PEPA test: fast and powerful di(cid:27)erential analysis from relative quantitative proteomics data using shared peptides

We propose a new hypothesis test for the di(cid:27)erential abundance of proteins in mass-spectrometry based relative quanti(cid:28)cation. An important feature of this type of high-throughput analyses is that it involves an enzymatic digestion of the sample proteins into peptides prior to identi(cid:28)cation and quanti(cid:28)cation. Due to numerous homology sequences, di(cid:27)erent proteins can lead to peptides with identical amino acid chains, so that their parent protein is ambiguous. These so-called shared peptides make the protein-level statistical analysis a challenge and are often not accounted for. In this article, we use a linear model describing peptide-protein relationships to build a likelihood ratio test of di(cid:27)erential abundance for proteins. We show that the likelihood ratio statistic can be computed in linear time with the number of peptides. We also provide the asymptotic null distribution of a regularized version of our statistic. Experiments on both real and simulated datasets show that our procedures outperforms state-of-the-art methods. The procedures are available via the pepa.test function of the DAPAR Bioconductor R package.


Introduction
Quantitative proteomics refers to the identication and quantication of the proteins present in a biological sample. This eld has rapidly grown mature over the last decade, allowing for a rened understanding of a wide variety of biomolecular processes: phenotypes of new forms of life, such as giant viruses [17], hostpathogen cell interactions [15] or microbial infections [12]. As many other omics sciences, it is based on a large scale sequencing approach, that is bound to high throughput measurements whose statistical processing is a central issue.
The most classically used measurement pipeline [4] is referred to as relative bottom-up MS/MS quantication [24]. The term bottom-up refers to the fact that proteins are not directly identied: instead, they are rst digested by an enzyme into smaller molecules called peptides, that are easier to analyze by MS/MS. The term MS/MS refers to the fact that two kinds of mass spectrometry (MS) measurements are alternatively performed. This instrumental pipeline is particularly useful in discovery proteomics, where the goal is to nd a shortlist of proteins that are signicantly dierently abundant. Several samples are collected under dierent biological conditions (e.g., in healthy vs. disease, wildtype vs. mutant, etc.) and analyzed with the aforementioned pipeline, leading to a list of identied peptides and their intensity for each sample. Several methods have been proposed to detect dierentially abundant proteins and can be divided in two main families: peptide-based and aggregation-based methods, also referred to as summarization-based in [9]. In the latter ones, peptide-level information is rst aggregated at the protein level and proteins are then tested for dierential abundance using these summaries [20,19,5,6]. They mostly dier by the set of peptides that they aggregate and the test statistic they use after aggregation. Peptide-based models on the other hand do not rely on an aggregation step and build a test statistic using peptide intensities as a sampling unit [2,1,3,10]. A more detailed discussion of the literature and preliminary comparisons among aggregation-based methods are provided in Section A of the Supplementary Material.
For both families of methods, deciding which proteins are dierentially abundant from peptide level observations is made dicult by the presence of shared peptides (as opposed to protein-specic ones): due to the numerous homology sequences between dierent genes, some peptides can belong to several distinct proteins [16]. This problem has long been reported in the literature [13]. It aects all data produced by a bottom-up approach including label-free as studied in this paper as well as isobaric tag data, where similar aggregation problems were reported [11,14]. According to [5], up to 50% of peptides can be shared in the proteome of complex organisms. We illustrate the extent of this phenomenon and its eect on the detection of dierentially abundant proteins in Section B of the Supplementary Material, on a proteomic dataset obtained from the LC-MS/MS analysis of mouse liver samples.
To the best of our knowledge, the few solutions to the shared peptide problem available in the literature have hardly spread to proteomics platforms, as they are computationally expensive and do not scale to large proteomics datasets. [2] for example exploit a model akin to the one we use in this paper but include a factor representing the peptide-specic relationship between the measured peptide intensity and its actual abundance. This factor causes the negative loglikelihood to be non-convex in the set of parameters making its minimization non-trivial and possibly expensive the authors restrict themselves to peptides shared by no more than two proteins. No algorithm or code is available for this method, to the best of our knowledge. More recently [1] have proposed AllP, which is also based on a similar model as our work but uses a log-normal model.
The use of this distribution corresponds to a common assumption on observed peptide distribution [18]. Unfortunately, maximizing the corresponding likelihood is more computationally demanding than that of the normal distribution we use, as no closed form maximizer is available. As reported in its original article, a synthetic datasets with 100 proteins requires 3 days of computation and for such a dataset the algorithm does not converge in 18% of the cases.
In this context, our contribution is four fold: 1. We introduce a linear model which relates measured peptide MS intensities to latent protein abundances, and use it to build a PEptide based Protein dierential Abundance (PEPA) likelihood ratio test which accounts for shared peptides.
2. Our linear model allows for faster estimation than existing log-linear models but still involves an nq × (p + q) design matrix, where n is the number of samples, q the number of peptides and p the number of proteins. Computation of our statistic can therefore be slow and require a large amount of memory in practice using naive least square implementations. We show that our likelihood ratio statistic can be computed in O(nq) nevertheless, making it compatible with proteomic platform throughput.
3. We empirically observe that regularized estimators of the variance parameter lead to more powerful tests. We show that under the null hypothesis of homogeneity, the regularized log-likelihood ratio statistic is still asymptotically χ 2 distributed up to some normalization. 4. We provide R code for all our methods in the DAPAR Bioconductor R package, so they can be routinely used by proteomic practitioners.

Methods
We consider a proteomics experiment measuring the intensity of q previously identied peptides that map onto a set of p known proteins. A biological sample consists of observed intensities for q peptides. Each peptide in turn can belong to several among p proteins, and the abundance of a peptide is the sum of the abundance of all proteins containing this peptide. Formally, if the proteins have respective abundance values θ 1 , . . . , θ p in a sample, then the abundance of peptide k in this sample should be p j=1 x kj θ j , where x kj = 1 if peptide k belongs to protein j, 0 otherwise.

Model
The observed intensitiesỹ k from an MS/MS experiment are typically modeled as samples from a log-normal distribution [2,18,1]: where σ 2 > 0 is the variance of the distribution, α k is a peptide-specic eect, X ∈ {0, 1} q×p is a binary matrix whose elements are the x kj and θ ∈ R p and α ∈ R q are vectors containing the protein abundances and peptide eects respectively.
The parameters of interest for dierential analysis are the protein abundances θ 1 , . . . , θ p . They are unobserved, and we want to test whether they change between two experimental conditions of interest. More precisely, we assume the n = n 1 + n 2 biological samples are measured under two dierent experimental conditions (n 1 under the rst condition, n 2 under the second) and we want to test where θ (1) and θ (2) ∈ R p are protein abundance vectors under the two conditions and j is the protein being tested for dierential abundance. The data we use to test (2) consist of q × n i.i.d. intensity measurements {ỹ i k } i=1,...,n k=1,...,q . To make the analysis and computation easier, we make the approximation that: which amounts to replacing ln p j=1 x kj θ j by its rst order Taylor expansion, as discussed in Supplementary Material C. Consistently, we observe that the likelihood ratio statistic relying on (3) leads to good empirical performances to test H 0 , even on real data or simulated ones from log-normal distributions.
Computing this statistic only requires solving linear problems, which is typically much faster than solving the non-linear problem associated with (1). It also makes the computation amenable to an even faster procedure which we introduce in Section 2.2. In the rest of this paper, we therefore assume that thẽ y i k are sampled from (3) and let y i k denote the log intensities lnỹ i k .

(4)
X 0 is a vertical concatenation of n copies of the R q×p+q (X I q ) matrix where I q is the identity matrix in R q , and X 1 is a vertical concatenation of n 1 copies of the R q×p+q+1 (X −j x j 0 I q ) matrix and n 2 copies of the R q×p+q+1 (X −j 0 x j I q ) matrix: X −j is the X matrix without its j-th columns x j . The matrix y ∈ R nq contains all {y i k } i=1,...,n k=1,...,q , i.e., the n 1 peptide intensity measurements under the rst condition, followed by n 2 ones under the second. Finally, β 0 = (θ, α) ∈ R p+q and β 1 = (θ −j , θ j , θ j , α) ∈ R p+q+1 . Considering σ as a xed parameter, the ML estimator of σ 2 iŝ Using an inverse gamma prior σ 2 ∼ Inv-Gamma(α, β) for α ≥ 0, the maximum a posteriori (MAP) estimator of σ 2 iŝ with s = 2β. This estimator is implicitly used in test statistics like SAM [21]. It amounts to regularizing the variance estimate and can lead to better power than t-tests to detect dierential abundance when only few samples are available.
To choose s in practice, we generalize the heuristic of [21]: we compute our statistic for all proteins across a grid of values of s and retain the one leading to the smallest coecient of variation of the statistic across variance levels. The motivation of the heuristic is that the amplitude of the regularized statistic should not be determined by the variance of the residuals.
Individual eects such as our peptide eect α k are commonly modeled as random variables and endowed with a prior distribution. In our experiments, using a xed or random α k made little dierence so we opted for the xed eect model, as it is amenable to fast computation using Proposition 1 (see below).
Finally, in practice X often involves disjoint sets of proteins with no peptide in common. When testing (2) for a protein j, we only use peptides whose observation aects the estimation of θ j . Concretely, we identify connected components in the bipartite graph whose nodes are peptides and proteins with edges between each peptide and its parent proteins, and apply our procedure to each connected component separately. Section D of the Supplementary Material further discusses this point and introduces a heuristic procedure which does not separate connected components.

Computation of the test statistics
We consider the likelihood ratio statistic for (2) under model (3): whereσ 2 0 andσ 2 1 are obtained by solving either (6) for ML estimation or (7) for MAP estimation. Both estimators require solving the least square problem in (4) for k = 0 and k = 1. A naive implementation explicitly storing the nq × (p + q) and nq × (p + q + 1) design matrices X k would be slow and possibly run out of memory even for small n when dealing with connected components involving thousands of peptides. In the experiments on simulated data with 50% of shared peptides presented in Section 4.1, the largest connected component contained 993/1000 proteins and 4981/5000 peptides. Computing the test statistic (8) under H 0 across only 2 × 3 samples took 4.8Gb of memory and 23 minutes on a 3Ghz i7 core with an implementation using the lm function of R. Computing the statistic (8) only requires to know the sum of the squared residuals min β y − X k β 2 and not necessarilyβ k ∈ arg min β y−X k β 2 . An implementation of the projection matrix X 0 (X 0 X 0 ) † X 0 y over the span of X 0 relying on the singular value decomposition of (X I q ) took 1.6Gb and 3 minutes. We now show how this sum of squared residuals can be computed in linear time with the size of y.
On the same simulation, our approach allowed to compute (8) in 0.002 seconds with only a marginal memory usage.
Proposition 1 Let X 0 and X 1 be dened as in (5) and y ∈ R nq contain the n whereȳ = n −1 n i=1 y i ∈ R q is the average across the n samples andȳ (l) ∈ R q is the average across the n l samples under condition l ∈ {1, 2}.
j ∈ R, l = 1, 2 is the average across samples under condition l of the log-intensities of all peptides belonging to protein j and q j = x j 2 is the number of peptides belonging to protein j. The additional term inσ 1 can then be interpreted as the squared dierence between the average log intensities across peptides between the two conditions.
An important consequence of Proposition 1 is that the log-likelihood ratio statistic (8) can be computed in O(nq) by computing averages of subsamples of y, without storing the X k matrices or diagonalizing the (X k X k ) matrices.
Using Proposition 1 to compute the likelihood ratio statistic moves the computational bottleneck to the identication of connected components, as illustrated in Section 4.1. The heuristic we discuss in Section D of the Supplementary Material skips this identication step and retains all peptides for each tested protein.

2.3
Null distribution of λ By Wilk's theorem [23], we know that λ converges in law to a χ 2 1 distribution as n → ∞ and the y ik are sampled i.i.d. under H 0 (i.e., θ (1) = θ (2) ) and when using maximum likelihood estimators of (θ, α, σ 2 ) from (4) and (6). This result provides asymptotic levels for our test, as rejecting H 0 when λ > χ 2 1,α , where χ 2 1,α is the 1 − α quantile of the χ 2 1 distribution, asymptotically leads to a false positive rate of α. The asymptotic is in n even though the number of sampling units is nq, as the size of the parameter α also increases with q. When using a MAP estimator (7) for σ 2 , Wilk's theorem does not hold anymore, and indeed we observed in our experiments that the null distribution of λ under H 0 deviates from the χ 2 1 distribution. However, Proposition 2 shows that multiplying λ by a constant factor is enough to recover a correct asymptotic level.
denote the maximum likelihood estimator of σ 2 under H 0 : β = β and H 1 : The proof is in Section F of the Supplementary Material. In practice, σ 2 is unknown but we obtained satisfying test levels on both real and simulated data by using the maximum likelihoodσ 2 instead.

Experimental setting
We use both simulated datasets and datasets resulting from spike-in samples to evaluate the performance of our test procedure. Simulated datasets allow us to control key parameters such as the proportion of shared peptides, but the

Simulated datasets
We simulate peptide intensities for each of n 1 = n 2 = 3 samples under two biological conditions, for q = 5000 peptides belonging to p = 1000 proteins. We purposely use a generative model described in Supplementary Material G.1 which diers from the normal model (3) used in our testing procedure. This allows us to obtain more realistic data, and to assess the robustness of our method to deviations from the model.

Spike-in datasets
The true set of dierentially expressed proteins is generally not known in real data, making it dicult to compare dierential analysis methods. We resort to using spike-in samples, for which the true set of dierentially abundant proteins is known. We use the two datasets described in [7], and which are available in the DAPARdata R package [22]. These two datasets contain 6 samples that were prepared using the equimolar human protein mixture Sigma UPS1, including 48 human proteins. Their dierential abundance ratios are 2 and 2.5 for the rst and second datasets, that are respectively referred to as Exp1_R2_pept and

Compared methods
We evaluate two versions of PEPA test, both using the likelihood ratio statistic (8): PEPA-ML relies on the ML estimator of the variance (6), and PEPA-MAP on the MAP estimator (7) of the variance with an inverse-gamma prior.
Comparing these two procedures provides insights on the respective interests of the likelihood ratio test itself and the variance regularization. These two methods are compared to several other reference methods.
The only other peptide-based model that accounts for all shared peptides is AllP [1], which cannot cope with large scale datasets that we consider in our experiments so we compare to methods that only account for protein-specic peptides. The rst one, referred to as PeptideModel, performs a two-sample t- We also consider two aggregation-based models: the rst one, denoted AllSpec-SAM, performs the SAM-test proposed by [21] (with automatic tuning of the fudge factor parameter, as discussed by [8]) over each protein summarized by the sum of intensities of all of its protein-specic peptides. The second one, de- To compare the performances of the dierent methods, we construct precisionrecall (PR) curves. In order to stabilize the results, the PR curves are averaged over 30 runs for simulated data, and 10 runs on the spike-in data when using the peptide merging procedure.

Additional assumptions made by PEPA
We evaluate all methods on their ability to test H 0 : θ

Results
In this Section we present PR curves on simulated and spike-in datasets, calibration curves and a runtime performance comparison of the evaluated methods.

Performances on simulated datasets
The performances on simulated datasets with 0%, 5%, 20% and 50% of shared and Exp1_R25_pept originally contain 10722 (resp. 10601) peptides, among which 211 (resp. 290) come from human proteins, so that the proportion of shared peptides is rather small: for either datasets, the proportion of introduced shared peptides is smaller than 2.65% (to be compared with a proportion up to 50% of shared peptides as recalled in the introduction).
We notice a large dierence of performances and of behavior between the two datasets. Exp1_R25_pept derives from a series of LC-MS/MS experiments that did not undergo any malfunction, so that the data is of rather high quality. On Exp1_R2_pept it is not possible to diagnose the source of the noise which could range from the MS acquisition to the bioinformatic data processing, yet it is clear that on this dataset, the UPS1 protein is more dicult to detect. Both datasets correspond to a scenario that can be faced on proteomics platforms.

Peptide-based methods do not always dominate aggregation-based methods
The domination of peptide-based over aggregation-based models that was Our methods handle proteins with no specic peptide In both experiments, some proteins are lost by methods that only rely on protein-specic peptides because as the number of shared peptides increases, these proteins end up with only shared peptides. On Figures 2 and 3, this leads to the noticeable dropouts on the lower end of the curves for these methods.
This illustrates an important practical issue: methods that only rely on protein-specic peptides are unable to deal with some proteins. Accounting for shared peptides like we suggest not only improves our ability to detect dierentially abundant proteins among those that are handled by classical methods, but also increases the proteome coverage.
To conclude, these experiments show that both our method accounting for shared peptides and its regularized version improve our ability to detect dierentially abundant proteins on proteomics datasets. The more shared peptides, the more important the increment of performances, but even on datasets with no shared peptides the methods remain accurate and never strongly underperforms. Finally, the regularized version always performs better than the other,  abundances. Using our model to estimate abundances a task which is out of the scope of this paper and which we did not include in our experiments would require to actually estimate the regression parameters, e.g. using explicit formulas for (X k X k ) † X k y.   for all compared testing procedures.