Combining high resolution and exact calibration to boost statistical power: A well-calibrated score function for high-resolution MS2 data

To achieve accurate assignment of peptide sequences to observed fragmentation spectra, a shotgun proteomics database search tool must make good use of the very high resolution information produced by state-of-the-art mass spectrometers. However, making use of this information while also ensuring that the search engine’s scores are well calibrated—i.e., that the score assigned to one spectrum can be meaningfully compared to the score assigned to a different spectrum—has proven to be challenging. Here, we describe a database search score function, the “residue evidence” (res-ev) score, that achieves both of these goals simultaneously. We also demonstrate how to combine calibrated res-ev scores with calibrated XCorr scores to produce a “combined p-value” score function. We provide a benchmark consisting of four mass spectrometry data sets, which we use to compare the combined p-value to the score functions used by several existing search engines. Our results suggest that the combined p-value achieves state-of-the-art performance, generally outperforming MS Amanda and Morpheus and performing comparably to MS-GF+. The res-ev and combined p-value score functions are freely available as part of the Tide search engine in the Crux mass spectrometry toolkit (http://crux.ms).


Introduction
: The figure plots, as a function of peptide length, the proportion of randomly generated peptide sequences that obey Equation 1. The two series marked "Real" use monoisotopic masses from the 20 real amino acids; the series marked "Random" uses masses that have a random number in the range 0, 1] added to each mass. Two bin sizes (1.005079 Da and 0.02 Da) are used. For each series, a total of 100,000 peptides were simulated. whose (discretized) mass is i and whose (discretized) score with respect to the current spectrum is j. The procedure works by filling in values in this table for increasingly large values of i and j, computing each new 46 value by summing over existing entries in the table (see Methods for details).
The problem arises in the discretization of the mass axis. The logic associated with filling in the table There are two important changes in this preprocessing compared to that for the XCorr score function. First, discretization of the peaks' mass values is deferred until after the high-resolution residue evidence has been quantified and aggregated. Second, the final subtractive step, which induces a cross-correlation penalty in the 141 XCorr score function, is omitted altogether. Another difference is the addition of two peaks, representing 142 the m/z of the N-terminal group and the m/z of the precursor minus the C-terminal group (typically a 143 hydroxyl group). Both of these peaks have intensities of zero. 144 Next, we quantify and aggregate evidence for each type of residue inducing a b-ion cleavage at each i.e., r takes a value of 1 when the deviation is 0, and 0 when the deviation is equal to or greater than m tol .

152
This magnitude r is then multiplied by the sum of the rank intensities of the two peaks, prior to being stored. 153 Residue evidence is stored in a two-dimensional residue evidence matrix R. The columns of R are indexed 154 by discretized masses m j , and the rows or R correspond to the amino acids a i found in the peptide database 155 (typically around 20 rows). The increment of evidence r generated according to Eq. 3 is added to element 156 R aimj , where a i = X and m j is obtained by discretizing m A with a bin width W = 1.0005079 Da.

157
Each pair of peaks A and B is also considered as a putative pair of charge 1+ y ions, charge 2+ b 158 ions, and charge 2+ y ions, and additional residue evidence is generated using appropriate modifications 159 to Eq. 3. In all cases, however, the evidence is added to the element of R indexed by the discretized mass 160 of the corresponding charge 1+ b ion, so that all evidence related to a particular locus of fragmentation is 161 aggregated together. 162 Once all possible residue evidence has been accumulated into matrix R, the values in R undergo a linear 163 discretization to integer values, such that the minimum value in R is 0 and the maximum is some specified 164 integer r max . This integer discretization ensures that scores will have integer values, which is required for 165 the subsequent dynamic programming. 166 The theoretical spectrum corresponding to a candidate peptide is very simple. For each possible prefix 167 sequence of the peptide a tuple is created, consisting of two elements: the identity of the prefix's C-terminal 168 amino acid and an integer formed by discretizing the prefix mass with bin width W = 1.0005079. For a can-169 didate peptide P of length n, the full representation B then consists of n−1 such tuples: {(a k , m k )} k=1...n−1 .

170
Finally, assume that candidate peptide P of length n has a minimal binary representation B as described 171 above. Then the residue evidence score Ψ between P and spectrum S is the sum of elements selected from 172 the residue evidence matrix R (derived from S) according to the tuples in B: 2.3 Calibrating the residue evidence score via dynamic programming 174 The following assumes a spectrum S with precursor mass is m S is being scored.

175
Let P (1→n) be a peptide of length n, with mass m (1→n) = m S and amino acid sequence a 1 , a 2 , . . . , a n .

176
Because Ψ(P, S) is additive, the score for matching S with P (1→n) can be obtained by first calculating the 177 score for the prefix sequence P (1→n−1) = a 1 , a 2 , ..., a n−1 , then adding the evidence r = R anm S from the 178 residue evidence matrix R. Note that this process is equally valid for any subsequence P (1→k) = a 1 , a 2 , ..., a k 179 with mass m (1→k) , 180 Ψ(P (1→k) ) = Ψ(P (1→k−1) ) + R a k m (1→k) Let C s,m be the count of peptides with mass m that produce a discretized score s. If (hypothetically) all 181 the peptides have the same terminal amino acid a with mass m a , then we would have Allowing for all naturally occurring amino acids a i ∈ A, with masses m ai , the count becomes Since Ψ(P, S) is additive, Eq. 7 is valid for all masses 1 ≤ m ≤ m S . Eq. 7 defines the basic recursion of the 184 DP.

185
The DP computation of C is conducted in a two-dimensional array, where the rows are indexed by s and 186 the columns by m. The number of rows is determined by an estimate of the largest possible score for S: where Z (i) refers to the sorted column maxima from R and q = m S / min{m ai∈A } .

188
We initially set: where T N is the mass of the N-terminal group.

190
• C s,m ← 0 for all s = 0 or m = 1. This includes a range of indices s < 1 and m < 1 that are accessed 191 during the DP.

192
The elements of the array are then computed sequentially: properties. This problem can be solved by considering the relative abundances of amino acids in the recursive 205 counting: where p ai is the probability of finding amino acid a i in a large collection of naturally occurring peptides, 207 with ai∈A p ai = 1. Note that it may be important to use different estimates of p ai for the N-terminus, 208 C-terminus, and non-terminal positions, depending on the specificity of the enzyme used for digestion.

209
Assume we have calculated, using DP, the distribution of scores C s,m S over all possible peptides for 210 spectrum S, where 0 ≤ s ≤ s max . Then the p-value relative to this distribution for a specific peptide P , 211 matched to S with residue evidence score ψ = Ψ(P, S), is These p-values can be used in place of raw residue evidence scores during a standard database search. The res-ev p-value and the XCorr p-value provide complementary yet not fully independent estimates of the 215 quality of a given peptide-spectrum match (PSM). Accordingly, we employ a previously described method 216 for assigning a p-value to the product of n correlated p-values (11), using the following equation:  Table 1: Mass spectometry datasets. The database used for the ocean data set is comprised of individual peptides derived from high-throughput sequencing reads, rather than full-length proteins.
where n is the number of p-values being multiplied (in our case, n = 2), Z n is the product of the p-values, m  2.5 Data sets 226 We used four previously described tandem mass spectrometry data sets to validate our methods (Table 1).

227
The data sets were selected to represent a diversity of both sample types and instrument types.  Briefly, a metapeptide database is a peptide database whose sequences are derived from raw read sequences 250 that have been translated into peptides in all six reading frames (14).

251
Human fraction (15): Histologically normal adrenal gland tissue from three deceased individuals were 252 pooled together using equal amonts of protein from each donor. Samples were lysed using SDS. The pro-253 tein sample was fractionated by SDS-polyacrylamide gel electrophoresis (SDS-PAGE). Then the protein 254 bands were destained, reduced, and alkylated. Protein digestion was peformed using an in-gel trypsin digestion. Following digestion, the peptides were desalted. The resulting peptide sample was separated with 256 a 60 min linear gradient using reversed-phase liquid chromatography on an Easy-nLC II nanoflow liquid 257 chromatography system (Thermo Scientific). Data was acquired using a LTQ-Orbitrap Elite (Thermo Sci-258 entific). The mass spectrometer was operated in a Top 20 data-dependent acquisition mode with a 30 259 second dynamic exclusion window. MS1 scans were acquired at a mass resolution of 120,000 at 400 m/z. resulting peptides were chemically labeled using stable isotope dimethyl labeling. The yfgM knockout lysate 270 was labeled with the "Medium" isotope and the the WT sample was labeled with the "Heavy" isotope.

271
These lysates were mixed together in a 1:1 ratio and then fractionated into 45 fractions using strong cation were not considered. The resulting target and decoy peptides were placed into a new .fasta file. The words 295 'target' and 'decoy' were appended to the peptide headers of the peptides in their respective .fasta files.

296
Then the target and decoy .fasta file were concatenated to create a target-decoy database for MS Amanda.

297
Because it is not possible to turn off N-terminal methionine clipping in MS Amanda, in order to ensure 298 that the other three database search tools were exposed to the same peptides as MS Amanda, we then 299 subjected the predigested .fasta files to a second round of "digestion". In this second round, no missed 300 cleavages were allowed, N-terminal methionines were allowed to be clipped, and peptides shorter than five 301 amino acids were removed. Since the .fasta files that went through the second round were pre-digested, the peptide that each score function detected, and whether that peptide is a target or decoy, is also listed.

335
The value 'NA' is placed into empty cells that result from one score function scoring a scan and another 336 score function not scoring that particular scan. This phenomenon is due to each program having a different 337 threshold for the minimum number of peaks required to score a scan. A second R script used the PSM table   338 as input to calculate false discovery rates and generated the plots for this publication (Supplemental File 4). 339 We used the following false discovery rate equation: FDR = (number of decoys + 1) / number of targets 340 (21). For the Percolator analysis, the Tide search was performed as described previously, except that during index 343 creation, the "digestion" option was set to "partial-digest" and one missed cleavage was allowed ("missed- For any observed spectrum, we can use dynamic programming to determine the exact distribution of residue-354 evidence scores that result from each possible peptide sequences whose discretized mass matches the dis-355 cretized precursor mass. We can then compare the score from a particular PSM to this distribution and 356 calculate a p-value, i.e., we compute the probability of observing a residue-evidence score greater than or 357 equal to the score of a particular PSM.

358
To test the validity of the resulting p-values, we searched real data against a decoy database. Specifically, 359 we searched the Plasmodium data set against a shuffled Plasmodium database (see Methods for details) using  Plasmodium scan 11156, annotated with res-ev, with a high p-value. (D) Same as (C), but annotated using XCorr, with a low p-value. In each panel, peaks colored in blue, dark blue, red, and dark red represent b+1, b+2, y+1, y+2 ions, respectively. data set, but we searched against a concatenated database of both real ("target") and shuffled ("decoy") peptide sequences. From the resulting ranked list of PSMs, we used target-decoy analysis to estimate the 375 false discovery rate (FDR) associated with each observed PSM score (22). We measured FDR out to a 376 maximum of 10%, reasoning that higher FDR thresholds are not likely to be of much practical value. For 377 comparison, we repeated our search with three other score functions: the uncalibrated res-ev score function, 378 the uncalibrated high-res XCorr score function, and the XCorr p-value. Note that the latter necessarily 379 discards the high resolution of the fragment m/z axis, because the XCorr dynamic programming procedure 380 requires ∼1 Da bins.

387
A complementary measure of the quality of a PSM score function is the "target match percentage" 388 (TMP), which is defined as the fraction of spectra for which the top-scoring match involves a target peptide 389 (23). For a perfectly random score function, we expect the TMP to be ∼50%. The best possible TMP is 390 100%; however, in practice any real data set will contain spectra that cannot be identified, either because 391 the corresponding generating peptide is not in the given peptide database or because the spectrum was 392 generated by a non-peptide contaminant. TMP provides a measure of the quality of a score function that is 393 independent of a score function's calibration. This is because the TMP never involves comparing scores for

394
PSMs involving different spectra. Hence, the distribution of PSM scores for spectrum A can be dramatically 395 different from the distribution of PSM scores for spectrum B, but the TMP achieved by the score function 396 can still be high.

397
In the Plasmodium TMP analysis, the res-ev p-value (and, by definition, also the raw res-ev score)

418
In contrast, scan 11156 ( Figure 3C-D) illustrates why some PSMs score well using XCorr and poorly 419 using residue-evidence. This scan was assigned the same peptide (ELERSGEVAPDIHEHIK) by both score 420 functions, but the resulting PSM received a high p-value (8.45 × 10 −1 , 47.4% FDR) from residue-evidence 421 and a low p-value (1.11 × 10 −2 , 3.4% FDR) from XCorr. The disparity between the two score functions 422 arises because, using residue-evidence, only three pairs of peaks contribute to the score. In contrast, the 423 XCorr score includes individual components corresponding to sixteen different fragment ions (b2, b3, b4, 424 b7, b9, b11, b12, b16, y1, y2, y5, y10, y12, y14, y15, and y16). In Figure 3D, there are only fourteen 425 colored fragment ions because the b3 and b9 ion correspond to the same peak and the b7 and the y16 ion 426 also correspond to the same peak. The lack of a long "ladder" of successive b-or y-ion peaks keeps the  residue-evidence score low, relative to XCorr.

439
To combine these two scores, we applied a previously described method for estimating the statistical      Table 2: Target match percentages. The TMPs of four score functions (rows) for four datasets (columns). The TMP is defined as the percentage of spectra that match a target peptide.
value at a 1% FDR threshold. However, at larger FDR thresholds, combined p-value consistently performed probabilities (31-34), or run a machine learning post-processor on the combined results (35). Our empirical 516 results suggest that the res-ev score function and its combined p-value provide yet another complementary 517 view of peptide-spectrum matching, which will likely add value in the context of such aggregation schemes.

518
One potential explanation for the relatively poor performance of MS-GF+ on the Plasmodium dataset is due to the unusual nature of the data itself. MS-GF+ uses a supervised machine learning algorithm to 520 learn a scoring model. To perform well on the Plasmodium data might require a model that was trained on 521 data with TMT labeling and digestion with Lys-C. In contrast, the combined p-value approach is invariant 522 to properties of the data, such as digestion and labeling schemes.

523
A potential area for future work lies in the method for combining XCorr and res-ev p-values. We 524 empirically set the parameter m, which represents the degree of dependency between the two types of p-525 values, to be 1.2. However, in principle this value could be re-estimated for each new data set, using a 526 strategy similar to that shown in Supplementary Figure 1. However, our empirical results (not shown) 527 suggest that the behavior of the method is not strongly dependent upon the choice of m.

528
The strong performance of the Morpheus search engine as measured by TMP suggests that this score 529 does a good job of identifying the generating peptide for a given spectrum. On the other hand, the poor 530 overall performance of Morpheus suggests that the score function is poorly calibrated. This is not surprising, 531 because the score is simply the sum of two terms: the number of matched product ions, and the fraction of the 532 observed peak intensities that can be assigned to matched products. Thus, longer peptides or spectra with 533 more peaks will tend to achieve higher Morpheus scores on average. Our results suggest that a calibrated 534 version of this score function should be able to achieve very good empirical performance.

535
In addition to proposing a new score function, we have provided a new benchmark for use in evaluating The following supplemental files can be found in PRIDE under accession PXD009265.