Mass spectrometrists should search for all peptides, but assess only the ones they care about

In shotgun proteomics identified mass spectra that are deemed irrelevant to the scientific hypothesis are often discarded. Noble (2015) 1 therefore urged researchers to remove irrelevant peptides from the database prior to searching to improve statistical power. We here however, argue that both the classical as well as Noble’s revised method produce suboptimal peptide identifications and have problems in controlling the false discovery rate (FDR). Instead, we show that searching for all expected peptides, and removing irrelevant peptides prior to FDR calculation results in more reliable identifications at controlled FDR level than the classical strategy that discards irrelevant peptides post FDR calculation, or than Noble’s strategy that discards irrelevant peptides prior to searching.

because the irrelevant peptides were removed from the database prior to searching. Noble correctly 48 pointed out that this issue will not lead to statistical problems as long as a correct FDR procedure 49 is adopted. However, we illustrate that the popular target-decoy FDR procedure 3 cannot avoid 50 these statistical problems when the sub-sub search strategy is adopted on small to moderate sized 51 subsets. We also argue that the Noble approach still sacrifices statistical power by testing more 52 hypotheses than necessary, i.e. PSMs that would match well to irrelevant peptides in the complete 53 search could actually be discarded because it is highly unlikely that these are subset peptides. 54 We therefore propose a search-all-assess-subset (all-sub) strategy by (1) searching the mass spectra 55 against a database with all proteins that are expected in the sample, and (2) discarding PSMs 56 matching to irrelevant peptides in the complete search prior to (3) FDR calculation, which has 57 the promise to further boost the statistical power. The filtering strategy in step (2) is independent 58 from the subsequent data analysis steps and can reduce the multiple testing problem considerably 59 without compromising the FDR calculation.

67
In figure 1 we illustrate that the fraction of incorrect target PSMs (π 0 ) in the complete search 68 is substantially different from the one in the cytoplasm subset. Based on the TDA approach we 69 estimate that 13.9% of the target PSMs are incorrect hits when adopting the all-all search, while 70 the actual fraction in the subset is probably lower, i.e π 0 = 7.2% as estimated with the all-sub 71 strategy. This is also reflected in the distributions of the all-all and the all-sub MS-GF+ scores 72 in figure 1, which are bimodal. The first mode, corresponding to incorrect PSMs, is much higher 73 for the all-all strategy than for the all-sub method. Hence, the FDR cutoff using the all-all search 74 strategy is probably too conservative for the cytoplasm example. This is also reflected by the 75 increased number of subset PSMs that are returned by the all-sub method (2,578 vs 2,553 PSMs).

76
However, the FDR of the all-all method can also be too liberal, i.e. when the fraction of incorrect 77 PSMs in the subset is higher that the one in the complete search (e.g. the ATPase activity subset 78 in supplementary Fig. 10). Hence, the FDR in the all-all strategy is often not representative for 79 that of the subset leading to suboptimal PSM lists, which can be either too long or too short 80 depending on the scenario. is observed for the human subsample (6,108 vs 6,474 PSMs).

127
The Noble sub-sub strategy, on the other hand, is more conservative than our all-sub strategy at 128 to be required. But due to the specific emperical distribution of the decoy scores the subset TDA 156 still results in a lower cutoff than in the all-all approach ( Supplementary Fig. 27).

157
We observe that the location and shape of (1) the decoy distribution and (2) Table 1). 168 We also developed a user-friendly web-based tool in R 8 that provides (1) the all-sub FDR, (2) 169 the rescaled all-all FDR and (3) diagnostic plots for assessing the location-scale assumption.

170
(http://iomics.ugent.be/saas/ and Supplementary Code) In our application, π 0 is estimated based 171 on the ratio of the number of subset decoys and the number of subset targets in a concatenated 172 target-decoy search. We feel that our approach can be further optimized for small subsets by using 173 the location and shape assumption explicitely when estimating π 0 .