Assessment of false discovery rate control in tandem mass spectrometry analysis using entrapment

A pressing statistical challenge in the field of mass spectrometry proteomics is how to assess whether a given software tool provides accurate error control. Each software tool for searching such data uses its own internally implemented methodology for reporting and controlling the error. Many of these software tools are closed source, with incompletely documented methodology, and the strategies for validating the error are inconsistent across tools. In this work, we identify three different methods for validating false discovery rate (FDR) control in use in the field, one of which is invalid, one of which can only provide a lower bound rather than an upper bound, and one of which is valid but under-powered. The result is that the field has a very poor understanding of how well we are doing with respect to FDR control, particularly for the analysis of data-independent acquisition (DIA) data. We therefore propose a new, more powerful method for evaluating FDR control in this setting, and we then employ that method, along with an existing lower bounding technique, to characterize a variety of popular search tools. We find that the search tools for analysis of data-dependent acquisition (DDA) data generally seem to control the FDR at the peptide level, whereas none of the DIA search tools consistently controls the FDR at the peptide level across all the datasets we investigated. Furthermore, this problem becomes much worse when the latter tools are evaluated at the protein level. These results may have significant implications for various downstream analyses, since proper FDR control has the potential to reduce noise in discovery lists and thereby boost statistical power.


S1
The combined entrapment method tends to overestimate the true FDP and the lower bound underestimates it The underlying assumption of the combined entrapment estimation method is that because the entrapment database is r times larger than the sample database, a spectrum that is foreign to the combined database is r times more likely to be optimally matched to the entrapment than to the original target part of the combined database.In particular, in the extreme case that none of the original target peptides is present in the sample N E is an unbiased estimator of r • V T , so in this case, the numerator of (1), N E (1 + 1/r), is an unbiased estimator of V E + V T , the number of false discoveries in the combined database.Of course, in the typical case a subset T p of the original target database, T , is present in the sample.With T m denoting the complementary set of those peptides that are not in the sample, we can split the entrapment dataset E T into two corresponding sets E p and E m so that each of those parts is roughly r times larger than the corresponding original target part.
The above discussion shows that N Em , the number of entrapment discoveries in the subset E m , is an unbiased estimator of r • N Tm = r • V Tm , where N Tm is the number of (false) original target discoveries in T m .At the same time N Ep = V Ep is obviously an overestimate of r • V Tp , because the latter is 0 by definition: all the peptides in T p are in the sample so none can be falsely detected, even if the PSM that identifies it happens to be incorrect.
It follows that Therefore, showing the numerator in (1) indeed generally overestimates V E + V T , the number of false discoveries in the target+entrapment list of discoveries.
At the same time it is clear that indeed a lower bound on the FDP in the target+entrapment discovery list.

S2 The sample entrapment method can underestimate or overestimate the FDP
The sample entrapment estimation method (3) aims to estimate the FDP only among the original target discoveries rather than among the combined list of target+entrapment discoveries.This approach is inherently problematic because the analyzing tool is asked to control the FDR in the target+entrapment list of discoveries while the evaluation is done only on the original target discoveries.As a result, while as explained next, the sample entrapment method typically underestimates the FDP, it can also potentially overestimate the FDP.Therefore, this method cannot be used to show that a tool controls the FDR, nor that it does not control the FDR.The related fallacy of trying to control the FDR in a subset by controlling it in the larger set has already been pointed out in the literature [2,5].Indeed, typically the FDP among the discoveries from the target T will be significantly lower than among the target+entrapment list of discoveries, and hence (3) will typically underestimate the FDP among the latter discoveries.Critically, it is the latter FDP that the search tool is actually trying to control, so typically the sample estimation will underestimate the true FDP.
To demonstrate that (3) can in some extreme cases overestimate the FDP, we find it simpler to assume that the analyzing tool aims to report peptides in DDA data.Consider a hypothetical case, where every original target peptide in T produces a very good spectrum that creates a perfect PSM.Suppose that, in addition to those perfect spectra, the dataset also contains a large number of other spectra that are foreign to T .Because the native spectra will mostly be optimally matched to the original target peptides they were generated from, the entrapment peptides will overwhelmingly be optimally matched to the foreign spectra.Inevitably, some of those entrapment peptides will score fairly high and make it to the target+entrapment discovery list.Consequently, we will end up with N E > 0 and hence FDP T > 0 in (3), but the true FDP is 0 because there are no false sample matches, so (3) overestimates the true FDP in this case.
Other versions of (3) that aim to estimate the FDP among the discoveries from the original target T have been used as well.For example, Yu et al. [7] estimate this FDP by , where M T and M E are the number of peptides in the original target and entrapment databases.Much of the above criticism applies to this estimator as well: it can overestimate as well as underestimate the FDP.For example, if the tool reports that it detected all the original target peptides, then this estimate is 0, but of course, this will typically underestimate the FDP.Similarly, if all the original target peptides are present in the sample then the actual FDP is 0, but this estimate will typically be positive.

S3 Potential pitfalls of using foreign entrapment sequences
Critical to all methods of FDP estimation is the ratio r of entrapment-to-original-target sequences.When using shuffled entrapment every original target sequence creates exactly r shuffled entrapment sequences, each with the same composition as the target's, and therefore there is no ambiguity about that ratio.In contrast, consider using a foreign entrapment in peptide-level analysis.The natural choice for r is the ratio of the number of entrapment-to-original-target peptides.However, it is conceivable that the entrapment peptides would have very different amino acid composition and length distributions compared with those of the original target peptides.Such differences can cast doubts over the adequacy of that simple ratio.

S3.1 Selecting the "right" evolutionary distance
An issue related to the last point is the critical selection of the foreign species: choose species that are at a very large evolutionary distance to your target species and you risk creating entrapment sequences that do not offer a realistic competition to the incorrect target discoveries they need to model.In this case you are likely to underestimate the true FDP when using the combined estimation.Conversely, if your entrapment includes a species that is a close relative of your target, then you run the risk of overestimating the FDP, particularly when considering sample peptides whose exact matches are missing from the original target database, say due to unexpected modifications or genetic variations.We next demonstrate this point through a few sets of entrapment experiments we conducted, where in each set we use the same analysis tool and the same dataset, while varying the evolutionary distance of the foreign entrapment sequences even as we keep a fixed r = 1.
Figure S1 presents the different conclusions we can draw when evaluating Sage's peptide-level analysis of the human-sampled HEK293 DDA dataset using two different foreign entrapment resources (both described in Section 4.4).As expected, when using a mixture of (non-ape) primates and mouse proteins for our foreign entrapment (left panel), Sage's estimated FDPs are substantially higher than when using Arabidopsis for the entrapment (right panel).Because we cannot use the paired estimation when using a foreign entrapment, the combined estimation is the only way to establish FDR control in this case.However, the combined curve is noticeably above the diagonal in the left panel.In particular, it estimates the FDP for the 1% FDR threshold at a rather high 2.4%, thus calling into question Sage's FDR control had we only used that entrapment experiment.In contrast, using the Arabidopsis peptides the combined method is only marginally above the diagonal (right panel).Recalling the general conservative nature of the combined method this raises the possibility that using now the much more evolutionary distant Arabidopsis we might be slightly underestimating the FDP in this case.
We observe the same trends when using foreign entrapments to evaluate DIA-NN's analysis of the humansampled PXD042704 (human-astral) DIA dataset.Specifically, Supplementary Figure S2's comparison of using the Primates+Mouse vs. Arabidopsis entrapment sequences to evaluate DIA-NN's precursor-level analysis looks qualitatively very similar to the corresponding peptide-level analysis of Sage in Figure S1, so the same comments apply here.Similarly, the protein-level entrapment analyses depicted in Figure S3 again show that all estimated FDPs increase as the evolutionary distance to the human sample decreases.

S3.2 Foreign entrapments can be compromised by contamination
Finally, it is important to note that an entrapment experiment that relies on foreign sequences can be led astray by contamination.Indeed, compare the results of the two entrapment experiments summarized in Figure S4, where in both cases Sage peptide-level analysis was applied to the HEK293 dataset using two different foreign entrapments, both constructed so that r = 1: one made of yeast + Arabidopsis peptides (left panel) and the other made of Arabidopsis-only peptides (right panel).Clearly, adding the yeast peptides to the entrapment database significantly increases both the combined and the lower bound estimates of the FDP.
Looking more closely, we find that at the 1% FDR threshold Sage reported 108,867 peptides, of which 1,409 were entrapment peptides corresponding to a combined estimate of 2.59% (r = 1), and a lower bound of 1.29% on the FDP.Notably, 897 of the 1,409 discovered entrapment peptides (63.7%) were yeast peptides even though only 289,258 of the 1,389,436 entrapment peptides (20.8%) came from yeast.Statistically this is a highly significant enrichment of the yeast peptides among the discovered entrapment peptides (two sided Fisher Exact p-value < 2.2e-16).Unlike the examples in the previous section, yeast of course is not significantly closer to human than Arabidopsis, thus yeast contamination remains the most likely explanation.Consistently, searching the HEK293 MS/MS dataset against a protein database containing human and yeast proteins using Sage, we found that 1,510 out of the 111,376 peptides reported at 1% FDR match only to yeast proteins (a further 486 peptides match both the yeast and human proteins).Factoring in the ratio of human to yeast peptides this 1.4% of yeast peptides is much larger than expected.Moreover, if we specifically focus on the run with ID 02B, we find that 961 of the 9,390 peptides reported at 1% FDR match only to yeast proteins (a further 103 peptides match both the yeast and human proteins).This 10.2% yeast content is a fairly strong indication of some yeast contamination in that specific run.

S4 A survey of notable entrapment experiments
We conducted an extensive literature review looking at some prominent publications that used entrapment experiments.The results are summarized in Table 1, and here we discuss a few of those publications in more detail.
In [3] the authors estimated the FDP as N E •1/r N T +N E •1/r , which does not seem to be well motivated.Applying this formula to the numbers they reported using-N T = 8, 867, N E = 297 (E. coli peptides) and r = 48, 131/29, 333-we obtain the 0.02 estimated FDP that they reported.Regardless of the fact that this estimate is already higher than the 1% FDR threshold, if you use the combined estimation method, Equation (1), then you get an estimate higher than 0.05, or 5 times higher than the threshold.Moreover, even the lower bound, Equation (2), is larger than 0.03, or 3 times higher than the threshold, casting doubt over the claim that Specter controls the FDR.
In [4] the authors indicated that they used r = 1.176 and claim that the estimated FDPs (referred to as "external FDR") were in very good agreement with the corresponding thresholds ("internal FDR").However, they used the sample method, which as we argued is invalid.Had the authors used the valid combined method, then taking r into account, the 1% FDP estimate they computed would have translated to 2.2%.Because this is not a lower bound it does not demonstrate MaxDIA fails to control the FDR, but at the same time this experiment does not provide evidence to support the fact that it controls the FDR.
In [7] the authors also used the invalid sample method to estimate the FDP among the quantified proteins reported by three different DIA tools.Using their reported values, r = (6060 + 4401 + 16224)/20407 ≈ 1.31 the estimated FDPs (at 1% FDR) they reported for DIA-NN library-free, FP-MSF, and FP-MSF hybrid pipelines of 1.2%, 1.9%, and 1.8% respectively, correspond to 2.7%, 4.3% and 4.1% respectively if we use the valid combined method.Again, this analysis does not prove that these three tools do not control the FDR, but this experiment cannot be taken as evidence of proper FDR control.
In [1] the lower bound was correctly used to argue questionable FDR control in the case of MaxDIA, but it was also incorrectly used to argue proper FDR control of Spectronaut and DIA-NN.As we mentioned, the lower bound cannot be used as evidence of proper FDR control.For example, the authors report that using an in silico entrapment setup, DIA-NN's estimated FDP at 1% FDR was 1.77%, which they found acceptable.However, keeping in mind that this is a lower bound and that given that the entrapment part consisted of 16,202 Arabidopsis protein sequences and the original target was made of 17,082 mouse and 6,730 yeast protein sequences, we have r ≈ 0.68.Therefore, using the same entrapment setup, the combined method estimates the FDP at 2.6%.The real FDP is somewhere in between, but it is highly questionable whether these results can be taken as evidence for proper FDR control.
The paper describing the recently released AlphaPept refers to some entrapment experiments, but the method is not fully described [6].Looking at the accompanying code they released, we found that they were estimating the FDP as N E /(N E + N E + N D ), where N D is the number of decoys above the cutoff.This estimate is even smaller than the lower bound on the FDP (Equation ( 2)); hence, it cannot possibly be used as evidence of valid FDR control.

S5 Rationale behind the paired entrapment method and its extension
We argued in Supplementary Note S1 that, on average, the numerator in (1) overestimates the number of false discoveries.The paired entrapment estimation method can reduce this bias by taking advantage of the pairing information if it exists.Accordingly, in this section we require that our combined target+entrapment database contains pairs of original target and entrapment peptides (or proteins) {(t i , e i ) : i = 1, . . ., n}, so that t i ∈ T p if and only if e i ∈ E p , and similarly for T m , and E m , where again, T p and T m are the original target peptides that are present, respectively, missing from the sample.Noting that in this case the ratio of entrapment to original target size is r = 1, the above discussion in Section S1 shows that N Em , the number of entrapment discoveries in the subset E m , is an unbiased estimator of V Tm .We can derive an alternative estimator of V Tm by competing t i with e i as follows.Let s be the discovery cutoff score and let where we can randomly break, or alternatively discard, pairs with t i = e i (the results will slightly differ, but from a statistical perspective the two are equivalent).Note that we overload the notations t i /e i here to mean the peptides as well as their scores.We also define N Tm≥s>Em and N Tm>Em≥s by the obvious analogy with the roles of E m and T m swapped.
The combined estimation method relies on the fact that for t i ∈ T m and r = 1, P (e i ≥ s) = P (t i ≥ s), and hence N Em is an unbiased estimator of V Tm .Examining the different contributions to N Em we find that The equal chance assumption implies that for t i ∈ T m P (e i ≥ s > t i ) = P (t i ≥ s > e i ), and that In general, compared with N Em , N Em≥s>Tm + 2 • N Em>Tm≥s , is an inferior estimator of V Tm because it considers less data than N Em does, and hence it has a higher variability.However, the advantage of the paired approach becomes clear once we consider the corresponding estimators of V Tp .Indeed, because T p contains only peptides that are in the sample, we expect that for any reasonable scoring function and t i ∈ T p , P (e i > t i ≥ s) < P (t i > e i ≥ s), and therefore typically Because V Tp = 0 the fact that the estimator on the right hand side is generally smaller implies that it has a smaller bias.
The combined entrapment estimator (1) with r = 1 uses N E = N Em + N Ep to estimate V T , and hence 2N E = V E + N E to estimate V E + V T , the number of false target+entrapment discoveries.Similarly, our less conservative estimator is defined by adding N Ep≥s>Tp + 2 • N Ep>Tp≥s and N Em≥s>Tm + 2 • N Em>Tm≥s to estimate the same V T , and with N E = V E as before we obtain (4).
The k-matched paired entrapment estimations method We can generalize the paired entrapment estimation method to use k ≥ 1 entrapment peptides matched to each original target peptide, i.e., we assume the combined target+entrapment database contains (k + 1)-tuples of an original target and k independently generated entrapment peptides that are associated with it: {(t i , e 1 i , . . ., e k i ) : i = 1, . . ., n}.In practice, we generate these k entrapment peptides by applying multiple shuffles to t i and in particular r = k.
For l = 1, . . ., k, let N (l)≥s denote the number of original target peptides in T for which exactly l of its associated entrapment peptides score ≥ s and t i < min{s, e 1 i , . . ., e k i }, i.e., the target scored worse than the threshold s as well as its k associated entrapments.Similarly, we let N (k+1)≥s denote the number of original target peptides t i that again scored worse than its k associated entrapments but t i ≥ s.We define the k-matched entrapment estimate as Clearly, (S1) reduces to (4) for k = 1, so the paired entrapment estimation (4) coincides with the k = 1 k-matched paired estimation.Moreover, it can be shown that for any reasonable scoring function for which for t i ∈ T p , P (e j i > t i ) < P (t i > e j i ) for all j, FDP * k T ∪ E T of (S1) would still overestimate the FDP, but less so than FDP * T ∪ E T of (4). Figure S5: Peptide or precursor-level FDR control evaluation of DIA search tools In each panel, the empirical FDR was estimated for a given dataset (row) and search tool (column) using three different entrapment methods.Each panel plots the empirical FDR as a function of the FDR threshold.The dashed vertical lines are at the 1% FDR threshold, as are the numbers reported in text in the panels.The "Max q-value" value in each Spectronaut panel was the maximum reported q-value by the tool.Figure S6: Precursor-level FDR control evaluation of DIA search tools In each panel, the empirical FDR was estimated for a given dataset (row) and search tool (column) using three different entrapment methods.Each panel plots the empirical FDR as a function of the FDR threshold.The dashed vertical lines are at the 1% FDR threshold, as are the numbers reported in text in the panels.The "Max q-value" value in each Spectronaut panel was the maximum reported q-value by the tool.Figure S7: Precursor-level FDR control evaluation of DIA search tools In each panel, the empirical FDR was estimated for a given dataset (row) and search tool (column) using three different entrapment methods.Each panel plots the empirical FDR as a function of the FDR threshold.The dashed vertical lines are at the 1% FDR threshold, as are the numbers reported in text in the panels.The "Max q-value" value in each Spectronaut panel was the maximum reported q-value by the tool.As r increases all methods report increasingly larger estimated FDP.Specifically, with r = 6 even the lower bound on the FDP among the 6,636 proteins discovered at 1% FDR is as high as 4.1%.

Figure S2 :
Figure S2: Varying the evolutionary distance of the precursor-level foreign entrapment with DIA-NN applied to the DIA dataset PXD042704 (human-astral) with r = 1.The dashed vertical lines are at the 1% FDR threshold, as are the numbers reported in text in the panels.

Figure S3 :
FigureS3: Varying the evolutionary distance of the protein-level foreign entrapment with DIA-NN applied to the DIA dataset PXD042704 (human-astral) with r = 1.The dashed vertical lines are at the 1% FDR threshold, as are the numbers reported in text in the panels.

Figure S4 :
Figure S4: Foreign entrapments can be biased by contamination.

Figure S10 :
FigureS10: Protein-level FDR control evaluation of DIA search tools In each panel, the empirical FDR was estimated for a given dataset (row) and search tool (column) using three different entrapment methods.Each panel plots the empirical FDR as a function of the FDR threshold.The dashed vertical lines are at the 1% FDR threshold, as are the numbers reported in text in the panels.

Figure S13 :
FigureS13: Spectronaut struggles to control the FDR at the protein level as r increases on the DIA dataset PXD042704 (human-astral).As r increases all methods report increasingly larger estimated FDP.Specifically, with r = 6 even the lower bound on the FDP among the 6,636 proteins discovered at 1% FDR is as high as 4.1%.
Protein-level FDR control evaluation of DIA search tools In each panel, the empirical FDR was estimated for a given dataset (row) and search tool (column) using three different entrapment methods.Each panel plots the empirical FDR as a function of the FDR threshold.The dashed vertical lines are at the 1% FDR threshold, as are the numbers reported in text in the panels.
FigureS9: Protein-level FDR control evaluation of DIA search tools In each panel, the empirical FDR was estimated for a given dataset (row) and search tool (column) using three different entrapment methods.Each panel plots the empirical FDR as a function of the FDR threshold.The dashed vertical lines are at the 1% FDR threshold, as are the numbers reported in text in the panels.