Unveiling the links between peptide identification and differential analysis FDR controls by means of a practical introduction to knockoff filters

In proteomic differential analysis, FDR control is often performed through a multiple test correction (i.e., the adjustment of the original p-values). In this protocol, we apply a recent and alternative method, based on so-called knockoff filters. It shares interesting conceptual similarities with the target-decoy competition procedure, classically used in proteomics for FDR control at peptide identification. To provide practitioners with a unified understanding of FDR control in proteomics, we apply the knockoff procedure on real and simulated quantitative datasets. Leveraging these comparisons, we propose to adapt the knockoff procedure to better fit the specificities of quantitive proteomic data (mainly very few samples). Performances of knockoff procedure are compared with those of the classical Benjamini-Hochberg procedure, hereby shedding a new light on the strengths and weaknesses of target-decoy competition.


Introduction
Controlling the false discovery rate (FDR) is a well-established practice in most -omic approaches, as it answers a pervasive question: Considering quantitative measurements for many covariates (be they genes, transcripts, metabolites, or proteins) in a set of samples split in at least two different biological conditions, how can we shortlist some differentially expressed ones, while controlling the risk of false positives (i.e. wrongly selected covariates due to their looking differentially expressed while they are not)? To cope with this, the most commonly used procedure is without a doubt the Benjamini-Hochberg one (BH) [2]. However, due to its large field of application, FDR control has focused a lot of additional efforts in biostatistics, and many authors have proposed to improve upon BH FDR control [3,8], or have proposed alternative frameworks to do so [1,6,20].
In the specific case of proteomics, FDR control is not only used in the aforementioned biomarker selection problem. It is also an essential quality control metric when matching experimental fragmentation spectra onto in silico spectra (i.e., derived from reference database of protein sequences). However, for historical reasons, the associated FDR control is not performed using classical tools from biostatistics. On the contrary, a rather empirical approach termed target-decoy [9] is almost exclusively used. It consists in searching two databases: the first one, referred to as target, containing the genuine protein sequences, and another one, referred to as decoy, containing artefactual sequences. Under the assumption that target mismatches and decoy matches are equally likely, the number of decoy matches can be used to estimate the number of target mismatches, thus opening the door to FDR control.
For a long time, FDR control for peptide identification and for protein differential analysis have been considered as largely independent. However, theoretical connections exist: Notably, it has long been established [16] that if target and decoy databases are searched independently, then the procedure is broadly equivalent to relying on empirical null theory to estimate the FDR in a BH-related way [8]. More recently, it has been shown ( [7] that BH procedure could be a user-friendly and computationally attractive alternative to target decoy competition (TDC) (see Note 1). However, recent developments in theoretical biostatistics have made the links between both approaches to FDR control even tighter. Notably, the authors of [1] have proposed to tackle the biomarker research FDR control using an algorithmic procedure akin to that of TDC. Broadly, this novel approach, denoted as "knockoff-filter," works as follows. First, knockoff variables are simulated to be as independent as possible from conditions of samples, but yet preserve the covariance structure of the original variables (see Note 2). Second, a competition is organized between each original variable and its associated knockoff. Third, the proportion of retained knockoffs is used to estimate the proportion of wrongly selected original covariates (see Table 1 for a more detailed comparison with TDC). Conversely, authors have recently leverage the theory underlying knockoff filters to propose improved TDC strategies (see [10]).
Overall, the framework of knockoff filters is particularly insightful to provide a global understanding of FDR control in proteomics and the purpose of this protocol is to root such unified view on empirical comparisons using both real and simulated data. Interestingly, the results of these comparisons are compliant with empirical knowledge about the various strengths and weaknesses classically associated to each FDR control method.

Notations
We first start by reviewing commonly used yet conflicting notations in biostatistics and proteomics.

Classical notations in biostatistics
In biostatistics, the false discovery rate (FDR) and the false discovery proportion (FDP) are distinct notions. The FDP corresponds to what was classically and informally referred to as the "true FDR" in proteomics, i.e., the exact proportion of false positives among the proteins that passed the user-defined selection threshold, and therefore deemed as differentially abundant. Of course, except for benchmark artificial or simulated datasets, this quantity is unknown in practice.
The FDR reads as FDR = E[FDP], where E stands for the expectation, which broadly amounts to the long run average of the FDP on an infinite number of related experiments subject to stochastic fluctuations. This quantity is also unknown but it

Target-Decoy Competition
Knockoff filter (2nd order approximation) 1. Construct peptide decoys such that decoy PSMs have same score distribution than erroneous target PSMs 1. For each protein, generate knockoff abundances with same mean and correlation matrices as original abundances.
2. For each real spectrum obtained, find the best match among all targets and decoys, and retain its score.
2. For each protein, compute a score describing whether the original abundances vector or its knockoff best predicts the condition.
3. The number of selected PSMs from decoys at a given cutoff enables to estimate the FDR on selected target PSMs.
3. The number of selected knockoffs at a given cutoff enables to estimate the FDR on selected original proteins deemed differentially abundant. is much easier to estimate, and such estimate is classically noted FDR. Estimating the FDR is insightful, but unfortunately, not always sufficient [14]. An unbiased FDR estimate is expected to provide a value closed to E[FDP]. However, on a given dataset, this value may be larger or smaller than the FDP. While a slightly too large estimate implies a conservative behavior (there will be less false positives than expected among the shortlisted biomarkers), a too small FDR implies a too liberal quality control and subsequent risks in post-proteomics experiments.
To cope with weaknesses of FDR estimation, FDR control procedures have been developed: they rely on more conservative assumptions that yield slightly lesser selected discoveries at a given cut-off parameter. If we note as FDR the FDR estimate resulting from controlling the FDR at level ( being classically tuned to 1%) it is likely that FDR ≤ .
In other words, if one cuts-off a list of putative biomarkers according to an FDR controlled at 1%, the FDR estimate on this very list is likely to be slightly lower than 1%. However, as the FDP remains unknown, it is the only way to safely assume that the FDP is equal to or lower than 1%.

Classical notations in proteomics
In proteomics, most of the notions described above (see Subheading 2.1) are conflated. Since the mid-2010s, discriminating between the FDP and the FDR has progressively become standard. However, distinction between FDR (as equal to E[FDP]), FDR, FDR , and is scarce. The reason is obvious: except for specific methodological publications, most of them are not useful to the community. Indeed, in practice, a proteomic researcher only needs to manipulate , the cut-off parameter, and to understand that after applying the FDR control accordingly, the FDP is not necessarily strictly equal to , but possibly slightly smaller. However, the everyday language is error-prone: when one says or writes "We selected the putative biomarkers at an FDR of 1%," what is referred to as FDR is not E[FDP], FDR, or FDR , but .
To cope with this, it is possible to rely on other notations. They are not as formal as those of mainstream biostatistics (see Subheading 2.1) although they are sometimes reported in mathematics works [4]. However, they are sufficient for a rigorous everyday work in a proteomic lab. Essentially, it amounts to conflate the FDR estimate with , and to define the FDR control as a procedure which provides the following guarantee with a sufficiently high probability: This formulation can be misleading in the sense it gives the impression that the FDR control procedure indeed controls the FDP (see Note 3). However, it has two advantages: First, it makes the everyday language compliant with the minimum amount of statistical notions possible; second, it simplifies the understanding of other statistical notions such as "q-value" or "adjusted p-value," as using this formalism, they are simply equivalent to the FDR, as detailed in [5]. In the rest of the protocol, the naming conventions resulting from Eq. 1 are used, so that FDR refers to , the FDR level tuned by the practitioner to perform FDR control.

Other notations used in this protocol
Hereafter, the following mathematical notations are used: 1. : the number of biological samples.
2. : the number of proteins to include in differential analysis.

3.
∈ R × : the matrix of protein abundances, where each row corresponds to a sample and each column corresponds to a protein.

4.
: the vector of abundance of the -th protein, i.e. the -th column of . 5. , : the abundance value of -th protein for the -th replicate.
6. : the vector representing the condition label (numerical value) of biological samples, of length . For example, the -th coefficient of is 1 if the -th sample comes from the healthy condition, and -1 if it comes from the disease condition.
8. Ko : the knockoff vector of abundance of the -th protein.
9. : the vector of scores of all proteins (only the original ones, not the knockoff), of length .

10.
: the score associated to the -th protein. A large positive value is evidence that the protein is differentially expressed. It is typically constructed by comparing the predictive power of and Ko of the sample conditions. Swapping and Ko should swap the sign of . A null means that both Ko and Ko bring the same amount (or lack thereof) of information on the condition.

R version
R version 4.0.3 (or above) is required to use the following packages. We recommend using an integrated development environment like Rstudio to execute the commands of this protocol. It can be downloaded from https://www.rstudio.com/.

Packages
The following packages are necessary: 1. The packages knockoff, lars ( [13]), and glmnet ( [11]) must be installed from the CRAN: i n s t a l l . p a c k a g e s ( " k n o c k o f f " ) i n s t a l l . p a c k a g e s ( " l a r s " ) i n s t a l l . p a c k a g e s ( " g l m n e t " ) 2. cp4p [12] provides two datasets with controlled ground truth: They result from analysis of samples containing different abundance of 48 human proteins spiked in a yeast background [19]. The p-values from a Welch -test associated to each protein are also provided, along with functions to apply Benjamini-Hochberg procedure for differential analysis. To install cp4p package, it is first necessary to install the BioConductor [15] packages it depends on: i f ( ! r e q u i r e N a m e s p a c e ( " BiocManager " , q u i e t l y = TRUE ) ) i n s t a l l . p a c k a g e s ( " BiocManager " ) BiocManager : : i n s t a l l ( " m u l t t e s t " ) BiocManager : : i n s t a l l ( " limma " ) BiocManager : : i n s t a l l ( " q v a l u e " ) 3. Then cp4p can be installed from the CRAN: i n s t a l l . p a c k a g e s ( " cp4p " ) 4. Finally, load the packages in the environment: l i b r a r y ( cp4p ) l i b r a r y ( k n o c k o f f ) l i b r a r y ( l a r s )

Data Format
This protocol relies on a data format which is quite uncommon in proteomics (see Note 4). The input data on which FDR control is applied should have at least 3 rows, i.e. at least biological 3 samples in total are needed. The number of proteins to include in differential analysis can be arbitrary. Values of abundance in should be log 2 -scaled.
For conveniency, we use two datasets in this protocol: A dataset resulting from real mass-spectrometry output, called LFQRatio25 (see Subheading 3.4), and a simulated dataset with adjustable parameters (see Subheading 3.5).

Data loading from cp4p
The following commands enable to load and prepare LFQRatio25 dataset [12]: 1. Load the dataset with the following command:

Data simulation
The following commands enable to prepare a simulated dataset: 1. The code below randomly generates a dataset broadly akin to LFQRatio25.
Due to randomness, it will be different from one run to another. To ensure the results are reproducible and to obtain same results as in the remaining of the protocol, use the following optional command to set the random seed (see

Methods
This section falls into the following subsections: 1. We explain how to apply the original knockoff-filter procedure to control the FDR for differential expression analysis. Precisely, we show how to (1) generate knockoff variables; (2) compute a score for each protein/knockoff pair; (3) select differentially abundant proteins for a predefined target FDR.
2. We detail how to replace the default scoring strategy with other ones, and compare these alternative knockoff procedures to the classical Benjamini-Hochberg (BH) procedure.
3. We propose some code to illustrate the sensitivity of the knockoff filter procedure to the random generation of knockoffs.

Original knockoff procedure
1. Choose the dataset on which applying the knockoff procedure: 5. Set the value of targeted FDR, compute the resulting threshold, and select proteins for which their score is above this threshold. The target_fdr parameter must be a number between 0 and 1. The offset parameter determines which FDR estimator to use, it can be set to either 0 or 1 (see Note 10) . When offset is 0, a biased FDR estimate is used, and when offset is 1, a non-biased, yet more conservative estimate is used. We notice that FDP and power curves on Figures 1 and 2 are almost always horizontal. This means that variables selected remain the same whatever the FDR target chosen. When the offset equates 1 (unbiased estimator), no proteins are deemed differentially expressed below a certain value of FDR. Thus, even though their are no false positive, there are no true positive either, making the FDR control through knockoff filters practically useless.
We mainly explain this over-conservativeness by the usage of variable selection with the Lasso algorithm, at the step of scores computation. In fact, in the setting << , the Lasso algorithm will only select variables. This is problematic for differential expression analysis where the total number of samples rarely exceeds the number of a priori differentially expressed proteins. On top of that, as very few covariates are selected, and some original variables are much more differentially abundant than all the others, knockoff variables are almost never selected. Thus, estimating the number of false discoveries from the number of selected knockoffs is not appropriate in our cases. These efficiency of variable selection with Lasso is thoroughly discussed in [21].

Scoring methods based on forward stagewise regression and t-test
Preliminary experimental comparisons highlighted the knockoff procedure accuracy highly depends on the chosen feature selection algorithm. We herefater describe two procedures that we found to address the issue described above (see Subheading 4.1). The first scoring method consists in using forward stagewise selection (FS) algorithm (see Note 11). The second one is derived from the variable selection procedure classically used in proteomics: it amounts to computing a -test p-value for both original and knockoff variables; then, the final score (i.e., ) is defined by the log difference of p-values (LDP) obtained between each original variable and its knockoff. We observe that the knockoff filter procedure with LDP broadly follows the same trend as the BH one on LFQRatio25 (see Figure 4). By construction, the LDP scores is never null, yielding a rather symmetric distribution (see Figure 3). The largest positive scores (depicted in the right hand tail) result from differentially abundant proteins, while the left hand one amounts to selected knockoff proteins. The distribution being more symmetric than when using the Lasso, it is possible to select a larger subset of proteins at a given FDR. However, when using the FS based scores, knockoff filters roughly behaves as with the Lasso, yielding a greater but yet insufficient power.
Finally, the BH procedure also yields anti-conservative results on LFQRatio25, as the FDP is always higher than the FDR. However, this can be explained by other Figure 3: Histogram of scores 's obtained with log diff of p-values scoring method, on LFQRatio25 dataset. The blue area correspond to original variables that are selected, and the red area represent knockoff variables selected, both at a threshold of 2 (hence, a conservative FDR estimate at a selection threshold of 2 reads = red area+1 blue area ).
preprocessing steps (match between runs, normalization, imputation, etc.) which tends to shrink the within-condition variance prior to differential analysis as well as to increase the risk of false positives that are not accounted by FDR control. Indeed, Benjamini-Hochberg is conservative on simulated data (see Figure 5).

Sensitivity of FDR control to knockoff used
Knockoff generation with create.second_order function (see Subheading 4.1, Step 3) involves the random draw of a knockoff matrix (similarly to the random generation of decoy sequences). Hence, on a given dataset, running two consecutive FDR control procedures with knockoff filters should lead to slightly different results. We hereafter propose several experiments to illustrate the sensitivity of the knockoff filter procedure to the knockoff generation, as well as to evaluate its magnitude.  leading to a display akin to that of Figure 6. The proteins selected at an FDR of 5% for each knockoff series are retained in a matrix referred to as scores: image ( t ( s u b m a t ) , c o l =c ( " g r e y " , " b l u e " , " r e d " ) , x l a b =" P r o t e i n s ( s e l e c t e d a t l e a s t o n c e ) " , a x e s =F ) m t e x t ( t e x t =c ( p a s t e ( " K n o c k o f f " , c ( 1 , 1 5 , 3 0 ) ) ) , s i d e =2 , l i n e = 0 . 1 , a t = s e q ( 0 . 0 , 1 , 1 / 2 ) , l a s =1 , c e x = 0 . 9 ) l e g e n d ( " t o p r i g h t " , i n s e t =c ( − 0 . 2 3 , 0 ) , l e g e n d =c ( " S e l e c t e d H_0 " , " S e l e c t e d H_1 " , " Not s e l e c t e d " ) , f i l l =c ( " r e d " , " c y a n " , " g r e y " ) ) Figures 5 and 7 emphasize the important variability resulting from the random nature of knockoff filters. To counter this variability, [18] proposes a method to aggregate multiple knockoffs. In fact, similar sensitivity has already been commented upon with target-decoy procedures [17], so it seems to be a problem ubiquitous to FDR control procedures which involve simulating artifactual data under the null hypothesis. Finally these observations provide an intuitive support for the tools described in [10], which relies on multiple decoy databases to construct a knockoff-like score.