Abstract
Proteogenomics aims at identifying variant or unknown proteins in bottom-up proteomics, searching transcriptome- or genome-derived custom protein databases. However, empirical observations reported that the large size of these proteogenomic databases is associated to lower sensitivity of peptide identifications. Various strategies were proposed to avoid this, including the generation of reduced transcriptome-informed protein databases (i.e., built from reference protein databases only retaining proteins with expressed transcript in the sample-matched transcriptome), which were found to increase peptide identification sensitivity. In this work, we propose a detailed evaluation of this approach. First, we establish that the increased sensitivity in peptide identification is in fact a statistical artefact, which directly results from the limited capability of TDC to accurately model incorrect target matches with excessively small databases. As anti-conservative FDRs likely hamper the robustness of the resulting biological conclusions, we advocate for alternative FDR control methods that are less sensitive to database size. Second, we show that despite not increasing sensitivity, reduced transcriptome-informed databases are useful, as they allow reducing ambiguity of protein identifications, yielding fewer shared peptides. Furthermore, we illustrate that searching the reference database and subsequently filtering proteins with unexpressed transcript similarly reduces protein identification ambiguity, while representing a more transparent and reproducible strategy. To summarize, using reduced transcriptome-informed databases is an interesting strategy that has not been promoted for the good reason (an artifactual peptide identification sensitivity increment instead of a protein identification ambiguity decrement).
Competing Interest Statement
The authors have declared no competing interest.
ABBREVIATIONS
- BH
- Benjamini-Hochberg procedure
- CC
- connected component
- DB
- database
- FDR
- false discovery rate
- MS/MS
- tandem mass spectrometry
- PSM
- peptide-sepctrum match
- TDC
- target-decoy competition