Abstract
In shotgun proteomics, the amount of information that can be extracted from label-free quantification experiments is typically limited by the identification rate and the noise level of the quantitative signals. This generally causes a low sensitivity in differential expression analysis on protein level. Here, we propose a quantification-first approach that reverses the classical identification-first workflow. Specifically, we introduce a method, Quandenser, that applies unsupervised clustering on both MS1 and MS2 level to summarize all analytes of interest without assigning identities. This prevents valuable information from being discarded prematurely in the identification process and allows us to spend more effort on the identification process due to the data reduction achieved by clustering. Applying this methodology to a dataset of partially known composition, we could now employ open modification and de novo searches to identify multiple analytes of interest that would have gone unnoticed in traditional pipelines. Furthermore, Quandenser reports error rates on feature level which we integrated into our probabilistic protein quantification method, Triqler, to propagate error probabilities from feature level all the way to protein level. Quandenser/Triqler outperformed the state-of-the-art method MaxQuant/Perseus, consistently reporting more differentially abundant proteins, even in a clinical dataset where none were discovered previously. Compellingly, in all three clinical datasets investigated, the differentially abundant proteins showed enrichment for functional annotation terms.