Abstract
Despite the rapidly increasing amount of data in public mass spectrometry repositories, peptide mass spectra are usually analyzed by each laboratory in isolation, treating each experiment as if it has no relationship to any others. This approach fails to exploit the wealth of existing, previously analyzed mass spectrometry data. Alternatively, although large spectral data can be jointly analyzed using spectrum clustering methods, this unsupervised approach does not utilize information from peptide identifications. Here, we propose to train a deep neural network in a supervised fashion based on previous assignments of peptides to spectra. The network, called “GLEAMS,” learns to embed spectra into a low-dimensional space in which spectra generated by the same peptide are close to one another. We empirically demonstrate the utility of this learned embedding by propagating annotations from labeled to unlabeled spectra. We further use GLEAMS as the basis for a large-scale spectral clustering, detecting groups of unidentified, proximal spectra representing the same peptide, and we show how to use these clusters to explore the dark proteome of repeatedly observed yet consistently unidentified mass spectra. We provide a software implementation of our approach, along with a tool to quickly embed additional spectra using a pre-trained model, to facilitate large-scale analyses.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
We have completely rewritten GLEAMS from the ground up to enable efficient scaling to repository-scale data. We have replaced our custom clustering algorithm with a standard large-scale clustering algorithm, DB-SCAN. This change appropriately places the emphasis of our manuscript on the quality of our learned embedding, showing that these embeddings work well with a standard clustering tool. We have run our analysis on a massive dataset comprised of 669 million spectra. This is a pre-analyzed compendium of peptide-spectrum matches (PSMs) previously generated to produce a community knowledge base of the discoverable human proteome.