A learned embedding for efficient joint analysis of millions of mass spectra

Wout Bittremieux; Damon H. May; Jeffrey Bilmes; William Stafford Noble

doi:10.1101/483263

Abstract

Despite the rapidly increasing amount of data in public mass spectrometry repositories, peptide mass spectra are usually analyzed by each laboratory in isolation, treating each experiment as if it has no relationship to any others. This approach fails to exploit the wealth of existing, previously analyzed mass spectrometry data. Alternatively, although large spectral data can be jointly analyzed using spectrum clustering methods, this unsupervised approach does not utilize information from peptide identifications. Here, we propose to train a deep neural network in a supervised fashion based on previous assignments of peptides to spectra. The network, called “GLEAMS,” learns to embed spectra into a low-dimensional space in which spectra generated by the same peptide are close to one another. We empirically demonstrate the utility of this learned embedding by propagating annotations from labeled to unlabeled spectra. We further use GLEAMS as the basis for a large-scale spectral clustering, detecting groups of unidentified, proximal spectra representing the same peptide, and we show how to use these clusters to explore the dark proteome of repeatedly observed yet consistently unidentified mass spectra. We provide a software implementation of our approach, along with a tool to quickly embed additional spectra using a pre-trained model, to facilitate large-scale analyses.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

We have completely rewritten GLEAMS from the ground up to enable eﬀicient scaling to repository-scale data. We have replaced our custom clustering algorithm with a standard large-scale clustering algorithm, DB-SCAN. This change appropriately places the emphasis of our manuscript on the quality of our learned embedding, showing that these embeddings work well with a standard clustering tool. We have run our analysis on a massive dataset comprised of 669 million spectra. This is a pre-analyzed compendium of peptide-spectrum matches (PSMs) previously generated to produce a community knowledge base of the discoverable human proteome.
https://github.com/bittremieux/GLEAMS

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.