Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra

Aditya Divyakant Shrivastava, Neil Swainston, Soumitra Samanta, Ivayla Roberts, View ORCID ProfileMarina Wright Muelas, View ORCID ProfileDouglas B. Kell
doi: https://doi.org/10.1101/2021.06.25.449969
Aditya Divyakant Shrivastava
1Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK
2Department of Computer Science and Engineering, Nirma University, Ahmedabad 382481, India
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Neil Swainston
1Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK
3Mellizyme Biotechnology Ltd, Liverpool Science Park IC1, 131 Mount Pleasant, Liverpool L3 5TF
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Soumitra Samanta
1Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ivayla Roberts
1Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marina Wright Muelas
1Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Marina Wright Muelas
Douglas B. Kell
1Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK
2Department of Computer Science and Engineering, Nirma University, Ahmedabad 382481, India
3Mellizyme Biotechnology Ltd, Liverpool Science Park IC1, 131 Mount Pleasant, Liverpool L3 5TF
4Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs. Lyngby, Denmark
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Douglas B. Kell
  • For correspondence: dbk@liv.ac.uk doukel@biosustain.dtu.dk
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

The ‘inverse problem’ of mass spectrometric molecular identification (‘given a mass spectrum, calculate the molecule whence it came’) is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the number of experimentally available electrospray mass spectra of small molecules is quite limited. However, the forward problem (‘calculate a small molecule’s likely fragmentation and hence at least some of its mass spectrum from its structure alone’) is much more tractable, because the strengths of different chemical bonds are roughly known. This kind of molecular identification problem may be cast as a language translation problem in which the source language is a list of high-resolution mass spectral peaks and the ‘translation’ a representation (for instance in SMILES) of the molecule. It is thus suitable for attack using the deep neural networks known as transformers. We here present MassGenie, a method that uses a transformer-based deep neural network, trained on ~6 million chemical structures with augmented SMILES encoding and their paired molecular fragments as generated in silico, explicitly including the protonated molecular ion. This architecture (containing some 400 million elements) is used to predict the structure of a molecule from the various fragments that may be expected to be observed when some of its bonds are broken. Despite being given essentially no detailed nor explicit rules about molecular fragmentation methods, isotope patterns, rearrangements, neutral losses, and the like, MassGenie learns the effective properties of the mass spectral fragment and valency space, and can generate candidate molecular structures that are very close or identical to those of the ‘true’ molecules. We also use VAE-Sim, a previously published variational autoencoder, to generate candidate molecules that are ‘similar’ to the top hit. In addition to using the ‘top hits’ directly, we can produce a rank order of these by ‘round-tripping’ candidate molecules and comparing them with the true molecules, where known. As a proof of principle, we confine ourselves to positive electrospray mass spectra from molecules with a molecular mass of 500Da or lower. The transformer method, applied here for the first time to mass spectral interpretation, works extremely effectively both for mass spectra generated in silico and on experimentally obtained mass spectra from pure compounds. The ability to create and to ‘learn’ millions of fragmentation patterns in silico, and therefrom generate candidate structures (that do not have to be in existing libraries) directly, thus opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra.

Competing Interest Statement

The authors have declared no competing interest.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted June 26, 2021.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra
Aditya Divyakant Shrivastava, Neil Swainston, Soumitra Samanta, Ivayla Roberts, Marina Wright Muelas, Douglas B. Kell
bioRxiv 2021.06.25.449969; doi: https://doi.org/10.1101/2021.06.25.449969
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra
Aditya Divyakant Shrivastava, Neil Swainston, Soumitra Samanta, Ivayla Roberts, Marina Wright Muelas, Douglas B. Kell
bioRxiv 2021.06.25.449969; doi: https://doi.org/10.1101/2021.06.25.449969

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3609)
  • Biochemistry (7585)
  • Bioengineering (5533)
  • Bioinformatics (20825)
  • Biophysics (10344)
  • Cancer Biology (7995)
  • Cell Biology (11653)
  • Clinical Trials (138)
  • Developmental Biology (6617)
  • Ecology (10224)
  • Epidemiology (2065)
  • Evolutionary Biology (13639)
  • Genetics (9557)
  • Genomics (12856)
  • Immunology (7930)
  • Microbiology (19568)
  • Molecular Biology (7675)
  • Neuroscience (42182)
  • Paleontology (308)
  • Pathology (1259)
  • Pharmacology and Toxicology (2208)
  • Physiology (3271)
  • Plant Biology (7058)
  • Scientific Communication and Education (1295)
  • Synthetic Biology (1953)
  • Systems Biology (5433)
  • Zoology (1119)