RT Journal Article SR Electronic T1 A supervised fingerprint-based strategy to connect natural product mass spectrometry fragmentation data to their biosynthetic gene clusters JF bioRxiv FD Cold Spring Harbor Laboratory SP 2021.10.05.463235 DO 10.1101/2021.10.05.463235 A1 Tiago F. Leao A1 Mingxun Wang A1 Ricardo da Silva A1 Justin J.J. van der Hooft A1 Anelize Bauermeister A1 Asker Brejnrod A1 Evgenia Glukhov A1 Lena Gerwick A1 William H. Gerwick A1 Nuno Bandeira A1 Pieter C. Dorrestein YR 2021 UL http://biorxiv.org/content/early/2021/10/06/2021.10.05.463235.abstract AB Microbial natural products, in particular secondary or specialized metabolites, are an important source and inspiration for many pharmaceutical and biotechnological products. However, bioactivity-guided methods widely employed in natural product discovery programs do not explore the full biosynthetic potential of microorganisms, and they usually miss metabolites that are produced at low titer. As a complementary method, the use of genome-based mining in natural products research has facilitated the charting of many novel natural products in the form of predicted biosynthetic gene clusters that encode for their production. Linking the biosynthetic potential inferred from genomics to the specialized metabolome measured by metabolomics would accelerate natural product discovery programs. Here, we applied a supervised machine learning approach, the K-Nearest Neighbor (KNN) classifier, for systematically connecting metabolite mass spectrometry data to their biosynthetic gene clusters. This pipeline offers a method for annotating the biosynthetic genes for known, analogous to known and cryptic metabolites that are detected via mass spectrometry. We demonstrate this approach by automated linking of six different natural product mass spectra, and their analogs, to their corresponding biosynthetic genes. Our approach can be applied to bacterial, fungal, algal and plant systems where genomes are paired with corresponding MS/MS spectra. Additionally, an approach that connects known metabolites to their biosynthetic genes potentially allows for bulk production via heterologous expression and it is especially useful for cases where the metabolites are produced at low amounts in the original producer.Significance The pace of natural products discovery has remained relatively constant over the last two decades. At the same time, there is an urgent need to find new therapeutics to fight antibiotic resistant bacteria, cancer, tropical parasites, pathogenic viruses, and other severe diseases. To spark the enhanced discovery of structurally novel and bioactive natural products, we here introduce a supervised learning algorithm (K-Nearest Neighbor) that can connect known and analogous to known, as well as MS/MS spectra of yet unknowns to their corresponding biosynthetic gene clusters. Our Natural Products Mixed Omics tool provides access to genomic information for bioactivity prediction, class prediction, substrate predictions, and stereochemistry predictions to prioritize relevant metabolite products and facilitate their structural elucidation.Competing Interest StatementW.H.G. has an equity interest in NMRFinder and in SirenasMD Inc., companies that may potentially benefit from the research results and W.H.G. also serves on the Scientific Advisory Boards of these companies. The terms of this arrangement have been reviewed and approved by the University of California San Diego in accordance with its conflict of interest policies. P.C.D. is a scientific advisor to SirenasMD Inc., Galileo and Cybele, and cofounder and scientific advisor to Ometa and Enveda with approval by the University of California San Diego. M.W. is a cofounder of Ometa Labs, LLC.