plantMASST - Community-driven chemotaxonomic digitization of plants

Understanding the distribution of hundreds of thousands of plant metabolites across the plant kingdom presents a challenge. To address this, we curated publicly available LC-MS/MS data from 19,075 plant extracts and developed the plantMASST reference database encompassing 246 botanical families, 1,469 genera, and 2,793 species. This taxonomically focused database facilitates the exploration of plant-derived molecules using tandem mass spectrometry (MS/MS) spectra. This tool will aid in drug discovery, biosynthesis, (chemo)taxonomy, and the evolutionary ecology of herbivore interactions.

oxygen and energy, sustaining animal life, including our own, and have been used by humans to treat diseases.Despite their ecological, nutritional, and medicinal value, the taxonomic distribution of plant metabolites is hard to establish.This is due to the absence of a common open-access spectral MS plant database and search engines.Furthermore, not every species distributed in the plant kingdom has been studied and not every imaginable metabolite has been discovered.Most chemotaxonomic studies of plants are limited to specific clades of the taxonomic tree [2][3][4][5] and rely on data availability of known compounds in structural and spectral databases [6][7][8][9][10][11][12][13][14] , limiting the search of yet-to-be-characterized molecules.However, creating a chemophenetic plant metabolite inventory of all plant species found worldwide is within reach as mass spectrometry (MS) technologies and algorithms are continuously improving.To facilitate this, we have created plantMASST, a taxonomically-informed mass spectrometry search tool for plant metabolites within the Global Natural Products Social Molecular Networking (GNPS) ecosystem.plantMASST builds on the the approach taken for microbeMASST 15 , creating a digital inventory of untargeted plant metabolomics data.This enables the querying of MS/MS spectra corresponding to known and unknown molecules within a curated database of LC-MS/MS data from plant extracts, with results mapped across the plant taxonomic tree.As of May 2024, plantMASST curated reference datasets containing LC-MS/MS data from 19,075 plant extracts with over 100 million MS/MS spectra linked to their respective taxonomical information (Figure 1a).The plantMASST reference database results from community contributions and metadata curation from 90 scientists worldwide and it now includes 246 botanical families, 1,469 genera, and 2,793 species.To increase plantMASST coverage, we encourage the community to deposit new plant datasets in MassIVE (https://massive.ucsd.edu/)with associated metadata in the ReDU template 16 .
A single MS/MS spectrum can be searched in plantMASST through its web interface (https://masst.gnps2.org/plantmasst/).The search output is a list of all data files where the queried MS/MS spectrum (within the user-defined scoring criteria) has been observed.Additionally, an interactive taxonomic tree of the results is generated, which can be easily explored (Figure 1b,c).
The search is accomplished by providing either a Universal Spectrum Identifier (USI) 17 or a precursor m/z and a corresponding list of fragment ions as m/z-intensity-pairs (Supplementary Figure S1a).Although the parameters are adjustable, the web interface provides the following default values: precursor and fragments mass tolerance set at 0.05 Da and cosine similarity of 0.7 with at least three matched fragment ions shared between the queried spectrum and the spectra available in plantMASST."Analog search" can also be enabled by the user to explore MS/MS spectra from putative-related structural analogs or different ion forms of the same molecules (e.g., different adducts, multimers, and in-source fragment ions).Moreover, plantMASST automatically performs spectral library searches against the reference libraries publicly available within the GNPS environment for direct metabolite annotation through MS/MS spectral matching.Building on the concepts of MASST 18 and fastMASST 19 , plantMASST returns results within seconds and includes the taxonomic distribution of metabolites at once.The interactive taxonomic tree (and links to related tools in the GNPS ecosystem) used to visualize the results was built using NCBI 20 taxonomy (Supplementary Figure S1b).Finally, users can explore matches inspecting the MS/MS mirror plots, with highlighted shared peaks, using the Metabolomics Spectrum Resolver 21 and access the corresponding LC-MS/MS files using the GNPS Dashboard 22 .
Here, we highlight the use of plantMASST through three representative applications on natural product discovery and one about how to explore public human diet intervention datasets.
First, researchers may desire to explore the taxonomic distribution of known plant-derived molecules.Therefore, as proof of concept, we investigated moroidin.This bicyclic octapeptide belongs to the BURP-domain-derived ribosomally synthesized and post-translationally modified peptides (RIPPs) family 23 , which was first isolated from Dendrocnide moroides Wedd.
(Urticaceae) 24 and Celosia argentea L. (Amaranthaceae) 25 .This molecule was described to induce apoptosis in the A549 human non-small cell lung cancer cell line and it has been gaining attention in the search for new drugs for cancer treatment 26 .We used plantMASST to search an MS/MS spectrum of the [M + 2H] 2+ ion of moroidin and found four potential producer species across two different plant families, including yellow bauhinia (Bauhinia tomentosa Vell., Fabaceae) which was not a known producer of moroidin.Moroidin identification in the extract of B. tomentosa seeds was confirmed as Metabolite Identification level 1 by comparison to an authentic standard (Figure 1b) according to the Metabolomics Standard Initiative (MSI, Supplementary Figure S4) 27 .Another example of natural product discovery across the taxonomic domain is showcased by searching for piperlongumine, an amide alkaloid originally isolated from the Piper genus (Piperaceae), which has a reported antitumor activity 28 (Figure 1c).This molecule is also currently being investigated for the treatment of glioblastoma 29 .Utilizing plantMASST, we observed that the piperlongumine MS/MS spectrum was also detected in two other botanical species: Gymnotheca chinensis Decne.(Saururaceae) and Clematis apiifolia DC.
(Ranunculaceae), which has not been previously reported.MS/MS and retention time matching with a commercial standard allowed us to confirm the presence of piperlongumine in both species and achieve level 1 identification (Figure 1c) 27 .This expands the known natural reservoirs of piperlongumine beyond its primary source, Piper longum L. (Piperaceae), and highlights the versatility and efficacy of plantMASST in discovering pharmaceutically relevant phytochemicals in diverse plant families.It also suggests new research questions, such as why the production of these alkaloids is observed in such diverse plant species.Other plantMASST representative applications, as such caffeine, reserpine, icaridin, lutein, methoxsalen, and tryptophan, can be found in Supplementary Figures S1, S3, S4, and S6.interpretation of a specialized metabolite known to be produced by specific plants, moroidin (the molecule was confirmed using standard, Supplementary Figure S4).Pie charts display the percentage of matches within that taxonomic level detected against the plantMASST database.
Blue indicates the percentage of samples with matches and yellow without matches.The reference MS/MS spectrum for moroidin (CCMSLIB00005435899) is available in the GNPS library.c) Output of piperlongumine (CCMSLIB00010117596) search.Mainly produced by Piper species, the molecule was also confirmed, for the first time, to be present in Gymnotheca chinensis and Clematis apiifolia leaf extracts via MS/MS and retention time matching to its commercial standard.In the mirror plots, the spectrum on the top is relative to piperlongumine, and the one on the bottom is a match to plantMASST.More tissues were also matched and the chromatograms can be found in Supplementary Figure S5.
Beyond demonstrating the potential of plantMASST for highlighting specific plant metabolites with a narrow distribution within the plant kingdom, we built on recent observations that human neuroactives are also present in plants 30,31 .Many of them are involved in the signaling communication and adaptation of plants 31 .However, it is yet unknown to what extent they are distributed across the plant kingdom.As no report exists at the moment, we used plantMASST to search the MS/MS spectra of acetylcholine, dopamine, serotonin, glutamate, gammaaminobutyric acid (GABA), and norepinephrine, all known neuroactives also produced by humans.Additionally, we searched for cannabidiol (CBD) and tetrahydrocannabinol (THC), two plant-derived metabolites known to affect human brain physiology for comparison (Figure 2a,b).
Serotonin, also known as 5-HT, was detected in 61 out of 246 plant families available in plantMASST, with a notable prevalence in species belonging to the Malpighiaceae family, including Banisteriopsis genus (Figure 2b).This observation suggests this genus is a potential source of neuroactives such as serotonin.Additionally, it may also explain the neuroactive properties of species of this genus, such as Banisteriopsis caapi C.V. Morton (Malpighiaceae), which is used as the main ingredient of Ayahuasca, an indigenous beverage traditionally used in the Northwestern Amazon to treat mental health disorders 32 .These plants could be further evaluated for serotonin-mediated effects on breathing, sleep, arousal, and seizure control 33,34 .
Other neuroactives were found in specific plant families.For instance, tryptamine was detected only in Solanum lycopersicum L. (Solanaceae); CBD and THC, two molecules that have neurological effects on humans, were found only in Cannabis sativa (Cannabaceae).Therefore, plantMASST can also enable users to explore the diversity of neuroactives in plants and highlight plants that could be further investigated for potential pharmacological applications or the extraction of bioactive compounds with medicinal properties.
Although plantMASST and the underlying reference database can be leveraged in many ways, we showcase a last example of how plantMASST can be used to detect human dietary consumption of plants.We reanalyzed all the MS/MS spectra from two public metabolomics datasets of human fecal data of diet-related studies; one comparing vegan vs omnivore diets 35,36 , and the other American vs Mediterranean diets 37 .We observed a higher percentage of MS/MS matches to plant metabolites in the vegan group when compared to the omnivore group (Figure 2c).Interestingly, also the subjects consuming the Mediterranean diet had more MS/MS matches to plant metabolites compared to the American diet (Figure 2d).These results suggest that plantMASST could also be used to define the nature of plant-derived dietary patterns.
It is important to bear in mind certain limitations when interpreting the results of plantMASST since the current reference database contains diverse experimental and acquisition conditions.This includes the plant organ, plant growth conditions, the biome of origin, extraction methods, sample preparation, instrument geometries, and defined collision energies, among others.Therefore, even if the plant is known to be a producer of a compound of interest, a plantMASST match might be missed because of fewer fragment ion matches to low-intensity MS/MS ions or because no MS/MS of the compound was triggered due to low precursor ion intensity.If the users decrease the tolerances of minimum MS/MS fragment peaks to find more matches, it may result in more false positive matches.Further, although many MS/MS spectra will be quite specific to one molecule (e.g., moroidin and reserpine, Supplementary Figure S2a), isomers that have the same mass, can have nearly identical MS/MS spectrum, which is the case of quercetin and morin (Supplementary Figure S2b).This means that in some cases the taxonomic interpretation of a family of molecules may not map well to the individual molecules.
Finally, there are currently only 2,793 species of plants, representing a fraction of the species that exist.Despite these challenges, plantMASST represents an important advance in our ability to map the plant taxonomic distribution of metabolites, especially as researchers continue to expand the taxonomic curation of untargeted plant metabolomics data.We expect that plantMASST will potentially have profound implications for the fields of drug discovery, nutrition, and chemical ecology.

Data collection and curation
To enable the taxonomic search of known and unknown compounds and to make a dent in capturing chemotaxonomic data from all available plant extracts, 211 publicly available MS/MS datasets in the GNPS/MassIVE were manually compiled and each file, within these datasets, was taxonomically defined possibly to species level.For the samples in which the species was not known, the closest known taxonomic rank was defined (e.g., genus or family).These datasets consisted of 20,209 unique LC-MS/MS files, representing plant extracts inclusive of all plant tissue types (leaves, stems, seeds, among others) and habits (trees, shrubs, lianas, herbs, etc).To collect this number of samples, we made an open call to the scientific community to deposit plantrelated datasets in GNPS/MassIVE, which led to the deposition of an additional 25 datasets between December 2022 and March 2023, resulting from the efforts of 12 research groups across the world.All the collected information of each file part of plantMASST is available on GitHub and contains the following information: the path of each file, the filename in the format 'Dataset/Filename', the MassIVE ID, the taxon name, the NCBI Taxonomy ID, ReDU availability, whether the file is relative to a blank or a QC, and the USI of the file.
To get the NCBI taxonomy IDs, the datasets containing ReDU metadata were matched to NCBI IDs.When only a table containing the sample names was provided, the NCBI taxonomy IDs were manually retrieved from the NCBI Taxonomy web browser (https://www.ncbi.nlm.nih.gov/taxonomy).

plantMASST taxonomic tree generation
The plantMASST taxonomy tree was created with R Studio 4.2.2 and Python 3.10.Only distinct NCBI IDs (n = 3,173) were kept in the database.To obtain the complete lineage of every NCBI ID 38 , the 'taxize' package (v.0.9.100) was used, namely its categorization function.The main taxonomic ranks (kingdom to species) along with subgenus, subspecies, and variations were maintained to create taxonomic trees with an equal number of taxa nodes.After importing the NCBI ID list for every taxon into Python, a taxonomic tree was created using the ETE toolkit 39 .
After that, the created Newick tree was transformed into JSON format, and details like taxonomic rank and number of accessible samples were added.All taxonomic entries in plantMASST are classified according to the NCBI taxonomy TSV file containing the USIs to be searched or an MGF file containing precursors and MS/MS spectra of the ions.The same settings described in the API web interface such as minimum cosine, m/z tolerance, precursor m/z, and minimum fragments matched, are adjustable.To create the resulting taxonomic tree, the JSON file of the complete plantMASST taxonomic tree is filtered and converted into a D3 JavaScript object that can be visualized as an HTML file.

plantMASST applications
plantMASST can be used in a variety of scenarios.First, we showed the taxonomic distribution of specific metabolites present in the GNPS public libraries (moroidin, caffeine, reserpine, icaridin, lutein, methoxsalen, and piperlongumine).These searches were done using the web interface and the default parameters.Second, selected neuroactives with MS/MS spectra available in the GNPS public libraries (acetylcholine, dopamine, GABA, glutamate, norepinephrine, serotonin, CBD, THC, and tryptamine) were searched against plantMASST with the web interface using the default parameters.The matches obtained were manually inspected via mirror plots and the low-quality matches were filtered out.The tables containing the taxonomic distribution were downloaded and combined to visualize them using the Interactive Tree of Life

Retention time matching for moroidin
Moroidin was isolated and purified as described 23 from Celosia argentea flower and seed material.All materials were purchased from Fisher Scientific unless otherwise noted.Celosia argentea seeds were purchased from Seedville TM , and Bauhinia tomentosa seeds were purchased from rarepalmseeds.com TM .0.2 g of C. argentea seeds or B. tomentosa seeds were each ground with mortar and pestle, extracted with 3 mL methanol for 1 h shaking at 200 rpm at 37 °C in a 7 mL glass scintillation vial (Fisher Scientific 03-337-26).The methanol extracts were dried under nitrogen gas and resuspended in 3 mL of deionized water.The resuspensions were partitioned twice with hexane (1:1, v/v), partitioned twice with ethyl acetate (1:1, v/v), and extracted once with 3 mL n-butanol.The n-butanol extract was dried in vacuo in a Thermo Scientific SPD140P1 speed vac and resuspended in 2 ml of 80% methanol for LC-MS/MS analysis.The moroidin standard had a final concentration of 500 nM in 80% methanol.
LC-MS/MS analysis for moroidin analysis was carried out on a Thermo H-ESI-Q-Exactive Orbitrap mass spectrometer coupled to a Thermo Vanquish ultra GitHub (https://github.com/helenamrusso/plantmasst).To help interpret and establish that distinct plant species' small molecules were only found, known molecules already present in the GNPS library (https://library.gnps2.org/)were employed.
-Moroidin (CCMSLIB00005435737) the percentage of matches within that taxonomic level detected against the plantMASST database.Blue indicates the percentage of samples with matches and yellow without matches.Reference MS/MS spectra of reserpine (CCMSLIB00010110971), and it represents a level 2 annotation according to the Metabolomics Standards Initiative 27 .b) The MS/MS spectrum is searched against the GNPS libraries, and possible annotations are returned if matches are identified.Users can visit the accompanying GNPS Library Spectrum page for information on the reference spectrum.c) Data on matched scans in the sample from various taxa is supplied.Furthermore, users can visualize the mirror plot between the queried spectrum and spectrum from datasets included in the plantMASST database, such as the similarity score, and matching fragments.The user can also obtain the project's MassIVE accession number as well as contact information for the person who contributed the data.

Figure 1 .
Figure 1.Schematic overview of plantMASST infrastructure and output.a) The creation of the plantMASST reference database involved utilizing 19,075 community-curated LC-MS/MS data and knowledge from MassIVE, GNPS 12 , and ReDU 16 .b) Example of plantMASST output

Figure 2 .
Figure 2. Use of plantMASST in the search for neuroactives in plants and plant-derived small molecules in public data sets of fecal samples of humans.a) Taxonomic distribution of neuroactive compounds across plant families.b) Heatmap showing the distribution of the nine neuroactives in 156 plant species.The bar plot highlights the sum of the percentage (%) of matches to serotonin/dopamine among the botanical families.c) Percentages of metabolites matched to plantMASST in a human publicly available diet-related dataset (GNPS/MassIVE: MSV000086989) 35,36 containing fecal samples from vegans (n = 27) and omnivores (n = 27).d) Percentages of metabolites matched to plantMASST in a human diet-related dataset (GNPS/MassIVE: MSV000093005) 37 containing fecal samples from patients consuming either an American (n = 27) or Mediterranean (n = 82) diet.Boxplots represent first (lower), median, and third (upper) quartiles.Upper and lower whiskers extend to the closest value to +/-1.5 * interquartile range (IQR).The independent two-sided t-test was applied in Fig 2c, while the twosided Mann-Whitney-Wilcoxon test was applied in Fig 2d, and statistical significance was observed for p < 0.05.Food icons were obtained from Bioicons.com.

(
iTOL)40 .A heatmap was generated in Python (version 3.11) to show the plantMASST matches of these neuroactives to different plant species.The packages 'pandas' (version 2.2.2) and 'plotly' (version 5.21.0) were used for this analysis.Finally, we re-analyzed two diet-related MS/MS studies (vegan vs omnivore diet: MSV000086989; American vs Mediterranean diet: MSV000093005).The raw data was processed in MZmine3 (version 3.4.27).The MZmine3 batch files used in each study are available on GitHub (https://github.com/helenamrusso/plantmasst), in addition to the generated output files.The generated .mgffiles were used as input for the plantMASST batch search (https://github.com/robinschmid/microbe_masst)using the following parameters: cosine threshold: 0.7; minimum matched peaks: 4; precursor mass and fragment tolerance: 0.02 Da; analog search: off.Boxplots showing the percentages of matches to plantMASST in each sample were generated in Python using the 'seaborn' package (version 0.11.2).The Shapiro-Wilk test was used to assess the normality of the data.
search for plant-derived molecules (Figure2c) from fecal samples of vegans and omnivores is publicly available in GNPS/MassIVE under the accession number MSV000086989.Data used to assess plant-derived molecules in fecal samples from people subjected to an American and Mediterranean diet is publicly available in GNPS/MassIVE under the accession number MSV000093005.Data acquired for retention time matching between piperlongumine standard and plant extracts is available in GNPS/MassIVE under the accession number MSV000094562.
MassIVE IDs) of the studies used to generate this tool is available on GitHub (https://github.com/helenamrusso/plantmasst,plant_masst_table.csv).All the taxonomic trees shown in this manuscript can be interactively explored by downloading the .htmlfiles available on 130; Microscans: 1; data type: profile; Usue EASY-IC(TM): ON.The Dynamic exclusion mode: Custom; Exclude after n times: 1; Exclusion duration (s): 5; Mass tolerance: ppm; low: 10, high: 10, Exclude isotopes: true.Appex detention: Desired Apex Window (%): 50.Isotope Exclusion: Assigned and unassigned with an exclusion window (m/z) for unassigned isotopes: 8.The Intensity threshold was set to 2.5E 5 and a targeted mass exclusion list was used.The centroid data-dependent MS 2 (dd-MS 2 ) scan acquisition events were performed in discovery mode, triggered by Apex detection with a trigger detection (%) of 300 with a maximum injection time of 120 ms, performing 1 microscan.The top 3 abundant precursors (charge states 1 and 2) within an isolation window of 1.2 m/z were considered for MS/MS analysis.For precursor fragmentation in the HCD mode, a normalized collision energy of 15, 30, and 45% was used.Data was recorded in profile mode (Use EASY-IC(TM): ON).(