Abstract
Mass spectrometry is a cornerstone of untargeted metabolomics, enabling the characterization of metabolites in both positive and negative ionization modes. However, comparisons across ionization modes have remained a substantial challenge due to the distinct fragmentation patterns produced by each polarity. To overcome this barrier, we present MS2DeepScore 2.0, a machine learning-based model to predict chemical similarity between mass fragmentation spectra, which works both between different and the same ionization modes. We demonstrate the utility of MS2DeepScore 2.0 in a human urine case study, where MS2DeepScore enabled cross-ionization mode molecular networking, enhancing data exploration and metabolite annotation. To ensure robustness, we have implemented a quality estimation method that flags spectra with low information content or those dissimilar to the training data, thereby minimizing false predictions. Altogether, MS2DeepScore 2.0 extends our current capabilities in organizing, exploring, and annotating untargeted metabolomics profiles.
Introduction
Mass spectrometry is widely used to map the chemical contents of natural extracts and other biological mixtures. In untargeted metabolomics, tandem mass spectrometry (or mass spectrometry fragmentation, MS/MS, MS2) is typically used to support structural annotation of metabolite features detected in metabolomics profiles. Interpretation of tandem mass spectra is increasingly done with the help of computational tools that assist with structurally annotating mass spectra, such as SIRIUS1, MS-Finder2 and MS2Query3. Furthermore, mass spectral similarity scores, like the cosine score, modified cosine score4, Spec2Vec5, MS2DeepScore6 and others7-9, play a crucial role in in silico annotation and organization approaches like library matching, analogue searching, and organizing spectra by molecular networking.
The most widely used classical spectrum similarity measure, the cosine score, evaluates similarity based on visual equivalence in fragmentation patterns, making it effective for identifying (near-)identical molecules under strict conditions. The so-called modified cosine score considers both neutral losses and direct matching fragments during signal alignment. Thereby, it can account for a single structural modification and is used for searching structurally similar molecules4, 10, 11.
Nevertheless, both scores fail to serve as general proxies for chemical similarity, as they struggle to account for more complex fragmentation relationships arising from multiple structural modifications5. Additionally, both metrics assume similar experimental conditions, making them sensitive to variations in ionization mode, instrument type, collision energy, and data processing pipelines. Consequently, they identify only a tiny fraction of the dense chemical relationships found in complex samples5, 12.
Mass spectrometry can be performed in two ionization modes: positive and negative. How suitable a particular ionization mode is for detecting a metabolite, largely depends on the metabolite’s structure13, 14. Consequently, mass spectrometry data is often acquired in both ionization modes to cover a larger fraction of the metabolome of the measured samples. While mass fragmentation spectra are highly similar for the same molecule when recorded in the same ionization mode with the same acquisition parameters, this is often not the case when comparing to a mass spectrum recorded in the other ionization mode15. By design, both cosine and modified cosine scores are in these cases not suitable to compare spectra across different ionization modes. As a result, positive and negative ionization mode mass spectra are mostly analyzed separately, for instance, by searching in separate reference libraries and creating two separate molecular networks16-18. Where approaches like MolNotator19 and Ion Identity Molecular Networking20 can merge positive and negative mode spectra into one network, they require adduct identification based on well-aligned retention times and the recognition of specific mass differences between mass features. Achieving retention time alignment can be cumbersome and necessitates using the same chromatography column for both positive and negative ionization modes. A cross-ionization mode MS2 similarity metric could alleviate these challenges by enabling streamlined computational workflows that align positive and negative ionization mode data.
Here, we developed a mass spectral similarity metric that can predict chemical similarity between mass spectra of not only the same, but also different ionization modes. The approach for this new similarity metric is based on the Siamese neural network architecture used in the previous version of MS2DeepScore6. The original MS2DeepScore model was able to predict chemical similarities with good overall accuracy, but it has several shortcomings. First, separate models had to be trained for positive and negative ionization mode data which meant less training data for each model and no cross-ionization mode applications. Secondly, the former MS2DeepScore models were trained on MS2 fragments only; however, spectral metadata like precursor m/z, ionization mode, or other acquisition parameters could become valuable information for improving prediction quality. In the present work, we explore and evaluate the addition of metadata to the model input and show that using ionization mode and precursor m/z as input improves model performance. Thirdly, we introduce a new pair sampling algorithm that reduces biases introduced during model training. Finally, we introduce a method that can estimate the mass spectral embedding quality for each input spectrum. This allows users to filter out spectra for which the MS2DeepScore predictions are unreliable, e.g., due to low spectral quality or when spectra differ substantially from the training data. This further improves the reliability of MS2DeepScore results.
In addition to the above-mentioned key aspects, this work also contains many technical improvements on the MS2DeepScore code and hyperparameters which lead to better predictions and much shorter runtimes both for model training and chemical similarity prediction. We also added a training pipeline that makes training new models easier, more streamlined, and more robust. Our latest model is now also available in mzmine21, seamlessly integrating MS2DeepScore molecular networking of thousands of spectra in seconds with mzmine’s feature detection workflows and statistical analysis dashboards. The interactive network visualizer facilitates exploring the chemical space by combining multiple spectral similarity metrics. This local deployment offers scientists without programming expertise easy access to MS2DeepScore molecular networks.
We demonstrate the utility of our model through a case study on human urine, showcasing its ability to integrate positive and negative ionization mode spectra into an unified analysis. We found cross-ionization mode clusters in a molecular network, thereby finding more chemical relations between metabolite features. We also highlight the possibility of directly visualizing MS2Deepscore embeddings using UMAP22 and provide an interactive plot for intuitive exploration of both positive and negative ionization mode spectra of the case study. Annotation by experts confirmed the validity of the discovered links between positive and negative ionization mode spectra. By enabling cross-ionization mode molecular networking and embedding visualizations, our model paves the way for extracting deeper insights from untargeted metabolomics data.
Results
Cross-ionization mode models
Training MS2DeepScore 2.0 models on mass fragmentation spectra in both ionization modes resulted in a model that performs well for predicting chemical similarity between spectra acquired in the same ionization mode, but also for predicting chemical similarity between positive and negative ionization mode spectra. Figure 1 shows the predicted MS2DeepScore scores between pairs of test spectra. Figure 4C shows that MS2DeepScore 2.0 can also predict chemical similarity between spectra of two different ionization modes.
Predictions are made between all test spectra, followed by taking the average per unique molecule pair. The counts in each Tanimoto bin are normalized, by dividing the counts by the total number of pairs in each Tanimoto score bin. 50 uniformly distributed bins were used for both axes. a) Predictions between pairs of negative ionization mode spectra in the test set. b) Predictions between pairs of positive ionization mode spectra in the test set. c) Predictions between pairs of negative and positive ionization mode spectra in the test set.
Case studies MS2DeepScore
The new capabilities of cross-ionization mode MS2DeepScore models are illustrated with a case study on human urine samples. Multiple cross-ionization mode networks were formed linking biochemically relevant metabolites together. The highlighted clusters in Figure 2 have been manually curated and annotated by experts. In total, 37 spectra were manually annotated, resulting in annotating the structural identity of 13 different metabolites. The confidence level23 of the annotations can be found in Supplementary Table 7.2. For example the left cluster contains caffeine-related molecules, all part of known caffeine metabolism pathways24. A model without cross-ionization mode predictions would not have been able to link positive and negative ionization mode spectra in this cluster together. Therefore the cross ionization mode model was able to highlight new connections in the molecular network that correspond to real metabolic pathways.
By predicting chemical similarity between both the positive and negative ionization mode spectra, spectra of both ionization modes can be visualized together. An interactive version of the full UMAP plot is available as an HTML file and an interactive version of the molecular network can be loaded in Cytoscape (see Data Availability section). a) A molecular network created by using MS2DeepScore 2.0 similarity scores. We highlight a few examples where MS2DeepScore was able to predict close chemical similarity between positive and negative ionization modes. We recommend visualization in Cytoscape for more details (see Data Availability section). b) UMAP representation of the MS2DeepScore 2.0 embeddings of the human urine case study. Two spectra are highlighted which both correspond to the same molecule, but were recorded in positive and negative ionization modes. MS2DeepScore 2.0 correctly predicted very similar embeddings, while the fragments do not overlap. The UMAP representation is zoomed in to show the area with most spectra, this resulted in excluding a few spectra (<5%) in the plot.
As an intermediate output, MS2DeepScore produces mass spectral embeddings. UMAP can plot these 500-dimensional vectors in a 2D representation of the chemical space complementary to the networking approach. Figure 2b shows the UMAP22 representation of the embeddings of the case study. In this UMAP representation we combined both positive and negative ionization mode spectra into a single representation. We observe mixing of the embeddings of both ionization modes in 2D space and highlight an example of a positive and a negative mode spectrum of cholic acid, a bile acid commonly found in urine25. Both spectra have a similar embedding and are visualized closely together in the UMAP representation, indicating high predicted chemical similarity. The spectra of this molecule in different ionization modes are visually very different, but still MS2DeepScore 2.0 was able to predict high chemical similarity of 0.79, while the modified cosine score results in a similarity of 0.36 and cosine score a similarity of 0.0. The full UMAP representation of the embeddings is available as an interactive HTML file allowing for exploring the case studies further, see Data Availability section.
Uncertainty evaluation
MS2DeepScore predictions can be unreliable for some mass spectra. This could, for instance, be due to bad or incomplete fragmentation, fragments of multiple metabolites in one spectrum (i.e., “hybrid” spectra), or simply because there were no similar spectra in the training data. Here, we have developed and assessed our Embedding Evaluator model, and we show how it can identify spectra for which MS2DeepScore cannot predict reliable chemical similarities. We can improve prediction reliability by filtering out spectra with a high predicted MSE. Figure 3c shows the effect of removing the spectra with the highest predicted MSE. Additional analysis in Supplementary Information section 4 shows that the predicted MSE correlates with features like number of fragments, precursor m/z, ionization mode, and signal intensities.
An embedding evaluator is implemented to predict if MS2Deepscore can make reliable chemical similarity predictions for an input spectrum. This embedding evaluator is trained by using the MSE for a spectrum as a proxy. a) Architecture for predicting the embedding quality for a spectrum. b) Predicted MSE against the True MSE for spectra in the test set. c) Removing the spectra with the highest predicted MSE reduces the MSE mostly in the high and low bins. Test spectra were removed by removing the test spectra with the highest predicted MSE by our Embedding Evaluator model. Predictions were made between the test spectra. First the MSE is calculated by taking the average MSE between all spectra of two molecules, followed by taking the average per Tanimoto bin.
The input layer comprises scan metadata and fragment data after binning the m/z axis and applying square root transformation to the signal intensities. Numerical data, e.g., precursor m/z or collision energy, is transformed to values closer to 1, to have input in a similar order of magnitudes, to optimize training. Textual inputs, like ionization mode or instrument type, are one-hot encoded. A single dense layer converts the input to a numerical vector (embedding) of length 500. The model is trained to create embeddings for which the cosine similarity between two embeddings correlates well with chemical similarity (Tanimoto score).
Additional metadata input
MS2DeepScore 2.0 allows for adding metadata as input to the neural net. Experiments with using additional metadata showed that adding precursor m/z, ionization mode, and adduct type as input for the model notably improved the performance, while one-hot encoding of the instrument type did not (see Supplementary Figure 1.5). Considering that the adduct type remains unknown or is annotated with lower accuracy in many common mass spectrometry workflows, the model used in the main text relies only on precursor m/z and ionization mode as metadata.
Comparison with single ionization mode models
Supplementary Figure 3.3 compares models trained on a single ionization mode model to the MS2DeepScore 2.0 model trained on both ionization modes. The single ionization mode models are as expected bad at predicting chemical similarity between spectra of the other ionization mode, see Supplementary Figure 3.1 and 3.2. The model trained on only positive ionization mode spectra performs comparable at predicting similarity between positive ionization mode as the dual-ionization mode model. But the model trained on only negative ionization mode spectra, performs differently at predicting similarity between negative ionization mode spectra compared to the dual-ionization mode model. Similarity predictions for lower Tanimoto bins have a higher loss compared to the single ionization mode model, but in the 0.9-1.0 bin the dual ionization mode model has a lower loss than the single ionization mode model.
Interestingly, MS2DeepScore models trained on only one of the ionization modes show a better than random prediction performance when predicting mass spectral similarity of spectra obtained in the other ionization mode (e.g., positive ionization mode model to predict between neg-neg – see Supplementary Figure. 3.1a). This suggests that MS2DeepScore is able to detect patterns that generalize between the two ionization modes.
Sampling algorithm
The sampling algorithm is optimized to result in balanced sampling over Tanimoto scores and equal sampling of the different molecules. With this newly developed sampling algorithm we achieve a good balance of molecule sampling frequencies with a maximum of < 15% difference between molecules (see Supplementary Figure 2.1d) and an exactly equal sampling frequency over the whole Tanimoto score range grouped in 10 bins. More details about the sampling algorithm optimization can be found in the Supplementary Information section 2.
Speed of training MS2DeepScore
To achieve fast training and prediction times, the entire model was implemented using Pytorch26. The MS2DeepScore 2.0 model was trained in 11,2 hours on a server with Intel Xeon gold 6342 2.8Ghz, Nvidia A40 GPU, and 512 GB Memory.
Discussion
MS2DeepScore is able to make reliable predictions between mass spectra measured under different conditions, even if hardly any of the fragments overlap. We show that an MS2DeepScore model trained on both ionization modes can predict good estimates of the chemical similarity between spectra measured in different ionization modes (Figure 1c). In addition, when doing predictions between the same ionization modes, the dual-ionization mode MS2DeepScore model performs similarly to a MS2DeepScore model that was only trained on one ionization mode (Supplementary Figure 3.1 and 3.2). Therefore, the dual-ionization mode model can be used without compromising on model performance for same-ionization mode comparisons.
In this work, we also made substantial updates to the MS2DeepScore pipeline. For example, the pair sampling algorithm to train MS2DeepScore was optimized. Sampling pairs during training is a crucial step in training MS2DeepScore, since low Tanimoto scores are orders of magnitudes more frequent than high Tanimoto scores. Here we introduce a new sampling algorithm that does not only balance the sampled pairs over equally spaced Tanimoto bins, but also balances the sampling frequency of each molecule and even the distribution of the Tanimoto scores per unique molecule. This new pair sampling algorithm reduces potential biases in the training data and makes sure the diversity in the training data is used well. Whilst these changes were a substantial improvement in making good use of the chemical diversity in the training set, we do note that our new sampling algorithm did not enforce balanced sampling of the different ionization mode pairs. Because more positive ionization mode spectra were available, this resulted in sampling more positive ionization mode pairs compared to negative ionization mode pairs. Currently, the dual ionization mode model has a higher average MSE than a model trained only on negative ionization mode spectra, but this is not the case for the positive vs. positive ionization mode MSE. Having more balanced sampling over the ionization modes might improve model performance for negative vs. negative ionization mode predictions. In addition, since there are differences in the Tanimoto score distributions for positive and negative ionization mode spectra, this might result in not having an equal number of pairs per Tanimoto bin for each ionization mode. For example, since there were not many high Tanimoto score examples between positive and negative ionization mode spectra during training, it is likely that this explains the observation that almost no predictions above 0.9 were made for cross-ionization mode pairs. In future work, the sampling algorithm could be further optimized to also enforce balanced pair sampling for the different ionization mode pairs.
In addition, allowing metadata as input to the model improves performance, see Supplementary Figure 1.5. The model used in the main text, uses precursor m/z and ionization mode as metadata input. Using the adduct as input was also beneficial for model performance. There are methods to predict adduct information from MS1 (full) scans27; however, not all preprocessing tools generate reliable adduct information. Therefore, we decided to not include this in the default model, since using incorrect adduct information could result in reduced performance. Currently, the model does not use MS1 data directly. In future work it would be interesting to include MS1 data into training of MS2DeepScore models. However, a challenge is that public mass spectral libraries generally do not have annotated raw MS1 spectra available. Directly including MS1 data, or alternatively predicted adducts or molecular formulas as features into the model could potentially further improve MS2DeepScore performance.
MS2DeepScore models are trained to predict chemical similarity scores, using the Tanimoto coefficient between Daylight fingerprints as the primary metric. Widely regarded as effective for fingerprint-based comparisons, the Tanimoto score has become a standard in cheminformatics applications28-30. However, molecular similarity is inherently subjective, varying by context and application31, 32. Even when we restrict ourselves to fingerprint-based metrics, many possible variants with different strengths and weaknesses exist33, 34. Future work could explore alternative similarity metrics and fingerprints, leveraging the flexible architecture of MS2Deepscore to expand its applicability across diverse tasks.
The cross-ionization mode MS2DeepScore model uses a Siamese neural network architecture similar to the original MS2DeepScore paper6. Hyperparameter optimization resulted in using a dense network of a different size. For benchmarking, we picked the best-performing model. Smaller models are possible with only slight performance reductions. For details see Supplementary Information section 1. If computing time or embedding size is crucial for an application, an MS2DeepScore model with a smaller model architecture or embedding size can easily be trained.
Fragment signals are initially binned, which might result in the loss of useful information. In Supplementary Figure 1.6 we show that using bins smaller than 0.1 Da did not result in improved performance. However, exploring alternative methods that do not require binning could potentially further improve model performance. For instance, the recent DreaMS model uses a transformer architecture, which does not require binning of spectral data35. By pretraining a DreaMS model on unsupervised data followed by transfer learning to create a spectral similarity prediction model, it is possible to train a model that works without binning. So far, the DreaMS model has only been trained on positive ionization spectra making it unsuitable for training an cross-ionization mode model. In future work, it would be valuable to attempt pretraining models on both ionization modes followed by transfer learning to create a chemical similarity predictor. For future work exploring new deep learning algorithms, we recommend building on the groundwork done in MS2DeepScore 2.0, by reusing the newly implemented sampling algorithm, automated training pipeline, and benchmarking methods, to make the results reproducible and comparable.
The prediction quality of the MS2DeepScore model is sensitive to the quality and type of the input spectra. Poor predictions are expected for low-quality spectra with limited fragmentation, chimeric mass spectra from multiple precursor ions, or spectra with little similarity to our training data. In the original MS2DeepScore paper6, the uncertainty was estimated using a Monte-Carlo dropout regularization36. In later real-world applications, however, we noted that this was a subpar solution. For example, we noticed that spectra with little similarity to the training data as well as low-quality spectra often received very similar embeddings. This is very detrimental, because similar embeddings will lead to -mostly false- predictions of high chemical similarities. Hence, MS2DeepScore 2.0 is now complemented by an uncertainty estimation for individual input spectra. This model predicts the MSE from a spectral embedding created by MS2DeepScore. We demonstrated that the overall accuracy can be raised by removing the test spectra with a high predicted uncertainty, see Figure 3. We anticipate that other tools can use this new method for uncertainty and spectral quality estimation.
Given the enormous range of possible mass spectral datasets and applications, there remains a risk of our model not being well-suited for very specific tasks or chemical classes. In such cases, we recommend training a custom MS2DeepScore model. Training a new MS2DeepScore model is now relatively easy since an automatic training pipeline is available. For smaller custom datasets, we recommend merging them with larger available datasets, such as the here-used public libraries, before training a new model from scratch. We speculate that a promising alternative route could be to start with our pre-trained model and run additional training on the custom reference data, a common “fine tuning” strategy in deep learning.
MS2DeepScore is available through PyPI via pip, is actively maintained and adheres to best practices in software development. Most code is covered by unit tests and supported by a continuous integration (CI) pipeline to ensure reliability and robustness. To make MS2DeepScore accessible to a wider audience, which is unfamiliar with using basic Python, MS2DeepScore is now also available in mzmine21. Within mzmine, MS2DeepScore-based molecular networking can be combined with feature detection, compound annotation, and interactively linked to statistical analysis. We anticipate this will make MS2DeepScore and cross ionization mode predictions available to a wide audience of chemists without programming experience.
The ability to reliably predict chemical similarities across ionization modes creates entirely new options for mass spectral data exploration by combining positive and negative ionization mode data. Similarity-based graphs can now be generated independent of the ionization mode, rendering cross-ionization mode molecular networking feasible. Furthermore, our new model can be used as a basis to use the larger positive ionization mode reference spectral library as a source for annotation of the negative ionization mode data, and vice versa. We expect that this will help researchers to more quickly and comprehensively identify new molecules and to do new chemical and biological discoveries.
Methods
Metadata as input
MS2DeepScore 1.0 uses mass fragments as an input to predict chemical similarity between mass spectra6. In the current work, MS2DeepScore 2.0 allows for the use of additional metadata of the fragmentation spectra. This is implemented in a flexible way which allows adding any type of metadata as an input into the model. Numerical data, e.g., precursor m/z or collision energy, is transformed to values closer to 1, to have input in a similar order of magnitudes, to optimize training. Textual inputs, like ionization mode or instrument type, are one-hot encoded. For the dual-ionization mode model used in the main text, precursor m/z and ionization mode were used as additional metadata input. As a part of the current study, other experiments were also run with instrument type and/or adduct type as additional input(s). An overview of the selected model architecture can be found in Figure 1.
Tanimoto score
As a metric for chemical similarity between two molecules the Tanimoto score between molecular fingerprints was used37. An rdkit38 daylight fingerprint (4096 bits) was generated for each unique 2D structure. This Tanimoto score was used for training and benchmarking and will be referred to as Tanimoto score.
Spectrum pair selection for training
One of the key challenges in training a model to predict Tanimoto scores is the highly non-uniform distribution of these scores across possible molecule pairs. Low Tanimoto scores are several orders of magnitude more frequent than high Tanimoto scores. In our previous work6, this imbalance was partly mitigated by a data generator that selected a molecule pair belonging to a random Tanimoto score bin for each pair selection step. However, molecules in the used dataset often lacked partners in the high Tanimoto score ranges. As a result, even though the former data generator substantially reduced the bias, there was still a considerable shift towards lower Tanimoto scores. In addition, the selection of the second molecule in a pair was not equally distributed, leading to high variability in sampling frequency per unique molecule.
To address these issues, we developed a new pair sampling algorithm optimized for balanced sampling across Tanimoto score bins and near uniform sampling frequencies for each unique molecule. During training, the pair sampling algorithm loops over selected molecule pairs and randomly selects two corresponding spectra per molecule pair, because often multiple spectra are available for a single molecule. In this work, two molecules were considered the same if the first 14 characters of their InchiKeys were equal, thereby ignoring stereochemistry. The pair sampling algorithm loops multiple times over the selected set of molecule pairs, but the corresponding spectra are randomly resampled every loop.
Before training the model, a balanced set of molecule pairs was selected. The molecule pair sampling happened per Tanimoto bin, but during the sampling, the molecule sampling count was tracked. The sampling algorithm started by selecting the least frequently sampled molecule with pairs available in the Tanimoto bin. From the candidate pairs for this molecule, the second molecule with the lowest sampling count was chosen. Resampling of molecule pairs was allowed, enabling both a close-to-equal sampling frequency across unique molecules and a balanced distribution over Tanimoto bins. To minimize resampling, the algorithm prioritized least-sampled available pairs before selecting the least sampled second molecule. This sampling algorithm significantly improved sampling balance.
However, some molecules were still sampled up to six times more than others. To further reduce this imbalance, a maximum sampling count per molecule was introduced, limiting sampling frequency disparities to less than 15%. Details of the experiments conducted to optimise the sampling algorithm are provided in Supplementary Information section 2.
Binning spectra
Before training, the fragments were binned, to make them suitable as input for the neural network. Binning happened by making bins of 0.1 Da between 10 ≤ m/z < 1000 Da, resulting in 9.900 bins. In the former MS2DeepScore work, bins were only included if they had at least one fragment in the training data. Instead, MS2DeepScore 2.0 uses all bins, even if none of the training spectra have a fragment in this bin. This reduces code complexity and reduces the risk of accidental mismatch between the binning method and model versions. Intensity values were transformed by square-root to reduce the impact of high intensity signals.
Architecture improvements
MS2DeepScore 1.0 is implemented in Tensorflow6, 39. Here, the entire MS2DeepScore 2 model was reimplemented using Pytorch26. This improved compatibility with GPUs and Apple M1 chips, but also overall code readability. Combined with an entirely new implementation of the DataGenerators, this resulted in a substantial speed-up in the training of models.
A pipeline is now available that performs all steps necessary for training new MS2DeepScore models. The wrapper function only requires a file with annotated mass spectra and the settings for model training. First mass spectra are separated on ionization modes and split in test, train, and validation sets. After that the model is trained, and benchmarking figures are created.
Model settings
The original MS2DeepScore paper used two layers of 500 nodes with an embedding size of 200. Given the expanded training library and the dual-ionization mode training, it was expected that a different model architecture could result in better performance. Hyperparameter optimization was performed to determine an optimal configuration, as detailed in Supplementary Information section 1. The final architecture consisted of a single layer with 10,000 nodes and an embedding size of 500, which was used for all models presented in the main text.
Compared to the former MS2DeepScore models, several other adjustments were made. The final layer activation function was changed from ReLU to Tanh40, dropout and batch normalization were removed and the settings for data augmentation were changed: augment removal max was changed from 0.3 to 0.2, augment intensity was changed from 0.4 to 0.2, and augment noise intensity was changed from 0.01 to 0.02. The exact settings were added as a JSON file to the Zenodo entry, see Data Availability section.
Input data filtering and splitting
For training the models we combined multiple public libraries: the GNPS library4, the MassBank EU library, the MassBank of North America (MoNA) library41, and the MS2 spectra of MSnLib created by Brungs et al.42. After combing these libraries they were first cleaned using the matchms library cleaning pipeline43, 44. The settings for cleaning can be found in Supplementary Settings 1. Experiments that assessed the model performance for different minimum signal numbers and intensity thresholds can be found in Supplementary Figure 1.1. After cleaning, the library consisted of 36,638 unique molecules and 519,580 spectra in positive ionization mode and 18,480 unique molecules and 145,594 spectra in negative ionization mode. A molecule was considered unique if the first block of its InChiKey identifier (14 letters) was equal, thereby ignoring stereochemistry.
The cleaned spectrum library was split by ionization mode and divided into training, validation, and test set. We selected 1/20th of unique molecules for both the validation and test sets, all corresponding spectra to these molecules were removed from the training set. For the positive and negative mode set, the selection of InChIKeys to use for the validation and test sets was different, since we might otherwise have introduced a bias in the validation set for spectra that were available in both ionization modes. The dual-ionization mode library was trained by combining the positive ionization mode training spectra and the negative ionization mode training spectra. The validation spectra were used for all experiments for the optimization of our model, like changing the filtering of input spectra, or adjustments to the model size. The test set was not used during any experimentation or hyperparameter optimization and was only used for benchmarking of the final model.
Embedding and score uncertainty estimation
The prediction quality of the MS2DeepScore model is sensitive to the quality of input spectra and the similarity to the training data. To detect spectra that are hard to predict for MS2DeepScore, we designed a new pipeline using a convolutional neural network that predicts the quality of a spectrum embedding. We designed the “Embedding Evaluator” model by implementing an Inception Time architecture45 using Pytorch26, and trained it on the mean squared error (MSE) of all Tanimoto score predictions between the embedding in question and 999 randomly sampled other spectra from the training data. The conceptual idea here is that the Embedding Evaluator will learn to identify embeddings for low-quality or out-of-distribution input data. In later applications, the predicted embedding qualities can be used for uncertainty estimation.
Benchmarking
The mean squared error (MSE) was used as a loss function, measuring the difference between the predicted and actual Tanimoto score. During training, a sampling algorithm ensured equal numbers of spectrum pairs in each Tanimoto bin and a balanced representation of molecule pairs. However, due to the lower number of available validation spectra it is not suitable to use the same sampling algorithm for the validation spectra, since this would significantly reduce the number of pairs available for benchmarking. To obtain a representative MSE for the validation and test set, the average loss per molecule pair was calculated by averaging the losses of all available spectrum pairs for each molecule pair. The used mass spectral library often contains multiple mass spectra for one molecule, in some cases up to several hundred spectra for the same molecule. By taking the average loss per molecule pair we ensured that the model performance is not judged mostly on the performance of a few molecules with a high number of mass spectra. The average MSE per molecule pair was then used to calculate the average MSE per Tanimoto bin. Ten equally spaced Tanimoto bins between 0 and 1 were used. The final loss used was the average MSE over these 10 bins. In addition to this benchmarking we analyzed the performance of MS2Deepscore for different adducts and different compound classes, these results can be found in Supplementary Information section 5. A comparison of MS2Deepscore to the modified cosine score is available in Supplementary Information section 6.
Case studies
To illustrate the new possibilities MS2DeepScore 2.0 introduces, we show the capabilities of the new model to create dual-ion-mode molecular networks and UMAP22 embedding representations using a case study with an experimental dataset of human urine samples.
The MS2 spectra used for this case study were acquired for urine fractions generated at the National Phenome Centre by reversed-phase liquid chromatography (RP-LC) as detailed in Albreht et al.46. Briefly, 10-fold pre-concentrated human urine was subjected to semi-preparative RP-LC using a scaled-up approach initially developed in Whiley et al.47 collecting 90 RP fractions. Urine fractions were treated like urine samples and profiled using previously reported RP-LC assay48 on a Waters Acquity UPLC instrument coupled to Xevo G2-S TOF mass spectrometer (Waters Corp., Manchester, UK) via a Z-spray electrospray ionization (ESI) source. Mobile phases A and B were LC-MS grade water and acetonitrile both pre-spiked with 0.1% formic acid, respectively. Gradient elution program, detailed in the protocols associated with Lewis et al. 49, was run with a mobile phase flow rate of 0.6 mL/min using a Waters 2.1 × 150 mm (1.8 μm) HSS T3 column maintained at 45 °C during LC separation. The mass spectrometry parameters can be found in Supplementary Information 7.1.
MassLynx software (Waters, Manchester, U.K.) was used for data acquisition and visual inspection. The raw data files were converted from the Waters .RAW format to .mzML format using the msconvert tool from the ProteoWizard toolkit50. DDA files converted to .mzML format were peak picked and converted to .mgf format using MSDIAL ver.4.9.221218 Windowsx6434 using the following parameters for peak detection: min peak height = 1000 amplitude, mass slice width = 0.05 Da; MS2Dec: sigma window value = 0.5, MS2 abundance cutoff = 200 amplitude; Alignment: RT tolerance = 0.05 min, MS1 tolerance = 0.01 Da.
The peak-picked spectra were further processed by matchms43, 44, only MS2 spectra were kept with more than four fragments. The exact processing settings and logging can be found in the Jupyter notebook on GitHub. Putative annotations and analogue predictions were done using MS2Query3. Annotations with a prediction higher than 0.7 are included in the interactive UMAP embedding visualization.
Molecular networking
The case study data was used to create a dual-ionization mode molecular network with MS2DeepScore similarity edges and MS2 as nodes. A graphml file was created using matchms. The minimum MS2DeepScore cut-off used is 0.85, “top-n” is set to 20, meaning that only the top 20 highest-scoring similarity scores per spectrum were considered for creating edges. The link method used was mutual, which means only edges were added if the edge is in the top list of both nodes. For each node, the highest 10 scores that have a mutual link in the top 20 of bode nodes were used for creating an edge. These settings could still result in more than 10 edges connecting to a single node if an edge from another node was in the top 10 highest similarity scores with a mutual connection.
The graphml file was used for visualizing the molecular network in Cytoscape51 (see Figure 3a). To highlight the capabilities of MS2DeepScore across ionization modes, we selected a few clusters that included both positive and negative ionization mode spectra. These clusters were manually annotated by experts. Supplementary Table 7.1 provides the method of annotation and confidence levels.
Embedding UMAP visualization
MS2DeepScore generates spectral embeddings as intermediate output. These embeddings can be used directly to visualize spectra in 2D space by using dimensionality reduction methods. Here we used UMAP to reduce the 500 embedding dimensions to two dimensions. The number of neighbours was set to 50, this setting influences how local or global the 2D representation is. The resulting UMAP representation can be found in Figure 3b and is also available as an interactive plot, see Data Availability section. The interactive plot can be coloured based on ionization mode or ClassyFire52 compound class annotation of MS2Query3 analogue predictions.
Integration into mzmine
MS2DeepScore is available through PyPI as a pip installable Python package. Even though little programming knowledge is required to apply MS2DeepScore models and clear tutorials are available, this was still a significant hurdle for scientists without programming experience. To offer an easy local deployment, MS2DeepScore has been integrated into mzmine21, a modular MS data processing software. Now, mzmine enables Feature-based Molecular Networking (FBMN53) and Ion identity Molecular Networking (IIMN20) using MS2DeepScore in an interactive network visualizer coupled with compound annotation and statistics dashboards. This allows users to create molecular networks and explore the chemical space within the mzmine graphical user interface, without requiring command line or scripting. A tutorial for using MS2DeepScore within mzmine can be found here: https://mzmine.github.io/mzmine_documentation/module_docs/group_spectral_net/molecular_net_working.html#algorithm-MS2DeepScore.
Integration of MS2DeepScore in mzmine required converting the MS2DeepScore model to the torch script format which is supported by the Deep Java Library (DJL). The package https://github.com/niekdejonge/MS2DeepScore_java_conversion contains scripts for converting existing MS2DeepScore models. The latest torch script version of the MS2DeepScore model is available at https://doi.org/10.5281/zenodo.12628368 and can be automatically downloaded from within mzmine’s molecular networking module. MS2DeepScore is available in mzmine starting from mzmine version 4.3.0.
Code availability
MS2DeepScore is available as PyPI package and therefore pip installable. The version used for this manuscript is version 2.5.0. All code is available on https://github.com/matchms/MS2DeepScore. The notebooks used for creating the benchmarking figures can be found in the folder https://github.com/matchms/MS2DeepScore/tree/main/notebooks/MS2DeepScore_2. Details about how each figure can be reproduced can be found in Supplementary Information section 8.
The mzmine source code for the PyTorch model integration via TorchScript format is available on the mzmine GitHub https://github.com/mzmine/mzmine/tree/master/mzmine-community/src/main/java/io/github/mzmine/modules/dataprocessing/group_spectral_networking/MS2DeepScore. The scripts for converting existing MS2Deepscore models to torchscript can be found on GitHub: https://github.com/niekdejonge/MS2DeepScore_java_conversion.
Data availability
The dual-ionization mode MS2DeepScore model and embedding evaluator model used in this study can be downloaded from Zenodo, https://doi.org/10.5281/zenodo.14290920. The training, validation, and test spectra can be downloaded from https://zenodo.org/records/13934470. All case study data can be found on https://zenodo.org/records/14535374, this includes an interactive version of the full UMAP plot as an HTML file and the required files to create an molecular network in Cytoscape.
Author contributions
NdJ came up with the concept for cross-ionization mode similarity scores and wrote the first version of the manuscript. FH came up with the concept for the EmbeddingEvaluator. NdJ, JJJvdH, FH designed the research and revised the manuscript. NdJ, DJ, LJT, and FH contributed to the code. NdJ and FH designed and evaluated the code for the current version. NdJ and RS implemented MS2DeepScore in mzmine. EC annotated the case study data. Data from the urine samples used in this study was originally generated by the National Phenome Centre, Imperial College London. All authors contributed to the data analysis and interpretation. JJJvdH and FH supervised this work.
Competing interests
JJJvdH is member of the Scientific Advisory Board of NAICONS Srl., Milano, Italy and consults for Corteva Agriscience, Indianapolis, IN, USA. RS is a co-founder of mzio GmbH, Bremen, Germany. All other authors declare to have no competing interests.
Acknowledgements
The authors thank Corinna Brungs for sharing prereleases of the latest MSnLib library42 and for her assistance with filtering out MSn and merged MS2 spectra. The authors thank Tomáš Pluskal for hosting NJ during a lab visit, which enabled the collaboration that led to the integration of MS2DeepScore in mzmine. NJ thanks Dick de Ridder for helpful discussions and feedback on the results of MS2DeepScore 2.0. NJ thanks the research lab of Soha Hassoun for valuable feedback on the first preprint. This work was supported by the Medical Research Council and National Institute for Health Research [grant number MC_PC_12025] and the Medical Research Council UK Consortium for MetAbolic Phenotyping (MAP UK) [grant number MR/S010483/1]. Infrastructure support was provided by the National Institute for Health Research (NIHR) Imperial Biomedical Research Centre (BRC).
Footnotes
↵* These authors jointly supervised this work.
Updated data links to match most up to data models.