Comparison of Cosine, Modified Cosine, and Neutral Loss Based Spectrum Alignment For Discovery of Structurally Related Molecules

Spectrum alignment of tandem mass spectrometry (MS/MS) data using the modified cosine similarity and subsequent visualization as molecular networks have been demonstrated to be a useful strategy to discover analogs of molecules from untargeted MS/MS-based metabolomics experiments. Recently, a neutral loss matching approach has been introduced as an alternative to MS/MS-based molecular networking, with an implied performance advantage in finding analogs that cannot be discovered using existing MS/MS spectrum alignment strategies. To comprehensively evaluate the scoring properties of neutral loss matching, the cosine similarity, and the modified cosine similarity, similarity measures of 955,228 peptide MS/MS spectrum pairs and 10 million small molecule MS/MS spectrum pairs were compared. This comparative analysis revealed that the modified cosine similarity outperformed neutral loss matching and the cosine similarity in all cases. The data further indicated that the performance of MS/MS spectrum alignment depends on the location and type of the modification, as well as the chemical compound class of fragmented molecules.


Introduction
With the growing availability of free, open, and proprietary tandem mass spectrometry (MS/MS) libraries, MS/MS-based spectrum alignment is routinely used for metabolite annotation and organization of mass spectral datasets. [1][2][3][4] For MS/MS spectrum comparison, the cosine similarity is the most widely used approach to match near-identical spectra with each other. 5,6 Adaptations hereof, such as the modified cosine similarity, 1,7 can be used to discover non-identical but related spectra using analog searching and molecular networking of small molecules. 8,9 Similar approaches have also been used in proteomics to perform open modification searching for the unbiased detection of peptides that contain post-translational modifications. [10][11][12][13] Additionally, spurred by the emergence of increasingly powerful computational and machine learning techniques, several novel similarity scores have been proposed. [14][15][16][17][18] Recently, neutral loss-based spectrum alignment was introduced in the METLIN analysis ecosystem 19 as a strategy to match MS/MS spectra of analog molecules. 20 During neutral loss matching, MS/MS spectra are mirrored at their precursor mass by calculating the distances from each fragment ion peak to the precursor mass, describing the neutral losses. Effectively, the neutral loss approach recalibrates spectra with their precursor masses as the origin, and a mirror database of neutral loss spectra was created in METLIN. 19 The neutral loss spectra can then be used to find related spectra of modified analog molecules. Based on representative examples, Aisporna et al. 20 have postulated that neutral loss matching can be particularly effective at finding MS/MS spectrum pairs that other approaches, such as spectrum alignment using the cosine and modified cosine similarities, could not find. This benefit was explained by the increase in spectrum usage (i.e. more explained intensity captured by matching fragment ions) in the assignment of related matches. Nevertheless, although a few select MS/MS spectra were chosen to compare different spectrum similarity measures, the difference in spectrum matching performance was not systematically quantified. 20 We hypothesized that the modified cosine similarity, which forms the foundation of molecular networking in the Global Natural Products Social Molecular Networking (GNPS) system, 7 should be able to discover all MS/MS spectrum pairs that can be found by neutral loss matching, as it considers both directly matching peaks and matching peaks that are shifted by the pairwise precursor mass difference. We, therefore, set out to systematically compare the performance of neutral loss matching, cosine similarity, and modified cosine similarity on a large collection of spectrum pairs of modified peptides and small molecules. As the data of the original study 20 was not available due to the access restrictions of the METLIN database, 19 we instead used a large collection of 955,228 and 10 million MS/MS spectrum pairs for peptides and small molecules, respectively, which were derived from the MassIVE-KB 21 and GNPS resources. 7

Results
We evaluated neutral loss matching, cosine similarity, and modified cosine similarity in matching MS/MS spectra of non-identical yet structurally related molecules (Figure 1). The cosine similarity is a common strategy to compare MS/MS spectra to each other by matching fragment ions in both spectra with identical m/z values (while accounting for a relevant fragment mass tolerance based on the data acquisition settings). The modified cosine similarity is an extension of the cosine similarity which not only matches fragment ions with identical m/z values, but also considers fragment ions across both spectra with m/z values that are shifted by the precursor mass difference of the spectrum pair under consideration. Finally, neutral loss matching first transforms MS/MS spectra into neutral loss spectra by mirroring the fragment ion m/z values at the precursor mass, after which the neutral loss peaks with identical ∆m/z values are matched to each other. precursor m/z = 810.487). 22 We evaluate the cosine similarity (left), modified cosine similarity (middle), and neutral loss matching (right, visualized as neutral loss spectra) to compare MS/MS spectra to each other. Directly matching fragments with near-identical m/z or ∆m/z values in the original MS/MS spectra and the neutral loss spectra, respectively, are indicated in blue. Additional matching fragments that are offset by the pairwise precursor m/z difference, considered by the modified cosine similarity, are indicated in red. Unmatched fragments are indicated in black. (c) Illustration of a heat map to compare two alternative spectrum similarity measures, specifically the cosine similarity and neutral loss matching. The heat map shows the bivariate distribution of the corresponding spectrum similarity values for all MS/MS spectrum pairs, with each point corresponding to multiple observations as indicated by the heat map coloring. The density plots (gray) show the marginal distributions of the individual spectrum similarities. Representative examples with different spectrum similarity values are indicated by the colored diamonds, with the corresponding spectrum alignments shown in the colored boxes.
Three datasets with heterogeneous mass spectral properties were used to compare the different spectrum similarity measures. Because large-scale ground truth benchmark datasets of related small molecule spectrum pairs with known structural modifications currently do not exist yet, the first dataset we used consisted of 955,228 known peptide MS/MS spectrum pairs derived from the MassIVE-KB resource, 21 where peptide pairs differ by a single modification (post-translational modification or amino acid substitution). Second, mass spectrum pairs were derived from 495,600 small molecule reference MS/MS spectra contained in the GNPS community mass spectral libraries. 7 Because over 1.5 billion pairwise comparisons (precursor mass difference between 1-200 Da) can be performed using these spectra, a random subset of 10 million spectrum pairs was used for computational efficiency. Third, 340,637 bile acid MS/MS spectrum pairs from the GNPS community spectral libraries were used to evaluate the influence of specific modification properties, such as addition or loss of oxygen, conjugation of the bile acids with amino acids (as amino acid amidates), and substitution of conjugated amino acids attached to the bile acid core.
Evaluation of the different spectrum similarity measures on the peptide data indicates that the modified cosine similarity strictly outperforms both the cosine similarity and neutral loss matching (Figure 2a), while neutral loss matching resulted in a higher score than the cosine similarity in only 8.8% of the peptide pairs. Spectrum usage (in terms of the number of explained fragment ions and total explained intensity) can be used to assess the performance of spectrum similarity measures (Figure 2b). 20 In general, neutral loss matching resulted in very low spectrum usage for peptide mass spectra, with a median explained intensity of 17.1%. In contrast, the median explained intensity for the cosine similarity and modified cosine similarity were 56.4% and 70.7%, respectively. Taking into account the modification position relative to the linear peptide sequences (Figure 2c), neutral loss matching performed better with modifications located closer to the C-terminal end. In this case, a maximal number of y-ions, which are the dominant signals in peptide mass spectra, include the modification and can be matched using the neutral loss strategy. In contrast, the cosine similarity performed best when the modification was located close, but not entirely at the peptide N-termini. We hypothesize that in this case the cosine similarity can match a large number of y-ions as well as the initial b-ions, whereas when the modification occurs at the N-terminus, no b-ions contribute to the cosine similarity while higher index y-ions often do not occur and thus do not provide further benefit. Finally, the modified cosine similarity achieves high scores across a range of modification positions (except for N-terminal modifications), which indicates its beneficial performance in capturing both directly matching fragment ions and fragment ions that are shifted due to the mass difference induced by the modification. Although pairs of peptide MS/MS spectra can serve as ground truth, because in this case all pairwise modifications are known, peptides are only a subset of molecules. Small molecules are chemically more diverse, and many of the MS/MS spectra will be significantly different. Therefore, we also evaluated the different spectrum similarity measures on a random subset of 10 millions spectrum pairs derived from the GNPS community spectral libraries (Figure 3). Because these pairs were randomly selected, the majority of pairs exhibit low spectrum similarities as they correspond to molecules that are structurally unrelated. Similar to the previous analysis using peptide data, the modified cosine similarity strictly outperformed the cosine similarity and neutral loss matching (Figure 3a). Neutral loss matching performed considerably better than for peptide data, however, with 32.0% of the spectrum pairs achieving a higher neutral loss match score than cosine score, and 68.0% of the spectrum pairs achieving a higher cosine score than neutral loss score. Because this is a random set of mass spectrum comparisons where most are not expected to match, the spectrum usage is low for all three similarity measures (Figure 3b). Nevertheless, the rank of explained intensities, with the explained intensity from the modified cosine similarity outperforming the explained intensity from the cosine similarity, which in turn outperformed the explained intensity from neutral loss matching, was the same as for the peptide data. We also evaluated the spectrum similarity measures in function of the chemical similarity between the molecules, based on the Tanimoto index 23 as a proxy for structural similarity (Figure 3c). This indicated that although even for molecules with a high Tanimoto similarity the majority of pairs still exhibited a poor spectrum similarity, the modified cosine similarity was able to best reflect structural similarity. Considering that the majority of the 10 million randomly selected spectrum pairs correspond to MS/MS spectra of unrelated molecules, we filtered for mass spectrum pairs whose molecules exhibit high structural similarity (Tanimoto index above 0.9; 8,956 spectrum pairs). Although the Tanimoto index is an imperfect method to fully assess structural similarity, 24 this filter still enriches for structurally related molecules. This data subset can provide additional insights into the impact of each spectrum similarity measure, especially when comparing the spectrum usage (Figure 4a). This indicates that the modified cosine similarity captured more spectrum usage (as defined by explained intensity) than the cosine similarity and neutral loss matching. The different behavior of peptide spectrum pairs versus metabolite spectrum pairs, especially when comparing cosine similarity to neutral loss matching (Figure 2a, Figure 3a, Figure 4a) suggests that various modifications may impact spectrum similarity differently. This is supported by the observation that spectrum similarities for structurally related molecules differ significantly per chemical compound class (Figure 4b). For example, some compound classes that are characterized by shared or similar molecule backbones, such as purine nucleotides and glycerophospholipids, exhibit very high cosine and modified cosine similarities and low neutral loss scores. In contrast, other compound classes, such as flavonoids, show very low similarities, irrespective of the spectrum similarity measure. Finally, there are interesting compound classes, such as carboxylic acids and indoles, that show lower cosine similarities and neutral loss scores, but higher modified cosine similarities. This indicates that for these classes, the MS/MS spectra contain both directly matching fragment ions and fragment ions that are shifted by modifications or neutral losses, which can only be captured jointly by the modified cosine similarity. To evaluate how specific types of modifications impact spectrum similarity, we performed detailed analyses of the bile acids molecular family, which has a number of distinct modifications. Our lab has recently discovered many new bile acids, for which the MS/MS spectra have been made available as an open access resource. [25][26][27][28][29] Additionally, we have collected MS/MS data on a historical library of previously determined bile acids. Although other studies have reported the recent discovery of new bile acids as well, 30-35 unfortunately these could not be included in the current analysis as the corresponding MS/MS data are not publicly available yet. Because our continued study of bile acids has given us a deep understanding of this subset of molecules, we can further analyze these data to understand how specific modifications influence spectrum similarity ( Figure 5). In total 846 MS/MS spectra of 369 unique bile acids were included, leading to 340,637 pairs of MS/MS spectra. Although bile acids can undergo many different modifications, 36 here we focused on (i) all bile acid pairs, (ii) bile acid pairs that differ by a single oxygen, (iii) bile acid pairs that differ by a conjugated amino acid, and (iv) bile acids pairs that differ by the substitution of a conjugated amino acid. Similarly as for the previous analyses, the modified cosine similarity strictly outperformed neutral loss matching and the cosine similarity (Figure 5a). Interestingly, for this class of molecules neutral loss matching outperformed cosine similarity in 68.6% of cases.
An increasing Tanimoto index between bile acid pairs, reflecting more similarity in the structures, resulted in increased spectrum similarities as well, with the modified cosine similarity best capturing high structural similarity (Figure 5b). However, these trends strongly depend on the type of modification. For a modification consisting of a single oxygen difference, the majority of spectrum pairs achieved high modified cosine similarities (median 0.805), whereas cosine similarities (median 0.426), and neutral loss scores (median 0.224) were considerably lower (Figure 5c). For comparing spectra of non-conjugated bile acids to conjugated bile acids, none of the spectrum similarity measures work well (Figure 5d). This indicates that this type of modification exerts a strong influence on the fragmentation pathways, despite belonging to the same class of molecules. Finally, when comparing spectra of conjugated bile acids that have undergone an amino acid substitution, neutral loss matching behaved most similarly to the modified cosine similarity, both showing a bimodal score distribution, and strongly outperforming the cosine similarity (Figure 5e). Nevertheless, for all of the mass spectrum pairs, there was not a single instance in which neutral loss matching outperformed the modified cosine similarity.

Discussion
Here we have evaluated the performance of three related mass spectrum similarity measures-cosine similarity, neutral loss matching, and modified cosine similarity-in capturing similarities between structurally related molecules. Our evaluations indicate that the modified cosine similarity is superior to both alternative similarity measures for both peptide and small molecule MS/MS data. These results are in concordance with the popularity of the modified cosine similarity as one of the most commonly used methods to find MS/MS spectra of related molecules. 7 Our interpretation from the peptide data is that the cosine similarity most closely approximates the modified cosine similarity when the modification is located in the N-terminal region of the peptide, as this will maximally conserve the overall spectral pattern of the dominant y-ions. Furthermore, despite only considering directly matching fragment ions, our results show that the traditional cosine similarity outperforms the recently proposed neutral loss matching strategy 20 in the majority of cases, even for modified small molecules. Despite the overall advantage of the cosine similarity, there are a significant number of small molecule MS/MS spectrum pairs for which neutral loss matching outperformed cosine similarity. In contrast, neutral loss matching was always inferior to the modified cosine similarity. We currently hypothesize that neutral loss matching outperforms the cosine similarity when the modification is on the same side of the molecule as the charge. In this case, all fragment ions are shifted by the modification mass, and the neutral loss spectra become mirror images of the original MS/MS spectra, whereas no fragment ions will match directly using the cosine similarity. Nevertheless, under such a circumstance the modified cosine similarity still captures all shifted fragment ions as well, which is confirmed by the comparisons presented. In conclusion, both cosine similarity and neutral loss matching can only capture a subset of fragment ions that are matched by the modified cosine similarity.
Although there is not a single test case where our analysis revealed neutral loss matching to outperform the modified cosine similarity, to enable community use of neutral loss matching as a comparative alternative to the cosine similarity and the modified cosine similarity, we have implemented it in several software tools, including matchms, 37 GNPS, 7 and MZmine. 38 For detailed exploration of MS/MS spectrum similarity, user-friendly viewers are also available, for example within MZmine (Figure 6). 38,39 Additionally, to promote open and reproducible science, all data and code to execute the presented analyses are available with an open and unrestricted license.

Figure 6. Viewer in MZmine for user-friendly evaluation of spectrum similarity measures.
The example shows the spectrum usage from the modified cosine similarity (top) and neutral loss matching (bottom) for the MS/MS spectra of the bile acids taurocholic acid (CCMSLIB00005435561) and glycocholic acid (CCMSLIB00005435513). The viewer shows the directly matching fragment ions (top) and neutral loss ions (bottom) in blue, and the matching fragment ions that are offset by the precursor mass difference (top) in orange. The side panel shows the relative contribution of all pairs of matching fragment and neutral loss ions to the overall score.
Besides the three mass spectrum similarity measures that we have evaluated in this work, an increasing number of other spectrum similarity measures for the discovery of MS/MS spectra of related molecules are being proposed, including various MS/MS spectrum preprocessing steps, different implementations of (modified) cosine similarities that utilize different weighting schemes, 11,19,40 and methods that use machine learning, such as Spec2Vec, 14 MS2DeepScore, 16 spectral entropy, 15 SIMILE, 17 GLEAMS, 18 and many others. These similarity measures likely provide alternative and complementary approaches for the discovery of related molecules, as the modified cosine similarity only accounts for mass shifts associated with a single structural modification, whereas in practice the structural relationship between two analogs can consist of two or more modifications, resulting in more complex mass fragmentation relationships. Future benchmarking of such methods, as well as the effects of data processing, are necessary to provide further insight into the strengths and limitations of each approach.

Acknowledgments
This research was supported by BBSRC-NSF award 2152526 and National Institutes of Health award U19 AG063744. We want to thank Kris Laukens for assistance with creating the figures. Figure 1 was created with BioRender.com.

Author contributions statement
PCD conceptualized and supervised the work. WB, RS, and FH implemented various spectrum similarity measures. WB, RS, JJJvdH, MW, and PCD performed the data analyses. MW provided computational resources. WB and PCD wrote the manuscript. All authors reviewed and edited the manuscript.

Competing interests statement
PCD is on the advisory board of Cybele and is a co-founder and advisor of Ometa and Enveda, with prior approval by UC San Diego. JJJvdH is a member of the Scientific Advisory Board of NAICONS Srl., Milan, Italy. MW is a co-founder of Ometa Labs LLC.

Spectrum similarity measures
The cosine similarity, modified cosine similarity, and neutral loss matching were implemented in Python (version 3.10). In contrast to the original formulation of neutral loss spectra, 20 spectra were not mirrored at their precursor m/z, as this is not properly defined for multiply charged precursors (i.e. a multiply charged precursor peak can have a lower m/z value than its singly charged fragment ions), but instead at the theoretical m/z of the singly charged precursor. For the cosine similarity and neutral loss matching, only directly matching fragment ions with a near-identical m/z or ∆m/z, respectively, were considered. For the modified cosine similarity, fragment ions with m/z values that are shifted by the pairwise precursor mass difference were considered as well. For all similarity measures, the optimal matching peak assignment across both MS/MS spectra was computed using the SciPy (version 1.8.0) 41 implementation of the linear assignment problem. Additional scientific computing used Numba (version 0.55.1), 42 NumPy (version 1.21.5), 43

Analysis of modified peptide MS/MS spectrum pairs
Pairs of modified peptide MS/MS spectra were extracted from the MassIVE-KB spectral library (version 2018/06/15). 21 This is a repository-wide spectral library derived from over 30 TB of human MS/MS proteomics data, containing 2,154,269 MS/MS spectra. First, MS/MS spectra with precursor charge 2, 3, or 4 were considered and spectrum pairs were required to have identical precursor charges. Second, pairs of spectra whose peptides differ by a single modification were extracted. This modification can consist of the absence/presence of a single post-translational modification, or a single amino acid difference (edit distance of 1, corresponding to a single amino acid substitution or amino acid prefix or suffix addition/removal). To avoid inclusion of MS/MS spectrum pairs that only differ by nearby ambiguously localized post-translational modifications, all MS/MS spectrum pairs were required to have a precursor mass difference of at least 4 m/z. This resulted in 955,228 MS/MS spectrum pairs with known peptide labels derived from the MassIVE-KB spectral library.
MS/MS spectra were preprocessed by removing peaks within a 0.1 m/z window around the precursor m/z and removing noise peaks with intensity below 1% of the base peak intensity. Spectrum similarities were computed using a 0.1 m/z fragment mass tolerance. The relative locations of the modifications by which pairs of peptides differ were calculated by normalizing the modification indexes in each pair of peptide sequences by the length of the shortest paired peptide sequence.

Analysis of modified small molecule MS/MS spectrum pairs
Pairs of modified small molecule MS/MS spectra were extracted from the GNPS community spectral libraries (ALL_GNPS_NO_PROPOGATED), consisting of 495,600 MS/MS spectra, and the GNPS bile acids library (BILELIB19), consisting of 4,533 MS/MS spectra (data downloaded on May 12, 2022). 7 First, MS/MS spectra with precursor charge 1 were considered (spectra with the unspecified precursor charge 0 were assumed to have precursor charge 1 as well). Additionally, only centroid MS/MS spectra that contain at least 6 fragment ions, were acquired in positive ion mode, and have [M+H]+ adducts were included. For the GNPS community spectral libraries, only MS/MS spectra with structural information specified by InChI or SMILES strings were included, and these were used to compute compound classes using Classyfire via its web application programming interface. 50 For the general small molecule analysis, all MS/MS spectrum pairs from the GNPS community spectral libraries with a pairwise precursor mass difference between 1 and 200 m/z were considered, resulting in over 1.5 billion possible spectrum pairs. In deference to computational limitations, 10 million MS/MS spectrum pairs were randomly selected for the subsequent analysis.
For the modified bile acids analysis, all MS/MS spectrum pairs from the GNPS bile acids library with a pairwise precursor mass difference between 1 and 200 m/z were considered, resulting in 340,637 spectrum pairs. Additionally, bile acids that differ by a single oxygen were selected by filtering on a pairwise precursor mass difference (mass tolerance 0.01 m/z) of 15.9949 m/z; bile acids that differ by a conjugation were selected by filtering on pairwise precursor mass differences (mass tolerance 0.01 m/z) of 57.0214 m/z (glycine), 71.0371 m/z (alanine), 107.0041 m/z (taurine), 147.0684 m/z (phenylalanine), or 163.0633 m/z (tyrosine); and bile acids that differ by a substitution in conjugation were selected by filtering on pairwise precursor mass differences (mass tolerance 0.01 m/z) of 49.9826 m/z (taurine ↔ glycine), 35.9669 m/z (tauro ↔ alanine), 40.0643 m/z (taurine ↔ phenylalanine), 56.0592 m/z (taurine ↔ tyrosine), 90.0469 m/z (glycine ↔ phenylalanine), 106.0418 m/z (glycine ↔ tyrosine), 76.0313 m/z (alanine ↔ phenylalanine), or 92.0262 m/z (alanine ↔ tyrosine).
MS/MS spectra were preprocessed by removing peaks within a 0.1 m/z window around the precursor m/z and removing noise peaks with intensity below 1% of the base peak intensity. Spectrum similarities were computed using a 0.1 m/z fragment mass tolerance. Tanimoto indexes 23 were computed using RDKit (version 2022.3.2) 51 by creating molecule objects from their SMILES strings, deriving RDKit topological fingerprints using the RDKit default settings (2048 bits), and computing the Tanimoto index.

Data and code availability
All code and notebooks to recreate the presented analyses are available as open source under the permissive BSD license at https://github.com/bittremieux/cosine_neutral_loss. A permanent archive of the source code and the analysis notebooks is available on Zenodo at https://doi.org/10.5281/zenodo.6584619. Additionally, cosine similarity, modified cosine similarity, and neutral loss matching have been implemented in the matchms Python package (https://github.com/matchms/matchms; version 0.15.0). 37 An interactive viewer to inspect neutral loss matching and (modified) cosine similarity is available in MZmine 3 (https://github.com/mzmine/mzmine3) 38 under "Menu > Tools > Spectral mirror." Spectra can be retrieved by their GNPS library identifiers or Universal Spectrum Identifiers (USIs). 39,52 Experimental MS/MS spectra can be selected from feature finding results by selecting two rows in a feature table and choosing "Show > MS/MS spectral mirror" from the right-click context menu.
The MassIVE-KB spectral library and the GNPS community spectral libraries are available under an open and free license at https://massive.ucsd.edu/ProteoSAFe/static/massive-kb-libraries.jsp and https://gnps-external.ucsd.edu/gnpslibrary, respectively. A permanent archive of the spectral libraries, as well as the computed similarity scores for the different analyses, is available on Zenodo at https://doi.org/10.5281/zenodo.6829249.
Individual spectra are accessible by their Universal Spectrum Identifiers (USIs). 39,52 The spectra displayed in Figure 1b