Abstract
Matching peak features across multiple LC-MS runs (alignment) is an integral part of all LC-MS data processing pipelines. Alignment is challenging due to variations in the retention time of peak features across runs and the large number of peak features produced by a single compound in the analyte. In this paper, we propose a Bayesian non-parametric model that aligns peaks via a hierarchical cluster model using both peak mass and retention time. Crucially, this method provides confidence values in the form of posterior probabilities allowing the user to distinguish between aligned peaksets of high and low confidence. The results from our experiments on a diverse set of proteomic, glycomic and metabolomic data show that the proposed model is able to produce alignment results competitive to other widely-used benchmark methods, while at the same time, provide a probabilistic measure of confidence in the alignment results, thus allowing the possibility to trade precision and recall.
Availability Our method has been implemented as a stand-alone application in Java, available for download at http://github.com/joewandy/HDP-Align.
1 Introduction
Liquid-chromatography coupled to mass-spectrometry (LC-MS) is a popular method for performing large-scale experiments and investigating the differential expression of compounds in LC-MS-based-omics (such as proteomics, metabolomics and glycomics). Large-scale untargeted LC-MS studies may involve the analyses of potentially hundreds of runs [6]. Typical LC-MS data pre-processing pipelines operate in a serial manner with many intermediate steps. In untargeted LC-MS studies, the presence of even relatively small errors in any steps preceding the identification stage could potentially result in significant differences to the final analysis and biological conclusions. Alignment, where the correspondences between peaks across runs are established, forms an integral part of the LC-MS data preprocessing pipeline and is a challenging problem due to the non-linear deviations in retention time (RT) of peak features across runs [23] and the large number of ‘related-peaks’ derived from a single compound alone [9]. These related-peaks include e.g. adducts, fragments and multiple ionisation forms, and often share similar chromatographic peak shape correlations and close RT values. In this paper, the term ‘run’ refers to the output from running any biological sample through the LC/MS instrument once, while the term ‘feature’ refers to a tuple of minimally the (m/z, RT) values derived from a single peak.
Alignment methods can be broadly divided into two categories [27]: (a) warping-based methods that perform RT correction of peak features before matching, and (b) direct-matching methods that establish peak correspondence by performing matching on peak features directly, without performing RT correction, often by maximising some objective function. Warping methods attempt to correct the RT drifts present across runs, either using the full LC-MS profile data or the feature data alone. Once the time warping function has been established, finding the actual matching between peak features is straightforward as both peak masses and RT values should be accurately conserved across runs. In contrast, direct matching methods perform the alignment of peaks by directly matching the features across runs, skipping the initial time-warping step. Direct matching methods can be preferred in certain cases due to their simplicity, while still offering good performance [14,22,33].
Alignment methods can also be categorised depending on whether they require a user-defined reference run to be specified. When such reference is necessary, the full alignment of multiple runs is constructed through successive merging of pairwise runs towards the reference run (e.g. MZmine2’s Join aligner in [22]. Alternatively, methods that do not require a reference run can either operate in a hierarchical fashion – where the final multiple alignment results are constructed in a greedy manner by merging of successive pairwise results following a guide tree (e.g. SIMA, described in [33]) – or by pooling features across runs and grouping similar peaks in the combined input simultaneously (e.g. the group() function of XCMS in [26]).
According to [27], the common shortcomings shared by many alignment methods include the incorrect modelling assumption that elution order of peaks is preserved across runs and the abundance of user-defined parameters, which can dramatically influence alignment results. Further uncertainties can be introduced due to the selection of a reference run and the construction of a guide tree in hierarchical methods. Since alignment is such an important part of the data preprocessing steps, it would be useful to be able to robustly identify the uncertainty or confidence in the alignment results. The subject of identifying and quantifying uncertainty has been extensively investigated in the problem of multiple sequence alignment (MSA) for genomics and transcriptomics. [13] attempt to quantify the alignment uncertainty of the popular MSA tool ClustalW [30], based on evaluations using synthetic data, and concludes that between half to all columns in their benchmark MSA results contain alignment errors. [19] construct a score that reflects the consensus between all possible pairwise alignments in T-COFFEE, while [20] propose GUIDANCE, a confidence measure obtained from pertubations of guide trees. Statistical approaches that provide a measure of confidence in alignment results have also been explored by [24] and [2], where the MSA results and phylogeny are constructed simultaneously, thus eliminating the need for a guide tree.
Despite the clear benefits of alignment uncertainty quantification in the sequence domain, the challenge of quantifying alignment results remains relatively unaddressed for the alignment of multiple runs in LC-MS-based-omics. Bayesian methods operating on profile data [11,16,32] and feature-based alignment methods [7,22, 33] exist to correct RT drift, but in such methods, uncertainties are not propagated from the RT regression stage to the necessary peak matching stage that follows. Several recent feature-based alignment methods incorporate probabilistic modelling as part of their workflow, making it possible to extract some form of scores or probabilities on the alignment results. These methods are often limited to the alignment of two runs, which is not a realistic assumption in actual LC-MS experiments. For example, [10] propose an empirical Bayes model for pairwise peak matching. Matching confidence can be obtained from the model in form of posterior probability for any peak pair in two runs, however constructing multiple alignment results in [10] still requires a greedy search to find candidate features within m/z and RT-RT tolerances to a predetermined set of ‘landmark’ peaks. [8] describe PeakLink, a workflow for alignment that performs an initial warping using a fourth-degree polynomial. PeakLink poses the pairwise matching problem as a binary classification problem, where a Support Vector Machine (SVM) is trained based on an alignment ground truth derived from MS-MS information and used to differentiate matching and non-matching candidate pairs to produce the actual alignment results. While not explicitly included in the output of PeakLink, a matching score can be extracted from the SVM that represents how far each candidate pair is from the decision boundary. Note that these scores are not well-calibrated in the probabilistic sense, thus making comparisons of matching scores less straightforward. PeakLink is also not extended to the problem of aligning multiple runs, although [8] state that it would be possible to do so with the choice of a suitable reference run.
The goal of establishing the matching of peaks across multiple runs at once can be viewed as a clustering problem, where a set of peaks can be grouped (by their m/z, RT and other suitable features) into local clusters within each run (representing all of the peaks from an individual compound), which are further grouped into global clusters shared across runs. A preliminary form of this idea has been explored in [5], where hierarchical clustering is performed on the total ion chromatogram data to group peaks into within-run local clusters, which are further grouped into across-run super clusters. The highly accurate mass information available from modern LC-MS platforms is not used in [5], although it is highlighted as a possible future work. The choice of using a hierarchical clustering method in [5] also requires choosing various user-defined parameters, such as determining a suitable cut-off for the dendogram produced, deciding on a suitable linkage method and defining an appropriate distance measure between groups of peaks.
In this work, we expand upon the idea of viewing direct matching as a hierarchical clustering problem by proposing HDP-Align, a Bayesian non-parametric model that groups related-peaks within runs and assigns them to global clusters shared across runs. Within each global cluster, peaks are further grouped by their m/z values into mass clusters, which represent the various ionisation products (IPs) derived from the global compound. The proposed model allows us to infer the matching of peaks across all runs at once, without the need for any intermediate merging of pairwise runs, and the resulting posterior summaries provide us with a confidence score in the matching quality of aligned peaksets. This introduces the possibility of allowing the user to trade recall for precision from the alignment results by returning a smaller subset of the results having a higher confidence score of being correctly aligned. Figure 1 shows an illustration of the clustering process in HDP-Align. Additionally, the latent variables of clustering structure inferred in the model can potentially have physically meaningful identities that can be used for further analysis, and using a metabolomic dataset, we demonstrate the usefulness of such clustering objects by using the mass clusters derived from the model to perform putative annotations of features based on their potential adduct types and metabolite identities.
2 Related Work
The goal of establishing the matching of peaks across multiple runs at once can be viewed as a clustering problem, where a set of peaks can be grouped (by their m/z, RT and other suitable features) into local clusters within each run (representing all of the peaks from an individual compound), which are further grouped into global clusters shared across runs. Hierarchical clustering has been used for the matching of peaks across runs [5,31]. In [31], peaks are hierarchically clustered based on their m/z values to construct matching across runs, while in [5], peaks from the entire dataset are pooled and a hierarchical clustering scheme based on RT only is used to group peaks into within-run local clusters, which are further grouped into across-run super clusters. Both approaches require choosing various user-defined parameters, such as determining a suitable cut-off for the dendogram produced, deciding on a suitable linkage method and defining an appropriate distance measure between groups of peaks. In [31], no chromatographic separation is performed, so only the m/z values of peaks are used. The nature of the gas chromatography data used in [5], where retention time across runs is more reproducible, means that even without using the m/z information, good alignment performance can still be obtained. This will not be the case of LC-MS data, where retention time drift is common and the highly accurate m/z information is crucial for alignment. The proposed HDP-Align model fills this gap where both m/z and RT values, important for LC-MS peak alignment, are used for the hierarchical clustering process. The probabilistic approach employed by HDP-Align also allows us to extract confidence values from aligned peaksets.
3 Hierarchical Dirichlet Process Mixture Model for Alignment
The proposed model for HDP-Align is framed as a Hierarchical Dirichlet Process (HDP) mixture model [29]. Essential modifications to the basic HDP model were performed to suit the nature of the multiple peak alignment problem. Figure 2 shows the conditional dependencies between random variables in the HDP-Align model.
Our input consists of J input files, indexed by j = 1, …, J, corresponding to the J LC-MS runs to be aligned. Each j-th input file contains Nj peaks in total, which can be separated into Kj local clusters of related-peaks. In a j-th file, peaks are indexed by n = 1, …, Nj and local clusters are indexed by k = 1, …, Kj. Across all files, we assign each local cluster k in file j to a global cluster i = 1, …, I, where I is the total number of global clusters, using the indicator variable v, as described in the following paragraph. A global cluster corresponds to the compound of interest during LC-MS analysis, e.g. metabolite or peptide fragment, that is present across runs, while local clusters are realisations of the global clusters in a specific run. Finally, within each global cluster i, we can further group peaks by their m/z values into A mass clusters (indexed by a = 1, …, A). Each mass cluster therefore corresponds to the ionization product peaks coming from the different runs that are produced by a global compound during mass spectrometry.
We use the indicator variable zjnk = 1 to denote the assignment of peak n in file j to local cluster k in that file. Similarly, vjni = 1 if peak n in file j is assigned to global cluster i, and vjnia = 1 if peak n in file j is assigned to mass cluster a linked to metabolite i. Let dj be the list of observed data of peaks in file j, dj = (dj1, dj2, …, djn) where djn = (xjn, yjn) with xjn the RT value and yjn the log m/z value of the peak feature. The log of m/z value is here used as the m/z error is assumed to increase linearly with the observed m/z value [21]. θ denotes the global mixing proportions and πj the local mixing proportions for file j. The global mixing proportions θ are distributed according to the Griffiths, Engen and McCloskey (GEM) distribution: where the GEM distribution over θ is described through the stick-breaking construction:
The local mixing proportions πj are distributed according to a Dirichlet Process (DP) prior with the base measure θ and concentration parameter αt.
Within each file j, the indicator variable zjnk = 1 denotes the assignment of peak n in file j to local RT cluster k in that file. This follows the local mixing proportions for that file.
The RT value ti of a global mixture component is drawn from a base Gaussian distribution with mean µ0 and precision (inverse variance) σ0.
The RT value tij of a local mixture component in file j is normally distributed with mean ti and precision δ. The precision controls how much RT values of related-peak groups across runs are allowed to deviate from the parent global compound’s RT.
Finally, the observed peak RT value is normally distributed with mean tjk and precision γ. The precision controls how much RT values of peaks can deviate from their related-peak group.
The m/z value produced through high-precision MS instrument is highly accurate, and its correspondence is often preserved across runs. Once peaks have been assigned to their respective global clusters, we need to further separate peaks within each global cluster into mass clusters to obtain the actual alignment. These mass cluster corresponds to ionisation products. We do this by incorporating an internal DP mixture model on the m/z values (yjn) within each global cluster i. Let the indicator vjnia = 1 denotes the assignment of peak n in file j to mass cluster a in the i-th global cluster. Then: where the index ia refers to the a-th mass cluster of the i-th global cluster. λi is the mixing proportions of the i-th internal DP mixture for the masses, with αm the concentration parameter. µia is the mass cluster mean, drawn from the Gaussian base distribution with mean ψ0 and precision ρ0. The observed mass value is drawn from a Gaussian distribution with the component mean µia and precision ρ, for which the value is set based on the MS instrument’s resolution. Additionally, we add an additional constraint on the likelihood of yjn using the indicator function I(·) such that I(djn) = 1 if there are no other peaks inside the mass cluster that come from the same file as the current djn peak, and 0 otherwise. This constraint captures the restriction that a peak feature can only be matched to other peaks from different files, reflecting the assumption that within each LC-MS run, compounds produce ionisation products with distinct mass-to-charge fingerprints that can be used for matching to other runs.
4 Inference
Inference within the model is performed via a Gibbs sampling scheme, allowing us to compute the alignment probabilities through the proportion of posterior samples in which any sets of peaks are placed in the same mass component (a) in the same top-level cluster. In this manner, peaks coming from different runs that are in the same mass component are considered to be aligned as they have similar RT and m/z values. In each iteration of the sampling procedure, we instantiate the mixture component parameters for the local RT cluster (tjk) and global RT cluster (ti) in the mixture model. In the internal DP mixture linked to each global cluster i, we marginalise out the mass cluster parameters (µia). The initialisation step of the sampler is performed by assigning all peaks in each run into a single local RT cluster. Across runs, these local clusters are assigned under a global cluster shared across runs. Within a global cluster, peaks coming from different runs are assigned to a single mass cluster. The sampler than iterates through each peak feature, removing it from the model, updating the assignment of peaks to clusters and performing the necessary book-keeping on any instantiated mixture components. Further details on the specific Gibbs update statements can be found in following sections.
4.1 Updating peak assignments
We use the following variables to denote the count of items in any clustering object: cjk is the number of peaks in a local cluster k of file j. ci is the number of local clusters in a global cluster i, and cia is the number of peaks in a mass cluster a inside a global RT cluster i. To update the assignment of a peak djn to local RT cluster k during Gibbs sampling, we need the conditional probability of P(zjnk = 1) given every other parameters, denoted as P(zjnk = 1|djn,…).
We consider the top and bottom terms of eq. (13) separately in the following.
The likelihood of the peak djn to be in an existing local RT cluster k, p(djn|zjnk = 1,…) is proportional to cjk. This is assumed to factorise across the RT (xjn) and mass (yjn) terms
The RT term p(xjn|zjnk = 1,…) in eq. (14) is normally distributed with mean tjk and precision γ, while the mass term p(yjn|zjnk = 1, …) is an internal DP mixture of mass components linked to the parent global cluster i of an existing local cluster k. We then marginalise over all mass clusters in i to get p(yjn|zjnk = 1, vjni = 1…)
To compute the terms in eq. (15), first we consider the case for an existing mass cluster a in the global RT cluster i. Then,
For a new mass cluster a* in the global RT cluster i, we marginalise out µia to obtain
The likelihood of the peak djn to be in a new local cluster k* is proportional to αt. Marginalising over all global clusters in the top-level DP, we get
There are two terms to compute in eq. (18): whether peak djn is in an existing global cluster i or a new global cluster i*. For an existing global RT cluster i in eq. (18), p(djn|vjni = 1,…) is assumed to factorise into its RT and mass terms, so p(djn|vjni = 1,…) = p(xjn|vjni = 1,…) · p(yjn|vjni = 1, …). We marginalise over all local RT clusters to obtain and marginalise over all possible mass clusters in the internal DP linked to global cluster i to obtain p(yjn|vjni = 1, …). This is defined in eq. (15). Finally, for a new global RT cluster i* in eq. (18), p(djn|vjni* = 1,…) is also assumed to factorise into its RT and mass terms. Then, we marginalise over tjk and ti to obtain and marginalise over µia to get
4.2 Updating instantiated variables
The following expressions are used to update the instantiated mixture component parameters in the model during Gibbs sampling.
Updating global cluster’s RT ti: here, tjk∊i refers only to local RT clusters currently assigned to the global cluster i, and ci is the count of such peaks. Then where .
Updating local cluster’s RT tjk: here, xjn∊k refers only to peaks currently assigned to the local RT cluster k, and cjk is the count of such peaks. where .
4.3 Using the Inference Results
Using the posterior samples from Gibbs sampling, we can compute various posterior summaries and more interestingly, extract the alignment of peaks from the inference results (since features assigned into the same mass cluster within the same global RT cluster are considered to be aligned). For each sample, we record the aligned peaksets of peaks put into the same mass cluster. Averaging over all samples provides a distribution over these aligned peaksets. Note that across all the aligned peaksets from all samples, it is possible for the same peak to be matched to different partners with varying probabilities, depending on how often they co-occur together in the same mass cluster. To allow the possibility of controlling precision and recall from the results, we provide another user-defined threshold t, where aligned peaksets are returned only when they occur with matching probability >t. Varying this threshold allows the user to use HDP-Align to trade precision for recall: a low value for t gives a larger set of results that are potentially less precise, while conversely a high t provides a smaller, more precise set of aligned peaksets. This is an output not available from the other baseline alignment methods and can potentially be useful in problem domains where high precision is required from the alignment results.
4.4 Isotopic Product and Metabolite Identity Annotations
In metabolomic studies using electrospray ionisation, a single metabolite can generate multiple ionisation products (IPs, such as isotopic variants, adducts, fragment peaks), alongside other peaks resulting from noise and artifacts introduced during mass spectrometry [15]. Determining and annotating these IP peaks are desirable to remove extraneous peaks and reduce the burden of subsequent analysis in the data processing pipeline. Additionally, deducing the precursor mass of the compound that generates the IP peaks is necessary to query compound library databases in order to assign metabolite identities.
The resulting clustering objects inferred from HDP-Align lend themselves to further analysis in a natural fashion, as global RT clusters in HDP-Align may correspond to metabolites, while local RT clusters may correspond to the noisy realisations of these metabolites within each run. Mass clusters in the internal mixture of each global cluster could correspond to the ionisation products of a metabolite. To demonstrate the possibility of obtaining additional information beyond alignment from the output of HDP-Align, we follow the workflow in [15] that performs IPs and metabolite annotations of peaks. This workflow is composed of multiple key steps: peak matching, ionisation product clustering and metabolite mass matching. A key difference of HDP-Align to the workflow in [15] lies in the fact that HDP-Align is able to perform the two separate steps of peak alignment and potential IP clustering simultaneously, as shown in Figure 3.
Given the set of potential IP clusters, we can perform IP annotations on the peaks. To do this using the metabolomic dataset, first we take the set of clustering objects produced in a posterior sample. For each mass cluster, we assign its m/z value to be the average m/z values of features assigned to it, denoted by m. The list of common adducts (Table 1) in positive ionisation mode is used to compute the inverse transformation for the precursor mass that generates the observed mass. Following [15], any two mass clusters sharing the same precursor mass (within tolerance) provide a vote on the presence of that consensus precursor mass. The mass clusters and peaks inside them can be annotated with the adduct type that produces the transformation type to the shared precursor mass. The set of precursor masses deduced in this manner can also be used to query KEGG (a database of metabolite compounds) in order to assign metabolite identities to the global compound.
5 EVALUATION STUDY
5.1 Evaluation Datasets
Performance of the proposed methods and other benchmark methods is evaluated on LC-MS datasets from proteomic, glycomic and metabolomic experiments (Table 5 summarises the number of features in the datasets). The Proteomic dataset is obtained from [14]. All 6 fractions from the Proteomic dataset in [14], each containing 2 runs of features having high RT variations across runs, are used in our experiments. The Glycomic dataset is provided by [32]. We use the first 10 runs from the dataset in our experiment. Both of the Proteomic and Glycomic datasets provide alignment ground truth and have been used to benchmark alignment performance in other studies [1,14,22,32,33]. Additionally, we also introduce a metabolomic dataset generated from standard runs used for the calibration of chromatographic columns [3]. The runs were produced from different LC-MS analyses separated by weeks, representing a potentially challenging alignment scenario. 6 runs were used in the experiment. Alignment ground truth was constructed from the putative identification of peaks in each of the 6 runs separately at 3 ppm using the Identify module in mzMatch [25], taking as additional input a database of 104 compounds known to be present and a list of common adducts in positive ionisation mode (Table 1). This is followed by matching of features sharing the same annotations across runs to construct the alignment ground truth. Only peaks unambiguously identified with exactly one annotation are used for this purpose; peaks with more than one annotation per run are discarded from the ground truth construction.
Table 2 summarises the different evaluation datasets and the number of features each dataset has.
5.2 Performance Measures
Performance of the evaluated methods in our results is quantified in terms of precision and recall. These two measures are also commonly used in information retrieval, where ‘precision’ refers the fraction of retrieved items that are relevant, while ‘recall’ refers the fraction of relevant items that are retrieved [18].
To provide a definition of ‘precision’ and ‘recall’ suitable for evaluating alignment performance, we first enumerate all the possible q-size combinations for every aligned peakset in both the method’s output and the ground truth list. For example, an alignment method returns a list of two aligned peaksets {a, b, c, d,}, {e, f, g} as output. When q = 2, this output can be enumerated into a list of 9 ‘alignment items’ of all the pairwise combinations of features: {a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}, {e, f}, {e, g}, {f, g}. Let M and G be the results from such enumeration from a method’s output and the ground truth respectively. Each distinct combination of features in M and G can be considered as an item during performance evaluation. Intuitively, the choice of q reflects the strictness of what is considered to be a true positive item, with larger values of q demanding an alignment method that produces results spanning more runs correctly.
For a given q, the following positive and negative instances of alignment item can now be defined for the purpose of performance evaluation:
True Positive (TP): items that should be aligned (present in G) and are aligned (present in M).
False Positive (FP): items that should not be aligned (absent from G) but are aligned (present in M).
True Negative (TN): items that should not be aligned (absent from G) and are not aligned (absent from M).
False Negative (FN): items that should be aligned (present in G) but are not aligned (absent from M).
In the context of alignment performance, precision is therefore the fraction of items in M that are correct with respect to G, while recall is the fraction of items in G that are aligned in M. A method with a perfect alignment output would have both precision and recall values of 1.0.
5.3 Benchmarking Method
We benchmark HDP-Align against two established alignment methods: SIMA [33] and MZmine2’s Join Aligner [22]. The performance of both methods have been evaluated in past studies [1,14,22,32,33]. SIMA is a stand-alone program while Join aligner is an integral part of the MZmine2 suite widely used for the pre-processing of LC-MS data. The selection of SIMA and Join as the benchmark methods is motivated by the fact that both methods are direct matching methods (thus easily comparable to HDP-Align) but still differ sufficiently in how they establish the final alignment results. This is primarily due to the differences between both methods in the form of the distance/similarity function between peak features, the actual matching algorithm itself and the merging order of pairwise results to construct the full alignment results.
The two most important parameters to configure in both methods are the mass and RT tolerance parameters, used for thresholding and computing feature similarities during matching. We label these common parameters as the T(m/z) and Trt parameters. Note that despite the common label, each method may use the parameter values differently during the alignment process. In our experiments, we let T(m/z) and Trt vary within reasonable ranges (details in Section 5.4) and report all performance values generated by each combination of the two parameters.
5.4 Parameter Optimisations
Tables 3 and 4 describe the parameter ranges of each method during performance evaluation. For HDP-Align (Table 3), we perform the experiments based on our initial choices on the appropriate parameter values. These are almost certainly less than optimal and can be optimised further. The mass cluster standard deviation for HDP-ALign is set to the equivalent value in parts-per-million (ppm). These are 500 ppm for the Proteomic dataset and 3 ppm for the Glycomic and Metabolomic datasets. The local (within-run) cluster RT standard deviation is assumed to be fairly constant and set to 2 seconds for all datasets, while the global cluster standard deviation is set in the following dataset-specific manner: 50 seconds for the Proteomic dataset and 20 seconds for the remaining datasets. The larger standard deviation value is required for the Proteomic dataset to accomodate for greater RT drifts across runs. Other hyperparameters in HDP-Align are fixed to the following values: α' = 10, αt = 10, αm = 100. The values of the precision hyperparameters for global cluster RT (σ0) and mass cluster (ρ0) are set to a broad value of 1/5E6. No significant changes were found to the results when these hyperparameters for the DP concentrations and cluster precisions were varied. The mean hyperparameters μ0 and ψ0 are set to the means of the RT and m/z values of the input data respectively. During inference, 10000 posterior samples were obtained with the first 5000 used as burn-in, and taking every 10-th sample after burn-in for the posterior probabilities of peaks to be matched.
For SIMA and Join, we report the results from all combinations of the mass and RT tolerance parameters within reasonable ranges listed in Table 4. The ranges of T(m/z) and Trt parameters used are based values reported on [14] for the Proteomic dataset and [32] for the Glycomic dataset. For the Metabolomic dataset, they were chosen in light of the mass accuracy and RT deviations of the data.
In HDP-Align, the mass cluster standard deviation is set to the equivalent value in parts-per-million (ppm). These are 500 ppm for the Proteomic dataset and 3 ppm for the Glycomic and Metabolomic datasets. The local (within-run) cluster RT standard deviation is assumed to be fairly constant and set to 2 seconds for all datasets, while the global cluster standard deviation is set in the following dataset-specific manner: 50 seconds for the Proteomic dataset and 20 seconds for the remaining datasets. The larger standard deviation value is required for the Proteomic dataset to accomodate for greater RT drifts across runs.
Other hyperparameters in HDP-Align are fixed to the following values: α′ = 10, αt = 10, αm = 100. The values of the precision hyperparameters for global cluster RT (σ0) and mass cluster (ρ0) are set to a broad value of 1/5E6. No significant changes were found to the results when these hyperparameters for the DP concentrations and cluster precisions were varied. The mean hyperparameters μ0 and ψ0 are set to the means of the RT and m/z values of the input data respectively. During inference for the Glycomic and Metabolomic datasets, 500 posterior samples were collected for the Gibbs sampling procedure, discarding the first 500 during the burn-in period. For the Proteomic dataset with larger RT deviations, 5000 posterior samples were obtained after discarding the first 5000 samples during burn-in. The number of samples is selected to ensure convergence during inference.
6 Results
Precision and recall values for the evaluated methods methods on the different datasets are shown in Sections 6.1 and 6.2. Additionally, an example of the further annotations for the putative adduct type and metabolite identity that can be produced by HDP-Align is also shown in Section 6.2. Running time of the evaluated methods are reported in Section 6.3.
6.1 Proteomic Results
Figure 4 shows the results from performance evaluation on the Proteomic dataset. We see that both benchmark methods (SIMA and Join) produce a wide range of performance depending on the parameter values for (T(m/z),Trt) chosen. Sensitivity to parameter values is expected on this dataset due to the low mass accuracy in the MS instrument that produces the data and the high RT drifts present across runs (further details in [14]). HDP-Align performs well on several fractions (particularly fractions 040, 060, 080, 100) with precision-recall performance close to the optimal performance attainable by the benchmark methods. On all fractions, HDP-Align is also able to produce higher-precision results compared to the benchmark methods by reducing recall through setting the appropriate values for the threshold t. The primary benefits of quantifying alignment uncertainties is realised here as the well-calibrated probability scores on the matching confidence of aligned peak features produced HDP-Align allows the user to choose which point along the PR curve to operate on. It is less obvious how this can be accomplished in the benchmark methods by varying the RT (Trt) and m/z (Tm/z) thresholding parameters, if at all possible.
6.2 Glycomic and Metabolomic Results
Figures 5 and 6 show the results from experiments on the Glycomic and Metabolomic datasets. Similar to the Proteomic dataset, a wide range of precision-recall values can be observed in the results for the benchmark methods on the two datasets. The performance of HDP-Align, using the same set of parameters on both datasets, come close to the optimal results from the benchmark methods, while still allowing the user to control the desired point along the precision-recall curve to operate on.
The results for the Glycomic dataset (Figure 5) also show some additional results on how the measured precision-recall values might change depending on the strictness of what constitutes an ‘item’ during performance evaluation. This is accomplished by gradually increasing the value for q (described in detail in Section 5.2) that determines the size of the feature combinations enumerated from a method’s output. For example, q=2 considers all pairwise combinations of features from the method’s output during performance evaluation, while q = 4 considers all combinations of size 4, and so on. Figure 5 shows that as q is increased, parameter sensitivity seems to become more of an issue for the benchmark methods, with more parameter sets having lower precisions in the results. Across all qs evaluated, parameter pairs that produce the best alignment performance (points with high precision and recall values) are generally small T(m/z) and large Trt values. Examples of parameter pairs that produce the best and worse performance for SIMA are shown in Figure 6. The results here appear to suggest the importance of having high mass precision during matching. Importantly, we see from Figure 5 that the performance of HDP-Align remains fairly consistent as q is increased.
The Metabolomic dataset also provides us with additional results in form of annotations of putative adduct type and metabolite identities. A thorough evaluation on the quality of such annotations, in comparison to e.g. the workflow proposed in [15], is beyond the scope of this paper and would likely necessitate using a different and more apppopriate evaluation dataset. Instead, we present an example of the further analysis performed by HDP-Align (as proposed in Section 4.4) on the resulting clustering objects after inference. Figure 7 shows a global RT cluster where peak features across runs have been grouped by their RT and m/z values. Within this global cluster, peak features are further separated into 6 mass clusters – corresponding to ionisation products produced by the global cluster during mass spectometry. In Figure 7, mass cluster A and B contain features aligned from several runs but they do not have any other mass cluster sharing a possible precursor mass. Mass cluster C and D share a common precursor mass (292.12696) and can thus be annotated by the adduct type that produce the transformation. Similarly, mass cluster E and F share a common precursor mass at 383.14278. Queries to a local KEGG database are issued based on the precursor mass values, producing several compound identities that can be putatively assigned to the global RT cluster. It is a great strength of our approach that this putative identification step appears very naturally from the alignment results.
6.3 Running Time
The main factor affecting the running time of HDP-Align is the total number of peaks across all runs to be processed and the number of samples produced during Gibbs sampling. In each iteration of Gibbs sampling, HDP-Align removes a peak from the model, updates parameters of the model conditioned on every other parameters, and reassigns a peak into RT and mass clusters. The time complexity of this operation is O(N), where N is the total number of peaks across all runs. In practice, additional time will also be spent on various necessary book-keeping operations, such as deleting empty clusters that are no longer required, updating internal data structures, etc.
A representative running time is given as N = 9344 for the Glycomic dataset. HDP-Align requires approximately 5 hours to collect 1000 samples. In comparison, both SIMA and Join perform alignment within 5 to 10 seconds. Similarly, for N = 7477 for the Metabolomic dataset, HDP-Align produces the results in approximately 4 hours after collecting 1000 samples, while SIMA and Join complete within seconds. The running time of HDP-Align, while being significantly longer than these two benchmark methods, is comparable to other computationally-intensive steps (e.g. peak detection) in a typical LC-MS pipeline.
7 DISCUSSION AND CONCLUSION
We have presented a hierarchical non-parametric Bayesian model that performs direct matching of peak features, a problem of significant importance in the data pre-processing pipeline of large untargeted LC-MS datasets. Unlike other direct matching methods, the novelty of our proposed approach lies in its ability of to produce well-calibrated probability scores on the matching confidence of aligned peak features (evidenced by the increasing precision and decreasing recall as the threshold t is increased). This is accomplished by casting the multiple alignment problem of LC-MS peak features as a hierarchical clustering problem. Matching confidence can then be obtained based on the probabilities of co-eluting peak features to be assigned under the same mass component of the same global cluster. Experiments based on datasets from real proteomic, glycomic and metabolomic experiments show that HDP-Align is able to produce alignment results competitive to the benchmark alignment methods, with the added benefit of being able to provide a measure of confidence in the alignment quality. This can be useful in real analytical situations, where neither the optimal parameters nor the alignment ground truth is known to the user.
Through comparisons against benchmark methods, our studies have also investigated the effect of sub-optimal parameter choices on alignment performance. While beyond the scope of our paper, we agree with [27, 28] that thorough investigations into the influence of numerous configurable parameters (prevalent in nearly all LC-MS data processing pipeline) on the resulting biological conclusions are of utmost importance. This should be followed by the development of methods to minimise or automatically-tune such configurable parameters. Despite the abundance of new methods proposed for LC-MS data pre-processing, relatively few studies have been done on the subject of quantifying uncertainties and alleviating the burden of parameter optimisations during actual data analysis. One way to minimise the number of parameters is through the integration of multiple steps in the typical LC-MS pipeline into fewer steps. Our proposed model in HDP-Align can potentially be extended in this manner, as evidenced by the metabolomic dataset results where we directly use the clustering objects inferred from the model to perform further analysis on putative adduct and metabolite type annotations. While the proposed annotation approach in Section 4.4 is fairly simple, it can be easily extended to more sophisticated annotation strategies, such as in CAMERA [12].
A primary weakness of HDP-Align lies in the long computational time required to produce results. Additional work will be required to reduce the computational burden of the model through various optimisation tricks and potentially by parallelising the Gibbs inference step using e.g. the method described in [17]. Another possibility is to adopt a different non-sampling-based inferential approach while still retaining the essence and benefits of the HDP model in this paper. The results presented in the current paper suggest the method shows enough promise to warrant the effort to speed it up.
Another aspect worthy of investigation is determining the most effective way to present and visualise the alignment probabilities produced by HDP-Align. Additional sources of information present in the LC-MS data, such as chromatographic peak shapes, can also be used to improve alignment performance and subsequent analyses that follow.
Finally, replacing or enhancing the mixture of mass components used in HDP-Align with a more appropriate mass model, such as that in MetAssign [4] that specifically takes into account the inter-dependency structure of peaks, is an avenue for future work. This will be particularly useful when extending the proposed model in HDP-Align into a single inferential model that encompasses many intermediate steps in a typical LC-MS data processing pipeline.
Acknowledgments
We would like to thank Glasgow Polyomics for providing us with the metabolomics data used for performance evaluation. JW was funded by a PhD studentship from SICSA. SR was supported by the BBSRC (BB/L018616/1).
Footnotes
↵* joe.wandy{at}glasgow.ac.uk