Abstract
Canonical distance and dissimilarity measures can fail to capture important relationships in high-throughput sequencing datasets since these measurements are unable to represent feature interactions. By learning a dissimilarity using decision tree ensembles, we can avoid this important pitfall. We used 16S rRNA data from the lumen and mucosa of the distal and proximal human colon and the stool of patients suffering from immune-mediated inflammatory diseases and compared how well the Jaccard and Aitchison metrics preserve the pairwise relationships between samples to dissimilarities learned using Random Forests, Extremely Randomized Trees, and LANDMark. We found that dissimilarities learned by unsupervised LANDMark models were better at capturing differences between communities in each set dataset. For example, differences in the microbial communities of colon’s distal lumen and mucosa were better reflected using LANDMark dissimilarity (p ≤ 0.001, R2 = 0.476) than using the Jaccard distance (p ≤ 0.001, R2 = 0.313) or Random Forest dissimilarity (p ≤ 0.001, R2 = 0.237). In addition, applying Uniform Manifold Approximation and Projection to dissimilarity matrices and transforming the result using principal components analysis created two-dimensional projections that captured the main axes of variation while also preserving the pairwise distances between samples (eg: ρ = 0.8804, p ≤ 0.001 for the distal colon dissimilarities). Finally, supervised LANDMark models tend to outperform both Random Forest and Extremely Randomized Tree classifiers. Models employing multivariate splits can improve the analysis of complex high-throughput sequencing datasets. The improvements observed in this work likely result from the ability of these models to reduce noise from uninformative features. In an unsupervised setting, LANDMark models can preserve pairwise relationships between samples. When used in a supervised manner, these methods tend to learn a decision boundary that is more reflective of the biological variation within the dataset.
Author Summary Distance and dissimilarity measures are often used to investigate the structure of biological communities. However, our investigation into two commonly used distance measures, the Jaccard and Aitchison distances, demonstrates that these measures can fail to capture important relationships in microbiome communities. This is likely due to their inability to identify dependencies between features. For example, both the Jaccard and Aitchison metrics are unable to identify subsets of samples where the presence of one feature depends on another. Previous research has found that Random Forest embeddings can be used to create an alternative dissimilarity measure for dimensionality reduction in genomic datasets. We show that dissimilarities learned by decision tree ensembles, especially those using base-estimators capable of partitioning data using oblique and non-linear cuts, can be superior since these approaches naturally model these interactions.
Introduction
Biomarkers are objectively measurable characteristics of biological systems which can identify and provide evidence in favor or against a biological process or condition (1,2). For example, organisms that are present or absent in patients suffering from a disease, such as Crohn’s Disease, could be considered a biomarker if they can be used to predict the condition (3). Machine learning (ML) algorithms are being increasingly applied to a wide array of genomic, metagenomic, and transcriptomic data sets to identify relevant biomarkers and create predictive models of these datasets. When analyzing amplicon sequencing data one typical goal is to discover amplicon sequence variants (ASVs) associated with each of the biological communities being studied. For example, a recent study identified how impaired dopamine signaling in mice with a defective dopamine transporter gene alters the activity of metabolic pathways and the composition of the gut microbiome (4). Unlike approaches such as DESeq2 and MetagenomeSeq, ML models tend not to assume anything about the underlying distribution of each co-variate (5,6). Furthermore, many ML models, such as neural networks and Random Forests, can naturally model interactions between covariates (7,8). For these reasons, ML represents a potentially powerful way to identify biomarkers in high-throughput sequencing (HTS) data. Out of the myriad of available machine learning methods, Random Forests (RFs) and other decision tree ensembles have become very popular due to their good overall performance when working with high-throughput sequencing data. Furthermore, extensive tools and approaches have been designed which are starting to peel back the “black-box” veneer of these and other machine learning models (9). For example, RFs have been recently applied to study and identify operational taxonomic units (OTU), which can be considered a class of biomarkers, from the microbiomes of patients suffering from cardiovascular disease, chronic obstructive pulmonary disease, and various immune-mediated inflammatory diseases (3,10,11). These models, which are not linearly constrained, have been shown to generalize well to unseen data in more recent amplicon sequencing studies (12).
While machine learning has become incredibly popular and has led to important discoveries, biomarker selection using RFs and other commonly used approaches can be problematic due to the various algorithmic assumptions. For example, each decision tree in a RF uses a recursive series of axis-orthogonal splits to approximate the underlying data generating function (13,14). However, more complex oblique or non-linear splits often result in more appropriate representations of the data generating function (12,14). Another classification algorithm, k-nearest neighbors, is sensitive to the number of neighbors and the distance metric (15). Logistic regression, ridge regression, and linear support vector classifiers can only learn linear decision boundaries (12). while neural networks can require a large amount of data and time to learn appropriate weights for each parameter.
One aspect of RFs which have not been extensively explored is their ability to learn a dissimilarity measure when working in an unsupervised setting. Unsupervised RFs have previously been used to discover similar cell populations in single-cell RNAseq data, identify different classes of renal cell carcinomas tumors, study the underlying structure of a population using shared genetic variations (16–18). This body of work has demonstrated that unsupervised RFs can identify important sources of variation between samples while still being robust to noise and problems stemming from the high dimensionality of high-throughput sequencing datasets. While these results lay the groundwork and demonstrate the utility of unsupervised RFs, they do not investigate the potential of multivariate decision trees in learning a similarity function. In this study, we investigate multivariate decision trees. Specifically, we will investigate their ability to learn a similarity measure and how this similarity measure compares with distance measures. Finally, we will examine how successful multivariate trees are at classifying and identifying biomarkers in two medically important human microbiome datasets.
Methods
Dataset Selection
Two human microbiome datasets were selected for inclusion in this study. The first was derived from the colons of healthy individuals (19) using 16S rRNA amplicon sequencing. This dataset collected samples from the unprepared colons of healthy individuals and was chosen since we could divide the dataset into four sets of comparisons (19). These comparisons examined differences in the abundance of OTUs between the microbial communities of the proximal lumen (RS) and mucosa (RB), the distal lumen (LS) and mucosa (LB), between the RS and the LS, and finally between the RB and the LB. The second dataset was chosen since it contains samples from patients who suffer from immune-mediated inflammatory diseases (IMID) (3). Differences between the microbiomes of patients suffering from Chron’s disease (CD), ulcerative colitis (UC), multiple sclerosis (MS), and rheumatoid arthritis (RA) were compared to healthy controls. Specifically, the work by Forbes et al. (2018) investigated if disease-specific taxonomic biomarkers, OTUs, could be identified in each patient’s stool. In both studies, the authors used differential abundance testing and Random Forests to identify potential OTU biomarkers (3,13).
Bioinformatic Processing of Raw Reads
Raw sequences from two previously published datasets were obtained from the Sequence Read Archive (PRJNA450340 and PRJNA418115) (3,19). All bioinformatic processing of the raw reads was prepared using the MetaWorks v1.8.0 pipeline (available online at: https://githib.com/terrimporter/MetaWorks) (20). The default settings for merging reads were used except for the parameter controlling the minimum fraction of matching bases, which was increased from 0.90 to 0.95. This was done to remove a larger fraction of potentially erroneous reads. Merged reads were then trimmed using the default settings MetaWorks passes to CutAdapt. Since reads from PRJNA418115 were pre-processed and the primers removed (Personal Communication with Kaitlin Flynn, Ph.D. (kjflynn06{at}gmail.com) in January 2019), no reads were discarded during trimming. The remaining quality-controlled sequences were then de-replicated and denoised using VSEARCH 2.15.2 to remove putative chimeric sequences (21). Finally, VSEARCH was used to construct a matrix where each row is a sample and each column an Amplicon Sequence Variant (ASV). Taxonomic assignment was conducted using the RDP Classifier (version 2.13) and the built-in reference set (22).
ASVs which are likely to be contaminants, specifically those likely belonging to chloroplasts and mitochondria, were removed. From the remaining sequences, only those belonging to the domain Bacteria and Archaea were retained for further analysis. In the IMID dataset, only sequences assigned to Firmicutes, Actinobacteria, and Tenericutes were retained. This was done since the original study found that operational taxonomic units assigned to other bacterial groups were underrepresented (3). Following the initial processing steps, ASVs with a bootstrap support of 0.8 or higher were chosen for further analysis. The cutoff of 0.8 for the V4 rRNA region sequenced in the 16S dataset was chosen because fragments of ∼ 200 bp in length are likely to be assigned to the correct genus 95.7% of the time (23). A site by ASV count matrix, where each row is a sample and each column an ASV, was created using this data. The matrix was filtered to retain only ASVs found in three or more samples. This filtration step was taken since reducing the size of the feature space can often lead to a more generalizable model (24–26).
The filtered matrix must be transformed in such a way to minimize the impact of various technical factors, such as differences in library size (27). Our unsupervised and supervised analyses examined two transformations of the filtered matrix. The first transformation we investigated was the presence-absence transformation. This transformation is useful since it reflects if ASVs are present or absent in the sample and the impact of technical errors, such as differences in library size and the uneven amplification of DNA can be minimized. The second transformation, the centered-log-ratio (CLR) transformation, was used since it effectively addresses the fact that amplicon-sequencing data is compositional (24,28). independent. When searching for biomarkers, the transformation which resulted in the best generalization performance was used.
Training of Unsupervised Models
Tree-based models are an effective means of capturing the similarity between samples. The similarity matrix, S, can be constructed by calculating how often samples co-occur in the terminal leaves of each decision tree. This co-occurrence, S(xi, xj), is a similarity and can be found using the following equation: Where xi and xj is the vector representation of all terminal node positions of samples xi and xj in the forest, and N is the total number of trees in the forest. The similarity matrix, S, is then converted into a dissimilarity matrix, D (Equation Two) (17). This dissimilarity measure, while not a metric such as the Jaccard distance (29), can be used to investigate beta-diversity and can be constructed using either a supervised or an unsupervised approach (17). To use decision tree ensembles in an unsupervised manner a second dataset is created such that the columns (ASVs) are randomly permuted. In the case of the CLR-transformed data, the original counts were permuted before the CLR transformation. The samples in the permuted dataset are assigned a label of “0” while samples in the original dataset are assigned a label of “1”. The classifier s is then tasked to find the difference between the permuted and original data. RF and ET classifiers were used at their default settings, except for the number of trees which was set to 128 (30). LANDMark (Oracle) models were trained using 128 trees and with the number of features set to , where n is the number of features in the filtered dataset. This was done to generate a more diverse set of trees. To avoid generating proximity matrices that are biased due to a lucky permutation, we created 100 different unsupervised proximity matrices using equation one and combined them using equation two to create a dissimilarity matrix.
Analysis of Beta-Diversity
Dissimilarity and distance matrices were used as input for PerMANOVA and a principal coordinates analysis (PCoA). A Uniform Manifold Approximation and Projection (UMAP) using the dissimilarity and distance matrices was also conducted (31). The UMAP algorithm was chosen since it projects a high-dimensional graph of the input data into a lower-dimensional Euclidean space. This algorithm can create potentially better representations of the sampling space since high-throughput sequencing data can lie on a complex high-dimensional manifold (31). Finally, the pairwise distances between samples in the UMAP embedding were calculated and used by PCoA to embed the UMAP projection into a two-dimensional space (32). Spearman’s rho was used to measure the distortion between the embeddings and the original distances/dissimilarities.
Assessment of Supervised Model Generalization Performance
Following our investigation of beta-diversity using similarity measures derived from unsupervised models, we assessed the generalization and feature selection performance of the LANDMark (Oracle), ET, and RF classifiers (33,34). Thirty different train-test splits, with the classes in each set being proportional to those in the original, were created for each metabarcoding data set. 50% of the original data was used to construct each training set and the random state used to create each train-test split was set to the iteration number for the split for reproducibility. RF and ET classifiers were used at their default settings, apart from the number of trees which was set to 128 (30). LANDMark (Oracle) models were also trained using 128 trees and, as in the unsupervised learning, the number of features considered at each node was set to . The remaining 50% of the data were used to calculate the balanced accuracy score using Scikit-Learn (33). This process was repeated for presence-absence and CLR-transformed data. Unless otherwise stated, the analysis of the IMID data was conducted using the first time point. This was done to avoid inflating the balanced accuracy scores since the microbiomes across time were found to be highly similar.
The transform (presence-absence or CLR) resulting in the best generalization performance and was used during feature selection. ASV features were selected using a combination of recursive feature elimination (RFE) followed by RFE with 5-fold stratified cross-validation. RFE was used to find a set of 200 predictive features. This step aimed to remove ASVs with little predictive value. Following this, RFE with 5-fold stratified cross-validation was used to create a more distilled subset of at least 20 predictive ASVs. The step size for each round of feature elimination was set to 5%. Each iteration’s test set was used to evaluate the predictive balanced accuracy of the final model. The subsets of ASVs from the best performing iteration were chosen for further analysis and display. Shapley scores, calculated using the ‘Explainer’ function of the Python ‘shap’ package was used to identify the ASVs which strongly impacted the prediction of each sample (35). The ‘shap’ package was also used to generate decision heatmaps which display the impact on prediction for each ASV. When this process was used to analyze IMID data only samples from the first time point were used as input into RFE. However, Shapley scores were calculated twice. The first set of scores was calculated using each iteration’s test data. The second set of scores was calculated using the first time point as the background dataset and the second time point as the testing data. A Bayesian analysis, using Nadeau and Bengio’s corrected t-test implemented in the Python ‘baycomp’ package, was used to compare the generalization performance of models before feature selection and after feature selection (36). The region of practical equivalence (ROPE), the probability of two models having equivalent performance, was defined as a difference in score within ±0.025. Although the choice for the size of this region is arbitrary, this size was chosen since it represents the impact of two classification errors. Finally, the structure of the decision space will be investigated to ascertain how well each model learns an appropriate decision boundary (14,37).
Results
The Choice of Transformation and Dissimilarity Measure Can Result in Different Interpretations of Amplicon Sequencing Data
When LANDMark (Oracle), ET, and RF classifiers were trained to differentiate between real and randomized samples, statistically significant differences between sampling locations were detected when using each model’s dissimilarity matrix (Table 1). These tests demonstrated that the most suitable choice of transformation depends on the dataset. For example, the main effect (sampling location) clearly explained a greater fraction of the variance when using the presence-absence transformation in each subset of the healthy gut data. For the IMID data, the CLR transformation was the better choice. These tests also demonstrate that unsupervised models, such as LANDMark (Oracle), can capture information that distinguishes samples, especially when trained using appropriately transformed data. To test if the number of features used has an impact on the explanatory ability of the main effect, we created multiple dissimilarity matrices where the number of features considered at each node was N, 2N, 4N, 8N, and 16N. Here N is equal to the square root of the number of ASVs. This investigation revealed that the explanatory ability of the main effect in each dataset appears to be sensitive to the number of features explored at each node (Figure 1). Interestingly, there appears to be an inverse relationship between LANDMark and the RF and ET models. Finally, the amount of explained variance along the first principal coordinate tended to be greater when using LANDMark (Oracle) dissimilarities. The spread of samples along this axis also tended to reflect differences in sampling location/disease phenotype (Figures 2 and 3). These results are particularly surprising since these matrices were created without using any of the metadata.
UMAP followed by PCoA is Effective at Creating Ordinations of the Investigated 16S rRNA Datasets
In PCoA projections of the original dissimilarity matrices, little to no correlation between distances in the original and projected spaces was observed (Figures 4 and 5 A, D, G, J). However, there is a trend where the most dissimilar pairs of samples could be found on the right side of each PCoA plot. Projections of each original dissimilarity matrix by UMAP, however, appear to better reflect the topology of the input space since distances between samples in the original and projected space appear to be correlated (Figures 4 and 5 B, E, H, K). Simply, this means that if the distance between two samples was large in the original space it also tended to be large in the UMAP space. Furthermore, Spearman’s rho tended to be highest in the UMAP projections of LANDMark (Oracle) dissimilarities, suggesting that this approach is particularly effective at preserving relationships between samples (Figure 4 and 5 E). In one dataset (LB vs LS), the projection of the samples, pairwise comparisons between samples from the original LANDMark (Oracle) dissimilarities the projected distances resulted in the formation of two distinct groups (Figure 1). This can be easily explained as inter-class variation being greater than the intra-class variation in this subset, an observation supported by the PerMANOVA results (See Table 1). This was also observed in other subsets, though not to such an extreme degree. Finally, unlike the PCoA projections of the original dissimilarities, a two-dimensional PCoA embedding of the UMAP distances does not result in a notable difference in the pairwise dissimilarities between samples (Figure 4 and 5 C, F, I, L).
The Choice in Data Transformation Could Impact Generalization Performance
When training using all features, generalization performance in the different subsets of the healthy gut dataset differed depending on the transformation. When training LANDMark (Oracle), ET, and RF models on the healthy-gut dataset, a Bayesian analysis showed that the presence-absence transformation is more likely to yield a model with better generalization performance in nearly all subsets of the data (Table 2). ET and RF models did perform better when trained on CLR transformed data in the RS-LS subset. However, this is unlikely to matter since no model was able to learn a way to classify RS samples from LS samples regardless of transformation. Since the PA transformed data was more likely to generate better models, we investigated if there would be any practical difference between models. In the IMID datasets, generalization performance appeared to depend on both the choice of transformation and classification model. For example, RF and ET models performed better when trained presence-absence transformed data in the MS-HC and the performance of these models are likely to be equivalent in the RA-HC and UC-HC subsets regardless of transformation (Table 2). However, the performance of LANDMark (Oracle) was best on CLR-transformed data across all subsets.
The Supervised LANDMark (Oracle) Classifier Learns Better Decision Rules than the Random Forest and Extremely Randomized Trees Classifiers
Supervised LANDMark’s ability to split samples into their respective classes using multiple features resulted in clearer separations between classes (Figure 6). The decision boundaries learned by LANDMark were also less influenced by the peculiarities of the RF or ET classifiers. For example, an arcing effect was observed in the PCoA projection of the decision space of the RF classifier (Figure 6, Right Panel) while no such pattern could be observed in the decision space of the LANDMark classifier (Figure 6, Left Panel). Regardless of which classifier was used, the first principal component in each PCoA projection explained a large amount of the variance in the decision space. This suggests that each classifier can learn good decision rules which separate different classes of samples (14,37). However, due to the small number of samples, the PCoA results for the higher components should be interpreted with some caution. Finally, LANDMark (Oracle) models tend to be as good or better than RF or ET models since they appear to generalize better (Tables 3 - 5).
ASVs Predicted to Have a High Impact on Model Performance is Consistent with Previously Reported Results
The ASVs identified using LANDMark (Oracle) and RFE in the LB-LS subset of the healthy gut dataset are generally consistent with what was reported by Flynn et al. (19). We confirmed that Turicibacter spp., Peptoniphilus spp., and Finegoldia spp. play a role in differentiating these two sites (19) (Suppl Figures 1 and 2). However, the results suggest that the individual impact that these ASVs have on classification is somewhat muted. Also, the differences in overall importance may be due to the experimental design since we built our models using 50% of the dataset. The ASV which had the strongest influence on generalization performance in test samples, ASV 317, belonged to Schaalia spp. and was not originally identified as important. Interestingly, ASV 576 (assigned to Anaeromassilibacillus spp.) was only present in one test sample but its absence strongly shifted the predictions of the model towards both types of samples, suggesting a possible interaction between one or more ASVs. Currently, it is difficult to determine interactions between ASVs using LANDMark. To investigate potential interactions involving ASV 576, an Extremely Randomized Trees model with 2048 trees was trained. This approach was chosen since it has been shown to approximate a non-linear function as the number of trees increases (33,34). While classification was not perfect (balanced accuracy score of 0.9) this follow-up analysis did confirm that ASVs 317 (Schaalia), 457 (Enterocloster), 429 (Faecalicatena), 120 (Veillonella), 610 (Eisenbergiella), and 249 (Lawsonibacter) primarily impact classification and that the effect of ASV 576 is likely an artifact (Suppl Figure 3).
We identified a group of ASVs which are important for distinguishing between CD and HC samples. ASVs belonging to Gemmiger, Coprococcus, and Lachnospiracea incertae sedis were included in this group. Furthermore, the genera identified by our model are consistent with those reported in the original work (3). Lower abundance in ASVs 18, 64, 36, 95, 187, and 92 - shift model predictions away from HCs. These ASVs were assigned to the genera Gemmiger, Coprococcus, and Blautia (for the remainder) respectively. Interestingly, a higher abundance of these ASVs did not result in a strong shift towards the prediction of a HC. An increase in the abundance of ASV 39 (Lachnospiracea incertae sedis) shifts predictions towards CD. A sixth ASV which was assigned to the genus Monoglobus, a taxon that was not previously identified as important, was identified in our analysis (Figure 7). While a detailed discussion of Monoglobus is outside the scope of this work, this species has been shown to be involved in pectin degradation and the metabolites produced from these pathways are important mediators of the inflammatory response (38,39). Within test samples from the first time point higher abundance of this ASV tended to shift some predictions towards healthy controls while a lower abundance of this ASV tends to shift predictions away from healthy controls. In a follow-up analysis using the second time point, however, the impact this ASV had on model predictions was considerably more muted (Suppl Figure 4). Finally, our analysis identified a group of additional ASVs (which included taxa such as Terrisporobacter, Neglecta, Roseburia) where a decrease in abundance tends to shift predictions towards CD. The overall influence that these ASVs exert on prediction is smaller, however.
Discussion
The datasets investigated here were chosen since the human gut microbiome is an important area of medical research and is becoming increasingly linked to important disease phenotypes. Since machine learning models are becoming increasingly used to identify predictive features, it is important to understand how the quality and interpretation of results change depending on the machine learning model. This will hopefully allow greater insights into the composition and function of the human microbiome. The choice of transformation and dissimilarity measure is an important consideration when investigating microbiome data. It has long been known that the choice of dissimilarity measure can influence our measurement and interpretation of the main gradients influencing the structure of communities and taxonomic similarity between pairs of samples (40,41). For example, recent investigations have demonstrated that this choice can result in misleading results due to the sparsity inherent to the data, and differences in library size and sampling (24,27,42). To combat these problems a multitude of dissimilarity measures and ordination approaches have been developed to summarize and visualize ASV differences between sites (41). However, it remains incomplete since distance metrics and other commonly used dissimilarity measures have difficulty capturing potential interactions between ASVs. For example, the Jaccard distance simply calculates the number of shared ASVs over the total number of unique ASVs between two communities and it fails to consider how dependencies between ASVs influence the structure of a community. An example of such a dependency occurs when the presence of one ASV depends on the exclusion of another (43). Furthermore, when using measures that use abundance information, it is simple to show how differences in abundances can result in situations where the sites that share the same species are more dissimilar than sites that have no species in common. While applying transformations, such as CLR or converting to presence-absence, can help in these situations, a review of the literature suggests that there is yet to be a consensus on which approach is best (24,41,44,45). Our results are also unclear in this matter and suggest that the best choice in transformation will depend on both the dataset and model being used. For example, our results suggest that the presence-absence transformation may be better suited when samples come from (or are suspected to come from) two or more distinct ecological niches, such as the lumen and mucosa of the colon (46). This likely occurs since differences between these communities are dominated by changes in the presence and absence of specific organisms rather than abundance. However, when analyzing changes occurring within similar niches, such as those derived from stool, the CLR transformation may be more useful since it is sensitive to changes within compositions (28,47).
Alternative approaches to measuring pairwise dissimilarity, such as learning a dissimilarity measure, have also been developed and applied to the analysis of genomic and transcriptomic datasets (13,16,17,29). Unfortunately, while the properties of various dissimilarity measures have been extensively investigated, comparatively little work has been done exploring how learned dissimilarity measures can be used to investigate the same data. They are particularly interesting since they can learn a representation of the underlying manifold upon which the input samples are embedded (29,48). Given that amplicon sequencing datasets tend to lie on such manifolds, using learned dissimilarities could represent a potentially powerful way to analyze these datasets. Furthermore, since these dissimilarity matrices are derived from decision tree ensembles, interactions between ASVs are potentially accounted for, thereby overcoming one of the weaknesses of distance metrics (7,43,48). Therefore, using learned dissimilarities could result in the construction of more informative ordinations.
Our experiments show that a PCoA, on its own, is not able to adequately project samples into an appropriate embedding. This occurs since PCoA is a type of matrix factorization algorithm and it is difficult to construct linear representation in cases where the input manifold is non-linear. In these cases, PCoA cannot adequately preserve relationships between samples and the resulting projection would not effectively capture important aspects of the data. This is evident in Figures 2 and 3, which demonstrate that the first two principal axes of each PCoA projection of the original dissimilarities explain only a small fraction of the variation in each dataset. This is further underscored by the data presented in panels A, D, G and J of Figures 4 and 5 panels. These experiments clearly show that PCoA only rotates the input space and does not preserve the pairwise dissimilarities between samples in the resulting projection. Graph algorithms, such as UMAP, are an attractive alternative since these approaches are designed to learn an appropriate representation of the input manifold. Our experiments, evidenced in Figures 4 and 5, show that UMAP (and UMAP followed by PCoA) preserves the relationships between samples in the projected space since the pairwise dissimilarities in the original and projected space are correlated (31,49). Simply put, if the distance or dissimilarity between a pair of samples is large in the original space it tends to be large in the projected space. Applying these algorithms to our datasets allowed us to effectively visualize the relationships between samples, specifically differences in sampling location, with minimal distortion. Our results also support the growing body of work that shows that UMAP preserves the overall structure of HTS datasets and that it is more capable of representing sources of biological variation than PCoA (32). Finally, since the number of components used to construct the UMAP projection is arbitrary, we strongly suggest that a grid search over two UMAP parameters, the number of components and neighbors, is run so that a projection that best preserves the pairwise dissimilarity between samples can be constructed.
The dissimilarity matrices learned by unsupervised LANDMark (Oracle) resulted in projections that more clearly distinguished between the known main effects (sampling location and disease phenotype) (Table 1). Also, as the number of features used for splitting in LANDMark (Oracle) increased, the explanatory power of the main effects grew. This result demonstrates that distance metrics, such as the Jaccard or Aitchison metrics, might not capture the important differences between samples as readily as learned dissimilarities. One possible explanation for this result could be due to the inclusion of an increasing number of irrelevant dimensions as the dimensionality of the dataset increases (50,51). In amplicon sequencing datasets, irrelevant dimensions likely occur due to the inclusion of uninformative ASVs, potentially informative but highly variable ASVs, splitting a single genome, and missing data (24,27,52,53). Learned dissimilarity measures, such as those explored here, may be capable of identifying and reducing the impact uninformative ASVs exert when measuring dissimilarity. For example, in a RF classifier only ASVs which result in the best split are chosen at each node (13). Therefore, the impact of uninformative ASVs tends to be minimized since they are not selected as often. LANDMark (Oracle) extends this idea by identifying which linear or non-linear model is best at discriminating between classes using a randomly selected coalition of ASVs (37).
We show that using oblique decision tree ensemble classifiers, such as LANDMark (Oracle), can result in a highly predictive model. In this work, we show that a LANDMark (Oracle) classifier was likely to be at least as good as the ET or RF classifiers. Furthermore, when compared to RF and ET classifiers, we demonstrate that using feature selection is less likely to impact the generalization performance of a LANDMark (Oracle) classifier (Table 3). This result is important since it suggests that LANDMark (Oracle) is more robust to noise, especially when trained on CLR-transformed data. Furthermore, it is important to consider the shape of the decision boundaries learned by these classifiers. Both the RF and ET classifiers will produce a blocky boundary since each is only capable of learning axis-aligned splitting rules, although the boundary learned by ET tends to be smoother due to the random selection of cut-points (14,34). Smoother boundaries are preferred since they are likely to be a more faithful approximation of the rules which generate the data being studied (14,54). While the performance of all three models was similar in some instances, issues in the decision boundaries in these instances were noted. Specifically, we observed structures in the higher components of a PCoA using proximity matrices derived from supervised RF and ET models. In contrast, these structures did not exist in LANDMark (Oracle) models, implying the learning of a smoother boundary. This is consistent with other work involving this class of classifiers (14,37).
The generalization performance of our models tended to differ from that reported in the original work (3,19). We believe that these differences arose from differences in methodology, the use of ASVs, our choice of transformation, and our use of split-half cross-validation. Since we chose to analyze ASVs instead of OTUs, the dimensionality of our dataset substantially increased. For example, in the original IMID study the authors used 383 OTUs while our study found 702 ASVs (3). While using ASVs can provide a richer amount of information, generalization performance may degrade if ASVs artificially split bacterial genomes into different clusters (52). This occurs since the signal from one unique strain will now be spread over multiple ASVs. While this can lead to lower classification performance, this choice is justifiable since the results of our analysis are reproducible and these ASVs we identified as important can be used to generate new hypotheses for future experiments (55). The number of trees used to train our models and how generalization performance was calculated were also different. The original IMID work used 500 trees and calculated generalization performance using the out-of-bag error while the work by Flynn et al. (2018) used non-rarefied data as input and measured generalization performance using AUC scores (3,19). In contrast, we used 128 trees and split our data into training and testing sets using repeated split-half cross-validation. Previous work has demonstrated that after 128 trees the performance of a RF tends to plateau (30,37). Some additional testing using the various subsets of the IMID dataset demonstrated that adding additional trees to our analysis is unlikely to result in substantially better performance (Suppl Table 1). Finally, and likely the most significant contributor to differences in generalization performance, is our choice to use repeated split-half cross-validation. This approach is expected to result in decreased generalization performance since fewer samples are used for training. However, the advantage of this approach is that the overlap between training datasets is minimized (56). This reduces the dependence between different estimates of generalization performance thereby improving the ability to detect a true difference between the generalization performance of two classifiers (56). An additional advantage of using split-half cross-validation is that we can use more testing samples to calculate feature importance scores.
The ASVs identified as important by LANDMark (Oracle) are consistent with those identified in the original studies. This not only confirms the viability of LANDMark (Oracle) in this area of research, but it also strengthens the original work as their findings were replicated using a very different approach. Our work also demonstrates that classifiers such as LANDMark can not only validate the results of the original studies, but they can also add additional insights. For example, in the LS-LB investigation LANDMark (Oracle) identified Schaalia spp. as an important marker capable of distinguishing between the proximal lumen and mucosa of the colon. Finally, while detecting single ASV biomarkers is important, we should always be cognizant of the fact that these organisms interact with each other and the host. Therefore, when building and analyzing predictive models it is important to use approaches that can explore, quantify, and validate these interactions. In addition to detecting strongly predictive ASVs, our approach was also capable of detecting ASVs which have a more subtle effect on predicting CD and HC patients and whether samples originated in the distal or the proximal colon.
When looking at the ASVs identified by each model, both the RF and ET identified fewer ASVs than LANDMark (Oracle). The larger number of ASVs identified by LANDMark (Oracle) is likely due to differences in the way in which nodes are constructed. In RF and ET classifiers, only single features are used to construct each node (13,34). Therefore, only a very small fraction of features (at most n-1, where n is the number of samples) will be used to construct each tree. In practice, however, it is more likely that fewer features will be used if particularly good splits are found. It is also possible that features are reused at deeper nodes within each tree. This form of tree construction has also been shown to have a strong regularizing effect, which could limit the amount of available information upon which decisions are made (57). While it is likely that a regularization effect similar to that observed in RF and ET occurs in LANDMark, the strength of this effect may be more muted because LANDMark considers more features at each node (37).
This allows a richer amount of information to be used to construct each tree but comes at the cost of including features that may have a limited impact on classification. For this reason, we believe it is particularly important to pair LANDMark models with model agnostic introspection algorithms, such as Permutation Explainer, which are capable of quantifying feature importance and interactions between features (58). It is also important to note that genome splitting could also contribute to this effect (52). For example, multiple ASVs assigned to Peptoniphilus in the LS-LB data and Blautia and Coprococcus in the CD-HC data. Therefore, additional work is needed to determine the extent of this issue in 16S datasets. Work is also needed to determine how best to handle this problem.
Conclusions and Future Work
Our work has shown that unsupervised LANDMark (Oracle) models can learn effective dissimilarity matrices. When paired with modern dimensionality reduction approaches, such as UMAP, the global structure of the original dissimilarity matrix is preserved. UMAP representations can then be combined with existing matrix factorization approaches to create informative ordinations. However, this comes at a cost of clarity since it is difficult to determine how variance along each axis is related to the presence/absence or abundance of each ASV. Therefore, it is important to conduct work investigating approaches capable of identifying which ASVs impact the location of samples in the transformed space. Finally, we show that LANDMark (Oracle) can learn highly predictive models after feature selection. Importantly, the ASVs identified by feature selection is consistent with contemporary work. Due to the way LANDMark constructs each tree, further investigations into the integration of feature selection and a statistical analysis of the resulting feature impact scores are necessary. This could potentially identify a small subset of highly predictive ASVs and this analysis would sidestep the need to use generalized linear models since the degree of confidence in the impact that each ASV has on classification is evaluated rather than differences in abundance/presence.
Declarations
Ethics approval and consent to participate
Not Applicable
Consent for publication
Not Applicable
Availability of data and materials
Authors can confirm that all relevant data are included in the article and/or its supplementary information files.
Competing interests
The authors declare that they have no competing interests.
Funding
JR is supported by funds from the Food from Thought project as part of Canada First Research Excellence Fund. MH received funding from the Government of Canada through Genome Canada and Ontario Genomics. BG is supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) grant (RGPIN-2020-05733).
Authors’ contributions
JR and MH conceived the project. JR analyzed/interpreted the results. JR wrote the draft. JR, MH, BG, and SK read, discussed, and contributed to the draft. MH provided computational resources. All authors have read and approved the final manuscript.
Additional Files
Additional File 1 – Supplementary Figures 1 to 4
Additional File 2 – Supplementary Table 1
Additional File 3 – Raw ESV Table, Taxonomic Assignments, and Code
Acknowledgements
We would like to thank Dr. Katie McGee and Dr. Terri M. Porter for their thoughtful discussions during the development of LANDMark.