Abstract
Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate that species-level resolution is attainable.
Main
Advances in high-throughput DNA sequencing and bioinformatics analyses have illuminated the crucial roles of microbial communities in human populations and planetary health12 and enable microbiome meta-analysis on a massive scale3. An important step in characterizing microbial communities is classification of short marker-gene DNA sequences (e.g., bacterial 16S rRNA genes) to infer taxonomic composition.
Short marker-gene sequence reads often contain insufficient information to differentiate species using conventional methods4–8. However, current best practices rely on species-level classification to circumvent well-documented inconsistencies between genus-level reference taxonomies and molecular phylogeny (e.g. Clostridium and Eubacteriumf)9.
In this work, we demonstrate that a substantial improvement in classification accuracy of marker-gene sequences can be achieved if a reference taxonomic distribution for the sample’s source environment is known. This technique enables marker-gene sequencing to differentiate individual species at a level of accuracy previously only available at genus level.
We focus on q2-feature-class¡f¡er, a QIIME 210 plugin for taxonomic classification. In previous work4 we benchmarked this method against other common classifiers, including RDP Classifier11 and several consensus-based methods using real and simulated data for four bacterial and fungal loci. In general, q2-feature-classifier met or exceeded the accuracy of the other classifiers 4. However, all tested methods performed similarly if their parameters were tuned in a concordant manner. Significant performance enhancement demonstrated in the current work for q2-feature-classifier therefore implies improved performance over those other methods.
RDP Classifier and q2-feature-classifier use similar naive Bayes machine-learning classifiers to assign taxonomies based on sequence k-mer frequencies, and exhibit very similar performance when default parameters are used4. The default assumption of these classifiers is that each species in the reference taxonomy is equally likely to be observed. Unlike RDP classifier, however, q2-feature-classifier now allows prior probabilities to be set for each species. We refer to the prior probabilities as taxonomic weights and the default equal probabilities as uniform weights. We hypothesized that inputting the frequencies with which each taxon is actually observed in nature as taxonomic weights would improve classifier performance.
Taxonomic weights were downloaded and assembled using our new utility, q2-clawback (https://github.com/BenKaehler/q2-clawback). We created weights for 14 Earth Microbiome Project Ontology (EMPO) 3 habitat types1 across 21,513 samples from the Qiita microbial study management platform3 (see Online Methods for details), q2-clawback can assemble weights from any appropriately curated set of samples or by querying Qiita on any available metadata category. We refer to EMPO 3 habitat-specific taxonomic weights as bespoke weights.
Using bespoke weights, researchers can now classify sequences at species level with the same confidence that they previously classified sequences at genus level (Figure 1). The mean error rate (the proportion of reads incorrectly classified) across the 14 EMPO 3 habitat types was 14% (±1% s.e.) for bespoke weights at species level and 16% (±1% s.e.) for uniform weights at genus level (single-sided paired t-test P = 0.14). These results indicate that bespoke weights achieve comparable or better species-level accuracy to what uniform weights can only accomplish at genus level. (As described below, bespoke weights significantly outperform uniform weights by all metrics when both are compared at species level.) The mean Bray-Curtis dissimilarity between observed and expected taxonomic abundances was 0.13 (±0.01 s.e.) for bespoke weights at species level and 0.15 (±0.01 s.e.) for uniform weights at genus level (single-sided paired t-test P = 0.013) (Table S2, Figure S2), indicating better performance of bespoke weights. See Supplementary Results for more details of our benchmarking results, and Online Methods for details.
Bespoke weights significantly outperformed uniform weights when both were compared at species level (bespoke error rate = 14%, uniform error rate = 25%, paired t-test P = 5.8×10-5) (Figure 1). Similar results were obtained for Bray-Curtis dissimilarity and F-measure (see Supplementary Results). Averaged across the 14 EMPO 3 habitats, Proteobacteria and Firmicutes were the most abundant phyla (34% and 18% of reads, respectively). Switching from uniform to bespoke weights caused error rates for classification of species in these phyla to drop from 35.4% (±0.7% s.e.) to 22.3% (±0.4% s.e.) and 43.6% (±0.7% s.e.) to 24.3% (±0.3% s.e.) respectively (Figure S8). These differences were highly significant for both Proteobacteria and Firmicutes (paired t-tests P = 1.4×10-6 and P= 8.4×10-6 respectively).
Classifier performance was sensitive to the choice of taxonomic weights. Testing the use of taxonomic weights from EMPO 3 habitats that were not the sample’s true habitat revealed that as the taxonomic weights moved away from the bespoke weights for a given sample, error rate increased, as expected (Pearson r2 = 0.57, P < 2.2×10-16) (Figure S5; see Supplementary Results and Online Methods). We also tested the classification accuracy when using the average of the 14 EMPO 3 habitat-specific bespoke weights, which we term average weights. For every EMPO 3 habitat, bespoke weights outperformed average weights (sign test P = 6.1×10−5) (Figures 2, S2–3). Similarly, average weights always outperformed uniform weights (sign test P = 6.1×10-5) (Figures 2, S2–3). The implication is that classification accuracy improves when taxonomic weights more closely resemble taxonomic frequencies observed in nature. Importantly, uniform weights gave inferior performance, even compared to using taxonomic weights from the EMPO 3 habitats other than the sample’s source habitat (cross-habitat weights; Figure S4; see Supplementary Results).
The ability of uniform-weight classifiers to resolve species-level differences from marker genes is directly related to the sequence topology of reference species. Species with highly similar sequences will be difficult to differentiate, even if these species occupy exclusive ecological habitats. However, bespoke weights incorporate habitat-specific species distribution information to guide sequence classification. Hence, classification accuracy under bespoke weights for a given habitat type is tied to the sequence topology and distribution of individual species in that habitat. We devised a statistic that we term the confusion index to quantify how often similar sequences originated from different species in the same habitat (see Online Methods). The confusion index is a function of the taxonomic difference between sequences with similar k-mer profiles and the frequency that they appear, taking the bespoke weights as the likelihood of observing a given species. We found that error rates for bespoke weights were correlated with the confusion index (Figure S7; Pearson r2 = 0.72, P = 1.4×10-4, see Online Methods and Supplementary Results). That is, classification accuracy is affected by how often different species in the same sample have similar amplicon sequences but different taxonomic classifications, and that varies between EMPO 3 habitats.
The assumption of uniform weights, that species are evenly distributed in nature and hence equally likely to be detected, is incorrect. We have demonstrated that this assumption imposes a consistently negative impact on performance, even when compared to deliberately incorrect taxonomic weights selected from ecologically dissimilar environmental sources (the cross-habitat weights). As a result, we suggest the continued usage of uniform weights is not justifiable. When publicly accessible pre-existing microbiome data is available for the sample (i.e., environment) type being investigated, bespoke weights should be used. For all other natural sample types, average weights estimated from global microbial species distributions are superior to uniform weights. For highly unusual sample distributions, e.g., in synthetic populations, we recommend compiling custom bespoke weights from existing samples. In the Supplementary Results we demonstrate how shotgun metagenome data may be used to improve classification accuracy (Figure S9). Efforts to curate microbiome data and the continued contribution of researchers to online microbiome data repositories will refine and extend the ability to apply appropriate bespoke weights for sequence classification in diverse sample types.
By comparing uniform, average, and bespoke weights, we have shown that the more specific the taxonomic weights to a sample’s environment, the better the classification accuracy. q2-clawback facilitates achieving these improvements in accuracy by making it easier for the researcher to assemble weights that are more specific than identifying a sample’s EMPO 3 habitat. For instance, it is trivial to assemble weights for all stool samples with human hosts from Qiita (See the online tutorial, https://library.qiime2.org/plugins/q2-clawback).
The results we present provide a general path for delivering species-level classification accuracy. As such, the work provides a complementary solution to the small number of existing specialist classification databases12–15. Moreover, bespoke weight classification permits the detection of unexpected species not encompassed by custom databases.
By improving species-level classification of marker-gene sequences, bespoke weights may support critical functional inferences, e.g., differentiation of pathogenic and non-pathogenic species of the same genus16–21. Ongoing improvements in public reference sequence and sample databases will further boost performance, supporting biological insight into global microbiome compositions. Uniform weights should always be avoided, as they distort natural species distributions, leading to imprecise and incorrect taxonomic predictions.
Methods
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.
Author Contributions
Conceived, designed, and performed experiments: BDK and NAB. Designed and wrote clawback software: BDK and NAB. Wrote manuscript: BDK, NAB, JGC, GAH. Developed supporting software (redbiom): DM, RK. Provided critical review of manuscript and results: DM, RK, JGC, GAH.
Competing Interests
The authors declare no competing interests.
Online Methods
Data
We downloaded all public 150 nucleotide 16S v4 samples for 18 EMPO 3 habitat types from Qiita3 using q2-clawback. The downloaded data consisted of sequence variant and abundance information. The sequence variants were prepared by the standard Qiita pipeline, including Deblur22, prior to download. q2-clawback uses redbiom23 (https://github.com/biocore/redbiom) to access Qiita. Data from the following Qiita studies were used: 1111324, 11444, 1716, 1036925, 99026, 2080, 1713, 894, 1289, 1883, 1673, 1288, 10353, 219227, 10323, 678, 1773, 662, 1799, 864, 1481, 102428, 1064, 2182, 10934, 1674, 179529, 10273, 1028330, 1042231, 804, 10308, 105632, 238228, 1240, 889, 1041, 1717, 1222, 11149, 11669, 80733, 10245, 1711, 1721, 910, 1001, 895, 55034, 174735, 71336, 755, 861, 95837, 1116138, 1115439, 945, 723, 1715, 1714, 10798.
The three EMPO 1 control EMPO 3 habitat types were excluded, as well as Hypersaline (saline), Aerosol (non-saline), and Plant surface, which all had fewer than nine samples in the Qiita database for 150 nt sequence variants. The number of samples downloaded for each EMPO 3 habitat are shown in Table OM1.
For the cross validation analysis, sequence-variant level data was discarded and only taxonomic abundance information was retained. The sequence variants were classified using the standard q2-feature-classifier naive Bayes classifier based on Greengenes 99% identity OTU reference data40 to obtain empirical taxonomic abundance data for each sample. The naive Bayes classifier was trained using the “balanced” parameter recommendations given in Bokulich, Kaehler, et al.4.
For the shotgun data experiment (see Supplementary Results), data was downloaded from the Human Microbiome Project website2. The downloaded tables had been prepared using a pipeline leading to MetaPhlAn241. Paired 16S stool samples were downloaded from Qiita3 in the form of DNA sequencing data with quality scores. The 16S samples were trimmed to 340 nt and denoised using DADA242. In total, 71 pairs of shotgun and 16S stool samples were found. Reference data sets were downloaded from the NCBI RefSeq database43. Full 16S sequences were trimmed to the V3-V5 regions (forward primer CCTACGGGAGGCAGCAG; reverse primer CCGTCAATTCMTTTRAGT), using q2-feature-classifier4, resulting in 20,696 reference sequences across 14,777 taxa. It should be noted that this experiment is intended for demonstration only, and that we are not advocating the use of the NCBI 16S RefSeq database for this purpose, as on average there are less than two reference sequence examples for each taxon.
Clawback
q2-clawback is a free, open-source, BSD-licensed package that is available on GitHub (https://github.com/BenKaehler/q2-clawback). It includes methods for downloading sequence variants from Qiita (fetch-Qiita-samples), extracting sequence variants for taxonomic classification using q2-feature-classifier (sequence-variants-from-samples), and assembling taxonomic weights from collections of samples of taxonomic abundance (generate-class-weights). These methods can be run independently or combined into a single method call (assemble-weights-from-Qiita). Figure OM1 shows the workflow for these methods. An online tutorial is available (https://library.qiime2.org/plugins/q2-clawback.
In general, taxonomic weights are assembled as follows. A set of sequence variants with abundances are acquired (fetch-Qiita-samples). The sequence variants are extracted (sequence-variants-from-samples) and classified using the naive Bayes classifier under uniform weights using “balanced” settings4. Classification to species level is forced by setting the confidence parameter to −1. The resulting read counts are aggregated, normalised, and added to a small (10-6 unobserved weight default) uniform offset (generate-class-weights) to form bespoke weights. The resulting weights are used to retrain the naive Bayes classifier to create a classifier under the bespoke weights assumption. In our experiments, which are detailed below, this procedure was modified slightly to accommodate cross validation and compilation of taxonomic weights from a variety of sources.
Cross Validation Using Empirical Taxonomic Abundance
To test classification accuracy using varying taxonomic weights, we developed a cross-validation strategy that accounted for the observed abundances of taxa in any given habitat. This strategy ensured that a classifier was never asked to classify a sequence that had occured in its training set or generate taxonomic abundances that had directly contributed to its input taxonomic weights. To our knowledge, our cross-validation strategy is the first to incorporate information about taxonomic weights in assessing taxonomic classifier performance. This situation is known in machine learning as imbalanced learning44.
Cross validation was used to analyse the effectiveness of setting the taxonomic weights for the q2-feature-classifier naive Bayes taxonomic classifier. A single cross-validation test follows the pattern (shown in Figure OM2, several steps are described in more detail below):
Obtain a set of reference sequences and reference taxonomies.
Obtain a set of samples for a given EMPO 3 habitat type, where each sample contains the number of reads observed for each taxon.
Perform stratified k-fold cross validation simultaneously on reference sequences and samples.
For each fold:
Train a classifier on the training reference sequences, optionally incorporating read counts from the training samples to calculate taxonomic weights.
Simulate samples that closely match the taxonomic abundances in the test samples using the test reference sequences, then classify them using the above classifier.
Step 2. Data was obtained as detailed above. Taxonomic abundances were estimated using the naive Bayes classifier under uniform weights using “balanced” settings4, where the classifier was forced to classify to species level.
Step 3. We performed 5-fold cross validation in each instance. Standard stratification for 5-fold cross validation requires that at least five sequences exist for each taxonomy, which is not the case for the 99% identity Greengenes reference taxonomy. We therefore formed a stratum for each taxonomy for which five or more reference sequences existed (large taxonomies) and merged the remaining taxonomies (small taxonomies) into those strata. A single large taxonomy was chosen for each small taxonomy by training a naive Bayes classifier on the large taxonomies, classifying the reference sequences in the small taxonomies, then voting weighted by confidence. Shuffled stratified 5-fold cross validation was then implemented using a standard library call to scikit-learn45.
Cross validation was performed simultaneously on samples and reference sequences. Sample cross validation was not stratified.
Step 4a. Each sample consisted of a set of taxonomies and their abundances. Taxonomic weights were formed by aggregating those counts across the training samples. As a result of the merged strata in Step 3, some taxonomies that were present in the bespoke weights were not present amongst the taxonomies of the training sequences. Any such taxonomy was mapped to the nearest taxonomy that was present amongst the taxonomies represented by the training sequences, as measured by the voting system from Step 3.
Step 4b. Samples were simulated by drawing sequences from the test sequences in such a way as to closely resemble the taxonomic abundances of the test samples. Again as a result of the merged strata in Step 3, some taxonomies that were present in the test samples were not present in the taxonomies of the test sequences. In the same way as for Step 4a, any missing species-level taxonomy was mapped to the closest taxonomy for a sequence present in the test sequences. Once missing taxonomies were resolved, samples were simulated by drawing test sequences as evenly as possible from each taxonomy so that any read count was a whole number.
For the q2-feature-classifier naive Bayes classifiers that were reported in this study, we used the recommended “balanced” parameters as recommended for uniform weights4. That is, we used a confidence level of 0.7 in all cases. In Bokulich et al.4, a confidence level of 0.92 was recommended for bespoke weights tested on mock communities. We tested the classifiers at this level but in all cases the results were dominated by the less conservative confidence level of 0.7.
F-measure and Bray-Curtis46 dissimilarity were calculated for each sample and taxonomic level using the q2-quality-control QIIME 2 plugin (https://github.com/qiime2/q2-quality-control). F-measure for each fold was aggregated across samples by weighting by the total read count for each sample. Bray-Curtis dissimilarity was averaged across samples without weighting, but samples with less than 1,000 reads were filtered out.
Error rates, or the proportion of reads not correctly classified, were calculated as follows. A classification was called correct only if the expected classification exactly matched the observed classification to the required taxonomic level. That is, if the expected classification did not contain classification all the way to that level because that species was not present in the training set, then the classification was called correct only if it was truncated at exactly the right level. Correct classification rates were again calculated for each sample and aggregated across samples by weighting by the total read count for each sample. Aggregation across folds and EMPO 3 habitats was evenly weighted.
Confusion Index
The degree to which species can be successfully resolved is directly related to the dissimilarity of their sequences. We sought to establish a property of the reference data and taxonomic weights that was related to the classification accuracy across EMPO 3 habitats. For any pair of DNA sequences, the critical quantities are their sequence and taxonomic dissimilarities. Sequence dissimilarity is measured as the Bray-Curtis dissimilarity of k-mer counts. Taxonomic dissimilarity is the depth (from species level) of the most recent common ancestor, e.g. zero for the same species, one for species within the same genus and seven for an Archaean versus a Bacterium.
The Confusion Index is then the log of the product of the probability that the sequence dissimilarity for any pair of sequences is less than a threshold (we selected 0.25) and the expectation of the taxonomic distance given that the sequence dissimilarity is less than 0.25. The expectation was calculated under the assumption that the two sequences were sampled independently with probability given by their bespoke weights. That is, where CI is the Confusion Index, ds(i,j) is the sequence dissimilarity between the ith and jth sequences, dt(i, j) is the taxonomic dissimilarity between the i th and j th sequences, w(i) is the weight of the i th sequence, and I(·) is the indicator function.
The Confusion Index quantifies how often a pair of taxa have nearly identical sequences but different taxonomies for a given set of taxonomic weights. One advantage of this quantity is that it can be estimated statistically by taking a random sample of pairs of sequences. In this study we sampled 108 pairs of sequences for each calculation.
Comparison of Taxonomic Classification for Shotgun and Amplicon Sequencing
The effect of using taxonomic weights derived from taxonomic classification of shotgun sequencing reads was determined using 5-fold cross validation, where each classifier was trained using taxonomic weights aggregated across the samples in the training set, then tested on 16S samples from a test set. TDR4 was computed using the q2-quality-control QIIME 2 plugin. TDR is the fraction of taxa that were discovered in the shotgun sequencing sample that were also found in the amplicon sample.
Code Availability
q2-clawback is available at https://github.com/BenKaehler/q2-clawback/releases/tag/0.0.4. All other code developed for this study is available at https://github.com/BenKaehler/paycheck/releases/tag/0.0.2.
Data Availability
The Qiita data used in this study have been deposited at https://doi.org/10.5281/zenodo.2548899. The HMP and NCBI data used in this study have been deposited at https://doi.org/10.5281/zenodo.2549777.
Acknowledgments
QIIME 2 development was primarily funded by NSF Awards 1565100 to JGC and 1565057 to RK. This work was supported by an NHMRC project grant APP1085372, awarded to GAH, JGC, RK.
References
References
References
- 47.↵