Abstract
Metagenomics experiments often characterize microbial communities by sequencing the ribosomal 16S and ITS regions. Taxonomy prediction is a fundamental step in such studies. The SINTAX algorithm predicts taxonomy by using k-mer similarity to identify the top hit in a reference database and provides bootstrap confidence for all ranks in the prediction. SINTAX achieves comparable or better accuracy to the RDP Naive Bayesian Classifier with a simpler algorithm that does not require training. Most tested methods are shown to have high rates of over-classification errors where novel taxa are incorrectly predicted to have known names.
Introduction
Sequencing of tags such as the ribosomal 16S gene and fungal internal transcribed space (ITS) region is a popular method for surveying microbial communities. Recent examples include the Human Microbiome Project (HMP Consortium, 2012) and a survey of the Arabidopsis root microbiome (Lundberg et al., 2012). A fundamental step in such studies is to predict the taxonomy of sequences found in the reads. The most popular method is currently the RDP Naive Bayesian Classifier (Wang et al., 2007) (hereafter RDP). Additional taxonomy prediction methods are supported by QIIME (Caporaso et al., 2010) and mothur (Schloss et al., 2009).
Reference databases
Taxonomy prediction requires a reference database containing sequences with taxonomy annotations. Authoritative prokaryotic sequence classifications exist for at most the ~12,000 named species belonging to ~2,300 genera which represent only a tiny fraction of extant species (Yarza et al., 2014). Available databases include the RDP training sets, the full RDP database (RDPDB) (Maidak et al., 2001), SILVA (Pruesse et al., 2007), Greengenes (DeSantis et al., 2006) and UNITE (Kõljalg et al., 2013). The RDP 16S training set v16 (RTS) has 13,212 sequences belonging to 2,126 genera while the RDP Warcup ITS training set (Deshpande et al., 2015) v2 has 18,878 sequences belonging to 8,551 species. The RDP training sets contain only sequences with authoritative names and are therefore much smaller than SILVA, Greengenes and UNITE which include environmental sequences. SILVA v123 has 1.8M small subunit ribosomal RNA sequences; v114 was estimated to contain ~94,000 genera (Yarza et al., 2014). Greengenes v13.5 has 1.8M 16S sequences. UNITE release 01.08.2015 has 476k ITS sequences representing ~71,000 species. Most taxonomy annotations in SILVA and Greengenes are predictions obtained by computational and manual analyses which are primarily based on trees predicted from multiple alignments (McDonald et al., 2012; Yilmaz et al., 2014); in RDPDB most annotations are predicted by RDP. In the 16S databases (RDPDB, SILVA and Greengenes), no attempt is made to classify unnamed groups, while UNITE assigns numerical “species hypothesis” identifiers to unnamed clusters. By default, QIIME uses a subset of Greengenes clustered at 97% identity (GGQ, containing 99k sequences in v13.8), and mothur recommends a subset of SILVA (SILVAM, containing 172k sequences in v123). The RDP web site and stand-alone software use the RDP training sets.
Database coverage and novel taxa
If a query sequence is found in the database, its taxonomy is naively given by the reference annotation. This prediction may wrong if the database has annotation errors or multiple species are identical over the sequenced region, which often happens with short tags such as the popular V4 hypervariable region of 16S. The latter scenario cannot be reliably identified by checking the database for other identical sequences because the reference data may be incomplete. If the query sequence is not found in the database then prediction is more difficult. For example, using a 95% identity threshold for clustering full-length 16S sequences was found to give groups that best approximate genera (Yarza et al., 2014). Thus, if a 16S sequence has 95% identity with a database hit, it might be in the same genus but since identity correlates only approximately with taxonomic rank it could belong only to the same family or same class. Or, it could belong to the same species if there is atypically large variations between paralogs or strains. From this perspective, the task of taxonomy prediction is to estimate the lowest common rank (LCR) between the query and the database. A query rank r is known if r ≥ LCR, i.e. at least one member of its clade is present in the reference database (regardless of whether it is named) and novel if r < LCR. The coverage of a reference database at a given rank with respect to a set of query sequences is the fraction of queries that are known and novelty = (1 – coverage) is the fraction of queries that are novel. The mean top-hit identity (MTI) between query sequences and their top hits can be used as an approximate indication of coverage. To obtain typical query sets, I constructed OTUs at 97% identity using UPARSE (Edgar, 2013) from V4 reads of human gut, mouse gut and soil communities respectively (Kozich et al., 2013) and ITS reads of a soil fungal community (Schmidt et al., 2013). MTIs of these samples vs. commonly-used reference databases are shown in Table 1. All V4 samples have MTI<95% with RTS, suggesting that many, perhaps most, OTUs belong to novel genera, especially in soil (MTI=88%).
Mean top-hit identities
RDP leave-one-out validation
RDP was tested on 16S and ITS sequences using leave-one-out validation (Wang et al., 2007; Deshpande et al., 2015) where one query sequence is extracted from the training set (RTS and Warcup, respectively) and classified using the remaining sequences as a reference. Accuracy (AccRDP) is calculated as the fraction of sequences that are correctly classified. Roughly half (1,119 / 2,472) of the genera in RTS are singletons, i.e. have exactly one training sequence, while about a quarter (2,258 / 8,548) of the species in Warcup are singletons, comprising 8% (16S) and 13% (ITS) of the training sequences. A singleton cannot be classified correctly in a leave-one-out test because no training sequences are left for its clade so that the maximum achievable AccRDP by an ideal algorithm is the fraction of non-singleton taxa, i.e. 92% for 16S genus and 87% for ITS species, rather than 100% as would usually be expected for an accuracy measure. The average number of non-singleton training sequences is 9 per genus in RTS and 14 per species in Warcup which suggests that correct classification should be relatively easy for most queries, while in practice many genera will be novel, and taxa that are rare in the database may be common in the query set and vice versa. Also, all predictions are included in AccRDP regardless of their bootstrap confidence values rather than using the authors’ recommended parameters (here, 80% cutoff) as would usually be expected for a benchmark test. In summary, the RDP leave-one-out test does not model typical query datasets and AccRDP does not give a realistic estimate of accuracy by any conventional definition.
Methods
Performance metrics
Sensitivity should be measured as the fraction of known queries that are correctly identified so that the highest achievable sensitivity by an ideal algorithm is 100%. If novel queries were also counted then sensitivity <100% would reflect an opaque combination of low database coverage and failures to correctly predict known taxa, as with AccRDP. It is useful to distinguish two types of false positive error: misclassifications, where an incorrect name is predicted for a known rank, and over-classifications, where a name is predicted for a novel rank. For a given query set, reference database and taxonomic rank let Nknown and Nnovel be the number of queries with known and novel taxa respectively. Let TP be the number of correct predictions, FPmis be the number of misclassification errors and FPover be the number of over-classification errors. The total number of queries is N = Nknown + Nnovel. The following accuracy metrics can now be defined:
Sensitivity = TP / Nknown,
Misclassification rate = MC = FPmis / Nknown,
Over-classification rate = OC = FPover / Nnovel,
Errors per query = EPQ = (FPmis + FPover) / N.
To a first approximation, we might expect misclassification and over-classification rates to be similar on different datasets because these measures reflect intrinsic characteristics of an algorithm independent of the data while EPQ, the measure that is typically of most interest in practice, will strongly depend on database coverage (equivalently, on query novelty). For example, if a query set contains mostly known sequences, we would expect errors to be rare and dominated by misclassifications, while if a query set is highly novel then there may be many overclassifications. If these expectations are correct, then values of MC and OC measured on a benchmark test will be similar to those obtained on biological data in practice while EPQ will be similar only if the benchmark has similar rates of novel taxa.
Clade partition cross-validation (CPX)
If high ranks are usually known but low ranks are often novel, then a benchmark test should contain a mix of known and novel taxa at low ranks so that both MC and OC can be measured. This can be achieved by clade partition cross-validation (CPX), as follows. Clades at a given rank rpart from a reference database are partitioned so that a randomly-chosen half of the daughter groups in a given clade are assigned to the query set and the other half to the reference set so that ranks below rpart are always novel. For example, if rpart = family then half of the genera for a given family are assigned to the query and half to the reference set. Singletons are always assigned to the query set, so are always novel while non-singletons are always known. For this work, I used rpart = family and rpart = genus and calculated performance metrics from the combined predictions on both query-reference pairs.
SINTAX algorithm
For a query sequence Q and reference database R the SImple Non-Bayesian TAXonomy (SINTAX) algorithm proceeds as follows. Let W(Q) be the set of k-mers in Q where k = 8 by default. In one iteration, a random sub-sample ws(Q) of size s is extracted from W(Q) where s = 32 by default. Sub-sampling is performed with replacement. For each reference sequence r ∈ R, the number of words in common is Usubset(r) = |ws(Q) ⋂ W(r)|. The top hit T by k-mer similarity is identified as T = argmax(r) Usubset(r) and the taxonomy is taken from the annotation of T. By default, 100 iterations are performed. For each rank, the name that occurs most often is identified and its frequency is reported as its bootstrap confidence. SINTAX is similar to the RDP algorithm except (a) the taxonomy in each iteration is identified from the top k-mer hit for the sub-sample rather than the most probable taxonomy according to the naive Bayesian calculation and (b) a fixed-size subset of 32 k-mers is used for bootstrapping. RDP uses a subset of size |Q|/k, the number of non-overlapping k-mers, because overlapping k-mers are not independent. SINTAX uses a fixed-size subset to compensate for a problem that arises with longer sequences. For a given query sequence, consider reference sequences ranked using all k-mers, i.e. in order of decreasing Uall(r) = |W(Q) ⋂ W(r)|. This gives a list of taxonomies sorted by decreasing Uall. Let Call1 be the top taxonomy and Call2 the second-ranked taxonomy with similarities Uall1 and Uall2 respectively using all k-mers, and similarities Usubset1 and Usubset2 using the subset in a given iteration. If Uall1 ≫ Uall2 then Usubset1 will be greater than Usubset2 in most or all iterations and C1 will therefore have high bootstrap confidence. Conversely, if Uall1 is only slightly greater than Uall2, then it is more likely that the order of the top two taxonomies will be reversed in Usubset order and the bootstrap confidence of C1 will then be lower. The bootstrap confidence thus correlates with the difference Uall1 – Uall2, giving an indication of how much closer the top taxonomy is to the query than the second-ranked taxonomy. As the sequence length |Q| increases, the number of non-overlapping k-mers |Q|/k increases. Using a larger subset reduces fluctuations under sub-sampling so that the taxonomy order defined by Usubset converges on the order defined by Uall, and in particular C1 is the top-ranked taxonomy more often. The bootstrap confidence of C1 therefore tends to increase for longer sequences, regardless of whether it is correct. In other words, as the sequence length increases, using a subset of size |Q|/k tends to give high bootstrap confidence to the top k-mer hit for any query sequence. When the query is novel, C1 is always wrong and the over-classification rate therefore increases with longer sequences. This problem is mitigated by using a fixed subset size of 32. See Supp. Table 1 for a comparison of SINTAX with s=32 and s=1/k vs. RDP.
Results
I tested SINTAX v1.0, standalone RDP v2.12, QIIME v1.9.1 and mothur v1.36.1. The QIIME assign_taxonomy.py script was run with options -m uclust (Quc, the default), -m sortme (Qsm),-m blast (Qblast) and -m rdp (Qrdp, a wrapper for standalone RDP that sets a bootstrap cutoff of 50% by default). The mothur classify.seqs command was run with method=wang (Mrdp, the default, a re-implementation of the RDP algorithm) and method=knn (Mknn).
Leave-one-out validation
Results for leave-one-out testing of SINTAX and RDP are shown in Table 2 and Supp. Table 1. The AccRDP metric is shown together with Sensitivity, OC and MC for genus (16S) or species (ITS). At phylum rank, EPQ is given as the measure of error rate since almost all phyla are known so OC cannot be measured reliably and MC ≈ EPQ.
Leave-one-out test results
Clade partition cross-validation
Results for CPX testing are shown in Table 3 and Supp. Table 2. SINTAX is compared to other algorithms at genus and phylum ranks using the default database for each method. Against RDP, a bootstrap cutoff of 80% is used as this is the value recommended by the RDP authors. SINTAX and RDP are observed to have similar performance on V4 while SINTAX has substantially lower error rates on full-length 16S and ITS due to its lower over-classification rates. The ITS phylum sensitivity of SINTAX (98.3%) is notably better than RDP (81.8%).
CPX test results
High over-classification rates
A high rate of over-classifications at low rank was found for all methods on all tests ranging from a minimum genus OC of 12.2% (SINTAX on RTS V4 with 80% bootstrap cutoff) to a maximum of 87% (Qblast on GGQ V4). The QIIME OC rates were especially high. The lowest OC of any QIIME method (tested on V4 with its default GGQ reference database) was 48.4% (Qsm). A high OC rate is of practical importance if a significant number of novel genera are present in the query set. Table 1 shows that the OTUs for the V4 soil data have mean identity 95% or less with all of the default reference databases, implying that half or more probably have novel genera. On a query set with 50% novel genera, an algorithm will have an overall false-positive (FP) rate of 0.5×OC + 0.5×MC. At genus rank, using OC and MC values measured on the CPX test, this gives a FP rate of 13% for RDP and SINTAX at 80% bootstrap using RTS, 29% for the default QIIME method Quc and 52% for Qblast.
Discussion
SINTAX achieves comparable (V4) or better (full-length 16S and ITS) accuracy to RDP. SINTAX is conceptually simple: it finds the top k-mer hit, and the evidence supporting a prediction can be presented as a list of reference taxonomies with their k-mer similarities to the query sequence. The RDP algorithm is more opaque, making it difficult to review the evidence supporting a prediction. The posterior probabilities calculated according to the naive Bayesian theory are astronomically small for correct predictions, typically in the range 10–18 to 10–24. Ideally, a posterior would be an estimate of the probability that a taxonomy is correct and would be ~1 for a correct prediction, but here the probabilities are wrong by twenty or so orders of magnitude, necessitating post-hoc bootstrapping to obtain a useable confidence measure. SINTAX and RDP have similar accuracy when both use sub-sample size |Q|/k for bootstrapping (Supp. Table 2) in which case the algorithms are essentially the same except for the score for sorting taxonomies. This suggests that the naive Bayesian approach could be interpreted as an approximation to finding the top k-mer hit.
The high measured over-classification rates indicate an unsolved problem with novel taxa, which is readily explained by sparse reference data (Fig. 1). At high ranks, this may not be important because novel phyla are rare, but at low ranks novel taxa are common, especially in communities that are less studied or difficult to culture (e.g. extremophiles) or highly diverse (e.g. soil).
A sparse reference database induces over-classification errors. Q is the query and T is the top hit. Black circles denote known species, white circles novel species. If Q belongs to a novel genus (white square), the top few hits will tend to belong to the most similar known genus (black square). Here, SINTAX will tend to give a high bootstrap confidence to the known genus. Other algorithms which explicitly (or effectively) take a consensus of taxonomies of the most similar reference sequences, such as RDP, will similarly tend to make over-classification errors.
On full-length 16S sequences, RDP has a measured overclassification rate of 40% (CPX) and 48% (leave-one-out) at the recommended 80% bootstrap cutoff. This result suggests that many of the genus annotations in RDPDB, most of which were predicted by RDP at 80% bootstrap, may be false positives as 47% of the 3.2M RDPDB sequences have top-hit identity <95% with RDPTS, implying that roughly half belong to novel genera. Assuming 47% novel genera and the lower OC value gives an estimate of 0.47×0.40×3.2M = 600k over-classified genera.
Footnotes
robert{at}drive5.com