Abstract
Amplicon sequencing generates chimeric reads which can cause spurious inferences of biological variation. I describe UCHIME2, an update of the popular UCHIME chimera detection algorithm with new modes optimized for high-resolution biological sequence reconstruction (“denoising”) and other applications. I show that chimera frequency correlates inversely with divergence, that error-free chimera prediction from sequence is impossible in principle, and that UCHIME2 achieves higher detection accuracy than previous methods.
Amplicon sequencing is widely used to survey biological sequence variation in applications including marker gene metagenomics1, immune system repertoire analysis2 and cancer genomics3. In such experiments, chimeric amplicons form when an incomplete DNA strand anneals to a different template and primes synthesis of a new template derived from two different biological sequences4. Recent chimera detection methods include ChimeraSlayer4, UCHIME5, DECIPHER6 and CATCh7. UCHIME classifies a query sequence by making a model from a concatenated pair of sub-sequences (segments) in a reference database. The query is predicted to be chimeric if the score of its alignment to the model exceeds a threshold. The reference database is provided by the user or constructed de novo from the reads. UCHIME2 modifies the original UCHIME algorithm by classifying a query as unknown if uncertainty is high due to missing reference data. Such queries would be classified, perhaps misleadingly, as “non-chimeric” by UCHIME. UCHIME2 also adds heuristics with parameter values optimized for different applications and introduces a new strategy designed for filtering denoised (error-corrected) amplicon sequences.
To investigate chimeras encountered in practice, I used Illumina reads of the currently popular V4 region of the 16S ribosomal RNA (16S) gene from high- and low-diversity microbial communities (soil and human vagina, respectively) and from two artificial communities (mock1 and mock2); see Supp. Note 1 for details. I extracted reads with < 1 expected errors8 and made OTUs at 97% identity using UCLUST9. Using its recommended 16S reference (the ChimeraSlayer Gold database), UCHIME predicted 9% of soil and 39% of vagina OTUs to be chimeric, while using SILVA10 predicted 57% and 53% respectively. The lower sensitivity of Gold vs. SILVA is explained by its smaller size (5.1k vs. 1.4M sequences), and the relatively better performance of Gold on vagina vs. soil is explained by the higher frequency of well-known species in vagina. The frequencies predicted using SILVA are probably underestimates and are surely better than using Gold (Supp. Notes 2 and 3), showing that at least half the soil and vagina OTUs are chimeric.
Earlier work4–6 characterized chimeras by parent divergence because detection is harder when parents are similar. However, a chimera can be arbitrarily close to one parent, e.g. if the other's segment is short. A better indication of detectability is the number of differences compared to the closest known non-chimeric sequence (reference divergence, D). Measured frequencies of chimeras with 1 ≤ D ≤ 20 are reported in Fig. 1, showing that frequency correlates inversely with D and a majority have D < 10. I also considered the identity of a segment with its closest reference sequence (segment identity, S). S is <100% when a parent is missing from the database, which reduces detectability.
Previously published tests of reference-based chimera detection methods4–6 use simulations which assume, unrealistically, that parent sequences are known, i.e. S = 100%. To investigate the dependence of prediction accuracy on both D and S, I designed a new benchmark, CHSIMA, with simulated 16S and ITS chimeras with D = 1 to 10 and S = 90% to 100% (Supp. Note 5). False positives (FPs) were measured by dividing a chimera-free database into pairs (splits) with identities from 90% to 100%. I calculated sensitivity as the fraction of simulated chimeras that were correctly predicted and, unlike previously published tests, included false negatives (FNs) as well FPs in the error rate because both may be comparably harmful, especially when D > 3% where FNs cause spurious OTUs and FPs discard valid biological sequences. For assessments of DECIPHER and CATCh see Supp. Notes 4 and 6, which show DECIPHER to have very low sensitivity. CHSIMA results are given in Table 1 and Supp. Note 12. The balanced and sensitive modes of UCHIME2 have higher sensitivity than previous methods, with balanced having the lower overall error rate, and the high-confidence mode reports fewer false-positives.
The sensitivity of all tested methods falls rapidly with decreasing segment similarity. This is explained by chimeric models (fakes) of non-chimeric queries (Supp. Note 7), which are common and in the most challenging cases are exact matches (perfect fakes). Remarkably, I found that a large majority of non-chimeric sequences in the 99% identity splits have perfect fake models for the V4 region of 16S and the ITS1 and ITS2 fungal Internal Transcribed Spacer regions, and a third to a half in the 97% identity splits. Introducing heuristics to reduce FPs due to fakes therefore unavoidably causes an increase in FNs, especially for low-divergence chimeras. Error-free chimera/non-chimera classification from sequence is therefore impossible in principle, even in an ideal scenario where the reference database is complete and correct and the query sequence has no errors (Supp. Note 9). The problem of fakes is exacerbated when there are sequence errors or the database is incomplete, as is typically the case, noting that when S ≲ 100%, the probability of a non-chimera having a perfect fake model is very high (Supp. Note 7). These results, combined with the rapid increase in FNs with lower segment identities (Table 1), show that the choice of Gold as default for UCHIME was misguided; much better accuracy will usually be achieved with a large database such as SILVA due to the reduction in FNs.
Methods
Given a query sequence Q, UCHIME2 uses the UCHIME algorithm to construct a model (M), then makes a multiple alignment of Q with the model and top hit (T, the most similar reference sequence). The following metrics are calculated from the alignment: number of differences dQT between Q and T and dQM between Q and M, the alignment score (H) using eq. 2 in Edgar et al. 2011. The fractional divergence with respect to the top hit is calculated as divT = (dQT – dQM)/|Q|. If divT is large, the model is a much better match than the top hit and the query is more likely to be chimeric, and conversely if divT is small, the model is more likely to be a fake. If divT > 3%, the chimera would cause a spurious OTU.
Deep sequencing enables resolution of amplicons with as few as one difference11,12. In marker gene applications, this can be accomplished by error-correction (denoising) using algorithms such as DADA212 and UNOISE8 which predict the set of unique amplicon sequences from which the reads are derived. Many amplicons will be chimeric, and a post-processing step is therefore required to identify the non-chimeric subset. With this application in mind, I designed a denoised de novo (DDN) mode of UCHIME2 using a strategy similar to the de novo mode of UCHIME. Each sequence is compared with all amplicons having greater abundance on the assumption that parents are more abundant than chimeras because they undergo more rounds of amplification. If a perfect model is found (dQM=0, dQT > 0), the amplicon is classified as chimeric. This method is not heuristic in the sense that a perfect model will always be reported if one exists. Unlike UCHIME de novo, DDN will detect chimeras with divergences as low as a single difference. DDN compares an amplicon with all higher-abundance sequences, including those with chimeric models (which would be discarded by UCHIME), enabling detection of chimeras with three or more segments which form13 when a parent is itself chimeric. State-of-the-art denoisers achieve very low error rates12, giving amplicon sequences close to the ideal scenario described in the main text: error-free sequences and a complete reference database. Even in this scenario, false positives are possible because a valid biological variant with one difference has a perfect fake de novo model with a probability of a few percent (Supp. Note 9).
However, given that this probability is low and that D=1 chimeras are common (Fig. 1) I considered it preferable to classify any query with a perfect model as chimeric, allowing a few FPs due to perfect fakes rather than the many FN D=1 chimeras that would typically result from imposing more stringent criteria.
Accurate de novo chimera detection for high-resolution applications requires denoised amplicon sequences. Even if read quality filtering is applied, for example by requiring that the most probable number of errors implied by Phred scores is zero, many reads with one or more errors remain8. De novo chimera detection would then be severely degraded by the explosion in perfect fake models when S < 100% (Supp. Note 7).
For lower-resolution (OTU) clustering of quality-filtered (but not denoised) reads at 97% identity, UPARSE was shown14 to generate dramatically fewer chimeric OTUs on mock community data than pipelines which use UCHIME for chimera filtering. This was achieved despite the inherent limitations of chimera detection because UPARSE does not need to distinguish between low-divergence chimeras, reads with base call errors and biological variants with <3% differences—all are assigned to the closest OTU and their sequences per se are discarded. An OTU pipeline should therefore use UPARSE for clustering and chimera filtering unless reads are denoised, in which case using UCHIME2 DDN followed by OTU clustering with UCLUST is a reasonable alternative.
Results in the main text show that 100% segment similarity is needed for accurate detection with a user-supplied database, but this will rarely be possible in practice because reads often have errors and reference databases are incomplete and contain sequence errors and ambiguous bases. Noting that chimeras with D=1 are common, UCHIME2 classifies queries as non-chimeric only when there is a 100% identical reference sequence and otherwise as chimeric or unknown. I designed four different sets of parameters for reference-based chimeric classification, as follows. Denoised mode reports all chimeras when query sequences are error-free and the reference database is complete and correct. Balanced mode attempts to minimize both FPs and FNs, giving the lowest overall error rate. Specific mode prioritizes reducing FPs at the expense of increased FNs, with results similar to UCHIME. (Here, unknown is considered to be a FN if the query is a chimera). High-confidence mode further reduces FPs, at the cost of the highest overall error rate. Sensitive mode is designed to reduce FNs at the expensive of a higher FP rate. Parameter values for each mode are specified in Supp. Note 10. Noting that users rarely set non-default parameters even when command-line options are available, I did not set a default mode for UCHIME2, hoping that this would make users more aware of the limitations of reference-based chimera filtering by asking them to choose a compromise between sensitivity, false-positive and false-negative error rates that is most appropriate for addressing their biological questions.
Author contributions
R.C.E. conceived of the study, performed the analysis and wrote the manuscript.
UCHIME2: improved chimera prediction for amplicon sequencing
Supplementary Notes
Note 1. Reads of soil, vagina and mock communities. Note 2. Measuring chimera divergences and frequencies.
Note 3. SILVA gives the best chimera frequency measurement. Note 4. Assessment of DECIPHER and CATCh.
Note 5. Design of the CHSIMA benchmark.
Note 6. Comments on the DECIPHER benchmark. Note 7. Fake and perfect fake models are common.
Note 8. Probability of a de novo perfect fake model.
Note 9. Error-free classification is impossible in principle. Note 10. Default parameters for UCHIME2 modes.
Note 11. Method names, software versions and command lines. Note 12. CHSIMA results.
Supplementary References
Note 1. Reads of soil, vagina and mock communities
Reads for soil, human vagina and artificial community mock1 were taken from Kozich et al.1. The reads were downloaded from http://www.mothur.org/MiSeqDevelopmentData/, accessed 3rd Sept 2015.
Mock2 is MiSeq dataset 2 from Bokulich et al.2 The reads were kindly provided by Dirk Gevers at the Broad Institute. There is now a link to the data from the QIIME resource page (http://qiime.org/home_static/dataFiles.html, accessed 23rd June 2016).
Note 2. Measuring chimera divergences and frequencies
In order to measure chimera divergence frequencies reliably, it is necessary to align sequences which can be classified as chimeric or not with high confidence against error-free parent sequences using a method that is unbiased with respect to divergence. This is challenging because it is not possible to identify sequencing error with 100% certainty, and error-free chimera detection is not possible, especially with noisy reads, and especially when divergence is low, as shown in the main text. This note describes a method which overcomes these challenges to make an accurate measurement of divergence frequencies.
Reference database
For the mock1 and mock2 communities, I used the known 16S sequences for the species in the samples. For the soil and vagina samples, I used the 100 most abundant error-corrected amplicons generated by UNOISE3 as the reference database. The top 100 sequences are almost certainly correct (because denoising is most effective at high abundances) and a large majority of chimeras have parents in the top 100. Regardless, the subset of chimeras with top-100 parents is surely unbiased with respect to divergence frequencies.
Query set
I searched for chimeras in reads after filtering at maximum one expected error3. I did not use error-correction because many chimeras are singletons, which are discarded by denoising methods including DADA24 and UNOISE.
Chimera detection algorithm
I used the denoised mode of UCHIME2 which is not heuristic in the sense that it is guaranteed to report all perfect models, i.e. all models with dQM = 0 and dQT > 0 (see Methods in main text) and will therefore identify all error-free chimeric sequences with parents in the reference database. Sequencing error in a chimeric read will almost certainly result in a false negative (because then dQM = (number of base call errors) > 0 except in pathological cases where errors reproduce a chimera by chance). The false-positive rate for error-free reads against error-free parents is estimated in Note 9 and found to be a few percent for D=1 and vanishingly small for D > 1.
Despite known sources of error, divergence frequencies are accurately measured
This method has a number of well-understood sources of error. A measurement of divergence frequencies will be degraded only if these errors are biased with respect to divergence, otherwise they are benign. False negatives are caused by (1) unfiltered sequencing error and (2) parents missing from the reference database, neither of which is biased w.r.t. chimera divergence. False positives are rare and the causes should again have no bias w.r.t. divergence with the possible exception of D=1. Therefore, the method described in this note gives an accurate measurement of the chimera divergence distribution except that D=1 may be overestimated.
Note 3. SILVA gives the best chimera frequency estimate
To investigate the coverage of the SILVA and Gold databases on the human vagina and soil samples, I measured the identities of the top hits of the 100 most abundant denoised amplicons (Fig. SN3.1). SILVA has ≥99% identity for 96% of the vagina amplicons and 75% of the soil amplicons, while Gold has ≥99% identity top hits for 50% and 6% of the amplicons, respectively. Referring to Table 1 in the main text, UCHIME2 in balanced mode has sensitivity 87% for identities of 97% or higher with an approximately equal number of FPs and FNs. Chimera frequencies from UCHIME2-balanced predictions with SILVA should therefore be superior to Gold, especially for soil, though will probably give an underestimate due to the lack of high identity reference sequences for ~25% of the high-abundance amplicons in soil.
Note 4. Assessment of DECIPHER and CATCh
CATCh5 is an ensemble classifier that generates a consensus result from predictions by UCHIME6, DECIPHER7 and other algorithms. I was unable to install the stand-alone version of DECIPHER or to configure DECIPHER to use a non-default reference database, which precluded running DECIPHER or CATCh on most benchmark tests including the ChimeraSlayer8 benchmark, the CHSIMA test described in the main text, or the SIM2 and CHSIM tests described in the UCHIME paper.
I tested DECIPHER and CATCh using the length-300 ChimeraSlayer test sequences (https://sourceforge.net/projects/microbiomeutil/files/ChimeraTestRegime/, retrieved 2nd Feb 2016). These include 2,500 length-300 two-segment chimeras (Ch300) constructed from parents in the Gold database and 4,770 length-300 segments (Ctl300) of Gold sequences, which are named isolates and thus known to be non-chimeric. I chose this test because it uses published benchmark data, the length is close to the currently popular V4 region (~250nt) and because it is reasonable to assume that all the Gold sequences are present in the reference databases used by DECIPHER and CATCh, noting that if parent sequences were not present, this would bias a test in favor of UCHIME and UCHIME2. For DECIPHER, I used its web server (http://decipher.cee.wisc.edu/FindChimeras.html, data submitted 1st Jul 2016). CATCh predictions were kindly provided by Mohamed Mysara.
In the ChimeraSlayer benchmark, divergence (PDiv) is measured between parents rather than the distance (D) between the chimera and its closest parent. PDiv provides an approximate upper bound D ≤ PDiv/2 because the closer parent is probably the one with the longer segment, which covers at least half of the chimera. Thus, PDiv ≤ 6% corresponds approximately to D ≤ 3%, the “harmful range" where chimeras will cause spurious OTUs in a typical analysis. Low-divergence chimeras are common (main text), so the “abundant range” D ≤ 5% or PDiv ≤ 10% is the most important in practice. Above PDiv = 10%, chimeras are relatively rare and are detected with high sensitivity by all tested methods except DECIPHER.
Results are shown in Table SN4.1 and in Figs. SN4.1 and SN4.2. The sensitivity of DECIPHER was found to be much lower than the other methods, while CATCh had sensitivity slightly higher than UCHIME. Both CATCh and DECIPHER reported several false positives despite that fact that all of the control sequences are segments of named isolate sequences, while UCHIME and all UCHIME2 modes reported no false positives as expected for known sequences. Note that this is not the same procedure used for the false positive tests described in the ChimeraSlayer and UCHIME papers which used a “leave-one-out" strategy (i.e., the query sequence was omitted from the database). In this case, I could not change the databases for DECIPHER or CATCh so I tested all methods using “leave-all-in”, i.e. without deleting the query sequence, as in the CHSIMA false positive tests with SegId=100% (Note 5).
UCHIME2 in denoised mode is not heuristic in the sense that it will report all query sequences not found in the reference database for which a perfect chimeric model can be constructed. Its sensitivity is <100% on Ch300 (Fig. SN4.2) because 38 of the simulated chimeras have hits to Gold with 100% identity, e.g. L300chmraD1_S000440923_5642-6199:6200-6518_S000382508 matches S000842173, illustrating that error-free prediction is impossible in principle (Note 8).
Note 5. Design of the CHSIMA benchmark
My goal with CHSIMA was to implement a benchmark for reference-based methods that measures accuracy with varying chimera divergence (D) and varying segment similarity (S). Previous benchmarks tested variation with D but made the unrealistic assumption that parent sequences are always known, i.e. S=100%.
For a given region (e.g., V4 or ITS1), I extracted the region from the reference database, giving sequence set R. For each S, I constructed two subsets (splits XS and YS) of R with approximately equal numbers of sequences such that each sequence in one split had a top hit with identity S in the other split (Fig. SN5.1). In the special case S=100%, X100 = Y100 = a random subset of 1,000 sequences from R.
For each pair of splits, I constructed a set of two-segment simulated chimeras CS from parent sequences in YS, generating 100 chimeras for each D = 1, 2 … 10 for a total of 1,000.
True positives (TPs) are measured with XS as the reference database and CS as the query set. Sensitivity is the fraction of chimeras that are successfully detected and the false-negative (FN) rate is the fraction not detected.
False positives (FPs) are measured with a random subset of 1,000 sequences from YS as the query and XS as the database. This measures the FP rate with a known distance (S) between the database and the query sequences, in contrast to the leave-one-out method used in the ChimeraSlayer and UCHIME papers where the distance has varying values <100% but is not otherwise characterized.
I defined sensitivity and error rates for a given S as follows:
In most cases, |XS| = |YS| = 1,000, in which case the overall error rate is the mean of the FP rate and FN rate:
Sensitivity and error rates may be given for each D or averaged over all D = 1, 2 … 10.
Previous benchmarks5–8 did not include false negatives in reported error rates. While the false negative rate is readily calculated as FN rate = (100% – Sensitivity), I felt it was more informative to include false negatives in reported total error rates because of the atypical importance of false-negative chimera predictions. A sensitivity of 90% and false-positive error rate of 2% would sound pretty good for a typical informatics algorithm, but here if you have 1,000 OTUs then you will end up with 10% = 100 spurious OTUs due to false-negative chimeras in addition to the 20 good OTUs lost due to false positives. In my opinion, it is more informative to express this as a 12% error rate so that a casual reader or user does not overlook the importance of FNs. For another example, the sensitivity of DECIPHER for 300nt chimeras with S=100% and PDiv ≤ 10 is 5% (Fig. SN4.2) with a FP rate of 1% (Table SN4.1) so the large majority of harmful chimeras will be missed by DECIPHER. In my opinion, quoting the overall error rate per query (95% + 1%)/2 = 48% gives a better indication of the poor performance of the algorithm on this data.
Note 6. Comments on the DECIPHER benchmark test
I was not able to reproduce results in the DECIPHER paper because some details are not specified and I was unable to install the stand-alone version of DECIPHER.
The DECIPHER paper does not fully explain how false positives were measured, in particular which reference database was used for each algorithm. The authors state “[t]he 2,000 parent sequences in the data set were … evaluated to estimate the rate of false-positive detections … with fs_DECIPHER having the lowest false positive rate (0.15%), followed by ChimeraSlayer (0.70%), WigeoN (0.85%), Uchime (1.5%), and ss_DECIPHER (1.6%)." UCHIME has a zero false positive rate for query sequences having a 100% identical full-length match, so the database used for UCHIME was missing at least 1.5% = 30 of the full-length sequences for the 2,000 parents; in fact, probably many more, given its low FP rate. The Gold database does not contain all of the parents, and if Gold was used with UCHIME this could explain why its FP rate is >0 and raises the question of whether DECIPHER was run with its own default database which does include all the parent sequences and would bias the results in favor of DECIPHER.
Another issue with the DECIPHER benchmark arises from reference sequences with missing bases at the 5' and 3' terminals. While the RDP isolate and Gold sequences are sometimes considered to be “full-length", in fact they often do not cover the complete gene (see Fig. SN6.1 for an example). With default settings, UCHIME allows terminal gaps in query sequences but not reference sequences (see discussion of Global-X alignments and Fig. 1 in Edgar et al. 2011). This may partly explain the anomalously low sensitivity of UCHIME on the DECIPHER benchmark test.
Note 7. Fake and perfect fake models are common
Given a query sequence Q, a chimeric model is a pair of reference sequence segments (A, B) concatenated together that have divergence > 0, i.e. are more similar to Q than the most similar sequence in the database (the top hit, T). If Q is not chimeric, the model is fake. A perfect fake model is a fake model that is identical to Q.
To give an indication of the frequency of fake models found by UCHIME2 I counted the number of alignments with a positive score, ignoring all other thresholds. Results are shown in the “Fakes” column in Table SN7.1. These are underestimates of the number of fake models that could be constructed, noting for example that this method fails to find many perfect fake models with S=99% found by UCHIME2-denoised (Table SN7.1).
All queries with at least one perfect fake model are identified by UCHIME2 in denoised mode because the algorithm is not heuristic, in the following sense: if one or more chimeric models exist, and the query sequence is not present in the database, then a model will be reported. Results for CHSIMA are shown in Table SN7.2 (see Note 5 for explanation of CHSIMA). This table can be interpreted as giving the false-positive error rate of UCHIME2-denoised, or equivalently as giving the frequency of perfect fake models for each segment identity (SegId, S). For all regions except full-length 16S, a large majority of sequences have a perfect fake model with S=99% and between a third and a half with S=97%. At S=99%, 10% of full-length 16S sequences have a perfect fake model.
The observation that both fakes and perfect fakes are very common implies that chimeras cannot be reliably distinguished from non-chimeras by any conceivable reference-based algorithm. In the de novo case, sequence abundances provide additional evidence which is predictive but not definitive. Similarly, it is not possible to screen large reference databases such as SILVA9 and Greengenes10 for chimeras with high accuracy. Low-divergence chimeras are common, and if the parents of a low-divergence chimera are present in the database, an algorithm such as UCHIME2-denoised can discover the model, but would not be able to reliably determine whether the model was fake because there is no evidence either way in the sequence. If its parents are not present, then the sensitivity and/or error rate of all algorithms necessarily increase rapidly with decreasing S and again, if a model is found, a chimera cannot be reliably distinguished from a correct biological sequence with a perfect or imperfect fake model.
Note 8. Probability of a de novo perfect fake model
A de novo detection algorithm constructs a reference database from sequences found in the reads. The observation that perfect fake models are common (Note 7) raises the question of how often true biological variants with a few substitutions could cause false positives due to perfect fake models in a de novo database. To investigate this, I extracted the top 100 denoised amplicons for the soil and vagina samples and generated 100,000 random variants of these amplicons with 1, 2 … 5 substitutions. I then used UCHIME2-denoised with the variants as a query and the top 100 denoised amplicons as a reference to determine the fraction of variants which had a perfect chimeric model. This simulates the problem faced by the post-processing step in a denoising pipeline which attempts to predict which of the error-correct amplicon sequences are chimeric.
Results are shown in Table SN8.1, which show that a biological variant with one difference has a probability of a few percent of having a perfect fake model while a variant with two or more differences is very unlikely to have a perfect fake model. This shows that a few percent of low-abundance biological variants with a single substitution will be discarded as false positive chimeras. Biological variants with two or more substitutions are very unlikely to have perfect fake models.
Note 9. Error-free chimera detection is impossible in principle
As discussed in Note 7, a perfect fake model is a pair of reference sequence segments (A, B) concatenated together that reproduce a query sequence Q which is known to be a correct biological sequence. If the reference database contains A and B but not Q, then it is impossible in principle for a reference-based algorithm to distinguish a chimeric amplicon CAB = A+B from an amplicon derived from the correct biological sequence Q because the sequences of CAB and Q are identical. Perfect fake models are surprisingly common (Note 7).
A sequence which is found in the reference database and also has a perfect chimeric model constructed from two other reference sequences cannot be reliably classified for essentially the same reason -- the sequences of a potential chimera and a correct biological sequence are identical. The results of Note 7 show that such cases are very common, especially for the currently popular regions V4 and ITS. Consider R99 = X99+ Y99for the V4 region. 97% of the sequences in X99have a perfect fake model in Y99, which implies that conversely ~97% of the sequences in Y99have perfect fake models in X99.Combining the two splits, it follows that at least ~97% of the sequences in R99have perfect fake models constructed from other sequences in R99. The problem of perfect fake models is thus not solved in the ideal case where all biological sequences are known, noting that a set of denoised amplicons is a good approximation to this scenario.
In the de novo case, the unique sequence abundances of A, B and Q provide additional evidence which is predictive but not definitive because low-abundance variants with perfect chimeric models may be due to correct biological sequences with fake models (Note 8), to chimeric amplicons, or reads with incorrect bases due to uncorrected sequencing error, or polymerase copying mistakes. Since perfect fake models are relatively rare in the de novo case, especially when there multiple substitutions, UCHIME2-denoised-de-novo (DDN) classifies a sequence as chimeric if it has a perfect model and the parent sequences are more abundant.
Note 10. Default parameters for UCHIME2 modes
Note 11. Method names, versions and command lines
Source code and Linux binary for UCHIME2 is provided in the supplementary files. Command line options are given in the README.TXT file.
ChSlayer is the mothur11 implementation of ChimeraSlayer with default parameters. I used v1.36.1 of mothur. Typical commands:
mothur “#align.seqs(candidate=q.fasta, template=gold.align, processors=6)” mothur “#chimera.slayer(fasta=q.align, template=gold.align, processors=6)”ChSlayer-kmer is the same as ChSlayer plus the search=kmer option.
DECIPHER was run using its web server http://decipher.cee.wisc.edu/FindChimeras.html selecting the “short-length sequences" option for sequences < 1,000nt, “full-length sequences" for sequences ≥ 1,000nt. Data was submitted 1st Jul 2016.
CATCh results were kindly provided by M. Mysara.
Note 12. CHSIMA results
References
Supplementary References
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.