RT Journal Article SR Electronic T1 KrakenHLL: Confident and fast metagenomics classification using unique k-mer counts JF bioRxiv FD Cold Spring Harbor Laboratory SP 262956 DO 10.1101/262956 A1 Breitwieser, FP A1 Salzberg, SL YR 2018 UL http://biorxiv.org/content/early/2018/02/09/262956.abstract AB Motivation False positive identifications are a significant problem in metagenomics. Spurious identifications can attract many reads that often aggregate in the genomes. Genome coverage may be used to filter false positives, but fast k-mer based metagenomic classifiers only provide read counts as metrics, and re-alignment is expensive. We propose using k-mer coverage, which can be computed during classification, as proxy for genome base coverage.Results We present KrakenHLL, a metagenomics classifier that records the number of unique k-mers as well as coverage for each taxon. KrakenHLL is based on the ultra-fast classification engine Kraken and combines it with HyperLogLog cardinality estimators. We demonstrate that more false-positive identifications can be filtered using the unique k-mer count, especially when looking at species of low abundance. Further enhancements include mapping against multiple databases, plasmid and strain identification using an extended taxonomy, and inclusion of over 100,000 additional viral strain sequences. KrakenHLL runs as fast as Kraken, and sometimes faster.Availability and Implementation KrakenHLL is implemented in C++ and Perl, and available under the GPL v3 license at https://github.com/fbreitwieser/krakenhll.Contact florian.bw{at}gmail.com.