RT Journal Article SR Electronic T1 Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes JF bioRxiv FD Cold Spring Harbor Laboratory SP 2023.02.24.529942 DO 10.1101/2023.02.24.529942 A1 Jarno N. Alanko A1 Jaakko Vuohtoniemi A1 Tommi Mäklin A1 Simon J. Puglisi YR 2023 UL http://biorxiv.org/content/early/2023/02/24/2023.02.24.529942.abstract AB Motivation Huge data sets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these data sets, efficient indexing data structures — that are both scalable and provide rapid query throughput — are paramount.Results Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 hours. The resulting index takes 142 gigabytes, and Themisto pseudoaligns reads from a Salmonella enterica isolate sample against the index at a rate of 2 million base pairs per second on 48 threads. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 thousand genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets.Availability and implementation Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.Contact jarno.alanko{at}helsinki.fiSupplementary information Supplementary data are available at Bioinformatics online.Competing Interest StatementThe authors have declared no competing interest.