Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

View ORCID ProfileJarno N. Alanko, Jaakko Vuohtoniemi, View ORCID ProfileTommi Mäklin, View ORCID ProfileSimon J. Puglisi
doi: https://doi.org/10.1101/2023.02.24.529942
Jarno N. Alanko
1Department of Computer Science, University of Helsinki, Helsinki, 00014, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jarno N. Alanko
  • For correspondence: alanko.jarno@gmail.com
Jaakko Vuohtoniemi
1Department of Computer Science, University of Helsinki, Helsinki, 00014, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Tommi Mäklin
2Department of Mathematics and Statistics, University of Helsinki, Helsinki, 00014, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tommi Mäklin
Simon J. Puglisi
1Department of Computer Science, University of Helsinki, Helsinki, 00014, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Simon J. Puglisi
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Motivation Huge data sets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these data sets, efficient indexing data structures — that are both scalable and provide rapid query throughput — are paramount.

Results Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 hours. The resulting index takes 142 gigabytes, and Themisto pseudoaligns reads from a Salmonella enterica isolate sample against the index at a rate of 2 million base pairs per second on 48 threads. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 thousand genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets.

Availability and implementation Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.

Contact jarno.alanko{at}helsinki.fi

Supplementary information Supplementary data are available at Bioinformatics online.

Competing Interest Statement

The authors have declared no competing interest.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.
Back to top
PreviousNext
Posted February 24, 2023.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
Jarno N. Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, Simon J. Puglisi
bioRxiv 2023.02.24.529942; doi: https://doi.org/10.1101/2023.02.24.529942
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
Jarno N. Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, Simon J. Puglisi
bioRxiv 2023.02.24.529942; doi: https://doi.org/10.1101/2023.02.24.529942

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4231)
  • Biochemistry (9124)
  • Bioengineering (6774)
  • Bioinformatics (23975)
  • Biophysics (12114)
  • Cancer Biology (9515)
  • Cell Biology (13772)
  • Clinical Trials (138)
  • Developmental Biology (7623)
  • Ecology (11681)
  • Epidemiology (2066)
  • Evolutionary Biology (15497)
  • Genetics (10634)
  • Genomics (14314)
  • Immunology (9475)
  • Microbiology (22825)
  • Molecular Biology (9087)
  • Neuroscience (48935)
  • Paleontology (355)
  • Pathology (1480)
  • Pharmacology and Toxicology (2566)
  • Physiology (3844)
  • Plant Biology (8322)
  • Scientific Communication and Education (1470)
  • Synthetic Biology (2295)
  • Systems Biology (6184)
  • Zoology (1300)