Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression

View ORCID ProfileKarel Břinda, View ORCID ProfileLeandro Lima, Simone Pignotti, View ORCID ProfileNatalia Quinones-Olvera, Kamil Salikhov, View ORCID ProfileRayan Chikhi, View ORCID ProfileGregory Kucherov, View ORCID ProfileZamin Iqbal, View ORCID ProfileMichael Baym
doi: https://doi.org/10.1101/2023.04.15.536996
Karel Břinda
1GenScale, Inria/IRISA Rennes, Campus de Beaulieu, 35042 Rennes Cedex, France
2Department of Biomedical Informatics, Harvard Medical School, MA 02115 Boston, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Karel Břinda
  • For correspondence: karel.brinda@inria.fr baym@hms.harvard.edu
Leandro Lima
3EMBL-EBI, CB10 1SD Hinxton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Leandro Lima
Simone Pignotti
2Department of Biomedical Informatics, Harvard Medical School, MA 02115 Boston, USA
4LIGM, CNRS, Univ. Gustave Eiffel, 77454 Marne-la-Vallée Cedex 2, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Natalia Quinones-Olvera
2Department of Biomedical Informatics, Harvard Medical School, MA 02115 Boston, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Natalia Quinones-Olvera
Kamil Salikhov
4LIGM, CNRS, Univ. Gustave Eiffel, 77454 Marne-la-Vallée Cedex 2, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rayan Chikhi
5Department of Computational Biology, Institut Pasteur, 75015 Paris, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Rayan Chikhi
Gregory Kucherov
4LIGM, CNRS, Univ. Gustave Eiffel, 77454 Marne-la-Vallée Cedex 2, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gregory Kucherov
Zamin Iqbal
3EMBL-EBI, CB10 1SD Hinxton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Zamin Iqbal
Michael Baym
2Department of Biomedical Informatics, Harvard Medical School, MA 02115 Boston, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Michael Baym
  • For correspondence: karel.brinda@inria.fr baym@hms.harvard.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections makes it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • added acknowledgement

  • https://github.com/karel-brinda/phylogenetic-compression-supplement

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted April 18, 2023.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression
Karel Břinda, Leandro Lima, Simone Pignotti, Natalia Quinones-Olvera, Kamil Salikhov, Rayan Chikhi, Gregory Kucherov, Zamin Iqbal, Michael Baym
bioRxiv 2023.04.15.536996; doi: https://doi.org/10.1101/2023.04.15.536996
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression
Karel Břinda, Leandro Lima, Simone Pignotti, Natalia Quinones-Olvera, Kamil Salikhov, Rayan Chikhi, Gregory Kucherov, Zamin Iqbal, Michael Baym
bioRxiv 2023.04.15.536996; doi: https://doi.org/10.1101/2023.04.15.536996

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4381)
  • Biochemistry (9590)
  • Bioengineering (7090)
  • Bioinformatics (24855)
  • Biophysics (12599)
  • Cancer Biology (9953)
  • Cell Biology (14348)
  • Clinical Trials (138)
  • Developmental Biology (7946)
  • Ecology (12105)
  • Epidemiology (2067)
  • Evolutionary Biology (15985)
  • Genetics (10923)
  • Genomics (14736)
  • Immunology (9869)
  • Microbiology (23656)
  • Molecular Biology (9484)
  • Neuroscience (50845)
  • Paleontology (369)
  • Pathology (1539)
  • Pharmacology and Toxicology (2681)
  • Physiology (4013)
  • Plant Biology (8656)
  • Scientific Communication and Education (1508)
  • Synthetic Biology (2393)
  • Systems Biology (6432)
  • Zoology (1346)