Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Fast and compact matching statistics analytics

Fabio Cunial, Olgert Denas, Djamal Belazzougui
doi: https://doi.org/10.1101/2021.10.05.463202
Fabio Cunial
*Max Planck Institute for Molecular Cell Biology and Genetics (MPI-CBG and CSBD), Dresden, 01307, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: fabio.cunial@gmail.com
Olgert Denas
†Blue River Technology, Sunnyvale, CA 94086, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Djamal Belazzougui
‡CAPA, DTISI, Centre de Recherche sur l’Information Scientifique et Techique, Algiers, Algeria
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Motivation Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences.

Results We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state of the art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage, and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics.

Availability ad implementation Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • cunial{at}mpi-cbg.de

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.
Back to top
PreviousNext
Posted October 07, 2021.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Fast and compact matching statistics analytics
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Fast and compact matching statistics analytics
Fabio Cunial, Olgert Denas, Djamal Belazzougui
bioRxiv 2021.10.05.463202; doi: https://doi.org/10.1101/2021.10.05.463202
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Fast and compact matching statistics analytics
Fabio Cunial, Olgert Denas, Djamal Belazzougui
bioRxiv 2021.10.05.463202; doi: https://doi.org/10.1101/2021.10.05.463202

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4235)
  • Biochemistry (9140)
  • Bioengineering (6784)
  • Bioinformatics (24004)
  • Biophysics (12131)
  • Cancer Biology (9537)
  • Cell Biology (13781)
  • Clinical Trials (138)
  • Developmental Biology (7638)
  • Ecology (11703)
  • Epidemiology (2066)
  • Evolutionary Biology (15513)
  • Genetics (10647)
  • Genomics (14327)
  • Immunology (9484)
  • Microbiology (22848)
  • Molecular Biology (9095)
  • Neuroscience (49002)
  • Paleontology (355)
  • Pathology (1482)
  • Pharmacology and Toxicology (2570)
  • Physiology (3848)
  • Plant Biology (8331)
  • Scientific Communication and Education (1471)
  • Synthetic Biology (2296)
  • Systems Biology (6193)
  • Zoology (1301)