Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Matchtigs: minimum plain text representation of kmer sets

View ORCID ProfileSebastian Schmidt, View ORCID ProfileShahbaz Khan, View ORCID ProfileJarno Alanko, View ORCID ProfileGiulio E. Pibiri, View ORCID ProfileAlexandru I. Tomescu
doi: https://doi.org/10.1101/2021.12.15.472871
Sebastian Schmidt
1Department of Computer Science, University of Helsinki, Helsinki, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sebastian Schmidt
  • For correspondence: sebastian.schmidt@helsinki.fi alexandru.tomescu@helsinki.fi
Shahbaz Khan
1Department of Computer Science, University of Helsinki, Helsinki, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Shahbaz Khan
Jarno Alanko
1Department of Computer Science, University of Helsinki, Helsinki, Finland
2Faculty of Computer Science, Dalhousie University, Halifax, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jarno Alanko
Giulio E. Pibiri
3Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy
4ISTI-CNR, Pisa, Italy
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Giulio E. Pibiri
Alexandru I. Tomescu
1Department of Computer Science, University of Helsinki, Helsinki, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Alexandru I. Tomescu
  • For correspondence: sebastian.schmidt@helsinki.fi alexandru.tomescu@helsinki.fi
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

We propose a polynomial algorithm computing a minimum plain-text representation of kmer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 60% over unitigs and 27% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 91% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.

Availability matchtigs: https://github.com/algbio/matchtigs

SSHash-Lite: https://github.com/jermp/sshash-lite

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • Forgot to update abstract

  • https://github.com/algbio/matchtigs

  • https://github.com/jermp/sshash-lite

  • [16] Downloaded in February 2022.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted November 22, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Matchtigs: minimum plain text representation of kmer sets
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Matchtigs: minimum plain text representation of kmer sets
Sebastian Schmidt, Shahbaz Khan, Jarno Alanko, Giulio E. Pibiri, Alexandru I. Tomescu
bioRxiv 2021.12.15.472871; doi: https://doi.org/10.1101/2021.12.15.472871
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Matchtigs: minimum plain text representation of kmer sets
Sebastian Schmidt, Shahbaz Khan, Jarno Alanko, Giulio E. Pibiri, Alexandru I. Tomescu
bioRxiv 2021.12.15.472871; doi: https://doi.org/10.1101/2021.12.15.472871

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4239)
  • Biochemistry (9171)
  • Bioengineering (6804)
  • Bioinformatics (24063)
  • Biophysics (12154)
  • Cancer Biology (9564)
  • Cell Biology (13825)
  • Clinical Trials (138)
  • Developmental Biology (7657)
  • Ecology (11736)
  • Epidemiology (2066)
  • Evolutionary Biology (15540)
  • Genetics (10671)
  • Genomics (14358)
  • Immunology (9511)
  • Microbiology (22901)
  • Molecular Biology (9129)
  • Neuroscience (49113)
  • Paleontology (357)
  • Pathology (1487)
  • Pharmacology and Toxicology (2583)
  • Physiology (3851)
  • Plant Biology (8351)
  • Scientific Communication and Education (1473)
  • Synthetic Biology (2301)
  • Systems Biology (6205)
  • Zoology (1302)