Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Masked superstrings as a unified framework for textual k-mer set representations

View ORCID ProfileOndřej Sladký, View ORCID ProfilePavel Veselý, View ORCID ProfileKarel Břinda
doi: https://doi.org/10.1101/2023.02.01.526717
Ondřej Sladký
1Charles University in Prague, Czech Republic
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ondřej Sladký
Pavel Veselý
1Charles University in Prague, Czech Republic
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Pavel Veselý
  • For correspondence: vesely@iuuk.mff.cuni.cz karel.brinda@inria.fr
Karel Břinda
2Inria/IRISA Rennes, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Karel Břinda
  • For correspondence: vesely@iuuk.mff.cuni.cz karel.brinda@inria.fr
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

The popularity of k-mer-based methods has recently led to the development of compact k-mer-set representations, such as simplitigs/Spectrum-Preserving String Sets (SPSS), matchtigs, and eulertigs. These aim to represent k-mer sets via strings that contain individual k-mers as substrings more efficiently than the traditional unitigs. Here, we demonstrate that all such representations can be viewed as superstrings of input k-mers, and as such can be generalized into a unified framework that we call the masked superstring of k-mers. We study the complexity of masked superstring computation and prove NP-hardness for both k-mer superstrings and their masks. We then design local and global greedy heuristics for efficient computation of masked superstrings, implement them in a program called KmerCamel, and evaluate their performance using selected genomes and pan-genomes. Overall, masked superstrings unify the theory and practice of textual k-mer set representations and provide a useful framework for optimizing representations for specific bioinformatics applications.

Competing Interest Statement

The authors have declared no competing interest.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted February 03, 2023.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Masked superstrings as a unified framework for textual k-mer set representations
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Masked superstrings as a unified framework for textual k-mer set representations
Ondřej Sladký, Pavel Veselý, Karel Břinda
bioRxiv 2023.02.01.526717; doi: https://doi.org/10.1101/2023.02.01.526717
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Masked superstrings as a unified framework for textual k-mer set representations
Ondřej Sladký, Pavel Veselý, Karel Břinda
bioRxiv 2023.02.01.526717; doi: https://doi.org/10.1101/2023.02.01.526717

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4866)
  • Biochemistry (10822)
  • Bioengineering (8068)
  • Bioinformatics (27384)
  • Biophysics (14030)
  • Cancer Biology (11167)
  • Cell Biology (16106)
  • Clinical Trials (138)
  • Developmental Biology (8808)
  • Ecology (13333)
  • Epidemiology (2067)
  • Evolutionary Biology (17399)
  • Genetics (11706)
  • Genomics (15964)
  • Immunology (11062)
  • Microbiology (26171)
  • Molecular Biology (10685)
  • Neuroscience (56750)
  • Paleontology (422)
  • Pathology (1737)
  • Pharmacology and Toxicology (3012)
  • Physiology (4570)
  • Plant Biology (9671)
  • Scientific Communication and Education (1617)
  • Synthetic Biology (2699)
  • Systems Biology (6997)
  • Zoology (1515)