Abstract
The design of efficient dynamic data structures for large k-mer sets belongs to central challenges of sequence bioinformatics. Recent advances in compact k-mer set representations via simplitigs/Spectrum-Preserving String Sets, culminating with the masked superstring framework, have provided data structures of remarkable space efficiency for wide ranges of k-mer sets. However, the possibility to perform set operations remained limited due to the static nature of the underlying compact representations. Here, we develop f-masked superstrings, a concept combining masked superstrings with custom demasking functions f to enable efficient k-mer set operations via string concatenation. Combined with the FMSI index for masked superstrings, we obtain a memory-efficient k-mer index supporting set operations via Burrows-Wheeler Transform merging. The framework provides a promising theoretical solution to a pressing bioinformatics problem and highlights the potential of f-masked superstrings to become an elementary data type for k-mer sets.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
2012 ACM Subject Classification Applied computing → Bioinformatics; Theory of computation →Pattern matching
Funding Ondřej Sladký Supported by GA ČR project 22-22997S and ERC-CZ project LL2406 of the Ministry of Education of Czech Republic.; Pavel Veselý Supported by GA ČR project 22-22997S, ERC-CZ project LL2406 of the Ministry of Education of Czech Republic, and Center for Foundations of Modern Computer Science (Charles Univ. project UNCE 24/SCI/008).; Karel Břinda Supported by French National Research Agency (ANR) under Grant ANR-24-CE45-1226 for the REALL project.
More unified results and narrative of the paper. Section on FMSI and FMSI experiments were moved to a separate manuscript. Revised experiments on set operations.
https://github.com/OndrejSladky/f-masked-superstrings-supplement