PT - JOURNAL ARTICLE AU - Sladký, Ondřej AU - Veselý, Pavel AU - Břinda, Karel TI - FroM Superstring to Indexing: a space-efficient index for unconstrained <em>k</em>-mer sets using the Masked Burrows-Wheeler Transform (MBWT) AID - 10.1101/2024.10.30.621029 DP - 2024 Jan 01 TA - bioRxiv PG - 2024.10.30.621029 4099 - http://biorxiv.org/content/early/2024/11/03/2024.10.30.621029.short 4100 - http://biorxiv.org/content/early/2024/11/03/2024.10.30.621029.full AB - The exponential growth of DNA sequencing data limits the searchable proportion of the data. In this context, tokenization of genomic data via their k-merization provides a path towards efficient algorithms for their compression and search. However, indexing even single k-mer sets still remains a significant bioinformatics challenge, especially if k-mer sets are sketched or subsampled. Here, we develop the FMSI index, a space-efficient data structure for unconstrained k-mer sets, based on approximated shortest superstrings and the Masked Burrows Wheeler Transform (MBWT), an adaptation of the BWT for masked superstrings. We implement this in a program called FMSI, and via extensive evaluations using prokaryotic pan-genomes, we show FMSI substantially improves space efficiency compared to the state of the art, while maintaining a competitive query time. Overall, our work demonstrates that superstring indexing is a highly general, parameter-free approach for modern k-mer sets, without imposing any constraints on their structure.Competing Interest StatementThe authors have declared no competing interest.