A performant bridge between fixed-size and variable-size seeding

Arne Kutzner; Pok-Son Kim; Markus Schmidt

doi:10.1101/825927

ABSTRACT

Seeding is usually the initial step of high-throughput sequence aligners. Two popular seeding strategies are fixed-size seeding (k-mers, minimizers) and variable-size seeding (MEMs, SMEMs, max. spanning seeds). The former strategy benefits from fast index building and fast seed computation, while the latter one benefits from a high seed entropy. Here we build a performant bridge between both strategies and show that neither of them is of theoretical superiority. We propose an algorithmic approach for computing MEMs out of k-mers or minimizers. Further, we describe techniques for extracting SMEMs or maximally spanning seeds out of MEMs. A comprehensive benchmarking shows the practical value of the proposed approaches. In this context, we report about the effects and the fine-tuning of occurrence filters for the different seeding strategies.