Abstract
K-mer-based methods are widely used in bioinformatics for various types of sequence comparison. However, a single mutation will mutate k consecutive k-mers and makes most k-mer based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, e.g., spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches due to the size of k.
Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consists of linked minimizers. We show that under a certain minimizer selection technique, strobemers provide more evenly distributed sequence matches than k-mers and are less sensitive to different mutation rates and distributions. Strobemers also produce a higher total match coverage across sequences. Strobemers are a useful alternative to k-mers for performing sequence comparisons as commonly used in sequence alignment, clustering, classification, and error-correction. A reference implementation with code for analyses is available at https://github.com/ksahlin/strobemers.
Competing Interest Statement
The authors have declared no competing interest.