Abstract
K-mer-based methods are widely used in bioinformatics for various types of sequence comparison. However, a single mutation will mutate k consecutive k-mers and makes most k-mer based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, e.g., spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches due to the size of k.
Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of linked minimizers. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce a higher match coverage across sequences. We further implement a proof-of-concept sequence matching tool StrobeMap, and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios. A reference implementation of our tool StrobeMap together with code for analyses is available at https://github.com/ksahlin/strobemers.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
- More analyses on biological data - more informative analysis metrics - added spaced k-mers to comparison - Added analysis of proof-of-concept tool StrobeMap - Implemented and evaluated new strobemer-class hybridstrobes - Modifications to definitions/construction of strobemers. - Added runtime and memory comparison - Added thinning analysis