PT  - JOURNAL ARTICLE
AU  - Dan DeBlasio
AU  - Fiyinfoluwa Gbosibo
AU  - Carl Kingsford
AU  - Guillaume Marçais
TI  - Practical universal &lt;em&gt;k&lt;/em&gt;-mer sets for minimizer schemes
AID  - 10.1101/652925
DP  - 2019 Jan 01
TA  - bioRxiv
PG  - 652925
4099  - http://biorxiv.org/content/early/2019/05/30/652925.short
4100  - http://biorxiv.org/content/early/2019/05/30/652925.full
AB  - Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same. One way to find k-mer orderings for minimizer schemes is through the use of universal k-mer sets, which are subsets of k-mers that are guaranteed to cover all windows. The smaller this set the fewer false positives (where two poorly aligned sequences being identified as possible matches) are identified. Current methods for creating universal k-mer sets are limited in the length of the k-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. We take some of the first steps in creating universal k-mer sets that can be used to construct minimizer orders for large values of k that are practical. We do this using iterative extension of the k-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers.