PT - JOURNAL ARTICLE AU - Dan DeBlasio AU - Fiyinfoluwa Gbosibo AU - Carl Kingsford AU - Guillaume Marçais TI - Practical universal <em>k</em>-mer sets for minimizer schemes AID - 10.1101/652925 DP - 2019 Jan 01 TA - bioRxiv PG - 652925 4099 - http://biorxiv.org/content/early/2019/05/30/652925.short 4100 - http://biorxiv.org/content/early/2019/05/30/652925.full AB - Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same. One way to find k-mer orderings for minimizer schemes is through the use of universal k-mer sets, which are subsets of k-mers that are guaranteed to cover all windows. The smaller this set the fewer false positives (where two poorly aligned sequences being identified as possible matches) are identified. Current methods for creating universal k-mer sets are limited in the length of the k-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. We take some of the first steps in creating universal k-mer sets that can be used to construct minimizer orders for large values of k that are practical. We do this using iterative extension of the k-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers.