Abstract
Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long sub-sequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers overall than necessary and therefore provide limited improvement to runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders resulting in fewer selected k-mers. Unfortunately, generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus cannot help in the many applications that need minimizers of larger k.
Here, we close this gap by introducing decycling set-based minimizer orders. We define new orders based on minimum decycling sets, which are guaranteed to hit any infinitely long sequence. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets, and can also scale up to larger k. Furthermore, we developed a query method that avoids the need to keep the k-mers of a decycling set in memory, which enables the use of these minimizer orders for any value of k. We expect the new decycling set-based minimizer orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.
1 Introduction
As the number and depth of high-throughput sequencing experiments grows, efficient methods to map, store, and search DNA sequences have become critical in their analysis. Sequence sketching is a fundamental building block of many of the basic sequence analysis tasks, such as assembly [20,4], alignment [22,19,11], and binning [2,1,6]. The common principle in all sketching techniques is the selection of a k-mer representative from a long DNA sequence for indexing sequences in data structures or algorithms. Key parameters for evaluating the merits of sketching techniques are density [13], defined as the fraction of k-mers selected, maximum bin load [12], defined as the maximum number of windows sketched by a single k-mer, and conservation of the sketch under mutations or sequencing errors.
One of the most common sequence sketching techniques is minimizers [23]. The minimizer of an L-long sequence is the minimum among all the w = L− k +1 k-mers that it contains, according to some order o over the k-mers. Selecting the minimizers from all overlapping L-long windows of a sequence provides a sketch of that sequence. Despite the advantages of minimizers, commonly used minimizer orders, such as lexicographic and random, have been shown to perform poorly in density [13] and maximum bin load [6]. In addition, the conservation under mutations and robustness to errors of such minimizer orders is low relative to other sketching techniques [3].
A recent breakthrough in developing minimizer orders with lower density has been achieved by compact universal k-mer hitting sets (UHSs) [18]. A UHS is a set of k-mers guaranteed to hit any L-long sequence. In terms of a complete de Bruijn graph of order k, a minimum UHS is a minimum set of nodes whose removal leaves no path of length L−k + 1 in the graph. Heuristic algorithms for finding a minimum UHS include DOCKS [18] and PASHA [5], both of which approach UHS construction as a path covering problem in a complete de Bruijn graph. The UHS-based minimizer orders were shown to achieve lower density than common orders [13,5]. However, constructing and storing UHSs is inefficient due to the exponential dependence of the heuristic algorithms on k, and currently compact UHSs are available only for k≤13. Key to these methods is the ability to efficiently identify a minimum decycling set, which is a set of k-mers guaranteed to hit any infinitely long sequence. A minimum decycling set can be generated in time log-linear in the de Bruijn graph size [15].
Partly due to the challenges in constructing UHSs, other recent works have focused on developing sequence-specific minimizer orders. For example, sequencespecific minimizer orders have been used in binning applications to achieve lower maximum bin size or more balanced bins than general minimizers [1,6]. Hoang et al. [8] used deep learning to achieve sequence-specific low-density minimizers for much longer k (up to 320). Still, these solutions are tailored to a specific sequence, and cannot be generally applied.
In this work, we developed new methods to construct general minimizer orders that scale to larger k. We defined minimizer orders based only on a minimum decycling set. We further improved the scalability of our approach by implementing an efficient method to query in linear time if a k-mer belongs to a minimum decycling set without the need to construct, store, or query the whole set. Finally, we demonstrate that our new decycling-set-based minimizer orders achieve density that is comparable to or better than UHS-based orders. The minimizer orders we defined thus provide for the first time general orders with low density that can scale to any value of k. All code developed under this project is publicly available via github.com/OrensteinLab/DecyclingSetBasedMinimizerOrder.
2 Preliminaries and definitions
We begin by defining and providing theoretical background on concepts necessary for the description and evaluation of our methods.
k-mer
For a string S over an alphabet Σ, a k-mer is a contiguous substring of length k. We denote the k-mer starting at position i as S[i, i + k − 1].
k-mer order
For a function on k-mers o : Σk → ℝ, we say that k-mer x1 is less than x2 under o (x1 <o x2) iff o(x1) < o(x2).
Minimizer scheme
A minimizer scheme is a function fk,w,o : Σw+k−1→ {0, …, w−1}. Function f returns the position of the minimum k-mer in a given window of w overlapping k-mers (i.e., in every L = w + k−1 long window). By convention, ties are broken by choosing the left-most k-mer. The minimizers of a string S, denoted as ℳk,w,o(S), are all the positions in the string that are selected by applying the scheme to all overlapping L-long windows of S:
Universal hitting set (UHS)
A universal hitting set Uk,L ⊆ Σk is a set of k-mers such that any L-long string contains at least one k-mer from Uk,L as a contiguous substring. By construction, at least one k-mer from Uk,L must appear in every window of w = L − k + 1 overlapping k-mers, and thus it is possible to define minimizer orders that are compatible with a UHS. An order oUk,L,h is compatible with Uk,L if for x1 ∈ Uk,L, x2 ∉UUk,L ⇒ x1 <Uk,L,h x2, and otherwise, when x1 and x2 are either both in or both not in the UHS, then x1 <Uk,L,h x2 ⇔ h(x1) < h(x2) for some order h.
Partition-compatible minimizer order
We extend the above definition of UHS-compatible orders to minimizer orders that are compatible with an ordered partition of Σk. Given an ordered partition of Σk, Π = [C1, …, Cm], we define a compatible minimizer order oΠ,h such that for x1 ∈ Ci, x2 ∈ Cj i < j ⇒ x1 <Π,h x2 and if i = j then x1 <Π,h x2 ⇔ h(x1) < h(x2) for some order h.
de Bruijn graph
A de Bruijn graph (dBG) of order k is a directed graph in which every node is labelled with a distinct k-mer and there may be a directed edge from node a to b iff the (k−1)-long suffix of a is the same as the (k−1)-long prefix of b. The edge is labelled with the (k + 1)-long merge of the two labels. A complete dBG has a node for every possible k-mer and an edge for every possible (k + 1)-mer. Paths in a dBG of order k represent sequences, and a path of w nodes represents a sequence of w overlapping k-mers. Thus, the nodes represented by a UHSUk,L will be a covering set for all (L−k + 1)-long paths in a complete dBG of order k.
Decycling set
A decycling set in a graph G = (V, E) is a set of nodes whose removal results in an acyclic graph. Finding a minimum decycling set (also called feedback vertex set) in an arbitrary graph is NP-hard [9]. We are interested in a minimum decycling set in a complete dBG of order k, which we denote by Dk. Mykkeltveit [15] gave an efficient algorithm to construct such a set in time loglinear in the complete dBG size. A pure cycle is a set of nodes corresponding to all the cyclic rotations of some k-mer [15]. Mykkeltveit showed that Dk contains a single node from each pure cycle in a complete dBG. Moreover, each pure cycle defines a conjugacy class, and thus the pure cycles factor the complete dBG, namely every k-mer belongs to exactly one of the pure cycles.
Mykkeltveit embedding
To determine which of the cyclic rotations of a k-mer to include in Dk, Mykkeltveit defined an embedding of k-mers in the complex plane. For a k-mer x,, where xi is the numeric encoding of the i-th character of x (in our case the encoding of the DNA alphabet is: A=0, C=1, G=2, T=3) (Figure 1). The minimum decycling set constructed by Mykkeltveit’s algorithm includes for each conjugacy class the first counter-clockwise rotation x such that ℝ(x) > 0. When all rotations have ℝ(x) = 0, any arbitrary k-mer from the cycle can be selected.
The embedding is shown for the rotations of the k-mer ACACT, indicated above each subfigure. Each letter of the k-mer encodes a weight (in parentheses) placed at the k-th roots of unity (red dots). The embedding represents the center of mass of the k-mer (black dot). The sign of each embedding projected onto the real axis is shown below each rotation. In this example, ACTAC (red box) is the first counter-clockwise rotation x with ℝ(x) > 0, and is thus selected by Mykkeltveit’s algorithm to participate in a minimum decycling set.
Mykkeltviet’s algorithm has an efficient implementation due to Knuth [10]. This implementation uses the FKM algorithm [7] to enumerate the k-mer conjugacy classes in lexicographic order. The representative selected for each class is first one with ℝ(x) > 0, and for classes with ℝ(x) = 0 for all k-mer rotations, the lexicographically smallest k-mer is included in the decycling set. A minimum decycling set consists of O(|Σ|k/k) k-mers and it can be generated in time O(k|Σ|k), i.e. log-linear in the dBG size [15,21].
Minimizer density
The expected density of a minimizer scheme is the fraction of k-mer positions that will be selected as minimizers in expectation over an infinitely long random i.i.d. sequence. The particular density of a minimizer scheme on a specific sequence S (e.g., the human genome) is the fraction of k-mer positions selected by the scheme on that sequence.
The density factor normalizes density for the window size w of the scheme. We follow the definition of Zheng et al. [8]: for a sequence S the density factor is
· (w + 1), where L = w + k−1. This definition of the density factor removes the dependence on L, e.g. making the expected density factor of all random minimizers the same, regardless of k and L. Note that other works define the density factor simply as the density times a factor of (w + 1) (c.f. [13]). Expected and particular density factors are defined analogously to expected and particular density. The expected density factor of random minimizers has been shown to be 2 [13] and there is a general non-tight lower-bound of 1.5 [14].
3 Methods
The current heuristic algorithms that generate compact UHSs begin by constructing a minimum decycling set Dk. We propose using Dk as an “approximate UHS” and defining an order based on it. The rationale for this idea is threefold. First, for most combinations of k and L, the majority of k-mers in UHSs generated by these heuristics belong to Dk. Second, as discussed above, Dk can be generated very efficiently, while the subsequent k-mer additions that the heuristics perform in order to remove long paths are very slow. Third, Zheng et al. [24] showed that following the removal of a minimum decycling set from a complete dBG of order k, the longest remaining path has length O(k3), which bounds the length of the longest remaining sequence. Below we explore this idea and develop several variants of decycling set-based minimizer orders.
3.1 Decycling set-based minimizer orders
We define a partition-compatible order based only on a minimum decycling set Dk and use it in lieu of a UHS-based minimizer order. In this order, k-mers in Dk precede all other k-mers. Within each set, a random hash function is used to compare between k-mers. Dk can be constructed efficiently using Knuth’s implementation of Mykkeltveit’s algorithm [15] as described above.
For large values of k, when Dk is too large to store or takes too much time to compute, we can instead scan the target sequence and for every k-mer test its membership in Dk on the fly using the procedure outlined in Algorithm 1. The real parts of the embeddings of a k-mer x and its clockwise rotation x′ are computed in O(k) time and compared to determine if x is the first counterclockwise rotation with ℝ(x) > 0. If ℝ(x) = ℝ(x′) = 0, then the algorithm determines whether x is a lexicographically smallest rotation in O(k) time.
(Alg. 1 correctness) Alg. 1 correctly determines whether a k-mer is a member of Dk in time O(k).
Proof. The proof follows from the definition of Dk. We say that a k-mer x is positive, negative, or non-positive if ℝ(x) > 0, < 0, or ≤ 0, respectively. Recall that a k-mer x ∈ Dk iff either: (i) it is the first positive counter-clockwise rotation in its conjugacy class; or, (ii) all k-mers in the conjugacy class have ℝ = 0 and x is a lexicographically smallest rotation.
For (i), line 6 returns true iff the input k-mer x is the first positive counterclockwise rotation in its conjugacy class, i.e. x has ℝ(x) > 0 and the one letter clockwise rotation of x, denoted x′ has ℝ(x′)≤0.
For (ii), note that if two consecutive rotations of a k-mer x, x′ have ℝ(x) = ℝ(x′) = 0 (lines 7-8), then all k-mers in that conjugacy class have zero embedding (Lemma 1 in Mykkeltveit [15]). The loop in lines 10-14 checks all possible rotations of x and returns false if it finds a k-mer that is lexicographically smaller than x (line 11), otherwise it returns true either if it checked all possible rotations and none of them is lexicographically smaller than x (i = 0 and j ≥k−1) or it finds that x is identical to one of its rotations and x is a lexicographically smallest rotation (i = k and j ≥k−1).
The embedding computations (lines 1, 2, and 4) take O(k) time. The loop beginning on line 10 can run for at most 2k times and performs constant time computations per iteration for a total running time of O(k).
3.2 Double decycling set-based minimizer orders
By symmetry, Mykkeltveit’s construction can be used to create a minimum decycling set using the first counter-clockwise negative k-mer x in each conjugacy class rather than the first positive one. We refer to this set as the symmetric decycling set . The decycling set and symmetric decycling set divide sequences according to the following interesting property:
(remaining path partition). In any remaining path in the complete dBG after removing Dk, all the positive nodes precede all the non-positive nodes.
In other words, a remaining path must consist of two distinct parts: A positive part, containing only positive k-mers, followed by the second non-positive part consisting of non-positive k-mers only. The proof relies on two lemmas:
The k-mers associated with all incoming neighbours of a node x in a dBG have the same ℝ(x).
Proof. All incoming neighbours y of x differ only in y0, and have embedding with .
The pure cycles factor the complete dBG, namely, every k. k-mer belongs to exactly one of the pure cycles.
Proof. Every k-mer is on some pure cycle corresponding to its rotations. Assume the contrary that k-mer x is on two distinct pure cycles, C1 and C2. Let y be the last common node in the path in C1∩C2 starting from x. Then, the edges out of y in the two cycles are distinct, contradicting the fact that both correspond to the cyclic rotation of y.
Proof (Thm. 1). Let xi be the first non-positive node in a remaining path x1, …, xt and assume the contrary that there exists a positive xj for j > i. W.l.o.g. assume xj is the first with that property in the path. Let C be the pure cycle that contains xj. C exists and it is well defined by Lemma 2. Let y be the node preceding xj in C. By Lemma 1, ℝ(xj−1) = ℝ(y). Since y is non-positive, xj should be in Dk as the first positive node in C, a contradiction.
By a similar argument, in a remaining path after removing , the negative nodes precede all other nodes. Thus, removal of a double decycling set consisting of Dk∪
would leave only short remaining paths that cannot contain both negative and positive k-mers.
We define a partition-compatible minimizer order based on double decycling sets with . Because the double decycling set leaves even shorter remaining paths, we hypothesize that this minimizer order may achieve lower density compared to the one using only a single decycling set.
3.3 Modified decycling set-based minimizer orders
We defined another variant of the decycling set-based order to account for homopolymers. Long homopolymers in a sequence can increase the particular density of a decycling set-compatible minimizer order, and removing them from the set may improve the scheme performance. However, we note that all homopolymers have an embedding with ℝ(x) = 0. Since in practice k-mers with embedding 0 are only a small fraction of the decycling set, we simply choose to exclude all of them from the decycling set. The resulting set is denoted , and we call the corresponding order modified decycling set-based order. This relieves us of the need to perform lines 7-14 in Algorithm 1, and as a result could speed up the membership test. Modified symmetric decycling sets
are defined analogously, as is the modified double decycling set compatible order.
4 Results
We compared the performance of our new decycling set-based minimizer orders to UHS-based orders and random orders, across a range of k and L values. We measured performance using expected and particular density factors. Expected density factors were estimated by measuring density on five random i.i.d. sequences of 1M nt. Particular density factors were measured on a randomly selected 1M nt segment from chromosome X of the CHM13 telomere-to-telomere human genome assembly [16] with 10 repetitions using different seeds for the pseudo-random hash functions. We used Python’s hash function as the pseudorandom hash to compare between k-mers within each set of a partition. Scripts to compute the expected and particular density of the different minimizer orders are available from github.com/OrensteinLab/DecyclingSetBasedMinimizerOrder.
4.1 Decycling set-based orders outperform UHS-based orders
Figure 2 compares the density factors of the tested orders for k = 11 and varying L values, and for k = 5 to 15 and L = 100. Average density factors over the repeated runs are shown for visual clarity. The same plots with error bars displayed are in Figure S1. The order denoted “decycling-UHS” is a variant of UHS order in which the decycling set k-mers precede the rest of the UHS. UHS- and decycling-UHS-compatible orders were generated by DOCKS for k < 12 and by PASHA for 12≤k≤13. The sets of PASHA are slightly less compact than those of DOCKS, hence the slight bump in density factor for the UHS order at k = 12 (Figures 2A,B). UHSs for larger k could not be generated due to time- and storage-intensive computation required for every combination of k and L. In contrast, the decycling set-based orders have the distinct advantage of being easily computed on the fly for any (k, L) combination.
The expected density (A,C) and particular density factors on CHM13X (B,D) of different minimizer orders is compared for fixed L = 100 and varying k (A,B) and fixed k = 11 and varying L (C,D).
The decycling set-based orders consistently perform similarly or better than UHS-based orders. As expected, random orders typically do worst, and the relative improvement of UHS- and decyling set-based orders compared to random orders increases with k. Conversely, as L grows for fixed k the density factors of the different methods are more similar. The particular density matches the expected density relatively closely for all orders but is much noisier (Figure S1).
4.2 Scaling to k ≥ 20
We compared the decycling set-based orders to the random baseline orders for much larger k than is possible with UHS-compatible orders. Figure 3 shows results for k = 20, 50 and 100. Average density factors over the repeated runs are shown. The same plots with error bars displayed are in Figure S2.
The expected density factor (A,C,E) and the particular density factor on CHM13X (B,D,F) of different minimizer orders is compared for fixed k = 20 (A,B), fixed k = 50 (C,D), and fixed k = 100 (E,F) for varying L. Note that in C-F the lines for the modified and unmodified orders are almost identical.
As k grows, the advantage of the decycling set-based order becomes even more pronounced and the double decycling set-based order improves more significantly over the decycling set-based order. This is true in particular for shorter L, with the differences between the decycling and double decycling set-based orders disappearing as L grows. At the same time, for larger k, the modified variants of the decycling set and double decycling set orders perform essentially the same as the original, but with improved k-mer query runtime. In all cases, the particular density factor is very close to that of the expected density factor.
5 Discussion
In this work, we solved one of the major limitations of UHS-based minimizer orders. By relieving the strict requirement of generating a set of k-mers that hits every L-long sequence, we were able to generate minimizer orders that are close to universal and can be calculated efficiently on the fly. Based on Mykkveltveit’s algorithm, we developed a method to determine if a k-mer belongs to a minimum decycling set, which can be applied to any k. We demonstrated that minimizer orders based on minimum decycling sets are comparable or better in their density to minimizer orders based on UHSs, thus achieving good performance while avoiding escalating runtime and memory usage with the increase of k.
We also defined the modified and double decycling set orders. For longer k and relatively shorter L, the double decycling set-based order yields much lower density than even the decycling set-based order. Although we did not perform extensive runtime comparisons of the methods, the double decycling set-based order is generally slower to compute than the decycling set-based order, and the modified orders perform fewer computations and thus can be slightly faster. As the density of the different methods converges as L increases, this suggests using modified double decycling set-based order for smaller L to achieve lower density, while modified decycling set-based order can be used for larger L and achieve similar density with faster running times. Based on the results we have presented, a general rule-of-thumb appears to be that the advantage of the double decycling set persists until around L = 2.5k.
We see several promising future directions to take. First, it may be possible to more rigorously define which of the different decycling set-based orders is better to use for each given combination of k and L. Second, frequency-based orders are known to be highly efficient in terms of density while easily computable as sequence-specific minimizer orders. It will be interesting to extend our work by ranking each of the sets in a partition by their frequency in a specific sequence dataset to achieve lower density values (as was recently shown by incorporating UHS-based orders with frequency ranking [17]). Third, it may be possible to use decycling sets and their variants as sketches without defining compatible minimizer orders by simply including all decycling set k-mers in the sketch. By choosing an appropriate value of k and decycling set variant it may be possible to achieve a given desired density. Such schemes would be better conserved than minimizers as they are not dependent on a longer sequence window.
Our new approach can enable more efficient analyses of high-throughput DNA sequencing data. By implementing our new decycling set-based minimizer orders in data structures and algorithms of high-throughput DNA sequencing analysis, we expect to see reductions in runtime and memory usage, beyond what was previously demonstrated using UHS-based minimizer orders.
Acknowledgments
This study was supported by a United States–Israel Binational Science Foundation (BSF) grant no. 2020297 to YO and BB. RS was supported in part by the Israel Science Foundation (grant 2206/22) and by Len Blavatnik and the Blavatnik Family foundation. DP and LP were supported in part by fellowships from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. LP was supported in part by the National Natural Science Foundation of China project 61902072.