Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Efficient minimizer orders for large values of k using minimum decycling sets

David Pellow, Lianrong Pu, Baris Ekim, Lior Kotlar, Bonnie Berger, Ron Shamir, View ORCID ProfileYaron Orenstein
doi: https://doi.org/10.1101/2022.10.18.512682
David Pellow
1Blavatnik School of Computer Science, Tel-Aviv University, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lianrong Pu
1Blavatnik School of Computer Science, Tel-Aviv University, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Baris Ekim
2Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lior Kotlar
3Department of Computer Science, Ben-Gurion University, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Bonnie Berger
2Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ron Shamir
1Blavatnik School of Computer Science, Tel-Aviv University, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yaron Orenstein
4Department of Computer Science, Bar-Ilan University, Israel
5The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yaron Orenstein
  • For correspondence: yaron.orenstein@gmail.com
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long sub-sequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers overall than necessary and therefore provide limited improvement to runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders resulting in fewer selected k-mers. Unfortunately, generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus cannot help in the many applications that need minimizers of larger k.

Here, we close this gap by introducing decycling set-based minimizer orders. We define new orders based on minimum decycling sets, which are guaranteed to hit any infinitely long sequence. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets, and can also scale up to larger k. Furthermore, we developed a query method that avoids the need to keep the k-mers of a decycling set in memory, which enables the use of these minimizer orders for any value of k. We expect the new decycling set-based minimizer orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.

1 Introduction

As the number and depth of high-throughput sequencing experiments grows, efficient methods to map, store, and search DNA sequences have become critical in their analysis. Sequence sketching is a fundamental building block of many of the basic sequence analysis tasks, such as assembly [20,4], alignment [22,19,11], and binning [2,1,6]. The common principle in all sketching techniques is the selection of a k-mer representative from a long DNA sequence for indexing sequences in data structures or algorithms. Key parameters for evaluating the merits of sketching techniques are density [13], defined as the fraction of k-mers selected, maximum bin load [12], defined as the maximum number of windows sketched by a single k-mer, and conservation of the sketch under mutations or sequencing errors.

One of the most common sequence sketching techniques is minimizers [23]. The minimizer of an L-long sequence is the minimum among all the w = L− k +1 k-mers that it contains, according to some order o over the k-mers. Selecting the minimizers from all overlapping L-long windows of a sequence provides a sketch of that sequence. Despite the advantages of minimizers, commonly used minimizer orders, such as lexicographic and random, have been shown to perform poorly in density [13] and maximum bin load [6]. In addition, the conservation under mutations and robustness to errors of such minimizer orders is low relative to other sketching techniques [3].

A recent breakthrough in developing minimizer orders with lower density has been achieved by compact universal k-mer hitting sets (UHSs) [18]. A UHS is a set of k-mers guaranteed to hit any L-long sequence. In terms of a complete de Bruijn graph of order k, a minimum UHS is a minimum set of nodes whose removal leaves no path of length L−k + 1 in the graph. Heuristic algorithms for finding a minimum UHS include DOCKS [18] and PASHA [5], both of which approach UHS construction as a path covering problem in a complete de Bruijn graph. The UHS-based minimizer orders were shown to achieve lower density than common orders [13,5]. However, constructing and storing UHSs is inefficient due to the exponential dependence of the heuristic algorithms on k, and currently compact UHSs are available only for k≤13. Key to these methods is the ability to efficiently identify a minimum decycling set, which is a set of k-mers guaranteed to hit any infinitely long sequence. A minimum decycling set can be generated in time log-linear in the de Bruijn graph size [15].

Partly due to the challenges in constructing UHSs, other recent works have focused on developing sequence-specific minimizer orders. For example, sequencespecific minimizer orders have been used in binning applications to achieve lower maximum bin size or more balanced bins than general minimizers [1,6]. Hoang et al. [8] used deep learning to achieve sequence-specific low-density minimizers for much longer k (up to 320). Still, these solutions are tailored to a specific sequence, and cannot be generally applied.

In this work, we developed new methods to construct general minimizer orders that scale to larger k. We defined minimizer orders based only on a minimum decycling set. We further improved the scalability of our approach by implementing an efficient method to query in linear time if a k-mer belongs to a minimum decycling set without the need to construct, store, or query the whole set. Finally, we demonstrate that our new decycling-set-based minimizer orders achieve density that is comparable to or better than UHS-based orders. The minimizer orders we defined thus provide for the first time general orders with low density that can scale to any value of k. All code developed under this project is publicly available via github.com/OrensteinLab/DecyclingSetBasedMinimizerOrder.

2 Preliminaries and definitions

We begin by defining and providing theoretical background on concepts necessary for the description and evaluation of our methods.

k-mer

For a string S over an alphabet Σ, a k-mer is a contiguous substring of length k. We denote the k-mer starting at position i as S[i, i + k − 1].

k-mer order

For a function on k-mers o : Σk → ℝ, we say that k-mer x1 is less than x2 under o (x1 <o x2) iff o(x1) < o(x2).

Minimizer scheme

A minimizer scheme is a function fk,w,o : Σw+k−1→ {0, …, w−1}. Function f returns the position of the minimum k-mer in a given window of w overlapping k-mers (i.e., in every L = w + k−1 long window). By convention, ties are broken by choosing the left-most k-mer. The minimizers of a string S, denoted as ℳk,w,o(S), are all the positions in the string that are selected by applying the scheme to all overlapping L-long windows of S: Embedded Image

Universal hitting set (UHS)

A universal hitting set Uk,L ⊆ Σk is a set of k-mers such that any L-long string contains at least one k-mer from Uk,L as a contiguous substring. By construction, at least one k-mer from Uk,L must appear in every window of w = L − k + 1 overlapping k-mers, and thus it is possible to define minimizer orders that are compatible with a UHS. An order oUk,L,h is compatible with Uk,L if for x1 ∈ Uk,L, x2 ∉UUk,L ⇒ x1 <Uk,L,h x2, and otherwise, when x1 and x2 are either both in or both not in the UHS, then x1 <Uk,L,h x2 ⇔ h(x1) < h(x2) for some order h.

Partition-compatible minimizer order

We extend the above definition of UHS-compatible orders to minimizer orders that are compatible with an ordered partition of Σk. Given an ordered partition of Σk, Π = [C1, …, Cm], we define a compatible minimizer order oΠ,h such that for x1 ∈ Ci, x2 ∈ Cj i < j ⇒ x1 <Π,h x2 and if i = j then x1 <Π,h x2 ⇔ h(x1) < h(x2) for some order h.

de Bruijn graph

A de Bruijn graph (dBG) of order k is a directed graph in which every node is labelled with a distinct k-mer and there may be a directed edge from node a to b iff the (k−1)-long suffix of a is the same as the (k−1)-long prefix of b. The edge is labelled with the (k + 1)-long merge of the two labels. A complete dBG has a node for every possible k-mer and an edge for every possible (k + 1)-mer. Paths in a dBG of order k represent sequences, and a path of w nodes represents a sequence of w overlapping k-mers. Thus, the nodes represented by a UHSUk,L will be a covering set for all (L−k + 1)-long paths in a complete dBG of order k.

Decycling set

A decycling set in a graph G = (V, E) is a set of nodes whose removal results in an acyclic graph. Finding a minimum decycling set (also called feedback vertex set) in an arbitrary graph is NP-hard [9]. We are interested in a minimum decycling set in a complete dBG of order k, which we denote by Dk. Mykkeltveit [15] gave an efficient algorithm to construct such a set in time loglinear in the complete dBG size. A pure cycle is a set of nodes corresponding to all the cyclic rotations of some k-mer [15]. Mykkeltveit showed that Dk contains a single node from each pure cycle in a complete dBG. Moreover, each pure cycle defines a conjugacy class, and thus the pure cycles factor the complete dBG, namely every k-mer belongs to exactly one of the pure cycles.

Mykkeltveit embedding

To determine which of the cyclic rotations of a k-mer to include in Dk, Mykkeltveit defined an embedding of k-mers in the complex plane. For a k-mer x,Embedded Image, where xi is the numeric encoding of the i-th character of x (in our case the encoding of the DNA alphabet is: A=0, C=1, G=2, T=3) (Figure 1). The minimum decycling set constructed by Mykkeltveit’s algorithm includes for each conjugacy class the first counter-clockwise rotation x such that ℝ(x) > 0. When all rotations have ℝ(x) = 0, any arbitrary k-mer from the cycle can be selected.

Fig. 1:
  • Download figure
  • Open in new tab
Fig. 1: Mykkeltveit embedding.

The embedding is shown for the rotations of the k-mer ACACT, indicated above each subfigure. Each letter of the k-mer encodes a weight (in parentheses) placed at the k-th roots of unity (red dots). The embedding represents the center of mass of the k-mer (black dot). The sign of each embedding projected onto the real axis is shown below each rotation. In this example, ACTAC (red box) is the first counter-clockwise rotation x with ℝ(x) > 0, and is thus selected by Mykkeltveit’s algorithm to participate in a minimum decycling set.

Mykkeltviet’s algorithm has an efficient implementation due to Knuth [10]. This implementation uses the FKM algorithm [7] to enumerate the k-mer conjugacy classes in lexicographic order. The representative selected for each class is first one with ℝ(x) > 0, and for classes with ℝ(x) = 0 for all k-mer rotations, the lexicographically smallest k-mer is included in the decycling set. A minimum decycling set consists of O(|Σ|k/k) k-mers and it can be generated in time O(k|Σ|k), i.e. log-linear in the dBG size [15,21].

Minimizer density

The expected density of a minimizer scheme is the fraction of k-mer positions that will be selected as minimizers in expectation over an infinitely long random i.i.d. sequence. The particular density of a minimizer scheme on a specific sequence S (e.g., the human genome) is the fraction of k-mer positions selected by the scheme on that sequence. Embedded Image The density factor normalizes density for the window size w of the scheme. We follow the definition of Zheng et al. [8]: for a sequence S the density factor is Embedded Image · (w + 1), where L = w + k−1. This definition of the density factor removes the dependence on L, e.g. making the expected density factor of all random minimizers the same, regardless of k and L. Note that other works define the density factor simply as the density times a factor of (w + 1) (c.f. [13]). Expected and particular density factors are defined analogously to expected and particular density. The expected density factor of random minimizers has been shown to be 2 [13] and there is a general non-tight lower-bound of 1.5 [14].

3 Methods

The current heuristic algorithms that generate compact UHSs begin by constructing a minimum decycling set Dk. We propose using Dk as an “approximate UHS” and defining an order based on it. The rationale for this idea is threefold. First, for most combinations of k and L, the majority of k-mers in UHSs generated by these heuristics belong to Dk. Second, as discussed above, Dk can be generated very efficiently, while the subsequent k-mer additions that the heuristics perform in order to remove long paths are very slow. Third, Zheng et al. [24] showed that following the removal of a minimum decycling set from a complete dBG of order k, the longest remaining path has length O(k3), which bounds the length of the longest remaining sequence. Below we explore this idea and develop several variants of decycling set-based minimizer orders.

3.1 Decycling set-based minimizer orders

We define a partition-compatible order based only on a minimum decycling set Dk and use it in lieu of a UHS-based minimizer order. In this order, k-mers in Dk precede all other k-mers. Within each set, a random hash function is used to compare between k-mers. Dk can be constructed efficiently using Knuth’s implementation of Mykkeltveit’s algorithm [15] as described above.

For large values of k, when Dk is too large to store or takes too much time to compute, we can instead scan the target sequence and for every k-mer test its membership in Dk on the fly using the procedure outlined in Algorithm 1. The real parts of the embeddings of a k-mer x and its clockwise rotation x′ are computed in O(k) time and compared to determine if x is the first counterclockwise rotation with ℝ(x) > 0. If ℝ(x) = ℝ(x′) = 0, then the algorithm determines whether x is a lexicographically smallest rotation in O(k) time.

Algorithm 1

Decycling set membership

Figure
  • Download figure
  • Open in new tab

Proposition 1

(Alg. 1 correctness) Alg. 1 correctly determines whether a k-mer is a member of Dk in time O(k).

Proof. The proof follows from the definition of Dk. We say that a k-mer x is positive, negative, or non-positive if ℝ(x) > 0, < 0, or ≤ 0, respectively. Recall that a k-mer x ∈ Dk iff either: (i) it is the first positive counter-clockwise rotation in its conjugacy class; or, (ii) all k-mers in the conjugacy class have ℝ = 0 and x is a lexicographically smallest rotation.

For (i), line 6 returns true iff the input k-mer x is the first positive counterclockwise rotation in its conjugacy class, i.e. x has ℝ(x) > 0 and the one letter clockwise rotation of x, denoted x′ has ℝ(x′)≤0.

For (ii), note that if two consecutive rotations of a k-mer x, x′ have ℝ(x) = ℝ(x′) = 0 (lines 7-8), then all k-mers in that conjugacy class have zero embedding (Lemma 1 in Mykkeltveit [15]). The loop in lines 10-14 checks all possible rotations of x and returns false if it finds a k-mer that is lexicographically smaller than x (line 11), otherwise it returns true either if it checked all possible rotations and none of them is lexicographically smaller than x (i = 0 and j ≥k−1) or it finds that x is identical to one of its rotations and x is a lexicographically smallest rotation (i = k and j ≥k−1).

The embedding computations (lines 1, 2, and 4) take O(k) time. The loop beginning on line 10 can run for at most 2k times and performs constant time computations per iteration for a total running time of O(k).

3.2 Double decycling set-based minimizer orders

By symmetry, Mykkeltveit’s construction can be used to create a minimum decycling set using the first counter-clockwise negative k-mer x in each conjugacy class rather than the first positive one. We refer to this set as the symmetric decycling set Embedded Image. The decycling set and symmetric decycling set divide sequences according to the following interesting property:

Theorem 1

(remaining path partition). In any remaining path in the complete dBG after removing Dk, all the positive nodes precede all the non-positive nodes.

In other words, a remaining path must consist of two distinct parts: A positive part, containing only positive k-mers, followed by the second non-positive part consisting of non-positive k-mers only. The proof relies on two lemmas:

Lemma 1

The k-mers associated with all incoming neighbours of a node x in a dBG have the same ℝ(x).

Proof. All incoming neighbours y of x differ only in y0, and have embedding with Embedded Image.

Lemma 2

The pure cycles factor the complete dBG, namely, every k. k-mer belongs to exactly one of the pure cycles.

Proof. Every k-mer is on some pure cycle corresponding to its rotations. Assume the contrary that k-mer x is on two distinct pure cycles, C1 and C2. Let y be the last common node in the path in C1∩C2 starting from x. Then, the edges out of y in the two cycles are distinct, contradicting the fact that both correspond to the cyclic rotation of y.

Proof (Thm. 1). Let xi be the first non-positive node in a remaining path x1, …, xt and assume the contrary that there exists a positive xj for j > i. W.l.o.g. assume xj is the first with that property in the path. Let C be the pure cycle that contains xj. C exists and it is well defined by Lemma 2. Let y be the node preceding xj in C. By Lemma 1, ℝ(xj−1) = ℝ(y). Since y is non-positive, xj should be in Dk as the first positive node in C, a contradiction.

By a similar argument, in a remaining path after removing Embedded Image, the negative nodes precede all other nodes. Thus, removal of a double decycling set consisting of Dk∪Embedded Image would leave only short remaining paths that cannot contain both negative and positive k-mers.

We define a partition-compatible minimizer order based on double decycling sets with Embedded Image. Because the double decycling set leaves even shorter remaining paths, we hypothesize that this minimizer order may achieve lower density compared to the one using only a single decycling set.

3.3 Modified decycling set-based minimizer orders

We defined another variant of the decycling set-based order to account for homopolymers. Long homopolymers in a sequence can increase the particular density of a decycling set-compatible minimizer order, and removing them from the set may improve the scheme performance. However, we note that all homopolymers have an embedding with ℝ(x) = 0. Since in practice k-mers with embedding 0 are only a small fraction of the decycling set, we simply choose to exclude all of them from the decycling set. The resulting set is denoted Embedded Image, and we call the corresponding order modified decycling set-based order. This relieves us of the need to perform lines 7-14 in Algorithm 1, and as a result could speed up the membership test. Modified symmetric decycling sets Embedded Imageare defined analogously, as is the modified double decycling set compatible order.

4 Results

We compared the performance of our new decycling set-based minimizer orders to UHS-based orders and random orders, across a range of k and L values. We measured performance using expected and particular density factors. Expected density factors were estimated by measuring density on five random i.i.d. sequences of 1M nt. Particular density factors were measured on a randomly selected 1M nt segment from chromosome X of the CHM13 telomere-to-telomere human genome assembly [16] with 10 repetitions using different seeds for the pseudo-random hash functions. We used Python’s hash function as the pseudorandom hash to compare between k-mers within each set of a partition. Scripts to compute the expected and particular density of the different minimizer orders are available from github.com/OrensteinLab/DecyclingSetBasedMinimizerOrder.

4.1 Decycling set-based orders outperform UHS-based orders

Figure 2 compares the density factors of the tested orders for k = 11 and varying L values, and for k = 5 to 15 and L = 100. Average density factors over the repeated runs are shown for visual clarity. The same plots with error bars displayed are in Figure S1. The order denoted “decycling-UHS” is a variant of UHS order in which the decycling set k-mers precede the rest of the UHS. UHS- and decycling-UHS-compatible orders were generated by DOCKS for k < 12 and by PASHA for 12≤k≤13. The sets of PASHA are slightly less compact than those of DOCKS, hence the slight bump in density factor for the UHS order at k = 12 (Figures 2A,B). UHSs for larger k could not be generated due to time- and storage-intensive computation required for every combination of k and L. In contrast, the decycling set-based orders have the distinct advantage of being easily computed on the fly for any (k, L) combination.

Fig. 2:
  • Download figure
  • Open in new tab
Fig. 2: Density factor of decycling set-based minimizer orders and UHS-based orders.

The expected density (A,C) and particular density factors on CHM13X (B,D) of different minimizer orders is compared for fixed L = 100 and varying k (A,B) and fixed k = 11 and varying L (C,D).

The decycling set-based orders consistently perform similarly or better than UHS-based orders. As expected, random orders typically do worst, and the relative improvement of UHS- and decyling set-based orders compared to random orders increases with k. Conversely, as L grows for fixed k the density factors of the different methods are more similar. The particular density matches the expected density relatively closely for all orders but is much noisier (Figure S1).

4.2 Scaling to k ≥ 20

We compared the decycling set-based orders to the random baseline orders for much larger k than is possible with UHS-compatible orders. Figure 3 shows results for k = 20, 50 and 100. Average density factors over the repeated runs are shown. The same plots with error bars displayed are in Figure S2.

Fig. 3:
  • Download figure
  • Open in new tab
Fig. 3: Density factors of decycling set-based orders for large k.

The expected density factor (A,C,E) and the particular density factor on CHM13X (B,D,F) of different minimizer orders is compared for fixed k = 20 (A,B), fixed k = 50 (C,D), and fixed k = 100 (E,F) for varying L. Note that in C-F the lines for the modified and unmodified orders are almost identical.

As k grows, the advantage of the decycling set-based order becomes even more pronounced and the double decycling set-based order improves more significantly over the decycling set-based order. This is true in particular for shorter L, with the differences between the decycling and double decycling set-based orders disappearing as L grows. At the same time, for larger k, the modified variants of the decycling set and double decycling set orders perform essentially the same as the original, but with improved k-mer query runtime. In all cases, the particular density factor is very close to that of the expected density factor.

5 Discussion

In this work, we solved one of the major limitations of UHS-based minimizer orders. By relieving the strict requirement of generating a set of k-mers that hits every L-long sequence, we were able to generate minimizer orders that are close to universal and can be calculated efficiently on the fly. Based on Mykkveltveit’s algorithm, we developed a method to determine if a k-mer belongs to a minimum decycling set, which can be applied to any k. We demonstrated that minimizer orders based on minimum decycling sets are comparable or better in their density to minimizer orders based on UHSs, thus achieving good performance while avoiding escalating runtime and memory usage with the increase of k.

We also defined the modified and double decycling set orders. For longer k and relatively shorter L, the double decycling set-based order yields much lower density than even the decycling set-based order. Although we did not perform extensive runtime comparisons of the methods, the double decycling set-based order is generally slower to compute than the decycling set-based order, and the modified orders perform fewer computations and thus can be slightly faster. As the density of the different methods converges as L increases, this suggests using modified double decycling set-based order for smaller L to achieve lower density, while modified decycling set-based order can be used for larger L and achieve similar density with faster running times. Based on the results we have presented, a general rule-of-thumb appears to be that the advantage of the double decycling set persists until around L = 2.5k.

We see several promising future directions to take. First, it may be possible to more rigorously define which of the different decycling set-based orders is better to use for each given combination of k and L. Second, frequency-based orders are known to be highly efficient in terms of density while easily computable as sequence-specific minimizer orders. It will be interesting to extend our work by ranking each of the sets in a partition by their frequency in a specific sequence dataset to achieve lower density values (as was recently shown by incorporating UHS-based orders with frequency ranking [17]). Third, it may be possible to use decycling sets and their variants as sketches without defining compatible minimizer orders by simply including all decycling set k-mers in the sketch. By choosing an appropriate value of k and decycling set variant it may be possible to achieve a given desired density. Such schemes would be better conserved than minimizers as they are not dependent on a longer sequence window.

Our new approach can enable more efficient analyses of high-throughput DNA sequencing data. By implementing our new decycling set-based minimizer orders in data structures and algorithms of high-throughput DNA sequencing analysis, we expect to see reductions in runtime and memory usage, beyond what was previously demonstrated using UHS-based minimizer orders.

Acknowledgments

This study was supported by a United States–Israel Binational Science Foundation (BSF) grant no. 2020297 to YO and BB. RS was supported in part by the Israel Science Foundation (grant 2206/22) and by Len Blavatnik and the Blavatnik Family foundation. DP and LP were supported in part by fellowships from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. LP was supported in part by the National Natural Science Foundation of China project 61902072.

Footnotes

  • https://github.com/OrensteinLab/DecyclingSetBasedMinimizerOrder

References

  1. 1.↵
    Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201–i208 (2016)
    OpenUrlCrossRefPubMed
  2. 2.↵
    Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015), https://doi.org/10.1093/bioinformatics/btv022
    OpenUrlCrossRefPubMed
  3. 3.↵
    Edgar, R.: Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ 9, e10805 (2021)
    OpenUrlCrossRefPubMed
  4. 4.↵
    Ekim, B., Berger, B., Chikhi, R.: Minimizer-space de bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems (2021)
  5. 5.↵
    Ekim, B., Berger, B., Orenstein, Y.: A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. In: Research in Computational Molecular Biology. pp. 37–53. Springer International Publishing (2020)
  6. 6.↵
    Flomin, D., Pellow, D., Shamir, R.: Data set-adaptive minimizer order reduces memory usage in k-mer counting. Journal of Computational Biology (2022)
  7. 7.↵
    Fredricksen, H., Maiorana, J.: Necklaces of beads in k colors and k-ary de bruijn sequences. Discrete Mathematics 23(3), 207–210 (1978), https://www.sciencedirect.com/science/article/pii/0012365X7890002X
    OpenUrl
  8. 8.↵
    Hoang, M., Zheng, H., Kingsford, C.: Differentiable learning of sequence-specific minimizer schemes with deepminimizer. Journal of Computational Biology (2022)
  9. 9.↵
    1. Miller, R.,
    2. Thatcher, J.
    Karp, R.: Reducibility among combinatorial problems. In: Miller, R., Thatcher, J. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum Press (1972)
  10. 10.↵
    Knuth, D.E.: Unavoidable2. http://www-cs-faculty.stanford.edu/~uno/programs/unavoidable2.w (2003)
  11. 11.↵
    Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)
    OpenUrlCrossRefPubMed
  12. 12.↵
    Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. In: Proceedings of the VLDB Endowment. vol. 6, pp. 169–180. VLDB Endowment (2013)
    OpenUrl
  13. 13.↵
    Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., Kingsford, C.: Improving the performance of minimizers and winnowing schemes. Bioinformatics 33(14), i110–i117 (2017), https://doi.org/10.1093/bioinformatics/btx235
    OpenUrlCrossRef
  14. 14.↵
    Marçais, G., DeBlasio, D., Kingsford, C.: Asymptotically optimal minimizers schemes. Bioinformatics 34(13), i13–i22 (2018), https://doi.org/10.1093/bioinformatics/bty258
    OpenUrl
  15. 15.↵
    Mykkeltveit, J.: A proof of Golomb’s conjecture for the de Bruijn graph. Journal of Combinatorial Theory, Series B 13(1), 40–45 (1972), http://www.sciencedirect.com/science/article/pii/00958956729000684
    OpenUrl
  16. 16.↵
    Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A.V., Mikheenko, A., Vollger, M.R., Altemose, N., Uralsky, L., Phillippy, A.M., et al: The complete sequence of a human genome. Science 376(6588), 44–53 (2022), https://www.science.org/doi/abs/10.1126/science.abj6987
    OpenUrlCrossRef
  17. 17.↵
    Nyström-Persson, J., Keeble-Gagnere, G., Zawad, N.: Compact and evenly distributed k-mer binning for genomic sequences. Bioinformatics 37(17), 2563–2569 (2021)
    OpenUrl
  18. 18.↵
    Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C.: Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Computational Biology 13(10), e1005777 (2017)
    OpenUrl
  19. 19.↵
    Pellow, D., Duttaf, A., Shamir, R.: Parameterized syncmer schemes improve longread mapping. bioRxiv (2022)
  20. 20.↵
    Rautiainen, M., Marschall, T.: MBG: Minimizer-based sparse de Bruijn Graph construction. Bioinformatics 37(16), 2476–2478 (2021), https://doi.org/10.1093/bioinformatics/btab004
    OpenUrl
  21. 21.↵
    Ruskey, F., Savage, C., Wang, T.M.Y.: Generating necklaces. Journal of Algorithms 13(3), 414–430 (1992)
    OpenUrl
  22. 22.↵
    Sahlin, K.: Flexible seed size enables ultra-fast and accurate read alignment. bioRxiv (2022), https://www.biorxiv.org/content/early/2022/05/25/2021.06.18.449070
  23. 23.↵
    Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data. pp. 76–85 (2003)
  24. 24.↵
    Zheng, H., Kingsford, C., Marçais, G.: Lower density selection schemes via small universal hitting sets with short remaining path length. In: Research in Computational Molecular Biology. pp. 202–217. Springer International Publishing (2020)
Back to top
PreviousNext
Posted October 21, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Efficient minimizer orders for large values of k using minimum decycling sets
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Efficient minimizer orders for large values of k using minimum decycling sets
David Pellow, Lianrong Pu, Baris Ekim, Lior Kotlar, Bonnie Berger, Ron Shamir, Yaron Orenstein
bioRxiv 2022.10.18.512682; doi: https://doi.org/10.1101/2022.10.18.512682
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Efficient minimizer orders for large values of k using minimum decycling sets
David Pellow, Lianrong Pu, Baris Ekim, Lior Kotlar, Bonnie Berger, Ron Shamir, Yaron Orenstein
bioRxiv 2022.10.18.512682; doi: https://doi.org/10.1101/2022.10.18.512682

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4229)
  • Biochemistry (9118)
  • Bioengineering (6753)
  • Bioinformatics (23949)
  • Biophysics (12103)
  • Cancer Biology (9498)
  • Cell Biology (13746)
  • Clinical Trials (138)
  • Developmental Biology (7618)
  • Ecology (11666)
  • Epidemiology (2066)
  • Evolutionary Biology (15479)
  • Genetics (10621)
  • Genomics (14298)
  • Immunology (9468)
  • Microbiology (22808)
  • Molecular Biology (9083)
  • Neuroscience (48900)
  • Paleontology (355)
  • Pathology (1479)
  • Pharmacology and Toxicology (2566)
  • Physiology (3828)
  • Plant Biology (8320)
  • Scientific Communication and Education (1467)
  • Synthetic Biology (2294)
  • Systems Biology (6172)
  • Zoology (1297)