Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Weighted minimizer sampling improves long read mapping

Chirag Jain, Arang Rhie, Haowen Zhang, Claudia Chu, Sergey Koren, Adam Phillippy
doi: https://doi.org/10.1101/2020.02.11.943241
Chirag Jain
1National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Arang Rhie
1National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Haowen Zhang
2College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Claudia Chu
2College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sergey Koren
1National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Adam Phillippy
1National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: chirag.jain@nih.gov
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Motivation In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g., Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome in order to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions.

Results We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while taking into account a weight for each k-mer; i.e, the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches, and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.

Contact adam.phillippy{at}nih.gov

Availability Winnowmap is built on top of the Minimap2 codebase (Li, 2018) and is available at https://github.com/marbl/winnowmap.

1 Introduction

Continued development of time and space-efficient algorithmic techniques has been pivotal for dealing with the exponential growth of DNA sequencing throughput. In the context of mapping and alignment applications, tools have evolved from purely alignment-based (Smith and Waterman, 1981), to seed-and-extend (Altschul et al., 1997; Kurtz et al., 2004), to succinct text indexing (Langmead and Salzberg, 2012; Yu et al., 2015), and now to ‘sketch’-based (Ondov et al., 2016; Li, 2018) techniques. Sketch-based algorithms employ dimensionality reduction to transform a sequence into a more compact representation, e.g., a subset of k-mers present in the sequence (Marçais et al., 2019; Rowe, 2019). While these algorithms continue to be widely leveraged in bioinformatics, they are even more prevalent for long-read (PacBio/Oxford Nanopore) analyses because longer strings are more amenable to compaction. As such, several long-read based mappers (Li, 2016; Popic and Batzoglou, 2017; Jain et al., 2018; Li, 2018), genome assemblers (Berlin et al., 2015; Koren et al., 2017; Chin and Khalak, 2019; Shafin et al., 2019; Kundu et al., 2019), metagenomic read classifiers (Dilthey et al., 2019), transcriptomic tools (Sahlin and Medvedev, 2019; Sahlin et al., 2020) employ either minimizer- or MinHash-based sequence comparison.

Minimizer sampling was introduced to the field by Roberts et al. (2004) for scaling the genome assembly problem, after being independently described a year earlier by Schleimer et al. (2003) in the text mining literature. Given a fixed window length w and a pre-defined ordering of all k-mers, minimizer sampling selects the minimum k-mer from every consecutive window (Figure 1). Minimizers can be collected in linear time with regard to the sequence length, and matching substrings can be quickly identified since the algorithm is guaranteed to select a common minimizer for two sequences if they share an exact match of at least w + k − 1 bases long. Further, if the ordering of k-mers is determined using a random permutation, then the expected minimizer sampling density is known to be 2/(w + 1) (Roberts et al., 2004). Due to these properties, there exists a trade-off between sensitivity and speed when deciding the window length w.

Fig. 1.
  • Download figure
  • Open in new tab
Fig. 1.

Visualization of the minimizer sampling technique (Roberts et al., 2004). The sequence being sampled is shown as a black line. Assuming a pre-defined order of k-mers (e.g., lexicographical), the sampling selects each of the smallest k-mer(s) (shown as red arrows) in consecutive sliding windows (shown as blue intervals). The length of each window, i.e., w is three in the above example.

As one would expect, minimizers originating from repetitive genomic segments are sampled at a higher frequency than minimizers from unique regions. When used for seeding pairwise alignments, high-frequency minimizers lead to excessive false-positives and can prohibitively increase runtime and memory usage. A popular yet sub-optimal way to deal with this problem is to mask minimizers whose frequency exceeds a certain threshold. For instance, Minimap2 (Li, 2018) discards the top 0.02% most frequently occurring minimizers, whereas Mashmap (Jain et al., 2018) discards the top 0.001% by default. An unfortunate consequence of masking is that minimizer matches are no longer guaranteed and accuracy is reduced in repetitive genomic regions. Negative effects of k-mer masking have been highlighted before, such as poor homology detection in genomic repeats (Frith, 2011) and mis-assembly of high-copy plasmids (Koren et al., 2017).

To address the above problem, we propose optimizations to the standard minimizer sampling technique. We formulate the idea of ‘weighted-minimizers’, where k-mers that are assigned higher (or lower) weights are more (or less) likely to be selected as minimizers. Using this technique, we show that down-weighting highly repetitive k-mers avoids excessive false-positive matches, while still maintaining the guarantee that all matching substrings of length w+k−1 or greater will share a minimizer. In addition to weighted minimizers, we also incorporate and demonstrate the utility of ‘robust winnowing’ (Schleimer et al., 2003). In the case of multiple equally-minimum k-mers in a window, robust winnowing breaks the tie by preferring a k-mer that has already been chosen by a previous window. This differs from the standard algorithm (Roberts et al., 2004), which selects all equally-minimum k-mers, and is a simple yet crucial trick for efficient handling of low-complexity sequences such as “ACACAC…”. After implementing the above optimizations, we prove that the time complexity to compute weighted, robust minimizers remains linear with regard to sequence length and that the original minimizer match guarantee is preserved.

We refer to our implementation as Winnowmap, which replaces the minimizer sampling and indexing algorithms of the widely used alignment tool Minimap2 (Li, 2018). In doing so, we reuse Minimap2’s highly efficient anchor chaining and gapped alignment routines. We evaluate the speed and accuracy of Winnowmap using simulated and real long-read sequencing data aligned to the human reference genome GRCh38 (Schneider et al., 2017), including the recently completed X chromosome that contains the first-ever assembled human centromere (Miga et al., 2019). Compared to Minimap2, our results demonstrate that Winnowmap reduces the mapping error-rate over the entire X chromosome (0.14% to 0.06%), with the biggest gains achieved in the highly repetitive centromeric region (3.6% to 0.0%). Maintaining high alignment accuracy within long genomic repeats is critical for accurate genome assembly and variant calling. By avoiding masking, we show that Winnowmap maintains uniform minimizer density, while the masking heuristic used by Minimap2 leads to significant drops in minimizer density, especially within long tandem repeats like the centromeric satellite array. Moreover, Winnowmap uses less memory (up to 50% less) while maintaining a similar runtime versus Minimap2. Here we focus on sequence read mapping, but we expect our optimizations to benefit all applications of minimizers.

Related work

Once a k-mer ordering is defined, the set of minimizers picked from an input sequence becomes deterministic. Accordingly, our idea of weighted-minimizers is implemented by manipulating the k-mer ordering to prefer higher weighted k-mers. There have been prior attempts to optimize minimizer selection for different applications. Chikhi et al. (2015) use minimizers for low-memory construction and traversal of de Bruijn graphs. In their algorithm, a complete set of σk k-mers (σ is the size of the alphabet) is sorted based on k-mer frequency to define an order. Accordingly, a low-frequency k-mer is preferred over a high-frequency k-mer in each window. This procedure, however, requires 64 × σk bits in memory. This would work for small values of k (≤ 10 in their application), but does not scale to the larger k-mer sizes (e.g. 15–21) typically used for long-read analyses. Other work has focused on optimizing k-mer ordering to reduce the minimizer density (Orenstein et al., 2016; Marçais et al., 2018; DeBlasio et al., 2019). These optimizations are complementary to our objective as we seek to avoid excessive false-positive hits that occur by sampling high frequency k-mers. Our formulation of weighted-minimizers is partly inspired by (Chum et al., 2008), which extends the MinHash technique to incorporate tf-idf weighting and was adopted by Canu (Koren et al., 2017) for sequence read overlap detection. Below we present background on the classic minimizer sampling (Roberts et al., 2004) and weighted MinHash (Chum et al., 2008) techniques, the two key ideas upon which our algorithm is based.

2 Background

2.1 Minimizer sampling

Here we formally discuss the standard minimizer sampling algorithm and recall its properties. Continuing the same notation, w denotes window length and σ is size of the alphabet. Let U be the universe of all σk k-mers, and h : U → [0, 1] be a random hash function that assigns each k-mer to a real number within a unit interval. The function h induces an ordering among k-mers. Although Roberts et al. (2004) originally described the algorithm using a lexicographical ordering, a randomized ordering often works better in practice (Marçais et al., 2017).

The minimizer sampling algorithm entails computing the minimum k-mer within each consecutive window of length w. Ties are handled by picking all equally minimum k-mers (Figure 1). Perhaps the easiest way to compute minimizers is to use two loops such that the outer loop iterates through the input sequence while the inner loop computes a minimum k-mer within each window. Assuming the hash h of a k-mer is computable in O(1) time, this nested-loop approach requires O(nw) time for an input sequence of length n. However, there exists a linear-time O(n) algorithm for computing sliding-window minimum elements (Smith, 2011), where the trick is to use a double-ended queue while streaming k-mers (Figure 2). This algorithm starts with an empty queue Q (Algorithm 1, line 1). Before inserting a new k-mer at the back end of Q, the algorithm scans and discards all k-mers higher than the current one (line 5). This is because all the previously seen higher k-mers cannot be minimizers of either the current or subsequent windows. Due to this step, Q maintains sorted k-mer order in each iteration. After inserting the new k-mer, we also discard those k-mers from the front that are located outside the range of the current window (line 8). Due to the sorted order in Q, the smallest k-mer is simply drawn from front of Q (line 12). Note that each element of the array arr is inserted and deleted only once throughout the algorithm. Therefore, the total runtime of the algorithm is linear in sequence length. Schleimer et al. (2003) proved that the expected density, i.e. fraction of k-mers sampled from a random input sequence equals 2/(w + 1) for all practical values of w (w < σk).

Fig. 2.
  • Download figure
  • Open in new tab
Fig. 2.

Visualizing a single iteration of Algorithm 1. As the window slides to the right, the new k-mer is inserted at the back of the double-ended queue Q and a minimizer is selected from front end of the queue. The above figure also uses two equal k-mers (order= 5) in different colors to highlight how ties are handled in Algorithm 1.

Algorithm 1:

Standard procedure for computing minimizers

Figure
  • Download figure
  • Open in new tab

We intentionally split Algorithm 1 into a main routine and four helper functions. Later when we introduce optimizations to this algorithm to avoid sampling repetitive k-mers, we will only modify the functions leaving the main routine intact. Below, we summarize the properties of the minimizer sampling Algorithm 1 as a lemma. Readers are referred to the original articles (Roberts et al., 2004; Smith, 2011) for formal proofs.

Lemma 1.

The following statements are true:

  1. Algorithm 1 computes the desired set of minimizers.

  2. k-mer order values in queue Q of Algorithm 1 remain sorted in non-decreasing order during execution.

  3. Assuming k-mer hashing is an O(1) operation, then Algorithm 1 uses O(n) time where n is length of input sequence.

  4. If we ignore the possibility of k-mer ties within a window and assume that k-mer hash values are independent and uniformly distributed, then the expected density of minimizers is 2/(w + 1).

  5. At least one matching minimizer will be sampled from two strings with matching substrings of length ≥ w + k − 1.

2.2 MinHash and k-mer weighting

MinHash (Broder, 1997), like minimizer sampling, can be used to compute a signature of a sequence. It is a well-known locality sensitive hashing scheme for Jaccard similarity. Recall that the Jaccard similarity of two k-mer sets X and Y is |X ∩ Y |/|X ∪ Y |. Assuming k-mer hash values are independent and uniformly distributed, Broder (1997) proved that the probability of Embedded Image is equal to the Jaccard similarity. This enables an unbiased estimation of Jaccard while using only a subset of the original sets. When using Jaccard, elements in X ∩ Y contribute uniformly to the similarity value. However in some applications, the ‘importance’ of each element can vary. For instance, when comparing two sequences, a match of a rare k-mer carries more significance than a match of a highly-repetitive k-mer. Accordingly, Chum et al. (2008) proposed a variant of the Jaccard similarity metric called weighted set-similarity. Suppose function µ : U → ℕ assigns a weight to a k-mer, then the weighted set-similarity between two k-mer sets X and Y is given by: Embedded Image

Note that Jw (X, Y) is more influenced by higher-weighted k-mers. It is possible to extend the MinHash algorithm to develop an unbiased estimator for Jw. Let h1, h2 … hτ be τ independent random hash functions where τ equals the highest k-mer weight Embedded Image. A k-mer i is hashed using the first µ(i) hash functions: Embedded Image. Naturally if a k-mer has a higher weight, its expected hash value f (i) is lower because more random hashes are considered. As a result, the probability of a k-mer i ∈ X being selected as signature of set X is decided by its weight and equals µ(i)/Σj∈X µ(j). Thus, similar to the original MinHash algorithm, the probability of Embedded Image is Jw (X, Y).

In the above approach, we need to generate µ(i) uniformly distributed random numbers for each k-mer i. As a result, the above technique would require a large amount of space and time when input weights are large. Moreover, the above technique does not generalize to real-numbered weights. To remedy this, Chum et al. (2008) suggest an optimization which allows estimating Jw using a single random number for each k-mer. Their optimization follows from the observation that only the relative order of k-mer hash values matters and not their absolute values. As a result, it becomes possible to use a single random hash function while transforming its output to match the cumulative distribution of the hash function f. Their proposed alternative hashing function assigns k-mer i to 1 − (h1(i))1/µ(i). This also generalizes the above algorithm to real-numbered weights µ : U → ℝ+. This algorithm partly inspired our formulation of weighted-minimizer sampling (Section 4).

3 Algorithm for Robust Winnowing

It is useful to recall the definition of robust winnowing:

Definition 1

(Schleimer et al. (2003)). Robust winnowing: In each window select the minimum k-mer. If possible break ties by selecting the same k-mer as the window one position to the left. If not, select the rightmost minimal k-mer.

This tie breaking differs from the original algorithm of Roberts et al. (2004), which preserves all equally minimum k-mers. Robust sampling has one advantage and one disadvantage. The advantage is that we avoid sampling too many minimizers in low-complexity substrings such as “ACACAC…” or “AAAAA…” (Figure 3). Its disadvantage is that the minimizer selection from a window during a tie depends on the context, i.e. it depends on which minimizers were selected in prior windows. However, our results show that the advantage outweighs the disadvantage, and this optimization yields significant improvements in runtime and memory-usage without affecting accuracy.

Fig. 3.
  • Download figure
  • Open in new tab
Fig. 3.

Visualization of tie breaking in the standard minimizer and robust winnowing algorithms. We use a low-complexity sequence “CACACA…” to illustrate the difference between the two approaches. The sequence is comprised of two 3-mers, “CAC” and “ACA”, which are ordered lexicographically in the above example. In the left plot, we break ties by sampling all equally-minimum k-mers per window, where as we follow the robust winnowing tie breaking in the right plot. Note that robust winnowing samples half the minimizers as compared to the standard approach in this repetitive sequence example.

Schleimer et al. (2003) did not provide an algorithm and time complexity for computing robust minimizers, so for completeness we provide here a linear-time algorithm (Algorithm 2) that borrows psuedocode from Algorithm 1, while modifying its furtherPop and furtherSample functions. Lemma 2 proves correctness of the proposed algorithm. Similar to Algorithm 1, it is trivial to argue that the runtime remains linear. Further, any matching substring of length ≥ w + k − 1 between two sequences sample at least one robust minimizer of same order.

Algorithm 2:

Computing minimizers using the robust winnowing method

Figure
  • Download figure
  • Open in new tab

Lemma 2.

Algorithm 2 computes the desired set of minimizers using the robust winnowing criteria.

Proof. Using the robust winnowing scheme, we sample a new robust minimizer at iteration i iff either of the following are true: (i) the newest k-mer is lower than all k-mers in Q, or (ii) a prior robust minimizer (i.e. the front k-mer of Q) goes out of range. In the first case, the new k-mer will be selected as a robust minimizer by the sample function and pushed to the front of Q, while discarding other k-mers in Q. In the second case, we must pick the right-most minimum k-mer if there are multiple equal minimums in Q. The function furtherPop pops all but the right-most minimum in Q.

4 Weighted minimizer sampling

Using the same notation as before, function µ : U → ℝ+ assigns a weight to a k-mer. Weighted minimizer sampling is defined as following.

Definition 2.

Weighted minimizer sampling: Build a random ordering of all k-mers such that for a k-mer set X = {k1, k2, …, k|X|} ⊆ U, P (min(k1, k2, …, k|X|) = ki) = µ(ki)/(µ(k1) + µ(k2) + … + µ(k|X|)) ∀i ∈ [1, |X|]. Using this k-mer ordering, execute the robust-winnowing procedure.

In the original minimizer sampling algorithm (Section 2.1), the k-mer ordering is defined directly using a random hash function. As a result, any k-mer in a window is equally likely to be selected as a minimizer (assuming no k-mer ties). The advantage of the weighted technique is that it allows us to bias selection of certain k-mers over others. If (k1, k2, … kw) is the set of k-mers in a window, then the probability of k-mer ki being minimum equals µ(ki)/(µ(k1) + µ(k2) + … + µ(kw)). As a result, the higher the weight of a k-mer relative to its neighboring k-mers in a window, the more likely it is to be sampled as a minimizer. The weighted technique reduces to the original unweighted algorithm if all k-mer weights are equal. In order to define a k-mer ordering needed for weighted minimizer sampling, we borrow the optimized hashing technique of Chum et al. (2008) (Section 2.2). Given a random hash function h : U → [0, 1], we assign the order of a k-mer ki to be Embedded Image.

We now analyze the expected density of weighted minimizer sampling. Recall that the expected density equals the expected fraction of k-mers sampled from a random input sequence. While the original algorithm guarantees density of 2/(w + 1) (Lemma 1), here the density depends on k-mer weights.

Lemma 3.

Suppose k1, k2, … kn−k+1 are the k-mers in the order as they appear in a random input sequence of length n. Assume σk ≫ w. The expected density of weighted minimizer sampling equals: Embedded Image

Proof. Suppose Wi is window of w consecutive k-mers starting at position i. Window Wi is said to be charged if the minimum k-mer in window Wi differs from the minimum k-mer in window Wi−1, i.e. Wi selects a new minimum. The expected density of a minimizer sampling can be analyzed by counting total number of charged windows in the input sequence.

A union of two windows Wi and Wi−1 defines an interval of length w + 1 in the input sequence. Accordingly, k-mers ki−1, ki, … ki+w−1 occur in this interval, and let kmin be the minimum k-mer among these. As σk ≫ w, we safely ignore the possibility of ties. Window Wi is charged if and only if either kmin = ki−1 or kmin = ki+w−1. Define random variable Xi to be 1 if Wi is charged and 0 if not.

For i ∈ [2, n − k − w + 2] (for all windows except the first) Embedded Image

For i = 1, E[Xi] = 1. Therefore, the expected density Embedded Image

The theoretical worst case expected density of weighted minimizer sampling occurs when k-mers k1, k2, …, kn−k+1 are distinct (i.e., σk ≫ n) and µ(k1) ≫ µ(k2) ≫ … ≫ µ(kn−k+1). In this case, the density approximately equals 1. A best case scenario is one where k1, k2, …, kn−k+1 are distinct and weights µ(kw), µ(k2w), µ(k3w) … are significantly higher than all the other k-mer weights. In this case, density is optimal and equals 1/w.

Winnowmap implements weighted minimizers sampling to avoid indexing highly repetitive k-mers for long-read mapping. Given any reference genome, we assume the availability of a pre-computed set 𝒮 of repetitive k-mers, i.e. k-mers that occur with frequency above a certain cutoff. This set can be computed efficiently using a k-mer counting tool. We assign any k-mer ki ∈ U a weight of v < 1 if ki ∈ 𝒮, and 1 otherwise (Algorithm 3). By default, Winnowmap uses a frequency cutoff of 1024 for determining highly repetitive k-mers, and sets v = 1/8. Note that the asymptotic time complexity of minimizer sampling remains linear in sequence length if we use a σk-sized bit array to store the set 𝒮 that supports membership queries in O(1) time.

Lemma 4.

The following statements are true for Algorithm 3:

  1. Assuming k-mer hashing is an O(1) operation, then Algorithm 3 uses O(n) time where n is length of input sequence.

  2. At least one minimizer of the same order will be sampled from two strings with matching substrings of length ≥ w + k − 1.

Algorithm 3 involves computational overhead that affects indexing performance in practice. First, random accesses to a σk-sized bit array in each iteration makes the implementation slow (line 5). Second, computing a power arithmetic operation for each occurrence of repetitive k-mers is expensive (line 6). We implement two optimizations to avoid this overhead. As |𝒮| is expected to be ≪ σk, we use a space-efficient bloom filter to query membership in 𝒮, which incurs a marginal false-positive rate. For the second issue, we compute x1/v without using a power operation. For example, if v is 1/8, then x1/v = x8 can be computed using three multiplications. Finally, we use −x1/v (line 6) and −x (line 7) to assign k-mer order values instead of 1 − x1/v and 1 − x respectively to avoid rounding errors (e.g., 1 − 1e-20 rounded to 1). As a result of these optimizations, weighted minimizer sampling is nearly as fast as the standard minimizer sampling.

Algorithm 3:

Algorithm for computing weighted minimizers

Figure
  • Download figure
  • Open in new tab

5 Results

In this section, we demonstrate the empirical advantage of robust winnowing and weighted minimizer sampling for long read mapping. As mentioned earlier, we have implemented Winnowmap by replacing the minimizer sampling and indexing procedures of the well-established mapping tool Minimap2 (Li, 2018). Minimap2 has been previously shown to achieve excellent read mapping results in terms of accuracy and speed, so in this section we compare Winnowmap directly against Minimap2 (v2.17-r954-dirty).

5.1 Experimental setup

Hardware and software

For all our experiments, we used a server equipped with two Intel Xeon E5-2698 v4 20-core processors and 1 TB memory. Both Winnowmap and Minimap2 were run in parallel-mode using 16 CPU threads, and we report wall-clock time in our results. We used recommended parameters for mapping PacBio and Oxford Nanopore (ONT) reads based on Minimap2’s user documentation. These are -cx map-pb for PacBio reads, and -cx map-ont for ONT reads. Winnowmap requires an additional parameter specifying a file containing the list of highly-repetitive k-mers for a reference genome. In our experiments, this file was assumed to be pre-computed. Given the latest advances in k-mer counting algorithms, this is a low-overhead O(n) operation, especially for the values of k typically used for long-read mapping (e.g. 15 for ONT reads and 19 for PacBio reads).

Winnowmap uses a default window length of 50. In contrast, Minimap2 performs dense minimizer sampling using a default window length of 10. Our results show that, unlike Minimap2, Winnowmap can tolerate sparser sampling without compromising accuracy. Minimap2 relies on standard minimizer sampling (Roberts et al., 2004) before masking the top 0.02% most frequent minimizers from its index. Winnowmap does not use any masking heuristics. All other algorithmic features and parameters are the same between Minimap2 and Winnowmap.

Benchmarking data sets

Our data sets include two simulated sets of PacBio reads (D1, D2) and one set of real ultra-long ONT reads (D3). Total count, N50, and maximum length of these read sets is shown in Table 5.1. The simulated read sets were obtained using PBSIM (Ono et al., 2013) with a mean error-rate of 10% and mean read length of 15 kbp. PacBio read set D1 was simulated from a recently finished human X chromosome (Miga et al., 2019) and D2 simulated from the human whole-genome reference GRCh38 (Schneider et al., 2017), respectively. The assembly of chromosome X (v0.7) contains the first-ever resolved centromere (3.1 Mbp), which allowed us to evaluate mapping accuracy within a long tandem repeat. Finally, read set D3 is a small random sample of the rel3 human ‘CHM13’ whole-genome, ultra-long ONT sequencing data (Miga et al., 2019). A minimum read length of 1 kbp was enforced during sub-sampling. The read sets were mapped to the human chromosome X assembly (D1) and the human reference genome (D2, D3), respectively.

Evaluation criteria

For simulated read sets D1 and D2, we evaluated mapping accuracy against the known truth. We followed a similar criteria previously used by Li (2018). A read is said to be mapped correctly if its primary alignment overlaps with the true interval, and the overlapping bases constitute ≥ 10% of the union of the true and the reported alignment intervals. We used Minimap2’s paftools utility (Li, 2018) to compute mapping accuracy using this criteria. We also measured indexing time, mapping time, peak memory-use, fraction of unmapped reads, and minimizer sampling density.

5.2 Comparison with Minimap2

We evaluated performance of Winnowmap and Minimap2 using data sets D1–D3 (Table 5.1). We tested the two algorithms using both window length parameters 10 and 50. While discussing differences between Winnowmap and Minimap2, we refer to the results obtained using their default window length parameters, i.e. 50 for Winnowmap, 10 for Minimap2. Using simulated read sets D1 and D2, Winnowmap achieves better read mapping accuracy than Minimap2. Moreover, the accuracy of Winnowmap is largely unaffected by the varying window length. Compared to Minimap2, Winnowmap reduces read mapping error from 0.14% to 0.06% on data set D1, and from 1.85% to 1.81% on data set D2. Both Winnowmap and Minimap2 were able to map all reads in D1 and D2. In the D3 data set, there are many very long reads but also a higher fraction of short reads. This resulted in a significant fraction of unmapped reads using both algorithms. Using its default window length, Winnowmap has comparable mapping and indexing runtime to Minimap2, while using less memory. However, with a small window length of 10, the runtime and memory use of Winnowmap is much higher because weighted minimizer sampling is less effective for small window lengths. Large windows provide more freedom for the weighted minimizer sampling to choose a non-repetitive k-mer over a repetitive k-mer.

Winnowmap significantly improves accuracy in large repeats. Using data set D1, we evaluated its accuracy within the centromeric region of chromosome X, which comprises approximately 1500 tandemly arrayed copies of a 2057 bp repeat unit spanning approximately 3.1 Mbp. To measure mapping accuracy within this repeat array, we considered all 1057 reads that were sampled from the region (positions 57, 828, 561 to 60, 934, 693). Table 3 shows the mapping accuracy with different window length parameters. Compared to Minimap2, Winnowmap reduces mapping error from 3.6% to 0.0% and its accuracy is again robust to changing window length. This improvement is possible since Winnowmap maintains all minimizers in the index whereas Minimap2 masks repetitive minimizers. To visualize the effect of masking, we show minimizer sampling density across the length of chromosome X in Figure 4. This plot shows the minimizer sampling density of three methods: (i) the standard minimizer sampling (Roberts et al., 2004) without any modification (blue), (ii) minimizer sampling and masking in Minimap2 (orange), and (iii) weighted minimizer sampling in Winnowmap (green). The window length was fixed to 10 for all the three methods. Minimap2’s masking heuristic leads to poor sampling density in long repeats with the most significant dip observed in the centromere region.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1.

List of datasets used for evaluation.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2.

Comparison of mapping performance between Winnowmap and Minimap2 using data sets D1–D3. We tested both methods while varying window length, which controls the minimizer sampling rate. Larger window lengths lead to more sparse sampling, as indicated by the last column. Minimap2 uses w = 10 by default. Winnowmap is more robust to sparse sampling and, as a result, uses a larger default window length of 50.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3.

Evaluation of mapping accuracy in the centromeric region of human chromosome X. Here we consider only those reads in data set D1 which were sampled from centromeric repeat array.

Fig. 4.
  • Download figure
  • Open in new tab
Fig. 4.

Minimizer sampling density using a complete human chromosome X as the reference (data set D1). We compare three sampling algorithms using fixed window length w = 10. ‘Standard’ method refers to the classic minimizer sampling algorithm from Roberts et al. (2004), without any masking or modification. Minimap2 uses the standard algorithm, but masks the most frequently occurring minimizers (top 0.02%) in the reference (count ≥ 160 for this reference). Winnowmap uses weighted minimizer sampling. Both ‘Standard’ and Winnowmap methods maintain at least one minimizer per window in their index and achieve near-uniform density. The masking heuristic in Minimap2 affects long repetitive segments (e.g. centromere: 58–61 Mbp), where density drops significantly.

5.3 Benefit of weighted minimizer sampling and robust winnowing

Weighted minimizer sampling in Winnowmap includes two key optimizations, (i) robust winnowing, and (ii) weighted sampling. These two optimizations are independent of one another and we evaluated the advantages of each. To do this, we compared (i) Winnowmap, (ii) a version of Winnowmap that implements only robust winnowing, and (iii) a version of Minimap2 that implements the standard minimizer sampling without the masking heuristic. Accordingly, these three methods are referred to as ‘Optimization 1+2’, ‘Optimization 1’, and ‘Standard’, respectively in Figure 5. As there is no masking involved, these three methods index all minimizers irrespective of their frequency. Here we show mapping time and peak memory usage using data sets D1–D3 while keeping window length equal. We used a window length of 10 for data sets D1 and D2, and 50 for data set D3. The ‘Standard’ method crashed on data set D3 with a window length of 10. These results demonstrate that these Winnowmap optimizations are crucial for speed and memory efficiency.

Fig. 5.
  • Download figure
  • Open in new tab
Fig. 5.

Benefit of the proposed minimizer sampling optimizations to long read mapping performance. We compare three methods: ‘Standard’ (default minimizer sampling, no masking), ‘Optimization 1’ (robust winnowing) and ‘Optimization 1+2’ (robust winnowing and weighted minimizer sampling). All three methods select at least one minimizer in each window. The left plot indicates total mapping time whereas the right plot indicates total memory usage, using data sets D1–D3. The x-axis is log-scaled in both plots. The combination of both optimizations is crucial in Winnowmap to achieve good efficiency.

6 Conclusions

Minimizer sampling is a simple yet powerful technique to speed up genome sequence analysis. When this technique is used for seeding sequence alignments, minimizers with high frequency often lead to too many false-positive seed matches. The large number of false hits naturally leads to high memory-usage and runtime. To date, a popular way to deal with this issue has been to mask highly repetitive minimizers. While this masking heuristic eliminates false hits, the underlying guarantee of a minimizer seed match is lost in the process. As a result, masking decreases mapping accuracy within long genomic repeats. In this paper, we describe optimizations to the standard minimizer sampling procedure and evaluate the improvements to long-read mapping. We have introduced weighted minimizer sampling, where users can specify which k-mers should be more (or less) likely to be selected as minimizers. Lastly, when there are multiple equally minimum k-mers in a window (e.g. in low complexity regions), robust winnowing is effective in preventing oversampling.

We implemented weighted minimizer sampling and robust winnowing in Winnowmap. Winnowmap makes it feasible to map PacBio or ONT reads without the need for a masking heuristic. As a result, it achieves superior mapping accuracy by maintaining a uniform sampling density across the reference sequence using a simple weighting criteria.

Future work will be focused in three directions. First, we will explore more sophisticated weight functions, e.g. down-weighting erroneous k-mers in addition to repetitive k-mers. A set of likely erroneous k-mers could be detected prior to mapping by intersecting the k-mer sets of the reads and the reference genome. Second, we will test the idea of weighted minimizer sampling in other applications such as long-read overlapping and metagenomic read classification. Third, the density bounds that were discussed in this paper are defined for arbitrary weights. However, in most applications, we can assume weight values are bounded between [1/c, c], where c is a constant. It will be useful to derive tight bounds on the expected density under this constraint.

References

  1. ↵
    Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17), 3389–3402.
    OpenUrlCrossRefPubMedWeb of Science
  2. ↵
    Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M., and Phillippy, A. M. (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology, 33(6), 623.
    OpenUrlCrossRefPubMed
  3. ↵
    Broder, A. Z. (1997). On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE.
  4. ↵
    Chikhi, R., Limasset, A., Jackman, S., Simpson, J. T., and Medvedev, P. (2015). On the representation of de bruijn graphs. Journal of Computational Biology, 22(5), 336–352.
    OpenUrlCrossRefPubMed
  5. ↵
    Chin, C.-S. and Khalak, A. (2019). Human genome assembly in 100 minutes. bioRxiv.
  6. ↵
    Chum, O., Philbin, J., Zisserman, A., et al. (2008). Near duplicate image detection: min-hash and tf-idf weighting. In Bmvc, volume 810, pages 812–815.
    OpenUrl
  7. ↵
    DeBlasio, D., Gbosibo, F., Kingsford, C., and Marcais, G. (2019). Practical universal k-mer sets for minimizer schemes. BioRxiv, page 652925.
  8. ↵
    Dilthey, A. T., Jain, C., Koren, S., and Phillippy, A. M. (2019). Strain-level metagenomic assignment and compositional estimation for long reads with metamaps. Nature communications, 10(1), 1–12.
    OpenUrl
  9. ↵
    Frith, M. C. (2011). Gentle masking of low-complexity sequences improves homology search. PLoS One, 6(12), e28819.
    OpenUrlCrossRefPubMed
  10. ↵
    Jain, C., Dilthey, A., Koren, S., Aluru, S., and Phillippy, A. (2018). A fast approximate algorithm for mapping long reads to large reference databases. Journal of Computational Biology, 25(7), 766.
    OpenUrl
  11. ↵
    Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., and Phillippy, A. M. (2017). Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5), 722–736.
    OpenUrlAbstract/FREE Full Text
  12. ↵
    Kundu, R., Casey, J., and Sung, W.-K. (2019). Hypo: Super fast & accurate polisher for long read genome assemblies. bioRxiv.
  13. ↵
    Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. L. (2004). Versatile and open software for comparing large genomes. Genome biology, 5(2), R12.
    OpenUrlCrossRefPubMed
  14. ↵
    Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with bowtie Nature methods, 9(4), 357.
    OpenUrl
  15. ↵
    Li, H. (2016). Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14), 2103–2110.
    OpenUrlCrossRefPubMed
  16. ↵
    Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094–3100.
    OpenUrlCrossRefPubMed
  17. ↵
    Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., and Kingsford, C. (2017). Improving the performance of minimizers and winnowing schemes. Bioinformatics, 33(14), i110–i117.
    OpenUrlCrossRef
  18. ↵
    Marçais, G., DeBlasio, D., and Kingsford, C. (2018). Asymptotically optimal minimizers schemes. Bioinformatics, 34(13), i13–i22.
    OpenUrl
  19. ↵
    Marçais, G., Solomon, B., Patro, R., and Kingsford, C. (2019). Sketching and sublinear data structures in genomics. Annual Review of Biomedical Data Science, 2.
  20. ↵
    Miga, K. H., Koren, S., Rhie, A., Vollger, M. R., Gershman, A., Bzikadze, A., Brooks, S., Howe, E., Porubsky, D., Logsdon, G. A., et al. (2019). Telomere-to-telomere assembly of a complete human x chromosome. BioRxiv, page 735928.
  21. ↵
    Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S., and Phillippy, A. M. (2016). Mash: fast genome and metagenome distance estimation using minhash. Genome Biology.
  22. ↵
    Ono, Y., Asai, K., and Hamada, M. (2013). Pbsim: Pacbio reads simulator—toward accurate genome assembly. Bioinformatics, 29(1), 119–121.
    OpenUrlCrossRefPubMedWeb of Science
  23. ↵
    Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., and Kingsford, C. (2016). Compact universal k-mer hitting sets. In International Workshop on Algorithms in Bioinformatics, pages 257–268. Springer.
  24. ↵
    Popic, V. and Batzoglou, S. (2017). A hybrid cloud read aligner based on minhash and kmer voting that preserves privacy. Nature communications, 8, 15311.
    OpenUrl
  25. ↵
    Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M., and Yorke, J. A. (2004). Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18), 3363–3369.
    OpenUrlCrossRefPubMedWeb of Science
  26. ↵
    Rowe, W. P. (2019). When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data. Genome biology, 20(1), 199.
    OpenUrl
  27. ↵
    Sahlin, K. and Medvedev, P. (2019). De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm. In International Conference on Research in Computational Molecular Biology, pages 227–242. Springer.
  28. ↵
    Sahlin, K., Sipos, B., James, P. L., Turner, D. J., and Medvedev, P. (2020). Error correction enables use of oxford nanopore technology for reference-free transcriptome analysis. bioRxiv.
  29. ↵
    Schleimer, S., Wilkerson, D. S., and Aiken, A. (2003). Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76–85. ACM.
  30. ↵
    Schneider, V. A., Graves-Lindsay, T., Howe, K., Bouk, N., Chen, H.-C., Kitts, P. A., Murphy, T. D., Pruitt, K. D., Thibaud-Nissen, F., Albracht, D., et al. (2017). Evaluation of grch38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome research, 27(5), 849–864.
    OpenUrlAbstract/FREE Full Text
  31. ↵
    Shafin, K., Pesout, T., Lorig-Roach, R., Haukness, M., Olsen, H. E., Bosworth, C., Armstrong, J., Tigyi, K., Maurer, N., Koren, S., et al. (2019). Efficient de novo assembly of eleven human genomes using promethion sequencing and a novel nanopore toolkit. BioRxiv.
  32. ↵
    Smith, K. C. (2011). Sliding window minimum implementations. https://goo.gl/8RC54b [Online; accessed 16-January-2020].
  33. ↵
    Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195–197.
    OpenUrlCrossRefPubMedWeb of Science
  34. ↵
    Yu, Y. W., Daniels, N. M., Danko, D. C., and Berger, B. (2015). Entropy-scaling search of massive biological data. Cell systems, 1(2), 130–140.
    OpenUrl
Back to top
PreviousNext
Posted February 11, 2020.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Weighted minimizer sampling improves long read mapping
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Weighted minimizer sampling improves long read mapping
Chirag Jain, Arang Rhie, Haowen Zhang, Claudia Chu, Sergey Koren, Adam Phillippy
bioRxiv 2020.02.11.943241; doi: https://doi.org/10.1101/2020.02.11.943241
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Weighted minimizer sampling improves long read mapping
Chirag Jain, Arang Rhie, Haowen Zhang, Claudia Chu, Sergey Koren, Adam Phillippy
bioRxiv 2020.02.11.943241; doi: https://doi.org/10.1101/2020.02.11.943241

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4117)
  • Biochemistry (8824)
  • Bioengineering (6528)
  • Bioinformatics (23480)
  • Biophysics (11800)
  • Cancer Biology (9218)
  • Cell Biology (13333)
  • Clinical Trials (138)
  • Developmental Biology (7440)
  • Ecology (11420)
  • Epidemiology (2066)
  • Evolutionary Biology (15165)
  • Genetics (10447)
  • Genomics (14054)
  • Immunology (9180)
  • Microbiology (22182)
  • Molecular Biology (8820)
  • Neuroscience (47610)
  • Paleontology (350)
  • Pathology (1430)
  • Pharmacology and Toxicology (2492)
  • Physiology (3735)
  • Plant Biology (8085)
  • Scientific Communication and Education (1438)
  • Synthetic Biology (2222)
  • Systems Biology (6042)
  • Zoology (1254)