Abstract
The sharp increase in next-generation sequencing technologies’ capacity has created a demand for algorithms capable of quickly searching a large corpus of biological sequences. The complexity of biological variability and the magnitude of existing data sets have impeded finding algorithms with guaranteed accuracy that efficiently run in practice. Our main contribution is the Tensor Sketch method that efficiently and accurately estimates edit distances. In our experiments, Tensor Sketch had 0.956 Spearman’s rank correlation with the exact edit distance, improving its best competitor Ordered MinHash by 23%, while running almost 5 times faster. Finally, all sketches can be updated dynamically if the input is a sequence stream, making it appealing for large-scale applications where data cannot fit into memory.
Conceptually, our approach has three steps: 1) represent sequences as tensors over their sub-sequences, 2) apply tensor sketching that preserves tensor inner products, 3) implicitly compute the sketch. The sub-sequences, which are not necessarily contiguous pieces of the sequence, allow us to outperform k-mer-based methods, such as min-hash sketching over a set of k-mers. Typically, the number of sub-sequences grows exponentially with the sub-sequence length, introducing both memory and time overheads. We directly address this problem in steps 2 and 3 of our method. While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Hence, we propose a new sketching scheme that completely avoids the need for constructing the ambient space.
Our tensor-sketching technique’s main advantages are three-fold: 1) Tensor Sketch has higher accuracy than any of the other assessed sketching methods used in practice. 2) All sketches can be computed in a streaming fashion, leading to significant time and memory savings when there is overlap between input sequences. 3) It is straightforward to extend tensor sketching to different settings leading to efficient methods for related sequence analysis tasks. We view tensor sketching as a framework to tackle a wide range of relevant bioinformatics problems, and we are confident that it can bring significant improvements for applications based on edit distance estimation.
1 Introduction
The emergence of next-generation sequencing technologies and a dramatic decrease in cost have led to an exponential increase in biological sequence data, frequently stored in exceptionally large databases. While alignment scores are considered the gold-standard of sequence distance in many bioinformatics applications, the growing number of sequences to be analyzed poses serious challenges to exact distance computation via alignment. This problem has motivated research on time- and space-efficient alignment-free methods that try to estimate rather than compute sequence similarity. Especially applications relying on comparing millions of sequences have turned towards choosing approximate methods over exact alignments [2]. Out of many possibilities, we have selected the task of phylogeny reconstruction to further motivate distance estimation. In this task, the estimated distances between sequences are used to reconstruct the evolutionary history in form of a phylogenetic tree. Instead of using exact alignment scores, many alignment-free methods instead rely on k-mer statistics as a proxy. (Where the term k-mer refers to all length k substrings of a given string.) The multiplicity, frequency, mode, and reappearance statistics of k-mers have been utilised to directly estimate evolutionary distance [17,16,1]. Other approaches include variable lengths matches in their statistics [18].
To break the stringent dependency of adjacent letters in a k-mer, spaced k-mers have introduced “match” and “ignore” positions. For example, if a match-pattern “11011” is used, then “CTGAC” versus “CTTAC” constitutes a match. Spaced k-mers have been shown to improve mapping sensitivity, the accuracy of phylogenies, and the performance of sequence classification [5,7,13,15]. Analogously, substring-based methods can be relaxed to allow for some mismatches [6]. Since the quality of the estimations will greatly depend on the selected match-pattern, several works have focused on optimizing these patterns [3,11,8,10]. However, finding combinatorially optimal patterns becomes intractable as the number of ignore-positions increases. Furthermore, any optimization is prone to be task-dependent, which would require the optimization to be repeated for each task separately.
More recently, hashing based methods have become increasingly popular. MinHash sketch methods have been primarily used for set similarity estimation [2], that have shown promise in fast genomic and metagenomic distance estimation by representing sequences as collections of informative substrings [12]. From locality sensitive hashing literature we know that any sketch over the ℓp-norm (p ∈ (0, 2]) will automatically lead to sub-quadratic nearest neighbor search by locality sensitive hashing [4]. This provides further motivation for focusing on sketching the edit distance, leaving behind the inherent problems of seed-based methods. One of the most accurate sketching methods currently available is Ordered MinHash (OMH) [9], which considers tuples k-mers that are non-adjacent. Therefore, we will compare our approach against the classical k-mer methods MinHash, and Ordered MinHash.
An especially difficult task for most current methods is the estimation of edit distance for longer, less closely related sequences. To illustrate the main limitations of using k-mer statistics in this context, we look at two examples that act as opposing forces in deciding the size of k.
Consider two sequences drawn randomly over the alphabet of size
, and let vx, vy ∈ ℝ4k denote the k-mer frequency profile (multiplicity of each k-mer divided by total number of k-mers) for each sequence. For long sequences with N ≫ 4k, k-mer frequencies will converge to their mean, that is 4-k for all their components, which implies that ||vx — vy ||1 → 0. Therefore, any k-mer profile-based method will severely underestimate distance between these two random sequences. In order to avoid this, k has to be restricted to larger values k ≳ log4(N).
Consider the scenario that we want to rely on k-mer matching to find similar parts of two sequences y and x, and y is generated by independently mutating every index of x with probability r ∈ (0,1). Therefore, the probability that a k-mer belonging to x and y is not mutated is . This implies that k ≲ r-1 is necessary to avoid the k-mer match probability to converge to zero.
By combining the insights from these examples, we conclude that any value of k will fail with high probability in one of these scenarios for distantly related sequences. Since both results hold for k-mer frequency vectors in general, any such statistic on the k-mer profile is also bound to fail. This inherent problem with k-mers is our main motivation for designing sketches that are more resilient to mutations.
Conceptually, our method can be seen as expectation over all possible spaced seeds, without any limits on the number of ignore positions. In other words, Tensor Sketching resolves the inherent limitations of any seed-based method by taking an average of all possible seeds, and thereby reducing the risk of selecting a seed that reduces the sensitivity. In statistical terms, this implies that the resulting sketch is a Bayesian admissible estimator.
We will begin by introducing our notation and terminology in Section 2.1 and then proceed to presenting the Tensor Sketch and Tensor Slide sketch methods in Sections 2.2 and 2.3, respectively. In subsequent Section 3 we first summarize our data generation scheme and then compare our approach to the available state-of-the-art methods. We then motivate possible applications such as phylogeny reconstruction and graph distance estimation. In the final Section 4 we put our work into context and give an outlook into future work.
2 Methods
2.1 Preliminaries and Notation
Sets
We use to denote the indicator function, taking a value of 1 when the logical expression between the brackets is true, and zero otherwise. For integer N, [N] denotes the integer set [N]: = {1,…, N}. For finite set S, let |S| and #S interchangeably denote its cardinality, and define to be all bijective maps from S to [|S|]
Finally, we use X ~ S to denote that X is uniformly drawn from the set S.
Vectors
Throughout the manuscript bold face fonts are used to distinguish vectors, tensors, and sequences, from scalar variables a = (a1, a2). 1N and 0N are all-ones and all-zeros vectors of N length N, respectively. We use when integers i and j are equal modulus N. The circular r-shift shiftr (a), will shift the elements of a to the left shiftr (a) = (ar+1,… aN, a1,…,ar), defined formally as
where the mapping σ, circularly shifts indices to the left.
Strings
Let , and
denote the alphabet, the strings of length N, and all strings respectively. Let x o y denote concatenation of x and y, and |x| be the length of string x. We use xi to denote the i-th character. Define x[i:j] to be a slice from ith to j th index x[i:j]: = xi… xj, referred to as a substring of x, and define x:= xi1 … xik to be a subsequence of x, when 1 ≤ i1 < · · · < ik ≤ |x| are strictly increasing indices of x. For two strings
, the edit distance ded(x, y) denotes the minimum number of edit operations needed to transform one string to the other. It can be defined recursively using
as follows
with ded(ε,ε): = 0 as the recursion basis.
Minimum hash sketching
For a sequence , let
denote its k-mer sequence, with Xi denoting the ith k-mer Xi:= x[i:i+k-1], and #Xi the occurrence number of this k-mer #Xi:= #{j ≤ i: Xj = Xi}. Furthermore, the pairs (Xi, # Xi) will be referred to as unified k-mers.
In MinHash (MH), a random hash is used to sort the k-mers, the k-mer with the lowest index, with respect to hash functions
In Weighted MinHash (WMH), hash functions are drawn from a hash family over unified k-mers
Finally, Ordered MinHash (OMH) [9], generalizes WMH by sampling from multiple k-mers that appear in the same order within the sequence. The sampling uses hashes over the k-mers space , and selects indices that are mapped to the t lowest hash values. Formally, for each hash function hr, we sketch (Xσ1,…,Xσt), when monotone indices 1 ≤ σ1 < · · · < σ* ≤ |X| are mapped to the lowest hash values
We can compute distances via the Hamming distance dH(a, b), defined as the number of indices that the input sequences differ dH(a, b) = #{i: ai = bi}, where a and b have equal length. When φ is a MinHash sketch, dH(φ(x),φ(y)) is closely related to the Jaccard set similarity index between k-mer sets
Note that MH, WMH, and OMH are defined as elements with lowest hash over a set of k-mers, unified k-mers, and ordered tuples of k-mers, respectively. For example, in WMH we can change the definition of SX to a set of unified k-mers SX:= {(Xi, #Xi): i ≤ |x| — k + 1} (analogous for SY). It is worth mentioning that OMH becomes equivalent to WMH if we set the tuple length to one, t = 1.
2.2 Tensor Sketch
First, define to be set of all increasing subsequences of [N]*
Without loss of generality let the alphabet be relabeled to , and
be the size. The main idea behind sequence tensor sketch is counting subsequences as opposed to k-mers. Define
to be the probability mass function of a random t-ary subsequence of x
With a slight abuse of notation, we treat both as a probability mass function, and a t-dimensional tensor with size mt. Then, for two arbitrary strings
, define order t similarity s(t)(x, y) and distance d(t)(x, y) as
The inner product equals the probability of getting a match, if we draw two t-ary tuples from x and y, respectively, which is closely related to OMH sketch.
Naive computation of equations (9) and (10) requires time and mt memory, which becomes prohibitive for even moderate values of these variables. Our tensor sketching scheme provides a (1+ϵ)-approximation for tensor distance, but only requires
time to compute. Furthermore, any downstream analysis will operate on sketches of size
, regardless of the original sequence length.
Since we are interested in preserving the Euclidean norm after sketching, we follow the definition of tensor sketch by Pham and Pagh [14]. For integer D, tensor sketch Φ: ℝmt → ℝD sketches an mt-dimensional tensor into ℝD. Let us assume a series of pairwise independent hash functions , and sign hash functions
. Moreover, define hash sum
and hash sign product
as follows
Finally, tensor sketch Φ(T): = (ϕr (T))r∈[D] is defined as
where T ∈ ℝmt is an arbitrary tensor. Crucially, tensor sketch preserves the Euclidean norm in expectation
, and assume t to be constant, the variance decreases with sketch dimension
. Therefore,
is concentrated around its mean by second moment bound (see Lemma 7 of Pham and Pagh [14]). Therefore, a sketch size of
suffices to bound the multiplicative error by ϵ.
If we could compute and
) efficiently, distance computation
merely requires
time, as opposed to the exponential cost of constructing the ambient space. However, since the tensors
are never rank-1, the conventional efficient tensor sketching schemes cannot be applied here. Therefore, we have designed an efficient algorithm to compute (12), and provide several extensions of the algorithm in the next section. In summary, for sequence x, the tensor sketch φts(x) is computed by applying the tensor sketch on tuple tensor
Implicit sketching by recursive computation
First, observe that the rth component of the sketch , can be rewritten based on the probability mass of (H(x[i]),S(x[i])), when i is a uniformly drawn ordered t-tuple
We set out to recursively compute the probability mass for t-ary tuples, based on probability mass of shorter ordered tuples. Therefore, for p ∈ {1,…, t} let be the probability mass function of
, when
is a random ordered p-tuple over [N], while
and
are partial hash sum and products up to p, respectively
To recursively compute for p ≥ 2, we separate it into two cases depending on whether the last index is part of the tuple (ip = N) or not (ip < N)
where
, since
. Defining vector
, the index shift r – r’ (mod D), amounts to circularly shifting of
to the left by r’ indices. Hence, the recursion in vector form can be written as
♢Recursive insert (p ∈ [t]):
♢ Basis (p = 0):
Interestingly, the recursion relation for variables in depends only on variables in the previous layer
. Therefore, it is sufficient to store the variables corresponding to layer i, which implies a memory complexity of
. This fact is exploited in Algorithm 1, which is why the subscript x[i] is dropped from T, and uses ±1 for brevity4. The number of sub-problems needed to compute
is exactly 2NtD, and the cost of each recursion is
. Therefore, we can compute the sketch in
time.
Sequence Tensor Sketching
2.3 Tensor Slide Sketch
In this section, we improve the precision of tensor sketch by concatenating sketches of overlapping subsequences of the original sequence. However, instead of applying Algorithm 1 to compute each sketch separately, we can exploit the fact that many sub-problems in the recursion (18) are shared. The main idea is to consider sub-problems , that correspond to hash functions sp,…,sq and hp,…,hq, for all 1 ≤ p ≤ q ≤ t, so that we can efficiently remove characters from the initial position. Formally,
for all r ∈ [D],b ∈ {—1,1}, is the probability of
being equal to (r, b), when i is a randomly drawn (q – p+1)- ary increasing subsequence
(See Supplementary Section A for more details).
In Algorithm 2, denotes the probability vector
. The time and memory complexity of computing sketches for all w-mers of a reference sequence with length N are
and
, respectively. The length of the sliding window w does not appear in any complexity term.
Equipped with the streaming algorithm, we can concatenate sketches of windows as we slide the window along the sequence. More specifically, because sliding windows that have a large overlap cannot improve our estimate, we can down-sample by a factor of S, referred to as the stride. In other words, we store the sketch of a window every s basepairs. Formally, let us define φtss(x) as follows
where φts(x) is Tensor Sketch(13). If we naively call tensor sketch for each step,
time is required per embedding dimension. Applying the tensor slide sketch algorithm, this can be computed in
. In our experience, moderate values of t ≤ 5 suffice for most applications, implying that for all
, the streaming algorithm will run faster. Finally, the distance between two tensor slide sketches is defined as the squared Euclidean norm, with zero-padding for shorter sequences if two sequences are not equal in length.
Tensor Slide Sketch3 Experimental Results
3. Experimental Results
The primary objective of our tensor sketching framework is to efficiently sketch global alignment scores between sequences. In order to put our method into context, we compare its performance to MH, WMH, and OMH [9], three state-of-the-art sketching techniques (see Section 2 for formal definitions).
3.1 Synthetic data generation and parameter selection
As a basis for evaluation, we generated a set of synthetic sequence pairs. Given a mutation rate r ∈ (0,1), and a reference sequence at random , we mutate each index with probability r, with mutations being insertion, deletion, or substitution, with equal probabilities. While the mutation rate for each pair is fixed, it is uniformly drawn from the unit interval for each pair anew, which ensures the diversity of the edit distances.
One can evaluate the sketch-based distance of each pair against the exact edit distance, by quantifying how well each method preserves the order of distances, as captured by the Spearman’s correlation. We select parameters for each hashing method such that it achieves the highest rank correlation for a fixed embedding dimension D (See Supplementary Section B for more details regarding parameter selection).
3.2 Tensor slide sketch achieves high rank correlation with edit distance
For each sketching method, we determined Spearman’s rank correlation between edit distance and sketch distance, as well as absolute and relative execution time (Table 1). In particular, Tensor Slide Sketch achieves a rank correlation of 0.956 while reducing the computation time by 96.3%. Furthermore, we calculated the area under the receiver operating characteristic (AUROC) for a classifier discriminating two sequences based on a threshold for their exact edit distance (normalized by length), indicating that TS and TSS outperform other methods in detection of distantly related sequences (normalized edit distance threshold 0.5), while TSS outperforms other methods on all but the smallest edit distance threshold.
1000 sequence pairs of length 10000 were generated over an alphabet of size 4, with the mutation rate uniformly drawn from [0.00, 1.00]. The time column shows normalized time in milliseconds, i.e., total time divided by number of sequences, while the relative time shows the ratio of sketch-based time to the time for computing exact edit distance. The values shown are average over independent trials followed by their standard deviation. The embedding dimension is set to D = 64, and individual model parameters are (a) MinHash k = 12, (b) Weighted MinHash k = 12, (c) Ordered MinHash k = 2,t = 7, (d) Tensor Sketch t = 6, (e) Tensor Slide Sketch w = 1000, s = 100, t = 3.
We then assessed the performance and execution time of all considered methods utilizing the synthetically generated data set (Figure 1). We observe that the quality of both MinHash (MH) and Weighted MinHash (WMH) greatly diminishes as the sequences become more distant from one another (Figure 1a) and the quality of MH and WMH drops with increasing sequence length (Figure 1b). In both plots, the accuracy of OMH and the tensor sketch methods maintains a much higher level. Notably, the increased accuracy of TS and TSS comes only at a very small additional cost in time when compared with MH (Figure 1c), while both methods run much faster than OMH. Lastly, we assessed the relationship of sketching accuracy to the number of embedding dimensions for each method (Figure 1d). Remarkably, TSS achieves a rank correlation of 0.74% even when using only 10 embedding dimensions.
The dataset for these experiments consisted of 2000 sequence pairs independently generated over an alphabet of size 4. The embedding dimension is set to D = 256, and model-specific parameters are MinHash k = 12, Weighted MinHash k = 12, Ordered MinHash k = 2, t = 7, Tensor Sketch t = 6, Tensor Slide Sketch t = 3, with window and stride set to 10% and 1% of the sequence length, and its embedding dimension is set to the square root of D. (1a) Area Under the ROC Curve (AUROC), for detection of edit distances below a threshold using the sketchbased approximations. The x-axis, shows which edit distance (normalized) is used, and the y axis shows AUROC for various sketch-based distances. (1b) The Spearman’s rank correlation is plotted against the sequence length (logarithmic scale). (1c) Similar setting to (1b), plotting the execution time of each sketching method (y-axis, logarithmic scale) as a function of sequence length (x-axis, logarithmic scale). The reported times are normalized, i.e., average sketching time plus average distance computation time for each method. (1d) Spearman rank correlation of each sketching method as a function of the embedding dimension D (x-axis, logarithmic scale). In all subplots the solid line and shades indicate mean and standard deviation, computed over independent runs with the same configuration.
In summary, TSS produces the most accurate sketches across the board, while introducing a small time and memory overhead when compared with min-hashing. Furthermore, TS and TSS can be computed in a streaming fashion. For example, in order to sketch all 100-mers of a reference sequence, the time complexity grows only linearly with the length of the reference sequence, while WMH and MH need to be computed separately for each sequence of length m, rendering them m times slower for this task.
While sorting creates an iterative bottleneck for all hash-based schemes, the tensor sketch operation is based on a vector shift and sum, which explains the faster execution in practice. Moreover, vector operations can be executed in a single step on some CPU or GPU architectures to achieve an even greater boost. Finally, dependence on large values of k will introduce additional memory and time overhead if k-mers no longer fit into built-in types. In contrast, tensor sketch can be stored in efficient built-in types regardless of the choice of parameters.
3.3 Tensor Sketching helps estimate phylogenies
We further explored the task of estimating all pairwise distances of a given set of evolutionary related sequences, resulting in a well structured distance matrix. This task is reminiscent of important bioinformatics applications, such as the binning of metagenomic reads or phylogeny reconstruction. In both cases, the reconstruction of the good distance matrix as a whole, and not only individual components, forms the basis for a good approximation. While there are several ways to formulate this problem, we can simply visualize the exact and approximate matrices and compare them with an overall metric such as Spearman’s rank correlation. Figure 2, shows such a distance matrix. The sequences were generated to emulate a phylogeny tree. Furthermore, each sequence is mutated from its parent with a fixed rate of 15%. The optimal parameters of each model are chosen again based on the Spearman’s rank correlation, while window and stride length for TSS are fixed to 10% and 1% of the sequence length, giving an advantage to the competing methods. It is evident that MH and OMH are only accurate for small distances, but fail to do so for more distant sequences, which could negatively affect tree reconstruction based on these estimate distances. On the other hand, OMH and tensor sketching are capable of preserving a wider range of distances, which makes them better candidates for approximate phylogeny construction.
The subplot (a) illustrates the exact edit distance matrix, while the subplots (b)-(f) demonstrate the approximate distance matrices based on sketching methods. To highlight how well each method preserves the rank of distances, in all plots, the color-code indicates the rank of each distance (lighter, shorter distance), and the Spearman’s rank correlation is shown at the top of each plot. The phylogeny was simulated by an initial random sequence of length N = 10000, over an alphabet of size . Afterwards, for 6 generations, each sequence in the gene pool was mutated and added back to the pool, resulting in 64 sequences overall. The mutation rate set to 0.15. The embedding dimension is set to D = 64, and individual model parameters are set to (b) MinHash k = 12, (c) Weighted MinHash k = 12, (d) Ordered MinHash k = 2, t = 7, (e) Tensor Sketch t = 5, (f) Tensor Slide Sketch t = 2, w = 1000, s = 100.
3.4 Tensor Sketch opens up a wide range of applications
Sketching de-Bruijn graph distances
The slide-sketch algorithm presented in 2, assumes only having access to a stream of characters coming in and that the task involves sketching substrings of the entire sequence. While the task description may sound to be restricted to a linear reference sequence, the same assumptions actually apply to any Euler path on a string graph. For example, if the input is a de-Bruijn graph and the task is to sketch all vertices of the graph, any traversal of the graph that walks through adjacent vertices can be mapped to the streaming of incoming and outgoing characters, which will fit the assumptions of the slide sketch algorithm. Consequently, the time complexity of sketching vertices of a de-Bruijn graph will grow only as a function of its size, i.e., the number of its vertices, as opposed to the size of each vertex.
Robustness to Global Mutations
Tensor Sketch is sensitive to global shifts and transpositions of characters, while k-mer statistics are entirely local. To illustrate this, consider case when sequence y is constructed from x, by swapping the first and second half of the sequence . It is evident that (N — k + 1) out of N k-mers are intact under this block permutation, implying that for sufficiently large N ≫ k the k-mer profile of y will converge to x. This may severely overestimate their alignment score. In contrast, only
of t-ary subsequences remain intact. This greater sensitivity makes tensor sketch an appealing method for applications that such block permutations are likely to occur.
4 Discussion
We presented tensor sketching, a method for estimating sequence similarity for biological sequences. We demonstrated that Tensor Slide Sketch achieves a high Spearman’s rank correlation, but runs an order of magnitude faster than computing the exact alignment. When compared with other sketching methods, Tensor Sketch and Tensor Slide Sketch both achieve a more accurate preservation of the order between distances than MinHash, Weighted MinHash, and Ordered MinHash.
Bayesian estimation with tensor sketching
Spaced k-mers were introduced motivated by applications involving distantly related sequences. If we are to allow for i ignore positions, there will be possible patterns. Considering all such patterns is clearly not an option if i is anything larger but a small constant. Not only is it highly non-trivial to search for the optimal pattern in this large combinatorial space, but any incremental step for pattern optimization also has to be repeated if different tasks require a different set of patterns. Seen from this angle, Tensor Sketch can be interpreted as an average of all possible spaced t-mers patterns with (N — t) ignore positions, while the sketching step avoids the exponential cost of explicitly representing all combinatorial options.
The core idea of tensor sketching, that is to average over all possible seed patterns, can be alternatively viewed as Bayesian estimation. This implies that our estimate is admissible, i.e., no other estimator can outperform it across the board. This provides some intuition into why our approach achieves an acceptable or better accuracy over the entire range of edit distances. In other words, the risk of selecting a bad seed or size of k-mer is minimized by taking the mean over all possible seeds. While this corresponds to a uniform prior on our seeds, one can introduce a non-uniform prior by setting different weights for vertices on our tensors, namely to penalize more number of gaps in the pattern. Therefore, we can strengthen or weaken the contribution of individual patterns. Weighted average can help to reduce background statistics into the distance metric, as well as introduce nonuniform mismatch penalties. These variations and extensions come naturally within the tensor sketching framework, in contrast with hash-based approaches that require heuristic steps.
Conclusion
This work’s main contributions are the introduction of two tensor-based sketching methods to sequence similarity estimation: Tensor Sketch, providing an efficient algorithm to compute it, and Tensor Slide Sketch, a streaming version of the algorithm. Our results indicate that the tensor sketching method and its applications open up exciting research directions to explore for the bioinformatics community. The main advantages of tensor sketching compared with hashbased methods are that it 1) can run in a streaming fashion, 2) achieves much higher accuracy, and 3) runs much faster in practice.
Acknowledgements
We would like to thank Ximena Bonilla, Daniel Danciu, Kjong-Van Lehmann, and Ragnar Groot Koerkamp for their constructive feedback on the manuscript. AJ was funded by ETH Zurich core funding to GR. AK was partially funded by PHRT Project #106 and the Swiss National Science Foundation Grant No. 407540_167331 “Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation” as part of Swiss National Research Programme (NRP) 75 “Big Data”.
Appendix A Tensor Slide Sketch Algorithm
For all 1 ≤ p ≤ q ≤ t, let and
be partial hash sum and products from p up to q, respectively, and define
be the probability mass function of
, when i is a uniformly drawn ordered (q — p + 1)-tuple.
The recursive insertion is analogous to (18), as we can derive based on the previous layer
and smaller problems
. Recursive deletion is equivalent to rolling back insertion of the first character. We can derive
based on
and
, by separating it into two cases when the random tuple starts at the first index being in the tuple
where
, analogous to the insertion only case, and r’: = hp(x1) and b’: = sp(x1) are recursive hashes as before. We can rearrange this identity to compute
based on other terms
ov
Defining probability vector , and calculating
, we can write the recursion in the vector form more concisely as:
♢ Recursive delete (p ≤ q):
♢ Recursive insert (p ≤ q):
♢ Basis (q < p):
In order to implement an iterative algorithm, one can roll back the recursive calls and execute them in the reverse order. Namely for deletion, recursive relation for depends on
, and
, implying that updates for
must precede updates for
, imposing a partial order corresponding to tuple length (q — p + 1). This in turn corresponds with the random subsequence length. For insertion recursion (18), sub-problems correspond to shorter subsequence lengths, while for deletion recursion (31), sub-problems correspond to longer subsequence lengths, which impose an ascending and descending dependency on subsequence length, respectively.
Appendix B Parameter selection
The parameters for each model were selected according to the highest Spearman’s rank correlation. Figure S1 shows the accuracy of each method as a function of its parameters. The only model that was optimized for more than one parameter is OMH, where tuple length t and k-mer length k were jointly optimized.
B.1 Hashing algorithm
In our parameter search both CRC32 and MurmurHash3 hashing algorithms were used to investigate the effect of hashing on the accuracy of min-hashing methods. We did not observe a significant effect of hashing on the accuracy of min-hashing. Therefore, MurmurHash3 was used in all experiments following the grid search, as it is more commonly used in bioinformatics practice.
B.2 Tensor Slide Sketch
As TSS produces a 2-dimensional output, in all experiments the embedding dimension was set to . For example, if MH, WMH, OMH, and TS were run with 64 embedding dimensions, TSS sketch dimension was set to 8. TSS concatenates these 8-dimensional sketches every s steps. Throughout the experiments the length of window and stride were always set proportional to sequence length, 10% and 1% of the length respectively. As a result, the overall sketch size of TSS remains comparable to other methods.
Hyper-parameters for each method: subplots (a) to (f) show the Spearman’s rank correlation of each sketching method versus their respective parameters. The shades in the plots indicate the standard deviation of Spearman’s correlation, calculated for over independent runs of the same configuration. The grid search is defined over 1 ≤ k ≤ 16 and 2 ≤ t ≤ 10. The tuple length t = 1 is not included, because such a case is covered by WMH. In subplot (a) the accuracy of MH is depicted versus the k-mer size, and the hashing algorithm is indicated by the color. (b) similar to subplot (a), but for WMH, In subplot (c), Spearman’s correlation of OMH is for k-mer size and tuple length, with the hashing algorithm fixed to CRC32. Subplot (d) is similar to (c), and the only difference is that the hashing method is set to MurmurHash3. Subplots (e) and (f) illustrate the average accuracy of TS and TSS respectively, as a function of tuple length t.
Footnotes
a few typos fixed
↵4 The plus-minus signs are shorthand for two equations: one by taking all to be +, and another taking all to be —, but not any other combination, for example sin(a ± b) = sin(a) cos(b) ± sin(b) cos(a). If there are ∓ in the equation, their order for execution must the reversed, for example cos(a ± b) = cos(a) cos(b) ∓ sin(a) sin(b).