## Abstract

The sharp increase in next-generation sequencing technologies’ capacity has created a demand for algorithms capable of quickly searching a large corpus of biological sequences. The complexity of biological variability and the magnitude of existing data sets have impeded finding algorithms with guaranteed accuracy that efficiently run in practice. Our main contribution is the Tensor Sketch method that efficiently and accurately estimates edit distances. In our experiments, Tensor Sketch had 0.88 Spearman’s rank correlation with the exact edit distance, almost doubling the 0.466 correlation of the closest competitor while running 8.8 times faster. Finally, all sketches can be updated dynamically if the input is a sequence stream, making it appealing for large-scale applications where data cannot fit into memory.

Conceptually, our approach has three steps: 1) represent sequences as tensors over their sub-sequences, 2) apply tensor sketching that preserves tensor inner products, 3) implicitly compute the sketch. The sub-sequences, which are not necessarily contiguous pieces of the sequence, allow us to outperform fc-mer-based methods, such as min-hash sketching over a set of *k*-mers. Typically, the number of sub-sequences grows exponentially with the sub-sequence length, introducing both memory and time overheads. We directly address this problem in steps 2 and 3 of our method. While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Hence, we propose a new sketching scheme that completely avoids the need for constructing the ambient space.

Our tensor-sketching technique’s main advantages are three-fold: 1) Tensor Sketch has higher accuracy than any of the other assessed sketching methods used in practice. 2) All sketches can be computed in a streaming fashion, leading to significant time and memory savings when there is overlap between input sequences. 3) It is straightforward to extend tensor sketching to different settings leading to efficient methods for related sequence analysis tasks. We view tensor sketching as a framework to tackle a wide range of relevant bioinformatics problems, and we are confident that it can bring significant improvements for applications based on edit distance estimation.

## 1 Introduction

The emergence of next-generation sequencing technologies and a dramatic decrease in cost have led to an exponential increase in biological sequence data, frequently stored in exceptionally large databases. While alignment scores are considered the gold-standard of sequence distance in many bioinformatics applications, the growing number of sequences to be analyzed poses serious challenges to exact distance computation via alignment. This problem has motivated research on time- and space-efficient alignment-free methods that try to estimate rather than compute sequence similarity. Especially applications relying on comparing millions of sequences have turned towards choosing approximate methods over exact alignments [2]. Out of many possibilities, we have selected the task of phylogeny reconstruction to further motivate distance estimation. In this task, the estimated distances between sequences are used to reconstruct the evolutionary history in form of a phylogenetic tree. Instead of using exact alignment scores, many alignment-free methods instead rely on *k*-mer statistics as a proxy. (Where the term *k-mer* refers to all length *k* substrings of a given string.) The multiplicity, frequency, mode, and reappearance statistics of *k*-mers have been utilised to directly estimate evolutionary distance [17,16,1]. Other approaches include variable lengths matches in their statistics [18].

To break the stringent dependency of adjacent letters in a *k*-mer, *spaced k-mers* have introduced “match” and “ignore” positions. For example, if a match-pattern “11011” is used, then “CTGAC” versus “CTTAC” constitutes a match. Spaced *k*-mers have been shown to improve mapping sensitivity, the accuracy of phylogenies, and the performance of sequence classification [5,7,13,15]. Analogously, substring-based methods can be relaxed to allow for some mismatches [6]. Since the quality of the estimations will greatly depend on the selected match-pattern, several works have focused on optimizing these patterns [3,11,8,10]. However, finding combinatorially optimal patterns becomes intractable as the number of ignore-positions increases. Furthermore, any optimization is prone to be task-dependent, which would require the optimization to be repeated for each task separately.

More recently, hashing based methods have become increasingly popular. MinHash sketch methods have been primarily used for set similarity estimation [2], that have shown promise in fast genomic and metagenomic distance estimation by representing sequences as collections of informative substrings [12]. From locality sensitive hashing literature we know that any sketch over the *ℓ ^{p}*-norm (

*p*∈ (0, 2]) will automatically lead to sub-quadratic nearest neighbor search by locality sensitive hashing [4]. This provides further motivation for focusing on sketching the edit distance, leaving behind the inherent problems of seed-based methods. To our knowledge, the most accurate sketching method currently available is

*Ordered MinHash (OMH)*[9], which dramatically improves accuracy of

*k*-mer MinHash, by considering

*k*-mers that are non-adjacent. Therefore, we will compare our approach against the classical

*k*-mer MinHash, and Ordered MinHash as the current state-of-the-art.

An especially difficult task for most current methods is the estimation of edit distance for longer, less closely related sequences. To illustrate the main limitations of using *k*-mer statistics in this context, we look at two examples that act as opposing forces in deciding the size of *k*.

*Example 1*. Consider two sequences ** x**, drawn randomly over the alphabet of size , and let

*v*, denote the

_{x}*k*-mer frequency profile for each sequence. For long sequences with

*N*≫ 4

^{k},

*k*-mer frequencies will converge to their mean, that is 4

^{-k}for all their components, which implies that ║

*v*–

_{x}*v*║ → 0. Therefore, any

_{y}*k*-mer profile-based method will severely

*underestimate*distance between these two random sequences. In order to avoid this,

*k*has to be restricted to larger values

*k*≳ log

_{4}(

*N*).

*Example 2*. Now, consider a second pair of sequences, where ** y** is generated by substituting every character of

**with 8% probability. Plugging this value back into our lower bound for**

*x**k*, the probability that a

*k*-mer would not overlap any substitution equals (1/8)

^{log4(N)}= N

^{-1.5}. Therefore, the total number of

*k*-mers not overlapping any mutation will be which converges to 0 as N grows. Since the sequences share a 92% similarity in global alignment by design, the frequency vectors are guaranteed to

*overestimate*the distance.

By combining the insights from these examples, we conclude that any value of *k* will fail with high probability in one of these scenarios for distantly related sequences. Since both results hold for *k*-mer frequency vectors in general, any such statistic on the *k*-mer profile is also bound to fail. This inherent problem with *k*-mers is our main motivation for designing sketches that are more resilient to mutations.

Conceptually, our method can be seen as expectation over all spaced seeds, without any limits on the number of *ignore* positions. In other words, Tensor Sketching resolves the inherent limitations of any seed-based method by taking an average of all possible seeds, and thereby reducing the risk of selecting a seed that reduces the sensitivity. In statistical terms, this implies that the resulting sketch is a Bayesian admissible estimator.

We will begin by introducing our notation and terminology in Section 2.1 and then proceed to presenting the Tensor Sketch and Tensor Slide sketch methods in Sections 2.2 and 2.3, respectively. In subsequent Section 3 we first summarize our data generation scheme and then compare our approach to the available state-of-the-art methods. We then discuss applications such as phylogeny reconstruction and graph distance estimation. In the final Section 4 we put our work into context and give an outlook into future work.

## 2 Methods

### 2.1 Preliminaries and Notation

#### Sets

For integer *N*, [*N*] denotes the integer set [*N*]:= {1,…, *N*}. For finite set *S*, let denote set of all permutations of indexing *S*

#### Vectors

Throughout the manuscript bold face fonts are used to distinguish vectors, tensors, and sequences, from scalar variables **a** = (*a*_{1}, *a*_{2}). **1**^{N} and **0**^{N} are all-ones and all-zeros vectors of length *N*, respectively. We use when integers *i* and *j* are equal modulus *N*. The circular *r*-shift shift_{r} (**a**), will shift the elements of **a** to the right shift_{r} (**a**) = (*a*_{N–r+1},… *a _{N}, a*

_{1},…,

*a*), defined formally as where the mapping

_{N–r}*σ*, circularly shifts indices to the right.

#### Strings

, and denote the alphabet, strings of length *N*, and all strings respectively. ** x** o

**denotes concatenation, and |**

*y***| is the length of**

*x***,**

*x**denotes the*

**x**_{i}*i*-th character,

*x*_{[i:j]}is a slice from

*i*th to

*j*th index

*x*_{[i:j]}:=

*…*

**x**_{i}*, referred to as a*

**x**_{j}*substring*of

**x**, while

*x*_{[i]}:=

*x*_{i1}…

*, is a*

**x**_{ik}*k-ary subsequence (or just subsequence)*of

**, when**

*x***i**= (

*i*

_{1},…,

*i*) is a strictly increasing sequence that 1 ≤

_{k}*i*

_{1}< ··· <

*i*≤ |

_{k}*|. For two strings*

**x****, , the edit distance d**

*x*_{ed}(

*) denotes the minimum number of edit operations needed to transform one string to the other. It can be defined recursively using*

**x**,**y****′ =**

*x*

*x*_{[2:|x|]},

**′ =**

*y*

*y*_{[2:|y|]}as follows with d

_{ed}(

*ε, ε*):= 0 as the recursion basis.

#### Minimum hash sketching

For sequence , let *X _{i}* denote the ith

*k*-mer

*X*:=

_{i}

*x*_{[i:i+k–1}], and #

*X*denote its occurrence number #

_{i}*X*:= #{

_{i}*j*≤

*i*:

*X*=

_{j}*X*}, and

_{j}*unified k-mer*refers to the pair (

*X*).

_{i}, #X_{i}In MinHash (MH), a random hash is used to sort the *k*-mers, the *k*-mer with the lowest index, with respect to hash functions

Weighted MinHash (WMH) draws hash functions from unified *k*-mers

Ordered MinHash (OMH) [9], generalize WMH by sampling from multiple *k*-mers that that appear in the same order in sequence. The sampling uses hashes over the *k*-mers space , and select indices that are mapped to the *t* lowest hash values. Formally, for each has function *h _{r}*, we sketch (

*X*

_{σ1},…,

*X*), when monotone indices 1 ≤

_{σt}*σ*

_{1}< ··· <

*σ*≤ |

_{t}*X*| are mapped to the lowest hash values

Finally, we can compute MinHash distance via the *Hamming distance* d_{H}(*X, Y*), defined as the number of indices that two points *X* and *Y* differ d_{H}(*X, Y*) = #{*i* ∈ [*N*]: *X _{i}* =

*Y*}. Therefore, when

_{i}*φ*is a MinHash sketch, d

_{H}(

*φ*(

**),**

*x**φ*(

**)) becomes an estimator for Jaccard set similarity index . Therefore, different MinHash methods differ in their choice of set they sketch. Observe that MH, WMH, and OMH are MinHash over**

*y**k*-mers, unified

*k*-mers, and

*t*-tuples of

*k*-mers respectively, which in turn implies the equivalence of OMH with WMH for parameter choice of

*t*= 1.

### 2.2 Tensor Sketch

First, define to be set of all increasing subsequences of [*N*]^{t}

Without loss of generality let the alphabet be relabeled to , and be the size. The main idea behind sequence tensor sketch is counting subsequences as opposed to *k*-mers. Define to be the probability mass function of a random *t*-ary subsequence of *x*

With a slight abuse of notation, we treat both as a probability mass function, and an *m ^{t}*-dimensional tensor. A simple yet elegant observation links the tensor representation to ordered MinHash. The inner product equals the number of ordered

*t*-ary tuples that are shared between

**and**

*x***, normalized by the total number, which appears as the numerator in the Jaccard index. Therefore, OMH and Tensor Sketch are closely related from a theoretical perspective. Then for two arbitrary strings**

*y***, , define order**

*x**t*similarity

*s*

^{(t)}(

*) and distance*

**x**,**y***d*

^{(t)}(

*) as*

**x**,**y**However, naive computation of equations (8) and (9) requires time and *m ^{t}* memory, which becomes prohibitive for even moderate values of these variables. Our tensor sketching scheme, provides a (1 +

*ϵ*)-approximation for tensor distance, but only requires time to compute. Furthermore, any downstream analysis will operate on sketches of size , regardless of the original sequence length.

Since we are interested in preserving the Euclidean norm after sketching, we follow the definition of tensor sketch by Pham and Pagh [14]. For integer *D*, tensor sketch sketches an *m ^{t}*-dimensional tensor into . Let us assume a series of pairwise independent hash functions , and sign hash functions . Moreover, define hash sum and hash sign product as follows

Finally, tensor sketch *Φ*(*T*):= (*ϕ _{r}*(

*T*))

_{r∈[D]}is defined as where is an arbitrary tensor. Crucially, tensor sketch preserves the Euclidean norm in expectation , and its accuracy increases with sketch size which leads to the second moment bound (for proof, see Lemma 7 of Pham and Pagh [14]):

Therefore, a sketch size of *D* = 4*ϵ*^{-2} suffices to bound the multiplicative error by *ϵ* with probability , and sketch size of suffices to boost the probability of success to 1 – *δ*, for arbitrary *δ* ∈ (0,1), by taking the median of log(1/*δ*) independent -size sketches.

If we could compute sketches and efficiently, distance computation merely requires time, as opposed to the exponential cost of constructing the ambient space. However, since the tensors are never rank-1, the conventional efficient tensor sketching schemes cannot be applied here. Therefore, we have designed an efficient algorithm to compute (11), and provide several extensions of the algorithm in the next section. In summary, for sequence ** x**, the tensor sketch

*(*

**φ**_{ts}*) is computed by applying the tensor sketch on tuple tensor*

**x**#### Implicit sketching by dynamic programming

First, observe that the *r*th component of the sketch , can be rewritten based on the probability mass of (*H*(*x*_{[i]}), *S*(*x*_{[i]})), when **i** uniformly drawn ordered *t*-tuple

We set out to design a dynamic program that computes probability mass for ordered *t*-tuples, based on probability mass of shorter smaller ordered tuples. Therefore, for *p* ∈ {1,…, *t*} let be the probability mass function of (*H _{p}*(

*x*_{[i]}),

*S*(

_{p}

*x*_{[i]})), when is a random ordered

*p*-tuple over [

*N*], while and are partial hash sum and products up to

*p*, respectively

To recursively compute for *p* ≥ 2, we separate it into two cases depending on whether the last index is part of the tuple (*i _{p}* =

*N*) or not (

*i*<

_{p}*N*) where , since Defining vector , the index shift

*r – r′*(mod

*D*), amounts to a circular

*r′*-shift of (b). Hence, the recursion in vector form can be written as

◊

*Recursive insert*(*p*∈ [*t*]):◊

*Basis*(*p*= 0):

Interestingly, the recursion relation for variables in depends only on variables in the previous layer . Therefore, it is sufficient to store the variables corresponding to layer *i*, which implies a memory complexity of . This fact is exploited in Algorithm 1, which is why the subscript *x*_{[i]} is drooped from **T**, and uses ±1 for brevity^{4}. The number of sub-problems needed to compute is exactly 2*NtD*, and the cost of each recursion is . Therefore, we can compute the sketch with dynamic programming in .

### 2.3 Tensor Slide Sketch

In this section, we assume that our task involves sketching sequences that have a significant overlap. Instead of applying Algorithm 1 to each sequence separately, we can exploit the fact that many subproblems in the recursion (18) are shared. This idea is exploited in Algorithm 2, to compute sketches for all w-mers of a reference sequence with length *N* with time and memory complexity, which is only ŕ times more expensive than Algorithm 1. The length of the sliding window *w* does not appear in any complexity term, which could provide significant savings, considering that *t* is typically much smaller.

The tensor slide sketch, is defined as
where * φ_{ts}*(

*), is Tensor Sketch(13), the only difference being, that successive sketches are computed dynamically. It is important to note that Tensor Slide Sketch (TSS) only stores of the indices. Because sliding windows that are too overlapping cannot improve our estimate, we can downsample the slide sketch to keep the sketch condense.*

**x**## 3 Experimental Results

The primary objective of our tensor sketching framework is to efficiently sketch global alignment scores between sequences. In order to put our method into context, we compare its performance to MH, WMH, and OMH [9], three state-of-the-art sketching techniques (see **Section 2** for formal definitions).

### 3.1 Synthetic data generation

As a basis for evaluation, we generated a set of synthetic sequence pairs over the full spectrum of edit distances using Monte Carlo sampling. We draw a reference sequence at random , and for the desired edit distance *d* ∈ [*N*], we generate ** y** by mutating with

*d*edit operations at random

**=:**

*x*

*x*_{0}→

*x*_{1}→ ⋯ →

*x*=:

_{d}**, where**

*y*

*x*_{i+1}is created by selecting a random index from [|

*x*_{i}|], and either delete, substitute, or insert a random character at that index. We accept the pair

**, if we have d**

*x, y*_{ed}(

**) =**

*x, y**d*, and compute the sketches and sketch-based distances for each method. This process is repeated to create the full data set. In total, we generated 1000 sequence pairs of length

*N*= 20000 over an alphabet of size . (See

**Figure B.1**for a raw presentation of approximate vs. exact distances as scatter plot).

### 3.2 Tensor slide sketch achieves high rank correlation with edit distance

For each sketching method, Spearman’s rank correlation with the edit distance, the execution time, and the execution time relative to the edit distance is determined (see Table 1). In particular, Tensor Slide Sketch achieves a correlation of 0.881 while reducing the computation time by 98.6%. The 0 rank correlation for MinHash indicates the fact that very long sequences have 0 distance, as they contain every possible *k*-mer. Furthermore, the AUC are calculated for classification of sequences based on a threshold for their exact edit distance (normalized by length). For all thresholds, TSS clearly outperforms other sketches.

We then assessed the performance and execution time of all considered methods utilizing the synthetically generated data set (Figure 1). We observe that the quality of both MinHash (MH) and Weighted MinHash (WMH) greatly diminishes as the sequences become more distant from one another (**Figure 1a**) and the quality of MH and WMH drops with increasing sequence length (**Figure 1b**). In both plots, the accuracy of OMH and the tensor sketch methods maintains a much higher level. Notably, the increased accuracy of TS and TSS comes only at a very small additional cost in time when compared with MH (**Figure 1c**), while both methods run much faster than OMH. Lastly, we assessed the relationship of sketching accuracy to the number of embedding dimensions for each method (**Figure 1d**). Remarkably, TSS achieves a rank correlation of 0.74% even when using only 10 embedding dimensions.

In summary, TSS produces the most accurate sketches across the board, while introducing a small time and memory overhead when compared with min-hashing. Furthermore, TS and TSS can be computed in a streaming fashion. For example, in order to sketch all 100-mers of a reference sequence, the time complexity grows only linearly with the length of the reference sequence, while WMH and MH need to be computed separately for each sequence of length m, rendering them m times slower for this task.

While sorting creates an iterative bottleneck for all hash-based schemes, the tensor sketch operation is based on a vector shift and sum, which explains the faster execution in practice. Although we have only reported times for single-thread execution, we can expect tensor sketch to achieve a greater boost from multi-threading, as vector shift and sum can be easily vectorized for parallel execution. Moreover, vector operations can be executed in a single step on some CPU or GPU architectures to achieve an even greater boost. Finally, the hash-based methods store k-mers, or tuples of k-mers as the sketch, which grows with alphabet size and model parameters. In contrast, the tensor sketches can be stored as small integers, namely in a single byte.

### 3.3 Tensor Sketching helps estimate phylogenies

We further explored the task of estimating all pairwise distances of a given set of evolutionary related sequences, resulting in a well structured distance matrix. This task is reminiscent of important bioinformatics applications, such as the binning of metagenomic reads or phylogeny reconstruction. In both cases, the reconstruction of the good distance matrix as a whole, and not only individual components, forms the basis for a good approximation. While there are several ways to formulate this problem, we can simply visualize the exact and approximate matrices and compare them with an overall metric such as Spearman’s rank correlation. Figure 2, shows such a distance matrix. The sequences were generated to emulate a phylogeny tree.

### 3.4 Tensor Sketching supports Discretization

While the result of sketching is a rational number, in our applications, these numbers were discretized to take less space. In fact, one can store the sign bit, plus its logarithm with (1 + *ϵ*) base rounded to the nearest integer *i*, to ensure an e-bound on the error, as the original number must lie in the interval between (1 + *ϵ*)^{i} and (1 + *ϵ*)^{i+1}. We used 256 bins, which implies one byte per dimension. In contrast, MH and WMH sketch *k*-mers for each embedding dimension, which will take *k* bits per dimension. As OMH sketches *t* separate *k*-mers, this complexity will even increase to *tk* bits.

### 3.5 Tensor Sketch opens up a wide range of applications

#### Sketching de-Brujin graph distances

The slide-sketch algorithm presented in 2, assumes only having access to a stream of characters coming in and that the task involves sketching substrings of the entire sequence. While the task description may sound to be restricted to a linear reference sequence, the same assumptions actually apply to any Euler path on a string graph. For example, if the input is a de-Brujin graph and the task is to sketch all vertices of the graph, any traversal of the graph that walks through adjacent vertices can be mapped to the streaming of incoming and outgoing characters, which will fit the assumptions of the slide sketch algorithm. Consequently, the time complexity of sketching vertices of a de-Brujin graph will grow only as a function of its size, i.e., the number of its vertices, as opposed to the size of each vertex.

#### Robustness to Global Mutations

Tensor Sketch is sensitive to global shifts and transpositions of characters, while *k*-mer statistics are entirely local. To illustrate this, consider case when sequence ** y** is constructed from

**, by swapping the first and second half of the sequence . It is evident that (**

*x**N*–

*k*+ 1) out of

*N k*-mers are intact under this block permutation, implying that for sufficiently large

*N*≫

*k*the

*k*-mer profile of

**will converge to**

*y***. This may severely overestimate their alignment score. In contrast, only of**

*x**t*-ary subsequences remain intact. This greater sensitivity makes tensor sketch an appealing method for applications that such block permutations are likely to occur.

## 4 Discussion

We presented tensor sketching, a method for estimating sequence similarity for biological sequences. We demonstrated that Tensor Slide Sketch achieves a high Spearman’s rank correlation, but runs an order of magnitude faster than computing the exact alignment. When compared with other sketching methods, Tensor Sketch and Tensor Slide Sketch both achieve a more accurate preservation of the order between distances than MinHash, Weighted MinHash, and Ordered MinHash.

### Hyper-parameters

The results presented in Section 3 have focused on a fixed set of parameters for the models. However, our results are fairly robust to the choice of parameters. Namely, varying the size of *k*-mers will be beneficial to hash-based sketches for small edit distances, while it will hurt the performance for more distant pairs. Hence, Spearman’s rank correlation will not substantially improve. While tensor sketch achieves the best accuracy in our set of experiments, a comprehensive study of different parameter regimes and settings under which they will be beneficial merits further investigation. In particular, min-hash methods typically show good performance at detecting small distances, as it is evident in our AUROC plot in Figure 1. This suggests the possibility of having a hybrid-version of tensor sketching and ordered MinHash to combine the best features of the two. Moreover, such a study could shed more light on the connections between min-hashing and tensor sketching.

### Bayesian estimation with tensor sketching

Spaced *k*-mers were introduced motivated by applications involving distantly related sequences. If we are to allow for *i ignore* positions, there will be possible patterns. Considering all such patterns is clearly not an option if *i* is anything larger but a small constant. Not only is it highly non-trivial to search for the optimal pattern in this large combinatorial space, but any incremental step for pattern optimization also has to be repeated if different tasks require a different set of patterns. Seen from this angle,Tensor Sketch can be interpreted as an average of all possible spaced *t*-mers patterns with (*N* − *t*) ignore positions, while the sketching step avoids the exponential cost of explicitly representing all combinatorial options.

The core idea of tensor sketching, that is to average over all possible seed patterns, can be alternatively viewed as Bayesian estimation. This implies that our estimate is admissible, i.e., no other estimator can outperform it across the board. This provides some intuition into why our approach achieves an acceptable or better accuracy over the entire range of edit distances. In other words, the risk of selecting a bad seed or size of *k*-mer is minimized by taking the mean over all possible seeds. While this corresponds to a uniform prior on our seeds, one can introduce a non-uniform prior by setting different weights for vertices on our tensors, namely to penalize more number of gaps in the pattern. Therefore, we can strengthen or weaken the contribution of individual patterns. Weighted average can help to reduce background statistics into the distance metric, as well as introduce nonuniform mismatch penalties. These variations and extensions come naturally within the tensor sketching framework, in contrast with hash-based approaches that require heuristic steps.

### Conclusion

This work’s main contributions are the introduction of two tensor-based sketching methods to sequence similarity estimation: Tensor Sketch, providing an efficient algorithm to compute it, and Tensor Slide Sketch, a streaming version of the algorithm. Our results indicate that the tensor sketching method and its applications open up exciting research directions to explore for the bioinformatics community. The main advantages of tensor sketching compared with hashbased methods are that it 1) can run in a streaming fashion, 2) achieves much higher accuracy, and 3) runs much faster in practice.

## Acknowledgements

We would like to thank Kjong-Van Lehmann and Ximena Bonilla for their constructive feedback on the manuscript. AJ was funded by ETH Zurich core funding to GR. AK was partially funded by PHRT Project #106 and the Swiss National Science Foundation Grant No. 407540_167331 “Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation” as part of Swiss National Research Programme (NRP) 75 “Big Data”.

## Appendix A Tensor Slide Sketch Algorithm

The main idea is to extend the data structures, to dynamically append the string or remove a character from the beginning. The procedure is detailed in Algorithm 2 (**See Section A** for derivation of recursive step in detail). In order to implement an iterative algorithm, one can roll back the recursive calls and execute them in the reverse order. Namely for deletion, recursive relation for depends on , and , implying that updates for must precede updates for , imposing a partial order corresponding to tuple length (*q* – *p* + 1). This in turn corresponds with the random subsequence length. For insertion recursion (18), subproblems correspond to shorter subsequence lengths, while for deletion recursion (31), sub-problems correspond to longer subsequence lengths, which impose an ascending and descending dependency on subsequence length, respectively.

For all 1 ≤ *p* ≤ *q* ≤ *t*, let and be partial hash sum and products from *p* up to *q*, respectively, and define be the probability mass function of (*H*_{p:q}(*x*_{[i]}), *S _{p:q}*(

*x*_{[i]})), when

**i**is a uniformly drawn ordered (

*q*–

*p*+ 1)-tuple.

The recursive insertion is analogous to (18), as we can derive based on the previous layer and smaller problems . Recursive deletion is equivalent to rolling back insertion of the first character. We can derive based on and , by separating it into two cases when the random tuple starts at the first index being in the tuple
where , analogous to the insertion only case, and *r*′: = *r* – *h _{p}*(

*x*_{1}) and

*b*′:=

*bs*(

_{p}

*x*_{1}) are recursive hashes as before. We can rearrange this identity to compute based on other terms

Defining vector , and calculating , we can write the recursion in the vector form more concisely as:

◊

*Recursive delete*(*p*≤*q*):◊ Recursive insert (

*p*≤*q*):◊ Basis (

*q*<*p*):

## Appendix B More results

Figure B.1 shows individual pair of sequences as a scatter plot, while the solid lines indicate median and quartiles. As the Spearman’s rank correlation indicates, that rank of edit distances are better preserved by the two tensor sketching methods, and more so by tensor slide sketch. This difference is due to the fact that hash-based sketches are sensitive to small edit distances, namely below 0.1 (normalized to length), they plateau more quickly.

We can evaluate these methods more quantitatively, by requiring them to solve a decision edit distance problem. In the decision version of the problem, the task is to decide which points are bellow a certain threshold and which ones are not. Therefore, the performance of each method can be depicted by ROC curve, which are plotted for four different thresholds in Figure B.2, with the dataset remains the same as Figure B.1. The main takeaway of ROC plots is that while hashbased methods are effective at very low thresholds, they lose their sensitivity for higher thresholds, and their sensitivity declines as task moves away from comparing very similar sequence pairs. Interestingly, tensor slide sketch seems to perform nearly as optimal in the lower threshold, but outperforms other methods on most lower thresholds significantly.

## Footnotes

↵4 The plus-minus signs are shorthand for two equations: one by taking all to be +, and another taking all to be –, but not any other combination, for example sin(

*a*±*b*) = sin(*a*) cos(*b*) ± sin(*b*) cos(*a*). If there are ∓ in the equation, their order for execution must the reversed, for example cos(*a*±*b*) = cos(*a*) cos(*b*) ∓ sin(*a*) sin(*b*).