Seed-chain-extend alignment is accurate and runs in close to O ( m log n ) time for similar sequences: a rigorous average-case analysis ⋆

. Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment employed by modern sequence aligners. While effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation . Assume we are given a random nucleotide sequence of length ∼ n that is indexed (or seeded) and a mutated substring of length ∼ m ≤ n with mutation rate θ < 0 . 206. We prove that we can find a k = Θ (log n ) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear gap cost chaining and quadratic time gap extension is O ( mn f ( θ ) log n ) where f ( θ ) < 2 . 43 · θ holds as a loose bound. In fact, for reasonable θ = 0 . 05, f ( θ ) < 0 . 08, indicating nearly quasilinear running time in practice. The alignment also turns out to be good; we prove that more than 1 − O ( 1 √ m ) fraction of the homologous bases are recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched , i.e. only a subset of all k-mers is selected. Under the open syncmer sketching method, one can sketch with decreasing density as a function of n and achieve asymptotically smaller chaining time, yet the same bounds for extension time and recoverability hold. In other words, sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and conjecture that f ( θ ) can be further reduced.


Introduction
Since the earliest years of bioinformatics, one primitive task has been sequence alignment [4,64], which plays a major role in genomic sequencing and phylogenetics. Intuitively, alignment matches together similar parts of two strings, inserting gaps where needed. More formally, alignment is defined by a set of allowed edit operations (e.g. single-character substitutions, insertions, deletions) with associated costs; an alignment is a sequence of those operations transforming one string into another string (or for local alignment, transforming one string into a substring of the other), and the alignment score is the summed cost of those operations [8].
Unfortunately, the best guaranteed algorithms for computing alignment [53,64] are quadratic in timecomplexity; worse, this bound appears to be tight [5]. Thus, to deal with large volumes of sequencing data [45,7], sequence alignment programs employ heuristics [4,37,44,29,35,40] without performance guarantees [46] for computational efficiency. All these heuristic algorithms are by necessity fast, achieving empirically sub-quadratic runtimes on real problems. They fail for adversarial examples, but aligners perform well in practice because the sequences being aligned are similar and not pathological examples [43,24,48,69,31].
Phylogenetics often makes use of comparative genomics, where aligning together multiple whole genomes allows annotating the ways in which two species have diverged over evolutionary time [33]. In genomic sequencing, the alignment task manifests because the sequencing machines are only able to read a small portion of a chromosome at a time, producing short snippets known as reads [3]. It is then incumbent on read-mapping software to determine the likely origin location of that read in the genome, and for already sequenced species, this is usually performed by aligning the read to a known reference genome [54].
Historically, different types of heuristics were used for the two tasks, because aligning a very small substring to a longer string is easier than aligning comparable-size strings. Indeed, the seed-and-extend heuristic, as seen in BWA and Bowtie 2, is only applicable to NGS short-reads. However, as 3rd-generation long-read sequencing becomes more prominent, the two tasks become more similar and the same heuristics can be used for both [60]-in this manuscript, we address the seed-chain-extend heuristic used in modern software for both read mapping and whole genome alignment.

Our contribution
Our goal in this manuscript is to close the gap between theory and practice, rigorously justifying the heuristics used in some of the most widely used alignment software. To this end, we turn to the methods of averagecase analysis [67], which gives us a way of breaking through the quadratic barrier of alignment [47]. Given a random string, we define a substitution model giving rise to a distribution on pairs of inputs and then average our analysis on pairs of strings over this distribution.
Recently, Ganesh and Sy [21] also used probabilistic analysis to show that a heuristic algorithm based on banded alignment can run in O(n log n) time for two length n sequences and is optimal with high probability. However, their banded alignment heuristic has not yet seen any usage in practical software, and their analysis is invalid in the case of read mapping as it only pertains to two nearly equal length strings. Thus we turn instead to the analysis of an empirically battle-tested heuristic for sequence alignment: seed-chain-extend. Seed-chain-extend is a well-established technique for comparative genomics [1,49], and recently the addition of sketching (or subsampling) has made it popular for long-read aligners [58, 11,65] including minimap2 [37], the primary algorithm our model of seed-chain-extend is based on.
We provide, to the best of our knowledge, the first average-case bounds on runtime and optimality for the sketched k-mer seed-chain-extend alignment heuristic under a pairwise mutation model. Our optimality result shows that for large enough k-mer size k, the alignment is mostly constrained to be near the correct diagonal of the alignment matrix and that runtime is close to linear when the mutation or error rate is reasonably small. We also show that subsampling Θ( 1 log n ) of k-mers asymptotically reduces our bounds on chaining time but not for extension time. Our results give a theoretical justification for both the empirical accuracy and sub-quadratic runtime of seed-chain-extend. A simplified version of our theorem follows.
Simplified Theorem 1 (Informal main result; no sketching). Suppose we are given a uniformly random DNA string of length n and a mutated substring of length m where each base is substituted with probability θ. If θ < 0.206 and the longer string is already seeded, then we can choose k = Θ(log n) such that the expected runtime of k-mer seed-chain-extend is O(mn f (θ) log(n)) = O(mn 2.43·θ log(n)), and in expectation ≥ 1 − O( 1 √ m ) of the homologous bases can be recovered from this alignment.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint We will state our models and definitions precisely in Sections 2 and 6. Our main result, Theorem 1, precisely defines the function f (θ) < 2.43 · θ in the exponent, but 2.43 · θ is a convenient upper bound. We can quickly see from this bound that for modest mutation rates, n 2.43·θ is not too large.

Mutation model
. . x n+k−1 be a random uniform string with n + k − 1 i.i.d letters on an alphabet of size σ for some k = Θ(log n). Let S ′ = y p+1 y p+2 . . . y p+m+k−1 be a substring of S of length m + k − 1 starting at a fixed position p + 1 with each character independently mutated to a different letter with probability θ. Although notationally a little confusing at first, the k − 1 term ensures S, S ′ contain exactly n and m k-mers respectively-k-mers are in many ways the natural unit of measurement, rather than individual characters. We model only point substitutions here and not indels. Independent substitution models have been considered in theoretical work [9,62,21], and importantly, also demonstrated to be useful empirically [11,55,61]. Also, while genomes can be repetitive, on the level of k-mers a random model has been shown to be reasonable [18].

Modelling seed-chain-extend
A brief overview of seed-chain-extend based alignment is given as follows: first a subset of k-mers in both S, S ′ are taken as seeds and exact seed matches between S, S ′ called anchors are obtained. We only use k-mer seeds in this study, although other types of seeds are possible [28,29]. An optimal increasing subsequence of possibly overlapping anchors based on some score is then collected into a chain, where increasing is defined with the standard precedence relationship [25] between k-mer anchors (See Figure 1a and Chaining below). The chain is extended into a full alignment by aligning between anchor gaps in the chain.
Model overview: our model of seed-chain-extend is primarily inspired by minimap2 with a few key differences. It captures the following steps: seeding the query S ′ , matching the k-mers to obtain anchors, sorting the anchors, chaining, and extending. We assume the reference S has already been seeded and only the query needs seeding, as in the case of read alignment. For comparing two similar length genomes, the seeding time of either genome is comparable, so the asymptotics will be the same for comparative genomics.
Runtimes: non-sketched seeding runtime with a hash table is O(m), whereas sketched seeding runtime is O(mk) (discussed in Section 7). Letting N be a random variable for the number of anchors, matching is O(N +m) by iterating through a hash table, sorting is O(N log N ), and (optimal) chaining is O(N log N ) (see Section 2.3). Extension is the only step with unknown time complexity. It will turn out that extension and N log N are the dominating asymptotic terms, so our goal is to bound these terms in expectation. Empirically, it has been shown that chaining and alignment usually take the most time [27] for read mapping.
Chaining: a chain is a sequence of tuples C = ((i 1 , j 1 ), . . . , (i u , j u )) where i ℓ and j ℓ are the starting positions of the anchoring k-mers on S and S ′ respectively, under the convention that S ′ = y p+1 . . . so the k-mer labelled (p + 1) on S ′ is actually the first k-mer. The precedence relation i ℓ > i ℓ−1 and j ℓ > j ℓ−1 must hold for all ℓ, and k-mers can overlap. Our chaining score is the L1 or linear gap cost [1,39] of the form u−ζ[(i u −i 1 )+(j u −j 1 )] for ζ > 0, which penalizes long chains from distant spurious anchors and is necessary for our proofs when n > m. The score is sometimes defined equivalently as u− In the language of [1], we let our anchor fragments have length 1, so the k-mers can overlap. While minimap2 v2.22's default chaining score is different and uses a heuristic banded chaining approach, it does use a linear gap cost (without overlaps) in certain situations, e.g. mapping long contigs [38].
Extension: we use quadratic time extension between gaps based on any alignment score (e.g. edit distance, affine gap costs [15]) as our optimality criterion only depends on the chain (Section 4). We do not extend past the ends of the chain and do not use banded alignment [12] in this step, unlike minimap2 [37].

Extension and chaining runtimes
Given sorted anchors, let T Chain be the time spent finding an optimal chain. T Chain depends on the objective function [51, 25,1,56]. Since our gap costs are linear, T Chain = O(N log N ) where N is the number of anchors [1]. For extension time T Ext , let (G 1 , G ′ 1 ), . . . , (G u−1 , G ′ u−1 ) be the size of the gaps for an optimal chain. G ℓ indicates the length of the substring between the k-mers i ℓ , i ℓ+1 on S and G ′ i similarly for S ′ ; G ℓ , G ′ ℓ can be zero. The extension time is . We will show that the expected runtime is better than this upper bound, but it serves as a useful worst case. Since S, S ′ are random strings, T Chain , T Ext , N , and the alignment itself all become random variables. Our goal will be to bound E[T Chain ] and E[T Ext ].

Results
First, a few definitions are in order. Recall that |S| = n + k − 1 ∼ n, and |S ′ | = m + k − 1 ∼ m are our string lengths. We define σ > 1 to be the size of the alphabet and 0 < θ < 1 to be the probability a base mutates. Our theory holds for any alphabet size, but we use σ = 4 for specifying numerical constants. Let log = log σ with base σ and ln be the natural logarithm. Let k = C log n for a fixed C > 0 so that key quantities can be expressed interchangeably as σ k = n C and (1 − θ) k = n −Cα where α = − log σ (1 − θ) > 0. We define the actual goodness of the chain in terms of recoverability in Section 4; it measures the fraction of homologous bases under our mutation model that could potentially be recovered by extending through an optimal chain and only depends on the chain, not the actual alignment.
Our proof follows in three main steps. First, we bound the first and second moments of the random variable N , which denotes the number of anchors, implying that chaining is fast. Second, we use concentration inequalities for sums of dependent random variables and exploit the structure of chaining to show that with high probability, an optimal chain does not deviate much from the chain with only "homologous anchors" (Figure 1a). The failure probabilities will be on the order of Θ( 1 n ), allowing us to also bound everything in expectation. Lastly, we bound the expected runtime of extension through gaps between homologous anchors.
Asymptotically, the chaining runtime is smaller than extension runtime. However, the implied constants can be much smaller for the extension term [27], so it is practically useful to reduce the runtime of chaining via sketching. Let 0 < 1/c ≤ 1 be the expected fraction of selected k-mers for a sketching method. We use the open syncmer k-mer seeding method [16] which has a useful mathematical property (Theorem 7), giving: Theorem 2 (Sketched result). In addition to the hypotheses outlined in Theorem 1, let c = O(log n) < k and ζ = 1 . For open syncmer sketched seed-chainextend, the expected running time is O min( 1 c 2 mn Cα log 3 (n), c·mn Cα log n) for extension and O( 1 c mn −Cα log m) for chaining. The expected recoverability of any optimal chain is ≥ This shows that the asymptotic upper bound on extension runtime is the same even if we let c grow with n like c = Θ(log n) < k, leading to the following conclusion: sketching can reduce chaining time without increasing extension time much. Other seeding schemes used, e.g. minimizers [63,57,37,14] or FracMinHash [23] behave differently, but our techniques provide intuition and in some cases can be extended (See Section 7).
We validate our results with simulations in Section 8. We show that our expected time bounds predict runtimes well for large enough n. We then show that when sketching with Θ(log n) density, the multiplicative slow-down in extension time plateaus, which is in accordance with our theory. However, the multiplicative speed-up in chaining time increases asymptotically. Thus sketching is extremely effective at producing favourable runtime trade-offs between chaining and extension.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint (a) k-mer anchors under mutations and their corresponding alignment matrix. Blue anchors are "homologous anchors" while red are "spurious anchors".   . We now define recoverability as the number of homologous bases one can possibly recover by seed-chain-extend; a visual example is shown in Figure 1b. We define Align(C) carefully in the appendix. Note that an optimal chain with respect to our linear gap cost objective u − ζ[(i u − i 1 ) + (j u − j 1 )] does not directly optimize for recoverability. We provably find the chain with the optimal linear gap cost, and then we will argue that it leads to good recoverability.
Traditionally, alignment optimality is defined using a generalized edit distance [70]. However, these distances are only used as a proxy for detecting sequence homology [6,68,41]. We know the true underlying sequence ancestry in our model, so defining optimality with respect to sequence homology suits the actual goal of sequence alignment. The reason for the name "recoverability" is that extension could potentially recover all recoverable bases, but this depends on the extension algorithm and is not guaranteed [42].
Under our model, the trivial O(1) alignment that aligns S ′ back to the originating substring of S without indels is the most homologous. This may seem to make our results superfluous; however, remember that the algorithm does not know where S ′ "begins" if m < n. Also, while we do not attempt the case with indels, seed-chain-extend is still valid when indels are present whereas the trivial solution does not allow for indels.
We will lower bound E[R(C)]. To do this, we will work with a more natural object called a break.
Definition 2 (inspired by [21]). We call matching bases and anchors of the form (x, x) homologous and spurious otherwise. Given a chain ((i 1 , j 1 ), . . . , (i u , j u )) and a maximal interval [p.
.q] such that (i p , j p ), ..., (i q , j q ) are all spurious anchors, define the break B as B = [min(i p , j p ).. max(i q , j q ) + k − 1]. Let the length or size of a break be L(B) = max(i q , j q ) − min(i p , j p ) + k and L(C) = B L(B) be the total length over all breaks.
We prove Lemma 1 in the appendix. The concept of a break is illustrated in Figure 1b. Breaks cover nonrecoverable regions, so subtracting the breaks from the span of the anchors lower bounds the recoverability.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint

Fundamental tools and bounds
In this section, we describe some fundamental tools for dealing with pairs of random mutating strings. We first need to give a careful probabilistic exploration of random k-mer anchors on S = x 1 x 2 . . . and S ′ = y p+1 y p+2 . . .. This requires a bit of work due to the dependence between the random strings S, S ′ . For the rest of the paper, missing proofs can be found in the appendix.
Definition 3. Let M (i, j) be random variables such that M (i, j) = 1 if x i = y j and 0 otherwise. Let A(i, j) be a random variable k−1 ℓ=0 M (i + ℓ, j + ℓ). A(i, j) is an indicator random variable for the presence of a k-mer anchor at positions (i, j). We will also refer to A(i, j) variables as "anchors".
We would like for M (i, j), M (i + 1, j + 1), . . . to be independent, so finding the probability of k-mer matches is easy. However, in our model, it is not actually obvious a priori that M (i, j), M (i + 1, j + 1) are independent when j ̸ = i. Consider M (1, 2) and M (2, 3). The former is a function of x 1 , y 2 , and the latter x 2 , y 3 . However, x 2 and y 2 are dependent in our model, so more work is needed. A convenient graphical representation of independence for the M random variables is the match graph, which we define below.
A match graph is shown in Supplementary Figure 1. The main reason for defining the match graph is the following theorem, which allows us to graphically determine independence of the A variables.  We will denote the random variables p+m i=p+1 A(i, i) = i A(i, i) = N H (n, m) = N H and i̸ =j A(i, j) = N S (n, m) = N S respectively as the number of homologous anchors and spurious anchors; we will drop the dependence on n, m to simplify notation. These are key random variables that we wish to bound later on. The below theorem follows directly from Corollary 1 by linearity of expectation.

Bounding sums of k-mer random variables
We proceed to bound the distribution of the random variables N S and N H , which are sums of dependent random variables. We first bound N S by computing the second moments and using variance based bounds.
To do this, we need to examine the independence structure of the A(i, j) random variables.
Lemma 2. For A(i, j) and A(h, l), if both of the following conditions hold: then the induced match graph on the M variables for A(i, j) and A(h, l) has no cycles.
Intuitively, the first condition states that two anchors do not overlap too much, e.g. A(1, 1) and A(2, 2) are not independent when k = 3. The intuition behind the second condition can be illustrated by the following situation where k = 1 and θ ∼ 0: consider the anchors A(1, 5), A(5, 1). Since it is likely that x 1 = y 1 and x 5 = y 5 , if x 1 = y 5 then x 5 = y 1 with high probability, so x 1 = y 5 is not independent of x 5 = y 1 .
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint Corollary 2. If A(i, j) and A(h, l) satisfy Lemma 2, they are independent.
Proof. By condition (1) in Lemma 2, A(i, j) and A(h, l) do not share any M variables i.e. A(i, j) = M (i, j)M (i + 1, j + 1) . . . and similarly for A(h, l), but no product shares a variable with the others. Since the match graph has no cycles under these conditions, by Theorem 3, all M variables are independent so Pr(A(i, j)A(h, l) = 1) = 1 σ 2k = Pr(A(i, j) = 1) Pr(A(h, l) = 1) as desired.
σ k . Now we can use the variance bound and Chebyshev's inequality to get the result below. Note the bound uses k = C log n; we will prefer this form for quantities directly used for proving the main result. The label F1 in the theorem refers to the particular event space for which the bound always holds. We will label each proposition that holds with high probability with the event space that we are operating in. We will continue this convention for the rest of the paper as this will be useful when computing our final bounds.
Lemma 4 (F1). With probability ≥ 1− 1 n , the number of spurious anchors is ≤ n 2−C + √ 8mC log(n)n 1−C/2 where we used m(n−1) σ k < n 2−C . That is, If C > 3, then for large n, N S = 0 with high probability, and our analysis would be easy. But, we want C as small as possible. It turns out we can make C ∼ 2 for reasonable θ, significantly tightening our bounds.
For N H we can get a stronger exponential bound because of its independence structure. A(i, i)s, which we call homologous anchors, are only dependent in a small neighbourhood around i of size k because k-mers on non-overlapping substrings are independent. This is called k−dependence (not to be confused with kindependence) and is used in [9] to show N H is asymptotically normal. Concentration bounds can also be translated in the k−dependent scenario [26].
Theorem 5 (Dependent Chernoff-Hoeffding bound -Corollary 2.4 from Janson [26] reworded and simplified). Suppose we have X = a∈A Bernoulli a (q) for some 0 < q < 1. A proper cover of A is a family of subsets {A i } i∈I such that all random variables in A i ⊂ A are independent and i∈I A i = A. Let χ(A) be the minimum size of the cover, |I|, over all possible proper covers. Then for t ≥ 0,

Lemma 5.
Pr Proof. We simply use Theorem 5 with q = (1−θ) k . By k-dependence, we can easily see that . .} is a partition satisfying the independence condition, and we will have at most k sets. Thus χ(A) ≤ k, and we're done.

Proof of non-sketched main result
To prove the main result on seed-chain-extend without sketching, we will need to bound three quantities in expectation: the runtime of chaining, the recoverability of chaining, and the runtime of extension.
We first bound O(E[N log N ]), the expected runtime of chaining (and also anchor sorting). Note that . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint Now we bound the expected recoverability of the chain. Given S and S ′ , let a homologous gap of size ℓ + k − 1 bases be an interval of ℓ consecutive k-mers for which no homologous anchors exist (i.e. the k-mers are mutated). In the context of a chain, a homologous gap will refer to a gap flanked by two homologous anchors. Technically, if ℓ consecutive k-mers are uncovered but are flanked by two homologous anchors, this gives ℓ − k + 1 uncovered bases. We will ignore these factors of k as we will show that they are asymptotically small. It turns out homologous gaps grow relatively slowly in n with high probability.
Lemma 6 (F2). With probability ≥ 1 − 1/n, no homologous gap has size greater than plus a small C log n term we will ignore because it is small asymptotically.
In a chain, gaps may also be flanked by one or two spurious anchors. We call these non-homologous gaps. We first bound break lengths, will imply good recoverability, and bound non-homologous gaps later on.
Lemma 7 (F1 + F2). Take any C > min(3, 2 1−2α ) and let ζ = 1 6g(n) where g(n) = C 50 8 log(n) ln(n)n Cα . Assume m = Ω(n 2Cα+ϵ ) for some ϵ > 0. Then for large enough n, there are no breaks of length ≥ m 1/2 with probability greater than 1 − 2/n in an optimal chain. Corollary 3. Under the same assumptions as in Lemma 7, the expected recoverability of any optimal chain is ≥ The idea behind proving the above propositions is to work in a space of events F1 ∩ F2 where "bad events" do not occur and any optimal chain is recoverable. Because this space of bad events is small, they do not contribute to our expected value too much. This finishes the recoverability result for the main theorem.
The last step is to bound extension running time, and this comes down to separately bounding the size of the homologous and non-homologous gaps in any optimal chain. We bound the runtime of extension through homologous gaps by directly calculating the expectation through all possible homologous gaps. We then show that the runtime through non-homologous gaps is small and does not contribute to the asymptotic term.  Proof (Theorem 1). The expected runtime of chaining follows from Theorem 6. The recoverability result follows from Corollary 3. The expected runtime of extension is Ext ] are both O(mn Cα log(n)) by Lemmas 8 and 9. This completes the proof as long as we satisfy the assumptions of Lemma 7 and Theorem 6 on C, α. To satisfy the assumptions, we require C > 1 1−α , C > min(3, 2 1−2α ), and Cα < 1/2 otherwise m = Ω(n 2Cα+ϵ ) > n for large enough n. It's not hard to check that the limiting condition is α < 1/6, so we require − log 4 (1 − θ) < 1/6. This works out to be θ < 1 − 4 − 1 6 < 0.2063. We can also remove the minimum condition on C because α < 1/6 implies 2 1−2α < 3.

Sketching and local k-mer selection
Now consider not selecting all of the k-mers in a string, but only a subset of them during the initial seeding step. This allows one to chain only a subset of the k-mers, potentially providing runtime savings.
We use the open syncmer method [16]. Given a string, we take all k-mers of the string and break the k-mer into s-mers with s < k. There are k − s + 1 s-mers in the k-mer. We select or seed the k-mer if the smallest s-mer (subject to some ordering, which we choose as uniform random) is in the ⌈ k−s+1 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint The expected fraction of selected k-mers over a string with i.i.d uniform letters is called the density, and it is 1 k−s+1 for the open syncmer method (up to a small error term O( (k−s+1) 2 σ s ) which we will ignore; see [71] or [62]). We will let c be the reciprocal of the density, so c = (k − s + 1) .
The original open syncmer definition in [16] had a parameter t where a k-mer was selected if the smallest s-mer was in the t−th position; we proved in [62] that the optimal t is ⌈ k−s+1 2 ⌉ with respect to maximizing the number of conserved bases from k-mer matching. The reason we choose open syncmers is primarily to the following fact which was shown in [16]: ⌉. If k − s + 1 is odd, two consecutive open syncmers must have starting positions ≥ t bases apart. If even, they must have starting positions ≥ t − 1 bases apart.
Theorem 7 follows by examining the smallest s-mer in a k-mer and noticing that in the next overlapping k-mer, the locations for the new smallest s-mer are restricted. This theorem is the reason we use open syncmers and is crucial to our proofs. The spacing property makes selected open syncmers a polar set [72]; other methods also give rise to polar sets [19,20] but open syncmers seem to perform well empirically [62,20] and are easy to describe. For the rest of the section, we will assume c = k − s + 1 is odd, so ⌈ c 2 ⌉ = c+1 2 . Let A(i, j) be the random variables as defined before. Let J(i) be the indicator random variable for if the ith k-mer is selected on S as an open syncmer, and J ′ (j) similarly for the j-th k-mer on S ′ . We now wish to calculate E[J(i)J ′ (j)A(i, j)], the probability that a k-mer match exists and the k-mer is an open syncmer.
Proof. If A(i, j) = 1, then the i-th k-mer and j−th k-mer are the same. If a k-mer is selected as an open syncmer on S, it must also be selected on and J(i) are independent because we assume the random ordering for the s-mers is independent of the random mutations. Using the law of total expectation and Theorem 4 gives the result for both cases.
Definition 5. We will replace all random variables involving anchors and matches with a superscript * to indicate sketched seeds, e.g.
The above follows directly from Lemma 10. As expected, subsampling to a 1 c fraction of the k-mers gives 1 c expected hits. Note the important property of context independence used in the proofs: if A(i, j) = 1, then J(i) = J ′ (j). This property is not satisfied if one were sampling k-mers randomly or using minimizers [59,62]. We now deduce the sketched moment bounds on N * .
The sketched moment bounds allow us to start re-analyzing the crucial propositions in Section 6 in the context of sketching. The first main result is that sketching reduces the chaining time as one would expect. The second result we wish to highlight is the new bound on extension runtime through homologous gaps. . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint It will turn out that in the sketched case, extension over homologous gaps also dominates runtime. Since we sketch with density 1 c , this result states that we can sketch with decreasing density and have the same asymptotic extension time as without sketching, which is surprising. The crucial fact we use in the proof of Lemma 13 is Theorem 7, which shows that open syncmer seeding weakens the k-dependence of the seeds since they are now at least (c + 1)/2 bases apart. This tightens the bound in Theorem 5 and also shows that the sketched maximum homologous gap size is Θ(g(n)), with g(n) as in Lemma 6. If we used other methods such as closed syncmers [16] or FracMinHash [23,22], then we would recover the O(c · mn Cα log n) result in Lemma 13 but not the O( 1 c 2 mn Cα log 3 (n)) result. Thus, we have saved a log factor by using open syncmers when we let c = Θ(log n). The bulk of Section E in the appendix is dedicated to proving Lemma 13.
The key quantities used for the rest of the proofs are N * S and g ′ (n), the new maximum homologous gap size. We will prove g ′ (n) = Θ(g(n)) with g(n) as in Lemma 6, so the only asymptotic difference in these bounds is a factor of 1 √ c in N * S . Since we want N * S to be small, this does not affect downstream analysis. It follows that the proofs of the rest of the lemmas in Section 6 can be replicated almost verbatim. We give a sketch of these proofs in the appendix.

Empirical results
We implemented a basic version of sketched seed-chain-extend and tested it on simulated random sequences. For the chaining step, we implemented an AVL tree based max-range-query method as described in [50,39]. For the extension step, we used a standard dynamic programming (DP) algorithm implemented in rustbio [34]. Our code is available at https://github.com/bluenote-1577/basic_seed_chainer/.
We let k = C log n where C = 2 1−2α , and do seed-chain-extend on two length n sequences. As k cannot be fractional, we let n k = 4 k C for varying integer values of k. We will set c = k − 6 = C log(n) − 6 which grows with n. The constant 6 is arbitrary but chosen sufficiently large that so s-mers in the open syncmer method are not so small as to make the method degenerate. We found that recoverability was always quite high and that breaks did not occur very often in actual simulations; we show this in Supplementary Figure 9. Thus the primary focus of the empirical results will be runtime. We used 50,000 iterations for every data point.

Accuracy of asymptotic extension runtime predictions
We first empirically investigate our upper bound on expected extension runtime, which is O(n 1+Cα log(n)) for both sketched and non-sketched extension when c = Θ(log n). Assuming that the runtime is simply λn 1+Cα k log(n k ) for some fixed constant λ, we can predict the ratio of the runtimes as n 1+Cα k+1 log(n k+1 ) n 1+Cα k log(n k ) . Of course, this is incorrect for small n, as smaller terms may dominate runtime, but we expect it to be reasonably accurate for large n. We plot the empirical and predicted ratios of extension runtimes in Figure 2 (a). Figure 2 (a) shows that our upper bound looks reasonable for both sketched and non-sketched cases for θ = 0.10. In Supplementary Figure 8, we show the same plots for θ = 0.05. The empirical results never cross the predicted ratio, but this is not too surprising as the predicted ratio is only approximate. Importantly, the empirical extension runtime ratios slope downwards, which is in agreement with the prediction.
8.2 Sketching with Θ( 1 log n ) density gives favourable runtime tradeoffs One of our key results in Theorem 2 is that the upper bound on asymptotic runtime of extension does not depend on the density as long as 1/c < k, but chaining speed-up scales multiplicatively with c. In Figure 2 (b), we plot the multiplicative speed-up and slow-down of chaining and extension runtimes. This figure shows why sketching is so effective in practice with respect to runtimes; the maximal extension slow-down plateaus to ≤ 3 times in this case, whereas the chaining speed up is linear as the string grows exponentially. Therefore it is worth sketching aggressively to reduce runtime if chaining is slow. One should however still be careful of the sensitivity loss due to sketching in practice [62].
In practice, because extension is heavily optimized [43,17,66] and repetitive k-mers lead to more anchors, chaining can be a bottleneck even though it is asymptotically faster in runtime. In fact, we tried using WFA [43], a highly optimized algorithm for extension instead of the generic DP algorithm and found it was ∼ 60 times faster than chaining without sketching for n ∼ 2, 800, 000, k = 23, θ = 0.05 in our implementation.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Runtime ratio n 1+Cα log(n) (Proved value for large n) Empirical extension time ratio Empirical extension time ratio with sketching (a) Runtime analysis of extension. The y-axis shows the runtime ratios (with 95% confidence intervals) for iteration i divided by runtime for iteration i − 1 where the sequence length ni is plotted on the xaxis. As n grows, the ratio seems to asymptotically approach our conjectured bound.

Conclusion
In this work, we are able to rigorously justify the empirical results seen by the seed-chain-extend heuristic through average case analysis under a simple mutation model. We showed the alignment is both accurate and fast: ≥ 1 − O( 1 √ m ) fraction of the sequence homology is recoverable from the chain while only running in O(mn Cα log(n)) time, where for even a moderate mutation rate θ = 0.05, Cα < 0.08. Interestingly, a recent aligner [31] empirically achieved this predicted n 1.08 runtime for θ = 0.05 on two length n sequences using similar but distinct techniques. In addition, we also proved that one can sketch to arbitrary densities 1/c where c < k = O(log n) while asymptotically decreasing runtime and without asymptotically decreasing recoverability, justifying the effectiveness of sketching. Because our set-up is modeled by techniques used by practical software such as minimap2, our results provide a theoretical backing for why modern sequence alignment software actually performs so well in practice.
In terms of improving our bounds, it seems unlikely that the expected runtime is truly quasilinear due to the fundamental quantity (1 − θ) C log n = n −Cα , the likelihood of a k-mer match, decreasing faster than logarithmically in n. In spite of this, we believe there can be significant improvements on our bounds in Theorems 1 and 2. In the term n Cα , C is required to be > 2 1−2α , which seems unsatisfactory. This is because our analysis of spurious anchors relied on a relatively weak variance bound. The variance bound also leads to a 1 − O( 1 √ m ) bound for recoverability, which we also believe can be tightened. To surpass these bounds, a deeper understanding of spurious anchor subsequences must be developed. Tighter analysis of spurious anchors would also allow repeats, a major source of complexity in real genomes [32], to be modelled more accurately. Spurious subsequences of anchors is a very similar problem to common subsequences in random strings for extremely large alphabets, a topic that has had much theoretical attention [13,30,36,52]. It seems to us that this is only a first step in exploring the probabilistic theory of seed-chain-extend, and there are many more interesting problems to tackle.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  (3) . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint

A Mathematical conventions and reminders
We list a few reminders and conventions that will be useful for the technical parts of our work. ≫ will mean asymptotically dominates, i.e. f (n) ≫ g(n) if and only if lim n→∞ g(n)/f (n) = 0. |S| ∼ n and |S ′ | ∼ m, where S, S ′ are our pairs of strings and ∼ ignores small factors of k present. This is because k = C log n for a constant C > 0 and m = Ω(n 2Cα+ϵ ) for the assumptions of our main theorems, and these terms dominate k = O(log n). (1 − θ) k = n −Cα where α = − log σ (1 − θ) > 0 is a function of 0 < θ < 1, the mutation rate, and σ is our alphabet size which is > 1. Our results are general, but we will use σ = 4 for specific numerical results. When we use the assumption C > 2 1−2α , C is also greater than 2, a useful fact we will use repeatedly. A chain C is a sequence of anchors (exact matches of k-mers), represented by tuples ((i 1 , j 1 ), . . . , (i u , j u )) where i ℓ is the starting position of the ℓth k-mer on S and j ℓ is the starting position of the ℓth k-mer on S ′ . Anchors can overlap. The cost for a chain C is u − ζ[(i u − i 1 ) + (j u − j 1 )] where ζ = ζ(n) > 0 will eventually be considered as a (decreasing) function of n.

B Missing proofs and definitions from Section 4
We note that technically, the alignment matrix in Definition 1 is slightly different than the standard dynamic programming matrix [15] which can be thought of a directed graph with a path representing an alignment. Our representation does not include the graphical information; it only captures information pertaining to the possibility of matching bases, which is sufficient for us.

B.1 Recoverability and definition of Align(C)
Figure 1b will serve as a helpful guide for our definitions. Let C be a chain of anchors ((i 1 , j 1 ), . . . , (i u , j u )). Below we define carefully define recoverability and set up our proof of Lemma 1.
A chain gives a set of k-mer matches and a set of possible alignments by extending through bases, constraining the full alignment matrix between S and S ′ . We can formalize this as follows. Given two consecutive anchors (i ℓ , j ℓ ) and (i ℓ+1 , j ℓ+1 ), extending allows for possible matches between the gaps given by the following set of possible matches The factor of +k and −1 is because i ℓ is the start of the ℓth anchor, and extension starts after the end of the first k-mer and goes until 1 base before the start of the second k-mer. Ext(ℓ) corresponds to the green boxes in Figure 1b. The set Ext(ℓ) is empty if i ℓ + k > i ℓ+1 − 1 and similarly for j, i.e. there are no gaps between possibly overlapping anchors. Align(C) also takes into account the k-mer matches given by C. Thus we define Align(C) formally as where the first term corresponds to the k-mer matches in the alignment matrix.

B.2 Proof of Lemma 1
Since we care about the intersection of Align(C) with the homologous diagonal, we can manipulate the expression for Align(C) to get something more tractable. Let Diag[a..b] = {(x, x) : x ∈ [a..b]}, C H be the set of homologous anchors in C, and C S be the set of spurious anchors. The homologous anchors give rise to the homologous matches in the alignment matrix . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint The spurious anchors give no matches on the homologous diagonal, so it contributes nothing to the Align(C)∩ D H term. Therefore, NR represents the parts of the diagonal in Figure 1b that are not recoverable or accessible by extension through anchors, although it may technically intersect Align(C) in our mathematical definition. The following identity then holds after rewriting M H : because the min-max condition is redundant for homologous anchors where i ℓ = j ℓ . We can fill in the gaps between M H ∪ N R along the diagonal with Ext(ℓ) after noticing that we can rewrite This result follows because the union over all sets covers the entire diagonal between the first and last anchor. After bounding the diagonal to lie within D H and removing N R from both sides of the equation, we obtain the following result.
Proof. Using the supplementary lemma above, we note that j u ≤ p + m and j 1 ≥ p + 1 since the anchors must lie between [p + 1..p + m], so after ignoring the k term we get We now prove that the breaks "cover" N R, so |N R| ≤ L where L is the total length of the breaks, proving the result. Let π(N R) be the projection of the set onto one of the coordinate axes (it doesn't matter which one). Let x ∈ π(N R). Then for some spurious anchor (i, j), min(i, j) ≤ x ≤ max(i, j) + k − 1. But every spurious anchor is contained in exactly one break, since breaks partition the set of spurious anchors. This break B = [a..b + k − 1] ⊃ [min(i, j).. max(i, j) + k − 1] since the break takes the minimum and maximum coordinates over all spurious anchors in the break. Thus x ∈ B for some break B and the set of breaks breaks B in C B ⊃ π(N R). L(C) ≥ |π(N R)| = |N R|, and we're done.
First, we claim that G 1 , . . . , G q , considered as random letters in S and S ′ , are mutually independent (without considering additional M variables). Indeed, each G i can be considered as just a set of pairs of (x ℓ , y ℓ ) random variables because the edges (x ℓ , y ℓ ) always exist in the original graph and so the connected component must leave no x i or y i unpaired. All pairs of (x ℓ , y ℓ ) random variables are mutually independent by definition of our original mutation model. Thus all G i s are mutually independent. Now we consider the E ℓ s. The random variables in E ℓ are functions of G ℓ , the letters in the connected component. It follows that the E ℓ s are functions of mutually independent G ℓ s, and are themselves mutually independent. Now let x ∈ G x and y ∈ G y be two letters in different connected components. x, E x are both functions of G x and similarly for y, E y , and G y . Thus Pr(x, y, E 1 , . . . , E q ) = Pr(x, E x ) Pr(y, E y ) ℓ̸ =x,y Pr(E ℓ ) which shows conditional independence of x, y with respect to {E 1 , . . . , E ℓ } = M. Since G(M) has no cycles, x α and y β lie on two different connected components of G(M − ) after removing the edge induced by M (α, β) (we use the α ̸ = β assumption here). Therefore x α , y β are conditionally independent by the above Supplementary Lemma 2, so we can write the above sum as = a∈Σ Pr x α = a | M − Pr y β = a | M − .
We claim that both terms in the sum are 1 σ . To show this, let the permutation C σ be a cyclic permutation of order σ on letters in Σ (e.g. C 4 sends A → C → T → G) and apply it to every letter on both strings when . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint we write C σ (S, S ′ ). It's not hard to see that this is a measure (or probability) preserving transformation for Pr(· | M − ) because it preserves matches between S and S ′ and letters are distributed uniformly.
Let A be the set of strings (S, S ′ ) for which x α = a, and we can see that Pr(x α = a | M − ) = Pr(A | M − ). Now applying C σ σ times and using measure preservation gets that Finally, all C n σ (A), n = 1, . . . , σ − 1 are clearly disjoint and partitions the space of strings. Thus, we see that Pr(x α = a | M − ) = 1 σ as desired. The argument works exactly the same for Pr(y β = a | M − ), so and our induction step is complete.
C.2 Proof of Corollary 1 If the match graph induced by A(i, j), i.e. induced by M (i, j), M (i+1, j +1), . . . , M (i+k −1, j +k −1) has no cycles, using Theorem 3 gives the result. This is easy to see from drawing out the match graph which has edges (x, y) for all x, y and (x i , y j ), . . . , (x i+k−1 , y j+k−1 ) but we give a rigorous argument below. See Supplementary Figure 1 for an example of such a match graph.
First, suppose j > i and suppose x i+ℓ is the vertex with a cycle and with the smallest possible ℓ ∈ [0..k−1]. We can assume the cycle touches some x because the match graph is bipartite. Since x i+ℓ has degree at most two and therefore equal to two, we can traverse the edge (x i+ℓ , y i+ℓ ) as the first edge in the cycle. If y i+ℓ has degree one, then x i+ℓ has no cycle so suppose the degree is two. Then we have an edge (y i+ℓ , x i+α ) which is in the cycle. However, this implies x i+α also has a cycle and α < ℓ because j > i; the edges from x i to y j have increasing positions. This creates a contradiction as ℓ was the smallest index with a cycle, so no cycles exist. Now if we have j < i, we repeat the same argument but with ℓ the largest in . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint Lemma 2 For A(i, j) and A(h, l), if both of the following conditions hold: 1. |i − h| ≥ k or |j − l| ≥ k and 2. |i − l| ≥ k or |j − h| ≥ k, then the induced match graph on the M variables for A(i, j) and A(h, l) has no cycles.
The intuition for the conditions of Lemma 2 is that the first condition prevents both anchors from overlapping too much, while the second condition prevents the condition where, for example, two variables M (1, 5), M (5, 1) can cause a cycle to form in the match graph by x 1 → y 5 → x 5 → y 5 → x 1 .
Proof. If |i − h| ≥ k or |j − l| ≥ k, then without loss of generality, assume |i − h| ≥ k. Then we can find a set of {x i , . . . , x i+k−1 } = X i and X h = {x h , . . . , x h+k−1 } = X h disjoint. See Supplementary Figure 2 for a pictorial representation. It will turn out that if |j − l| ≥ k, then the xs become ys and the argument doesn't change.
Now we claim two facts.
1. The degree of every x ∈ X i or X h must be at most two: each A variable induces one new edge for the xs covered by some k-mer, but these are disjoint sets of k-mers. 2. If a cycle exists, the cycle must touch some x ∈ X i and also a x ′ ∈ X h : if it did not, a cycle must exist for some x ∈ X i or X h (the graph is bipartite, so must include a x) and would not use induced edges on the match graph for one of A(i, j) or A(h, l). This implies the match graph for just one of A(i, j) or A(h, l) would have a cycle, which is impossible by the proof of Corollary 1.
To complete the proof, we proceed with an argument similar to the proof of Corollary 1. We need four different cases corresponding to either |i − l| ≥ k or |h − j| ≥ k and our previous assumption of |i − h| ≥ k or |j − l| ≥ k. We prove only one case but it is easy to translate the argument to the other three.
Let's assume |i − l| ≥ k. By the second fact above, if a cycle were present, we can assume it touches x ℓ ∈ X i . If j > i, let x ℓ be the leftmost x ℓ ∈ X i with a cycle, and if j < i, let x ℓ be the rightmost. Because the degree of x ℓ is at most two (by the first fact above) and thus must be exactly two, we can assume the cycle starts with the edge (x ℓ , y ℓ ). Now, y ℓ has degree at most two since |i − l| ≥ k, i.e. the k-mer starting at l on S ′ doesn't overlap y i , therefore the only other edges adjacent to y ℓ from x α ∈ X i . Therefore x α is either to the left of x ℓ if j > i or to the right if j < i and must also have a cycle. However, because x was the leftmost or rightmost such x, this is impossible, so we are done.
If |j − l| ≥ k, we use ys instead of xs and we switch "rightmost" with "leftmost". If we have |j − h| ≥ k, we switch X i or Y i with X h or Y h instead. Repeating the same argument verbatim gives the conclusion.

C.4 Proof of Lemma 3
Thus the variance V ar(N S ) can be upper bounded by 8k 2 mn σ k .
Proof. We will frequently use (m − 1) < m and (n − 1) < n to simplify the bounds. l) : |h − i| < k and |l − j| < k}. We can separate the sum into three different parts, . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint and N 2 S = S 1 + S 2 + S 3 . Let us bound the expectation of each sum separately. Firstly, we get when (h, l) is not in the set B k (i, j) ∪ B(j, i) by Corollary 2, and there are at most m 2 (n − 1) 2 possible A(h, l)A(i, j) tuples.
The variance of N H was calculated in [9] for the variable N mut = n − N H , but we can use a simpler approximate bound.
; notice that if |i − j| < k then there is obviously dependence between A(i, i) and A(j, j) because the anchors have overlapping bases, but otherwise the variables are independent. Thus The only indices where dependence may be an issue is when either |h − i| < k or |l − i| < k. Thus for each pair h, l, there are at most 4k choices for i which may not be independent. We can use the same idea as the previous bound to show that Proof. For all x > 0, we have the inequality ln(x/u) ≤ x/u − 1 for any u > 0. Substituting this into N ln N we have H , we can just use the moment bounds in Lemma 3. Plugging in the bounds yields E[N 2 ] ≤ (8k 2 mn 1−C + m 2 n 2−2C ) + (2kmn −Cα + m 2 n −2Cα ) + (8kmn 1−C + 2m 2 n 1−Cα−C ) To simplify these terms, notice that n −Cα ≫ n 1−C and mn −Cα ≫ k 2 where ≫ means asymptotically dominates. This is because −Cα > 1 − C follows from our assumption C > 1 1−α . The second term is because m = Ω(n 2Cα+ϵ ) so mn −Cα ≫ n ϵ ≫ k 2 = O(log 2 (n)). This gives . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint

D.2 Proof of Lemma 6
Supplementary Lemma 3. For any interval consisting of ℓ k-mers on S ′ , the probability that all ℓ homologous anchors are 0 is upper bounded by Proof. The letters on an interval on S, S ′ are distributed identically as a length ℓ + k − 1 version of S, S ′ . We can then use Lemma 5 for t = m(1 − θ) k .
Lemma 6 (F2) With probability ≥ 1 − 1/n, no homologous gap has size greater than plus a small C log n term we will ignore because it is small asymptotically.
Proof. Using the above supplementary lemma, let ℓ = (50/8) k (1−θ) k ln(n). Then the probability that there are no homologous anchors in a segment of ℓ k-mers is ≤ 1 n 2 . Let HG 1 , . . . , HG m−ℓ+1 be indicator random variables where HG i = 1 if the next ℓ k-mers from position i have no homologous anchors and 0 otherwise. It follows that E[ HG i ] ≤ m n 2 ≤ 1/n. Using Markov's inequality, we see that Pr( HG i ≥ 1) ≤ 1 n . Thus with probability ≥ 1 − 1/n, no homologous gap is larger than 50/8 · k (1−θ) k ln(n) k-mers as desired. ℓ k-mers corresponds to a homologous gap of size ℓ − k + 1, so need to add k − 1 to get an upper bound on the homologous gap size, but we will ignore this in the analysis because it is small asymptotically.
If we assume C > 3, then Lemma 4 shows that with probability ≥ 1 − 1/n and large enough n, no spurious anchors exist at all. Of course, no breaks can occur so we are already done in this case. The rest of the section is for tackling the case C ≤ 3. We will prove a series of supplementary lemmas, and then prove Lemma 7.
We will now assume the hypotheses of Lemma 7 for the rest of this section. That is, 3 ≥ C > 2 1−2α , m = Ω(n 2Cα+ϵ ) for some ϵ > 0, and define ζ = 1 6g(n) as in the statement of Lemma 7. We will not be too careful with small additive constants of order O(log n) due to indexing offsets from now on. For example, a length m 1/2 interval of bases contains technically m 1/2 − k + 1 k-mers, but since we work with asymptotics we'll treat this as ∼ m 1/2 .
Proof. We will show that no chain of only spurious anchors can be optimal, implying that any optimal chain must have at least one homologous anchor. By F2, there are no homologous gaps of length ≥ g(n) where g(n) = Θ(n Cα log 2 (n)), so we can lower bound the number of homologous anchors N H under the space F2 by = Ω m n Cα log 2 (n) .
Let score(N H ) be the score of the chain with only homologous anchors. Then . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint because the only homologous chain has a linear cost of at most 2ζm. Now consider a chain without homologous anchors, i.e. only spurious anchors. Such a chain has maximum score score(N S ) ≤ N S − 2N S ζ because there are at most N S such anchors, and the N S in linear cost is assuming the smallest possible linear cost where there are no gaps between the anchors, i.e.
holds and also score( holds. We therefore we want the following asymptotic inequality to hold: This holds when 1 − C/2 + Cα < 0 which is equivalent to our assumed condition C > 2 1−2α , and thus the chain of only homologous anchors has asymptotically greater score than any chain with only spurious anchors. Therefore any optimal chain with high probability must contain at least one homologous anchor for large enough n.  Figure 3: Circles are bases, grey circles are mutated bases, and boxes are k-mers. The red spurious anchor indicates a break. We can always remove a break flanked by two homologous anchors and then add in homologous anchors that may be present within the break. This idea is used in the proof of Supplementary Lemma 5 to show that for large enough breaks, such a procedure always improves the chaining score. Supplementary Lemma 5 (F1 + F2). Suppose a break is flanked by two homologous anchors in an optimal chain. That is, homologous anchors exist on both sides of the break in the chain. Then with probability ≥ 1 − 2/n and large enough n, this break has size < m 1/2 .
Proof. Suppose a break of length ≥ m 1/2 is flanked by two homologous anchors. Then for the given chain ((i 1 , j 1 ), . . . , (i u , j u )), the break occurs somewhere in the middle, say ((i r , j r ), . . . , (i t , j t )) where r > 1, t < u. Let us construct a new chain by removing all spurious anchors ((i r , j r ), . . . , (i t , j t )) within our chain that lie in this break, and then adding all homologous anchors that are present within the break to the chain. See Supplementary Figure 3.
This does not change the gap cost part of the chaining score which is still ζ[(i u − i 1 ) + (j u − j 1 )], so if there are more homologous anchors after the switch, then this switch is more optimal. Assuming condition . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint F2, the number of homologous anchors contained within a break is lower bounded by m 1/2 /g(n). Assuming the condition F1, the number of spurious anchors, and therefore the size of the break, is upper bounded by N S = Θ(log(n)m 1/2 n 1−C/2 ). Notice that ≫ Θ(log(n)m 1/2 n 1−C/2 ) = N S again exactly when 1 − C/2 + Cα < 0 which is equivalent to C > 2 1−2α . Thus under the event space F1 ∩ F2, switching to the homologous k-mers always improves our optimal chain for large enough n.
Supplementary Lemma 6 (F1 + F2). Suppose a break is flanked by a single homologous anchor in an optimal chain. Then with probability ≥ 1 − 2/n and large enough n, this break has size < m 1/2 . Proof. It follows easily from the definition of a break that a break flanked by one homologous anchor must occur at the start or end of the chain ((i 1 , j 1 ), . . . , (i u , j u )). Let us assume that such a break occurs at the end of the chain at anchors ((i r , j r ), . . . , (i u , j u )); the same argument works for the break occurring at the beginning of the chain.
We use a similar argument as above, but this time we must take into account the linear gap cost. Furthermore, instead of switching to the relative homologous anchors, we just remove the break from the chain and check if that improves the score. See Supplementary Figure 4.
The score of the old chain is A − ζ[(i u − i r−1 ) + (j u − j r−1 )] + w where A is the score considering subchain of the anchors prior to r, and w is the number of anchors in the break. This is upper bounded by A − ζ(B − k) + N S where N S = Θ(log(n)m 1/2 n 1−C/2 ) as before and B is the length of the break. This holds because the length of the break subtracted by k is B − k = max(i u , j u ) − min(i r , j r ) which is less than Removing the break entirely is more effective when ζB − k > N S , so when we can get a better chain by simply removing the break. By the previous discussion, 1 − C/2 + Cα < 0, so we just let B = m 1/2 and this completes the proof.
Proof (Lemma 7). Breaks are flanked by either one, two, or no homologous anchors. The last case can only occur if the entire chain is a break. By Supplementary Lemma 4, we always have at least one homologous anchor in any optimal chain (under F1 ∩ F2), so any break is flanked by at least one homologous anchor. Therefore by the previous supplementary lemmas, no breaks of size m 1/2 occur with probability ≥ 1 − 2/n after using the conditions F1, F2 with the appropriate assumptions.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint Supplementary Figure 5: Given an optimal chain (shown with k-mer anchors in red and blue), if the last k-mer on S ′ is 2 √ m distance away from the end, because the break size is < √ m, there will remain at least √ m bases left (shaded in blue) near the end of S ′ , S. We use this geometry in the proof of Supplementary Lemma 7 and argue that adding in every possible homologous k-mer in the shaded region gives a more optimal score.

D.4 Proof of Corollary 3
Supplementary Lemma 7. Under the assumptions of Lemma 7, given any optimal chain (i 1 , j 1 ), . . . , (i u , j u ), Proof. Lemma 7 shows the max break size is m 1/2 ; we claim that j u ≥ p + m − 2 √ m, i.e. j u is less than 2 √ m away from the end of S ′ . Suppose otherwise. The argument proceeds in two cases. The idea is to essentially show that adding on all of the homologous anchors near the end of S ′ always increases the score (under the event space F1 ∩ F2). If (i u , j u ) is homologous, then m 1/2 ≫ g(n): g(n) = Θ(n Cα log 2 (n)) is the maximum distance between homologous k-mers and m = Ω(n 2Cα+ϵ ). Thus for large enough n, we can find another homologous k-mer near the edge of S ′ . Adding this homologous k-mer to the chain changes the score by at least −2ζg(n) + 1 where the −2ζg(n) is the linear cost. Since ζ = 1 6ζ , this is positive, so the old chain was not optimal. For the non-homologous case, refer to Supplementary Figure 5 for the geometry. Now if (i u , j u ) is not homologous, then i u ∈ [j u − √ m, j u + √ m] because any spurious anchor is contained in a break, which has size < √ m. Assume j u = p + m − L where L > 2 √ m so j u is L away from the end of S ′ . We claim that adding in all homologous anchors in the interval [max(i u , j u ) + 1..p + m] gives a more optimal chain. Indeed, since max(i u , j u ) ≤ j u + after using the inequality L > 2 √ m. Therefore adding in these homologous k-mers makes the score bigger, contradicting the optimality of our chain. Thus L ≤ 2 √ m, and the distance from j u to the end of S ′ is at most 2 √ m. The argument works with directions flipped for j 1 , so j u − j 1 ≥ m − 4 √ m as desired.
Corollary 3 Under the same assumptions as in Lemma 7, the expected recoverability of any optimal chain is ≥ 1 − O( 1 √ m ) for large enough n.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint Proof. Let F = F1 ∩ F2, so all breaks have length < m 1/2 ; Pr(F) ≥ 1 − 2/n. Letting R(C) = R be the recoverability as before and letting |S ′ | = m+k −1 ∼ m because our final result uses big O notation anyways, we get from Lemma 1 relating recoverability to breaks that where we define C S to be the number of spurious anchors in an optimal chain. The number of breaks is upper bounded by C S , and we've used Lemma 7 and Supplementary Lemma 7 to bound L(B) by m 1/2 and j u − j 1 by m − 4 √ m respectively. It is clear that Since E[N S ] ≤ mn 1−C , we have that E[N S | F] ≤ mn 1−C 1−2/n by the law of total expectation and non-negativity of N S . Using this, we get that Remark 1. In the proof of the intermediate lemmas associated with the proof of Corollary 3, we assumed that S ′ was not too close to the edges of S, i.e. quantities such as p − √ m were non-negative. It's not hard to see that if p were close to the edges of S, the breaks would actually be smaller because by definition breaks can not go past the ends of S. One could in fact show that our upper bound still work after accounting for these edge cases, but for simplicity, we omit these edge cases from our proof.

D.5 Proof of Lemma 8
We prove Lemma 8 by proving a series of lemmas.
Supplementary Lemma 8. For a given instance of S, S ′ , let C be any optimal chain. Let C only be the chain consisting of only homologous anchors, in S, S ′ . Defining T H Ext to be the runtime of extension of C over only the homologous gaps in C and T Ext (C only ) as the runtime of extension over the homologous gaps of C only , we have that T H Ext ≤ T Ext (C only ). Proof. Let (i ℓ , i ℓ ) and (i ℓ+1 , i ℓ+1 ) be two consecutive homologous anchors in an optimal chain C. We can guarantee that no A(i ℓ +1, i ℓ +1), . . . , A(i ℓ+1 −1, i ℓ+1 −1) random variables are equal 1, otherwise adding such an anchor would improve an optimal chain (it does not incur a linear gap cost penalty due to being flanked by two anchors). Thus (i ℓ , i ℓ ) and (i ℓ+1 , i ℓ+1 ) are also consecutive anchors in C only , and the homologous gap corresponding to those anchors is also present in C only . Therefore, T H Ext ≤ T Ext (C only ) since all homologous gaps in C are also in C only . Definition 6. Define K ⊂ C only to be the subchain of homologous anchors for which the starting positions of the k-mers on S are in {p + 1, p + 1 + k, p + 1 + 2k, . . . , p + 1 + (⌊ |S ′ | k ⌋ − 1)k}. That is, the subchain K consists of only homologous anchors restricted to k-mers that are spaced k bases apart starting from the first index. Since K is sparser than C only , it should take longer to extend through K than C only . We formalize this below.
Supplementary Lemma 9. Let G start be the distance from the first anchor of K to the start of S ′ and similarly for G end and the last anchor of K to the end of S ′ . Then . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Homologous anchor in C only but not the subchain Supplementary Figure 6: A graphical proof of Supplementary Lemma 9. Given a subchain of homologous anchors, the extension time is longer because the square of the larger gaps in the subchain contains the square of the smaller gaps over all homologous anchors. The last gap on the right is not accounted for in the subchain, but adding G 2 end and G 2 start fixes this. Note: the second last orange square on the right is a gap for both chains, i.e. the square is both orange and blue.
Proof. See Supplementary Figure 6 for a visualization of the proof. Let {G K 1 , G K 2 , . . . , G K r } be the homologous gap sizes in K, where we think of G K 1 as the gap size (possibly 0) between the first two anchors, and so forth. Similarly, let {G C 1 , G C 2 , . . . , G C q , } be the homologous gap sizes in C only . The extension runtime is and similarly for T Ext (C only ). Any two consecutive anchors in K give rise to a gap G K i . These two anchors also exist on C only , but there may be intermediate anchors, so this gives rise to multiple gaps G C j , . . . , G C j+ℓ between these two anchors on C only . Now G C j + . . . + G C j+ℓ ≤ G K i because the sum of all intermediate gaps is at most the size of the larger gap, so Let G C a be the leftmost gap on C only to the right of the first anchor of K, and G C b be the gap to the left of the last anchor of K. Considering every pair of consecutive anchors on K, we get by the previous inequality. However, we're not done yet because the leftmost gaps in C only may not be contained by any gap in K. Using the definition of G start as the "gap" on K containing all of the leftmost gaps on C only and similarly for G end , we get by the same arguments as above. Combining both inequalities gives the result.
By the above results, we can bound T H Ext by either T Ext (C only ) or T Ext (K). We will work with K in this section, but we will actually use C only for the sketched version of the main theorem. Below we give a random variable to calculate these extension times.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Under our original definition, the extension time over C only would be where G ℓ is the size of the gap, but u is a random variable. It's clear that is the runtime of extension over C only , but it will be easier to handle for our proof now that the upper index is not a random variable. Similarly, is the runtime of extension over K. We will now work with the chain K restricted to equally spaced k-mers and their unmutated homologous anchors. We upper bound this extension time in expectation, which will upper bound E[T H Ext ] as well.
. The random variable G start = ℓ · k if for the first ℓ k-mers, A(1, 1) = 0, A(1 + k, 1 + k) = 0, . . . , A(1 + (ℓ − 1)k, 1 + (ℓ − 1)k) = 0 but A(1 + ℓk, 1 + ℓk) = 1. In other words, the first ℓ k-mers that are spaced k bases apart are mutated, but the ℓ + 1th such k-mer is not mutated. Thus Importantly, we used the fact that k-mers spaced k distance apart are independent of each other. The ≤ comes from the fact that G start can not be larger than |S ′ |. It follows that which follows from geometric series manipulations. Notice that O(n 2Cα log 2 n) = O(m) because m = Ω(n 2Cα+ϵ ). It's clear that the same argument holds for G 2 end , so both terms are O(m). Now for Y K i , we have that Y K i = ℓ · k where ℓ > 0 only if i ∈ {p + 1, p + 1 + k, p + 2 + k, . . .}, the k-mer at i is unmutated, the k-mers that are k, 2k, 3k, . . . , ℓk bases ahead from i are mutated, and the k-mer (ℓ + 1) · k bases ahead of i is unmutated. Thus . . } and 0 otherwise.
We can compute the expectation over all Y K i and only pick out the non-zero random variables where i ∈ {1, 1 + k, . . .}. There are at most m/k such random variables, so we get The expectations in all the terms are O(mn Cα log n), so this finishes the proof.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint D.6 Proof of Lemma 9 Supplementary Lemma 10 (F1 + F2). Under the same assumptions of Lemma 7 all non-homologous gaps in an optimal chain have size < √ m + 50C 8 log(n) ln(n)n Cα on both S and S ′ with probability ≥ 1 − 2/n and large enough n.
Proof. Let us be in the space F = (F1) ∩ (F2), which holds with probability ≥ 1 − 2/n. There are two types of non-homologous gaps; non-homologous gaps that are flanked by two spurious anchors and gaps that are flanked by only one.
If a non-homologous gap is flanked by two spurious anchors, then it is part of a break. The gap size must be smaller than the break, which has length less than √ m. Suppose a non-homologous gap is flanked by one spurious anchor A(i, j) and one homologous anchor A(h, h). Suppose WLOG that |i − h| > |j − h|. We know with high probability that |j − h| < g(n) = 50/8C log(n) ln(n)n Cα as otherwise there will be a homologous anchor between A(i, j) and A(h, h) (by property F2/Lemma 6) and the chain is not optimal as we could just insert the additional homologous anchor between j and h. Furthermore, |i − j| < √ m as |i − j| is less than the break size. We can see then that |i − h| ≤ |i − j| + |j − h| < √ m + 50/8C log(n) ln(n)n Cα as desired. Proof. Defining γ as the maximum non-homologous gap size as in Supplementary Lemma 10 conditional on F =F1 ∩ F2, γ = O(m 1/2 + n Cα log 2 (n)) = O(m 1/2 ) when m = Ω(n 2Cα+ϵ ). We bound the expected value of the non-homologous gaps as follows.

E[
Gj not homologous The inequality uses γ > G j and that the number of non-homologous gaps is at most 2 times C S , which we define to be the number of spurious anchors in the chain (each anchor gives rise to at most two unique gaps). C S ≤ N S follows trivially, so The first inequality follows from E[N S | F] Pr(F) ≤ E[N S ] and non-negativity of N S , as well as Pr(F) ≥ 1 − 2/n. Finally, under F, the worst case extension through non-homologous gaps is just O(nm) as in Section 2.3. Since Pr F ≤ 2/n, the expected runtime is O(nm/n + mn 2−C ) = O(m) as desired.
E Missing proofs from Section 7 E.1 Proof of Lemma 11 Lemma 11 The variance V ar(N * S ) can be upper bounded by 1 Proof. The proof follows almost exactly the same as Lemma 3. We only do the variance bound as an example, and the other moment bounds follow exactly the same way. We upper bound the three sums S 1 , S 2 , S 3 in the proof of Lemma 3 but now with the sketched versions S * 1 , S * 2 , S . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint By independence of A(i, j), A(h, l) in S * 1 , the second term is n −2C . Now notice that under the conditions of S 1 , either |i ≥ h| > k or |j − l| ≥ k meaning that two of the k-mers along either S or S ′ are independent. WLOG we can assume it is i and h; thus J(i) and J(h) are independent. The A(i, j) = 1 condition implies J(i) = J ′ (j) and similarly for h, l, so Remark 2. The variables J(i) are k-dependent so we could use Theorem 5 for this sum. This however leads to a bound like exp(−O(m/(kc))) instead of our bound which is exp(−O(m/(k + c))) in the above lemma. While the former bound still leads to the same asymptotic behavior for Supplementary Lemma 13, which we are ultimately trying to prove, we believe it is enlightening to show how the dependence structure of the J(i)s can be used to obtain a better bound.
Let min(J(i, i + τ ), 1) = I(i, i + τ ) and clearly b−τ +1 i=a is also large. Notice that I(i, i+τ ) is the random variable which is 1 when some k-mer in [i..i + τ ) is an open syncmer. By Theorem 7, only one of the J variables in J(i), . . . , J(i + τ − 1) can be equal to one. Using Pr(J(i) = 1) = 1 c , we get that by disjointness. I(i, i + τ ) is k + τ − 1-dependent because it examines the k + τ − 1 bases starting from position i, so using Theorem 5 we get that 25 .
1(i, i + τ ) and using τ = (c + 1)/2 we obtain CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  *  (a, a), . . . , A * (b, b) are 0 is upper bounded by where we just sum only over syncmer anchors instead of all k-mer anchors in the range.
Now by Theorem 7, the distance between consecutive positions in J is at least (c + 1)/2 (remembering that we assume c is odd for simplicity), which means that given the ith open syncmer, the (i + ⌈ 2k c+1 ⌉)th open syncmer is more than k bases apart from the ith open syncmer and hence independent from each other. Thus the A(i, i)s in J are now ⌈ 2k c+1 ⌉ < 2k c+1 + 1 dependent. Conditioning on |J | > (1−β)ℓ c , we can bound the first term using Theorem 5 to get Where we substituted in |J | = (1−β)ℓ c in the above equation and used c c+1 ≤ 1 to remove the term. To bound the second term, we just use Supplementary Lemma 11. This furnishes the final result Supplementary Lemma 13. No homologous gap has size greater than with probability ≥ 1 − 2/n after ignoring an additive (c + 1)/2 − 1 + C log n term.
Proof. We use Supplementary Lemma 12 after plugging in the value of ℓ = g ′ (n) and letting β = 1/2. Algebraic manipulations show that the probability is less than ≤ 2/n 2 using k = C log n and the inequality c < k. Then as in the proof of Theorem 6 we can use indicator random variables and Markov's inequality in the same way to get the result.

E.4 Proving Lemma 13
We use the same definitions and supplementary lemmas as in Section D.5 for Y i , C only and K, now considering the sketched versions Y * i , C * only , and K * . We first note that the sketched version for Supplementary Lemmas 8 and 9 hold by the same arguments. The analogous version of Lemma 8 holds as well. . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint Proof. Essentially the same argument as in the proof of Lemma 8 outlined in Section D.5. We end up bounding E[T Ext (K * )] + E[(G * start ) 2 ] + E[(G * end ) 2 ]; the only difference is that (1 − θ) k → (1−θ) k c due to sketching. Ultimately, the main term becomes The same geometric series manipulation as before gives us that this is This finishes the proof.
When c is fixed to be a constant independent of n, we get the same asymptotic bound as before. This bound suggests that sketching makes extension slower. However, we can actually do better than this when c grows with n. This time, we proceed in a different manner, instead directly using C * only to bound the extension time. We let Y * i be the sketched version of Y i as in Definition 7, which is a random variable measuring gap sizes between anchors.
Proof. Y * i = ℓ is equivalent to the k-mer covering [i..i + k − 1] being unmutated (i.e. no bases on the k-mer are mutated) and selected as an open syncmer, all of the k-mers in between these two flanking k-mers not being present, and the k-mer covering [i + ℓ + k..i + ℓ + 2k − 1] being unmutated and selected as an open syncmer. Calling these events H 1 , H 2 , and H 3 respectively, Pr(Y i = ℓ) = Pr(H 1 ∩ H 2 ∩ H 3 ).
The k-mers considered in H 2 that lie in between the flanking k-mers overlap the flanking k-mers in H 1 and H 3 , so these events are not independent. To upper bound H 2 , let H ′ 2 be the event that the k-mers lying completely in the interval [i + k..i + ℓ + k − 1] on S, and not just overlapping the interval, are mutated or not selected as an open syncmer. H ′ 2 ⊃ H 2 as events. This set of k-mers is exactly the k-mers starting at positions in H i (k, ℓ) after some examination. Now notice that H 1 , H ′ 2 , H 3 are all independent as the k-mers in each event lie on non-overlapping bases, so In the case that [i + ℓ + k..i + ℓ + 2k − 1] does not lie on S ′ , i.e. i + ℓ + k > p + m, then Pr(Y i = ℓ) = 0 as a homologous gap of this length can not exist near the edges, so the upper bound still holds. The upper bound also holds if i − k + 1 < p + 1 by the same reason. This finishes the proof. . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint holds by noticing that the first term in Supplementary Lemma 12 is dominated by the second term and using k > c, where we also ignored the − c+1 2 + 1 in Supplementary Lemma 12 because it is asymptotically small. We can now bound the expected time of extension over homologous gaps by E The other bound c · mn Cα log n follows from Supplementary Lemma 14, so taking the minimum over both bounds yields the result.

Remark 3.
Intuitively, what is happening here is that open syncmers have the same key property as K; they are spaced at least c+1 2 ∼ k 2 bases apart when c ∼ k, and are "almost independent" like the anchors in K. C * only has similar properties to the non-sketched K chain, showing why they give the same bounds. Interestingly, if were to bound C only to get a result for the non-sketched version E[T H Ext ] while using the same techniques as above (with the dependent Chernoff-Hoeffding bound), the result would be O(mn Cα log 3 n), worse than bounding using K by log 2 (n).
Remark 4. If we were to use the FracMinHash method or another k-mer selection/sketching method without the distance guarantee provided by Theorem 7, Supplementary Lemma 12 would not hold, and therefore the above argument would fail. Supplementary Lemma 14 would however still hold.

E.5 Re-proving sketched bounds
Supplementary Lemma 16 (F1* + F2*). Using the same assumptions as in Lemma 7, there are no breaks of length > m 1/2 with probability greater than 1 − 3/n in an optimal chain for large enough n.
Proof. In Section D.3, the structure of the problem does not change. The only difference is the bounds on gap size and spurious k-mers. Asymptotically the homologous gap size is still the same, and both the number of spurious anchors and homologous anchors are smaller by a factor of √ c, which does not affect the inequalities. Thus all of the bounds in Section D.3 hold except with slightly different probabilities since F1* holds with probability ≥ 1 − 2/n instead of 1 − 1/n. Proof. The proof is the same as the proof of Lemma 9 in Section D.6 except the sketched maximum nonhomologous gap length is m 1/2 + g ′ (n). This is still O(m 1/2 ), and N * S < N S , so the expected value of the non-homologous runtime does not change.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint  Figure 9: The recoverability of our alignments over 50,000 iterations as a function of sequence length n where θ = 0.10 and k was chosen as described in Section 8. Breaks were uncommon and most of the recoverability loss is from the (j u − j 1 ) term due to homologous anchors not being near the edges.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted October 18, 2022. ; https://doi.org/10.1101/2022.10.14.512303 doi: bioRxiv preprint