## Abstract

As sequencing datasets keep growing larger, time and memory efficiency of read mapping are becoming more critical. Many clever algorithms and data structures were used to develop mapping tools for next generation sequencing, and in the last few years also for third generation long reads. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors.

Here we introduce parameterized syncmer schemes, and provide a theoretical analysis for multi-parameter schemes. By combining these schemes with downsampling or minimizers we can achieve any desired compression and window guarantee. We introduced syncmer schemes into the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long read data from a variety of genomes, the syncmer-based algorithms reduced unmapped reads by 20-60% at high compression while using less memory. The advantage of syncmer-based mapping was even more pronounced at lower sequence identity. At sequence identity of 65-75% and medium compression, syncmer mappers had 50-60% fewer unmapped reads, and ∼ 10% fewer of the reads that did map were incorrectly mapped. We conclude that syncmer schemes improve mapping under higher error and mutation rates. This situation happens, for example, when the high error rate of long reads is compounded by a high mutation rate in a cancer tumor, or due to differences between strains of viruses or bacteria.

## 1 Introduction

As the volume of third-generation, long read sequencing data increases, new computational methods are needed to efficiently analyze massive datasets of long reads. One of the most basic steps in analysis of sequencing data is mapping reads to a known reference sequence or to a database of many sequences. Several long read mappers have been proposed [6, 15], with minimap2 [8] being the most popular. minimap2 is a multi-purpose sequence aligner that uses sequence minimizers as alignment anchors. Minimizers, the minimum valued *k*-mers in windows of *w* overlapping *k*-mers of a sequence, are used to sketch sequences. They have greatly improved computational efficiency of many different sequence analysis algorithms (e.g. [2], [7], [17]). A key criterion in evaluating minimizer schemes is their *density*, which is the fraction of *k*-mers selected. The inverse of the density is called the *compression rate* of the scheme.

Recent work has shown that minimizers are less effective under high error or mutation rates. Motivated by this observation, Edgar [4] recently introduced a novel family of *k*-mer selection schemes called *syncmers*. Syncmers are a set of *k*-mers defined by the position of their minimum *s*-long substring (*s*-minimizer). Syncmers constitute a predetermined subset of all possible *k*-mers and, unlike minimizers, they are defined by the sequence of the *k*-mer only and do not depend on the rest of the sequence in which they appear. Syncmers are therefore more likely to be conserved under mutations. In contrast, minimizers are selected depending on a larger window, which is more likely to contain mutations or errors. This difference is crucial in long read data, which has much higher error rate than short reads [3]. Another key difference between syncmer and minimizer schemes is that the latter guarantee selection of a *k*-mer in every window of *w* consecutive *k*-mers (this is called a *window guarantee*), while syncmers do not.

Edgar defined several syncmer variants, including the families of *open syncmers*, whose *s*-minimizer appears at one specific position, and *closed syncmers*, whose *s*-minimizer appears at either the first *or* the last position in the *k*-mer [4]. He computed the properties of a range of syncmer schemes and used them to choose a scheme with a desired compression rate. Shaw and Yu [16] recently formalized the notions of the conservation of selected positions and their clustering along a sequence, and provided a broader theoretical analysis.

In this work we generalize syncmer schemes to multiple arbitrary *s*-minimizer positions. We call these *parameterized syncmer schemes* (PSSs), where the parameters are the possible indices of the *s*-minimizer in a selected *k*-mer, and an *n*-parameter scheme uses *n* such indices. An example is a 3-parameter scheme that selects any 15-mer with the minimum 5-mer appearing at position 1, 5, or 9. We analyze the properties of such parameterized syncmer schemes and determine which schemes perform well for a given compression rate through theoretical analysis and empirical testing.

When using PSSs in practice, the selected *k*-mers must often be downsampled to achieve a desired compression rate. We demonstrate that it is possible to retain properties of syncmers such as minimum and most frequent distances between selected positions by choosing the correct parameters and downsampling rate of a PSS. We can also have a window guarantee by combining syncmers and minimizers.

Many read mappers work by indexing seeds in a reference sequence and then finding a chain of matching seeds, forming a segment with high scoring alignment with the query sequence. In the long read mapper minimap2 [8], minimizers are used as the seeds, with the advantage that any identical window of length *w* will have the same minimizer. However, for longer reads with a much higher error rate, conservation of the selected *k*-mers becomes more important than the window guarantee, especially when there are also mutations. For example, it was shown that with 90% identity between aligned sequences, only about 30% of the positions on the sequence will overlap a conserved minimizer in minimap2 [4].

We introduced syncmer schemes into two leading long read mappers: the latest release of minimap2 [9] and Winnowmap2 [5]. Winnowmap extended minimap2 and re-weighted the minimizers by frequency to obtain a better distribution of minimizers and improve mapping, especially in highly repetitive regions [6]. Winnowmap2 achieved even better performance in repetitive regions by replacing the extension phase of minimap2’s seed- and-extend alignment algorithm by aligning minimal confidently alignable substrings that do not contain non-reference alleles. The latest version of minimap2 was reported to have closed the performance gap between the mappers [9].

We compared minimap2, Winnowmap2, and their modified syncmer-based versions on both simulated and real long read data. The syncmers increased the number of mapped reads across a large range of compression rates, resulting in 20-60% fewer unmapped reads at high compression. Even at lower compression, the syncmer mappers had 2-15% fewer unmapped reads. The syncmer versions took less memory but had somewhat longer mapping times. The most marked improvements were observed when percent identity of the mapped reads and reference sequences were low. With percent identity of 65 and 75% and medium compression, syncmer mappers had 50-60% fewer unmapped reads and still had 8-13% fewer incorrectly mapped reads.

While conducting this research, Shaw and Yu released a preprint that modified minimap2 to use open syncmers [16]. That work focused mostly on the theoretical properties of syncmers and provided a foundation and justification for their use. Our work greatly extends the practical use of syncmers and builds on these theoretical foundations.

The paper is structured as follows: Section 2 provides background, definitions, and terminology; Section 3 provides some theoretical analysis of PSSs; Section 4 describes the practical implementation of PSSs and their integration into minimap2 and Winnowmap2; Section 5 presents experimental results of the original and PSS-modified mappers; Section 6 discusses the results and future work.

## 2 Definitions and Background

### 2.1 Basic definitions and notations

For a string *S* over the alphabet Σ, a *k-mer* is a *k*-long contiguous substring of *S*. The *k*-mer starting at position *i* is denoted *S*[*i, i*+*k* −1] (string indices start from 1 throughout). We work with the nucleotide alphabet: Σ = {*A, C, G, T*}.

*k*-mer order

Given a one-to-one hash function on *k*-mers *o* : Σ^{k} → ℝ, we say that *k*-mer *x*_{1} is less than *x*_{2} if *o*(*x*_{1}) *< o*(*x*_{2}). Examples include lexicographic encoding or random hash. We will denote this as *x*_{1} *< x*_{2} when *o* is clear from the context. In this work we use a random order unless otherwise noted.

#### Canonical *k*-mers

Denote the reverse complement of *x* by . For a given order, the *canonical form* of a *k*-mer *x*, denoted by *Canonical*(*x*), is the smaller of *x* and . For example, under the lexicographic order, *Canonical*(*CGGT*) = *ACCG*.

### 2.2 Selection schemes

#### Selection scheme

A *selection scheme* is a function from a string to the indices of positions in it *f* : Σ* → ℕ. The scheme implicitly selects the *k*-mers starting at these positions. For a string *S* ∈ Σ*, *f*_{k}(*S*) = {*i*_{1}, *i*_{2}, …, *i*_{n}} is the set of start indices of the *k*-mers selected by the scheme.

#### Minimizers

A *minimizer scheme* chooses the position of the minimum value *k*-mer in every window of *w* consecutive *k*-mers in *S*:
where the minimum is according to *k*-mer ordering *o*. By convention, ties are broken by choosing the leftmost position. An example of a minimizer selection scheme is shown in Figure 1A. By definition, minimizers select a *k*-mer in every window of *w k*-mers. This property is called a *window guarantee*.

#### Parameterized syncmers

A *parameterized syncmer scheme* (PSS) with parameters 0 *< x*_{1} *<* … *< x*_{n−1} *< x*_{n} ≤ *k* − *s* + 1 selects a *k*-mer if the minimum *s*-mer of that *k*-mer appears at one of the positions *x*_{i} in the *k*-mer:

As *o* is fixed we will drop it from the notation where possible. An example of a PSS is shown in Figure 1B. We denote the family of all *n*-parameter syncmers as *Sy*_{n}. Note that, unlike minimizers, a PSS may not have a window guarantee since it will only identify a fixed subset of all *k*-mers.

Under these definitions, the open and closed syncmer schemes defined in [4] are *Sy*_{1} and *Sy*_{k,s}(1, *k* − *s* + 1), respectively. From here on, we will refer to PSSs simply as syncmers.

#### Downsampled syncmers

Given a uniformly random hash function *h* : Σ^{k} → [0, *H*], for a given string *S, downsampling* selects syncmers only from the set of |Σ|^{k}*/δ k*-mers with the lowest hash values:

We call *δ* the *downsampling rate*.

#### Windowed syncmers

A syncmer scheme may leave large gaps between selected positions on some input sequences. Windowed syncmers fill in these gaps using a minimizer scheme, thus providing a window guarantee. For clarity in the definition below let *Sy* represent *Sy*_{k,s,o}(*x*_{1}, …, *x*_{n})(*S*).

Letting 𝒮 represent all *k*-mers that can be syncmers in *Sy*_{k,s,o}(*x*_{1}, …, *x*_{n}), an equivalent definition would be: ℳ_{k,w,o′} (*S*) where *o*′ is defined such that *x* ∈ 𝒮, *x*′ ∈ Σ^{k} \𝒮 ⇒ *x < x*′.

### 2.3 Properties and evaluation criteria of schemes

We define some properties of selection schemes and several metrics that will allow us to compare different schemes.

#### Density and compression

The *density* [13] of a scheme is the expected fraction of positions selected by the scheme in an infinitely long random sequence: *d*(*f*) = 𝔼 [|*f* (*S*)|*/*|*S*|] as |*S*| → ∞. *Compression* is defined as *c*(*f*) = 1*/d*(*f*), i.e. the factor by which the sequence *S* is “compressed” by representing it using only the set of selected *k*-mers.

#### Conservation

Conservation [16] is the expected fraction of positions covered by a selected *k*-mer in sequence *S* that will also be covered by the same selected *k*-mer in the mutated sequence *S*′ where *S*′ is generated by iid base mutations with rate *θ*. Define the set of positions covered by the same selected *k*-mer in both sequences
then the *conservation* of the scheme is defined as .

#### Spread and distance distribution

One key feature of a scheme is the distribution of distances between selected positions. This would tell us the frequency with which selected positions appear close together or far apart. Shaw and Yu studied the probability distribution of selecting *at least one* position in a window of length *α*. We refer to the vector *P* (*f, α*) of these probabilities as the *spread*.

We define the distribution of the distances between *consecutive* selected positions: *Pr*(*f*) = [*Pr*(*f*, 1), *Pr*(*f*, 2), …], where *Pr*(*f, n*) is the probability that position *i* + *n* is the next selected position given that position *i* is selected. We refer to this as the *distance distribution* of the scheme.

*pN* metric

The *pN* metric (*N* ∈ [0, 100]) is the *Nth* percentile of the distance distribution, i.e., it is the length *l* for which *N* % of the distances between consecutive selected positions are of length ≤ *l*.

##### 𝓁 and 𝓁_{2} metrics

Let the lengths of the uncovered gaps between *k*-mers selected by a scheme on a sequence *S* be *l*_{1}, *l*_{2}, …. We define and .

Many of these properties can be determined theoretically in expectation for given sequence and mutation models and the selection scheme. They can also be determined empirically for a specific sequence. Some metrics may also be applied either to all the positions selected by a scheme in a reference, or only to the selected positions that are *conserved* after mutation or sequencing error. We refer to the latter using the subscript *mut*, for example, 𝓁_{2,mut}.

### 2.4 Analysis of syncmer schemes – prior work

Edgar recently introduced syncmers as an alternative to minimizers and other selection schemes with the goal of improving conservation rather than density, arguing that the latter is dictated by the application and system constraints [4]. He introduced open and closed syncmers. Rotated variants of syncmers, in which the minimizer is allowed to circle around from the end of the *k*-mer to the beginning were also introduced, but we found they were not useful in practice and do not address them here. Analyses of syncmer densities, window guarantees, and distributions were provided in [4] for open, closed, and downsampled syncmers.

Shaw and Yu greatly extended the framework for theoretical analysis of syncmers [16]. They defined the spread and conservation of a scheme. The two are connected through the number of unmutated *k*-mers overlapping a given position, *α*(*θ, k*), for a given mutation rate, *θ*. Letting *P* (*f*) = [*P* (*f*, 1), *P* (*f*, 2), …*P* (*f, k*)] be the spread, and *P* (*α*(*θ, k*)) = [*P* (*α*(*θ, k*) = 1), *P* (*α*(*θ, k*) = 2), …, *P* (*α*(*θ, k*) = *k*)], then *Cons*(*f, θ, k*) = *P* (*f*) · *P* (*α*(*θ, k*)). Note that there is a closed form expression for calculating *P* (*α*(*θ, k*) = *α*)), and that *P* (*f*, 1) = *d*(*f*). Their theoretical framework allowed Shaw and Yu to obtain expressions for the spread (and therefore conservation) of open and closed syncmers and other selection schemes.

## 3 Analysis of parameterized syncmer schemes

### 3.1 Recursive expressions for conservation of PSSs

We extend the analysis of [16] to obtain recursive definitions of the spread for general 2-parameter schemes where the *s*-minimizer indices can take on any values. We show how to incorporate downsampling and extend the definitions further to the conservation of schemes with more parameters. We used these expressions to compute the expected conservation for all PSSs of *k*-mer lengths 11, 13, 15, 17, and 19 and mutation rates 0, 0.05, 0.15 and 0.25. These results are provided in Supplementary File 1, Table 1.

Consider a window of *α* consecutive *k*-mers. We assume randomly distributed sequence throughout. Let *s*_{β} be the *s*-minimizer in the window, at position *β*. Then if *t* is a parameter of the syncmer scheme, *s*_{β} generates a syncmer if it is not in the first *t* − 1 or last *k* − *t* positions in the window. If *β* is not in a position where it generates a syncmer, we recursively check to the left or right of *β* to see if a syncmer is generated by the *s*-minimizer of that region. See Figure 2 for an example.

For a 1-parameter scheme *f* with *k*-mer length *k, s*-minimizer length *s*, and parameter *t* let *P* (*α*) be the probability of selecting at least one syncmer in a window of *α* adjacent *k*-mers. Then, assuming a uniformly random hash over the *s*-mers, and conditioning on the position of the *s*-minimizer of the *α*-window, *β*:

The probability of any of the *k* + *α* − *s* starting positions being the *s*-minimizer is denoted as *p*_{β} and assumed to be uniform. If *β* is in the first *t* −1 or last *k* −*t* starting positions (red regions in Figure 2A), then a syncmer may be generated by the remaining *α* − *β* positions to the right or *β* −*k* + *s*− 1 positions to the left, respectively. Note we define when *i > j* and *P* (*x*) = 0 when *x* ≤ 0.

When downsampling syncmers, there is a probability of 1*/δ* that an *s*-minimizer in the syncmer generating region (i.e. the green region in Figure 2A) really generates a syncmer. If it does not, then the left and right regions are considered recursively, yielding the following expression, where we simplify notation by letting *P* (*α* −*β*) = *P*_{R} and *P* (*β* −*k* +*s* −1) = *P*_{L}:

In the case of 2-parameter schemes, two syncmers are generated by *s*_{β} in regions that will overlap if the parameters *t*_{1} and *t*_{2} are within *s* of each other, and will be disjoint otherwise (see Figure 2B,C). Combining these two cases into a single recursive expression yields:

When downsampling is used then the 1 in the second and fifth sums is replaced by as in the one parameter case. The third sum expresses the overlapped region where either parameter creates a syncmer, when it exists. When *both* generated syncmers are downsampled then the left and right sides are recursively checked, thus the 1 is replaced by .

This expression can be greatly simplified by introducing the notation *count*(*β*) that represents the number of syncmers generated by the *s*-minimizer *s*_{β}. For example, *count*(*β*) = 0 in the red region of Figure 2 and *count*(*β*) = 2 in the overlapped region when *β* = 6 or 7 in Figure 2C. The general expression for *P* (*α*) for *any* PSS is:

Note that this definition relies on the definition *P* (*x*) = 0, *x* ≤ 0 to include the correct terms for the correct values of *β*.

The value of *P* (*α*) can thus be computed efficiently for any PSS and used to calculate the conservation using the formula from [16].

### 3.2 Choosing an appropriate metric to compare schemes

While Edgar shows convincingly that conservation is a more appropriate metric for selection schemes than density, we argue that 𝓁_{2,mut} contains additional important information for the purpose of mapping. Specifically, observe that, for given mutation rate *θ, k*, and selection scheme *f*, we have 𝔼 [𝓁_{mut,θ,f,k}] = 1 − *Cons*(*θ, f, k*). While 𝓁 (and conservation) counts the number of bases that are not covered by conserved selected *k*-mers, it treats all gap lengths equally. In contrast, 𝓁_{2,mut} penalizes a few large gaps more than many smaller gaps with the same total length. See the example in Figure 3. When the selected *k*-mers are used as seeds for mapping, it is important to avoid large gaps, in order to enable read mapping across gaps. Thus 𝓁_{2} provides additional salient indication to 𝓁 on how the selection scheme will affect mapping performance.

### 3.3 Calculating the distance distribution

For a given scheme, the distribution of distances between adjacent syncmer positions is specified by *Pr*(*d* = *x*), the probability that the distance *d* is *x*. To calculate this probability, we define the new quantities *F* (*α*) and *L*(*α*) denoting the probability that *only* the first or *only* the last *k*-mer in a window of *α k*-mers is a syncmer, respectively. We refer to these *k*-mers as *K*_{1} and *K*_{α} respectively. Note that due to symmetry *F* (*α*) = *L*(*α*). Note also that 1 − *P* (*α*) gives the probability that *no k*-mer in an *α*-window is a syncmer.

We compute *F* (*α*) by conditioning on *β* as before. For simplicity we divide the sum over *β* into cases based on the syncmers that are generated by *s*_{β} rather than breaking up the sum across different values of *β*. With some abuse of notation, we let *K*_{i} represent the event the that *s*_{β} generates *K*_{i} as a syncmer.

In the first case we have the probability that *K*_{1} is not downsampled, any other syncmer generated by *s*_{β} is downsampled, and there are no other syncmers generated to the right of *β*. In the second case we have the probability that any syncmers generated by *s*_{β} are downsampled, no syncmers are generated to the right of *β*, and the recursive computation of the probability that the *s*-minimizer of the segment to the left of *β* generates a syncmer at *K*_{1}.

Similarly, define *D*(*α*) to be the probability that in a window of *α k*-mers *only* the first *and* last *k*-mers are syncmers. Then

### 3.4 Calculating 𝓁_{2,mut}

To compute the desired metric 𝓁_{2,mut} we must calculate the metrics from the previous section but only with *conserved* syncmers. We add the subscript ‘*mut* ‘ to a value to indicate that only conserved syncmers are considered. The impact of mutations is similar to that of downsampling shown in the previous section, except that when a syncmer is lost due to mutation, the surrounding *k*-mers are also lost. In this case we consider no downsampling to make the expressions simpler.

Let Ω_{β} be the set of syncmers generated by *s*_{β}, and *ω*_{βi} be the members of this set. Note that |Ω_{β}| = *count*(*β*). Then,

For convenience we call the first probability *P*_{conserved} and the second *P*_{recurse}.

*P*_{conserved} is computed using the inclusion-exclusion principle:

Every term in this series is calculated as (1 − *θ*)^{countBases} where *θ* is the mutation rate and *countBases* counts the number of bases covered by the conserved *k*-mers (i.e. if two conserved syncmers overlap, the shared bases are counted only once).

*P*_{recurse} is more complicated. We again sum over all values of *β*. When Ω_{β} is empty (e.g. *β* is in the red region), then the recursion is similar to the case without mutation. Otherwise, all of the syncmers are lost due to mutation, and we additionally sum over the possible positions of the first and last points of mutation in Ω_{β}, named *f* and *l*, respectively.

Here *left* = *max*(0, *min*(*f* − *k, β* − *k* + *s* − 1)) and *right* = *max*(0, *min*(*α* − *l, α* − *β*)). We expand the joint probability as
where *y* is 1 if *l* = *f* and 2 otherwise, and *x* is the number of unmutated bases that is fixed by the given values of *f* and *l*. The conditional probability is written as
and is computed using the inclusion-exclusion principle as above.

*F* (*α*) and *D*(*α*) are extended to the case of mutation similarly:

Here we again divide into cases depending on whether there are syncmers that can be lost. We have also used recursive expressions that are similar to the above except we are given that *b* bases to the left or right of the defined region are conserved. These are calculated using similar techniques as above.

Finally, we can use these expressions to compute:

Note that, unlike *P* (*α*), which can be computed efficiently, the computation of these metrics includes an infinite sum. The sum can be truncated at an appropriate distance *x*, however there are still many more terms than in the computation of the conservation. In practice, simulating a very long sequence, selecting syncmers, and simulating mutations to determine these metrics empirically is much less time consuming and yields results that are very close to the true values. We used this simulation method to compute 𝓁_{2,mut} for *k* = 11, 13, 15, 17 and 19, mutation rates 0.05, 0.15 and 0.25. and all 2- and 3-parameter schemes, presented in Supplementary File 1, Table 2 (note that for 1-parameter schemes the best 𝓁_{2} and 𝓁 are the same, and thus already known from [16]).

### 3.5 Achieving the target compression

A simple extension of the expression for compression of open and closed syncmers yields that the compression of an *n*-parameter PSS is , where we assume that *s* is long enough relative to *k* that the *s*-minimizer is likely to be unique. As we show in Supplementary File 1, Table 4, it is preferable to achieve a specific compression with minimal downsampling. For example, the 𝓁_{2} of the best 2-parameter scheme with a downsampling rate of 2 is an order of magnitude worse than that of the best 1-parameter scheme without downsampling. Thus, to choose the best PSS for a given target compression, we can choose the one with compression closest to, but below, the desired compression and then downsample to reach the desired compression.

## 4 Methods

We modified the code of minimap2 (v2.22-r1105-dirty) and Winnowmap2 (v2.03) to select our syncmer variants as seeds instead of minimizers. The code is available from https://github.com/Shamir-Lab/syncmer_mapping.

### 4.1 Syncmer schemes implementation

The implementation of the syncmer schemes defined in Section 2 is straightforward. Sequences are scanned from left to right, the canonical *k*-mer at each position is identified using a random hash *h*_{1}, and the index of the minimum *s*-mer under another random hash *h*_{2} is determined.

For downsampled schemes, syncmers are selected if their hash value normalized between 0 and 1 is below the downsampling rate. Note that a different hash function than *h*_{1} must be used to ensure random downsampling. Windowed schemes are integrated into the minimizer selection scheme of the mappers except that syncmers are selected in each window first. If no syncmer is present, then the minimizer is selected.

Pseudocode describing these implementations is presented in Algorithms 1 and 2.

Additional implementation and optimization details are presented in Supplementary section S1

## 5 Results

We evaluated different PSSs on real genomes and compared them to theoretical results from Section 3. We also evaluated PSS-based mapping compared to the original minimizer-based versions of minimap2 and Winnowmap2 on simulated and real read data.

The sequences used for these experiments were: human GRCh38.p13 [14], human chromosome X from CHM13 (v1.0), *E. coli* K12 [1], and a microbial sample BAC containing assemblies of 15 microbes for which PacBio long read data is available [12] (three of the microbes were used in [16], see Supplementary Section S3 for more details on the samples selected). Information about the sequences is presented in Table 1.

We simulated PacBio and ONT reads from Chromosome X and from BAC with a depth of 10. Details of simulation parameters are found in the Supplement Section S2. For real datasets we selected a random set of 10K ONT reads of the NA12878 cell line with read length capped at 10kb (SRA accession ERR3279003), and 1K PacBio reads for each of the BAC microbes [12]. Details are available in Table 2.

### 5.1 Properties of parameterized syncmer schemes

Our theoretical analysis of PSS properties (Section 3) relies on a number of assumptions. Specifically, it assumes uniform iid sequences and mutations, assumes only substitution mutations, and treats the sequence as a single forward strand. We therefore examined the properties of PSSs on real genomes where these assumptions do not necessarily apply, and compared them to minimizer schemes.

We used *k* = 15 and selected the best syncmer schemes (based on 𝓁_{2,mut}) with theoretical compression 5.5 and 10. The default minimizer scheme of minimap2 uses *k* = 15, *w* = 10 yielding the theoretical compression of 5.5. A theoretical compression of 10 is achieved by minimap2 with *w* = 19. The properties of these schemes on the ECK12 and CHM13X sequences without mutation are shown in Supplementary section S4, Table S1. For *unmutated* reference genomes, minimizers outperformed syncmers, with much lower 𝓁_{2} and *p*100 values for schemes with the same compression.

To test the schemes on *mutated* sequences, we selected *k*-mers using the different schemes, simulated iid substitutions to the CHM13X sequence at a rate of 15%, and computed the properties of the conserved *k*-mers selected by the schemes. The performance is summarized in Table 3. Under mutation the advantage of syncmers is clear: syncmers have 19-33% more conserved positions and better performance in all metrics.

The windowed and downsampled variants are shown in Supplementary Tables S3 and S2. As expected, shorter window lengths require more selected positions to be filled by minimizers and have markedly lower 𝓁_{2} than the unwindowed versions.With mutations, windowed syncmers with short window lengths do even better than the unwindowed PSSs, even with relatively few conserved minimizer positions (see Table S3).

Figure 4 shows the distribution of distances between selected positions for *Sy*_{15,5}(3, 9) on CHM13X. Figure 4A shows the distance distribution of syncmers selected only using forward strand *k*-mers. It matches the theoretical distribution from Section 3 closely, with a minimum distance of 3 and a sharp peak at 6. In mapping, the read orientation is unknown and canonical syncmers are used. Figure 4B shows the results using canonical syncmers. The distance distribution still retains the peak at 6 and a local maximum at 3, but now adjacent positions are selected, and it has a much longer tail of distances, as reflected also in the *p*100 values shown in Supplementary Table S1.

### 5.2 The fraction of unmapped reads

We mapped reads using minimap2 and Winnowmap2 with ℳ_{15,10} (low compression), ℳ_{15,50} (medium), and ℳ_{15,100} (high) on 4 datasets. For each dataset, the syncmerminimap and syncmer-winnowmap parameters were selected to have the best performance based on theoretical 𝓁_{2,mut} for the same compression achieved by minimap2. In all cases this resulted in *Sy*_{15,5}(3, 9) matching the low compression, and *Sy*_{15,4}(6) matching the medium and high compression. The other scheme parameters were manually selected to closely match the real compression. The exact compression, window length, and downsampling rates are given in the table in Supplementary File 1, Table 5.

Figure 5 shows the percentage of unmapped reads achieved by each of the mappers for simulated PacBio and ONT reads mapped to the human reference genome. See Supplementary Figure S7 for additional results, including windowed mappers. Syncmer variants performed essentially the same or better than the original mappers in all cases, with a marked advantage at high compression. All mappers did much better on the PacBio reads than on ONT reads, which have a higher proportion of deletions and substitutions. The jump in the fraction of unmapped reads between medium and high compression may indicate that in order to overcome the large fraction of non-conserved seeds, existing mappers need to use a lower compression with many redundant seeds.

We compared the performance of all mappers on real data (Table 2) across a range of compression values. The ONT reads were mapped against the reference GRCh and the PacBio bacterial reads were mapped against the BAC reference. See Figure 6. The syncmer variants consistently outperformed the original minimizer-based mappers, with syncmer-winnowmap performing the best across the larger part of the range. Full results and scheme parameters are given in Supplementary File 1, Table 6. For high compression, the minimizers had 20-40% more unmapped reads than the syncmers. At low compression rates of 5.5 − 11, minimizers had 2-15% more unmapped reads than syncmers.

### 5.3 Mapping correctness

We evaluated the mapping correctness for PacBio simulated reads as done in [6] (see Supplementary section S2 for details). The percentage of incorrectly mapped reads simulated from CHM13X and the BAC genomes are shown in Figure 7. Winnowmap consistently performed better than minimap, and the syncmer variants of Winnowmap performed best overall.

Although we cannot evaluate the mapping correctness on the real datasets, the mapping quality scores can be used to compare the different mappers. On the four real datasets, reads mapped by syncmer-minimap but not by minimap2 generally had higher mapping quality than those mapped by minimap2 and not syncmer-minimap. For example, for the human cell line ONT reads, comparing minimap2 with ℳ_{15,50} to corresponding syncmer-minimap, 39 minimap-only reads had average mapping quality 31.4 (median 27), while 94 syncmer-minimap-only reads had an average quality score of 38.7 (median 42.5). Full results for different compression rates are presented in Supplementary File 1, Table 7.

### 5.4 Impact of sequence identity level

We examined the impact of the level of identity between the sequenced reads and the reference to which they are aligned. Differences between the sequences can be due to high sequencing errors rate in long reads, mutations in the sequenced organism, or differences between sequenced and reference strains. We simulated 1000 PacBio reads from CHM13X at percent identity 65%, 75%, and 95% in addition to the default 87% used above. The results are shown in Figure 8 and S3. For minimap2 and Winnowmap2 we used ℳ_{15,50}, and in the syncmer variants we used *Sy*_{15,4}(6) with the other parameters selected as above to match the compression of minimap2.

The syncmer variants outperformed the original tools in terms of fraction of reads mapped, with larger gains as percent identity decreases. This highlights the impact of the increased conservation of syncmers. All tools performed very well at higher percent identity, indicating that more than enough seeds were selected and conserved to adequately map all reads (and thus perhaps compression could be increased). Winnowmap2 performed noticeably worse at lower percent identity, leaving almost all reads unmapped at 65% identity. Syncmer-minimap outperformed minimap2 on the fraction of correctly mapped reads in all cases. Winnowmap2 correctly mapped a larger fraction of the mapped reads at 75% identity, but mapped only 35% of the reads, compared to ≥ 95% for the other variants. At 95% identity the syncmer variants had fewer incorrectly mapped reads.

### 5.5 Performance of windowed syncmer schemes

Windowed schemes combine syncmers and minimizers, allowing for a syncmer scheme to have a window guarantee with a relatively short window. In practice the windowed variants of our syncmer mappers were very similar or slightly worse than the variants without windowing for the same compression. Supplementary section S5 presents all of the results on the windowed variants of the mappers.

### 5.6 Runtime and memory

We compared the runtime and memory usage of the six tested mappers on the PacBio and ONT simulated reads from bacteria and human. Table 4 shows the performance for three different tasks. The third task maps bacterial reads against the human genome, to show the effect on timing of many unmapped reads. All experiments were performed on a 44-core, 2.2 GHz server with 792 GB of RAM, using 50 threads. Peak RSS (in GB) and real time (in seconds) as measured by the tools are reported. For minimap2 and Winnowmap2 ℳ_{15,10} were used, for syncmer variants *Sy*_{15,5}(3, 9) was used with the same parameters matched to the minimizers as above.

In all cases, syncmer-winnowmap used the least RAM, at the cost of higher runtimes, and minimap2 achieved the fastest mapping, at the cost of higher RAM usage. Syncmer-minimap achieved a balance with the second fastest runtime and second or third lowest memory usage among the six tools.

We also looked at the runtimes and memory usage of all runs of the continuous compression experiment shown in Figure 6. Results are shown in Figures 9 and 10 (see also Supplementary Figures S5 and S6 for the windowed variants). minimap2 was consistently the fastest, followed by syncmer-minimap, which took 50-100% longer. Interestingly, the two datasets show exactly opposite trends in memory usage (Figure 10). This is because the bacterial reference genomes are relatively short, and thus the memory bottleneck is in the mapping stage, while for the human reference genome the memory bottleneck is in the indexing stage. Increasing compression lowers index size but results in longer alignments between anchors, requiring more memory in the mapping phase. Thus, when indexing is the bottleneck, increasing compression reduces memory, while when mapping is the bottle-neck it increases memory. Winnowmap2 and its variants used less memory in the mapping phase while minimap2 and its variants used less memory in the indexing phase. In the case that indexing is the bottleneck, the syncmer variants had lower memory usage than the original mappers across most of the range of compression values (Figure 10B).

## 6 Discussion

In this study we generalized the notion of syncmers to PSSs and derived their theoretical properties. We incorporated PSSs into the long read mappers minimap2 and Winnowmap2. Our syncmer mappers outperformed minimap2 and Winnowmap2 and succeeded in mapping more long reads across a range of different compression values for multiple real and simulated datasets.

As our results show, the advantage of using syncmers is most marked at high compression and high error rate, as is expected due to their higher conservation. Yet the advantage is present at the lower compression used by existing mappers. For large genomes, such as the human genome, using the higher compression enabled by syncmers also leads to lower RAM usage. Syncmer-minimap is slower than the highly optimized minimap2, taking 50-100% longer to map reads, but it is faster than Winnowmap2. Future work should focus on lowering the runtime by further optimizing the syncmer mapping implementation.

There are a number of issues and questions that this work leaves open, particularly in the theoretical analysis. First, the analysis of windowed schemes and downsampled schemes under mutation remains to be completed. Second, an expression for 𝓁_{2} for minimizer schemes could also be obtained. Third, can the theory be expanded to canonical *k*-mers? Fourth, it may be possible to obtain more robust definitions of conservation and 𝓁_{2} that do not depend on preserving indices between sequences, thereby allowing indels to be included in the theoretical analysis.

Another possible avenue to explore is in the definition of the selection scheme itself. Is it possible to select *k*-mers in a biased way to increase the compression but still retain the beneficial distance distribution of syncmer schemes? The quest for an “optimal” scheme is not over.

## Supplementary information

### S1 Syncmer based mapping implementations

Modifications to the mappers were minimal. Only the code that selects the *k*-mers to use as seeds to index the reference sequence and as anchors from the query reads was modified. Here we describe implementation details and optimizations in the code that differ from the high-level descriptions in Algorithm 1 and 2.

In minimap2 the most frequent minimizers (0.02% by default) are dropped to reduce spurious matches and lower the runtime and memory usage. We also drop the most frequent selected *k*-mers as the last stage of all minimap syncmer variants for consistency. In Winnowmap, the most frequent *k*-mers (also 0.02% by default) are re-weighted in the minimizer order so they are less likely to be selected as minimizers. In the syncmer-winnowmap variant, we do not consider *k*-mer weighting, and thus we simply drop these *k*-mers if they are selected. However, in the *windowed* syncmer-winnowmap variant we do re-weight the frequent *k*-mers before selecting minimizers in empty windows.

We use several different hashes in our syncmer variants of the mappers: *h*_{can} to select canonical *k*-mers, *h*_{s} to select *s*-minimizers, *h*_{min} to select minimizers for windowed variants and *h*_{down} for downsampling. We require that *h*_{can} ≠ *h*_{down} to maintain random downsampling. In minimap syncmer variants we use hash64 from minimap2 for *h*_{min} and a variant of MurmurHash2 that ensures *murmur*2(0) ≠ 0 to ensure randomness for the other hashes. Thus *h*_{min} = −*hash*64*/*UINT64_MAX, *h*_{s}(*x*) = *murmur*2(*x*), *h*_{down}(*x*) = *murmur*2(*x*), and *h*_{can}(*x*) = *murmur*2(*x <<* 1 + 5) to ensure that it has a different value than *h*_{down}. For winnowmap variants we use *h*_{can}(*x*) = *lexicographic*(*x*) as this is what is used by the *k*-mer counter Meryl, *h*_{min}(*x*) = −(*murmur*2(*x*)*/*UINT64_MAX)^{8} in the case that the minimizer is one of the most frequent and −*murmur*2(*x*)*/*UINT64_MAX otherwise. The other hashes are as in the minimap variants.

In all windowed variants, downsampling occurs before filling in empty windows with minimizers.

ONT reads were mapped using the `map-ont` option in all mappers, while PacBio reads were mapped using the `map-pb` option (`map-pb-clr` in Winnowmap and variants). The latter uses homopolymer compression (HPC) and thus has a real compression (on the non-HPC sequence) that is above the theoretical one.

### S2 Simulation parameters and details

PacBio reads were simulated using PBSim [11] with error rates and read lengths roughly matched to the statistics observed in a recent benchmark of long read correction methods [3] unless otherwise indicated.

PacBio reads were simulated using PBSim with the CLR model and the following parameter settings: depth 10, mean length 9000, length std 6000, minimum length 100, maximum length 40000, mean accuracy 0.87, accuracy std 0.02, minimum accuracy 0.85, maximum accuracy 1, and difference ratio 10:48:19.

ONT reads were simulated using NanoSim [18] with default parameters and the human pre-trained model for Guppy base calls.

To evaluate mapping correctness for the PacBio simulated data we used the mapeval utility of paftools packaged with minimap2. In this tool, reads are considered correctly mapped if the overlap between the read alignment and the true read location is ≥ 10% of the combined length of the true read and aligned read interval. This criterion was also used in [6].

### S3 Bacterial species used

We chose a single representative assembly of each strain from [12] with the fewest, longest and most highly covered contigs and concatenated all references into a single fasta file. Reads from the same samples were downloaded. Assemblies and reads from following samples were used:

bc1019, Bacillus cereus 971 (ATCC 14579)

bc1059, Bacillus subtilis W23

bc1101, Burkholderia cepacia (ATCC 25416)

bc1102, Enterococcus faecalis OG1RF (ATCC 47077D-5)

bc1111, Escherichia coli K12

bc1087, Escherichia coli W (ATCC 9637)

bc1018, Helicobacter pylori J99 (ATCC 700824)

bc1077, Klebsiella pneumoniae (ATCC BAA-2146)

bc1082, Listeria monocytogenes (ATCC 19117)

bc1043, Methanocorpusculum labreanum Z (ATCC 43576)

bc1047, Neisseria meningitidis FAM18 (ATCC 700532)

bc1054, Rhodopseudomonas palustris

bc1119, Staphylococcus aureus HPV (ATCC BAA-44)

bc1079, Staphylococcus aureus subsp. aureus (ATCC 25923)

bc1052, Treponema denticola A (ATCC 35405)

### S4 Properties of syncmer schemes on real genome sequences without mutation

The theoretical properties measured on real genomes (without mutation) are shown in Table S1.

### S5 Windowed syncmer scheme results

Tables S2 and S3 present the properties of windowed syncmer schemes on real genome sequences with and without mutation, respectively.

Figures S1 and S2 present the number of unmapped reads and wrongly mapped reads for simulated datasets. These correspond to Figures 5 and 7 and include the results for windowed variants. Figure S3 presents the impact of percent sequence identity on the windowed variants as well, corresponding to Figure 8.

Results on the real human and bacterial reads are presented in Figure S4, and the runtimes and RAM usage for these runs are in Figures S5 and S6. The runtime and memory usage on different tasks for the windowed variants is presented in Table S4.

### S6 Supplemental performance results

Figure S7 shows additional results for the number of unmapped reads at low, medium, and high compression rates.

## Acknowledgement

Study supported in part by the Israel Science Foundation (grant No. 3165/19, within the Israel Precision Medicine Partnership program, and grant No. 1339/18) and by Len Blavatnik and the Blavatnik Family foundation. DP was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University.