## Abstract

Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. *FracMinHash* was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. While experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis and prove that while FracMinHash is not unbiased, this bias is easily corrected. Next, we detail how a simple mutation model interacts with FracMinHash and are able to derive confidence intervals for evolutionary mutation distances between pairs of sequences as well as hypothesis tests for FracMinHash. We find that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely when compared to traditional MinHash, and the confidence interval performs significantly better in estimating mutation distances. A python-based implementation of the theorems we derive is freely available at https://github.com/KoslickiLab/mutation-rate-ci-calculator. The results presented in this paper can be reproduced using the code at https://github.com/KoslickiLab/ScaledMinHash-reproducibles.

## 1 Introduction

Sketching-based approaches in recent years have been successfully applied to a variety of genomic and metagenomic analysis tasks, due in large part to such methods incurring low computational burden when applied to large data sets. For example, Mash [24], is a MinHash [7]-based approach that was used to characterize the similarity between all pairs of RefSeq genomes in less than 30 CPU hours. Such efficiency gains are due primarily to sketching-based approaches recording a small subsample (or modification thereof) of the data in such a fashion that some distance metric is approximately preserved, a process called a locality sensitive hashing scheme. In bioinformatics, this has resulted in improvements to error correction [27, 23], assembly [9, 3, 14, 12], alignment [17, 22], clustering [30, 10, 26, 18], classification [20, 19, 6], and so on. Importantly, the accuracy and efficiency of sketching approaches can frequently be characterized explicitly, allowing practitioners to balance between efficiency improvements and accuracy. Often, these theoretical guarantees dictate that certain sketching approaches are well suited only to certain kinds of data. For example, MinHash, which is used in many of the aforementioned applications, has been shown to be particularly well-suited to quantify the similarity of sets of roughly the same size, but falters when sets of very different sizes are compared [18]. This motivated the introduction of the containment MinHash which utilized a MinHash sketch of the smaller set, with an additional probabilistic data structure (a Bloom filter [5]) to store the larger set. While this improved speed and accuracy, this approach can become quite inconvenient for large sets due to requiring a bloom filter to be created for the larger of the two sets. To ameliorate this, an approach called the “FracMinHash” was recently introduced [15, 16] that modifies the MinHash sketch size to scale naturally with the size of the underlying data. This has been implemented into a software package called Sourmash [8] that uses these FracMinHash sketches to facilitate genomic and metagenomic similarity assessment (`sourmash compare`), metagenomic taxonomic classification (`sourmash gather`), and database searches (`sourmash search`) [26]. Independently, and more recently, the same concept of FracMinHash was introduced by Ekim et al. (2021) but there with the name *universe minimizer*.

While there is ample computational evidence for the superiority of FracMinHash when compared to the classic MinHash, particularly when comparing sets of different sizes, no theoretical characterization about the accuracy and efficiency of the FracMinHash approach has yet been given. In this manuscript, we address this missing characterization of accuracy and efficiency by deriving a number of theoretical guarantees. In particular, we demonstrate that the FracMinHash approach, as originally introduced, requires a slight modification in order to become an unbiased estimator of the containment index. After this, we characterize the statistics of this unbiased estimator and derive an asymptotic normality result for FracMinHash. This in turn allows us to derive confidence intervals and hypothesis tests for this estimator when considering a simple mutation model (which is related to the commonly used Average Nucleotide Identity score). We also characterize the likelihood of experiencing an edge case when analyzing real data which allows us to provide a level of confidence along with the estimated containment index. Finally, we support the theoretical results with additional experimental evidence and compare our approach to the frequently used Mash distance [24].

A python-based implementation of the theorems we derive is freely available at https://github.com/KoslickiLab/mutation-rate-ci-calculator.

### A note on naming

As Phil Karlton is reported to have said [1]: “There are only two hard things in Computer Science: cache invalidation and naming things.” The latter certainly holds true in computational biology as well. As noted above, the concept discussed herein has been defined similarly and independantly by different authors. Ekim et al. [12] referred to the concept as *universe minimizers*, Irber et al. [16, 26, 8] called it *scaled MinHash*, and Edgar [11] called them *min code syncmers*. A recent twitter thread [29] involving these authors and others coalesced on the following definitions: *FracHash* is any sketching technique that takes a defined fraction of the hashed elements. As such, Broder’s [7] ModHash is one such example of a FracHash. A *FracMinHash* is then a sketch that takes a fraction of the hashed elements, specifically those that hash to a value below some threshold (hence the “min”).

## 2 FracMinHash and its statistics

We begin first by introducing the definition of FracMinHash using a slight modification of the definition contained in [16]. We compute the expectation of FracMinHash, find that it is nearly, but not exactly, an unbiased estimator of the containment index, and then compute the variance of the unbiased version. We conclude this section by showing asymptotic normality of FracMinHash, a result that will be used in a subsequent section to derive confidence intervals and hypothesis tests for FracMinHash under a simple mutation model.

### 2.1 Definitions and preliminaries

We recall the definition of FracMinHash given in [15] and reiterate its expected value before extending the statistical analysis of this quantity. Given two arbitrary sets *A* and *B* which are subsets of a domain Ω, the containment index *C*(*A, B*) is defined as . Let *h* be a perfect hash function *h*: Ω → [0, *H*] for some . For a *scale factor s* where 0 ≤ *s* ≤ 1, a FracMinHash sketch of a set *A* is defined as follows:

The scale factor *s* is a tunable parameter that can modify the size of the sketch. Using this FracMinHash sketch, we define the FracMinHash estimate of the containment index as follows:

For notational simplicity, we define *X _{A}*:= |

**FRAC**

_{s}(

*A*)|. Observe that if one views

*h*as a uniformly distributed random variable, we have that

*X*is distributed as a binomial random variable:

_{A}*X*~ Binom(|

_{A}*A*|,

*s*). Furthermore, if where both

*A*and

*B*are non-empty sets, then

*X*and

_{A}*X*are independent when the probability of success is strictly smaller than 1. Using these notations, we compute the expectation of eq. (2), recapitulated from [15] for completeness.

_{B}*For* 0 < *s* < 1, *if A and B are two non-empty sets such that A \ B and A ⋂ B are non-empty,*

*Proof*. Using the notation introduced previously, observe that and that
the random variables *X*_{A⋂B} and *X _{A\B}* are independent (which follows directly from the fact that

*A*⋂

*B*and

*A\B*are non-empty, distinct sets). We will use the following fact from standard calculus:

Then using the moment generating function of the binomial distribution, we have

We also know by continuity that

Using these observations, we can then finally calculate that where Fubini’s theorem is used in Equation (9) and independence in Equation (10).

In light of Theorem 1, we note that eq. (2) is *not* an unbiased estimate of *C*(*A, B*). This may explain the observations in [16] that showed the uncorrected version in eq. (2) leads to suboptimal performance for shorts sequences (e.g viruses). However, for sufficiently large |*A*| and *s*, the bias factor (1 – (1 – *s*)^{|A|}) is sufficiently close to 1. Alternatively, if |*A*| is known (or estimated, eg. by using HyperLogLog [13]), then
is an unbiased estimate of the containment index *C*(*A, B*). Throughout the rest of the paper, we will refer to the debiased *C*_{frac}(*A, B*) as the *fractional containment index*. We now turn to calculating the expectation and variance of the fractional containment index *C*_{frac}(_{A, B}).

### 2.2 Mean and variance of *C*_{frac}(*A, B*)

The expectation of *C*_{frac} (*A, B*) follows directly from Equation (14) and Theorem 1.

*For* 0 < *s* < 1, *if A and B are two distinct sets such that A* ⋂ *B is non-empty, the expectation of C _{frac}(A,B) is given by*

We now turn to determining the variance of *C*_{frac}(*A, B*). Observing the independence of *X*_{A⋂B} and *X _{A\B}* given that the intersection of

*A*and

*B*is non-empty, ideally we can determine the variance of

*C*

_{frac}(

*A, B*) using the associated multivariate probability mass function. However, doing so does not result in a closed-form formula. Therefore, we use Taylor expansions to approximate the variance.

*For n* = |*A* ⋂ *B*| *and m* = |*A* \ *B*| *where both m and n are non-zero, a first order Taylor series approximation gives*

Using the results of Theorem 3, we have the variance of *C*_{frac}(*A, B*) as follows.

*For n* = |*A* ⋂ *B*| *and m* = |*A* \ *B*| *where both m and n are non-zero, a first order Taylor series approximation gives*

Proceeding in the same fashion, we can obtain second and third order approximations to the variance. Indeed, series approximations can be had to arbitrarily high order due to the binomial distribution having finite central moments of arbitrary order. However, we found that the higher order expansion derivations are tedious and long, whereas the results obtained using first order approximation are both simple and accurate enough in practice, as our numerical experiments below demonstrate.

### 2.3 Asymptotic normality of *C*_{frac}(*A,B*)

In order to derive confidence intervals and hypothesis tests for *C*_{frac}(*A, B*) in the next section, in this section we prove this quantity’s asymptotic normality. We utilize the delta method [2, section 14.1.3] combined with the De Moivre-Laplace theorem. Indeed, the De Moivre-Laplace theorem guarantees asymptotic normality of *X*_{A⋂B} and *X _{A\B}*, and since is twice differentiable, we can apply the delta method to obtain:

*For* *and m* = |*A \ B*| *where both m and n are non-zero,*

## 3 Statistics of *C*_{frac}(*A,B*) under simple mutation model

In this section, we turn our attention to analyzing how a simple mutation model affects *C*_{frac}(*A, B*). The model under consideration is a simple mutation process where each nucleotide of some sequence *S* is independently mutated at a fixed rate. This model was recently introduced in [4] where it was quantified statistically how this mutation process affects the *k*-mers in *S*. We extend the results of [4] to the case where *A* is the set of *k*-mers of *S*, *B* is the set of *k*-mers of *S*’ (that is, the sequence *S* after the mutation process) and where the quantity under consideration is *C*_{frac}(*A, B*). We first recall a few important definitions.

### 3.1 Preliminaries

We follow closely the exposition contained in [4]. Let *L* > 0 be a natural number that denotes the number of *k*-mers in some string *S*. A *k-span K _{i}* is the range of integers [

*i, i*+

*k*– 1] which denotes the set of indices of the sequence

*S*where a

*k*-mer resides. Fix a

*mutation rate p*where 0 <

*p*< 1. The

*simple mutation model*considers each position in

*i*= 0,…,

*L*+

*k*– 1 and with probability

*p*, marks it as

*mutated*. A mutation at location

*i affects*the

*k*-spans

*K*

_{max(0,i–k+1)},…,

*K*. Let

_{i}*N*

_{mut}be a random variable defined to be the number of affected/mutated

*k*-spans. We use

*q*= 1 – (1 –

*p*)

^{k}to express the probability that a

*k*-span is mutated. Note that 1 –

*p*corresponds precisely to the expected average nucleotide identity (ANI) between a sequence

*S*and its mutated counterpart

*S*’.

In addition to the number of affected or unaffected *k*-spans, we shall need to define the sets of *k*-mers before and after the mutation process. Given a nonempty sequence *S* on the alphabet {*A, C, T, G*} and a *k*-mer size such that each *k*-mer in *S* is unique, let *A* represent the set of all *k*-mers in *S* and let *L* = |*S*| – *k* + 1. Now, we apply the simple mutation model to *S* via the following: if for any *i* ∈ [0,…, *L* + *k* – 1], this index is marked as mutated, let be some nucleotide in {*A, C, T, G*} \ {*S _{i}*}, and otherwise let if the index

*i*is not marked as mutated. Let

*B*represent the set of

*k*-mers of

*S*’, and we assume that

*S*’ does not contain repeated

*k*-mers either. In summary,

*A*denotes the set of

*k*-mers of a sequence

*S*, and

*B*denotes the set of

*k*-mers of a sequence

*S*’ derived from

*S*using the simple mutation model with no spurious matches.

### 3.2 Expectation and variance

We immediately notice that |*A* \ *B*| = *N*_{mut}, and |*A* ⋂ *B*| = *L* – *N*_{mut}. We note that the results in Theorem 3, Corollary 4 and Theorem 5 above still hold for a fixed *N*_{mut} (since *m* = *N*_{mut} and *n* = *L* – *N*_{mut}). However, assuming a simple mutation model, *N*_{mut} is not a fixed quantity, rather a random variable that depends primarily on the mutation rate *p* (among other parameters of the mutation model). Therefore, the analyses so far only connects *C*_{frac}(*A, B*) to a fixed *N*_{mut}, as we have only considered the randomness from the FracMinHash sketching process so far. To quantify the impact of the mutation rate *p* on *C*_{frac}(*A, B*), we consider now the randomness introduced by both the FracMinHash sketching process and the mutation process simultaneously.

Let and be the probability tuples corresponding to the mutation and FracMinHash sketching random processes, respectively. We will use the subscript to indicate the product space, e.g. and . Hence we assume that the mutation process and the process of taking a FracMinHash sketch are independent. Indeed, the hash functions have no relation to the point mutations introduced by the simple mutation model. Before proceeding with the analysis, we make a note that the expectation and variance of *N*_{mut} under the simple mutation model with no spurious matches have been investigated in [4]. As such, we already know and , and will use these results directly (see [4, Table 1]).

*For* 0 < *s* < 1, *if A and B are respectively distinct sets of k-mers of a sequence S and a sequence S’ derived from S under the simple mutation model with mutation probability p such that A* ⋂ *B is non-empty, then the expectation of C _{frac}(A,B) in the product space P,S is given by*

*where and are the probability tuples corresponding to the mutation and FracMinHash sketching random processes, respectively*.

*Proof.*

Here, we used Fubini’s theorem in the second step. We also used the expectation of *N*_{mut} from [4], where *q* = 1 – (1 – *p*)^{k}.

Next, we turn to the more challenging task of calculating the variance of *C*_{frac}(*A, B*) in the product space . In the following, note that is already known (see [4, Theorem 2]).

*For* 0 < *s* < 1, *if A and B are respectively distinct sets of k-mers of a sequence S and a sequence S’ derived from S under the simple mutation model with mutation probability p such that A* ⋂ *B is non-empty, then the variance of C _{frac}(A, B) in the product space P, S is given by*

*where*

*and*

*are the probability tuples corresponding to the mutation and FracMinHash sketching random processes, respectively*.

With these quantities in hand, we are now in a position to derive hypothesis tests and confidence intervals for *C*_{frac}(*A, B*).

## 4 Hypothesis test and confidence interval

Observe that the marginal of *C*_{frac}(*A, B*) with respect to the mutation process is simply . Using the results in [4], we note that *N*_{mut} is asymptotically normally distributed when the mutation rate *p* and *k*-mer length *k* are independent of *L*, and *L* is sufficiently large. In Theorem 5, we showed that *C*_{frac}(*A, B*) is normally distributed for a fixed *N*_{mut}. Therefore, considering the randomness from both the FracMinHash sketching and the mutation model independently, *C*_{frac}(*A, B*) is asymptotically normal when all conditions are met. Using the statistics derived in Section 3, we obtain the following hypothesis test for *C*_{frac}(*A, B*).

*Let* 0 < *s* < 1, *let A and B be two distinct sets of k-mers, respectively of a sequence S and a sequence S’ derived from S under the simple mutation model with mutation probability p, such that A* ⋂ *B is non-empty*.

*Also, let* 0 < *α* < 1, , *and* . *Then, the following holds as L* → ∞ *and p and k are independent of L*.

We can turn this hypothesis test into a confidence interval for the mutation rate *p* as follows.

*Let A and B be two distinct sets of k-mers, respectively of a sequence S and a sequence S’ derived from S under the simple mutation model with mutation probability p, such that A* ⋂ *B is non-empty. Let* E_{pfixed} [*X*] *and* Var_{pfixed} [*X*] *denote the expectation and variance of X under the randomness from the mutation process with fixed mutation rate p _{fixed}. Then, for fixed α, s, k and an observed fractional containment index C_{frac}(A, B), there exists an L large enough such that there exists a unique solution p = p_{low} to the equation*

*and a unique solution p = phigh to the equation*

*such that the following holds:*

## 5 Likelihood of corner cases

In practice, one disadvantage of sketching techniques is that the size of the sketch (here controlled via the scale factor *s*) may be too small (respectively, too large) to distinguish between highly similar (respectively, dissimilar) sequences. For example, given a small mutation rate *p*, one may need a very large scale factor, and so sketch, to be able to distinguish between a sequence and the mutated version. These “corner cases” are precisely the ones where the confidence interval given by Theorem 9 will likely fail. One of these pathological cases shows up when there is nothing common between the two FracMinHash sketches **FRAC**_{S}(*A*) and **FRAC**_{S}(*B*). We observe that this occurs when *X*_{A⋂B} = 0. Now *X*_{A⋂B} is distributed as a binomial distribution Binom(*n, s*) where *n* = |*A* ⋂ *B*| = *L* - *N*_{mut}, so the probability of the intersection being empty with respect to the sketching process is:

Ideally, we would be able to directly calculate , the expected probability of this corner case happening. Unfortunately, we do not have a closed form representation of *N*_{mut}, and therefore instead take a Taylor series of with respect to *N*_{mut} about E[*N*_{mut}] = *qL*:

Hence, if we calculate the first *i* central moments of *N*_{mut}, we can approximate the expected probability of this corner case happening. In practice, we have found even the second central moment to suffice.

The remaining pathological case occurs when *p* ≠ 0 and yet **FRAC**_{s}(*A*) = **FRAC**_{s}(*B*) (i.e. the sketches are not large enough to distinguish between *A* and *B*). Similar to before, we have
and hence the first order Taylor series expansion gives
and the expected probability of the sketches of *A* and *B* not differing at all can similarly be estimated when given access to central moments of *N*_{mut}.

In both cases, these formulas help practitioners assess if containment estimates of 0 or 1 are due to parameter settings (eg. scale value to high/low), or else are biologically meaningful.

## 6 Experiments and results

### 6.1 FracMinHash accurately estimates the containment index for sets of very different sizes

We first show that FracMinHash can estimate the true containment index better when the sizes of two sets are dissimilar. For this experiment, we compared FracMinHash with the popular MinHash implementation tool Mash [25]. We took a Staphylococcus genome from the GAGE dataset [28] and selected a subsequence that covers *C*% of the whole genome in terms of number of bases, added this sequence to a metagenome, and calculated the containment of Staphylococcus in this “super metagenome.” The metagenome we used is a WGS metagenome sample from a pharmaceutical degrading enrichment culture (NCBI accession PRJNA782474), consisting of approximately 1.3G bases. We used a scale factor of 0.005 for FracMinHash, and we set the number of hash functions for Mash the same as the size of the FracMinHash sketch of the Staphylococcus genome (approximately 1500 in average).

We repeated this setup for different values of *C*, and compared the containment index calculated by Mash and FracMinHash in Figure 1. The points shown in the figure are the mean values for multiple runs with different seeds, whereas the error bars show the standard deviation. Mash primarily reports MinHash Jaccard index, so we converted the Jaccard index into containment by counting the number of distinct kmers using brute force.

Figure 1 illustrates that while Mash and FracMinHash both faithfully estimate the true containment index, the FracMinHash approach more accurately estimates the containment index as this index increases in value. In addition, the estimate is more precise as demonstrated by the size of the error bars on the estimates. This is likely due to the fact that while Mash and FracMinHash both use a sketch of size 4,000 for the Staphylococcus genome, Mash uses the same fixed value of 4,000 when forming a sketch for the metagenome, while FracMinHash selects a sketch size that scales with the size of the metagenome. This can be seen most starkly when the metagenome is significantly larger than the query genome when estimating containment indices.

### 6.2 FracMinHash gives accurate confidence intervals around mutation rates

Next, we show that the confidence interval from Theorem 9 for the mutation rate *p* works well in practice. To do so, we performed 10,000 simulations of sequences of length *L* = 10k, 100k and 1M that underwent the simple mutation model with *p* = 0.001, 0.1 and 0.2. We then used a scale factor of *s* = 0.1 when calculating *p*_{low} and *p*_{high} for a 95% confidence interval and repeated this for *k*-mer sizes of 21, 51 and 100. Table 1 records the percentage of experiments that resulted in *p*_{low} ≤ *p* ≤ *p*_{high} and demonstrates that the confidence intervals indeed are approximately 95%. We also performed the same experiment for other scale factors. The results are similar, but for the sake of brevity these tables are included in the appendix.

### 6.3 FracMinHash more accurately estimates mutation distance

#### 6.3.1 On simulated data

We now compare the Mash estimate and FracMinHash estimate (given as a confidence interval) of mutation rates. For this experiment, we simulated point mutations in the aforementioned Staphylococcus genome at a mutation rate *p*, and then calculated the distance of the original Staphylococcus genome with this mutated genome using both Mash and the interval given by Theorem 9. The results are shown in Figure 2a. This plot shows that Mash overestimates the mutation rate by a noticeable degree, with increasing inaccuracy as the mutation distance increases. This is likely due to the Mash distance assuming a Poisson model for how mutations affect *k*-mer occurrences, which has been shown to be violated when considering a point mutation model. In contrast, the point estimate given by Theorem 9 is fairly close to the true mutation rate, and the confidence interval accurately entails the true mutation rate.

#### 6.3.2 On real data

Finally, we conclude this section by presenting pairwise mutation distances between a collection of real genomes using both Mash and the interval in Theorem 9. To make a meaningful comparison, it is important to compute the true mutation distance (or equivalently, the average nucleotide identity) between a pair of genomes. For this purpose, we used OrthoANI [21], a fast ANI calculation tool. From amongst 199K bacterial genomes downloaded from NCBI, we randomly filtered out pairs of genomes so that the pairwise ANI ranges from 0.5 to 1. For visual clarity, we kept at most 3 pairs of genomes for any ANI interval of width 5%. We used 4000 hash functions to run Mash, and set *L* = (|*A*| + |*B*|)/2 for the confidence intervals in Theorem 9, where |*A*| and |*B*| denote the numbers of distinct kmers in the two genomes in a pair. The results are presented in Figure 2b.

Clearly, Mash keeps overestimating the mutation distance, particularly for moderate to high distances. In contrast, the confidence intervals given by Theorem 9 perform significantly better. It is noticeable that the confidence intervals are not as accurate as in case of a simulated genome (presented in Figure 2a). This is natural because when we introduce point mutations, the resulting pair of genomes do not vary in length. On the other hand, in this real setup, the sizes of the genomes are very dissimilar, have repeats, and very easily violate the simplifying assumptions of the simple mutation model.

## 7 Conclusions

In contrast to classic MinHash, which uses a fixed sketch size, FracMinHash automatically scales the size of the sketch based on the size of the input data. This has its advantages of facilitating accurate comparison of sets of very different sizes, but also possesses the possibility that sketch sizes become quite large. However, given that a user has control over what percentage of the data to keep in the sketch (in terms of *s*), reasonable estimates can be made about sketch sizes *a priori*. In addition, one particularly attractive feature of FracMinHash is its analytical tractability: as we have demonstrated, it is relatively straightforward to characterize the performance of FracMinHash, derive its statistics, and study how it interacts with a simple mutation model. Given these advantages, it seems reasonable to favor FracMinHash in situations where sets of differing sizes are being compared, or else when fast and accurate estimates of mutation rates are desired (particularly for moderate to high mutation rates).

## Acknowledgements

This material is based upon work supported by the National Science Foundation under grant No. DMS-1664803. The authors would like to acknowledge the helpful inputs from Luiz Irber and Paul Medvedev.

## A Appendix

### A.1 Verification of Theorem 9 using simulations

Similar to Table 1, we repeated the experiment for the same settings except with two different scale factors. The results are shown in this section.

### A.2 Missing theorems and proofs

*For n* = |*A* ⋂ *B*| *and m* = |*A* \ *B*| *where both m and n are non-zero, a first order Taylor series approximation gives*

*Proof.* Let , *μ _{x}* =

*ns*,

*μ*=

_{y}*ms*and use subscripts to denote partial derivatives:

We then have the first order Taylor series: with the middle term of eq. (19) factoring due to independence.

*For* , *n* = |*A* ⋂ *B*| *and m = |A \ B| where both m and n are non-zero,*

*Proof.* The covariance matrix is calculated as

Using the same notation as in Theorem 3, let

The delta method then uses the first order Taylor series from Theorem 3 to obtain that converges in distribution to a centered normal with variance

*For* 0 < *s* < 1, *if A and B are respectively distinct sets of k-mers of a sequence S and a sequence S’ derived from S under the simple mutation model with mutation probability p such that A* ⋂ *B is non-empty, then the variance of C _{frac}(A, B) in the product space P, S is given by*

*where*

*and*

*are the probability tuples corresponding to the mutation and FracMinHash sketching random processes, respectively.*

*Proof.* First, we calculate the second moment of *C*_{frac}(*A, B*) in the product space as follows:

Therefore, we calculate the variance in the product space as follows.

*Let A and B be two distinct sets of k-mers, respectively of a sequence S and a sequence S′ derived from S under the simple mutation model with mutation probability p, such that A* ⋂ *B is non-empty. Let E _{pfixed}* [

*X*]

*and*Var

*[*

_{pfixed}*X*]

*denote the expectation and variance of X under the randomness from the mutation process with fixed mutation rate p*(

_{fixed}. Then, for fixed α, s, k and an observed fractional containment index C_{frac}*A, B*),

*there exists an L large enough such that there exists a unique solution p*=

*p*

_{low}to the equation*and a unique solution p = p*

_{high}to the equation*such that the following holds:*

*Proof.* Given the results in Theorem 8, we only need to prove that *p _{low}* and

*p*are well defined. It suffices to show that and are strictly monotonic in

_{high}*p*

_{low}and

*p*

_{high}, respectively under the stated conditions.

Let us first investigate the function of *p*_{low}. For simplicity, we will write p instead of *p*_{low}, *z* instead of *z _{α}* and

*N*instead of

*N*

_{mut}. We observe the following: where

After a very long and tedious series expansion of the derivative about *L* = ∞, we obtain that the derivative is

Therefore, as *L* approaches ∞, the derivative is always negative, which gives us that the function is monotonically decreasing in *p*_{low} in the asymptotic case.

The proof that is monotonically decreasing in *p*_{high} proceeds in an entirely analogous manner.

### A.3 Theoretical guarantees to accurately estimate containment index

In this section, we present theoretical evidence that *C*_{frac}(*A, B*) is able to estimate the true containment index *C*(*A, B*) with high accuracy. Let the elements in *A* ∪ *B* be *e _{i}* for

*i*= 1 to

*N*. We define an indicator variable

*Y*associated with an element

_{i}*e*as follows:

_{i}Let *Y* be the number of elements in **FRAC _{s}**(

*A*) ⋂

**FRAC**(

_{s}*B*). Naturally, . The probability of

*Y*being 1 is . Therefore, we have:

_{i}Let us make a simplifying assumption that the exact cardinality of the set *A* is known. Let us define *Y*′ as . Therefore, *E*[*Y*′] = |*A* ⋂ *B*|/|*A*| = *C*(*A, B*). If we use *Y*′ as the estimator to measure *C*(*A, B*), then we have
where we used Chernoff bound for a sum of Bernoulli random variables in the last step. The results are trivial, stating that when the two sets have more in common, or when we work with a larger scale factor, the estimate *Y*′ performs better. This is expected, and conforms to the concept of using a scale factor. *C*_{frac}(*A, B*) estimates *C*(*A, B*) slightly differently than Yand further investigations are required to narrow down the theoretical guarantees of *C*_{frac}(*A,B)* estimating *C*(*A,B*).

## Footnotes

Update citation 15, typo fix "S"->"s" and other such minor changes.