Abstract
The ability to quickly and inexpensively describe the taxonomic diversity in an environment is critical in this era of rapid climate and biodiversity changes. The currently preferred molecular technique is (meta)barcoding in which taxonomically informative plasmid/mitochondrial markers are sequenced. It is low-cost, and widely used, but has drawbacks. As sequencing costs continue to fall, an alternative approach based on genome-skimming has been proposed. This approach first applies low-pass (100Mb – several Gb per sample) sequencing to voucher and/or query samples and then recovers marker genes and/or organelle genomes computationally. In contrast, we suggest the use of the unassembled sequence data for taxonomic identification using an alignment-free approach based on the k-mer decomposition of the sequencing reads. Our approach is motivated by earlier work that connects genomic distance to the Jaccard index on k-mer collections, but improves upon prior work through a careful modeling of the impact of low-coverage, sequencing error, and other factors on the Jaccard index. Our tool, Skmer, estimates genomic distance between two organisms represented by their k-mer collections obtained from the genome-skims, and uses distance estimates to match a genome-skim query to a reference collection. Skmer shows excellent performance in our simulation studies, and makes the assembly-free approach to genome-skimming a viable alternative to the traditional barcoding. The Skmer software is made publicly available on https://github.com/shahab-sarmashghi/Skmer.git
1 Introduction
The ability to quickly and inexpensively study the taxonomic diversity in an environment is critical in this era of rapid climate and biodiversity changes. The current molecular technique of choice is (meta)barcoding [1–3]. Traditional (meta)barcoding is based on DNA sequencing of taxonomically informative and group-specific marker genes (e.g., mitochondrial COI [1, 4] and 12S/16S [5, 6] for animals, chloroplast genes like matK for plants [7], and ITS [8] for fungi) that are variable enough for taxonomic identification, but have flanking regions that are sufficiently conserved to allow for PCR amplification using universal primers. Barcoding is used for taxonomic identification of single-species samples. In the case of metabarcoding, the goal is to deconstruct the taxonomic composition of a mixed sample consisting of multiple species. Beyond the barcoding application, the barcoding marker genes have also been used to delimitate species [9] and to infer phylogenies [10, 11].
The accuracy of (meta)barcoding depends on the coverage of the reference database and the method used to search queries against it [3]. To satisfy the coverage requirement, reference databases with millions of barcodes have been generated (e.g., the Barcode of Life Data System, BOLD, for the COI marker [12]). Computational methods for finding the closest match in a reference dataset of markers (e.g., TaxI [13]), and for placement of a query into existing marker trees [14–16] have been developed. However, the traditional approach to (meta)barcoding has drawbacks. PCR for marker gene amplification requires relatively high quality DNA and thus cannot be applied to samples in which the DNA is heavily fragmented. Moreover, since barcode markers are relatively short regions, their phylogenetic signal and identification resolution can be limited [17]. For example, 896 of the 4,174 species of the wasp could not be distinguished from other species using COI barcodes [18]. While low costs have kept PCR-based pipelines attractive, decreasing costs of shotgun sequencing have now made it possible to shotgun sequence 1-2Gb of total DNA per reference specimen sample for as low as $80 [19], even after including sample preparation and labor costs. Therefore, researchers have proposed an alternate method for barcoding a sample that uses low-pass sequencing to generate genome-skims [19, 20], and subsequently identifies chloroplast or mitochondrial marker genes or assembles the organelle genome. Reconstructing plastid and mtDNA genomes from low-pass shotgun data is doable because non-nuclear DNA tends to be heavily overrepresented in shotgun sequencing; for example, 10.4% of all reads from the Apocynaceae family of flowering plants were from the chloroplast in one genome-skimming study [20]. Large reference databases based on genome-skimming techniques are under construction (e.g., PhyloAlps [21], NorBol [22], and DNAmark [23] projects).
Most current applications of genome-skimming to species identification require organelle genome assembly, a task that requires relatively time-consuming manual curation steps to ensure that assembly errors are avoided [24]. The current approach also discards a vast proportion of the non-target data, which means reducing the signal. Among the existing genome-skimming projects, the DNAmark project has started to consider an alternative approach. Perhaps instead of only relying on organelle markers, we could use the entire set of reads generated in a genome-skim as the identifier of a species. This approach poses an interesting methodological question: can the unassembled data be used to taxonomically profile reference and query samples in a similar manner to conventional barcoding, but using all available genomic information and saving us from the labor-intensive task of mitochondria/plastid genome assembly? In this paper, we introduce a new method to use low coverage genome-skims of both reference and query samples. Our approach aims to use all the generated sequence data and to eliminate the need for marker gene assembly. By avoiding the assembly step, our approach also reduces the amount of data processing needed for expanding the reference database.
We treat genome-skims simply as low-coverage “bags of reads”, both for a collection of reference species and for query samples. The problem is to find the reference genome-skim that matches the query; if an exact match is not found, we seek the closest available match. A more advanced problem, not directly addressed here, is placing the query in a phylogeny of reference species. A yet more difficult challenge, also not addressed here, is decomposing a query genome-skim that contains DNA from several different taxa into its constituent species.
Central to solving these problems is the ability to estimate a distance between two genome-skims for low and varied coverage using assembly-free and alignment-free approaches. Alignment-free comparison methods [25–27] have been widely studied, including for phylogenetic reconstruction [25, 28–37]. However, these methods typically assume high coverage, enough to cover the most of the genome with at least one read [38]. The required levels of coverage are not economically feasible for building up large databases of reference genome skims or for general processing of query samples. Like many existing alignment-free methods [39, 40], we decompose all reads into fixed length oligomers (denoted k-mers with length k) [41], and use existing tools for computing the k-mer frequencies (e.g., JellyFish [42]). Similar to Ondov et al., we compute the hamming distance using k-mers [41]. Recall that the Jaccard index J is a similarity measure between any two sets (e.g. k-mer collections) defined as the size of their intersection divided by the size of their union. Ondov et al. describe a tool, Mash [41], in which (a) J is estimated efficiently using a hashing procedure; and, (b) J is translated into an estimate of the hamming distance between two genomes, which in turn, relates to the evolutionary distance. Unfortunately, the estimate of J is impacted by coverage, repeats, sequencing error and other factors, and no current approach works well for low coverage datasets. Here, we develop and implement techniques to correct these errors with the aim of enabling the assembly- free approach to genome-skimming. Our tool, Skmer, shows excellent performance in computing distance, identification, and placement of genome-skim queries on to a reference collection. The assembly-free approach to genome-skimming, therefore, should be further explored as a viable alternative to the current approach.
2 Methods
Consider an idealized model where two genomes are the outcome of a random process that copies a genome and introduces mutations at each position with fixed probability d. Moreover, substitutions are the only allowed mutation. In this case, the per-nucleotide hamming distance D between the two genomes is a random variable (r.v.) with expected value d. We would like to estimate d. While this is a simplified model, we will test the method on real pairs of genomes that differ due to complex mutational processes (also, see Appendix B for extensions). We start with known results connecting the Jaccard index and the hamming distance and then show how these results can be generalized to low coverage genome-skims. Throughout, we present our results succinctly and present derivations and more careful justifications in Appendix A
Jaccard index versus genomic distance. The Jaccard index of subsets A1 and A2 is defined as
Let W be the number of shared k-mers between the two genomes. Note that: , where L is the genome length. Assuming random genomes and no repeats, perhaps justifiably [43], the probability that a changed k-mer exists elsewhere in the genome is vanishingly small for sufficiently large k. Thus, we assume a k-mer is in the shared k-mers set only if no mutation falls on it, an event that has probability (1 – d)k. Thus, we can model W as a binomial with probability (1 – d)k and L trials. As Ondov et al. [41] pointed out, we can estimate and they further approximate D as . To be able to estimate large distances, we avoid the unnecessary approximation and use Equation 2 directly. We skim each genome to obtain k-mer sets A1, A2 and estimate J using Equation 1, which can be computed efficiently using a hashing technique used by Mash [41]. Note that, however, Equation 2 assumes a high coverage of the genome so that each k-mer is sampled at least once with very high probability. This assumption is violated for genome-skims in consequential ways. As a simple example, suppose the coverage is low enough that a k-mer is sampled with probability 0.5. Then, even for identical genomes, we estimate J as , resulting in a distance estimate of D ≈ 0.032 for k = 21.
2.1 Extending to genome-skims with known low coverage
We now show how Equation 2 can be refined to handle genome-skims despite low and uneven coverage, sequencing error, and varying genome-lengths. We assume that coverage is known (but see the next section).
When the genome is not fully covered, three sources of randomness are at work: mutations and sampling of k-mers from each of the two genomes. Each genome of length L is sequenced independently using randomly distributed short reads of length ℓ at coverages c1 and c2 to produce two genome-skims. Under the simplifying assumption that genomes are not repetitive, we choose k to be large enough so that each k-mer is unique with high probability. Therefore, the number of distinct k-mers in each genome is L – k ⋍ L. The probability of covering each k-mer can be approximated as ηi = 1 – e-λi where λi = ci(1 – k/ℓ). Modeling the sampling of k-mers as independent Bernoulli trials, |Ai| becomes binomially distributed with parameters ηi and L. By independence, W = |A1 ∩ A2| also becomes binomially distributed with parameters η1η2(1 – d)k and L. Moreover, U = |A1 ∪ A2| can also be modeled approximately as a Gaussian with mean (η1 + η2 – η1η2(1 – d)k)L. Treating η1 and η2 as known and dividing by gives us: thus,
Sequencing error
Each error reduces the number of shared k-mers and increases the total number of observed k-mers, and thus can also change the Jaccard index. Let є denote the base-miscall rate. For large k and small e, the probability that an erroneous k-mer produces a non-novel k-mer is negligible. The probability that a k-mers is covered by at least one read, without any error, is approximately
Adding up the number of error-free and erroneous k-mers, the total number of k-mers observed from both genomes can again be approximately modeled as a Gaussian with mean ζiL for
Just as before, we can simply estimate D by solving for it in
When the coverage is sufficiently high, each k-mer will be covered by multiple reads with high probability, and low-abundance k-mers can be safely considered as erroneous. Mash has an option to filter out k-mers with abundances less than some threshold m to remove k-mers that are likely to be erroneous. In this case, assuming all erroneous k-mers are removed. For instance, filtering single-copy k-mers (i.e., m = 2) gives us: and the Jaccard index follows the same equation as (5). Since this filtering approach only works for high coverage, we filter low coverage k-mers only when our estimated coverage is higher than a threshold (described below). Note that the genome-skims compared may use different filtering schemes yet Eqn. 5 holds regardless.
Differing genome lengths
Based on a model where the genomic distance between genomes of different lengths is defined to be confined to the mutations that are falling on homologous sequences, we can drive
This computation does not penalize for genome length difference. While a rigorous modeling of evolutionary distance for genomes of different length require sophisticated models of gene gain, duplication, and loss, we take the heuristic approach used by Ondov et al. [41] and simply replace min(L1, L2) with (L1 + L2)/2. This ensures that the estimated distance increases as genome lengths becomes successively more different. This leads us to our final estimate of distance given by:
2.2 Estimating Coverage
So far we have assumed a perfect knowledge of sequencing depth and error. We will continue to use a given constant base error rate є (either known or estimated from Phred scores). However, for genome-skims, the genome length is not known; thus, we need to estimate the coverage in order to apply our distance correction.
The sequencing depth, which is the average number of reads covering a position in the genome, can be estimated from the k-mer coverage profiles. The probability distribution of the number of reads covering a k-mer is a Poisson r.v. with mean λ, where λ is defined as k-mer coverage. As we look into the histogram data, it is easier to work with counts instead of probabilities. Let M denote the total number of k-mers of length k in the genome, and Mi count the number of k-mers covered by i reads. Thus, for i ≥ 0, . For a given set of reads, we can count the number of times that each k-mer is seen, and assuming zero sequencing error, it equals the number of reads covering that k-mer. Then, we can aggregate the number of k-mers covered by i reads and find Mi for i ≥ 1. However, since in a genome-skim, large parts of the genome may not be covered, both M and M0 are unknown. To deal with this issue, we could take the ratio of consecutive counts to get a series of estimates of λ as . In practice, sequencing errors change the frequency of k-mers which has to be considered when estimating the coverage. Like before, we assume that the k-mer length k is large enough that any error will introduce a novel k-mer, so the count of all erroneous k-mers is added to the count of single-copy k-mers. Moreover, for k-mers with more than one copy, the number of times that each kmer is seen equals the number of reads covering that k-mer without any error. Formally, let denote the count of k-mers seen i times in the presence of error, and ρ = (1 – є)k denote the probability of error-free k-mer.
If we know the error rate, then λ can be estimated using the information in ’s. Similar to the case of zero error, a family of estimates is obtained by taking the ratio of consecutive counts where . For the case of i = 1, we solve the equation numerically, starting from . While any of these can be used in principle, the empirical performance can be affected by the choice; in our tool, we use heuristic rules (described below) that seek to use error-free but large Mi values.
3 Experimental setup
Skmer takes as input two or more genome-skims and a point estimate of sequencing error, e. It uses JellyFish [42] to compute Mi values, which are then used in estimating λ based on Equation 8. We first compute finally, if c < 2, . Then, Mash is used to estimate the Jaccard index, as described below. Finally, we use Equation 7 to compute the hamming distance with η and ζ values computed using Equations 3, 4 if c < 5 or else using Equation 6. Also, the genome length L is estimated as the total sequence length divided by the coverage c.
We used a series of experiments to (i) study the accuracy of our new approach compared to existing methods with respect to computing the hamming distance, and (ii) finding the reference match to a query sequence in a reference dataset of genome-skims, or the closest match when the query is not included in the reference.
We compared performance against Mash/Mash* and AAF[30]. For Mash, and Skmer, we used k = 31 (selected empirically; Fig. S1) and sketch size 107. As Mash handles errors by removing low copy k-mers, we set the minimum cardinality for k-mers to be included as with our estimate of c. We also created a version of Mash called Mash* that did not use the approximation . AAF [30] is another method that uses k-mers to estimate distances. AFF has an algorithm to correct hamming distances for low coverage, but the correction relies on adjusting the length of tip branches in a distance-based inferred phylogeny. As such, it cannot run on a pair of genomes and requires at least four genomes. Also, AAF leaves coverage estimation to the user with some guidelines, which we fully follow (Appendix C).
Genomic Datasets
We used three sets of publicly available assembled genomes (Tables S3–S5) and used ART [44] to simulate genome-skims, controlling for the sequencing depth (coverage) and introducing sequencing error at a fixed rate of є = 0.01 (Appendix C). Specifically, the data included 21 Drosophila, genomes (flies) and 22 genomes from the Anopheles genus (mosquitoes) obtained from InsectBase[45], and 47 avian species from the Avian Phylogenomic Project [46, 47]. We also used simulations to control mutation distance between pairs of genomes. As a challenging case, we took the highly repetitive assembly of the wasp species Cotesia vestalis, and mutated it artificially; we only applied single nucleotide mutations distributed uniformly at random across the genome. We repeated the study on the simpler case of the fly species D. melanogaster. Similar to real genomes, we generate genome-skims using ART with є = 0.01 and varying coverage between and 16X. For simulated genomes, we repeated the skimming 10 times and reported the mean and standard error.
Evaluation Metrics
For simulated data, the true distance is controlled and is thus known. For biological datasets, the ground truth is unknown. Instead, we use the distance measured on the full assembly by each method as its ground truth; thus, the ground truth for AAF is computed using AAF. We show both absolute error and the relative error, measured as || where d and d̂ are the true and the estimated distances.
Leave-i-out
We used a leave-i-out strategy to study the accuracy of searching for a query genome in a reference set. For a query genome Gq in a set of n genomes {G1… Gn}, we ordered all genomes based on their distances to Gq calculated using the full assemblies, which represents the ground truth; let denote the order (note . For 1 < i < n, we removed the closest i – 1 genomes to Gq from the reference dataset, leaving us with We then ordered the remaining genomes by each method; let x1… xn–i+1 be the order obtained by a method and let r be the the rank of the best remaining genome according to the grand truth in the estimated order (i.e., ). Since r = 1 implies perfect performance, and r > 1 indicates error, we measured error as the mean of r – 1 across all query genomes (1 ≤ q ≤ n).
4 Results
4.1 Hamming distance for pairs of genome-skims
We first study the accuracy of Mash and Skmer in estimating the hamming distance between a pair of genomes. Since AAF cannot be run on pairs of genomes, we do not test it in our first set of analyses.
Simulated Genomes
On simulated genomes, where we control both the distance and coverage, distances are computed with high accuracy by Mash when coverage is high (Fig. 1a), except where the true distance is also high (i.e., 0.2). However, the accuracy of Mash quickly degrades when the coverage is reduced to 4X or less. In contrast, even when the coverage is reduced to , Skmer has high accuracy. For example, with the true distance set to 0.05, Mash estimates the distance as 0.085 with 1X coverage (an overestimation by 70%) while Skmer corrects the distance to 0.044 (an underestimation by 12%). Note that applying Mash* to the complete assemblies generally generates very accurate results, as expected, but even given the full assembly, Mash* still has a small but noticeable error when d = 0.2. We note that repeating skimming ten times with different samples produces extremely consistent estimates. Repeating the process with the Drosophila melanogaster genome as the base genome also produces similar results (Fig. S2). The only condition where Skmer has considerable absolute error is with coverage below 1X and d = 0.2 (Fig. 1a). However, we note that for d = 0.001, the relative error is not small with low coverage (Fig. S3b) indicating that distinguishing very small distances (perhaps below species-level) requires high coverage. Estimating the right order of magnitude when the true distance is 0.001 seems to require at least 2X coverage while 1X coverage is sufficient to distinguish distances at or above 0.01 (Fig. S3).
To find the minimum levels of coverage required for accurately estimating the hamming distance using Skmer, we repeat the simulation but range the coverage from to 1X (Fig. S4). Interestingly, even with very low coverage, the absolute error in estimated distances is relatively small, especially when the true distance is also small (for d ≥ 0.1, Skmer estimates start to degrade below coverage).
Real Genomes
We now test methods on real pairs of insect and avian genomes. Note that unlike the simulated datasets, here, genomes can undergo all types of genetic variations and complex rearrangements, and thus, do not have the same length. Since the true distance cannot be controlled, we carefully selected several pairs of genomes to cover a wide range of mutation distance and genome length. Here, the distance estimated by Mash* on the assemblies is considered the true distance. For all pairs of insect and avian genomes (Fig. 1b and Fig. S5), Mash has high error for coverage below 8X while Skmer successfully corrects the estimated distance and obtains values extremely close to the the results of running Mash* on the full assembly. For example, the distance between A. stephensi with length ~196Mbp and A. maculatus with length ~132Mbp is estimated to be 0.104 based on the full assembly and 0.103 (1% underestimation) with only coverage using Skmer, while Mash would estimate the distance to be 0.168 (60% overestimation). Interestingly, on real data, Skmer seems to have even less error than simulated genomes.
Coverage estimates
Our estimates of c are close to the true c used in simulations (Fig. S6a). Notably, Skmer run with the true coverage is less accurate than with estimated coverage (Fig. S6b). We speculate that on genomes with repeats, by slightly overestimating coverage, our method gives an estimate of the “effective” coverage, reducing the impact of repeats on the Jaccard index.
4.2 Sets of genome-skims
We now turn to datasets with sets of genome-skims. So far, our experiments have controlled for the coverage by skimming varying amount of sequence data, proportional to the genome length. In our genome-skimming application, coverage will not be fixed. Often, the amount of sequence data obtained for each species will be relatively similar. As a result, genomes of different length end up being sequenced with different coverage depth proportional to the inverse of their length. Moreover, the sequencing effort per species may also vary across sequencing protocols, experiments and research labs, and so a database of reference genome-skims may consist of samples with heterogeneous sequencing coverages. We now study the accuracy of different methods in the presence of mixed coverage.
Fixed sequencing effort
We start with experiments where all species are skimmed with the same sequencing effort (0.1Gb, 0.5Gb, or 1Gb) and measure the error in the estimated mutation distance between all pairs of species in the Anopheles, Drosophila, and avian datasets (Figs. 2, S7-S9). The error in the distance estimated by Mash relative to the ground truth can be quite large (higher than 200% in the worst case) while Skmer consistently makes accurate estimates close to the true distance even at the lowest amount of coverage (Table 1). We should note that the typical genome length of species varies among these three datasets, and so equal sequencing effort means unequal range of sequencing coverage. For instance, the birds genomes are on average ~5 times larger than Anopheles genomes; thus, birds need to be skimmed with larger amount of sequence to have an accuracy comparable with Anopheles species. As expected, increasing coverage reduces the error for all methods including Mash (Figs. S7-S9)
Heterogeneous sequencing effort
We now further mix coverages as follows to capture the scenario where genome-skims come from various labs or experimental protocols. For each species, we choose its total sequencing effort from three possible values 0.1Gb, 0.5Gb, and 1Gb, uniformly at random, and estimate all pairs of distances within each dataset as before (Fig. 3). Here, in addition to Mash, we also compare our results with AAF. Similar to the case of fixed sequencing effort, Skmer mitigates large relative error in the distances estimated by Mash and produces accurate results. The correction applied by AAF also reduces the impact of low coverage to some extent; still Skmer has considerably less error (Table 2). For example, in the Drosophila dataset, the worst-case error of AAF is above 70%, whereas it never exceeds 4% for Skmer.
Sequencing Error
We tested the impact of (i) providing an incorrect estimate of є to Skmer and (ii) using uneven distributions of error that change across the length of the read to emulate the Illumina HiSeq2000 platform. Skmer seems generally robust to mis-specifications of the sequencing error model, especially when the error is underestimated (Fig. 4 and Table S6). However, overestimating the error (e.g., setting it 2% where the true error rate is 1%) leads to a noticeable increase of the distance errors. Using uneven patterns of error across a read has minimal negative impacts on the accuracy of Skmer.
Running time
In terms of running time, Skmer and Mash are comparable while AAF is much slower. For example, the total running time (using 24 CPU cores) to compute distances based on genome-skims for all pairs of birds using Mash, Skmer, and AAF was roughly 8, 33, and 460 minutes, respectively.
4.3 Leave-out search against a reference database of genome-skims
We now study the effectiveness of using hamming distance to search a database of genome-skims to find the closest match to a query genome-skim. Given a query genome-skim and a reference dataset of genomes, we can order the reference genomes based on their hamming distance to the query. The results can be provided to the user as a ranking. When the query genome is available in the reference dataset, finding the match is relatively easy. To study the effectiveness of the search as the distance of the closest available match increases, we use a leave-out experiment, as described earlier in Section 3. Figure 5 shows the mean rank error of the best remaining match in a leave-out experiment when removing i – 1 genomes for 1 < i < n. Recall that rank error zero corresponds to a perfect match to the best available genome.
On the Anopheles dataset, Skmer consistently outperforms Mash and AAF in terms of finding the best remaining match. In fact, for finding the second, third, or fourth best match, Skmer has close to zero error. In contrast, Mash and AFF are on average off by one genome even for finding the second best match. On the Drosophila datasets, finding matches seems relatively easier for all three methods. Still, in finding the second best match, Skmer again has close to zero error while AAF and Mash are each off by a genome close to half of the times. After the second best match, AAF and Skmer have comparable accuracy while Mash is considerably worse. These results demonstrate that correcting the distance not only impacts our understanding of the absolute distance, but also, impacts estimates of the relative distance of genome-skims.
5 Discussion
We showed that hamming distances as small as 0.01 can be estimated accurately from genome-skims with 1X or lower coverage. What does a distance of 0.01 mean? The answer will depend on the organisms of interest. For example, two eagles species of the same genus (H. albicilla and H. leucocephalus) have D æ 0.003 but two Anopheles species of the same species complex (A. gambiae and A. coluzzii) have D æ 0.018. Broadly speaking, for eukaryotes, detecting distances in the 10−2 order is often enough to distinguish between species (Fig. S10). On the other hand, distances in the 10−3 order often differentiate between populations or very similar species. Detection at these lower levels seems to require 2X coverage using Skmer (Fig. S3b) but future work should study the exact level of sequencing required for accurate ordering of species at distances in the order of 10−3 or less. Moreover, the question of the minimum coverage required may avail itself to information-theoretical bounds and near-optimal solutions, similar to those established for the assembly problem [48, 49].
All of our tests in this study were based on simulating genome-skims from assemblies by sub-sampling reads and adding sequencing error. While this provided us with reliable ground truth of distances, real applications of genome-skims may face further complications. For example, the actual coverage of real genome-skims may not be uniform and randomly distributed. At a minimum, actual genome skims will have an overrepresentation of mitochondrial or plastid sequence. Moreover, the read length may be different between the query and the reference genome-skims. More importantly, other sources of DNA originating from for example, parasites, diet, fungi, commensals, bacteria, and human contamination may all be present in the sample and may cause an over-estimation of the distance. This may or may not impact the ranking of a genome skim with regards to the reference species, but it certainly can impact the value of the estimated distance. We recommend that before using Skmer, database searches should be used to find and eliminate bacterial or fungal contamination (perhaps using metagenomic tools such as Kraken [50]). Our future efforts will further study ways to eliminate impacts of external DNA. A related direction of future work is to explore whether Skmer can be extended to environmental DNA analyses, i.e., queries consisting of genome-skims of multi-taxa samples. While Skmer is presented here in a general setting, its best use is for eukaryotic organisms, where the notion of species is better established and species can be separated with reasonable effort. We tested Skmer on birds and insects, but we predict it will work equally well for plants, a prediction that should be tested in future work.
The connection between hamming distance and phylogenetic distance depends on mutation processes considered. If only substitutions are allowed and assuming the Jukes-Cantor model [51], the phylogenetic distance is note this transformation is monotonic and does not change rankings of matches to a query search. Assuming a more complex model such as GTR [52], hamming distance is not enough to estimate the phylogenetic distance. However, we have devised a simple procedure to estimate GTR distances using the log-det approach [53] by repeated applications of Skmer to perturbed reads (Appendix B). The GTR distances can rank matches to a query differently from the hamming distance; the accuracy of the two distances should be compared in future work. Insertions, deletions, duplications, losses, and repeats can all reduce the Jaccard index and thus increase the hamming distance. However, with these mutations, the correct definition of the evolutionary distance is not straightforward; nor is its relationship to hamming distance or Jaccard index clear. Here, we focused on estimating the hamming distance with high accuracy despite low coverage, leaving these broader questions to future work.