Abstract
Given a massive collection of sequences, it is infeasible to perform pairwise alignment for basic tasks like sequence clustering and search. To address this problem, we demonstrate that the MinHash technique, first applied to clustering web pages, can be applied to biological sequences with similar effect, and extend this idea to include biologically relevant distance and significance measures. Our new tool, Mash, uses MinHash locality-sensitive hashing to reduce large sequences to a representative sketch and rapidly estimate pairwise distances between genomes or metagenomes. Using Mash, we explored several use cases, including a 5,000-fold size reduction and clustering of all ~55,000 NCBI RefSeq genomes in 46 CPU hours. The resulting 93 MB sketch database includes all RefSeq genomes, effectively delineates known species boundaries, reconstructs approximate phylogenies, and can be searched in seconds using assembled genomes or raw sequencing runs from Illumina, Pacific Biosciences, and Oxford Nanopore. For metagenomics, Mash scales to thousands of samples and can replicate Human Microbiome Project and Global Ocean Survey results in a fraction of the time. Other potential applications include any problem where an approximate, global sequence distance is acceptable, e.g. to triage and cluster sequence data, assign species labels to unknown genomes, quickly identify mistracked samples, and search massive genomic databases. In addition, the Mash distance metric is based on simple set intersections, which are compatible with homomorphic encryption schemes. To facilitate integration with other software, Mash is implemented as a lightweight C++ toolkit and freely released under a BSD license at https://github.com/marbl/mash.
Introduction
When BLAST was first published in 19901, there were less than 50 million bases of nucleotide sequence in the public archives (http://www.ncbi.nlm.nih.gov/genbank/statistics); now a single sequencing instrument can produce over 1 trillion bases per run2. New methods are needed that can manage and help organize this scale of data. To address this, we consider the general problem of computing an approximate distance between two sequences and describe Mash, a general-purpose toolkit that utilizes the MinHash technique3 to reduce large sequences (or sequence sets) to compressed sketch representations. Using only the sketches, which can be thousands of fold smaller, the similarity of the original sequences can be rapidly estimated with bounded error. Importantly, the error of this computation depends only on the size of the sketch and is independent of the genome size. Thus, sketches comprising just a few hundred values can be used to approximate the similarity of arbitrarily large datasets. This has important applications for large-scale genomic data management and emerging long-read, single-molecule sequencing technologies.
The MinHash technique is a form of locality-sensitive hashing4 that has been widely used for the detection of near-duplicate Web pages and images5, 6, but has seen limited use in genomics despite initial applications over ten years ago7. More recently, MinHash has been applied to the relevant problems of genome assembly8, 16S rDNA gene clustering9, and metagenomic sequence clustering10. Because of the extremely low memory and CPU requirements of this probabilistic approach, MinHash is well suited for data-intensive problems in genomics. To facilitate this, we have developed the Mash toolkit for flexible construction, manipulation, and comparison of MinHash sketches from genomic data. We build upon past applications of MinHash by deriving a new significance test and distance metric, the Mash distance, which estimates a simple evolutionary distance. Similar ‘alignment-free’ methods have a long history in bioinformatics11,12. However, methods based on string matching must process the entire sequence with each comparison13-16, while methods based on short word counts have lacked the ability to differentiate closely related sequences17-20. In contrast, the Mash distance can be quickly computed from the size-reduced sketches alone, producing a result that strongly correlates with Average Nucleotide Identity (ANI)21. Thus, Mash combines the high specificity of matching-based approaches with the dimensionality reduction of statistical approaches.
Mash provides two basic functions for sequence comparisons: sketch and dist. The sketch function converts a sequence or collection of sequences into a MinHash sketch (Figure 1). The dist function compares two sketches and returns for the original sequences an estimate of the Jaccard index (i.e. the fraction of shared k-mers), a P-value, and the Mash distance, which estimates the rate of sequence mutation under a simple evolutionary model16 (Methods). Since Mash relies only on comparing length k substrings, or k-mers, the inputs can be whole genomes, metagenomes, nucleotide sequences, amino acid sequences, or raw sequencing reads. Each input is simply treated as a collection of k-mers taken from some known alphabet, allowing many applications. Here we examine three specific use cases, (1) sketching and clustering the entire NCBI RefSeq genome database, (2) searching assembled and unassembled genomes against the sketched RefSeq database, and (3) computing a distance between metagenomic samples using both assembled and unassembled read sets. Additional applications can be envisioned and are covered in the Discussion.
Results
Clustering all genomes in NCBI RefSeq
Mash enables scalable whole-genome clustering, which is an important application for the future of genomic data management. As genome databases increase in size, and whole-genome sequencing becomes routine, it will become infeasible to manually assign taxonomic labels for all genomes. Thus, generalized and automated methods will be useful for constructing and partitioning groups of related genomes, e.g. for the automated detection of outbreak clusters22. To illustrate the utility of Mash, we sketched and clustered all of NCBI RefSeq Release 7023, totaling 54,118 organisms and 618 Gbp of genomic sequence. The resulting sketches total only 93 MB, yielding a compression factor of >5,000-fold versus the uncompressed FASTA (674 GB). Sketching all genomes and computing all ~1.5 billion pairwise distances required just 26.1 and 20.3 CPU hours, respectively (the sketch database is provided as Supplementary Data 1). This process is easily parallelized, which can reduce the wall clock time to minutes with sufficient compute resources. Once constructed, additional genomes can be added incrementally to the full RefSeq database in just 0.9 CPU seconds per 5 MB genome (or 4 CPU minutes for a 3 GB genome).
The resulting Mash distances correlate well with ANI, with D≈1–ANI over multiple sketch sizes and sizes of k (Figure 2). Due to the high cost of computing ANI, a subset of 500 Escherichia genomes was selected for comparison (Supplementary Table 1). For ANI in the range 90-100%, the correlation with Mash distance is very strong across multiple sketch sizes and choices of k. For the default sketch size of s=1,000 and k=21, Mash approximates 1–ANI with a root-mean-square error of 0.00274 on this dataset. This correlation begins to degrade for more divergent genomes because the Mash estimate becomes more variable and penalizes for genome size differences, whereas ANI is based solely on the core genome. Increasing the sketch size reduces the variance of the Mash estimates, especially for divergent genomes (Supplementary Figures 1 and 2), while the choice of k is a tradeoff between sensitivity and specificity. Smaller values of k are more sensitive for divergent genomes, but lose specificity for large genomes due to chance k-mer collisions (Supplementary Figure 3). Such chance collisions will skew the Mash distance, but given a known genome size, undesirable k-mer collisions can be avoided by choosing a suitably large value of k (Methods); however, too large of a k-mer size could result in no shared k-mers found.
Approximate species clusters can be generated from the all-pairs distance matrix by graph clustering methods or simple thresholding of the Mash distance to create connected components. To illustrate, we linked all RefSeq genomes with a pairwise Mash distance ≤0.05, which equates to an ANI of ≥95%. This threshold roughly corresponds to a 70% DNA-DNA reassociation value—a historical, albeit debatable, definition of bacterial species21. Figure 3 shows the resulting graph of significant (P≤10−10) pairwise distances with D≥0.05 for all microbial genomes. Simply considering the connected components of the resulting graph yields a partitioning that largely agrees with the current NCBI bacterial species taxonomy. Eukaryotic and plasmid components are shown in Supplementary Figures 4 and 5, but would require alternate parameters for species-specific clustering due to their varying characteristics. Beyond simple clustering, the Mash distance is an approximation of the mutation rate that can also be used to rapidly approximate phylogenies using hierarchical clustering. For example, all pairwise Mash distances for 17 RefSeq primate genomes were computed in just 2.5 CPU hours (11 minutes wall clock on 17 cores) with default parameters (s=1,000 and k=21) and used to build a neighbor-joining tree24. Figure 4 compares this tree to an alignment-based phylogenetic tree model downloaded from the UCSC genome browser25. The Mash and UCSC primate trees are topologically consistent for everything except the Homo/Pan split, for which the Mash topology is more similar to past phylogenetic studies26 and mitochondrial trees12. On average the Mash branch lengths are slightly longer, with a Branch Score Distance27 of 0.10 between the two trees, but additional distance corrections are possible for k-mer based models16. However, due to limitations of both the k-mer approach and simple distance model, we emphasize that Mash is not explicitly designed for phylogeny reconstruction, especially for genomes with large size differences, and should be used only in cases where such approximations are sufficient.
Rapid genome identification from assemblies or reads
With a pre-computed sketch database, Mash is able to rapidly identify unknown genomes from both genome assemblies and raw sequencing reads. To illustrate, we computed Mash distances for multiple Escherichia coli datasets compared against the RefSeq sketch database (Table 1). This test included the K12 MG1655 reference genome as well as assembled and unassembled sequencing runs from the ABI 3730, Roche 454, Ion PGM, Illumina MiSeq, PacBio RSII, and Oxford Nanopore MinION instruments. For assembled genomes, the correct strain was identified in a few seconds using Lowest Common Ancestor (LCA) classification. For raw sequencing reads, the correct species was identified in all cases, including 1D MinION reads28, which had an average sequencing error rate of ~40%. The reduced resolution obtained when using raw sequences is due to erroneous k-mers, which introduce noise into the sketch. To mitigate this problem, Mash uses a streaming Bloom filter to probabilistically ignore single-copy k-mers from raw reads sets, but some fraction of erroneous k-mers will persist and skew the Mash distance. Increasing the sketch and Bloom filter sizes will improve accuracy when dealing with raw sequencing data, as well as read trimming and/or correction.
To further test Mash’s discriminatory power, we searched MinION reads collected from two closely related Bacillus species (ANI≈95%) against the full RefSeq sketch database. In both cases Mash was able to correctly identify the species, using 43,806 and 91,379 sequences collected from single MinION R7.3 runs of Bacillus anthracis Ames and Bacillus cereus ATCC 10987, respectively (combined 1D and 2D reads). In the case of the higher quality B. cereus reads, processed with a more recent ONT workflow (1.10.1 vs. 1.6.3), the correct strain was identified with simple LCA classification. These two searches both required just one minute of CPU and 209 MB of RAM. Such low-overhead searches could be used for quickly triaging unknown samples or to rapidly select a reference genome for performing further, more detailed comparative analyses.
Clustering massive metagenomic datasets
Mash can also replicate the function of k-mer based metagenomic comparison tools, but in a fraction of the time previously required. The metagenomic comparison tool DSM, for example, computes an exact Jaccard index using all k-mers that occur more than twice per sample29. By definition, Mash rapidly approximates this result by filtering unique k-mers with a Bloom filter and estimating the Jaccard index via MinHashing. COMMET also uses k-mers to approximate similarity, but attempts to identify a set of similar reads between two samples using Bloom filters30, 31. The similarity of two samples is then defined as the fraction of similar reads that the two datasets share, which is essentially a read-level Jaccard index. Figure 5a replicates the analysis in Maillet et al.30 using both Mash and COMMET to cluster Global Ocean Survey (GOS) data32. Mash is over 50-fold faster than COMMET and correctly identifies clusters from the original GOS study. This illustrates the incremental scalability of Mash where the primary overhead is sketching, which occurs only once per each sample. After sketching, computing pairwise distances is near instantaneous. Thus, Mash avoids the quadratic barrier usually associated with all-pairs comparisons and scales well to many samples. For example, a new GOS sample could be added to the Mash distance table in less than a minute, compared to an hour required for COMMET, making Mash ideal for real-time sample analysis.
For a large-scale test, samples from the Human Microbiome Project33 (HMP) and Metagenomics of the Human Intestinal Tract34 (MetaHIT) were combined to create a ~10 TB 888-sample dataset. Importantly, the size of a Mash sketch is independent of the input size, requiring only 70 MB to store the combined sketches (s=10,000, k=21) for these datasets. Both assembled and unassembled samples were analyzed, requiring 4.4 CPU hours to process all assemblies and 279.6 CPU hours to process all read sets. The two clusterings are remarkably similar, with all samples clearly grouped by body site (Figure 5b). However, because the Mash distance is based on k-mer sets, it is not sensitive to changes in relative abundance and may be more prone to batch effects, such as sequencing error rate. For example, Mash does not cluster MetaHIT samples by health status, as previously reported34, and MetaHIT samples appear to preferentially cluster with one another. It is not clear if this reflects true sample differences (e.g. American vs. European stool) or batch effect. Additionally, Mash identified outlier samples that were independently excluded by the HMP’s quality control process. When included in the clustering, these samples were the only ones that failed to cluster by body site (Supplementary Figure 6).
Discussion
Mash enables the comparison and clustering of whole genomes on a massive scale. Potential applications include the rapid triage and clustering of sequence data, for example, to quickly select the most appropriate reference genome for read mapping or to identify mis-tracked or low quality samples that fail to cluster as expected. The strong correlation between Mash distance and ANI promise approximate phylogeny construction, which could be used to rapidly determine outbreak clusters for thousands of genomes in real time. Additionally, because the Mash distance is based upon simple set intersections, it can be computed using homomorphic encryption schemes35, enabling privacy-preserving genomic tests36.
Future applications of Mash could include read mapping and metagenomic sequence classification via windowed sketches or a containment score3 to test for the presence of one sequence within another. However, both of these approaches would require additional sketch overhead to achieve acceptable sensitivity. Improvements in database construction are also expected. For example, rather than storing a single sketch per sequence (or window), similar sketches could be merged to further reduce space and improve search times. Obvious strategies include choosing a representative sketch per cluster or hierarchically merging sketches via a Bloom tree37. Finally, both the sketch and dist functions are designed as online algorithms, enabling, for example, dist to continually update a sketch from a streaming input. The program could then be modified to terminate when enough data has been collected to make a species identification at a predefined significance threshold. This functionality is designed to support the analysis of real-time data streams, as is expected from nanopore-based sequencing sensors22.
METHODS
The Mash version 1.0 codebase is provided as Supplementary Data 2. Precompiled binary releases and source code updates are available from https://github.com/marbl/mash.
Mash sketch
To construct a MinHash sketch, Mash first determines the set of constituent k-mers by sliding a window of length k across the sequence. Mash supports arbitrary alphabets (e.g. nucleotide or amino acid) and both assembled and unassembled sequences. Without loss of generality, here we will assume a nucleotide alphabet Σ={A,C,G,T}. Depending on the alphabet size and choice of k, each k-mer is hashed to either a 32-bit or 64-bit value via a hash function, h. For nucleotide sequence, Mash uses canonical k-mers by default to allow strand-neutral comparisons. In this case, only the lexicographically smaller of the forward and reverse complement representations of a k-mer is hashed. For a given sketch size s, Mash uses a “bottom sketch” strategy returning the s smallest hashes output by h over all k-mers in the genome (Figure 1). For a sketch size s and genome size n, this can be efficiently computed in O(n log s) time by maintaining a sorted list of size s and updating the current sketch only when a new hash is smaller than the current sketch maximum. Further, the probability that the ith hash of the genome will enter the sketch is s/i, so the expected runtime of the algorithm is O(n + s log s log n) (Ref.3), which becomes nearly linear when n >> s.
As demonstrated by Figure 3, a sketch comprising 400 32-bit hash values is sufficient to roughly group microbial genomes by species. With these parameters, the sketch for a ~3 billion base-pair human genome represents a million-fold (lossy) compression. However, the probability of a given k-mer K appearing in a random genome X of size n is:
Thus, for k=16 the probability of observing a given k-mer in a 3 Gbp genome is 0.50, and 25% of k-mers are expected to be shared between two random 3 Gbp genomes by chance alone. This will skew any k-mer based distance, and make distantly related genomes appear more similar than reality. To avoid this phenomenon, it is sufficient to choose a value of k that minimizes the probability of observing a random k-mer. Given a known genome size n and the desired probability q of observing a random k-mer (e.g. 0.01), this can be computed as38: which yields k=14 and k=19 for 5 Mbp and 3 Gbp genomes (q=0.01), respectively. We have found k=21 gives accurate estimates in most cases (including metagenomes), so this is set as the default. However, for constructing the RefSeq database, k=16 was chosen so that each hash could fit in 32-bits, minimizing the database size at the expense of reduced specificity for larger genomes. The small k also improves sensitivity, which helps with noisy data like single-molecule sequencing (Supplementary Figure 3).
Lastly, for sketching raw sequencing reads, Mash provides a streaming Bloom filter to remove erroneous k-mers. This approach assumes that redundancy in the data (e.g. depth of coverage >5) will result in true k-mers appearing multiple times in the input, while false k-mers will appear only once. To probabilistically exclude unique k-mers from the sketch, new hashes are only inserted into the sketch if they are found in the Bloom filter. If a new hash would have otherwise been inserted in the sketch, but was not found in the Bloom filter, it is inserted into the Bloom filter so that subsequent appearances of the hash will pass.
Mash distance
A MinHash sketch of size s=1 is equivalent to the subsequent ‘minimizer’ concept of Roberts et al.39, which has been used in genome assembly40, k-mer counting41, and metagenomics42. Importantly, the more general MinHash concept permits an approximation of the Jaccard index between two k-mer sets A and B. Mash follows Broder’s original formulation and merge-sorts two bottom sketches S(A) and S(B) to estimate the Jaccard index3. The merge is terminated after s unique hashes have been processed (or both sketches exhausted), and the Jaccard estimate is computed as for x shared hashes found after processing s′ hashes. Because the sketches are stored in sorted order, this requires only O(s) time and effectively computes: which is an unbiased estimate of the true Jaccard index, as illustrated in Figure 1. Conveniently, the error bound of the Jaccard estimate relies only on the sketch size and is independent of genome size43. Specific confidence intervals are given below and in Supplementary Figure 1. Note, however, that the relative error can grow quite large for very small Jaccard values (i.e. divergent genomes). In these cases, a larger sketch size or smaller k is needed to compensate. For flexibility, Mash can also compare sketches of different size, but such comparisons are constrained by the smaller of the two sketches s<u and only the s smallest values are considered.
The Jaccard index is a useful measure of global sequence similarity because it correlates with Average Nucleotide Identity (ANI), the most common metric used to describe global sequence similarity. However, like the MUM index13, J is sensitive to genome size and simultaneously captures both point mutations and gene content differences. For distance-based applications, the Jaccard index can be converted to the Jaccard distance Jδ(A, B) = 1 − J(A, B), which is related to the q-gram distance but without occurrence counts44. This can be a useful metric for clustering, but is non-linear in terms of the sequence mutation rate. In contrast, the Mash distance D seeks to directly estimate a mutation rate under a simple Poisson process of random site mutation. As noted by Fan et al.16, given the probability d of a single substitution, the expected number of mutations in a k-mer is λ = kd. Thus, under a Poisson model (assuming unique k-mers and random, independent mutation), the probability that no mutation will occur in a given k-mer is e−kd, with an expected value equal to the fraction of conserved k-mers w to the total number of k-mers t in the genome, . Solving for d gives . To account for two genomes of different sizes, Fan et al.16 set t to the smaller of the two genome’s k-mer counts, thereby measuring containment of the k-mer set. However, Mash sets t to the average genome size n, thereby penalizing for genome size differences and measuring resemblance (e.g. to avoid a distance of zero between a phage and a genome containing that phage). Finally, because the Jaccard estimate j can be framed in terms of the average genome size , the fraction of shared k-mers can be framed in terms of the Jaccard index yielding the Mash distance:
Equation 4 carries many assumptions and does not attempt to model more complex evolutionary processes, but closely approximates the divergence of real genomes (Figure 2). With appropriate choices of s and k, it can be used as a replacement for costly ANI computations. Supplementary Figure 2 gives 99% confidence bounds on the Mash distance for various sketch sizes, and Supplementary Figure 3 illustrates the relationship between the Jaccard index, Mash distance, k-mer size, and genome size.
Mash P-value
In cases of distantly related genomes it can be difficult to judge the significance of a given Jaccard index or Mash distance. As illustrated by Equation 1, for small k and large n there can be a high probability of a random k-mer appearing by chance. How many k-mers then are expected to match between the sketches of two random genomes? This depends on the sketch size and the probability of a random k-mer appearing in the genome. From Equation 1, the expected Jaccard index r between two random genomes X and Y is given by:
Knowing the expected population size m of all distinct k-mers in X and Y:
The probability p of observing x or more matches between the sketches of these two random genomes can be computed using the hypergeometric cumulative distribution function. For the sketch size s, expected Jaccard index r, and expected population size m:
However, because m is typically very large and the sketch size is relatively much smaller, it is more practical to approximate the hypergeometric distribution with the binomial distribution:
Mash uses Equation 8 to compute the P-value of observing a given Mash distance (or less) under the null hypothesis of both genomes being purely random with uniform character frequencies. Of course biological sequences are not random and this equation does not account for compositional characteristics like GC bias, but it is useful in practice for ruling out clearly insignificant results (especially for small values of k and j). Note, this only describes the significance of a single comparison, and multiple testing must be considered when searching against a large database.
RefSeq clustering
By default, Mash uses 32-bit hashes for k-mers where |Σ|k ≤ 232 and 64-bit hashes for |Σ|k ≤ 264. Thus, to minimize the resulting size of the all-RefSeq sketches, k=16 was chosen along with a sketch size s=400. While not ideal for large genomes (due to the small k) or highly divergent genomes (due to the small sketch), these parameters are well suited for determining species-level relationships between the microbial genomes that currently constitute the majority of RefSeq. For similar genomes (e.g. ANI>95%), sketches of a few hundred hashes are sufficient for basic clustering. As ANI drops further, the Jaccard index rapidly becomes very small and larger sketches are required for accurate estimates. Confidence bounds for the Jaccard estimate can be computed using the inverse cumulative distribution function for the hypergeometric or binomial distributions (Supplementary Figure 1). For example, with a sketch size of 400, two genomes with a true Jaccard index of 0.1 (x=40) are very likely to have a Jaccard estimate between 0.075 and 0.125 (P>0.9, binomial density for 30≤x≤50). For k=16, this corresponds to a Mash distance between 0.09 and 0.12.
RefSeq Complete release 70 was downloaded from NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov). Using FASTA and Genbank records, replicons and contigs were grouped by organism using a combination of two-letter accession prefix, taxonomy ID, BioProject, BioSample, assembly ID, plasmid ID, and organism name fields to ensure distinct genomes were not combined. In rare cases this strategy resulted in over-separation due to database mislabeling. Plasmids and organelles were grouped with their corresponding nuclear genomes when available; otherwise they were kept as separate entries. Sequences assigned to each resulting ‘organism’ group were combined into multi-FASTA files and chunked for easy parallelization. Each chunk was sketched with:
mash sketch -s 400 -k 16 -f -o chunk *.fastaThis required 26.1 CPU hours on a heterogeneous cluster of AMD processors. The resulting, chunked sketch files were combined with the Mash paste function to create a single ‘refseq.msh’ file containing all sketches. Each chunked sketch file was then compared against the combined sketch file, again in parallel, using:
mash dist -t refseq.msh chunk.mshThis required 20.3 CPU hours to create pairwise distance tables for each chunk. The resulting chunk tables were concatenated and formatted to create a PHYLIP formatted distance table.
For the ANI comparison, a subset of 500 Escherichia genomes were selected to present a range of distances yet bound the runtime of the comparatively expensive ANI computation (Supplementary Table 1). ANI was computed using MUMmer’s ‘dnadiff’ program and extracting the 1-to-1 ‘AvgIdentity’ field from the resulting report files45. The corresponding Mash distances were taken from the all-vs-all distance table as described above.
For the primate phylogeny, the FASTA files were sketched separately, in parallel, taking an average time of 8.9 minutes each and a maximum time of 11 minutes (Intel Xeon E5-4620 2.2 GHz processor and solid-state drive). The sketches were combined with Mash paste and the combined sketch given to dist. These operations took insignificant amounts of time, and table output from dist was given to PHYLIP46 neighbor to produce the phylogeny. Accessions for the 17 genomes used are given in Supplementary Table 2. The UCSC tree was downloaded from: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/multiz20way/
RefSeq search
Each dataset listed in Table 1 was compared against the full RefSeq Mash database using the following command for assemblies: mash dist refseq.msh seq.fasta and the following command for raw reads: mash dist -u refseq.msh seq.fasta which enabled the Bloom filter to remove erroneous, single-copy k-mers. Hits were sorted by distance and all hits within one order of magnitude of the most significant hit (P≤10−10) were used to compute the lowest common ancestor using an NCBI taxonomy tree. The smallest significant distance, with ties broken by P-value, was also reported.
Metagenomic clustering
The Global Ocean Survey (GOS) dataset32 was downloaded from the iMicrobe FTP site (ftp://ftp.imicrobe.us/projects/26). The full dataset was split into 44 samples corresponding to Table 1 in Rusch et al.32. This is the dataset used for benchmarking in the Compareads paper30, and that analysis was replicated using both Mash and COMMET31, the successor to Compareads. COMMET was run with default parameters (t=2, m=all, k=33) as: python Commet.py read_sets.txt where ‘read_sets.txt’ points to the gzipped FASTQ files. This required 34 CPU hours (2,069 CPU minutes) and 4 GB of RAM. The heatmaps were generated in R using the quartile coloring of COMMET31 (Supplementary Note 1). Supplementary Figure 7 shows the original heatmap generated by COMMET on this dataset. Mash was run as:
mash sketch -u -g 3500 -k 21 -s 10000 -o gos *.faThis required 0.6 CPU hours (37 CPU minutes) and 19.6 GB of RAM (or 8 MB without Bloom filtering). The resulting combined sketch file totaled just 3.4 MB in size, compared to the 20 GB FASTA input. Mash distances were computed for all pairs of samples as: mash dist -t gos.msh gos.msh which required less than 1 CPU second to complete.
All available HMP and MetaHIT samples were downloaded from: http://downloads.hmpdacc.org/data/Illumina/ (HMP reads) http://downloads.hmpdacc.org/data/HMASM/ (HMP assemblies) ftp://ftp.sra.ebi.ac.uk/vol1/ERA000/ERA000116/fastq/ (MetaHIT reads) http://www.bork.embl.de/~arumugam/Qin_et_al_2010/ (MetaHIT assemblies) totaling 764 sequencing runs (9.3 TB) and 755 assemblies (60 GB) for HMP, and 124 sequencing runs (1.1 TB) and 124 assemblies (10 GB) for MetaHIT. Mash was run in parallel with the same parameters used for the GOS datasets, and the resulting sketches merged with Mash paste. Sketching the 764 HMP sequencing runs required 259.5 CPU hours (average 0.34, max 2.01), and the 755 assemblies required 3.7 CPU hours (average 0.005). Sketching the 124 MetIDBA sequencing runs required 20 CPU hours (average 0.16, max 0.62), and the 124 assemblies required 0.64 CPU hours (average 0.005). Mash distances were computed for all pairs of samples as before for GOS. This required 3.3 CPU minutes for both sequencing runs and assemblies. HMP samples that did not pass HMP QC requirements33 were removed from Figure 5b, but Supplementary Figure 6 shows all HMP assemblies clustered, with several samples that did not pass HMP quality controls included. These samples are the only ones that fail to group by body site. Thus, Mash can also act as an alternate QC method to identify mis-tracked or low-quality samples.
Mash engineering
Mash builds upon the following open-source software packages: kseq47 for FASTA parsing, Cap’n Proto for serialized output (https://capnproto.org), MurmurHash3 for k-mer hashing (https://code.google.com/p/smhasher), GNU Scientific Library48 (GSL) for P-value computation, and the ‘bloom’ Bloom filter library (httpsV/code.google.com/p/bloom). All Mash code is licensed with a 3-clause BSD license. If needed, Mash can also be built using the Boost library49 to avoid the GSL (GPLv3) license requirements. Due to Cap’n Proto requirements, a C++11 compatible compiler is required to build from source, but precompiled binaries are distributed for convenience.
AUTHOR CONTRIBUTIONS
AMP conceived the project, designed the methods, and wrote the paper with input from BDO, TJT, and SK. BDO wrote the software and assisted with analyses. TJT led the RefSeq and tree analyses. SK led the search and metagenomic analyses. ABM and NHB performed sequencing experiments.
COMPETING FINANCIAL INTERESTS
The authors have no financial interests to declare.
ACKNOWLEDGEMENTS
The authors thank Konstantin Berlin for helpful discussions; Brian Walenz and Torsten Seemann for reviewing the draft; and Philip Ashton, Aleksey Jironkin, and Nicholas Loman for providing early feedback on the software. This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health, and under Contract No. HSHQDC-07-C-00020 awarded by the Department of Homeland Security (DHS) Science and Technology Directorate (S&T) for the management and operation of the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the DHS or S&T. In no event shall the DHS, NBACC, S&T or Battelle National Biodefense Institute (BNBI) have any responsibility or liability for any use, misuse, inability to use, or reliance upon the information contained herein. DHS does not endorse any products or commercial services mentioned in this publication.