Abstract
The identification of reference genomes and taxonomic labels from metagenome data underlies many microbiome studies. Here we describe two algorithms for compositional analysis of metagenome sequencing data. We first investigate the FracMinHash sketching technique, a derivative of modulo hash that supports Jaccard containment estimation between sets of different sizes. We implement FracMinHash in the sourmash software, evaluate its accuracy, and demonstrate large-scale containment searches of metagenomes using 700,000 microbial reference genomes. We next frame shotgun metagenome compositional analysis as the problem of finding a minimum collection of reference genomes that “cover” the known k-mers in a metagenome, a minimum set cover problem. We implement a greedy approximate solution using FracMinHash sketches, and evaluate its accuracy for taxonomic assignment using a CAMI community benchmark. Finally, we show that the minimum metagenome cover can be used to guide the selection of reference genomes for read mapping. sourmash is available as open source software under the BSD 3-Clause license at github.com/dib-lab/sourmash/.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵a Funded by Grant GBMF4551 from the Gordon and Betty Moore Foundation; Grant R01HG007513 from the NIH NHGRI
↵b Funded by Grant GBMF4551 from the Gordon and Betty Moore Foundation
↵c Funded by Grant GBMF4551 from the Gordon and Betty Moore Foundation; Grant R03OD030596 from the NIH Common Fund
↵d Funded by Grant 1711984 from the NSF; Grant GBMF4551 from the Gordon and Betty Moore Foundation; Grant 2018911 from the NSF
↵e Funded by Grant 2029170 from the NSF
↵f Funded by Grant GBMF4551 from the Gordon and Betty Moore Foundation; Grant R01HG007513 from the NIH NHGRI; Grant 2018911 from the NSF; Grant R03OD030596 from the NIH Common Fund
luizirber, luizirber, brooksph, brooksph, taylorreiter, ReiterTaylor, bluegenes, saltyscientist, mahmudhera, HeraMahmudur, dkoslicki, DavidKoslicki, ctb
Added a reference to additional analytical results, and added details on the database used for taxonomic benchmarking.