Exploring neighborhoods in large metagenome assembly graphs reveals hidden sequence diversity

C. Titus Brown; Dominik Moritz; Michael P. O’Brien; Felix Reidl; Taylor Reiter; Blair D. Sullivan

doi:10.1101/462788

Abstract

Genomes computationally inferred from large metagenomic data sets are often incomplete and may be missing functionally important content and strain variation. We introduce an information retrieval system for large metagenomic data sets that exploits the sparsity of DNA assembly graphs to efficiently extract subgraphs surround-ing an inferred genome. We apply this system to recover missing content from genome bins and show that substantial genomic se-quence variation is present in a real metagenome. Our software implementation is available at https://github.com/spacegraphcats/spacegraphcats under the 3-Clause BSD License.

Footnotes

DM, MPO, FR, and BDS designed and implemented algorithms; CTB, DM, MPO, and FR devel-oped software; CTB and TR conducted biological data analysis; CTB and BDS supervised work. All authors interpreted results, wrote text, created figures, and approved the submitted paper.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.