Accurate and complete genomes from metagenomes

  1. Jillian F. Banfield1,5,6
  1. 1Department of Earth and Planetary Sciences, University of California, Berkeley, California 94720, USA;
  2. 2Graduate Program in Biophysical Sciences, University of Chicago, Chicago, Illinois 60637, USA;
  3. 3Department of Medicine, University of Chicago, Chicago, Illinois 60637, USA;
  4. 4Bay Paul Center, Marine Biological Laboratory, Woods Hole, Massachusetts 02543, USA;
  5. 5Department of Environmental Science, Policy, and Management, University of California, Berkeley, California 94720, USA;
  6. 6Earth and Environmental Sciences, Lawrence Berkeley National Laboratory, University of California, Berkeley, California 94720, USA
  • 7 Present address: Department of Bacteriology, University of Wisconsin, Madison, WI 53706, USA

  • Corresponding authors: jbanfield{at}berkeley.edu, meren{at}uchicago.edu
  • Abstract

    Genomes are an integral component of the biological information about an organism; thus, the more complete the genome, the more informative it is. Historically, bacterial and archaeal genomes were reconstructed from pure (monoclonal) cultures, and the first reported sequences were manually curated to completion. However, the bottleneck imposed by the requirement for isolates precluded genomic insights for the vast majority of microbial life. Shotgun sequencing of microbial communities, referred to initially as community genomics and subsequently as genome-resolved metagenomics, can circumvent this limitation by obtaining metagenome-assembled genomes (MAGs); but gaps, local assembly errors, chimeras, and contamination by fragments from other genomes limit the value of these genomes. Here, we discuss genome curation to improve and, in some cases, achieve complete (circularized, no gaps) MAGs (CMAGs). To date, few CMAGs have been generated, although notably some are from very complex systems such as soil and sediment. Through analysis of about 7000 published complete bacterial isolate genomes, we verify the value of cumulative GC skew in combination with other metrics to establish bacterial genome sequence accuracy. The analysis of cumulative GC skew identified potential misassemblies in some reference genomes of isolated bacteria and the repeat sequences that likely gave rise to them. We discuss methods that could be implemented in bioinformatic approaches for curation to ensure that metabolic and evolutionary analyses can be based on very high-quality genomes.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.258640.119.

    • Freely available online through the Genome Research Open Access option.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server