Abstract
Microbial and viral diversity, distribution, and ecological impacts are often studied using metagenome-assembled sequences, but genome incompleteness hampers comprehensive and accurate analyses. Here we introduce COBRA (Contig Overlap Based Re-Assembly), a tool that resolves de Bruijn graph based assembly breakpoints and joins contigs. While applicable to any short-read assembled DNA sequences, we benchmarked COBRA by using a dataset of published complete viral genomes from the ocean. COBRA accurately joined contigs assembled by metaSPAdes, IDBA_UD, and MEGAHIT, outcompeting several existing binning tools and achieving significantly higher genome accuracy (96.6% vs 19.8-59.6%). We applied COBRA to viral contigs that we assembled from 231 published freshwater metagenomes and obtained 7,334 high-quality or complete species-level genomes (clusters with 95% average nucleotide identity) for viruses of bacteria (phages), ∼83% of which represent new phage species. Notably, ∼70% of the 7,334 species genomes were circular, compared to 34% before COBRA analyses. We expanded genomic sampling of ≥ 200 kbp phages (i.e., huge phages), the largest of which was curated to completion (717 kbp in length). The improved phage genomes from Rotsee Lake provided context for metatranscriptomic data and indicated in situ activity of huge phages, WhiB and cysC/cysH encoding phages from this site. In conclusion, COBRA improves the assembly contiguity and completeness of microbial and viral genomes and thus, the accuracy and reliability of analyses of gene content, diversity, and evolution.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Minor modifications in the Abstract section, and with updated data availability.