Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

ResMiCo: increasing the quality of metagenome-assembled genomes with deep learning

Olga Mineeva, Daniel Danciu, View ORCID ProfileBernhard Schölkopf, View ORCID ProfileRuth E. Ley, View ORCID ProfileGunnar Rätsch, View ORCID ProfileNicholas D. Youngblut
doi: https://doi.org/10.1101/2022.06.23.497335
Olga Mineeva
1Department of Computer Science, ETH Zürich, Zürich, Switzerland
2Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen, Germany
3Swiss Institute for Bioinformatics, Lausanne, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Daniel Danciu
1Department of Computer Science, ETH Zürich, Zürich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Bernhard Schölkopf
1Department of Computer Science, ETH Zürich, Zürich, Switzerland
2Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen, Germany
5ETH AI center, ETH Zürich, Zürich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Bernhard Schölkopf
Ruth E. Ley
4Department of Microbiome Science, Max Planck Institute for Biology, Tübingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ruth E. Ley
Gunnar Rätsch
1Department of Computer Science, ETH Zürich, Zürich, Switzerland
3Swiss Institute for Bioinformatics, Lausanne, Switzerland
5ETH AI center, ETH Zürich, Zürich, Switzerland
6Department of Biology, ETH Zürich, Zürich, Switzerland
7Medical Informatics Unit, Zürich University Hospital, Zürich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gunnar Rätsch
  • For correspondence: nyoungblut@tuebingen.mpg.de gunnar.raetsch@inf.ethz.ch
Nicholas D. Youngblut
4Department of Microbiome Science, Max Planck Institute for Biology, Tübingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nicholas D. Youngblut
  • For correspondence: nyoungblut@tuebingen.mpg.de gunnar.raetsch@inf.ethz.ch
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

The number of published metagenome assemblies is rapidly growing due to advances in sequencing technologies. However, sequencing errors, variable coverage, repetitive genomic regions, and other factors can produce misassemblies, which are challenging to detect for taxonomically novel genomic data. Assembly errors can affect all downstream analyses of the assemblies. Accuracy for the state of the art in reference-free misassembly prediction does not exceed an AUPRC of 0.57, and it is not clear how well these models generalize to real-world data. Here, we present the Residual neural network for Misassembled Contig identification (ResMiCo), a deep learning approach for reference-free identification of misassembled contigs. To develop ResMiCo, we first generated a training dataset of unprecedented size and complexity that can be used for further benchmarking and developments in the field. Through rigorous validation, we show that ResMiCo is substantially more accurate than the state of the art, and the model is robust to novel taxonomic diversity and varying assembly methods. ResMiCo estimated 4.7% misassembled contigs per metagenome across multiple real-world datasets. We demonstrate how ResMiCo can be used to optimize metagenome assembly hyperparameters to improve accuracy, instead of optimizing solely for contiguity. The accuracy, robustness, and ease-of-use of ResMiCo make the tool suitable for general quality control of metagenome assemblies and assembly methodology optimization.

Author summary Metagenome assembly quality is fundamental to all downstream analyses of such data. The number of metagenome assemblies, especially metagenome-assembled genomes (MAGs), is rapidly increasing, but tools to assess the quality of these assemblies lack the accuracy needed for robust quality control. Moreover, existing models have been trained on datasets lacking complexity and realism, which may limit their generalization to novel data. Due to the limitations of existing models, most studies forgo such approaches and instead rely on CheckM to assess assembly quality, an approach that only utilizes a small portion of all genomic information and does not identify specific misassemblies. We harnessed existing large genomic datasets and high-performance computing to produce a training dataset of unprecedented size and complexity and thereby trained a deep learning model for predicting misassemblies that can robustly generalize to novel taxonomy and varying assembly methodologies.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • ↵† The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

  • https://github.com/leylabmpi/ResMiCo

  • http://ftp.tue.mpg.de/ebio/projects/ResMiCo/

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted June 26, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
ResMiCo: increasing the quality of metagenome-assembled genomes with deep learning
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
ResMiCo: increasing the quality of metagenome-assembled genomes with deep learning
Olga Mineeva, Daniel Danciu, Bernhard Schölkopf, Ruth E. Ley, Gunnar Rätsch, Nicholas D. Youngblut
bioRxiv 2022.06.23.497335; doi: https://doi.org/10.1101/2022.06.23.497335
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
ResMiCo: increasing the quality of metagenome-assembled genomes with deep learning
Olga Mineeva, Daniel Danciu, Bernhard Schölkopf, Ruth E. Ley, Gunnar Rätsch, Nicholas D. Youngblut
bioRxiv 2022.06.23.497335; doi: https://doi.org/10.1101/2022.06.23.497335

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4682)
  • Biochemistry (10357)
  • Bioengineering (7670)
  • Bioinformatics (26330)
  • Biophysics (13523)
  • Cancer Biology (10683)
  • Cell Biology (15438)
  • Clinical Trials (138)
  • Developmental Biology (8497)
  • Ecology (12820)
  • Epidemiology (2067)
  • Evolutionary Biology (16851)
  • Genetics (11399)
  • Genomics (15478)
  • Immunology (10616)
  • Microbiology (25207)
  • Molecular Biology (10220)
  • Neuroscience (54463)
  • Paleontology (401)
  • Pathology (1668)
  • Pharmacology and Toxicology (2897)
  • Physiology (4342)
  • Plant Biology (9243)
  • Scientific Communication and Education (1586)
  • Synthetic Biology (2557)
  • Systems Biology (6780)
  • Zoology (1466)