Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing

View ORCID ProfileShaojun Pan, View ORCID ProfileXing-Ming Zhao, View ORCID ProfileLuis Pedro Coelho
doi: https://doi.org/10.1101/2023.01.09.523201
Shaojun Pan
1Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
2Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Shaojun Pan
Xing-Ming Zhao
1Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
2Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai, China
3MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
4Zhangjiang Fudan International Innovation Center, Shanghai, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Xing-Ming Zhao
  • For correspondence: xmzhao@fudan.edu.cn luispedro@big-data-biology.org
Luis Pedro Coelho
1Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
2Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Luis Pedro Coelho
  • For correspondence: xmzhao@fudan.edu.cn luispedro@big-data-biology.org
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Motivation Metagenomic binning methods to reconstruct metagenome-assembled genomes (MAGs) from environmental samples have been widely used in large-scale metagenomic studies. The recently proposed semi-supervised binning method, SemiBin, achieved state-of-the-art binning results in several environments. However, this required annotating contigs, a computationally costly and potentially biased process.

Results We propose SemiBin2, which uses self-supervised learning to learn feature embeddings from the contigs. In simulated and real datasets, we show that self-supervised learning achieves better results than the semi-supervised learning used in SemiBin1 and that SemiBin2 outperforms other state-of-the-art binners. Compared to SemiBin1, SemiBin2 can reconstruct 8.3%–21.5% more high-quality bins and requires only 25% of the running time and 11% of peak memory usage in real short-read sequencing samples. To extend SemiBin2 to long-read data, we also propose ensemble-based DBSCAN clustering algorithm, resulting in 13.1%–26.3% more high-quality genomes than the second best binner for long-read data.

Availability and Implementation SemiBin2 is available as open source software at https://github.com/BigDataBiology/SemiBin/ and the analysis scripts used in the study can be found at https://github.com/BigDataBiology/SemiBin2_benchmark.

Competing Interest Statement

The authors have declared no competing interest.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted January 09, 2023.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing
Shaojun Pan, Xing-Ming Zhao, Luis Pedro Coelho
bioRxiv 2023.01.09.523201; doi: https://doi.org/10.1101/2023.01.09.523201
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing
Shaojun Pan, Xing-Ming Zhao, Luis Pedro Coelho
bioRxiv 2023.01.09.523201; doi: https://doi.org/10.1101/2023.01.09.523201

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4113)
  • Biochemistry (8815)
  • Bioengineering (6519)
  • Bioinformatics (23463)
  • Biophysics (11790)
  • Cancer Biology (9209)
  • Cell Biology (13323)
  • Clinical Trials (138)
  • Developmental Biology (7438)
  • Ecology (11410)
  • Epidemiology (2066)
  • Evolutionary Biology (15151)
  • Genetics (10436)
  • Genomics (14044)
  • Immunology (9171)
  • Microbiology (22154)
  • Molecular Biology (8812)
  • Neuroscience (47570)
  • Paleontology (350)
  • Pathology (1428)
  • Pharmacology and Toxicology (2491)
  • Physiology (3730)
  • Plant Biology (8080)
  • Scientific Communication and Education (1437)
  • Synthetic Biology (2221)
  • Systems Biology (6037)
  • Zoology (1253)