Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

CONSENT: Scalable long read self-correction and assembly polishing with multiple sequence alignment

Pierre Morisse, Camille Marchet, Antoine Limasset, Thierry Lecroq, Arnaud Lefebvre
doi: https://doi.org/10.1101/546630
Pierre Morisse
1Normandie Université, UNIROUEN, INSA Rouen, LITIS, 76000 Rouen, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: pierre.morisse2@univ-rouen.fr
Camille Marchet
2Univ. Lille, CNRS, UMR 9189 - CRIStAL
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Antoine Limasset
2Univ. Lille, CNRS, UMR 9189 - CRIStAL
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Thierry Lecroq
3Normandie Univ, UNIROUEN, LITIS, 76000 Rouen, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Arnaud Lefebvre
3Normandie Univ, UNIROUEN, LITIS, 76000 Rouen, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Motivation Third-generation sequencing technologies Pacific Biosciences and Oxford Nanopore allow the sequencing of long reads of tens of kbp, that are expected to solve various problems, such as contig and haplotype assembly, scaffolding, and structural variant calling. However, they also display high error rates that can reach 10 to 30%, for basic ONT and non-CCS PacBio reads. As a result, error correction is often the first step of projects dealing with long reads. As first long reads sequencing experiments produced reads displaying error rates higher than 15% on average, most methods relied on the complementary use of short reads data to perform correction, in a hybrid approach. However, these sequencing technologies evolve fast, and the error rate of the long reads now reaches 10 to 12%. As a result, self-correction is now frequently used as the first step of third-generation sequencing data analysis projects. As of today, efficient tools allowing to perform self-correction of the long reads are available, and recent observations suggest that avoiding the use of second-generation sequencing reads could bypass their inherent bias.

Results We introduce CONSENT, a new method for the self-correction of long reads that combines different strategies from the state-of-the-art. More precisely, we combine a multiple sequence alignment strategy with the use of local de Bruijn graphs. Moreover, the multiple sequence alignment benefits from an efficient segmentation strategy based on k-mer chaining, which allows a considerable speed improvement. Our experiments show that CONSENT compares well to the latest state-of-the-art self-correction methods, and even outperforms them on real Oxford Nanopore datasets. In particular, they show that CONSENT is the only method able to efficiently scale to the correction of Oxford Nanopore ultra-long reads, and is able to process a full human dataset, containing reads reaching lengths up to 1.5 Mbp, in 15 days. Additionally, CONSENT also implements an assembly polishing feature, and is thus able to correct errors directly from raw long read assemblies. Our experiments show that CONSENT outperforms state-of-the-art polishing tools in terms of resource consumption, and provides comparable results. Moreover, we also show that, for a full human dataset, assembling the raw data and polishing the assembly afterwards is less time consuming than assembling the corrected reads, while providing better quality results.

Availability and implementation CONSENT is implemented in C++, supported on Linux platforms and freely available at https://github.com/morispi/CONSENT.

Contact pierre.morisse2{at}univ-rouen.fr

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • Remove bioinformatics template

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.
Back to top
PreviousNext
Posted April 24, 2020.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
CONSENT: Scalable long read self-correction and assembly polishing with multiple sequence alignment
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
CONSENT: Scalable long read self-correction and assembly polishing with multiple sequence alignment
Pierre Morisse, Camille Marchet, Antoine Limasset, Thierry Lecroq, Arnaud Lefebvre
bioRxiv 546630; doi: https://doi.org/10.1101/546630
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
CONSENT: Scalable long read self-correction and assembly polishing with multiple sequence alignment
Pierre Morisse, Camille Marchet, Antoine Limasset, Thierry Lecroq, Arnaud Lefebvre
bioRxiv 546630; doi: https://doi.org/10.1101/546630

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2517)
  • Biochemistry (4964)
  • Bioengineering (3469)
  • Bioinformatics (15181)
  • Biophysics (6885)
  • Cancer Biology (5380)
  • Cell Biology (7711)
  • Clinical Trials (138)
  • Developmental Biology (4518)
  • Ecology (7135)
  • Epidemiology (2059)
  • Evolutionary Biology (10210)
  • Genetics (7497)
  • Genomics (9767)
  • Immunology (4822)
  • Microbiology (13179)
  • Molecular Biology (5129)
  • Neuroscience (29367)
  • Paleontology (203)
  • Pathology (835)
  • Pharmacology and Toxicology (1460)
  • Physiology (2129)
  • Plant Biology (4734)
  • Scientific Communication and Education (1008)
  • Synthetic Biology (1337)
  • Systems Biology (4002)
  • Zoology (768)