Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Long-read error correction: a survey and qualitative comparison

Pierre Morisse, Thierry Lecroq, Arnaud Lefebvre
doi: https://doi.org/10.1101/2020.03.06.977975
Pierre Morisse
1Univ Rennes, Inria, CNRS, IRISA, 35000, Rennes, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: contact@pierremorisse.com
Thierry Lecroq
2Normandie Univ, UNIROUEN, LITIS, 76000 Rouen, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Arnaud Lefebvre
2Normandie Univ, UNIROUEN, LITIS, 76000 Rouen, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Third generation sequencing technologies Pacific Biosciences and Oxford Nanopore Technologies were respectively made available in 2011 and 2014. In contrast with second generation sequencing technologies such as Illumina, these new technologies allow the sequencing of long reads of tens to hundreds of kbp. These so called long reads are particularly promising, and are especially expected to solve various problems such as contig and haplotype assembly or scaffolding, for instance. However, these reads are also much more error prone than second generation reads, and display error rates reaching 10 to 30%, according to the sequencing technology and to the version of the chemistry. Moreover, these errors are mainly composed of insertions and deletions, whereas most errors are substitutions in Illumina reads. As a result, long reads require efficient error correction, and a plethora of error correction tools, directly targeted at these reads, were developed in the past ten years. These methods can adopt a hybrid approach, using complementary short reads to perform correction, or a self-correction approach, only making use of the information contained in the long reads sequences. Both these approaches make use of various strategies such as multiple sequence alignment, de Bruijn graphs, Hidden Markov Models, or even combine different strategies. In this paper, we describe a complete survey of long-read error correction, reviewing all the different methodologies and tools existing up to date, for both hybrid and self-correction. Moreover, the long reads characteristics, such as sequencing depth, length, error rate, or even sequencing technology, have huge impacts on how well a given tool or strategy performs, and can thus drastically reduce the correction quality. We thus also present an in-depth benchmark of available long-read error correction tools, on a wide variety of datasets, composed of both simulated and real data, with various error rates, coverages, and read lengths, ranging from small bacterial to large mammal genomes.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • Updated results and commentaries

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.
Back to top
PreviousNext
Posted July 12, 2021.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Long-read error correction: a survey and qualitative comparison
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Long-read error correction: a survey and qualitative comparison
Pierre Morisse, Thierry Lecroq, Arnaud Lefebvre
bioRxiv 2020.03.06.977975; doi: https://doi.org/10.1101/2020.03.06.977975
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Long-read error correction: a survey and qualitative comparison
Pierre Morisse, Thierry Lecroq, Arnaud Lefebvre
bioRxiv 2020.03.06.977975; doi: https://doi.org/10.1101/2020.03.06.977975

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4658)
  • Biochemistry (10311)
  • Bioengineering (7631)
  • Bioinformatics (26222)
  • Biophysics (13464)
  • Cancer Biology (10640)
  • Cell Biology (15358)
  • Clinical Trials (138)
  • Developmental Biology (8462)
  • Ecology (12772)
  • Epidemiology (2067)
  • Evolutionary Biology (16783)
  • Genetics (11368)
  • Genomics (15421)
  • Immunology (10566)
  • Microbiology (25081)
  • Molecular Biology (10170)
  • Neuroscience (54216)
  • Paleontology (398)
  • Pathology (1659)
  • Pharmacology and Toxicology (2878)
  • Physiology (4321)
  • Plant Biology (9207)
  • Scientific Communication and Education (1582)
  • Synthetic Biology (2543)
  • Systems Biology (6759)
  • Zoology (1455)