Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

View ORCID ProfileCan Firtina, View ORCID ProfileZiv Bar-Joseph, View ORCID ProfileCan Alkan, View ORCID ProfileA. Ercument Cicek
doi: https://doi.org/10.1101/233080
Can Firtina
Bilkent University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Can Firtina
Ziv Bar-Joseph
Carnegie Mellon University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ziv Bar-Joseph
Can Alkan
Bilkent University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Can Alkan
A. Ercument Cicek
Bilkent University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for A. Ercument Cicek
  • For correspondence: cicek@cs.bilkent.edu.tr
  • Abstract
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Motivation: Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Results: We designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platform's error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads. Availability: Hercules source code is available at https://github.com/BilkentCompGen/Hercules

Copyright 
The copyright holder for this preprint is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
  • Posted December 13, 2017.

Download PDF

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Hercules: a profile HMM-based hybrid error correction algorithm for long reads
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
Share
Hercules: a profile HMM-based hybrid error correction algorithm for long reads
Can Firtina, Ziv Bar-Joseph, Can Alkan, A. Ercument Cicek
bioRxiv 233080; doi: https://doi.org/10.1101/233080
del.icio.us logo Digg logo Reddit logo Technorati logo Twitter logo CiteULike logo Connotea logo Facebook logo Google logo Mendeley logo
Citation Tools
Hercules: a profile HMM-based hybrid error correction algorithm for long reads
Can Firtina, Ziv Bar-Joseph, Can Alkan, A. Ercument Cicek
bioRxiv 233080; doi: https://doi.org/10.1101/233080

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (619)
  • Biochemistry (858)
  • Bioengineering (516)
  • Bioinformatics (4761)
  • Biophysics (1502)
  • Cancer Biology (1029)
  • Cell Biology (1446)
  • Clinical Trials (52)
  • Developmental Biology (974)
  • Ecology (1631)
  • Epidemiology (808)
  • Evolutionary Biology (3688)
  • Genetics (2511)
  • Genomics (3265)
  • Immunology (601)
  • Microbiology (2411)
  • Molecular Biology (894)
  • Neuroscience (6483)
  • Paleontology (42)
  • Pathology (124)
  • Pharmacology and Toxicology (220)
  • Physiology (287)
  • Plant Biology (893)
  • Scientific Communication and Education (247)
  • Synthetic Biology (386)
  • Systems Biology (1322)
  • Zoology (162)