Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Fast and memory-efficient noisy read overlapping with KD-trees

Dmitri Parkhomchuk, Andreas Bremges, Alice C. McHardy
doi: https://doi.org/10.1101/166835
Dmitri Parkhomchuk
1Helmholtz Centre for Infection Research, 38124 Braunschweig, Germany
2German Center for Infection Research (DZIF), 38124 Braunschweig, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Andreas Bremges
1Helmholtz Centre for Infection Research, 38124 Braunschweig, Germany
2German Center for Infection Research (DZIF), 38124 Braunschweig, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Alice C. McHardy
1Helmholtz Centre for Infection Research, 38124 Braunschweig, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Motivation Third-generation sequencing technologies produce long, but noisy reads with increasing sequencing throughput and decreasing per-base costs. Detecting read-to-read overlaps in such data is the most computationally intensive step in de novo assembly. Recently, efficient algorithms were developed for this task; nearly all of these utilize long k-mers (>10 bp) to compare reads, but vary in their approaches to indexing, hashing, filtering, and dimensionality reduction.

Results We describe an algorithm for efficient overlap detection that directly compares the full spectrum of short k-mers, namely tetramers, through geometric embedding and approximate nearest neighbor search in multidimensional KD-trees. A proof of concept implementation detected read-to-read overlaps in bacterial PacBio and ONT datasets with notably lower memory consumption than state-of-the-art approaches and allowed downstream de novo assembly into single contigs. We also introduce a sequence-context dependent tagging scheme that contributes to memory and computational efficiency and could be used with other aligning and overlapping algorithms.

Availability A C++14 implementation is available under the open source Apache License 2.0 at: https://github.com/dzif/kd-tree-overlapper

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted July 21, 2017.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Fast and memory-efficient noisy read overlapping with KD-trees
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Fast and memory-efficient noisy read overlapping with KD-trees
Dmitri Parkhomchuk, Andreas Bremges, Alice C. McHardy
bioRxiv 166835; doi: https://doi.org/10.1101/166835
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Fast and memory-efficient noisy read overlapping with KD-trees
Dmitri Parkhomchuk, Andreas Bremges, Alice C. McHardy
bioRxiv 166835; doi: https://doi.org/10.1101/166835

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4222)
  • Biochemistry (9095)
  • Bioengineering (6733)
  • Bioinformatics (23916)
  • Biophysics (12066)
  • Cancer Biology (9484)
  • Cell Biology (13720)
  • Clinical Trials (138)
  • Developmental Biology (7614)
  • Ecology (11644)
  • Epidemiology (2066)
  • Evolutionary Biology (15459)
  • Genetics (10610)
  • Genomics (14281)
  • Immunology (9447)
  • Microbiology (22749)
  • Molecular Biology (9056)
  • Neuroscience (48811)
  • Paleontology (354)
  • Pathology (1478)
  • Pharmacology and Toxicology (2558)
  • Physiology (3817)
  • Plant Biology (8299)
  • Scientific Communication and Education (1466)
  • Synthetic Biology (2285)
  • Systems Biology (6163)
  • Zoology (1295)