RT Journal Article
SR Electronic
T1 Fast and memory-efficient noisy read overlapping with KD-trees
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 166835
DO 10.1101/166835
A1 Dmitri Parkhomchuk
A1 Andreas Bremges
A1 Alice C. McHardy
YR 2017
UL http://biorxiv.org/content/early/2017/07/21/166835.abstract
AB Motivation Third-generation sequencing technologies produce long, but noisy reads with increasing sequencing throughput and decreasing per-base costs. Detecting read-to-read overlaps in such data is the most computationally intensive step in de novo assembly. Recently, efficient algorithms were developed for this task; nearly all of these utilize long k-mers (&gt;10 bp) to compare reads, but vary in their approaches to indexing, hashing, filtering, and dimensionality reduction.Results We describe an algorithm for efficient overlap detection that directly compares the full spectrum of short k-mers, namely tetramers, through geometric embedding and approximate nearest neighbor search in multidimensional KD-trees. A proof of concept implementation detected read-to-read overlaps in bacterial PacBio and ONT datasets with notably lower memory consumption than state-of-the-art approaches and allowed downstream de novo assembly into single contigs. We also introduce a sequence-context dependent tagging scheme that contributes to memory and computational efficiency and could be used with other aligning and overlapping algorithms.Availability A C++14 implementation is available under the open source Apache License 2.0 at: https://github.com/dzif/kd-tree-overlapper