Abstract
De novo assembly is the process of accurately reconstructing a genome sequence using only overlapping, error-containing DNA sequence fragments (reads) that redundantly sample a genome. While longer reads simplify genome assembly and improve the contiguity of the reconstruction, current long-read technologies come with high error rates. A crucial step of de novo genome assembly for long reads consists of finding overlapping reads. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel algorithm for computing overlaps and alignments that balances the goals of recall (completeness) and precision (avoiding incorrect overlaps), consistently performing well on both, and doing so with reasonable compute time and memory usage.
We present a probabilistic model which demonstrates the soundness of using short, fixed length k-mers to detect overlaps, avoiding expensive pairwise alignment of each read against all others. We then introduce a notion of reliable k-mers based on our probabilistic model. The use of reliable k-mers eliminates both the k-mer set explosion, that would otherwise occur with highly erroneous reads, and the spurious overlaps from k-mers originating in repetitive regions. Finally, we present a new method for separating true (genomic) overlaps from false positives using a combination of alignment techniques and probabilistic modeling. Using this methodology the probability of false positives drops exponentially as the length of overlap between sequences increases. On both real and simulated data, BELLA on average outperforms previous tools in the F1 score, meaning that both precision and recall are the best or close to the best. On simulated data, BELLA achieves an average of 2.7% higher recall, 17.9% higher precision, and 10.9% higher F1 score than state-of-the-art tools, while remaining runtime performance competitive.