PT - JOURNAL ARTICLE AU - Giulia Guidi AU - Marquita Ellis AU - Daniel Rokhsar AU - Katherine Yelick AU - Aydın Buluç TI - BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper AID - 10.1101/464420 DP - 2019 Jan 01 TA - bioRxiv PG - 464420 4099 - http://biorxiv.org/content/early/2019/01/29/464420.short 4100 - http://biorxiv.org/content/early/2019/01/29/464420.full AB - Motivation De novo assembly is the process of accurately reconstructing a genome sequence using only overlapping, error-containing DNA sequence fragments (reads) that redundantly sample a genome. While longer reads simplify genome assembly and improve the contiguity of the reconstruction, current long-read technologies come with high error rates. A crucial step of de novo genome assembly for long reads consists of finding overlapping reads. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel algorithm for computing overlaps and alignments that balances the goals of recall (completeness) and precision (avoiding incorrect overlaps), consistently performing well on both, and doing so with reasonable compute time and memory usage.Results We present a probabilistic model which demonstrates the soundness of using short, fixed length k-mers to detect overlaps, avoiding expensive pairwise alignment of each read against all others. We then introduce a notion of reliable k-mers based on our probabilistic model. The use of reliable k-mers eliminates both the k-mer set explosion, that would otherwise occur with highly erroneous reads, and the spurious overlaps from k-mers originating in repetitive regions. Finally, we present a new method for separating true (genomic) overlaps from false positives using a combination of alignment techniques and probabilistic modeling. Using this methodology the probability of false positives drops exponentially as the length of overlap between sequences increases. On both real and simulated data, BELLA on average outperforms previous tools in the F1 score, meaning that both precision and recall are the best or close to the best. On simulated data, BELLA achieves an average of 2.7% higher recall, 17.9% higher precision, and 10.9% higher F1 score than state-of-the-art tools, while remaining runtime performance competitive.Availability https://github.com/giuliaguidi/bellaContact gguidi{at}berkeley.eduSupplementary information Supplementary data are available at Bioinformatics online.