## Abstract

*De novo* assembly is the process of accurately reconstructing a genome sequence using only overlapping, error-containing DNA sequence fragments (*reads*) that redundantly sample a genome. While longer reads simplify genome assembly and improve the contiguity of the reconstruction, current long-read technologies come with high error rates. A crucial step of *de novo* genome assembly for long reads consists of finding overlapping reads. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel algorithm for computing overlaps and alignments that balances the goals of recall (completeness) and precision (avoiding incorrect overlaps), consistently performing well on both, and doing so with reasonable compute time and memory usage.

We present a probabilistic model which demonstrates the soundness of using short, fixed length *k*-mers to detect overlaps, avoiding expensive pairwise alignment of each read against all others. We then introduce a notion of *reliable k-mers* based on our probabilistic model. The use of *reliable k-mers* eliminates both the *k*-mer set explosion, that would otherwise occur with highly erroneous reads, and the spurious overlaps from *k*-mers originating in repetitive regions. Finally, we present a new method for separating true (*genomic*) overlaps from false positives using a combination of alignment techniques and probabilistic modeling. Using this methodology the probability of false positives drops exponentially as the length of overlap between sequences increases. On both real and simulated data, BELLA on average outperforms previous tools in the F1 score, meaning that both precision and recall are the best or close to the best. On simulated data, BELLA achieves an average of 2.7% higher recall, 17.9% higher precision, and 10.9% higher F1 score than state-of-the-art tools, while remaining runtime performance competitive.

## 1 Introduction

Recent advancements in sequencing technologies have made large-scale genomic data more accessible than ever, enabling the characterization of genome structure and its variation between and within species. The analysis of data after sequencing is a challenging task. One of the biggest challenges for the analysis of high-throughput sequencing reads (i.e. short DNA fragments) is whole genome assembly (Zhang *et al.*, 2011), which is the process of aligning and merging DNA fragments in order to reconstruct the original sequence. More specifically, *de novo* genome assembly reconstructs a genome from redundantly sampled fragmentary DNA sequences, without prior knowledge of the genome. *De novo* genome assembly can generate sequences for previously uncharacterized genomes (Simpson and Durbin, 2012). Even for genomes that already have *reference* assemblies, *de novo* assembly can enable the identification of individual-specific genetic features without biases from the reference, which represents only a single individual or composite. Thus through *de novo* assembly the full range of genetic variation in a species becomes accessible.

Current high-throughput sequencing methods can be divided in two main categories based on the length of the read: “short-read” and “long-read” technologies. The main limitation of short-read technologies (Bentley *et al.*, 2008) is their inability to deal with genomic repeats longer than the read length, resulting in fragmented assemblies due to ambiguous placements of sequences (Phillippy *et al.*, 2008; Nagarajan and Pop, 2009). Conversely, long-read technologies, offered by companies Pacific Biosciences (PacBio) with their Single-Molecule Real Time (SMRT) sequencing (Eid *et al.*, 2009) and Oxford Nanopore Technologies with their Nanopore sequencing (Goodwin *et al.*, 2015), generate long reads with average lengths reaching and often exceeding 10,000 base pairs (bp). These allow the resolution of complex genomic repetitions, enabling more accurate ensemble views. However, the improved read length of these technologies comes at the cost of lower accuracy, with error rates ranging from 5% to 35%. Nevertheless, errors are more random and more evenly distributed within PacBio long reads (Giordano *et al.*, 2017) compared to short-read technologies.

While short read assembly typically relies on the De Bruijn Graph (DBG) abstraction, a de Bruijn graph for long reads would be too fragmented to be useful due to the high error rates. Hence, an overwhelming majority of the state-of-the-art long read assemblers uses the Overlap-Layout-Consensus (OLC) paradigm (Berlin *et al.*, 2015). In OLC assembly the first step is the detection of overlaps between reads and the construction of an overlap graph. The OLC paradigm is used for long reads in part because significantly fewer reads are required to cover the genome (by a factor of 100X), limiting the size of the overlap graph. Highly-accurate overlap detection is a major computational bottleneck of the OLC assembly pipeline (Myers, 2014), mainly due to the compute-intensive nature of pairwise alignment.

At present, several algorithms are capable of overlapping error-prone long-read data with varying accuracy. The prevailing approach is to use an indexing data structure, such as a *k*-mer index table or a suffix array, to identify a set of initial candidate read pairs, thus mitigating the high cost of computing pairwise alignments in a second stage (Chu *et al.*, 2016).

The process of identifying a set of initial candidate read pairs, sometimes simply known as *overlapping*, affects both the accuracy and the algorithm runtime. Precise identification of initial candidate read pairs minimizes the pairwise alignment running time, while retaining all pairs that truly overlap in the genome. A solid mathematical model is critical for identifying these pairs. In addition, computationally efficient and highly accurate overlapping and alignment algorithms have the potential to improve existing long-read assemblers, enabling *de novo* reference assemblies and detection of genetic variations of higher quality.

The main contributions of this work are the following:

Using a Markov chain model (Markov, 1971), we demonstrate the soundness of using a

*k*-mer seed-based approach for accurately identifying initial candidate read pairs.We develop a simple procedure for pruning

*k*-mers and prove that it retains nearly all true overlaps with high probability. The result is greater computational efficiency without loss of accuracy.We reformulate the problem of overlap detection in terms of a sparse matrix-matrix multiplication, which enables the use of high-performance techniques not previously applied in the context of long read overlap and alignment.

We couple our overlap detection with a state-of-the-art seed-and-extend banded-alignment method (Döring

*et al.*, 2008; Zhang*et al.*, 2000) to attain BELLA, all-to-all long-read aligner.

BELLA uses a new method based on Chernoff bounds to separate true alignments from false positives depending on the alignment score. We show that the probability of false positives drops exponentially as the length of overlap between sequences increases. On simulated data, BELLA achieves high recall (of true alignments) with negligible losses of precision.

## 2 Approach

### Proposed Algorithm

The current work develops a computational efficient and highly accurate algorithm for overlap detection and alignment for long-read *de novo* genome assembly. The algorithm is implemented in a high-performance software package, called Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA). In this paper, we present the mathematical approaches that underlie the proposed work and describe the implementation.

BELLA uses a seed-based approach to detect overlaps in the context of *de novo* assembly. Such an approach parses the reads into *k*-mers (i.e. sub-strings of fixed length *k*), which are then used as feature vectors to identify overlaps amongst all reads. Using a mathematical approach based on the Markov chain model, we first show the feasibility of using a *k*-mer seed based approach for overlap detection of long-read data with high error rates.

However, not all *k*-mers are created equal in terms of their usefulness for detecting overlaps. For instance, the overwhelming majority of *k*-mers that occur only once in the dataset are errors (and are also not useful for seeding overlaps between pairs of reads). Similarly *k*-mers that occur more frequently than what would be expected given the sequencing depth and the error profile are likely to come from repetitive regions of the genome. It is a common practice to prune the *k*-mer space using various methodologies (Koren *et al.*, 2017; Lin *et al.*, 2016).

BELLA implements a novel method for filtering out *k*-mers that are likely to either contain errors or originate from a repetitive region. The *k*-mers that are retained by BELLA are considered to be *reliable*, where the reliability of a *k*-mer is defined as its probability of having originated from a unique (non-repetitive) region of the genome. We argue that unique *k*-mers are sufficient for overlap detection because long reads either (a) include long enough non-repetitive sections to identify overlaps using unique *k*-mers or (b) they are completely contained within a repeat, in which case their overlaps are ambiguous and uninformative to begin with. BELLA’s reliable *k*-mer detection explicitly maximizes the retention of *k*-mers that belong to unique regions of the genome, using a probabilistic analysis given the error rate and the sequencing depth. BELLA estimates the error rate from the mean error probability of the data.

BELLA uses a sparse matrix to internally represent its data, where the rows are reads, columns are reliable *k*-mers, and a nonzero **A**(*i, j*) ≠ 0 contains the position of the *j*th reliable *k*-mer within *i*th read. Construction of this sparse matrix requires efficient *k*-mer counting.

Overlap detection is implemented in BELLA using Sparse Generalized Matrix Multiplication (SpGEMM), which allows our algorithm to achieve fast overlapping without using approximate approaches. The running time of BELLA’s overlap detection phase is not quadratic in the number of reads as it never initiates a comparison between two reads if they do not share any *k*-mers. Sparse matrix multiplication is a highly flexible and efficient computational paradigm that enables better organization of computation and generality, because it is able to manipulate complex data structures such as the ones used in finding overlaps using shared *k*-mers. The implementation of this method within our pipeline enables the use of high-performance techniques previously not applied in the context of long-read alignment. It also allows continuing performance improvements in this step due to the ever-improving optimized implementations of SpGEMM (Nagasaka *et al.*, 2018; Azad *et al.*, 2016; Dalton *et al.*, 2015).

BELLA’s overlap detection has been coupled with an optimized state-of-the-art seed-and-extend algorithm (Döring *et al.*, 2008; Zhang *et al.*, 2000), meaning the alignment between two reads starts from a shared seed (identified in the previous overlap detection) and not necessarily from the beginning of reads. During the alignment stage, BELLA uses a new method to separate true alignments from false positives depending on the alignment score. We prove that the probability of false positives decreases exponentially as the length of overlap between reads increases. In the future, improvements on existing seed-and-extend algorithms and newly developed ones that would enable higher accuracy to be achieved could be integrated seamlessly into our pipeline.

Existing tools also implement approximate overlap detection using sketching. A sketch is a reduced space representation of a sequence. Multiple randomized hash functions convert *k*-mers into integer fingerprints and a subset of these is selected to represent the sketch of a sequence according to some criterion. For example, Berlin *et al.* (2015) retain only the smallest integer for each hash function and use the collection of these minimum valued fingerprints as sketch. These methods, while fast, are approximate because sketching is a lossy transformation. Conversely, BELLA uses an explicit *k*-mer representation, which allows us to couple our overlap detection with a seed-and-extend alignment to refine the output and to improve the precision of our algorithm.

## 3 Methods

### Overlapping Feasibility

Chaisson and Tesler (2012) proposed a theory for how long reads contain subsequences that may be used to anchor alignments to the reference genome. The sequences are modeled as random processes that generate error-free regions whose length is geometrically distributed, with each such region separated by an error (Giordano *et al.*, 2017). The result obtained from their theory is the minimum sequence length to have an *anchor* within a confidence interval.

Here, we present a theory on how these subsequences, also known as *k*-mers, can be used to anchor alignments between two long reads in the context of *de novo* assembly, allowing an accurate overlap discovery among all the reads in a dataset. The initial assumption of our model defines the probability of correctly sequencing a base as equal to:
where *e* is the error rate of the sequencer. From this notion, we model the probability of observing *k* correct consecutive bases on both *read*_{1} and *read*_{2} as a Markov chain process (Markov, 1971).

The Markov chain process is characterized by a *transition matrix* **P** that includes the probabilities to move from one state to another. Each row-index *start* of **P** represents the starting state, and each column-index *end* of **P** represents the ending state. Each entry of **P** is a non-negative number indicating a *transition probability*. Our transition matrix has (*k* + 1) possible states, which lead to (*k* + 1)^{2} transition probabilities of moving from *start* to *end*. The probability of having one correct base on both reads is *p*^{2}. For any state except the *absorbing* state *k*, an error in at least one of the two sequences sets the model back to state 0, which happens with probability 1 − *p*^{2}; otherwise, the Markov chain transition from state *i* to *i* + 1 happens with probability *p*^{2}. The absorbing state *k* cannot be abandoned, as both *read*_{1} and *read*_{2} have already seen *k* consecutive correct bases. Hence, its transition probability is 1.

### Probability of observing at least one shared correct *k*-mer in an overlap region of length L *> k*

The transition matrix is filled with the probabilities of moving from one state to another, resulting in the following (for the sake of illustration, *k* has been set to 5):

One can then find the probability of being in any of the states after *L* steps in the Markov chain by computing the L^{th} power of the matrix **P**. More efficiently, one can just compute **vP**^{L}, where **v** *←* (1, 0, …, 0) is the standard unit vector; this efficiently exploits *L* matrix-vector products. This approach is sufficient because we are only interested in the probability of being in the final absorbing state. The above operation leads to the probability of having one correct *k*-mer in the same location on both reads given a certain overlap region.

Figure 1 describes the process: each state contains the number of successful sequenced bases obtained up to this point on both reads, while the arrows represent the transition probabilities. Algorithm 1 shows this procedure in terms of pseudo-code. In it, *p* represents the probability of correctly sequencing a base on one read.

The proposed model is the driving factor behind the choice of the optimal *k*-mer length to be used during the overlap detection phase. Figure 2 illustrates the probabilities of finding one correct *k*-mer by varying the value of *k*, the error rate and the overlap length *L*. The model favors the selection of the shorter *k*-mer length in order to find a correct seed with high probability even in a narrow overlap region when the error rate is high, because the probability of a *k*-mer being correct decreases approximately geometrically as its length increases. With decreasing error rate, however, a larger *k* would be preferable since it would decrease the amount of *k*-mers coming from repetitive regions of the genome.

### Reliable *k*-mers

Repetitive regions of the genome lead to certain *k*-mers occurring frequently in the input reads. *k*-mers from these regions pose two problems for pairwise overlapping and alignment. First, their presence increases the computational cost, both at the overlapping stage and at the alignment stage, because these *k*-mers generate numerous and possibly spurious overlaps. Second, they often do not provide valuable information.

Our argument here is that *k*-mers coming from a repetitive region in the genome can be ignored for the purpose of seed-based overlapping. This is because either (a) the read is longer than the repeat, in which case there should be enough sequence data from the non-repeat section to find overlaps, or (b) the read is shorter than the repeat, in which case the read will not be particularly useful for downstream tasks such as *de novo* assembly.

Following the terminology proposed by Lin *et al.* (2016), we identify *k*-mers that do not exist in the genome as *non-genomic*, thus characterizing *k*-mers present in the genome as *genomic*. A genomic *k*-mer can be *repeated*, if it is present multiple times in the genome, or *unique*, if it is not. One can think of the presence of *k*-mers within each read as that read’s feature vector. For the reasons discussed above, the feature vector should include all the unique *k*-mers, as they often are the most informative features.

Since we do not know the genome prior to assembly, we must estimate the genomic uniqueness of *k*-mers from our sample of redundant, error-containing reads. In this section, we provide a mathematically grounded procedure that chooses a frequency range for *k*-mers that we consider to be *reliable*. The basic question that guides the reliable *k*-mer selection procedure is the following: “Given that a *k*-mer is sequenced from a unique (non-repeat) region of the genome, what is the probability it will occur at least *m* times in the input data?”. For a genome *G* sequenced at depth *d*, the conditional modeled probability is:
where MAP(*k-mer, G*) is the set of locations in the genome *G* where *k-mer* can be mapped, COUNT() function computes the cardinality of a given input set, and freq(*k-mer, G, d*) is the expected number of occurrences of *k-mer* within sequenced reads, assuming each region of *G* is sequenced *d* times (*sequencing depth*). In that sense, BELLA’s approach to select reliable *k*-mers diverges sharply from how Lin *et al.* (2016) selects their *solid strings*. While solid strings discards infrequent *k*-mers, our model discards highly-repetitive *k*-mers, arguing that (a) unique *k*-mers are sufficient to find informative overlaps, and (b) a unique *k*-mer has low probability of occurring frequently. However, both approaches are justifiable within their respective problem statements since our model is applied to the OLC paradigm and the one developed by Lin et al. is applied to the DBG paradigm.

The probability of a *k*-mer being sequenced correctly is approximately (1 − *e*)^{k}, where *e* is the error rate. The probability of correctly sequencing a *k*-mer once can be generalized to obtain the probability of seeing it multiple times in the data, considering that each correct sequencing of that *k*-mer is an independent event. For example, if the sequencing depth is *d*, the probability of observing a unique *k*-mer *k _{i}* in the input data

*d*times is approximatively (1 −

*e*)

^{dk}. More generally, the number of times a unique

*k*length section of the genome is sequenced correctly when the sequencing depth is

*d*follows a binomial distribution: where

*n*is the number of trials and

*p*is the probability of success. Consequently, we derive that the probability of observing a

*k*-mer

*k*(which corresponds to a unique, non-repetitive region of the genome)

_{i}*m*times within a sequencing input data with depth

*d*is: where

*m*is the multiplicity of the

*k*-mer in the input data,

*e*is the error rate,

*d*is the sequencer depth, and

*k*is the

*k*-mer length. Given the values of

*d*,

*e*, and

*k*, the curve

*Pr*(

*m*;

*d*, (1 −

*e*)

^{k}) can be computed.

For low values of *m*, equation 5 might not well approximate the expected number of occurrences of a *k*-mer *k _{i}*, because other sections of the genome could have been morphed into

*k*by mistake. However, for medium and high values of

_{i}*m*such recurrent morphing is unlikely, especially for values of

*k*high enough to be unique in the genome being sequenced.

Equation 5 is used to identify the range of reliable *k*-mers. To select the lower bound *l*, we compute *Pr*(*m*; *d*, (1 − *e*)^{k}) for each multiplicity *m* and cumulatively sum up these probabilities, starting from *m* = 2. The cumulative sum does not start from *m* = 1 because a *k*-mer appearing a single time in the input data (and therefore appearing on a single read) cannot be used to identify an overlap between two reads. The lower bound is the smaller *m* value after which the cumulative sum exceeds a user-defined threshold *ϵ* (Algorithm 2). The choice of *l* matters when the sequencing error rate is relatively low (≈ 5%) or when the sequencing coverage is high (≈ 50 − 60×), or both. This is because in those cases, a *k*-mer with small multiplicity has a high probability to be incorrect.

The upper bound *u* is chosen by following a similar procedure. Here, the probabilities are cumulatively summed up starting from the largest possible value of *m* (i.e. *d*). In this case, *u* is the largest value of *m* after which the cumulative sum exceeds the user-defined threshold *E* (see Algorithm 3). The *k*-mers that appear more frequently than *u* have too low a probability of belonging to a unique region of the genome, and multi-mapped *k*-mers would lead to an increase of computational cost, and potentially to mis-assemblies.

*K*-mers appearing with greater multiplicities than *u* and those appearing with smaller multiplicities than *l* in the input set are discarded and not used as read features in the downstream algorithm. Our reliable *k*-mer selection procedure discards at most 2*ϵ* useful information in terms of *k*-mers that can be used for overlap discovery.

To summarize, both *l* and *u* vary with depth, error rate and *k*-mer length. They are computed by employing the user’s estimates for those properties of the dataset.

### Sparse Matrix Construction and Multiplication

The overlapper takes the *k*-mer dictionary from the previous module as input and removes from the dictionary *k*-mers with occurrences that are outside the reliable range [*l, u*]. It saves only the lexicographically smaller *k*-mer between itself and its reverse complement.

BELLA uses a sparse matrix format to store its data and sparse matrix-matrix multiplication (SpGEMM) to identify overlaps. Sparse matrices express the data access patterns in a concise and clear manner, allowing better organization of computation. The sparse matrix **A**, also known as the *data matrix*, is a |*reads*|-by-|*k-mers*| matrix with reads as its rows and the entries of the *k*-mer dictionary as its columns. If the *j*th reliable *k*-mer is present in the *i*th read, the cell (*i, j*) of **A** is non-zero. **A** is then multiplied by its transpose, **A**^{T}, yielding a sparse *overlap matrix* **AA**^{T} of dimensions |*reads*|-by-|*reads*|. Each non-zero cell (*i, j*) of the overlap matrix contains the number of shared *k*-mers between the *i*th read and the *j*th read and the corresponding positions in the corresponding read pair of (at most) two shared *k*-mers.

The column-by-column sparse matrix multiplication is implemented efficiently using the Compressed Sparse Columns (CSC) format for storing sparse matrices. However, other options are certainly possible in the future, which is one of the advantages of our work. Any novel sparse matrix format and multiplication algorithm would be applicable to the overlapping problem and would enable continued performance improvements since multiple software packages already implement this primitive, including Intel MKL and Sandia’s KokkosKernels (Deveci *et al.*, 2017).

The SpGEMM algorithm shown in Figure 3 is functionally equivalent to a *k*-mer based seed-index table, which is common in other long-read alignment codes. However, the CSC format allows true constant-time random access to columns as opposed to hash tables. More importantly, the computational problem of accumulating the contributions from multiple shared *k*-mers to each pair of reads is handled automatically by the choice of appropriate data structures within SpGEMM. Figure 3 illustrates the merging operation of BELLA, which uses a hash table data structure indexed by the row indexes of **A**, following the multi-threaded implementation proposed by *Nagasaka et al.* (2018). Finally, the contents of the hash table are stored into a column of the final matrix once all required nonzeros for that column are accumulated.

In the resulting sparse overlap matrix **AA**^{T}, each non-zero cell (*i, j*) is a structure composed of an integer value storing the number of shared *k*-mers, and an integer array of size 4 storing the position on read *i* and on read *j* of two shared *k*-mers (or just one if the pair shares just one *k*-mer). To enable this special multiplication, which performs scalar multiplication and additions differently than with standard floating point numbers, we use the semiring abstraction (Kepner and Gilbert, 2011). Multiplication on a semiring allows the user to overload scalar multiplication and addition operations and still use the same SpGEMM algorithm. Many existing SpGEMM implementations support user-defined semirings, including those that implement the GraphBLAS API (Buluç *et al.*, 2017).

Increasing the genome size also increases the memory requirements for building the final overlap matrix. For large genomes, it is possible that the sparse overlap matrix **AA**^{T} would not fit in memory even if the data matrix **A** does. BELLA avoids this situation by dividing the multiplication into batches based on the available RAM. At each stage, only a batch of columns of the overlap matrix are created. The set of nonzeros in that batch of the overlap matrix are immediately tested for alignments (as described in Section 3). The pairs that pass the alignment test are written to the output file of BELLA so that the current batch of overlap matrix can be discarded.

Given the nature of our problem, the sparse overlap matrix **AA**^{T} is a symmetric matrix. Thus, we compute the multiplication using only the lower triangle of **A**, avoiding computing the pairwise alignment twice for each pair. Currently, there are no known specialized SpGEMM implementations for **AA**^{T} that store and operate only on **A**, but we hope to develop one in the future. This would have cut the memory requirements in half. The obvious solution of computing inner projects of rows of **A** is suboptimal, because it has to perform Ω(|*reads*|^{2}) inner products even though the majority of inner products are zero. By contrast, our column-by-column implementation runs faster than *O*(|*reads*|^{2}) whenever the overlap matrix **AA**^{T} is sparse. Given that the main purpose of the overlapping process is used to filter candidate pairs, the overlap matrix tends to be sparse. In practice, Table 4 shows that the overlap matrix is more than 99% sparse in all tested cases.

### Pairwise Alignment

BELLA avoids over-filtering in the overlap stage in order to achieve high recall. However, overlapping alone is not sufficient for high precision because the number of shared *k*-mers between read pairs is not a sufficiently informative feature for filtering false positives. High precision is desirable for avoiding wasted work in subsequent stages of *de novo* assembly. To achieve high precision, BELLA filters candidate read pairs by performing fast (i.e., approximately linear time) pairwise alignments.

For pairwise alignment, BELLA employs a seed-and-extend alignment algorithm. In contrast to approaches that rely on sketches or minimizers, such as Minimap and MHAP, seed-and-extend alignment can be performed directly using the *k*-mers from BELLA’s overlap stage. BELLA’s alignment module is based on an x-drop implementation proposed by Zhang *et al.* (2000) and implemented in the SeqAn library (Döring *et al.*, 2008), a `C++` library for sequence analysis. However, these can easily be changed in the future as more advanced pairwise alignment implementations are developed.

BELLA inputs at most two seed *k*-mers to the x-drop alignment algorithm. If the read pair shares more than two *k*-mers, two *k*-mers are chosen such that they are at least *n* bases apart for the following reason: a second alignment performed with an overlapping or nearby *k*-mer is likely to be only marginally better than the first, but to cost as much computationally. BELLA’s default setting is *n* = 1, 000. Our experiments showed that further increasing the number of seeds did not increase accuracy by a large enough margin to justify the extra computational cost, suggesting that using at most two seeds for x-drop alignment represents a good compromise between accuracy and computational cost.

For each read pair in the overlap matrix, the alignment is extended from one- or two-seed *k*-mers until the alignment score drops *x* points below the best score seen so far. For example, if the best score seen thus far were 200 with *x* = 3, the algorithm would stop the alignment when the score dropped to 197 or lower. The errors could be non-consecutive: supposing a linear scoring matrix, the score could go down to 198, get back to 199 and finally go down again to 197, at which point the stop criterion is verified and the alignment ends. Once the alignment is complete, if the best score is lower than a threshold *n*, the pair of sequences is discarded.

The choice of *k*, the x-drop value, and the alignment score threshold, all influence the final recall and precision. Our theoretical model for tuning these parameters assumes the typical read error-rates of high-throughput PacBio sequencing (10 − 15%). These errors are randomly distributed, and the majority of them are indels, erroneous insertions or deletions of nucleotides (Giordano *et al.*, 2017). We argue that an x-drop factor excessively small, such as *x* = 3, might be too strict for PacBio data as, for example, 3 consecutive insertions on one sequence at the beginning of the alignment would cause the exclusion of a true overlapping pair from the output. Larger values of *x* could increase the amount of true positives while also making it easier to differentiate true alignments from false positives.

A fixed alignment threshold might not capture the real alignment. Instead, we expect the alignment score to increase as the overlap length between two sequences increases. Consequently, the alignment threshold would benefit from an adaptive behavior according to the estimated overlap length between two sequences that show evidence of overlap, resulting in an *adaptive threshold*. The choice of the scoring matrix used in the pairwise alignment step can justify the alignment score threshold being a linear function of the estimated overlap length.

Given an estimated overlap region of length *L* and the probability *p* = *q*^{2} of getting a correct base on both sequences, we would expect *m* = *p · L* correct matches within that overlap region. The alignment score *χ*, corresponding to the edit distance, can be written as follows:
where *m* is the number of matches, *L* is the length of the overlap region, *α* is the value associated with a match in the scoring matrix while *β* is the penalty for mismatch or a gap/indel (*α, β* > 0). Given these assumptions, we define the ratio *ϕ* between *χ* and the estimated overlap length *L* as:

The expected value of *ϕ* is equal to 2 *· p* − 1, if an exact alignment algorithm is used. We would like to define a cutoff in the form of (1 − *δ*)*ϕ*, so that we retain pairs over this cutoff as true alignments and discard remaining pairs. We use a Chernoff bound (Chernoff *et al.*, 1952; Hoeffding, 1963) to define the value of *δ*, proving that there is only a small probability of missing a true overlap of length *L* ≥ 2000 bp (which is the minimum overlap length for a sequence to be considered a true-positive) when using the above-defined cutoff. We derived the following Chernoff bound:

Given two sequences that indeed overlap by *L* = 2000, the probability of their alignment score being below the mean by more than 20% (*δ* = 0.2) is ≤ 7.90 × 10^{−26}. The derivation of the above formula is reported in Supplementary Material. BELLA achieved high value of recall and precision among state-of-the-art software tools, with an x-drop value of *x* = 7 and an adaptive threshold derived from the scoring matrix and the with *δ* = 0.2 cutoff rate.

Lastly, BELLA outputs alignments in a format similar to BLASR/MHAP (Chaisson and Tesler, 2012; Berlin *et al.*, 2015), which outputs a .`m4` file. For each pair of reads that pass both the overlapping and alignment stage filters, the output includes a line with: the respective (a) identifiers of the reads, (b) number of shared *k*-mers, (c) alignment score, (d) strand information (*n* if the reads belong to the same strand, *c* if they are not, following DALIGNER (Myers, 2014) convention), (e) start and end positions of the alignment in the first read, (f) start and end positions of the alignment in the second read, and (g) lengths of the reads. If a pair of reads share two *k*-mers and the respective alignment scores both exceed the alignment score threshold, the higher-scoring alignment information is output.

## 4 Evaluation

The datasets used for evaluation are listed in Table 1. We selected genomes with varying size and complexity since analysis results are sensitive to these features (Li *et al.*, 2012).

As performance metrics, we used recall, precision, F1 score, and running time. Recall is defined as the fraction of true positives of the aligner/overlapper over the total size of the *ground truth*; precision is the fraction of true positives of the aligner/overlapper over the total number of elements found by the aligner/overlapper (i.e., total size of its true and false positives combined). F1 score is the harmonic average of the precision and recall. F1 score is a better metric to use when seeking a balance between precision and recall. Overall, a tool that consistently performs well in both precision and recall, rather than achieving low scores in either, is most valuable. F1 is a good indicator of that.

More formally,
where *TP* is size of true positives, *FP* is the size of false positives, and *FN* is the size of false negatives. We consider a read pair as true-positive if the sequences align for at least 2 kb in the reference genome. We derived the threshold *t* = 2 kb from the procedure proposed by Li (2016), and generated the ground truth using Minimap2. A detailed description of our evaluation procedure and ground truth generation can be found in Supplementary Material.

## Results

We evaluated BELLA against several state-of-the-art overlappers, using both simulated and real PacBio data. The simulated data was generated using PBSIM (Ono *et al.*, 2012) with an error rate of 15%. The results are shown in Table 2 and Table 3, respectively. The last column of each table indicates whether the respective overlapper also performs alignment on overlapping reads.

Simulated data has the advantage that ground truth is known. Since the reads are extracted from a fixed genome, the overlaps are known by construction. Table 2 shows that MECAT, Min-imap2, and MHAP made significant trade-offs in recall versus precision; while MECAT had the highest precision, it had missed a large number of the true overlaps. In contrast, both BELLA and BLASR were consistently strong (typically over 80%) in both precision and recall, but BLASR had a much higher computational cost (on average, 4.3× slower than BELLA). The quality consistency of BELLA is reflected in its F1 scores, which outperformed all the others. Minimap2 was the fastest tool for the whole set of simulated data, but performed only overlap and not alignment.

For the evaluation with real data, since exact error rates are unknown, BELLA estimates the mean error rate *e* from the read set, using the per-base quality scores provided with the data. The results are shown in Table 3. Although BLASR performed reasonably well on the synthetic data, it achieved very low recall for certain real data sets (as low as 3.23% recall for *C. elegans* 40X). DALIGNER proved to be the fastest of the tools that performs alignment, but BELLA was within 2.5× of its runtime, and dominated in quality (precision and recall). As on the synthetic data, BLASR was the slowest by a significant margin.

Finally, Table 4 shows the number and the percentage of candidate overlaps discarded by BELLA, before and after the alignment step, on four representative data sets. For comparison, the number of reads squared (the potential all-to-all pairwise alignment workload) is also provided. The overlap detection procedure of BELLA was able to filter over **99%** of all read pairs before performing pairwise alignment.

## 5. Discussion

BELLA proposes a computationally efficient and highly accurate approach for overlapping and aligning noisy long reads, based on mathematical models that minimize the cost of overlap detection, without missing large numbers of true overlaps. The evaluation results in Tables 2, 3, and 4 demonstrate BELLA’s superior accuracy compared to state-of-the-art software on simulated data where reliable ground truth information is known, and competitive accuracy on real data where the ground truth is based on the results of read to reference alignment. The runtime of BELLA is, for the most part, within the average of competitive overlappers, which is noteworthy (given that BELLA also outputs alignments). Further, these alignments are sufficiently accurate to facilitate the downstream analysis and assembly tasks.

On simulated data, BELLA achieved both high precision and recall, thus obtaining the best F1 score on the whole set of data. Table 2 shows that BELLA outperformed its closest competitor, Minimap2, with an average of 2.7% higher recall, 17.9% higher precision, and 10.9% higher F1 score. For *P. aeruginosa*, the Minimap2’s recall was comparable to that achieved by BELLA; for the remaining data sets, BELLA clearly achieved the highest recall.

On real data, BELLA had the highest precision for *E. coli* (Sample) data set and consistently high F1 scores, which are comparable with the scores of BELLA’s main competitor, Minimap2, for the two *E. coli* data sets. Notably, BELLA had a 40.5% higher F1 score than Minimap2 for *C. elegans* 40X. Overall, the top performer on one data set becomes one of the worst on some other data set whereas BELLA’s F1 score is consistently within 6.5% of the top entry.

If not user-defined, BELLA estimates the error rate *e* from the read set. If the error rate *e* is known, we recommend providing it to BELLA. Along with the depth and *k*-mer length, *e* is one of the parameters which BELLA uses to compute the reliable *k*-mer range, and the adaptive alignment score threshold. Consequently, the error rate of the dataset influences which *k*-mers are retained for overlap detection, which in turn affects BELLA’s recall and precision.

Tables 2 and 3 show that BELLA achieved higher values of recall on synthetic data compared to real data. This behavior could be related to how ground truth sets are generated. Using synthetic data, the exact location from which a read originates is known, since this location was computed in a deterministic way by the tool (PBSIM) that generated this data. When real data is used, the location of reads is determined by Minimap2. However, there is no guarantee that Minimap2 correctly locates all the alignments, or that it finds every single correct alignment. Hence, BELLA could potentially find more correct alignments than the ones identified by Minimap2. Therefore, it is possible that BELLA’s true accuracy on real data is higher in reality. We plan to investigate these issues deeper in the future.

BELLA saves and uses at most two shared *k*-mers as seeds for the alignment step. These *k*-mers are chosen such that they are at least *n* bases apart. This avoids additional, computationally expensive pairwise alignments with overlapping or nearby *k*-mers. If performance is not a concern, the user can ignore this optimization and use all available *k*-mers (separated by at least *n* bases) as seeds. For example, when using all reliable *k*-mers separated by at least 1, 000 bps, BELLA achieved 4.1% higher recall (80.84%), 0.1% higher precision (81.96%), and 2.1% higher F1 score (81.39%) for *E. coli* (Sample), thus outperforming Minimap2 for quality on this data set. This quality improvement cost a ≈ 2× slower runtime; this ultimately influenced our choice of selecting at most two shared *k*-mers.

On the other hand, if recall and speed (and not necessarily precision) are the user’s main interest, the user can skip the pairwise alignment and use the overlapping module alone; in which case, BELLA achieves its maximum recall. For example, given *E. coli* (Sample), BELLA achieved a recall of 94.02% with a speedup of ≈ 2× with respect to the value reported in Table 3, outperforming MECAT with respect to time.

Table 4 showed that BELLA can filter over 99% of all read pairs yet achieve high values of recall. This demonstrates the high-quality filtering performance of our overlap detection; it excludes a massive amount of non-overlapping read pairs without losing correct ones.

We note that parallel SpGEMM implementations offer a path for efficient parallelization of the BELLA methods described above.

## 6 Conclusion

Single-molecule sequencing technologies have favored overlap-based approaches (OLC) for *de novo* assembly, enabling highly accurate reconstruction of complex genomes. However, overlap detection using long-read data is a major computational bottleneck. High error rates further complicate overlap detection since it becomes hard to separate signal from the noise.

We presented BELLA, a computationally efficient and highly accurate long-read to long-read aligner and overlapper. BELLA uses a *k*-mer based approach to detect overlaps between noisy, long-read data. We demonstrated the feasibility of this approach through a mathematical model based on Markov chains. BELLA provides a novel algorithm for pruning *k*-mers that are unlikely to be useful in overlap detection and whose presence would only incur unnecessary computational costs. This reliable *k*-mers detection algorithm explicitly maximizes the probability of retaining *k*-mers that belong to unique regions of the genome.

To achieve fast overlapping without sketching, BELLA uses sparse matrix-matrix multiplication and utilizes high-performance software and libraries developed for this sparse matrix subroutine. Any novel sparse matrix format and multiplication algorithm would be applicable to overlap detection and enable continued performance improvements. BELLA’s overlap detection has been coupled with a state-of-the-art seed-and-extend banded-alignment method. We developed and implemented a new method to separate true alignments from false positives depending on the alignment score. This method demonstrated that the probability of false positives decreases exponentially as the length of overlap between sequences increases.

BELLA achieves higher recall and F1 scores than state-of-the-art tools on simulated data and high values of precision and F1 score on real data, while being performance competitive, thus demonstrating the validity of our approach. Future work includes a further characterization of real data features and the development of a high-performance pairwise alignment algorithm to improve BELLA’s performance.

## Funding

This work is supported by the Advanced Scientific Computing Research (ASCR) program within the Office of Science of the DOE under contract number DE-AC02-05CH11231. We used resources of the NERSC supported by the Office of Science of the DOE under Contract No. DEAC02-05CH11231. This research was also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

## Acknowledgements

Thanks to Heng Li for the help with Minimap2 and to Rob Egan and Steven Hofmeyr for valuable discussions.