Abstract
Motivation Sequence alignment has been a core problem in computational biology for the last half-century. It is an open problem whether exact pairwise alignment is possible in linear time for related sequences (Medvedev, 2022b).
Methods We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm on the edit graph. In order to efficiently align long sequences with high error rate, we extend the seed heuristic for A* (Ivanov et al., 2022) with match chaining, inexact matches, and the novel match pruning optimization. We prove the correctness of our algorithm and provide an efficient implementation in A*PA.
Results We evaluate A*PA on synthetic data (random sequences of length n with uniform mutations with error rate e) and on real long ONT reads of human data. On the synthetic data with e=5% and n≤107 bp, A*PA exhibits a near-linear empirical runtime scaling of n1.08 and achieves >250× speedup compared to the leading exact aligners EDLIB and BIWFA. Even for a high error rate of e=15%, the empirical scaling is n1.28 for n≤107 bp. On two real datasets, A*PA is the fastest aligner for 58% of the alignments when the reads contain only sequencing errors, and for 17% of the alignments when the reads also include biological variation.
Contact ragnar.grootkoerkamp{at}inf.ethz.ch, pesho{at}inf.ethz.ch
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
- fixed abstract: it was not matching the PDF because of twice repeated Motivation and Methods - Added (c) The Authors 2022 to front page footer - \section*{acknowledgements} (unnumbered) - change hyperref package to not draw those ugly boxes