Exact global alignment using A* with seed heuristic and match pruning

Ragnar Groot Koerkamp; Pesho Ivanov

doi:10.1101/2022.09.19.508631

Abstract

Motivation Sequence alignment has been a core problem in computational biology for the last half-century. It is an open problem whether exact pairwise alignment is possible in linear time for related sequences (Medvedev, 2022b).

Methods We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm on the edit graph. In order to efficiently align long sequences with high error rate, we extend the seed heuristic for A* (Ivanov et al., 2022) with match chaining, inexact matches, and the novel match pruning optimization. We prove the correctness of our algorithm and provide an efficient implementation in A*PA.

Results We evaluate A*PA on synthetic data (random sequences of length n with uniform mutations with error rate e) and on real long ONT reads of human data. On the synthetic data with e=5% and n≤10⁷ bp, A*PA exhibits a near-linear empirical runtime scaling of n^1.08 and achieves >250× speedup compared to the leading exact aligners EDLIB and BIWFA. Even for a high error rate of e=15%, the empirical scaling is n^1.28 for n≤10⁷ bp. On two real datasets, A*PA is the fastest aligner for 58% of the alignments when the reads contain only sequencing errors, and for 17% of the alignments when the reads also include biological variation.

Availability github.com/RagnarGrootKoerkamp/astar-pairwise-aligner

Contact ragnar.grootkoerkamp{at}inf.ethz.ch, pesho{at}inf.ethz.ch

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

- fixed abstract: it was not matching the PDF because of twice repeated Motivation and Methods - Added (c) The Authors 2022 to front page footer - \section*{acknowledgements} (unnumbered) - change hyperref package to not draw those ugly boxes

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.