Abstract
Motivation Pairwise alignment of nucleotide sequences has been calculated in practice by the seed-and-extend strategy, where we enumerate seeds (shared patterns) between sequences and then extend the seeds by a Smith-Waterman-like semi-global dynamic programming to obtain full pairwise alignments. With the advent of massively parallel short read sequencers, algorithms and data structures for efficiently finding seeds had been explored extensively. However, recent advances in single-molecule sequencing technologies enabled us to obtain millions of reads, each of which is orders of magnitude longer than those output by the short-read sequencers, demanding a faster algorithm for the extension step that dominates the computation time in pairwise local alignment. Our goal is to design a faster extension algorithm which overcomes the two major drawbacks of the single-molecule sequencers that the sequencing error rates is high (e.g., 10-15 %) and insertions and deletions are more frequent than substitutions are.
Results We propose an adaptive banded dynamic programming (DP) algorithm for calculating pairwise semi-global alignment of nucleotide sequences that allows a relatively high insertion or deletion rate while maintaining the band width to some small constant (e.g., 32 cells). On every band advancing operation, cells at the forefront of the band are calculated simultaneously without mutual dependencies, allowing an efficient Single-Instruction-Multiple-Data (SIMD) parallelization. We show by an experiment that our algorithm runs approximately 8 times faster than the extension alignment algorithm in NCBI BLAST+ retaining the similar sensitivity and accuracy. The results indicate that the algorithm is capable of replacing extension alignment routines in the existing nucleotide local alignment programs.
Availability The implementation of the algorithm and the benchmarking scripts are available at https://github.com/ocxtal/adaptivebandbench.
Contact mkasa{at}k.u-tokyo.ac.jp