Abstract
Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. We present AvxWindowedFMindex (AWFM-index), an open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index is easy to incorporate into bioinformatics software and is able to perform exact match count and locate queries ~2-4x faster than SeqAn3’s FM-index implementation for nucleotide search, and ~2-6x faster for amino acid search in a single-threaded context. This performance is due to (i) a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and (ii) inclusion of a cache-efficient lookup table for partial k-mer searches. AWFM-index also trivially parallelizes to multiple threads with good scaling, and enables efficient on-disk storage of the memory-intensive suffix array. The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Minor changes to text, and a couple extra experiments
http://wheelerlab.org/publications/2021-AWFM-Anderson/Anderson_suppl.tar.gz