Abstract
Long read sequencing is now routinely used at scale for genomics and transcriptomics applications. Mapping of long reads or a draft genome assembly to a reference sequence is often one of the most time consuming steps in these applications. Here, we present techniques to accelerate minimap2, a widely used software for mapping. We present multiple optimizations using SIMD parallelization, efficient cache utilization and a learned index data structure to accelerate its three main computational modules, i.e., seeding, chaining and pairwise sequence alignment. These result in reduction of end-to-end mapping time of minimap2 by up to 1.8 × while maintaining identical output.
- architecture-aware optimizations
- SIMD
- sequence alignment
- minimap2
- learned-index
Competing Interest Statement
SK, VM and SM are employees of Intel Corporation
Footnotes
This revised version presents mm2-fast optimizations applied to minimap2 (version 2.22). Figure 2, 3, and 4 revised. Supplementary Figure S1 and S2 revised. Supplementary file updated.