Accelerating long-read analysis on modern CPUs

Saurabh Kalikar; Chirag Jain; Vasimuddin Md; Sanchit Misra

doi:10.1101/2021.07.21.453294

Abstract

Long read sequencing is now routinely used at scale for genomics and transcriptomics applications. Mapping of long reads or a draft genome assembly to a reference sequence is often one of the most time consuming steps in these applications. Here, we present techniques to accelerate minimap2, a widely used software for mapping. We present multiple optimizations using SIMD parallelization, efficient cache utilization and a learned index data structure to accelerate its three main computational modules, i.e., seeding, chaining and pairwise sequence alignment. These result in reduction of end-to-end mapping time of minimap2 by up to 1.8 × while maintaining identical output.

architecture-aware optimizations
SIMD
sequence alignment
minimap2
learned-index

Competing Interest Statement

SK, VM and SM are employees of Intel Corporation

Footnotes

This revised version presents mm2-fast optimizations applied to minimap2 (version 2.22). Figure 2, 3, and 4 revised. Supplementary Figure S1 and S2 revised. Supplementary file updated.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.