RT Journal Article SR Electronic T1 Sapling: Accelerating Suffix Array Queries with Learned Data Models JF bioRxiv FD Cold Spring Harbor Laboratory SP 2020.01.29.925768 DO 10.1101/2020.01.29.925768 A1 Kirsche, Melanie A1 Das, Arun A1 Schatz, Michael C. YR 2020 UL http://biorxiv.org/content/early/2020/01/30/2020.01.29.925768.abstract AB Motivation As genomic data becomes more abundant, efficient algorithms and data structures for sequence alignment become increasingly important. The suffix array is a widely used data structure to accelerate alignment, but the binary search algorithm used to query it requires widespread memory accesses, causing a large number of cache misses on large datasets.Results Here we present Sapling, an algorithm for sequence alignment which uses a learned data model to augment the suffix array and enable faster queries. We investigate different types of data models, providing an analysis of different neural network models as well as providing an open-source aligner with a compact, practical piecewise linear model. We show that Sapling outperforms both an optimized binary search approach and multiple existing read aligners on a wide collection of genomes, including human, bacteria, and plants, speeding up the algorithm by more than a factor of two while adding less than 1% to the suffix array’s memory footprint.Availability and implementation The source code and tutorial are available open-source at https://github.com/mkirsche/sapling.Supplementary Information Supplementary notes and figures are available online.