LevioSAM: Fast lift-over of alternate reference alignments

Motivation As more population genetics datasets and population-specific references become available, the task of translating (“lifting”) read alignments from one reference coordinate system to another is becoming more common. Existing tools generally require a chain file, whereas VCF files are the more common way to represent variation. Existing tools also do not make effective use of threads, creating a post-alignment bottleneck. Results LevioSAM is a tool for lifting SAM/BAM alignments from one reference to another using a VCF file containing population variants. LevioSAM uses succinct data structures and scales efficiently to many threads. When run downstream of a read aligner, levioSAM completes in less than 13% the time required by an aligner when both are run with 16 threads. Availability https://github.com/alshai/levioSAM Contact tmun1@jhu.edu, langmea@cs.jhu.edu


Introduction
Most analyses of sequencing datasets start by aligning the reads to a linear reference genome. But using a single linear reference leads to reference bias, a tendency to produce incorrect alignments or to miss alignments for reads containing non-reference alleles. Many methods have been proposed for reducing bias by incorporating alternate alleles into the reference (Garrison et al., 2018). Some of these methods focus on aligning reads to one or more linear references containing non-reference alleles, e.g. major allele (Dewey et al., 2011), or individual-specific references (Rozowsky et al., 2011).
Such variant-aware methods often end with a step that translates alignments from the variant-augmented coordinate system to a standard reference. This can affect the alignment's offset with respect to the chromosome, its CIGAR string (reflecting insertion/deletion differences), etc.
This "lift-over" problem has been addressed in prior tools including UCSC liftOver (Fujita et al., 2010), and CrossMap (Zhao et al., 2014). To our knowledge, only CrossMap supports lifting alignments in the SAM/BAM format (Li et al., 2009). These tools also require a "chain file" as input, whereas allele information for building alternate references is usually available as VCF files (Danecek et al., 2011) provided by large consortia for cataloguing variation (1000Genomes Project Consortium et al., 2015Karczewski et al., 2020).
We describe levioSAM, a new method and software tool for lifting alignments from an alternate to a target reference genome. LevioSAM uses succinct data structures to store and query the pattern of gaps in the alignment between the references. LevioSAM supports the most common file formats: SAM and BAM for alignments and VCF for genetic variants. When using multiple threads, levioSAM is faster than CrossMap but gives identical mapping results.

Approach
LevioSAM uses succinct data structures from the SDSL library (Gog et al., 2014) to represent indel differences between the source and target references ( Figure 1).
The structures can stored on disk so they can be reused without recalculating differences. LevioSAM supports standard BAM/SAM input files. LevioSAM's translation affects several fields of the SAM: e.g. its offset, tag information, CIGAR string, and paired-end information. VCF and BAM file parsing is supported by htslib (Li et al., 2009). Multithreading and piped input/output are supported, allowing levioSAM to fit into existing workflows without creating a new computational bottleneck.

Results
We evaluated the computational efficiency of levioSAM when lifting human alignments from a major-allele version of GRCh38 to the standard GRCh38 reference. We built the major-allele reference based on the 1000 Genomes Project callset (Lowy-Gallego et al., 2019), giving a linear reference with 1,746,180 single-nucleotide variants and 252,781 indels with respect to GRCh38. We randomly sampled 10 million 150-bp Illumina reads from ERR3239334, a whole-genome sequencing dataset derived from the NA12878 individual. We used Bowtie 2 (Langmead and Salzberg, 2012) with default parameters to align sequencing reads to the major-allele GRCh38 reference.
We found that levioSAM used only a fraction of the computational resources used by Bowtie 2 during alignment (Figure 2a). The mean CPU time for levioSAM is 12.0% compared to Bowtie 2 (83.6 vs. 699.3 ms per read) for single-end reads, and 12.9% (87.0 vs. 673.8 ms per read) for paired-end reads. LevioSAM used less than 13 MB memory in both cases, a fraction of Bowtie 2's 3GB footprint. LevioSAM scaled efficiently with added threads (Figure 2b  (2) This value is used as input to a rank 0 query w.r.t bv1 to obtain the corresponding position on the top sequence. 25.0 times faster than when using 1 thread and when lifting unpaired alignments, and 23.9 times faster for paired-end alignments. LevioSAM's peak memory footprint was 54% higher for 32 threads compared to a single thread, but peak memory never exceeded 16MB.
We then compared the run-time and memory of levioSAM to that of CrossMap, which also supports BAM/SAM lift-over, using the same set of reads. CrossMap requires a "chain" file rather than a VCF, so we first converted the major-allele VCF to a chain file. LevioSAM and CrossMap take a similar amount of time using a single thread (896 and 931 seconds respectively), but CrossMap lacks support for multithreading, making it a likely bottleneck. The tools produced identical lift-over coordinates for each alignment, though there were some discrepancies in the MAPQ (0.11% alignments differed) and CIGAR strings (0.12% of alignments differed) -though every CIGAR string was still valid. The MAPQ discrepancies tended to occur in discordant alignments. Overall, levioSAM's memory efficiency, speed, and thread scaling allow it to fit in a typical read alignment workflow without creating a new bottleneck downstream of read alignment. These experiments were performed on a computer with a 2.2 Ghz Intel Xeon CPU (E5-2650 v4) and 515GB memory.

Discussion
LevioSAM is an alignment lifting tool that works with SAM/BAM and VCF files and makes it easier to implement efficient workflows for read alignment to one or more customized reference genomes. LevioSAM is based on succinct data structures and scales well with multiple threads, making it computationally efficient to be integrated in alignment pipelines. We believe this tool will further benefit methods based on augment reference genomes to reduce reference bias (Chen et al., 2020). While levioSAM currently assumes that differences between the source and target references consist only of mismatches and indels, it will be important to support larger differences such as inversions or rearrangements in the future. While this is not possible in general, we expect that many cases can be handled by e.g. splitting reads into segments during the lifting process.