RT Journal Article SR Electronic T1 AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes JF bioRxiv FD Cold Spring Harbor Laboratory SP 2021.02.16.431517 DO 10.1101/2021.02.16.431517 A1 Jeremie S. Kim A1 Can Firtina A1 Meryem Banu Cavlak A1 Damla Senol Cali A1 Nastaran Hajinazar A1 Mohammed Alser A1 Can Alkan A1 Onur Mutlu YR 2021 UL http://biorxiv.org/content/early/2021/02/17/2021.02.16.431517.abstract AB As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7×, 6.6×, and 2.8× for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.Code Availability AirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.Competing Interest StatementThe authors have declared no competing interest.