Abstract
Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. There are two versions of human references currently used in the biomedical literature, GRCh37/hg19 and GRCh38, and conversions between these versions are critical for quality control, imputation, and association analysis. In the present study we show that genomic coordinates for single nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are erroneously converted by the TOPMed imputation server. Depending on the array type, we estimate the inappropriate conversions of variant coordinates would occur in about 2-5 Mb of the genome. Errors for palindromic variants in these inverted regions cannot be detected by standard quality control procedures and destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, we show that these regions include important disease susceptibility variants that would be lost due to poor imputation. For example, we show that a known locus associated with prostate cancer on chr10 would have its association P-value drop from 2.86×10-7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We propose and publicly release on GitHub a straight-forward heuristic, triple-liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy.
Competing Interest Statement
The authors have declared no competing interest.