PT - JOURNAL ARTICLE
AU - Yue Wu
AU - Anna Yaschenko
AU - Mohammadreza Hajy Heydary
AU - Sriram Sankararaman
TI - Fast estimation of genetic correlation for Biobank-scale data
AID - 10.1101/525055
DP - 2019 Jan 01
TA - bioRxiv
PG - 525055
4099 - http://biorxiv.org/content/early/2019/01/20/525055.short
4100 - http://biorxiv.org/content/early/2019/01/20/525055.full
AB - Genetic correlation, i.e., the proportion of phenotypic correlation across a pair of traits that can be explained by genetic variation, is an important parameter in efforts to understand the relationships among complex traits. The observation of substantial genetic correlation across a pair of traits, can provide insights into shared genetic pathways as well as providing a starting point to investigate causal relationships. Attempts to estimate genetic correlations among complex phenotypes attributable to genome-wide SNP variation data have motivated the analysis of large datasets as well as the development of sophisticated methods.Bi-variate Linear Mixed Models (LMMs) have emerged as a key tool to estimate genetic correlation from datasets where individual genotypes and traits are measured. The bi-variate LMM jointly models the effect sizes of a given SNP on each of the pair of traits being analyzed. The parameters of the bi-variate LMM, i.e., the variance components, are related to the heritability of each trait as well as correlation across traits attributable to genotyped SNPs. However, inference in bi-variate LMMs, typically achieved by maximizing the likelihood, poses serious computational challenges.We propose, RG-Cor, a scalable randomized Method-of-Moments (MoM) estimator of genetic correlations in bi-variate LMMs. RG-Cor leverages the structure of genotype data to obtain runtimes that scale sub-linearly with the number of individuals in the input dataset (assuming the number of SNPs is held constant). We perform extensive simulations to validate the accuracy and scalability of RG-Cor. RG-Cor can compute the genetic correlations on the UK biobank dataset consisting of 430, 000 individuals and 460, 000 SNPs in 3 hours on a stand-alone compute machine.