Abstract
Genetic correlation, i.e., the proportion of phenotypic correlation across a pair of traits that can be explained by genetic variation, is an important parameter in efforts to understand the relationships among complex traits. The observation of substantial genetic correlation across a pair of traits, can provide insights into shared genetic pathways as well as providing a starting point to investigate causal relationships. Attempts to estimate genetic correlations among complex phenotypes attributable to genome-wide SNP variation data have motivated the analysis of large datasets as well as the development of sophisticated methods.
Bi-variate Linear Mixed Models (LMMs) have emerged as a key tool to estimate genetic correlation from datasets where individual genotypes and traits are measured. The bi-variate LMM jointly models the effect sizes of a given SNP on each of the pair of traits being analyzed. The parameters of the bi-variate LMM, i.e., the variance components, are related to the heritability of each trait as well as correlation across traits attributable to genotyped SNPs. However, inference in bi-variate LMMs, typically achieved by maximizing the likelihood, poses serious computational challenges.
We propose, RG-Cor, a scalable randomized Method-of-Moments (MoM) estimator of genetic correlations in bi-variate LMMs. RG-Cor leverages the structure of genotype data to obtain runtimes that scale sub-linearly with the number of individuals in the input dataset (assuming the number of SNPs is held constant). We perform extensive simulations to validate the accuracy and scalability of RG-Cor. RG-Cor can compute the genetic correlations on the UK biobank dataset consisting of 430, 000 individuals and 460, 000 SNPs in 3 hours on a stand-alone compute machine.
1 Introduction
Understanding the underlying shared genetic structure between traits and diseases can provide insights into shared disease etiology and can form the starting point to investigate causal relationships among traits [2]. Genetic correlation i.e., the proportion of phenotypic correlation across a pair of traits that can be explained by genetic variation, is an important parameter in efforts to quantifying the relationships among complex traits as it can provide insights into biological pathways that are shared among the pair of traits. For example, significant genetic correlation between body mass index (BMI) and lymphocyte count have been used to conclude that lymphocytes are relevant to body weight regulation [1]. Similarly, a number of studies have reported a high genetic correlation between schizophrenia and bipolar disorder [6, 2].
While traditionally reliant on family studies, the availability of genome-wide genetic data have led to a number of approaches to estimate genetic correlation from these datasets. Bi-variate Linear Mixed Models (LMMs) have emerged as a key statistical model for this problem [13]. The bi-variate LMM jointly models the effect sizes of a given SNP on each of the pair of traits being analyzed. The parameters of the bi-variate LMMs, i.e., the variance components, are related to the heritability of each trait well as the genetic correlation across the traits.
The most commonly used method for estimating genetic correlation as well as trait heritabilities in a bi-variate LMM relies on the restricted maximize likelihood method, termed genomic restricted maximum likelihood (GREML)[5, 3, 8, 7]. However, GREML poses serious computational burdens. GREML is a non-convex optimization problem that relies on an iterative optimization algorithm. While a number of methods have been proposed to improve the computational efficiency of GREML [5], current GREML methods are still computationally expensive when applied to large-scale datasets such as the UK Biobank that contains genotypes from around half a million individuals at a million SNPs [11].
Another state-of-the-art method, LD-score regression (LDSC), requires only summary statistics from genome-wide association studies (GWAS) to estimate genetic correlations [2]. LDSC is appealing as it does not require individual level data thereby mitigating concerns of privacy that arise from sharing individual-level data. Further, LDSC often has substantially reduced computational requirements (assuming that the summary statistics have been computed). Nevertheless, LDSC has some drawbacks: its estimates tend to have large standard errors and is prone to bias in settings where there is a mismatch between the samples used to estimate summary statistics and the reference datasets that are used to estimate LD [9].
1.1 Our Contribution
We propose, RG-Cor, a randomized algorithm to estimate genetic correlations of traits using individual-level genotype that can scale to the dataset sizes typical of the UK Biobank. This method for estimating genetic correlation builds upon our randomized estimator of heritability, [12]. RG-Cor is a randomized Method-of-Moment(MoM) estimator of the heritability of traits as well as the genetic correlation between pairs of traits. MoM estimators tends to be less statistically efficient comparing to GREML. Despite the statistical inefficiency, the MoM estimator leads to a closed-form solution of heritability and genetic correlation parameters. On the other hand, the main computational bottleneck of the MoM estimator in genetic correlation estimation is the computation of the N × N genetic relationship matrix, which capture the relationships between all pairs of N individuals in the dataset.
For genetic correlation estimation, our randomized MoM estimator (RG-Cor) relies on the observation that the key computation bottleneck can be replaced by multiplying the N × M (individuals × SNP) genotype matrix with a small number, B, of random vectors thereby obtaining a time complexity of . We can further gain efficiency by leveraging the structure of the genotype matrix, where all the entries are in a finite set, {0, 1, 2} so that the time complexity can be reduced to
.
We apply RG-Cor to pairs of traits to estimate their heritability as well as the genetic correlation, as well as computing estimates of the standard errors. We show in simulations that the RG-Cor yields accurate estimates of genetic correlation. Compared to GREML estimators, we show that the loss in statistical inefficiency of RG-Cor is fairly modest. On the other hand, RG-Cor is several orders of magnitude faster than other methods. Finally, we applied RG-Cor to compute the genetic correlation of selected pairs of traits in 291, 273 white British individuals in the UK Biobank.
2 Methods
2.1 Model assuming complete overlap of samples across traits
For simplicity, we first assume that we observe two traits measured on the same set of N samples. We observe genotypes across these N individuals at M SNPs. The genotype vector for individual n is a length M vector denoted by gn ∈ {0, 1, 2}M. The mth entry of gn denotes the number of minor alleles carried by individual n at SNP m. Let G be the N × M matrix of genotypes. Let X denote the N × M matrix of standardized genotypes obtained by centering and scaling each column of G so that and
for all n ∈ {1,…, N}. Let y1, y2 denote two vectors of phenotypes of length N.
We assume the vector of phenotypes y1, y2 is related to the genotypes X by a bivariate linear mixed model:
Here β1, β2 denote the M-vectors of SNP effect sizes, i.e., β1,m denotes the mean change in phenotype 1 when the genotype at SNP m changes from 0 to 1 or from 1 to 2.
Here phenotypes y1, y2 are centered so that .
denotes the genetic variance of trait 1, i.e., the variance component of phenotype 1 corresponding to the vector of genotypes across M SNPs while
is the genetic variance for phenotype 2.
denote the residual variance (variance not explained by genetics) for each of the two traits.
γg and γe denote the genetic and residual covariances. We define the genetic correlation as . Let
, and β [β1T, β2T]T. Thus we have:
We have since the phenotypes are centered. The population covariance of the two phenotypes is given by:
Here is the genetic relatedness matrix (GRM).
We aim to jointly estimate the genetic and residual variance as well as covariance parameters. Our approach to estimate both the variance components and the genetic correlation relies on a Method-of-Moments (MoM) estimator obtained by equating the population covariance to the empirical covariance. The empirical covariance of the concatenated phenotype vector y is estimated by the sample covariance: yyT. The MoM estimator is obtained by solving the following ordinary least squares problem:
Setting the gradient of the objective function to zero gives us the normal equations (see Supplementary Material). We observe that solving for genetic and residual covariance parameters (γg, γe) is independent of solving the variance parameters: . Thus, MoM estimates of the covariance parameters can be obtained by solving the set of normal equations:
The GRM K can be computed in time O(MN2) and requires O(N2) memory. Given the GRM, computing each of the coefficients for the normal equations requires O(N2) time.
Given each of the coefficients, we can solve analytically for , and
. Indeed, we can write:
Here we have used the property that tr(K) = N due to the use of a standardized genotype matrix.
Finally, we use estimates of the genetic variance parameters to obtain a plug-in estimate of the genetic correlation:
The estimators for and
are give by
and
(see Supplementary Material).
Substituting these expressions for the genetic covariance and variances gives us the following estimator of genetic correlation:
2.2 Model assuming partial overlap of samples across traits
We now generalize our approach to the setting where the traits are no longer observed on the same samples. Assume we have N1 samples for trait 1 and N2 samples for trait 2 of which N samples (N ≤ N1, N ≤ N2) contain measurements for both the traits. G1 and G2 denote the matrix of genotypes for the two traits separately and assume that the samples are observed on the same set of SNPs. We define X1, X2 to be the N1 × M and N2 × M matrices of standardized genotypes obtained by centering and scaling each column of G1 and G2 so that for all m ∈ {1,…, M}, a ∈ {1, 2}. Let y1, y2 denote the two vectors of phenotype with size N1 and N2 respectively. Additionally, we define an N1 × N2 indicator matrix, C where Ci,j = 1 when individual i and j refer to the same sample and 0 otherwise. We also define β1, β2 to be the M-vectors of SNP effect sizes.
We assume the two phenotypes, y1, y2 are related to the genotypes by the following bivariate linear mixed model:
The population covariance of the phenotypes is now:
Here is the GRM for the samples observed for the first trait while
is the GRM for the samples for the second trait. KA is the GRM for all pairs of samples:
.
Thus the MoM estimator could be obtained by equating the population covariance to the empirical covariance, estimated by yyT. Thus the MoM estimator is obtained by solving the following ordinary least squares problem:
Define KC to be the GRM for samples with measurements of both phenotypes while y1C and y2C denote the N-vector of phenotypes for traits 1 and 2. The MoM estimator for genetic covariance satisfies the set of normal equations:
Finally, given each of the coefficients, we can solve analytically for , and
.
Finally the estimate of genetic correlation is given by the plug-in estimate:
The estimators for require computing
and
(see Supplementary Materials).
2.3 RG-Cor: Randomized MoM estimator for Genetic Correlation
The computational bottleneck in obtaining MoM estimators for lies in computing
for the setting of complete overlap of samples, and computing tr(KAKAT), tr(K12) tr(K22) for partially overlapping samples.
Naive computation of tr(KAKAT) requires operations, where N1, N2 are the sample size of each of the traits. Similarly, tr(K12) and tr(K22) can be computed in
and
time.
To overcome this computational bottleneck, we replace these quantities with randomized estimators and
.
Given a N × N matrix A and a random vector z with mean zero and covariance IN, we use the following identity to construct the randomized estimators [4].
Equation 10 leads to the following unbiased estimator for the trace of tr(KAKAT), tr(K12) tr(K22) given B random vectors, z1,…, zB, zb ∈ ℝN, b ∈ 1… B drawn independently from a distribution with zero mean and identity covariance matrix IN:
In practice, we compute the above estimators by drawing each entry of zb independently from a standard normal distribution.
The RG-Cor estimator is obtained by solving Equation 8 by replacing tr[KAKAT] with LBA.
2.4 Inclusion of covariates
In a number of settings, it is desirable to include covariates, such as age, sex, or principal components to correct for population structure, in the analysis. In the complete overlap setting, the samples share the covariates. This changes Equation 1 into:
Here W is N × C matrix of covariates while α is a C-vector of fixed effects. In this setting, we transform Equation 11 by multiplying both slides by the projection matrix
Similarly, for traits that partially share samples, where the covariate is , where W1 is the covariate for trait 1, and W2 is the covariate for trait 2, the projection matrix is:
3 Experiments
3.1 Simulation to assess the accuracy and computational efficiency of RG-Cor
We performed simulations to compare the performance of RG-Cor to other methods for genetic correlation estimation in terms of accuracy, running time and memory usage. Specifically, we compared the performance of RG-Cor to GREML (as implemented in the GCTA software) [5], BOLT-REML [7], and LD-score regression (LDSC) [2]. The GREML software aims to compute maximum likelihood (or restricted maximum likelihood (REML)) estimates of a bi-variate linear mixed model [5]. BOLT-REML is an approximate REML method that can scale to larger problem sizes relative to GREML. LDSC, on the other hand, is widely used to estimate genetic correlation when only summary statistics from GWAS on pairs of traits are available.
Experiments were based on real genotypes from the UK Biobank [11] and on the Northern Finland Birth Cohort (NFBC) dataset [10]. We simulated pairs of traits with known values of heritability and genetic correlation. Experiments to assess the estimation accuracy of each method used the full NFBC dataset, containing 315, 529 SNPs and 5326 individuals, so that all the methods could be run in reasonable time. While comparing computational efficiency, we compared RG-Cor to BOLT-REML and GREML in terms of running time and memory usage on subsets of UK biobank data.
3.2 RG-Cor estimator is accurate
In our first set of simulations, we compared the accuracy of RG-Cor to GREML, BOLT-REML, and LDSC. We evaluated the accuracy of estimates of RG-Cor when the number of random vectors B were set to 10 as well as 100.
For these experiments, we analyzed the Northern Finland Birth Cohort (NFBC) dataset, which contains 5326 individuals and 315, 529 SNPs after removing SNPs with minor allele frequency ≤ 0.05 and with Hardy-Weinberg Equilibrium p-value < 0.01.
Given the genotypes, we simulated a pair of phenotypes based on the complete overlap model specified in equation (1). We assume all SNPs have an effect on each trait (i.e., the trait architecture is infinitesimal). We considered settings where the true heritability of phenotypes are i) both low (set to 0.1 and 0.2 respectively), and ii) one of the phenotypes has low heritability while the other has high heritability (set to 0.2 and 0.8) Fixing the true heritability of each phenotype, we vary the true genetic correlation across {0, 0.2, 0.5, 0.8}. We repeated each experiment 100 times.
Figure 1(a), we show the situation where the true heritability of both phenotypes are low and fixed to be 0.1 and 0.2. This is a typical situation since complex phenotypes tend to have low heritability in human populations. In Figure 1(b), the true heritability of each phenotype is fixed to be 0.2 and 0.8. Table 1 summarizes these results reporting the bias, standard error and mean square error (MSE) of the methods for each parameter setting of Figure 1. We observe that the the statistical efficiency of the estimates from BOLT (approximate REML) and RG-Cor are comparable. While GREML estimates tend to be the most statistically efficient, as expected, in the low heritability setting, RG-Cor achieves a lower standard error than GREML. We also observe that in all cases LDSC attains large standard errors, consistent with previous observations [14]. We conclude that RG-Cor is comparable to BOLT-REML and is particularly useful in cases where heritability for both traits are low and their genetic correlation is high. Finally, the results are indistinguishable when RG-Cor uses B = 10 versus B = 100.
We compared the accuracy of methods for genetic correlation estimation using simulated phenotypes and genotypes from the North Finland Birth Cohort (NFBC). In figure 1(a), the heritability of two traits are fixed to be 0.1 and 0.2 while in figure 1(b), the heritabilities of the two traits are 0.2 and 0.8. We vary the genetic correlation to be {0, 0.2, 0.5, 0.8}. In some cases where the genetic correlation is high, RG-Cor is statistical efficient relative to. We observe that the standard error of the RG-Cor estimates is relatively insensitive when we change B from 10 to 100.
Estimates of bias, mean square error and standard error of genetic correlation estimation methods in simulations
3.3 RG-Cor is computationally efficient
In order to measure computational efficiency, we sub-sampled the UK Biobank genotypes to sample sizes of 1,000, 2, 000, 5, 000, 10, 000 50, 000, 100, 000, and 290, 000 which is approximately the sample size of the UK Biobank dataset after quality control.
Prior the sub-sampling experiment, we performed the following individual-level and SNP-level quality controls. We constrained the samples to the British white population as indicated by selfreported ethnicity. We removed 14, 255 samples with missingness > 0.1. We restricted our analysis to SNPs that were present on the UKBiobank Axiom array used to genotype the UKBiobank. We removed SNPs with greater than 1% missingness and minor allele frequency smaller than 1%. Our final dataset contained 291, 273 individuals and 459, 792 SNPs after quality control. All experiments were performed on an AMD EPYC machine on which we restricted the run time to 6 days and memory usage to 200GB.
Figure 2 shows that both GREML and BOLT-REML do not scale to large sample sizes. GREML could not scale beyond sample sizes greater than 100, 000 due to the requirement of computing and operating on a genetic relatedness matrix (GRM)The runtime of BOLT-REML scales as N1.5 as reported previously[7]. We observed a difficulty in convergence while running BOLT-REML on subsets of the UK Biobank data. Based on the observed runtimes, we extrapolate that BOLT-REML would require about 17 days to run on the full UK Biobank dataset with 291, 273 samples. On the other hand, RG-Cor ran in about 3 hours and used 50 GB memory on the set of 291, 273 individuals. The memory usage of RG-Cor also scales linearly.
We measured the run time and memory usage of methods for genetic correlation estimation as a function of the number of samples while fixing the number of SNPs to 459, 792. The samples were obtained as subsets of unrelated, white British individuals in the UK Biobank. We performed all comparisons on an AMD EPYC machine. In Figure 2(a), GREML could not finish computation on 100, 000 samples. BOLT-REML scales well but is nevertheless computationally intensive with increasing sample sizes. RG-Cor runs in a few hours on even the largest dataset. Figure 2(b) shows that both RG-Cor and BOLT-REML have scalable memory requirements.
3.4 Genetic correlation among traits in UK biobank
We applied RG-Cor to analyze phenotypes in the UK Biobank. We restricted our analysis to SNPs genotyped on the UK Biobank Axiom array, filtering out the genetic markers that had high missingness rate (> %1) and low minor allele frequency (< 1%). We also filtered out subjects that had high missing genotype rate (> 1%), as well as samples that have genetic kinship with any other sample (samples having any relatives in the dataset using the phenotype field 22021: Genetic kinship to other participants). After quality control, we obtained 291, 273 non-related individuals and 459, 792 SNPs.
We analyzed seven continuous phenotypes and estimated the genetic correlations on all 21 pairs of phenotypes (Figure 3). All genetic correlation estimates across traits are adjusted for gender, UK Biobank assessment center, age at recruitment, and top 10 principal components of the genotype. We estimated the standard error of the RG-Cor estimator using a computationally efficient block Jackknife (see Supplementary Material).
Blue indicates negative genetic correlation and red indicates positive genetic correlation. Genetic correlations that are significantly different from 0 are marked with an asterisk. Genetic correlations that are significantly different from 0 after Bonferroni correction for 21 tests in this analysis are marked with two asterisk. Genetic correlations between traits are computed after correcting for covariates: gender, UK Biobank assessment center, age at recruitment, and top 10 genotype principal components.
4 Discussion
We have described RG-Cor, a scalable estimator for genetic correlation. We show that the RG-Cor estimates for genetic correlation are accurate and achieve similar statistical efficiency while being highly scalable. We use RG-Cor to compute genetic correlations across seven continuous phenotypes in UK Biobank obtaining estimates consistent with previous results.
This genome-wide analysis of genetic correlation is the stepping stone for understanding the relationships across human traits and diseases. In future analyses, we intend to systematically scan pairs of traits in biobank datasets to obtain genetic correlation estimates. We can further partition the genetic correlation with respect to the function and minor allele frequency of the SNPs to further interpret the underlying relationships and causality.
Availability
The RHE-reg software is made freely available to the research community at: https://github.com/sriramlab/RHE-reg
Acknowledgments
This research was conducted using the UK Biobank Resource under applications 33127. We thank the participants of UK Biobank for making this work possible. SS was supported in part by is supported in part by NIH grants R35GM125055, NSF Grant III-1705121, an Alfred P. Sloan Research Fellowship, and a gift from the Okawa Foundation.