ABSTRACT
Large-scale cohorts with combined genetic and phenotypic data, coupled with methodological advances, have produced increasingly accurate genetic predictors of complex human phenotypes called polygenic risk scores (PRS). In addition to the potential translational impacts of identifying at-risk individuals, PRS are being utilized for a growing list of scientific applications including causal inference, identifying pleiotropy and genetic correlation, and powerful gene-based and mixed model association tests. Existing PRS approaches rely on external large-scale genetic cohorts that have also measured the phenotype of interest. They further require matching on ancestry and genotyping platform or imputation quality. In this work we present a novel reference-free method to produce PRS that does not rely on an external cohort. We show that naive implementations of reference-free PRS either result in substantial over-fitting or prohibitive increases in computational time. We show that our algorithm avoids both of these issues, and can produce informative in-sample PRS over any existing cohort without over-fitting. We then demonstrate several novel applications of reference-free PRS including detection of pleiotropy across 246 metabolic traits and efficient mixed-model association testing.
INTRODUCTION
Individual genetic polymorphisms typically explain only a small proportion of the heritability, even for traits that are highly heritable [Nolte et al. (2017)]. Polygenic risk scores (PRS), aggregate the contributions of multiple genetic variants to a phenotype [Torkamani et al. (2018)]. These scores can be calculated using routinely recorded genotypes [Torkamani et al. (2018); Nolte et al. (2017)], are strongly associated with heritable traits [Nolte et al. (2017)], and are independent of environmental exposures or other factors that are uncorrelated with germ line genetic variants. These properties have motivated a rapidly expanding list of applications from basic science (e.g. causal inference and Mendelian randomization [Burgess and Thompson (2013)], hierarchical disease models [Cortes et al. (2017)], and identification of pleiotropy [Krapohl et al. (2016)]) to translation (e.g. estimating disease risk [Khera et al. (2018); Maas et al. (2016)], identifying patients who are likely to respond well to a particular therapy [Natarajan et al. (2017)], or flagging subjects for modified screening [Seibert et al. (2018)]).
Polygenic risk scores are calculated as a weighted sum of genotypes. In some applications all genotyped SNPs may be used, but often only a small set are given nonzero weight. A subset of SNPs selected to contribute to a PRS may be a genome-spanning-but-uncorrelated (LD-pruned) set or a set of SNPs with independent evidence of association with the phenotype of interest. Gene-specific polygenic risk scores are also generated using selected sets of SNPs within a region of the genome, such as a window around the coding region of a particular gene [Gusev et al. (2016); Gamazon et al. (2015)]. The weights on the SNPs included in a polygenic score are often derived from the marginal regression coefficients of an external GWAS [Dudbridge (2016); Wray et al. (2007)], but they may instead be based on predictive models using all SNPs. Joint predictive models include LMMs and their sparse extensions [Yang et al. (2011); Zhou et al. (2013); Vilhjálmsson et al. (2015)] and other regularized regression models such as the lasso or elastic net [Warren et al. (2014); Gusev et al. (2016); Gamazon et al. (2015); Rakitsch et al. (2012)]. The predictions from these joint analyses using genome wide variation are also approximated by post-processing of GWAS summary statistics [Vilhjálmsson et al. (2015); Gusev et al. (2016)].
For these SNP weights to accurately reflect the SNPs’ joint association with the phenotype and to generate informative and interpretable polygenic risk scores, the reference data set must match the target data set in many ways: the populations must have similar ancestry; the trait of interest must be measured and in a similar way; and identical genotypes must be assayed or imputed. Further, the reference data must be large enough to accurately learn the PRS weights.
An alternative approach is to use the studied data set to build a reference-free PS. This eliminates the need for an external reference data set with matched genotypes, phenotypes, and populations. However, as we show below, naive approaches can easily over-fit genetic effects. This over-fitting results in PRS correlated with non-genetic components of phenotype, that will induce bias or other errors in downstream applications. Cross-validation is one established approach to mitigate over-fitting, which in this context involves holding out and computing a polygenic score for each sample in turn. The main hurdle to this approach is computation time, as standard leave-one-out cross validation requires fitting the PRS model N times in a sample with N individuals.
Here we report an efficient method to generate PRS by using the out-of-sample predictions from a cross-validated linear mixed model (LMM). Our approach generates leave-one-out (LOO) polygenic risk scores, which we call cvBLUPs, with computational complexity linear in sample size after a single LMM fit. In addition to eliminating the reliance on external data and guaranteeing the PRS are generated from a relevant population and phenotype, we describe several applications that are only feasible with cvBLUPs. We first demonstrate several desirable statistical properties of cvBLUPs and then consider applications including evidence of polygenicity across metabolic phenotypes, a novel formulation of mixed model association studies, and selection of relevant principal components for control of confounding by population structure. To facilitate their use, we have incorporated the calculation of cvBLUPs in the in the genetic analysis program GCTA [Yang et al. (2011)].
METHODS
We consider the continuous phenotype y measured on N individuals which depends on an N-by-M matrix of additively coded genotypes G, other covariates X, and random noise ε:
For each subject i, the polygenic score PSi is calculated as in Equation 2: where S is the set of SNPs in the polygenic model, gij is the number of alleles corresponding to the SNP weights at SNPj carried by subject i, and is the SNP weight. is often chosen to be the estimated effect size of SNP j in an external GWAS.
Our objective is to produce a leave-one-out cross-validated polygenic score (PRS) for each subject. We generate our PRS as a genetic prediction from a linear mixed model (LMM). LMMs are widely used for genetic prediction [Robinson et al. (1991)], heritability estimation [Yang et al. (2010); Kang et al. (2008, 2010); Lippert et al. (2011); Zhou and Stephens (2012)], and other polygenic analyses [Lee et al. (2012); Zhou and Stephens (2012); Kang et al. (2008); Lippert et al. (2011)]. The LMM (Equation 3) jointly models the contributions of all SNPs Z and other covariates X to the phenotype y. Following others [Yang et al. (2011)], we define Z by centering and scaling columns of G to have mean 0 and variance 1.
The key LMM parameters are the genetic variance and the noise variance . We estimate these by REML [Patterson and Thompson (1971); Kang et al. (2008); Yang et al. (2011)] or Hasman-Elston regression [Chen (2014)]. These variance estimates are then used to estimate b, the genetic effect sizes (i.e. weights), or Zb, the genetic predictions (i.e. BLUPs): where K is the genetic relatedness matrix (GRM) computed from the centered and scaled genotypes Z by . The estimated phenotypic covariance, , where IN is the N-by-N identity matrix, decomposes the covariance to components due to shared genetics (Zb) and to noise (ε).
While the BLUPs in could be used as a polygenic score, we show below that this in general over-fits the noise (ε) and is therefore inappropriate for most PRS applications. To address this problem, we propose to use leave-one-out (LOO) cross-validated BLUPs instead of ordinary BLUPs, which guarantees independence between genetic predictions and ε. Unfortunately, standard LOO approaches will multiply computational time by a factor of N.
We avoid this penalty by leveraging the fact that for linear models, where fitted values are a linear transformation of phenotypes, , the LOO prediction errors can be calculated from a single model fit [Hastie et al. (2009)]. BLUPs from LMMs fall in this category of linear predictors by applying , i.e. the phenotype after removing estimated fixed effects.
In more detail, the LOO prediction errors are the differences between the LOO genetic predictions and the observed residual phenotypes after subtracting fixed effects, . The residuals r are the difference between the BLUPs and the residual phenotypes . For a linear model, these are related by a simple equation [Hastie et al. (2009)]: where Hi,i is the i’th diagonal element of the matrix H. Intuitively this says that due to over-fitting, the in-sample residuals ri are deflated by (1 − Hi,i) relatively to their unbiased LOO counterparts.
We can rearrange these expressions to calculate the LOO predictions, or cvBLUPs, given the standard BLUPs, the phenotype residuals , and the diagonal elements of the H matrix:
Because all of these elements are computed when fitting an LMM, cvBLUPs can be produced with no additional computational complexity.
RESULTS
Empirical confirmation of cross-validated predictions
To examine the properties of the proposed cvBLUP formulation we conducted a set of simulations. We generated 1000 data sets with N = 1000 subjects under the model y = Xβ + gβg + Zb + ε. X consists of 5 normally distributed covariates and Xβ jointly explain 20% of the phenotypic variance. g represents an additively coded SNP with allele frequency 0.5, and gβg contributes 2% of the phenotypic variance. Z represents M = 1000 independent SNPs with minor allele frequencies drawn i.i.d. and uniformly from [0.05, 0.5]. Effect sizes bj are drawn i.i.d. from with the genetic variance accounting for 39% of the phenotypic variance. The residual noise also accounts for 39% of the phenotypic variance, giving a heritability h2 ≈ 50%. For each simulated data set, we first estimate variance components and then compute BLUPs and cvBLUPs as described above.
Figure 1 shows scatter plots for one simulation of non-genetic (ε) and genetic factors (Zb) plotted against in-sample BLUPs (Left) and cvBLUPs (Right). As expected, the standard BLUPs are highly correlated with ε but the cvBLUPs are not, a central required property of genetic predictors for most applications. In contrast, both BLUPs and cvBLUPs are highly correlated with the true Zb, emphasizing the subtlety in constructing valid polygenic risk scores.
Table 1 shows the mean correlations of true simulated values with standard BLUPs and cvBLUPs. Again, standard BLUPs are clearly correlated with the noise term ε due to over-fitting, but cvBLUPs are appropriately uncorrelated with ε. Standard BLUPs, but not cvBLUPs, are also correlated with the unmodeled causal SNP. This type of correlation causes downstream problems for residual analyses, predictions, and causal inference. Importantly, the cvBLUPs are independent of all unmodeled effects as desired.
Genetic predictions and cross-trait predictions using cvBLUPs
We next applied cvBLUP in an analysis of Finnish men from the METSIM cohort [Laakso et al. (2017)]. This cohort is comprised of 10197 men aged 45 to 73 at recruitment between 2005 and 2010 in Kuopio, Finland. Blood serum samples were collected from each participant, and 228 metabolites in the samples were quantified by NMR. In addition to the metabolites, biometric traits including height and weight, and epidemiological traits such as diagnoses or family history of diabetes and CHD were recorded for a total of 248 phenotypes. Continuous phenotypes were quantile normalized. All samples were genotyped at 665,478 SNPs on the Illumina OmniExpress chip. After removing subjects with missing rates above 5 percent and SNPs with missing rates above 5 percent, 10070 subjects and 609131 SNPs remain.
We initially consider genetic predictions of the metabolic, biometric, and epidemiological traits in an unrelated subset of subjects (with genetic relatedness less than 0.05). Since the metabolic traits are expected to be affected by statins and by pharmaceutical interventions for diabetes, we exclude subjects with diabetes or who use statins from the initial analysis and calculation of cvBLUPs. There are no comparable data sets with the set of metabolic measurements available in the METSIM cohort, but cvBLUPs allow computationally efficient genetic predictions of all 246 phenotypes (excluding diabetes status and statin use). With the genetic prediction models learned in the restricted set of subjects, we extended predictions to the excluded subjects with standard BLUP effect size estimates (Equation 4). Thus, cvBLUPs allow analyses of reference-free genetic predictions in a subset of subjects that is restricted to avoid confounding by known environmental exposures (statins and responses to diagnoses of diabetes) or by family structure; and these genetic predictions may then be extended to subjects who are initially excluded to avoid confounding.
We first estimated and for each phenotype using linear mixed models [Yang et al. (2011)]. Overall, 198 of 246 phenotypes have statistically significant heritability at the 0.05 significance level by Wald tests. The significant heritabilities range from 14.6 to 46.1 percent with a mean and standard deviation of 27.5 and 8.0 percent respectively. The 48 non-significant heritabilities range from 0 to 14.4 percent with a mean and standard deviation of 8.3 and 4.3 percent respectively. We next used the method described above to compute leave-one-out polygenic risk scores (i.e. cvBLUPs). Figure 2 and Supplemental Table 12 show the correlation of the phenotypes (rows) with cvBLUPs (columns) for all 246 phenotypes grouped by hierarchical clustering [Kolde and Kolde (2018)] of rows and applying the same permutation to columns. The blue diagonal shows the expected positive correlation between a cvBLUP and its own phenotype with mean 0.065 and standard deviation 0.037. Focusing on the 198 traits with significant heritabilities the correlations of cvBLUP and phenotype have mean 0.078 and standard deviation 0.027.
The off-diagonal blue patches in the figure represent cvBLUPs that are positively correlated and predictive of different phenotypes, while red patches represent cvBLUPs that are predictive but negatively correlated with different phenotypes. The off-diagonal correlations show the widespread pleiotropy of genetic effects on metabolism with over 16203 off-diagonal cvBLUP-phenotype associations at FDR=0.05 [Benjamini and Yekutieli (2001)]. Many cvBLUP-trait correlations are sign-consistent with the respective trait-trait correlations. For example, HDL and LDL cholesterol are well known be negatively correlated [Terry et al. (1989)], and our results demonstrate this is partially due to negative genetic pleiotropy. That is, we observe that SNPs associated with increased LDL are also associated with decreased HDL in aggregate (and vice versa).
cvBLUPs in association testing
In addition to use in detection of pleiotropy, polygenic modelling is a widely used tool in association testing [Kang et al. (2008); Yang et al. (2014); Lippert et al. (2011); Zhou and Stephens (2012); Loh et al. (2015)], and we therefore consider cvBLUPs in this this context. We compare the relative performance of five groups of methods. First, unadjusted regression; second, principal component adjusted regression; third, standard linear mixed model (LMM) association tests; fourth, LMM residual based methods; and fifth, cvBLUP adjusted regression. Standard LMM-based methods use association tests where the covariance of the observations based on the genetic relatedness of the subjects is estimated and used to calculate effect estimates and test statistics by generalized least squares [Kang et al. (2008); Yang et al. (2014); Lippert et al. (2011); Zhou and Stephens (2012)]. The LMM residual-based methods perform association tests on the BLUP residualized phenotypes and possibly genotypes. Then, to try and account for the bias inherent in using the overfit BLUPs, they perform an adjustment step on the resulting test statistics and effect sizes. These methods were pioneered by GRAMMAR/GRAMMAR-γ and include the recent BOLT and BOLT-INF methods [Aulchenko et al. (2007); Svishcheva et al. (2012); Loh et al. (2015)].
The typical sources of confounding for associations with germ-line genetic markers are population structure, family structure, and batch effects in the data collection [Listgarten et al. (2010)]. Genetic principal components as adjustment covariates may suffice to control for confounding by population structure or batch effects but linear mixed models are often more effective at controlling these sources of confounding [Kang et al. (2010)], while also helping to control confounding by family structure and boosting power to detect true associations over standard fixed-effect regression models [Yang et al. (2014)]. These benefits of linear mixed models come at the expense of an increased computational burden over standard linear regression.
In genome wide association studies the statistical significance of each variant gj is tested individually. Here, the SNP is jointly analyzed with covariates X and the contributions of unmodeled variants contribute to a larger error term η = Z(−j)b(−j) + ε.
Often a better estimate of the effect size for a particular SNP gj may be made by accounting for the contributions of the other variants to the phenotype y, and by blocking the effects of confounders of the associations of genotypes and the phenotype – by adjustment with an appropriate set of fixed effect covariates or other means [Zaitlen et al. (2012a,b); Yang et al. (2014)].
Here we demonstrate the use of cvBLUPs as adjustment covariates in a linear regression model that efficiently captures some of the benefits of a standard mixed model association study. To compare the performance cvBLUP adjusted analyses to existing methods for association testing under a range of study scenarios, simulations were used. Methods compared were: unadjusted linear regression, PC adjusted linear regression, a standard LMM association test (GCTA), and BOLT association tests. BOLT results were collected both for the infinetissimal genetic model (BOLT-INF) and the sparse causal genetic model (BOLT-LMM). Association tests conducted with cvBLUP adjustment, GCTA, and BOLT were done with leave-one-chromosomeout schemes wherein the variance components, cvBLUPs, and phenotypic predictions and residuals (BOLT) were calculated using only SNPs that are on different chromosomes than the test-SNPs.
In each simulation, data sets were generated with N = 2000 subjects under the model y = Xβ + gβg + Zb + ε. Here X consists of normally distributed covariates drawn to contribute i.i.d. noise to the phenotypes in the independent-subject simulation, but to be correlated with the subjects’ ancestral populations in the simulations with population structure and with family in the simulations with confounding by family structure. Xβ was scaled to contribute 10% of the phenotypic variance. g represents a set of 5 additively coded causal SNPs with effect sizes βg set to a common fixed value of 0.25. A set of 5 null SNPs were also drawn but did not contribute to the phenotype y. Z represents M = 2000 independent SNPs modeled with random effect sizes. For simulations run under the infinitessimal genetic model, the random effects bj were drawn from with the genetic variance accounting for 40% of the phenotypic variance. For the sparse non-infinitessimal model, a fraction mc = 2% of the SNPs in Z were selected to be causal with effect sizes drawn from The residual noise accounts for 40% of the phenotypic variance, giving a heritability h2 ≈ 50%. Effect estimates and test statistics produced by the various analysis methods are summarized for 1000 null SNPs and 1000 causal SNPs.
In Table 2, the results of analyzing data sets with independent subjects and an infinitessimal genetic architecture are shown. All methods produced unbiased effect estimates and well calibrated tests under the null , but the cvBLUP and mixed-model based methods were more powerful – with greater average test statistics for the causal SNPs. The BOLT effect estimates were biased, with effect sizes deflated towards zero. This deflation of BOLT effect estimates is seen across the simulation scenarios, however we do not detect deflation of the BOLT effect estimates in the real-data analyses of the METSIM cohort data below, where there are much ratios of counts of SNPs to number of subjects. Bias in BOLT may be due to the empirical estimation of a deflation-correction factor for the residual based test based on the GRAMMAR-γ adjusted residual analysis method Loh et al. (2015); Svishcheva et al. (2012). In analogous simulations with a sparse genetic model (Table 9, Supplement), BOLT-LMM is considerably more powerful because it explicitly models the sparse genetic architecture. When the true genetic contribution to the phenotype, Zb, (not including the test-SNPs) is included as a covariate in a linear regression association test of causal SNPs g, the power is greatly increased.
In Table 3, the results of analyzing data sets with confounding by population structure are shown. Subjects were drawn from 5 distinct populations with pairwise Fst set to 0.03. Population specific allele frequencies were generated by the Balding-Nickels model [Balding and Nichols (1995)]. The component Xβ in the generating model was drawn to be correlated with population, but X was not used in the analyses. In this scenario, unadjusted linear regression and linear regression with fewer than 4 principal components as adjustment covariates give inflated test statistics and excessive false positives under the null, . Correcting for population structure by inclusion of 4 principal components, cvBLUP adjustment, or use of the mixed model based methods gives well calibrated test statistics under the null, . For causal test SNPs, cvBLUP and mixed model based methods give greater test statistics than PC-adjusted linear regression with 4 PCs, indicating greater power. Linear regression with adjustment with the true genetic effect Zb gives high power for detecting causal SNPs, but does not control for inflation of test statistics at null SNPs due to population structure . Covariate adjustment with cvBLUPs controls the confounding by population structure and improves the power as do the LMM based methods. These methods use all SNPs and detect the shifts in allele frequencies across populations when there is confounding by population structure.
cvBLUPs correct for population structure because they are weighted combinations of ALL principal components, where weights are based on the the singular value corresponding to the principal component and on the strength of the association of the principal component with the outcome (see Supplement). Conceptually, cvBLUPs control for population structure as if all PCs were considered and the most relevant ones for the analysis were kept.
In Table 4, the results of analyzing data sets with confounding by family structure are shown. Here the 2000 subjects in each simulation represented 200 families with 10 subjects each. Families were generated in pedigrees with four founders and six of their descendants, with descendants’ genotypes selected independently by drop-down from their parents. In the data generating model, there were covariate effects correlated with family membership, Xβ but these covariates were not included in the analyses, creating confounding by family structure. In this scenario, unadjusted linear regression, and PC-adjusted linear regression have inflated test statistics and correspondingly high false discovery rates. Standard linear mixed models (GCTA) control for the confounding by family structure, with accurate effect estimates under the null and alternate, and barely inflated test statistics under the null having mean and standard deviation 1.10 and 0.05 respectively. In this scenario, cvBLUP adjusted analyses and the results from BOLT have biased effect estimates under the alternative, and deflated (conservative) test statistics under the null. This suggests over-adjustment for family structure in both cvBLUP-adjusted analyses and BOLT.
Finally, we applied these methods to the METSIM data described above. All p-values were GC adjusted for comparison purposes. All mixed model-based methods, including LR+cvBLUP were more powerful than standard linear regression. As expected BOLT-LMM had the highest power due to modelling of non-infinitessimal structure. In this data analysis there is no observed evidence of bias in effect size estimates by BOLT, such as a systematic deflation of effect estimates relative to estimates made by standard linear regression or standard LMM fixed effect estimates, suggesting that it may not be a problem in practice.
Leveraging cvBLUPs to compute linear mixed model shrink parameters
Finally, we examine the use of cvBLUPs in estimating LMM shrink parameters. The BLUP effect size estimates from a linear mixed model are “shrunk” to be smaller in magnitude to exploit a bias-variance trade-off that reduces their mean squared error. With the estimated genetic variance component , the estimated distribution of effect estimates is an empirical prior on the effect sizes, which yields estimates that are biased toward the prior mean of 0. Equivalently, the BLUP effect sizes estimates are ridge regression estimates with a penalty parameter λ related to the LMM variance components: .
When SNPs are independent, the LMM shrinkage estimates of the effect sizes b for given estimates of the variance components is found to be [Vilhjálmsson et al. (2015)] and the shrink, , can be calculated directly from the number of individuals and SNPs. However, when SNPs are in LD, it is more complicated to estimate the effective number of SNPs [Patterson et al. (2006)] and hence the shrink. Here we show how cvBLUPs can be used to estimate the shrink directly.
First, in the case of independent SNPs, the LMM estimates of the SNP effects would be reduced in magnitude and variance relative to the true values b, as in Equation 9, and there would be a corresponding decrease in the variance of polygenic risk scores for a new individual with normalized genotypes Znew, calculated using as weights rather than the true effect sizes b:
Equation 10 suggests the direct estimation of the shrink by taking the ratio of to . We confirm this approach is approximately unbiased via simulation. For multiple settings of N (number of subjects), M (number of SNPs) and h2 (heritability), heritability and variance components were estimated and the value of the standard independent-SNP model for the shrink is compared to the empirical BLUP shrink and the empirical cvBLUP shrink . In each simulation setting, the BLUP shrink is much larger than the independent model value or the cvBLUP shrink due to over-fitting by the standard BLUPs. However, the cvBLUP formula and the independent-SNP model are consistent for all parameter settings.
Because this approach does not require identification of an effective number of SNPs, it extends directly to the case where there is linkage disequilibrium. We applied this approach to simulated data from the METSIM cohort used above. We estimated the shrink parameter for simulated phenotypes based on the real genotypes at 609131 SNPs with minor allele frequencies greater than 0.01 on the 6263 unrelated (at the 0.05 level) subjects. Twenty simulations were run using fractions of causal SNPs between 0.0001 and 1.0. Causal SNPs were chosen by simple random sampling with equal probability from the genotyped SNPs. SNP effect sizes were normally distributed and trait heritability was 50 percent. The shrink formula for independent SNPs suggested a shrink of about 0.0045-0.0055, while the cvBLUP shrink – the ratio of the variance of the cvBLUPs to the estimated genetic variance component ranged from 0.06-0.08.
In this setting, with LD between the SNPs, the independent SNP formula for the shrink is invalid. In particular the effective number of SNPs is much smaller than the total number of SNPs leading to a shrink estimate over ten times smaller than the the cvBLUP based estimate.
DISCUSSION
Here we describe a new and computationally efficient approach for generating polygenic risk scores directly from a linear mixed model. We show that the LMM framework allows direct calculation of out-of-sample genetic predictions. Our approach will have immediate utility for the growing list of applications that rely on PS, and we provide examples of several additional application areas, including detection of pleiotropy, powerful association testing, and estimation of polygenic shrink.
The elimination of over-fitting by cvBLUPs relative to BLUPs suggests a solution to the bias problem in residual based methods. Rather than post-hoc correction of residual test statistics as in GRAMMAR-g and BOLT, the LMM residuals may be replaced by out-of-sample prediction errors with cross-validated predictors: use (y – cvBLUP) instead of standard residuals (y – BLUP)
There are several limitations of this approach. First, cvBLUPs are calculated from the standard LMM framework, which corresponds to the infinitessimal genetic model. As we see in in supplementary Tables 9-11, in simulations with sparse genetic architectures, sparse models have considerably higher power. In future work, we intend to use sparse analogues of cvBLUPs generated as cross-validated predictions. Unfortunately, sparse models including the BOLT-LMM model and LASSO are not amenable to the fast leave-one-out cross-validation as in Equation (7).
Another limitation is that when used as adjustment covariates cvBLUPs do not control confounding by family structure or cryptic relatedness as well as standard mixed model association tests. Rather than using all subjects for computation of cvBLUPs, an alternative protocol is to calculate cvBLUPs for an unrelated subset, and also to calculate BLUP effect size estimates (Equation 4) using these unrelated subjects. This procedure will tend to block confounding by genetic structure remaining in the nominally unrelated subset and will improve power by accounting for the polygenicity of the trait, but it could under-adjust for confounding by family structure because by construction the training set for learning the polygenic model does not contain closely related subjects. Methods involving cross-validated predictions from multiple models or cvBLUPs from mixed models with multiple variance components may prove useful, by analogy with other multiple variance component methods that include sparse relatedness matrices to indicate family membership [Zaitlen et al. (2013), Tucker et al. (2015)]
In our cross-trait analysis of the METSIM data set we show that the cross-correlations of PRS for one trait and actual (normalized) phenotypic measures for other traits. We are working to extend these cross-trait analyses – in particular by using correlations of cvBLUPs for pairs of traits as estimates of the genetic correlation. However, even naive correlations of cvBLUPs give an effective picture of the genetic correlations between traits. Since cvBLUPs are efficiently calculated one at a time, and genetic correlations are then estimated in trivial pairwise analyses of traits, the pair-wise correlations of hundreds or thousands of traits may be efficiently calculated this way. Furthermore, in the pair-wise analyses of the METSIM phenotypes, and in our pair-wise analyses of RNA expression that underlie our trans-eQTL analysis [Liu et al. (2018)] we actually generate asymmetric cross-correlation matrices because the correlation of cvBLUP for trait A with measured trait B is not the same as the correlation of measured trait A with cvBLUP for trait B. We are exploring applications of these asymmetric matrices for network analysis and causal inference.
Efficient generation of out-of-sample genetic predictions using leave-one-out cross-validation of the predictions from a linear mixed model is an effective way to generate polygenic risk scores, and opens the application of analyses based on PRS to scenarios where there is no available reference data to generate a typical scoring model. It is now well known that PRS and genetic predictions transfer poorly to populations that are distinct from the reference data set used to learn the genetic model [Martin et al. (2017); Scutari et al. (2016)]. We look forward to using reference-free PRS methods based on cvBLUPs for applications with data from under-represented populations.
The principle of using cross-validated predictions from polygenic models as PRS may be extended to predictions from sparse or complex models, but the cross-validated predictions from a standard linear mixed model, which we call cvBLUPs, are particularly simple to calculate and have novel and interpretable applications. To make the results of this work accessible to the community, we have implemented them in the GCTA software package [Yang et al. (2011)].
CONFLICTS OF INTEREST
None.
APPENDICES
Formula for efficient leave-one-out cross-validation of the linear mixed model
Let represent a linear mixed model with continuous outcome y, fixed effect covariates X, fixed effect sizes b, additively coded genotypes Z, random genetic effect sizes b and unmodeled or environmental factors ε. The genetic effect sizes b and environmental factors ε are modeled as i.i.d. normally distributed random variables with variances and respectively:
By convention the genotypes are scaled to have mean 0 and variance 1, and the SNP effect sizes are assumed to have effect sizes independent of MAF on this scale. So the total genetic contributions to the phenotype Zb are normally distributed with mean 0 and variance . The scaled genotypes Z are used to calculate a SNP-based genetic relatedness matrix K:
The observations y in the linear mixed model are normally distributed with a covariance ∑ that depends on K and the variance components and
To simplify notation, let fixed effect sizes β and fixed effects Xβ = 0:
Exclude observation i from the genetic predictive model fit and then make an out-of-sample (oos) prediction for observation i:
Now generate the augmented vector y+i as the vector of outcomes y with observation yi replaced with it’s out-of-sample prediction from a model trained on the remaining observations, .
Consider the ridge regression interpretation of the mixed model with the ridge penalty . Assume the variance components or heritability have been estimated in a prior step or are known.
For the reduced data set with the outcome for observation i removed we have:
For the augmented data set with the outcome for observation i replaced by an out-of-sample prediction we have:
For the augmented model, , and the remaining terms in the expression for are the same as for the reduced model , so .
Now consider the differences in the predictions for the ith value from the augmented model and the model with all observations:
Here, is the ith row of the matrix H, or Hi,.. The vectors y+i and y only differ at the ith element, so:
Finally,
Relationship between principal components and cross-validated BLUPs
Genetic principal components are routinely used as quantitative measures of study participants’ ancestry [Patterson et al. (2006); McVean (2009)] and as such are used as adjustment covariates in association studies to block confounding by population structure [Patterson et al. (2006)]. Linear mixed models provide another framework for controlling potential confounding by population structure [Yang et al. (2014)]. Both PCs and LMMs are methods that account for ancestry and other forms of genetic structure in a data set by analyses of the genetic relatedness matrix (GRM) . Specifically, PCA involves calculating some number of eigenvetors of the GRM – or equivalently left singular vectors of the scaled genotype matrix Z – and using them as adjustment covariates in regression analyses, while LMMs model observed outcomes as non-independent, with the random effects that contribute to the outcome y correlated to a degree related to the amount of shared genetic variation between each pair of subjects: .
In principal component adjusted analyses, some number of PCs are used as adjustment covariates in linear regression. With linear mixed models, generalized least squares with an estimate of the sample covariance is used.
Principal component adjusted analysis has several disadvantages. First, it is not clear how many principal components should be used. In GWAS it has become conventional to use a standard number of PCs, say 10, but it is generally not clear whether that will be enough to account for the components of genetic structure that are confounded with non-genetic factors in a study. Second, it is not clear which PCs should be included. Conventionally, PCs are sorted in descending order by the magnitude of the corresponding eigenvalues, and PCs with the largest eigenvalues are used. However, selected PCs may not be associated with any non-genetic factors and may not relieve any confounding. Finally, in association testing applications, tests needlessly lose power due to over-adjustment when PCs that do not adjust for confounding are included as covariates. By construction PCs represent axes of variation in the genotype data, so some may be highly correlated with a test SNP, and inclusion of correlated covariates increases the standard errors around test SNP effect estimates.
The spectral decomposition of the GRM K is: where U is a unitary matrix whose columns are the genetic PCs or eigenvectors of K, and Λ is a diagonal matrix of the corresponding eigenvalues.
Principal component adjusted regression includes Uk (first k columns of U) as regression covariates to improve the estimation or testing of βg:
This leaves open the question of how many or which PCs should be included. As an alternative to standard practice, we can try using all of the PCs, and use regularization to keep the model estimable. We can also minimize problems due to over-adjustment by calculating the PCs U using a GRM K* or scaled genotype matrix Z* that does not include SNPs in linkage disequilibrium with or on the same chromosome as the test-SNP g.
Towards a connection to cvBLUPs, rescale the columns of U by the corresponding singular values:
Rescaling U to Us puts a higher prior on PCs with larger eigenvalues. Now compress the contribution of the principal components to the outcome by calculating a vector of phenotypic predictions using ridge regression with a penalty λ on the principal components:
If we define and recall that , so that , we have:
But this is just a BLUP. So, BLUPs arise from a limiting case of trying to do PC-adjusted regression with all PCs. Therefore cross-validated predictions from ridge regression on all PCs are cvBLUPs.
cvBLUPs as adjustment covariates are similar to a compression of all PCs into a single covariate, with the PCs given prior weights that emphasize the PCs with larger eigenvalues, but do not exclude any. The PCs are also weighted by their relevance to the outcome because they represent predictions from a ridge regression model that implicitly has ridge regression effect sizes for the association of the PC with the outcome. Over-fitting in the ridge regression step is avoided by the leave-one-out cross-validation. Finally, loss of power by over adjustment is avoided by excluding the chromosome or SNPs in linkage disequilibrium with test SNP from the GRM used for calculation of the cvBLUPs. In fact, unlike PC adjustment but similar to standard LMM analyses, cvBLUP adjustment boosts the power of association studies by modeling genetic contributions to the phenotype other than the SNP of interest, thereby increasing the signal to noise ratio.
Additional results with cvBLUP-adjusted association testing
ACKNOWLEDGEMENTS
NZ, JM, DP were funded by NIH grants (K25HL121295, U01HG009080, R01HG006399).
We thank the individuals who participated in the METSIM study. This study was funded by the National Institutes of Health (NIH) grants HL-095056, HL-28481, and U01 DK105561. The funders had no role in study design, data collection, and analysis, decision to publish, or preparation of the article.
MAK works in a Unit that is supported by the University of Bristol and UK Medical Research Council (MCUU12013/1). The Baker Institute is supported in part by the Victorian Government’s Operational Infrastructure Support Program.