Abstract
We introduce Genetic Instrumental Variables (GIV) regression – a method to estimate causal effects in non-experimental data with many possible applications in the social sciences and epidemiology. In non-experimental data, genetic correlation between the outcome and the exposure of interest is a source of bias. Instrumental variable (IV) regression is a potential solution, but valid instruments are scarce. Existing literature proposes to use genes related to the exposure as instruments (i.e. Mendelian Randomization – MR), but this approach is problematic due to possible pleiotropic effects of genes that can violate the assumptions of IV regression. In contrast, GIV regression provides accurate estimates for the causal effect of the exposure and gene-environment interactions involving the exposure under less restrictive assumptions than for MR. As a valuable byproduct, GIV regression also provides accurate estimates of the chip heritability of the outcome variable. GIV regression uses polygenic scores (PGS) for the exposure and the outcome of interest, both of which can be constructed from genome-wide association study (GWAS) results. By splitting the GWAS sample for the outcome into non-overlapping subsamples, we obtain multiple indicators of the outcome PGS that can be used as instruments for each other. In two empirical applications, we demonstrate that our approach produces reasonable estimates of the chip heritability of educational attainment (EA) and, unlike the results using MR, GIV regression estimates find that the positive relationship between body height and EA is primarily due to genetic confounds that have pleiotropic effects on both traits.
Introduction
A major challenge in the social sciences and in epidemiology is the identification of causal effects in non-experimental data. In these disciplines, ethical and legal considerations along with practical constraints often preclude the use of experiments to randomize the assignment of observations between treatment and control groups or to carry out such experiments in samples that represent the relevant population [1]. Instead, many important questions are studied in field data which make it difficult to discern between causal effects and (spurious) correlations that are induced by unobserved factors [2]. Obviously, confusing correlation with causation is not only a conceptual error, it can also lead to ineffective or even harmful recommendations, treatments, and policies, as well as a significant waste of resources (e.g., as in [3]).
One important source of bias in field data is genetic effects: Twin studies [4] as well as methods based on molecular genetic data [5, 6] can be used to estimate the proportion of variance in a trait that is due to the linear genetic effects (so-called narrow-sense heritability). Using these and related methods, an overwhelming body of literature demonstrates that almost all important human traits, behaviors, and health outcomes are influenced both by genetic predisposition as well as environmental factors ([7, 8, 9]). Most of these traits are “genetically complex”, which means that the observed heritability is due to the accumulation of effects from a very large number of genes that each have a small, often statistically insignificant, influence [10]. Furthermore, genes often influence several seemingly unrelated traits (i.e. they have “pleiotropic effects”) [11] and genetic correlations between many traits have been convincingly demonstrated [12], giving rise to unobserved variable bias in field studies that do not control for the genetic predisposition of individuals for the exposure and the outcome of interest.
One popular strategy to isolate causal effects in non-experimental data is to use instrumental vari-ables (IVs) which “purge” the exposure of its correlation with the error term in the regression [13]. IVs need to satisfy two important assumptions. First, they need to be correlated with the exposure of inter-est conditional on the other control variables in the regression (i.e. IVs need to be “relevant”). Second, they need to be independent of the error term of the regression conditional on the other control variables and produce their correlation with the outcome solely through their effect on the exposure. In practice, finding valid IVs that satisfy both requirements is difficult. In particular, the second requirement (the so-called exclusion restriction) is challenging.
Epidemiologists have proposed to use genetic information to construct IVs and termed this approach Mendelian Randomization (MR) [14, 15, 16, 17]. The idea is in principle appealing because genotypes are randomized in the production of gametes by the process of meiosis. Thus, conditional on the genotype of the parents, the genotype of the offspring are the result of a random draw. So if it would be known which genes affect the exposure, it may be possible to use them as IVs to identify the causal influence of the exposure on some outcome of interest. Yet, there are four challenges to this idea. First, we need to know which genes affect the exposure and isolate true genetic effects from environmental confounds that are correlated with ancestry. Second, if the exposure is a genetically complex trait, any gene by itself will only capture a very small part of the variance in the trait, which leads to the well-known problem of weak instruments [18, 19]. Third, genotypes are only randomly assigned conditional on the genotype of the parents. Unless it is possible to control for the genotype of the parents, the genotype of the off-spring is not random and correlates with everything that the genotypes of the parents correlate with (e.g. parental environment, personality, and habits) [20]. Fourth, the function of most genes is not completely understood. Therefore, it is difficult to rule out direct pleiotropic effects of genes on the exposure and the outcome, which would violate the exclusion restriction [16].
Recent advances in complex trait genetics make it possible to address the first two challenges of MR. Array-based genotyping technologies have made the collection of genetic data fast and cheap. As a result, very large datasets are now available to study the genetic architecture of many human traits and a plethora of robust, replicable genetic associations has recently been reported in large-scale genome-wide association studies (GWAS) [21]. These results begin to shed a light on the genetic architecture that is driving the heritability of traits such as body height [22], BMI [23], schizophrenia [24], Alzheimer’s disease [25], depression [26], or educational attainment (EA) [27]. High quality GWASs use several strategies to control for genetic structure in the population and indeed, empirical evidence suggests that the vast majority of the reported genetic associations for many traits is not confounded by ancestry [28, 29, 30, 31]. Furthermore, so-called polygenic scores (PGS) have become the favored tool for summarizing the genetic predispositions for genetically complex traits [32, 33, ?, 27]. PGS are linear indices that aggregate the effects of all currently measured genetic variants (typically single nucleotide polymorphisms, a.k.a. SNPs), and recent studies demonstrate the ability of PGS to predict genetically complex outcomes such as height, BMI, schizophrenia, and EA [34, 22, 23, 24, ?]. For example, a polygenic score for EA currently captures 4-6% of the variance in the trait and replicates extremely well across different hold-out samples [27]. Although PGS still capture substantially less of the variation in traits than suggested by their heritability [35] (an issue we return to below), PGS capture a much larger share of the variance of genetically complex traits than individual genetic markers. The third challenge could in principle be addressed if the genotypes of the parents and the offspring are observed (e.g. in a large sample of trios) or by using large samples of dizygotic twins where the genetic differences between siblings are random draws from the parent’s genotypes. However, the fourth challenge (i.e. pleiotropy) remains a serious obstacle despite recent efforts to relax the exogeneity assumptions in MR ([36, 37]).
Here, we present a novel method that we call Genetic Instrumental Variables (GIV) regression that can be implemented using widely available statistical software. In contrast to MR, GIV regression does not require strong assumptions about the causal mechanism of genes because it effectively controls for possible pleiotropic effects of genes. In particular, GIV regression is based on the insight that adding the true PGS for the outcome to a regression model would effectively eliminate bias arising from a genetic correlation between the outcome and an exposure of interest. Furthermore, we argue that the attenuated predictive accuracy of PGS is conceptually similar to the well-known problem of measurement error in regression analysis. Instrumental variable (IV) techniques can correct attenuation bias in regression coefficient estimates that results from measurement error [38]. We argue that it is possible to obtain a valid IV for a PGS by randomly splitting the GWAS sample that was used for its construction. Typically, a GWAS is used to estimate the effects of individual SNPs in a discovery sample. Then, the estimated effects are utilized as weights for the genetic data in an independent prediction sample. By splitting the GWAS sample into independent subsamples, one can obtain several PGS (i.e. multiple indicators) in the prediction sample. Each will have even lower predictive accuracy than the original score due to the smaller GWAS subsamples used in their construction, but these multiple indicators can be used as IVs for each other, and the instruments will satisfy the assumptions of IV regression to the extent that the measurement errors (the difference between the true and calculated PGS) are uncorrelated.
We show that it is possible under plausible assumptions to obtain consistent estimates of the narrowsense heritability of a trait by using IV regression that utilizes two PGS that were constructed this way. Then we extend the idea to the problem of estimating causal effects in non-experimental data. We argue that using multiple indicators of the PGS of the outcome together with a PGS for the exposure produce IVs that come reasonably close to satisfying the assumptions of IV regression. Finally, we demonstrate how our approach can be straightforwardly extended to obtain causal estimates of gene-environment interactions (GxE) on outcomes.
We begin by laying out the assumptions of our approach and prove that GIV regression yields consistent estimates for the effect of the PGS on the outcome variable, when the other covariates in the model are exogenous and when the true PGS is uncorrelated with the error term net of the included covariates. We then turn to the more complex case of when a regressor of interest (T) is potentially correlated with unobserved variables in the error term because of pleiotropy, and we show that the bias under these assumptions with GIV regression is generally smaller than with OLS, MR, or what we will term an en-hanced version of MR (EMR). We then use simulations to test how our approach behaves in finite sample under plausible assumptions about genetic correlations and then show how sensitive our method is to violations of the assumptions in comparison to MR and EMR.
Next, we demonstrate the practical usefulness of our approach in empirical applications using the publicly available Health and Retirement Study [39]. First, we demonstrate that a consistent estimate of the so-called chip heritability [35] of EA can be obtained with our method. Then, we estimate the effects of body height on EA. As a “negative control,” we check whether our method finds a causal effect of EA on body height (it should not).1
Theory
Assumptions
The methods we describe builds on the standard identifying assumptions of IV regression [13]. In the context of our approach, this implies four specific conditions:
Complete genetic information: The available genetic data include all variants that influence the variable(s) of interest.
Genetic effects are linear: All genetic variants influence the variable(s) of interest via additive linear effects. Thus, there are no genetic interactions (i.e. epistasis) or dominant alleles.
Genome-wide association studies successfully control for population structure: In other words, the available regression coefficients for the genetic variants are not systematically biased by omitted variables that describe the genetic ancestry of the population. Failure to control for population structure can lead to spurious genetic associations [20].
It is possible to divide GWAS samples into non-overlapping sub-samples drawn from the same population as the sample used for analysis.
Estimating narrow-sense SNP heritability from polygenic scores
Under these assumptions, consistent estimates of the chip heritability of a trait2 can be obtained from polygenic scores (for full details, see Supporting Information section 2). If y is the outcome variable, X is a vector of exogenous control variables, and is asummary measure of genetic tendency for y in the presence of controls for X, then one can write
where G is an n×m matrix of genetic markers, and ζy|X is the m×1 vector of SNP effect sizes, where the number of SNPs is typically in the millions. If the true effects of each SNP on the outcome were known, the true genetic tendency
would be expressed by the PGS for y, and themarginal R2 of
in equation 1 would be the chip heritability of the trait. In practice, GWAS results are obtained from finite sample sizes that only yield noisy estimates of the true effects of each SNP. Thus, a PGS constructed from GWAS results typically captures far less of the variation in y than suggested by the chip heritability of the trait ([40]; [33]; [35]). This is akin to the well-known attenuation bias resulting from measurement error [41]. We refer to the estimate of the PGS from available GWAS data in the presence of controls for X as Sy|X, where
and substitute Sy|X for
in equation 1. The variance of a trait that is captured by its available PGS increases with the available GWAS sample size to estimate ζy|X and converges to the SNP-based narrowsense heritability of the trait at the limit if all relevant genetic markers were included in the GWAS and if the GWAS sample size were sufficiently large [35].
It has long been understood that multiple indicators can, under certain conditions, provide a strategy to correct regression estimates for attenuation from measurement error ([42]; [43]). IV regression using estimation strategies such as two stage least squares (2SLS) and limited information maximum likelihood (LIML) will provide a consistent estimate for the regression coefficient of a variable that is measured with error if certain assumptions are satisfied ([38]; [44]): (1) The IV is correlated with the problem regressor, and (2) conditional on the variables included in the regression, the IV does not directly cause the out-come variable, and it is not correlated with any of the unobserved variables that cause the outcome variable [38]. In general, these assumptions are difficult to satisfy. In the present case, however, GWAS summary statistics can be used in a way that comes close enough to meeting these conditions to measurably improve results obtainable from standard regression.
The most straightforward solution to the problem of attenuation bias is to obtain multiple indicators of the PGS by splitting the GWAS discovery sample for y into two mutually exclusive subsamples. This produces noisier estimates of , with lower predictive accuracy, but the multiple indicators can be used as IVs for each other (SI appendix). Standard 2SLS regression using Sy1 as an instrument for Sy2 will then recovered a consistent estimate of γ in equation 1.
Assuming that the variables in 1 are standardized to have mean zero and a standard deviation of one, and further assuming that the controls contained in X do to not correlate with genotype G, a consistent estimate of the chip heritability of y can now be obtained from , where ρ is the correlation coefficient. The heritability estimate
is not simply equal to
because we regressed on
instead of
. Thus, we standardize with the respect to the variance of Sy|X instead of
, which leads to a bias equal to
. Multiplying
with the correlation between Sy1 and Sy2 recovers a consistent estimate for
(see Supporting Information section 2.1).
Reducing bias arising from genetic correlation between exposure and outcome
The logic from above can be extended to situations where the question of interest is not the chip heritability of y per se, but rather the effect of some non-randomized exposure on y (e.g. a behavioral or environmental variable, or a non-randomized treatment due to policy or medical interventions). We can rewrite equation 1 by adding a treatment variable of interest T, such that
where, for example, y is EA and T is body height. In each case, it is presumed that the outcome variable is to some extent caused by genetic factors, and the concern is that the genetic propensity for the outcome variable
is also correlated with the exposure represented by T in equation (3). If
is not observed and controlled for in equation 3,
will be a biased estimate of the effect of T on y.3
In standard Mendelian randomization (MR), a measure of genetic tendency (ST) for a behavior of interest (T in equation 3) is used as an IV in an effort to purge of bias that arises from correlation between T and unobservable variables in the disturbance term under the argument that the genetic tendency variable, e.g., the measured PGS ST, is exogenous ([46];[44]). One such example would be the use of a PGS for height as an instrument for height in a regression of EA on height. The problem with this approach is that the PGS for height will fail to satisfy the exclusion restriction if (some) of the genes affecting height also have a direct effect on EA (e.g. via healthy cell growth and metabolism) or if they are correlated with unmeasured environmental factors that affect EA.4 Note that this problem arising from pleiotropic effects of genes is not solved even if infinitely large GWAS samples would be available.
The multiple indicator strategy described above provides multiple approaches for addressing the bias in MR. If the genetic propensity for y could be directly controlled in the regression, MR would provide less biased estimates of the effect of T. We refer to the combined use of Sy1 (where the PGS for y may have been estimated only with controls for X or also with controls for T – when we leave it unsubscripted below, we refer to either of these alternatives)5 as a control and ST as an IV as “enhanced Mendelian Randomization” (EMR). However, controlling for Sy1 as a proxy for is not adequate, both because it leaves a component of
in the error term which causes the exclusion restriction assumption of MR to fail, and because the bias in the estimated coefficient of Sy1 also produces bias in the estimated coefficient of T. The bias arising from the use of a proxy for
is a form of omitted variable bias (SI appendix).
Violation of the exclusion restriction due to genetic correlation is potentially solved (or at least is less severe) when a third indicator of the PGS for y, i.e., Sy3 is used to instrument simultaneously both Sy1 and T in equation 1. However, the practical problem with using two indicators as the sole instrument is that their mutual correlation will be relatively high (depending on their reliability) and they are weak instruments for T. As a practical strategy, the best solution is arguably to use ST along with Sy2 (or Sy2 and Sy3) as instruments for Sy1 and T. ST will still violate the exclusion restriction to the extent that it is correlated with Guy1. However, the extent of the violation will be reduced by the presence of Sy1 in the regression. Arguably a strategy that both reduces the correlation between ST and ϵ (through the inclusion of Sy1 in the model) and eliminates or greatly reduces omitted variable bias through the inclusion of an instrument for Sy1 in the first stage equation will outperform MR in the estimation of a consistent effect of T that is purged of genetic correlation. Figure 1 illustrates the GIV regression strategy we propose.
Genetic Instrumental Variables (GIV) regression
As noted above, if not all relevant genetic effects are contained in the PGS (e.g. interaction effects, structural variants, or rare alleles may be missing given currently available GWAS data), the PGS instruments above will not perfectly satisfy the exclusion restriction to the extent that is correlated with the omitted genetic variables. However, the above approach would generally be expected to reduce bias due to genetic correlation, given that a large fraction of heritability can be attributed to linear effects of common SNPs that are well tagged by currently available genotyping arrays [35, 47, 34, 48]). See the SI appendix for details.
Gene-environment interactions
We next generalize equation (3) to the case of gene-environment interactions, where the effect of T varies with the PGS. In principle, these interactions could be extremely complicated and so for practical reasons, swe focus her on obtaining plausible estimates of the linear interaction between and T. We rewrite equation (3) as
Now there are three endogenous variables, T, Sy1|XT, and TSy1|XT. Also the disturbance term has now been elaborated to include a term that is a function of T, and so an additional PGS for y is needed as an additional instrumental variable. This additional PGS for y will allow the use of IV regression to estimate δ2. In the simulations described in section SI 3.3 (see SI Figure 14), GIV regression performs better than OLS, MR, or EMR in estimating the parameters of equation 4. Of course, the term may not fully capture all gene-environment interactions involving T or other environmental variables. Correlations between the IVs and variables in the error term of equation 4 will violate the assumptions of IV regression. We address issues of violated assumptions below.
Model with measurement errors and no additional endogeneity and methods 1,2,3,6 (see main text)
Model with measurement errors and no additional endogeneity and methods 1,2,4,5 (see main text)
Model with measurement errors and no additional endogeneity and methods 1,5,6,7 (see main text)
Model with measurement errors, additional endogeneity, and methods 1,2,4,7 (see main text)
Model with measurement errors, additional endogeneity, and methods 1,2,4,7 (see main text)
Model with measurement errors, additional endogeneity, and methods 1,5,6,7 (see main text)
Model with measurement errors and skewness in T
Model with measurement errors and skewness in y
Model with measurement errors and kurtosis
Model with dependent measurement errors between multiple indicators of S
Model with dependent measurement errors among T and multiple indicators of S (all correlations assumed equally large)
Model with correlations between common and rare genetic variances
Model with parental effects
Model with interaction term
Simulations
We explored the robustness of GIV regression in finite sample sizes using a range of simulation scenarios (SI appendix). The simulations generate data from a set of known models, which we then analyzed to produce coefficient estimates of the effect of the PGS for y on y and the effect of T on y. We produce these estimates using OLS, MR, EMR, and GIV regression, and compare these results with the true answer across a range of parameter values. The simulations specify that the true PGS scores for y and T are correlated and that the observed PGS scores for y and T are constructed with error. We make the conservative assumption that the entire genetic correlation between the traits is due to Type 1 pleiotropy, i.e. all genes that are associated with both phenotypes have direct effects on both.6 In practice, this is unlikely to be the case, but it is equally unlikely that one can put a credible upper bound on (or completely rule out) Type 1 pleiotropy. In one set of scenarios, we make the assumption that the entire endogeneity problem arises from the genetic correlation between y and T, a problem which would be solved if we could measure the PGS for these two phenotypes without error. In a second set of simulations, we make the additional assumption that endogeneity arises from other (e.g., environmental) sources that cause the disturbance term in the structural equation for y to be correlated with the disturbance term in the structural model for T even if the true PGS for y and for T were in the respective structural equations. In a third set of simulations, we assume that the genetic factors that affect T are correlated with the environmental factors in the disturbance term for y, as would be the case if parental genes, which affect the PGS for T, also either cause or select for environmental factors that affect y net of T and the PGS for y. We conducted simulations which alternatively specify that the effects of T and the PGS for y are additive and that the effects interact (i.e., where the effect of T on y depend on the PGS for y). The simulations alternately assume that the underlying distributions for the errors in the structural equations for y and T are multivariate normal and deviate from normal via the introduction of skew and kurtosis. They also alternately assume that the errors in the equation for the observed PGS for T and the multiple observed PGS for y are independent or correlated. Finally, we simulate the scenario where even the true PGS for y and T fail to capture all the genetic effects on y and T because they omit rare genetic variants, and where the rare variants for y are correlated with the rare variants for T.
The details of these simulation results are described in the SI appendix. The results provide considerable support for the claim that GIV regression greatly improves our ability to estimate the effects of variables that may have a causal effect on an outcome variable but where genetic correlation and other forms of endogeneity are present. When the only problem is measurement error in the PGS for y and T, GIV regression produces results that bracket the true answer. GIV regression also provides accurate results when the errors of the two structural models are correlated for reasons beyond measurement error in the PGS. Skewness and Kurtosis in the distributions of the errors do not much affect the quality of GIV regression estimates.
When the measurement errors of the PGS are correlated (i.e. when overlapping GWAS samples were used to construct the PGS for y and the genetic IVs), the exclusion restriction is violated and GIV regression estimates are biased. However, we find that GIV regression still outperforms MR or EMR for small to moderately correlated measurement errors (ρ < 0.5). It is also encouraging to find that missing genetic variants from the PGS for y and T do not lead to noteworthy bias in the GIV regression estimate for the effect of T on y. Finally, we find that in situations where endogeneity is induced by the effects of or correlation between parental genes and the environment of the parent’s children, estimates from all methods are, as expected, biased. However, we still find that GIV regression outperforms all other methods in terms of the size of the bias of the estimated effect of T on the outcome. In the next section, we discuss these scenarios and our results in greater detail.
Violated assumptions
We now elaborate on the assumptions that GIV regression is based on, and discuss what our simulations tell us about how GIV regression performs under potential violations in comparison with OLS, MR, and EMR.
1. Complete genetic information: Current GWAS are based on two technologies to obtain genetic data. First, so-called genotyping arrays are used to extract information from DNA samples for a selected sub-set of genetic markers. Array technologies allow high throughput and are substantially cheaper than sequencing the entire human genome, which mostly consists of genetic information that does not vary among humans. Instead, array technologies focus on genetic markers that are known to vary within or across specific human populations. Second, one makes use of the fact that genetic markers which are physically close to each other on a chromosome tend to be correlated. This allows genotyping arrays to focus on one or a few SNPs per region that represent the genetic variations (so-called haplotypes) which can be found among humans. Next, information from fully sequenced reference samples is used to impute the missing SNPs [49]. This approach yields highly accurate information for common genetic polymorphisms [50]. However, genotyping and imputation accuracy attenuate strongly for rare polymorphisms as well as for so-called structural genetic variants (e.g. deletions, insertions, inversions, copy-number variants) that are not directly included in the genotyping array. Newer genotyping arrays tend to capture more and better selected polymorphisms than older arrays. Furthermore, increasing sample sizes of completely sequenced reference populations allow imputation of missing genetic variants with ever increasing accuracy [50]. Nevertheless, this implies that the assumption of complete genetic information is violated in practice, although this is likely to be a temporary issue. Another implication of this assumption is that it will be important in practice to ensure that all PGS used in GIV regression are constructed from the same or at least from largely overlapping sets of SNPs.
While it is not possible to know the impact of genetic variants that are not yet included in GWAS data, recent research [47] finds that the 1000 Genomes imputed data imply very little bias for our method arising from correlation between missing genetic information and the SNPs used to estimate the PGS for height, because the 1000 Genomes imputed data contains almost all the narrow-sense heritability of these traits. Specifically, we used Yang et al’s results to infer the effects of rare variants on y (and also on T) in the SI appendix, and we then computed the bias via simulations using a range of correlations between common and rare genetic variants. The simulation shows that our results are robust across a range of plausible values for these correlations (Supplementary Figure 12).
2. Genetic effects are linear: Possible violations of this assumption could arise if non-linear genetic effects such as systematic gene-gene interactions (a.k.a. epistasis) ended up in the error terms of both scores or if y is affected by genetic dominance or by unmeasured genetic markers (e.g. very rare alleles or structural variants not included in the GWAS or the prediction sample). In other words, suppose that the true structural equation is
where f(G) includes interaction terms between the various genetic markers in G, the effects of unmeasured genetic markers and other nonlinear effects. The presence of f(G) in equation (5) may cause the exclusion rule to be violated; Sy2 may be correlated with the disturbance term because
may be correlated with the non-zero interaction effects in f(G). This problem is not solved even if the measurement error in the PGS was essentially eliminated through the use of extremely large GWAS samples and using ordinary least squares to estimate equation (5); the problem stems from the failure to control for (or find an instrument for) f(G). Imagine that
perfectly captures the linear effects of the measured SNPs. If f(G) is uncorrelated with
, then the estimated effect of
will be consistent, butthe estimate of the proportion of variance in y that is statistically explained by genetic factors will be underestimated and the standard error of the effect of
will be higher than if f(G) were observed. If f(G) is positively correlated with
then the true effect of
will be over estimated and the total variance explained by genetic factors will be underestimated. If f(G) is negatively correlated with
, then the true effect of
will be overestimated and the total variance explained by genetic factors will also be underestimated. Epistasis certainly exists to some extent. However, the observed twin correlations for the majority of traits (69%) are consistent with a simple and parsimonious model where twin resemblance is solely due to additive genetic variation and where epistasis is therefore not a major problem [8].
3. Genome-wide association studies successfully control for population structure: Violations of this assumption lead to biased PGS or to PGS that predict y for non-genetic reasons (as when a model without population controls makes it seem as if Italians like pasta or the Chinese use chopsticks for genetic reasons) [20]. This can lead to the violation of the exclusion restriction if the population structure variables that are correlated with are not controlled for in the PGS or the structural model, and if these population variables affect the outcome of interest. Multiple indicators of
wouldnotresolve this omitted variable bias because each of these indicators would also be correlated with the omitted variables.
4. It is possible to divide GWAS samples into non-overlapping sub-samples drawn from the same population as the sample used for analysis. In principle, this assumption seems unproblematic: the availability of large-scale, population-based, genotyped datasets such as the UK Biobank makes it straightforward to randomly split the sample into parts and to exclude genetically related or identical observations. One practical issue is that one may want to use results from published GWAS studies to construct polygenic scores. In this case, it should be verified that the genetic architecture of the trait is identical in the GWAS results and the analysis sample (e.g. using bivariate LD score regression [12]). Furthermore, most GWAS studies are conducted as a meta-analysis of summary statistics from various samples. Metaanalysis circumvents legal, practical, and logistic challenges that would have to be overcome to pool data from several providers on one central location is as statistically powerful as analyzing the raw data directly [51]. However, the meta-analysis approach makes it difficult to check if the same or closely related individuals have been included in several samples. It is currently unknown if and to which extent such hidden overlap between GWAS samples is a real issue. We explore in SI section 3.2.2 the consequences of a correlation between the measurement errors of the polygenic scores. As can be seen from Supplementary Figures 10 and 11, when the measurement errors for Sy1 and Sy2 are not independent, all methods produce biased estimates. When the correlation between the measurement errors of the PGS for y is small, GIV regression outperforms the other methods. When the correlation becomes moderate to strong, none of the methods produce accurate estimates. When the measurement errors in Sy1 and Sy2 are correlated with the measurement error in T, there is a region of small to moderate correlation strength in which EMR performs better than GIV regression or MR. When the correlation is strong, none of the methods produce accurate estimates.
5. The PGS for the outcome is uncorrelated with omitted inherited environmental factors that affect the outcome. An example of potential bias stemming from omitted inherited environmental factors would arise from the correlation between the PGS for height and the PGS for parent’s height, which is correlated with parent’s height, under conditions when parents get an environmental effect from height (e.g., higher pay for being taller) that affects the quality of the childhood health environment that could be correlated both with child’s height and with child’s EA. Violation of the exclusion restriction would be avoided by controlling for parental height, or for the parental resources that are consequences of parental height. More generally, however, there might be other causal pathways between parental genetic factors that affect the child’s environment in ways that affect her EA (e.g., via the BMI of parents). One alternative strategy for blocking these pathways would be to construct and control for a parental PGS for the child’s EA. However, large family samples including biological parents and their offspring would be required for this. Such samples are still very rare and often not available in the public domain. Furthermore, in the absence of sufficiently large GWAS samples to estimate parental PGS with high accuracy, controlling for the observed parental PGSs would not be perfect, for the same reasons described throughout this paper in the context of a person’s own PGS (though it would then in principle be possible to pursue a multiple indicator strategy for parental PGS as we do for the child’s PGS). Whether it would be preferable to control for the parental PGS (and to use an additional indicator of the PGS to obtain consistent estimates of the effect of the parental PGS on the outcome) or for parental phenotypical characteristics that might affect a person’s life course environment, or for the environmental characteristics themselves would depend upon whether the causal pathways are well-enough understood and whether sufficient information is available about a person’s environment, her parent’s phenotypical characteristics, or her parent’s PGS for the person’s outcome of interest in the analysis.
To assess the potential impact of bias from omitted inherited environmental factors that are correlated with the IVs, we carried out a final simulation where we assumed that a genetic marker of the parents (PT) affects y net of and T. We assumed arrange of values for the effect of PT allowing its effect to range from zero to the same size as T in our structural model (equation 3). Supplementary Figure 13 shows that GIV regression generally outperforms OLS, MR, and EMR, but that all methods produce substantial bias if the reduced form effect of PT, which affects y entirely indirectly through omitted environmental factors, rivals the effect of T on y. In practice, the problem is unlikely to be this large; even if the indirect effects of parental genes through their effect on the child’s environment are sizable, much of the bias can be removed from the estimation via controls for these consequential environmental variables or for the parental phenotypical characteristics that produce these environmental effects (e.g., parental education or income or height), or for the parental genotype (via a PGS for the genetic effects of parents on y) or through the use of data on twins that allows an effective control of parental genotype via the estimation of within-family regression models.
Empirical applications
We illustrate the practical use of GIV regression in a variety of important empirical applications using data from the Health and Retirement Survey (HRS) for 8,638 unrelated individuals of North-West European descent who were born between 1935 and 1945 (SI appendix).
The narrow-sense SNP heritability of educational attainment
First, we demonstrate that GIV regression can recover the genomic-relatedness-matrix restricted maximum likelihood (GREML) estimate of the chip heritability of EA.
Specifically, we follow the common practice in GREML estimates of heritability and analyze the residual of EA from a regression of EA on birth year, birth year squared, gender, and the first 10 principal components from the genetic data [52]. Next, we standardize the residual and regress it on a standardized PGS for EA using OLS or GIV. The results are displayed in Table 1. The standard OLS estimate of the PGS explains 6.3% of the variance in EA (with a 95% confidence interval of +/- 1.8%), which is similar to the results reported by the Social Science Genetic Association Consortium [27]. This is substantially lower than the 17.3% (95% CI +/- 4%) estimate of chip heritability reported by [52] in the same data using GREML7 Columns 2 and 3 show GIV regression results using one or the other of the two subset PGS scores as the covariate and as the instrumental variable. Since we are regressing on a PGS that contains measurement error (rather than on the true PGS), the squarred standardized coefficient does not equal the chip heritability of EA. Instead, a downward correction of the estimate is required to obtain an unbiased estimate (see SI appendix for details). Applying this correction, the GIV regression results in columns 2 and 3 imply a chip heritability of 8.8% (95% CI +/- 3.4%) and 15.5% (+/- 4.8%), respectively. Note that we obtain higher estimates of chip heritability if we are using the UKB score for EA in the structural model. In contrast to the SSGAC score, the UKB score is derived from a GWAS on just one relatively homogenous population sample, whereas the SSGAC score is the result of a meta-analysis across many different cohorts. It is reasonable to assume that the meta-analyzed samples often had genetic correlations of lower than 1, which would tend to attenuate the predictive accuracy of the SSGAC score in the HRS[52] and, as a result, also attenuate the GIV estimate of chip heritability. The estimate using the UKB score for EA in the structural model is consistent with the GREML estimate from [52].
Effects of the PGS on Educational Attainment in the HRS subsample
The relationship between body height and educational attainment
Previous studies using both OLS and sibling or twin fixed effects methods have found that taller people generally have higher levels of EA [53, 54, 55]. They are also more likely to perform well in various other life domains, including earnings, higher marriage rates for men (though with higher probabilities of divorce), and higher fertility [56, 57, 58, 59, 60, ?]. The question is what drives these results. Can they be attributed to genetic effects that jointly influence these outcomes? Are there social mechanisms that systematically favor taller or penalize shorter individuals? Or are there non-genetic factors (e.g., the uterine and post-birth environments especially related to nutrition or disease) that affect both height and these life course outcomes? The literature on the relationship between height and EA has found evidence that the association arises largely through the relationship between height and cognitive ability, which may suggest that the height-EA association is driven largely by genetic association between height and cognitive ability. We use GIV regression with individual-level data from the HRS to clarify the influence of height on EA, and we compare these results with those obtained from OLS and from MR. In addition, we conduct a “negative control” experiment that estimates the causal effect of EA on body height (which should be zero). A complete description of the materials and methods is available in the SI Appendix.
GWAS summary statistics for height were obtained from the Genetic Investigation of ANthropometric Traits (GIANT) consortium [22] and by running a GWAS on height using the interim release of genetic data in the UK Biobank [61], which was not part of the GIANT sample. We refer to these as Height_GIANT and Height_UKB, respectively. GWAS summary statistics for EA were obtained from the Social Science Genetic Association Consortium (SSGAC). The most recent study of the SSGAC on EA used a meta-analysis of 64 cohorts for genetic discovery and the interim release of the UKB for replication [27]. We refer to these samples as EA_SSGAC and EA_UKB, respectively. There is an overlap in the cohorts between Height_GIANT and EA_SSGAC. To ensure independence of measurement errors in the PGS, whenever one of the two was used as regressor, we excluded the other as instrument and used a PGS from UK Biobank data instead.
The OLS results in Table 2 show that height (in meters) appears to have a strong effect on years of EA, with two additional centimeters in height generating one additional month of EA. MR appears to confirm the causal interpretation of the OLS result; indeed, the point estimate from MR is even slightly larger than from OLS. As discussed above, MR suffers from probable violations of the exclusion restriction. These violations could stem from the possibility that the some genes have direct effects on both height and EA (i.e. Type 1 pleiotropy).8 They could also stem from the possibility that the PGS for height by itself is correlated with the genetic tendency for parents to have higher EA and income, and therefore a lower nutritional or disease risk for their children, who therefore are more likely to reach their full cognitive potential and have higher EA. Controlling for the PGS is an imperfect strategy for eliminating this source of endogeneity, because the bias in the estimated effect of the PGS score also biases the estimated effect of height (the omitted variable bias discussed earlier).
Regression of educational attainment on height in the Health and Retirement Study (HRS)
In contrast, estimates from GIV regression in Table 2 show both a considerably larger effect of the education PGS score on EA, and a small and statistically insignificant effect of height on EA. These results imply that the positive correlation between height and EA is not a causal relationship. Rather, the observed phenotypic correlation is primarily due to the genetic correlation between the two traits. Furthermore, our “negative control” using GIV regression finds no causal effect of EA on height, as expected. One might contrast our results also to those of [53], who found a correlation between the height and EA of Finnish monozygotic (MZ) twins. Silventoinen et al’s study effectively controls for all genetic correlation between height and EA. However, their result would still suffer from endogeneity bias to the extent that the difference in MZ twin heights is related to intra-uterine or post-birth environmental differences that cause one twin to be taller and have higher cognitive or non-cognitive abilities than the other twin. GIV regression is arguably superior to twins fixed effects to the extent that these environmental variables are uncorrelated with the PGS for height once the PGS for EA is effectively controlled, making the combination of the PGS for height and the PGS for EA to be valid instruments.
Conclusion
Accurate estimation of causal relationships with observational data is one of the biggest and most important challenges in epidemiology and the social sciences - two fields of inquiry where many questions of interest cannot be adequately addressed with properly designed experiments due to practical or ethical constraints. Here, we have proposed a method that allows genetic data to be used for this purpose. Thinking of genetic data as a sort of naturally occurring experiment is appealing because the genotypes that arise from two mates are randomized by the process of meiosis. Thus, given that virtually all human traits are heritable to some extent, an individual’s genotype could in principle be used to identify causal effects across a wide range of important scientific questions. Thanks to cheap and accurate genotyping technologies and growing insights into the genetic architecture of many traits via large-scale GWAS, this general idea becomes practically more and more feasible. In principle, it is this idea which underlies so-called Mendelian Randomization (MR)-a method suggested by epidemiologists that uses genetic data as instrumental variables.
The crucial identifying assumption of MR is that the genes which are used as instruments do not also affect the outcome through other causal pathways via so-called pleiotropic effects. In light of the widespread and often substantial genetic correlations between many traits, this assumption seems problematic. We have proposed a new strategy that we call genetic instrumental variable (GIV) regression, that eliminates or at least substantially reduces the bias of MR due to pleiotropy under a set of arguably more realistic assumptions. We have explored conditions where the assumptions underlying GIV regression will fail and conclude that GIV regression outperforms OLS and MR in a broad range of realistic scenarios.
The simulations described in the paper certainly do not cover all conceivable data generating processes, but they are nonetheless of considerable utility, we would argue, in assessing the performance of GIV regression. Analyses with real data demonstrate that GIV regression recovers estimates of the effect size of the outcome PGS that are consistent with alternative approaches to estimate the extent of narrow-sense heritability. Our analyses also provide reason to be cautious when using OLS or MR to estimate causal effects between variables that are known to be genetically correlated. Existing knowledge about the effects of epistasis, rare or dominant alleles, structural variants, or population structure provide good grounds to be cautiously optimistic that GIV regression provides an important tool for assessing causal effects when unmeasured genetic correlation is likely to be a serious issue. In particular, constant improvements in genotyping technology, increasing GWAS samples, and even better statistical methods to control for population structure in GWAS will make it less and less likely in the future that the assumptions underlying our approach will be seriously violated. Additional knowledge in this rapidly developing field will provide further guidance for assessing the extent of remaining bias in GIV regression estimates. The combination of new estimation tools and continued rapid advancements in genetics should provide a significant improvement in our understanding of the effects of behavioral and environmental variables on important socioeconomic and medical outcomes.
1 Introduction
The Supplementary Information for this article consists of four sections. In section 2, we provide technical details of Genetic Instrumental Variables (GIV) regression. In section 3, we describe a set of simulations to illustrate and explore how GIV regression performs in finite samples when the model assumptions are satisfied or violated in various ways. Section 4 describes the data and methods used for our empirical examples. In section 5, we provide supplementary information about the empirical examples described in the article.
2 Genetic Instrumental Variables (GIV) regression
2.1 Estimating narrow-sense SNP heritability from polygenic scores
We begin by showing that consistent estimates of the chip heritability of a trait (i.e. the proportion of variance in a trait that is due to linear effects of currently measurable SNPs) can be obtained from polygenic scores. If y is the outcome variable, X is a vector of exogenous control variables, and is a summary measure of genetic tendency for y in the presence of controls for X, then one can write
where, for example, y is educational attainment. Typical variables in X would be age, gender, and the first ten principal components in the genetic data as controls for population structure. If the heritability of y is caused by a large number of genetic loci, each with a very small effect [1], we call y a “genetically complex trait.” In this situation, the genetic liability for y cannot be adequately represented by just one gene. Rather, it is preferable to approximate the genetic liability
with a polygenic score (PGS).The weight so feach SNP that are summed up in the PGS are obtained from a GWAS on y in an independent sample [2, 3]. In a GWAS, y is regressed on each SNP separately, typically including a set of control variables such age, sex, and the first few principal components of the genetic data to control for population structure [4]. Thus, the obtained estimates for each SNP do not account for correlation between SNPs (a.k.a. linkage disequilibrium - LD), which may bias the PGS. In practice, several solutions are available to deal with this challenge, including pruning SNPs for LD prior to constructing the score [5] or using a method that explicitly takes the LD structure between SNPs into account (e.g. LDpred, see [6]). The scores themselves (Sy|X) are linear combination of the elements in G weighted by the estimated coefficients
obtained from
where G is an n×m matrix of genetic markers, an
is the m×1 vector of LD-adjusted estimated effect sizes, where the number of SNPs (the size of m in equation 2) is typically in the millions. If the true effects of each SNP on the outcome were known, the true genetic tendency
would be expressed by the PGS for y, and the marginal R2 of
in equation 1 would be the chip heritability of the trait. In practice, GWAS results are obtained from finite sample sizes that only yield noisy estimates of the true effects of each SNP. Thus, a PGS constructed from GWAS results typically captures far less of the variation in y than suggested by the chip heritability of the trait ([7]; [2]; [8]). We refer to the estimate of the PGS from available GWAS data as Sy|X, and substitute Sy|X for
in equation 1. The variance of a trait that is captured by its available PGS increases with the available GWAS sample size to estimate ζ and converges to the true narrow-sense heritability of the trait at the limit if all relevant genetic markers were included in the GWAS and if the GWAS sample size were sufficiently large [8].
As reported in [9] and [2], the explained variance in a regression of a phenotype on its PGS can be expressed as
where y is standardized,
is the genetic variance of y (i.e., the proportion of the variance in y explained by G), n is the sample size, and m is the number of genetic markers. For example, a PGS for EA based on a GWAS sample of 100,000 individuals would be expected to explain about 4% of the variance of EA in a hold-out sample (assuming there are 70,000 effective loci, all of them included in the GWAS, and a chip heritability of 20% [9]), even though the estimated total heritability of EA in family studies is roughly 40% [10].
It has long been understood that multiple indicators can, under certain conditions, provide a strategy to correct regression estimates for attenuation from measurement error ([11]; [12]). Instrumental variables (IV) regression using estimation strategies such as two stage least squares (2SLS) and limited information maximum likelihood (LIML) will provide a consistent estimate for the regression coefficient of a variable that is measured with error if certain assumptions are satisfied ([13]; [14]): (1) The IV is correlated with the problem regressor, and (2) conditional on the variables included in the regression, the IV does not directly cause the outcome variable, and it is not correlated with any of the unobserved variables that cause the outcome variable [13]. In general, these assumptions are difficult to satisfy. In the present case, however, GWAS summary statistics can be used in a way that comes close enough to meeting these conditions to measurably improve results obtainable from standard OLS regression and from standard Mendelian Randomization (MR) [15].
Multiple indicators of the PGS provide a theoretical solution to the problem of attenuation bias, and, we argue, a practical solution as well. The most straightforward solution to the problem is to split the GWAS discovery sample for y into two mutually exclusive subsamples. This produces noisier estimates of , with lower predictive accuracy. How-ever, it also produces an IV for Sy|X that has desirable properties. Formally, we let
be the estimated coefficient vector for ζy in equation 2 from the first training sample, and
be the coefficient vector estimated from the second training sample. It follows then that
for the j-th genetic marker, where uy1|X and uy2|X are asymptotically normally distributed errors with E(uy1j|X) = E(uy2j|X) = 0 and
, and where xj is the observed number of reference alleles for location j. Because the two discovery samples are non-overlapping, uy1|X and uy2|X would be independent of each other if the PGS model is correctly specified (we return to this point below). By applying the two vectors of estimated coefficients, we obtain two PGS,
where G is the matrix of genetic markers for the analytical sample. We then rewrite equation 1 in terms of the observed first PGS as
As can be seen from equation 5, the PGS Sy1|X is correlated with the error term via its correlation with Guy1|X from equation 4. However, under the assumptions that equation (2) accurately describes the relationship between G and y and that the genetic architecture of the trait is identical across GWAS and prediction samples, then Sy2|X meets the two requirements to be a valid instrument for Sy1|X, namely that it is correlated with Sy1|X (through their mutual dependence on ) and uncorrelated with the disturbance term, because neither
nor Guy2|X are correlated with Guy1|X. Therefore, the covariance of Guy1|X and Guy2|X is
with equation (6) true because of the fact that uy1i|X and uy2j|X – being random measurement error – are independent of the genetic markers and uncorrelated with each other, under the assumption that equation (2) is the correct specification of the relationship between G and y.1 It follows, therefore, that the IV Sy2|X is uncorrelated with the error term in equation 5, i.e.,
At the same time, Sy2|X is correlated with Sy1|X through their common dependence on . Under the assumptions that X is uncorrelated with ϵ net of
, and that
is uncorrelated with ϵ net of X, then Sy2|X is a valid IV for the estimation of γ in equation 5.
The above derivation assumes that the true coefficients of the genetic markers in G do not vary in the population. More generally, we might assume that the population consists of a finite number of (possibly latent) groups, k = 1, …, K with the kth group having the polygenic score . Absent information about the specific number of groups and the group memberships of individuals in any specific population, the polygenic score that would be estimated from a sufficiently large sample from that population would be a weighted average of the scores for each group, with the weights dependent on the proportion each group is of the total population [13]. Any population P therefore can be characterized in terms of its group composition, p1, p2, …, pK. The above results apply straightforwardly when the PGS are estimated and analyzed using samples from a single group. When they are instead estimated on a population that is a mixture of groups, the situation is more complicated. The true PGS for any individual who is in group k can be expressed as
where P = {p1, p2, …, pK} is the group composition that defines population P and Δyk|X is the deviation between the group k specific PGS for trait y and the population average (for population P). Under this elaboration, equation 5 can be written as
where
is the true PGS for trait y for individual i ingroup k, and wher
is the first polygenic score estimated using coefficients from the GWAS sample drawn from population P. Variation in true PGS by group creates the possibility that the exclusion restriction will be violated. If
is the IV, the
is correlated with Δyk|X to the extent that the true PGS differ by group and to the extent that the weighted average deviation of the true PGS estimated from each individual’s group and the true PGS estimated from the other groups correlates with the PGS for the population P. If the two PGS scores were estimated on one “pure” group and the analysis sample was for a second “pure” group, then the deviation between the two PGS would of course correlate with the PGS for one of the groups and the exclusion restriction would be violated unless the SNP coefficients of the PGS for the one group were the same as the beta coefficients of the PGS for the other group. If the analysis sample and the GWAS samples are drawn from the same population (i.e., the same mixture of groups), we would expect the correlation between the deviations for analysis sample members (drawn from each of the groups in the same proportion as the GWAS sample) and the true PGS for the GWAS sample to be very small. If the population consists only of a single group or, equivalently, if all groups have the same SNP coefficients in their PGS for trait y, then the issue of group-specific heterogeneity in PGS disappears.2
When PGS for y are used that were constructed with a different set of control variables than are used in the regression, the above results need to be modified. Let us assume that variables χ were controlled in the GWAS and variables X are controlled in the regression model. Then
where dyXχ is the vector of differences in the effects of genetic markers on y when X is controlled and when χ is controlled. If a finite sample PGS of y is constructed using χ as controls, i.e., Sy1|χ, and this finite sample PGS is used in place of Sy1|X as a proxy for
model 1, one obtains
where
The problem now is that using Sy2|χ as an IV would violate the exclusion restriction to the extent that dyXχ differs from zero, because GdyXχ is both in Sy2|χ and in the error, and because would generally be correlated with GdyXχ. The exten to fbias would depend on the extent to which the effects of the genetic markers on y differ when X and when χ are controlled.
Once a consistent estimate for has been obtained, it is possible to derive an estimate of the narrow-sense SNP (or chip) heritability of y. In a univariate linear regression model with standardized variables, the squarred regression coefficient is equal to R2. This follows directly from the definition of R2 as the variance of y explained by X as a fraction of total variance of y. Thus, γ2 in 1 can be thought of as the narrow-sense chip heritability of y if both y and
are standardized variables with mean zero and a standard deviation of one (assuming the controls included in X are not correlated with genotype G). In practice, however, the estimat
originates from a regression on a PGS that contain measurement error (Sy1|X or Sy2|X) rather than on the true PGS
. In particular, the obtained regression coefficient
will be standardized using the variance of Sy1|X or Sy2|X instead of the variance of
. It turns out that this implies that the heritability estimate
is biased by a factor equal to
which simplifies to
if the observed score was standardized. However, it is possible to derive a simple error correction because one can estimate the variance of
by estimating the covariance of Sy1|X and Sy2|X:
With an estimate of at hand, we can back out an unbiased heritability estimate:
When y is standardized, var(y) = 1, the error correction simplifies to
An estimate of the standard error of h2 can be obtained using the Delta method[17].
2.2 Reducing bias due to genetic correlation between exposure and out-come
The logic from above can be extended to situations where the question of interest is not the SNP heritability of y per se, but rather the influence of some non-randomized exposure on y (e.g. a behavioral or environmental variable, or a non-randomized treatment due to policy or medical interventions). We can rewrite equation 1 by adding a treatment variable of interest T, such that
where, for example, y could be educational attainment and T could be body height. In each case, it is presumed that the outcome variable is to some extent caused by genetic factors, and the concern is that the genetic propensity for the outcome variable is also correlated with the treatment represented by T in equation (7). If
is not observed, it is part of the disturbance term. Equation (7) is written without any interaction terms involving
and T, implying that while the effect of T may vary with
, δ is a (variance-weighted) average effect of T across values of
[13]. If the same (uncontrolled for) genetic tendencies that affect the outcome variable also affect or are otherwise correlated with T (e.g., if T is influenced by parental genes that are correlated with
), then
will be a biased estimate of the effects of T.
In standard MR, a measure of genetic tendency (ST) for a behavior of interest (T in equation 7) is used as an IV in an effort to purge of bias that arises from correlation between T and unobservable variables in the disturbance term under the argument that the genetic tendency variable, e.g., the measured PGS ST, is exogenous ([18];[14]). Implicitly, the true PGS for y is in the error term, as is shown in equation 7. One such example would be the use of a PGS for height as an instrument for height in a regression of the effect of height on educational attainment. The second stage regression in MR, then, takes the form
The problem with this approach is that the PGS for height will typically fail to satisfy the exclusion restriction because of so-called Type 1 pleiotropy [15]: the genetic variants that predispose individuals to be tall may also directly increase the predisposition for higher educational attainment [19, 20] (e.g. via healthy cell growth and metabolism). This problem is not solved even if we could use the true PGS as the IV.
The multiple indicator strategy described above provides potentially attractive approaches for addressing the bias in MR. If the genetic propensity for y could be directly controlled in the regression, MR would provide less biased estimates of the effect of T. We refer to the combined use of Sy1|XT as a control and ST as an IV as “enhanced Mendelian Randomization” (EMR). However, controlling for Sy1|XT as a proxy for is not adequate, both because it leaves a component of
in the error term which causes the exclusion restriction assumption of MR to fail, and because the bias in the estimated coefficient of Sy1 also produces bias in the estimated coefficient of T. The bias arising from the use of a proxy for
as a control variable in OLS regression is a form of omitted variable bias. To see this, assume that ST is a valid instrument for T if
were measured and controlled. In this case, the second stage of 2SLS would involve the substitution of
for T, and the regression would give a consistent estimate of δ if
were observed, i.e.
where
and where (for simplicity) we drop other covariates from the model.3 If v1 is omitted from the regression, then the bias in δ and γ is equal to the product of γ and the regression coefficients of the regression of v1 on
and Sy1.
More generally, if a set of variables z is omitted from a regression of y on a vector X, then
and
where X and Z are the matrices of included and omitted variables. The expression (X′X)−1X′Z gives the matrix of coefficients from regressions of each of the omitted variables on the included variables, and λ is the vector of coefficients of the variables in z in the regression of y on x and z. If z consists of a single omitted variable, then
, and (X′X)−1X′z is the vector of estimated regression coefficients of z on the included variables x.
Violation of the exclusion restriction in EMR due to genetic correlation is potentially solved (or at least is less severe) when an additional indicator of the PGS for y, i.e., Sy3|XT, is used to instrument simultaneously both Sy1|XT and T in equation 1. The practical problem with using two indicators of as the sole instruments is that their mutual correlation will be relatively high (depending on their reliability) and they are weak instruments for T. Instead, as a practical – and, we will argue, effective– strategy, the best solution is arguably to use ST along with Sy2|XT (or Sy2|XT and Sy3|XT) as instruments for Sy1|XT and T. ST will still violate the exclusion restriction to the extent that it is correlated with v1. However, the extent of the violation will be reduced by the presence of Sy1|XT in the regression.
Arguably a strategy that both reduces the correlation between ST and ϵ (through the inclusion of Sy1|XT in the model) and eliminates or greatly reduces omitted variable bias through the inclusion of an instrument for Sy1|XT in the first stage equation will outperform MR in the estimation of a consistent effect of T that is purged of genetic correlation. As noted above, if not all relevant genetic effects are contained in the PGS (e.g. interaction effects, structural variants, or rare alleles may be missing given currently available GWAS data), the PGS instruments above will not perfectly satisfy the exclusion restriction to the extent that is correlated with the omitted genetic variables. However, theabove approach would generally be expected to reduce bias due to genetic correlation, given that a large fraction of heritability can be attributed to linear effects of common SNPs that are well tagged by currently available genotyping arrays [8, 21, 9, 22]).
This argument can be made more formally. MR assumes a regression of y on covariates and T, with in the error term of equation 7. If, as above, weomit the covariates X (or, more precisely, residualize y, T, and Sy1|XT from their dependence on X), then the bias in MR equals
where coefficient
is the coefficient of
in the regression of
on
is the effect of
on y, net of X and T, and the second term is ignorable because
is orthogonal to its residual.
If we estimate δ using EMR, the omitted variables are now v1 (instead of ) and
, and the bias in the estimator for δ is
where the first term is the product of
(the coefficient of
in the regression of v1 on
and Sy1|XT) and coefficient
, which is the effect of v1 on y, net of X,
, and Sy1|XT. The second term is the product of the coefficient of
in the regression of
on
and Sy1|XT and the coefficient of
on y, net of X,
, and Sy1|XT. As with MR, the second term is ignorable.
Lastly, we consider GIV regression. The second stage of GIV regression is
In GIV regression, the bias of δ is given by
The first term is the product of , which is the coefficient of
in the regression of
on
and
multiplied by λ1, the effect of
on y. The second term is the coefficient of
in the regression of
on
and
multiplied by λ2,, the effect of
on y. As with MR and EMR, the second term is ignorable.
The major difference between the bias in MR and the bias in GIV regression stems from the relative sizes of and
. In general, we expect that the correlation between the true PGS for y and that part of T which is predicted by the observed PGS for T (which drives
) will be stronger than is the correlation between the residual PGS for y (i.e., the difference between the true PGS for y, controlling for X and T, and the predicted PGS from Sy2|XT) and that part of T which is predicted by the observed PGS for T and Sy2|XT (which drives
). Therefore, in general, we expect a lower bias in the estimate of δ using GIV regression than using MR. We find support for our expectation in the simulations and empirical analyses discussed below.
In principle, PGS can be computed that include controls for T as well as for X. In practice, however, PGS are typically computed without a control for T. How would the situation above change if instead of Sy1|XT and Sy2|XT, we used Sy1|X and Sy2|X? Now the second stage of GIV regression is
where dy|XT is the vector of differences in the effects of each of the genetic markers in G on y when both X and T are controlled and when only X is controlled. In GIV regression, the bias of δ is given by
The first term is the product of , which is the coefficient of
in the regression of
on
and
multiplied by λ1, the effect of
on y. The second term is the coefficient of
in the regression of
on
and
multiplied by λ2,, the effect of
on y. As with MR and EMR, the second term is ignorable. We expect that the correlation between the true PGS for y and that part of T which is predicted by the observed PGS for T (which drives
) will be stronger than is the correlation between the two components of the residual PGS for y4 and that part of T which is predicted by the observed PGS for T and Sy2 (which drives
). We expect the advantage of GIV regression to be smaller when the error term also includes Gdy|XT stemming from the use of PGS for y that lacks a control for T. However, there is no serious barrier to the construction of PGS for y that include or exclude a control for T and so it can be established empirically (e.g., via correlations between the PGS calculated with and without controls) about the practical importance of including this control for purposes of GIV regression analysis.
2.3 Gene-environment interactions
We next generalize equation (7) to the case of gene-environment interactions, where the effect of T varies with the PGS. In principle, these interactions could be extremely complicated and so for practical reasons, we focus her on obtaining plausible estimates of the linear interaction between and T. Were write equation (7) as Now there are three endogenous variables, T, Sy1|XT, and T Sy1|XT. Also the disturbance term has now been elaborated to include a term that is a function of T, and so an additional instrument, Sy4|XT, is needed. As before, Sy4|XT will be a valid instrument for the same reasons as in equation (6) to the extent that problems deriving from correlation between the instrument and f(G) are relatively small.
3 Simulations
We describe the basic simulation model and the data generating process in 3.1. Section 3.2 studies various violations of the model assumptions.
3.1 Standard model
Our interest lies in studying the effect of a treatment T on an outcome y. Both T and y are partly heritable and the genetic propensities of individuals for both variables are summarized by polygenic scores, and
. These polygenic scores are not observed directly. Instead, they are empirically approximated from genome-wide association study (GWAS) results for T and y with finite sample sizes. The estimated regression coefficients from the GWAS are used as weights to construct the scores in an independent sample with genetic data. Since the GWASs were conducted in finite sample sizes, the estimated betas will have standard errors greater than zero, which implies that the constructed PGS will be noisy proxies of the true PGS [2]. We denote the actually available (noisy) PGS for T and y for individuals i = 1, …, N as STi and Syi|XT, respectively. The data generating process is as follows:
and
are drawn from a multivariate normal distribution with non-zero covariance and have a correlation of ρG, where in our simulation, we for simplicitly assume that X = 0. We assume that all measurement errors are independent of each other. Both y and T are standardized and have a mean of 0 and a variance of 1. Furthermore, the true polygenic scores (
and
)are also assumed to be standardized. This implies the following variances for T and y:
where σϵ,η is the covariance between ϵ and η. One can calculate from this variance decomposition what β and γ should be in terms of their heritability (
and
) and their genetic correlation:
Similarly, and
can be expresses as
where ρe is the correlation between ϵ and η, and where we assume that h is the heritability of G net of X.
In practice, an important concern is endogeneity and bias due to unobserved environmental factors that jointly influence T and y (i.e., ρe ≠ 0). Our simulations cover two broad scenarios. In one scenario, we assume that endogeneity in the naive OLS estimates arises solely due to genetic correlations between T and y (i.e., ρG ≠ 0). In this case, the endogeneity problem would be solved if the true PGS wouldbeknown.We simulate this scenario of entirely genetically caused endogeneity by drawing ϵ and η from independent distributions. In the second scenario, there is“additional endogeneity” due to unobserved non-genetic effects that matter for both T and y (i.e., ρe ≠ 0). We simulate this more realistic scenario by drawing ϵ and η from a multivariate normal distribution with ρe = 0.4.
The variances of the error terms and
can be derived using the model of Dudbridge [2]. In the original Dudbridge model, the polygenic scores are not standardized and have a variance equal to the heritability of the trait. The effect sizes are assumed to be randomly distributed across the genome. Therefore, the true polygenic score and the estimated polygenic score are defined as
The variance of this score is equal to
where m is the number of independent genetic markers and where the first approximately equals sign reflects that fact that the variance of
will generally be slightly smaller than is the variance of
. The variance of the estimation error can now be written as
Here n is the sample size used to estimat . Since we standardized our polygenic scores in our simulations, we divide this variance by the heritability such that our true polygenic score has a variance of one. Therefore, the variance of the error in the polygenic score for observation i and trait k is
For our simulations, we use parameters in a range close to the empirical values we estimated in the Health and Retirement Study (see section 4). For all scores we assume mki equal to 300,000. The GWAS discovery sample sizes are assumed to vary between 200,000 and 2,000,000 per trait. Several recently published GWAS already had sample sizes exceeding 200,000 [23, 24, 25, 20] and 2,000,000 will be a realistic sample size for many traits in the near future. For models that include multiple polygenic scores per trait, we assume that the GWAS discovery sample was divided into equal parts per score. The size of our estimation sample is set to 8,600 (again similar to the sample size of the HRS, see section 4). We assume chip heritabilities of (similar to results reported for educational attainment [9, 16]) and
(similar to results reported for body height [21]). For δ, we assume 0.15, which is the phenotypic correlation between height and educational attainment in the HRS. The genetic correlation between height and educational attainment has been estimated to be 0.15 by [20]. Hence, we use this value for ρG. Note that this implies we assume that the entire genetic correlation between the traits is due to Type 1 pleiotropy, i.e. we assume that all genes that are associated with both phenotypes have direct effects on both rather than some of the genes having cascade effects (e.g. a direct effect on height that triggers higher educational attainment, which shows up in the genetic correlation estimate). Surely, this is a conservative assumption to the disadvantage of classic MR. However, since there is no way to exclude the possibility that Type 1 pleiotropy is underlying the observed genetic correlation between the two traits, we prefer this conservative assumption.
All simulations were written in MatLab and the code is posted on Github (https://github.com/cburik/GIVsim). Each simulation is conducted as follows:
Calculate the simulation parameters from the input parameters.
Draw the true PGSs from a multivariate normal distribution.
Draw the error terms and measurement errors from their respective distributions;
Compute the”measured” PGS, T, and y from equations 15-18;
Estimate
and
in the simulated data and save the estimation results.
Repeat steps ii-v to create a distribution of estimated effect sizes and their confidence intervals for each method.
We estimate the coefficients:
using OLS.
in
using 2SLS and ST as the IV (i.e., Mendelian Randomization - MR).
using 2SLS with ST as the instrument for T, treating Sy1|XT as exogenous (i.e., Enhanced Mendelian Randomization - EMR).
using 2SLS with Sy2|XT as the instrument for T, treating Sy1|XT as exogenous (i.e. EMR with an alternative instrument).
using 2SLS with Sy2|XT and Sy3|XT as instruments (i.e., Genetic Instrumental Variables regression - GIV).
using 2SLS with Sy2|XT and ST as instruments (i.e., Genetic Instrumental Variables regression - GIV).
using 2SLS with Sy2|XT, Sy3|XT, and ST as instruments (i.e., Genetic Instrumental Variables regression - GIV).
The results of the simulations are shown in Supplementary Figures 1-14. The results of the standard model (with valid assumptions) are depicted in Supplementary Figures 1 to 6.
Supplementary Figures 1-3 shows results from the relatively simple scenario of genetic endogeneity only (ρg = 0.15, but ρe = 0). In this case, we would not need an instrument for T if the true PGS for would be known or if measurement error in
would be dealt with. This is also apparent in this figure - when the GWAS sample becomes large enough, the OLS estimates converge to the true coefficients. We see the same tendency for EMR (method 3) in Supplementary Figure 1, but it performs slightly worse then OLS. MR estimates (Supplementary Figures 1 and 2) are biased for all sample sizes due to the omitted effect of
and the correlation of the instrument Sy1|XT with that omitted variable. Supplementary Figure 1 also reports estimates for GIV regression using the 6th method, using Sy2|XT and ST as instruments. This version of GIV regression outperforms the other methods and provides consistent estimates even for GWAS samples of 200,000 observations that are split into two samples of 100,000 to create two indicators of the score. The variance of the estimates is larger compared to the other methods, but decreases with total GWAS sample size.
Supplementary Figure 2 shows the same results, but now with GIV regression estimates using the 5th method that uses Sy2|XT and Sy3|XT as instruments and EMR estimates using the 4th method that uses Sy2|XT as instrument. In this version, the variance of the GIV regression estimates is so large that the confidence bounds are outside of the figure. The high correlation between Sy1|XT, Sy2|XT and Sy3|XT implies large standard errors of the estimated coefficients due to multicollinearity. In other words, the instruments do not contain enough additional information to get precise estimates for and T. EMR using method 4 also does not perform as well as EMR using method 3.
Supplementary Figure 3 compares the OLS estimates with all three GIV regression methods (5, 6 and 7). Again, it is clear that method 5 yields very imprecise results. Methods 6 and 7 perform equally well. However, method 7 requires the construction of an additional polygenic score. This additional effort does not seem to be justified compared to the results from method 6. Thus, we recommend method 6 for most practical applications.
Supplementary Figures 4, 5 and 6 show simulation results of the same methods, but now for the more complex scenario with additional endogeneity (ρg = 0.15, ρe > 0). As can be seen from Supplementary Figure 4, the OLS estimates are biased for the effects of both and T even if the PGS was based on large GWAS samples. As the GWAS sample size increases, the estimate for
converges towards the true value, while the effect of T remains systematically overestimated due to unobserved environmental effects. The estimates of MR are also biased in this scenario due to the omitted effect of
. EMR estimates are likewise biased. However, EMR estimates of T are much closer to the true value than OLS and MR estimates. Furthermore, the EMR estimates converge towards the true coefficients as the GWAS sample sizes increase towards infinity. Furthermore, Supplementary Figure 4 also shows that the GIV regression estimates using method 6 are very close to being consistent for all GWAS sample sizes. Supplementary Figures 5 and 6 show that methods 4 and 5 again perform very poorly. Furthermore, we find no performance increase of method 7 compared to our preferred method 6.
3.2 Simulations of violated assumptions
We now turn to a set of simulations that systematically violate the identifying assumptions of GIV regression. The goal of this exercise is to explore how sensitive GIV regression reacts to these violated assumptions in finite sample sizes. Our simulations focus on the best performing GIV regression and EMR variants from section 3.1 (i.e., methods 6 and 3). We add OLS and standard MR estimates as benchmarks.
3.2.1 Skewness and kurtosis
Since GIV regression is a instrumental variables method, theoretical proofs of consistency rely on the central limit theorem. Yet, the more relevant question in practice is how sensitive the method reacts to skewness and kurtosis in finite sample sizes. To explore this question, we simulate samples with different degrees of skewness and kurtosis by drawing variables from a normal distribution using Fleishman’s power method[26, 27], keeping the GWAS sample size fixed at n = 1, 000, 000. The other parameters remained the same as in the standard model with additional endogeneity (ρg = 0.15, ρe = 0.4).
We investigate three different models. First, we add skewness to y via ϵ in equation 15. Second, we add skewness to T via η in equation 16. Note that if T is skewed, y will also become skewed. Third, we simulate a model with kurtosis in both y and T via ϵ and η.
For the first and second scenario, we used a kurtosis of 3 (corresponding to a normal distribution), unless that was not possible. Not all combinations of kurtosis and skewness are attainable, in the case of high skewness more kurtosis is needed. In those cases, we used the minimum kurtosis needed to obtain a certain skewness. The minimum kurtosis is found via the formula from [27]:
where k is the minimal amount of kurtosis needed and s is skewness.
Supplementary Figures 7 and 8 present the results for scenarios 1 and 2, with skewness in T and y, respectively. In both cases, it appears that only the OLS estimates are affected by skewness, while GIV regression estimates remain consistent. Supplementary Figure 9 shows the results for scenario 3 with kurtosis. Again, GIV regression estimates remain consistent, while kurtosis also does not diminish the performance of the other methods relative to the base-line scenario with normally distributed variables.
3.2.2 Dependent measurement errors
One of the key assumptions of GIV regression is independence of the measurement errors of the polygenic scores. Specifically, the measurement errors of the scores used as instruments (εy2, εT) should be independent from the score used as regressor (εy1). The independence of the measurement errors is violated if there is an overlap in the GWAS samples used to construct the different scores and the correlation between εy1 and εy2 will depend on the extent to which there is sample overlap. Furthermore, if the GWAS samples used to construct Sy1|XT and ST overlap, then the strength of the correlation between the measurement errors also depends on the extent to which the outcome variables (y and T) are correlated.
We simulate a violation of the independent measurement errors assumption by drawing the measurement errors from a multivariate normal distribution using a non-zero correlation between them. We fixed the GWAS sample size at n = 1, 000, 000 in these simulations and varied the correlation between the measurement errors. Again, there are different versions of this model. In the first scenario, only the measurement errors in Sy1|XT and Sy2|XT (εy1 and εy2) are correlated with each other. In the second scenario, the measurement error of ST (εT) is also correlated with the others.
The results of the first scenario are shown in Supplementary Figure 10. GIV regression outperforms the alternative methods for small to moderately correlated measurement errors (ρ < 0.5). As the correlation of the measurement errors increases, GIV regression starts to underestimate the genetic effects on y due to the attenuation bias. At the same time, GIV begins to overestimate the effect of the T, but only to a small extent. The GIV regression estimates of the treatment remain much closer to the true effect than the OLS estimates, even for severe violations of the independent error assumptions. However, for very strong violations (i.e. for very strong sample overlap between the samples used to construct Sy1 and Sy2), GIV regression performs slightly worse than MR and EMR.
Supplementary Figure 11 displays the results of the second scenario, now with two invalid instruments (Sy2|XT and ST). Again, more strongly correlated measurements errors induce a stronger bias in the GIV regression estimates. However, in contrast to the first scenario, GIV regression now underestimates both the genetic effect and the effect of the treatment. None of the displayed estimation methods get anywhere close to the true parameters in the case of strongly correlated measurement errors.
3.2.3 Missing genetic variants
We simulate a situation where not all genetic variants are captured by the polygenic scores. A situation that will be common in practice, since GWAS results never include all genetic variants and rare variants may be left out. In this situation, we augment the model of equations 15-18 by splitting the scores in two parts: one part for common variants and the second part for rare variants. This situation is described with the following equations:
We determined β1, β2, γ1 and γ2 using the observed heritability and the estimated missing heritability. [21] estimate the chip heritability of height to be 0.56 and state that the total narrow-sense heritability is likely to be between 0.6 and 0.7 (the current estimate is still attenuated due to imperfect tagging of rare and structural genetic variants). For this simulation exercise we assume the observed chip heritability of T to be 0.55 and the total narrow-sense heritability to be 0.65. For y, we extrapolate this ratio and assume the observed chip heritability and the total narrow-sense heritability to be 0.2 and 0.24, respectively.
similarly:
with:
We assume that the rare variants are correlated with the common variants, but they are unobserved. In practice, we do not know the size of this correlation. To get a sense, we deconstructed the scores used in the empirical part of our paper into different parts. Specifically, we constructed PGS for EA and height using three different subsets of GWAS data from the 1000 Genomes project: (G1) a PGS score based only on SNPs included in HapMap 3 (G2) a PGS score based on common SNPs not included in HapMap 3, but included in 1000 Genomes (MAF>5%), and (G3) a PGS score based on rare SNPs not included in HapMap 3, but included in 1000 Genomes (MAF ≤ 5%). Due to LD, the three scores should be correlated with each other, but the question is the extent to which this correlation affects the correlation of PGS for two different traits. The following correlation matrices show the correlation of G1 with the common SNPs not included in Hapmap 3, and the correlation between G2 and the rare SNPs included in 1000 Genomes (MAF ≤ 5%).
During the simulations, we again fix the GWAS sample at 1,000,000. We vary the correlation between the common and the rare genetic variants (ρr). Because γ1 depends on ρr, γ1 will have different values depending on the input parameters.
The simulation results are shown in Supplementary Figure 12. We see that the effect of T is still consistently estimated across all simulated scenarios. Thus, we conclude that missing (rare or structural) genetic variants do not lead to a noteworthy bias in the GIV regression estimate of the treatment T on outcome y.
3.2.4 Parental effects
The last situation we simulate is concerned with unobserved parental effects. In particular, the genetic correlation between a biological parent and his or her offspring is on average 0.5. Since the genotypes of the parents partly influence the environment of the offspring (e.g. via socio-economic status and parental habits), it is possible that environmental factors that are”inherited” via the parents will violate the exclusion restriction of IV regression because unobserved genotypes of the parents would partly correlate with the polygenic scores of the offspring and residual environmental factors captured by the error term. In principal, it is possible to control for parental genotypes directly (via genetic data from the parents) or indirectly (via the inclusion of family fixed-effects, e.g. in a large sample of dizygotic twins). However, only very few samples currently exist that either contain a large enough sample of genotyped trios (i.e. mother, father, child) or genotyped dizygotic twins and also the phenotypes of interest.
Our simulations below explore the consequences arising from unobserved”inherited environments” in a standard GIV regression model. Specifically, we model a situation where the unobserved weighted average polygenic score of the parents for the treatment variable (PT) has a direct or indirect effect on the outcome, y. We augment the standard model as follows:
We assume PT has a correlation of 0.5 with and a correlation of ρg = 0.5 with
. Furthermore, we assume that the parental effect is directly related to
andonlyrelated to
through pleiotropy between T and y. A scenario where the parental genetic effect is more directly related to
can also be imagined. In both cases, the assumptions for GIV regression will be violated. However, a stronger bias can be expected when the omitted parental environment has a stronger correlation with the instrument for the treatment variable (ST). Hence, these simulations can be seen as a worst case scenario. Furthermore, we vary θ from 0 to 0.15 (i.e., the effect of PT on y is not stronger than the effect of T on y). In this situation β is the same as in the standard model. However, the simulations need to adjust γ to account for overestimation of the heritability due to the unobserved effect of PT :
As before, the simulations fixed the GWAS sample at n = 1, 000, 000. We varied the strength of the parental effects (θ). Because γ depends on θ, γ has different values depending on the input parameters.
The results of these simulations are shown in supplementary figure 13. While we can see from the top panel that the genetic effects are estimated consistently with GIV regression, it is clear from the bottom panel that the parental effects bias the estimated effect of T. Our treatment effects are overestimated because of the omitted variable bias caused by PT. Note that GIV regression still outperforms all the other models even in this case.
3.3 Simulation of gene-environment interactions
Next, we simulate the gene-environment interaction model described in SI section 2.3. In principle, these gene-environment interactions could be extremely complicated, but for practical reasons we focus on a simple linear interaction between T and . The equations of the data generating process have been augmented to equations 52-55.
We assume here that δ2 is of the same size of δ1 (0.15). Hence, the interaction term is relatively important. The other parameters have the same value as they do in the standard model (including the additional endogeneity), only we have to take the interaction term into account when calculating γ.
As in the standard model, the GWAS sample varied from 200,000 to 2,000,000. The parameters are estimated with two methods, OLS and GIV regression. For GIV regression, Sy1|XT, T, and TSy1|XT are used as endogenous regressors and Sy2|XT, ST, and Sy2|XT ST are used as instruments.
The results of these simulations are shown in SI figure 14. From all three panels it is clear that the GIV regression estimates all three coefficients of equation 52 consistently, while the OLS estimates are clearly biased also for larger GWAS samples. The variance of the GIV regression estimates is relatively large compared to OLS. This should be taken into consideration if the effect size of the interaction term is expected to be smaller. We have not compared GIV regression to MR in this scenario, as there is not a standard way to do MR with polygenic scores and gene-environment interactions.
4 Data
We use data from the Health and Retirement Survey (HRS)[28]. The HRS is a longitudinal survey on health, retirement and aging which is presentative for the US population aged 50 years or older. The survey consists of eleven waves from 1992 to 2012. We used phenotypic data that has been cleaned and harmonized by the RAND cooperation.5
Since 2006, data collection has expanded to include biomarkers and a subset of the participants has been genotyped.6 Autosomal SNPs were imputed using the worldwide reference panel from phase I of the 1000 Genomes project (v3, released March 2012)[29]. If the uncertainty about the genotype of an individual was greater than 10 percent, the SNP was removed. Furthermore, SNPs were removed from the entire sample if the imputation quality was below 70 percent, if the minor allele frequency was smaller than 1 percent, or if the SNP was missing in over 5 percent of the sample. Our analyses are restricted to unrelated participants of European descent according to the standard HRS protocol. Specifically, HRS filtered out parent-offspring pairs, siblings and half-siblings. Selection on European descent was done based on self reported race and principal component analysis [30]. The polygenic score for educational attainment is negatively correlated with birth year (r = -0.06; p < 0.0001) and educational attainment influences longevity [31, 32]. Since the HRS is a sample of an older population, we further restrict our sample to a smaller range of birth years (1935 to 1945) to reduce sample selection bias that is correlated with the PGS. This resulted in a sample of 2,839 individuals.
We constructed polygenic scores starting with a set of 2,224,079 SNPs that were either directly genotyped in HRS or present in the HapMap3 reference panel [33], providing us with a high-resolution coverage of common genetic variants. To control for linkage disequilibrium (LD) between SNPs, we constructed all polygenic scores using LDpred [6] with the default LD window (total number of SNPs divided by 3000) and assuming that 30 percent of the SNPs are causal.
The polygenic scores for educational attainment were constructed by using GWAS results provided by the Social Science Genetic Association Consortium [20], excluding HRS and the 23andMe cohort from the meta-analysis, but including the UK Biobank (see Supplementary Table 1). Three polygenic scores are constructed. First, a score using a meta-analysis of all available data. This score uses a sample of 318,954 individuals (1,873,557 SNPs). The other two scores are created by splitting the sample in two. One score is created by only using data of the UK Biobank (n = 111, 349; 1,873,557 SNPs) and a score using the remaining sub-sample of several cohorts from around the world (n = 207, 605; 1,849,602 SNPs).
The first polygenic score for height was constructed using the publicly available GWAS summary results from the GIANT consortium (n = 253, 288) [25]7 which are based on ≈ 2.5 million autosomal SNPs that were imputed using the HapMap 2 CEU reference panel [34] (See Supplementary Table 2). Merging this set with the directly genotyped and HapMap 3 SNPs resulted in 1,264,571 SNPs that were included in the score by LDpred.
For the second polygenic score for height, we conducted a GWAS on body height in the UK Biobank (UKB). The UKB is a publicly available population-based prospective study of individuals aged 40-69 years during recruitment in 2006-2010 [35]. We restricted the analysis to unrelated Brits of European descent [36] that were available in the interim release of the genetic data (n = 112, 151). Autosomal SNPs were imputed using the UK10K reference panel. Details on genotyping, pre-imputation quality control, and imputation have been documented extensively elsewhere [36]. GWAS analysis included as control vari-ables dummies for genotyping batches, years of age, sex, and all interaction terms between age dummies and sex. Furthermore, the first 15 principal components of the genetic data were also included to control for subtle population structure. GWAS results underwent quality control following an extended version of the EasyQC protocol [37] described in detail elsewhere [23]. This yielded 1,861,232 autosomal SNPs that were included in the LDpred scores.
5 Empirical applications
We demonstrate the value of GIV regression approach in several important empirical applications. First, we estimate the chip heritability of educational attainment (EA) in the HRS data from a PGS for EA. We use the residual of EA after regressing it on control variables. The results are shown in Table 1 of the main text. All reported coefficients are standardized. Since the squarred standardized coefficient in OLS equals R2, our OLS result in column 1 of Table 1 implies that the PGS for EA currently captures 6.3% of the variance in EA (with a 95% confidence interval of +/- 1.8%, obtained from multiplying the standard error estimate by 1.96).
Using the GIV regression results reported in columns 2 and 3 of Table 1 and the error correction described above 2.1, we obtain chip heritability estimates of 8.8% (95% CI +/- 3.4%) and 15.5% (+/- 4.8%), respectively.
Second, we estimate the (causal) effect of body height on EA. Earlier studies have reported a positive relationship between these variables [38, 39, 40]. Third, we present results from a negative control that estimates the (causal) effect of EA on body height (which should be zero). We estimate these effects using OLS, MR, EMR, and GIV regression. We include birth year, birth year squared, educational attainment of both parents and (in pooled models) gender as control variables. We included PGS of EA or height depending on the method. All variables have been standardized.
There is an overlap in the cohorts used by the GIANT consortium in the GWAS on height and by the SSGAC GWAS on EA [20]. To ensure independence of measurement errors in the PGS, whenever the GIANT height PGS was used, we excluded the other as an instrument and used a PGS constructed from the UK Biobank GWAS results instead.
The OLS results in Table 2 in the main text appear to show that height has a strong effect on EA. MR seems to confirm the causal interpretation of the OLS result; indeed, the point estimate from MR is even slightly larger than from OLS. However, and as discussed above, MR suffers from probable violations of the exclusion restriction. These violations could stem from the possibility that the same genetic factors that increase height are also directly increasing EA. They could also stem from the possibility that the PGS for height by itself is correlated with the genetic tendency for parents to have higher EA and income, and therefore a lower nutritional or disease risk for their children, who therefore are more likely to reach their full cognitive potential and have higher EA. Controlling for the PGS (i.e., EMR) is an imperfect strategy for eliminating this source of endogeneity, because the bias in the estimated effect of the PGS score also biases the estimated effect of height (the omitted variable bias discussed earlier).
In contrast, estimates from both GIV regressions in Table 2 in the main text show both a considerably larger effect of the education PGS score on EA, and a small and statistically insignificant effect of height on EA. These results imply that the positive correlation between height and EA is not a causal relationship. Instead, the phenotypic correlation seems to be entirely explained by the genetic correlation between the two traits.
Supplementary Table 3 shows the estimates of EA on height (which should be zero) using the four estimation strategies. OLS and MR both find (erroneously) a statistically significant positive effect of EA on height. The GIV regression estimate for the effect of EA on height is indistinguishable from zero in both specifications. In this application, EMR also finds a small and statistically insignificant effect of EA on height, though it underestimates the genetic contribution to height.
Supplementary Tables
Cohort list for Educational Attainment
Cohort list for Height
Regression of height on educational attainment in the Health and Retirement Study (HRS)
Footnotes
This research was facilitated by the Social Science Genetic Association Consortium (SSGAC) and by the research group on genetic and social causes of life chances at the Zentrum für interdisplinäre Forschung (ZiF) Bielefeld. Data analyses make use of the UK Biobank resource under application number 11425. We acknowledge data access from the Genetic Investigation of ANthropometric Traits Consortium (GIANT). We used data from the Health and Retirement Study (HRS), which is supported by the National Institute on Aging (NIA U01AG009740, RC2 AG036495, RC4 AG039029). HRS genotype data can be accessed via the database of Genotypes and Phenotypes (dbGaP, accession number phs000428.v1.p1). Researchers who wish to link genetic data with other HRS measures that are not in dbGaP, such as educational attainment, must apply for access from HRS. We are very grateful to Richard Karlsson Linnér for help with the GWAS analyses in the UK Biobank and to Aysu Okbay for providing us with subsets of the GWAS meta-analysis on educational attainment. We thank Patrick Turley, Daniel J. Benjamin, Jonathan P. Beauchamp, Niels Rietveld, Eric Slob, Hans van Kippersluis, Benjamin Domingue, and Lisbeth Trille Loft for productive discussions and comments on earlier versions of the manuscript. Furthermore, we thank Elliot Tucker-Drob for pointing us to the necessary correction of the heritability estimate in our model. The study was supported by funding from an ERC Consolidator Grant (647648 EdGe, Philipp D. Koellinger).
↵1 Note that a clean experimental design which randomizes people into groups based on body height or EA is not possible. Thus, any attempt to study the causal relationship between the two variables must rely on observational data and naturally occurring experiments like the genetic endowment of individuals, which we exploit here.
↵2 i.e. the proportion of variance in a trait that is due to linear affects of currently measurable SNPs
↵3 This is the standard case of omitted variable bias, see [45].
↵4 Classic MR typically does not use PGS as instruments. Instead, the idea is to use single genetic variants that are known to affect the exposure via well-understood biological mechanisms that make it unlikely to violate the exclusion restriction. In practice, limited knowledge about the biological function of most genes make it difficult to argue that direct pleiotropic effects of the gene on the exposure and the outcome of interest exist.
↵5 The implications of using Sy or alternatively Sy|X are discussed in the SI.
↵6 as opposed to cascade effects where, for example, a component of the PGS for y affects y indirectly through its effect on height that then causes higher EA
↵7 GREML yields unbiased estimates of SNP-based heritability that are not affected by attenuation, see [?].
↵8 Results from [?] and [27] suggest a genetic correlation between height and EA of about 0.15.
↵1 This conclusion assumes that the two PGS are estimated from the same population. In principle, the PGS for a trait could vary across sub-populations. Using a PGS from one subpopulation as an instrument for a PGS from another subpopulation could cause a violation of the exclusion restriction. This potential problem is solved if the two scores are estimated from randomly chosen subsamples of a single GWAS sample after randomly excluding related individuals so that the final sample consists only of unrelated individuals. This can be done using the UK Biobank.
↵2 This issue is similar to the attenuation of predictive accuracy of a PGS that results from an imperfect genetic correlation between the GWAS summary statistics in the hold-out sample and the GWAS summary statistics in the discovery sample [16].
↵3 We could, for example, imagine replacing each of the variables in equation 8 with the residual of this variable from an OLS regression of that variable on the variables in X.
↵4 one component is the difference between the true PGS for y, without controls for X and T, and the predicted PGS for y (without controls for X and T) from Sy2). The other component is the difference in the coefficients of G on y in the presence and the absence of Xand T.
↵5 RAND HRS Data, Version O. Produced by the RAND Center for the Study of Aging, with funding from the National Institute on Aging and the Social Security Administration. Santa Monica, CA (August 2016). See http://www.rand.org/labor/aging/dataprod/hrs-data.html for additional information.
↵7 http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT consortium data files\#GWAS Anthropometric 201
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵