Abstract
Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with complex human traits, but only a fraction of variants identified in discovery studies achieve significance in replication studies. Replication in GWAS studies has been well-studied in the context of winner’s curse, which is the inflation of effect size estimates for significant variants in a study. Multiple methods have been proposed to correct for the effects of winner’s curse. However, winner’s curse is often not sufficient to explain lack of replication. Another reason why studies fail to replicate is that there are fundamental differences between the discovery and replication studies. A confounding factor can create the appearance of a significant finding while actually being an artifact that will not replicate in future studies. We propose a statistical framework that utilizes GWAS replication studies to model winner’s curse and study-specific heterogeneity due to confounders and correct for these effects. We show through simulations and application to 100 human GWAS data sets that modeling both winner’s curse and study-specific heterogeneity explains observed patterns of replication in GWAS studies better than modeling winner’s curse alone.
Introduction
Replication is a gold standard in scientific discovery. Consensus emerges when a result has been replicated repeatedly by multiple researchers. Recently, a vigorous discussion has emerged of how often replication of an initial study fails across all fields of science, including genomics [1, 2, 3, 4, 5]. Genome-wide association studies (GWAS) are an ideal model to study replication because there are a large number of GWAS data sets with replication studies publicly available. GWAS replication studies are typically conducted in an independent cohort and on a smaller set of variants than the discovery study. In the National Human Genome Research Institute Catalog of Published GWAS, thousands of genetic variants have been associated with complex human traits but not all associated variants achieve significance in follow up replication studies [6, 4, 5, 7].
There are several reasons why associations do not replicate. The first is simply statistical. It is possible that the association is not observed in the replication study by chance. However, if the p-value from the original finding is highly significant and the replication studies have similar experimental designs, this scenario is unlikely. A second reason why studies can fail to replicate is winner’s curse, which is the inflation of effect size estimates for significant variants in a study. This phenomenon occurs because the reported findings are a small fraction of many possible findings. In the case of GWAS, the significant associations are discovered after examining millions of variants and pass a stringent genome-wide significance threshold. This can result in inflated effect size estimates of significant variants in a study, especially when studies are underpowered. Winner’s curse has been studied extensively in GWAS, and multiple methods have been proposed to correct for its effects [9, 10, 11, 12, 4]. However, winner’s curse is often not sufficient to explain lack of replication. A third reason why studies fail to replicate is that there are fundamental differences between the original and the replication study. An effect present in one study but not present in other studies can create the appearance of a significant finding that is not replicated in future studies [13]. This can either occur because of an underlying biological difference or a technical difference between the two studies. We refer to the cause of these differences as confounders.
Current methods for modeling confounders fall into two broad categories. The first class of methods attempt to model the effect of confounders before the association statistic is calculated in order to remove their effects from the association statistic. While these methods are widely used, they have several fundamental limitations. Methods that account for known covariates may not correct for all potential confounders. Confounding correction methods that use unsupervised learning to learn principal components or other global patterns in the data can incorrectly model the true signal as a confounder, which would remove true biological signal from the data [15, 14]. Similarly, when using unsupervised methods, it is unclear when there is residual confounding that remains in the data. The second class of methods attempt to directly adjust p-values by a constant factor to remove inflation. An example of such a method is genomic control [16]. In genomic control, there is an assumption that relatively few variants affect the trait and the vast majority do not. The implication of this assumption is that if the association statistics are ranked, then the variant corresponding to the median statistic will not affect the trait, and the value of this statistic will represent only the effect of the confounders. Genomic control scales all of the p-values using this statistic. Recently it has been observed that due to polygenicity and linkage disequilibrium (LD) structure in the genome, the majority of variants (including the one corresponding to the median statistic) either affect the trait or are correlated with a variants that affect the trait. This breaks the genomic control assumption. While LD-score regression has been shown to distinguish polygenicity and confounding [17], it has been shown that this approach can also result in inflated SNP-based heritability estimates under strong stratification [18].
In this paper, we present a novel approach for characterizing study-specific heterogeneity due to confounders using replication studies. The key insight in our approach is that we can use replications to identify the presence of confounders and then use this information to correct the studies. Since replication studies are performed on the same phenotype, utilizing replication studies to estimate the effect of confounders does not reply on assumptions to distinguish between polygenicity and confounding. Furthermore, we can apply our approach in combination with traditional techniques like linear mixed models and principal component analysis. Our approach can be used to model any residual confounding effects after application of these methods.
In our framework, we use a random effects model to jointly model the effect of both winner’s curse and study-specific confounders on GWAS summary statistics. We show through simulations that we can accurately estimate the contribution of confounders on a study by using the existing findings of the study and a replication. We apply this framework to 100 GWAS studies from the Human GWAS Catalog and observe a surprising amount of confounding in GWAS studies. We validate our approach by comparing the predicted replication rate under our model with both the true replication rate and the predicted replication rate under a naive model that only accounts for winner’s curse. We show that modeling both winner’s curse and study-specific heterogeneity due to confounders explains observed patterns of replication in GWAS studies better than modeling winner’s curse alone.
Results
Method overview
The main goal of this framework is to account for winner’s curse and confounding between discovery and replicate GWAS studies of the same phenotype. We compare the predicted replication rate of two random effects models — one that corrects for only winner’s curse and one that corrects for both winner’s curse and confounding. Through this comparison, we show that jointly modeling both winner’s curse and study-specific confounders explains observed patterns of replication better than the naive approach that only models winner’s curse. We introduce these models without accounting for difference in sample size for clarity, but we relax this constraint in the Methods section.
In GWAS, winner’s curse is the phenomenon where the association statistics for variants meeting a genome-wide threshold tend to be overestimated. The effect of winner’s curse can be observed in Figure 1, where the summary statistics for the significant variants in the discovery study are substantially lower in the replication study. Due to this phenomenon, not all of the significant variants in the initial discovery study replicate. Winner’s curse is widely observed in GWAS due to lack of statistical power in initial discovery studies. When power is low, the variants that are most significant in a study are likely to have inflated effect sizes due to random noise.
To model random noise contributing to winner’s curse, we model the statistics for each variant k from the initial and discovery studies as normally distributed random variables ( and , respectively). We assume that there is a shared genetic effect λ that is responsible for the observed association signal. Thus, the distribution of the statistic for variant k in study i is . We define the prior probability of the true genetic effect to be , where the variance in the true genetic effect is learned from the data. Then, we model the joint distribution of the summary statistics from the two studies (Equation 1).
We correct for winner’s curse by computing the conditional distribution of the replication summary statistic given the initial summary statistic (Equation 2). Using this conditional distribution, we can account for winner’s curse and compute the expected value of the summary statistic in the replication study, along with confidence intervals. This framework accurately models the data in cases where winner’s curse is the only source of inflation. Figure 2A shows a GWAS on height, where most of the variants fall within the 95% confidence intervals of the model accounting for winner’s curse [19]. This shows that in studies without substantial confounding effects, winner’s curse can adequately explain the replication rate.
However, there is often additional heterogeneity due to confounding, and a framework that only accounts for winner’s curse is inadequate. Figure 2B shows an example of a GWAS on height in African American women [20]. In this study there was substantial confounding, and only 5/84 (6%) of variants replicated. Using a model that only accounts for winner’s curse, most variants are outside of the 95% confidence intervals, indicating that there is additional heterogeneity that is not modeled. To account for study-specific confounding, we decompose the effect size of the summary statistics into a genetic effect (λ) and study-specific confounding effects (d(i)). The distribution of the statistic for variant k in study i is . In addition to the prior on the genetic effect, we introduce priors on the study-specific confounders .We incorporate both of these priors into the joint distribution of the summary statistics (Equation 3).
We correct for both winner’s curse and confounding by computing the conditional distribution of the replication summary statistic given the initial summary statistic (Equation 4). By taking into account the additional variance in the statistics from confounders, we are able to more accurately model the summary statistic data from the two studies (Figure 2B). The model that only accounts for winner’s curse predicted that 84 variants would replicate, whereas our model that also accounts for confounding predicted that only 4 variants would replicate, which is substantially closer to the true value of 5 variants. This difference in predictions is due to the study-specific confounding effects estimated in the second model, which both decreases the expected value of the statistics in the replication study and increases the variance of the statistics in the replication study. After correcting for winner’s curse and confounding, most variants are within the 95% confidence intervals for the model.
For each data set, we compute estimates of the summary statistics that we would expect using a framework that only accounts for winner’s curse and a framework that accounts for winner’s curse and confounding. We also compute the expected replication rate under the two models. We apply this framework to simulated data and 100 human GWAS in the GWAS catalog. We compare the predicted replication rates under the two models with the true replication rate.
Confounding explains low replication in simulated data
We generated simulated data to demonstrate that our approach accurately models the effects of winners’ curse and confounding to explain low replication in GWAS studies (See Methods). We set the variance for the genetic and confounding effects to a range of values from 0.0 to 3.0. Using multiple combinations of fixed parameters, we simulated summary statistics for 1000 variants by drawing the shared genetic effect, study-specific confounders, and study-specific error from normal distributions. We then computed the summary statistics for each variant as the sum of the genetic effect, the study-specific confounder, and the study-specific error. We define the true replication rate to be the percentage of variants that are significant in the discovery study that are also significant in the replication study using a Bonferonni correction for multiple testing.
We directly compared our method with a simplified model that only takes into account winner’s curse. When only accounting for winner’s curse, the predicted replication rate was often much higher than the true replication rate (Figure 3). The winner’s curse model only accurately predicted the replication rate when the confounding for both studies is set to zero (i.e, and ). This indicates that when confounding exists between two GWAS studies, the two studies may have different effect sizes. Thus, a model that only accounts for winner’s curse may overestimate the expected replication in the presence of confounding.
We then applied our method that takes into account both winner’s curse and study-specific confounding. For simulations where the confounding is greater than zero, the predicted replication rate under this model was closer to the true replication rate than the simplified model that only accounts for winner’s curse (Figure 3). To ensure that our maximum likelihood estimates of the variance parameters were accurate, we compared the estimates with the true values. For all simulations, the maximum likelihood estimates of the variance parameters are close to the true values (Figure 4).
Accounting for confounding better explains replication rate in 100 human GWAS datasets
We then apply our framework to 100 human GWAS studies previously curated to require summary statistic data availability, a focus on human genetics, and other consistency criteria [4]. All studies have a discovery and replication design, where only the significant SNPs in the discovery study are tested in the replication study. We use the summary statistics from these discovery and replication studies to test our framework’s ability to capture the effects of winners’ curse and confounding. After learning the variance parameters for the genetic and confounding effects, we calculate the predicted replication rate under the model accounting for winner’s curse and the model accounting for winner’s curse and confounding (See Methods). We compare these predicted replication rates to the true replication rates to assess which model explains the observed replication better.
We define the true replication rate to be the percentage of variants that are significant in the discovery study that are also significant in the replication study. We use a Bonferonni adjusted p-value threshold of for each study, where M is the number of SNPs tested in the study. Of the 1652 reported GWAS variants, only 519 (31%) replicate. Using the simplified model that does not account for confounding, we would expect 1552 (94%) of the variants to replicate. However, when we estimate the effect of confounding in our framework, we expect 548 (33%) of the variants to replicate, which is very close to the observed value thus giving evidence that we do observe a substantial bias beyond what we would expect from winner’s curse alone. Our study-specific replication rates also show that accounting for confounding improves prediction of replication rate, indicating that accounting for confounding is important for understanding patterns of replication across studies (Figure 5).
We compare our predicted replication rates with those previously reported by Pe’er et al., which corrects for the expected bias in observed effect due to winner’s curse in the same 100 GWAS studies [4]. At a Bonferonni adjusted significance level of 0.05, Palmer et al. predicts that 610 loci will replicate, which is more than both the true replication rate (519) and the predicted replication rate using our method when accounting for confounders (548). This suggests that by utilizing replication studies, we can account for more heterogeneity due to confounding and explain replication better than adjusting for winner’s curse alone.
Estimates of study-specific confounding elucidates lack of replication
Our framework models additional variation in the summary statistics that is due to study-specific confounders. To further assess the effect of our estimated levels of confounding on replication rate, we analyzed the distribution of estimated confounding in all studies. The estimated value of was negatively correlated (Spearman ρ = −0.84) with the true replication rate in these studies (Figure 6). In many studies, the level of confounding estimated was substantial. In many studies we quantified the variance due to confounding to be an order of magnitude larger than the variance due to noise (WC). We stratified the studies by the total number of significant variants in the discovery study since our estimates of may be less robust for studies with only one significant variant. In studies with at least 50 significant variants, the correlation between confounding and true replication rate is strongest (Spearman ρ = −0.95). For subsequent analyses, we focused only on studies that have at least 50 significant SNPs in the discovery GWAS (8 studies).
In order to understand why some studies replicate poorly, we analyzed the ancestry of the discovery and replication studies. When GWAS are performed in populations with different ancestries, differences in the true effect sizes between populations can contribute to lack of replication. Thus, we expect that studies using homogenous populations would replicate better than studies using heterogenous populations. However, we observed a range of confounding and replication for both types of studies (Figure 6). For instance, the studies using heterogeneous populations had replication rates ranging from 2% to 75% [22, 23, 24, 25, 19, 26]. Of the two studies from homogeneous populations, one study had a replication rate of 27% [27], while the other had a replication rate of only 2% [20]. While ancestry explains replication inconsistently, our estimates of confounding can distinguish between studies where ancestry is correctly accounted for and studies where it is not (Figure 6).
Another potential cause of poor replication is sample size. When sample sizes are small, winner’s curse may contribute to lack of replication in GWAS studies. The study with the smallest sample size (176 individuals) also had the lowest replication rate (1%) and highest amount of confounding [22]. While the correlation between sample size and true replication is quite high (Spearman ρ = .46), there are some studies where smaller sample sizes have higher replication rates and vice versa. Our model can be used to identify when small sample sizes negatively affect replication rate.
Discussion
We developed a novel statistical framework to correct for winner’s curse and study-specific confounding in GWAS data. This framework utilizes GWAS replications to identify the presence of confounders without replying on assumptions to distinguish between polygenicity and confounding.
We showed through simulations that our model accurately estimates the variance of the genetic and confounding effects and that our model can be used to explain replication rates. When applying our method to 100 human GWAS studies, we showed that a model that accounts for winner’s curse and confounding explains replication rates more accurately than a model that only accounts for winner’s curse. While the estimated confounding in the discovery study explains the replication rate well, ancestry and size do not explain the replication consistently. We also showed that confounding is highly prevalent in GWAS studies. This indicates that modeling residual confounding is necessary for understanding lack of replication in GWAS studies.
One of the difficulties in our analyses is that some GWAS studies have very few significant variants, making the maximum likelihood estimates of the variance parameters unstable. Theoretically, it is possible to compute the variance parameters using additional variants that were not significant in the initial discovery study. However, summary statistics for these variants are often not computed to decrease the multiple testing burden for replication studies. Nevertheless, as GWAS studies have increasingly larger sample sizes, we expect that the number of GWAS variants will increase and make our estimated parameters increasingly robust.
Methods
GWAS overview
In GWAS studies, an association study is performed between each genetic variant and the phenotype. The effect size of each variant (k) is determined by estimating the maximum likelihood parameters of Equation 5, where yj is the phenotype or individual j, µ is the phenotypic mean, xkj is the normalized genotype of individual j, βk is the effect size of the variant k, ej is the error, and N is the number of individuals.
In vector notation, Equation 5 becomes the following.
The resulting maximum likelihood estimates are and . The residuals can be used to estimate the standard error . The standard error of the estimator is . Since the sample sizes for GWAS studies are large, the association statistic follows an approximately normal distribution (Equation 8).
Under the null hypothesis, Sk will follow the standard normal distribution, which can be used to compute the significance of association. In the standard GWAS framework, we assume that the standardized effect size is caused by a true genetic effect . Thus, Equation 8 can be rewritten as the following.
Correcting GWAS summary statistics for winner’s curse
Given Equation 8, we can write the distributions of summary statistics for a initial discovery study and a replication study as and , respectively.
We assume that λ is the same across multiple studies on the same trait. We define the prior distribution of λ as , where is the variance in the true effect size. Thus, the posterior distributions of and are also normally distributed.
We correct for winner’s curse by computing the conditional distribution of the replication statistic given the discovery statistic . We derive the conditional distribution from the joint distribution as follows.
The covariance between and is computed as follows.
Therefore, the joint distribution of and is Equation 11.
Conditioning on , we obtain Equation 12.
For each value of , the mean of the conditional distribution gives the expected summary statistic in a replication study, correcting for winner’s curse. This distribution can also be used to create a confidence interval on the replication sample statistics.
Correcting GWAS summary statistics for winner’s curse and confounding
Suppose in addition to study-specific environmental effects, there are also study-specific confounders. We model these confounders in the discovery study and replication study as and respectively. We decompose the effect size into the sum of a genetic component (λ) and a confounding component δ(i).
Similar to the case without confounding, the posterior distributions of and are normally distributed (Equations 15 and 16).
Therefore, the joint distribution is Equation 17
Similar to the winner’s curse only model, we can find the expected summary statistic in a replication study correcting for winner’s curse by computing the conditional distribution of the replication statistic given the discovery statistic (Equation 18).
Predicting the replication rate
The conditional distribution can also be used to predict the replication rate of an initial discovery study. For a genetic variant k with association statistic in an initial discovery study, the probability of replication is , where z is the z-score thresholds corresponding to a specific significance threshold t for the replication study.
The predicted replication rate can be calculated as the average probability of replication across all significant variants in the discovery study. Let 𝒜 be the set of variants found to be significant in the discovery study. The predicted replication rate (r) is defined as
Estimating the variance components from data
The genetic variance and confounding variance are not known a priori. We estimate these parameters from the data using the following procedures. Since in many cases, the replication study only tests variants that are significant in the initial study, we calculate the variance components using only data from variants that are significant in the initial study and tested in both studies. Let the total number of variants studied be M and the total number of significant variants in the initial study be M′
We first calculate the maximum likelihood estimate (MLE) for the total variance of the statistics in the discovery study s(1), which we denote as (Equation 20). We include the unobserved variants that are not significant in the first study in the likelihood by integrating over all of their possible values.
We then use this estimate of the total variance to compute the expected value of the replication statistics s(2) for different values of . We select the value of that minimizes the residual sum of squares between the predicted value of s(2) and the true value (Equation 21).
We can then decompose total variance in s(1) and estimated using the previously estimated total variance and genetic variance . We solve for as follows.
Finally, we use the joint distribution of and (Equation 17) to compute the MLE estimate of , using the previously estimated and .
Data generating model
We generated simulated data to demonstrate that our approach can capture the effects of winners’ curse and confounding to explain low replication in GWAS studies. To show that our model is more effective at explaining low replication than a method that only takes into account winners’ curse, we directly compare our method with a simplified model that only takes into account winners’ curse.
We set variance of the shared genetic variance to be . We then set the variance of study-specific confounders to be and . For each variant, we sampled from the following distributions.
We assumed that sample sizes for the discovery and replication studies were 5000 and 1000, respectively. We computed the summary statistics for each study as . Using our framework, we estimate the variance components and compare these estimates with the true values.
Acknowledgment
J.Z., J.Z, S.F., R.B., and E.E. are supported by National Science Foundation grants 0513612, 0731455, 0729049, 0916676, 1065276, 1302448, 1320589, 1331176, and 1815624, and National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568, P01-HL28481, R01-GM083198, R01-ES021801, R01-MH101782 and R01-ES022282. J.Z. is supported by National Institutes of Health Award Number T32MH073526 and National Science Foundation Graduate Research Fellowship under Grant DGE-1650604.