Abstract
We present a new statistical test of association between a trait (either quantitative or binary) and genetic markers, which we theoretically and practically prove to be robust to arbitrarily complex population structure. The statistical test involves a set of parameters that can be directly estimated from large-scale genotyping data, such as that measured in genome-wide associations studies (GWAS). We also derive a new set of methodologies, called a genotype-conditional association test (GCAT), shown to provide accurate association tests in populations with complex structures, manifested in both the genetic and environmental contributions to the trait. We demonstrate the proposed method on a large simulation study and on the Northern Finland Birth Cohort study. In the Finland study, we identify several new significant loci that other methods do not detect. Our proposed framework provides a substantially different approach to the problem from existing methods. We provide some discussion on its similarities and differences with the linear mixed model and principal component approaches.
INTRODUCTION
Performing genome-wide tests of association between a trait and genetic markers is one of the most important research efforts in modern genetics [1-3]. However, a major problem to overcome is how to test for associations in the presence of population structure [4]. Human populations are often structured in the sense that the genotype frequencies at a particular locus are not homogeneous throughout the population. Rather, there are latent variables (such as geography or ancestry) that directly affect the allele frequencies of the genotypes. At the same time, there may be other loci and non-genetic factors that also correlate with these latent variables, which in turn are correlated with the trait of interest. When this occurs, genetic markers become spuriously statistically associated with the trait of interest despite the fact that there is no biological connection.
The importance of addressing association testing in structured populations is evidenced by the existence of a large literature of methods proposed for this problem [5,6]. The well established methods all take a similar strategy in that the trait is modeled in terms of the genetic markers of interest, while attempting to adjust for genetic structure. Two popular approaches are to correct population structure by including principal components of genotypes as adjustment variables [7,8] or by fitting a linear mixed effects model involving an estimated kinship or covariance matrix from the individuals’ genotypes [9,10]. Previous work investigating the limitations of these two methods include Wang, et al. (2013) [11]. These two approaches have been shown to be based on a common model that make differing assumptions about how the kinship or covariance matrices are utilized in the model [5]. This common model does not allow for non-genetic (e.g., environmental) contributions to the trait to be dependent with population structure. The linear mixed effects model requires that the genetic component is composed of small effects that additively are well-approximated by the Normal distribution. The model itself is therefore an approximation, and it is not yet possible to theoretically prove that a test based on this model is robust to structure for the more general class of relevant models that we investigate.
By taking a substantially different approach that essentially reverses the placement of the trait and genotype in the model, we formulate and provide a theoretical solution to the problem of association testing in structured populations for both quantitative and binary traits under general assumptions about the complexity of the population structure and its relationship to the trait through both genetic and non-genetic factors. This theoretical solution directly leads to a method for addressing the problem in practice that differs in key ways from the mixed model and principal component approaches. The method is straightforward: a model of structure is first estimated from the genotypes, and then a logistic regression is performed where the SNP genotypes are logistically regressed on the trait plus an adjustment based on the fitted structure model. The coefficient corresponding to the trait is then tested for statistical significance. The class of models to which this provides a test robust to structure is fairly general.
This association testing framework is robust to population genetic structure, as well as to non-genetic effects that are dependent or correlated with population genetic structure (for example, lifestyle and environment may be correlated with ancestry) and with heteroskedasticity that is dependent on structure. We introduce a test based on this framework, called “genotype conditional association test” (GCAT). We show the proposed method corrects for structure on simulated data with a quantitative trait and compares favorably to existing methods. We also apply the method to the Northern Finland Birth Cohort data [12] and identify several new associated loci that have not been identified by existing methods. For example, the proposed method is the only one to identify a SNP (rs2814982) associated with height, which we note is linked to another SNP (rs2814993) that has been associated with skeletal frame size [13]. We discuss the advantages and disadvantages of the proposed framework with existing approaches, and we conclude that the proposed framework will be useful in future studies as sample sizes and the complexity of structure increase.
RESULTS
Population Structure Model
Suppose that there are n individuals, each with m measured SNP genotypes. The genotype for SNP i in individual j is denoted by xij ∈ {0, 1, 2}, i = 1, 2, …, m, j = 1, 2, …, n. We collected these SNP genotypes into an m × n matrix X, where the (i, j) entry is xij. We denote the genotypes for individual j by xj = (x1j, x2j, …, xmj )T.
We utilize our recently developed framework that flexibly models complex population structures for diallelic loci [14]. Let Z be an unobserved variable describing how individuals fit into the underlying population structure. For a SNP i, the allele frequency πi can be viewed as being a function of Z, πi(Z). For a random sample of n individuals from an overall population, we therefore have sampled population structure positions z1, z2, …, zn with resulting allele frequencies πi(z1), πi(z2), …, πi(zn) for SNP i. In Hao et al. (2013) [14], we formulate and estimate a model for m SNPs simultaneously while providing a flexible parameterization of the form of πi(Z).
For shorthand, πij ≡ πi(zj) is the allele frequency for SNP i conditioned on the ancestry state of individual j. The πij values are called “individual-specific allele frequencies.” These allele frequencies can be collected into an m × n matrix F, where the (i, j) entry is πij. Note that E[xij/2|zj] = πij, and when Hardy-Weinberg equilibrium holds, xij∣zj ∼ Binomial(2, πij). We utilize the framework from Hao et al. (2013) [14] that allows the simultaneous estimation of all πij from a given genotype data set X. Specifically, it provides estimates of latent variables that form a linear basis of the quantities, which turns out is the most convenient scale on which to estimate a model of structure for the proposed testing framework. The model and estimation procedure is called “logistic factor analysis” (LFA). It should be noted that other well-behaved estimates of πij may be utilized as well. Further details are provided in Methods.
Trait Models
We assume a trait (quantitative or binary) has been measured on each individual, which we denote by yj, j = 1, 2, …, n. One way in which spurious associations occur in the presence of population structure is that SNPs become correlated with each other when structure is not taken into account. Therefore, if a SNP is causal for the trait of interest, then any other SNP correlated with this causal SNP may also show an association. For SNPs in linkage disequilibrium due to their physical proximation with the causal SNP, one expects these to be associated with the trait regardless of structure. However, in the presence of structure, there may be many unlinked SNPs that also show associations with the trait due the fact that structure induces correlations of these SNPs with the causal SNP. Indeed, one of the early methods for detecting structure in association studies was to show that many randomly chosen, unlinked SNPs show associations to the trait [4]. This source of confounding is typically the main focus of association tests designed for structured populations.
Another key issue that is less often considered is the fact that lifestyle and environment are also often related to ancestry (Figure 1a). This implies that non-genetic effects may also be directly related to structure. We therefore extend the concept of the latent variable Z to include not only population genetic structure, but also lifestyle and environment: Z = (structure, lifestyle, environment). For each observed individual j, there is an underlying latent variable zj that contains the information about structure, lifestyle, and environment for individual j. We allow for the case that structure or ancestry may be directly influential on or related to lifestyle and environment, and that all three of these variables may influence the trait of interest. An association test that is immune to structure should also be immune to the non-genetic effects that are confounded with structure.
We consider the following models of quantitative and binary traits. We write the trait models in terms of additive genetic effects, but the framework can be extended to account for dominance models and interactions, and the models can also incorporate adjustment variables that capture known sources of trait variation.
The quantitative trait model is where βi is the genetic effect of SNP i on the trait, λj is the random non-genetic effect, and ϵj is the random noise variation. To allow the interdependence of structure, lifestyle, and environment, we assume that xj = (x1j, …, xm,j)T, λj, and may all be functions of zj. We assume that , which allows for heteroskedasticity of the random noise variation. The distribution of λj can remain unspecified, although we assume that λj and zj may be dependent random variables. The population genetic model summarized shows how the distribution of depends on zj. Without having observed zj, it follows that , λj, and ϵj are dependent random variables; however, we assume that conditional on zj, these random variables are independent.
The binary trait model is where again βi is the genetic effect of SNP i on the trait, λj is the non-genetic effect, and we allow for the case that xj and λj may be dependent due to the common confounding latent variable zj as described for the quantitative trait model.
We have shown that the linear mixed effects model and principal components approaches involve more restrictive assumptions about the trait models utilized in testing for associations (Methods).
Motivation and Rationale of the Proposed Test
The rationale for the proposed test is schematized in Figure 1. The SNP Xi and the trait Y become spuriously associated because they are under the influence of a common latent variable Z. This latent variable contains information on population structure, lifestyle, and environment, all of which may be interdependent and play a determining role in the trait. The problem is that we cannot directly observe Z and we would like to avoid making assumptions about its mathematical form. If we can successfully construct either Xi∣Z (the distribution of Xi conditional on Z) or Y∣Z, then it is possible to perform a test of association between Xi and Y that is immune to the effects of Z. Possible association tests should occur between Xi∣Z and Y, between Xi and Y∣Z, or between Xi∣Z and Y∣Z.
The linear mixed model and principal components approaches can be interpreted as attempts to estimate a model of Y∣Z. This requires additional assumptions about non-genetic and genetic effects, and their relationship to Z, specifically there is no relationship between structure and non-genetic effects in the trait model (Methods and ref. [5]). Due to the massive number of SNPs that have been measured in GWAS, trying to construct Xi∣Z is appealing since we have an abundance of information about the effect of latent variables on the genotypes. (For example, this can easily be visualized in principal components constructed from the genotypes.) Our approach is therefore to carry out an association test between Xi∣Z and Y by specifically testing whether there is equality or not between Pr(Xi∣Y, Z) and Pr(Xi∣Z) (Figure 1b). If Pr(Xi∣Y, Z) = Pr(Xi∣Z) then there is no association between the SNP Xi and the trait Y; if Pr(Xi∣Y, Z) ≠ Pr(Xi|Z), then there is an association. This test of association is in theory immune to population structure because we have taken into account Z.
One remaining problem is that we cannot observe Z. However, it is straightforward to show that when there is no association Pr(Xi|Y, Z, πi(Z)) = Pr(Xi|Y, πi(Z)) and Pr(Xi|Z, πi(Z)) = Pr(Xi|πi(Z)). In other words, in order to capture Xi|Z, it suffices to capture Xi|πi(Z), the effect of Z on the allele frequency of SNP i. We have recently developed a framework that flexibly parameterizes and estimates Xi|πi(Z) [14]. In order to test whether Pr(Xi|Y, πi(Z)) = Pr(Xi|πi(Z)), we perform a logistic regression of the SNP genotypes Xi on the trait Y plus the transformed individual-specific allele frequencies, logit(πi(Z)), where for 0 < p < 1. This inverse regression approach is a substantial departure from the most commonly employed methods that attempt to adjust for population structure.
Association Test Immune to Population Structure
We have derived a statistical hypothesis test of association that is equivalent to testing whether βi = 0 for each SNP i in the above trait models (1) and (2), and whose null distribution does not depend on structure or the non-genetic effects correlated with structure, making it immune to spurious associations due to structure (METHODS). Specifically, the test allows for general levels of complexity in structure because the test is based on adjusting for structure according to individual-specific allele frequencies.
We have proved a theorem (METHODS) that shows that βi = 0 in models (1) and (2) implies that bi = 0 in the following model: for all j = 1, 2, …, n. This establishes a model that can be used to test for associations in place of models (1) and (2). Note that the non-genetic effects, heteroskedasticity, and polygenic background do not appear in the above model used to test for associations. This is important because under our general assumptions, these terms can be difficult or even impossible to estimate in practice. Furthermore, testing for association under this model means that the test will have a valid null distribution regardless of the form of the non-genetic effects, heteroskedasticity, and polygenic background.
As fully detailed in METHODS, an association statistic whose null distribution is known can be constructed by testing whether bi = 0 in the above model, which we have shown is a valid test of βi = 0 in traits models (1) and (2). Briefly, the testing procedure works as follows:
Formulate and estimate a model of population structure that provides well-behaved estimates of the logit(πij) values. We specifically use the logistic factor analysis (LFA) approach of ref. [14], which has been shown to provide a accurate linear basis of the logit(πij) values.
For each SNP i, perform a logistic regression of the SNP genotypes on the trait values plus the model terms that estimate the values1. Also, perform a logistic regression of the SNP genotypes on only the model terms that estimate , where the trait is now excluded from the fit. These two model fits are compared via a likelihood ratio statistic, where the larger the statistic, the more evidence there is that bi ≠ 0.
Calculate a p-value for each SNP based on our result that when the null hypothesis of no association is true, βi = 0 in models (1) and (2), then the above statistic follows a distribution for large sample sizes.
We call our proposed test the “genotype conditional association test” (GCAT). As a general concept, such an approach is sometimes called an inverse regression model because we consider E[x∣y] rather than E[y∣x].
Simulation Studies
We performed an extensive set of simulations to demonstrate that the proposed test is robust to population structure and to assess its power to detect true associations (full technical details in Methods). We compared the proposed test to its oracle version where model (3) and test-statistic (6) are used with the true πij values. We also included in the simulations studies three important and popular methods: (i) the method of adjusting the trait and genotypes by principal components computed from the full set of genotypes [8] and (ii) two implementations of the linear mixed effects model approach [9,10], specifically EMMAX by Kang et al. (2010) [10] and GEMMA by Zhou and Stephens (2012) [15]. The methods are abbreviated as “PCA,” “LMM-EMMAX,” and “LMM-GEMMA.”
The complete simulation study on quantitative traits involved population structure constructed in 11 different ways for each of three different apportionments of variance among genetic effects, non-genetic effects, and random variation that all contribute to variation in the trait. Therefore, each configuration involved a constructed allele frequency matrix F and values assigned to variances , , and Var(ϵj) from model (1). For each of these 33 = 11 × 3 configurations, we simulated 100 GWAS data sets, for a grand total of 3300 studies.
We simulated allele frequencies: (i) subject to structure estimated from three real data sets: HapMap, Human Genome Diversity Project (HGDP), and the 1000 Genomes Project (TGP), where the HapMap structure was simulated according to the Balding-Nichols model; (ii) at four different levels of admixture in the Pritchard-Stephens-Donnelly (PSD) model, which is an extension of the Balding-Nichols model; and (iii) for four different types of spatially defined structure. We intentionally simulated challenging population structures, having in mind that future GWAS such as the forthcoming “Genotype Tissue Expression” program (GTEx) data may involve particularly challenging forms of structure.
In order to provide an extra challenge to the proposed test, we simulated the allele frequencies from a model that differs from LFA model (4). We generated allele frequencies parameterized by F = ΓS, where F is the matrix of πij values, Γ is an m × d matrix and S is the d × n matrix that encapsulates the structure. This model captures as special cases the Balding-Nichols model and the PSD model [14]. It was also intended to provide an advantage to the PCA and LMM methods because the structure is manifested on the observed genotype scale [14], which is the same scale on which both methods estimate structure.
We simulated 10 truly associated SNPs whose effect sizes are distributed according to a Normal distribution. All genotypes were simulated to be in linkage equilibrium so that true and false positives are unambiguous. We set the variances , , and Var(ϵj) to be: (5%, 5%, 90%), (10%, 0%, 90%), and (10%, 20%, 70%). Setting these variances enforced a certain overall level of genetic contribution to the trait; therefore our simulation study results were minimally affected by the choice of 10 truly associated SNPs and the Normal distribution on their effect sizes. In each simulation scenario, we simulated data for m = 100,000 SNPs and n = 5000 individuals, except HGDP necessarily restricted us to n = 940 individuals and TGP to n = 1500 individuals. The dimension of the structure was set to d = 3, although we carried out the same simulations for d = 6 and the results were quantitatively very similar and qualitatively equivalent.
Each simulation configuration involved analyzing 100 GWAS data sets (X, y), where the Oracle method, the proposed GCAT method, and the PCA method were applied to each study. For a given simulated study, we obtained a set of m = 100,000 p-values. So-called “spurious associations” occur when the p-values corresponding to null (non-associated) SNPs are artificially small. For a given p-value threshold t, we expect there to be m0 × t false positives among the m0 p-values corresponding to null SNPs, where m0 = 100,000 – 10 in our case. At the same time, we can calculate the observed number of false positive simply by counting how many of the null SNP p-values are less than or equal to t. The excess observed false positives are spurious associations. A method properly accounts for structure when the average difference is zero. The best one can do on a study-by-study basis is captured by the Oracle method, which according to our theory is immune to structure and provides the correct null distribution.
We found from using the distributed binary executable EMMAX software and our own implementation that EMMAX required a 10-fold increase in computational time over the proposed method and PCA when analyzing n = 5000 individuals. Therefore, it was not reasonable to apply EMMAX to all 3300 simulated GWAS data sets. We limited comparisons with EMMAX to five representative structure configurations of the full 11 for a single apportionment of the variances assigned to genetic effects, non-genetic effects, and random variation. GEMMA was computationally more efficient, though still significantly slower than GCAT or our implementation of PCA. Figure 2 shows the excess in observed false positives vs. the expected number of false positives for the Oracle, GCAT (proposed), PCA, and both implementations of LMM methods under five configurations of structure for the variance configuration corresponding to genetic=5%, environmental=5%, and noise=90%. It can be seen that the LFA implementation of the proposed GCAT method performs similarly to the Oracle test, whereas PCA tends to suffer from an excess of spurious associations. Figures S1-S8 is a more complete set of simulations with results from the all three sets of variances for the full 11 configurations of structure. Due to the computational constraints mentioned above, the additional simulations feature only results from GEMMA for LMM methods.
In comparing the statistical power among the methods (Figures S9-S17), we found that the Oracle, GCAT, and PCA performed similarly well, while the two LMM methods often suffered from a loss of power. We also carried out analogous simulations on binary traits simulated from model (2) and we found that all methods performed similarly well in terms of producing correct p-values that were robust to structure. This result agrees with the comparisons made between PCA and a linear mixed effects model in Astle and Balding (2009) [5].
Analysis of the Northern Finland Birth Cohort Data
We applied the proposed method to the Northern Finland Birth Cohort (NFBC) genome-wide association study data [12], which includes several metabolic traits and height. This study has also been analyzed by the LMM and PCA methods, as well as a standard analysis uncorrected for structure [10]. We carried out association analyses with the proposed method on the 10 traits that were also analyzed using the other methods (Table 1). After processing the data, including filtering for missing data, minor allele frequencies, and departures from Hardy-Weinberg equilibrium, the data were composed of m = 324,160 SNPs and n = 5027 individuals (Methods). The logistic factors were computed on a subset of the data where markers were at least 200 kbp apart.
Most traits showed only approximate Normal distributions, so we applied a Box-Cox Normal transformation to all traits so that they satisfy the model assumptions. We noted that C-reactive Protein (CRP) and Triglycerides (TG) traits followed an exponential distribution more closely, so it was unnecessary to transform these two traits. The developed theory can be extended to exponential distributed quantitative traits as well.
The 20 most significant SNPs for each of the 10 traits are shown in Table S1. Kang et al. (2010) utilized a genome-wide significance threshold of p-value < 7.2 × 10−8 as proposed in ref. [16], so we also utilized this threshold for comparative purposes. The number of loci found to be significant for each method are shown in Table 1. Whereas our proposed method identifies 16 significant loci, the other methods identify 11 to 14 loci.
We identified three new loci that were not identified by the other methods. None of the other methods identified any significant associations for the height trait. However, we identified rs2814982 on chromosome 6 as being statistically associated with height (Table S1). This SNP is located ∼ 70kbp from another SNP, rs2814993, which has been associated with skeletal frame size in a previous study [13]. Additionally, rs2814993 was the fifth most significant SNP for height. For the LDL cholesterol trait, we identified a significant association with rs11668477, which was significantly associated with LDL cholesterol in a different study [17]. Finally, there were significant associations between the glucose (GLU) trait and a cluster of SNPs (rs3847554, rs1387153, rs1447352, rs7121092) proximal to the MTNR1B locus; variation at this locus has been associated with glucose in a previous study [18].
As described in Sabatti et al. (2009) [12], the NFBC data show modest levels of inflation due to population structure as measured by the genomic control inflation factor (GCIF) [19] of test statistics from an uncorrected analysis. The population structure present among these individuals may be subtler and manifested on a finer scale than other settings. Noting that the GCAT approach does not attempt to adjust for a polygenic background, the GCIF values calculated for the proposed method (Table S2) were found to be in line with what is expected for polygenic traits where no structure is present [20], providing evidence that the proposed method adequately accounts for structure.
DISCUSSION
We considered models of quantitative and binary traits involving genetic effects and non-genetic effects in the presence of arbitrarily complex population structure. We allowed for the non-genetic effects to be confounded with population genetic structure since structure, ancestry, lifestyle, and environment – all factors potentially involved in complex traits – may be highly dependent with one another. A causal model provided the intuition that under these models, it is most reasonable to account for this confounding in the genotypes, but it is not tractable to do so in the non-genetic effects. This follows because we have many instances of genotypes that can be jointly modeled to provide reliable estimates of structure, but the non-genetic effects are never directly observed and we do not have repeated instances of them. In general it is not possible to estimate a latent variable that accounts for the confounding between structure and non-genetic effects.
These observations led us to propose an inverse regression approach to testing for associations, where the association is tested by modeling genotype variation in terms of the trait plus model terms accounting for structure. In this model, the terms accounting for structure were based on the logistic factor analysis approach that we have proposed [14], although the general form of the association test can incorporate other methods that estimate population structure. We mathematically proved under general assumptions that the trait term in the model is non-zero only when the genetic marker is truly associated with the trait, regardless of the population structure. We demonstrated that the implemented test properly accounts for structure in a large body of simulated studies that included a wide range of population structures. We also applied the method to 10 traits from the Northern Finland Birth Cohort genome-wide association study. The proposed method identified three new loci associated with the traits, including being the only method among those we considered that identifies a locus associated with the height trait. Overall, we showed that the proposed method compares favorably to existing methods and we also noted that it has favorable computational requirements compared to existing methods.
As GWAS increase in sample size and levels of complexity of population structure, it is important to develop methods that properly account for structure and scale well with sample size. Whereas we found that the popular principal components adjustment does not properly account for structure, we also found that the mixed model approach performs reasonably well. However, the mixed model approach involves estimating a n × n kinship matrix and its current implementation does not scale well with sample size. The kinship matrix quickly becomes computationally unwieldy when n grows large, and the possibility of the estimated kinship matrix becoming overwhelmed by noise is a concern [21]. In the Northern Finland Birth Cohort data, the mixed model approach required us to estimate 12 million parameters, whereas the proposed method involved estimating 25-thousand parameters, a ∼500-fold decrease. A study involving n = 10,000 individuals with the same complexity of structure requires estimating about 50-million parameters in the mixed model kinship matrix, whereas the proposed method requires estimating 50-thousand parameters, a ∼1000-fold decrease. In addition, estimating the structure in the proposed method primarily uses singular value decomposition, for which a rich literature of computational techniques exist. We utilized a Lanczos bidiagonalization algorithm [22] which scales approximately linearly with respect to n for d ≪ n. The proposed method is well equipped to scale to massive GWAS and can take advantage of future advances for computing singular value decomposition.
The key assumption to verify in utilizing the proposed GCAT approach is that population structure observed in the SNP genotypes is adequately modeled and estimated. One can test for associations among SNPs that show convincing empirical evidence that the model of structure is reasonably well-behaved; this can be directly tested on the genotype data as previously demonstrated in our logistic factor analysis (LFA) model of structure [14]. For example, on the Northern Finland Birth Cohort Study, we empirically verified that utilizing the LFA model with dimension d = 6 accounted for structure reasonably well for the great majority of SNPs. The linear mixed effects model (LMM) approach and principal components (PCA) approach make trait model assumptions that may be difficult to verify in practice (Methods).
We anticipate that the proposed genotype conditional association test (GCAT) will be useful for future studies. The mathematical framework we have developed should facilitate its extension to traits modeled according to distributions not considered here while maintaining our theoretical proof that the test accounts for population structure in the presence of non-genetic effects also confounded with structure.
METHODS
Logistic Factor Analysis (LFA)
When forming a latent variable model of structure, where the goal is to make minimal assumptions about the underlying structure, there are benefits to modeling logit(πij) in terms of a latent variable model instead of πij directly [14]. The quantity logit(πij) = log(πij/(1 − πij)) is called the “natural parameter” of the distribution of xij when we assume Hardy-Weinberg equilibrium so that xij ∼ Binomial(2, πij). The quantity logit(πij) occurs as a linear term in the log-likelihood of the data, and it is the target parameter in logistic regression because of its straightforward mathematical properties. This viewpoint also facilitates calculating the distribution of xij given the structure, which is the essential challenge in accounting for structure in the proposed association testing framework.
In the association testing framework detailed below, it turns out that developing a latent variable model and estimate of the logit(πij) is particularly appropriate. The approach is called “logistic factor analysis” (LFA). Let L be an m × n matrix with (i, j) element equal to logit(πij). Consider the following parameterization: where A is an m × d matrix, H is a d × n matrix, and d ≪ n. The columns of H are independent, and column j captures the structure information for individual j. That is, Pr(xij∣hj, zj) = Pr(xij∣hj) where hj is column j of H. Row i of A determines how SNP i is affected by structure. We have shown in ref. [14] that this model performs well in estimating structure resulting from discrete subpopulations, admixed populations, the Balding-Nichols model [23], the Pritchard-Stephens-Donnelly model [24], and models of spatially oriented structure.
In practice, H will be unknown, so it must be estimated. We have developed a method called logistic factor analysis (LFA) that we have shown to estimate H well [14]. Specifically, the LFA estimate Ĥ has been shown to span the same space as the true H at a high level of accuracy, which implies that replacing H with Ĥ in the above equations yields nearly identical results. The accuracy of Ĥ in estimating H has been demonstrated even when the individual-specific allele frequencies are not directly constructed from model (4), L = AH.
Proposed Association Testing Framework
We have derived a statistical hypothesis test of association that is equivalent to testing whether βi = 0 for each SNP i in the above trait models (1) and (2), and whose null distribution does not depend on structure or the non-genetic effects correlated with structure, making it immune to spurious associations due to structure. Specifically, the test allows for general levels of complexity in structure because the test is based on adjusting for structure according to individual-specific allele frequencies.
A Model of Genetic Variation Given the Trait and Structure
As a first step, we have proved a theorem (see below) that shows that βi = 0 in models (1) and (2) implies that bi = 0 in the following model: for all j = 1, 2, … , n. This establishes a model that can be used to test for associations in place of models (1) and (2).
There are a few important details to note. First, the variables λj, , and (xkj)k≠i do not appear in the model. This is important because it is impossible to estimate λj and in the typical setting, and we will also typically not know the polygenic ∑k≠i βkxkj component of the model. Second, the genotype variation is being modeled in terms of the trait variation, instead of the other way around. It is initially counter-intuitive because almost all association tests involve modeling the trait in terms of the SNP genotypes. As explained in more detail below, this reversal is crucial for adjusting the probability distribution of xij according to structure, and for eliminating the need to estimate λj, , and (βk)k≠i.
We call our proposed test the “genotype conditional association test” (GCAT). The model we propose to utilize is sometimes called an inverse regression model because we utilize E[x∣y] rather than E[y∣x].
Proposed Test Conditional on Individual-Specific Allele Frequencies
As a second step, we have derived a test-statistic to test whether bi = 0 in model (3) whose null distribution is immune to structure. The log-likelihood function of the parameters given individual j is where the probability on the right-hand-side is calculated according to model (3). The log-likelihood of all n individuals is where πi = (πi1, πi2, … , πi,n) and y = (y1, y2, … , yn). The test statistic we utilize is a generalized likelihood ratio test statistic [25]:
The log-likelihood is maximized by performing a logistic regression of all n observed genotypes for SNP i on the right hand side of model (3). We have proven a theorem (Methods) that shows that when βi = 0 in models (1) or (2), the null distribution of this test statistic is , regardless of the values of πij, (xkj)k≠i, (βkj)k≠i, λj, and for j = 1, 2, … , n in models (1) and (2).
Proposed Test In Terms of LFA Model
As a third step, we have extended the above results to the case where the individual-specific allele frequencies are unknown and must be estimated. This requires a model of the individual-specific allele frequencies, and we utilize model (4) so that . First, assume that H from model (4) is known. We have proved that βi = 0 in models (1) and (2) implies bi = 0 in the following model: for all j = 1, 2, … , n, where hj is column j of H and it is noted that without loss of generality we let hdj = 1 making aid an intercept term. The test-statistic used to test for an association between SNP i and the trait is the following generalized likelihood ratio test statistic: where ai = (ai1, ai2, … , ai,d). The log-likelihoods in this test statistic are maximized by performing a logistic regression of all n observed genotypes for SNP i on the right hand side of model (7) on all n individuals. As the previous case, we have proven a theorem (Methods) that shows that when βi = 0 in models (1) or (2), the null distribution of this test statistic is , regardless of the values of πi, (xkj)k≠i, β−i, λ, and σ2 in models (1) and (2).
The proposed test utilizes LFA to form an estimate Ĥ, replaces H with Ĥ, and carries out the test using model (7) and test statistic (8): T(xi, y, Ĥ). This approach directly allows the simultaneous estimation of ai and bi for each SNP i under the unconstrained model and the estimation of ai with bi = 0 under the constraints of the null hypothesis. Because of this, the test allows the uncertainty of the m × d unknown parameters of A to be taken into account and it allows bi to be competitively fit with ai under the unconstrained, alternative hypothesis model.
Another approach is to first carry out estimation of F by whatever method the analyst finds appropriate and then base the test on statistic (6) with the πij replaced with the estimates . This has the advantage that it allows for a much broader class of methods to estimate F, but it may be more conservative than the above implementation because bi is not competitively fit with the πij under the unconstrained model. In this case, F may be estimated in a manner that allows for fine-scale levels of inter-individual coancestry and locus-specific models of structure without relying on the lower d-dimensional factorized model L = AH that we used here.
Proposed Test Under the Alternative Hypothesis
The proposed association test is based on models (3) and (7). Even though we have proved that the test is immune to population structure, it is also important to demonstrate that the test has favorable statistical power to identify true associations. We have shown that the is a tractable approximation of the model under general configurations of a true alternative hypothesis for SNP i where βi ≠ 0 (see below). This provides the beginnings of a mathematical framework for characterizing the power of the test.
Theorems and Proofs
Because xij∣zj ∼ Binomial(2, πi(zj)) where we write πij ≡ πi(zj), it follows that Pr(πij∣πij, Zj) = Pr(xij∣πij). We assume that Pr(xij∣hj, zj) = Pr(xij∣hj); in other words, all information about the influence of population structure on the genotypes of individual j is captured through column j of H. It therefore follows that Pr(xij∣πij, hj, zj) = Pr(xij∣πij, hj) = Pr(xij∣πij). We also assume that the SNP genotypes are mutually independent given the structure (which also implies the set of SNPs we consider are in linkage equilibrium, given the structure). These assumptions yield the following equalities:
Suppose that yj is distributed according to model (1) or (2), xij∣πij ∼ Binomial(2, πij) as parameterized above, and the SNP genotypes are mutually independent given the structure as detailed above. Then βi = 0 in models (1) or (2) implies that bi = 0 in model (3).
Note: We provide two proofs of this theorem because both provide relevant insights. The first version gives insight into the probabilistic mechanism underlying the proposed approach and has some generality beyond the modeling assumptions made here. The second version directly shows how the terms in models (1) and (2) relate to those in model (3).
Proof (version 1): When βi = 0, it follows that Pr(yj∣(xkj)k≠i, xij, zj) = Pr(yj∣(xkj)k≠i, zj) by the assumptions of models (1) and (2). Noting that Pr((xkj)k≠i∣xij, zj) = Pr((xkj)k≠i∣zj) by the conditional independence assumption, we have:
By Bayes theorem we have
Since Pr(yj∣xij, zj) = Pr(yj∣zj), this implies that Pr(xij∣yj, Zj) = Pr(xij∣zj) and it follows that bi = 0 in model (3).
Proof (version 2): For either model (1) or (2), it follows that and similarly
By the assumptions detailed above, we have Pr(xij∣(xkj)k≠i, zj) = Pr(xij∣πij) and therefore:
Under the quantitative trait model (1), it follows that
Plugging this back into equation (10) shows that where and . Following analogous steps, we find that where . When βi = 0 in model (1), then aij = ãij = bij = 0.
Under the binary trait model (2), it follows that where and bi = βi. Plugging this back into equation (10) shows that
Following analogous steps, we find that where . When βi = in model (2), then aij = ãij = bi = 0.
Putting these together, we have that when βi = 0 in models (1) or (2), then model (3) holds with bi = 0.
Suppose that the assumptions of Theorem 1 hold and additionally . Then β = 0 in models (1) or (2) implies that bi = 0 in model (7).
Proof: The proof is the same as that to Theorem 1, except we replace πij with hj.
Suppose that yj is distributed according to model (1) or (2) and that xij∣πij ∼ Binomial(2, πij). If βi = 0 in models (1) or (2), then the test-statistic T(xi;, y, πi) defined in (6) converges in distribution to as n → ∞.
Proof: When βi = 0, then [xij∣yj, πij] ∼ Binomial (2, πij) by Theorem 1. It then follows that in distribution as n → ∞ by Wilks’ theorem [25].
Suppose that the assumptions of Theorem 1 hold and additionally . If βi = 0 in models (1) or (2), then the test-statistic T(xi;, y, H) defined in (8) converges in distribution to as n → ∞.
Proof: When βi = 0, then [xij∣yj hj] ∼ Binomial by Corollary 1. It then follows that in distribution as n → ∞ by Wilks’ theorem [25].
Proposed Model Under the Alternative Hypothesis
When the alternative model is true this means that βi ≠ 0. In this case it is worthwhile to characterize model (3) in terms of the distribution of xij∣yj, zj. Under trait models (1) or (2), it follows that:
This implies that where under model (1) we have , and under model (2) we have , , bij = βi.
In the case that aij = ãij, it is the case that
However, this exact equality is only the case when βi = 0. For the typical effect sizes seen in GWAS, it will nevertheless be true that aij ≈ ãij, in which case the above functional form will be approximately true. This allows for an approximation that can be utilized in practice for power calcuations.
Simulated Allele Frequencies
In order to simulate the m × n matrix of genotypes X, we first needed to simulate the m × n matrix of allele frequencies F. Recall that we model the allele frequencies by forming L = logit(F) and then utilizing the model L = AH from equation (4).
Instead of simulating allele frequencies from the L = AH model we use to perform the proposed association test, we instead simulated them from a different model to demonstrate the flexibility of the L = AH model. Specifically, we let F = ΓS where Γ is m × d and S is d × n with d ≤ n. The d × n matrix S encapsulates the genetic population structure for these individuals since S is not SNP-specific but is shared across SNPs. The m × d matrix Γ maps how the structure is manifested in the allele frequencies of each SNP. We have shown that the model F = ΓS includes as special cases discrete subpopulations, the Balding-Nichols model, and the Pritchard-Stephens-Donnelly model.
We formed Γ and S for the 11 different population structure configurations exactly as carried out in Hao et al. (2013) [14]. These constructions are summarized as follows from Hao et al. (2013).
Balding-Nichols Model
The HapMap data set was deliberately sampled to be from three discrete populations, which allowed us to populate each row i of Γ with three independent and identically distributed draws from the Balding-Nichols model: , where k ∈ {1, 2, 3}. Each γik is interpreted to be the allele frequency for subpopulation k at SNP i. The pairs (pi, Fi) were computed by randomly selecting a SNP in the HapMap data set, calculating its observed allele frequency, and estimating its FST value using the Weir & Cock-erham estimator [26]. The columns of S were populated with indicator vectors such that each individual was assigned to one of the three subpopulations. The subpopulation assignments were drawn independently with probabilities 60/210, 60/210, and 90/210, which reflect the subpopulation proportions in the HapMap data set. The dimensions of the simulated data were m = 100,000 SNPs and n = 5000 individuals.
1000 Genomes Project (TGP)
We started with the TGP data set from Hao et al. (2013) [14]. The matrix Γ was generated by sampling for k = 1, 2 and setting γi3 = 0.05. In order to generate S, we computed the first two principal components of the TGP genotype matrix after mean centering each SNP. We then transformed each principal component to be between (0, 1) and set the first two rows of S to be the transformed principal components. The third row of S was set to 1, i.e. an intercept. The dimensions of the simulated data were m = 100,000 and n = 1500, where n was determined by the number of individuals in the TGP data set.
Human Genome Diversity Project (HGDP)
We started with the HGDP data set from Hao et al. (2013) [14] and applied the same simulation scheme as for the TGP scenario. The dimensions of the simulated data were m = 100,000 and n = 940, where n was determined by the number of individuals in the HGDP data set.
Pritchard-Stephens-Donnelly (PSD)
The PSD model assumes individuals to be an admixture of ancestral subpopulations. The rows of Γ were again created by three independent and identically distributed draws from the Balding-Nichols model: , where k ∈ {1, 2, 3}. For this scenario, the pairs (pi, Fi) were computed from analyzing the HGDP data set for observed allele frequency and estimated FST via the Weir & Cockerham estimate [26]. The estimator requires each individual to be assigned to a subpopulation, which were made according to the K = 5 subpopulations from the analysis in Rosenberg et al. (2002) [27]. The columns of S were sampled for j = 1, …, n. There were four PSD scenarios with parameter values α = (0.01, 0.01, 0.01), α = (0.1, 0.1, 0.1), α = (0.5, 0.5, 0.5), and α = (1, 1, 1). α = (0.1, 0.1, 0.1) was chosen as the representative structure for Figure 2. The dimensions of the simulated data were m = 100,000 SNPs and n = 5000 individuals.
Spatial
We seek to simulate genotypes such that the population structure relates to the spatial position of the individuals. The matrix Γ was populated by sampling for k = 1, 2 and setting γi3 = 0.05. The first two rows of S correspond to coordinates for each individual on the unit square and were set to be independent and identically distributed samples from Beta(a, a), while the third row of S was set to be 1, i.e. an intercept. There were four spatial scenarios with parameter values of a = 0.1, 0.25, 0.5, and 1. As a → 0, the individuals are placed closer to the corners of the unit square, while when a = 1, the individuals are distributed uniformly. a = 0.1 was chosen as the representative structure for Figure 2. The dimensions of the simulated data were m = 100,000 SNPs and n = 5000 individuals.
Simulated Traits
For each of the 11 simulations scenarios, we generated 100 independent studies. For each study, X was formed by simulating xij ∼ Binomial(2, πij) where F was constructed as described above. In order to simulate a quantitative trait, we needed to simulate α, , λj, and ϵj from model (1).
First, we set α = 0. Without loss of generality SNPs i = 1, 2, … , 10 were set to be true alternative SNPs (where βi ≠ 0); we simulated for i = 1, 2, ≠ , 10. We set βi = 0 for i > 10. Note that X is influenced by the latent variables z1, ≠ , zn through S in the model F = ΓS described above. In order to simulate λj and ϵj so that they are also influenced by the latent variables z1, … , zn, we performed the following:
Perform K-means clustering on the columns of S with K = 3 using Euclidean distance. This assigns each individual j to one of three mutually exclusive cluster sets where .
Set λj = k for all for each k = 1, 2, 3.
Let and set for all for each k = 1, 2, 3.
Draw independently for j = 1, 2, … , n.
This strategy simulates non-genetic effects and random variation that manifest among K discrete groups over a more continuous population genetic structure defined by S. This is meant to emulate the fact that environment (specifically lifestyle) may partition among individuals in a manner distinct from, but highly related to population structure.
This yields three values , and ϵj for each individual j = 1, 2, … , n. In order to set the variances of these three values to pre specified levels νgen, νenv and νnoise, we rescaled each quantity as follows:
The trait for a given study was then formed according to for j = 1, 2, … , n. For each of the 11 simulation scenarios, we considered the following three configurations of (νgen, νenv, νnoise): (5%, 5%, 90%), (10%, 0%, 90%) and (10%, 20%, 70%).
In total, there were 11 different types of structures considered over three different configurations of genetic, environmental, and noise variances for a total of 33 settings. For each setting, we simulated 100 independent studies where each involved m = 100,000 SNPs and up to n = 5000 individuals.
Northern Finland Birth Cohort Data
Genotype data was downloaded from dbGaP (Study Accession: phs000276.v1.p1 ). Individuals were filtered for completeness (maximum 1% missing genotypes) and pregnancy. (Pregnant women were excluded because we did not receive IRB approval for these individuals.) SNPs were first filtered for completeness (maximum 5% missing genotypes) and minor allele frequency (minimum 1% minor allele frequency), then tested for Hardy-Weinberg equilibrium . The final dimensions of the genotype matrix are m = 324,160 SNPs and n = 5027 individuals.
A Box-Cox transform was applied to each trait, where the parameter was chosen such that the values in the median 95% value of the trait was as close to the normal distribution as possible. Indicators for sex, oral contraception, and fasting status were added as adjustment variables. For glucose, the individual with the minimum value was removed from the analysis as an extreme outlier. All analyses were performed with d = 6 logistic factors, which was determined based on the Hardy-Weinberg equilibrium method described in ref. [14]. The association tests were performed exactly as described in the main text.
Linear Mixed Effects Model and Principal Component Analysis Approaches
In order to explain the assumptions made by the linear mixed effects model approach (LMM) and principal components approach (PCA), we first re-write model (1) as follows: where the object of inference is βi for each SNP i = 1, … , m. As explained in Astle and Balding (2009) [5], these approaches assume that , meaning that the non-genetic effects are independent from population structure and there is no heteroskedas-ticity among individuals.
The LMM approach also makes the assumption that we can approximate the genetic contribution by a multivariate Normal distribution: where Φ is the n × n kinship matrix. If we define , we can write the above model as where it is assumed that . Since it is not the case in general that the are identically distributed for all SNPs i = 1, … , m, one can either estimate a different pair of parameters for each SNP or assume that these parameters change very little between SNPs. Since the former tends to be computationally demanding, algorithms such as EMMAX [10] propose to estimate a single pair of parameters from a null model and then utilize this single estimate for every SNP. More recently, algorithms such as GEMMA have been proposed to relax this assumption [15].
The n × n kinship matrix Φ is estimated from the genotype data X. This involves the simultaneous estimation of (n2 − n)/2 parameters, which is particularly large for sample sizes considered in current GWAS (on the order of 108 for n = 10,000). The uncertainty in the estimated Φ is typically not taken into account, and there is so far no regularization of the high-dimensional estimator of Φ. Unregularized estimates of large covariance matrices have been shown to be problematic [28,29], a concern that is also applicable to estimates of Φ. Estimating involves manipulations of the estimated Φ matrix, which can pose numerical challenges due to the fact that the estimated Φ is both high-dimensional and nonsingular. The LMM approach therefore makes assumptions that are important to verify for each given study and it involves some challenging calculations and estimations.
The PCA approach first calculates the top d principal components on a normalized version of the genotype matrix X. In the method proposed by Price et al. (2006) [8], these principal components are then regressed out of each SNP i and the trait (regardless of whether it is binary or quantitative). A correlation statistic is calculated between each adjusted SNP genotype and the adjusted trait, and the p-value that tests for equality to 0 is reported. As shown in Hao et al. (2013) [14], the top d principal components form a high-quality estimate of a linear basis of the allele frequencies πij. Extracting the residuals after linearly regressing the genotype data for SNP i onto these principal components is equivalent to estimating the quantity xij − πij. Using the trait as the response variable in this regression adjustment is equivalent to estimating under the assumptions on the trait model given above (where this quantitative trait model is assumed regardless of whether the trait is quantitative or binary). Therefore, the association test carried out in the PCA approach implicitly involves an estimated form of the model: where it is assumed that λj + ϵj are approximately i.i.d. . When a correlation between the adjusted trait and the adjusted genotype for SNP i is carried out, then the residual variation is based on the joint distribution of ∑k≠iβk(xkj − πik) + λj + ϵj for j = 1, … , n.
Let us denote . Since Var(xij − πij) = 2πij(1 − πij) and Var(xkj − πkj) = 2πkj(1 − πkj), it follows that (xij − πij) and (xkj − πkj) for i, k = 1, … , m and j = 1, … , n still suffer from confounding due to structure through their variances. Therefore, the implicit assumption made by the PCA approach that the are independent and identically distributed in the above model is violated. This is our interpretation of why the PCA approach shows poor performance in adjusting for structure under our quantitative trait simulations. Astle and Balding (2009) [5] make further mathematical characterizations of the relationship between the implicit models in the PCA and LMM approaches, which we also found to be helpful.
Interestingly, when considering the binary trait model (2), the Bernoulli distributed trait does not involve a mean and variance term as in the Normal distributed quantitative trait. It may be the case that this difference contributes to explaining why the PCA approach shows similar behavior to the GCAT and LMM approaches for binary traits (see Results and ref. [5]). Specifically, the PCA approach appears to perform reasonably well in adjusting for structure for the binary trait simulations that we considered.
Software Implementation
The proposed method has been implemented in open source software, which will be made publicly available upon publication.
Footnotes
↵* These authors contributed equally to this work
↵+ Present address: Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20850
1 In our implementation, the logistic factors are included as covariates, which serve as the model terms that estimate the values.