Testing for genetic associations in arbitrarily structured populations

Minsun Song; Wei Hao; John D. Storey

doi:10.1101/012682

Abstract

We present a new statistical test of association between a trait (either quantitative or binary) and genetic markers, which we theoretically and practically prove to be robust to arbitrarily complex population structure. The statistical test involves a set of parameters that can be directly estimated from large-scale genotyping data, such as that measured in genome-wide associations studies (GWAS). We also derive a new set of methodologies, called a genotype-conditional association test (GCAT), shown to provide accurate association tests in populations with complex structures, manifested in both the genetic and environmental contributions to the trait. We demonstrate the proposed method on a large simulation study and on the Northern Finland Birth Cohort study. In the Finland study, we identify several new significant loci that other methods do not detect. Our proposed framework provides a substantially different approach to the problem from existing methods. We provide some discussion on its similarities and differences with the linear mixed model and principal component approaches.

INTRODUCTION

Performing genome-wide tests of association between a trait and genetic markers is one of the most important research efforts in modern genetics [1-3]. However, a major problem to overcome is how to test for associations in the presence of population structure [4]. Human populations are often structured in the sense that the genotype frequencies at a particular locus are not homogeneous throughout the population. Rather, there are latent variables (such as geography or ancestry) that directly affect the allele frequencies of the genotypes. At the same time, there may be other loci and non-genetic factors that also correlate with these latent variables, which in turn are correlated with the trait of interest. When this occurs, genetic markers become spuriously statistically associated with the trait of interest despite the fact that there is no biological connection.

The importance of addressing association testing in structured populations is evidenced by the existence of a large literature of methods proposed for this problem [5,6]. The well established methods all take a similar strategy in that the trait is modeled in terms of the genetic markers of interest, while attempting to adjust for genetic structure. Two popular approaches are to correct population structure by including principal components of genotypes as adjustment variables [7,8] or by fitting a linear mixed effects model involving an estimated kinship or covariance matrix from the individuals’ genotypes [9,10]. Previous work investigating the limitations of these two methods include Wang, et al. (2013) [11]. These two approaches have been shown to be based on a common model that make differing assumptions about how the kinship or covariance matrices are utilized in the model [5]. This common model does not allow for non-genetic (e.g., environmental) contributions to the trait to be dependent with population structure. The linear mixed effects model requires that the genetic component is composed of small effects that additively are well-approximated by the Normal distribution. The model itself is therefore an approximation, and it is not yet possible to theoretically prove that a test based on this model is robust to structure for the more general class of relevant models that we investigate.

By taking a substantially different approach that essentially reverses the placement of the trait and genotype in the model, we formulate and provide a theoretical solution to the problem of association testing in structured populations for both quantitative and binary traits under general assumptions about the complexity of the population structure and its relationship to the trait through both genetic and non-genetic factors. This theoretical solution directly leads to a method for addressing the problem in practice that differs in key ways from the mixed model and principal component approaches. The method is straightforward: a model of structure is first estimated from the genotypes, and then a logistic regression is performed where the SNP genotypes are logistically regressed on the trait plus an adjustment based on the fitted structure model. The coefficient corresponding to the trait is then tested for statistical significance. The class of models to which this provides a test robust to structure is fairly general.

This association testing framework is robust to population genetic structure, as well as to non-genetic effects that are dependent or correlated with population genetic structure (for example, lifestyle and environment may be correlated with ancestry) and with heteroskedasticity that is dependent on structure. We introduce a test based on this framework, called “genotype conditional association test” (GCAT). We show the proposed method corrects for structure on simulated data with a quantitative trait and compares favorably to existing methods. We also apply the method to the Northern Finland Birth Cohort data [12] and identify several new associated loci that have not been identified by existing methods. For example, the proposed method is the only one to identify a SNP (rs2814982) associated with height, which we note is linked to another SNP (rs2814993) that has been associated with skeletal frame size [13]. We discuss the advantages and disadvantages of the proposed framework with existing approaches, and we conclude that the proposed framework will be useful in future studies as sample sizes and the complexity of structure increase.

RESULTS

Population Structure Model

Suppose that there are n individuals, each with m measured SNP genotypes. The genotype for SNP i in individual j is denoted by x_ij ∈ {0, 1, 2}, i = 1, 2, …, m, j = 1, 2, …, n. We collected these SNP genotypes into an m × n matrix X, where the (i, j) entry is x_ij. We denote the genotypes for individual j by x^j = (x_1j, x_2j, …, x_mj )^T.

We utilize our recently developed framework that flexibly models complex population structures for diallelic loci [14]. Let Z be an unobserved variable describing how individuals fit into the underlying population structure. For a SNP i, the allele frequency π_i can be viewed as being a function of Z, π_i(Z). For a random sample of n individuals from an overall population, we therefore have sampled population structure positions z₁, z₂, …, z_n with resulting allele frequencies π_i(z₁), π_i(z₂), …, π_i(z_n) for SNP i. In Hao et al. (2013) [14], we formulate and estimate a model for m SNPs simultaneously while providing a flexible parameterization of the form of π_i(Z).

For shorthand, π_ij ≡ π_i(z_j) is the allele frequency for SNP i conditioned on the ancestry state of individual j. The π_ij values are called “individual-specific allele frequencies.” These allele frequencies can be collected into an m × n matrix F, where the (i, j) entry is π_ij. Note that E[x_ij/2|z_j] = π_ij, and when Hardy-Weinberg equilibrium holds, x_ij∣z_j ∼ Binomial(2, π_ij). We utilize the framework from Hao et al. (2013) [14] that allows the simultaneous estimation of all π_ij from a given genotype data set X. Specifically, it provides estimates of latent variables that form a linear basis of the quantities, which turns out is the most convenient scale on which to estimate a model of structure for the proposed testing framework. The model and estimation procedure is called “logistic factor analysis” (LFA). It should be noted that other well-behaved estimates of π_ij may be utilized as well. Further details are provided in Methods.

Trait Models

We assume a trait (quantitative or binary) has been measured on each individual, which we denote by y_j, j = 1, 2, …, n. One way in which spurious associations occur in the presence of population structure is that SNPs become correlated with each other when structure is not taken into account. Therefore, if a SNP is causal for the trait of interest, then any other SNP correlated with this causal SNP may also show an association. For SNPs in linkage disequilibrium due to their physical proximation with the causal SNP, one expects these to be associated with the trait regardless of structure. However, in the presence of structure, there may be many unlinked SNPs that also show associations with the trait due the fact that structure induces correlations of these SNPs with the causal SNP. Indeed, one of the early methods for detecting structure in association studies was to show that many randomly chosen, unlinked SNPs show associations to the trait [4]. This source of confounding is typically the main focus of association tests designed for structured populations.

Another key issue that is less often considered is the fact that lifestyle and environment are also often related to ancestry (Figure 1a). This implies that non-genetic effects may also be directly related to structure. We therefore extend the concept of the latent variable Z to include not only population genetic structure, but also lifestyle and environment: Z = (structure, lifestyle, environment). For each observed individual j, there is an underlying latent variable z_j that contains the information about structure, lifestyle, and environment for individual j. We allow for the case that structure or ancestry may be directly influential on or related to lifestyle and environment, and that all three of these variables may influence the trait of interest. An association test that is immune to structure should also be immune to the non-genetic effects that are confounded with structure.

Figure 1:

Rationale for the proposed test of association, (a) A graphical model describing population structure and its effects on traits. Population structure serves as a latent causal variable common among a set of loci, via the allele frequencies. When one locus has a causal effect on the trait, this induces spurious associations with other loci affected by population structure. At the same time, population structure may be highly related to lifestyle and the environment as these are all possibly related to ancestry and geography, (b) Accounting for confounding due to latent population structure. Left panel: A test for association between the ith SNP X_i and trait Y without taking into account Z will produce a spurious association due to the fact that both X_i and Y are confounded with Z. Right panel: A test for association between X_i∣π_i(Z) and Y will be an unbiased because condition on π_i(Z) breaks the relationship between Z and X_i.

We consider the following models of quantitative and binary traits. We write the trait models in terms of additive genetic effects, but the framework can be extended to account for dominance models and interactions, and the models can also incorporate adjustment variables that capture known sources of trait variation.

The quantitative trait model is where β_i is the genetic effect of SNP i on the trait, λ_j is the random non-genetic effect, and ϵ_j is the random noise variation. To allow the interdependence of structure, lifestyle, and environment, we assume that x^j = (x_1j, …, x_m,j)^T, λ_j, and may all be functions of z_j. We assume that , which allows for heteroskedasticity of the random noise variation. The distribution of λ_j can remain unspecified, although we assume that λ_j and z_j may be dependent random variables. The population genetic model summarized shows how the distribution of depends on z_j. Without having observed z_j, it follows that , λ_j, and ϵ_j are dependent random variables; however, we assume that conditional on z_j, these random variables are independent.

The binary trait model is where again β_i is the genetic effect of SNP i on the trait, λ_j is the non-genetic effect, and we allow for the case that x^j and λ_j may be dependent due to the common confounding latent variable z_j as described for the quantitative trait model.

We have shown that the linear mixed effects model and principal components approaches involve more restrictive assumptions about the trait models utilized in testing for associations (Methods).

Motivation and Rationale of the Proposed Test

The rationale for the proposed test is schematized in Figure 1. The SNP X_i and the trait Y become spuriously associated because they are under the influence of a common latent variable Z. This latent variable contains information on population structure, lifestyle, and environment, all of which may be interdependent and play a determining role in the trait. The problem is that we cannot directly observe Z and we would like to avoid making assumptions about its mathematical form. If we can successfully construct either X_i∣Z (the distribution of X_i conditional on Z) or Y∣Z, then it is possible to perform a test of association between X_i and Y that is immune to the effects of Z. Possible association tests should occur between X_i∣Z and Y, between X_i and Y∣Z, or between X_i∣Z and Y∣Z.

The linear mixed model and principal components approaches can be interpreted as attempts to estimate a model of Y∣Z. This requires additional assumptions about non-genetic and genetic effects, and their relationship to Z, specifically there is no relationship between structure and non-genetic effects in the trait model (Methods and ref. [5]). Due to the massive number of SNPs that have been measured in GWAS, trying to construct X_i∣Z is appealing since we have an abundance of information about the effect of latent variables on the genotypes. (For example, this can easily be visualized in principal components constructed from the genotypes.) Our approach is therefore to carry out an association test between X_i∣Z and Y by specifically testing whether there is equality or not between Pr(X_i∣Y, Z) and Pr(X_i∣Z) (Figure 1b). If Pr(X_i∣Y, Z) = Pr(X_i∣Z) then there is no association between the SNP X_i and the trait Y; if Pr(X_i∣Y, Z) ≠ Pr(X_i|Z), then there is an association. This test of association is in theory immune to population structure because we have taken into account Z.

One remaining problem is that we cannot observe Z. However, it is straightforward to show that when there is no association Pr(X_i|Y, Z, π_i(Z)) = Pr(X_i|Y, π_i(Z)) and Pr(X_i|Z, π_i(Z)) = Pr(X_i|π_i(Z)). In other words, in order to capture X_i|Z, it suffices to capture X_i|π_i(Z), the effect of Z on the allele frequency of SNP i. We have recently developed a framework that flexibly parameterizes and estimates X_i|π_i(Z) [14]. In order to test whether Pr(X_i|Y, π_i(Z)) = Pr(X_i|π_i(Z)), we perform a logistic regression of the SNP genotypes X_i on the trait Y plus the transformed individual-specific allele frequencies, logit(π_i(Z)), where for 0 < p < 1. This inverse regression approach is a substantial departure from the most commonly employed methods that attempt to adjust for population structure.

Association Test Immune to Population Structure

We have derived a statistical hypothesis test of association that is equivalent to testing whether β_i = 0 for each SNP i in the above trait models (1) and (2), and whose null distribution does not depend on structure or the non-genetic effects correlated with structure, making it immune to spurious associations due to structure (METHODS). Specifically, the test allows for general levels of complexity in structure because the test is based on adjusting for structure according to individual-specific allele frequencies.

We have proved a theorem (METHODS) that shows that β_i = 0 in models (1) and (2) implies that b_i = 0 in the following model: for all j = 1, 2, …, n. This establishes a model that can be used to test for associations in place of models (1) and (2). Note that the non-genetic effects, heteroskedasticity, and polygenic background do not appear in the above model used to test for associations. This is important because under our general assumptions, these terms can be difficult or even impossible to estimate in practice. Furthermore, testing for association under this model means that the test will have a valid null distribution regardless of the form of the non-genetic effects, heteroskedasticity, and polygenic background.

As fully detailed in METHODS, an association statistic whose null distribution is known can be constructed by testing whether b_i = 0 in the above model, which we have shown is a valid test of β_i = 0 in traits models (1) and (2). Briefly, the testing procedure works as follows:

Formulate and estimate a model of population structure that provides well-behaved estimates of the logit(π_ij) values. We specifically use the logistic factor analysis (LFA) approach of ref. [14], which has been shown to provide a accurate linear basis of the logit(π_ij) values.
For each SNP i, perform a logistic regression of the SNP genotypes on the trait values plus the model terms that estimate the values¹. Also, perform a logistic regression of the SNP genotypes on only the model terms that estimate , where the trait is now excluded from the fit. These two model fits are compared via a likelihood ratio statistic, where the larger the statistic, the more evidence there is that b_i ≠ 0.
Calculate a p-value for each SNP based on our result that when the null hypothesis of no association is true, β_i = 0 in models (1) and (2), then the above statistic follows a distribution for large sample sizes.

We call our proposed test the “genotype conditional association test” (GCAT). As a general concept, such an approach is sometimes called an inverse regression model because we consider E[x∣y] rather than E[y∣x].

Simulation Studies

We performed an extensive set of simulations to demonstrate that the proposed test is robust to population structure and to assess its power to detect true associations (full technical details in Methods). We compared the proposed test to its oracle version where model (3) and test-statistic (6) are used with the true π_ij values. We also included in the simulations studies three important and popular methods: (i) the method of adjusting the trait and genotypes by principal components computed from the full set of genotypes [8] and (ii) two implementations of the linear mixed effects model approach [9,10], specifically EMMAX by Kang et al. (2010) [10] and GEMMA by Zhou and Stephens (2012) [15]. The methods are abbreviated as “PCA,” “LMM-EMMAX,” and “LMM-GEMMA.”

The complete simulation study on quantitative traits involved population structure constructed in 11 different ways for each of three different apportionments of variance among genetic effects, non-genetic effects, and random variation that all contribute to variation in the trait. Therefore, each configuration involved a constructed allele frequency matrix F and values assigned to variances , , and Var(ϵ_j) from model (1). For each of these 33 = 11 × 3 configurations, we simulated 100 GWAS data sets, for a grand total of 3300 studies.

We simulated allele frequencies: (i) subject to structure estimated from three real data sets: HapMap, Human Genome Diversity Project (HGDP), and the 1000 Genomes Project (TGP), where the HapMap structure was simulated according to the Balding-Nichols model; (ii) at four different levels of admixture in the Pritchard-Stephens-Donnelly (PSD) model, which is an extension of the Balding-Nichols model; and (iii) for four different types of spatially defined structure. We intentionally simulated challenging population structures, having in mind that future GWAS such as the forthcoming “Genotype Tissue Expression” program (GTEx) data may involve particularly challenging forms of structure.

In order to provide an extra challenge to the proposed test, we simulated the allele frequencies from a model that differs from LFA model (4). We generated allele frequencies parameterized by F = ΓS, where F is the matrix of π_ij values, Γ is an m × d matrix and S is the d × n matrix that encapsulates the structure. This model captures as special cases the Balding-Nichols model and the PSD model [14]. It was also intended to provide an advantage to the PCA and LMM methods because the structure is manifested on the observed genotype scale [14], which is the same scale on which both methods estimate structure.

We simulated 10 truly associated SNPs whose effect sizes are distributed according to a Normal distribution. All genotypes were simulated to be in linkage equilibrium so that true and false positives are unambiguous. We set the variances , , and Var(ϵ_j) to be: (5%, 5%, 90%), (10%, 0%, 90%), and (10%, 20%, 70%). Setting these variances enforced a certain overall level of genetic contribution to the trait; therefore our simulation study results were minimally affected by the choice of 10 truly associated SNPs and the Normal distribution on their effect sizes. In each simulation scenario, we simulated data for m = 100,000 SNPs and n = 5000 individuals, except HGDP necessarily restricted us to n = 940 individuals and TGP to n = 1500 individuals. The dimension of the structure was set to d = 3, although we carried out the same simulations for d = 6 and the results were quantitatively very similar and qualitatively equivalent.

Each simulation configuration involved analyzing 100 GWAS data sets (X, y), where the Oracle method, the proposed GCAT method, and the PCA method were applied to each study. For a given simulated study, we obtained a set of m = 100,000 p-values. So-called “spurious associations” occur when the p-values corresponding to null (non-associated) SNPs are artificially small. For a given p-value threshold t, we expect there to be m₀ × t false positives among the m₀ p-values corresponding to null SNPs, where m₀ = 100,000 – 10 in our case. At the same time, we can calculate the observed number of false positive simply by counting how many of the null SNP p-values are less than or equal to t. The excess observed false positives are spurious associations. A method properly accounts for structure when the average difference is zero. The best one can do on a study-by-study basis is captured by the Oracle method, which according to our theory is immune to structure and provides the correct null distribution.

We found from using the distributed binary executable EMMAX software and our own implementation that EMMAX required a 10-fold increase in computational time over the proposed method and PCA when analyzing n = 5000 individuals. Therefore, it was not reasonable to apply EMMAX to all 3300 simulated GWAS data sets. We limited comparisons with EMMAX to five representative structure configurations of the full 11 for a single apportionment of the variances assigned to genetic effects, non-genetic effects, and random variation. GEMMA was computationally more efficient, though still significantly slower than GCAT or our implementation of PCA. Figure 2 shows the excess in observed false positives vs. the expected number of false positives for the Oracle, GCAT (proposed), PCA, and both implementations of LMM methods under five configurations of structure for the variance configuration corresponding to genetic=5%, environmental=5%, and noise=90%. It can be seen that the LFA implementation of the proposed GCAT method performs similarly to the Oracle test, whereas PCA tends to suffer from an excess of spurious associations. Figures S1-S8 is a more complete set of simulations with results from the all three sets of variances for the full 11 configurations of structure. Due to the computational constraints mentioned above, the additional simulations feature only results from GEMMA for LMM methods.

Figure S1:

Performance of association tests on 100 simulated studies based off the PSD model of structure for various α comparing the Oracle, GCAT (proposed), LMM (GEMMA), and PCA tests. The variance contributions to the trait are genetic=5%, environmental=5%, and noise=90%. The remaining details are equivalent to Figure 2.

Figure S2:

Performance of association tests on 100 simulated studies based off the spatial model of structure for various α comparing the Oracle, GCAT (proposed), LMM (GEMMA), and PCA tests. The variance contributions to the trait are genetic=5%, environmental=5%, and noise=90%. The remaining details are equivalent to Figure 2.

Figure S3:

Performance of association tests on 100 simulated studies based off the PSD model of structure for various α comparing the Oracle, GCAT (proposed), LMM (GEMMA), and PCA tests. The variance contributions to the trait are genetic=10%, environmental=0%, and noise=90%. The remaining details are equivalent to Figure 2.

Figure S4:

Performance of association tests on 100 simulated studies based off the spatial model of structure for various α comparing the Oracle, GCAT (proposed), LMM (GEMMA), and PCA tests. The variance contributions to the trait are genetic=10%, environmental=0%, and noise=90%. The remaining details are equivalent to Figure 2.

Figure S5:

Performance of association tests on 100 simulated studies based off the Balding-Nichols, HGDP, and TGP simulation scenarios comparing the Oracle, GCAT (proposed), LMM (GEMMA), and PCA tests. The variance contributions to the trait are genetic=10%, environmental=0%, and noise=90%. The remaining details are equivalent to Figure 2.

Figure S6:

Performance of association tests on 100 simulated studies based off the PSD model of structure for various α comparing the Oracle, GCAT (proposed), LMM (GEMMA), and PCA tests. The variance contributions to the trait are genetic=20%, environmental=10%, and noise=70%. The remaining details are equivalent to Figure 2.

Figure S7:

Performance of association tests on 100 simulated studies based off the spatial model of structure for various α comparing the Oracle, GCAT (proposed), LMM (GEMMA), and PCA tests. The variance contributions to the trait are genetic=20%, environmental=10%, and noise=70%. The remaining details are equivalent to Figure 2.

Figure S8:

Performance of association tests on 100 simulated studies based off the Balding-Nichols, HGDP, and TGP simulation scenarios comparing the Oracle, GCAT (proposed), LMM (GEMMA), and PCA tests. The variance contributions to the trait are genetic=20%, environmental=10%, and noise=70%. The remaining details are equivalent to Figure 2.

Figure 2:

Performance of association tests on 100 simulated studies based off the Balding-Nichols, HGDP, TGP, PSD α = 0.1, and spatial a = 0.1 simulation scenarios comparing the Oracle, GCAT (proposed), LMM (EMMAX), LMM (GEMMA), and PCA tests. The quantitative traits are based on model (1). The variance contributions to the trait are genetic=5%, environmental=5%, and noise=90%. The differences between the observed number of false positives and expected number of false positives versus the expected number of false positives under the null are plotted for each simulated study (grey lines), the average of those differences (black line), and the middle 90% (blue lines). All simulations had m = 100,000 SNPs, so the range of the x-axis corresponds to choosing a significance threshold of up to 0.0025. The difference on the y-axis is the number of “spurious associations.” PCA is shown on a separate y-axis since it usually has a much larger maximum than the other methods. The Oracle method is where the true allele frequencies are inputted into the proposed test, which we have theoretically proven always corrects for structure.

In comparing the statistical power among the methods (Figures S9-S17), we found that the Oracle, GCAT, and PCA performed similarly well, while the two LMM methods often suffered from a loss of power. We also carried out analogous simulations on binary traits simulated from model (2) and we found that all methods performed similarly well in terms of producing correct p-values that were robust to structure. This result agrees with the comparisons made between PCA and a linear mixed effects model in Astle and Balding (2009) [5].

Figure S9:

Statistical power of the Oracle, GCAT (proposed), PCA, and both LMM association tests. The results are for the simulated data sets shown in Figures 2. The quantitative traits are based on model (1). The variance contributions to the trait are genetic=5%, environmental=5%, and noise=90%.

Figure S10:

Power analysis for the simulation studies presented in Figure S1.

Figure S11:

Power analysis for the simulation studies presented in Figure S2.

Figure S12:

Power analysis for the simulation studies presented in Figure S3.

Figure S13:

Power analysis for the simulation studies presented in Figure S4.

Figure S14:

Power analysis for the simulation studies presented in Figure S5.

Figure S15:

Power analysis for the simulation studies presented in Figure S6.

Figure S16:

Power analysis for the simulation studies presented in Figure S7.

Figure S17:

Power analysis for the simulation studies presented in Figure S8.

Figure S18:

Theoretical versus observed quantiles of −log₁₀(p-value) from the GCAT association tests on the Northern Finland Birth Cohort traits. The y-axis was truncated at p-value < 10⁻⁸; see Table S1 for the smallest p-values for each trait.

Analysis of the Northern Finland Birth Cohort Data

We applied the proposed method to the Northern Finland Birth Cohort (NFBC) genome-wide association study data [12], which includes several metabolic traits and height. This study has also been analyzed by the LMM and PCA methods, as well as a standard analysis uncorrected for structure [10]. We carried out association analyses with the proposed method on the 10 traits that were also analyzed using the other methods (Table 1). After processing the data, including filtering for missing data, minor allele frequencies, and departures from Hardy-Weinberg equilibrium, the data were composed of m = 324,160 SNPs and n = 5027 individuals (Methods). The logistic factors were computed on a subset of the data where markers were at least 200 kbp apart.

View this table:

Table 1:

Number of significant loci at genome-wide significance (p-value < 7.2 × 10⁻⁸) for each of the 10 traits from the Northern Finland Birth Cohort. The counts for LMM+GC, PCA+GC, and Uncorr+GC are derived from Table 2 in Kang et al. (2010).

Most traits showed only approximate Normal distributions, so we applied a Box-Cox Normal transformation to all traits so that they satisfy the model assumptions. We noted that C-reactive Protein (CRP) and Triglycerides (TG) traits followed an exponential distribution more closely, so it was unnecessary to transform these two traits. The developed theory can be extended to exponential distributed quantitative traits as well.

The 20 most significant SNPs for each of the 10 traits are shown in Table S1. Kang et al. (2010) utilized a genome-wide significance threshold of p-value < 7.2 × 10⁻⁸ as proposed in ref. [16], so we also utilized this threshold for comparative purposes. The number of loci found to be significant for each method are shown in Table 1. Whereas our proposed method identifies 16 significant loci, the other methods identify 11 to 14 loci.

View this table:

Table S1:

The top 20 most associated SNPs for each of the 10 traits considered in the Northern Finland Birth Cohort study. The GCAT p-value and GCAT+GC p-value (genomic control adjusted GCAT p-value) is shown for each SNP. SNPs that achieved GCAT+GC p-value < 7.2 × 10⁻⁸ are colored, and each locus for a given trait is given a different color.

We identified three new loci that were not identified by the other methods. None of the other methods identified any significant associations for the height trait. However, we identified rs2814982 on chromosome 6 as being statistically associated with height (Table S1). This SNP is located ∼ 70kbp from another SNP, rs2814993, which has been associated with skeletal frame size in a previous study [13]. Additionally, rs2814993 was the fifth most significant SNP for height. For the LDL cholesterol trait, we identified a significant association with rs11668477, which was significantly associated with LDL cholesterol in a different study [17]. Finally, there were significant associations between the glucose (GLU) trait and a cluster of SNPs (rs3847554, rs1387153, rs1447352, rs7121092) proximal to the MTNR1B locus; variation at this locus has been associated with glucose in a previous study [18].

As described in Sabatti et al. (2009) [12], the NFBC data show modest levels of inflation due to population structure as measured by the genomic control inflation factor (GCIF) [19] of test statistics from an uncorrected analysis. The population structure present among these individuals may be subtler and manifested on a finer scale than other settings. Noting that the GCAT approach does not attempt to adjust for a polygenic background, the GCIF values calculated for the proposed method (Table S2) were found to be in line with what is expected for polygenic traits where no structure is present [20], providing evidence that the proposed method adequately accounts for structure.

View this table:

Table S2:

The genomic control inflation factor (GCIF) was calculated for each trait in the association analysis of the Northern Finland Birth Cohort traits. The calculation was based on SNPs spaced at ∼250kbp. The 95% Bonferroni adjusted simultaneous confidence interval under the assumption that the median statistic follows the theoretical null distribution is (0.9389, 1.0666). We calculated GCIF for the proposed statistics and defined in the text.

DISCUSSION

We considered models of quantitative and binary traits involving genetic effects and non-genetic effects in the presence of arbitrarily complex population structure. We allowed for the non-genetic effects to be confounded with population genetic structure since structure, ancestry, lifestyle, and environment – all factors potentially involved in complex traits – may be highly dependent with one another. A causal model provided the intuition that under these models, it is most reasonable to account for this confounding in the genotypes, but it is not tractable to do so in the non-genetic effects. This follows because we have many instances of genotypes that can be jointly modeled to provide reliable estimates of structure, but the non-genetic effects are never directly observed and we do not have repeated instances of them. In general it is not possible to estimate a latent variable that accounts for the confounding between structure and non-genetic effects.

These observations led us to propose an inverse regression approach to testing for associations, where the association is tested by modeling genotype variation in terms of the trait plus model terms accounting for structure. In this model, the terms accounting for structure were based on the logistic factor analysis approach that we have proposed [14], although the general form of the association test can incorporate other methods that estimate population structure. We mathematically proved under general assumptions that the trait term in the model is non-zero only when the genetic marker is truly associated with the trait, regardless of the population structure. We demonstrated that the implemented test properly accounts for structure in a large body of simulated studies that included a wide range of population structures. We also applied the method to 10 traits from the Northern Finland Birth Cohort genome-wide association study. The proposed method identified three new loci associated with the traits, including being the only method among those we considered that identifies a locus associated with the height trait. Overall, we showed that the proposed method compares favorably to existing methods and we also noted that it has favorable computational requirements compared to existing methods.

As GWAS increase in sample size and levels of complexity of population structure, it is important to develop methods that properly account for structure and scale well with sample size. Whereas we found that the popular principal components adjustment does not properly account for structure, we also found that the mixed model approach performs reasonably well. However, the mixed model approach involves estimating a n × n kinship matrix and its current implementation does not scale well with sample size. The kinship matrix quickly becomes computationally unwieldy when n grows large, and the possibility of the estimated kinship matrix becoming overwhelmed by noise is a concern [21]. In the Northern Finland Birth Cohort data, the mixed model approach required us to estimate 12 million parameters, whereas the proposed method involved estimating 25-thousand parameters, a ∼500-fold decrease. A study involving n = 10,000 individuals with the same complexity of structure requires estimating about 50-million parameters in the mixed model kinship matrix, whereas the proposed method requires estimating 50-thousand parameters, a ∼1000-fold decrease. In addition, estimating the structure in the proposed method primarily uses singular value decomposition, for which a rich literature of computational techniques exist. We utilized a Lanczos bidiagonalization algorithm [22] which scales approximately linearly with respect to n for d ≪ n. The proposed method is well equipped to scale to massive GWAS and can take advantage of future advances for computing singular value decomposition.

The key assumption to verify in utilizing the proposed GCAT approach is that population structure observed in the SNP genotypes is adequately modeled and estimated. One can test for associations among SNPs that show convincing empirical evidence that the model of structure is reasonably well-behaved; this can be directly tested on the genotype data as previously demonstrated in our logistic factor analysis (LFA) model of structure [14]. For example, on the Northern Finland Birth Cohort Study, we empirically verified that utilizing the LFA model with dimension d = 6 accounted for structure reasonably well for the great majority of SNPs. The linear mixed effects model (LMM) approach and principal components (PCA) approach make trait model assumptions that may be difficult to verify in practice (Methods).

We anticipate that the proposed genotype conditional association test (GCAT) will be useful for future studies. The mathematical framework we have developed should facilitate its extension to traits modeled according to distributions not considered here while maintaining our theoretical proof that the test accounts for population structure in the presence of non-genetic effects also confounded with structure.

METHODS

Logistic Factor Analysis (LFA)

When forming a latent variable model of structure, where the goal is to make minimal assumptions about the underlying structure, there are benefits to modeling logit(π_ij) in terms of a latent variable model instead of π_ij directly [14]. The quantity logit(π_ij) = log(π_ij/(1 − π_ij)) is called the “natural parameter” of the distribution of x_ij when we assume Hardy-Weinberg equilibrium so that x_ij ∼ Binomial(2, π_ij). The quantity logit(π_ij) occurs as a linear term in the log-likelihood of the data, and it is the target parameter in logistic regression because of its straightforward mathematical properties. This viewpoint also facilitates calculating the distribution of x_ij given the structure, which is the essential challenge in accounting for structure in the proposed association testing framework.

In the association testing framework detailed below, it turns out that developing a latent variable model and estimate of the logit(π_ij) is particularly appropriate. The approach is called “logistic factor analysis” (LFA). Let L be an m × n matrix with (i, j) element equal to logit(π_ij). Consider the following parameterization: where A is an m × d matrix, H is a d × n matrix, and d ≪ n. The columns of H are independent, and column j captures the structure information for individual j. That is, Pr(x_ij∣h^j, z_j) = Pr(x_ij∣h^j) where h^j is column j of H. Row i of A determines how SNP i is affected by structure. We have shown in ref. [14] that this model performs well in estimating structure resulting from discrete subpopulations, admixed populations, the Balding-Nichols model [23], the Pritchard-Stephens-Donnelly model [24], and models of spatially oriented structure.

In practice, H will be unknown, so it must be estimated. We have developed a method called logistic factor analysis (LFA) that we have shown to estimate H well [14]. Specifically, the LFA estimate Ĥ has been shown to span the same space as the true H at a high level of accuracy, which implies that replacing H with Ĥ in the above equations yields nearly identical results. The accuracy of Ĥ in estimating H has been demonstrated even when the individual-specific allele frequencies are not directly constructed from model (4), L = AH.

Proposed Association Testing Framework

We have derived a statistical hypothesis test of association that is equivalent to testing whether βi = 0 for each SNP i in the above trait models (1) and (2), and whose null distribution does not depend on structure or the non-genetic effects correlated with structure, making it immune to spurious associations due to structure. Specifically, the test allows for general levels of complexity in structure because the test is based on adjusting for structure according to individual-specific allele frequencies.

A Model of Genetic Variation Given the Trait and Structure

As a first step, we have proved a theorem (see below) that shows that β_i = 0 in models (1) and (2) implies that b_i = 0 in the following model: for all j = 1, 2, … , n. This establishes a model that can be used to test for associations in place of models (1) and (2).

There are a few important details to note. First, the variables λ_j, , and (x_kj)_k≠i do not appear in the model. This is important because it is impossible to estimate λ_j and in the typical setting, and we will also typically not know the polygenic ∑_k≠i β_kx_kj component of the model. Second, the genotype variation is being modeled in terms of the trait variation, instead of the other way around. It is initially counter-intuitive because almost all association tests involve modeling the trait in terms of the SNP genotypes. As explained in more detail below, this reversal is crucial for adjusting the probability distribution of x_ij according to structure, and for eliminating the need to estimate λ_j, , and (β_k)_k≠i.

We call our proposed test the “genotype conditional association test” (GCAT). The model we propose to utilize is sometimes called an inverse regression model because we utilize E[x∣y] rather than E[y∣x].

Proposed Test Conditional on Individual-Specific Allele Frequencies

As a second step, we have derived a test-statistic to test whether b_i = 0 in model (3) whose null distribution is immune to structure. The log-likelihood function of the parameters given individual j is where the probability on the right-hand-side is calculated according to model (3). The log-likelihood of all n individuals is where π_i = (π_i1, π_i2, … , π_i,n) and y = (y₁, y₂, … , y_n). The test statistic we utilize is a generalized likelihood ratio test statistic [25]:

The log-likelihood is maximized by performing a logistic regression of all n observed genotypes for SNP i on the right hand side of model (3). We have proven a theorem (Methods) that shows that when β_i = 0 in models (1) or (2), the null distribution of this test statistic is , regardless of the values of π_ij, (x_kj)_k≠i, (β_kj)_k≠i, λ_j, and for j = 1, 2, … , n in models (1) and (2).

Proposed Test In Terms of LFA Model

As a third step, we have extended the above results to the case where the individual-specific allele frequencies are unknown and must be estimated. This requires a model of the individual-specific allele frequencies, and we utilize model (4) so that . First, assume that H from model (4) is known. We have proved that β_i = 0 in models (1) and (2) implies b_i = 0 in the following model: for all j = 1, 2, … , n, where h^j is column j of H and it is noted that without loss of generality we let h_dj = 1 making a_id an intercept term. The test-statistic used to test for an association between SNP i and the trait is the following generalized likelihood ratio test statistic: where a_i = (a_i1, a_i2, … , a_i,d). The log-likelihoods in this test statistic are maximized by performing a logistic regression of all n observed genotypes for SNP i on the right hand side of model (7) on all n individuals. As the previous case, we have proven a theorem (Methods) that shows that when β_i = 0 in models (1) or (2), the null distribution of this test statistic is , regardless of the values of π_i, (x_kj)_k≠i, β_−i, λ, and σ² in models (1) and (2).

The proposed test utilizes LFA to form an estimate Ĥ, replaces H with Ĥ, and carries out the test using model (7) and test statistic (8): T(x_i, y, Ĥ). This approach directly allows the simultaneous estimation of a_i and b_i for each SNP i under the unconstrained model and the estimation of a_i with b_i = 0 under the constraints of the null hypothesis. Because of this, the test allows the uncertainty of the m × d unknown parameters of A to be taken into account and it allows b_i to be competitively fit with a_i under the unconstrained, alternative hypothesis model.

Another approach is to first carry out estimation of F by whatever method the analyst finds appropriate and then base the test on statistic (6) with the π_ij replaced with the estimates . This has the advantage that it allows for a much broader class of methods to estimate F, but it may be more conservative than the above implementation because b_i is not competitively fit with the π_ij under the unconstrained model. In this case, F may be estimated in a manner that allows for fine-scale levels of inter-individual coancestry and locus-specific models of structure without relying on the lower d-dimensional factorized model L = AH that we used here.

Proposed Test Under the Alternative Hypothesis

The proposed association test is based on models (3) and (7). Even though we have proved that the test is immune to population structure, it is also important to demonstrate that the test has favorable statistical power to identify true associations. We have shown that the is a tractable approximation of the model under general configurations of a true alternative hypothesis for SNP i where β_i ≠ 0 (see below). This provides the beginnings of a mathematical framework for characterizing the power of the test.

Theorems and Proofs

Because x_ij∣z_j ∼ Binomial(2, π_i(z_j)) where we write π_ij ≡ π_i(z_j), it follows that Pr(π_ij∣π_ij, Z_j) = Pr(x_ij∣π_ij). We assume that Pr(x_ij∣h^j, z_j) = Pr(x_ij∣h^j); in other words, all information about the influence of population structure on the genotypes of individual j is captured through column j of H. It therefore follows that Pr(x_ij∣π_ij, h^j, z_j) = Pr(x_ij∣π_ij, h^j) = Pr(x_ij∣π_ij). We also assume that the SNP genotypes are mutually independent given the structure (which also implies the set of SNPs we consider are in linkage equilibrium, given the structure). These assumptions yield the following equalities:

Theorem 1

Suppose that y_j is distributed according to model (1) or (2), x_ij∣π_ij ∼ Binomial(2, π_ij) as parameterized above, and the SNP genotypes are mutually independent given the structure as detailed above. Then β_i = 0 in models (1) or (2) implies that b_i = 0 in model (3).

Note: We provide two proofs of this theorem because both provide relevant insights. The first version gives insight into the probabilistic mechanism underlying the proposed approach and has some generality beyond the modeling assumptions made here. The second version directly shows how the terms in models (1) and (2) relate to those in model (3).

Proof (version 1): When β_i = 0, it follows that Pr(y_j∣(x_kj)_k≠i, x_ij, z_j) = Pr(y_j∣(x_kj)_k≠i, z_j) by the assumptions of models (1) and (2). Noting that Pr((x_kj)_k≠i∣x_ij, z_j) = Pr((x_kj)_k≠i∣z_j) by the conditional independence assumption, we have:

By Bayes theorem we have

Since Pr(y_j∣x_ij, z_j) = Pr(y_j∣z_j), this implies that Pr(x_ij∣y_j, Z_j) = Pr(x_ij∣z_j) and it follows that b_i = 0 in model (3).

Proof (version 2): For either model (1) or (2), it follows that and similarly

By the assumptions detailed above, we have Pr(x_ij∣(x_kj)_k≠i, z_j) = Pr(x_ij∣π_ij) and therefore:

Under the quantitative trait model (1), it follows that

Plugging this back into equation (10) shows that where and . Following analogous steps, we find that where . When β_i = 0 in model (1), then a_ij = ã_ij = b_ij = 0.

Under the binary trait model (2), it follows that where and b_i = β_i. Plugging this back into equation (10) shows that

Following analogous steps, we find that where . When β_i = in model (2), then a_ij = ã_ij = b_i = 0.

Putting these together, we have that when β_i = 0 in models (1) or (2), then model (3) holds with b_i = 0.

Corollary 1

Suppose that the assumptions of Theorem 1 hold and additionally . Then β = 0 in models (1) or (2) implies that b_i = 0 in model (7).

Proof: The proof is the same as that to Theorem 1, except we replace π_ij with h^j.

Theorem 2

Suppose that y_j is distributed according to model (1) or (2) and that x_ij∣π_ij ∼ Binomial(2, π_ij). If β_i = 0 in models (1) or (2), then the test-statistic T(x_i;, y, π_i) defined in (6) converges in distribution to as n → ∞.

Proof: When β_i = 0, then [x_ij∣y_j, π_ij] ∼ Binomial (2, π_ij) by Theorem 1. It then follows that in distribution as n → ∞ by Wilks’ theorem [25].

Corollary 2

Suppose that the assumptions of Theorem 1 hold and additionally . If β_i = 0 in models (1) or (2), then the test-statistic T(x_i;, y, H) defined in (8) converges in distribution to as n → ∞.

Proof: When β_i = 0, then [x_ij∣y_j h^j] ∼ Binomial by Corollary 1. It then follows that in distribution as n → ∞ by Wilks’ theorem [25].

Proposed Model Under the Alternative Hypothesis

When the alternative model is true this means that β_i ≠ 0. In this case it is worthwhile to characterize model (3) in terms of the distribution of x_ij∣y_j, z_j. Under trait models (1) or (2), it follows that:

This implies that where under model (1) we have , and under model (2) we have , , b_ij = β_i.

In the case that a_ij = ã_ij, it is the case that

However, this exact equality is only the case when β_i = 0. For the typical effect sizes seen in GWAS, it will nevertheless be true that a_ij ≈ ã_ij, in which case the above functional form will be approximately true. This allows for an approximation that can be utilized in practice for power calcuations.

Simulated Allele Frequencies

In order to simulate the m × n matrix of genotypes X, we first needed to simulate the m × n matrix of allele frequencies F. Recall that we model the allele frequencies by forming L = logit(F) and then utilizing the model L = AH from equation (4).

Instead of simulating allele frequencies from the L = AH model we use to perform the proposed association test, we instead simulated them from a different model to demonstrate the flexibility of the L = AH model. Specifically, we let F = ΓS where Γ is m × d and S is d × n with d ≤ n. The d × n matrix S encapsulates the genetic population structure for these individuals since S is not SNP-specific but is shared across SNPs. The m × d matrix Γ maps how the structure is manifested in the allele frequencies of each SNP. We have shown that the model F = ΓS includes as special cases discrete subpopulations, the Balding-Nichols model, and the Pritchard-Stephens-Donnelly model.

We formed Γ and S for the 11 different population structure configurations exactly as carried out in Hao et al. (2013) [14]. These constructions are summarized as follows from Hao et al. (2013).

Balding-Nichols Model

The HapMap data set was deliberately sampled to be from three discrete populations, which allowed us to populate each row i of Γ with three independent and identically distributed draws from the Balding-Nichols model: , where k ∈ {1, 2, 3}. Each γ_ik is interpreted to be the allele frequency for subpopulation k at SNP i. The pairs (p_i, F_i) were computed by randomly selecting a SNP in the HapMap data set, calculating its observed allele frequency, and estimating its F_ST value using the Weir & Cock-erham estimator [26]. The columns of S were populated with indicator vectors such that each individual was assigned to one of the three subpopulations. The subpopulation assignments were drawn independently with probabilities 60/210, 60/210, and 90/210, which reflect the subpopulation proportions in the HapMap data set. The dimensions of the simulated data were m = 100,000 SNPs and n = 5000 individuals.

1000 Genomes Project (TGP)

We started with the TGP data set from Hao et al. (2013) [14]. The matrix Γ was generated by sampling for k = 1, 2 and setting γ_i3 = 0.05. In order to generate S, we computed the first two principal components of the TGP genotype matrix after mean centering each SNP. We then transformed each principal component to be between (0, 1) and set the first two rows of S to be the transformed principal components. The third row of S was set to 1, i.e. an intercept. The dimensions of the simulated data were m = 100,000 and n = 1500, where n was determined by the number of individuals in the TGP data set.

Human Genome Diversity Project (HGDP)

We started with the HGDP data set from Hao et al. (2013) [14] and applied the same simulation scheme as for the TGP scenario. The dimensions of the simulated data were m = 100,000 and n = 940, where n was determined by the number of individuals in the HGDP data set.

Pritchard-Stephens-Donnelly (PSD)

The PSD model assumes individuals to be an admixture of ancestral subpopulations. The rows of Γ were again created by three independent and identically distributed draws from the Balding-Nichols model: , where k ∈ {1, 2, 3}. For this scenario, the pairs (p_i, F_i) were computed from analyzing the HGDP data set for observed allele frequency and estimated F_ST via the Weir & Cockerham estimate [26]. The estimator requires each individual to be assigned to a subpopulation, which were made according to the K = 5 subpopulations from the analysis in Rosenberg et al. (2002) [27]. The columns of S were sampled for j = 1, …, n. There were four PSD scenarios with parameter values α = (0.01, 0.01, 0.01), α = (0.1, 0.1, 0.1), α = (0.5, 0.5, 0.5), and α = (1, 1, 1). α = (0.1, 0.1, 0.1) was chosen as the representative structure for Figure 2. The dimensions of the simulated data were m = 100,000 SNPs and n = 5000 individuals.

Spatial

We seek to simulate genotypes such that the population structure relates to the spatial position of the individuals. The matrix Γ was populated by sampling for k = 1, 2 and setting γ_i3 = 0.05. The first two rows of S correspond to coordinates for each individual on the unit square and were set to be independent and identically distributed samples from Beta(a, a), while the third row of S was set to be 1, i.e. an intercept. There were four spatial scenarios with parameter values of a = 0.1, 0.25, 0.5, and 1. As a → 0, the individuals are placed closer to the corners of the unit square, while when a = 1, the individuals are distributed uniformly. a = 0.1 was chosen as the representative structure for Figure 2. The dimensions of the simulated data were m = 100,000 SNPs and n = 5000 individuals.

Simulated Traits

For each of the 11 simulations scenarios, we generated 100 independent studies. For each study, X was formed by simulating x_ij ∼ Binomial(2, π_ij) where F was constructed as described above. In order to simulate a quantitative trait, we needed to simulate α, , λ_j, and ϵ_j from model (1).

First, we set α = 0. Without loss of generality SNPs i = 1, 2, … , 10 were set to be true alternative SNPs (where β_i ≠ 0); we simulated for i = 1, 2, ≠ , 10. We set β_i = 0 for i > 10. Note that X is influenced by the latent variables z₁, ≠ , z_n through S in the model F = ΓS described above. In order to simulate λ_j and ϵ_j so that they are also influenced by the latent variables z₁, … , z_n, we performed the following:

Perform K-means clustering on the columns of S with K = 3 using Euclidean distance. This assigns each individual j to one of three mutually exclusive cluster sets where .
Set λ_j = k for all for each k = 1, 2, 3.
Let and set for all for each k = 1, 2, 3.
Draw independently for j = 1, 2, … , n.

This strategy simulates non-genetic effects and random variation that manifest among K discrete groups over a more continuous population genetic structure defined by S. This is meant to emulate the fact that environment (specifically lifestyle) may partition among individuals in a manner distinct from, but highly related to population structure.

This yields three values , and ϵ_j for each individual j = 1, 2, … , n. In order to set the variances of these three values to pre specified levels ν_gen, ν_env and ν_noise, we rescaled each quantity as follows:

The trait for a given study was then formed according to for j = 1, 2, … , n. For each of the 11 simulation scenarios, we considered the following three configurations of (ν_gen, ν_env, ν_noise): (5%, 5%, 90%), (10%, 0%, 90%) and (10%, 20%, 70%).

In total, there were 11 different types of structures considered over three different configurations of genetic, environmental, and noise variances for a total of 33 settings. For each setting, we simulated 100 independent studies where each involved m = 100,000 SNPs and up to n = 5000 individuals.

Northern Finland Birth Cohort Data

Genotype data was downloaded from dbGaP (Study Accession: phs000276.v1.p1 ). Individuals were filtered for completeness (maximum 1% missing genotypes) and pregnancy. (Pregnant women were excluded because we did not receive IRB approval for these individuals.) SNPs were first filtered for completeness (maximum 5% missing genotypes) and minor allele frequency (minimum 1% minor allele frequency), then tested for Hardy-Weinberg equilibrium . The final dimensions of the genotype matrix are m = 324,160 SNPs and n = 5027 individuals.

A Box-Cox transform was applied to each trait, where the parameter was chosen such that the values in the median 95% value of the trait was as close to the normal distribution as possible. Indicators for sex, oral contraception, and fasting status were added as adjustment variables. For glucose, the individual with the minimum value was removed from the analysis as an extreme outlier. All analyses were performed with d = 6 logistic factors, which was determined based on the Hardy-Weinberg equilibrium method described in ref. [14]. The association tests were performed exactly as described in the main text.

Linear Mixed Effects Model and Principal Component Analysis Approaches

In order to explain the assumptions made by the linear mixed effects model approach (LMM) and principal components approach (PCA), we first re-write model (1) as follows: where the object of inference is β_i for each SNP i = 1, … , m. As explained in Astle and Balding (2009) [5], these approaches assume that , meaning that the non-genetic effects are independent from population structure and there is no heteroskedas-ticity among individuals.

The LMM approach also makes the assumption that we can approximate the genetic contribution by a multivariate Normal distribution: where Φ is the n × n kinship matrix. If we define , we can write the above model as where it is assumed that . Since it is not the case in general that the are identically distributed for all SNPs i = 1, … , m, one can either estimate a different pair of parameters for each SNP or assume that these parameters change very little between SNPs. Since the former tends to be computationally demanding, algorithms such as EMMAX [10] propose to estimate a single pair of parameters from a null model and then utilize this single estimate for every SNP. More recently, algorithms such as GEMMA have been proposed to relax this assumption [15].

The n × n kinship matrix Φ is estimated from the genotype data X. This involves the simultaneous estimation of (n² − n)/2 parameters, which is particularly large for sample sizes considered in current GWAS (on the order of 10⁸ for n = 10,000). The uncertainty in the estimated Φ is typically not taken into account, and there is so far no regularization of the high-dimensional estimator of Φ. Unregularized estimates of large covariance matrices have been shown to be problematic [28,29], a concern that is also applicable to estimates of Φ. Estimating involves manipulations of the estimated Φ matrix, which can pose numerical challenges due to the fact that the estimated Φ is both high-dimensional and nonsingular. The LMM approach therefore makes assumptions that are important to verify for each given study and it involves some challenging calculations and estimations.

The PCA approach first calculates the top d principal components on a normalized version of the genotype matrix X. In the method proposed by Price et al. (2006) [8], these principal components are then regressed out of each SNP i and the trait (regardless of whether it is binary or quantitative). A correlation statistic is calculated between each adjusted SNP genotype and the adjusted trait, and the p-value that tests for equality to 0 is reported. As shown in Hao et al. (2013) [14], the top d principal components form a high-quality estimate of a linear basis of the allele frequencies π_ij. Extracting the residuals after linearly regressing the genotype data for SNP i onto these principal components is equivalent to estimating the quantity x_ij − π_ij. Using the trait as the response variable in this regression adjustment is equivalent to estimating under the assumptions on the trait model given above (where this quantitative trait model is assumed regardless of whether the trait is quantitative or binary). Therefore, the association test carried out in the PCA approach implicitly involves an estimated form of the model: where it is assumed that λ_j + ϵ_j are approximately i.i.d. . When a correlation between the adjusted trait and the adjusted genotype for SNP i is carried out, then the residual variation is based on the joint distribution of ∑_k≠iβ_k(x_kj − π_ik) + λ_j + ϵ_j for j = 1, … , n.

Let us denote . Since Var(x_ij − π_ij) = 2π_ij(1 − π_ij) and Var(x_kj − π_kj) = 2π_kj(1 − π_kj), it follows that (x_ij − π_ij) and (x_kj − π_kj) for i, k = 1, … , m and j = 1, … , n still suffer from confounding due to structure through their variances. Therefore, the implicit assumption made by the PCA approach that the are independent and identically distributed in the above model is violated. This is our interpretation of why the PCA approach shows poor performance in adjusting for structure under our quantitative trait simulations. Astle and Balding (2009) [5] make further mathematical characterizations of the relationship between the implicit models in the PCA and LMM approaches, which we also found to be helpful.

Interestingly, when considering the binary trait model (2), the Bernoulli distributed trait does not involve a mean and variance term as in the Normal distributed quantitative trait. It may be the case that this difference contributes to explaining why the PCA approach shows similar behavior to the GCAT and LMM approaches for binary traits (see Results and ref. [5]). Specifically, the PCA approach appears to perform reasonably well in adjusting for structure for the binary trait simulations that we considered.

Software Implementation

The proposed method has been implemented in open source software, which will be made publicly available upon publication.

Footnotes

↵* These authors contributed equally to this work
↵+ Present address: Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD 20850
¹ In our implementation, the logistic factors are included as covariates, which serve as the model terms that estimate the values.

References

[1].↵
McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P. A., and Hirschhorn, J. N. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews. Genetics 9(5), 356–369 (2008).
OpenUrl CrossRef PubMed Web of Science
[2].↵
Frazer, K. A., Murray, S. S., Schork, N. J., and Topol, E. J. Human genetic variation and its contribution to complex traits. Nat Rev Genet 10(4), 241–251, Apr (2009).
OpenUrl CrossRef PubMed Web of Science
[3].↵
Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447(7145), 616–678, Jun (2007).
OpenUrl
[4].↵
Pritchard, J. K. and Rosenberg, N. A. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet 65(1), 220–228, Jul (1999).
OpenUrl CrossRef PubMed Web of Science
[5].↵
Astle, W. and Balding, D. J. Population structure and cryptic relatedness in genetic association studies. Statistical Science 24, 451–471 (2009).
OpenUrl CrossRef
[6].↵
Price, A. L., Zaitlen, N. A., Reich, D., and Patterson, N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11(7), 459–463, Jun (2010).
OpenUrl CrossRef PubMed Web of Science
[7].↵
Zhang, S., Zhu, X., and Zhao, H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genetic epidemiology 24(1), 44–56 (2003).
OpenUrl CrossRef PubMed Web of Science
[8].↵
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8), 904–909, Aug (2006).
OpenUrl CrossRef PubMed Web of Science
[9].↵
Yu, J., Pressoir, G., Briggs, W. H., Vroh Bi, I., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., Kresovich, S., and Buckler, E. S. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38(2), 203–208 (2006).
OpenUrl CrossRef PubMed Web of Science
[10].↵
Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.-Y., Freimer, N. B., Sabatti, C., and Eskin, E. Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42(4), 348–354 (2010).
OpenUrl CrossRef PubMed Web of Science
[11].↵
Wang, K., Hu, X., and Peng, Y. An analytical comparison of the principal component method and the mixed effects model for association studies in the presence of cryptic relatedness and population stratification. Hum. Hered. 76(1), 1–9 (2013).
OpenUrl
[12].↵
Sabatti, C., Service, S. K., Hartikainen, A.-L., Pouta, A., Ripatti, S., Brodsky, J., Jones, C. G., Zaitlen, N. A., Varilo, T., Kaakinen, M., Sovio, U., Ruokonen, A., Laitinen, J., Jakkula, E., Coin, L., Hoggart, C., Collins, A., Turunen, H., Gabriel, S., Elliot, P., McCarthy, M. I., Daly, M. J., Järvelin, M.-R., Freimer, N. B., and Peltonen, L. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genetics 41(1), 35–46 (2009).
OpenUrl CrossRef PubMed Web of Science
[13].↵
Soranzo, N., Rivadeneira, F., Chinappen-Horsley, U., Malkina, I., Richards, J. B., Hammond, N., Stolk, L., Nica, A., Inouye, M., Hofman, A., Stephens, J., Wheeler, E., Arp, P., Gwilliam, R., Jhamai, P. M., Potter, S., Chaney, A., Ghori, M. J. R., Ravindrarajah, R., Ermakov, S., Estrada, K., Pols, H. A. P., Williams, F. M., McArdle, W. L., van Meurs, J. B., Loos, R. J. F., Dermitzakis, E. T., Ahmadi, K. R., Hart, D. J., Ouwehand, W. H., Wareham, N. J., Barroso, I., Sandhu, M. S., Strachan, D. P., Livshits, G., Spector, T. D., Uitterlinden, A. G., and Deloukas, P. Meta-analysis of genome-wide scans for human adult stature identifies novel loci and associations with measures of skeletal frame size. PLoS genetics 5(4), e1000445 (2009).
OpenUrl
[14].↵
Hao, W., Song, M., and Storey, J. D. Probabilistic models of genetic variation in structured populations applied to global human studies. arXiv:1312.2041 (2013).
[15].↵
Zhou, X. and Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nature genetics 44(7), 821–824, Jul (2012).
OpenUrl CrossRef PubMed
[16].↵
Dudbridge, F. and Gusnanto, A. Estimation of significance thresholds for genomewide association scans. Genetic Epidemiology 32(3), 227–234 (2008).
OpenUrl CrossRef PubMed Web of Science
[17].↵
Sandhu, M. S., Waterworth, D. M., Debenham, S. L., Wheeler, E., Papadakis, K., Zhao, J. H., Song, K., Yuan, X., Johnson, T., Ashford, S., Inouye, M., Luben, R., Sims, M., Hadley, D., McArdle, W., Barter, P., Kesäniemi, Y. A., Mahley, R. W., McPherson, R., Grundy, S. M., Wellcome Trust Case Control Consortium, Bingham, S. A., Khaw, K.-T., Loos, R. J. F., Waeber, G., Barroso, I., Strachan, D. P., Deloukas, P., Vollenweider, P., Wareham, N. J., and Mooser, V. Ldl-cholesterol concentrations: a genome-wide association study. Lancet 371(9611), 483–491 (2008).
OpenUrl CrossRef PubMed Web of Science
[18].↵
Prokopenko, I., Langenberg, C., Florez, J. C., Saxena, R., Soranzo, N., Thorleifsson, G., Loos, R. J. F., Manning, A. K., Jackson, A. U., Aulchenko, Y., Potter, S. C., Erdos, M. R., Sanna, S., Hottenga, J.-J., Wheeler, E., Kaakinen, M., Lyssenko, V., Chen, W.-M., Ahmadi, K., Beckmann, J. S., Bergman, R. N., Bochud, M., Bonnycastle, L. L., Buchanan, T. A., Cao, A., Cervino, A., Coin, L., Collins, F. S., Crisponi, L., de Geus, E. J. C., Dehghan, A., Deloukas, P., Doney, A. S. F., Elliott, P., Freimer, N., Gateva, V., Herder, C., Hofman, A., Hughes, T. E., Hunt, S., Illig, T., Inouye, M., Isomaa, B., Johnson, T., Kong, A., Krestyaninova, M., Kuusisto, J., Laakso, M., Lim, N., Lindblad, U., Lindgren, C. M., McCann, O. T., Mohlke, K. L., Morris, A. D., Naitza, S., Orrù, M., Palmer, C. N. A., Pouta, A., Randall, J., Rathmann, W., Saramies, J., Scheet, P., Scott, L. J., Scuteri, A., Sharp, S., Sijbrands, E., Smit, J. H., Song, K., Steinthorsdottir, V., Stringham, H. M., Tuomi, T., Tuomilehto, J., Uitterlinden, A. G., Voight, B. F., Waterworth, D., Wichmann, H.-E., Willemsen, G., Witteman, J. C. M., Yuan, X., Zhao, J. H., Zeggini, E., Schlessinger, D., Sandhu, M., Boomsma, D. I., Uda, M., Spector, T. D., Penninx, B. W., Altshuler, D., Vollenweider, P., Jarvelin, M. R., Lakatta, E., Waeber, G., Fox, C. S., Peltonen, L., Groop, L. C., Mooser, V., Cupples, L. A., Thorsteinsdottir, U., Boehnke, M., Barroso, I., Van Duijn, C., Dupuis, J., Watanabe, R. M., Stefansson, K., McCarthy, M. I., Wareham, N. J., Meigs, J. B., and Abecasis, G. R. Variants in mtnr1b influence fasting glucose levels. Nature genetics 41(1), 77–81 (2009).
OpenUrl CrossRef PubMed Web of Science
[19].↵
Devlin, B. and Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
OpenUrl CrossRef PubMed Web of Science
[20].↵
Yang, J., Weedon, M. N., Purcell, S., Lettre, G., Estrada, K., Willer, C. J., Smith, A. V., Ingelsson, E., O’Connell, J. R., Mangino, M., Mägi, R., Madden, P. A., Heath, A. C., Nyholt, D. R., Martin, N. G., Montgomery, G. W., Frayling, T. M., Hirschhorn, J. N., McCarthy, M. I., Goddard, M. E., Visscher, P. M., and GIANT Consortium. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics 19(7), 807–812 (2011).
OpenUrl CrossRef PubMed
[21].↵
Witten, D. M., Tibshirani, R., and Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009).
OpenUrl CrossRef PubMed Web of Science
[22].↵
Baglama, J. and Reichel, L. Restarted block lanczos bidiagonalization methods. Numerical Algorithms 43, 251–272 (2006).
OpenUrl
[23].↵
Balding, D. J. and Nichols, R. A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96(1-2), 3–12 (1995).
OpenUrl CrossRef PubMed Web of Science
[24].↵
Pritchard, J. K., Stephens, M., and Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959, Jun (2000).
OpenUrl Abstract/FREE Full Text
[25].↵
Wilks, S. S. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics 9(1), 60–62 (1938).
OpenUrl CrossRef
[26].↵
Weir, B. and Cockerham, C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).
OpenUrl CrossRef PubMed Web of Science
[27].↵
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A., and Feldman, M. W. Genetic structure of human populations. Science 298, 2381–2385 (2002).
OpenUrl Abstract/FREE Full Text
[28].↵
Bickel, P. J. and Levina, E. Regularized estimation of large covariance matrices. The Annals of Statistics 36(1), 199–227, 02 (2008).
OpenUrl CrossRef
[29].↵
Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441 (2008).
OpenUrl CrossRef PubMed Web of Science

View the discussion thread.

Posted December 12, 2014.

Download PDF

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5210)
Biochemistry (11736)
Bioengineering (8746)
Bioinformatics (29186)
Biophysics (14964)
Cancer Biology (12084)
Cell Biology (17401)
Clinical Trials (138)
Developmental Biology (9418)
Ecology (14176)
Epidemiology (2067)
Evolutionary Biology (18299)
Genetics (12235)
Genomics (16793)
Immunology (11863)
Microbiology (28066)
Molecular Biology (11580)
Neuroscience (60925)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4956)
Plant Biology (10422)
Scientific Communication and Education (1683)
Synthetic Biology (2883)
Systems Biology (7338)
Zoology (1650)

[1] [1].↵
McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P. A., and Hirschhorn, J. N. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews. Genetics 9(5), 356–369 (2008).
OpenUrl CrossRef PubMed Web of Science

[2] [2].↵
Frazer, K. A., Murray, S. S., Schork, N. J., and Topol, E. J. Human genetic variation and its contribution to complex traits. Nat Rev Genet 10(4), 241–251, Apr (2009).
OpenUrl CrossRef PubMed Web of Science

[3] [3].↵
Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447(7145), 616–678, Jun (2007).
OpenUrl

[4] [4].↵
Pritchard, J. K. and Rosenberg, N. A. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet 65(1), 220–228, Jul (1999).
OpenUrl CrossRef PubMed Web of Science

[5] [5].↵
Astle, W. and Balding, D. J. Population structure and cryptic relatedness in genetic association studies. Statistical Science 24, 451–471 (2009).
OpenUrl CrossRef

[6] [6].↵
Price, A. L., Zaitlen, N. A., Reich, D., and Patterson, N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11(7), 459–463, Jun (2010).
OpenUrl CrossRef PubMed Web of Science

[7] [7].↵
Zhang, S., Zhu, X., and Zhao, H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genetic epidemiology 24(1), 44–56 (2003).
OpenUrl CrossRef PubMed Web of Science

[8] [8].↵
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8), 904–909, Aug (2006).
OpenUrl CrossRef PubMed Web of Science

[9] [9].↵
Yu, J., Pressoir, G., Briggs, W. H., Vroh Bi, I., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., Kresovich, S., and Buckler, E. S. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38(2), 203–208 (2006).
OpenUrl CrossRef PubMed Web of Science

[10] [10].↵
Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.-Y., Freimer, N. B., Sabatti, C., and Eskin, E. Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42(4), 348–354 (2010).
OpenUrl CrossRef PubMed Web of Science

[11] [11].↵
Wang, K., Hu, X., and Peng, Y. An analytical comparison of the principal component method and the mixed effects model for association studies in the presence of cryptic relatedness and population stratification. Hum. Hered. 76(1), 1–9 (2013).
OpenUrl

[12] [12].↵
Sabatti, C., Service, S. K., Hartikainen, A.-L., Pouta, A., Ripatti, S., Brodsky, J., Jones, C. G., Zaitlen, N. A., Varilo, T., Kaakinen, M., Sovio, U., Ruokonen, A., Laitinen, J., Jakkula, E., Coin, L., Hoggart, C., Collins, A., Turunen, H., Gabriel, S., Elliot, P., McCarthy, M. I., Daly, M. J., Järvelin, M.-R., Freimer, N. B., and Peltonen, L. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genetics 41(1), 35–46 (2009).
OpenUrl CrossRef PubMed Web of Science

[13] [13].↵
Soranzo, N., Rivadeneira, F., Chinappen-Horsley, U., Malkina, I., Richards, J. B., Hammond, N., Stolk, L., Nica, A., Inouye, M., Hofman, A., Stephens, J., Wheeler, E., Arp, P., Gwilliam, R., Jhamai, P. M., Potter, S., Chaney, A., Ghori, M. J. R., Ravindrarajah, R., Ermakov, S., Estrada, K., Pols, H. A. P., Williams, F. M., McArdle, W. L., van Meurs, J. B., Loos, R. J. F., Dermitzakis, E. T., Ahmadi, K. R., Hart, D. J., Ouwehand, W. H., Wareham, N. J., Barroso, I., Sandhu, M. S., Strachan, D. P., Livshits, G., Spector, T. D., Uitterlinden, A. G., and Deloukas, P. Meta-analysis of genome-wide scans for human adult stature identifies novel loci and associations with measures of skeletal frame size. PLoS genetics 5(4), e1000445 (2009).
OpenUrl

[14] [14].↵
Hao, W., Song, M., and Storey, J. D. Probabilistic models of genetic variation in structured populations applied to global human studies. arXiv:1312.2041 (2013).

[15] [15].↵
Zhou, X. and Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nature genetics 44(7), 821–824, Jul (2012).
OpenUrl CrossRef PubMed

[16] [16].↵
Dudbridge, F. and Gusnanto, A. Estimation of significance thresholds for genomewide association scans. Genetic Epidemiology 32(3), 227–234 (2008).
OpenUrl CrossRef PubMed Web of Science

[17] [17].↵
Sandhu, M. S., Waterworth, D. M., Debenham, S. L., Wheeler, E., Papadakis, K., Zhao, J. H., Song, K., Yuan, X., Johnson, T., Ashford, S., Inouye, M., Luben, R., Sims, M., Hadley, D., McArdle, W., Barter, P., Kesäniemi, Y. A., Mahley, R. W., McPherson, R., Grundy, S. M., Wellcome Trust Case Control Consortium, Bingham, S. A., Khaw, K.-T., Loos, R. J. F., Waeber, G., Barroso, I., Strachan, D. P., Deloukas, P., Vollenweider, P., Wareham, N. J., and Mooser, V. Ldl-cholesterol concentrations: a genome-wide association study. Lancet 371(9611), 483–491 (2008).
OpenUrl CrossRef PubMed Web of Science

[18] [18].↵
Prokopenko, I., Langenberg, C., Florez, J. C., Saxena, R., Soranzo, N., Thorleifsson, G., Loos, R. J. F., Manning, A. K., Jackson, A. U., Aulchenko, Y., Potter, S. C., Erdos, M. R., Sanna, S., Hottenga, J.-J., Wheeler, E., Kaakinen, M., Lyssenko, V., Chen, W.-M., Ahmadi, K., Beckmann, J. S., Bergman, R. N., Bochud, M., Bonnycastle, L. L., Buchanan, T. A., Cao, A., Cervino, A., Coin, L., Collins, F. S., Crisponi, L., de Geus, E. J. C., Dehghan, A., Deloukas, P., Doney, A. S. F., Elliott, P., Freimer, N., Gateva, V., Herder, C., Hofman, A., Hughes, T. E., Hunt, S., Illig, T., Inouye, M., Isomaa, B., Johnson, T., Kong, A., Krestyaninova, M., Kuusisto, J., Laakso, M., Lim, N., Lindblad, U., Lindgren, C. M., McCann, O. T., Mohlke, K. L., Morris, A. D., Naitza, S., Orrù, M., Palmer, C. N. A., Pouta, A., Randall, J., Rathmann, W., Saramies, J., Scheet, P., Scott, L. J., Scuteri, A., Sharp, S., Sijbrands, E., Smit, J. H., Song, K., Steinthorsdottir, V., Stringham, H. M., Tuomi, T., Tuomilehto, J., Uitterlinden, A. G., Voight, B. F., Waterworth, D., Wichmann, H.-E., Willemsen, G., Witteman, J. C. M., Yuan, X., Zhao, J. H., Zeggini, E., Schlessinger, D., Sandhu, M., Boomsma, D. I., Uda, M., Spector, T. D., Penninx, B. W., Altshuler, D., Vollenweider, P., Jarvelin, M. R., Lakatta, E., Waeber, G., Fox, C. S., Peltonen, L., Groop, L. C., Mooser, V., Cupples, L. A., Thorsteinsdottir, U., Boehnke, M., Barroso, I., Van Duijn, C., Dupuis, J., Watanabe, R. M., Stefansson, K., McCarthy, M. I., Wareham, N. J., Meigs, J. B., and Abecasis, G. R. Variants in mtnr1b influence fasting glucose levels. Nature genetics 41(1), 77–81 (2009).
OpenUrl CrossRef PubMed Web of Science

[19] [19].↵
Devlin, B. and Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
OpenUrl CrossRef PubMed Web of Science

[20] [20].↵
Yang, J., Weedon, M. N., Purcell, S., Lettre, G., Estrada, K., Willer, C. J., Smith, A. V., Ingelsson, E., O’Connell, J. R., Mangino, M., Mägi, R., Madden, P. A., Heath, A. C., Nyholt, D. R., Martin, N. G., Montgomery, G. W., Frayling, T. M., Hirschhorn, J. N., McCarthy, M. I., Goddard, M. E., Visscher, P. M., and GIANT Consortium. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics 19(7), 807–812 (2011).
OpenUrl CrossRef PubMed

[21] [21].↵
Witten, D. M., Tibshirani, R., and Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009).
OpenUrl CrossRef PubMed Web of Science

[22] [22].↵
Baglama, J. and Reichel, L. Restarted block lanczos bidiagonalization methods. Numerical Algorithms 43, 251–272 (2006).
OpenUrl

[23] [23].↵
Balding, D. J. and Nichols, R. A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96(1-2), 3–12 (1995).
OpenUrl CrossRef PubMed Web of Science

[24] [24].↵
Pritchard, J. K., Stephens, M., and Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959, Jun (2000).
OpenUrl Abstract/FREE Full Text

[25] [25].↵
Wilks, S. S. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics 9(1), 60–62 (1938).
OpenUrl CrossRef

[26] [26].↵
Weir, B. and Cockerham, C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).
OpenUrl CrossRef PubMed Web of Science

[27] [27].↵
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A., and Feldman, M. W. Genetic structure of human populations. Science 298, 2381–2385 (2002).
OpenUrl Abstract/FREE Full Text

[28] [28].↵
Bickel, P. J. and Levina, E. Regularized estimation of large covariance matrices. The Annals of Statistics 36(1), 199–227, 02 (2008).
OpenUrl CrossRef

[29] [29].↵
Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441 (2008).
OpenUrl CrossRef PubMed Web of Science

Testing for genetic associations in arbitrarily structured populations

Abstract

RESULTS

Population Structure Model

Trait Models

Motivation and Rationale of the Proposed Test

Association Test Immune to Population Structure

Simulation Studies

Analysis of the Northern Finland Birth Cohort Data

DISCUSSION

METHODS

Logistic Factor Analysis (LFA)

Proposed Association Testing Framework

A Model of Genetic Variation Given the Trait and Structure

Proposed Test Conditional on Individual-Specific Allele Frequencies

Proposed Test In Terms of LFA Model

Proposed Test Under the Alternative Hypothesis

Theorems and Proofs

Proposed Model Under the Alternative Hypothesis

Simulated Allele Frequencies

Balding-Nichols Model

1000 Genomes Project (TGP)

Human Genome Diversity Project (HGDP)

Pritchard-Stephens-Donnelly (PSD)

Spatial

Simulated Traits

Northern Finland Birth Cohort Data

Linear Mixed Effects Model and Principal Component Analysis Approaches

Software Implementation

Footnotes

References

Citation Manager Formats

Subject Area