SNP-based heritability estimation: measurement noise, population stratification and stability

Eric R. Gamazon; Danny S. Park

doi:10.1101/040055

Abstract

Siddharth Krishna Kumar¹ and co-authors claim to have shown that “GCTA applied to current SNP data cannot produce reliable or stable estimates of heritability.” Given the numerous recent studies on the genetic architecture of complex traits that are based on this methodology, these claims have important implications for the field. Through an investigation of the stability of the likelihood function under phenotype perturbation and an analysis of its dependence on the spectral properties of the genetic relatedness matrix, our study characterizes the properties of an important approach to the analysis of GWAS data and identified crucial errors in the authors’ analyses, invalidating their main conclusions.

Heritability estimation using genome-wide SNP data is a fundamental research topic with profound implications for studies of the genetic architecture of complex traits. The development of a novel methodology^2,3 in this direction has spurred studies, on a broad spectrum of complex traits, that have reinforced the view that a substantial portion of missing heritability can be accounted for by hitherto undiscovered common variants^4,5 and has led to substantial research that has demonstrated that certain functional categories of SNPs contribute disproportionately to the heritability of complex diseases^6-8. However, in a recent report¹, Krishna Kumar and co-authors claim to have proved that the method “may not reliably improve our understanding of the genomic basis of phenotypic variability” even when the assumptions of the method are satisfied exactly and that the heritability estimates produced are highly sensitive to the choice of sample used and to measurement errors in the phenotype. We investigated these claims by characterizing the dynamical properties of the likelihood function and identified crucial analytic errors that seriously undermine the validity of the authors’ conclusions.

The GREML model

We consider the following model (Figure 1A) of the phenotype y(which has been simplified, as in Krishna Kumar et al., to exclude any fixed-effects): where uis a Px1 vector of random (genetic) effects, Z is a NxP (standardized genotype) matrix and εis the (non-genetic) residual. Here

Figure 1.

A) The GREML model (which underlies the GCTA software implementation) has been simplified here, as in the Krishna Kumar et al. study, to exclude fixed effects. Note that the phenotypic covariance decomposes into a genetic covariance and a residual covariance. The Genetic Relatedness Matrix (GRM), which quantifies the genetic similarity between pairs of individuals using the genotype data Z, can be written as A = ZZ^T/P. Using Restricted Maximum Likelihood (REML), GCTA provides estimates of σ² and σ² and thus of the SNP-based heritability: . The dynamics of the log-likelihood can be investigated by considering a perturbation in the phenotype vector (e.g., the gradient or the local curvature) or the spectral properties of the GRM. B) We performed simulations, assuming N = 2,000 unrelated individuals, P = 50,000 independent SNPs and h² = 0.75. For each value of the minor allele frequency (MAF) ∊ {0.10, 0.30}, we generated the matrix Z by drawing from the binomial distribution Bin(2, maf) and standardizing (i.e., by centering and scaling) the entries. We simulated 100 phenotypes for each MAF. The genetic effects u were drawn from the standard normal N(0,1). We used the generative model described in (A) and the necessary residual to arrive at the required level of heritability. The distribution of GREML estimates for h² and corresponding standard error is shown for each MAF.

Thus, the distribution of yassumes the following form:

Note that the phenotypic covariance, var(y), is the sum of a genetic covariance and a residual covariance. The Genetic Relatedness Matrix (GRM), which quantifies the genetic similarity between pairs of individuals using the genotype data Z, can be written as follows:

Singularity index and induced quadratic form

We refer to the function S(Z) ≔ log(det(α²I+σ²ZZ^T)) as the singularity index (because it provides a formal test for the invertibility of the phenotypic covariance matrix α²I + α²ZZ^T) and refer to the function as the induced quadratic form. Note the log-likelihood of the observed phenotype data y₁ is given by

Using Restricted Maximum Likelihood (REML), GCTA estimates the variances σ² and α² given the observation y₁, thereby providing an estimate of the SNP-based heritability:

Equivalently, the log-likelihood function, now viewed as a function of Z and y₁, can be written as a sum involving the singularity index and the induced quadratic form:

Perturbation of the standardized genotype matrix Z and the GRM A

Because the Z in the GREML model is a standardized genotype matrix (wherein each entry is a function of the number of copies of the reference allele and the reference allele frequency at a SNP), this implies that there are implicit constraints on what is a valid perturbed genotype matrix Z + perturb(Z) (i.e., constraints which determine whether Z + perturb(Z) is a realizable or ill-defined standardized genotype matrix). A perturbation matrix perturb(Z) may generate a matrix that departs substantially from a standardized genotype matrix, yielding an ill-defined revised model. To illustrate this, if the original (e.g., independent, real and random) entries in Z have mean 0 and variance 1, a perturbation with elements on the primary diagonal due to the introduction of the phenotype noise would preserve the mean of these elements but alter their variance, possibly quite substantially. In short, not every element of Matrices(N, P) represents a standardized genotype matrix, and not every perturbation is a reasonable one. For the same reason, a perturbation of the GRM (by an error matrix E, as in the authors’ equation [A17] of the Appendix) does not necessarily generate a valid (revised) GRM. (For example, the resulting perturbed GRM must be symmetric, which implies that the perturbation matrix Z must be symmetric as well.) Furthermore, modeling the difference between the true Z and sample Z through an error matrix F via an additive model (Z_sample = Z_true + F) makes some very strong assumptions, including that the two matrices, Z_true and Z_sample, are of the same dimension (in particular, same number of variants). It is therefore more sound to evaluate the discordance between the true GRM (GRM_true and the estimated GRM (GRM_sample). The impact of this discordance (arising, for example, from the imperfect tagging of causal variants^2,9) on the REML estimate of heritability is indeed a valid subject of research³. Interestingly, this issue is related to the classic Horn’s conjecture in matrix theory (which was finally settled¹⁰) on the spectrum of the sum of two Hermitian matrices and on how the eigenvalues of two Hermitian matrices constrain the eigenvalues of their sum.

A critique of the authors’ claims

The authors evaluated the sensitivity of the likelihood function, and the resulting GREML estimate, to the GWAS data (specifically, phenotype measurement noise and population stratification). We report here crucial errors in the authors’ analyses, on which the main conclusions of the study are based. Furthermore, we highlight a methodological gap, which we address using an approach that may be of interest to future studies in population genetics and GWAS of complex traits.

(We should note a random matrix theory for the Wishart product matrix ZZ^T(or the GRM) generally assumes a Z with independent Gaussian entries, and any application in genetics must demonstrate that the relevant theoretical results apply (robustly) to a (non-Gaussian) matrix (e.g., one consisting of standardized genotype data). The authors appear to claim, clearly incorrectly and rather confusingly, for both Z and its symmetrization ZZ^T a Wishart distribution (e.g., see pages E62 and E68 of the authors’ paper 1). In what follows, we will assume that Z is a standardized genotype matrix (and thus non-Gaussian), and Gaussian-based results that require extension to the non-Gaussian case will be explicitly stated.)

1. Sensitivity of third term of log-likelihood to phenotype noise

The authors sought to show the instability of the induced quadratic form Q(Z,y₁), and thus of the log-likelihood, by showing its sensitivity to the phenotype measurement (i.e., to a perturbation of y₁). In their analysis, this conclusion follows from the instability of the spectral properties of Z even under a “small perturbation.” The authors used the following “equivalence” of perturbations (see equation [A10] of their Appendix A) – namely, the perturbation to the phenotype measurement and the induced perturbation of the matrix Z:

Applying the Sherman-Morrison-Woodbury identity to the third term of the log-likelihood (equation [e2]), one obtains

Thus, the sensitivity, assuming phenotype perturbation (equation [*]), depends not only on the factor with an underlying bracket (i.e., the spectral properties of Z), but also on the remaining terms (including ). (The authors highlighted the former and, curiously, disregarded the latter.) Ignoring these remaining terms may yield invalid inferences concerning Q(Z,y₁).

Importantly, Q(Z,y₁) is an ℝ-valued continuous function at every (Z₀,y₀) ∊ Matrices(N,P)xℝ^N, i.e.,

∀∊ > 0 so that |Q(Z,y)-Q(Z₀,y₀)| < ∊ whenever d((Z,y),(Z₀,y₀)) < δ where d: Matrices(N, P)x ℝ^N ⊕ Matrices(N, P)xℝ^N → ℝis the distance function defined by:

Here, for M = [m_ij ∊ Matrices(N, P) . The distance function endows the set Matrices(N, P)xℝ^N with the topology of a Euclidean space (homeomorphic to ℝ^NP+N) on which Q(Z,y), consisting of sums and products of continuous functions, is continuous. Similarly, the proper subset

{(Z, y) | Z a standardized genotype matrix and y a phenotype vector} ⊆ Matrices(N, P)xℝ_N, which is embeddable into ℝ_NP+N via the canonical inclusion, gets an induced subspace topology on which Q(Z,y) is continuous.

Given a fixed matrix Z, we ask how a perturbation in y₁ changes Q(Z,y₁). The rate of change in Q with respect to (the vector) y₁ is given by the gradient:

This simplifies to the following expression (by symmetry of M): which allows us to quantify the as a function of (the perturbed) y₁. Because S(Z) does not depend on y₁, this also gives the rate of change of the entire log-likelihood with respect to the phenotype vector(up to a constant factor). Furthermore, the expression for shows that Q ∈ C¹, i.e., it is actually continuously differentiable as a function y₁. involves a second-degree polynomial in A and is therefore continuous as a function of the GRM. Finally, the second derivative (Hessian) matrix, which carries information about the local curvature, does not vary with the phenotype: implying that higher-order derivatives do not vary with phenotype. This “curvature” matrix, along with the gradient, allows us to write a perturbation expansion, i.e., the local Taylor series expression, for the value of the log-likelihood at a perturbed value y₁ + Δy₁, demonstrating the stability of the log-likelihood under phenotype measurement noise. Having ruled out phenotype perturbation as the source of the claimed instability in the likelihood function, we proceeded to characterize the dynamical properties of the likelihood under a perturbation in the genetic relatedness (see next section).

Consistent with (a) the continuity of the function Q in Z and y₁, (b) the linear rate of change in Q with respect to y₁ and (c) the well-behaved local Taylor series expansion, simulations we performed confirm the stability of the GREML estimate (Figure 1B). We note that, in fact, both terms (Q(Z,y₁) and S(Z)) of the log-likelihood are continuous functions at every (Z₀,y₀) ∊ Matrices(N,P)xℝ_N.

The authors’ figure 5, which was intended to show the variation in the GREML estimates from random sampling from repeated measures of a phenotype, is not unexpected and, furthermore, does not empirically support the flawed theoretical argument about the instability of the log-likelihood.

2. Stability of second term of log-likelihood in stratified population

Here we are interested in describing the dynamics of the likelihood function with respect to the spectral properties of the GRM in the general context (i.e., not merely when the GRM reflects population stratification). But first we consider a particular structure of genetic relatedness to evaluate the authors’ claims concerning the instability of the singularity index S(Z) under population stratification. Using the singular value decomposition (SVD) of Z (Z = and applying the Matrix determinant lemma, one obtains the following decomposition:

The last term of equation [**] can be written in terms of the singular values w_i of Z as . (Here we suppose that the singular values are ordered in magnitude from largest to smallest.) From this, the authors concluded (incorrectly, as we will see) that in a stratified population (for which, it is claimed, thousands of the w_i are close to 0), this expression for the last term of [**] (and thus the entire expression itself) is sensitive to small changes in the values of the w_i. However, one cannot show the instability of the singularity index without also considering the rest of the terms in equation [**]. Indeed, equation [**] can be rewritten as follows:

For singular values w_i of Z that are close to 0, (based on the Taylor series expansion). Thus, the sampling variability for near-zero singular values (from the expression for S(Z); see equation [e4]) does not arise from the terms (as the authors claim), but from Such near-zero singular values should add little to the singularity index and closely-packed singular values (i.e., for which w_i ≈ w, for some constant w) should affect S(Z) nearly similarly, and thus the claim that near-zero singular values lead to unreliable estimates of the variance explained by all SNPs (= Pσ²) remains unfounded. In contrast, very large eigenvalues (such as reflecting non-random population structure) affect the stability of the index. The rate of change of the index with respect to w_i, namely is given by the following expression:

The rate-vector ρ is highly informative about the sampling behavior of the index at extreme singular values. Note that at w_i = ∞, ρ_i is approximately Thus, the marginal effect of increasing singular value on the index decays at infinity in a manner inversely proportional to the magnitude of the singular value. Clearly, lim which implies that the rate of change becomes almost negligible for singular values near 0.

As we have already noted, the singularity index is also a continuous function at each (Z₀,y₀) ∊ Matrices(N, P)xℝ_N and, by projection to the first coordinate, a continuous function of the matrix Z. A natural question is how a perturbation (error) in the genotype matrix determines its spectral properties. The classical Weyl’s inequality^11,12 implies that, given with i.e., the size of the perturbation bounds the size of the resulting perturbations in singular values. A corollary of Weyl’s inequality is that the following functions: (for 1 ≤ i ≤ P) are continuous. Thus, the authors’ working premise that “a small perturbation of Z causes a large change in its spectral properties” (see page E67), notwithstanding the serious limitation (described above) inherent in the use of a perturbation in Z for this type of stability analysis, appears to conflict with this fundamental result.

If a small perturbation in Z implies a correspondingly small perturbation in the singular values, what can be deduced about the dynamical properties of the likelihood assuming large errors in Z? It is important to note that when the perturbation (i.e., ||F||₂) is large, Weyl’s inequality, as stated (or, indeed, as in the version stated in the reference cited by the authors 12), sets a correspondingly large bound on |w_i,sample – w_i,true|. The large upper bound does not of course mean a large change in the spectral properties, but does imply less discrimination in our ability to distinguish between the (paired) singular values. Thus, additional machinery would be required to make accurate inferences from the singular values or the spectrum of the GRM.

3. Methodological gap

What is notably missing from the authors’ analysis, given its use of the eigenvalues of the GRM (from the SVD) to evaluate the stability of the GREML approach, is a quantification of the degree to which the eigenvalues reflect non-random population structure versus random expectation. A large eigenvalue may well be “within null expectation,” and there is thus a need to quantify its significance. (Note this is different from the empirical distribution of the GRM eigenvalues as presented in the authors’ figure 1, which aimed to show, despite the small sample sizes considered, concordance of the data with the asymptotic behavior of eigenvalues from the Marchenko-Pastur theory.) Consideration of the null is also missing from the authors’ appropriation of the notion of an “ill-conditioned” matrix Z, which is defined in terms of the condition number as an approach for investigating the effect on GREML estimates. In addition to these key methodological gaps, it is important to note that k is a property of the matrix Z rather than of the GREML method. Indeed, a very large k would also affect effect size estimation in simple linear regression that jointly fits multiple SNPs as fixed effects; a very large K would imply that even a small change in y could have a destabilizing impact on the estimated SNP effect sizes and that matrix inversion would be unstable with finite-precision numbers.

The distribution of the largest eigenvalue of the Wishart matrix of a matrix Z with independent Gaussian entries is known¹³. For large values of N and P, if λ denotes the largest eigenvalue, then assumes the Tracy-Widom distribution¹⁴; here both the centering constant μ(N, P) and the scaling constant σ(N, P) depend on only N and P. If the following assumptions are met for the symmetrization ZZ^T = [S_ij] in the GREML model (where now Z is the standardized genotype matrix with non-Gaussian entries):

the (independent real random) entries have mean 0 and variance 1
all moments of these random variables are finite
E(S_ij)^2m ≤ m_mfor some constant m (i.e., the distributions of the entries decay at least as fast as a Gaussian distribution)

Soshnikov’s extension theorem¹⁵ implies that the ratio for some centering and scaling constants that depend only on N and P, converges in distribution to the Tracy-Widom distribution, just as in the Wishart case. The ratio thus provides a way to assess the significance of the largest eigenvalue of a GRM and to quantify the presence of non-random population structure in the genotype data¹⁶. (For example, using the Framingham dataset presented in the authors’ figure 3, one concludes that the dataset shows extreme population stratification, p<2.2x10¹⁶.) Exact expressions for the density and the moments of the distribution of the smallest eigenvalue (in terms of polynomials, exponentials and hypergeometric functions) for a matrix with independent Gaussian entries have been derived, and, interestingly, the form of this distribution depends on whether P – N is odd or even¹⁷. Additionally, the work of Edelman provides a closed form for the distribution of the condition number k. Indeed, for Z with independent standard-Gaussian entries and large N¹⁷, we can write providing an asymptotic distribution for K(Z).

The claims made by the authors concerning the stability of the GREML estimates such as through their use of the skew in singular values (such as the “Largest Singular Value” of Figure 3 and the discussion thereof in the text) are, as currently presented, statistically problematic without consideration of what is expected under the null distribution, which we characterized here.

Conclusions

We investigated the dynamics of the GREML model to evaluate the dependence of the heritability estimate on phenotype perturbation and on the spectral properties of the genetic relatedness matrix. We explored the properties of the singularity index and the induced quadratic form as functions of the GRM and the phenotype. Furthermore, we derived an explicit expression for the rate of change (as well as all higher-order ones) in the log-likelihood with respect to the phenotype vector. Having ruled out phenotype perturbation as a cause of the claimed instability, we then explored the dynamical properties of the likelihood function under perturbations in the spectral properties of the GRM. In particular, we examined the sensitivity to outlier singular values, demonstrating that the authors’ claims regarding the impact of sampling variability for near-zero singular values were based on an analytic error (and assumed an incorrect view of the structure of genetic relatedness under population stratification). (It should be noted that the observation that population structure, which may be reflected in the largest eigenvalues of the GRM, may confound heritability estimation, and must thus be adjusted for, has been repeatedly discussed and investigated^18,19.) Finally, we investigated a methodological gap in the authors’ study and highlighted an approach to address it, which may be of broad interest to methods development in population genetics and genome-wide association analysis.

Author contributions

E.R.G. designed the study, performed the research and wrote the paper. D.S.P. performed the research. Both authors reviewed and approved the final manuscript.

Acknowledgments

E.R.G. acknowledges support from R01 MH101820, R01 MH090937 and R01 CA157823.

References

↵
Krishna Kumar, S., Feldman, M. W., Rehkopf, D. H. & Tuljapurkar, S. Limitations of GCTA as a solution to the missing heritability problem. Proc Natl Acad Sci U S A 113, E61–70, doi:10.1073/pnas.1520109113 (2016).
OpenUrl Abstract/FREE Full Text
↵
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42, 565–569, doi:10.1038/ng.608 (2010).
OpenUrl CrossRef PubMed Web of Science
↵
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 88, 76–82, doi:10.1016/j.ajhg.2010.11.011 (2011).
OpenUrl CrossRef PubMed
↵
Lee, S. H. et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat Genet 44, 247–250, doi:10.1038/ng.1108 (2012).
OpenUrl CrossRef PubMed
↵
Davis, L. K. et al. Partitioning the heritability of Tourette syndrome and obsessive compulsive disorder reveals differences in genetic architecture. PLoS Genet 9, e1003864, doi:10.1371/journal.pgen.1003864 (2013).
OpenUrl CrossRef PubMed
↵
Gamazon ER, I. H., Liu C, Members of the Bipolar Disorder Genome Study (BiGS) Consortium, Nicolae DL, Cox NJ. The Convergence of eQTL Mapping, Heritability Estimation and Polygenic Modeling: Emerging Spectrum of Risk Variation in Bipolar Disorder. arxiv (2013).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet 47, 1228–1235, doi:10.1038/ng.3404 (2015).
OpenUrl CrossRef PubMed
↵
Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am J Hum Genet 95, 535–552, doi:10.1016/j.ajhg.2014.10.004 (2014).
OpenUrl CrossRef PubMed
↵
Speed, D., Hemani, G., Johnson, M. R. & Balding, D. J. Improved heritability estimation from genome-wide SNPs. Am J Hum Genet 91, 1011–1021, doi:10.1016/j.ajhg.2012.10.010 (2012).
OpenUrl CrossRef PubMed
↵
Knutson, A. & Tao, T. The honeycomb model of GL(n)(C) tensor products I: Proof of the saturation conjecture. J Am Math Soc 12, 1055–1090, doi:Doi 10.1090/S0894-0347-99-00299-4 (1999).
OpenUrl CrossRef Web of Science
↵
Weyl, H. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung. Mathematische Annalen 71, 441–479 (1912).
OpenUrl CrossRef Web of Science
↵
Stewart, G. W. Perturbation theory for the singular value decomposition. Institute for Advanced Computer Studies, University of Maryland, College Park, MD (1998).
↵
Johnstone, I. M. On the distribution of the largest eigenvalue in principal components analysis. Ann Stat 29, 295–327, doi:DOI 10.1214/aos/1009210544 (2001).
OpenUrl CrossRef Web of Science
↵
Tracy, C. A. & Widom, H. Level-Spacing Distributions and the Airy Kernel. Commun Math Phys 159, 151–174, doi:Doi 10.1007/Bf02100489 (1994).
OpenUrl CrossRef
↵
Soshnikov, A. A note on universality of the distribution of the largest eigenvalues in certain sample covariance matrices. J Stat Phys 108, 1033–1056, doi:Unsp 0022-4715/02/0900-1033/0 Doi 10.1023/A:1019739414239 (2002).
OpenUrl CrossRef
↵
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet 2, e190, doi:10.1371/journal.pgen.0020190 (2006).
OpenUrl CrossRef PubMed
↵
Edelman, A. Eigenvalues and Condition Numbers of Random Matrices. Siam J Matrix Anal A 9, 543–560, doi:Doi 10.1137/0609045 (1988).
OpenUrl CrossRef
↵
Zaitlen, N. et al. Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits. PLoS Genet 9, e1003520, doi:10.1371/journal.pgen.1003520 (2013).
OpenUrl CrossRef PubMed
↵
Browning, S. R. & Browning, B. L. Population structure can inflate SNP-based heritability estimates. Am J Hum Genet 89, 191–193; author reply 193-195, doi:10.1016/j.ajhg.2011.05.025 (2011).
OpenUrl CrossRef PubMed