SNP heritability: What are we estimating?

The SNP heritability has become a central concept in the study of complex traits. Estimation of based on genomic variance components in a linear mixed model using restricted maximum likelihood has been widely adopted as the method of choice were individual level data are available. Empirical results have suggested that this approach is not robust if the population of interest departs from the assumed statistical model. Prolonged debate of the appropriate model choice has yielded a number of approaches to account for frequency- and linkage disequilibrium dependent genetic architectures. Here we analytically resolve the question of how these estimates relate to of the population from which samples are drawn. In particular, we show that the correct model for the purpose of inference about does not require knowledge of the true genetic architecture of a trait. More generally, our results provide a complete perspective of these class of estimators of , highlighting practical shortcomings of current practise. We illustrate our theoretical results using simulations and data from UK Biobank.

The SNP heritability is defined as the fraction of the phenotypic variance explained by additive effects of a given set of genetic variants 1,2 . It forms a bound on the ability to predict a phenotype using linear models of the chosen variants and has been important in the debate about the so-called missing heritability [3][4][5] . It has also been used to draw conclusions about the genetic architecture of phenotypes by contrasting the heritabilities of different categories of genetic variants 6 . Variance components, based on genomic relationship matrices, fitted using restricted maximum likelihood estimation, the so called G-REML method 3 , have been proposed, implemented in various tools [7][8][9][10] , and widely adopted as an approach to estimate Based on empirical observations, it has been suggested that G-REML estimates are biased under departure of the population from the model 4,8,11,12 . This has led to a number of variations on the G-REML approach based on various assumptions about the genetic architecture of the phenotype 8,[11][12][13] . These different models yield different estimates 14,15 .
However, the correct choice of model remains unclear, as does the question what aspects of the population need to be incorporated into the model. We show that these questions can be resolved analytically, by directly relating the estimates of model parameters to parameters of the population. Our results highlight the central role of the linkage disequilibrium (LD) structure of the population, in particular between variants with non-zero additive effect. We show that incorporating this structure, which can be estimated from data, into the genomic relationship matrix (GRM) leads to an statistically consistent estimator of ݄ ௦ ଶ . Conversely, the choice of alternative GRMs leads to a bias. This bias depends on the departure of the GRM's assumptions from the LD structure of the population. It furthermore changes depending on the relative number of individuals and genetic variants in the analysis. The G-REML model with the standard GRM corresponds to the assumption of perfect linkage equilibrium amongst all modeled genetic variants. In practice populations are expected to depart from this assumption. On the one hand, natural processes like assortative mating or selection act directly on the LD structure in the population 16 . On the other hand, sampling strategies like, for example, case-control sampling will induce LD between causal variants in the sample 17 . Finally, we show that these effects are relevant even if causal variants are in linkage equilibrium. Typically, the set of modeled variants will not include all causal variants.
We show, that this is sufficient to necessitate accounting for the LD structure of the modeled variants in order to avoid complex biases.

Relating G-REML estimates to population parameters
The G-REML estimate of ݄ ௦ ଶ , is given by In order to disentangle these two effects, we increase the number of variants by adding genotypes permuted amongst individuals, which by their very nature should not capture genetic variance (Methods). As predicted, the estimates of heritability for a fixed sample size increase when the number of genetic variants modeled is doubled or quadrupled (Fig. 2b).
These increases are consistent with our expectations, as becomes apparent when the results are plotted as a function of the ratio ܰ ‫ܯ‬ ⁄

Consequences of the nonrandom distribution of causal variants with respect to LD
Understanding the relationship of ߪ ො ଶ and the actual population parameters allows us to formally address questions which previously could only be considered using simulations. This is important as simulations by necessity only consider a finite set of conditions. This may lead to false conclusions if the set of considered conditions is too narrow. To illustrate this point, we turn to the question of the consequences of biases in the distribution of causal genetic variants. It has been observed that causal variants are not expected to consist of a random sample from all genetic variants 20 . Furthermore, the nonrandom distribution of causal variants with respect to LD in particular has been repeatedly suggested as a cause of bias in G-REML 4,8,12 . This has given rise to a number of variants on the original method, in particular LDAK 8 and GREML-LDMS 4 , aimed at correcting this bias.
Here, we re-evaluate these claims through the perspective of (1).  ߚ 's, and may be expected to also be in LD with each other.
We illustrate these points in the context of a previously proposed simulation study 8 Fig. 3). This means, that we do not need to resort to processes like assortative mating, that induce LD between causal variants. Even in the absence of any   Fig. 4b).

Discussion
We   (Fig. 3c). In contrast the corresponding SNP heritability in a population which is in complete linkage equilibrium is zero, due to the lack of LD between modeled and causal variants.
As we demonstrate no complex processes acting on LD like, for example, assortative mating or selection, are necessary to give rise to complex behavior of the estimator. It is sufficient for the causal variants to not be included in the model, and most analyses may be expected to fall within this setting. Our simulations suggest that the consequences are particularly severe if causal variants are weakly tagged, which again constitutes the expected norm rather than exception. In general we may expect that the overall contribution from LD is the result of the confluence of several independent processes. Some of these processes may depend on the specific set of model variants chosen, as is for example the case for the tagging structure.
We only present an overview of some of the implications of our theoretical results. However, denotes the expectation w.r.t.

‫ݔ‬
. It is worth emphasising that the formulation of is the covariance matrix of the genetic variants in the population of interest.

Asymptotic equivalence of single variance component and stratified models
The general behavior of models with multiple genomic variance components is more

UK Biobank Data
All simulations and primary data analyses were performed using data from the UK Biobank 19 , in particular the same set of genotype data was used throughout. These were genotypes of 343,884 unrelated (Kinship Coefficient < 0.0442) White-British individuals. We only considered bi-allelic autosomal variants which were assayed by both genotyping arrays employed by UK Biobank, passed UK Biobank quality control procedures and, in the unrelated White-British sample, had a minor allele frequency >1% and did not depart from Hardy-Weinberg equilibrium (P < 10 -50 ). The unrelated White-British subset of individuals was obtained by excluding individuals who were identified by the UK Biobank as outliers based on either genotyping missingness rate or heterogeneity, or whose sex inferred from the genotypes did not match their self-reported sex. We then identified a subset of individuals such that for no two individuals the Kinship Coefficient was larger than 0.0442.
The White-British subset of these was obtained by retaining all individuals for whom the projection on the 20 leading genomic principal components was within three standard deviations from the mean of all individuals who self identified as White-British. Finally, we 13 removed individuals with a genotype call missing-ness rate >5% across variants which passed our quality control procedure.

Computation of Genomic Relationship Matrices
We make use of three types of GRMs, referred to as the standard, LDAK, and LD structured , that is the empirical LD matrix. As in the case of LDAK, we computed this matrix only once for each required set of variants using all available individuals.

Simulations
We implemented two simulation studies to illustrate aspects of the analytical results. All simulations utilized the genotype data from the UK Biobank as described above. However, we only used a restricted set of genetic variants, specifically only those on chromosome 18, in order to be able to achieve a wider range of ratios between numbers of individuals and genetic variants in the models.
All models were fitted using either GCTA, for smaller sample sizes and non-standard GRMs, or BOLT-REML, in the case of sample sizes larger than 100,000, tools.

Consequences of non-zero contributions from LD to genetic variance
We aimed to highlight the behavior of estimates of  It is worth emphasizing, that we specifically chose to retain the causal variants in the model and did not simulate effects sizes following a reasonable prior distribution. We retained the causal variants so as to exclude effects of tagging of these by the variants retained in the model. We did not simulate the effects sizes from any of the linear mixed model prior distributions to illustrate the point that the presented results hold without assumptions on the true generating distribution for the population.

Consequences of biased distribution of causual variants
We implemented a simulation study following the design of Speed et al. 8

Application to Height in UK Biobank
We estimated heritability using the standard G-REML approach for height measured in UK Biobank participants. We included Sex, Age and the leading five genomic principle components, computed on the whole UK Biobank cohort, as covariates in the analysis.

Effect of changes in sample size
We . We again divided the available 343,884 individuals into the maximal number of non-overlapping subsets for each sample size and fitted models for each subset.

Variance captured by rarer variants
We fitted models using either all available genetic variants with either MAF > 5%, In all cases the results follow the theoretical predictions.