## Abstract

The proportion of phenotypic variance attributable to the additive effects of a given set of genotyped SNPs (i.e. SNP-heritability) is a fundamental quantity in the study of complex traits. Recent works have shown that existing methods to estimate genome-wide SNP-heritability often yield biases when their assumptions are violated. While various approaches have been proposed to account for frequency- and LD-dependent genetic architectures, it remains unclear which estimates of SNP-heritability reported in the literature are reliable. Here we show that genome-wide SNP-heritability can be accurately estimated from biobank-scale data irrespective of the underlying genetic architecture of the trait, without specifying a heritability model or partitioning SNPs by minor allele frequency and/or LD. We use theoretical justifications coupled with extensive simulations starting from real genotypes from the UK Biobank (N=337K) to show that, unlike existing methods, our closed-form estimator for SNP-heritability is highly accurate across a wide range of architectures. We provide estimates of SNP-heritability for 22 complex traits and diseases in the UK Biobank and show that, consistent with our results in simulations, existing biobank-scale methods yield estimates up to 30% different from our theoretically-justified approach.

## Introduction

SNP-heritability, the proportion of phenotypic variance attributable to the additive effects of a given set of SNPs, is a fundamental quantity in the study of complex traits [1]; it provides an upper bound on risk prediction from a linear model relating genotypes to phenotype [2] and, when defined as a function of all SNPs on a genotyping array, yields insights into the “missing heritability” of complex traits [3–5]. Traditionally, SNP-heritability is estimated by fitting variance components models with REML [3,6–9]. With some notable exceptions [8], REML-based methods are typically not scalable to biobanks that assay hundreds of thousands of individuals (e.g., UK Biobank contains genotype measurements for more than half a million individuals [10]). SNP-heritability can also be estimated from summary-level GWAS data by assessing the deviation in marginal association statistics as a function of the LD score of each SNP [11–14], thus making SNP-heritability estimation scalable to hundreds of thousands or even millions of individuals. More recently, a randomized extension of Haseman-Elston (HE) regression [15] was shown to estimate a single genetic variance component from individual-level data as accurately as REML methods but in a fraction of the run-time [16].

To facilitate inference, all existing methods for genome-wide SNP-heritability inference make various assumptions on the underlying genetic architecture of the trait, which is typically parametrized by *polygenicity* (the number of variants with effect sizes larger than some small quantifiable constant *δ*) and *MAF/LD-dependence* (the coupling of effect sizes with minor allele frequency (MAF), local linkage disequilibrium (LD), or other functional genomic annotations such as regions of open chromatin) [17]. Since the true genetic architecture of any given trait is unknown, existing methods are susceptible to bias and often yield vastly different estimates of SNP-heritability for the same traits, even when applied to the same data [9,14,18]. Although multi-component methods that stratify SNPs by MAF and LD can ameliorate some of the robustness issues of single-component methods [7,18,19], fitting multiple variance components to biobank-scale data with REML is highly resource-intensive [8] and it is currently unclear whether stratifying by MAF/LD produces accurate estimates of total SNP-heritability for methods based on summary statistics. Alternate methods explicitly model MAF- and LD-dependent architectures when estimating SNP-heritability [6,9,14]; however, these approaches can produce drastically different estimates when their assumptions are violated [6,9,14,18,19]. In addition, genetic architecture is unlikely to be the same across traits or populations due to, for example, variable degrees of negative selection acting on different traits in different populations [17,20–25]. Methods that jointly infer SNP-heritability and other parameters such as the strength of negative selection or polygenicity have been proposed [14,23,26] but are computationally intensive and/or sensitive to LD-dependent architectures. Thus, it remains unclear which estimates of genome-wide SNP-heritability computed from biobank-scale data (e.g., UK Biobank [10]) are reliable.

In this work, we investigate whether genome-wide SNP-heritability can be accurately estimated under a generalized random effects (GRE) model that makes minimal assumptions on the genetic architecture of complex traits. Under this model, every causal effect can have an arbitrary SNP-specific variance, and SNP-heritability is defined as the sum of the SNP-specific variances (Methods). To the best of our knowledge, all existing methods make additional assumptions on top of the GRE model (Table 1). For example, the infinitesimal model assumed by single-component GREML [3] (and several other methods [8,16]) imposes an inverse relationship between MAF and effect size by assuming that every standardized effect size explains an equal portion of total SNP-heritability, whereas the single-component LDAK model assumes that each SNP-specific variance is inversely proportional to both MAF and the LD neighborhood of the SNP [6,9]. We derive a closed-form estimator for SNP-heritability under the GRE model as a function of GWAS marginal association statistics and in-sample LD and show that this estimator is consistent (i.e. the estimator approaches the true SNP-heritability as sample size increases) and unbiased (i.e. the expectation of the estimator is equal to the true SNP-heritability) when the number of individuals is larger than the number of SNPs. Most importantly, the accuracy of this estimator does not depend on the underlying genetic architecture of the trait. The GRE estimator has the same analytical form as previously proposed “fixed effect estimators” [27,28] that aim to estimate the variance explained by a set of fixed (non-random) SNP effects at a particular genomic region – a different quantity from the estimand of interest here (i.e. SNP-heritability defined as the sum of per-SNP variances; see Methods).

Through theoretical derivations and extensive simulations across a wide range of MAF- and LD-dependent architectures starting from real genotypes from the UK Biobank [10] (337K individuals and 593K SNPs), we find that the GRE estimator provides nearly unbiased estimates of SNP-heritability across all architectures whereas existing methods are sensitive to model misspecification. For example, across 126 distinct architectures, the maximum bias we observe with the GRE estimator is 2% of the simulated SNP-heritability whereas state-of-the-art methods such as stratified LD score regression (S-LDSC) [12,13] and SumHer [14] yield biases between −84% and 28% of the simulated SNP-heritability. For completeness, we also contrast the GRE estimator with several REML-based methods in simulations at lower sample sizes (due to the computational burden of most REML methods) and find that, consistent with recent reports [18], all REML-based methods are biased when their model assumptions are violated. Across a similar set of 126 architectures, the bias of the GRE estimator ranges from −5% to 6% of the simulated SNP-heritability whereas single-component REML methods [3,6,8,9] are biased by anywhere between −44% and 18% of the simulated SNP-heritability. We confirm that multi-component REML methods that stratify SNPs by MAF and LD score (GREML-LDMS-I [18]) are more accurate than single-component REML methods if favorable SNP stratification criteria are used (i.e. if SNPs are stratified by the same MAF bins used to define the causal variant MAF spectrum). The performance of the GRE estimator, which does not stratify SNPs or assume a specific heritability model [6,9,14], is similar to that of GREML-LDMS-I with favorable stratification criteria, thereby confirming that SNP-heritability can be accurately estimated without knowledge of the underlying genetic architecture.

Finally, we use marginal association statistics and in-sample LD from *N* = 290K unrelated British individuals genotyped at *M* = 460K SNPs (MAF *>* 1%) to provide estimates of SNP-heritability for 22 complex traits and diseases in the UK Biobank [10]. Consistent with our simulations, across the 18 traits with SNP-heritability estimates greater than 0.05, we find that estimates from S-LDSC (controlling for the baseline-LD model [13]) and SumHer differ from the GRE estimates by a median of −9% and 11%, respectively. For example, for height, estimates from S-LDSC (0.56) and SumHer (0.63) are approximately 7% lower and 5% higher, respectively, than our estimate of 0.60. Similarly, for hypertension, estimates from S-LDSC (0.14) and SumHer (0.18) are *±*12.5% different from our estimate of 0.16. Taken together, our results demonstrate that SNP-heritability can be accurately estimated from biobank-scale data without prior knowledge of the genetic architecture the trait, motivating the development of new methods to make inferences from biobank-scale data under fewer modeling assumptions.

## Results

### Overview of the approach

We investigate the utility of an estimator for SNP-heritability derived under a model that makes minimal assumptions on genetic architecture. We assume the standardized phenotype of an individual is a linear function of their genotypes: *y* = **x**^{T} ** β** +

*ϵ*, where

**x**is a vector of standardized genotypes at

*M*SNPs,

*is a vector of standardized effect sizes corresponding to the*

**β***M*SNPs, and is environmental noise (Methods). We assume the effects can follow any distribution as long as the effect size of every SNP

*i*is zero-centered (E[

*β*

_{i}] = 0) with a finite SNP-specific variance that is allowed to be 0, and that the covariance between the effects of any pair of SNPs is zero (E[

*β*

_{i}

*β*

_{j}] = 0 for all

*i*≠

*j*). We term this model the “generalized random effects” (GRE) model as, to the best of our knowledge, all existing methods to estimate SNP-heritability impose additional assumptions on top of this model. For example, setting for

*i*= 1

*, …, M*results in the single-component GREML model [3], whereas setting (where

*w*

_{i}is a function of the “LD score” of SNP

*i*and

*f*is the MAF of SNP

_{i}*i*) results in the most recent LDAK model [9] (Table 1). Under the GRE model, the SNP-heritability explained by the

*M*SNPs is the sum of SNP-specific variances: (Methods).

In this work, we are interested in accurately estimating from genotype measurements across *N* individuals at *M* typed SNPs. When *N* > *M*, the estimator , where is the vector of standardized SNP effects estimated by ordinary least squares (OLS), is the pseudoinverse of the in-sample LD matrix, and *q* is the rank of the in-sample LD matrix, is an unbiased estimator of SNP-heritability under the GRE model. That is, (Methods). The GRE model allows each SNP-specific variance to be an arbitrary finite value satisfying the constraints and . Thus, can capture any relationship between effect size and MAF/LD, which in turn implies that is unbiased under most genetic architectures. Unfortunately, even the largest biobank-scale datasets currently available contain fewer unrelated individuals than typed SNPs (i.e. UK Biobank has genotyped *M* ≈ 593*K* SNPs in *N* ≈ 337*K* unrelated British individuals), which limits the utility of the above estimator. We therefore extend our approach by partitioning the genome by chromosome into 22 approximately independent regions:
where for each chromosome *k* with *p*_{k} typed SNPs, is the *p*_{k}-vector of standardized SNP effects estimated by ordinary least squares (OLS), is the pseudoinverse of the in-sample LD matrix, and *q*_{k} is the rank of the in-sample LD matrix. Although this genome-wide estimator does not provide theoretical guarantees of unbiasedness, we show through extensive simulations that the magnitude of the bias is extremely small across all architectures when *N* is sufficiently larger than *p*_{k}.

### Accurate estimation of SNP-heritability irrespective of disease architecture

To investigate the bias and variance of , we perform simulations starting from the real genotypes of *N* = 337205 unrelated British individuals in the UK Biobank [10]. First, we use data from chromosome 22 (*M* = 9654 typed SNPs) to simulate 64 distinct MAF- and LD-dependent architectures by varying the SNP-heritability , the proportion of causal variants (*p*_{causal}), the distribution of causal variant MAF (CV MAF), and the strength of coupling between effect size and MAF/LD; we use “LDAK-LD-dependent” to describe architectures where causal effects are coupled with “LDAK weights” (see Methods). To enable comparison of estimates across different values of , we assess bias as a percentage of the simulated value of (relative bias) or the error of a single estimate as a percentage of (relative error). Consistent with analytical derivations, the GRE estimator restricted to chromosome 22 provides unbiased estimates across the 64 architectures after correcting for 16 independent tests at each value of (bias p-value < 0.05/16 is considered significant; see Methods) (Figure 1ac, Supplementary Table S1). The average relative bias across the 64 architectures is 0.00015% of the simulated , and the largest bias we observe under any single architecture is approximately (Supplementary Figure S1a, Supplementary Table S1). We confirm that the analytical estimator of the standard error (Methods) is well-calibrated across all genetic architectures (Supplementary Figure S3a, Supplementary Table S4). We then investigate the bias induced by partitioning chromosome 22 into non-independent blocks and find that, as expected, our estimator accrues statistically significant upward bias as the average block size decreases (Supplementary Figure S2, Supplementary Table S2). For example, in simulations on chromosome 22 where , *α* = −1, *p*_{causal} = 1%, and causal variants were chosen randomly from all SNPs, using a single chromosome-wide LD block produces approximately unbiased estimates (bias = 6.9 × 10^{−5}, p-value = 0.55) whereas partitioning the chromosome into 2 disjoint blocks of equal size induces a small but significant upward bias (bias = 4.3 × 10^{−4}, p-value = 5.3 × 10^{−4}) (Supplementary Figure S2, Supplementary Table S2).

Next, we investigate the accuracy of the GRE estimator in genome-wide simulations (*N* = 337K unrelated individuals and *M* = 593K array SNPs) where we use 22 chromosome-wide LD blocks to compute . Despite the 22-block approximation, we find that is highly accurate and robust across all 64 MAF- and LDAK-LD-dependent architectures (Figure 1bc). The average bias across the 64 architectures is 0.97% × , with the relative bias under any single architecture ranging from 0.07% to 2.1% of the simulated (Supplementary Figure S1b, Supplementary Table S3). The largest error we observe for a single estimate across all 6400 simulations (64 genetic architectures × 100 simulation replicates) is approximately (Figure 1c) and as *N/M* increases, the variance of decreases while the relative bias across the 64 architectures appears to be approximately fixed, ranging between 0.91% (*N* = 100K) and 0.99% (*N* = 200K) (Figure 1d). This small but statistically significant upward bias in the genome-wide estimates is likely due to the 22-chromosome approximation, which ignores stochastic correlations between genotypes of SNPs on different chromosomes. These trends hold for a range of values of *p*_{causal} (Supplementary Table S3). Most importantly, the accuracy of the GRE estimator does not correlate with the simulated trait architecture (Figure 1b). We also assess the calibration of our analytical estimator for the standard error in the genome-wide simulations and observe a small downward bias with respect to the empirical standard deviation of estimates (Supplementary Figure S3b, Supplementary Table S5). For example, across 16 distinct architectures where , the empirical standard deviation computed from 100 independent estimates of ranges from 0.0049 to 0.0064, whereas our estimate of the standard error is approximately 0.0036 across all architectures (Supplementary Figure S3b, Supplementary Table S5). In all subsequent analyses, we compute with the 22 chromosome-wide LD block approximation as this provides sufficiently accurate estimates and a fair comparison to other methods.

### Comparison of methods to estimate SNP-heritability

We compare with existing state-of-the-art approaches to estimate SNP-heritability that are easily scalable to the full UK Biobank data (*N* = 337K): LD score regression with no annotations (LDSC), which assumes *α* = −1 and no coupling of effect size with LD [11]; stratified LD score regression (S-LDSC), which partitions by a set of annotations of interest [12,13]; and SumHer, a recent scalable extension of LDAK which explicitly models MAF- and LD-dependent architectures through a specific form of the SNP-specific variances [14] (Table 1). To ensure a fair comparison among the methods, LD scores for LDSC, S-LDSC, and SumHer are computed using in-sample LD among the *M* SNPs, and in all simulations we aim to estimate the SNP-heritability explained by the same set of *M* SNPs (see Methods).

We find that is highly accurate and robust across all simulated architectures while LDSC, S-LDSC, and SumHer are sensitive to deviations from their respective model assumptions. For example, when (Figure 2), LDSC is approximately unbiased under the “single-component GREML model” (relative bias = 0.04%, *p* = 0.86) but is sensitive to the MAF spectrum of causal variants and the degree of coupling between effect size and MAF/LD (e.g., across the 12 architectures where *p*_{causal} = 1%, relative bias ranges from −44% to 50%) (Supplementary Table S6). Similarly, SumHer is accurate under the “LDAK model” (relative bias = 5.3%) but highly sensitive to other plausible genetic architectures (when *p*_{causal} = 1%, relative bias ranges from −19% to 22%) (Figure 2, Supplementary Table S7). Estimates from S-LDSC (MAF), which partitions by 10 MAF bins (Supplementary Table S8; Methods), are less biased compared to estimates from LDSC when causal effects are coupled with only MAF, but are significantly downward biased when causal effects are also coupled with LDAK weights (for , relative bias range is [1.9%, 7.0%] when *γ* = 0 and [−58%, −37%] when *γ* = 1) (Figure 2, Supplementary Table S9). S-LDSC with 10 MAF bins and an additional continuous “level of LD” (LLD) annotation, which we denote S-LDSC (MAF+LLD) (Methods), produces similar results on the same architectures (for , relative bias range is [1.8%, 6.5%] when *γ* = 0 and [−80%, −33%] when *γ* = 1) (Supplementary Table S10). In contrast, the relative bias of ranges from 0.45% to 1.3% across the same 16 genetic architectures where and *p*_{causal} = 1% (Figure 2, Supplementary Table S3). These trends hold for a range of values of and *p*_{causal}: across 112 distinct LDAK-LD-and/or MAF-dependent architectures, the average and range of the relative bias of each method are 0.96% [-0.06%, 2.1%] for , −2.2% [−71%, 70%] for LDSC, −22% [−62%, 8.7%] for S-LDSC (MAF), −29% [-89%, 9.0%] for S-LDSC (MAF+LLD), and 2.8% [-27%, 28%] for SumHer (Figure 1b,Figure 2, Supplementary Figures S4-S7, Supplementary Tables S3, S6-S7, S9-S10). We also perform simulations under 14 alternative LD-dependent architectures where the variance of each SNP is coupled with its inverse LD score instead of its LDAK weight (i.e. “LD-score-dependent” architectures; see Methods, Supplementary Figure S8) and find that remains nearly unbiased (relative bias ranges from 0.52% to 1.3%) whereas estimates from S-LDSC (MAF), S-LDSC (MAF+LLD), and SumHer are downward-biased on average across the 14 architectures (Supplementary Figure S9, Supplementary Table S11).

For completeness, we also compare to four widely used REML-based methods: single-component GREML (GREML), which assumes *α* = −1 and no coupling of effect size with LD [3]; GREML-LDMS-I, a multi-component extension of GREML that partitions SNPs by MAF and LD score [18]; BOLT-REML, a computationally efficient variance components estimation method with assumptions similar to those of GREML [8]; and LDAK, which assumes a specific form of the coupling of effect size with LD and recommends setting *α* = −0.25 [6,9] (Table 1). Because it is computationally intractable to apply the REML-based methods to thousands of genome-wide simulations with 337K individuals, we perform simulations using a reduced number of individuals and SNPs (*N* = 8430 individuals and *M* = 14821 array SNPs; see Methods). We find that the single-component REML methods (GREML, BOLT-REML, and LDAK) are sensitive to MAF- and LD-dependent architectures that deviate from their respective model assumptions, whereas our estimator is robust to all architecture parameters. For example, when (Figure 3), GREML and BOLT-REML are accurate under the “single-component GREML model” (GREML: relative bias = −1.4%, *p* = 6.0 × 10^{−3}, Supplementary Table S12; BOLT-REML: relative bias = −0.16%, *p* = 0.75, Supplementary Table S13) and LDAK is approximately unbiased under the “LDAK model” (relative bias = 0.16%, *p* = 0.77, Supplementary Table S14), but all single-component methods are sensitive to the MAF spectrum of causal variants and to the coupling of causal effects with MAF/LD. Across the 12 architectures in Figure 3 where *p*_{causal} = 1%, the relative biases of the single-component methods range from −15% to 7.9% (GREML), −14% to 9.1% (BOLT-REML), and −34% to 8.2% (LDAK) (Supplementary Tables S12-S14). In contrast, for the same 12 architectures, yields relative biases in the range [−2.1%, 1.7%], which is comparable to the relative bias observed with GREML-LDMS-I (range [−2.9%, 1.5%]) when using 8 GRMs defined by 4 LD quartiles and 2 MAF bins (MAF > 5% and MAF ≤ 5%) that align with the causal variant MAF spectrum (Figure 3, Supplementary Tables S15, S16). These trends are consistent across a range of values of and *p*_{causal}: across the 112 distinct LDAK-LD- and/or MAF-dependent architectures shown in Supplementary Figures S10-S14, the average and range of the relative bias are 0.09% [−4.9%, 6.4%] (GRE), −0.6% [−5.9%, 2.3%] (GREML-LDMS-I), −2.9% [−27%, 15%] (GREML), −1.8% [−25%, 18%] (BOLT-REML), and −8.2% [−44%, 13%] (LDAK) (Supplementary Tables S12-S16). Similar trends are observed in additional simulations under 14 LD-score-dependent architectures (Supplementary Figure S15, Supplementary Table S17). We note that the performance of GREML-LDMS-I depends on the resolution of the MAF and LD bins used to partition SNPs; in an extreme example where all causal variants are drawn from a MAF range tightly concentrated near 1%, running GREML-LDMS-I with the same 8 GRMs as before yields downward-biased estimates whereas our estimator remains robust (Supplementary Figure S16, Supplementary Tables S12-S16). While the variance of our estimator is larger than the variances of the REML-based methods (Figure 3), our approach is designed for biobank-scale GWAS data with sample sizes several orders of magnitude larger than what we used in these small-scale simulations. In summary, our results confirm that it is possible to accurately estimate under minimal assumptions about genetic architecture.

### Estimating SNP-heritability of 22 complex traits in the UK Biobank

Finally, we apply our approach to estimate for 22 complex traits and diseases in the UK Biobank (*N* = 290K unrelated British individuals, *M* = 460K array SNPs; see Methods) [10]. For comparison, we also provide estimates of from LDSC (no annotations), S-LDSC (controlling for the baseline-LD model [13,29]), and SumHer. Of the 22 traits analyzed (6 quantitative and 16 binary), we focus on 18 traits for which (Table 2). Using our approach, estimates of SNP-heritability for the 6 quantitative traits range from 0.12 (smoking status) to 0.60 (height). Across the 12 binary traits, our estimates range from 0.064 (autoimmune disorders) to 0.16 (hypertension) (Table 2). To enable a direct comparison between and the SNP-heritability quantities estimated by LDSC, S-LDSC, and SumHer, we run each of the summary-statistics-based methods with LD scores and regression weights computed from in-sample LD among the typed SNPs, and we estimate SNP-heritability as the sum of the per-SNP variances across all *M* SNPs (Methods). Across the 18 traits, the median difference (as a percentage of ) between S-LDSC (baseline-LD/in-sample) and is −9%; the median difference between SumHer (in-sample) and is 11% (Figure 4, Supplementary Table S18). This pattern is roughly consistent with the global trends we observed in genome-wide simulations (Figure 2). As expected [11], LDSC (in-sample) yields inflated estimates across all 18 traits. To enable a comparison between and SNP-heritability estimates from summary-statistics-based methods reported in the literature, we also run LDSC, S-LDSC, and SumHer with their recommended parameter settings [11,12,14,29] and with LD scores and regression weights computed from 489 Europeans in the 1000 Genomes Phase 3 reference panel [30] – we note that when running these methods as recommended, their SNP-heritability estimands are not equivalent to our definition of (see Methods and refs. [11,12,14,19] for details). Across the 18 traits for which , the median differences with respect to are −11% for LDSC (1KG), −14% for S-LDSC (baseline-LD/1KG), and 38% for SumHer (1KG) (Supplementary Figure S17, Supplementary Table S18). Across 9 traits (a subset of the 18 traits with ) for which a previous study reported estimates from single-component BOLT-REML computed from approximately 337K unrelated white British individuals in the UK Biobank [31], the median difference between the previously reported BOLT-REML estimates and is 8% (Supplementary Table S18).

### Runtime and memory requirement

Since our approach is designed to be applied to biobank-scale data, we report the runtime and memory requirements for computing with 22 chromosome-wide LD blocks in the UK Biobank (*N* = 337K individuals, *M* = 593K array SNPs). First, we compute chromosome-wide LD; this has complexity for chromosome *k* with *p*_{k} SNPs. In practice, this step does not impose a computational bottleneck because the LD computations can be parallelized over SNP partitions (e.g., given a pair of SNP partitions containing 1000 SNPs each, computing the pairwise LD matrix takes about 10 minutes and 16GB of memory). Second, the pseudoinverse of each chromosome-wide LD matrix is computed via truncated singular value decomposition (SVD), which has complexity for chromosome *k*. This step is parallelized over chromosomes; for chromosome 2, which has the largest number of typed SNPs, computing the truncated SVD and pseudoinverse of the LD matrix takes about 3 hours and 60GB of memory. Lastly, given the precomputed pseudoinverse of each chromosome-wide LD matrix and OLS association statistics, computing genome-wide has complexity . For any of the UK Biobank traits analyzed in this work, this takes less than 1 hour and requires 24GB of memory; most of this time is spent loading the pseudoinverse LD matrices into memory.

## Discussion

In this work, we show that highly accurate estimation of SNP-heritability can be achieved under minimal assumptions on the genetic architecture of complex traits. In particular, our proposed estimator assumes that each SNP effect has a fixed SNP-specific variance that can capture any arbitrary relationship between effect size and genomic features such as MAF and LD. We show that all existing methods to estimate SNP-heritability impose additional assumptions on the GRE model, and we confirm through extensive simulations that these methods are susceptible to bias when their modeling assumptions are not met. Additionally, we confirm that REML-based methods that partition SNPs by MAF and LD score generally yield much smaller bias compared to single-component REML methods [18]. In contrast, our estimator, derived under the GRE model, provides accurate estimates of SNP-heritability regardless of the underlying genetic architecture, without specifying a heritability model or partitioning SNPs by functional categories. On average across 18 heritable traits in the UK Biobank , our approach yields estimates that are higher than S-LDSC estimates (controlling for the baseline-LD model [29]) and lower than SumHer estimates (with the recommended heritability model [9, 14]). One practical advantage of our approach over methods such as LDSC, S-LDSC, and SumHer is that the estimand of our approach is always the same for a given genotype matrix, whereas the definitions and interpretations of the estimands of LDSC, S-LDSC, and SumHer can vary drastically depending on what sets of SNPs are used in each step of the inference procedure (e.g., the set of SNPs used to compute LD scores need not be the same set of SNPs that defines the SNP-heritability estimand of interest) [11,12,19]. Overall, our results show that while existing methods can yield biases, for the purpose of estimating total SNP-heritability of complex traits, most methods are relatively accurate and robust to plausible genetic architectures.

We conclude with several caveats and future directions. First, the utility of GRE estimator critically depends on the ratio between the number of SNPs (*M*) and the number of individuals (*N*) in the data – as *M/N* increases, the eigenstructure of the in-sample LD matrix (and sample covariance matrices in general) becomes increasingly distorted (larger eigenvalues are overestimated and smaller eigenvalues are underestimated) [32]. We mitigate this by assuming that the genome-wide LD matrix has a block diagonal structure (specifically, one block per chromosome); since the number of unrelated British individuals in the UK Biobank is larger than the number of array SNPs per chromosome, our approach is able to provide meaningful estimates of the SNP-heritability attributable to common SNPs (MAF > 1%) in individuals of British ancestry. A major limitation of our approach remains with respect to imputed and/or whole-genome sequenced data, in which the number of SNPs will continue to be orders of magnitude larger than the number of individuals for the foreseeable future. We defer a thorough investigation of regularized estimation of LD in high-dimensional settings (*M* > *N*) to future work.

Second, this work focuses on accurate estimation of SNP-heritability given OLS association statistics and chromosome-wide LD matrices estimated from the same genotype data. While summary statistics have been made publicly available for hundreds of GWAS, in-sample LD is usually unavailable or impossible to obtain since most large-scale GWAS are meta-analyses [33], and publicly available reference panels such as 1000 Genomes [30] currently have sample sizes in the hundreds or thousands at most. In addition, many publicly available summary statistics were computed using linear mixed models rather than OLS in order to control for population structure/cryptic relatedness. Previous works have noted (in the context of statistical fine-mapping) that the LD computation must be adjusted to accommodate association statistics computed from mixed models [33,34]. The sensitivity of our estimator to reference panel LD (with or without regularized LD estimation) and/or mixed model association statistics remains unclear [28,35]; we leave an investigation of both for future work. Furthermore, our simulations use typed SNPs to draw phenotypes because imputed genotypes have highly irregular LD patterns [9,18]. Although it would be more realistic to simulate causal variants and phenotypes from a denser set of genotyped SNPs or from whole-genome sequencing data [18], our simulation design was dependent on the availability of individual-level genotype measurements in biobank-scale sample sizes.

Third, the GRE estimator does not correct for population structure/cryptic relatedness. We mitigate this in our analysis of real UK Biobank traits by considering only unrelated individuals (> 3rd degree relatives) and by including age, sex, and the top 20 principal components as covariates in the linear regression when computing OLS association statistics. However, recent work has found significant evidence of assortative mating for some traits in the UK Biobank (e.g., height) and not others [36], suggesting that our estimator might be more susceptible to bias for some traits over others. Currently, it remains unclear how to quantify the bias of our estimator due to population structure and/or assortative mating in real data. In addition, we derive the GRE estimator under no ascertainment in case/control data. Future work is needed to extend the GRE approach to control for ascertainment bias [15,16,37,38].

Finally, while previous works have applied similar estimators in the context of fixed effects models to estimate local SNP-heritability within small regions (e.g., LD blocks) [27, 28], additional work is needed to extend our approach to perform functional partitioning of SNP-heritability by higher-resolution annotations. Existing methods for partitioning genome-wide SNP-heritability by small and/or overlapping annotations make various assumptions on genetic architecture [8,12–14,29], motivating the development of new methods in this area under fewer assumptions.

## URLs

GRE estimator: https://bogdan.dgsom.ucla.edu/pages/software

BOLT-LMM: https://data.broadinstitute.org/alkesgroup/BOLT-LMM/

GCTA: https://cnsgenomics.com/software/gcta/

LDAK: http://dougspeed.com/ldak/

LDSC: https://github.com/bulik/ldsc/ baseline-LD annotations: https://data.broadinstitute.org/alkesgroup/LDSCORE/

PLINK: https://www.cog-genomics.org/plink2

UK Biobank: https://www.ukbiobank.ac.uk

## Methods

### The generalized random effects model

We model the phenotype for an individual *n* randomly sampled from the population as , where **x**_{n} = (*x*_{n1} … *x*_{nM})^{T} is a vector of standardized genotypes measured at *M* SNPs for individual *n*, * β* = (

*β*

_{1}

*, …, β*)

_{M}^{T}is an

*M*-vector of the corresponding standardized SNP effect sizes, and is environmental noise. We assume Var[

*y*] = 1 and that the genotype at each SNP

_{n}*i*is centered and scaled in the population such that E[

*x*] = 0 and Var[

_{ni}*x*

_{ni}] = 1; i.e. , where

*g*

_{ni}∈ {0, 1, 2} is the number of copies of the effect allele at SNP

*i*for individual

*n*, and

*f*

_{i}is the population frequency of the effect allele at SNP

*i*. We define the population LD between two SNPs

*i*and

*j*to be

*v*

_{ij}≡

*E*[

*x*

_{ni}

*x*

_{nj}] for all

*i*≠

*j*. The population LD matrix among the

*M*SNPs is therefore . For simplicity, we use “SNP effect sizes” in lieu of “standardized SNP effect sizes” to refer to

*. We assume that the genotypes*

**β****x**

_{n}and effect sizes

*are independent given allele frequencies (*

**β***f*

_{1}

*, …, f*) and

_{M}**V**.

Under the generalized random effects (GRE) model, the first two moments of the distribution of the effect size of SNP *i* are E[*β*_{i}] = 0 and , where can be any arbitrary nonnegative finite number. We assume the covariance between the effects of different SNPs is 0 (i.e. Cov[*β _{i}, β_{j}*] = E[

*β*] = 0 for all

_{i}β_{j}*i*≠

*j*). Because the SNP-specific variances can capture any polygenicity (number of variants with effects larger than some measurable constant) and any degree of coupling between genomic features (e.g., MAF and LD) and effect size, the GRE model encompasses most realistic genetic architectures (Table 1).

We define total SNP-heritability to be the proportion of phenotypic variance attributable to the additive effects of a set of *M* SNPs whose genotypes are directly measured:

Thus, is defined with respect to a given population and a given set of SNPs. By definition, . Similarly, we define regional SNP-heritability to be the proportion of phenotypic variance due to the additive effects of the genotyped SNPs in region *k*. We assume that the set of SNPs that defines is a subset of the *M* SNPs that define (thus, ). If region *k* is the whole genome, .

### Estimating SNP-heritability under the GRE model

We are interested in estimating under the GRE model (Equation 3). In a GWAS with *N* individuals genotyped at *M* SNPs, let be the *N* × *M* matrix of standardized genotypes (i.e. each column of **X** has been standardized to have mean 0 and variance 1), let ** y** = (

*y*

_{1},…,

*y*

_{N})

^{T}be the

*N*-vector of standardized phenotypes, and let be the

*M × M*in-sample LD matrix (an estimate of population LD,

**V**) with rank

*q*, where 1

*≤ q ≤ M*. Let

**X**= (

**X**

_{1}…

**X**

_{K}be a set of

*K*approximately independent genome partitions spanning all

*M*SNPs (e.g., chromosomes). For each region

*k*containing

*p*

_{k}SNPs,

**X**

_{k}is the

*N*×

*p*

_{k}standardized genotype matrix and is the corresponding

*p*

_{k}×

*p*

_{k}in-sample LD matrix with rank

*q*

_{k}, where 1 ≤

*q*

_{k}≤

*p*

_{k}. We propose the following estimator for genome-wide SNP-heritability: where is the

*p*-vector of marginal SNP effects estimated by ordinary least squares (OLS) for region

_{k}*k*and is the pseudoinverse of .

In the following sections, we first derive in the simplest case where *K* = 1 and *N* > *M* by finding an estimator that satisfies . We then describe modifications to this estimator to allow *N* < *M* as well as rank-deficient LD matrices. Lastly, we derive an analytical form for the standard error of .

### Derivation for assuming fixed *β* and *N* > *M*

Recall that Var[*y*_{n}] = 1 and . Our goal is to find an estimator that satisfies (Equation 2). If *β* were fixed and we observed **V** and * β*, we could estimate as . However, in reality, we observe noisy estimates of

*and*

**β****V**from matrix, and

**y**, the standardized phenotype vector. We assume that when

*N*>

*M*, as

*N*→ ∞ (in practice, the assumption that

*N*>

*M*is untrue; in subsequent sections we show how we partition the genome into

*K*blocks such that

*N*>

*p*for each block

_{k}*k*). In a typical GWAS, the marginal SNP effects are estimated through ordinary least squares (OLS) regression as . Given

**X**and fixed

*, it follows that*

**β**Thus, as *N* → ∞, . Substituting and , we obtain the revised estimator . The expectation of this estimator is

We define to be an estimator that satisfies . Substituting into Equation 7, we obtain

### Unbiasedness of under the GRE model when *N* > M

Recall that under the GRE model, E[*β*_{i}] = 0 and , where for all SNPs *i*. In the previous sections, we showed that and . Recalling that Cov[*β*_{i}*β*_{j}] = 0 for all *i* ≠ *j*, it follows that

Therefore, . This result implies that is an unbiased estimator for under a wide range of genetic architectures that fall under the GRE model.

### Practical considerations for implementation

For most GWAS, because the number of genotyped SNPs *M* is much larger than the number of individuals *N* in the study, is a poor estimator of **V** genome-wide; as *M/N* increases, the eigenstructure of becomes increasingly distorted (larger eigenvalues are overestimated and smaller eigenvalues are underestimated) [32]. In addition, it is generally computationally intractable to compute and invert genome-wide. Thus, in practice, we divide the genome into a set of *K* approximately independent blocks (e.g., by chromosome) and implement as
where is the *p*_{k}-vector of marginal effects for block *k* containing *p*_{k} SNPs. While this estimator does not provide theoretical guarantees of unbiasedness, it remains statistically consistent (i.e. as (*N* → *℞*).

### Extension for rank-deficient LD

It is often the case that two SNPs are perfectly correlated in a genotype block **X**_{k}, or that *N* < *p*_{k} for a block *k*. In this case, is rank-deficient (i.e. its rank is less than *p _{k}*) and does not exist. We therefore compute , the pseudoinverse (Moore-Penrose inverse) of , which approximates using its truncated eigendecomposition. Let and let be the eigendecomposition of , where

**Λ**

_{k}= diag(

*λ*

_{1},…,

*λ*

_{qk},0,…,0). The pseudoinverse of is , where .

Substituting and , we obtain the following estimator for . Let **I**_{qk} be a *p*_{k} × *p*_{k} diagonal matrix in which the first *q*_{k} diagonal entries are 1 and the rest are 0. The expectation of our estimator given * β* and

**X**is

We wish to find an estimator that satisfies . Substituting into the above equation, we obtain

### Analytical variance of

Following quadratic form theory [28,39], the variance of in the single-block case is given by

When using the *K*-block approximation, which assumes that the blocks are independent, we approximate Equation 14 as the sum of the variances of the local SNP-heritabilities:

Because and for all *k* are unknown, Equation 14 is estimated by plugging in and Equation 15 is estimated by plugging in , the estimates of the regional SNP-heritabilities.

### Simulation Framework

To assess the performance of and other methods, we simulated continuous phenotypes from genotype array data in the UK Biobank [10] under a range of genetic architectures. We obtained a set of *N* = 337205 unrelated British individuals to use in simulations by extracting individuals that are > 3rd degree relatives and excluding individuals with putative sex chromosome aneuploidy. In all simulations, we standardize the genotype matrix before drawing phenotypes such that each column (SNP) of the genotype matrix has mean 0 and variance 1. In other words, we standardize the genotype at SNP *i* for individual *n* by computing , where *g*_{ni} ∈ {0, 1, 2} is the number of minor alleles at SNP *i* for individual *n* and *f*_{i} is the minor allele frequency (MAF) of SNP *i* among the *N* individuals.

Given standardized genotypes for *N* individuals at *M* SNPs and a fixed value of , phenotypes are simulated under different genetic architectures according to the following model. The proportion of causal variants, *p*_{causal}, is set to either 1 (i.e. an infinitesmal model in which all variants have nonzero effects), 0.01, or 0.001. Let *c _{i}* ∈ {0, 1} be an indicator variable for the causal status of SNP

*i*. If

*p*

_{causal}= 1,

*c*= 1 for

_{i}*i*= 1

*, …, M*. Otherwise, if 0

*≤ p*

_{causal}

*<*1, we draw

*p*

_{causal}

*× M*SNPs from the set of SNPs with minor allele frequencies in one of three ranges: (0, 0.5], (0.01, 0.05], or (0.05, 0.5]. We use the abbreviation “CV MAF” to refer to the MAF range from which causal variants are drawn. The standardized SNP effect sizes and phenotypes are then drawn according to the following model: where

*α*is a parameter that controls the coupling of MAF and effect size,

*w*is a SNP-specific LD weight, and

_{i}*γ ∈ {*0, 1

*}*is a global parameter specifying whether the effect size of a SNP is coupled with its LD score. We simulate two types of LD-dependent architectures by defining the SNP-specific LD weights

*w*

_{1}

*, …, w*to be either (1) the default “LDAK weights” computed by the LDAK software [6], or (2) the inverse unpartitioned “LD score” of each SNP computed within a 2-Mb window using the LDSC software (i.e. where

_{M}*j*indexes the set of SNPs within a 2-Mb window centered on SNP

*i*) [11]. When

*γ*= 1, both the LDAK weights and inverse LD score weights cause SNPs in regions of higher LD to have smaller effects than do SNPs in regions of lower LD. We set

*α*to one of two values:

*α*= −1, which indicates a relatively strong inverse relationship between MAF and effect size, or

*α*= −0.25, which indicates a weaker inverse relationship between MAF and effect size. Each per-SNP variance is multiplied by a constant scaling factor to ensure that . Note that if

*c*

_{i}= 1 and if

*c*

_{i}= 0.

Finally, given simulated phenotypes * y* = (

*y*

_{1},…,

*y*

_{N})

^{T}and genotypes , we compute marginal association statistics through ordinary least squares (OLS) as .

### Comparison of methods in simulations

Unless otherwise specified, in all genome-wide simulations, we use real genotypes of *N* = 337205 unrelated British individuals measured at *M* = 593300 array SNPs to draw causal effects for all *M* SNPs and phenotypes for all *N* individuals. OLS summary statistics are computed for all *M* SNPs using the simulated phenotypes and real genotypes for all *N* individuals. We implement our estimator (Equation 4) by computing chromosome-wide in-sample LD for each chromosome *k* as and we compare to three computationally efficient methods that operate on summary statistics: LD score regression (LDSC) [11], stratified LD score regression (S-LDSC) [12,13], and SumHer [14].

To run LDSC with no annotations, we use the LDSC software (see URLs) to compute the LD score of each SNP as a function of its LD to all other SNPs in a 2-Mb window centered on the SNP. The LD scores are computed from a random sample of 40K individuals to reduce the amount of memory required by the LDSC software. We run the regression with an unconstrained intercept, using all *M* SNPs as observations in the response variable, where each SNP in the regression is weighted to account for heteroscedasticity and correlations between association statistics at SNPs in LD [11]. is estimated as a function of all *M* SNP-specific variances by running LDSC with the flags --not-M-5-50 and --chisq-max 99999 (the latter option prevents the LDSC software from dropping high-effect SNPs).

We run S-LDSC in two ways to account for MAF- and LD-dependent architectures. S-LDSC (MAF) refers to S-LDSC with 10 binary MAF bin annotations defined such that each bin contains exactly 10% of the typed SNPs; this is intended to mirror the 10 MAF bin annotations in the S-LDSC “baseline-LD model” [13] (see Supplementary Table S8 for precise MAF bin ranges for the UK Biobank Axiom Array). S-LDSC (MAF+LLD) refers to S-LDSC with the same 10 MAF bins and an additional continuous “level of LD” (LLD) annotation computed by quantile-normalizing the unpartitioned LD scores within each MAF bin to a standard normal distribution [13]. While our definition of LLD is intended to mirror the LLD annotation in the baseline-LD model, we do not set the LLD of variants with MAF *<* 0.05 to 0 because our estimand of interest is the SNP-heritability attributable to all *M* SNPs (not just SNPs with MAF > 0.05 [13]). For each annotation, LD scores are computed within 2-Mb windows from a random sample of 40K individuals. We run the regression with all *M* SNPs, an unconstrained intercept, and the recommended regression weights [12,13]. Once again, we use the flags --not-M-5-50 and --chisq-max 99999 to estimate as a function of all *M* SNP-specific variances and to prevent the LDSC software from dropping high-effect SNPs.

To run SumHer, we first use the LDAK software (see URLs) to compute the default “LDAK weights” using in-sample LD [6,9,14]. Second, we compute “LD tagging” (i.e. LD scores) using 1-Mb windows centered on each SNP and setting *α* = −0.25 as recommended [14]. The LDAK software is memory-efficient, allowing us to use in-sample LD computed from all *N* = 337K individuals to obtain LDAK weights and LD tagging. Finally, we run SumHer to estimate as a function of all *M* SNP-specific variances. Unless otherwise specified, all default parameter settings are used to run SumHer in simulations.

Similarly, in all small-scale simulations, we use real genotypes of *N* = 8430 unrelated individuals at *M* = 14821 array SNPs to draw phenotypes for all *N* individuals. These individuals and SNPs are a subset of the full UK Biobank data that were used in the genome-wide simulations, and were chosen by selecting approximately 2.5% of individuals and the first 2.5% of SNPs from the beginning of each chromosome in order to preserve a realistic LD structure among the SNPs. OLS summary statistics are computed from the simulated phenotypes and genotypes for all *N* individuals and *M* SNPs, and is computed using in-sample chromosome-wide LD. We run the implementation of single-component GREML [3] provided by the GCTA software [40] and single-component BOLT-REML [8] provided by the BOLT-LMM software (see URLs), both with default parameters. We run the implementation of GREML-LDMS-I [18] provided by the GCTA software using 8 GRMs created from 2 MAF bins (MAF *≤* 0.05 and MAF > 0.05) and 4 LD score quartiles; LD scores were computed using the GCTA software with the default window size of 200-kb. We run LDAK using the default LDAK weights, setting *α* = −0.25 as recommended [6,9].

For a given genetic architecture, we generate 100 simulation replicates and obtain 100 estimates of from each method. We estimate the bias of an estimator under a given architecture by computing the difference between the average of the 100 estimates and the simulated (i.e. where is the estimate from the i-th simulation). To test whether the bias is statistically significant (i.e. significantly different from 0), we assess the z-score of the bias (, where is the standard error of the mean of the 100 estimates) which follows a N(0; 1) distribution under the null hypothesis. To enable a comparison of estimators across different values of , we assess the relative bias of an estimator under a single architecture as a percentage of . In Figure 1c, we compute the error of a single estimate from the *i*-th simulation as ; errors are also reported as percentages of .

### Analysis of UK Biobank phenotypes

We estimate SNP-heritability for 22 real complex traits (6 quantitative, 16 binary) in the UK Biobank [10]. We use PLINK [41] to exclude SNPs with MAF < 0.01 and genotype missingness > 0.01 as well as SNPs that fail the Hardy-Weinberg test at significance threshold 10^{−7}. We keep only the individuals with self-reported British white ancestry and no kinship (i.e. > 3rd degree relatives). After removing individuals who are outliers for genotype heterozygosity and/or missingness, we obtain a set of *N* = 290, 641 unrelated British individuals to use in the real data analyses. For all traits, marginal association statistics are computed through OLS in PLINK, using age, sex, and the top 20 genetic principal components (PCs) as covariates in the regression; these 20 PCs were precomputed by UK Biobank from a superset of 488,295 individuals. Additional covariates were used for waist-to-hip ratio (adjusted for BMI) and diastolic/systolic blood pressure (adjusted for cholesterol-lowering medication, blood pressure medication, insulin, hormone replacement therapy, and oral contraceptives). We compute for each trait using chromosome-wide in-sample LD estimated from all *N* individuals.

When using LDSC, S-LDSC, or SumHer to estimate SNP-heritability, it is necessary to define and distinguish between the following sets of SNPs: the set of SNPs containing all possible causal SNPs of interest (used to compute LD scores and LDAK weights), the set of SNPs used as observations in the regression, and the set of SNPs that defines the SNP-heritability estimand of interest. We run two versions of LDSC [11], S-LDSC (controlling for the most recent baseline-LD model) [12,13,29], and SumHer [14]. First, to enable a more direct comparison between and the estimands of LDSC, S-LDSC, and SumHer, we run an “in-sample LD” version of each method where the *M* typed SNPs (MAF > 0.01) are used to compute LD scores and LDAK weights, perform the regression, and estimate SNP-heritability (i.e. we define the SNP-heritability estimand to be the sum of the per-SNP variances across the *M* typed SNPs). We refer to the in-sample LD versions of these methods as LDSC (in-sample), S-LDSC (baseline-LD/in-sample), and SumHer (in-sample). To run LDSC (in-sample) and S-LDSC (baseline-LD/in-sample), we use the LDSC software (URLs) to compute LD scores and regression weights within 2-Mb windows centered on each SNP, using a random sample of 40K individuals to reduce the memory requirement. To run SumHer (in-sample), we use the LDAK software (URLs) to compute LD tagging from the genotypes of all *N* individuals, using 1-Mb windows centered on each SNP and setting *α* = −0.25 as recommended [9,14]. Unless otherwise specified, all other parameters were set to the default settings of each software.

To enable comparisons between and estimates from LDSC, S-LDSC, and SumHer reported in the literature, we also run each method with its recommended parameter settings and LD estimated from reference panel sequencing data. We refer to these methods as LDSC (1KG), S-LDSC (baseline-LD/1KG), and SumHer (1KG) to indicate that LD is estimated from 489 Europeans 1000 Genomes Phase 3 reference panel [30]. We run LDSC (1KG) and S-LDSC (baseline-LD/1KG) with LD scores and regression weights computed within 1-cM windows from 9,997,231 SNPs with minor allele count greater than 5 in the reference panel genotypes (URLs), and we define the SNP-heritability estimand to be a function of the array SNPs with MAF > 0.05 [11,12]. We run SumHer (1KG) using the 8,569,062 SNPs with MAF > 0.01 in the reference panel to compute LDAK weights and LD tagging (1-cM windows) and to define the SNP-heritability estimand; we control for a multiplicative inflation of test statistics as recommended [14]. See refs. [11,12,14,19] for details about the definitions and interpretations of the estimands of LDSC, S-LDSC, and SumHer.

## Acknowledgments

This research was conducted using the UK Biobank Resource under applications 33297 and 33127. We thank the participants of UK Biobank for making this work possible. We also thank Ruth Johnson, Malika Kumar Freund, Megan Major, Steven Gazal, Alkes Price and David Balding for helpful discussions. This work was funded by the National Institutes of Health (NIH) under awards R01HG009120, R01MH115676, R01HG006399, U01CA194393, T32NS048004, T32MH073526, and T32HG002536.