Estimating heritability of complex traits in admixed populations with summary statistics

All summary statistics-based methods to estimate the heritability of SNPs (hg2) rely on accurate linkage disequilibrium (LD) calculations. In admixed populations, such as African Americans and Latinos, LD estimates are influenced by admixture and can result in biased hg 2 estimates. Here, we introduce covariate-adjusted LD score regression (cov-LDSC), a method to provide robust hg2 estimates from GWAS summary statistics and in-sample LD estimates in admixed populations. In simulations, we observed that unadjusted LDSC underestimates hg2 by 10%-60%; in contrast, cov-LDSC is robust to all simulation parameters. We applied cov-LDSC to approximately 170,000 Latino, 47,000 African American 135,000 European individuals in three quantitative and five dichotomous phenotypes. Our results show that most traits have high concordance of hg2 between ethnic groups; for example in the 23andMe cohort, estimates of hg2 for BMI are 0.22 ± 0.01, 0.23 ± 0.03 and 0.22 ± 0.01 in Latino, African American and European populations respectively. However, for age at menarche, we observe population specific heritability differences with estimates of hg2 of 0.10 ± 0.03, 0.33 ± 0.13 and 0.19 ± 0.01 in Latino, African American and European populations respectively.


Introduction
It is important for human geneticists to study how genetic variants that influence phenotypic variability act across different populations worldwide 1,2 . With increasingly large and diverse genetic studies, it is now becoming feasible to assess how the genetic mechanisms of complex traits act across populations. However, to date, most genomewide association studies (GWAS) have been focused on relatively homogenous continental populations and, in particular those of European descent 3 . NonEuropean populations, particularly those with mixed ancestral backgrounds such as African Americans and Latinos, have been underrepresented in genetic studies. Many statistical methods to analyze genetic data assume homogeneous populations. In order to ensure that the benefits of GWAS are shared beyond individuals of homogeneous continental ancestry, statistical methods for admixed populations are needed 4 .
Among methods to analyze polygenic complex traits in homogeneous populations, summary statisticsbased methods such as linkage disequilibrium score regression (LDSC) 5,6 and its extensions [5][6][7][8] have become particularly popular due to their computational efficiency, relative ease of application, and their applicability without raw genotyping data 9 . These methods can be used to estimate SNP heritability, the proportion of phenotypic variance explained by genotyped variants 5,10-12 , distinguish polygenicity from confounding 5 , establish relationships between complex phenotypes 7 , and model genomewide polygenic signals to identify key cell types and regulatory mechanisms of human diseases 6,13,14 .
Summary statisticsbased methods for polygenic analysis frequently rely on accurate linkage disequilibrium (LD) calculations, which can be obtained on a random subset of the individuals without phenotypic information. Consequently, LDSC, SumHer 15 and other summary statisticsbased methods have not easily been expanded to admixed populations. The LD score for a SNP is the sum of its pairwise correlations (r 2 ) with all other SNPs. For most homogenous populations, LD scores are easy to calculate for SNPs in local windows since r 2 between SNPs more than one centimorgan (cM) is rarely nonzero. In admixed populations however, LD scores are confounded since pairwise r 2 between SNPs is inflated due to admixture itself. Furthermore, in homogenous populations LD scores can be easily calculated in reference panels since representative data is widely available. For admixed populations, even when reference panels are available, they may not be representative of the precise populations used in the genetic study. For example Latino populations in different regions worldwide can be the result of admixture between the same continental populations, but with dramatic differences in admixture proportions and timing of the admixture event 16 . A generic reference panel cannot easily capture these differences and hence cannot produce accurate LD scores that can be widely used for all Latino populations. Thus, LDSC has only been recommended to be applied in homogeneous populations.
In this work, we first examine the performance of LDSC in admixed populations and demonstrate that LDSC yields severely downwardly biased estimates of SNP heritability. Next, we extend the LDSCbased methods to admixed populations by introducing covariateadjusted LDSC (covLDSC). In covLDSC, for each variant we regress the global genotypic PCs, obtained within the GWAS samples, out of the raw genotypes to obtain covariateadjusted genotypes. We then compute LD scores on the adjusted genotypes and use LDSC to estimate heritability. Using covariateadjusted insample LD to compute LD scores removes the potential concerns of reference panel mismatch, longdistance admixtureLD, and covariate effects listed above, and produces accurate estimates of heritability with summary statistics ( Methods, Figure 1 ). Furthermore, heritability can be partitioned to identify key gene sets that have disproportionately high heritability. While access to the genotype data of the GWAS samples is required to compute the covariateadjusted LD scores, LD can be estimated on a random subset of the individuals, preserving the computational efficiency of LDSC and allowing for its application to very large studies. Individual cohorts can also release the insample covariateadjusted LD scores as well as the summary statistics to avoid privacy concerns associated with genotypelevel information to facilitate future studies.
After demonstrating that covLDSC is robust to a wide range of simulation scenarios, we apply it to 8,124 Latinos from a type 2 diabetes study (the Slim Initiative in Genomic Medicine for the Americas, SIGMA) 17 as well as 161,894 Latino, 46,844 African American, and 134,999 European research participants from a personal genetics company (23andMe). We analyze three quantitative phenotypes (body mass index, height, and age at menarche), and five dichotomous phenotypes (type 2 diabetes (available in the SIGMA cohort only), left handedness, morning person, motion sickness, and nearsightedness).
One powerful component of LDSC is that it can be used to test whether a particular genome annotation for example, sets of genes that are specifically expressed within a candidate tissue or cell type capture more heritability than expected by chance 12, 13 . We demonstrate that covLDSC can be applied in the same way to identify traitrelevant tissue and cell types in admixed and homogenous populations. We examine height, BMI and morning person since . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/503144 doi: bioRxiv preprint these traits had sufficient statistical power for celltype enrichment analyses in the personal genetics cohort. We observe a high level of consistency among enriched tissue types, highlighting that the underlying biological processes are shared among studied populations.
This analysis of hundreds of genome annotations in cohorts of over 100,000 individuals would have been impossible with existing genotypebased methods 18,19 .

Mathematical framework of covLDSC
In the standard polygenic model on which LDSC is based, genotype vectors are represented as , with each SNP normalized to mean zero and variance one in the population. In the absence of covariates, we model the phenotypes (1) where and is a vector of pernormalizedgenotype effect sizes, which we model as random with mean zero. In standard LDSC, the variance of is ; in β j M h g 2 / stratified LD score regression the variance of depends on a set of genome annotations. We β j assume, without loss of generality, that has mean zero and variance one in the population.
Let denote the chisquare statistic for the jth SNP, approximately equal to , and . The main equation on which LDSC is based is: .
, is the correlation between SNPs and in the underlying population. We estimate via R jk weighted regression of on our estimates of , evaluating significance with a block χ j 2 jackknife across SNPs 6 .
In the absence of covariates, the LD scores can be estimated from an external reference panel such as 1000 Genomes, as long as the correlation structure in the reference panel matches the correlation structure of the sample. In most homogeneous populations, we can also assume that the true underlying correlation is negligible outside of a 1cM window.
Next, suppose that we would like to include covariates in our model, such that Equation (1)  Using genotype data to compute LD scores means that the model being fit is based on the joint effects of a sparser set of SNPs, e.g. the genotyped SNPs, than when sequence data is used to compute LD scores. For estimating total SNP heritability, this means that covLDSC estimates the same estimand as GCTA ( ) and not the usual estimand of LDSC ( ; see below).  Window size and number of PCs in LD score calculations In addition to computing LD from the covariateadjusted genotypes, we also investigate the appropriate window size for estimating LD scores. To do this, we examine the effect of varying the genomic window size for both simulated and real data sets. We determine that LD score estimates were robust to the choice of window size if the increase in the mean LD score estimates was less than 1% per cM beyond a given window. Using this criterion, we use window sizes of 5cM and 20cM for the simulated and real genotypes respectively ( Supplementary Table 13 ). We also calculate the squared correlations between LD score estimates using the chosen window size and other LD score estimates with window sizes larger than the chosen window. The squared correlations were greater than 0.99 in all cases ( Supplementary Table   46 ) indicating the LD score estimates were robust at the chosen window sizes.
Similarly, to determine the number of PCs needed to be included in the GWAS association tests and covLDSC calculations, we examine the effect of varying the genomic window size using different numbers of PCs. The number of PCs that needed to be included for covariate adjustment depended on the population structure for different datasets.

Genotype simulations
We evaluate the performance of LDSC and covLDSC with simulated phenotypes and both simulated and real genotypes. For the simulated genotypes, we used msprime 20 version 0.6.1 to simulate population structure with mutation rate and recombination maps from the 0 2 1 8 HapMap Project 21 . We adapt the demographic model from Mexican migration history 22 using parameters that were previously inferred from the 1000 Genomes Project 23 . We assume the . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It  Table 7 ). After QC, a total of 8,214 individuals and 943,244 SNPs remain. We estimate the insample LD score with a 20cM window and 10 PCs in all scenarios.
We use these genotypes for simulations. We also analyze three phenotypes from the SIGMA cohort: height, BMI, and type 2 diabetes (T2D). For T2D, we assume a reported prevalence in Mexico of 0.144 17 . For each phenotype, we include age, sex, and the first 10 PCs as fixed effects in the association analyses.

Phenotype simulations
We simulate phenotypes with two different polygenic genetic architectures, given by GCTA 19 and the baseline model 6 respectively. In the GCTA model, all variants are equally likely to be causal independent of their functional or minor allele frequency (MAF) structure, and the standardized causal effect size variance is constant, i.e.
. In contrast, the ar(β ) where is the value of annotation at variant and represents j the pervariant contribution, of one unit of the annotation , to heritability. We generate all causal variants among common observed variants with MAF >5% (~40,000 SNPs in simulated genotypes and 943,244 SNPs in the SIGMA cohort). To represent environmental stratification, similar to previously described 5 , we add 0.2 * standardized first principal component to the standardized phenotypes.
We simulate both quantitative and casecontrol traits with both GCTA and baseline model genetic architectures, using both simulated and real genotypes, varying the number of causal variants, the true heritability, and environmental stratification. For casecontrol simulations, we adopt a liability threshold model with disease prevalence 0.1. We obtain 5,000 cases and 5,000 controls for each simulation scenario.
To obtain summary statistics for the simulated traits, we apply singlevariant linear models for quantitative traits and logistic models for binary trait both with 10 PCs as covariates in association analyses using PLINK 1.90 25 . 23andMe cohort All participants were drawn from the customer base of 23andMe, Inc., a direct to consumer We restrict participants to those who have European, African American, or Latino ancestry determined through an analysis of local ancestry 26 .
To compute LD scores, we use both genotyped and imputed SNPs. We filter genotyped variants with a genotype call rate 90%, nonzero selfchain score, strong evidence of Hardy Weinberg ≤ disequilibrium ( ), and failing a parentoffspring transmission test. For imputed variants, 0 p > 1 20 we use a reference panel that combined the May 2015 release of the 1000 Genomes Phase 3 haplotypes 23 with the UK10K imputation reference panel 27 . Imputed dosages are rounded to the nearest integer (0, 1, 2) for downstream analysis. We filter variants with imputation rsquared ≤ 0.9. We also filter genotyped and imputed variants for batch effects and sex dependent effects.
To minimize rounding inaccuracies, we prioritize genotyped SNPs over imputed SNPs in the merged SNP set. We restrict the merged SNP set to HapMap3 variants with MAF 0.05. We ≥ measure LD scores in a subset of African Americans (61,021) and Latinos (9,990) on chromosome 2 with different window sizes from 1cM to 50cM ( Supplementary Table 3 ) and squared correlation between different window sizes ( Supplementary Table 6) . We compute all LD scores with a 20cM window.
In genomewide association analyses, for each population, we choose a maximal set of unrelated individuals for each analysis using a segmental identitybydescent (IBD) estimation algorithm 28 . We define individuals to be related if they share more than 700cM IBD.
We performe association tests using linear regression model for quantitative traits and logistic regression model for binary traits assuming additive allelic effects. We include covariates for . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It

Heritability Estimation
We calculate insample LD scores using both a nonstratified LD score model and the baseline model 6 . In simulated phenotypes generated with the GCTA model, we use nonstratified LDSC to estimate heritability. In simulated phenotypes generated using the baseline model, we use LDSCbaseline to estimate heritability. We use the 53 nonfrequency dependent annotations included in the baseline model to estimate in the 23andMe research database and the h g 2 SIGMA cohort real phenotypes. We recognize recent studies have shown that genetic heritability can be sensitive to the choice of LDdependent heritability model 8,11 . However, understanding the LD and MAFdependence of complex trait genetic architecture is an important but complex endeavor potentially requiring both modeling of local ancestry as well as large sequenced reference panels that are currently unavailable. We thus leave this complexity for future work. Tissue Type Specific Analyses Following Finucane et al 13 , we extend covLDSC so that we can assess enrichment in and around sets of genes that are specifically expressed in tissue and celltypes (covLDSCSEG).
We annotate the genes with the same set of tissue specific expressed genes identified previously 13 using the Genotype-Tissue Expression (GTEx) project and a public dataset made available by the Franke lab. We calculate withinsample stratified covLD scores in the 23andMe cohort for each of these gene sets. We obtain regression coefficients from the model and perSNP heritability by one sd increase in value of the annotation of each cell type, conditional on other 53 noncell type specific baseline annotations. We calculate a onetailed pvalue for each coefficient where the null hypothesis is that the coefficient is nonpositive 13 . All the significant enrichments are reported with false discovery rate < 5% . We og (p) .75) ( l 10 > 2 perform fixedeffect inverse variance weighting metaanalysis using and normalized standard error (se) across populations.

Software Availability
An opensource software implementation of covariateadjusted LD score regression is publicly available (see Web Resources).

Robustness of LD score estimation
To demonstrate the effect of admixture on the stability of LD score estimates, we first calculated LD scores with genomic window sizes ranging from 050 cM in both European (EUR, N=503) and admixed American (AMR, N=347) populations from the 1000 Genomes Project 23 . As window size increases, we expect the mean LD score to plateau because LD should be negligible for large enough distance. If the mean LD score does not plateau, but continues to rise with increasing window size, then one of two possibilities may apply: (1)   beyond 1cM in size, as previously reported. However, in the AMR population the mean LD score estimates continued to increase concavely with increasing window size. In contrast, when we applied covLDSC with 10 PCs to calculate covariate adjusted LD scores, we observed that LD score estimates plateaued for both EUR and AMR at a 1cM and 20cM window size respectively (<1% increase per cM, Supplementary Table 10 ). This suggested that covLDSC was able to correct the longrange LD due to admixture and yielded stable estimates of LD scores ( Method, Supplementary Figure 1 ), and also that covLDSC was applicable in homogeneous populations ( Supplementary Table 10 ). The larger window size for the AMR population was needed due to residual LD caused by recent admixture. We next tested the sensitivity of the LD score estimates with regard to the number of PCs included in the covLDSC. We observed that in the AMR panel, LD score estimates were unaffected by adding PCs and by increasing window sizes above 20cM ( Supplementary Figure 2 ).

Simulations with simulated genotypes
To assess whether covLDSC produces unbiased estimates of , we first simulated h g 2 genotypes of admixed individuals ( Methods ). We simulated genotypes of 10,000 unrelated diploid individuals for approximately 400,000 common SNPs on chromosome 2 in a coalescent framework using msprime 20 . First, we tested LDSC and covLDSC with different admixture proportions between two ancestral populations, and a quantitative phenotype with a of 0.4 h g 2 using an additive model ( Methods ). We observed that as the proportion of admixture increased, for LDSC increasingly underestimated true by as much as 18.6%. In marked contrast, ĥ g Simulation results with real genotypes We next examined the performance of both unadjusted LDSC and covLDSC on real genotypes of individuals from admixed populations. We used genotype data from SIGMA, which includes  Table 11 ).
Thus far, we have used covLDSC by calculating LD scores on the same set of samples that were used for association studies (insample LD scores). In practical applications, computing LD scores on the whole data set can be computationally expensive and difficult to obtain, so we  Figure 10 ). We repeated these analyses in simulated phenotypes in the SIGMA cohort. We subsampled the SIGMA cohort, and obtained unbiased estimates when using as few as 1,000 samples ( Figure 2d ). We therefore recommend computing insample LD scores on a randomly chosen subset of at least 1,000 individuals from a GWAS in our approach. GCTAbased results would be impossible to obtain on the much larger datasets, for example the 23andMe cohort described below.  Table 12 ). For each phenotype, we estimated using the same populationspecific insample LD scores. For most phenotypes, h g 2 the reported was similar among the three population groups with a notable exception for h g 2 age at menarche ( Figure 3 ). This suggested possible differences ( between .1 0 p = 7 1 3 Latinos and Europeans) in the genetic architecture of these traits between different ethnic groups. It has been long established that there is population variation in the timing of menarche 35,36 . Early menarche might influence the genetic basis of other medically relevant traits since early age at menarche is associated with a variety of chronic diseases such as childhood obesity, coronary heart disease and breast cancer 37,38 . These results highlighted the importance of including diverse populations in genetic studies in order to enhance our understanding of complex traits that show differences in their genetic heritability.

Tissue type specific analysis
We applied stratified covLDSC to sets of specifically expressed genes 13  Latino, and African American populations. We only tested height, BMI and morning person, which were the three traits that had heritability zscores larger than seven in at least two populations 6 ( Supplementary Table 13 ). We also performed inversevariance weighting metaanalysis across the three populations ( Supplementary Table 14  African Americans: slope=0.67, se=0.12, ) ( Figure 4(a)(e) ). The sizes of these .4 0 p = 3 1 8 three brain structures have been shown to be correlated with BMI using magnetic resonance imaging data 39 . The midbrain and the limbic system are highly involved in the food rewarding signals through dopamine releasing pathway 40 . Furthermore, the hypothalamus in the limbic system releases hormones that regulate appetite, energy homeostasis and metabolisms, like leptin, insulin, and ghrelin 40,41 . For height, similar to previously reported associations 13 , we also identified enrichments in the gene sets derived from musculoskeletal and connective tissues. In the metaanalysis, the three most significant enrichments were cartilage ( Evidence showed that circadian rhythm was controlled by the suprachiasmatic nucleus, the master clock in our brain, and also the circadian oscillator that resides in neurons of the cerebral cortex 46,47 , 48 . We also found unique enrichments of esophagus muscularis and the esophagus gastroesophageal junction in the Latino populations, but the heterogeneity test showed that the difference is not significant ( = 0.49 and 0.50 respectively). We observed that the slope of regression between Latino coefficients and European coefficients across gene sets was 0.70 (se=0.062, ) ( Figure 4(k)(n) ). Compared to the original LDSCSEG,

Discussion
As we expand genetic studies to explore admixed populations around the world, extending statistical genetics methods to make inferences within admixed populations is crucial. This is particularly true for methods based on summary statistics, which are dependent on the use of LD scores that we showed to be problematic in admixed populations. In this study, we confirmed that original LDSC and other summary statisticsbased methods, such as PCGCs 49 and SumHer 50 that were originally designed for homogenous populations, should not be applied to admixed populations. We introduced covLDSC which regresses out global PCs on individual genotypes during the LD score calculation, and showed it can yield unbiased LD scores, heritability estimates and its enrichment, such as traitrelevant cell and tissue type enrichments, in homogenous and admixed populations.
Although our work provides a novel, efficient approach to estimate genetic heritability and to confidence that this tissue may account for polygenic heritability. Larger sample sizes are needed to increase the power of our current analyses and to enhance our understanding of how genetic variants that are responsible for heritable phenotypic variability differ among populations.
`` ¶ As the number of admixed and other diverse GWAS and biobank data become readily available 1,53,61 , our approach provides a powerful way to study admixed populations. .

Appendix Insample versus outofsample LD
To test the reliability of using an outofsample reference LD panel for covLDSC applications, we first examined the performance of outofsample LD scores obtained from 1,000 samples with a perfectly matching demographic history in the simulated genotypes. covLDSC yielded unbiased estimates when using 1,000 samples in an outofsample reference panel with a perfectly matching population structure ( Supplementary Figure 10 ). Next, we tested the accuracy of heritability estimates when using 1000 Genome Project to obtain outofsample LD scores. When using the AMR panel as a reference panel for the SIGMA cohort, we observed an unbiased estimate ( , Figure 2d ). This suggested that the AMR panel included in  Figure 9d ). This is probably due to attenuation bias from noisily estimated LD scores at N<1,000. Next, we explored the feasibility of applying an outofsample reference panel in cell and tissue type specific analyses. To this end, we used 1000 Genomes AMR samples (N=347) to obtain stratified LD scores and applied it on summary statistics obtained from 23andMe. In contrast to using insample LD scores, we discovered no significant tissue type enrichment ( Supplementary Figure 16 ) when using LD scores obtained from the 1000 Genomes. This suggested an outofsample reference panel may significantly reduce the power for detecting trait relevant tissue types in admixed populations.
We therefore caution that when using 1000 Genomes or any outofsample reference panels for a specific admixed cohort, users should ensure that the demographic histories are shared between the reference and the study cohort. We highly recommend computing insample LD scores on a randomly chosen subset of at least 1,000 individuals from a GWAS. We also strongly encourage cohorts to release their summary statistics and insample covariateadjusted LD scores at the same time to facilitate future studies.  The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/503144 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/503144 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/503144 doi: bioRxiv preprint . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/503144 doi: bioRxiv preprint . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/503144 doi: bioRxiv preprint Table 1. estimates of height, BMI and type 2 diabetes using different heritability h g 2 estimation methods. Reported values are estimates of (with standard deviations in h g 2 brackets) from LDSC using a 20cM window, covLDSC using a 20cM window and 10 PCs, and GCTA using REAP to obtain the genetic relationship matrix with adjustment by 10  . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/503144 doi: bioRxiv preprint