Abstract
Over the past decade, summary statistics from genome-wide association studies (GWAS) have been used to detect and quantify polygenic adaptation in humans. Several studies have reported signatures of natural selection at sets of SNPs associated with complex traits, like height and body mass index. However, more recent studies suggest that some of these signals may be caused by biases from uncorrected population stratification in the GWAS data with which these tests are performed. Moreover, past studies have predominantly relied on SNP effect size estimates obtained from GWAS panels of European ancestries, which are known to be poor predictors of phenotypes in non-European populations. Here, we collated GWAS data from multiple anthropometric and metabolic traits that have been measured in more than one cohort around the world, including the UK Biobank, FINRISK, Chinese NIPT, Biobank Japan, APCDR and PAGE. We then evaluated how robust signals of polygenic adaptation are to the choice of GWAS cohort used to identify associated variants and their effect size estimates, while using the same panel to obtain population allele frequencies (The 1000 Genomes Project). We observe many discrepancies across tests performed on the same phenotype and find that association studies performed using multiple different cohorts, like meta-analyses, tend to produce scores with strong overdispersion across populations. This results in apparent signatures of polygenic adaptation which are not observed when using effect size estimates from biobank-based GWAS of homogeneous ancestries. Indeed, we were able to artificially create score overdispersion when taking the UK Biobank cohort and simulating a meta-analysis on multiple subsets of the cohort. This suggests that extreme caution should be taken in the execution and interpretation of future tests of polygenic adaptation based on population differentiation, especially when using summary statistics from GWAS meta-analyses.
Introduction
Most human phenotypes are polygenic: the genetic component of trait variation across individuals is caused by differences in genotypes between individuals at a large number of variants, each with a relatively small contribution to the trait (Fisher et al., 1918; Turelli, 2017). This applies to traits as diverse as a person’s height, their risk of schizophrenia or their risk of developing arthritis. The study of complex traits spans more than a century but only in the last two decades has it become possible to systematically explore the genetic variation underlying these traits (Sella & Barton, 2019). The advent of genome-wide association studies (GWAS) has led to the identification of thousands of variants that are associated with such traits, either due to true biological mechanisms or because of linkage with causal variants (Visscher et al., 2012).
However, most research into the genetic aetiology of complex traits is based on GWAS data from populations of European ancestries (Popejoy & Fullerton, 2016). This bias in representation contributes to existing disparities in medical genetics and healthcare around the world (Martin et al., 2019). The low portability of European GWAS results – and, in particular, polygenic scores – to non-European populations is particularly concerning (Martin et al., 2017, 2019) (but see (Ragsdale et al., 2020)). Important trait-associated variants in non-European populations may be missed if they have low frequencies or are absent in European populations. Moreover, effect size estimates for an associated variant derived from a European-ancestry GWAS may not accurately reflect the effect of the same variant on the trait in other populations (Wojcik et al., 2019). This could be due to differences in epistasis, differences in linkage disequilibrium between causal and ascertained variants, or gene-by-environment interactions, to name a few causes (Guo et al., 2018). Additionally, negative selection and demographic history may cause differences in genetic architectures between populations (Durvasula & Lohmueller, 2019).
During the last decade, GWAS summary statistics have also been used to look for evidence of directional selection pushing a trait to a new phenotypic optimum, via allele-frequency shifts occurring across a large number of associated variants – a phenomenon known as polygenic adaptation (Pritchard et al., 2010; Hayward & Sella, 2019). For example, several studies have consistently found evidence for polygenic adaptation operating on height-associated variants in Europe, mainly across a south-to-north gradient (Turchin et al., 2012; Berg & Coop, 2014; Robinson et al., 2015; Mathieson et al., 2015; Racimo et al., 2018; Berg et al., 2017). To test for selection, these studies primarily relied on summary statistics from the GIANT consortium dataset, which is a meta-analysis of anthrompometric GWAS from multiple European cohorts (Allen et al., 2010; Wood et al., 2014). They looked for overdispersion in the frequencies of trait-associated variants across populations, relative to a neutral null model. To account for potential confounding due to population stratification, some have tried to replicate this signal using family-based association studies (Allison et al., 1999; Robinson et al., 2015). Berg et al. (2019) and Sohail et al. (2019) showed that this signal of polygenic adaptation on height-associated variants in Europe (and possibly on other trait-associated variants) is attenuated and in some cases no longer significant when using effect size estimates from GWAS performed on the UK Biobank – a large cohort composed primarily of individuals of British ancestry (Bycroft et al., 2018). There is no single explanation yet for these contradictory findings, but the most plausible one is that previous studies were impacted by very subtle confounding due to uncorrected population stratification in GIANT, and that data from family-based studies was not analyzed properly (Berg et al., 2019; Sohail et al., 2019).
It is as yet unclear how the choice of GWAS cohort affects tests of polygenic adaptation based on allele frequency differences between populations. Each cohort differs in ancestries of participants, inclusion criteria of individuals, SNP ascertainment scheme and association method. Given the poor portability of polygenic scores across populations, is it also true that GWAS performed on different cohorts will result in inconsistent signals of selection? Can we narrow down on the reason for the inconsistencies in previous studies of polygenic adaptation by looking at a larger number of cohorts? Here, we collated GWAS summary statistics from multiple complex traits that have been measured in more than one cohort around the world. We then evaluated how robust signals of polygenic adaptation are to the choice of cohort used to obtain effect size estimates. Across all comparisons, we used the same population genomic panel to obtain population allele frequency estimates: The 1000 Genomes Project phase 3 (The 1000 Genomes Project Consortium, 2015). We observe many discrepancies across tests performed on the same phenotype and attempt to understand what may be causing these discrepancies. Although we compare results for several traits, we pay special attention to height, as it is the most well-characterized and studied complex trait in the human genetics literature, as well as a trait for which we have summary statistics from the largest number of GWAS cohorts.
Methods
GWAS summary statistics
We obtained GWAS summary statistics from five large-scale biobanks and two GWAS meta-analyses (Figure 1). Since we aim to make comparisons among them, our interest is focused on traits that were measured in at least two different studies. This resulted in a total of 30 traits being included in our analysis.
Below, we provide a brief summary of each of the GWASs we focused on. For an overview of the type of arrays and association methods used in each of these, see Table 1.
UKBB
Summary statistics from the GWAS performed on all UK Biobank traits (Bycroft et al., 2018). These were released by the Neale lab (round 2: http://www.nealelab.is/uk-biobank/), after filtering for individuals with European ancestries. The UK Biobank includes genetic and phenotypic data from participants from across the United Kingdom, aged between 40 and 69. The traits measured include a wide range of lifestyle factors, physical measurements, and other phenotypic information gained from blood, urine and saliva samples. The Neale lab performed association testing in ∼340,000 unrelated individuals.
FINRISK
Summary statistics from GWASs carried out using the National FINRISK 1992-2012 collection from Finland. The FINRISK study is coordinated by the National Institute for Health and Welfare (THL) in Finland and its target population is sampled from six different geographical areas in Northern Finland. The FINRISK cohort was conducted as a cross-sectional population survey every 5 years from 1972 to assess the risk factors of chronic diseases and health behavior in the working age population. Blood samples were collected from 1992 to 2012. Anthropometric measures and other lifestyle information were also collected. The number of samples used for the GWAS results varies among the different traits (∼25,000 to ∼5,000) (Borodulin et al., 2018).
PAGE
Summary statistics from a multi-ethnic GWAS mega-analysis performed by the PAGE (Population Architecture using Genomics and Epidemiology) consortium (http://www.pagestudy.org/). This is a project developed by the National Human Genome Research Institute and the National Institute on Minority Health and Health Disparities, to characterize population-level disease risks in various populations from the Americas (Matise et al., 2011; Carlson, 2016). The association analysis was assembled from four different cohorts: the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), the Women’s Health Initiative (WHI), the Multiethnic Cohort (MEC) and the Icahn School of Medicine at Mount Sinai BioMe biobank in New York City (BioMe). The authors performed GWAS on 26 clinical and behavioural phenotypes. The study includes samples from 49,839 non-European-descent individuals. Genotyped individuals self-reported as Hispanic/Latino (n = 22,216), African American (n = 17,299), Asian (n = 4,680), Native Hawaiian (n = 3,940), Native American (n = 652) or Other (n = 1,052). The number of variants analyzed varies from 22 to 25 million for continuous phenotypes and 11 to 28 million for case/control traits. Sample sizes ranged from 9,066 to 49,796 individuals (Wojcik et al., 2019).
BBJ
Summary statistics from GWASs performed using the Biobank Japan Project, which enrolled 200,000 patients from 12 medical institutions located throughout Japan between 2003-2008. The authors collected biological samples and other clinical information related to 47 diseases and self-reported anthropometric measures. GWAS were then conducted on approximately 162,000 individuals to identify genetic variants associated with disease susceptibility and drug responses. Around 6 million variants were included in this GWAS (Nagai et al., 2017; Hirata et al., 2017; Kanai et al., 2018).
Chinese NIPT
Summary statistics from a GWAS performed in China using non-invasive prenatal testing (NIPT) samples from ∼141,431 pregnant women. The participants were recruited from 31 administrative divisions across the country. The study aimed to investigate genetic associations with maternal and infectious traits, as well as two antropometric traits: height and BMI (Liu et al., 2018a). It included ∼60,000 individuals. The number of imputed variants used was around 2 million.
APCDR
Summary statistics performed using the African Partnership for Chronic Disease Research cohort, which was assembled to conduct epidemiological and genomic research of non-communicable diseases across sub-Saharan Africa. The dataset includes 4,956 samples from Uganda (Baganda, Banyarwanda, Burundi, and others). The authors performed GWAS on 34 phenotypes, including anthropometric traits, blood factors, glycemic control, blood pressure, lipid tests, and liver function tests (Heckerman et al., 2016).
GIANT
Summary statistics published by the Genetic Investigation of Anthropometric Traits consortium (2012-2015 version, before including UK Biobank individuals) (Wood et al., 2014; Locke et al., 2015). GIANT is a meta-analysis of summary association statistics for various anthropometric traits, and includes information from more than 250,000 individuals of European descent. The meta-analysis was performed on 2.5 million autosomal SNPs, after imputation.
Population genomic panel
We used the 1000 Genomes Project phase 3 release data (The 1000 Genomes Project Consortium, 2015) to retrieve the allele frequencies of trait-associated variants in different population panels sampled from around the world (Figure 1). We used these to compute polygenic scores for each panel, using autosomal SNPs only. The dataset contains samples from 2,504 people from 26 present-day population panels, whose abbreviations and descriptions are listed in Table S1.
Identifying trait-associated SNPs
We used summary statistics for a set of 30 traits that were measured in at least two of the previously-listed GWAS datasets. Table S2 shows the full list of the traits included in this analysis and the number of variants and individuals per trait. For each trait, we excluded triallelic variants, variants with a minor allele frequency lower than 0.01 and those classified as low confident variants whenever this information was available in the summary statistics file. We selected a set of trait-associated SNPs based on a P-value threshold, and the effect size estimates of these variants were used to construct a set of polygenic scores. To only include approximately independent trait-associated variants in our scores, we use a published set of 1,700 non-overlapping and approximately independent linkage-disequilibrium (LD) blocks to divide the genome (Berisa & Pickrell, 2016). We extracted the SNP within each block with the lowest association P-value. To investigate the robustness of signals to different filtering schemes, we used two P-value thresholds to extract significantly associated variants: 1) P < 1e − 5 and 2) the standard genome-wide significant cutoff, P < 5e − 8. As an example, Figure S1 shows the distribution of effect size estimates of 1700 approximately independent SNPs for height (P < 1e − 5). In turn, Figure S2 shows the distribution of their effect size estimates scaled by the square root of the study’s sample size, which serves as a fairer comparison among studies. In order to build an empirical genome-wide covariance matrix (F-matrix) with non-associated SNPs, we extracted all SNPs with a P-value larger than 5e − 8 and then sampled every 20th “non-associated” SNP across the entire genome. We also used the LD score regression approach (Heckerman et al., 2016) to obtain an LD score regression intercept, LD score regression ratio, and a SNP heritability estimate for each GWAS that we looked at.
Neutrality test for polygenic scores
Polygenic risk scores aim to predict the genetic risk of a disease, or the genetic value of a trait, by combining the additive effect of a large number of trait-associated loci across the genome. For each trait, we obtained polygenic scores by computing the sum of allele frequencies at each of the top trait-associated SNPs from each block, weighted by their effect size estimates for that trait. The allele frequencies for these SNPs were retrieved from The 1000 Genomes Project population panels using glactools (Renaud, 2017). We then built a polygenic score vector for a given trait, , that contains the polygenic scores of all populations for that trait. Let ∈ [0, 1]M be the vector of derived allele frequencies at locus l, where pl,m is the derived allele frequency at locus l in population m, while αl is the effect size estimate of the derived allele at locus l. Then, the vector of the polygenic scores, , has length M equal to the number of populations (M = 26) and each element Zm is the polygenic score for population m Here, L is the total number of trait-associated loci.
Berg & Coop (2014) introduced a model designed for comparing polygenic scores across populations, in order to test for deviations from neutrality, which could perhaps be driven by adaptive divergence between populations. The test works by looking for overdispersion from a multivariate normal distribution, which would fit the distribution of scores if this was determined purely by genetic drift.
Under neutral genetic drift, Berg & Coop (2014) showed that the joint distribution of across closely-related populations should be approximately multivariate normal under a purely neutral model: where is a vector of ones and: and is the average allele frequency of locus l across all populations. The matrix F is a genome-wide covariance matrix that captures the co-ancestry among each pair of populations (Berg & Coop, 2014). Based on this null model, we can measure the Mahalanobis distance of the observed distribution of from the distribution under neutral genetic drift by computing Qx Under neutrality, the QX statistic is expected to follow a chi-squared distribution with M -1 degrees of freedom, (Berg & Coop, 2014). A significantly large value of QX indicates that there is an excess of variance in that cannot be explained by drift alone.
P-values via randomization schemes
To avoid relying on the assumption that follows a multivariate normal distribution under neutrality, we also obtained P-values via two alternative methods (Berg & Coop, 2014; Berg et al., 2017; Racimo et al., 2018). The first one relies on obtaining neutral pseudo-samples by randomizing the sign of the effect size estimates of all trait-associated SNPs, and then recomputing QX. The second one involves obtaining pseudo-samples by sampling random SNPs across the genome with the same allele frequency distribution in a particular (target) population as the SNPs used to computed QX. For each trait-associated SNP, we thus sampled a new SNP from a subset of the non-associated SNPs whose frequencies lie in the range [0.01 −p, p + 0.01] where p is the derived allele frequency of the trait-associated SNP. Then, we obtained a new P-value by computing the QX statistic on each of the pseudo-samples i: Here, is the QX statistic computed on pseudo-sample i, is the QX statistic computed on the true set of trait-associated SNPs, I() is an indicator function and N is the number of pseudo-samples used, which was set to 1,000. We tested the effect of using different population panels as our ‘target’ population for the frequency-matching scheme. Since we are utilizing seven GWAS cohorts that are composed of Latin American individuals, Asian (Japanese and Chinese), sub-Saharan African and European (Finnish and British) individuals, we decided to use population panels from the 1000 Genomes that roughly matched these ancestries: PUR, CHB, JPT, LWK, FIN and GBR.
Assessing different association methods
We were also interested in evaluating the effects of different types of association methods on the significance of the QX statistic. We used the UKBB cohort to perform different types of association studies on height. Starting from 805,426 genotyped variants across the genome, we restricted to SNPs with a minor allele frequency (MAF) < 5% globally, and performed associations on three different sets of individuals from the UKBB cohort: 1) self-reported white British individuals (“British”), 2) self-reported “white” individuals, and 3) “all ethnicities”, i.e. a UKBB set including all self-reported ethnicity categories. We applied the following quality filters in each of the resulting sets: 1) removed variants with P < 1e − 10 from the Hardy-Weinberg equilibrium test, 2) removed variants with MAF < 0.1% in the set, 3) removed variants with an INFO score less than 0.8 4) removed variants outside the autosomes, and 5) removed individuals that were 7 standard deviations away from the first six PCs in a PCA of the set. We then performed a GWAS via a linear model (LM) with PLINK 1.9 (Chang et al., 2015) and 2) a GWAS via a linear mixed model (LMM) using BOLT-LMM (Loh et al., 2018), on each of the three sets (Table 2). We used sex, age, age2, sex*age, sex*age2 and the first 20 PCs as covariates.
We also aimed to test whether a meta-analysis approach could lead to overdispersion of polygenic scores, and consequently, an inflated QX statistics. Therefore, we created an artificial meta-analysis on the entire UKBB cohort, approximately emulating the number of individual sub-cohorts that were included in GIANT. We divided the “all ethnicities” set of individuals into 75 sets, using two different approaches. In one approach, we obtained 75 clusters from a K-means clustering of the first three principal components from a PCA of the the UKBB individuals. In the other approach, we created 75 groups of equal size, randomly assigning individuals to each group, regardless of their PCA placement. We used PLINK 1.9 to perform a linear association model in each of the 75 clusters or groups. Afterwards, we integrated all summary statistics into a meta-analysis, using two different meta-analysis methods (Table 2): an inverse variance method and a sample size-based method, both implemented in METAL (Willer et al., 2010). As before, we used sex, age, age2, sex*age, sex*age2 and the first 20 PCs as covariates.
Source code
The code used to perform the analyses in this manuscript is available at: https://github.com/albarema/GWAS_choice/.
Results
Robustness of signal of selection and population-level differences
We obtained sets of trait-associated SNPs for GWAS performed on seven different cohorts: UK Biobank, FINRISK, Chinese NIPT, Biobank Japan, APCDR and PAGE. Using the effect size estimates from each GWAS, we calculated population-wide polygenic scores for each of the 26 population panels from the 1000 Genome Project (The 1000 Genomes Project Consortium, 2015), using allele frequencies from each population panel. We then tested for overdispersion of these scores using the QX statistic, which was designed to detect deviations from neutral genetic drift affecting a set of trait-associated SNPs (Berg & Coop, 2014). We focused on 30 traits that were phenotyped in two or more cohorts, so that we could compare the P-value of this statistic using effect size estimates from at least two different cohorts (see Methods).
We applied the QX statistic to each of the 30 traits by selecting SNPs we deemed to be significantly associated with each trait. We used two different P-value cutoffs to select these SNPs: 1) a lenient cutoff, P < 1e −5 and 2) the standard genome-wide significance cutoff P < 5e − 8. To verify that significant P-values of the QX statistics were not due to violations of the chi-squared distributional assumption, we also computed P-values using two randomization schemes: one is based on randomizing the effect size estimates of the trait-associated SNPs, while the other is based on using frequency-matched non-associated SNPs (see Methods). In general, we observe little notable differences in P-values when using the three schemes, although the sign-randomization scheme is sometimes inconsistent with the other two (Figures S3 and S4). The number of significant SNPs for each of the traits under the two cutoffs is shown in Figures S5 and S6.
After controlling for multiple testing (P < 0.05/n, where n is the number of assessed GWAS), we only find few traits with significant overdispersion in QX. Under the P < 1e − 5 SNP-association cutoff, the only traits with significant overdispersion in at least one cohort are height and white blood cells (WBC) (Figures 2, S3). Potassium urine levels and mean corpuscular hemoglobin (MCH) also result in significant values of QX when using the P < 5e − 8 SNP-association cutoff (Figures S4,S7).
Figure 3 shows polygenic scores computed for each of the 1000 Genomes populations for height. In agreement with previous studies (Berg et al., 2019; Sohail et al., 2019), we observe that differences in polygenic height scores when using effect size estimates from the UKBB are greatly attenuated relative to differences in scores built when using estimates from GIANT. Extending this analysis across all datasets, we observe that PAGE scores are also over-dispersed, though in different directions than GIANT scores (Figure 3). Additionally, the observation that Europeans have very high polygenic scores when using GIANT effect size estimates cannot be replicated using any of the other GWAS estimates. Indeed, we only obtain significant QX P-values only for PAGE and GIANT, after multiple testing correction (using either association P-value threshold to select SNPs). The number of SNPs used for polygenic scores are shown in Table 3. The LD score regression ratio is substantially higher for PAGE and FINRISK than for the other cohorts (Tables 4, S3).
We also tested how the choice of the SNP association P-value threshold influenced the results. Sohail et al. (2019) showed that between-population differences in polygenic height scores grow stronger when using more lenient SNP association P-value cutoffs. However, one then runs the risk of including more variants that may be significantly associated due to uncorrected population stratification. We see there is a smaller score overdispersion when using the genome-wide significant SNPs, than when using the more lenient P-value cutoff (right column, Table 3 and Figure S8). We also looked in closer detail at other traits with evidence for significant overdispersion via the QX test. White blood cell counts (WBC), for example, shows strong overdispersion when using PAGE, but not when using the UKBB or BBJ effect size estimates (Figures S9, S10). We also observe a similar pattern when looking at mean corpuscular hemoglobin (MCH) scores (Figures S11, S12). In the case of urine levels of potassium, larger between-population differences are found in UKBB than in BBJ, when we use the stringent threshold (Figures S13, S14). In general, we observe that between-population differences in scores tend to be more similar between studies when using the stricter SNP-association P-value threshold, than when using the more lenient threshold.
Relationship between GWAS effect size estimates
To better understand where the differences in overdispersion of QX could stem from, we performed pairwise comparisons of the effect size estimates from the different GWAS. Since the UKBB GWAS is the GWAS with the largest number of individuals, we decided to compare the estimates from each of the other studies to the UKBB estimates. Here, we only focused on the 1,700 approximately-independent SNPs (the best tag SNP within each LD block). We began by only using SNPs that were classified as significant in UKBB using the lenient cutoff (P < 1e − 5) (Figure 4). We observe that effect size estimates are correlated, as expected, but the strength of this correlation varies strongly across comparisons. UKBB vs. GIANT shows the highest correlation, while UKBB vs. APCDR shows the lowest. Importantly, those SNPs that also have a significant P-value in the non-UKBB GWAS in each comparison (colored in red in Figure 4) show a higher correlation than the rest of the SNPs.
The same analysis was carried out with SNPs classified as significant in each of the non-UKBB studies. The correlation of effect size estimates is generally lower (Figure S15), and a high percentage of SNPs deemed to be significant in the non-UKBB GWAS have effect size estimates approximately equal to zero in the UKBB GWAS (Figure S15). This pattern is stronger when we do not filter the 1,700 approximately independent SNPs by a particular SNP-association P-value cutoff (Figures S16 and S17).
We computed pairwise Pearson correlation coefficients between estimated effect sizes in the UKBB GWAS and each of the other GWAS (Table S4 when using SNPs that are significant in UKBB and Table S5 when using SNPs that are significant in the other GWAS). We observe that GWAS performed on individuals living geographically close to Britain have higher correlations to UKBB estimates than those that are performed on distant individuals. For instance, GIANT and FINRISK, both European-based GWAS, show high correlation in effect size estimates with UKBB (0.9958 and 0.790, respectively). In contrast, the GWAS based on an African panel, APCDR, shows an extremely low correlation in effect size estimates with UKBB (correlation coefficient = 0.087). This cohort has by far the smallest sample size of all the cohorts we analyzed (n = 4,778), which may explain the low correlation.
We also observe higher correlations when filtering for significantly trait-associated SNPs using either of the two SNP significance thresholds (P < 1e − 5 and P < 5e − 8) from the 1700 LD blocks. The sample size of the GWAS also affects the correlation in effect size estimates. We can see in Figure S18 that there is a positive relationship between the number of samples included in the non-UKBB GWAS and the Pearson correlation coefficients of the estimated effect sizes to the UKBB GWAS, which has the largest sample size.
Evidence for population stratification
Berg et al. (2019) looked for latent population stratification by studying the relation between allele frequency differences in two GWAS and their difference in effect size estimates. Presumably, if neither GWAS is affected by population stratification, there should not be a relation between these two variables. We plotted SNP differences in allele frequency between northern European and East Asian, African, and southern European samples (GBR, CHB, LWK and TSI subsets of 1000 Genomes, respectively) against the difference in effect size between a pair of GWAS. When comparing the UKBB and GIANT, we replicate the signal of correlation in differences between northern and southern European from Berg et al. (2019) (P < 1e − 5, see Figure S19). This pattern is also observed in the GBR vs. CHB and GBR vs. LWK comparisons (Figures S20 and S21, panel A and B). However, these differences are not observed for any other pairwise GWAS comparisons (Figures S19, S22, S23 and S24).
Assessing different association methods
We find strong differences in the amount of polygenic score overdispersion across GWASs, but the GWASs we assessed were carried out using different association methods. We wanted to evaluate the effect of different association methods on the overdispersion of polygenic scores, while using the same underlying association cohort. We chose the UKBB cohort for this assessment, as it is the largest cohort among the ones we tested. We first split the UKBB cohort into three increasingly more expansive sets: 1) “British”, 2) “White”, and 3) “all ethnicities”, based on a self-identified ethnicity classification carried out by the UKBB consortium. We then performed linear model (LM) and linear mixed model (LMM) association methods on each of the three sets of individuals (Table 2, see Methods). We also wanted to see if we could replicate the strong overdispersion in polygenic scores we saw in the meta-analysis cohorts, like GIANT, by partitioning the entire UKBB cohort into 75 cohorts (approximately emulating the number of cohorts in GIANT), and then performing a meta-analysis on the summary statistics obtained from individual GWASs performed separately on each of these cohorts (Table 2, see Methods).
The population-wide polygenic scores and the QX scores obtained using effect size estimates from each of these different methods are in Figures 5 and 6, respectively. There is an increased power to detect height-associated SNPs when we used the mixed model. Regardless of whether one uses a linear or a linear mixed model, GWASs performed on a more expansive category of people (“all-ethnicities”) lead to increased overdispersion of polygenic scores than when using more restrictive categories (“British” or “White”). Additionally, our artificial meta-analysis on the UKBB data resulted in even stronger overdispersion of the scores and, consequently, an even more strongly inflated QX statistic. Indeed, the most extreme QX -derived P-values across settings were those from the meta-analyses (Figure 6). This increased overdispersion is particularly evident when looking at the scores of European (FIN, CEU, GBR) and Latin American (PEL, CLM, MXL) populations in the meta-analysis setting (Figure 5).
Discussion
When looking for signals of polygenic adaptation based on population differentiation, we observe highly inconsistent signals depending on the GWAS cohort from which we obtained effect size estimates. Because we are using the exact same population panels to obtain population allele frequencies in all tests, the source of the inconsistencies must necessarily come from differences in the effect size estimates in the different GWAS. These inconsistencies are not limited to tests involving height-associated SNPs: they also appear in tests involving SNPs associated with other phenotypes, like white blood cell count, mean corpuscular hemoglobin and potassium levels in urine.
For those phenotypes for which we have effect size estimates from more than two different sources, we find that the GWASs performed using multiple different cohorts – GIANT and PAGE – show strong overdispersion in genetic scores, and, consequently, strong evidence for selection when relying on the QX statistic. In biobank-based GWAS conducted using panels with relatively homogeneous ancestries, the signals of selection are generally (but not always) more attenuated. Furthermore, in the case of height, the distribution of genetic scores when using GIANT estimates and when using PAGE estimates are not consistent, suggesting differences in scores are likely not driven by a biological signal that is not being picked up by the biobank-based tests.
Furthermore, when we performed an artificial meta-analysis on the UKBB data, emulating the methodology of GIANT, we observed increased dispersion of polygenic scores among populations than when using single GWAS cohorts of more homogeneous ancestries, echoing findings by Kerminen et al. (2019) at a more localized geographic scale. This increase in score dispersion in turn causes an inflation of the QX statistic. Overall, this suggests that uncorrected population structure in GWAS meta-analyses may be a strong confounder for tests of polygenic adaptation based on patterns of population differentiation in polygenic scores.
Another possible cause for these inconsistencies could be differences in the number of individuals included in each GWAS, leading to differences in power to detect polygenic adaptation on trait-associated variants. Indeed, in some of the smaller cohorts (FINRISK and APCDR) we observe little to no evidence for strong deviations from neutrality in the distribution of genetic scores across populations.
We also find that the type of test performed to obtain QX P-values does not yield strong differences in such P-values, at least not of the magnitude observed when using effect size estimates from different GWAS cohorts. Those phenotypes and GWAS cohorts for which we find significant overdispersion via the chi-squared distributional assumption for the QX statistic also tend to be the ones for which we find significant overdispersion when not relying on it. This suggests that this assumption – while not entirely accurate (Berg & Coop, 2014) – is still reasonably valid, across all the phenotypes we looked at, assuming the effect size estimates are not affected by stratification.
When focusing on height, which is the phenotype for which we have summary statistics from the largest number of GWAS cohorts, we were only able to detect evidence for uncorrected European stratification in GIANT (Berg et al., 2019), i.e. a strong correlation between cross-population allele frequencies and differences in effect size estimates between cohorts. However Haworth et al. (2019), Novembre & Barton (2018) and Rosenberg et al. (2019) encourage caution about the interpretation of signals of polygenic adaptations due to the presence of residual stratification even in GWAS panels with no clear evidence for stratification.
In future studies of polygenic adaptation, we recommend the use of large homogeneous data sets and the verification of signals of polygenic adaptation in multiple GWAS cohorts (e.g. (Chen et al., 2019)). We also recommend caution even when finding that statistics testing against neutrality are significant in multiple GWAS cohorts: it is still possible that all the GWAS cohorts may be affected by subtle stratification, possibly along different axes of population structure. To try to avoid stratification issues, recent studies have proposed to look for evidence for polygenic adaptation within the same panel that was used to obtain SNP effect size estimates, i.e. avoiding comparisons between populations that might be made up of individuals outside of the GWAS used to obtain effect size estimates, e.g. (Liu et al., 2018b). However, Mostafavi et al. (2019) recently showed that the accuracy of polygenic scores often depends on the age and sex composition of the GWAS study participants, even when studying individuals of roughly similar ancestries within a single cohort, due to heritability differences along these axes of variation. This implies that ancestry-based stratification is not the only confounder that researchers of polygenic adaptation should be aware of when trying to detect polygenic adaptation. Approaches based on tree sequence reconstructions along the genome (Speidel et al., 2019; Stern et al., 2020) appear to be a fruitful avenue of research towards the development of methods that can properly control for some of these confounders.
Overall, we generally urge caution in the interpretation of signals of polygenic adaptation based on human GWAS data, at least until we have robust generative models that can explain exactly how stratification is creeping into these tests (Young et al., 2019). Due to the high risk of misappropriation of this type of results by hate groups (Harmon, 2018), we also recommend that researchers make an effort to explain the caveats and problems associated with these tests in their publications (Novembre & Barton, 2018; Rosenberg et al., 2019), as well as the strong sensitivity of their performance to the input datasets that we choose to feed into them.
Supplementary Figures
Supplementary Tables
Acknowledgments
We thank Jeremy Berg for helpful comments on the manuscript, as well as Mark Daly, Samuli Ripatti, Jukka Koskela, Masahiro Kanai, Yukinori Okada, Yoichiro Kamatani, Evan Irving-Pease and Graham Gower for general advice and technical assistance in obtaining and handling association data. We would also like to thank the members of the Racimo group for feedback throughout the duration of the project. Finally, we thank all participants of the biobanks and association studies included in this work, for their valuable contribution to the study of genetic variation and disease. Access to the UK Biobank data was provided to AA via application number 32683 and to ARM via application number 31063. FR is supported by a Villum Fonden Young Investigator award (project no. 00025300). AR-M is supported by the Lundbeck Foundation GeoGenetics Centre grant (R302-2018-2155) and a Novo Nordisk grant to the GeoGenetics Centre (NNF18SA0035006). ARM is supported by funding from the National Institutes of Health (K99MH117229).