Abstract
Accurate assessment of health disparities requires unbiased knowledge of genetic risks in different populations. Unfortunately, most genetic association studies use genotyping arrays and European samples. Here, we integrate whole genome sequence data, GWAS results, and computer simulations to examine how ascertainment bias causes disease risks to be mis- inferred in non-study populations. We find that genetic disease risks are substantially overestimated for individuals with African ancestry – risk allele frequencies at known disease loci are 1.15% higher on average in Africa. These patterns hold for multiple disease classes (e.g., cancer, gastrointestinal, morphological, and neurological diseases). A contributing factor to this bias is that existing genotyping arrays are enriched for SNPs that have higher frequencies of ancestral alleles in Africa. Computer simulations of GWAS that use samples from bottlenecked non-African populations recapitulate regional differences in allele frequencies at disease susceptibility loci. These differences cause genetic disease risks to be overestimated for individuals with African ancestry and underestimated for individuals with non-African ancestry. We find that the extent of ascertainment bias depends on the genotyping platform used, numbers of cases and controls, demographic history, the proportion of ancestral vs. derived risk alleles, and choice of study population (African GWAS are less biased). Importantly, biases are only moderately reduced if GWAS use whole genome sequences and hundreds of thousands of cases and controls. Our results indicate that caution must be taken when using GWAS results from one population to predict disease risks in another population.
Introduction
In the past decade, over 3,000 genome-wide association studies (GWAS) have successfully identified more than 39,000 genetic variants that are associated with common diseases and other traits [1, 2]. Most GWAS use genotyping arrays to test whether specific risk alleles are more common in cases vs. controls. However, the vast majority of published GWAS have used samples of European ancestry [3, 4], and a looming challenge is to be able to generalize GWAS results across populations [5-10]. Results from GWAS can be combined to generate polygenic risk scores to predict individual risks of disease [11-13]. These polygenic risk scores involve summing the number of risk alleles in each individual's genome to quantify hereditary disease burdens. Further refinement of genetic risk scores involves weighting SNPs by effect size [14]. Additional complications for genetic risk scores include the “missing heritability” problem [15], which implies that the bulk of causal variants remain undiscovered. Diseases can also have different genetic architectures in different populations [16]. Because of these issues, genetic predictions of disease risk are not always accurate, and it is important to be able to distinguish between situations where genetic risks actually differ between populations and when predictions of genetic health disparities are spurious.
Genetic health disparities can arise when allele frequencies at disease-associated loci differ across populations [14]. These allele frequency differences are magnified for pairs of populations that do not share recent evolutionary history [17, 18]. Population bottlenecks and founder effects have influenced hereditary disease risks in a number of global populations. Many of these effects are disease-specific, such as elevated risks of cystic fibrosis among the Québécois [19] and cardiovascular disease among the descendants of the HMS Bounty mutineers [20]. Evolutionary history also affects whether there are genetic differences in disease risks across populations, including recent natural selection near disease susceptibility loci [21] and whether risk alleles are ancestral (shared with other primates) or derived (due to new mutations) [22, 23]. Although risks of individual diseases can differ across populations, the overall burden of hereditary diseases is expected to be similar across the globe [24]. Systematic departures from this null expectation may arise because many disease alleles are presently unknown.
Even if real differences exist between populations, SNP ascertainment bias can cause genetic disease risks to be misestimated. There are multiple sources of bias in GWAS, including choice of genotyping technology, the ancestry of study participants, and whether sample sizes are large or small [25-28]. Most commercially available genotyping arrays use SNPs that were originally ascertained in European populations, and arrays are enriched for intermediate frequency alleles (i.e., alleles with frequencies that are closer to 50%) [25]. Because of this, allele frequencies at presently known disease loci are not independent of the genotyping technology used to detect genetic associations. As of 2016, the ancestry of 81% of all GWAS samples was European and 14% was Asian [3], and this is likely to cause the set of known disease associations to be enriched for alleles that are polymorphic or intermediate frequency in Europe or Asia, but not Africa. There is also evidence that disease-associated alleles have elevated minor allele frequencies in study populations [5]. Biases in genetic studies parallel what is observed in social science research: most samples are from Western, educated, industrialized, rich and democratic (WEIRD) societies [29, 30]. An additional consideration is that large sample sizes are required to detect associations between SNPs and genetic diseases when risk alleles have small effect sizes or are rare [31]. Because of this, ascertainment bias is more problematic for GWAS that have small numbers of cases and controls.
At present, the extent to which ascertainment bias hinders precision medicine and personal genomics is unknown. To bridge this knowledge gap, we tested empirical data for systematic bias in risk allele frequencies across populations. Extensive computer simulations of GWAS were then used to provide insight into multiple causes of what appears to be a genetic health disparity (including the effects of different genotyping arrays, study designs, mode of inheritance, and evolutionary histories). Here, we focus on the problem of using disease associations discovered in one population to predict disease risks in another population, as opposed to whether GWAS findings can be successfully replicated across multiple populations.
Results
Empirical patterns of genetic risk
Allele frequencies at 3036 disease-associated loci were analyzed for each continental super- population in the 1000 Genomes Project dataset. Contrary to null expectations, the mean frequencies of risk alleles at disease susceptibility loci vary across populations (Fig. 1A). Specifically, the overall risk allele frequencies are significantly higher in African populations compared to non-African populations (mean difference: +1.15%, p-value = 0.02129, paired Wilcoxon signed-rank test). However, what appear to be genetic health disparities (elevated risk allele frequencies in Africa) are due to SNP ascertainment bias.
We explored differences in risk allele frequencies by binning each disease-associated locus into one of seven different categories: gastrointestinal (GI) or liver, metabolic, morphological, cancer, neurological, miscellaneous, and cardiovascular disease. As illustrated in Fig. 1A, population-level differences in risk allele frequencies persist when GWAS results were binned by disease type. Compared to other populations, African populations have the highest risk allele frequency in five out of seven disease types: metabolic (p-value = 0.005502), morphological (p-value = 0.09494), cancer (p-value = 0.1169), neurological (p- value = 0.0995), and miscellaneous disease (p-value = 0.3865, paired Wilcoxon signed-rank tests). African populations have intermediate frequencies of risk alleles at loci that are associated with GI or liver diseases (p-value = 0.6965), and lower frequencies of risk alleles at loci that are associated with cardiovascular disease (p-value = 0.01404, paired Wilcoxon signed-rank tests). Among non-African populations there was no underlying trend.
Further stratification according to ancestral vs. derived status reveals a clear pattern: disease types that have a larger proportion of ancestral alleles tend to have elevated risk allele frequencies in Africa (Fig. 1B). After binning GWAS SNPs by disease category, we find that the differences in the mean frequency of risk alleles between African and non-African populations are highly correlated with the proportion of risk alleles that are ancestral (r2 = 0.842). This suggests that continental patterns of disease risk may vary for risk alleles that are ancestral vs. derived.
The joint site frequency spectrum (SFS) of risk alleles in African and non-African populations provides empirical evidence of SNP ascertainment bias (Fig. 2). In this study, we focused on unfolded allele frequencies, rather than minor allele frequencies. In general, ancestral risk alleles tend to be the major allele and derived risk alleles tend to be the minor allele. This is expected given that derived alleles are, by definition, evolutionarily younger than ancestral alleles. 69.2% of the ancestral risk alleles are found at higher frequency in African populations (below the diagonal), and 64.5% of the derived risk alleles are found at higher frequency in non-African populations (above the diagonal). The null expectation is that equal numbers of alleles would be found on each side of the diagonal. Examining the borders of the joint frequency spectrum between African and non-African populations emphasizes the effects of study populations in GWAS. Many disease-associated alleles are found at extreme allele frequencies in Africa (close to 0 or 1) and at intermediate allele frequencies outside of Africa. This occurs because most GWAS have used non-African samples and statistical power is maximized at intermediate frequencies.
The difference in risk allele frequencies between African and non-African populations is expected to be zero when bias is absent. Conditioning on whether risk alleles are ancestral or derived reveals a striking pattern: ancestral risk alleles are found at much higher frequencies in Africa and derived risk alleles are found at much lower frequencies in Africa (Fig. 2B). The mean difference in ancestral risk allele frequencies between African and non-African populations is +9.51%, and the mean difference in derived risk allele frequencies between African and non-African populations is -5.40% (p-value < 2.2x10-16 for both comparisons, Wilcoxon signed-rank tests). The overall continental difference in risk allele frequencies of +1.15% arises because 44% of presently known disease-associated SNPs have ancestral risk alleles and 56% of disease-associated SNPs have derived risk alleles.
Because many disease-associations involve imputed SNPs, we tested whether continental differences in risk allele frequencies persist for SNPs that are not on the Affymetrix Genome-Wide Human SNP 6.0 Array. For this set of disease-associated loci, we find that SNPs with ancestral risk alleles have higher allele frequencies in Africa (+8.63% on average) and that SNPs with derived risk alleles have lower allele frequencies in Africa (-4.83% on average). This suggests that biases persist even for imputed SNPs.
Genotyping arrays are biased
One potential source of bias is genotyping platform: GWAS use microarrays with pre- ascertained SNPs. Whole genome sequencing (WGS) data from the 1000 Genomes Project reveals that each population has a similar mean derived allele frequency (Fig. 3A). This is expected since all human populations share the same evolutionary distance to chimpanzees. Compared to WGS data, derived allele frequencies are elevated for SNPs on the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray. However, commonly used genotyping arrays also exhibit continental patterns of bias: derived allele frequencies in African populations are markedly lower than derived allele frequencies in non- African populations (p-value < 2.2x10-16 for both arrays, Wilcoxon signed-rank tests). This bias is due to the fact that genotyping arrays contain SNPs that were ascertained in non-African populations. WGS data have an unbiased SFS with similar numbers of SNPs above and below the diagonal (Fig. 3B). By contrast, the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray are enriched for SNPs that are above the diagonal, i.e. SNPs with higher derived allele frequencies outside of Africa (Fig. 3C and Fig. 3D). This pattern mirrors what is seen for empirical GWAS data (Fig. 2A), which suggests that genotyping arrays contribute to continental differences in risk allele frequencies.
Simulated GWAS capture the effects of bias
Using computer simulations, we set out to test whether ascertainment bias is sufficient to explain observed patterns at disease-associated loci. Simulations use allele frequency data from the 1000 Genomes Project, knowledge of which SNPs are on genotyping arrays, and GWAS power calculations [31]. Importantly, these simulations do not assume that there were any underlying differences in hereditary disease risks across populations (i.e. simulated differences in risk allele frequencies are due to ascertainment bias). Results from computer simulations are similar to what is observed in empirical data: compared to non-African populations, African populations have elevated frequencies of ancestral risk alleles and reduced frequencies of derived risk alleles (Fig. 4). Note that that empirical risk alleles have been discovered in a heterogeneous set of studies. By varying the parameters of GWAS simulations we are able to quantify individual effects of each potential source of ascertainment bias (study population, genotyping technology, sample size, and the dominance of disease alleles).
Choice of study population has a profound effect on the relative frequencies of risk alleles in different populations. Simulated GWAS that use African (AFR) samples yield similar risk allele frequencies across each of the five continental super-populations. However, simulated GWAS that use American (AMR), East Asian (EAS), European (EUR), or South Asian (SAS) samples produce a set of disease-associated loci with elevated frequencies of ancestral risk alleles and reduced frequencies of derived risk alleles in Africa (Fig. 4A). The magnitudes of these differences in allele frequencies are comparable to what is observed in empirical GWAS data. Regardless of study population, risk allele frequencies are similar for each non- African population, and this may be due in part to the relatively recent divergence times between these populations. Because statistical power is maximized at intermediate allele frequencies, mean risk allele frequencies in study populations are shifted closer towards 50%. We note that simulated GWAS that use a mixture of samples from different continents (MIX) still produce a set of disease-associated loci with elevated frequencies of ancestral risk alleles and reduced frequencies of derived risk alleles in Africa. Similarly, simulated GWAS that use admixed American (AMR) samples yield biased allele frequencies. Taken together, these results suggest that pooling samples with different ancestries is unlikely to alleviate the problem of SNP ascertainment bias.
Although genotyping arrays contribute to ascertainment bias, GWAS simulations reveal that biases in risk allele frequencies persist even if whole genome sequences are used. Recall that empirical GWAS data come from heterogeneous set of studies, while simulated results assume a single study design and effect size. Despite this, allele frequency differences between Africa and Europe are similar for real and simulated data (Table 1). Disease associations from simulations of European GWAS yield similar results for the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray (ancestral risk allele frequencies were 10.7% and 11.0% higher in Africa and derived risk alleles were 8.0% and 8.2% higher in Europe, respectively). Somewhat surprisingly, disparities in allele frequencies also occur for European GWAS simulations that use whole genome sequences (Fig. 4B). However, continental differences in allele frequencies were reduced for simulations that used whole genome sequences (ancestral risk allele frequencies were 9.7% higher in Africa and derived risk alleles were 7.2% higher in Europe). The fact that allele frequency differences arise from WGS simulations lends additional support to the claim that biases will persist for imputed SNPs.
Continental biases in risk allele frequencies occur even if GWAS use large sample sizes. Simulated GWAS with less than 10,000 European cases and controls yield large differences in African and non-African allele frequencies (Fig. 4B). This occurs regardless of whether simulations use SNPs from the Affymetrix Genome-Wide Human SNP Array 6.0 or WGS. We find that well-powered studies with hundreds and thousands of cases and controls still results in notable differences in continental allele frequencies. There are diminishing returns for increasing sample sizes if simulated GWAS use genotyping arrays. By contrast, whole genome sequencing of one million cases and controls minimizes the amount of bias. Statistical power is also a function of the p-value threshold used in a GWAS. Holding the default parameter values constant, we find that using a more stringent p-value threshold amplifies risk allele frequency differences; ancestral risk allele frequencies are 12.2% higher in Africa and derived risk allele frequencies that are 8.8% higher in Europe if a p-value threshold of 5x10-8 is used.
Although we focused on additive effects, continental biases in risk allele frequencies vary for other modes of inheritance. It is easier to detect associations for low frequency dominant alleles, intermediate frequency additive alleles, and high frequency recessive alleles. However, the power to detect a genetic association does not solely depend on minor allele frequency (e.g. disease-causing alleles at 10% and 90% have a different chance of being successfully detected) [31]. Using simulations of European GWAS, we find that African risk allele frequencies are expected to be higher than European risk allele frequencies for dominant models of disease and lower than European risk allele frequencies for recessive models of disease (Table 2). These trends occur whether risk alleles are ancestral (dominant: +19.7%, recessive: +2.9%) or derived (recessive: -2.2%, recessive: -17.9%).
Discussion
SNP ascertainment bias confounds GWAS results and creates the illusion of genetic health disparities. Specifically, African populations tend to have higher frequencies of ancestral risk alleles and lower frequencies of derived risk alleles at existing GWAS loci. Taking into account the magnitude of these differences and the proportion of ancestral alleles in GWAS results yields risk allele frequencies that are 1.15% higher in Africa. This has important implications with respect to precision medicine and personal genomics: disease risks are likely to be misestimated if GWAS results are naively used to calculate genetic risk scores. Biased predictions of genetic risks are expected to be magnified for individuals of African descent, potentially complicating existing health disparities that are due to socio-cultural factors including access to medical care [32, 33].
Importantly, elevated risk allele frequencies in African populations are the opposite of what one expects to see given what is known about human demographic history. Natural selection is more efficient at purging deleterious variants when population sizes are large[34], and an important difference between African and non-African populations is that the latter have been subjected to multiple bottlenecks and founder effects following the out-of-Africa migration. Because of this, non-African genomes carry an excess load of homozygous deleterious alleles (as identified via GERP scores) [35]. By contrast, geographic patterns at known disease-associated loci differ by continent (Fig. 1), and this is due in part to SNP ascertainment bias.
The effects of different study populations are asymmetric. For example, if a GWAS uses European samples, allele frequencies at disease-associated loci will be similar across non-African populations and different for Africa (Fig. 4). By contrast, risk allele frequencies from African GWAS are relatively similar across all global populations. Because successful detection of a SNP-disease association requires that a causal locus is polymorphic in the study population [31, 36], bottlenecks and founder effects can contribute to the illusion of genetic health disparities. Consider a disease-causing allele that is initially found at the same frequency in two populations (i.e. prior to the divergence of these populations). Over time, genetic drift causes allele frequencies at this locus to change in each daughter population (Fig. 5A). Importantly, non-African populations have experienced a history of population bottlenecks, including a drastic reduction in population size during the out-of-Africa migration [37], and there is a greater chance that non-African populations will have allele frequencies that are either 0 or 1. Note that derived alleles tend to be low frequency and ancestral alleles tend to be high frequency [22]. African GWAS result in minimal bias and non-African GWAS result in an excess of ancestral risk alleles with elevated allele frequencies in Africa and an excess of derived risk alleles with elevated allele frequencies outside of Africa (compare Fig. 5B and Fig. 5C). Biases in genetic predictions of disease risk depend upon historical population sizes and divergence times.
Although whole genome sequencing can identify many genetic variants that are missing from genotyping arrays, many of these variants are rare and population-specific (Fig. 3B). Because of this, disease-associations that use WGS data need not generalize well to other populations. We find that continental biases in risk allele frequencies persist even if GWAS use whole genome sequences and hundreds of thousands of cases and controls (Fig, 4B). This has important implications for genetic risk score calculations: estimates of disease risk depend upon the population(s) in which disease-associations were originally discovered, regardless of whether WGS data were used.
Going forward, there are multiple ways to extend the benefits of precision medicine and personal genomics to a wide range of global populations. One option is to replicate every existing GWAS in as many populations as possible. However, this option has limited feasibility: even if sufficient funds and epidemiological resources are available, it is not always possible to obtain large sample sizes for each population. Instead, genetic risk scores can correct for SNP ascertainment bias. This requires understanding how risk allele frequencies differ between populations (as shown here), and leveraging linkage disequilibrium information to infer the effect sizes of risk alleles in non-study populations [38, 39]. Only by understanding the effects of SNP ascertainment bias can accurate predictive models of genetic disease risks be built.
Methods
Population genetic data
Allele frequencies were obtained for each of the five continental super-populations of the 1000 Genomes Project: Africa (AFR), Americas (AMR), East Asia (EAS), Europe (EUR), and South Asia (SAS) [17]. Phase 3 data were used. These frequencies were used to generate risk allele frequencies and derived allele frequencies at disease-associated loci from the NHGRI-EBI GWAS Catalog and simulated datasets. Ancestral and derived states in phase 3 1000 Genomes Project VCF files were used (these ancestral states were inferred via the EPO pipeline from Ensembl). We found that derived allele frequencies were elevated for large chunks of chromosome 8, which is indicative of misidentified ancestral states. To compensate for this, we masked SNPs found in the chr8: 89,00,000-146,364,022 region (hg19). Individuals in phase 3 of 1000 Genomes Project were genotyped using WGS. Allele frequencies of SNPs on the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray were found by merging data from the 1000 Genomes Project with lists of SNP ids obtained from the Affymetrix and Illumina websites.
Identification of disease-associated variants
Using the NHGRI-EBI GWAS Catalog [1], Berens and colleagues generated a curated set of 3180 disease-associated loci [40]. This involved filtering out SNPs that were not associated with a disease, eliminating SNPs lacking risk allele or odds ratio information, and LD-pruning. Here, we further constrained the set of disease-associated loci from [40] by requiring knowledge of whether risk alleles are ancestral or derived. After excluding 144 SNPs with unknown ancestral states, we were left with a focal set of 3036 disease-associated loci. We classified these 3036 disease-associated loci into seven non-overlapping categories: gastrointestinal/liver, metabolic, morphological, cancer, neurological, miscellaneous, and cardiovascular. Wilcoxon signed-rank tests were used to compare disease allele frequencies between African and non-African populations.
GWAS simulations
Computer simulations were used to test whether SNP ascertainment bias alone can produce what appears to be genetic health disparities. The goal here was to generate simulated datasets comparable to the set of 3036 disease-associated loci from the NHGRI-EBI GWAS Catalog. These simulations assume that the underlying risks of disease are the same across the globe. Two general types of simulations were run: simulations with ancestral risk alleles and simulations with derived risk alleles. Simulations involved randomly drawing a test SNP from a list of known genetic variants ascertained via WGS or found on commercial genotyping arrays. Conditioning on whether risk alleles are ancestral or derived, the risk allele frequency of the test SNP was found in the study population. We then used a Perl script based on the GAS/CaTS power calculator [31] to determine the probability of detecting a successful genetic association at the test SNP. The GAS power calculator leverages information about the number of cases and controls, p-value threshold, disease model, prevalence, disease allele frequency, and genotype relevant risk (http://csg.sph.umich.edu/abecasis/cats/gas_power_calculator/). For each test SNP, we generated a uniformly distributed random number between 0 and 1. The test SNP was retained if the random number was less than the power to successfully detect a genetic association, and the test SNP was rejected if the random number was greater than the probability of detection. This process was repeated until a set of 3036 successful disease associations were detected. At each of these 3036 SNPs, we obtained simulated risk allele frequencies for five super-populations in the 1000 Genomes Project dataset (AFR, AMR, EAS, EUR, SAS). Our default parameters were as follows: genotyping technology = Affymetrix Genome-Wide Human SNP Array 6.0, study population = Europe (EUR), sample size = 3500 cases and 3500 controls, genetic model = additive, p-value threshold = 10-5, prevalence = 0.1, and genotype relative risk = 1.211. These parameter values were chosen to be representative of the empirical data found in the NHGRI-EBI GWAS Catalog.
Our default model was modified to test which aspects of SNP ascertainment bias contribute the most to continental differences in risk allele frequencies. This involved varying the following simulation parameters: genotyping technology, sample size, mode of inheritance, and the p-value threshold required for association detection. The effects of different genotyping technologies were simulated by drawing random SNPs from either the Affymetrix Genome-Wide Human SNP Array 6.0, the Illumina Omni 5M microarray, or WGS data from the 1000 Genomes Project. To examine the effects of different study populations, simulated risk allele frequencies were chosen from one of five different populations (AFR, AMR, EAS, EUR, or SAS) or from an equal mixture of all five populations (MIX). The effects of different sample sizes were simulated by varying the number of cases and controls from 3 to 6 on a log10 scale at intervals of 0.1 (i.e. between 1,000 and 1,000,000 cases and controls). Three genetic modes of inheritance were simulated: dominant, additive, and recessive. Two different p-value thresholds were simulated: 1x10-5 and 5x10-8.
Data access
Global allele frequencies are publicly available from the 1000 Genomes Project website: http://www.internationalgenome.org/data. Disease associations are publicly available from the NHGRI-EBI GWAS Catalog: http://www.ebi.ac.uk/gwas/. R and Perl scripts used in GWAS simulations are available upon request.
Disclosure declaration
All authors declare that they have no competing interests.
References
Acknowledgements
We thank A. Martin, U. Martinez-Marigorta, and M. Quiver for helpful discussions during the writing of this paper. This work was supported by NIH/NCI grant U01CA184374 and start-up funds from Georgia Institute of Technology.