Ancient DNA reveals that few GWAS loci have been strongly selected during recent human history

Genetic data from ancient humans has provided new evidence in the study of loci thought to be under historic selection, and thus is a powerful tool for identifying instances of selection that might be missed by methods that use present-day samples alone. Using a curated set of disease-associated variants from the NHGRI-EBI GWAS Catalog, we provide an analysis to identify disease-associated variants that bear signatures of selection over time. After accounting for the fact that not every ancient individual contributed equally to modern genomes, a Bayesian inference method was used to infer allele frequency trajectories over time and determine which disease-associated loci exhibit signatures of natural selection. Of the 2,709 variants analyzed in this study, 895 show at least a weak signature of selection (|s| > 0.001), including multiple variants that are introgressed from Neanderthals. However, only nine disease-associated variants show a signature of strong selection (|s| > 0.01). Additionally, we find that many risk-associated alleles have increased in frequency during the past 10,000 years. Overall, we find that disease-associated variants from GWAS are governed by nearly neutral evolution. Exceptions to this broad pattern include GWAS loci that protect against asthma and variants in MHC genes. Ancient samples allow us an unprecedented look at how our species has changed over time, and our results represent an important early step in using this new source of data to better understand the evolution of hereditary disease risks.


Introduction
Many studies have been conducted to understand the genetic basis of diseases and other traits. Despite most variants appearing to evolve neutrally, exceptions exist for some disease-associated 2 loci (Supplemental Table S1). For example, six of the nine variants under strong selection (s > 3 0.01) are associated with autoimmune diseases. One of them (rs13277113) is associated with 4 systemic lupus erythematosus (Hom et al., 2008), and has been found by the GTEx Project (v8) to 5 significantly affect expression of BLK, a gene involved in B cell development (Gauld and Cambier, 6 2004). Another of these variants (rs2270450) is associated with developing Hashimoto thyroiditis 7 versus Graves' disease (Oryoji et al., 2015). Curiously, this variant is significantly associated with 8 expression of TDRD6, a gene involved in spermiogenesis as well as autoimmune polyendocrine 9 syndrome (Akpınar et al., 2017;Bensing et al., 2007). The other four strongly selected variants 10 associated with autoimmune diseases are found within the MHC region, which is known to be 11 dense with rapidly evolving genes critical to immune function.

Archaic introgression introduced some selected alleles at GWAS loci 1
Disentangling the effects of selection and archaic introgression can be difficult if one is analyzing 2 modern genomes (Racimo et al., 2016). However, time series data from ancient genomes can 3 sidestep many of these difficulties. Of the 2,709 disease-associated loci analyzed here, 24 contain 4 introgressed alleles from Neanderthals. Of these 24 loci, nine introgressed variants have signatures 5 of weak selection in the recent past ( Figure 1). Some introgressed alleles appear to have reached 6 moderate allele frequencies prior to being negatively selected, while other introgressed alleles 7 appear to have only been positively selected during recent human history. Overall, we find that 8 disease-associated alleles that have a Neanderthal origin have similar selection coefficients to other 9 disease-associated alleles from the GWAS Catalog (Table 1). This is not wholly surprising, given 10 that purging of strongly deleterious introgressed alleles occurred shortly after admixture (Petr et 11 al., 2019), and introgressed alleles that are able to persist to the present day will have managed to 12 avoid being filtered out due to Dobzhansky-Muller incompatibilities or other hybrid fitness effects.

Most selected alleles at disease loci have small effect sizes 10
We examined whether selection coefficients are correlated with disease-related effect sizes. Given 11 that our time series dataset exclusively contains samples from Europe and effect sizes are 12 contingent on a host of factors including study population and environmental factors, this analysis 13 focused on the subset of 1,910 variants that were identified in a GWAS that used samples of 14 European ancestry ( Figure 2). Overall, we find that there is no relationship between odds ratio 15 (OR) and magnitude of s for non-MHC variants (r = -0.012). This pattern arises because many 1 GWAS Catalog traits are largely irrelevant to fitness. For example, the variant with the highest 2 OR analyzed here (rs3825942, OR=20.1) is associated with glaucoma, a disease which normally 3 strikes long after reproductive age. Although the variants in Figure 2 are associated with a 4 heterogenous set of diseases, our results suggest that the selection coefficients are largely 5 decoupled from the effects of GWAS loci on disease risks. 6 7

Variants in the MHC region show stronger signatures of selection 8
Compared to other disease-associated alleles, MHC variants show a strikingly different pattern. 9 Disease-associated variants in the MHC region on chromosome 6 have larger effects sizes than 10 disease-associated variants that are not in the MHC region (mean MHC OR: 2.11, mean non-MHC 11 OR: 1.36, t test P = 4.12 x 10 -19 ). Disease-associated variants in MHC genes also show a modest 12 positive correlation between s and OR (r = 0.167). Variants located in the MHC region were 13 significantly more likely to show signatures of weak selection than GWAS variants found in other 14 parts of the human genome (Table 1, hypergeometric P < 10 -6 ). Disease-associated variants in the 15 MHC region were also significantly depleted of selection for protective alleles (hypergeometric P 16 < 10 -6 ). Given the highly pleiotropic nature of the immune system, it is possible that alleles that 17 increase risk for some immune diseases have been historically protective against others 18 unrepresented in the GWAS Catalog (i.e., infectious diseases). 19 20

Disease-specific tests for enrichment of selection signatures 21
Although we caution against overinterpreting disease-specific results (due to genetic hitchhiking 22 and the pleiotropic nature of variants analyzed in this paper), multiple diseases were enriched for 23 Page 10 signatures of weak selection. Idiopathic membranous nephropathy, an autoimmune disease that 1 severely impacts kidney function, showed the strongest signature of selection, with 17 of 19 2 variants under at least weak selection ( Figure 3A; hypergeometric P < 1.00 x 10 -6 ). Unsurprising 3 given the strong immune component to this disease, all but one of the variants associated with this 4 disease are found in the MHC region, which likely contributes to enrichment for selected alleles 5 that are associated with idiopathic membranous nephropathy. Curiously, alleles that protect against 6 this disease have tended to decrease in frequency during recent human history ( Figure 3B; 7 hypergeometric P = 4.76 x 10 -4 ). As stated above, this result could represent either genetic 8 hitchhiking or pleiotropic effects. The next strongest enrichment signal was from asthma, with 26 9 of the 49 variants associating with it being under at least weak selection ( Figure 3A; 10 hypergeometric P = 2.12 x 10 -3 ). We found that alleles that protect against asthma were strongly  Table S2). 15 16 In addition, we found three diseases that were significantly depleted for signatures of weak 17 selection (Supplemental Table S2). The strongest signal was for breast cancer ( Figure 3B; 21 of 18 96 variants; hypergeometric P = 1.20 x 10 -2 ), followed by major depressive disorder (10 of 55 19 associated variants; hypergeometric P = 1.27 x 10 -2 ), and addiction (3 of 23 variants; 20 hypergeometric P = 2.96 x 10 -2 ). Taken together, these results reveal that not all diseases are 21 equally affected by selection, even when selection does not appear to be broadly acting. Given that severity and when a disease manifests determine its fitness impact, we also examined 9 whether diseases with an early age of onset have different characteristics than diseases with a 10 late age of onset. Overall, 55.9% of alleles that are associated with increased disease risks in our 11 dataset were derived, as opposed to ancestral (binomial P < 1.00 x 10 -6 ). This trend was more 12 pronounced in diseases whose onset was before the age of 18 (hypergeometric P = 3.50 x 10 -5 ; 13 Figure 4A). Despite this effect, age of onset does not seem to impact the likelihood of a variant 14 being under selection ( Figure 4B). However, there is a slight bias in favor of the protective allele 15 for both the 0-18 age of onset (binomial P = 0.047) and the 19-40 age of onset disease variants 16 (binomial P = 0.042; Figure 4C).

11
We find that the majority of GWAS variants do not appear to be under strong selection, and that 12 selection coefficients (s) and odds ratios (OR) are uncorrelated. Allele frequency changes at many 13 disease-associated loci may be due to the indirect effects of genetic hitchhiking, i.e., selection 14 acting at closely linked loci. Because of this, our results suggest an "upper bound" for the amount 15 of selection experienced by GWAS variants. In general, we find that our results reinforce the 16 notion that diseases do not necessarily impact fitness. Our study also demonstrates the utility of 17 time series data to infer the evolutionary histories of individual disease-associated loci. Disease phenotypes are obvious targets for natural selection, so the general lack of variants under 1 selection in this set of GWAS loci at first seems unexpected. However, a previous study estimating 2 selection coefficients from C-scores showed a remarkably similar proportion of GWAS variants 3 under weak selection (Racimo and Schraiber, 2014). Indeed, there are many reasons why we would 4 not expect many of these variants to be under strong directional selection in the timescale 5 examined. First, many of these diseases are unlikely to impact fitness, a scenario supported by our 6 result that OR and the magnitude of s are uncorrelated for most variants. Of the 178 diseases 7 considered here, 20 are late-onset (age of onset of 60 or later) and while they shorten lifespans, 8 they do so well after reproductive age. Barring strong effects from scenarios such as the 9 grandmother hypothesis (Williams, 1957) and presuming these variants substantially affect no 10 other fitness-impacting traits, this makes this group of disease variants essentially invisible to 11 selection. The fact that these diseases show no marked depletion in the proportion of variants under 12 weak selection when compared to diseases with earlier ages of onset could imply that genetic 13 hitchhiking or pleiotropy may be obscuring expected patterns of selection. 14 15 Another possibility is that selection has acted on disease traits, but the polygenic nature of the 16 disease means individual loci affecting risk might be close to neutrality. Indeed, this is expected 17 as many GWAS variants are segregating at allele frequencies that no strongly deleterious variants Interestingly, diseases whose onset is during pre-reproductive or reproductive age are slightly more 16 likely to have protective alleles that are positively selected. A prime example of this involves 17 variants affecting asthma, which are strongly enriched for selection overall, and positive selection 18 on protective alleles in particular. Asthma is a chronic inflammatory disorder of the airways 19 primarily affecting children whose incidence has increased substantially over recent decades 20 It is worth mentioning that the approach taken here is subject to the limitations of the GWAS data 4 to which it has been applied. While incredibly useful, GWAS are not easily conducted on certain 5 diseases that likely carry great importance to fitness, such as infectious diseases or diseases and 6 health events that result in patient death before they can be enrolled in a study (i.e., myocardial 7 infarction, stroke). Our approach here is well suited to examining the common variants typically 8 found in GWAS, but it is ill-suited to rare or family-specific variation. Additionally, the timescales 9 we can examine using this approach are defined by the density and timeline of ancient samples. 10 As many of our samples are from the last 10,000 years and the youngest is from ~1,208 years ago, 11 very recent allele frequency shifts are difficult to detect. It is possible that some variants may have 12 been under episodic selection, but our method is unable to detect selection if allele frequency 13 trajectories are U-shaped. 14 15

Conclusion 16
Our study shows the utility of ancient genomic data and time series analyses in understanding the 17 effects of selection versus neutral processes on complex disease risk. These results contribute to 18 the growing body of data on how selection has shaped the human genome, as well as the 19 importance of disambiguating fitness effects from disease status. It is also worth noting that 20 incorporating information from ancient samples can help give context in situations where 21 traditional markers of selection (linkage disequilibrium, singleton density, etc.) are obscured due 22 to the history of the region. This is a particular source of trouble in regions such as haplotypes 23 Page 16 introgressed from ancient hominins, whose history violates many assumptions of methods using 1 modern data alone (Vernot and Akey, 2014). As more high-quality ancient genomes are collected 2 from around the world, our understanding of evolutionary medicine will continue to grow.

Age of onset determination and phenotype collapsing 20
When available, we used the disease information collated by Google from multiple data sources 21 (for more information, see: https://support.google.com/websearch/answer/2364942). Onset was 22 broken down into seven age ranges: 0-2, 3-5, 6-13, 14-18, 19-40, 41-60, and 60+. For each disease, 23 Page 17 we used the age range determined by the highest frequency of ages affected. When multiple age 1 ranges were equally likely, we used the youngest age range. For the diseases that Google did not 2 have this information, we used either age of onset data or the age of study participants from the 3 studies contributing to the GWAS Catalog (Welter et al., 2014). For phenotypes that lacked an age 4 of onset (allergic reactions, injuries, etc.), we used the category "Any." In addition, we collapsed 5 similar phenotypes into an umbrella phenotype (Supplemental Table S2). For example, all 6 symptoms associated with type II diabetes were collapsed into the type II diabetes umbrella 7 category.

Allele frequency trajectory and selection inference 21
We used a method developed by Schraiber et al (2016). For variants where the maximum 22 contribution of individuals of a particular lifestyle was less than 100%, we down-sampled the 23 individuals from that lifestyle by enforcing a chance of inclusion equal to the maximum 1 contribution of that lifestyle. Individuals were then binned by age (>9500, 9500-8000, 8000-6500, 2 6500-5000, 5000-3500, and <3500 years ago). Each time point was determined by calculating the

Introgressed and MHC variant identification 19
We determined the introgression status of variants using the haplotypes and variants identified in 20 a study of Neanderthal and Denisovan introgression across modern Eurasian and Melanesian 21 groups (Vernot et al., 2016). Coordinates defined by the Genome Reference Consortium were used to specify the MHC region (hg38 chr6:28,510,020-33,480,577). When necessary, we used liftOver 1 to convert data from hg38 to hg19 or vice versa (Hinrichs et al., 2006). 2 3

Binomial and hypergeometric enrichment 4
We used the numpy module in python to generate binomial and hypergeometric distributions. In 5 all cases, we simulated 1,000,000 draws. Unless otherwise stated in the text, hypergeometric 6 distributions were generated based on the results from the overall set of 2,709 GWAS variants 7 with selection information.