Abstract
As humans spread throughout the world, they adapted to variation in many environmental factors, including climate, diet, and pathogens. Because many of these adaptations were likely mediated by multiple non-coding variants with small effects on gene regulation, it has been difficult to link genomic signals of selection to specific genes, and to describe the regulatory response to selection. To overcome this challenge, we adapted PrediXcan, a machine learning method for imputing gene regulation from genotype data, to analyze low-coverage ancient human DNA (aDNA). First, we used simulated genomes to benchmark strategies for adapting gene regulatory prediction to increase robustness to incomplete aDNA data. Applying the resulting models to 490 ancient Eurasians, we found that genes with the strongest divergent regulation among ancient populations with hunter-gatherer, pastoralist, and agricultural lifestyles are enriched for metabolic and immune functions. Next, we explored the contribution of divergent gene regulation to two traits with strong evidence of recent adaptation: dietary metabolism and skin pigmentation. We found enrichment for divergent regulation among genes previously proposed to be involved in diet-related local adaptation, and in many cases, the predicted effects on regulation provide explanations for previously observed signals of selection, e.g., at FADS1, GPX1, and LEPR. For skin pigmentation, we applied new models trained in melanocytes to a time series of 2999 ancient Europeans spanning ~38,000 years BP. In contrast to diet, skin pigmentation genes show little regulatory change over time, suggesting that adaptation mainly involved large-effect coding variants. This work demonstrates how aDNA can be combined with present-day genomes to shed light on the biological differences among ancient populations, the role of gene regulation in adaptation, and the relationship between ancient genetic diversity and the present-day distribution of complex traits.
Introduction
In the last decade, the number of ancient DNA (aDNA) samples from anatomically modern humans (AMHs) has increased dramatically (Marciniak and Perry, 2017). These samples span the globe, and cover time periods from several hundred to tens of thousands of years ago. This is a rich data source for understanding genetic changes and adaptations that occurred as humans expanded across the globe. However, linking genetic differences in aDNA samples to phenotypes poses several challenges (Irving-Pease et al., 2021). First, while the samples are often paired with archaeological information, this is limited to what biological material has survived for thousands of years. Thus, most phenotypes of interest are not directly measurable. Second, due the complexity of many phenotypes and gaps in our knowledge of the genetic architecture of most traits, drawing conclusions about most phenotypes of interest based on genetic information alone is challenging (Benton et al., 2021; Li et al., 2020).
To date, most studies have focused on comparing aDNA from different geographical regions to map migrations and their relationship to archaeological changes (Skoglund and Mathieson, 2018). Shifts from a hunter-gatherer lifestyle to pastoral herding and agricultural farming have been of particular interest, because these changes had profound implications for multiple aspects of life. These include changes in day-to-day activities, population density, interactions with the environment, and substantial dietary shifts, such as increased reliance on domesticated grains (Goude and Fontugne, 2016; Olsson and Paik, 2016). These shifts likely modified selective pressures on populations as their lifestyles, diets, and pathogen exposures changed.
Genomic scans in present-day populations have identified many loci with evidence of positive selection (Field et al., 2016; Grossman et al., 2013; Rees et al., 2020; Voight et al., 2006). In some cases, selection can be linked to changes in the coding sequence of specific genes (Grossman et al., 2013; Lamason et al., 2005). In others, it can be linked to changes in gene regulation. For example, selection at the FADS1 locus is linked to increased expression (Buckley et al., 2017; Mathieson and Mathieson, 2018; Ye et al., 2017). However in most cases, the molecular basis of signals of selection remains poorly understood, even when a specific gene can be implicated. For example, the leptin receptor (LEPR) is surrounded by a haplotype that has experienced recent positive selection (Voight et al., 2006), and protein-coding changes in LEPR have been implicated in increased cold tolerance (Hancock et al., 2008). However, altered expression of this gene is also associated with altered appetite regulation and metabolism (Kentish et al., 2013; Loos et al., 2006). Due to the difficulty in measuring environmental variables and disentangling LD patterns, it remains unclear whether selection is acting on coding variants, expression changes, or both, and which environmental variable is the source of the selective pressure (Luca et al., 2010). Even these examples are exceptional; most selection signals cannot even be confidently attributed to specific genes. Selection peaks often span many genes, with little indication of which might drive changes in fitness or the underlying molecular mechanisms. This motivated us to ask whether information about variants associated with gene expression, such as expression quantitative trait loci (eQTL), could help to identify genes under selection—analogous to the way in which eQTL data can inform variant-gene-phenotype mapping in genome-wide and transcriptome-wide association studies.
We therefore developed an approach to identify genes whose regulation shifted in coordination with lifestyle changes in recent human history. These differences in regulation between ancient human groups in distinct environments suggest adaptation. To quantify gene regulation from aDNA samples, we adapted the PrediXcan-based approach we previously used to study gene regulation in archaic hominins (Colbran et al., 2019; Gamazon et al., 2015). Since available human aDNA have variable quality and coverage, we conducted simulations and control analyses to evaluate how models for imputing gene regulation perform when applied to low-coverage data, and how to ameliorate the effects of missing variants. These yielded heuristics for determining when regulation could be accurately modeled.
Guided by these simulations, we applied PrediXcan models for thousands of genes to hundreds of ancient humans representing populations from hunter-gatherer, pastoral, and agricultural lifestyles. We found enrichment for metabolic and immune pathways among the genes most divergently regulated between lifestyle groups. This reflects both the altered metabolic requirements and immune pressures of lifestyle shifts and highlights specific genes and pathways involved. For example, divergent regulation of LEPR suggests that its functions in metabolism and appetite regulation were relevant for recent adaptation. We also analyzed the predicted regulation of 20 diet-related genes in genomic regions with evidence of recent local adaptation. Supporting the accuracy of our approach, we rediscover the FADS locus regulatory haplotype that has been previously shown to vary by lifestyle and is likely the target of selection. We also identified divergent regulation between aDNA samples for selected genes involved in response to selenium (GPX1) and carnitine (SLC22A5) levels.
Modeling gene regulation using aDNA also allows us to characterise the nature of selection on specific phenotypes. To illustrate this, we investigated changes in predicted regulation of genes involved in skin pigmentation—the phenotype that is most clearly under directional selection in these populations—using PrediXcan models trained on expression data from melanocytes. We find that skin pigmentation genes show no consistent change in regulation over time suggesting that, for this particular phenotype, evolutionary change was driven by coding variants rather than regulatory changes. Overall, this work provides an atlas of imputed regulation for hundreds of ancient humans across thousands of genes to facilitate future exploration of gene regulatory shifts in recent human evolution, and demonstrates the utility of combining molecular predictive models with ancient DNA to understand the evolution of complex traits.
Results
Gene regulatory patterns can be imputed using low-coverage aDNA data
The genetically regulated component of gene expression can be predicted by machine learning models trained on gene expression. Previous approaches have applied these models to genome-wide common variant data from present-day humans (Fig. 1A), for example to perform transcriptome-wide association studies (Gamazon et al., 2015; Zhou et al., 2020; Zhu and Zhou, 2020), and to high-coverage archaic hominin genomes (Colbran et al., 2019). Here, we adapt this approach to enable application to low-coverage genotype data from ancient human individuals, considering the unique attributes of these data. In particular, aDNA data vary in coverage, depth, and quality. This creates a trade-off between number of individuals available for analysis and the genotype quality.
To explore this trade-off and the feasibility of this approach on available aDNA data, we created simulated ancient genomes by removing variants from present-day individuals with whole-genome sequencing from the 1000 Genomes Project (1kG) (The 1000 Genomes Project Consortium, 2015). (See the Supplementary Materials for detailed discussion.) First, we found that PrediXcan models trained using common variants identified from present-day whole genome sequencing data are robust to random patterns of missing data (Spearman ρ > 0.75 with up to 45% of variants missing; Supplementary Fig. 2A). However, nearly all aDNA samples used here were genotyped by targeted capture of ~1240k variants (“1240k set”; Supplementary Fig. 1B) (Fu et al., 2015; Haak et al., 2015). Furthermore, many of the ancient samples have low genotyping coverage resulting in many missing variants (Supplementary Fig. 2B). Thus, we next matched the missing data to patterns observed in aDNA and compared the performance of different prediction models applied to full genomes vs. genomes with simulated missing data (Supplementary Fig. 1A). These models’ consistency decreased substantially when applied to genomes with missing data matched to that in ancient DNA samples (median Spearman ρ = 0.39; Fig. 1B).
To address this, we trained prediction models using only variants from the 1240k set. The predictions of these models were correlated with those of the full models (median Spearman ρ = 0.67), as expected given the LD between variants in the 1240k set and those in the full models. We also identified a set of variants that were most frequently available in the highest quality ancient samples; this resulted in a set of the 600,000 most-informative variants from the 1240k set (“top600k set”). We then trained models using these variants targeted to the aDNA data (top600k) and evaluated their performance on full genomes and simulated ancient genomes (Methods). While predictions made by the 1240k and top600k models were largely consistent with those made by the Full models when applied to genomes with no missing data (median rho 0.82 and 0.79 respectively), only the 1240k models maintained consistency when applied to incomplete genomes (Fig. 1C). We therefore concluded that the 1240k trained models strike a balance between accuracy and sample size when applied to ancient data, and thus we used these models for the rest of our analysis.
Imputing gene regulatory differences between ancient human populations
We collected ancient human samples with genetic data from a variety of sequencing and genotyping platforms (Methods). Based on the analyses in the previous section, we ranked individuals by the number of sites successfully genotyped, and took the top quartile of individuals (>771240 SNPs, or 0.74x coverage), restricting to individuals from Eurasia due to sample density and genetic similarity to the training data (Fig. 2A). The samples ranged in date from 90 years before present (yBP) to 45,000 yBP, with the majority between 2,500 and 6,000 yBP (Fig. 2B).
We then assigned individuals to a lifestyle (hunter-gatherer, pastoralist, or agricultural) by literature review of the associated archaeological culture based on information from the original aDNA publications. In general, hunter-gatherers were from sites: 1) dated to times before any evidence of domestication or 2) with evidence only for foraging and meat consumption and no domesticated plants or animals. Agriculturalists were from sites with evidence for domesticated grains and animals. Pastoralists can be difficult to distinguish from agriculturists, and here refers to individuals from often semi-nomadic societies focused on domesticated animals (primarily the Yamnaya and similar groups). In addition, in some cases, the lifestyle distinction was based on genetic similarity to other groups, so the categories used here are based on a combination of genetics and archaeology. Because of these difficulties, we focused primarily on comparisons between hunter-gatherers and the other groups. This process resulted in 490 ancient Eurasian individuals with an assigned lifestyle and aDNA for further study (Fig. 2C).
We then applied the 210,800 “1240k” gene regulation prediction models described in the previous section to the 490 ancient samples, as well as to 503 present-day Europeans from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015). This resulted in normalized expression predictions in different tissues (“predicted regulation”) for 14,873 unique genes. The observed expression level of a gene in an tissue in an individual is a combination of genetically regulated and environmental factors. The output of our prediction model is not a direct proxy for the observed expression, but rather a quantification of the genetic component of gene regulation. Thus, differences in predicted regulation between individuals reflect potential differences in the inherited genetic component of expression, not environmentally driven differences.
Divergently regulated genes are enriched for immune and metabolic functions
To survey high-level differences among ancient individuals from the three lifestyle groups, we identified divergently regulated genes as those for which the P-value of a Kruskal-Wallis test passed a Bonferroni multiple testing correction (per-tissue). For example, GPR84 was among the most differences in predicted regulation between populations (Fig. 3A; Adrenal Gland predicted regulation of −0.0421 in agriculturalists vs. 0.197 in hunter-gatherers; K-W P -value). Overall, 5759 unique genes showed evidence of divergent regulation between lifestyles in at least one tissue (median 2 tissues; Supplementary Fig. 6A), and an average of 9.8% of genes in each tissue were divergent (Supplementary Fig. 6B). However, most divergent genes had relatively small changes in magnitude between groups (e.g. maximum 1.17 magnitude difference between hunter-gatherers and agriculturalists in Subcutaneous Adipose) and the majority of these differences are likely attributable to genetic drift, rather than the effects of selection. We therefore imposed a genomic control (Methods) on the full distribution of 210,800 (genes × tissues) Kruskal-Wallis P-values (Fig. 3A). We focused on the 500 genes with the most evidence of divergent regulation (corrected P < 3.46 × 10−3, FDR = 0.586), which may be enriched for targets of selection.
We hypothesized that immune and metabolic traits were among those under the most selective pressure as populations transitioned between lifestyles. To identify systematic patterns in the 500 most divergently regulated genes, we conducted Gene Ontology (GO) over-representation analysis. The twenty most-enriched annotation terms (Fig. 3B) included immune-related (e.g.“antigen processing and presentation”) as well as basic metabolic processes and cellular functions (e.g. “glycoprotein metabolic process”). In addition, the enrichments for some of the more general terms may be driven by genes with pleiotropic immune system effects. For example, the eight genes driving the enrichment of the “DNA-templated transcription, elongation” term included THOC5, which also functions in immunity and response to stimuli through cytokine-mediated pathways (Mancini et al., 2004; Tamura et al., 1999), ELP1, which has functions in proinflammatory signalling (Cohen et al., 1998), and AFF4, a component of the super elongation complex, which is recruited in response to HIV-1 infection (Chou et al., 2013; He et al., 2010).
Many gene sets are likely to maintain similar regulatory patterns across populations, regardless of lifestyle, and these should not be enriched among the top divergently regulated genes. To test this, we quantified the enrichment of three such sets under strong functional constraint among the 500 most diverged genes between lifestyle groups across tissues: 1) genes that have experienced stabilizing selection on their levels of expression across many species (Chen et al., 2018), 2) genes responsible for core house-keeping functions (Eisenberg and Levanon, 2013), and 3) genes that are intolerant to loss-of-function coding variation (“LOF-intolerant”) in present-day humans (Lek et al., 2016) (Methods). As expected, LOF-intolerant genes and those under long-term stabilizing selection are not enriched (Table 1). Surprisingly, housekeeping genes were slightly enriched (OR = 1.33, P = 0.0076). By definition, housekeeping genes have uniform and ubiquitous expression across tissues, so this pattern could partially be explained by increased power to model changes in their regulation in multiple tissues. However, many housekeeping genes are also involved in basic cellular metabolism (Eisenberg and Levanon, 2003), which could require fine tuning in response to changes in nutrient sources or other environmental shifts. We also tested for enrichment of genes that encode proteins that directly interact with viruses, since these genes are known to evolve rapidly (Enard et al., 2016), but we find no enrichment among the top 500 genes, suggesting that selection at these loci could be driven by coding rather than regulatory changes.
Several of the top divergently regulated genes underlying the GO functional enrichments have been implicated in local adaptation, for example EP300 (Zheng et al., 2017) and several subunits of HLA-DQ (Catassi and Catassi, 2018; De Silvestri et al., 2018; Pierini and Lenz, 2018). In the next two sections, we explore the connection between sequence signatures of recent adaptive evolution and divergent gene regulation with a focus on diet and skin pigmentation.
Changes in gene regulation contributed to adaptation to diet between ancient lifestyles
Many regions of the human genome bear signatures of recent population-specific adaptive evolution. However, the phenotypic drivers and molecular mechanisms underlying these evolutionary signatures are largely unresolved. Since diet was one of the main factors that shifted with the change from hunting and gathering to farming, we hypothesized that gene regulatory changes between lifestyle groups might be the target of signals of selection at dietary genes.
We compared the predicted regulation of 20 diet-related genes in regions with evidence of population-specific local adaptation (Rees et al., 2020) between ancient human groups with different lifestyles (Methods). Models for the 20 genes tested were enriched for lower P-values (P = 1.19 × 10−14, K-S test), with 4 unique genes among the top 500 most diverged genes by group (Supplementary Table 4).
FADS1 showed the most consistent evidence for divergent regulation between agriculturalists, pastoralists, and hunter-gatherers, with nominally significant differences in 21 tissues (Supplementary Table 2). In each tissue, hunter-gatherers had significantly lower FADS1 levels than in agriculturalists or present-day Europeans, as would be expected from a diet containing higher levels of long-chain plasma unsaturated fatty acids (Fig. 4B). We observed a similar trend among 32 ancient Africans, indicating that this is not necessarily specific to Eurasian populations (Supplementary Fig. 7A). The variants driving these regulatory differences are in linkage disequilibrium (LD) with the functional haplotype implicated in previous evolutionary studies (Supplementary Fig. 7A; Supplementary Table 3) (Ameur et al., 2012; Buckley et al., 2017; Mathieson and Mathieson, 2018; Ye et al., 2017). Overall, FADS1 predicted regulation is also negatively correlated with the date of the sample (Spearman ρ= −0.32, P = 1.95 × 10−20, which agrees with known allele frequency trajectories (Buckley et al., 2017; Mathieson and Mathieson, 2018; Ye et al., 2017).
Another gene in the FADS gene cluster, FADS2, functions in the same pathway as FADS1 and is also among the 500 most diverged genes. However it shows evidence for divergent regulation in fewer tissues than FADS1 (Supplementary Table 4), and the direction of effect is not consistent across tissues. Its presence therefore seems more likely to be due to overlap in regulatory variants with FADS1 than to selection on FADS2 regulation specifically. Our results further support the relevance of lifestyle differences between ancient populations in selection on the FADS locus and highlights the potential importance of regulatory changes of FADS1 in dietary adaptation in Eurasians.
Among the putative diet adaptation genes, GPX1, an antitoxin selenoprotein, and SLC22A5, a transporter responsible for recycling and uptake of carnitine (Console et al., 2018).(Supplementary Table 4; Supplementary Fig. 8) were also divergently regulated. The GPX1 locus has experienced selective sweeps related to environmental selenium levels (Engelken et al., 2016; White et al., 2015), and has been implicated in response to viral infections (Guillin et al., 2019). Carnitine plays an important role in the transport of certain long-chain fatty acids to the mitochondria for energy production; thus, modulation of its regulation could suggest a difference in metabolism related to variation in the energy demands of different lifestyles. Both selenium and carnitine levels differ in the likely primary diets of the ancient populations considered here (Flanagan et al., 2010; Mann, 2018), suggesting that both as potential targets of local adaptation.
Though it was not on the list of putative diet-related adaptation genes, LEPR has been suggested as the driver of nearby signatures of selection due to its function in appetite and cold tolerance (Hancock et al., 2008; Luca et al., 2010; Voight et al., 2006). LEPR was divergently regulated between lifestyle groups in the cerebellum (Fig. 4B) (the only brain tissue with a model for LEPR), both adipose tissues, and several other tissues. It was consistently predicted to be downregulated in agriculturalists compared to the other two groups in each tissue (Supplementary Table 5). Leptin is a hormone produced by adipose cells that suppresses appetite (Barrios-Correa et al., 2018), so this supports a possible connection between appetite regulation and the observed signatures of selection. This is particularly relevant to modern populations given the association of decreased LEPR function with obesity and metabolic disorders (Dehghani et al., 2018; Farooqi et al., 2007).
Overall, these analyses suggest that recent regulatory changes made a substantial contribution to adaption to diet. More broadly, they demonstrate the potential for this method to explain observed signals of selection and to disentangle its effects on nearby genes.
Skin pigmentation evolution was not driven by changes in gene regulation
We hypothesized that genes involved in complex phenotypes under selection in a population would exhibit systematic changes over time in their regulation. To test this, we focused on skin pigmentation, a trait that is known to have been under selection in humans in West Eurasia (Berg and Coop, 2014; Ju and Mathieson, 2020; Wilde et al., 2014) and for which many of the genes involved are well-understood (Sturm and Duffy, 2012). We trained new PrediXcan models using genetic variants and gene expression in melanocytes from a diverse population (Zhang et al., 2017). We were able to model 17 genes known to be involved in the melanogenesis pathway (Sturm and Duffy, 2012). Because skin pigmentation-associated variants changed in frequency over time, we applied these models to a time series of 2999 ancient Europeans dated between 38,052 yBP and 150 yBP, as well as 503 present-day Europeans from the 1000 Genomes Project and tested for systematic changes over time in predicted regulation.
Skin pigmentation genes are not enriched for differential regulation compared to all 6923 genes modeled in melanocytes (K-S Test P = 0.53; Fig. 5A). Predicted regulation showed a nominally significant linear relationship with time for only four skin pigmentation genes (Table 1), and only one (TYR) remained significant after genomic control.
We predict that TYR’s expression increased over time (Fig. 5B) and is higher in non-African (particularly European) populations compared to African populations (Fig. 5C), and in (more recent) agriculturalist populations compared to hunter-gatherers (Fig. 5D). TYR encodes an enzyme important for one of the earliest steps of the melanogensis pathway and loss-of-function mutations cause albinism (Ghodsinejad Kalahroudi et al., 2014; Norman et al., 2017). It is therefore surprising that increased expression would be driven by selection for decreased pigmentation. One possibility is that increased expression due to gene regulatory variants compensates for the increase in frequency in Europeans of an activity-reducing coding variant (rs1042602) in TYR (Wilde et al., 2014). Selection on pigmentation could favor the coding variant, while the maintenance of other functions of the gene could require increased expression. Supporting this, rs1042602 has a positive weight in the fitted PrediXcan model showing that it in fact is associated with increased expression.
Finally, we were unable to build accurate PrediXcan models for many known pigmentation genes, including those with known selected coding changes (Lamason et al., 2005; Soejima and Koda, 2007), mostly because there was not enough regulatory variation nearby the genes. Overall, our results suggest that, in contrast to diet, changes in gene regulation did not play a large role in the evolution of skin pigmentation in Europe. This is consistent with observations that selection signals for pigmentation-associated variants in Europe are mostly driven by a relatively small number large-effect, coding variants despite the polygenic nature of the phenotype Ju and Mathieson (2020).
Discussion
In this study, we adapted the PrediXcan approach for modeling the genetic component of tissue-specific gene regulation and applied it to hundreds of low-coverage ancient DNA samples from individuals from three different lifestyles and to a ~38,000-year transect of ancient Europeans. Our simulations and evaluations suggest that models of gene regulation for thousands of genes retain utility even when variant data are limited, as long as the models are trained for the specific application and their limitations properly taken into account. This is encouraging for the expansion of the PrediXcan approach to other contexts in which different variants were assayed than those used to train the original PrediXcan models. As more accurate methods are developed, it will be important to keep this aspect of their performance in mind.
Here, we found that over 5,000 genes showed evidence for divergent regulation among ancient hunter-gatherers, pastoralists, and agriculturalists in at least one tissue. The 500 genes most divergently regulated between lifestyles were enriched for metabolic and immune processes, indicating that altered gene regulation has shaped these functions during recent human evolution. Focusing on genes involved in diet, we find enrichment for divergent regulation in genes with nearby signals of recent selection, suggesting that changes in gene regulation may play a substantial role in adaptation to changes in diet.
Second, we trained new prediction models in melanocytes to analyze changes in the regulation of skin pigmentation genes in a time transect of ancient and present-day Europeans spanning 38,000 years. In contrast to genes associated with diet, we found that most genes we modeled show little to no systematic change in regulation over time, suggesting that selection on skin pigmentation mostly operated on a few large-effect coding variants. The exception, TYR, is predicted to have been up-regulated over time, which is contrary (with repsect to the trait) to the effects of a known coding variant in the gene and the predicted effects of gene expression on the trait itself (Chaki et al., 2011; Wilde et al., 2014). However, the increased expression in Europeans may be a response to the increase in frequency of a coding variant (rs1042602) that decreases activity. These results underscore the wide variety of adaptive mechanisms in recent human evolution, and the ability of ancient DNA to illuminate these mechanisms. The other skin pigmentation genes that show nominal changes in predicted regulation over time, MITF and TRPM1, are closely linked to TYR in the melanogenesis pathway, with MITF regulating both TYR and TRPM1 (D’Mello et al., 2016). Further analysis of the predicted perturbations of those relationships is needed to better understand the phenotypic consequences of these changes.
There are a several caveats to consider when interpreting these PrediXcan results. Previous work has demonstrated that, while there are some decreases in accuracy, the approach maintains utility when applied to non-European present-day populations and to archaic hominins (Colbran et al., 2019; Petty et al., 2019). Furthermore, the ancient Eurasian individuals considered here are less diverged from the GTEx cohort used for training than in these previous applications. However, due to the low coverage of the aDNA data and the focus on commonly assayed variants, there are many regulatory effects that these models do not capture. In addition, the models do not capture the effects of environment (both direct and indirect) on gene expression. Therefore, while differences in predicted regulation do not necessarily indicate a change in transcript expression levels, they do the identify change in the genetic architecture of a gene’s regulation. Our approach is therefore complementary to experimental assays of the regulatory effects of ancient genomic variants in present-day human cells (Weiss et al., 2021), and such approaches could be used to test our computational predictions. Another major limitation is that we are only able to draw conclusions about genes with sufficient expression and nearby present-day common variation. Finally, we have not developed a formal test for selection on gene regulation. While we have in some cases been able to link regulatory variation to signals of selection based on genomic data, most of the differences we observe were likely the result of genetic drift. Developing tests for selection on gene regulation that consider aDNA remains an important area for future work.
Despite these limitations, we demonstrate the utility of considering regulatory effects of variants in combination in ancient individuals. In particular, we show that changes in gene regulation were essential to many, but not all, recent human adaptations. The frequent occurrence of metabolic and immune genes among the most divergently regulated genes between ancient lifestyles underscores the contribution of gene regulation to adaptation to the substantial changes in lifestyle that the shift from nomadic hunting and gathering to stationary farming had on humans. Our targeted analysis of diet genes with evidence of results adaptive evolution further suggests that adapting to diets with different nutrient and fat compositions required population-level shifts in the regulation of many metabolic genes. In contrast, the lack of consistent gene regulatory changes in skin pigmentation genes suggests that adaptation in this trait was mainly mediated by coding variants.
Lifestyle and sun exposure are not the only variables that differ among the ancient humans with genetic information, and more diverse aDNA data are rapidly becoming available. Therefore, extending this analysis to ancient individuals across other evolutionary shifts will promising. It will also be informative to expand studies into non-European populations, both ancient and present-day, to learn when gene regulatory shifts are unique to specific populations or shared.
Overall, this study demonstrates the power of focusing evolutionary analyses on combinations of variants with established relationships to molecular phenotypes. Our approach is well-positioned to use the increasing availability of present-day and ancient genome data to provide both mechanistic explanations of selection signals and to generate hypothesis about phenotypic differences between ancient and present-day groups. While this study focused on gene regulatory shifts in response to changes in lifestyle and temporal shifts in regulation of skin pigmentation genes, similar methods could be applied in many other questions and sets of ancient samples. Given the importance of gene regulation in recent evolution, this is a necessary step in identifying and interpreting candidate regions that have been shaped by recent human evolution. Further analyses using this approach will contribute to understanding the genome’s response to large-scale environmental changes and the influence of these changes on humans today.
Methods
Ancient genotype and lifestyle data
For the lifestyle analyses, we obtained ancient human genotypes from a set compiled and analyzed by the Allen Ancient DNA Resource (v42.4; accessed March 1, 2020), then lifted them over the Genome Build hg38 using liftOverPlink. We filtered out samples that did not pass their QC procedure and ranked remaining samples by genotype count (i.e., the number of variants with a genotype call). We also filtered samples by their continent of origin, and primarily focused on 490 ancient Eurasians. (A FADS1 analysis additionally considered 32 ancient Africans.) For a present-day comparison, we used genome for 503 European samples from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015).
We manually assigned ancient samples to lifestyle groups by literature review based on archaeological information about the site and previous research about the associated culture. More specifically, we used lifestyles as assigned by the original publication of the sample where available. We then propagated those lifestyle labels to other samples based on the associated culture (again, as assigned by the original publication), then conducted a further literature review to match any unassigned cultures to a lifestyle based on similarity to those already matched. Samples were removed from consideration when there was not enough lifestyle-related evidence to make a call. The distinction between pastoral and agricultural groups was often difficult, and when there was ambiguity the groups were preferentially assigned to the agricultural category (Supplemental File S1).
Adapting PrediXcan for aDNA
Final models for aDNA-based gene regulation prediction
The set of models used for all lifestyle analyses were trained on whole genome sequencing and RNA-seq data from GTEx v8 for 49 tissues using variants that were genotyped by first enriching for the targeted variants (“1240k set”) (Fu et al., 2015; Haak et al., 2015). For each tissue, we considered only models that explained a significant amount of variance (FDR < 0.05, r2 > 0.01). In addition, we further required that each 1240k-trained model maintain high correlations with the original GTEx model (r > 0.5) over all 2504 1kG individuals. All LD calculations for variants in all 1kG Populations were made using LDLink (Machiela and Chanock, 2015).
The set of models used to study skin pigmentation were trained on genotype and RNA-seq data collected from melanocytes from 106 male skin samples (Zhang et al., 2017). We imputed all genotypes to 1000 Genomes using the NIH TOPMed server (Das et al., 2016) with the following settings: ref: 1kG Phase 3 v5; pop = other/mixed; rsq filter 0.001; phasing = eagle v2.4. We filtered genes to those with measured expression in at least 10 samples, with RSEM > 0.5 and > 6 reads, then each gene was inverse quantile normalized to a standard normal distribution across samples. We then corrected for ancestry using the first 3 principle components and 10 PEER factors. We trained the PrediXcan models using only ~1,240,000 SNPs that were genotyped by first enriching for those targeted SNPs (“1240k set”) (Fu et al., 2015; Haak et al., 2015), and included any gene for which the model was able to explain a nominally significant amount of variance in the observed expression (P < 0.05). We focused on a set of 17 genes (Sturm and Duffy, 2012) involved in skin pigmentation for which we were able to build models.
We abbreviate the 49 GTEx tissues considered as follows: Adipose - Subcutaneous: ADPS, Adipose - Visceral Omentum: ABPV, Adrenal Gland: ADRNLG, Artery - Aorta: ARTA, Artery - Coronary: ARTC, Artery - Tibial: ARTT, Brain - Amygdala: BRNAMY, Brain - Anterior Cingulate Cortex: BRNACC, Brain - Caudate: BRNCDT, Brain - Cerebellar Hemisphere: BRNCHB, Brain - Cerebel lum: BRNCHA, Brain - Cortex: BRNCTX, Brain - Frontal Cortex: BRNFCTX, Brain - Hippocampus: BRNHPP, Brain - Hypothalamus: BRNHPT, Brain - Nucleus Accumbens basal ganglia: BRNNCC, Brain - putamen basal ganglia: BRNPTM, Brain- Spinal Cord Cervical C-1: BRNSPN, Brain- Substantia Nigra: BRNSN, Breast: BREAST, Cells - Transformed Fibroblasts: FIBS, Colon - Sigmoid: CLNS, Colon - Transverse: CLNT, Esophagus - Gastroesophageal Junction: ESPGJ, Esophagus - Mucosa: ESPMC, Esophagus - Muscularis: ESPMS, Heart - Atrial Appendage: HRTAA, Heart - Left Ventricle: HRTLV, Kidney Cortex: KDNY, Liver: LIVER, Lung: LUNG, Minor Salivary Gland: MNRSG, Cells- EBV-transformed Lymphocytes: LYMPH, Ovary: OVARY, Pancreas: PNCS, Pituitary: PTTY, Prostate: PRSTT, Skeletal Muscle: MSCSK, Skin - Not sun-exposed: SKINNS, Skin - Sun-exposed: SKINS, Small Intestine: SMINT, Spleen: SPLEEN, Stomach: STMCH, Testis: TESTIS, Thyroid: THYROID, Tibial Nerve: NERVET, Uterus: UTERUS, Vagina: VAGINA, Whole Blood: WHLBLD.
Evaluating strategies for applying PrediXcan to aDNA
To evaluate the performance of different strategies for training PrediXcan regulation prediction models and applying them to aDNA, we carried out several simulations. In the random simulations, for each percentage missing threshold, we randomly selected 20 European individuals from 1kG (The 1000 Genomes Project Consortium, 2015), then randomly removed that percentage of genotype calls from their genomes before applying PrediXcan models to the simulated genomes (Supplementary Fig. 1). For each downsampled genome, we calculated a Spearman correlation between the predicted regulation of each gene in four tissues for the downsampled vs. the full genome. Thus, each box in Supplementary Fig. 2A has 80 (20 × 4) points. We then calculated the Spearman correlation between the median correlation between downsampled and full model predictions for each threshold and the percentage of variants missing at that threshold.
We also simulated missing data by matching patterns of missing variants from aDNA samples (Supplementary Fig. 1B). We used 3383 ancient human samples compiled and made available by the Allen Ancient DNA Resource on March 1, 2020 (v42.4). We selected three random Europeans from 1kG, then for each ancient sample we created three matching masked genomes that were missing exactly the same variants. For each masked genome, we calculated the Spearman correlation between the predicted regulation of each gene in all four tissues for the masked vs. the full genome (i.e. one correlation per individual).
We also evaluated three different sets of variants for training PrediXcan models. The “full set” consisted of all variable sites identified in GTEx v8 (this included both single nucleotide variants and short indels in hg38 coordinates). The “1240k set” was formed by intersecting the full set with the variants genotyped on the 1240k chip, which totalled 714,959 variants after lifting them over to hg38. Lastly, we assembled the “top600k set” of variants, which is a subset of the 1240k set with high “support”. We calculated the “support” for each variant over N aDNA samples as , where NumV ars is the number of variants successfully called in sample n. In other words, support for a variant is the number of samples in which that variant was successfully genotyped, weighted by the quality (i.e., number of genotyped variants) of the sample. A variant can therefore obtain a high support either by being genotyped in many low-quality samples, or in fewer high-quality samples. We ranked the variants by their support. We identified the top 600k variants, and for the purposes of simulating the behaviour of models when applied to incomplete data, we also considered the top 500k variants with the highest support (“top500k”; N = 499,666). For each set of variants, we trained a set of models and created a set of 1kG genomes masked to only include those variants (Fig. 1A). We assessed the performance of combinations of models and genomes by calculating the correlation of predictions made by each model-genome pair with predictions made by the Full models on the Full 1kG Genomes (i.e. one correlation was calculated per individual 1kG sample for each pair).
Identifying divergent gene regulation between ancient lifestyles
To identify genes with evidence for divergence in predicted gene regulation between the three lifestyle groups, we applied a Kruskal-Wallis test for the predictions of each gene model over individuals from each group. We accounted for multiple testing with a Bonferroni correction within each tissue. Genes passing the correction in at least one tissue are said to show evidence for a significant difference in regulation. To further isolate the genes that are the most likely to be diverged due to selection rather than drift, used genomic control to correct for population stratification by calculating an inflation factor λ and recalculating p-values based on the distribution of χ2/λ (Devlin and Roeder, 1999). To focus our discussion on the genes with the strongest evidence for divergence, we sorted all models by GC-corrected P-value and identified the top 500 unique genes (corresponding to 1236 models), which corresponded to those with at least one model with a GC-corrected P < 3.46 × 10−3 and FDR=0.586.
Gene set enrichment among diverged genes
To conduct functional enrichment analyses on the top 500 most diverged genes, we tested for GO annotation over-representation using WebGestalt with default parameters (Liao et al., 2019) Specifically, we compared the biological process GO terms among the 500 most diverged genes versus all genes with a model in at least one tissue. We also tested for enrichment of several other gene sets of interest: 1) genes whose expression in particular tissues is under stabilizing selection across 17 mammalian species (Chen et al., 2018); genes that are intolerant to loss-of-function variants in their protein products (called if the upper bound of the 95% confidence interval of the observed/expected ratio is lower than 0.35) (Lek et al., 2016); 3) housekeeping genes that show consistent expression across tissues (Eisenberg and Levanon, 2013); and 4) a set genes encoding virus interacting proteins (Enard et al., 2016). We calculated an odds ratio for each, and used a Fisher’s exact test to determine significance. For the genes under stabilizing selection on gene expression, we considered only those tested in that study before calculating statistics.
Skin pigmentation time series data and analysis
We obtained ancient human genome data from the Allen Ancient DNA Resource (v44.3; accessed February 8, 2021). We filtered for individual human samples from Europe (west of 59° East), and in the case of duplicate individuals chose the sample with the highest average coverage. We filled in missing dosages using the mean dosage across the other samples. This resulted in 2999 ancient Europeans, to which we added 503 European samples from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015) to construct a time series ranging from 38,052 yBP to present (31 samples were older than 15,000 yBP).
To identify genes which showed a systematic change in regulation over time, we obtained predicted regulation values for each gene in each individual using the melanocyte PrediXcan models. We then regressed the predicted regulation on the date of the sample using a linear regression framework, including the first 10 principle components to correct for ancestry. We further controlled for population stratification using genomic control (Devlin and Roeder, 1999), and identified the skin pigmentation genes for which the effect size of date was significant (corrected P < 0.05). We additionally compared the predicted regulation of TYR in all 2504 individuals from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015), separated by continental ancestry.
Data and Code Availability
All data and scripts are available on Github at https://github.com/colbrall/ancient_human_predixcan and https://github.com/colbrall/skin_pigmentation_regulation
Author Contributions
L.L.C., I.M. and J.A.C. designed the experiments and wrote the manuscript. M.R.J. designed the simulation experiments and conducted pilot simulation analyses. L.L.C conducted all other experiments. All authors edited and approved the final manuscript.
Competing Interests
The authors report no conflicts of interest.
Supplementary Materials
Evaluating the robustness of gene regulation prediction models to missing data
There are several challenges involved in adapting PrediXcan to be applied to ancient DNA. First, ancient individuals may be genetically diverged from the populations used to train the prediction models. Recent evaluations of the portability of these models across modern human populations have shown that while their accuracy decreases when applied across human populations, most models maintain substantial predictive ability that enables the discovery of meaningful biological associationsOkoro et al. (2021); Petty et al. (2019). We have also demonstrated that this approach can be applied to archaic hominin genomesColbran et al. (2019). The ancient Eurasian individuals we analyze here are less genetically diverged from the training population than modern African-ancestry individuals or archaic hominins. In addition, we stress that these models do not aim to predict gene expression itself, but rather to quantify the common variant mediated component of gene regulation, and thus, they provide an means to detect shifts in regulatory architecture.
Second, available aDNA data varies in coverage, depth, and quality. This creates a trade-off between number of individuals available for analysis and the quality of their genotyping. In addition, many other potential applications of PrediXcan involve populations that differ from the training populations. In these situations, even if the populations are of similar ancestry, all the variants the models use to predict gene expression may not be assayed in the population of interest. Therefore it is of great interest to understand how PrediXcan behaves under these conditions with varying levels of missing variant data, and develop ways to optimize its performance in such cases.
To better understand and quantify these patterns, we first trained PrediXcan models on expression data and all available variants in GTEx v8 (“Full models”). We then applied these prediction models to all variants and downsampled sets of variants from individuals with whole-genome sequencing from the 1000 Genomes Project (1kG)The 1000 Genomes Project Consortium (2015). This enabled us to evaluate how predictions change with missing variant data. We selected nine thresholds for percentage of missing SNPs (5%-45% missing) and downsampled 20 random European individuals per threshold (Supplementary Fig. 1A). The agreement between the predictions on downsampled genomes and full genomes was strongly correlated with the percentage of SNPs missing (Supplementary Fig. 2B). However, Spearman correlations were above 0.75 for all comparisons, even when genomes were missing as many as 45% of their SNPs. This suggests that model accuracy can be maintained even at relatively high rates of missingness, likely due to LD between variants.
While this is encouraging, missing SNPs may not be randomly distributed throughout the genome in some applications. To evaluate how biases in missingness across the genome could affect these results, we repeated the comparison described above. However, instead of randomly downsampling SNPs, we matched the patterns of missingness to the dataset (Fig. 1B) of particular interest to us: 3383 aDNA samples, with widely varying numbers of missing SNPs (Supplementary Fig. 2C). Overall, the correlations were much lower (median Spearman ρ=0.39; Fig. 1B). This is unsurprising given that the aDNA samples were obtained using targeted capture, while the model training data was based on whole genome sequencing. At most, the aDNA samples had 714,959 SNPs, while the training data had over 5 million (i.e. 87% missing). This indicates that, while models can tolerate a fair amount of missing data, the missingness caused by a mismatch between genotyped SNPs and training SNPs is likely to substantially decrease prediction accuracy.
We next evaluated whether imputation performance could be improved in our aDNA application by customizing the training data to contain only variants that will be available in the application data. This step ensures that any variants used that are not assayed in the application data are not used in model training. Thus, we retrained PrediXcan models using GTEx v8, but only considered the SNPs present in the 1240k capture set that is commonly used in aDNA studies (“1240k set”; Supplementary Fig. 1B). This resulted in 714,959 variants for model training (the number that were successfully lifted over to the hg38 genome build), as opposed to the 5,310,489 variants available for the full models based on whole genomes. Because most of the genotyped aDNA samples tend to be low-coverage (Supplementary Fig. 2C), we also tested the use-case where we chose the SNPs in the dataset most likely to provide information. To do this, we ranked SNPs by the number of samples in the 3383 ancient samples used above with genotype calls, and weighted that count by the overall coverage of those samples. We then chose the 600,000 SNPs with the best ranking (“top600k set”), thereby prioritizing SNPs that were frequently present in the best-quality samples.
Training models on fewer SNPs resulted in a small decrease in the variance explained in gene expression for some genes (Fig. 3), and fewer significant models were constructed in each tissue (7184, 6587, and 5196 significant models for Full, 1240k, and top600k respectively in Whole Blood). This also resulted in inconsistent predictions when models were applied to the same individuals (Fig. 4). This behaviour is caused by the decrease in available SNPs for modelling, which is reflected in the smaller models built in the 1240k and top600k sets (Fig. 5A,C). While many genes continued to be successfully modelled by the targeted models, restricting to fewer SNPs resulted in a loss of genes for which the Full models had a lower r2.
While these results demonstrate that using more SNPs during training results in more numerous and accurate models, this does not take into account the presence of missing data in the genomes that will be used for predictions. Therefore the outstanding question is whether model performance is more consistent when trained on fewer SNPs without a large missing data percentage, or if it is better to include more SNPs during training, but allow more missing data during application. To answer this, we applied our targeted models described above to 1kG genomes downsampled to match the various sets of SNPs used during model training and compared their agreement with the full models applied to the full genomes. For the top600k models, we further downsampled the application set to 500k SNPs using the same methodology.
In line with our initial findings, we found that all models lost consistency when applied to genomes with missing SNPs (Fig. 1C). The 1240k models maintained the highest agreement with the Full models when applied to incomplete data (median ρ=0.78, vs 0.58 and 0.41 for Full and top600k models, respectively), though they also had a larger variance in agreement. The Full models likely did worse because there was a much larger drop in the number of SNPs available compared to training, while the top600k models’ reliance on fewer SNPs may have increased their susceptibility to missing SNPs. This suggests that a balance between targeting model training for the dataset in question and allowing some missingness is the best course of action. For analyses presented in this paper, we therefore applied models targeted to the 1240k variant set to ancient individuals with relatively high coverage (above the 3rd quartile among all samples available).
Supplementary Figures
Acknowledgements
We would like to thank Eric Gamazon for discussions and help with data acquisition, as well as Dan Zhou for help in debugging the PrediXcan training pipeline. We would also like to thank current and former members of the Capra and Mathieson labs for useful discussions of the project and for providing comments on the figures and manuscript.
L.L.C. was funded by NIH grant T32GM080178 to Vanderbilt University and T32HG009495 to the University of Pennsylvania. I.M was funded by NIH award R35GM133708. J.A.C. was funded by NIH award R01GM115836 and R35GM127087, and the Burroughs Wellcome Fund.
This work was conducted in part using the resources of the Advanced Computing Center for Research and Education at Vanderbilt University, Nashville, TN. The research is solely the responsibility of the authors and does not necessarily represent the official views of Vanderbilt University Medical Center, the National Institutes of Health, or other funding agencies.