Abstract
Modern humans carry Neanderthal and Denisovan (archaic) genome elements which may have been a result of environmental adaptation. These effects may be particularly evident in pharmacogenes – genes responsible for the processing of exogenous substances such as food, pollutants, and medications. However, the health implications and contribution of archaic ancestry in pharmacogenes of modern humans remains understudied. We characterize eleven key cytochrome P450 (CYP450) genes involved in drug metabolizing reactions in three Neanderthal and one Denisovan individuals and examine archaic introgression in modern human populations. We infer the metabolizing efficiency of these eleven genes in archaic individuals and show important genetic differences relative to modern human variants. We identify archaic-specific SNVs in each CYP450 gene, including some that are potentially damaging, which may result in altered metabolism in modern human people carrying these variants. We highlight four genes which display interesting patterns of archaic variation: CYP2B6 – we find a large number of unique variants in the Vindija Neanderthal, some of which are shared with a small subset of African modern humans; CYP2C9 – containing multiple variants that are shared between Europeans and Neanderthals; CYP2A6*12 – a variant defined by a hybridization event that was found in humans and Neanderthals, suggesting the recombination event predates both species; and CYP2J2 – in which we hypothesize a Neanderthal variant was re-introduced in non-African populations by archaic admixture. The genetic variation identified in archaic individuals imply environmental pressures that may have driven CYP450 gene evolution.
Introduction
The cytochrome P450 (CYP450) genes encode oxidase enzymes that function in metabolism of endogenous small molecules and in detoxification of exogenous (or xenobiotic) compounds. This gene family is present in all mammals, with 57 active genes and 58 pseudogenes coding for CYP450 enzymes in humans (Thomas 2007). Xenobiotic-substrate enzymes have been studied extensively because of their roles in the absorption, distribution, metabolism, and excretion (ADME) of pharmaceuticals and drug development. The evolution of xenobiotic CYP450 enzymes may be driven by an organism’s need to metabolically detoxify foreign compounds, often toxic chemicals produced by plants, fungi, and bacteria, in the local environment. The CYP450 genes show evidence of positive selection and high allele frequency variation in humans (Thomas 2007; Fuselli et al. 2010). It has been suggested that the shift from hunting and gathering to food production in humans may have profoundly changed the selective effect of some CYP450 enzymes (Fuselli et al. 2010; Fuselli 2019). Although these genes have been studied extensively in humans and other mammals, the extent of genetic variation in archaic individuals and its relation to modern human variation has not been addressed. Genetic variation in ADME genes varies extensively across modern human populations, and identifying alleles inherited through archaic introgression may be informative about the origin of specific pharmacogenetic (PGx) variants and their resulting phenotypes (e.g., metabolizer status).
Throughout human evolutionary history, our species has adapted to numerous distinct and varied challenging environments, which present the need to metabolize new xenobiotic substances, such as food, pollutants, and medications (Fan et al. 2016). Modern humans encountered new environmental challenges as they dispersed first throughout and then outside of the African continent, but they also encountered other hominin species already adapted for life in those regions of the world: Neanderthals and Denisovans (Durvasula and Sankararaman 2020; Bergström et al. 2021). The direct sequencing of multiple Neanderthal and Denisovan genomes has revealed a complex history of admixture between these archaic humans and the ancestors of modern humans (Browning et al. 2018; Villanea and Schraiber 2018). Most modern humans carry a small but significant portion of archaic ancestry, which has been targeted by natural selection (Sankararaman et al. 2016). Purifying selection — or negative natural selection — has removed most archaic variants in functional genomic regions (Petr et al. 2019; Zhang et al. 2020; Schaefer et al. 2021), while some archaic variants may have been lost through genetic bottlenecks or drift in modern human populations. However, there are functional regions for which archaic variants are found at very high frequency in living humans through the effects of positive natural selection (Mendez et al. 2012; Huerta-Sánchez et al. 2014; Racimo et al. 2015; Dannemann et al. 2016). One of the most well-known examples of adaptation through archaic admixture is high-altitude adaptation in Tibetans, a population in which the Denisovan variant of the gene EPAS1 is found at extremely high frequency, and has conferred them with resistance to hypoxic stress in high-altitude environments (Huerta-Sánchez et al. 2014).
The CYP450 genes are a prime target for adaptation through archaic introgression as humans encountered novel exogenous substances as they expanded their range. Neanderthals and Denisovans may have possessed CYP450 variants fine-tuned to metabolizing substances found in their native habitats of Eurasia and Siberia. As modern humans expanded outside of Africa, they likely faced novel environmental factors which may have influenced selective pressures. Advantageous archaic CYP450 variants introduced to modern humans through admixture may have been retained in the modern human gene pool by natural selection.
Here, we investigate the evolution and predicted phenotypes of CYP450 genes in archaic individuals to increase our understanding of introgressed genetic variants and the different selective regimes and environments acting on these enzymes. We examine genetic variation and predict metabolizer phenotypes in eleven CYP450 genes in three Neanderthal and one Denisovan individuals using publicly available high-resolution whole genome sequencing data. In addition, we examine the effect of archaic introgression in modern human populations for these eleven genes. These 11 genes encode enzymes that when combined are responsible for up to 75% of the metabolism of commonly prescribed drugs (Evans and Relling 1999), which could have implications for modern human health and disease as well as drug safety and efficacy.
Results
Archaic and shared human CYP450 variation
In this study, we investigated genetic variation in eleven CYP450 genes: CYP1A2, CYP2A6, CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP2E1, CYP2J2, CYP3A4, and CYP3A5. We identified a total of 1,623 single nucleotide variants (SNVs) in the four archaic individuals (one Denisovan from Denisova Cave and three Neanderthals from the Vindija, Denisova Cave/Altai, and Chagyrskaya sites) for the eleven CYP450 genes investigated (Figure 1, Supplemental Table 1). Of the variants identified in the archaic individuals, 81.2% (n=1318) were intronic, 6.8% (n=111) were in the promoter region, 4.7% (n=77) were exonic, 5.1% (n=83) were in the untranslated regions (3’/5’-UTR), 1.9% (n=31) were non-coding RNA (ncRNA), and 0.2% (n=3) were splicing variants (Figure 1). Across all genes investigated, the Vindija Neanderthal presented the highest number of total variants (n=943), followed by the Chagyrskaya Neanderthal, Altai Neanderthal, and the Denisovan individual (n=667, n=584, and n=574, respectively).
Structural variation (SV) was additionally investigated in each of the eleven CYP450 genes in archaic individuals using sequencing read data and previously validated methods (Lee, Wheeler, Patterson, et al. 2019; Lee, Wheeler, Thummel, et al. 2019). SV was only observed in CYP2A6 and CYP2D6 in Neanderthal individuals (Figure 2, Supplemental Figure 1). The Altai and Vindija Neanderthal individuals exhibited a greater structural variation burden, presenting the homozygous partial deletion hybrid variant CYP2A6*12/*12 (Figure 2a-b) and a gene multiplication variation event: CYP2D6*2/*2×4 for Altai (i.e., 4 copies of CYP2D6) and CYP2D6*2/*2×3 for Vindija (Supplemental Figure 1a-b). The Chagyrskaya Neanderthal was heterozygous with a CYP2A6*1/*12 diplotype (Figure 2c) and there were no SV identified in the diplotype CYP2D6*2/*41 (Supplemental Figure 1c). The Denisovan individual did not show any copy number variation in any of the eleven CYP450 genes investigated (Figure 2d, Supplemental Figure 1d).
To further assess the genomic landscape of the archaic individuals in the context of modern humans, we identified variation in modern human individuals from the 1000 Genomes Project (Supplemental Figure 2, Supplemental Table 2a, 1000 Genomes Project Consortium et al. 2015), as well as Papuans sequenced as part of the Simons Genome Diversity Project (Mallick et al. 2016) and compared the archaic and modern human CYP450 haplotypes. To determine genetic variation that could be a result of potential introgression, we identified archaic variants that were present in non-African modern human populations. As the majority of archaic admixture occurred in Eurasia, shared genetic variants between archaic humans and Africans are expected to be shared ancestrally, but variants exclusively shared between non-Africans and archaic humans are expected to be the result of archaic admixture (see U statistic in Racimo et al., 2016). A total of 155 archaic SNVs were shared between archaic and modern humans (Table 1, Supplemental Table 3); 140 of these SNVs were shared only between Neanderthals and non-African modern humans, 11 were shared between only the Denisovan individual and non-African modern humans, and 4 of the shared SNVs were identified in the Neanderthals, the Denisovan, and non-African modern humans (Supplemental Table 3). CYP2C19 contained the greatest number of archaic variants shared with non-African modern humans (n=64). Although most shared variants were present at a frequency of less than 2% in any non-African modern human population, some CYP450 genes contain archaic variants that are found in human populations with a frequency of 10% and greater (n=41). These variants include SNVs in CYP2C8 and CYP2J2 that were at frequencies greater than 10% in Europeans and admixed Americans, SNVs in CYP2C9 that were at elevated frequencies in Europeans, admixed Americans, and South Asians (including rs1799853, the causative SNV for the CYP2C9*2 haplotype, that was identified at frequencies of 8-15% in European populations and admixed American populations), and a SNV in CYP1A2 (rs2470890) found at elevated frequencies in all non-African populations (Supplemental Table 3). Of the 41 SNVs found at elevated frequencies, only four were exonic (rs2470890 in CYP1A2, rs10509681 in CYP2C8, rs11572080 in CYP2C8, and rs1799853 in CYP2C9) and the remainder were intronic. Of all the non-African modern human populations, Papuans had the fewest shared archaic SNVs, but for these sites, archaic variants were observed at higher allele frequencies than any other human population, and all but one of these archaic variants were exclusive to Papuans (Supplemental Table 3). For example, we identified the exonic variant rs3915951 in CYP2D6 that had an allele frequency of 12.5% (n=1) in Papuans and was not found in any other modern human population (Supplemental Table 3). This observation of high frequency of archaic alleles exclusive to Papuans is consistent with either founder effects or a higher degree of archaic admixture into their ancestral populations.
Super-divergent CYP2B6 haplotypes shared between the Vindija Neanderthal and African individuals
The surprisingly large number of heterozygous SNVs in the Vindija Neanderthal CYP2B6 gene was an outlier relative to other Neanderthal diplotypes and warranted further scrutiny. We identified 334 variants sites in the Vindija Neanderthal, while the other three archaic genomes had fewer than 100 variants each (Denisovan: n=75; Altai: n=72; and Chagyrskaya: n=76, Figure 1). The Vindija Neanderthal had far more variant sites than the other Neanderthal or Denisovan individuals, with 91.9% (307/334) of these variant sites being heterozygous (Figure 3). To determine how the divergent Vindija CYP2B6 haplotype compared to other regions of the genome, we calculated the pairwise distance between the Chagyrskaya Neanderthal and Vindija Neanderthal with 29.1 kb windows (CYP2B6 gene length) across the genome and identified the Vindija CYP2B6 region as within the top 1% of windows based on pairwise distance between Neanderthals (pairwise distance = 0.00175, Supplemental Figure 3).
Given that CYP2B6 shares significant homology with the pseudogene, CYP2B7, which is located nearby (40.6 kb), we wanted to ensure that the elevated variation identified in CYP2B6 in the Vindija Neanderthal was not an effect of read mis-mapping with the paralog gene (Zanger and Klein 2013). To confirm a low likelihood of paralog gene mis-mapping error, we assessed the read depth at the CYP2B6 locus in the Vindija Neanderthal genome and identified no read depth elevation nor structural variation present (Supplemental Figure 4).
Furthermore, we identified 11 individuals from 1000 Genomes Project samples (1000 Genomes Project Consortium et al. 2015) who carry a related haplotype, all also showing a non-elevated read depth and a similar increase in variation to the Vindija Neanderthal (Figure 3, Supplemental Figure 4, Supplemental Table 2b). Human individuals carrying this haplotype were found exclusively in all African populations (ASW, ESN, GWD, LWK, MSL, YRI) at low frequencies (Supplemental Table 2b). The two divergent CYP2B6 haplotypes (in the Vindija Neanderthal and in the 11 Africans) uniquely share some SNVs (n=206), however they also present 92 African and 58 Vindija SNVs that are unique to each individual haplotype (Supplemental Figure 5). Comparatively, the other archaic individuals harbor far fewer unique SNVs in their CYP2B6 haplotypes (26 SNVs across the three individuals).
Phased Diplotypes and Predicted Metabolizer Phenotypes
Pharmacogene haplotypes are typically identified as star alleles - haplotype patterns composed of SNVs, indels, and structural variants (SVs) in pharmacogenes that are usually associated with enzyme activity levels. We identified the star allele composition of the eleven CYP450 genes in archaic individuals using variant and sequencing read data using previously validated methods (Lee, Wheeler, Patterson, et al. 2019; Lee, Wheeler, Thummel, et al. 2019). We identify the phased diplotype and predicted phenotype for each individual with the two primary star haplotypes called by Stargazer (Lee, Wheeler, Patterson, et al. 2019; Lee, Wheeler, Thummel, et al. 2019, Figure 4, Supplemental Table 4).
All three Neanderthal individuals displayed the reference diplotype (*1/*1) for CYP2J2 and CYP3A5 and presented the following non-reference homozygous diplotypes: CYP1A2*1F/*1F, CYP2B6*22/*22, CYP2C8*3/*3, CYP2C19*8/*8, and CYP2E1*7/*7 (Figure 4). The Altai and Vindija Neanderthals presented the homozygous SV diplotype CYP2A6*12/*12, while the Chagyrskaya Neanderthal was heterozygous for *12 (CYP2A6*1/*12, Figure 4). For CYP2C9, the Chagyrskaya Neanderthal presented a *2/*2 diplotype, the Altai Neanderthal presented a *1/*2 diplotype, and we were unable to accurately determine the diplotype for the Vindija Neanderthal because of a lack of read coverage at specific star variant locations (see Methods, Figure 4). The Altai and Vindija Neanderthals presented gene copy number variation for CYP2D6, presenting *2/*2×4 and *2/*2×3 diplotypes, respectively; the Chagyrskaya displayed a CYP2D6*2/*41 diplotype. For CYP3A4, the Chagyrskaya Neanderthal presented a reference diplotype (*1/*1), the Altai Neanderthal presented a *1/*1B diplotype, and the Vindija Neanderthal presented a homozygous *1B/*1B diplotype.
We additionally predicted metabolizer status phenotypes based on the diplotypes for each individual using Stargazer, as designated by the Clinical Pharmacogenetics Implementation Consortium (CPIC, https://cpicpgx.org, Figure 4). While Stargazer is designed for human phenotype prediction, we can use it to analyze the landscape for overall functionality in archaic individuals. All three Neanderthal individuals were predicted to have the same phenotype for the following enzymes: a normal metabolizer phenotype for CYP2J2 and CYP3A5; an intermediate metabolizer phenotype for CYP2C8 and CYP2E1; a slow metabolizer phenotype for CYP2A6; a poor metabolizer phenotype for CYP2C19; an ultra-rapid metabolizer phenotype for CYP2B6; and the diplotype for CYP1A2 currently has an unknown functional impact based on PharmVar (https://www.pharmvar.org/). For the CYP2C9 enzyme, the phenotype for the Chagyrskaya Neanderthal was predicted as a normal metabolizer, the Altai Neanderthal phenotype was predicted as an intermediate metabolizer, and the phenotype for the Vindija Neanderthal could not be accurately determined. For CYP3A4, the Chagyrskaya Neanderthal had a predicted normal metabolizer phenotype and the diplotypes for the Altai and Vindija Neanderthal individuals currently have an unknown functional impact. The gene copy number increase of CYP2D6 in the Altai and Vindija Neanderthal individuals predict an ultra-rapid metabolizer phenotype while the Chagyrskaya was predicted to have a normal metabolizer phenotype for this enzyme.
The Denisovan individual had notably different variation from that observed in the Neanderthal individuals (Figure 4). The Denisovan individual displayed the reference diplotype (*1/*1) for CYP2C8 and CYP2C19 and presented the following non-reference diplotypes: CYP1A2*1/*1F, CYP2A6*1/*9, CYP2B6*6/*6, CYP2C9*1/*8, CYP2D6*1/*4, CYP2E1*1/*7, CYP2J2*1/*7, CYP3A4*1B/*1B, and CYP3A5*1/*3 (Figure 4). The predicted phenotypes for the Denisovan individual showed a normal metabolizer phenotype for the CYP2C8, CYP2C9, CYP2C19, and CYP2E1 enzymes, an intermediate metabolizer phenotype for the CYP2B6, CYP2D6, and CYP3A5 enzymes, and a slow metabolizer phenotype for CYP2A6 enzyme. The diplotypes identified for the CYP1A2, CYP2J2, and CYP3A4 enzymes currently have an unknown functional impact.
Potentially-Function Altering SNVs
While Stargazer can make predictions from known star alleles, a number of novel missense variants were identified in the dataset. We identified potentially damaging or function altering SNVs in each gene using various programs that can predict the functional impact of genetic variation: Combined Annotation Dependent Depletion (CADD) integrates multiple weighted metrics to identify deleterious variants (Rentzsch et al. 2019; Rentzsch et al. 2021), Sorting Intolerant From Tolerant (SIFT) predicts the functional impact of amino acid substitution caused by genetic variation (Vaser et al. 2016), and Polymorphism Phenotyping (PolyPhen) identifies the predicted structural and functional outcome of amino acid substitution (Adzhubei et al. 2010). Variants considered potentially damaging or function-altering were defined as SNVs that are in the top 1% of deleterious variants by CADD score (PHRED-normalized CADD score ≥ 20), deleterious by SIFT prediction (SIFT score ≤ 0.05) or damaging by Polyphen2 prediction (Polyphen2 score between 0.15-1.0 as possibly damaging and 0.85-1.0 as probably damaging). We identified 23 potentially damaging variants in the eleven CYP450 genes investigated, of which 20 were considered deleterious by CADD (PHRED-normalized CADD score was ≥ 20), 16 were predicted by SIFT as deleterious, and 16 had a Polyphen prediction of damaging: 3 as “possibly damaging” and 13 as “probably damaging” (Figure 5, Supplemental Table 5). Of the 23 variants identified, 7 SNVs were considered novel and not observed in any variant annotation databases utilized (see Methods). CYP2D6 had the highest number of deleterious variants found in exonic and splicing sites (4 SNVs total) and the other genes each had 1-3 potentially deleterious variants. In the CYP2J2 gene, we have identified numerous deleterious SNVs in the Chagyrskaya and Vindija Neanderthal individuals in heterozygous form but did not find any of these Neanderthal deleterious SNVs in modern human populations. No potentially deleterious SNVs were identified in CYP2A6. Four SNVs were diagnostic for star allele haplotypes as discussed above (Figure 4), including the variant accounting for the CYP2C9*2 allele (rs1799853), the variant defining the CYP2C19*8 allele (rs41291556), and two variants (rs3892097 and rs1065852) that compose the CYP2D6*4 haplotype (Figure 5).
Most archaic variants present in non-African modern humans were either intronic (n=140) or exonic synonymous (n=4) mutations, making their impact on metabolism unclear; however, we identify three exonic variants with a CADD score ≥ 20, suggesting that they are likely to impact function (rs3915951, rs41291556, and rs1799853, bolded in Supplemental Table 3). The first variant, rs3915951, is an exonic missense variant in CYP2D6 that was identified as heterozygous in a single Papuan. The second variant is the causative exonic missense SNV for CYP2C19*8 (rs41291556) and had a frequency of 1.5% in Central Europeans and 0.5% in the Telugu population in the 1000 Genomes Project dataset (Supplemental Table 3). The third variant is the causative missense SNV for CYP2C9*2 (rs1799853) and was found globally at less than 5% frequency, with the exception of in European and admixed American populations, where the allele frequency ranges from 8-15% (Supplemental Table 3).
Introgression of archaic CYP450 alleles into modern humans
Each CYP450 gene had a range of archaic SNVs shared with modern human populations, from 2 shared SNVs (CYP1A2) to 64 shared SNVs (CYP2C19, Table 1). The frequency of shared SNVs also varied extensively across global human populations, likely reflecting patterns of past gene flow with diverse archaic populations and genetic drift (Sankararaman et al. 2014; Sankararaman et al. 2016), but overall the shared archaic SNVs were generally at low frequency across modern human populations (Supplemental Table 3). To confirm if the archaic CYP450 SNVs identified in modern human populations corresponded to an archaic haplotype inherited from Neanderthals, we calculated sequence divergence between each CYP450 gene haplotypes and the Vindija Neanderthal using Haplostrips (Supplemental Figure 6a-k). Interestingly, CYP2J2 showed evidence of haplotype sharing between the archaic individuals and modern humans because there were 8 archaic SNVs shared between Neanderthals and all non-African modern human populations at frequencies higher than 2%, making it a possible example of archaic introgression of a functional gene (Supplemental Table 3). We visualized these distances grouped by geographic region (Figure 6, Supplemental Figure 6i), and found that all non-African super-populations carry Neanderthal-like CYP2J2 alleles at low frequencies: 0.4% in Southeast Asia, 4.8% in Europe, 3.9% in East Asia, and 7.9% in admixed Americans. The Neanderthal-like CYP2J2 haplotype was not found in African populations.
Introgression of the CYP2A6*12 hybrid allele
We identified the CYP2A6*12 allele in all three Neanderthal individuals (Figure 3), and this may suggest inheritance of the hybrid allele through human-Neanderthal introgression. The CYP2A6 enzyme metabolizes coumarin, nicotine, and other plant secondary metabolites. The CYP2A6 gene is located adjacent to the inactive CYP2A7 gene and several allelic variants of CYP2A6 have been created by unequal crossover and gene conversion events between these genes (Oscarson et al. 2002). The CYP2A6*12 is a hybrid allele where exon 1 and 2 originate from CYP2A7 and exons 3–9 originate from CYP2A6 (Figure 4), and causes a 50% reduction in CYP2A6 protein levels and a 40% decrease in CYP2A6 coumarin 7-hydroxylation activity (Oscarson et al. 2002), leading to slow metabolism of various substrates of the CYP2A6 enzyme. The CYP2A6*12 allele is found at low frequencies in global populations, including African American (0.4%), Canadian First Nation (0.5%), and Japanese (0.8%) individuals and is absent in African populations (Oscarson et al. 2002).
However, it is possible that the unequal crossover event that created CYP2A6*12 predates the split time of the modern humans and Neanderthal-Denisovans, and is present in humans and Neanderthals because of Incomplete Lineage Sorting (ILS, suggested in Lin et al. 2015), but this does not explain why the allele is not present in Africans. To assess if the CYP2A6*12 haplotype could have survived the disruption from recombination in this region from the time since the split of modern humans and the archaic individuals, we calculated the probability of a haplotype carrying CYP2A6*12 to be maintained in both human and Neanderthal lineages through ILS. To calculate this probability, we used a previously published ILS equation (see Methods, Huerta-Sánchez et al. 2014). Using the 31.9 kb CYP2A6*12 hybrid segment (Oscarson et al. 2002) and a regional recombination rate of 0.77 cM/Mb (Myers et al. 2005), we estimate that it is unlikely for segments exceeding 17.1 kb to be maintained since the human-Neanderthal branch split (at ɑ=0.05) and it is therefore improbable that the 31.9 kb CYP2A6*12 haplotype would have been maintained by ILS in both lineages (p=0.0014).
Discussion
We investigated a panel of CYP450 genes which are responsible for 75% of all drug metabolizing reactions and represent the bulk of drug metabolizing enzymes (DME) for which therapeutic recommendations exist (Evans and Relling 1999). DMEs are subject to evolutionary processes (e.g., mutation, drift, natural selection) because of their role as detoxifiers of xenobiotic substances. Here, we analyze data from modern and archaic humans to gain insight into the evolutionary history of these loci and to predict metabolizing phenotypes of archaic individuals.
Archaic CYP450 metabolizing phenotypes
We find that the majority of the archaic CYP450 genes in our panel conferred a normal/intermediate or unknown metabolizer phenotype (CYP1A2, CYP2C8, CYP2C9, CYP2E1, CYP2J2, CYP3A4 and CYP3A5), consistent with previous studies showing strong purifying selection (Nelson et al. 2012) at these loci. The genes that showed either a poor/slow (CYP2A6, CYP2C19) or ultra-rapid (CYP2B6, CYP2D6) metabolizer status in the archaic individuals have also been known to have high variation in human populations (Zhou et al. 2017) and may indicate increased natural population variation or adaptation to the environment. However, it is important to note that the predicted phenotypes are based on modern human phenotypes, and it is unclear how the archaic-specific variation (in particular from the novel function-altering variants) and the archaic genomic landscape would affect metabolism in archaic individuals. These variants of uncertain significance would benefit from further functional tests, individually as well as in the haplotypes found in archaic individuals, measuring enzyme activity and protein abundance.
We highlight three CYP450 genes with high archaic genetic diversity: CYP2C9, CYP2A6, and CYP2J2. First, CYP2C9 contains a number of SNVs that are shared between Neanderthals and Europeans and are rare or absent from Africans, suggesting the possibility of archaic introgression for these variants. The CYP2C9 enzyme is responsible for metabolizing multiple classes of medications and is an important DMEs in humans (Miners and Birkett 1998; Henderson et al. 2018). Substrates for CYP2C9 include non-steroidal anti-inflammatories (Tracy et al. 1995; Miners et al. 1996; Hamman et al. 1997; Yamazaki et al. 1998), angiotensin II blockers (e.g. losartan) (Stearns et al. 1995), S-warfarin (Rettie et al. 1994; Yamazaki et al. 1998), phenytoin (Giancarlo et al. 2001), and tolbutamide (Miners and Birkett 1996). In the CYP2C9 gene, more than 60 functional haplotypes have been identified through the PharmVar consortium (Gaedigk et al. 2018; Sangkuhl et al. 2021). The shared CYP2C9 variants do not show the usual patterns between genetic and geographic distance as human populations expanded out of Africa and into Eurasia, Oceania, and the Americas. If the CYP2C9*2 variant was indeed introgressed from Neanderthals, the elevated frequency in European populations suggests that it may have been adaptive for processing certain xenobiotics that human populations were exposed to in Western Eurasia. Changes in CYP450 genes as a result of adaptation to new diets has been previously identified, such as selection for slower-metabolizing CYP2D6 variants in agricultural populations (Fuselli et al. 2010). However, we cannot rule out that the elevated frequency of the Neanderthal allele in Europeans was the result of strong founder’s effect. Further exploration of this haplotype would be needed to determine whether this haplotype was introgressed and then targeted by positive selection.
Second, CYP2A6 – describes a variant (*12) defined by a recombination event and found in modern humans, Neanderthals, and Denisovans. The CYP2A6 gene is expressed mainly in the liver and represents between 1% and 10% of the total liver CYP450 protein (Haberl et al. 2005). The CYP2A6 enzyme metabolizes drugs and pro-carcinogenic compounds including tegafur (Komatsu et al. 2000), valproic acid (Kiang et al. 2006; Tan et al. 2010), and coumarin (Miles et al. 1990), and is the primary enzyme involved in nicotine metabolism (Hukkanen et al. 2005). At present, more than 10 different allelic variants are known that cause absent or reduced enzyme activity (Oscarson et al. 2002; Benowitz et al. 2006), suggesting that eliminating or reducing CYP2A6 function is a recurring evolutionary strategy. Accordingly, there is marked diversity in CYP2A6 enzyme function in global populations, where only approximately 1% of European and Middle Eastern populations are poor metabolizers, but up to 20% of East Asian populations show severely reduced enzyme activity (Haberl et al. 2005). An alternative to explain the prevalence of CYP2A6*12 is that this hybrid allele evolved multiple times, independently in modern non-Africans and Neanderthals, in addition to the low frequency of CYP2A6*12 introgressed from Neanderthals. It has been proposed that the slow metabolism of some plant secondary metabolites may be adaptive as high levels of these toxins in human tissues may act as a deterrent to parasites (Sullivan et al. 2008; Hagen et al. 2009; Hagen et al. 2013), perhaps explaining why this hybrid allele co-occurs in both Neanderthals and some human populations. Future directions include examining CYP2A6 variation in more ancestral species, such as primates, to predict the timing of the CYP2A6*12 allele more accurately.
Finally, CYP2J2 – for which Neanderthal variants may have been introduced to non-African populations by archaic admixture and is now found in some modern populations. The CYP2J2 enzyme accounts for roughly 1-2% of hepatic CYP protein expression, but also displays high expression in the lung, kidney, heart, placenta, salivary gland, and skeletal muscle (Bièche et al. 2007; Murray 2016). CYP2J2 is involved in the oxidation pathways of polyunsaturated fatty acids (PUFAs) (Murray 2016) and mediates the oxidation of drugs including ebastine (Hashizume et al. 2002), astemizole (Matsumoto and Yamazoe 2001; Lafite et al. 2007), terfenadine (Matsumoto and Yamazoe 2001), and ebastine (Lee et al. 2012). The majority of established CYP2J2 polymorphisms occur at low frequencies, with the most common variant, CYP2J2*7, occurring at frequencies between 1-20% across global populations (King et al. 2002; Dreisbach et al. 2005; King et al. 2005; Wang et al. 2006; Polonikov et al. 2007; Polonikov et al. 2008; Murray 2016). Introgressed variants in CYP2J2 may have been adaptive and selected for in modern human populations, possibly due to their role in the oxidation of PUFAs. Oxidation of PUFAs brings adverse effects to food by producing off-flavors and may have been instrumental in modern human expansion to new environments. Further work to determine the functional effects of the shared CYP2J2 variants is needed and to test associations to PUFA levels in modern human populations.
Super-divergent CYP2B6 in the Vindija Neanderthal
The highly divergent haplotypes in the Vindija Neanderthal and a small number of African individuals (n=11) for CYP2B6 was an unexpected finding. CYP2B6 is responsible for metabolizing drugs such as the prodrug cyclophosphamide (Huang et al. 2000; Zanger and Klein 2013); efavirenz, a non-nucleoside reverse transcriptase inhibitor (Ward et al. 2003; Desta et al. 2007; Zanger and Klein 2013); the antidepressant bupropion (Faucette et al. 2000; Hesse et al. 2000; Zanger and Klein 2013); and ketamine (Desta et al. 2012). Genetic polymorphisms in CYP2B6 have been identified to alter enzyme activity (Ariyoshi et al. 2001; Lang et al. 2001; Kirchheiner et al. 2003; Ward et al. 2003; Hofmann et al. 2008) and there have been over 37 CYP2B6 haplotypes identified (Gaedigk et al. 2018; Desta et al. 2021).
The Vindija Neanderthal had far more variant sites in CYP2B6 compared to the other Neanderthal and Denisovan individuals as well as in other modern human populations. The excess of heterozygous sites did not seem to be due to a mapping error and the shared variation with 11 African individuals from the 1000 Genomes Project strongly suggesting that these sites may be part of the same haplotype, with some divergent haplotypes between the African modern human and the Vindija haplotype. This pattern seems consistent with early human-Neanderthal admixture allowing for private variation to accumulate on the African and Vindija Neanderthal haplotypes. Similarly, introgression from an archaic population with higher levels of heterozygosity could have resulted in two divergent haplotypes being passed to the human-Neanderthal shared lineage, each assorting into African and Neanderthals.
In fact, a study that aimed to identify archaic gene flow in Africans that pre-dates the split time of Neanderthals, Denisovans, and modern humans identified CYP2B6 as a candidate region in the African Mende population, three of whom have the divergent CYP2B6 haplotype (Durvasula and Sankararaman 2020). These haplotypes were likely introgressed from a population that lived long before humans diverged from Neanderthals and Denisovans and show much more divergence from one another than the other modern and archaic human haplotypes do. While it is difficult to determine the exact relationship between the two divergent haplotypes because the Vindija genome is un-phased, making it difficult to infer haplotypes, both Neanderthals and modern Africans harbor ancient haplotypes of CYP2B6. Given the number of differences between the two CYP2B6 divergent haplotypes, they likely both existed long before the modern human-Neanderthal split at ∼500,000 ybp.
CYP450 genetic introgression
The sharing of archaic variants in the eleven CYP450 genes we studied in modern humans is largely consistent with our general understanding of archaic introgression in human populations. For example, the number of Neanderthal SNVs is significantly higher than the number of Denisovan SNVs for all populations except Papuans (Sankararaman et al. 2016), which is consistent with the majority of shared variants in humans being Neanderthal in origin. Papuans and other Melanesian populations have much more Denisovan ancestry than most other populations (Meyer et al. 2012) and may have had gene flow from a different Denisovan population than the one encountered by the ancestors of Eurasians (Browning et al. 2018), and thus it is not surprising that they have an exclusive set of archaic alleles for the CYP450 genes, many of which are Denisovan in origin.
Interestingly, of the 23 potentially damaging variants we found in the eleven CYP450 genes, 7 SNVs were found uniquely in the archaic individuals. Neanderthal and Denisovan genome elements in modern humans are the targets of natural selection; in particular, most of the depletion of archaic ancestry has been predicted to be the result of purifying selection against weakly deleterious alleles (Harris and Nielsen 2016; Juric et al. 2016). After archaic and modern human admixture, increased pressure from purifying selection would remove archaic deleterious alleles, as they would result in marked health and fitness detriments for modern humans. The strength of purifying selection against these deleterious alleles has been estimated to be extremely high, and likely removed a significant portion of Neanderthal ancestry from the modern human gene pool in just a few generations (Harris and Nielsen 2016; Juric et al. 2016; Petr et al. 2019).
Limitations
A potential limitation of this study is the nature of ancient DNA; degraded DNA and chemical damage — in particular for transition point mutations (cytosine (C) → thymine (T)) — result in low read depth and sequencing errors for ancient genomes. In order to distinguish true polymorphism at heterozygous sites from sequencing errors and archaic DNA (aDNA) damage, we excluded base pairs that are represented in only a single read for each position in the BAM read alignment. Because DNA damage and sequencing error is distributed randomly, we expect false polymorphism to only appear once at each position, whereas any polymorphism represented in multiple reads is statistically likely to represent true heterozygosity. To further account for false positive variant calls, we filtered calls for allelic bias, removing any heterozygous calls that had an allelic read ratio less than 0.2 or greater than 0.8. The nature of ancient genomes resulted in low read depth for some star alleles that were called (Figure 3), but given this limitation, we also manually checked the sequencing reads for all diplotypes reported (Supplemental Table 4).
As a result of our novel filtering method, the archaic variant call files we generated contain a larger number of heterozygous sites compared to the variant call files that are usually used to represent the archaic genomes (Meyer et al. 2012; Prüfer et al. 2014; Prüfer et al. 2017; Mafessoni et al. 2020). The method used to make the previously generated variant call files likely used a more stringent method of filtering, resulting in more sites being called as homozygous. To ensure that our results were not biased by our variant calling method, we repeated all analyses (including calling star alleles, identifying archaic SNVs in modern humans, and comparing human and archaic haplotypes) using the original variant call files, and found that while the number of identified SNVs is smaller, the patterns identified (such as the divergent CYP2B6 haplotype in the Vindija and allele sharing between archaic and modern humans) remain.
Other limitations were the required phasing of the archaic CYP450 sequences with the program Beagle, which uses a human reference panel based on the 1000 Genomes panel for haplotype assessment, and the calling of the final diplotype with current known human star alleles. While this is the best reference panel available, it is uncertain if phasing based on modern humans reflects the true arrangement of alleles along chromosomes in archaic individuals. This limitation is somewhat ameliorated by finding homozygous allele calls for most CYP450 genes in the archaic individuals (Supplemental Table 3), which is consistent with the much lower heterozygosity of Neanderthals and Denisovans along the entire genome (Meyer et al. 2012; Prüfer et al. 2014; Prüfer et al. 2017; Mafessoni et al. 2020), and identifying that many of the heterozygous star allele calls were composed of only a single nucleotide variant, effectively deeming the phasing non-impactful. The final diplotype calls also did not consider any of the novel potentially deleterious SNVs but will require further functional validation of these variants.
Lastly, a further limitation to determining if archaic CYP450 alleles were inherited by modern humans through introgression is incomplete lineage sorting (ILS), in which modern humans, Neanderthals and Denisovans all share CYP450 alleles ancestrally, as could be the case for CYP2A6*12. However, we expect haplotypes shared through ILS to accumulate neutral mutations over time, slowly driving divergence between haplotypes. We expect true archaically introgressed alleles to be less divergent, as is the case in CYP2J2.
Conclusions
Understanding the impact of archaic variation on modern human health is still in its early stages. While a few genes have been identified for which the archaic alleles play an important role in human adaptation to their environment, such as EPAS1, the vast contribution of archaic ancestry to human health remains unknown. Our results suggest that interactions between modern and archaic humans may have resulted in the introduction of novel CYP450 variants into modern human populations, helping them adapt to novel environments as they expanded out of Africa. Important insights will continue to emerge from the careful inspection of pharmacologically relevant and highly studied genes such as the CYP450 genes investigated here.
Materials and Methods
Samples
We investigate population-specific CYP450 gene variation by combining data from the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015), the Neanderthal genome project (Prüfer et al. 2014), the Chagyrskaya Neanderthal genome project (Mafessoni et al. 2020), and the Denisovan Genome Project (Meyer et al. 2012). We extracted coding variation from four archaic human genomes; pertaining to a single Denisovan individual from the Denisova cave in the Altai Mountains (∼21X coverage), a Neanderthal individual from Croatia (∼42X coverage), and two Neanderthal individuals from the Altai Mountains: one from the Denisova cave (∼30X coverage), and one from the Chagyrskaya cave (∼28x coverage). These individuals are estimated to be at least 50,000 years old.
Variant Calling
Sequencing data for the three Neanderthal individuals and one Denisovan individual are publicly available, and processing of the BAM files has been outlined in previous literature (Meyer et al. 2012; Prüfer et al. 2014; Prüfer et al. 2017; Mafessoni et al. 2020). Sample BAM files were separated by chromosome for easier downstream processing using samtools (version 1.10 with hts lib 1.10) ‘view’ function (Li et al. 2009). Variant calling was performed for each CYP450 gene of interest, using pypgx (version 0.1.37) ‘bam2vcf’ which implements the Genome Analysis Toolkit (GATK, version 4.1.9.0) ‘HaplotypeCaller’ function (options -emit-ref-confidence GVCF; -minimum-mapping-quality 10) followed by GATK ‘GenotypeGVCF’ to merge individual samples (Poplin et al. 2018). Variants were called against the HumanG1Kv37 hg19 reference assembly. GATK ‘VariantFiltration’ was additionally utilized to annotate variants with a quality score 50. Variants from modern human individuals from the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015) included for further analysis were called using the same procedure.
For individual SNV analysis, variant call format (VCF) files were further filtered by the PHRED-scaled quality score (QUAL) 40 and gene regions were determined as the RefSeq start and end coordinates of the gene with 2000 base-pairs upstream to account for the promoter region using bcftools (version 1.10.2 with hts lib 1.10.2, (Li et al. 2009; O’Leary et al. 2016). Read depth for each variant was reported from the GDF files generated from pypgx bam2gdf (described below). Variants were additionally filtered for allele balance (AB) bias and polymorphic sites that had a BAM read depth ≤ 1 for one allele were removed to adjust for the random effects of aDNA damage being called as true variation. AB was calculated as the ratio of minor allelic read depth to major allelic read depth, and variant calls were filtered with an AB > 0.2.
Variant annotation was conducted through ANNOVAR (-protocol refGeneWithVer, knownGene, avsnp150, ljb26_pp2hvar) which utilizes dbSNP150, RefSeq, and the UCSC genome browser to identify known variants and the gene locations that variants occur at; the occurrence of novel SNVs were additionally confirmed with Gnomad (v2.1.1, (Sherry et al. 2001; Kent et al. 2002; Wang et al. 2010; O’Leary et al. 2016; Karczewski et al. 2021). Variant locations were identified with ANNOVAR, which annotates gene location based on RefSeq and UCSC genome browser genomic databases. Variants are defined as exonic, splicing, ncRNA, 3’ and 5’UTR, intronic, and in the promoter region. If a location varied between the RefSeq and UCSC gene annotation, the called location is determined by an ordering precedence: exonic/splicing > ncRNA > 3’/5’UTR > intronic > promoter region.
Potentially pathogenic variants were identified using three algorithms to rank variant deleteriousness: Combined Annotation Dependent Depletion (CADD v1.6), Sorting Intolerant From Tolerant (SIFT 4G v2.0.0), and Polymorphism Phenotyping (PolyPhen v2, (Adzhubei et al. 2010; Vaser et al. 2016; Rentzsch et al. 2019). Variants were considered as deleterious if the PHRED-normalized CADD score was ≥ 20, a variant was identified to be deleterious by the SIFT prediction, or a variant was identified as probably/possibly damaging by PolyPhen2 prediction.
Comparisons of pairwise distance between Neanderthal individuals was conducted by calculating the pairwise difference across windows of two Neanderthal genomes. In our case, we used the Chagyrskaya and Vindija VCFs provided by their original publications (Prüfer et al. 2017; Mafessoni et al. 2020) and chose a window size corresponding to the CYP2B6 gene length (29.1-kb). As the archaic genomes are unphased, we could not separate the two haplotypes for each individual, so we compared genotypes rather than haplotypes. To calculate pairwise distance, we added +1 to our total number of sites if the two genotypes were homozygous for the opposite allele (i.e., 0/0 vs. 1/1), and added 1/2 if one of the genotypes was heterozygous, then divided the total number of different sites by the length of the window.
Star allele calling and metabolizer phenotype prediction
CYP450 diplotypes and metabolizer phenotypes were determined using Stargazer, a bioinformatics tool for identifying star alleles of pharmacogenes by detecting SNVs, indels, and SVs from next-generation sequencing data (Lee, Wheeler, Patterson, et al. 2019; Lee, Wheeler, Thummel, et al. 2019). Briefly, small nucleotide variants (SNVs and indels) observed from each CYP gene were input into Stargazer, which utilizes the statistical phasing software BEAGLE (Browning and Browning 2007) to phase the variants into haplotypes. Phased variants were then matched to star alleles for each of the two haplotypes from an individual. Of note, diplotype calls were considered “non-determinant” if any of the variants used to define a star allele in Stargazer had a read depth ≤ 1. In addition to small nucleotide variants, SV calls were made by Stargazer using per-base depth of coverage data computed with the pypgx ‘bam2gdf’ command (https://github.com/sbslee/pypgx). Three control genes were used to normalize read depth and estimate accurate gene copy number: vitamin D receptor (VDR), epidermal growth factor receptor (EGFR), and ryanodine receptor 1 (RYR1). Final calls for star alleles were determined as the call that was consistent in the highest number of control genes. SV calls were additionally confirmed with visual inspection of read depth profiles for the three control genes. Finally, phenotype prediction was conducted in Stargazer by converting the called diplotype into an activity score, which is then used to predict the gene phenotype.
Identifying archaic alleles in modern populations
To identify the presence of archaic CYP450 variants in modern human populations, we calculated the allele frequency of the archaic variants described above in modern humans. We used modern human genomes from all populations sequenced for the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015), as well as the Papuans sequenced as part of the Simons Genome Diversity Project (Mallick et al. 2016). We examined all SNVs that were shared between archaic and modern humans, but also focused on a subset of SNVs that were more likely to be introgressed, as they were found outside of Africa but not in African populations. To exclude SNVs that were ancestrally shared between archaic humans and modern humans, archaic SNVs had to have a frequency in African populations of less than 1 percent and be present in at least one non-African population with an allele frequency greater than 1 percent. For all results described in the text, we used the archaic VCF generated using the methods in this paper for comparison with modern humans. For comparison, we repeated these analyses with the published VCFs that are associated with the original archaic genome sequencing studies (Meyer et al. 2012; Prüfer et al. 2014; Prüfer et al. 2017; Mafessoni et al. 2020).
To explore SNV sharing between the CYP2B6 haplotypes, we divided the archaic and modern humans into four groups: the Vindija genome (containing the divergent haplotype), the other three archaic genomes, the eleven Africans with the divergent haplotypes, and the rest of the modern Africans without the divergent haplotypes. Each group was scored for the presence or absence of a given SNV, and the sharing of these SNVs was summarized in a Venn diagram (Supplemental Figure 4).
We used Haplostrips (Marnetto and Huerta-Sánchez 2017) to calculate the distance between haplotypes of all CYP450 genes in modern humans in the 1000 Genomes Panel relative to a reference haplotype. Distances in Haplostrips are Manhattan distances, simply the number of SNVs with different alleles between the two sequences. Haplotypes are re-ordered by decreasing similarity with the archaic reference haplotype. Here, the Haplostrip is polarized to a Vindija Neanderthal reference haplotype (a consensus of the two archaic chromosomes) or a Denisovan reference haplotype, and each subsequent haplotype is ordered by genetic similarity, from most related to least related. For this analysis, we looked at haplotypes composed of each CYP450 gene at the gene coordinates, plus 5000 base pairs downstream and upstream to capture more linked neutral variation, found in the VCF files from the 1000 Genomes Project and the various archaic genome projects. We used the genetic distances calculated by Haplostrips to rank the proximity of modern human haplotypes to archaic haplotypes.
Incomplete Lineage Sorting Calculation
Incomplete Lineage Sorting (ILS) was assessed using a previously published equation (Huerta-Sánchez et al. 2014). Regional recombination rates were determined using HapMap (Myers et al. 2005). For these calculations, we used a generation time of 29 years (Langergraber et al. 2012; Zeberg and Pääbo 2021) and a branch length of 550,000 years with a 50,000 year timeframe of Neanderthal-Human interbreeding (Prüfer et al. 2014; Zeberg and Pääbo 2021).
Data Visualization
All data visualization was conducted in R Studio Suite (R version 4.0.2, https://www.r-project.org/) with the use of the CRAN library ggplot2 for graphical visualization of variant frequency, star allele diplotype calls, deleterious variant occurrence, and haplotype distance visualization (R Core Team 2008; Wickham 2009). The R package venneuler was used to create the Venn diagram in Supplemental Figure 4 (https://CRAN.R-project.org/package=venneuler).
Funding
THW and KGC are supported by the NHGRI grant to KGC R35HG011319. KEW is supported by the NIH grant to EHS 1R35GM128946-01. EHS was also supported by an Alfred P. Sloan Fellowship.
Data Availability
The scripts created to analyze and generate this data can be found at https://github.com/kelsey-witt/archaic_pgx and https://github.com/the-claw-lab/aDNA_PGx_2021.
Acknowledgements
We would like to thank George Perry and Omer Gokcumen for comments on early versions of our manuscript, and anonymous reviewers for their feedback.