Abstract
Living at high-altitude is one of the most difficult challenges that humans had to cope with during their evolution. Whereas several genomic studies have revealed some of the genetic bases of adaptations in Tibetan, Andean and Ethiopian populations, relatively little evidence of convergent evolution to altitude in different continents has accumulated. This lack of evidence can be due to truly different evolutionary responses, but it can be also due to the low power of former studies that have mainly focused on populations from a single geographical region or performed separate analyses on multiple pairs of populations to avoid problems linked to shared histories between some populations. We introduce here a hierarchical Bayesian method to detect local adaptation that can deal with complex demographic histories. Our method can identify selection occurring at different scales, as well as convergent adaptation in different regions. We apply our approach to the analysis of a large SNP dataset from low- and high-altitude human populations from America and Asia. The simultaneous analysis of these two geographic areas allows us to identify several candidate genome regions for altitudinal selection, and we show that convergent evolution among continents has been quite common. In addition to identifying several genes and biological processes involved in high altitude adaptation, we identify two specific biological pathways that could have evolved in both continents to counter toxic effects induced by hypoxia.
Introduction
Distinguishing between neutral and selected molecular variation has been a long-standing interest of population geneticists. This interest was fostered by the publication of Kimura's seminal paper 1 on the neutral theory of molecular evolution. Although the controversy rests mainly on the relative importance of genetic drift and selection as explanatory processes for the observed biodiversity patterns, another important question concerns the prevalent form of natural selection. Kimura 1 argued that the main selective mechanism was negative selection against deleterious mutations. However, an alternative point of view emphasizes the prevalence of positive selection, the mechanism that can lead to local adaptation and eventually to speciation 2,3.
A powerful approach to uncover positive selection is the study of mechanisms underlying convergent evolution. When different populations or evolutionary lineages are exposed to similar environments, positive selection should indeed lead to similar phenotypic features. Convergent evolution can be achieved through similar genetic changes (sometimes called “parallel evolution”) at different levels: the same mutation appearing independently in different populations, the same existing mutation being recruited by selection in different populations, or the involvement of different mutations in the same genes or the same biological pathways in separate populations 4. However, existing statistical genetic methods are not well adapted to the study of convergent evolution when data sets consists in multiple contrasts of populations living in different environments 5. The current strategy is to carry out independent genome scans in each geographic region and to look for overlaps between loci or pathways that are identified as outliers in different regions 6. Furthermore, studies are often split into a series of pairwise analyses that consider sets of populations inhabiting different environments. Whereas this strategy has the advantage of not requiring the modeling of complex demographic histories 7,8, it often ignores the correlation in gene frequencies between geographical regions when correcting for multiple tests 9. As a consequence, current approaches are restricted to the comparison of lists of candidate SNPs or genomic regions obtained from multiple pairwise comparisons. This sub-optimal approach may also result in a global loss of power as compared to a single global analysis and thus to a possible underestimation of the genome-wide prevalence of convergent adaptation.
One particularly important example where this type of problems arises is in the study of local adaptation to high altitude in humans. Human populations living at high altitude need to cope with one of the most stressful environment in the world, to which they are likely to have developed specific adaptations. The harsh conditions associated with high altitude include not only low oxygen partial pressure, referred to as high-altitude hypoxia, but also other factors like low temperatures, arid climate, high solar radiation and low soil quality. While some of these stresses can be buffered by behavioral and cultural adjustments, important physiological changes have been identified in populations living at high altitude (see below). Recently, genomic advances have unveiled the first genetic bases of these physiological changes in Tibetan, Andean and Ethiopian populations 10-19. The study of convergent or independent adaptation to altitude is of primary interest11,19,20, but this problem has been superficially addressed so far, as most studies focused on populations from a single geographical region10,13,14,16-19.
Several candidate genes for adaptation to altitude have nevertheless been clearly identified21,22, the most prominent ones being involved in the hypoxia inducible factor (HIF) pathway, which plays a major role in response to hypoxia23. In Andeans, VEGFA (vascular endothelial growth factor A, MIM 192240), PRKAA1 (protein kinase, AMP-activated, alpha 1 catalytic subunit, MIM 602739) and NOS2A (nitric oxide synthase 2A, MIM 163730) are the best-supported candidates, as well as EGLN1 (egl-9 family hypoxia-inducible factor 1, MIM 606425), a down regulator of some HIF targets12,24. In Tibetans10,11,13,14,16,25, the HIF pathway gene EPAS1 (endothelial PAS domain protein 1, MIM 603349) and EGLN1 have been repeatedly identified 22. Recently, three similar studies focused on Ethiopian highlanders17-19 suggested the involvement of HIF genes other than those identified in Tibetans and Andeans, with BHLHE41 (MIM 606200), THRB (MIM 190160), RORA (MIM 600825) and ARNT2 (MIM 606036) being the most prominent candidates.
However, there is little overlap in the list of significant genes in these three regions18,19, with perhaps the exception of alcohol dehydrogenase genes identified in two out of the three analyses. Another exception is EGLN1: a comparative analysis of Tibetan and Andean populations 12 concluded that “the Tibetan and Andean patterns of genetic adaptation are largely distinct from one another”, identifying a single gene (EGLN1) under convergent evolution, but with both populations exhibiting a distinct dominant haplotype around this gene. This limited convergence does not contradict available physiological data, as Tibetans exhibit some phenotypic traits that are not found in Andeans 26. For example, Tibetan populations show lower hemoglobin concentration and oxygen saturation than Andean populations at the same altitude 27. Andeans and Tibetans also differ in their hypoxic ventilatory response, birth weight and pulmonary hypertension 28. Finally, EGLN1 has also been identified as a candidate gene in Kubachians, a high altitude (~2000 m a. s. l.) Daghestani population from the Caucasus 15, as well as in Indians 29.
Nevertheless, it is still possible that the small number of genes under convergent evolution is due to a lack of power of genome scan methods done on separate pairs of populations. In order to overcome these difficulties, we introduce here a Bayesian genome scan method that (i) extends the F-model 30,31 to the case of a hierarchically subdivided population consisting of several migrant pools, and (ii) explicitly includes a convergent selection model.
We apply this approach to find genes, genomic regions, and biological pathways that have responded to convergent selection in the Himalayas and in the Andes.
Material and Methods
Hierarchical Bayesian Model
One of the most widely used statistics for the comparison of allele frequencies among populations is FST 32,33, and most studies cited in the introduction used it to compare low- and high altitude populations within a given geographical region (Tibet, the Andes or Ethiopia). Several methods have been proposed to detect loci under selection from FST, and one of the most powerful approach is based on the F-model (reviewed by Gaggiotti and Foll 34). However, this approach assumes a simple island model where populations exchange individuals through a unique pool of migrants. This assumption is strongly violated when dealing with replicated pairs of populations across different regions, which can lead to a high rate of false positives 35.
In order to relax the rather unrealistic assumption of a unique and common pool of migrants for all sampled populations, we extended the genome scan method first introduced by Beaumont and Balding 30 and later improved by Foll and Gaggiotti 31. More precisely, we posit that our data come from G groups (migrant pools or geographic regions), each group g containing Jg populations. We then describe the genetic structure by a F-model that assumes that allele frequencies at locus i in population j from group g, (where Ki is the number of distinct alleles at locus i), follow a Dirichlet distribution parameterized with group-specific allele frequencies
and with
coefficients measuring the extent of genetic differentiation of population j relative to group g at locus i. Similarly, at a higher group level, we consider an additional F-model where allele frequencies pig follow a Dirichlet distribution parameterized with meta-population allele frequencies
and with
coefficients measuring the extent of genetic differentiation of group g relative to the meta-population as a whole at locus i. Figure S1 shows the hierarchical structure of our model in the case of three groups (G = 3) and four populations per group (J1 = J2 = J3 = 4) and Figure S2 shows the corresponding non-hierarchical F-model for the same number of populations. All the parameters of the hierarchical model can be estimated by further assuming that alleles in each population j are sampled from a multinomial distribution 36. These assumptions lead to an expression for the probability of observed allele counts
:
where
is the multinomial likelihood,
and
are Dirichlet prior distributions,
. This expression can be simplified by integrating over pijg so as to obtain:
where
is the multinomial-Dirichlet distribution 34. The likelihood is obtained by multiplying across loci, regions and population

Using this model, we incorporate potential deviation from the genome wide F-statistics at each locus as in Beaumont and Balding 30. The genetic differentiation within each group g is:
where αig is a locus-specific component of
shared by all populations in group g, and βjg is a population-specific component shared by all loci. Similarly, we decompose the genetic differentiation at the group level under a logistic model as:
where Ai is a locus-specific component of
, shared by all groups in the meta-population, and Bg is a group-specific component shared by all loci.
By doing this, our model also eliminates the ambiguity of having a single αi parameter for more than two populations, since we now have (i) different selection parameters in each geographic region (αig are group specific) and (ii) separate parameters sensitive to adaptation among regions at the larger scale (Ai). We use the likelihood function and the logistic decomposition to derive the full Bayesian posterior:
where the prior for pi is a non-informative Dirichlet distribution, the priors for αig and Ai are Gaussian with mean 0 and variance 1, and the priors for βjg and Bg are Gaussian with mean −1 and variance 1. Note that priors on βjg and Bg have practically no influence on the posteriors as these parameter use the huge amount of information coming from all loci.
Parameter Estimation
We extend the Reversible Jump Markov Chain Monte Carlo (RJ-MCMC) approach proposed by Foll and Gaggiotti 31 to identify selection both within groups and among groups. For each locus and in each group separately, we consider a neutral model where αig = 0, and a model with selection where the locus-specific effect αig ≠ 0. Similarly, we consider two models at the group level for each locus where Ai = 0 for the neutral model, and Ai ≠ 0 for the model with selection. In order to tailor our approach to study convergent adaptation, we also consider the case where different groups share a unique locus-specific component αi (see Figure 1 for an example of such a model with two groups of two populations). At each iteration of the MCMC algorithm, we update Ai and αig in a randomly chosen group g for all loci. As described in Foll and Gaggiotti 31, we propose to remove αig from the model if it is currently present, or to add it if it is not, and we do the same for Ai. We also add a specific Reversible Jump proposal for convergent adaptation: if all groups but one are currently in the selection model (αig ≠ 0 for all g but one), we propose with a probability 0.5 to move to the convergent evolution model (where we replace all αig by a single selection parameter αi shared by all groups), and with a probability 0.5 we perform a standard jump as described above.
Directed acyclic graph describing the Bayesian formulation of the hierarchical F-model at a given locus i. Square nodes represent data and circles represent model parameters to be estimated. Dashed circles represent population allele frequencies, which are analytically integrated using a Dirichlet-multinomial distribution (see method description). Lines between the nodes represent direct stochastic relationships within the model. With the exception of Figure 4, we use the same color codes in all Figures, with blue for Asia, red for America, and yellow for convergent adaptation.
Genomic Data Set
In order to improve our understanding of the genetic bases of adaptation to altitude, we have applied our hierarchical Bayesian method to the dataset from Bigham et al. 12. This data set consists of 906,600 SNPs genotyped in four populations using the Affymetrix Genome-Wide Human SNP Array 6.0 platform (see Web Resources). These four populations consist of two populations living at high altitude in the Andes (49 individuals) and in Tibet (49 individuals), as well as two lowland related populations from Central-America (39 Mesoamericans) and East Asia (90 individuals from the international HapMap project 37). Thus, we compared four alternative models for each locus at the population level: 1) a neutral model (αi1 = αi2 = 0), 2) a model with selection acting only in Tibetans (αi2 = 0), 3) a model with selection acting only in Andeans (αi1 = 0), and 4) a convergent adaptation model with selection acting in both Tibetans and Andeans (αi1 = αi2 = αi). We estimate the posterior probability that a locus is under selection by summing up the posterior probabilities of the three non-neutral models (2, 3 and 4) and we control for False Discovery Rate (FDR) by calculating associated q-values 38-40, which are a Bayesian analogues of p-values taking into account multiple testing. For a given SNP, a q-value corresponds to the expected FDR if its posterior probability is used as a significance threshold. We do not pay any particular attention to the Ai parameter here, as it can be interpreted as a potential adaptation at the continental level in Asians and Native Americans, which is not directly relevant in the context of adaptation to high altitude (but see Discussion).
We excluded SNPs with a global minor allele frequency below 5% to avoid potential biases due to uninformative polymorphisms 41. This left us with 632,344 SNPs that were analyzed using the hierarchical F-model described above. We identified genomic regions potentially involved in high altitude adaptation by using a sliding windows approach. We considered windows of 500 kb, with a shifting increment of 25 kb at each step. The average number of SNPs per window over the whole genome was 121.4 (sd=44.6), after discarding any window containing less than 50 SNPs. We considered a window as a candidate target for selection if the 5% quantile of the q-values in the window was lower than 0.01, and we merged overlapping significant windows into larger regions.
Detecting Polygenic Convergent Adaptation
We first used SNPs identified as being under convergent adaptation to perform classical enrichment tests for pathways using Panther (see Web Resources) 42 and Gene Ontology (GO) 43 using String 9.1 (see Web Resources) 44. More specifically, we extracted the list of 302 genes within 10 kb of all SNPs assigned to the convergent adaptation model and showing a q-value below 10%, to serve as input for these tests.
These two approaches have limited power to detect selection acting on polygenic traits, for which adaptive events may have arisen from standing variation rather than from new mutations 3,25 In order to detect polygenic adaptation, we used a recent gene set enrichment approach 45, which tests if the distribution of a statistic computed across genes of a given gene set is significantly different from the rest of the genome. As opposed to the classical enrichment tests, this method does not rely on an arbitrary threshold to define the top outliers and it uses all genes that include at least one tested SNP. In short, we tested more than 1,300 gene sets listed in the Biosystems database 46 for a significant shift in their distribution of selection scores relative to the baseline genome-wide distribution. In our case, the selection score of each SNP is its associated q-value of convergent selection. As previously done 45, we calculated the sum of gene scores for each gene set and compared it to a null distribution of random sets (N=500,000) to infer its significance (see “Gene set enrichment analysis method” section in the Appendix). In order to avoid any redundancy between gene sets, we iteratively removed genes belonging to the most significant gene sets from the less significant gene sets before testing them again in a process called “pruning”. This process leads to a list of gene sets whose significance is obtained by testing completely non-overlapping lists of genes. See the Appendix for a more detailed description of the method.
Independent SNP Simulations
In order to evaluate the performance of our hierarchical method, we simulated data with features similar to the genomic data set analyzed here under our hierarchical F-model. Our simulated scenario thus includes two groups of two populations made of 50 diploids each, with FSC = 0.02 for all four populations and FCT = 0.08 for both groups. Note that these F-statistics correspond to those measured on the genomic data set we have analyzed here. In each group, a fraction of loci are under selective pressure in one of the two populations only. We simulated a total of 100,000 independent SNPs among which (i) 2,500 are under weak convergent evolution with αi = 3, (ii) 2,500 are under stronger convergent evolution with αi = 5, (iii) 2,500 are under weak selection in the first group with αi1 = 3 and neutral (αi2 = 0) in the second group, (iv) 2,500 are under stronger selection in the first group with αi1 = 5 and neutral ((αi2 = 0) in the second group, and (v) 90,000 remaining SNPs that are completely neutral (αi1 = αi2 = 0). As in the real data, we conditioned the SNPs to have a global minor allele frequency above 5%. We analyzed this simulated dataset using three different approaches: (i) the hierarchical F-model introduced above, (ii) two separate pairwise analyses (one for each group) containing two populations using the original F-model implemented in BayeScan 31 (see Web Resources), (iii) a single analysis containing the four populations using the original F-model implemented in BayeScan 31 ignoring the hierarchical structure of the populations. In our hierarchal model, the best selection model for each SNP was identified as described above using a q-value<0.01. When analyzing data in separate pairs of populations, we considered a SNP to be under convergent adaptation when it had a q-value<0.01 in the two regions.
Haplotype-Based Simulations and Statistics
Several alternative methods exist to detect natural selection. In particular, methods based on haplotype structure 47-51 have been widely applied to identify local adaptation to high altitude in humans (including the dataset from Bigham et al. 12 we are using here). In order to compare the performance of our approach with haplotype-based methods (see below), we have simulated haplotypic datasets with features similar to the genomic data set analysed here. We used the SimuPop package for Python 52 (see Web Resources) and considered a scenario where an ancestral population gives rise to two descendant populations, which after 600 generations (15,000 years) undergo separate splits into two populations, one at sea level and the other at high altitude. After the second split, populations evolve for 200 generations (5000 years) until the present time. This evolutionary scenario is supposed to approximate the divergence of Asian and Ameridian population followed by a subsequent divergence of highland and lowland population in Asia and in America, even though this history might have been more complex 53,54. We assume that there is no migration between populations and we adjusted population sizes so that FSC = 0.02 for all four populations and FCT = 0.08 for both groups, to have F-statistics values comparable to the observed data set. More precisely, we used Ne=10,000 for the ancestral population, Ne=4,000 for the two descendant populations and Ne=3,500 for each of the four populations after the second split. Recombination rate was set to 10-8 (=1cM/Mb) and the mutation rate to 1.2×10-8 55. We considered a strong selection scenario (Ns=100) and a moderate selection scenario (Ns=10), with positive selection operating only in high altitude populations right after the second split. We simulated 1,500 genomic regions each with 101 SNPs spaced every 4kb, of which (i) 1,000 were neutral, (ii) 250 were under moderate convergent evolution (Ns=10 in the two high-altitude populations), (iii) 250 were under strong convergent evolution (Ns=100 in the two high-altitude populations). For selected regions, selection operates on the SNP located at the center of the genomic region (i.e. SNP 50). We generated datasets that differed in the initial allele frequency (IAF) of the selected variant: (i) IAF=0.001, (ii) IAF=0.01, and (iii) IAF=0.1. At the end of the simulations, we sampled 50 individuals from each population and analysed the resulting dataset using different approaches. We used two commonly used statistics describing the pattern of long range homozygosity: the integrated haplotype score iHS based on the decay of haplotype homozygosity with recombination distance 48 and the cross-population extended haplotype homozygosity (XP-EHH), which contrasts the evidence of positive selection between two populations 49, and which is therefore particularly well suited to our case. Overall, we thus compared four different approaches: (i) the hierarchical F-model introduced above, (ii) two separate pairwise analyses (one for each group) containing two population using the original F-model implemented in BayeScan, (iii) two separate pairwise analyses (one for each group) containing two population using XP-EHH, and (iv) two separate analyses of the high altitude populations using iHS. We used receiver operating characteristic (ROC) curves and the area under the curve (AUC) to compare the performance of the four approaches as implemented in the R package pROC 56. Except for the hierarchical F-model introduced above, none of these approaches can explicitly model convergent evolution, and convergent adaptation is only inferred after separate analyses when significance is reached in the two regions at the same time.
Results
Patterns of Selection at the SNP Level
Using our hierarchical Bayesian analysis, we identified 1,159 SNPs potentially under selection at the 1% FDR level (q-value<0.01). For each SNP, we identified the model of selection (selection only in Asia, selection only in South America, or convergent selection; see methods) with the highest posterior probability. With this procedure, 362 SNPs (31%) were found under convergent adaptation, whereas 611 SNPs (53%) were found under selection only in Asia, and 186 SNPs (16%) only under selection in South America. These results suggest that convergent adaptation is more common than previously thought 5,24,57 even at the SNP level, but consistent with results of a recent literature meta-analysis over several species 5.
In order to evaluate the additional power gained with the simultaneous analysis of the four populations, we performed separate analyses in the two continents using the non-hierarchical F-model 31. These two pairwise comparisons identified 160 SNPs under selection in the Andes, and 940 in Tibet. The overlap in significant SNPs between these two separate analyzes and that under the hierarchical model is shown in Figure 2A. Interestingly, only 6 SNPs are found under selection in both regions when performing separate analyses in Asians and Amerindians. This very limited overlap persists even if we strongly relax the FDR in both pairwise analyzes: at the 10% FDR level only 13 SNPs are found under selection in both continents. These results are consistent with those of Bigham et al. 12, who analyzed both continents separately with a different statistical method based on FST, and who found only 22 significant SNPs in common between the two geographic regions. It suggests that the use of intersecting sets of SNPs found significant in separate analyses is a sub-optimal strategy to study the genome-wide importance of convergent adaptation. Interestingly, 15% of the SNPs (162 SNPs, see Figure 2A) identified as under selection by our method are not identified by any separate analyses, suggesting a net gain in power for our method to detect genes under selection (as confirmed by our simulation studies below).
Venn diagrams showing the overlap of SNPs potentially under selection in Asia and in America at a 1% FDR. A: Overlap between all SNPs found under any type of selection using our hierarchical model introduced here (green) with those found in separate analyses performed in Asia (blue) and in America (red). B: Overlap between SNPs found under convergent selection using our hierarchical model (yellow) with those found in separate analyses performed in Asia (blue) and in America (red).
We examined in more detail the 362 SNPs identified as under convergent adaptation. The overlap of these SNPs (the yellow circle) with those identified by the two separate analyzes is shown in Figure 2B. As expected, the 6 SNPs identified under selection in both regions by the two separate analyses are part of the convergent adaptation set. However, we note that 272 of the SNPs in the convergent adaptation set (75%) are identified as being under selection in only one of the two regions by the separate analyzes. This suggests that although natural selection may be operating similarly in both regions, limited sample size may prevent its detection in one of the two continents.
Genomic Regions Under Selection
Using a sliding window approach, we find 25 candidate regions with length ranging from 500 kb to 2 Mb (Figure 3 and Table S1). Among these, 18 regions contain at least one significant SNP assigned to the convergent adaptation model, and 11 regions contained at least one 500 kb-window where the convergent adaptation model was the most frequently assigned selection model among significant SNPs (Figure 3). Contrastingly, Bigham et al. 12 identified 14 and 37 candidate 1 Mb regions for selection in Tibetans and Andeans, respectively, but none of these 1 Mb regions were shared between Asians and Amerindians. Moreover, only two of the regions previously found under positive selection in South America and only four in Asia overlap with our 25 significant regions.
Each dot represents the 5% quantile of the SNPs q-values in a 500 kb window. Windows are shifted by increment of 25 kb and considered as a candidate target for selection if the 5% quantile is lower than 0.01 (horizontal dashed line). Overlapping significant windows are merged into 25 larger regions (indicated by grey vertical bars, see Table S2). Significant windows are colored in yellow when they contain at least one significant SNP for convergent adaptation. Otherwise they are colored according to the most represented model of selection identified among the SNPs they contain: blue for selection only in Asia and red for selection only in America. We also report the names of genes discussed in the text.
As noted above, the only gene showing signs of convergent evolution found by Bigham et al. 12 is EGLN1, which has also been identified in several other studies (see Table 1 in Simonson et al. 22 for a review). EGLN1 is also present in one of our 25 regions where three out of eight significant SNPs are assigned to the convergent adaptation model. We note that the significant SNPs in this region are not found in EGLN1 directly but in two genes surrounding it (TRIM67 [MIM 610584] and DISC1 [MIM 605210]), as reported earlier 14,58. The HIF pathway gene EPAS1, which is the top candidate in many studies 22, is also present in one of our 25 regions, where 28 of the 80 significant SNPs are assigned to the convergent adaptation model. Recently a particular 5-SNP EPAS1 haplotype has been identified in Tibetans as being the result of introgression from Denisovans 54. Unfortunately none of the five SNPs of interest identified in this study are present in our dataset, and additional sequencing will be required to check whether this haplotype is also present in Andeans.
We report for different methods and selection strengths the number of SNPs found to be neutral or under selection at a FDR of 1%.
Out of the 1,159 SNPs we identified above as being under one model of selection, 312 are located within our 25 regions (Table S1) where 120 of them are identified as under convergent adaptation (out of a total of 362 SNPs identified as under convergent adaptation in the whole data set, see Figure 2B). Almost all the 18 regions containing at least one significant SNP assigned to the convergent adaptation model also contain SNPs where the best-supported model is selection only in Asia, or selection only in America. However it is hard to distinguish if this reflects both convergent adaptation and region specific adaptation in the same genomic region, or simply different statistical power.
Polygenic Convergent Adaptation
We identified three pathways significantly enriched for genes involved in convergent adaptation using Panther 42 after Bonferroni correction at the 5% level. These are the “metabotropic glutamate receptor group I” pathway, the “muscarinic acetylcholine receptor 1 and 3” signaling pathway, and the “epidermal growth factor receptor” (EGFR, MIM 131550) signaling pathway. Using the String 9.1 database 44, two GO terms were significantly enriched for these genes when controlling for a 5% FDR: “ethanol oxidation” (GO:0006069) and “positive regulation of transmission of nerve impulse” (GO:0051971). Using a recent and more powerful gene set enrichment approach 45, we first identified 25 gene sets with an associated q-value below 5% (Table S2). An enrichment map showing these sets and their overlap is presented in Figure 4. There are two big clusters of overlapping gene sets, one related to Fatty Acid Oxidation with “Fatty Acid Omega Oxidation” as the most significant set and another immune system related cluster with “Interferon gamma signaling” as the most significant gene set. After pruning, only these two above-mentioned gene sets are left with a q-value below 5%. It is worth noting that the “Fatty Acid Omega Oxidation” pathway, which is the most significant gene set (q-value<10-6), contains many top scoring genes for convergent selection, including several alcohol and aldehyde dehydrogenases, as listed in Table S3. Interestingly, the GO term “ethanol oxidation” is no longer significant after excluding the genes involved in the “Fatty Acid Omega Oxidation” pathway. Out of the 362 SNPs identified under convergent adaptation above, only four are located in genes (±50kb) belonging to the Fatty Acid Omega Oxidation pathway (rs3805322, rs2051428, rs4767944, rs4346023), and only seven SNPs are found in genes belonging to the Interferon gamma signaling pathway (rs12531711, rs7105122, rs4237544, rs10742805, rs17198158, rs4147730, rs3115628). This apparent lack of significant SNPs in candidate pathways is expected, as our gene set enrichment approach does not rely on an arbitrary threshold to define the top outliers and is thus more suited to detect lower levels of selection acting synergistically on polygenic traits.
The 25 nodes represent gene sets with q-value<0.05. The size of a node is proportional to the number of genes in a gene set. The node color scale represents gene set p-values. Edges represent mutual overlap: nodes are connected if one of the sets has at least 33% of its genes in common with the other gene set. The widths of the edges scale with the similarity between nodes.
Power of the Hierarchical F-Model
Our simulations show a net increase in power to detect selection using the global hierarchical approach as compared to using two separate pairwise analyses (Table 1 and Figure 5 and 6). For the 2,500 SNPs simulated under the weak convergent selection model (αi = 3), the hierarchical model detects 6.5 times more SNPs than the two separate analyses (306 vs. 47). This difference can be explained by the smaller amount of information used when doing separate analyses instead of a single one. The power greatly increases when selection is stronger, and among the 2,500 SNPs simulated with αi= 5, 1,515 are correctly classified using our hierarchical model, as compared to only 643 using separate analyses. Similarly to what we found with the real altitude data, the two separate analyses often wrongly classify the convergent SNPs correctly identified as such by our hierarchical method as being under selection in only one of the two groups, but sometimes also as completely neutral (64 such SNPs when αi = 3 and 76 when αi = 5, see Figure 5B and D). We note that the hierarchical model is also more powerful at detecting selected loci regardless of whether or not the SNPs are correctly assigned to the convergent evolution set. Indeed, our method identifies 2,626 SNPs as being under any model of selection (i. e. convergent evolution or in only one of the two regions) among the 5,000 simulated under convergent selection, whereas the separate analysis detects only 2,475 SNPs. When selection is present only in one of the two groups (αi1 = 3 or 5 and αi2 = 0), the power of the hierarchical model is comparable with the separate analysis in the corresponding group, implying that there is no penalty in using the hierarchical model even in presence of group specific selection. A few of the group-specific selected SNPs are wrongly classified in the convergent adaptation model with a false positive rate of 1.7% (84 SNPs out of 5,000). Overall, the false discovery rate is well calibrated using our q-value threshold of 0.01 in both cases, with 29 false positives out of 4,141 significant SNPs (FDR=0.70%) for our hierarchical model, and 30 false positives out of 3,984 significant SNPs (FDR=0.75%) for the two separate analyses. Finally, when the four populations are analyzed together without accounting for the hierarchical structure, a large number of false positives appears (Table 1 and Figure 6C) in keeping with previous studies 35. Under this island model, 1,139 neutral SNPs are indeed identified as being under selection among the 90,000 simulated neutral SNPs (vs. 29 and 30 using the hierarchical method or two separate analyses, respectively). The non-hierarchical approach does not allow one to distinguish different models of selection, but among the 10,000 SNPs simulated under different types of selection, only 2,598 are significant. This shows that the non-hierarchical analysis leads to both a reduced power, and a very large false discovery rate (FDR=30.4%) in presence of a hierarchical genetic structure.
Venn diagrams showing the overlap of SNPs simulated under a convergent evolution model and identified under selection at a 1% FDR. A and C: Overlap between SNPs found under any type of selection using our hierarchical model introduced here (green) with those found in separate analyses performed in group 1 (blue) and in group 2 (red). B and D: Overlap between SNPs found under convergent using our hierarchical model (yellow) with those found in separate analyses performed in group 1 (blue) and in group 2 (red). In A and B, 2,500 SNPs are simulated under weak convergent selection (αi = 3), while in C and D 2,500 SNPs are simulated under stronger convergent selection (αi = 5).
For simulated SNPs, we plot the best selection model inferred (A) under our hierarchical F-model, (B) using two separate analyses of pairs of populations, and (C) under a non-hierarchical F model performed on four populations, thus ignoring the underlying hierarchical population structure. The colors indicate the inferred model: convergent evolution (yellow), selection only in the first group (blue), selection only in the second group (red), and no selection (black). Note that we use purple in the C panel, as this approach does not allow one to distinguish between different models of selection. For better visualization, we only plot 10,000 neutral loci among the 90,000 simulated, but the missing data show a very similar pattern.
Our haplotype-based simulations also show that our hierarchical model has generally a much higher performance than iHS and XP-EHH to detect convergent adaptation (Table 2). iHS has very low power to detect selection in all scenarios tested here, while XP-EHH performs well (AUC=0.75) when IAF=0.1 and selection is moderate (Ns=10), and very well (AUC=0.94) when IAF=0.001 and selection is strong (Ns=100). ROC curves for these two cases are presented in Figure 7, and for the four other cases in Figure S3. In both of these cases however, the hierarchical model has a higher performance (AUC=0.92 and AUC=0.999, respectively), and it also shows a high performance (AUC=0.92 and AUC=0.94) in the two other scenarios with strong selection (Ns=100) where XP-EHH has almost no power to detect convergent adaptation (AUC=0.57 and AUC=0.50 respectively, see Table 2). Interestingly, for the case where IAF=0.1 and Ns=10, XP-EHH has a slightly higher performance to detect selection in each region individually than the F-model as implemented in BayeScan (AUC=0.75 vs. AUC=0.71, respectively), but our hierarchical model outperforms these two approaches drastically (AUC=0.92, Figure 7A). Note that XP-EHH performs somewhat better than the other methods in one scenario (Ns=10, IAF=0.01), but its performance (UAC=0.60) is not particularly good and all methods seem to have problems to detect convergent adaptation in this case. Overall our analyses confirm that the use of separate analyses results in reduced power to detect convergent adaptation, which explains the difference between results obtained using our and previous methods when detecting high altitude adaptation in humans. The ROC analysis also shows that using a less stringent cutoff in separate analyses is far from being as powerful as our hierarchical model.
ROC curves summarizing the relative performance of our hierarchical model, BayeScan, and XP-EHH to detect convergent adaptation for simulated scenarios when (A) IAF=0.1 and Ns=10 and (B) IAF=0.001 and Ns=100 (see also Table 2 for overall scores).
Performance (AUC) of different methods to detect convergent adaptation in the case of haplotype-based simulated data sets. For the case Ns=100 and IAF=0.001 iHS could not be computed.
Discussion
Convergent Adaptation to High Altitude in Asia and America is Not Rare
Our hierarchical F-model reveals that convergent adaptation to high altitude is more frequent than previously described in Tibetans and Andeans. Indeed, 31% (362/1159) of all SNPs found to be potentially under selection at a FDR of 1% can be considered as under convergent adaptation in Asia and America. This is in sharp contrast with a previous analysis of the same data set where only a single gene was found to be responding to altitudinal selection in both Asians and Amerindians 12. Our model confirms the selection of EGLN1 in both Tibetans and Andeans. We also show that some genes already known to be involved in adaptation to high altitude in Tibetans, like EPAS1, may also have the same function in Andeans. Finally, we identified genomic regions, pathways, and GO terms potentially linked to convergent adaptation to high altitude in Tibetans and Andeans that have not been previously reported. Our approach seems thus more powerful than previous pairwise analyses, which is confirmed by our simulation studies. It suggests that datasets analyzed by previous studies that tried to uncover convergent adaptation by confronting lists of significant SNPs in separate pairwise analyses 59-63 would benefit from being reanalyzed with our method. We note that more complex demography could lead to a false positive rate higher than the nominal value. Based on the simple scenario of the divergence of four populations we have simulated, we found that our method is robust to the assumed demographic model, but this may not be always the case, and significant SNPs have to be considered only as candidates for further investigations.
Polygenic and Convergent Adaptation in the Omega Oxidation Pathway
Our top significant GO term is linked to alcohol metabolism, in keeping with a recent study of a high altitude population in Ethiopia 18,19. Indeed, one of the 25 regions identified in the present study includes several alcohol dehydrogenase (ADH) genes (ADH1A [MIM 103700], ADH1B [MIM 103720], ADH1C [MIM 103730], ADH4 [MIM 103740], ADH5 [MIM 103710], ADH6 [MIM 103735], ADH7 [MIM 600086]) located in a 370 kb segment of chromosome 4 (Figure 3), and another significant segment of 2 Mb portion of chromosome 12 includes ALDH2 (acetaldehyde dehydrogenase, MIM 100650). Some evidence of positive selection in ADH1B and ALDH2 had been reported in East-Asian populations, but without any clear selective forces identified 64.
Interestingly, our gene set enrichment analysis suggests a potential evolutionary adaptation of this group of genes, since they all belong to our most significant pathway, namely “Fatty Acid Omega Oxidation” (Table S3). Omega oxidation is an alternative to the beta-oxidation pathway involved in fatty acid degradation and energy production in the mitochondrion. Degradation of fatty acids into sugar by omega oxidation is usually a minor metabolic pathway, which becomes more important when beta-oxidation is defective18,19,65, or in case of hypoxia 66. It is however unclear if omega oxidation is a more efficient alternative to beta oxidation at high altitude, or if it would rather contribute to the degradation of fatty acids accumulating when beta oxidation is defective. The detoxifying role of this pathway is supported by the fact that it is usually mainly active in the liver and in the kidney 65. The fact that Ethiopians also show signals of adaptations in ADH and ALDH genes 19 suggests that convergent adaptation in the omega oxidation pathway could have occurred on three different continents in humans.
Response to Hypoxia-Induced Neuronal Damage
Hypoxia leads to neuronal damage through over-stimulation of glutamate receptors 67.Two out of our three significant pathways found with Panther (“metabotropic glutamate receptor group I” and “muscarinic acetylcholine receptor 1 and 3”) for convergent adaptation are involved with neurotransmitter receptors. The metabotropic glutamate receptor group I increases N-methyl-D-aspartate (NMDA) receptor activity, and this excitotoxicity is a major mechanism of neuronal damage and apoptosis 68. Consistently, the only significant GO term after excluding the genes involved in omega oxidation is also related to neurotransmission (“positive regulation of transmission of nerve impulse”) and contains two significant glutamate receptors genes (GRIK2 [MIM 138244] and GRIN2B [138252]) as well as IL6 (MIM 147620).
One of our top candidate regions for convergent adaptation includes 19 significant SNPs assigned to the convergent adaptation model, which are spread in a 100 kb region on chromosome 7 around IL6 (Figure 3), the gene encoding interleukin-6 (IL-6), an important cytokine. Interestingly it has been shown that IL-6 plasma levels increases significantly when sea-level resident individuals are exposed to high altitude (4300 m) 69, and IL-6 has been shown to have a neuroprotective effect against glutamate- or NMDA-induced excitotoxicity 70. Consistently the “metabotropic glutamate receptor group III” pathway seems to have responded to selection in Ethiopian highlanders 17. Together, these results suggest a genetic basis for an adaptive response to neuronal excitotoxicity induced by high altitude hypoxia in humans.
Versatility of the Hierarchical Bayesian Model to Uncover Selection
Our statistical model is very flexible and can cope with a variety of sampling strategies to identify adaptation. For example, Pagani et al. 15 used a different sampling scheme to uncover high altitude adaptation genes in North-Caucasian highlanders. They sampled Daghestani from three closely related populations (Avars, Kubachians, and Laks) living at high altitude that they compared with two lowland European populations. Here again, our strategy would allow the incorporation of these five populations into a single analysis. A first group would correspond to the Daghestan region, containing the three populations and a second group containing the two lowland populations. However, in that case, it is the decomposition of FCT in equation 2 that would allow the identification of loci overly differentiated between Daghestani (“group 1”) and European (“group 2”) populations.
Our approach could also be very useful in the context of Genome Wide Association Studies (GWAS) meta-analysis. For example, Scherag et al. 71 combined two GWAS on French and German samples to identify SNPs associated with extreme obesity in children. These two data sets could be combined and a single analysis could be performed under our hierarchical framework, explicitly taking into account the population structure. Our two “groups” in Figure 1 would correspond respectively to French and German individuals. In each group the two “populations” would correspond respectively to cases (obese children) and controls (children with normal weight). Like in the present study, the decomposition of FSC and the use of a convergent evolution model would allow the identification of loci associated with obesity in both populations. Additionally, a potential hidden genetic structure between cases and controls and any shared ancestry between French and Germans would be dealt with by the βjg and Bg coefficients in equations 1 and 2, respectively.
We have introduced here a flexible hierarchical Bayesian model that can deal with complex population structure, and which allows the simultaneous analysis of populations living in different environments in several distinct geographic regions. Our model can be used to specifically test for convergent adaptation, and this approach is shown to be more powerful than previous methods that analyze pairs of populations separately. The application of our method to the detection of loci and pathways under selection reveals that many genes are under convergent selection in the American and Tibetan highlanders. Interestingly, we find that two specific pathways could have evolved to counter the toxic effects of hypoxia, which adds to previous evidence (e.g. EPAS and EGLN1 22) suggesting that human populations living at high altitude might have mainly evolved ways to limit the negative effects of normal physiological responses to hypoxia, and might not have had enough time yet to develop more elaborate adaptations to this harsh environment.
Supplemental Data Description
Supplemental Data includes three figures and three tables.
Web Resources
The URLs for data presented herein are as follows:
Affymetrix Genome-Wide Human SNP Array 6.0 description, http://www.affymetrix.com/estore/catalog/131533/AFFY/Genome-Wide+Human+SNP+Array+6.0#1_1
BayeScan version 2.1, http://cmpg.unibe.ch/software/BayeScan/
NCBI Biosystems, http://www.ncbi.nlm.nih.gov/biosystems
NCBI Entrez Gene, http://www.ncbi.nlm.nih.gov/gene
Online Mendelian Inheritance in Man (OMIM), http://www.omim.org/
PANTHER, http://www.pantherdb.org
simuPOP, http://simupop.sourceforge.net
STRING version 9.1, http://string-db.org
Acknowledgements
We thank Prof. Abigail Bigham for making the genetic data analyzed here available. This work has been made possible by Swiss NSF grants No. 3100A0-126074, 31003A-143393, and CRSII3_141940 to LE. OEG was supported by French ANR grant No 09-GENM-017-001 and by the Marine Alliance for Science and Technology for Scotland (MASTS). The program BayeScan3 used to analyze the data is available from MF upon request.
Appendix Gene set enrichment analysis method
To find signals of selection at the pathway level we applied a gene set enrichment approach as described by Daub et al. 45. This method tests whether the genes in a gene set show a shift in the distribution of a selection score. In our case we take as selection score sconv = 1-qconv, where qconv is the q-value of a SNP computed from the probability of convergent selection. For the enrichment test we need one sconv value per gene, we therefore transformed the SNP based scores to gene based scores. We first downloaded 19,683 protein coding human genes, located on the autosomes and on the X chromosome, from the NCBI Entrez Gene website 72 (see Web Resources). Next we converted the SNPs to hg19 coordinates. 670 SNPs could not be mapped, resulting in 631,674 remaining SNPs. These SNPs were assigned to genes: if a SNP was located within a gene transcript, it was assigned to that gene; otherwise it was assigned to the closest gene within 50kb distance. For each gene, we selected the highest sconv value of all SNPs assigned to this gene. After removing 2,411 genes with no SNPs assigned, a list of 17,272 genes remained.
We downloaded 2,402 gene sets from the NCBI Biosystems database 46 (see Web Resources). After discarding genes that were not part of the aforementioned gene list, removing gene sets with less than 10 genes and pooling (nearly) identical gene sets, 1,339 sets remained that served as input in our enrichment tests.
We computed the SUMSTAT 73 score for each set, which is the sum of the sconv values of all genes in a gene set. Gene sets with a high SUMSTAT score are likely candidates for convergent selection. To assess the significance of each tested gene set, we compared its SUMSTAT score with a null distribution of SUMSTAT scores from random gene sets (N=500,000) of the same size. We could not approximate the null distribution with a normal distribution as applied in Daub et al. 45, as random gene sets of small to moderate size produced a skewed SUMSTAT distribution. Taking the highest sconv score among SNPs near a gene can induce a bias, since genes with many SNPs are more likely to have an extreme value assigned. To correct for this possible bias we placed each gene in a bin containing all genes with approximately the same number of SNPs and constructed the random gene sets in the null distribution in such a way that they were composed of the same number of genes from each bin as the gene set being tested. To remove overlap among the candidate gene sets, we applied a pruning method where we assign iteratively overlapping genes to the highest scoring gene set. As these tests are not independent anymore, we empirically estimated the q-value of these pruned sets. All sets that scored a q-value <5% (before and after pruning) were reported.