Summary
Traditional genome-wide scans for positive selection have mainly uncovered selective sweeps associated with monogenic traits. While selection on quantitative traits is much more common, very few signals have been detected because of their polygenic nature. We searched for positive selection signals underlying coronary artery disease (CAD) in worldwide populations, using novel approaches to quantify relationships between polygenic selection signals and CAD genetic risk. We identified candidate adaptive loci that may have been directly modified by disease pressures given their significant associations with CAD genetic risk. Top candidates were consistently associated with reproductive-traits suggesting antagonistic-pleiotropic tradeoffs with early-life phenotypes and also showed more evidence of gene regulatory effects in HapMap3 lymphoblastoid cell lines than non-adaptive candidates. Our study provides a novel approach for detecting selection on polygenic traits and evidence that modern human genomes have evolved in response to CAD-induced selection pressures and other early-life traits sharing pleiotropic links with CAD.
Highlights
Widespread genomic signals of positive selection are present underlying coronary artery disease (CAD) loci
Selection peaks that significantly associated with genetic risk suggest loci modified (in)directly by CAD
Selection was more often associated with variants important for regulating gene expression
CAD loci share many pleiotropic links with early-life traits suggesting antagonistic effects
Introduction
It is well established that modern human traits are a product of past evolutionary forces that have shaped heritable phenotypic and molecular variation, but we are far from understanding what diseases have driven natural selection and how this process has left its imprint across the genome. Although many recent genome-wide multi-population scans have searched for signatures of positive selection [1-9], these studies have detected few signals of selection on candidate loci associated with traits or diseases [10-12]. This suggests that classic ‘selective sweeps’ have been relatively rare in recent human history [13, 14] and that the tools currently used miss most of the smaller selection signals caused by diseases associated with polygenic traits [12]. This limits our understanding of how natural selection has acted on variation underlying complex diseases. In this study, we aimed to comprehensively identify positive selection signals underlying coronary artery disease (CAD) loci with methods designed to detect signals of recent positive selection. We also compared quantitative selection signals in 12 worldwide populations (HapMap3) with patterns of disease risk to identify signals of selection linked to CAD pressure.
Classic population genetics theory describes positive selection with the selective-sweep (or hard-sweep) model, in which a strongly advantageous mutation increases rapidly in frequency (often to fixation) resulting in reduced heterozygosity of nearby neutral polymorphisms due to genetic hitch-hiking [15, 16] and a longer haplotype with higher frequency. Many methods have been developed to detect these signatures [17, 18], including traditional tests that detect differentiation in allele frequencies among population (i.e. Wright’s fixation index, Fst [19]) and more recently developed within population tests for extended haplotype homozygosity (i.e. integrated haplotype score, iHS [9]). Some of the most convincing examples of human adaptive evolution have been uncovered for traits influenced by single loci with large effects. For example, the lactase persistence (LCT) and Duffy-null (DARC) mutations affecting expression of key proteins in milk digestion [10] and malarial resistance [20] both display hallmarks of selective sweeps. Other loci that are not clearly monogenic but also show selective sweeps are associated with high-altitude tolerance (EPAS1 [21]) and skin pigmentation (SLC24A5 and KITLG [22]). These previous studies showed that rapid selective sweeps occurred around loci where alleles that were previously rare or absent in populations had large effects on phenotypes.
Motivated by these initial successes and the increasing availability of global population data genotyped on higher resolution arrays (i.e. HapMap Project, 1000 Genomes Project), many genome-wide scans for candidate adaptive loci have recently been performed [11]. These studies suggest that selection may have operated on a variety of biological processes [10] in ways that differ among populations (i.e. local adaptation) [23], has been prevalent in genetic variation linked to metabolic processes [24], and may have often targeted intergenic regions and gene regulatory variants rather than protein-coding regions [12]. However, only the larger signals underlying monogenic traits are typically captured due to the lack of statistical power imposed by the need to correct for genome-wide multiple testing [18]. Most of these candidates also are not yet convincing due to inconsistencies between studies that utilized the same data [14], cannot be validated due to the absence of biological or functional information [25, 26], and perhaps because selective sweeps have actually been rare in human populations [27, 28].
In contrast to population genetics, in quantitative genetics rapid adaptation typically involves selection acting on quantitative traits that are highly polygenic [29, 30]. Under the ‘infinitesimal (polygenic) model’, such traits are likely to respond quickly to changing selective pressures through smaller frequency shifts in many polymorphisms already present in the population [13, 31]. Such alleles would not necessarily sweep to fixation, would produce smaller changes in surrounding heterozygosity, and would thus be hard to detect with most current population genetic methods [14, 26, 32]. Note that polygenic and classic sweep models are not mutually exclusive [13, 33], for alleles with small- and large-effects may both underlie a polygenic trait. Thus the degree to which candidate alleles will be detectable after a selective event will vary. Given that most common diseases are highly polygenic [34], this suggests a need to improve how we detect and understand adaptive signatures in the loci associated with polygenic traits.
Recent selection studies investigating polygenic traits have taken two approaches. The first scans for significant selection signals within genome-wide significant disease effect SNPs. For example, Ding and Kullo [35] found significant population differentiation (Fst) for 8 of 158 index SNPs underlying 36 cardiovascular disease phenotypes, and Raj et al. [36] observed elevated positive selection scores (Fst, iHS) for 37 of 416 index susceptibility SNPs underlying 10 inflammatory-diseases. The second approach tests if aggregated shifts in genome-wide significant allele frequencies are associated with phenotypic differences by population, latitudinal, or environmental gradients, which might indicate local adaptation. For example, Castro and Feldman [37] used 1300 index SNPs underlying many polygenic traits and found elevated adaptive signals (Fst and iHS) above background variation, and Turchin et al. [38] demonstrated moderately higher frequency of 139 height-increasing alleles in a Northern (taller) compared to Southern (shorter) European populations. These approaches all assume that the variants with the most significant p values are the most probable selection targets, but many if not most such variants are tagging tested or untested causal variants, which may themselves be of lower frequencies. This suggests an approach sensitive to more subtle signals of selection and disease risk is needed for polygenic selection.
We chose CAD as a model for examining polygenic selection signals underlying complex disease because it has (and continues to) impose considerable disease burden (selection pressure) in humans [39], its underlying genetic architecture has been extensively studied [40, 41] and many of its risk factors (cholesterol, blood pressure) have been under recent natural selection [42] related to potential pleiotropic effects or tradeoffs with CAD. Antagonistic pleiotropy describes gene effect on multiple linked traits where selection on one may cause fitness tradeoffs (i.e. disease, survival) in the other due to their negative genetic association [43]. Two common misconceptions are that CAD is exclusively late age of onset and only occurs at appreciable frequency in contemporary humans. If that were true, selection might not have had either the opportunity or sufficient time to affect genetic variation associated with CAD. However, CAD manifests early in life [45, 46] and can be detected even in adolescence through degree of atherosclerosis [46, 47] and myocardial infarction events [48]. CAD is also a product of many heritable risk factors (cholesterol, weight, blood pressure) whose variation is expressed during the reproductive period, when CAD could drive selection directly or indirectly. Furthermore, CAD has impacted human populations since at least the ancient Middle Kingdom period, with studies finding the presence of atherosclerosis in Egyptian mummies [49]. This suggests that there has been enough time for genomic signatures of selection related to CAD to develop and be detectable in modern humans.
By combining several 1000 Genomes-imputed datasets including HapMap3 and Finnish SNP data, a large genetic meta-analysis of CAD, and HapMap3 gene expression data, we sought to address the reason(s) why CAD exists in humans by answering the following questions: 1) Has selection recently operated on CAD loci 2) How do selection signals underlying CAD loci vary among populations and are they enriched for gene regulatory effects? 3) Do candidate adaptive signatures overlap directly with CAD genetic risk and is this useful for highlighting disease-linked selection signals? 4) Do CAD-linked selection signals display functional effects and evidence of antagonistic pleiotropy, in that they are also linked to biological processes or traits influencing reproduction?
Results
To test for selection signals for variants directly linked with CAD, we utilized SNP summary statistics from 56 genome-wide significant CAD loci in Nikpay et al. [41], the most recent and largest CAD case-control GWAS meta-analysis to date, to identify 76 candidate genes for CAD (Experimental Procedures). Nikpay et al. used 60,801 CAD cases and 123,504 controls from a mix of individuals of mainly European (77%), south (13% India and Pakistan) and east (6% China and Korea) Asian, Hispanic and African American (∼4%) descent with genetic variation imputed to a high-density using the 1000 Genomes reference panel. By investigating all SNPs in candidate CAD genes, we aimed to improve detection of smaller polygenic selection signals for the range of functional genic variants and short-range intergenic regulatory variants that would be missed with approaches that only consider genome-wide significant SNPs.
Signals of positive selection within coronary artery disease loci
We utilised the integrated Haplotype Score (iHS) to estimate positive selection for each SNP underlying candidate CAD genes within each population separately. Because iHS is typically used to detect candidate adaptive SNPs where the selected alleles may not have reached fixation [9], this estimate is well suited for detecting recent signals of selection as opposed to other measures [18]. iHS is also better suited for detecting selection acting on standing variation in polygenic traits [18, 50].
Candidate selection signals were found for many of the 76 CAD genes within each of the 12 worldwide populations (11 HapMap3 populations and Finns; Fig. 1A for top 40 based on their association with CAD log odds genetic risk, Fig. S1 for all 76). These were defined as ‘peaks’ of significantly elevated iHS scores across SNPs within each gene-population combination, with the apex approximating the likely positional target of positive selection.
All 76 genes are shown ranked according to Fig. 1B. Boxes show magnitude and significance of largest positive selection signal (integrated haplotype score, iHS) within each gene-population combination. P values (circles within squares) were obtained from 10000 permutations. Bonferroni corrected p value limit also shown (α=0.05/76=0.000657) with closed circles. Populations. Grouped by common ancestry, African (ASW, African ancestry in Southwest USA; MKK, Maasai in Kinyawa, Kenya; YRI, Yoruba from Ibadan, Nigeria; LWK, Luhya in Webuye, Kenya), East-Asian (CHB, Han Chinese subjects from Beijing; CHD, Chinese in Metropolitan Denver, Colorado; JPT, Japanese subjects from Tokyo), European (CEU, Utah residents with ancestry from northern and western Europe from the CEPH collection; TSI, Tuscans in Italy; FIN, Finnish in Finland), GIH (Gujarati Indians in Houston, TX, USA), MEX (Mexican ancestry in Los Angeles, CA, USA).
The 40 of 76 CAD genes investigated are shown that have at least four significant selection-risk associations in Panel B across all 12 populations. Panel A. Magnitude and significance of largest positive selection signal (integrated haplotype score, iHS) within each gene-population combination. P values (circles within squares) were obtained from 10000 permutations. Bonferroni corrected p-value limit also shown (α=0.05/76=0.000657) with closed circles. Panel B. Null hypothesis: no association between CAD genetic risk and positive selection, tested using mixed effects model with SNP estimates of CAD log odds genetic risk and iHS while accounting for gene LD structure as a random effect (first eigenvector from LD matrix per gene). Scaled regression coefficients were obtained directly from regressions, each p value from 10000 permutations. Panel C. Null hypothesis: association between genetic risk and positive selection for SNPs within CAD genes no different than non-CAD associated genes. Permuted p values were estimated by comparing each p value in Panel B against 100 nominal p values obtained by randomly choosing (without replacement) 100 non-CAD associated genes of similar size across the genome and using the same mixed effects model setup as described above. Populations. Grouped by ancestry, African (ASW, African ancestry in Southwest USA; MKK, Maasai in Kinyawa, Kenya; YRI, Yoruba from Ibadan, Nigeria; LWK, Luhya in Webuye, Kenya), East-Asian (CHB, Han Chinese subjects from Beijing; CHD, Chinese in Metropolitan Denver, Colorado; JPT, Japanese subjects from Tokyo), European (CEU, Utah residents with ancestry from northern and western Europe from the CEPH collection; TSI, Tuscans in Italy; FIN, Finnish in Finland), GIH (Gujarati Indians in Houston, TX, USA), MEX (Mexican ancestry in Los Angeles, CA, USA).
In the sample of all populations (Fig. 1A, largest iHS scores), most candidate selection signals were relatively small, but a few larger signals were detected. For example, out of the 912 gene-by-population combinations (Fig. S1), 354 (38%) contained weak-moderate candidate selection signals (significant iHS between 2-3), 84 (9%) contained moderate-strong signals (significant iHS between 3-4), and 6 (0.6%) had very strong signals (significant iHS > 4). The 6 largest selection signals were found in the following gene-population combinations: BCAS3 in GIH (iHS=4.45), MEX (iHS=4.23) and CEU (iHS=4.86), PEMT in MKK (iHS=4.24), ANKS1A in LWK (iHS=4.03), and CXCL12 in JPT (iHS=4.10), with all iHS p values <0.0001. Six genes (BCAS3, SMG6, PDGFD, KSR2, SMAD3, HDAC9) exhibited candidate selection signals consistently within all populations (Fig. 1A), and many genes also contained consistent selection signals for all populations within similar ancestral groups (e.g. African, European etc, Fig. 1A).
Within CAD genes, multiple candidate selection signals were sometimes present (particularly within larger genes, within separate linkage disequilibrium (LD)-blocks); these varied between and sometimes within a population. For example, in PHACTR1 (∼0.57mb in size, 14 introns) there are three main candidate selection signals in introns 4, 7 and 11 (see Fig. S2, comparing cross-population selection signals in PHACTR1) that were in separate LD-blocks (see Fig. 3C, LD plots). Within most populations, there was a broad and relatively weak set of candidate selection signals in intron 4 (the largest PHACTR1 intron, ∼300kb in length). Intron 4 is also the location of the published CAD index SNP (rs9369640) for PHACTR1. Three of the African populations had the highest iHS score for the same SNP in intron 4 (rs8180558) including ASW (iHS=2.4, P<0.05), LWK (iHS=2.8, P<0.01) and YRI (iHS=2.2, P<0.05), which is ∼18kb upstream from the index CAD SNP (r2 between rs8180558 and rs9369640 in PHACTR1: ASW=0.12; LWK=0.03; YRI=0.04). Peaks of PHACTR1 selection signals within the three Asian populations were at rs4715043 in CHB (iHS=2.3, P<0.05) and rs6924689 in both CHD (iHS=2.9, P<0.01) and JPT (iHS=3.0,P<0.01). The GIH population contained the largest selection signal, also in intron 4, with an apex at rs4142300 (iHS=3.7, P<0.001, 75kb downstream of/r2=0.07 with index CAD SNP rs9369640). This corresponded with the same apex SNP in intron 4 for TSI, though the TSI signal was weaker and non-significant (rs4142300, iHS=1.84); rs4142300 was also close to the apex SNP in CEU (rs9349350, iHS=2.0, P<0.05, r2=0.92) and MEX (rs2015764, iHS=2.1, P<0.05, r2=0.30). Other significant candidate selection signals were also present in intron 7 for three of the African populations (ASW, LWK, MKK), the CHD and GIH populations, with the largest intron 7 signal within MKK (SNP rs13191209, iHS=3.0, P<0.001). The last significant candidate selection signal within PHACTR1 was found within intron 11 with the largest signal at rs9349549 (MKK iHS=2.9, P<0.01; CEU iHS=2.7, P<0.01; TSI iHS=3.0, P<0.01). Other interesting candidate selection signals present in other CAD genes (Fig. S1) are not discussed here. Such patterns suggest that candidate selection signals are complex and often do not correspond to the alleles with largest effect on CAD.
Per-SNP integrated Haplotype Scores (iHS) plotted by chromosome position within PHACTR1 (including LD plots below each) for 12 worldwide populations. Permuted p value significance for each score coded by color (grey, non-significant; orange, p < 0.05). Red dashed line indicates position of index SNP for PHACTR1. Grey columns in background represent intron spans. Populations are clustered by common ancestry, African (ASW, African ancestry in Southwest USA; MKK, Maasai in Kinyawa, Kenya; YRI, Yoruba from Ibadan, Nigeria; LWK, Luhya in Webuye, Kenya), East-Asian (CHB, Han Chinese subjects from Beijing; CHD, Chinese in Metropolitan Denver, Colorado; JPT, Japanese subjects from Tokyo), European (CEU, Utah residents with ancestry from northern and western Europe from the CEPH collection; TSI, Tuscans in Italy; FIN, Finnish in Finland), GIH (Gujarati Indians in Houston, TX, USA), MEX (Mexican ancestry in Los Angeles, CA, USA).
Relationship between CAD genetic risk and selection across populations
For each CAD gene within each population, we used a mixed effects linear model to regress SNP-based estimates of CAD log odds genetic risk (ln(OR), obtained from cardiogramplusc4d.org) against iHS selection scores (Experimental Procedures). We accounted for LD structure by including the first eigenvector from an LD matrix of correlations (r2) between SNPs within each gene as a random effect.
For a subset of CAD loci, we found significant quantitative associations between disease risk and selection signals and for each of these the direction of this association was often consistent between populations (Fig. 1B). Furthermore, when compared to a null distribution of genes selected randomly from the genome, the strength of the CAD log odds versus selection signal at most loci was statistically significant (Fig. 1C). Fig. 1B shows 40 genes ranked based on those that showed the most consistent number of significant associations across the 12 populations, with those that showed fewer than four significant associations excluded. Positive and negative associations indicate elevated selection signals present in regions with higher or lower CAD log odds genetic risk, respectively.
In the comparison across populations, directionality of significant selection-risk associations tended to be most consistent for populations within the same ancestral group (Fig. 1B). For example, in PHACTR1, negative associations were present within all European populations (CEU, TSI, FIN), and in NT5C2 strong positive associations were present in all East Asian populations (CHB, CHD, JPT). Other negative associations that were consistent across all populations within an ancestry group included five genes in Europeans (COG5, ABO, ANKS1A, KSR2, FLT1) and four genes (LDLR, PEMT, KIAA1462, PDGFD) in East Asians.
Additional consistent positive associations included four genes (CNNM2, TEX41, NT5C2, MIA3) in East Asians, three (BCAS3, RAI1, KCNK5) in Europeans, and one (PPAP2B) in Africans. In comparison to other ancestral groups, African populations showed fewer significant selection-risk associations (27.9% of all 76-gene x 12-population combinations) than Asians (31.5%) or Europeans (32.8%). Some associations were consistent in all but one population (e.g. CNNM2, ABCG8 in Europeans; BCAS3, KCNK5 in Asians; CNNM2, TEX41 in Africans) or unique to one population within an ancestral group (e.g. TEX41 in FIN, COG5 in ASW).
Below we focus on BCAS3 (Fig. 2) and PHACTR1 (Fig. 3), two of the strongest selection-risk associations which, when adjusting for LD (Experimental Procedures), displayed varying directionality between at least two populations.
A. Correlation between selection signals (iHS) and coronary artery disease (CAD) log odds genetic risk (log odds, ln(OR)), both represented as absolute values. Red line/upper right value, β from mixed effects regression. B. Base pair positional comparison of selection signals and CAD genetic risk across BCAS3. Blue points, CAD log odds values; grey-orange or non-significant-significant points, iHS scores. Horizontal bar shows BCAS3 gene (and intron) span and location of lead index SNP. Blue/orange lines are smoothed lines estimated with loess function in R. C. LD plots, r2. Populations: CEU, Utah residents with ancestry from northern and western Europe from the CEPH collection; YRI, Yoruba from Ibadan, Nigeria.
Genetic risk of CAD vs positive selection in BCAS3
The genetic risk of CAD for variants in BCAS3 were positively correlated with an extremely large candidate adaptive signal in all European and two of three East Asian populations (Fig. 1B). For example in CEU, the largest iHS score was 4.85 and highly significant, and was elevated across most of BCAS3 (Fig. 2B CEU, spanning introns 1-18 and various LD-blocks, Fig. 2C), which matched the approximate trends in CAD log odds giving rise to a highly significant positive correlation (Fig. 2A CEU). In contrast, in YRI there was no detectable selection signal close to the index SNP (Fig. 2B YRI), but weak-moderate signals were present towards the end of BCAS3 (Fig. 2B YRI, introns 18-19, smaller LD-blocks Fig. 2C), which also corresponded with lower CAD log odds (Fig. 2B, YRI) thus giving rise to a significant negative correlation in Fig. 2A.
Genetic risk of CAD vs positive selection in PHACTR1
For all European populations, PHACTR1 (see CEU example, Fig. 3A) selection peaks were typically located within regions of consistently lower CAD log odds (Fig. 3B). This contrasted with most other non-European populations where the highest candidate selection peaks were located within regions with elevated CAD log odds (including the index CAD SNP rs9369640, intron 4). The largest selection peak in GIH (Fig. 3B) overlapped the CAD log odds peak in PHACTR1 giving rise to the strong positive association seen in Fig. 3A. The two distinctive selection peaks in both CEU and GIH were separated by different LD-blocks (Fig 3C), suggesting that these may have developed independently within PHACTR1. Interestingly, the negative association found for the MKK population was due to the location of the selection peaks more closely matching those of the European populations in intron 11 (Fig. S2).
A. Correlation between selection signals (iHS) and coronary artery disease (CAD) log odds genetic risk (ln(OR)), both represented as absolute values. Red line/upper right value, β from mixed effects regression. B. Base pair positional comparison of selection signals and CAD genetic risk across PHACTR1. Blue points, CAD log odds values; grey-orange or non-significant-significant points, iHS scores. Horizontal bar shows PHACTR1 gene (and intron) spans and location of index SNP if present. C. LD plots, r2. Populations: CEU, Utah residents with ancestry from northern and western Europe from the CEPH collection; GIH, Gujarati Indians in Houston, TX, USA.
Enrichment of gene regulatory variants under selection at CAD loci
To establish whether variants with evidence of selection in CAD genes also showed evidence of function, we performed an eQTL scan in 8 HapMap3 populations with matched LCL gene expression. We compared all SNPs in each CAD locus against expression for each focal gene within each population. We found that SNPs with significant integrated Haplotype Scores (iHS) were often also involved in gene regulation, compared to SNPs with non-significant selection scores (Fig. 4, Kolmogorov-Smirnov test p value <0.001). To assess which biological pathways were enriched for the highest-ranked genes according to Fig. 1B, i.e. those where selection scores were most closely associated with CAD log odds genetic risk, we included the top 10 genes into the Enrichr analysis tool [51] and found that these genes are especially enriched in pathways related to metabolism, focal adhesion and transport of glucose and other sugars. More interestingly, we found connections to reproductive phenotypes in the associations of these genes with pathways, ontologies, cell types and transcription factors. For example, we found links to ovarian steroidogenesis and genes expressed in specific cell types and tissues including the ovary, endometrium and uterus (see Table S4 for Enrichr outputs).
Summary distribution of permuted eQTL p values for SNPs with (left) or without (right) a significant selection signal. SNPs with a significant selection signal (iHS) were chosen by taking the largest significant positive selection signal (if one was present) within each gene-population combination. The same number of SNPs without a significant selection signal were also randomly drawn across all gene-population combinations for comparison. These SNPs were used in an eQTL analysis where they were regressed (including gender as a covariate) against their associated gene probe’s expression.
Discussion
This study has identified many candidate adaptive signals which suggests that selection on CAD loci is much more widespread than previously appreciated (also see Supplementary Discussion). It has previously been suggested [12] and demonstrated [52] that selection on gene expression levels has been an important element of human adaptation in general. We confirm this result for CAD associated loci. Positive selection signals within CAD loci were more likely than random SNPs to be associated with gene expression levels in cis (Fig. 4).
We found evidence that some of these signals may be a result of selection pressures induced directly by CAD itself. This finding is important for highlighting genes that may have been modified directly by selection on disease phenotypes and also for our general understanding of how quickly human genomes can respond to selection induced by changing environments. Subsequent biological process analyses and a thorough literature assessment (below) demonstrated that the loci most consistently associated with CAD genetic risk are also often linked to human reproduction, which suggests both their potential to respond to natural section and their possible role via antagonistic pleiotropy in the reproductive tradeoffs that would help to explain why CAD exists in human populations.
Coronary artery disease-induced changes to human genomes
One of our most interesting findings was the significant association between selection signals and CAD log odds genetic risk. This approach of integrating genome scans of positive selection with genome-wide genotype-phenotype data has been promoted previously as a tool to uncover biologically meaningful selection signals of recent human adaptation [12, 52] but has rarely been applied. Among the exceptions, Jarvis et al. [55] found a cluster of selection and association signals coinciding on chromosome 3 that included genes DOCK3 and CISH, which are known to affect height in Europeans.
For highly-ranked genes (according to the number of significant associations present within the 12 populations) in Fig. 1B such as BCAS3, CNNM2, TEX41, SMG6 and PHACTR1, the consistent overlap between selection and genetic risk of CAD suggests that many of these may have been modified by CAD-linked selective pressures. If so, then two conditions must have been met. Firstly, CAD was present for long enough to be involved in these genetic alterations, an evolutionary process which generally takes thousands of years. Indeed, precursors of CAD (i.e. atherosclerosis) are detectable in very early civilizations [49]. Secondly, the effects of CAD were directly or indirectly expressed during the reproductive period and trait variation was under natural selection due to its effects on reproductive success.
It is only possible for natural selection to directly act on CAD if those outcomes modify individual fitness relative to others in the same population. As outlined in the introduction, this is possible as CAD outcomes (i.e. myocardial infarction) do occur in young adults. However, early-life CAD outcomes are relatively rare, suggesting selection is more likely to operate indirectly on CAD via its risk factors (or other pleiotropically linked traits, discussed below), which provides a more likely explanation for the close associations we found between positive selection and genetic risk. Supporting this, phenotypic selection has been found operating on CAD risk factors [42], suggesting that these selection pressures are still present in modern humans.
Some genes had large signals of selection but showed weak or no consistent overlap with CAD genetic risk. For example HDAC9 (Histone Deacetylase 9) shows extensive evidence for having undergone recent selection within most populations, especially those of European or Mexican decent, but little or no overlap with CAD risk was evident in most populations. This suggests positive selection has operated on this gene due to its effects on a trait unrelated to CAD, which may not be surprising given HDAC9’s broad biological roles (as a transcriptional regulator, cell-cycle progression) and association with other very different phenotypes including ulcerative colitis [57] and psychiatric disorders [58]. This further demonstrates that this approach is useful for separating candidate selection signals important for the disease or phenotype of interest from those that aren’t.
Pleiotropic effects that establish the genetic foundations of tradeoffs
To further investigate whether top candidate adaptive loci for CAD modify fitness or share pleiotropic links with other traits that may modify fitness, we performed an extensive systematic literature search on the 40 top-ranked genes in Fig. 1 and a random set of 20 genes. If they have been under selection recently, they might still be associated with reproductive variation (i.e. fitness) in modern environments. We found that all 40 CAD genes shared at least one (often more) connection with fitness (Table S1-S2). Some appear to directly influence fitness (offspring number, age at menarche, menopause, survival), while many were associated with early-life reproductive traits that are likely to indirectly correlate with fitness including variation in ability to fertilize/conceive or fetal growth, development and survival. To test the novelty of this, we randomly chose 20 genes that were approximately the same size as the top 20 genes in Fig. 1. We only found three (out of 20) random genes with at least one potential link with fitness (Table S3). This suggests there are unique pleiotropic links between CAD and traits that have likely been under selection earlier in life.
Evidence for direct links between CAD genes and fitness (Table S1-S2) included genes associated with reproductive (PPAP2B, [59]) or twinning (SMAD3, [60]) capacity and number of offspring produced (e.g. KIAA1462, [61], SLC22A5, [62]). PHACTR1, LPL, SMAD3, ABO and SLC22A5 may contribute to reproductive timing (menarche, menopause) in women [63-65] and animals [66]. Expression of PHACTR1 [67], KCNK5 [68], MRAS and ADAMST7 [69] appear to regulate lactation capacity. Some gene deficiencies also cause pregnancy loss (e.g. LDLR, [70], COL4A2, [71]). Evidence for antagonistic links were much more common and included these: 25 genes shared links with traits expressed during pregnancy (Table S1-S2), i.e. variation that can negatively influence the health and survival outcomes of both the fetus and mother [72]. For example, a variant of CDKN2B-AS1 significantly contributes to risk of fetal growth restriction [73], both FLT1 [74] and LPL [75] are significantly differentially expressed in placental tissues from pregnancies with intrauterine growth restriction (IUGR), and preeclampsia and LDLR-deficient mice had litters with significant IUGR [76]. A further 29 and 19 genes were linked to traits that can directly influence female and male fertility, respectively (13 influence both) (Table S1-S2). For example, BCAS3 and PHACTR1 are highly expressed during human embryogenesis [77, 78], SWAP70 is intensely expressed at the site of implantation [79], and PHACTR1 may play a role in receptivity to implantation [80]. For ABCG8 and KSR2, animal models provide further support as gene expression deficiency can cause infertility in females (ABCG8, [81]) and males (KSR2, [82]).
Pleiotropic connections were also apparent in the classification of specific disorders or from studies investigating single-gene effects. For example, women with polycystic ovarian syndrome (PCOS) have higher rates of infertility due to ovulation failure and modified cardiovascular disease risk factors (i.e. diabetes, obesity, hypertension [83]). A number of CAD genes in this study (e.g. PHACTR1, LPL, PDGFD, IL6R, CNNM2) are found differentially expressed in PCOS women [84-88], suggesting possible links between perturbed embryogenesis and angiogenesis. In males, this can be demonstrated with a mutation in SLC22A5 that causes both cardiomyopathy and male infertility due to altered ability to break down lipids [89, 90]. More generally, many recent studies link altered cholesterol homeostasis with fertility, which is most apparent in patients suffering from hyperlipidemia or metabolic syndrome [91, 92].
To facilitate interpretation of selection occurring on early-life traits or CAD phenotypic risk factors that share pleiotropic connections and possible evolutionary tradeoffs with coronary artery disease, we present a conceptual figure (Fig. 5). These pleiotropic effects are important because many of them affect traits expressed early in life, some extremely early in life. Any allele that increases reproductive performance enough early in life to more than compensate for a loss of associated fitness late in life will be selected [43]. Such a mechanism has been recently suggested to help explain the maintenance of polymorphic disease alleles in modern human populations [93]. Some previous studies have tested for such tradeoffs in humans using direct fitness-related phenotypes (e.g. [44]) although evidence for such a mechanism influencing human disease is currently lacking. Our approach examining antagonistic fitness effects for disease genes that displayed consistent selection-genetic risk associations in diverse worldwide populations provides support for such a mechanism influencing CAD. Here we have presented multiple cases in which such antagonistic pleiotropy appears to be present for genes associated with CAD, which may help to explain our vulnerability to the disease.
As a simple example, AP describes gene effect on two traits (pleiotropy) that oppositely (antagonistic) affect individual fitness at different ages. Selection on that gene conferring a fitness advantage and disadvantage at different ages depends on the size and timing of the effects. An advantage during the ages with the highest probability of reproduction (between∼20-45 years of age in humans) would increase fitness (lifetime reproductive success) more than a similarly sized disadvantage at later ages would decrease it. This concept is part of the well-known evolutionary theory of ageing, which describes tradeoffs in energy invested into growth, reproduction and survival [105]. In the figure above, intense natural selection occurring on CAD loci as a result of fitness advantages (+ signs, red text callout box 1.) conferred by genetically correlated risk factors (‘CAD risk factors’ box) or early-life traits (‘early-life traits’ box) trades off with the deleterious effects of these genes on fitness (i.e. CAD burden) later in life (- sign, red text callout box 2.) where the intensity of selection is weak. This occurs because of the negative relationship between genetic effects on early vs late-life traits (- sign, red text callout box 3.), which could help explain the high prevalence and maintenance of CAD in modern human populations. Over shorter timescales, lifetime probability of CAD is modified by a combination of genetic and environmental risk factors (e.g. [106]). There is a good evidence that such antagonistic effects have operated on CAD loci given: significant associations between CAD genetic risk and selection we found (Fig 1-2); CAD genes also underlie many early-life traits known to modify fitness (Table S2); phenotypic selection has been found operating on CAD phenotypic risk factors [42].
Study limitations
There are also some limitations to our approach. We utilized CAD genetic risk estimated from a meta-analysis based on predominantly European (77%) with smaller contributions from south/east Asian (19%), Hispanic and African American (∼4%) ancestry [41]. Genetic risk variation for CAD might be different in the un-represented (i.e. Mexican) or less-represented (i.e. African) populations in this meta-analysis. If that were the case, it would reduce the usefulness of comparing selection and risk estimates in those populations. We also saw fewer significant selection-risk associations in the African populations (Fig. 1B), however this may be due to selection signals in the African populations being less obvious than those in East Asian and European populations, perhaps due to lesser linkage disequilibrium, as is consistent with results from previous studies [94]. Calculating disease risk and selection variation from populations within the same ancestral group might help resolve this, however it only represents a potential shortcoming for our cross-population analyses and not observations of antagonistic pleiotropy.
Summary
In this study, we found evidence that natural selection has recently operated on CAD associated variation. By comparing positive selection variation with genetic risk variation at known loci underlying CAD, we were able to identify and prioritize genes that have been the most likely targets of selection related to this disease across diverse human populations. That selection signals and the direction of selection-risk relationships varied among some populations suggests that CAD-driven selection has operated differently in these populations and thus that these populations might respond differently to similar heart disease prevention strategies. The pleiotropic effects that genes associated with CAD have on traits associated with reproduction that are expressed early in life strongly suggests some of the evolutionary reasons for the existence of human vulnerability to CAD.
Experimental Procedures
Defining loci linked to coronary artery disease
We started with the 56 lead index SNPs from Supplementary Table 5 in Nikpay et al. [41] corresponding to 56 CAD loci. When the index SNP was genic, all SNPs within that gene were extracted (using NCBI’s dbSNP) including directly adjacent intergenic SNPs ±5000bp from untranslated regions (UTR) in LD>0.7 (with any respective genic SNP). When the index SNP was intergenic, that SNP and other directly adjacent SNPs ±5000bp and in LD>0.7 (with the index SNP) were extracted and combined with SNPs from the respective linked gene listed in Nikpay et al. [41] including SNPs ±5000bp from UTR regions in LD>0.7 with that gene. This resulted in SNP lists for 56 genes. To further explore other genes not directly connected with lead index SNPs, but that were found within the CAD loci identified by Nikpay et al. [41], we extracted SNPs within each of those genes (plus SNPs ±5000bp from UTR regions in LD>0.7 with that gene). This resulted in SNP lists for a further 20 genes, bringing the total number of candidate genes for CAD to 76.
The per-SNP log odds (ln(OR)) values for the 76 genes were obtained from Nikpay et al. [41] available at http://www.cardiogramplusc4d.org/downloads and used in the analysis described below.
Preparation of HapMap3 samples
Genotype data (1,457,897 SNPs, 1,478 individuals) were downloaded for 11 HapMap Phase 3 (release 3) populations (http://www.hapmap.org [95]) including: Yoruba from Ibadan, Nigeria (YRI), Maasai in Kinyawa, Kenya (MKK), Luhya in Webuye, Kenya (LWK), African ancestry in Southwest USA (ASW), Utah residents with ancestry from northern and western Europe from the CEPH collection (CEU), Tuscans in Italy (TSI), Japanese from Tokyo (JPT), Han Chinese from Beijing (CHB), Chinese in Metropolitan Denver, Colorado (CHD), Gujarati Indians in Houston, TX, USA (GIH), and Mexican ancestry in Los Angeles, CA, USA (MEX). We also included another HapMap3 population, the Finnish in Finland (FIN) sample (ftp://ftp.fimm.fi/pub/FIN_HAPMAP3 [96]). These data had already been pre-filtered, i.e. SNPs were excluded that were monomorphic, call rate < 95%, MAF<0.01, Hardy-Weinberg equilibrium P <1x10-6 etc.
Before phasing and imputation, we performed a divergent ancestry check with flashpca [97] to check accuracy of population assignments, converted SNP data from build 36 to 37 with UCSC LiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver), checked strand alignment in Plink v1.9 [98] to ensure all genotypes were reported on the forward strand, and kept only autosomal SNPs. To speed up imputation, data were first pre-phased with Shapeit v2 [99] using the duoHMM option that combines pedigree information to improve phasing and default values for window size (2Mb), per-SNP conditioning sates (100), effective population size (n=15000) and genetic maps from the 1000 Genomes Phase 3 b37 reference panel (ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/).
Phased data were imputed in 5 Mb chunks across each chromosome with Impute v2 [100]. We then removed any multiallelic SNPs (insertions, deletions etc) from the imputed data and excluded SNPs with call rate < 95%, HWE P <1x10-6 and MAF<1%. The final dataset was then phased with Shapeit v2, and alleles were converted to ancestral and derived states using python script. Ancestral allele states came from 1000 Genomes Project FASTA files and derived 6-primate (human, gorilla, orangutan, chimp, macaque, marmoset) Enredo-Pecan-Ortheus alignment [101] from the Ensembl Compara 59 database [102].
Estimating signatures of recent selection
Integrated Haplotype Score (iHS)
Using the package rehh [103] in R version 3.1.3, per SNP iHS scores were calculated within each population (after excluding non-founders) using methods described previously [9]. iHS could not be calculated for SNPs without an ancestral state, or whose population minor allele frequency is <5%, or for some SNPs that are close to chromosome ends or large regions without SNPs [9]. Rehh was also used to standardize (mean 0, variance 1) iHS values empirically to the distribution of available genome-wide SNPs with similar derived allele frequencies. For analyses in the main text, we considered a SNP to have a candidate selection signal if it had an absolute iHS score > 2, a permuted p value <0.05, and was within a ‘cluster’ of SNPs that also had elevated iHS scores. Although permuting p values is computationally more intensive, it provides more flexibility to detect smaller selection signals that may be incorrectly classified with the more stringent Bonferroni correction that is often applied to these estimates. For the analyses described below, even though we only used iHS estimates for the SNPs defined in the CAD genes (and additional SNPs for permutation purposes), we calculated per-SNP iHS scores genome-wide (rather than locally, i.e. within 1MB regions around focal SNPs), for this provides more accurate estimates because final adjustments are made relative to other genome-wide SNPs of similar sized derived allele frequency classes. P values for iHS scores were permuted based on comparison of nominal p values against 10000 randomly selected estimates from within the same derived allele frequency classes.
Comparing CAD genetic risk and quantitative selection signals
We first tested the null hypothesis that there is no association between CAD genetic risk and signals of positive selection for CAD genes. For each gene within each population, we used a mixed effects linear model to regress SNP-based estimates of CAD log odds (ln(OR)) genetic risk against selection scores (iHS) resulting in 912 separate regressions. To account for LD structure (and potential confounding of highly correlated SNPs) within each gene, we also included the first eigenvector derived from an LD matrix of correlations (r2) between SNPs within each gene as a random effect. We chose to model LD structure with mixed-effects models rather than LD-prune because for many genes, the sample would have been too small for regression analyses. Also, it would be very difficult to properly capture both selection and the CAD log odds peaks needed to compare these variables. We accounted for multiple testing by permuting p values for each regression based on comparing each nominal p value against 10000 permuted p values derived from shuffling iHS scores.
Genes were then ranked based on the number of significant associations summed across the 12 populations. The 40 genes with at least four or more significant associations are shown in Fig. 1B. To illustrate the positional architecture of these selection-risk associations, plots for selected highly-ranked genes are shown in Fig. 2-3. By demonstrating how CAD genetic risk peaks and valleys correspond to variation in the magnitude of selection scores (iHS), this allowed visual assessment of potential modifications made to the phenotype-genotype map by selective pressures imposed directly or indirectly by CAD. It also helped us localize selection peaks within genes and compare them between populations. Similar peaks suggested similar selection and different peaks suggested local adaptation. This way of presenting the results also allowed us to detect the smaller adaptive shifts in allele frequencies typically expected to underlie selection on polygenic traits.
We then tested a second null hypothesis: that the selection-risk associations using the CAD genes are not unique compared to non-CAD associated loci. For each of the 76 CAD genes, we randomly (without replacement) chose 100 genes of similar length across the genome and performed the same mixed effects regression procedure described above for each gene by population combination using both CAD log odds values from Nikpay et al. [41], iHS scores estimated from the SNP data, and the first LD eigenvector from SNPs within a gene. Permuted p values were derived by comparing the nominal p value for each CAD gene against the 100 null distribution p values from the non-CAD associated genes. Results are shown in Fig. 1C.
Identifying functional targets of selection
To examine whether candidate adaptive signals within each gene corresponded to a gene’s regulatory variation, we regressed SNPs within focal genes and gender against that gene’s probe expression levels, which had previously been quantified in lymphoblastoid cell lines using Illumina’s Human-6 v2 Expression BeadChip for eight of the 12 populations [104]. While selection related to CAD may have targeted regulatory variants important for other tissues/cell-types, gene expression data was only available for this cell-type. Given the central importance of circulating lymphoblastoid cells in CAD and its risk factors, we might expect this cell type a good candidate to search for association between selection signals and regulatory variants important for these genes. The raw gene microarray expression data had previously been normalized on a log2 scale using quantile normalization for replicates of a single individual then median normalization for each population [104]. P values for each SNP-probe association were permuted using 10000 permutations by randomly shuffling gene probes expression. P values were then extracted for the most significant iHS score for each gene-population combination and compared to the same number of p values randomly drawn from different LD blocks underlying SNPs with non-significant iHS scores across each gene-population combination. A Kolmogorov-Smirnov test was used to compare the distribution of p values from each. To examine what biological processes were associated with the top ranked genes from Fig. 1, we uploaded the top 10 genes into Enrichr (http://amp.pharm.mssm.edu/Enrichr/) to define associated pathways (i.e. KEGG 2015, kegg.jp/kegg), ontologies (MGI Mammalian phenotypes, informatics.jax.org), cell types (Cancer cell line Encyclopedia, broadinstitute.org/ccle) and transcription factors (ChEA 2015, amp.pharm.mssm.edu/lib/chea.jsp).
Author Contributions
Conceptualization, S.G.B. and M.I.; Methodology, S.G.B. and M.I.; Formal analysis, S.G.B. and Q.H.; Literature review, S.G.B.; Writing – original draft, S.G.B. and M.I.; Writing – review & editing, S.G.B., Q.H., L.G., S.R., G.A., S.C.S and M.I.; Visualization, S.G.B.; Funding acquisition, M.I.; Supervision, M.I.
Acknowledgements
This study was supported by the National Health and Medical Research Council (NHMRC) of Australia (grant no. 1062227) and the National Heart Foundation of Australia. MI was supported by a Career Development Fellowship co-funded by the NHMRC and the National Heart Foundation of Australia (no. 1061435). GA was supported by an NHMRC Peter Doherty Early Career Fellowship (no. 1090462). We are grateful to the CARDIoGRAMplusC4D consortium for making their large-scale genetic data available. A list of members of the consortium and the contributing studies is available at www.cardiogramplusc4d.org.
References
- 1.↵
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.
- 54.
- 55.↵
- 56.
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.
- 86.
- 87.
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵