I. Abstract
Background: Gene-gene and gene-environment interactions are known to contribute significantly to variation of complex phenotypes in model organisms. However, their identification in human associations studies remains challenging for myriad reasons. In the case of gene-gene interactions, the large number of potential interacting pairs presents computational, multiple hypothesis correction, and other statistical power issues. In the case of gene-environment interactions, the lack of consistently measured environmental covariates in most disease studies precludes searching for interactions and creates difficulties for replicating studies.
Results: In this work, we develop a new statistical approach to address these issues that leverages genetic ancestry [Θ] in admixed populations. We applied our method to gene expression and methylation data from African American and Latino admixed individuals, identifying nine interactions that were significant at a threshold of p < 5 × 10−8. We replicate two of these interactions and show that a third has previously been identified in a genetic interaction screen for rheumatoid arthritis.
Conclusion: We show that genetic ancestry can be a useful proxy for unknown and unmeasured environmental exposures with which it is correlated
II. Background
Genetic association studies in humans have focused primarily on the identification of additive SNP effects through marginal tests of association. There is growing evidence that both gene-gene (G × G) and gene-environment (G × E) interactions contribute significantly to phenotypic variation in humans and model organisms[1-5]. In addition to explaining additional components of missing heritability, interactions lend insights into biological pathways that regulate phenotypes and improve our understanding of their genetic architectures. However, identification of interactions in human studies has been complicated by the multiple testing burden in the case of G × G interactions, and the lack of consistently measured environmental covariates in the case of G × E interactions[6,7].
To overcome these challenges, we leverage the unique nature of genomes from recently admixed populations such as African Americans, Latinos, and Pacific Islanders. Admixed genomes are mosaics of different ancestral segments[8] and for each admixed individual it is possible to accurately estimate Θ, the proportion of ancestry derived from each ancestral population (e.g. the fraction of European/African ancestry in African Americans)[9]. Studies have demonstrated that an array of environmental and biomedical covariates are correlated with Θ [10-13], and we therefore consider its use as a surrogate for unmeasured and unknown environmental exposures, θ is also correlated with the genotypes of SNPs that are highly differentiated between the ancestral populations. Thus θ may also be used as a proxy for detecting epistatic interactions. Therefore, we propose a new SNP by θ test of interaction (A1TL) in order to detect evidence of interaction in admixed populations.
We first investigate the properties of our method through simulated genotypes and phenotypes of admixed populations. In our simulations we demonstrate that differential linkage-disequilibrium (LD) between ancestral populations can produce false positive SNP by θ interactions when local ancestry is ignored. To accommodate differential LD, we include local ancestry in our statistical model and demonstrate that this properly controls this confounding factor. We also show that AITL is well powered to detect gene-environment interactions when θ is correlated with the environmental covariates of interest. However, the power for detecting pairwise G × G interactions at highly differentiated SNPs is lower than direct interaction tests even after accounting for the additional multiple testing burden.
We applied our method to gene expression data from African Americans and DNA methylation data from Latinos. We identified one genome-wide significant interaction(p < 5 × 10−8) associated with gene expression in the African Americans and eight significant interactions (p < 5 × 10−8) associated with methylation in the Latinos. We replicated three of the eight interactions associated with DNA methylation in the Latinos and show that the interaction associated with gene expression has also been previously been found to have epistatic effects in the Welcome Trust Case Control Consortium (WTCCC) rheumatoid arthritis case/control dataset[14]. Together, these results provide evidence for the existence of interactions regulating expression and methylation.
III. Results
Simulated Data
To determine the utility of using θ as a proxy for unmeasured and unknown environmental covariates, we applied the AITL to simulated 2-way admixed individuals. We tested β1, the proportion of ancestry from ancestral population 1, for interaction with simulated SNPs (see Simulation Framework). Power was computed over 1,000 simulations, assuming 10,000 SNPS being tested, and using a Bonferroni correction p-value cutoff of 5 × 10−6. We calculated the power using assumed interaction effect sizes (either βG × G or βG × E) of 0.1, 0.2, 0.3, and 0.4 (see Simulation Framework). Although the few interactions reported for human traits and diseases show much smaller effect sizes, we simulated large effects because genetic and environmental effect sizes in omics data, such as the expression and methylation data considered here, are known to be of larger magnitude. For example, some cis-eQTL SNPs explain up to 50% of the variance of gene expression[15].
Power When Using θ as a Proxy for Highly Differentiated SNPs
To determine whether using θ as a proxy for a highly differentiated SNPs is more powerful than testing all pairs of potentially interacting SNPs directly, we simulated two interacting SNPS in 1000 admixed individuals (see Simulation Framework). We then tested for an interaction using AITL by replacing the genotypes at the highly differentiated SNP with θ1. We observed that even with moderate effect sizes, using θ in place of the actual genotypes does not provide any increase in power even after accounting for multiple corrections (see Figure 1a). This is in agreement with recent work showing the limited utility of local ancestry by local ancestry interaction test to identify underlying SNP by SNP interaction when genotype data is available28. For the larger effect sizes we simulated, we do see power increasing as the delta between ancestral frequencies increase. The plots show that AITL would be unable to detect anything unless the effect was very strong. Figure 1b reveals that even with the multiple correction penalty, testing all pairwise SNPS directly is always more powerful. We note that when testing the interacting SNPs directly, we used a cutoff p-value of 1 × 10−9 since in theory we were testing all unique pairs of 10,000 SNPs. Based on these results, we would recommend testing for pairs of interacting SNPs directly if pairwise G × G interactions are a subject of interest in the study. However, when multi-way interactions are considered, AITL may become more powerful (see Discussion).
Power When Using θ as a Proxy Environmental Covariate
When assessing the utility of θ as a proxy for an environmental covariate E, we simulated 3000 individuals. E was simulated such that it was correlated with the individuals’ global ancestries in varying degrees (see Simulation Framework). Figure 2 shows the power of the AITL as a function of the Pearson correlation between θ1 and E. The power of testing E directly is exactly the power of the AITL when the correlation is equal to 1. As expected, as the correlation increases, the power increases as well. When the effect size is 0.1, the power to detect a gene-environment interaction is low whether one uses θ1 or E. However, both tests are much better powered for effect sizes greater or equal to 0.2, with the AlTL’s power being dependent on the level of correlation.
Differential LD
To demonstrate that differential LD has the potential to cause inflated test statistics, we ran 10,000 simulations of 1000 admixed individuals. For each individual we simulated 2 SNPs, a causal SNP and a tag SNP. The LD between the tag SNP and causal SNP was different based on the ancestral background the SNPs were on (see Simulation Framework). Over 10,000 simulations, we computed the mean test-statistic for the AIT and the AITL. We note that the phenotypes for these simulations were generated under a model that assumed no interaction. We observed a mean with a standard deviation of 1.53 for AITL. AIT, which does not condition on local ancestry, had a mean with a standard deviation of 3.60. We also looked at λGC or genomic control, as another indicator of test-statistic inflation[16]. λGC compares the median observed χ2 test-statistic versus the true median under the null. In our simulations, we observed λGC = 5.81 for AIT and λGC = 0.980 for AITL (see Supplementary Figure S1). Last, we computed the proportion of test-statistics that passed a p-value threshold of .05 and .01 in our simulations. The AIT had 3687 statistics passing a p-value of .05 and 1687 at a threshold of .01, whereas AITL had 464 and 96 at the same p-value thresholds. The results for AITL are as expected under a true null. The results from our simulations show that not accounting for local ancestry can result in inflated test-statistics and can potentially lead to false positive findings.
Real Data
Coriell Gene Expression Results
We first applied our method to the Coriell gene expression dataset[17]. The Coriell cohort is composed of 94 African-American individuals and the gene expression values of ~8800 genes in lymphoblastoid cell lines (LCLs). Since African Americans derive their genomes from African and European ancestral backgrounds, we tested for interaction between a given SNP and the proportion of European ancestry, θEUR. Each SNP by θEUR term was tested once for association with the expression of the gene closest to the SNP. We observed well-calibrated statistics with a λGC equal to 1.04 (see Supplementary Figure S2). In the LCLs, we found that interaction of rs7585465 with θEUR was associated with ERBB4 expression (AITL p = 2.95 × 10−8, Marginal p = 0.404) at a genome-wide significant threshold (p ≤ 5 × 10−8).
Given that the gene expression values come from LCLs (all cultured according to the same standards), the SNPs are either interacting with epigenetic alterations due to environmental exposures that have persisted since transformation into LCLs or the signals are driven by epista tic interactions. In our simulations, we showed that using θ as a proxy for a single highly differentiated SNP is underpowered compared to testing all pairs of potentially interacting SNPs directly. However, there are many SNPs that are highly differentiated across the genome with which θ will be correlated. It is therefore possible that θ is capturing the interaction between the aggregate of all differentiated trans-SNPs (i.e. global genetic background) and the candidate SNP. This is consistent with a recently reported finding, conducted in human iPS cell lines, that genetic background accounts for much of the transcriptional variation[2,18].
GALA II Methylation Results
We searched for interactions in methylation data derived from a study of asthmatic Latino individuals called the Genes-environments and Admixture in Latino Americans (GALA II)[19]. The methylation data is composed of 141 Mexicans and 184 Puerto Ricans. As the phenotype, we used DNA methylation measurements on ~300,000 markers from peripheral blood. As we had done with gene expression, we tested for interaction between a given SNP and θEUR using AITL. All SNPs within a 1 MB window centered around the methylation probe were tested. We used the European component of ancestry because it is the component shared most between Mexicans and Puerto Ricans (see Table 1). We observed well-calibrated test statistics with λGC equal to 1.06 in the Mexicans and 0.96 in the Puerto Ricans (see Supplementary Figure S3). We tested 128,794,325 methylation-SN Ρ pairs which results in a Bonferroni corrected p-value cutoff of 3.88 × 10−10. However, this cutoff is extremely conservative given the tests are not all independent. We therefore we report all results that are significant at 5 × 10−8 in either set as an initial filter. We found 5 interactions in the Mexicans and 3 in the Puerto Ricans that are significant at this threshold (see Table 2).
Unlike the Coriell individuals, who are 2-way admixed, the GALA 11 Latinos are 3-way admixed and derive their ancestries from European, African, and Native American ancestral groups. Consequently, to confirm that incomplete modeling or better tagging on one of the non-European ancestries was not driving the results, we retested all significant interactions including a second component of ancestry for AITL. In the case of the Mexicans, we included African and European ancestry, and in the case of the Puerto Ricans, we included European and Native American ancestry. Even after adjusting for the second ancestry the interactions between SNP and θEUR remained highly significant (see Supplementary Table 1).
We then performed a replication study of the significant Puerto Rican associations in the Mexican cohort and vice versa. To account for the fact that we are replicating eight total results across both populations, we used a Bonferroni corrected p-value threshold equal to .05/8 = 6.25 × 10−3. The interaction of rs4312379 and rs4312379 with ancestry in the Puerto Ricans replicated in the Mexicans. Furthermore, there was a highly significant overall trend of association in the replication study (permutation p < 1 × 10−4). The lack of direct replication for other specific interactions might be driven in part by the fact that Mexicans and Puerto Ricans have distinct genetics and environmental exposures. Overall, our results from the GALA II cohort suggest there are both genetic and environmental interactions that have yet to be discovered in admixed individuals.
IV. Discussion and Conclusions
For many disease architectures, interactions are believed to be a major component of missing heritability[20]. Finding new interactions has proven to be difficult for logistical, statistical, biological, and computational reasons. In this study, we have demonstrated that in admixed populations, testing for gene by θ interactions can be leveraged to overcome some of the difficulties typically encountered when searching for interactions. Although our method does not provide details as to which covariate is interacting with a genetic locus, it can show whether an interaction effect exists in a given dataset. Furthermore, the drawback of not having consistently measured environmental covariates is addressed by our method. Genetic ancestry is nearly perfectly replicable, especially with respect to environmental measurements that can be influenced by a myriad of factors between studies. Testing for the presence of interaction using a nearly perfectly reproducible covariate may enhance our understanding of the genetic basis of disease and other traits. Our method also provides the additional benefit of not being confounded by interactions between unaccounted-for covariates[21].
Our simulations showed that genetic ancestry can be a good proxy for an environmental covariate depending on the correlation between the two. On the other hand, our simulations also revealed that testing SNP by θ where genetic ancestry is a proxy for a single highly differentiated SNP is severely underpowered. Although genetic ancestry in our simulations was not a good proxy for a single SNP, our results from cell lines suggest that genetic ancestry is a good proxy for genetic background, since all highly differentiated SNPs across the genome will be correlated with genetic ancestry. There are also other contexts in which modeling SNP by θ may be useful, such as in heritability estimation. We have previously shown that local ancestry from admixed populations can be leveraged to estimate the total additive heritability of a phenotype[22]. We could also use the SNP by θ interaction terms to estimate heritability in a mixed-model framework because genetic ancestry is correlated with many genetic markers and environmental covariates[23]. To do so, we would introduce an additional variance component computed from SNP by θ across the genome in addition to the component computed from SNPs only. In this scenario, genetic ancestry would represent an aggregate of potential interacting genetic and environmental covariates. It will be interesting to see whether such estimations yield more accurate measures of heritability.
In our analysis of real data, we discovered gene by θ interactions associated with genes that have known interactions. In the Coriell data, we found that ERBB4 gene expression was associated with a SNP by θ interaction. Notably, ERBB4 gene expression has been previously shown to be modulated by SNP-SNP interactions in Schizophrenic individuals of European background[24,25]. Furthermore, the SNP rs7585465 in ERBB4 that we identified has been shown to be part of multiple epistatic interactions from the results of interaction analysis for rheumatoid arthritis in the WTCCC; of note, this SNP was in interaction for this disease with a highly population-differentiated SNP rsl63673 (which has allele “A” frequency of 0.11 in the reference African population YR1 and 1.0 in the reference European ancestry population CEU)[26]. In the GALA II Mexicans, the interaction of rs925736 with ancestry was associated with the methylation of HDAC4, a known histone deaceytlase (HDAC). In concert with DNA methylases, HDACs function to regulate gene expression by altering chromatin state[27]. In Europeans, HDACs have been shown to be associated with lung function through direct genetic effects and through environmental interactions[28,29]. For the GALA II Puerto Ricans, rsl7091085 showed an interaction associated with the methylation state of SERP1NA6. Of note, interaction between birth weight and SERP1NA6 has been previously associated with Hypothalamic-Pituitary-Adrenal axis function[30]. Further investigations of our interaction findings are thus warranted.
Our analysis revealed the existence of interactions but does not provide a direct way to determine the covariate that is interacting with a SNP. Further work will need to be done to uncover the exact environmental exposures or genetic loci with which SNPs are interacting. The existence of gene by θ interactions in GALA II underscores why modeling interactions should be considered for future association studies and heritability estimation in admixed populations.
V. Materials and Methods
Our approach is best illustrated with an example. First consider testing a SNP s for interaction with an environmental covariate E. θ can serve as a proxy for E if the two are correlated, even if E is unknown or unmeasured (see Figure 3a). Now consider testings for interaction with a SNP j≠s that is highly differentiated in terms of ancestral allele frequencies. For example, a SNP that has a high allele frequency in one ancestral population and a low allele frequency in the other ancestral population, θ can be used as a proxy for j because θ and the genotypes of SNP j will be correlated. Consider the case where j has a frequency of 0.9 in population 1 and frequency of 0.1 in population 2. Individuals with large values of θ1 are more likely to have derived j from population 1 and on average have greater genotype values at j. Similarly, individuals with small values of θ1 are more likely to have derived j from population 2 and on average have smaller genotype values. Thus, θ will be correlated with the genotypes of the individuals for highly differentiated SNPs and can serve as a proxy for detecting interactions (see Figure 3b).
Consider an admixed individual i who derives his or her genome from k ancestral populations. We denote individual i’s global ancestry proportion as where . The local ancestry of individual i at a SNP s is denoted as and is equal to the number of alleles from ancestry inherited at SNP s. Current methods allow us to estimate ancestry directly from genotype data both globally and at specific SNPs[9,31,32]. We denote the genotype of an individual i at SNP s as and the corresponding phenotype as yi.
In this work, we model phenotypes in an additive linear regression framework, but note that our method can easily be extended to a logistic framework for case-control data. Assuming n (unrelated) individuals, define to be the vector of all individuals’ phenotypes. The model for the phenotype is then where is a n×1 vector of error terms. X is a n×v matrix of v covariates, and is a v×1 vector of the covariate effect sizes. We note that in our notation for a vector . Assuming independence, the likelihood under this model is:
We can compute the log-likelihood ratio statistic (D) using a maximum likelihood approach:
The maximum likelihood estimator (MLE) of the effect sizes is , and the MLE of the error variance is . Here, L1 is the likelihood under the alternative and L0 is the likelihood under the null. and are the effect sizes and error variance estimates that maximize the respective likelihoods. D is distributed as χ2 with k degrees of freedom (df), where k is the number of parameters constrained under the null.
1-df Ancestry Interaction Test (AIT)
The first test we present is the standard direct test of interaction. We test for a SNP’s interaction with θ instead of an environmental covariate or another genotype. Let be the vector of the individuals’ genotypes at SNP s, be the vector of their global ancestries for ancestry a, and be the vector of interaction terms which result from the component-wise multiplication of the genotype and global ancestry vectors. We test the alternative hypothesis against the null hypothesis .
In this test of interaction, we test a single ancestry versus the other ancestries that may be present in the population of interest. One parameter is constrained under the null which results in a statistic with k = 1 df. Let and denote the effect sizes of genotype, interaction, and global ancestry under a given hypothesis respectively. The statistic is given below. where X is an n × 3 matrix composed of , and as columns.
1-df Ancestry Interaction Test with Local Ancestry (AITL)
Given that the individuals we analyze in this work are assumed to be admixed, there is potential for confounding due to differential LD. An interaction that is not driven by biology could occur due to the possibility that a causal variant may be better tagged by a SNP being tested on one ancestral background versus another (See Figure 3c). We account for the different LD patterns on varying ancestral backgrounds by including local ancestry as an additional covariate in AITL. By including local ancestry, we assume that the SNP being tested is on the same local ancestry block as the causal SNP that it may be tagging. Such an assumption is reasonable because admixture in populations such as Latinos and African Americans are relatively recent events and their genomes have not undergone many recombination events. As a result, local ancestry blocks on average stretch for several hundred kilobases[33,34].
Let be the vector of local ancestry calls for all individuals for ancestry a and let be the interaction terms from piecewise multiplication of the two vectors. We use the following alternative and null hypotheses:
Here we are testing for an interaction effect, i.e. , and constrain one parameter under the null resulting in a statistic with k = 1 df. Let and denote the effect sizes of the interaction between genotype and local ancestry and just local ancestry, respectively. The log likelihood ratio statistic is given by where X is an n × 5 matrix composed of , and as columns. All of these test statistics are straightforwardly modified to jointly incorporate several ancestries in the case of multi-way admixed populations.
Simulation Framework
For all our simulations, we simulated 2-way admixed individuals. Global ancestry for ancestral population 1 (θ1) was drawn from a normal distribution with μ = 0.7 and σ = 0.2. Individuals with θ1 > 1 or θ1 < 0 were assigned a value of 1 or 0, respectively. We simulated phenotypes of individuals to investigate our method in three different scenarios: gene-environment interactions, pairwise gene-gene interactions, and false positive interactions due to local differential tagging.
To simulate phenotypes under the situation of a gene-environment interaction, we simulated a single SNP. For each individual i, we assigned the local ancestry or the number of alleles derived from population 1 (γai) for each haplotype by performing two binomial trials with the probability of success equal to θi1. We then drew ancestry specific allele frequencies following the Balding-Nichols model by assuming a FST = 0.16 and drawing two ancestral frequencies, p1 and p2, from the following beta distribution[35]. where p is the underlying MAF in the entire population and is set to 0.2. Genotypes were drawn using a binomial trial for each local ancestry haplotype with the probability of success equal to p1 or p2 for values of γai = 0 or 1, respectively. Environmental covariates correlated with θ1, Ei were generated for each individual i by drawing from a normal distribution was varied from 0 to 5 in increments of 0.005 to create Ei’s that were correlated with individuals’ global ancestries in varying degrees. We generated phenotypes for individuals assuming only an interaction effect by drawing from a normal distribution, for a given interaction effect size .
To simulate phenotypes based on gene-gene interactions, we simulated two SNPs. At both SNPs, we assigned the local ancestry values as described for the gene-environment case. We assigned genotypes for individuals at the first SNP assuming an allele frequency of 0.5 for both populations and drawing from two binomial trials. We assigned genotypes at the second SNP over a wide range of ancestry specific allele frequencies to simulate different levels of SNP differentiation. Ancestry specific allele frequencies were initially p1 = p2 = 0.5 and iteratively increasing p1 by 0.005 while simultaneously decreasing p2 by 0.005 until p1 = 0.05 and p2 = 0.95. Genotypes at the second SNP were drawn using the same approach described for gene-environment. Using the simulated genotypes, phenotypes were drawn from a normal distribution , where gis is the genotype for individual i at the simulated SNP s.
To simulate the scenario of differential LD on different ancestral backgrounds leading to false positives, we simulated phenotypes based on a single causal SNP that was tagged by another SNP. At both SNPs, local ancestries were assigned as described previously and genotypes were drawn using ancestry specific allele frequencies. Ancestral allele frequencies were assigned such that the average r2 between the causal and tag SNP was 0.272 on the background of ancestral population 1 and 0.024 on the background of ancestral population 2. Thus, the tag SNP was only a tag on the populationl background and not on the population 2 background. Phenotypes were drawn from a normal distribution, , assuming no interaction and βCausal = 0.7, where gic is the genotype of individual i at the causal variant.
We implemented our approach in an R package (GxTheta), which is available for download at http://www.scandb.org/newinterface/GxTheta.html
Data Normalization
Gene Expression Normalization
Gene expression data (see Results) were first standardized for each gene such that mean expression was 0 and variance was 1. We then computed a covariance matrix of individual’s expression values and performed PCA on the covariance matrix. Residuals were computed for all expression values by adjusting for the top 10 principal components and the mean for each gene was added back to the residuals. Due to the high dynamic range of gene expression compared to methylation we conservatively chose to additionally perform quantile normalization. We then sorted the gene expression residuals and used the quantiles of their rank order to draw new expression values from a normal distribution, , by using the inverse cumulative density function24,25.
Methylation Data Normalization
Raw methylation values (see Results) were first normalized using Illumina’s control probe scaling procedures. All probes with median methylation less than 1% or greater than 99% were removed and the remaining probes were logit-transformed as previously described[36]. To control for extreme outliers, we truncated the distribution of methylation values. For a given probe, we first computed the mean and standard deviation of the methylation values. We then set any methylation values deviating more than 2.58 standard deviations from the mean to the methylation value corresponding to the 99.5th quantile.
Availability of Supporting Data
The Coriell data is available from dbGAP under accession numberphs000211.vl.pl. The GALA and SAGE data is available by emailing the study organizers at https://pharm.ucsf.edu/gala/contact
Competing Interests
The authors declare that they have no competing interests.
Authors’ Contributions
DSP, IE, EK, EE, EH and NZ designed research. DSP, IE, EK, and NZ performed research. DSP, IE, EK, EE, CE, CRG, JMG, EG, HA, CJY, EE, EH, and NZ contributed new reagents/analytic tools. DSP, ERG, and NZ wrote the manuscript. All authors read and approved the final manuscript.
Description of Additional Data Files
The following data are available with the online version of this paper. The Supplemental contains QQ-plots for the simulations and real analyses performed as well as a table containing p-values for the 2-component ancestry analysis of the GALA methylation data.
Acknowledgements
We would like to thank Lancelote Leong for his helpful manuscript comments.