Abstract
Genome-wide association (GWA) studies have identified thousands of significant genetic associations in humans across a number of complex traits. However, the majority of these studies focus on linear additive relationships between genotypic and phenotypic variation. Epistasis, or non-additive genetic interactions, has been identified as a major driver of both complex trait architecture and evolution in multiple model organisms; yet, this same phenomenon is not considered to be a significant factor underlying human complex traits. There are two possible reasons for this assumption. First, most large GWA studies are conducted solely with European cohorts; therefore, our understanding of broad-sense heritability for many complex traits is limited to just one ancestry group. Second, current epistasis mapping methods commonly identify significant genetic interactions by exhaustively searching across all possible pairs of SNPs. In these frameworks, estimated epistatic effects size are often small and power can be low due to the multiple testing burden. Here, we present a case study that uses a novel region-based mapping approach to analyze sets of variants for the presence of epistatic effects across six diverse subgroups within the UK Biobank. We refer to this method as the “MArginal ePIstasis Test for Regions” or MAPIT-R. Even with limited sample sizes, we find a total of 245 pathways within the KEGG and REACTOME databases that are significantly enriched for epistatic effects in height and body mass index (BMI), with 67% of these pathways being detected within individuals of African ancestry. As a secondary analysis, we introduce a novel region-based “leave-one-out” approach to localize pathway-level epistatic signals to specific interacting genes in BMI. Overall, our results indicate that non-European ancestry populations may be better suited for the discovery of non-additive genetic variation in human complex traits — further underscoring the need for publicly available, biobank-sized datasets of diverse groups of individuals.
Introduction
Genome-wide association (GWA) studies are a powerful tool for understanding the genetic architecture of complex traits and phenotypes [1–8]. The most common approach for conducting GWA studies is to use a linear mixed model to test for statistical associations between individual genetic variants and a phenotype of interest; here, the estimated regression coefficients represent an additive relationship between number of copies of a single-nucleotide polymorphism (SNP) and the phenotypic state. While this approach has produced many statistically significant additive associations, it is less amenable to detecting nonlinear genetic associations that also contribute to a trait’s genetic architecture. Epistasis, commonly defined as the nonlinear, or non-additive, interaction between multiple genetic variants, is a well-established phenomenon in a number of model organisms [9–18]. Epistasis has also been suggested as a major driver of both phenotypic variation and evolution [19–26]. Still, there remains skepticism and controversy regarding the importance of epistasis in human complex traits and diseases [27–34]. For example, multiple studies have suggested that phenotypic variation can be mainly explained with additive effects [27, 28, 32]; although, this hypothesis has been been challenged recently [35]. In initial work to locate the “missing heritability” in the human genome — the discrepancy between larger pedigree-based trait heritability estimates and smaller SNP-based trait heritability estimates using the first wave of human GWA study results [36–38] — it was suggested that epistasis may account for a significant portion of this observed discrepancy [24, 39, 40]. However, other studies have posited that, for at least some human phenotypes, genetic interactions are unlikely to be a major contributor to total heritability [34, 41, 42].
Algorithmically, detecting statistically significant epistatic signals via genome-wide scans is much more computationally expensive than the the traditional hypothesis-generating GWA framework. GWA tests for additive effects are linear in the number of SNPs, while epistasis scans usually consider, at a minimum, all pairwise combinations of SNPs (e.g., a total of J (J − 1)/2 possible pairwise combinations for J variants in a study). Methods that fall within the MArginal ePIstasis Test (MAPIT) framework [43–46] attempt to address these challenges by alternatively testing for marginal epistasis. Specifically, instead of directly identifying individual pairwise or higher-order interactions, these approaches focus on identifying variants that have a non-zero interaction effect with any other variant in the data. Indeed, analyzing epistasis among pairs of SNPs can be underpowered in GWA studies, particularly when applied to polygenic traits or traits which are generated by many mutations of small effect [4, 47–49].
To overcome this limitation, more recent computational approaches have expanded the additive GWA framework to aggregate across multiple SNP-level association signals and test for the enrichment of genes and pathways [50–61]. In Nakka et al. [62] we showed that enrichment analyses applied to multiple ancestries can identify genes and gene networks contributing to disease risk that ancestry-specific enrichment analyses fail to find. Recent multiethnic GWA studies have also found that using non-European populations offer new insights into additive genetic architecture [63–70]. However, despite this growing body of work and increasing efforts to promote conducting GWA studies in diverse ancestries [68, 71–75], few studies have investigated the role of epistasis in shaping multiethnic human genetic variation (but see [76–79]). Expanding epistasis studies to include non-European ancestries, as well as to aggregate over multiple SNP-level signals, may reveal a new understanding of non-additive genetic architecture in human complex traits.
In this study, our objective is to expand the marginal epistasis framework from individual SNPs to user-specified sets of variants (e.g., genes, signaling pathways) and apply the framework to multiple, diverse human ancestries. We aim to detect novel interactions between biologically relevant disease mechanisms underlying complex traits and to analyze multiple human ancestries, all while reducing the multiple testing burden that traditionally hinders exhaustive epistatic scans. We implement our new approach in “MArginal ePIstasis Test for Regions”, which we refer to as MAPIT-R. We apply MAPIT-R using pathway annotations from the “Kyoto Encyclopedia of Genes and Genomes” (KEGG) and REACTOME databases [80] to standing height and body mass index (BMI) assayed in individuals from multiple human ancestry “subgroups” (British, African, Caribbean, Chinese, Indian, and Pakistani) in the UK Biobank [81]. Spanning across all these subgroups, we find more than 200 pathways that have significant marginal epistatic effects on standing height and BMI. We then investigate the distribution of these significant non-additive signals across ancestries, traits, and pathways, finding future directions to prioritize for studies of epistasis in human complex traits.
Materials and Methods
Overview of the MAPIT-R Model
We describe the intuition behind the “MArginal ePIstasis Test for Regions” (MAPIT-R) in detail here. Consider a genome-wide association (GWA) study with N individuals. Within this study, we assume that we have an N-dimensional vector of quantitative traits y, an N × J matrix of genotypes X, with J denoting the number of single nucleotide polymorphisms (SNPs) encoded as {0,1, 2} copies of a reference allele at each locus, and a list of L predefined genomic regions of interests . We will let each genomic region l represent a known collection of annotated SNPs
with set cardinality
. In this work, each
includes sets of SNPs that fall within functional regions of genes that have been annotated as being members of certain pathways or gene sets (see Supplementary Note). Recall that our objective is to test whether a set of biologically relevant variants have a nonzero interaction effect with any other region along the genome. Therefore, MAPIT-R works by examining one region at a time (indexed by l) and fits the following linear mixed model
where μ is an intercept term; Z is a matrix of covariates (e.g., the top principal components from the genotype matrix) with coefficients δ;
is the summation of region-specific effects with corresponding additive effect sizes βj for the j-th variant; xj is an N-dimensional genotypic vector for the j-th variant in the l-th region that is the focus of the model;
is the combined additive effects from all other
SNPs in the data that have not been annotated as being within the
region of interest with coefficients βk; xk is an N-dimensional genotypic vector for the k-th variant in the data that has not been annotated as being within the
region of interest;
is the summation of all pairwise interaction effects (i.e., the Hadamard product Xj ○ xk) between the j-th variant in the l-th annotated region
and all other k ≠ j variants outside of
with corresponding coefficients θjk; and ε is a normally distributed error term with mean zero and independent residual error variance scaled by the component τ2. There are a few important takeaways from this formulation of MAPIT-R. First, the term ml effectively represents the polygenic background of all variants except for those that have been annotated for the l-th region of interest. Second, and most importantly, the term gl is the main focus of the model and represents the marginal epistatic effect of the region Rl [43, 44]. It is important to note that each component of the model will change with every new region that is considered.
For convenience, we assume that both the genotype matrix (column-wise) and the trait of interest have been mean-centered and standardized to have unit variance. Next, because the model in Eq. (1) is an underdetermined linear system (J > N), we ensure identifiability by assuming that the individual regression coefficients follow univariate normal distributions where
With the assumption of normally distributed effect sizes, the MAPIT-R model defined in Eq. (1) becomes a multiple variance component model where with
being the genetic relatedness matrix computed using genotypes from all variants within the region of interest;
with
being the genetic relatedness matrix computed using genotypes outside the region of interest; and
with Gl = Kl ○ Vl representing a second-order interaction relationship matrix which is obtained by using the Hadamard product (i.e., the squaring of each element) between the region-specific relatedness matrix and its corresponding polygenic background. Importantly, the variance component σ2 effectively captures the marginal epistatic effect for the l-th region. Even though we limit ourselves to the task of identifying second order (i.e., pairwise) epistatic relationships between sets of SNPs in this paper, extensions to higher-order and gene-by-environmental interactions are straightforward to implement for alternative analyses [43, 45, 82–84].
Hypothesis Testing with the MAPIT-R Framework
In this section, we now describe how to perform joint estimation of all the variance component parameters in the MAPIT-R model. Since our goal is to identify genomic regions that have significant non-zero interaction effects on a given phenotype, we examine each annotated SNP-set l = 1,…, L in turn, and test the null hypothesis in Eq. (1) and Eq. (2) that H0: σ2 = 0. We make use of the MQS method for parameter estimation and hypothesis testing [83]. Briefly, MQS is based on the computationally efficient method of moments and produces estimates that are mathematically identical to the Haseman-Elston (HE) cross-product regression [85]. To estimate the variance components with MQS, we first regress out the additive effects of the l-th SNP-set, the fixed covariates, and the intercept terms. Equivalently, we multiply both sides of Eq. (1) by a projection (hat) matrix such that the model becomes orthogonal to the column space of the intercept term μ. Specifically, we define H = I − B(B⊤B)−1B⊤ where is a concatenated matrix and with 1 being an N-dimensional vector of ones. This yields a simplified model
where y* = Hy is the projected phenotype of interest;
with
with
; and ε* = Hε is the projected residual error, respectively. Then lastly, for each annotation considered, the MQS estimate for the marginal epistatic effect is computed as
where
with elements (Sl)jk = tr(∑lj ∑lk) for the covariance matrices subscripted as
. Here, tr(●) is used to denote the matrix trace function. It has been well established that the marginal variance component estimate
follows a mixture of chi-square distributions under the null hypothesis because of its quadratic form and the assumed normally distributed trait y [43, 53, 86–89]. Namely,
, where
are chi-square random variables with one degree of freedom and (λ1,…, λn) are the eigenvalues of the matrix [43, 83]
with
being the MQS estimates of (ν2, ω2, τ2) under the null hypothesis. Several approximation and exact methods have been suggested to obtain p-values under the distribution of
. In this paper, we use the Davies exact method [87, 90].
Software Availability
Code for implementing the “MArginal ePIstasis Test for Regions” (MAPIT-R) is freely available in R/Rcpp and is located at https://cran.r-project.org/web/packages/MAPITR/index.html. All MAPIT-R functions use the CompQuadForm R package to compute p-values with the Davies method. Note that the Davies method can sometimes yield a p-value that exactly equals 0. This can occur when the true p-value is extremely small [91]. In this case, we report p-values as being truncated at 1 × 10−10. Alternatively, one could also compute p-values for all MAPIT-R based functions using Kuonen’s saddlepoint method [91, 92] or Satterthwaite’s approximation equation [93].
SNP-Set and Pathway Annotations
To create appropriate pathway annotations for MAPIT-R, we first assign SNPs to genes and then aggregate the genes together according to pathway definitions provided by the KEGG and REACTOME databases, respectively. KEGG and REACTOME pathway definitions were downloaded and extracted from the Broad Institute’s Molecular Signatures Database (MSigDB; https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp#C2) under the collection “C2: Curated Gene Sets” [80]. SNPs were annotated using Annovar [94] and were then mapped to a given gene if they were exonic, intronic, in the 5’ and 3’ UTRs, or within 20kb upstream or downstream of the gene.
UK Biobank Data
To create the UK Biobank population subgroups used in this study (UK Biobank Application Number 2241), we first extracted and grouped individuals by the self-identified ancestries of “African”, “British”, “Caribbean”, “Chinese”, “Indian”, and “Pakistani”. For the British subgroups, five sets of N = 4,000 and 10,000 non-overlapping individuals were created — with one set from each sample size being used for “primary analyses” and the remaining four being used for the “replication analyses”. Standard quality control procedures were applied to each population subgroup (see Supplementary Note for details). “Local” principal component analysis (PCA) was conducted to confirm ancestry groupings and to remove outliers. We refer to conducting PCA on each subgroup separately as “local” PCA to help distinguish from the alternative setup of conducting PCA on the entire dataset jointly, which we refer to as “global” PCA (see Supplementary Figure 1). Note that the genetic data we used in this study were the directly genotyped variant sets from the UK Biobank after running imputation of missing genotypes on the University of Michigan Imputation Server [95]. Here, imputation was conducted manually with an ancestry-diverse and sample-size balanced reference panel (1000G Phase 3 v5). For details on the final UK Biobank dataset, see Supplementary Tables 1 and 2. Lastly, both the standing height and body mass index (BMI) traits were adjusted for age, gender, and assessment center. Following previous pipelines [33, 96], each dataset was first divided into males and females. Age was then regressed out within each sex, and the resulting residuals were inverse normalized. These normalized values were then combined back together and assessment center designations were regressed out. Top 10 “local” principal components (PCs) were included as covariates during the actual MAPIT-R analyses. In total we conducted 24 different analyses (2 pathway databases, 2 phenotypes, 6 population subgroups), which we refer to as ‘database-phenotype-subgroup’ combinations. Lastly, for analyses using permuted phenotypes, permutations were conducted within-subgroup and done by randomly reassigning phenotypes to individuals.
Here, subgroups in the UK Biobank included individuals based on their self-identified ancestries: “African”, “British”, “Caribbean”, “Chinese”, “Indian”, and “Pakistani” (see legend to the right of panel (b)). Genome-wide significance was determined by using Bonferroni-corrected p-value thresholds based on the number of pathways tested in each database-phenotype-subgroup combination (see Supplementary Table 1). Across all database-phenotype combinations, the African subgroup has the largest numbers of significant pathways. For lists of the specific significant pathways per database-phenotype-subgroup combination, see Supplementary Table 3. Results from running MAPIT-R with REACTOME database pathways can be found in Supplementary Figure 2.
Results
Multiethnic Analyses Enables the Detection of Pathway-Level Interactions
We applied MAPIT-R to height and body mass index (BMI) to detect pathways from the KEGG and REACTOME databases [80] with significant epistatic interactions with other regions on the genome, using genotype data and diverse individuals from the UK Biobank. We focused on height and BMI due to the extensive work that has already been done investigating the broad-sense and narrow-sense heritabilities of these traits [29, 41, 97–100], and we used the KEGG and REACTOME databases because they cover an extensive range of both biological processes and pathway-sizes (measured in SNP counts). We analyzed six different human ancestry subgroups that we extracted from the UK Biobank: African (N = 3111), British (N = 3848, chosen randomly from the full N = 472,218 cohort), Caribbean (N = 3833), Chinese (N = 1448), Indian (N = 5077), and Pakistani (N = 1581) (Supplementary Figure 1 and Supplementary Tables 1-2). Subgroups were extracted based on self-identified ancestry and individuals were filtered using standard quality control procedures (see Materials and Methods and Supplementary Note for details). In total, we conducted 24 different analyses (i.e., 2 pathway databases, 2 phenotypes, 6 population subgroups), which we refer to as ‘database-phenotype-subgroup’ combinations.
Applying MAPIT-R to height and BMI within each ancestry subgroups, we find a total of 245 enriched pathways that have genome-wide significant signals for marginal epistatic interactions with the rest of the genome (Figure 1, Supplementary Figure 2, and Supplementary Table 3) Here, p-value significance thresholds were determined by using Bonferroni correction based on the number of pathways tested per analysis (see Supplementary Table 1). Overall, a similar number of pathways were statistically enriched between the KEGG and REACTOME databases (130 and 115, respectively); however, we find that BMI yields more non-additive genetic signal than height (155 versus 90 significant pathways, respectively). Across each ancestry-specific subgroup, our findings overlap with results from other work showing evidence for the importance of epistasis in human immunity, particularly involving the Major Histocompatibility Complex (MHC) [101–107], as well as the key roles metabolic processes and cellular signaling play in trait architecture for model systems [108–114]. Most notably, however, the majority of our results occurred within the African subgroup: 165 out of 245 significant pathways across all analyses.
Focusing on the African subgroup, the enriched pathways represent multiple biologically relevant themes in both height and BMI (Table 1 and Supplementary Table 3). When analyzing height with annotations from the KEGG database, we find that most of the statistically significant marginal epistatic interactions occur in pathways related to canonical signaling cascades, functions within the immune system, and sets of genes that affect heart conditions. Previous multiethnic GWA studies of height have found additive associations with cytokine genes [115] and WNT/beta-catenin signaling [116]. Results from MAPIT-R suggest that non-additive interactions involving cytokine receptors (p-value = 2.84 × 10−8) and genes within the WNT-signaling pathway (p-value = 6.54 × 10−6) also contribute to the complex genetic architecture of height as well. In BMI, we find similar themes, as well as multiple statistically significant signals from metabolic pathways (Table 1). Notably, MAPIT-R identified pathways related to ErbB signaling (p-value = 3.30 × 10−7) and ether lipid metabolism (p-value = 1.41 × 10−4) as having significant marginal epistatic effects — both of which have also been shown to have additive associations with BMI as well [96, 117, 118].
The biological themes include: cellular signaling, immune system, heart condition, and metabolism. Notably enriched pathways for each biological theme are included in the second column. For each pathway, MAPIT-R p-values, highlighted gene associations, and references for each gene association are shown for both height (third, forth, and fifth columns) and BMI (sixth, seventh, and eighth columns). Genome-wide significance was determined by using Bonferroni-corrected p-value thresholds based on the number of pathways tested in each database-phenotype-subgroup combination (Supplementary Table 1). “Highlighted Genes” and “References” were determined using relevant SNP association citations from the GWAS Catalog (version 1.0.2) [7]. For a full list of MAPIT-R significant pathways in all database-phenotype-subgroup combinations, see Supplementary Table 3. NS indicates that a pathway was not genome-wide significant for a given phenotype.
It is important to note that, in our analyses, the African subgroup has neither the largest sample size nor the largest number of SNPs following quality control (Supplementary Table 1). Thus, to investigate the power of MAPIT-R and its sensitivity to underlying parameters, we conducted simulation studies under a range of genetic architectures (Supplementary Figure 3) [43]. Here, we found that MAPIT-R both controls type 1 error accurately and also has the power to effectively detect pathway level marginal epistasis, even for polygenic traits where the contribution from individual SNPs to the broad-sense heritability of a trait can be quite low. We also ran versions of MAPIT-R on the real data, but with permuted phenotypes, to ensure that the model was not identifying significant non-additive genetic relationships by chance (Supplementary Figures 4 and 5). These permutations allowed us to further investigate MAPIT-R’s false discovery rates, in which we observe values only as high as 1.5% across our different database-phenotype-subgroup combinations at multiple significance thresholds (Supplementary Table 4).
Evidence of Epistasis within the Non-African Subgroups
In our analyses of the British, Chinese, Caribbean, Indian, and Pakistani subgroups, we identify 80 pathways in total that have significant marginal epistatic interactions. Interestingly, many of these pathways overlap with the set of significant results from the African subgroup; there is notably less overlap though in results between each of the individual non-African subgroups (Figure 2 and Supplementary Figure 6). For example, in the height analysis with KEGG annotations, 6-out-of-7 and 7-out-of-8 enriched pathways identified using the Caribbean and Chinese subgroups overlap with those detected while using the African subgroup, respectively. However, there is no overlap in results from our marginal epistasis scans at the pathway level between the Chinese and Caribbean subgroups.
Here, subgroups in the UK Biobank included individuals based on their self-identified ancestries: “African”, “British”, “Caribbean”, “Chinese”, “Indian”, and “Pakistani” (ordered here from top-to-bottom and left-to-right). Genome-wide significance was determined by using Bonferroni-corrected p-value thresholds based on the number of pathways tested in each databasephenotype-subgroup combination (see Supplementary Table 1). The diagonal shows the total number of genome-wide significant pathways per subgroup. We observe that significant pathways identified in non-African subgroups overlap more often with pathways from the African subgroup than they do with pathways from the other, remaining non-African subgroups. Results for both phenotypes in the REACTOME database can be seen in Supplementary Figure 6.
The pathways commonly identified with significant marginal epistatic signals in both the African and Caribbean subgroups contain genes related to multiple kinases (e.g., MAPK1, ROCK1, PRKCB, PAK1) and calcium channel proteins (e.g., CACNA1S, CACNA1D) (Supplementary Tables 5 and 6) — many of which are supported by associations validated in previous GWA applications [33, 119]. In contrast, the pathways with significant marginal epistatic effects identified in both the African and Chinese subgroups are pathways related to the immune system and contain multiple HLA loci (e.g., HLA-DRA, HLA-DRB1, HLA-A, HLA-B) (Supplementary Tables 5 and 6). These results are unsurprising since it is well known that the MHC region holds significant clinical relevance in complex traits [44, 103, 104, 120]; however, more recent work has also suggested that Han Chinese genomes may be particularly enriched for interactions involving HLA loci [121].
Stronger Epistatic Signals underlie BMI than Height
In our analyses with the African subgroup, we detected far more significantly enriched pathways for BMI than in height while using both the KEGG and REACTOME database annotation (Figure 1 and Supplementary Figure 2). While there is considerable correlation between the MAPIT-R p-values in height and BMI (Pearson correlation coefficient r = 0.76 in KEGG and 0.72 in REACTOME, respectively), there are stronger marginal epistatic signals in BMI that remain significant after Bonferroni-correction (Figure 3). These results align with pedigree-based heritability estimates for each trait, which have indicated narrow-sense heritability is around h2 = 0.8 in height and between h2 = 0.4 and h2 = 0.6 in BMI [97, 98]. Taken together, these estimates suggest that non-additive effects may play a greater role in BMI than height, as we have observed here.
For each plot, the x-axis shows the −log10 transformed MAPIT-R p-value for height, while the y-axis shows the same results for BMI. The red horizontal and vertical dashed lines are marked at the Bonferroni-corrected p-value thresholds for genome-wide significance in each pathway-phenotype combination (see Supplementary Table 1). Pathways in the top right quadrant have significant marginal epistatic effects in both traits; while, points in the bottom right and top left quadrants are pathways that are uniquely enriched in height or BMI, respectively. The four highlighted pathways in blue represent a cluster of oncogenic and signaling pathways whose loci have been functionally connected to BMI in previous studies [122–129]. Across both databases, BMI results have lower MAPIT-R p-values than height results on average. For these comparisons in all of the UK Biobank subgroups, see Supplementary Figure 21.
We detected one specific cluster of pathways in the KEGG database with notably divergent statistical evidence for marginal epistasis in height versus BMI (see Figure 3). These four highlighted pathways are related to oncogenic activity and include: genes associated with small cell lung cancer (p-value = 3.20 × 10−10), the ErbB signaling pathway (p-value = 3.30 × 10−7), genes associated with non-small cell lung cancer (p-value = 1.64 × 10−6), and T-cell receptor signaling (p-value = 6.12 × 10−6). There are predominantly two sets of gene families that appear in all four of these annotated gene sets: phosphatidylinositol 3-kinases (PI3Ks) and the AKT serine/threonine-protein kinases (see Supplementary Table 7). One particular gene in this group, AKT2, has been associated with multiple monogenic disorders of glucose metabolism, including severe insulin resistance and diabetes, and severe fasting hypoinsulinemic hypoglycemia [122–124], representing a possible driver of this cluster. Additionally, pharmacological inhibition of crosstalk between the PI3Ks has been shown to reduce adiposity and metabolic syndrome in both human beings and other model organisms [125–129].
Testing Variability in MAPIT-R with British Replicate Subpopulations
One important consideration of our results is that the diverse non-European human ancestries in the UK Biobank have smaller sample sizes than recent GWA studies in individuals of European ancestry. Given the large sample size of over N = 470,000 individuals for the full white British cohort in the UK Biobank, we decided to test whether subsampled datasets from this group — similar in size to the non-European ancestry subgroups — would be large enough to gain insight into the genetic variation of height and BMI. Here, we sampled four additional, non-overlapping random subgroups of N = 4,000 British individuals and tested whether MAPIT-R results in these replicate subgroups were consistent with our results for the original British 4,000 subgroup. We also constructed larger non-overlapping British subsamples of N = 10,000 individuals to investigate how our results might vary with sample size. In total we analyzed five non-overlapping sets of N = 4,000 British individuals and five non-overlapping sets of N = 10,000 British individuals.
When applying MAPIT-R to these data replicates, we find that our results are robustly similar to what was observed in the original British 4,000 subgroup. Overall, there is a limited number of pathways with significant marginal epistatic effects, regardless of the pathway annotation scheme being used (i.e., KEGG versus REACTOME). Moreover, there is also limited overlap in the significant pathways that were detected between each of the subsampled replicates. These results are depicted and summarized in Supplementary Figures 7-12. As previously done with the individuals of non-European ancestry, we also checked that the null hypothesis of MAPIT-R remained well-calibrated on these subsampled British replicates by permuting the height and BMI measurements. Once again, we found that MAPIT-R continued to exhibit low empirical false discovery and type 1 error rates (Supplementary Tables 8 and 9). Altogether, the consistency of these analyses compared to the results with the original 4,000 individual British subgroup demonstrate that sample size does not appear to be a driving factor in the detection of pathway-level marginal epistasis.
The Proteasome is Enriched for Marginal Epistasis Signals
To better identify the genes and genomic regions that are driving pathway-level marginal epistatic effects, we first investigated genes and gene families that are enriched amongst the significant pathways identified by MAPIT-R. To accomplish this, we conducted two types of hypergeometric tests for enrichment to detect genes that are overrepresented amongst the pathway annotations with low p-values (Supplementary Tables 3). In the first test, we took the annotations from a given database (i.e., KEGG or REACTOME) and implemented a standard hypergeometric test where we compared the number of times a gene appears within the set of significant epistatic pathways versus the number of times that same gene appears across all pathways in the database. This type of test, however, may be confounded by the fact that larger pathways naturally have more SNPs and are therefore more likely to be involved in non-additive genetic interactions (see Supplementary Figures 13 and 14). To mitigate this concern, we ran a second hypergeometric enrichment test using only pathways containing 1000 SNPs or fewer. By focusing on smaller pathways, we are better able to identify genes enriched for marginal epistasis versus spurious signals that may happen by chance in larger pathways.
Figure 4 shows the hypergeometric p-values for all genes in significant interacting pathways. Here, we focus on results for BMI within the African subgroup using annotations from the REACTOME database and we specifically highlight the only genes that were significant under both types of hypergeometric enrichment tests (i.e., the genes that were robustly identified as drivers regardless of the number of SNPs included in the test). Notably, these gene families (PSMA, PSMB, PSMC, PSMD, PSME, and PSMF) are all components of, or related to, the proteasome. The proteasome is a complex protein structure that acts as the catalytic half of the ubiquitin-proteasome system (UPS) — a critical system for the proper degradation of proteins within the cell [130–132]. The main proteasome isoform, 26S, is made up of two substructures: (i) the 20S core particle (CP) of four stacked rings (two outer structural rings encoded by PSMA genes and two inner catalytic rings encoded by PSMB genes), and (ii) the 19S regulatory particle (RP) which caps both ends of the CP (encoded by genes within both the PSMC and PSMD families). See Figure 5(a) for an illustration of this structure. Since these gene families covered both a large number of genomic sites, as well as biological functions known to be relevant to BMI, we used the proteasome as a test case to further refine the pathway-level signals identified by MAPIT-R.
Here, the gene-based p-values using the size restricted pathways are shown on the y-axis, while the results from the original unrestricted version of the analysis are shown on the x-axis. The blue dashed circle in panel (b) highlights the proteasome gene family cluster. For lists containing each gene’s original and size-restricted hypergeometric p-values, see Supplementary Table 11. Note that we only show results for BMI because few MAPIT-R significant pathways in the height analysis remained after imposing the size restriction. For lists containing gene counts for each database-phenotype-subgroup combination under both the original and size-restricted data sets, see Supplementary Tables 12 and 13.
(a) Models of different isoforms of the proteasome, a complex protein structure required for proper degradation of many proteins in the cell. The “26S Proteasome” is the main isoform, composed of the 20S core particle (CP) and capped on both ends by the 19S regulatory particle (RP). The “Hybrid Proteasome” isoform is produced when the CP binds on one end with an RP and on the other end with the IFN-γ-inducible 11S complex PA28αβ. The PSMA and PSMB gene families encode components of the CP, the PSMC and PSMD gene families encode components of the RP, and members of the PSME gene family encode PA28αβ. Note that PSMF represents a proteasome inhibitor and is not shown. The structures shown were adopted and modified from the Protein Data Bank (human 26S proteasome, https://www.rcsb.org/structure/5GJR; mouse PA28αβ, https://www.rcsb.org/structure/5MX5) [141]. (b) The heatmap shows the change in original MAPIT-R −log10 p-value for different REACTOME pathways when each proteasome gene family is removed one at a time in a “leave-one-out” manner. The analyses were conducted in BMI for the African subgroup of the UK Biobank. The x-axis shows each proteasome gene family and the y-axis lists each REACTOME pathway. Each column has been scaled by the number of SNPs present in the given gene family and, as a result, the heatmap specifically shows the −log10 p-value change (△ in legend) per SNP. (c) The table shows the number of SNPs present in each proteasome gene family (left), as well as the number of SNPs present in each REACTOME pathway (right).
To investigate whether components of the proteasome served as a driver of significant marginal epistatic effects, we conducted a “leave-one-out” analysis with each of the gene families in the proteasome. More specifically, we first used MAPIT-R to reanalyze BMI after leaving out SNPs annotated within genes belonging to the PSMA, PSMB, PSMC, PSMD, PSME, and PSMF families, one family at a time. Next, we then compared these new “leave-one-out” MAPIT-R p-values to each pathway’s original p-value from running MAPIT-R on the full data. This enabled us to identify whether the removal of a particular gene family would lead to a notable loss of information regarding a pathway’s epistatic interactions with the rest of the genome.
Figure 5(b) shows the results from this analysis. We find that the PSMA and PSME gene families exhibit biologically interpretable changes in p-value magnitudes across multiple REACTOME pathways. For the PSMA gene family, we observe no examples where removing these genes leads to large increases in the MAPIT-R p-values. As previously mentioned, the PSMA gene family functionally encodes the outer two rings of the core four rings in the main 20S core. These outer “alpha” rings are gates which block entry into the core of the proteasome until they are opened by stimulation from the 19S regulatory particle [133–135]. And unlike the inner “beta” rings encoded by the PSMB family, which contain the proteolytic active sites, the outer rings do not have any catalytic functionality [136, 137]. This less direct role in the protein degradation process may explain the lack of increase in MAPIT-R p-values, or lack of information lost, when PSMA genes are removed from analysis.
For the PSME gene family, on the other hand, we find some of the largest increases in MAPIT-R p-values across multiple REACTOME pathways. Contextually, members of the PSME gene family encode an alternative regulatory particle, 11S PA28αβ, that also associates with the 20S core. PA28αβ is an Interferon-γ (IFN-γ) inducible regulatory protein that operates in a ubiquitin-independent manner and increases production of a particular subset of proteasomes known as immunoproteasomes [138–141]. Immunoproteasomes are specialized isoforms that are expressed at higher levels in hematopoietic cells and are more directly associated with immunity-related processes such as MHC antigen presentation [142–144]. Additionally, recent work has connected PSME genes to the regulation of NF-κB signaling [145, 146]. Altogether, these connections to immune activity may explain why removal of the PSME gene family affects marginal epistatic signals in pathways related to NF-κB, B-cells, HIV, and apoptosis. Lastly, conducting these “leave-one-out” MAPIT-R analyses in the other remaining UK Biobank subgroups, we observe that removing the PSME gene family also leads to some of the largest increases in MAPIT-R p-values in individuals of non-African ancestry as well (Supplementary Figures 15-20 and Supplementary Table 10). The consistency of this result across all subgroups suggests that PSME is a key contributor to proteasome epistatic interactions with other regions in the genome.
Discussion
Here, we present the first scans for marginal epistasis within multiple human ancestries. We implement a new method, MAPIT-R, to test for evidence of non-additive genetic effects on the pathway-level and apply the framework to six different human ancestries sampled in the UK Biobank: African, British, Caribbean, Chinese, Indian, and Pakistani subgroups. Using two different pathway databases, we study continuous measurements of height and body mass index (BMI) and find a total of 245 pathways that have significant epistatic interactions with their polygenic background (see Figure 1). We find that the African subgroup produces the majority of these results, with over 65% of our 245 significant pathways being identified within this subgroup (see Figure 2). Additionally, we find that pathways related to immunity, cellular signaling, and metabolism have significant signals in our genome-wide marginal epistasis scans, and that BMI produces more significant marginal epistatic interactions at the pathway level than height (see Figure 3 and Table 1). In testing for drivers of our MAPIT-R results, we find evidence that the proteasome may be enriched for marginal epistatic interactions and characterize how proteasome gene families contribute to non-additive genetic architecture of complex traits (see Figures 4-5).
The fact that we find such an abundance of epistatic signals in the African subgroup underscores that African populations, and non-European ancestries in general, are particularly useful for complex trait genetics [66–68, 72, 147–152]. Past research has shown that African ancestry genomes offer a more complete characterization of the the genetic architecture of skin pigmentation [63, 64], reveal the evolutionary histories of FOXP2 and other loci [153, 154], and are needed for more transferable polygenic risk scores [65, 70]. While many studies have generated a call for more GWA studies to be conducted in individuals of non-European ancestry [71, 73–75, 155], we believe this study reveals that our understanding of the role of epistasis in human complex trait architecture and broad-sense heritability will also expand with multiethnic analyses. Our results suggest that non-European ancestries, and African ancestries in particular, may be better suited for identifying signals of epistasis than European ancestries.
Our analyses are not without limitations. First, we are limited due to the computational costs of epistasis detection, although testing for marginal epistasis reduces our testing burden compared to standard exhaustive epistasis scans. Still, the MAPIT-R framework does not scale well to the full sample sizes of modern human genomic biobanks [43, 45, 84]. MAPIT-R encounters burdensome scalability when analyzing tens of thousands of individuals. One important future direction for research is to detect epistatic interactions using GWA summary statistics. Moving away from the need to have individuallevel genotype-phenotype data to GWA study summary statistics has proven useful in both speeding up algorithmic efficiency as well as increasing power in multiple other GWA contexts [61, 156–160]. Another noticeable limitation is that MAPIT-R cannot be used to directly identify the interacting variant pairs that drive individual non-additive associations with a given trait. In particular, after identifying a pathway is involved in epistasis, it is still unclear which particular region of the genome it interacts with.
While the novel “leave-one-out” approach we implement here (see Figure 5(b)) helps narrow down the list of potential regions, MAPIT-R still does not directly identify pairs of interacting variants. Exploring marginal epistasis results a posteriori in a two step procedure can be one way to overcome these issues. For example, linking MAPIT-R with a framework that explicitly follows up on marginal epistasis signals with locus-focused methods such as fine-mapping [161–163] or co-localization [164–168] could further expand the power of the framework.
URLs
MArginal ePIstasis Test for Regions (MAPIT-R) software, https://cran.r-project.org/web/packages/MAPITR/index.html; UK Biobank, https://www.ukbiobank.ac.uk; Molecular Signatures Database (MSigDB), https://www.gsea-msigdb.org/gsea/msigdb/index.jsp; Database of Genotypes and Phenotypes (db-GaP), https://www.ncbi.nlm.nih.gov/gap; NHGRI-EBI GWAS Catalog, https://www.ebi.ac.uk/gwas/; UCSC Genome Browser, https://genome.ucsc.edu/index.html; MArginal ePIstasis Test (MAPIT), https://github.com/lorinanthony/MAPIT; PLINK, https://www.cog-genomics.org/plink/1.9/.
Author Contributions
MCT, LC, and SR conceived the study design. LC and SR conceived the methods. MCT developed the software and carried out the analyses of the UK Biobank data. GD carried out the simulation studies. All authors wrote and reviewed the manuscript.
Competing Interests
The authors declare no competing interests.
Acknowledgments
We thank Shigeo Murata for helpful feedback during the preparation of this manuscript. This research was conducted in part using computational resources and services at the Center for Computation and Visualization at Brown University. This research was also conducted using data from the UK Biobank Resource under Application Number 22419. G. Darnell was supported by NSF Grant No. DMS-1439786 while in residence at the Institute for Computational and Experimental Research in Mathematics (ICERM) in Providence, RI. This research was supported in part by grants P20GM109035 (COBRE Center for Computational Biology of Human Disease; PI Rand) and P20GM103645 (COBRE Center for Central Nervous; PI Sanes) from the NIH NIGMS, 2U10CA180794-06 from the NIH NCI and the Dana Farber Cancer Institute (PIs Gray and Gatsonis), as well as by an Alfred P. Sloan Research Fellowship (No. FG-2019-11622) awarded to L. Crawford. This research was also partly supported by the US National Institutes of Health (NIH) grant R01 GM118652, and the National Science Foundation (NSF) CAREER award DBI-1452622 to S. Ramachandran. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of the funders.
References
- 1.↵
- 2.
- 3.
- 4.↵
- 5.
- 6.
- 7.↵
- 8.↵
- 9.↵
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.↵
- 19.↵
- 20.
- 21.
- 22.
- 23.
- 24.↵
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.
- 49.↵
- 50.↵
- 51.
- 52.
- 53.↵
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.
- 68.↵
- 69.
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.
- 75.↵
- 76.↵
- 77.
- 78.
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.
- 100.↵
- 101.↵
- 102.
- 103.↵
- 104.↵
- 105.
- 106.
- 107.↵
- 108.↵
- 109.
- 110.
- 111.
- 112.
- 113.
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵
- 123.
- 124.↵
- 125.↵
- 126.
- 127.
- 128.
- 129.↵
- 130.↵
- 131.
- 132.↵
- 133.↵
- 134.
- 135.↵
- 136.↵
- 137.↵
- 138.↵
- 139.
- 140.
- 141.↵
- 142.↵
- 143.
- 144.↵
- 145.↵
- 146.↵
- 147.↵
- 148.
- 149.
- 150.
- 151.
- 152.↵
- 153.↵
- 154.↵
- 155.↵
- 156.↵
- 157.
- 158.
- 159.
- 160.↵
- 161.↵
- 162.
- 163.↵
- 164.↵
- 165.
- 166.
- 167.
- 168.↵
- 169.
- 170.
- 171.
- 172.
- 173.
- 174.
- 175.
- 176.
- 177.
- 178.
- 179.
- 180.
- 181.
- 182.
- 183.
- 184.
- 185.
- 186.
- 187.
- 188.