Abstract
Detecting epistatic interactions at the gene level is essential to understanding the biological mechanisms of complex diseases. Unfortunately, genome-wide interaction association studies (GWAIS) involve many statistical challenges that make such detection hard. We propose a multi-step protocol for epistasis detection along the edges of a gene-gene co-function network. Such an approach reduces the number of tests performed and provides interpretable interactions, while keeping type I error controlled. Yet, mapping gene-interactions into testable SNP-interaction hypotheses, as well as computing gene pair association scores from SNP pair ones, is not trivial. Here we compare three SNP-gene mappings (positional overlap, eQTL and proximity in 3D structure) and used the adaptive truncated product method to compute gene pair scores. This method is non-parametric, does not require a known null distribution, and is fast to compute. We apply multiple variants of this protocol to a GWAS inflammatory bowel disease (IBD) dataset. Different configurations produced different results, highlighting that various mechanisms are implicated in IBD, while at the same time, results overlapped with known disease biology. Importantly, the proposed pipeline also differs from a conventional approach were no network is used, showing the potential for additional discoveries when prior biological knowledge is incorporated into epistasis detection.
1 Introduction
Genome-wide association studies (GWAS) have identified over 70000 genetic variants associated with complex traits [5]. Often these variants altogether do not explain the whole variance of a trait. A representative example is inflammatory bowel disease (IBD), like Crohn’s disease and ulcerative colitis. Pooled twin studies estimate their heritabilities at 0.75 and 0.67 respectively [15]. Yet, despite large GWAS that identified over 200 IBD-associated loci [11], a low proportion of their variance has been explained [38]. Possible explanations include a large number of common variants with small effects, rare variants with large effects not covered in GWAS, unaccounted gene-environment interactions, and genetic interactions [29]. In this article we explore the latter, called epistasis, which has been linked to IBD in the past [14, 26, 30, 33, 45, 53]. Often, two types of epistasis are described: biological and statistical epistasis [31]. Broadly described, biological epistasis refers to a physical interaction between two biomolecules that has an impact on the phenotype. Statistical epistasis refers to departures from population-level linear models describing relationships between predictive factors such as alleles at different genetic loci.
Genome-wide association interaction studies (GWAIS) focus on the detection of statistical epistasis. To date, these studies have produced few replicable, functional conclusions, and specific gene-gene interactions have rarely been identified. This may be due to the small effects sizes of the interactions, the low statistical power, or the absence of a widely accepted GWAIS protocol. Even in the absence of statistical challenges, GWAIS are usually conducted on Single Nucleotide Polymorphisms (SNPs), with SNP-interactions often lacking a straightforward functional interpretation. Moving from SNP- to gene-level tests, which jointly consider all the SNPs mapped to the same gene, might address both shortcomings. First, aggregating SNP pair statistics into gene pair statistics is likely to increase the statistical power when dealing with complex diseases [49]. Second, converting statistical findings into biological hypotheses [23], may facilitate their functional interpretability [22].
To both reduce the number of tests and improve the interpretablity of significant SNP interactions, some authors propose examining only pairs of SNPs likely to be functionally related [32]. Such approaches use prior biological knowledge, for instance, of SNPs involved in genes that establish a protein-protein interaction [17]. Yet, limiting studies to one particular kind of gene-gene interaction might be reductive. To tackle that issue, Pendergrass et al. [34] developed Biofilter, a gene-gene co-function network, which aggregates multiple databases. Additionally, such approaches often require as well a proper mapping of SNP to genes.
In this article, we propose guiding statistical epistasis using plausible biological epistasis. Taking exclusively interactions reported from at least 2 different sources in Biofilter, we compile a subset of gene-gene interactions that are biologically plausible. Then, we exclusively search for those interactions in a GWAIS dataset, reducing the multiple test burden and improving the interpretability. We investigate different ways of mapping SNPs to genes and use the adaptive truncated product method to estimate the association of gene pairs. Network and pathway analyses are used to further assist in the interpretation of epistasis findings. The proposed pipeline is applied to GWAS data from the International IBD Genetic Consortium [11].
2 MATERIALS AND METHODS
2.1 Dataset and initial quality control
We investigated the IIBDGC dataset, produced by the International Inflammatory Bowel Disease Genetics Consortium (IIBDGC). The large sample size of this dataset helps overcoming the issue of reduced statistical power that is common in GWAIS. This dataset was genotyped on the Immunochip SNP array [8]. We performed quality control as in Ellinghaus et al. [12], hereby reducing the number of SNPs from 196524 to 130071. The final dataset contains 66280 samples, out of which 32622 are cases (individuals with IBD) and 33658 are controls.
The IIBDGC dataset aggregates different cohorts, and contains potentially confounding population structure. As in Ellinghaus et al. [12], we used the first 7 principal components to model population stratification. Because several epistasis detection methods, such as those implemented in PLINK [37], cannot include covariates in their logistic regression models, we instead adjusted the phenotypes by regressing out those principal components. In other words, we derived adjusted phenotypes from the logistic regression model by subtracting model-fitted values from observed phenotype values, i.e. response residuals (see Supplementary).
2.2 Gene interaction detection procedure
As we describe in more detail below, we applied different functional filters to the available data. These filters use plausible interactions between genes, and three different ways of mapping SNPs to those genes, and hence, to these interactions. These three mappings exploit different degrees of biological knowledge to map SNPs to genes, referred to as Positional, eQTL and Chromatin. For each of the three SNP-to-gene mappings, we only analyzed the pairs of SNPs corresponding to a gene pair with prior evidence for interaction. In addition, we compared our findings in these scenarios to the Standard scenario, where all SNP pairs are analyzed without prior filtering. An overview of the entire pipeline is presented in Fig. 1.
2.2.1 From gene models to SNP models
Although the unit of analysis in GWAIS is the SNP, biological interactions are often characterized at the gene level. Hence, we mapped all SNPs in the dataset to genes using FUMA [47], a post-GWAS annotation tool. We created an artificial input where every SNP is significant in order to perform such mapping on all the SNPs. We performed three SNP-gene mappings using FUMA’s SNP2GENE: positional, eQTL and 3D chromatin interaction (Table 1). In the Positional mapping, we mapped a SNP to a gene when the genomic coordinates of former was within the boundaries of the latter ± 10 kb. The eQTL mapping uses eQTLs obtained from GTEx [16]. We mapped an eQTL SNP to its target gene when the association P-value was significant in any tissue (FDR < 0.05). Lastly, in the Chromatin mapping, we mapped a SNP to a gene when a contact had been observed between the former and the region around the latter’s promoter in the 3D structure of the genome (250 bp upstream and 500 bp downstream from the transcription start site) in any of the Hi-C datasets included in FUMA (FDR < 10−6). This mapping might contain new, undiscovered, regulatory variants which, as for SNPs obtained through eQTL mapping, regulate the expression of a gene.
2.2.2 Co-function gene and SNP networks
We used Biofilter 2.4 [34] to obtain candidate gene pairs to investigate for epistasis evidence. Biofilter generates pairs of genes susceptible to interact (gene models) with evidence of co-function across multiple publicly available biological databases. It includes genomic locations of SNPs and genes, as well as known relationships among genes and proteins such as interaction pairs, pathways and ontological categories, but does not use trait information. As per Biofilter’s default, we used gene models supported by evidence in at least 2 databases. Additionally, we removed self-interactions, as detection of within-gene epistasis requires special considerations and is beyond the scope of this paper.
Given this set of gene models, and three different ways of obtaining SNP models from it, we removed all the SNPs that did not participate in any SNP model. Subsequently, we created six datasets. In one dataset no filter was applied (Standard analysis), i.e. no Biofiltering nor any SNP-to-gene mapping. Hence, the original SNP set was used. We also created one dataset exclusively for each SNP to gene mapping (Positional, eQTL and Chromatin). Lastly, we created two datasets using joint mappings: one with all the mappings (Positional + eQTL + Chromatin); and one with only the functional ones (eQTL + Chromatin). Since the main objective of this protocol is to increase the biological interpretability of epistasis findings, we have excluded other combinations that mix functional and non-functional information (Positional + eQTL and Positional + Chromatin).
We discarded SNP models involving rare variants (MAF < 5%) or in Hardy–Weinberg equilibrium (P-value < 0.001). Regardless, all risk SNPs described in Liu et al. [27] were included, even when the aforementioned epistasis quality controls criteria did not hold up. Then, when the two SNPs of a SNP model were located in the HLA region, we discarded the pair, as it is difficult to differentiate between main and non-additive effects in this region [43]. Lastly, we discarded models where the SNPs were in linkage equilibrium (r2 > 0.75), as motivated in Gusareva and Van Steen [18].
2.2.3 SNP-level epistasis detection and multiple testing correction
We used PLINK 1.9 to detect epistasis through a linear regression on the population structure adjusted phenotypes with the option --epistasis: where gA and gB are the genotypes under additive encoding for SNPs A and B respectively, Y is the adjusted phenotype, and β0, β1, β2, and β3 are the regression coefficients. PLINK performs a statistical test to evaluate whether β3 ≠ 0. It only returns SNP pairs with a P-value lower than a specified threshold. We used the default 0.0001. Only SNP models were considered, apart from the Standard approach.
To correctly account for multiple testing, the P-value threshold of significance had to be dataset-dependent as the number of tested SNP pairs changed from dataset to dataset (Section 2.2.1). We obtained these thresholds through permutations as in Hemani et al. [19] (Fig. 1). More specifically, for each dataset, we permuted the phenotypes 400 times and fitted the aforementioned regression-based association model. This produced a null distribution of the extreme P-values for this number of tests given the LD structure in the data. For each dataset, we took the most extreme P-value from each of the 400 permutations and set the threshold for 5% family-wise error rate (FWER) to be the 5% percentile of these most extreme P-values. Posterior experiments showed that a higher number of permutations, 1000, barely changed the empirical threshold (data not shown). Hence, 400 was a sufficient number of permutations to obtain an adequate threshold.
2.2.4 From SNP-level to gene-level epistasis
Our next step was to use significant SNP interactions to identify significant gene interactions, which requires combining the P-values of all SNP pairs mapped to the same gene pair. Suppose that SNP interaction tests have been conducted for N individual hypotheses H0i, i = 1, 2, …, N, for example, N SNP models mapped to the same gene model. We tested the joint null hypothesis at significance level α versus the combined alternative hypothesis H1: at least one of H0i is false. To do so, we considered all SNP pairs mapped to the same gene pair as a set of tests with the same global null hypothesis, and applied the Adaptive Truncated Product Method (ATPM) [39] (Fig. 1).
ATPM is an adaptive variant of the Truncated Product Method (TPM) of Zaykin et al. [52], which uses as a statistic the product of the P-values smaller than some pre-specified threshold (here, significant SNP interactions) tests. More specifically, given a truncation point τ and a number N of significant SNP interactions, this test statistic is given as where I(·) is the indicator function. TPM is interesting in our context because it does not require P-values for all SNP pairs but only for the most strongly associated ones.
The distribution of W(τ) under the null hypothesis is unknown when the individual tests are not independent, which is clearly the case here, but an empirical P-value can be estimated through permutations. Because the choice of τ is arbitrary, the adaptive version of TPM (ATPM) explores several values of τ and selects . The distribution of under the null hypothesis can again be determined through permutations [13].
In our procedure, which is detailed below for a given gene pair, we used B = 999 permutations and τ ∈ {0.001, 0.01, 0.05}. Remarkably, and following the suggestion of Becker and Knapp [2], the null distribution includes both the statistic from the observed dataset, and from the 999 permutations.
For each SNP model i = 1, …, N mapped to the gene pair, compute its P-values pi,b in the original dataset (b = 0) and for each of the B = 999 permutations (b = 1, …, B).
For each value of τ and b, compute the test statistic W(τ).
For each value of τ and b, estimate the P-value : .
For each value of b, compute .
Estimate the P-value of the gene model as .
Reject the global null hypothesis if P0 ≤ α = 0.05.
2.3 Pathway analysis
A pathway enrichment analysis on the neighborhood of a significant gene model can inform about the broader framework in which gene epistasis occurs. To define such neighborhoods, we adapted the network neighborhood search protocol from Yip et al. [50]. We computed the neighborhood of two genes as the list of all genes that (1) participate in any of the shortest paths between the two studied genes in the Biofilter network, once the direct link between them is removed; and (2) are also involved in a significant interaction with at least one other gene on these paths. We restricted our attention to neighborhoods containing at least 3 genes, including the 2 from the considered gene model. For each of these, we conducted a gene set enrichment analysis in relevant gene sets from the Molecular Signature Database (MSigDB version 7) [24, 41]. We performed the enrichment analysis using a hypergeometric test, which compares the obtained overlap between two sets to the expected overlap from taking equally-sized random sets from the universe of genes. We favored the hypergeometric test over the chi-square test used in Yip et al. [50] because the sample sizes of the neighborhoods were small and because chi-square is an approximation whereas the hypergeometric test is an exact test. The universe set was analysis dependant. It contained all genes in an annotated pathway and that can be mapped via genomic proximity to a SNP of the dataset for the Standard analysis, and genes present in Biofilter gene models, in an annotated pathway and that can be mapped via the appropriate SNP to gene mapping to a SNP of the dataset for the other analysis. Finally, pathways were said to be significant when the corresponding test P-value was lower than the Bonferroni threshold (0.05/(# pathways × # tested gene sets)), with pathways corresponding to pathways containing at least one gene of the neighborhood.
3 RESULTS
3.1 Type I error
In this article, we propose a protocol for epistasis detection using a gene co-function network (Section 2.2). Due to its multi-stage nature, type I error needs to be controlled. For that purpose, we performed a permutation analysis based on 1000 permutations for each of the datasets, permuting the phenotypes and running the entire protocol to detect significant gene interactions (Table 2). This permutation procedure is independent of the one used in the proposed protocol to compute significance thresholds. When at least one significant gene interactions was observed in a permutation, that permutation was considered a false positive (FP). This allowed us to compute the type I error rate as . Type I error was under control in all tested experimental settings, with estimates ≤ 6.6%.
3.2 SNP to gene mapping: Chromatin contacts map more SNPs per gene than other mappings
We obtained gene models from the Biofilter network (Section 2.2.2), and considered three analyses to obtain SNP models from these gene models (Section 2.2.1): Positional, eQTL and Chromatin. Chromatin produced the largest number of unique SNP-gene mappings (2394590), an order of magnitude more than eQTL (411120) and Positional (174879) (Table 1). The Chromatin analysis had on average the largest number of SNPs mapped on to a gene, followed by eQTL and Positional (Fig. 2A). Nonetheless, the number of SNPs mapped to a gene varied considerably across genes (Fig. 2B). In addition, the number of SNPs mapped to a same gene varied considerably across analyses (Fig. 2C, D and E): in general, the genes with most SNPs mapped using the eQTL mapping had relatively few SNPs mapped in the Chromatin mapping, and vice versa.
3.3 The Positional analysis does not recover any SNP interaction
Using the aforementioned SNP-gene mappings, and combinations of them (cross-mappings), yielded six datasets in which we analyzed SNP models (Section 2.2.3). The resulting epistatic SNP-SNP networks are described in Table 3. Strikingly, while the Standard analysis generated the largest SNP-interaction network (55 nodes/SNPs and 57 edges/interactions), the eQTL one was the largest by number of interactions (64). The Positional analysis produced no significant interactions at all.
Notably, the significant SNP interactions tended to be located in nearby genomic regions and to overlap with GWAS main effects loci (Fig. 3A). To investigate whether main effects could be driving some of the signals, even when in imperfect LD with epistatic SNP pairs (a phenomenon sometimes referred to as “phantom epistasis” [10]), we conducted a linear regression based test as in Section 2.2.3, but including a vector of polygenic risk scores as covariate. The polygenic risk scores were computed with PRSice-2 [7] with the trait adjusted for PCs, and are expected to capture the variance explained by main effects. The observed effect of many significant SNP model notably decreased when we conditioned on singular SNPs in this way (Fig. 3B), but not for all. The latter suggests a masking effect opposite to phantom epistasis. However, it is unclear how to adequately correct for multiple hypotheses testing after this adjustment in our setting, and in what follows we still use the unadjusted P-values, with the understanding that some of them may be inflated by weak correlations with main effects.
3.4 Gene epistasis: “functional” mappings boost discovery and interpretability
Findings of a GWAIS are often presented as a network, with nodes indicating SNPs and edges between nodes being present when the analysis protocol identifies the corresponding SNP pair as significantly interacting with the trait of interest. We converted SNP model networks into gene model epistasis networks (Fig. 4), considering an edge between genes whenever gene model significance was obtained through the previously described ATPM approach (Section 2.2.4). The largest network was obtained under Standard mapping (26 edges). The eQTL + Chromatin combinations performed second best (12, 13 edges). Since no significant SNP pairs were detected under Positional, no significant gene pairs were produced either (Table 4).
For both eQTL and Standard most of the significant SNP models mapped to exclusively one gene model, removing possible sources of ambivalence (Fig. 5A). That was less the case under the Chromatin analysis, where it was more common for the same SNP model to map to different gene models. We also investigated the relationship between significant gene models and the number of significant SNP models that mapped to them (Fig. 5B). Most significant gene interactions were supported by relatively small numbers of SNPs: either few in number, or few with respect to the total number of SNP models for that significant gene model.
3.5 Biofilter boosts discovery of interpretable hypotheses
Searching for epistatic interactions exclusively across edges of the Biofilter network greatly reduces the number of tests. Yet, this gain in statistical power might not lead to greater discoveries as it potentially disregards new interactions absent from databases. Hence, we tested whether exhaustively searching for epistasis on the datasets not reduced for Biofilter models but using each mapping, led to similar results. At the SNP level (Fig. 6B, upper panel), only a small proportion of the significant interactions were still detected when the network was not used. Strikingly, that difference got smaller at the gene level (Fig. 6B, lower panel). This suggests that the significant SNP models, even if fewer in number, are strong enough to lead to the detection of the gene models.
In a similar vein, we studied the number of interactions detected by considering the overlap between the significant models detected in the different analyses. Including more SNP-gene mappings in the analysis was mostly beneficial with respect to considering one mapping at a time, since both at the gene and at the SNP level, the significant interactions in Positional + eQTL + Chromatin highly overlapped with the other analyses (Fig. 6A). Nonetheless, a few interactions were also missed in this joint analysis, in particular 20 significant SNP models detected in the eQTL analysis.
3.6 Chromatin and Standard analyses partially replicate previous studies on IBD
In the past, several genetic studies studying epistasis on IBD have been conducted [14, 25, 26, 30, 33, 45]. We compared them to our results at the gene level, the minimal functional unit at which we expect genetic studies to converge. Several epistatic alterations have been reported involving interleukins [14, 26, 30]. Also our Standard analyses resulted in interactions involving three interleukins (IL-19, IL-10 and IL-23), although interacting with different genes than in the aforementioned studies. Functional analysis such as Positional + eQTL + Chromatin recovered five interleukins (IL-4, IL-5, IL-13, IL-19, IL-20). In addition, Lin et al. [25] detected interactions involving NOD2, with both IL-23R and other genes. Our Standard analysis highlighted two potentially new epistasic interactions involving NOD2.
Discoveries in the proposed protocol are guided by plausible biological interactions. Hence, every significant gene model can be traced back to a biological database, therefore producing biological hypotheses. For instance, the gene model MST1-MST1R is significant in multiple pipelines. Both genes have been linked to IBD, both by themselves [3, 6] and in interaction with other genes [48]. MST1R is a surface receptor of MST1, and, through physical interaction, they play a role in the regulation of inflammation.
3.7 Pathway analyses highlight the involvement of the extracellular matrix in IBD
Pathway enrichment analyses of each interaction’s neighborhood (Section 2.3) allowed us to identify broader biological mechanisms that the significant interaction pairs might be involved in. The eQTL analysis thus produced multiple significant pathways, involving the triangle of interactions formed by two genes located in 3p21.31 (HYAL1, HYAL3) and one in 7q31.32 (SPAM1) (Fig. 4). The affected pathways were related to the extracellular matrix, and specifically to glycosaminoglycan degradation. Links between the turnover of the extracelular matric and IBD-related inflammation have been reported in the past [35]. More specifically, glycosaminoglycan [40] and hyaluronon [1] degradation products lead to inflammatory response. When restricting attention to pathways of minimum gene size 10 and maximum gene size 500 to avoid imbalances and non-normality, four pathways are removed: cellular response to UV B, hyalurononglucosaminidase activity, hexosaminidase activity and CS/DS degradation. The Chromatin mapping and the Standard pipeline did not produce significant pathways.
3.8 The proposed pipeline increases reproducibility
GWAIS results are notoriously hard to reproduce. Hence, we studied whether our proposed pipeline led to more stable results. For that purpose, we ran the whole protocol again on a random subset of the data containing 80% of the samples. In each subset, 49% of the individuals were cases, respecting the initial proportion of cases and controls of the entire dataset. We repeated this experiment 10 times for each SNP-gene mapping. Conservatively, we used the same SNP and gene significance thresholds as for the corresponding entire dataset.
The Standard pipeline, which does not include Biofilter network-information, produced on average 11.4 significant gene models (standard error (SE) 1.1). With the eQTL (respectively Chromatin) analysis, we detected on average 5.8 gene pairs (respectively 3.2) with SE 0.1 (respectively 0.4). Fig. 7 shows that pipelines including biological knowledge recover more than 60% of the gene pairs detected with the entire cohort, on average, (83% for eQTL and 60% for Chromatin mapping), whereas without this knowledge (Standard), we recover less than 40% of the pairs. Hence, the Standard analysis appears to be the less robust in terms of conservation of gene pairs. This shows that filtering does increase robustness at the gene level. In addition, over the 10 repetitions, the eQTL analysis highlights significant pathways only once. The detected pathways are the same than those obtained with the entire population. No enriched pathways were found for the Chromatin or the Standard analysis.
4 DISCUSSION
In this article we proposed a new protocol for epistasis detection, based on a variety of functional filtering strategies (Section 2.2), and studied its application to GWAS data for Inflammatory Bowel Disease (Section 2.1). The protocol included several components to control for type I error, hereby strengthening our belief in the discovered genetic interactions.
A common theme in the interpretation of epistasis results consists on linking the associated variants to an altered gene function. In this article, we considered 3 different such SNP-gene mappings. Notably, the number of SNP-gene correspondence provided by each mapping differed by orders of magnitude. Moreover, the different mappings unevenly described genes; for instance, genes that had most SNPs mapped by using a chromatin contact map, had comparatively few eQTL SNPs. This motivated combining multiple mappings into an analysis (e.g. eQTL + Chromatin) in order to combine different perspectives of the epistasis process. For the most part, these complementary mappings improved the analyses, by recovering most of the interactions significant in the analyses that used one mapping at a time. Importantly, our results display the benefits of going beyond one single SNP-gene mapping (often, genomic position) to interpret epistasis results.
Restricting the tested interactions to functionally plausible pairs of genes and SNPs joins two faces of epistasis: searching for statistical epistasis, yet exclusively on plausible candidates for biological epistasis. This has several advantages. First, a more targeted input dataset reduces the number of tests and, in consequence, the multiple testing burden. In contrast, the high dimensionality of GWAIS data requires a much more stringent multiple testing correction and limits the detection of epistasis with low effect sizes. As we observed in Section 3.2, adopting one of the proposed analyses may reduce the number of SNP interactions to test by more than half. Yet, the Standard analysis, which does not use Biofilter, produces the most significant gene models. Second, the proposed protocol addresses the reproducibility issues widespread in GWAIS by producing results that are consistent at the gene and pathway levels (Section 3.5). Indeed, we observed an increased analytic robustness when using Biofilter gene models, in line with previous reports [4]. In particular, eQTL and Chromatin mappings, separate or in combination, increased said robustness. Third, restricting the search for epistasis to biologically plausible interactions yields results that are biologically interpretable and strikingly different from the ones obtained without using functional filtering (Section 3.4). Not surprisingly, different mappings also provided very different interaction signals and give resolution of information on different genes. In particular, in Section 3.6 we corroborated that the significant gene models from different functional filters were relevant to the biology of IBD. This was especially true for the Chromatin analysis (but also the eQTL analysis), giving rise to interactions with seemingly meaningful biological underpinnings, and stressing the relevance of regulatory variants in susceptibility to IBD. In contrast, the Standard analysis detected multiple interactions that were hard to interpret. For instance, several interactions involved RNA genes of unknown function (e.g. LOC101927272 or LINC02178).
Remarkably, while the Standard analysis produced rich results, the Positional analysis did not lead to any significant SNP models. They both use genomic position to map SNPs to genes, but Positional is restricted to gene models in Biofilter. We note that the Positional analysis does not coincide with how Biofilter is typically used on GWAS data for epistasis detection. The latter involves pooling all SNPs that are mapped to genes which occur in Biofilter proposed gene interaction models, and subsequently exhaustive screening those SNPs for pairwise interactions. These pairs may also involve gene pairs that were not highlighted by Biofilter, in contrast to our Positional analysis. We evaluated the impact of biofilter on the final resuls. No significant SNP interaction were detected in Positional analysis. In the analysis without biofiltering (dataset reduced to mappable SNPs using genomic proximity, but not reduced to biofilter gene pairs), 62 pairs were significant. Also, on the 86 SNP interactions that passed the experimental threshold in the Standard analysis (dataset not reduced to mappable using genomic proximity, nor biofilter gene pairs), only 57 are mappable to gene interactions using genomic proximity. Hence, 66% of significant SNP pairs are mappable via genomic proximity in the Standard analysis.
An important component of our protocol is the conversion of SNP-based epistasis to gene-based epistasis. The most popular approach consists in aggregating SNP-level P-values into gene-level statistics, which can be done in different ways (see [28] for some early examples, and [46] for recent developments). Here, we developed a generic approach that exploits a permutation strategy to define a P-value cutoff for SNP interactions, at a FWER of 5%, and then we followed the original implementation of the adaptive TPM (ATPM) to accommodate several truncation thresholds at once [39] while taking permutations instead of bootstrap as in Yu et al. [51]. The two algorithms are very similar, but we favored the TPM over the rank truncated product method of Yu et al. [51] that employs the product of the L most significant P-values, because the TPM only requires P-values smaller than a specified threshold, which is in line with the output of PLINK epistasis detection and saves storage space. Following both protocols and the recommendation of Becker and Knapp [2] we included measures derived from observed data in computing statistics under the null.
Remarkably, our proposed procedure keeps type I error under control, without additional corrections for multiple testing at the gene model. We hypothesize that this stems from two reasons. First, we apply a stringent correction for multiple testing at the SNP level. Second, when moving from SNP model significance to gene model significance, we restrict attention to significant SNP pairs in the ATPM. Hence, we do not consider any gene models that do not map to any such SNP model. However, alternative strategies could have been considered. For instance, not restricting ourselves to significant SNP models, hence conducting ATPM on all gene models. This could have led to increased discovery, in cases where the SNP models mapped to a gene tend to be low, albeit non-significant. However, it may also lead to an increased type I error. Accounting for that would require a multiple test correction at the gene level. In turn, such correction would be difficult since the dependency between the tests is unknown. Additionally, in common multiple test corrections this would require a much higher number of permutations, in order to obtain the necessary numerical precision.
How to best perform a pathway analysis of epistasis results is understudied. Often, all genes belonging to any significant gene pair are simply pooled together into a joint enrichment analysis. This approach discards the gene-gene interaction information that was, indeed, the object of analysis. Hence, in our procedure we adapted the Network neighborhood search protocol from Yip et al. [50], which considers the topology of the network using the shortest paths between the studied genes. It should be noted that we only used the topology to derive a neighborhood for each significant gene pair; then, we discarded the edge information. Yet, there are several directions for improvement. One is to exploit the topology of the epistasis network beyond the creation of a neighborhood. Another one is to take into account the gene size (or the number of SNPs per gene), for instance by performing a weighted version of the statistical test. Jia et al. [20] suggested a method for gene set enrichment analysis of GWAS data, adjusting the gene length bias or the number of SNP per gene. In our data, we did observe a link between the significance of the gene models and the number of SNPs mapped to the gene. For instance, in the eQTL analysis, the only one producing significant pathways, the median number of SNPs per genes is 385 among genes in significant pairs, versus 3 SNPs/gene genome-wide.
Several protocol changes may impact final results. As reported elsewhere [4], these changes or choices include the modelling framework (parametric, non-parametric, semi-parametric), encoding of the genetic markers, as well as LD handling. In this work, we used an additive encoding scheme (0, 1, 2 indicating the number of copies of the minor SNP allele), a popular choice in part because of its computational efficiency. However, this encoding schemes has been reported to tend to increase false positives (for instance [44]). This observation is based on type I error studies with data generated under the null hypothesis of no pairwise genetic interactions but in the presence of main effects (see for instance [21]). Here, we investigated the type I error control of our protocols under a general null hypothesis of no genetic associations with the trait (no interactions and no main effects) and established adequate control. As a consequence, this does not guarantee that our generated SNP interaction results were not overly-optimistic. To this end, we adjusted SNP-level epistasis P-values for main effects as comprised in a polygenic risk score. Not only does such a post-analysis adjustment via conditional regression reduce over-optimism due to inadequate control for lower-order effects, thus addressing phantom epistasis [10], but it may also occasionally highlight the masking of SNP interactions (as was shown in Fig. 3B - eQTL). More work is needed to investigate the impact on gene-level interaction results, derived accordingly. For convenience, we used the regression framework to identify SNP interactions and relied on earlier recommendations regarding LD handling [18].
Our protocols are built on output from Biofilter, that can be presented as a co-functional gene network. One of the motivations was its proven ability to highlight meaningful interactions in a narrower alternative hypothesis space, at the expense of leaving parts of the interaction search space unexplored. The database that Biofilter built contained 37266 interactions. This is notably smaller than other gene interaction databases, like HINT [9], 173797 interactions), or STRING [42], 11759455 interactions). Testing gene interactions with other (combinations of) biological interaction networks was beyond the scope of this paper. Furthermore, Biofilter analysis or exhaustive screening may lead to non-overlapping results. An example within a regression context is given in [4].
4.1 CODE AVAILABILITY
The code necessary to reproduce this article’s results and analyses is available on GitHub at https://github.com/DianeDuroux/BiologicalEpistasis. Additionally, we prepared network_epistasis.nf, a dataset agnostic Nextflow version of the proposed pipeline, available at https://github.com/hclimente/gwas-tools.
5 CONCLUSION
In this study we presented a protocol to enhance the interpretation of epistasis screening from GWAS. It includes gene-level epistasis discoveries with type I error under control, as well as a network-guided pathway analysis. Moreover, it improves the robustness of the results, making epistasis detection more reproducible. Aggregating SNP-level results into gene-level epistasis is challenging, but allows to include information from biological interaction databases. Based on that, we conducted multiple analyses that use different sources of prior biological knowledge about SNP-to-gene relationships and gene interaction models, as well as rigorous statistical approaches to assess significance. Each of them offers a different, albeit complementary view of the disease, which leads to additional insights.
Their application to GWAS data for inflammatory bowel disease highlighted the potential of our strategy, including network-guided pathway analysis, as it recovered known aspects of IBD while capturing relevant and previously unreported features of its genetic architecture. These strategies will contribute to identify gene-level interactions from SNP data for complex diseases, and to enhance our belief in these findings.
Funding and acknowledgements
We thank the International IBD Genetics Consortium for data collection and processing and for interesting discussions. Computational resources have been provided by the Consortium des Equipements de Calcul Intensif (CECI), funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11 and by the Walloon Region. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 666003 and 813533. KVS acknowledges opportunities and funding provided by WELBIO (Walloon Excellence in Life sciences and BIOtechnology). C-A.A. acknowledges funding from Agence Nationale de la Recherche (ANR-18-CE45-0021-01).
Footnotes
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵