Abstract
As the size of genome-wide association studies (GWAS) increases, detecting interactions among single nucleotide polymorphisms (SNP) or genes associated to particular phenotypes is garnering more and more interest as a means to decipher the full genetic basis of complex diseases. Systematically testing interactions is however challenging both from a computational and from a statistical point of view, given the large number of possible interactions to consider. In this paper we propose a framework to identify pairwise interactions with a particular target variant, using a penalized regression approach. Narrowing the scope of interaction identification around a predetermined target provides increased statistical power and better interpretability, as well as computational scalability. We compare our new methods to state-of-the-art techniques for epistasis detection on simulated and real data, and demonstrate the benefits of our framework to identify pairwise interactions in several experimental settings.
1 Introduction
The amount of data generated by genome-wide association studies (GWAS) has dramatically increased in the last few years. More diseases are now being tackled with larger cohorts. Nevertheless, despite this tangible progress, our understanding of complex diseases is still limited. The classical approach in GWAS is the marginal testing for association of the phenotype of interest with each single nucleotide polymorphism (SNP) while correcting for multiple hypothesis testing. However, this fails to explain most of the phenotypic variance known to be inheritable, a phenomenon also known as missing heritability. Epigenetics and rare variants with small to moderate effects are among the reasons advanced to explain the limitations of GWAS 1,2. In addition, high-order epistatic effects, one of the main hypotheses behind missing heritability 3, are not taken into account in marginal testing.
By constructing additive models of significant SNPs, only a small fraction of the missing heritability, as measured by narrow-sense heritability 3, is explained. For instance, the explained heritability for type II diabetes stands at 6% 4. For height, an extensively-studied trait, the explained proportion is only 5% 5. By revealing genetic interactions, epistasis can give an insight into the complex mapping between genotype and phenotype that cannot be extracted from marginal association testing. For instance, several epistatic mechanisms have been highlighted in the onset of Alzheimer disease 6. Most notably, the interaction between the two genes BACE1 and APOE4 was found to be significant on four distinct datasets.
Epistasis can be defined from two different angles: biological epistasis and statistical epistasis. The definition of statistical epistasis dates back to Fisher 7 who characterizes it as the departure from additivity in a mathematical model relating multilocus genotypes to phenotypic variation. A number of strategies deployed in the context of statistical epistasis are reviewed in Cordell8 and Niel et al.9. The strategies can be partitioned into two main categories: gene-gene interactions and SNP-SNP interactions. Approaching epistasis from the angle of gene-gene interactions is consistent with the definition of biological epistasis 10 as biomolecular or protein-protein interactions. Aggregator 11 and EigenEpistasis 12 are examples of tools with gene-gene interaction statistics as final output. In particular, Aggregator combines SNP-SNP interaction statistics to construct gene-level statistics. Exhaustive SNP-SNP interaction testing is still the most popular approach. It requires to correct for multiple testing using procedures such as Bonferroni correction 13 or the Benjamini-Hochberg procedure 14. The latter is an example of false-discovery rate (FDR) procedures which are less stringent than family-wise error rate (FWER) procedures. The Bonferroni correction is a typical FWER controlling approach. For all procedures, the correction comes at the cost of poor statistical power 15. For second-order interactions, billions of pairs of SNP must be tested, which impacts the statistical power. The decrease in statistical power is even greater for higher-order interactions. Moreover, exhaustive testing beyond second-order interactions is still unfeasible in reasonable time 16. For increased speed, the current state-of-the-art BOOST 17 and its GPU-derivative 18 add a preliminary screening stage that ensures the survival of significant interactions. Another fast interaction search algorithm in the high-dimensional setting is the xyz-algorithm 19, where the interaction problem is considered from a different perspective. Instead of assessing the dependency between the product of two variables and an outcome, the pair of interest is a first variable and the Hadamard product of the outcome and a second variable. To reduce the computational overhead, the pair is projected on a set of random vectors. On the LURIC 20 GWAS dataset, the xyz algorithm tested more than 1011 interactions while being about as fast as a two-stage LASSO 21.
In addition to exhaustive statistical testing, one can also fit exhaustive regression models with linear (“marginal”) effect terms and quadratic “interaction” terms. For a better inference of the interactions, Bien et al. 22 introduced hierNET, a LASSO with hierarchy constraints between univariate and bivariate terms. When the truth is hierarchical, hierNET outperforms exhaustive regression models. Though the hierarchy constraint is plausible for many applications, it severely limits the scalability of the method to highly-dimensional problems particularly GWAS. The scope of the current release of hierNET 22 is only hundreds of predictors.
By contrast, instead of constructing exhaustive models, we focus on expanding knowledge around predetermined loci, which we refer to as “targets” in what follows. Such targets can be drawn from the literature, experiments or top hits in previous GWAS. Exhaustive genome-scale models with all pairwise terms are often computationally intensive and suffer from low statistical power. The leverage of formerly identified SNPs is then a sensible option. A lower number of interactions has to be studied with the additional guarantee that the target affects the phenotype in question. Nonetheless, a similar partial study should account for other effects of both the target and the rest of the genotype not owed to their interaction. A failure to address this issue can bias the results. In the epistasis literature, methods with such properties are lacking. In clinical trials, similar problems are encountered where the goal is to infer the treatment response variation uniquely due to the interaction between the treatment assignment and the clinical covariates. Developed specifically for this reason, propensity score23 techniques are a common approach to achieve that. We therefore draw on those models to propose a family of model selection methods that robustly infer second-order interactions with a fixed SNP, through the formulation of different L1-penalized regression problems. Given the high-dimensional setting, sparsity-aware methods like LASSO are well suited for model selection in genomic applications. The first category of methods developed in this work are regression approaches where the outcome combines the phenotype, the target and propensity-like quantities. The candidate SNPs are used as covariates. We also present a weighted binary classification approach. The outcome is the target, while the phenotype is included in the sample weights with the propensity score. A by-product of our work is a new framework to estimate conditional probabilities within the genome using the semi-parametric representation of the chromosomes developed for fastPHASE 24.
In the statistical literature, the selection of causal variants is a support recovery problem. For parameterized models like the LASSO, stability selection 25 is an attractive option as a model selection procedure. It aggregates the empirical selection probabilities for each variable along the LASSO path while controlling for the family-wise error rate. The original feature importance criterion in stability selection is the maximal selection probability along the stability path. In our work, we use as a criterion the area under the stability path because it better accounts for the early stages of the stability path.
In this paper, we propose a new framework to study epistasis by only focusing on the synergies with a predetermined target. By proceeding this way, the methods developed in this work improve the recovery of interacting SNPs compared to standard methods like GBOOST or LASSO with interaction terms. We evaluate the performances of our methods against two baseline models on simulated GWAS for several types of disease models. We also conduct a case study on a real GWAS dataset for type II diabetes to demonstrate the scalability of our methods and to investigate the result differences between them.
2 Material and Methods
2.1 Setting and notations
We model genotypes and phenotypes as a triplet of random variables (X, A, Y), where Y is a discrete (e.g., in case-control studies) or continuous phenotype, X = (X1,⋯, Xp) ∈ {0, 1, 2}p represents a genotype with p SNPs, and A is a (p + 1)-th target SNP of interest. The reason why we split the p + 1 SNPs into X and A is that our goal is to detect interactions involving A and other SNPs in X. We restrict ourselves to a binary encoding of A in {–1, +1}, which allows us for example to study both recessive and dominant phenotypes, depending on how we binarize the SNP represented in A. We also introduce a version of the binarized target SNP taking values in {0, 1} by letting à = (A + 1)/2.
The target SNP A being binary, it is always possible to decompose the genotype-phenotype relationship as where ∊ is a zero mean random variable and
With these notations, we see from (1) that the term δ(X) A represents the marginal effect of A as well as synergistic effects between A and all SNPs in X. In the context of genomic data, we can interpret these synergies as pure epistatic effects. Furthermore, if δ(X) is sparse in the sense that it only depends on a subset of elements of X (which we call the support of δ), then the SNPs in the support of δ are the ones interacting with A. In other words, searching for epistasis between A and SNPs in X amounts to searching for the support of δ.
A GWAS dataset is a set of genotype-phenotype triplets (Xi, Ai, Yi)i=1,⋯,n, which we model as independently and identically distributed according to the law of (X, A, Y). To estimate the support of δ from GWAS data, we propose below several models based on sparse regression and classification. The common thread between them is the use of propensity scores, which model the linkage disequilibrium (LD) dependency between the target SNP A and the rest of the genotype X. Mathematically, the propensity score π(A|X) corresponds to the conditional probability of A given X. The balancing through the propensity scores filters out the common effects of the SNPs within X to only retain the synergistic effects with the target A. The first family of methods we propose all fall under the modified outcome banner. In these models, an outcome that combines the phenotype Y with the target SNP A and the propensity score π(A|X) is fitted linearly to the genomic covariates X. We propose several variants of this approach, based on several normalizations of π(A|X) to control for estimation errors. Our second proposition is a case-only method based on the framework of outcome weighted learning (OWL) developed by Zhao et al. 26. In this model, which is a weighted linear regression, the outcome is the target SNP A, and the covariates are the rest of the genotype X. The phenotype and the propensity score π(A|X) are incorporated in the sample weights Y / π(A|X). The following subsections (Sections 2.2 and 2.3) elaborate on those methods. Section 2.4 details our approach for the estimate of the propensity score π(A|X). Finally, Section 2.5 explains how we perform model selection through stability selection.
If not stated otherwise, the full data pipeline is written in the R language. A comprehensive package covering the methods presented in this paper will soon be made available on Bioconductor 27.
2.2 Modified outcome regression
For a given sample, only one of the two possibilities A = +1 or A = −1 is observed, making the direct estimation of δ(X) using (2) impossible from empirical GWAS data. The propensity score π(A|X) comes into play to circumvent this problem. By considering à = (A + 1)/2 ∈ {0, 1}, we can indeed rewrite (2) as:
Given an estimate of π(Ã|X), we define the modified outcome Ỹ of an observation (X, A, Y) as: and re-express simply
We note that our definition of modified outcome (3) generalizes that of Tian et al. 28 where it is defined as Ỹ = Y Ã; both definitions are equivalent in the specific situation considered by Tian et al. 28 where A and X are independent, i.e., P (Ã = 1|X) = P (Ã = 1), and furthermore P (Ã = 1) = 1/2. Our definition (3) is valid when A and X are not independent.
Given (4), we can estimate the support of δ from GWAS data by first transforming them into genotype - modified outcome pairs (Xi, Ỹi)i=1,⋯,n, and then applying a model for support recovery in sparse regression of Ỹ given X. For that purpose we implement a stability selection procedure explained below.
Furthermore, we propose several alternative definitions of Ỹ, which improve numerical stability and large-sample variance by controlling for the inverse of the propensity score π(A|X). A first alternative, which we call normalized modified outcome, separately normalizes the inverses of the propensity scores of cases and controls. It is consistent and was found in empirical studies to have a lower variance than the original modified outcome estimator 29:
However, both estimators may suffer from numerical instability because of the inverse of the propensity score weighting. If the conditional probabilities π̂(Ai = 0|Xi) or π̂(Ai = 1|Xi) are small, the weight attributed to the sample (i) can be very large relatively to other samples. The use of the inverse of the propensity scores is well-studied in the statistical literature 29,30. A second alternative definition of Ỹ, which we call shifted modified outcome, simply consists in the addition of a small term ξ = 0.1 to obtain an upper-bound on the inverse of the propensity scores:
The last approach within this category, that we call robust modified outcome, is rather similar to modified outcome and normalized modified outcome. In fact, all three of them are solutions to the following system of equations: where μ1 = 𝔼 [𝔼 [Y(1)|X]] and μ0 = 𝔼 [𝔼 1Y (0)|X]].
For all is a consistent estimator for the average risk difference 𝔼[δ(X)]. Modified outcome corresponds to the case (η0, η1) = (μ0, μ1). (η0, η1) = (0, 0) yields the second estimator, normalized modified outcome. Robust modified outcome is the solution to the above system with the smallest large-sample variance:
Where,
We can derive the expression of Robust modified outcome by using empirical estimates of and , the minimizers of the large-sample variance of μ̂0 and μ̂1, respectively. For more details, we refer the reader to Lunceford and Davidian 29.
2.3 Outcome weighted learning
Inspired by the OWL model of Zhao et al. 26 in the context of randomized clinical trials, we now propose a second formulation as a weighted binary classification problem to estimate δ(X) and its support. Like OWL, this formulation amounts mathematically to predict A from X, where errors are penalized more or less depending on Y. We assume in this section that Y takes only nonnegative values, e.g., Y ∈ {0, 1} for a case-control study. To take into account the dependency between A and X, we extend the OWL definition and consider the following function: where ϕ is a non-increasing loss function such as the logistic loss:
The reason to consider this formulation is that:
Lemma 1.
The solution d* to (5)-(6) is:
Proof. For any x ∈ {0, 1, 2}p we see from (5) that d*(x) must minimize the function l : ℝ → ℝ defined by which is minimized when l′(u) = 0, i.e.,
Lemma 1 clarifies how d* is related to δ: while δ is the difference of the expected phenotype conditioned to the two alternative values of A, d* is the log-ratio of the same two quantities. In particular both functions have the same sign for any genotype X. Hence we propose to estimate d* and its support, as an approximation and alternative to estimating δ and its support, in order to capture epistatic phenomena with A.
For a given sample (X, A, Y) if we define the weight W = Y /π(A|X), we can interpret d* in (5) as a logistic regression classifier that predicts A from X, with errors weighted by W. Hence d* and its support can be estimated from GWAS data by standard tools for weighted logistic regression and support estimation; we implement a stability selection procedure combined with elastic net regularized logistic regression, explained below.
In the case of qualitative GWAS studies, we encode Y as 0 for controls and 1 for cases. The regression weights W of controls thus become 0, resulting in a case-only approach for epistasis detection. Tools such as PLINK 31 and INTERSNP 32 implement optional case-only analyses, which can be more powerful in practice than a joint case-control analysis 8,33,34,35. In the case of PLINK and INTERSNP, additional hypotheses such as the independence of gene-gene frequencies are needed to ensure the validity of the statistical test. In our case, the family of weights {Wi = 1/π(Ai|Xi)i=1,⋯,n corrects for the dependency between the target A and the genotype X. We can therefore forego such hypotheses on the data. We may even argue that the controls are indirectly included in the regression model through π(A|X). It represents the dependency pattern within the general population, and not only within cases.
2.4 Estimate of the propensity score
As mentioned above, the propensity score π(A|X) is recurrent in clinical trials. In such a context, A is the treatment assignment and X are the clinical covariates. The outcome for clinical trials is the treatment response. We are interested in the interaction between the treatment and the covariates to understand the main drivers for treatment response. Practitioners often opt for a parametric model for the propensity score π(A|X) e.g. a regression model:
It is common practice to include a number of higher-order terms to model the interaction between the clinical covariates within X. The included variables are preferably either causal (related to the response) or confounding variables (related to both the response and the treatment assignment).
For single-nucleotide polymorphisms, a similar logistic regression model is also possible to model the structural dependence between the target of interest A and the rest of the genotype X. Because of the ultra high-dimensional setting and the linkage disequilibrium along the chromosomes, we opt instead for a more structure-aware model, namely a hidden Markov model 24. The hidden states represent contiguous clusters of phased haplotypes. The emission states correspond to SNPs. Several authors 24,36,37,38 advocate this model as a more flexible representation than haploblocks 39. Our selection of this model was also motivated by the heavy skewness of the estimated propensity score distributions towards 0 and 1 due to the severe overfitting of regression models. In Appendix A, we provide a full characterization of this model.
The hidden Markov model representation of the genome was developed to perform imputation, and has essentially remained confined to that application. For example, the fastPHASE software 24 based on this model leads to near-perfect imputation results, with error rates typically lower than 0.01. Among other applications, this representation has been used to construct knockoff copies of SNPs 40 to control the false discovery rate in GWAS 41. The estimate of the propensity scores π(A|X) is a new application of this representation in the context of genome-wide association studies.
Since the structural dependence is chromosome-wise, we only retain the SNPs located on the same chromosome as the SNP A, which we denote here by XA. Mathematically, this is equivalent to the independence of the SNPs A and XA from the SNPs of other chromosomes.
The pathological cases π(A|XA) ≈ 1 and π(A|XA) ≈ 0 can be avoided by the removal of all SNPs within a certain distance of A. In our implementation, we first performed an adjacency-constrained hierarchical clustering of the SNPs located on the chromosome of the target A. We fixed the maximum correlation threshold at 0.5. To alleviate strong linkage disequilibrium, we then discarded the SNPs within a three-cluster window of SNP A. Such filtering is sensible since we are looking for biological interactions between functionally-distinct regions. The neighboring SNPs are not only removed for the estimation of the propensity score, but also in the regression models searching for interactions. Other alternatives do exist to control the tails of the conditional distribution π̂(A|X). A straightforward approach is to trim the upper and lower percentiles (often the 1st and 99th percentiles) 42,43.
After fitting the unphased genotype model using fastPHASE, the last remaining step is the application of the forward algorithm 44 to obtain an estimate of the two potential observations (A = 1, XA) and (A = −1, XA). The Bayes theorem yields the desired propensity scores π(A|XA) = π(A|X).
2.5 Support estimation
In order to estimate the support of δ in the case of modified outcome regression (4), and of d* in the case of OWL (5), we model both functions as linear models and estimate non-zero coefficients by elastic net regression 45 combined with stability selection 25.
More precisely, given a GWAS cohort (Xi, Ai, Yi)i=1,⋯,n, we first define empirical risks for a candidate linear model x ↦ γ⊤ x for δ and d* as respectively
For a given regularization parameter λ > 0 and empirical risk R = R1 or R = R2, we then define the elastic net estimator: where we fix s = 10−6 to give greater importance to the L1-penalization. Over a grid of values Λ for the penalization parameter λ, we subsample N = 50 times without replacement the whole cohort. The size of the generated subsamples I1,⋯, IN is ⌊n/2⌋. Each subsample I provides a different support for γ̂λ, which we note Ŝλ(I). For λ ∈ Λ, the empirical frequency of the variable Xk entering the support is then given by:
In the original stability selection procedure 25, the decision rule for including the variable k in the final model is . The parameter t is a predefined threshold. For noisy high-dimensional data, the maximal empirical frequency along the stability path may not be sufficiently robust. In line with the results of Haury et al. 46, we found that the area under the stability path is a better criterion for model selection. The main intuition behind the better performance is the early entry of causal variables into the LASSO path.
Finally, to determine the grid Λ, we make use of the R package glmnet 47. We generate a log-scaled grid of 200 values (λl)l=1,⋯,200 between λ1 = λmax and λ200 = λmax/100, where λmax is the maximum λ leading to a non-zero model. To improve the inference, we only retain the first half of the path comprised between λ1 and λ100. The benefit of a thresholded regularization path is to discard a large number of irrelevant covariates that enter the support for low values of λ.
3 Results
3.1 Simulations
Disease model
We simulate phenotypes using a logit model with the following structure: where V, W, Z1 and Z2 are random subsets of {1,⋯, p}. The variables within the vector XV interact with A. In the disease model, we also included two other sets of variables XW and (). The variable XW corresponds to marginal effects while the two other variables and correspond to quadratic effects. The effect sizes β0,V, β1,V, βW and are sampled from 𝒩 (0, 1). Given the symmetry around 0 of the effect size distributions, the simulated cohorts are approximately equally balanced between cases and controls.
To account for the diversity of effect types in disease models, we simulate four scenarios with different overlap configurations between XV and (). For each of the scenarios detailed below, we conducted 125 simulations: 5 sets of causal SNPs {A, V, W, Z1, Z2} × 5 sets of size effects × 5 replicates.
Synergistic only effects, |V ∩ W | = 0, |V ∩ Z1| = 0, |V | = |W | = |Z1| = |Z2| = 8;
Partial overlap between synergistic and marginal effects, |V ∩ W | = 4, |V ∩ Z1| = 0, |V | = |W | = |Z1| = |Z2| = 8;
Partial overlap between synergistic and quadratic effects, |V ∩ W | = 0, |V ∩ Z1| = 4, |V | = |W | = |Z1| = |Z2| = 8;
Partial overlap between synergistic and quadratic/marginal effects, |V ∩ W | = 2, |V ∩ Z1| = 2, |V | = |W | = |Z1| = |Z2| = 8.
Because of the filtering window around the SNP A, the causal SNPs (XV, XW, Z1, Z2) were sampled outside of that window. The second constraint on the causal SNPs is a lower bound on the minor allele frequencies (MAF). We fixed that bound at 0.2. The goal is to obtain well-balanced marginal distributions for the different variants. For rare variants, it is difficult to untangle the statistical power of any method from the inherent difficulty in detecting them. The lower bound is also coherent with the common disease-common variant hypothesis 48: the main drivers of complex/common diseases are common SNPs.
Genotype simulations
We simulated genotypes using the second release of HAPGEN 49. The underlying model for HAPGEN is the same hidden Markov model described in Appendix A. The starting point is a reference set of population haplotypes. The accompanying haplotypes dataset is the 1000 Genomes phase 3 reference haplotypes 50. In our simulations, we only use the European population samples. The second input to HAPGEN is a fine scale recombination map. Consequently, the simulated haplotypes/genotypes exhibit the same linkage disequilibrium structure as the original data.
In comparison to the HAPGEN-generated haplotypes, the final markers density for SNP arrays is significantly reduced. For example, the sequencing technology for the WTCCC case-control consortium 51 is the Affymetrix 500K. As its name suggests, “only” five hundred thousand positions are genotyped. As most GWAS are based on SNP array data, we only extract from the simulated genotypes the markers of the Affymetrix 500K. In the subsequent QC step, we only retain common bi-allelic SNPs defined by a MAF > 0.01. We also remove SNPs that are not in a Hardy-Weinberg equilibrium (p < 10−6).
For iterative simulations, HAPGEN can be time-consuming, notably for large cohorts consisting of thousands of samples. Instead, we proceed in the following way: we generate once and for all a large dataset of 20 thousand samples on the 22nd chromosome. To benchmark for varying sample sizes n ∈ {500, 1000, 2000, 5000}, we iteratively sample uniformly and without replacement n-times the population of 20 000 individuals to create 125 case-control cohorts. On the 22nd chromosome, we then select p = 5 000 SNPs located between the nucleotide positions 16 061 016 and 49 449 618. We do not conduct any posterior pruning to avoid filtering out the true causal SNPs.
Evaluation
We benchmark our new methods against two baselines. The first method is GBOOST 17, a state-of-the-art method for epistasis detection. For all SNP pairs, it implements the log-likelihood ratio test statistic to compare the goodness of fit of two models: the full logistic regression model with main effect terms and interaction terms, and the logistic regression model with main effects only. The preliminary sure screening step to discard a number of SNPs from exhaustive pairwise testing was omitted, since we are only interested in the GBOOST score for the pairs of the form (A, Xk), where Xk is the k-th SNP. The second method, which we refer to as product LASSO, originates from the machine learning community. It was developed by Tian et al. 28 to estimate interactions between a treatment and a large number of covariates. It fits an L1-penalized logistic regression model with A × X as covariates. The variable of interest A is symmetrically encoded as {–1, +1}. Under general assumptions, Tian et al. 28 show how this model works as a good approximation to the optimal decision rule d* (see Section 2.3).
We visualize the results of our methods in terms of receiver-operating characteristic (ROC) curves and precision-recall (PR) curves. The ROC and PR curves are derived from the stability paths. For each SNP, the score is the area under its corresponding stability path. For ROC/PR curves, no normalization is needed to bring the scores into the [0, 1] range. The labels are 1 for the SNPs interacting with the target A, and 0 otherwise. The covariates and the outcome differ between our methods. That implies a different regularization path for each method and as a result, incomparable stability paths. For better interpretability and comparability between the methods, we use the position l on the stability path grid Λ = (λl) s.t. λl > λl+1 instead of the value of λl for computing the area under the curve.
In Figure 1, we provide the ROC and PR curves for the fourth scenario which corresponds to a partial overlap between synergistic and quadratic/marginal effects and for a sample size n = 500. Because of space constraints, all ROC/PR figures and corresponding AUC tables are listed in Appendix B. The figures represent the average ROC and PR curves of the 125 simulations in each of the four scenarios. To generate those figures, we used the R package precrec 52. It performs nonlinear interpolation in the PR space. The AUCs were computed with same package.
Regardless of the scenario and the sample size, the areas under all ROC curves are higher than 0.5. That confirms that all of them perform better than random, yet with varying degrees of success. By contrast, the overall areas under the precision-recall curves are low. The maximum area under the precision-recall curve is 0.41, attained by Modified Outcome with shifted weights for n = p. This can be attributed to the imbalanced nature of the problem: 8 synergistic SNPs out of 5 000. For both ROC and PR, we do also observe increasing AUCs with the cohort size.
The best performing methods are robust modified outcome and GBOOST. Robust modified outcome has a slight lead in terms of ROC AUCs, notably for low sample sizes. The latter setup is the closest to our intended application in genome-wide association studies. Of special interest to us in the ROC space is the bottom-left area. It reflects the retrieval performance for highly-ranked instances. For all scenarios, we witness a better start for robust modified outcome. The other methods within the modified outcome family behave similarly. Such a result was expected because of their theoretical similarities. Despite the model misspecification, product LASSO performs rather well. On average, it comes third to GBOOST and robust modified outcome. The outcome weighted learning approach which is an approximation to estimating δ has consistently been the worst performer in the ROC space.
In PR space, the results are more mixed. For low sample sizes, robust modified outcome is still the best performing method. As the sample size increases, we observe that other methods within the modified outcome family, notably shifted modified outcome, surpass the robust modified outcome approach. Surprisingly, the good performance of GBOOST in ROC space was not reproduced in PR space. This might be explained by the highly imbalanced nature of the problem and the lower performance of GBOOST, compared to robust modified outcome in the high specificity region of the ROC curves (lower left). By contrast, product LASSO is always trailing the best performer of the modified outcome family. As for ROC curves, we are also interested in the beginning of the PR curves. For a recall rate of 0.125, the highest precision rate is near 0.5 for the first, third and fourth scenario. That implies that we detect on average one causal SNP in the first two SNPs. For the second scenario, the highest precision rate is even higher at approximately 0.68. The area under the stability path is then a robust score for model selection in the high dimensional setting.
It is worth noting the homogeneous behavior of the different methods across the four scenarios. For a given sample size, and for a given method, the ROC and PR AUCs are similar. This suggests they all successfully filtered out the common effects term μ(X) even in presence of an overlap between the causal SNPs within μ(X) and δ(X).
3.2 Case study : type II diabetes dataset of the WTCCC
As a case study, we selected the type II diabetes dataset of the WTCCC 51 to illustrate the scalability of our methods to real datasets. To the best of our knowledge, no confirmed epistatic interactions exist for type II diabetes. We propose instead to study the synergies with a particular target: rs41475248 on chromosome 8. The first criterion to our choice is the presence of a significant epistatic effect. For that purpose, we initially ran GBOOST. SNP rs41475248 is involved in 3 epistatic interactions, when controlling for a false discovery rate of 0.05. The second criterion is being a common variant. The MAF of the selected target is 0.45.
Before running our methods on the WTCCC dataset, we applied the same QC procedures with the following thresholds: 0.01 for minor-allele frequencies and p > 10−6 for the Hardy-Weinberg equilibrium. The number of remaining variants is 354 439 SNPs. The number of samples is 4 897, split between 1 953 cases and 2 944 controls.
To solve the different L1-penalized regressions, we abandoned glmnet in favor of another solver, biglasso 53. glmnet does not accept as input such ultra-high dimensional design matrices. On the other hand, biglasso was specifically developed for similar settings thanks to its multi-threaded implementation and utilization of memory-mapped files. Because biglasso does not implement sample weighting, it cannot be used to run outcome weighted learning. Moreover, this approach performed worse than the modified outcome approaches on simulated data, and we therefore excluded it from this case study.
The main difficulty for the evaluation of GWAS methods is the biological validation of the study results. We often lack evidence to correctly label each SNP as being involved or not in an epistatic interaction. Evaluating the model selection performance of the different methods on real datasets is then impossible. However, we can study the concordance between them. A common way to proceed is Kendall’s tau which is a measure of rank correlation. In Table 1, we give the correlation matrix of our methods and the two baselines of Section 3.1. All elements are positive which indicates a relative agreement between the methods. Modified outcome, normalized modified outcome and shifted modified outcome have the highest correlation coefficients. Such a result was expected because of their theoretical similarities. We also note that the lowest score is for robust modified outcome and GBOOST. In the previous section, these two methods were the best performing. This suggests those two methods can make different true discoveries.
In any follow-up work, we will only exploit the highly-ranked variants. A weighted tau statistic that assigns a higher weight to the first instances is therefore more relevant. Weighted nonnegative tau statistics better assess the relative level of concordance between different pairs of methods, while the sign in Kendall’s tau shows if two methods rather agree or disagree. In Table 2, we list Kendall’s tau coefficients with multiplicative hyperbolic weighting. Similarly, we notice that robust modified outcome is least correlated with GBOOST and most correlated with product LASSO.
Aside from rank correlation, another option to appraise the results is to measure the association between the top SNPs for each method and the phenotype. Table 3 lists the Cochran-Armitage test p-values for the top 25 SNPs for each method in an increasing order. Though synthetic univariate measures, the Cochran-Armitage statistics give us an indication of the true ranking performance. Robust modified outcome is clearly the method with the lowest p-values. For instance, the top 14 SNPs have a p-value lower than 0.001. That confirms the result of our simulations that robust modified outcome is the best performer for capturing causal SNPs. The p-values associated to product LASSO and GBOOST are also relatively low, with respectively 5 and 4 p-values lower than 0.001. However, we note the overall difficulty in drawing clear conclusions for all methods. Without multiple testing correction, most of the p-values for each method already exceed classical significance levels e.g. 0.05. For 3 out of 6 methods, the p-values of the 25th SNP are greater than 0.90. Nonetheless, the existence of such high p-values further demonstrates the capacity of our methods in discovering novel associations undetected by univariate methods.
4 Discussion
We presented a new family of methods for epistasis detection. They revolve around detecting new interactions with specific targets/genes. Given our partial understanding of common diseases, such refocused models could be more useful in the understanding of the underlying biology. Hundreds of genes have already been associated with several common diseases via univariate GWAS. For type II diabetes, we mention the genes TCF7L2 and ABCC8. The latter affects insulin regulation, while the former impacts both insulin secretion and glucose production. The next step is to build upon these findings to detect potential synergies between these genes and the rest of the genome. Beyond a better understanding of disease mechanisms through new biomarker discovery, we see the development of combination drug therapies as a potential application of our work.
Among the methods we propose, robust modified outcome seems the most suited in practice to GWAS applications. The AUCs are overall the highest in addition to the early retrieval performance. More importantly, robust modified outcome outperforms GBOOST. From a dimensionality point of view, the closest simulations to real GWAS are for sample sizes n = 500. Across the four scenarios, robust modified outcome not only outperforms the current state-of-the art for epistasis detection GBOOST, but also the other methods based on regression models. However, the low PR AUCs show that there is still room for improvement. The highest observed PR AUC is 0.17. In the PR space, we also note that several of our methods clearly outperform GBOOST for all scenarios and all sample sizes. Interestingly, the GBOOST ROC curves behave similarly to other methods. Such differences between ROC and PR curves are common for highly-skewed datasets where PR curves are more informative 54. The main point of our methods is to focus on the synergies with a particular target while discarding other effects. The consistent ROC and PR AUCs across the four different scenarios show that they are rather successful at that. Their performance is not strongly impacted by the presence of additional marginal and/or epistatic effects.
The case study that we carried for type II diabetes demonstrates the scalability of all methods to real GWAS. One way to improve runtime is to adjust the number of subsamples used for stability selection; however this may come at the expense of performance. The development of new and faster LASSO solvers 55,56 for large scale problems will further help improve the adoption of our methods by end-users.
According to two rank correlation measures (Kendall’s tau and weighted Kendall’s tau), we see that all methods tend to agree, though partially. In simulations, synthetic performance measures like ROC and PR AUCs were relatively close. On the other hand, the rank correlations do not show complete agreement (values far from 1). For instance, GBOOST least agrees with robust modified outcome. However, the two methods are the best performing in our simulations. We conclude that a consensus method combining GBOOST and robust modified outcome could improve the recovery of interacting SNPs. Theoretically, the ranking differences between the methods motivate the question of the guarantees for support recovery in terms of effect sizes and dependence structure among covariates. Common variants with low effect sizes is a major hypothesis for missing heritability. A recent paper from Boyle et al. 57 even advances the hypothesis of an “omnigenic” model. It proposes that most heritability lies outside of core pathways; principally within regulatory pathways. That means that a large number of variants influence the phenotype. However, that brings up the question of causality: how to define a causal SNP when all variants are related to phenotype?
The simulations prove that a number of the highly-ranked SNPs are false positives. That is accentuated by the imbalanced nature of our problem: a handful of causal SNPs for thousands of referenced SNPs. Hopefully, the continual decrease in genotyping costs will result in a dramatic increase in sample sizes and, in consequence, statistical power. For instance, the UK Biobank 58 comprises full genome-wide data for five hundred thousand individuals. We also point out that our methods do naturally extend to higher-order interactions. The main idea is combining two SNPs into a single target through a binary function such as the product of the two SNPs. We expect results to depend on both the combination rule and our encoding choice for each SNP. Moreover, a loss of information occurs with such simplifications. We leave a study of those extensions to future work.
The main contribution of our work is extending the causal inference framework to epistasis by developing propensity-like scores for genomic data. The superior performance of robust modified outcome is partially owed to its robustness against propensity scores misspecification. An area of improvement is the propensity score estimation which can benefit a large number of methods. An interesting proposal from Wager et al. 59 completely forgoes propensity scores for the estimate of average treatment effects. All of the presented methods were originally developed for clinical trials where the analog to the target SNP is the treatment assignment and to the genotype are the clinical covariates. Given the rich literature in that field, this opens the door to a much broader panel of methods. In particular, future directions of our work include conditioning for multiple covariates (whether clinical covariates, variables encoding population structure or other genetic variants) to account for, among other things, higher-order interactions and population stratification.
A Genotypic hidden Markov model
In this Appendix, we explicit the transition and emission probabilities for the genotypic hidden Markov model. For that purpose, we start by considering a pair of ordered haplotypes and . We recall that the two haplotypes correspond to the same positions. The hidden variables and represent cluster memberships. They take discrete values in {1, ⋯, K }p. Scheet and Stephens 24 define the clusters as a “(common) combination of alleles at tightly linked SNPs”. The underlying hidden Markov models for the two alleles have identical forms. We then focus on the first allele a. We follow the notations of 41.
The marginal distribution of the first hidden state can be written as:
For j ∈ {2, ⋯, p}, the transition matrix is given by:
The parameter r = (r2, ⋯, rp) can be assimilated to the recombination rate between loci j – 1 and j, although Scheet and Stephens 24 point out the general mismatch between the observed recombination rates and the estimate of r. The parameter α = (αj,k)(j,k)∈{1,⋯p}×{1,⋯,K} is the relative frequency of the cluster k in locus j.
Conditionally on the latent state , the allele Hj is a Bernoulli random variable, is the frequency of allele 1 in cluster zj at the position j:
Under the Hardy-Weinberg equilibrium (HWE), a third hidden Markov model for the unphased genotype can be derived by combining the HMMs of the two alleles a and b. The emission states X = (X1, ⋯, Xp) ∈ {0, 1, 2}p are given by the sum of the emission states, . Because of the phase indetermination, the latent states are unordered pairs of haplotype latent states, . Thus, the dimensionality of the latent variable space is K(K + 1)/2. The different probabilities of the genotype model are computed by considering the two cases: and .
The initial latent state distribution is given by:
In a similar fashion, the transition probabilities: and, the emission probabilities are
For the estimate of the parameters ν = (α, r, θ), we use the imputation software fastPHASE 24 which fits the hidden Markov model using an expectation-maximization (EM) algorithm 60. Its computational complexity is 𝒪 (npK2). The complexity scales linearly for both p and n, rendering fastPHASE well-suited for real case-control datasets where the number of SNPs is typically in the hundreds of thousands and the number of samples in the thousands. In practice, as a trade-off between a rich representation of the clusters and the ensuing quadratic complexity, we chose K = 12.
B Simulation results
B.1 First scenario: synergistic only effects
B.2 Second scenario: partial overlap between synergistic and marginal effects
B.3 Third scenario: partial overlap between synergistic and quadratic effects
B.4 Fourth scenario: partial overlap between synergistic and quadratic/marginal effects
5 Acknowledgement
This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from wwww.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113, 085475 and 090355.
Footnotes
↵* Contact: lotfi.slim{at}mines-paristech.fr
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵