## Abstract

**Background** Genome-wide association studies (GWAS) are now routinely imputed for untyped SNPs based on various powerful statistical algorithms for imputation trained on reference datasets. The use of predicted allele count for imputed SNPs as the dosage variable is known to produce valid score test for genetic association.

**Methods** In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic association incorporating gene-environment interactions. We focus on case-control association studies where inference in an underlying logistic regression model can be performed using alternative methods that rely on varying degree on an assumption of gene-environment independence in the underlying population. As increasingly large scale GWAS are being performed through consortia effort where it is preferable to share only summary-level information across studies, we also describe simple mechanisms for implementing score-tests based on standard meta-analysis of “one-step” maximum-likelihood estimates across studies.

**Results** Applications of the methods in simulation studies and a dataset from genome-wide association study of lung cancer illustrate ability of the proposed methods to maintain type-I error rates for underlying testing procedures. For analysis of imputed SNPs, similar to typed SNPs, retrospective methods can lead to considerable efficiency gain for modeling of gene-environment interactions under the assumption of gene-environment independence.

**Conclusions** Proposed methods allow valid analysis of imputed SNPs in case-control studies of gene-environment interaction using alternative strategies that had been earlier available only for genotyped SNPs.

*Introduction*

Genome-wide association studies (GWAS) are now routinely imputed for untyped SNPs with powerful imputation algorithms [Howie et al., 2009; Browning and Browning, 2009; Li et al., 2010; O’connel et al., 2016; Loh et al., 2016] up to various reference panels such as the Hapmap (The International HapMap Consortium, 2005) and the 1000 Genomes (The 1000 Genomes Project Consortium, 2010; The 1000 Genomes Project Consortium, 2012; Sudmant, P. H. et al., 2015). Standard association tests for imputed SNPs are performed using the predicted allele count as the underlying dosage variable of the association model. Many earlier fine mapping studies based on the Hapmap panel have successfully used imputation for better characterization of common susceptibility SNPs within regions initially discovered through typed SNPs. More recently, imputation based on the 1000 Genome reference panel in existing GWAS for several traits have led to the discovery of new susceptibility loci containing uncommon or rare susceptibility variants [Guerreiro et al., 2013; Wang et al., 2014; Horikoshi et al., 2015].

The use of expected allele count for imputed SNPs as the dosage variable is known to produce valid score-test for genetic association [Marchini and Howie, 2010]. In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic association incorporating gene-environment interactions. In particular, we focus on case-control association studies where inference in an underlying logistic regression model can be performed using various alternative methods that rely on varying degree on an assumption of gene-environment independence in the underlying population. As increasingly large scale GWAS are being performed through consortia effort where it is preferable to share only summary-level information across studies, we also explore how these methods could be implemented in the context of meta-analysis. We study type-1 error and power of alternative methods using extensive simulation studies. An application of the methods is illustrated through a re-analysis of the National Cancer Institute GWAS of lung cancer that has been imputed by the 1000 Genome reference panel.

## Methods

### Options for Joint-Test of Association for Genotyped SNPs

We assume the main goal of our study is to test the association of disease-status (*D*) with genotype status ( *G*) of marker SNPs in the presence of a set of environmental risk factors ( *X*) that are known to be associated with the disease. We consider logistic regression to specify the disease-risk model in the form
where an interaction term between *G* and *X* is incorporated to allow the effect of the genetic factor, as measured in the odds-ratio scale, to vary by the level of the environmental factors. Commonly, SNP genotypes ( *G*) are coded as allele count assuming a linear-trend model for association with the underlying trait. More generally, genotype could be coded according to dominant, recessive or a two degree-of-freedom saturated model. A joint-test for genetic association under the above model corresponds to a global null hypothesis in the form

For genotyped SNPs, a multi degrees-of-freedom joint-test of association and interaction has been studied earlier [Kraft et al., 2007]. Typically, the analysis is performed based on standard prospective logistic regression analysis of case-control data.

Alternatively the analysis can be performed based on a retrospective-likelihood [Chatterjee and Carroll, 2005] that allows enhancement of power by exploitation of an assumption of gene-environment independence in the underlying population. Under gene-environment independence assumption, a case-only analysis can also be performed for inference on the logistic regression interaction parameter [Peigorsch et al., 1994], but it is not suitable for joint-testing of genetic association and interaction. The use of gene-environment independence assumption, however, can lead to serious bias in both the joint-and interaction-tests when the underlying assumption of gene-environment independence is violated [Albert et al., 2001;Mukherjee et al., 2008; Mukherjee et al., 2012].

A third alternative for joint-testing of association and interaction is to use an empitical-Bayes type inferential procedure that allows data adaptive shrinkage between estimates obtained from the prospective and retrospective likelihoods to strike a balance between efficiency and bias incurred by gene-environment independence assumption. Extensive simulation studies have shown that methods that exploit gene-environment independence assumption, such as retrospective-or EB- method, have substantial potential to improve power for gene-environment case-control studies compared to standrd prospective logistic regression [Mukherjee et al., 2008; Mukherjee et al., 2012]. The risk of false positives due to gene-environment correlation is generally low in many realistic situations and can be further minimized using the data adaptive EB or various types of two-stage procedures [Cornelis et al., 2012; Mukherjee et al., 2012].

### Derivation of Score-Tests

A major advantage of score-test, compared to Wald-or Likelihood-ratio test (LRT), is that it only requires imputation under the null model of no association and thus can easily incorporate expected dosage returned by popular imputation algorithms. Further for the analysis of less common and rare variants, score-tests may have more robust properties than Wald test or LRT as the number of cases or/and controls can be sparse in variant genotype categories.

Suppose that data consist of (*D _{u}, X_{u}, G_{u}*),

*u*= 1,…,

*n*where

*D*,

_{u}*X*, and

_{u}*G*, respectively, denote the disease status, environmental exposure, and SNP-genotype status for subject

_{u}*u*. Let

*Z*= (1,

*X*) and

*W*= (

*G, G**

*X*) denote a partitioning of the design matrix associated with the “nuisance” parameters,

*η*= (

*α,β*), and the parameters of interest,

_{x}*θ*= (

*β*), respectively, for the underlying logistic regression model.

_{g},β_{gx}### Prospective (PT) Method

The standard prospective likelihood of case-control data is derived as

Under the prospective-likelihood, the score-function for *θ* is given by

Under the null hypothesis of no association,

The maximum likelihood estimator (MLE) of the nuisance parameters *n* under the null model can be estimated by fitting the null model

The multivariate score-test-statistic can be computed as where is the variance-covariance matrix for the score-vector accounting for uncertainty associated with estimation of the nuisance parameters. One can estimate using the efficient information matrix in a model-based fashion or using the empirical variance-covariance matrix of the associated influence function to achieve robustness against mis-specification of the null model.

### Retrospective (RT) Method

The retrospective likelihood for case-control data is given by

It has been long known that inference for the parameters of interest under underlying logistic regression model is equivalent under the retrospective and prospective likelihoods for case-control data when no assumption is made about joint distribution of the underlying risk-factors, i.e. *G* and *X* in our example [Prentice and Pyke, 1979]. However, if an assumption of gene-environment environment independence is invoked, then more efficient inference is possible under the retrospective likelihood. In particular, Chatterjee and Carroll [2005] have previously shown that under the assumption of gene-environment independence, but without any further restriction on the distribution of *X*, inference under retrospective likelihood can be made using a “profile-likelihood” of the form
where the conditioning *R* = 1 is introduced to indicate the selection mechanisms of subjects into the sample under the case-control sampling scheme. Derivation of requires specification of population genotype frequencies, either using two parameters under a general multinomial model or using a single parameter under the assumption of Hardy Weinberg Equilibrium (HWE). Thus, for the retrospective likelihood, we expand the nuisance parameter vector as *η*^{*} =(*η*, *γ*) that the nuisance parameters include both parameters of the disease-risk and genotype frequency models.

The score-function for association parameters of interest for the retrospective likelihood can be derived as
where denotes expectation with respect to the joint probability distribution of *D* and *G* given *X* and *R* = 1.

Under the null,
which differs from the corresponding score-vector (equation (2)) is derived under the prospective likelihood only in the way the expectation is derived in the second term. In particular, under the retrospective likelihood, the expectation term is evaluated under the assumption of gene-environment independence while the prospective likelihood does not require any such assumption. Under the null hypothesis, the parameters of the null model (3) can be estimated using standard prospective logistic analysis since the MLE under the retrospective-and prospective-likelihoods are the same as we allow the distrbution of non-genetic risk-factors to be completely unspecified. Further, under null, MLE associated with genotype frequency model *γ* can be obtained from the pooled sample of the cases and controls. The multivariate-score test can now be derived under the retrospective likelihood following the same-steps as that for described for the propsective likelihood (see Supplementary Methods Section 1.2 for complete details).

### EB Procedure

Implementation of the original EB procedures requires parameter estimates from the prospective-and retrospective-likelihood methods. The estimate itself cannot be directly derived from a likelihood and thus derivation of a score-test for this procedure is not straightforward. As an alternative, we propose a “score-type” test that could maintain some of the advatnages of the score-tests as described earlier and yet allow combining inference from the prospective-and retrospective-likelihoods in a data adaptive fashion to balance between bias and efficiency. We first note that any score-test can be written in the form of a Wald-like test-statistic
.
where can be viewed as a one-step MLE starting from the null parameter value *θ*_{0}=0. Thus, taking advantage of the above Wald-like representation of score-test, we propose an EB score-type test in the form
where the corresponding EB estimates and associated variance-covariance matrix are obtained by combining the one-step MLE estimates derived from the prospective-and retrospective-likelihood (see Supplementary Methods Section 1.3) using formulae analogous to those described for the original EB procedure [Mukherjee and Chatterjee, 2008].

Derivation of the PT, RT, and EB methods under a more general setting that allows accounting for additional covariates in the model is given in Supplementary Methods.

### Handling Imputed Genotype Data

Once the forms of the score-tests are derived with observed genotyped data, handling imputed genotype data for all the procedures is relatively straightforward as it simply involves replacing *G _{u}* by

*Ĝ*, the expected value of genotype dosage taking into account predicted probabilities of different genotype values returned by the imputation algorithm. It is noteworthy how imputed genotype data are handled differentially in the prospective-and retrospective-score functions. Under the prospective-likelihood, the score-function for imputed genotype data takes the form where the imputed genotype-dosage variable contributes to both terms of the left hand side of the equation. In contrast, under the retrospective-likelihood, the score function for imputed genotype data takes the form where the imputed genotype-dosage variable contributes only to the first term of the equation. The genotype frequency parameters, required in derivation of the retrospective-score function, can be estimated from imputed genotype data based on overall predicted genotype counts observed in the pooled sample of cases and controls. Derivations of efficient-information matrices and empirical variance-covariance matrices for the score-vectors follow the same steps as those for observed genotype data for each of the respective procedures. Finally, the derivation of the one-step MLEs and score-type test using the EB procedure follows the same steps as those described for observed genotype data.

_{u}*Analysis of NCI GWAS of Lung Cancer*

We analyzed data from a GWAS of lung cancer generated at the National Cancer Institute. The dataset included 5713 cases and 5736 controls from four different study sites (Table 1). The samples were originally genotyped using a combination of lllumina GWAS platforms and were imputed using the 1000 Genomes Phase 2 reference panel using IMPUTE2 software [Howie et al., 2009]. The details of the studies can be found in several previous publications [Landi et al.,2009]. We evaluated the performance of the different methods in evaluating joint association of lung cancer with SNP-genotypes and genotype-by-smoking interactions. We derived score test under a logistic regression model where SNP genotypes were coded assuming additive effects. For modeling the effect of smoking status, recorded as current, former or never, we used two dummy variables. The resulting joint tests for association and interaction had three degrees of freedom. We also examined the two degree-of-freedom score-tests associated with only the interaction parameters of the model, but the underlying p-values were derived under the global null hypotheses of absence of both association and interactions. Figure 1 shows the quantile-quantile (Q-Q) plots for the interaction-only tests for the application of the PT, RT, and EB methods (left panel) and the joint-tests under the PT, RT, and EB methods (left panel) and the joint-tests under the PT, RT, and EB methods together with the test for main effect of *G* of the model without interaction (right panel), which were restricted to the analysis of ~5.3 million SNPs such that MAF > 0.05, the imputation quality reported to have info measure *I _{A}* ≥ 0.5, and the p-values from all the seven tests are available. As the patterns are generally similar for the model-based and empirical variance estimators, we only show results using the former method. In general, the Q-Q plot associated with the interaction-only tests aligns close to the diagonal line indicating that all the methods are maintaining type-1 error well. The Q-Q plot neither shows any strong upward curvature near low p-values that could be indicative of the presence of many strong interactions in the data.

In contrast, the Q-Q plot for the main-effect-only and joint tests of association and interaction clearly shows a strong upward curvature near the tail of the distribution. This pattern is largely driven by SNPs in the chromosome 15q25.1 region which are previously shown to be strongly associated with the risk of lung cancer (See Supplementary Figure SI for plots after removal of this region). SNPs in this region, which contains multiple nicotine receptor genes, have been shown to be associated with both risk of lung cancer [Amos et al., 2008; Thorgeirsson et al., 2008] and smoking intensity [Thorgeirsson et al., 2008;Saccone et al.,2010]. However, no SNPs in this region has been reported to be associated with smoking status even in studies with extremely large samples size (N>100K) [The Tobacco and Genetics Consortium, 2010]. Thus it is interesting that in this region (x-axis p-value < 10^{−4}), the RT and EB method, both of which exploit an assumption of independence of genotype and smoking status, consistently produce lower p-values for the SNPs than those from the main-effect only test and the joint-test under the PT method. It appears that, in this data, although gene-environment interactions by themselves are not identifiable at a high significance level, proper accounting for these effects using efficient methods are enhancing the detection of underlying signals captured by the joint test.

We also evaluated the performance of different methods including SNPs with lower MAF (MAF=0.01-0.05). In this setting, we observe that the Q-Q plots for the methods that used sandwich variance-covariance estimators were highly inflated indicating systematic problem with type-1 error rate control. The problem could be traced to small sample bias of the sandwich standard errors because of small number of non-smoking cases (N=355) who also carried variant genotype for rare SNPs in our study. When we combined non-smokers and former-smokers together to a single category, the bias went away (data not shown).

*Simulation Studies*

We generated data on a binary environmental exposure variable which is assumed to follow Bernoulli (0.5) and be independently distributed of *G*. We simulated SNP genotype ( *G*) assuming HWE and minor allele frequency (MAF) value of 0.3 or 0.05. Given the values of *G* and *X*, we generated the binary disease outcomes for individuals from the logistic regression model (1). We chose *β _{x}* = log(1.5) or

*β*= log(2) to allow the association of

_{x}*X*with

*D*to be either modest or strong, respectively. For evaluation of type-1 error, we assumed no genetic association, i.e. both

*β*= 0 and

_{g}*β*= 0. For evaluation of power, we set

_{gx}*β*= log(1.2) and

_{gx}*β*= log(1) or log(1.05). In all simulations

_{g}*β*was set such that an overall disease rate in the underlying population is about 5%. For evaluating type-1 error and power, we simulated 500,000 and 1,000 datasets, respectively, with each set consisting of 5,000 controls and 5,000 cases.

_{0}To evaluate validity and power of the methods when the SNP of interest may not be genotyped, we simulated haplotypes which consist of the SNP of interest and neighboring SNPs in linkage disequilibrium (Table 2). In one setting (left panel), the variant of interest was common (MAF=0.3) and could be imputed with high accuracy ( *R*^{2} =0.8) based on genotypes of the neighboring SNPs (See Stram [2004] for *R*^{2}). In the other setting (right panel), the variant of interest was less common (MAF= 0.05) and could be predicted with moderate accuracy (*R*^{2} = 0.5) based on the genotype status of the neighboring SNPs. Assuming HWE in the general population, multi-locus genotypes of individuals were generated from simulted haplotypes. We analytically evaluated the conditional probability for genotype at the SNP of interest for each configuration of genotypes at the neighboring SNPs. For simulating “imputed dosage” for the SNP of interest, we simulated the genotype data for the neigboring SNPs first and then assigned predicted genotype probabilities for the SNP of interest using the known conditional probabilities. For analysis of each simulated data, we pretended that only the predicted probabilities, and not the actual genotypes, were available at the SNP of interest.

We implemented each of the PT, RT, and EB score tests using either a model-based or an empirical variance-covariance estimator. However, in evaluation of the EB procedure, due to the lack of a model-based formula, the covariance between prospective and retrospective estimators was always evaluated based on empirical covariance of the underlying influenced functions. For each method, we evaluated the performance of both the joint-and interaction-only tests. In general, simulation studies show that the proposed methods perform well in maintaining type-1 error both at modertate (*α* = 0.05) and stringent (*α* = 0.0001) significance levels (Figure 2). In some scenarios, the RT method, when implemented with the sandwich variance estimator, showed a slight inflation over the nominal significance level. Across all the scenarios, the EB method was conservative, a pattern that has been reported earlier for analysis of typed SNPs and has been traced to the use of a conservative variance estimator [Mukherjee et al., 2012]. Employing the PT, RT, and EB methods on typed SNPs shows consistent results (Supplementary Figure S2).

Simulation studies of power (Table 3) suggest that relative performance of three different methods was similar for untyped SNPs as has been reported for typed SNPs in earlier studies [Mukherjee et al., 2012]. In particular, the RT method had the maximum power, the PT method has the minimum power and the EB procedure performed in between. All methods lost power to a similar degree for the analysis of untyped SNPs compared to the analysis of the same SNP had it been typed. The use of model-based versus sandwich variance estimators did not have much effect in power for any of the methods (Supplementary Table S1).

PT-joint, prospective joint test; RT – joint, retrospective joint test; EB-joint, empirical Bayes joint test; PT-int, prospective interaction test; RT-int, retrospective interaction test; EB-int, empirical Bayes interaction test; SNP, single nucleotide polymorphism; MAF, minor allele frequency.

## Discussion

In summary, we propose various types of score tests for genetic association and gene-environment interactions for analysis of case-control studies. Similar to standard tests for genetic association, in these methods, imputed genotype data for untyped SNP could be handled by simply substituting genotype values with predicted dosage that could be available from popular imputation software.

The prospective and retrospective score-tests are derived directly from the underlying likelihoods for case-control studies. We derived the score-test for the EB procedure using underlying one-step maximum-likelihood estimates of parameters obtained from the prospective-and retrospective-likelihoods. The one-step MLEs can also be used to perform multivariate meta-analysis of the parameters across studies and then derive various test-statistics based on meta-analyzed parameter estimates and their variance-covariance matrices. In our implementation of all the methods in the R software package CGEN (https://www.bioconductor.org/packages/release/bioc/html/CGEN.html) we allow returning of the one-step MLEs to facilitate meta-analysis.

Both analysis of simulated and real datasets suggest that the proposed methods can generally control type-1 error rates, but small sample bias could arise in the presence of sparse genotype-by-exposure cells, especially if sandwich variance estimators are used in some of these methods. Simulation studies of power show that the relative performances of the PT, RT, and EB procedure are quite similar for the analysis of untyped and typed SNPs. Although not studied directly, it can be anticipated that in the presence of gene-environment correlation in the population, the relative performance of these methods for their ability to control type-1 error would also be similar as has been reported in earlier studies [Mukherjee et al., 2012] for typed SNPs.

Although all the methods are valid for both continuous and categorical exposures, our numerical studies only involve categorical exposures. Future studies are needed to investigate performance of the proposed methods in the presence of continuous exposure and model mis-specification. It has been noted before that if the model for association of the disease with a continuous exposure is mis-specified, then the test for genetic associations and interaction could be biased due to underestimation of variance of target parameters under the mis-specified model [Tchetgen and Kraft, 2011]. We focus on test for genetic association and interactions using single genetic markers. Further studies are also merited how these methods could be extended for derivation of gene-level aggregate tests of genetic associations and interactions.

## Software link

The proposed method has been implemented in open source software, available at https://www.bioconductor.org/packages/release/bioc/html/CGEN.html.

## Acknowledgement

Research of Minsun Song, Maria Teresa Landi, Neil Caporaso, and Nilanjan Chatterjee was supported by the intramural program of the Nationcal Cancer Institute.