ABSTRACT
Multi-marker approaches are currently gaining a lot of interest in genome wide association studies and can enhance power to detect new associations under certain conditions. Gene and pathway based association tests are increasingly being viewed as useful complements to the more widely used single marker association analysis which have successfully uncovered numerous disease variants. A major drawback of single-marker based methods is that they do not consider pairwise and higher-order interactions between genetic variants. Here, we describe novel tests for multi-marker association analyses that are based on phenotype predictions obtained from machine learning algorithms. Instead of utilizing only a linear or logistic regression model, we propose the use of ensembles of diverse machine learning algorithms for constructing such association tests. As the true mathematical relationship between a phenotype and any group of genetic and clinical variables is unknown in advance and may be complex, such a strategy gives us a general and flexible framework to approximate this relationship across different sets of SNPs. We show how phenotype prediction obtained from ensemble learning algorithms can be used for constructing tests for the joint association of multiple variants. We first apply our method to simulated datasets to demonstrate its power and correctness. Then, we apply our method to previously studied asthma-related genes in two independent asthma cohorts to conduct association tests.
INTRODUCTION
Genome wide association studies (GWAS) have generated a wealth of information about genes and genetic variants influencing various diseases and traits (Visscher 2012). The vast majority of GWAS have focused on single-marker analysis and tests for significance were corrected for multiple hypotheses testing to obtain the correct false positive rates. Because the number of markers tested in such studies is large, a single nucleotide polymorphism (SNP) needs to have strong effects or the sample size needs to be large enough to cross the stringent genome wide significance thresholds. Furthermore, many complex traits are thought to result from the interplay of multiple genetic and environmental factors, which are not captured by single SNP association tests. Given these limitations of single-marker analysis, many multi-marker approaches for association testing have been proposed and are increasingly being used to complement single SNP analysis (e.g. Pang et al. 2006; Wang et al. 2007; Li et al. 2009; Liu et al. 2010; Wu et al. 2010; Li et al. 2011; Wu et al. 2011; Huang et al. 2011; Li et al. 2012; Chung and Chen 2012).
Genes are the basic functional units of the genome and multiple polymorphisms within or near a gene can jointly affect its products. Thus, multi-marker association tests can realistically model the multiplicity that occurs biologically. While individual causal variants might show only a marginal signal of association, jointly utilizing all informative SNPs within a gene may detect their manifold effects. Testing genes also reduces the burden of multiple testing from millions of individual SNP tests to around 20,000 genes. Gene-based methods may also be less sensitive to differences in allele frequency and linkage disequilibrium between population groups (and, therefore, may produce more replicable results).
To date several gene-based association tests have been proposed (e.g. Li et al. 2009; Liu et al. 2010; Wu et al. 2010; Li et al. 2011; Wu et al. 2011; Huang et al. 2011; Li et al. 2012). Most of these approaches first assign a subset of SNPs to a particular gene based on their location in the genome; they then seek to calculate a gene-based p value based on the individual SNP association tests. VEGAS (Versatile gene-based association study) is a gene-based method that combines the chi-square test statistics of individuals SNPs, while accounting for their dependence (Liu et al. 2010). GATES is another gene-based association test that uses an extended Simes procedure to integrate the p values of individual variants while accounting for pairwise correlations between variants when calculating the effective number of independent tests (Li et al. 2011). SKAT is a logistic kernel machine based test that can account for nonlinear effects when determining the gene-level significance (Wu et al. 2010; Wu et al. 2011).
Generally, the methods used for combining p values in gene-based tests can be divided into 2 categories: best-SNP picking and all SNP aggregating tests. Best-SNP picking tests use only one SNP-based p value after accounting for multiple testing adjustment. GATES is an example of a testing method that falls within this category. All-SNP aggregating tests, such as VEGAS-SUM and SKAT, attempt to accumulate the effects of all SNPs into a test when determining the overall p value. HYST is a recently developed hybrid method that use both these kinds of approaches in its calculations (Li et al. 2012).
Many existing gene-based approaches either use the minimum of the p values for variants within a gene or integrate the p values/test statistics from individual variants to determine the overall gene-level p values. However, this may not be optimal in terms of utilizing the information available in the data and it may be better to determine the joint association of multiple predictive SNPs rather than use individual SNP measures. In addition, many existing methods do not account for nonlinear effects. Our main goal here is to develop an accurate method for multi-marker association analysis that can incorporate pairwise and higher order interactions between variables. We use phenotype prediction algorithms as a basis for constructing such association tests. Since both the underlying genetic architecture of a trait and the optimal model structure for combining the association information across multiple SNPs are not usually known before testing, we propose a machine learning approach for this purpose. The main novelty of our approach is the use of an ensemble of diverse learning models to generate phenotype predictions. In this approach, we feed the initial predictions generated from many individual learning algorithms into a second-level learning algorithm which weights their contributions suitably to generate a final prediction (Breiman 1996; Breiman 2001; Bell et al. 2007; Sill et al. 2009; Toscher et al. 2009). Thus, our approach involves blending the results of different learning algorithms by using a “meta-level” learning algorithm. We also use additional variables called “meta-features” (e.g., age, gender, body mass index, individual genotypes, ancestry) as inputs to guide this blending procedure (Sill et al. 2009). In principle, such a combination of models can allow us to better approximate (on average) the true underlying relationships between the input variables and phenotype across multiple sets of SNPs. Of note, this method allows the relationships between different groups of SNPs and the phenotype to be non-linear, complex, and variable, as is likely to occur in nature.
Here, we show how machine learning algorithms can be used to construct powerful tests for multi-marker association analysis. We then show how to construct tests of association in the presence of non-genetic covariates and how to construct a multi-marker test of interactions under this framework. We first apply our method to simulated datasets to demonstrate its power and correctness. Lastly, we apply our method to previously studied asthma-related genes in two independent asthma cohorts to conduct gene-based association tests.
METHODS
Approach for predicting phenotypes
Here, we present an overview of our approach to predict phenotypes from genetic and clinical variables through the use of multiple machine learning algorithms. First, we create a list of all genetic variants and clinical covariates that can potentially influence the phenotype of interest such as a disease or drug response. Next, we perform a feature selection step where we identify a subset of variables, which are useful for building a predictive model (i.e., associated with the phenotype). This can be done in many ways such as using variable importance scores from a random forest algorithm or Pearson’s correlation coefficient with the phenotype. Different machine learning algorithms (e.g., random forests (Breiman 2001), support vector machines (Cortes and Vapnik 1995, Harris et al. 1996) and logistic regression) are then trained using this subset of informative variables. Subsequently, we use the predictions from these individual models along with the selected features as inputs in a “meta-level” random forest algorithm. Lastly, we assess prediction accuracy by testing the model on an “outside the training set” and through 20-fold cross-validation.
Ensemble learning algorithm for phenotype prediction
Ensemble learning variation 1:
Generate a set of all genetic variables.
Perform feature selection on the training data in order to identify an informative subset of variables (f1, f2…fn) for phenotype prediction. This can be performed using either pairwise correlation coefficients between variables and phenotype or by using random forest variable importance scores to rank the variables. Then, we can use the top 10%-30% of the variables in a prediction model.
Train k independent machine learning approaches on the training data using the selected features and generate model predictions P1, P2…Pk.
Use the predictions from step 3, P1, P2…Pk and f1, f2…fn as inputs and train a “meta-level” learning algorithm using random forests. Note that this is a key step in the algorithm and generates a final prediction by blending many individual predictions in a possibly nonlinear manner. The main goal is to learn the best model to combine individual models from the training data so that we can predict the phenotype as well as possible. The non-linear combination of models along with the meta-features gives us a more general predictive framework, which can accommodate different model structures and also allows the overall model to vary across the multi-dimensional parameter space.
Generate predictions in test data Pblend1 using the models trained in steps 3 and then 4. Repeat for all cross-validation folds to obtain unbiased phenotype predictions for all samples.
Generalization: An ensemble of ensembles
Generalizations of the algorithm described previously are also possible that can potentially further boost the prediction accuracy. In particular, the creation of an ensemble of models (steps 3 and 4 in previous algorithm) can be done in a variety of different ways. For example:
Ensemble learning variation 2: Combining of predictions from individual learning models can be done sequentially using predictions from all previous steps as inputs in the next step (i.e. instead of steps 3 and 4). Therefore, as an alternative approach, we can do the following:
Train learning algorithm 1 on the training data using the selected features f1, f2…fn as inputs and generate model predictions P1.
Train learning algorithm 2 on the training data using P1 and the selected features f1, f2…fn as inputs and generate model predictions P2.
Training learning algorithm 3 on the training data using P1, P2 and the selected features f1, f2…fn as inputs and generate model predictions P3.
………………………………………………………………………………………………..……
…………………………………………………………………………………………………..…
k) Training learning algorithm k on the training data using P1, P2,…Pk-1 and the selected features f1, f2…fn as inputs and generate model predictions Pk.
Note that each algorithm after the first is a meta-level learning algorithm. Then, we generate predictions in test data Pblend2 using the models as in training and repeat for all cross-validation folds to obtain unbiased phenotype predictions for all samples.
Ensemble learning variation 3: Instead of applying an ensemble learning model (variation 1) to all the samples, we can divide the high-dimensional parameter space of variables into different subsets. Then, we can train different ensemble learning models using only samples that fall in these different subsets and finally merge these models to obtain the overall prediction model. Subsequently, we can generate final predictions, Pblend3, in test data as we did for training data for all cross-validation folds within all subsets to obtain unbiased phenotype predictions for the entire sample.
Lastly, we can train a final random forest learning algorithm that uses f1, f2…fn and Pblend1, Pblend2 and Pblend3 as inputs and performs 20-fold cross-validation to generate the final prediction Pfinal.
Multi-marker tests of association
Once we have estimated a model using any of the algorithms described in the previous section and predicted phenotypes for all individuals using cross-validation, we can construct tests of association in the following manner. For continuous traits, we can calculate the Pearson’s correlation coefficient between predicted (Pfinal) and observed (Pactual) values and obtaining the corresponding p values. For case-control studies, we perform a logistic regression using all the genetic variables (i.e., SNPs) and Pfinal as explanatory variables. A chi square based likelihood ratio test can then be used to generate p values.
Testing multi-marker associations in the presence of covariates
Association testing in the presence of covariates (e.g., age, gender, BMI and smoking status) can be done in the following manner. First, consider both non-genetic covariates and genetic variables together for phenotype prediction according to any of the ensemble learning algorithms described earlier. Let Pfinal-all be the predicted phenotype values. Then, remove the SNP variables and rerun the phenotype prediction algorithm. Let Pfinal-covariates be the predicted phenotype values. For continuous traits, we first calculate the Pearson’s correlation coefficient for these predicted variables with the true phenotypes (Pactual). The strength of association for the genetic variables can then be calculated using the Steiger’s Z test (Steiger 1980) for the difference between the 2 calculated correlation coefficients. Let r12 and r13 denote the Pearson’s correlations between the true phenotype (Pactual) and Pfinal-covariates and Pfinal-all respectively. Let r23 denote the Pearson’s correlation between Pfinal-covariates and Pfinal-all. The Steiger’s test computes p values based on the following test statistic that is assumed to be standard normally distributed:
Here, Z12 and Z13 are Fisher’s transformations of r12 and r13, and
For case-control studies we can use both non-genetic covariates, genetic variables, Pfinal-all, and Pfinal-covariates as explanatory variables in a logistic regression model. We then use a chi square based likelihood ratio test to compare the former model with a model without any genetic variables (i.e. non-genetic covariates and Pfinal-covariates only) to calculate a p value for the genetic contribution.
Multi-marker tests for interactions
We can test for interactions between a set of markers in the following manner. First, consider all of the SNPs together in a linear or logistic regression model (for continuous or case-control phenotype) and generate phenotype predictions using cross-validation for all individuals. Let Plinear be the predicted phenotype values. Then, generate phenotype predictions for all individuals using any of the ensemble learning algorithms described previously. Let Pensemble denote the predicted phenotype values. For continuous traits, we will use all markers as well as Pensemble and Plinear as explanatory variables in a multiple regression model (Model 1) and perform a F test with a model (Model 0) without interactions (i.e. one with all markers and Plinear only) to calculate the p value. We compare the sum of the squared errors (SSE) of prediction to construct an F statistic with (1, N – VModel1 – 1) degrees of freedom. Here: F = [SSEModel0 – SSEModel1][N – VModel1 - 1]/SSEModel1. N denotes the number of samples and VModel1 denotes the total number of explanatory variables in model 1. For case-control studies, we will use all markers as well as Pensemble and Plinear as explanatory variables in a logistic regression model and use a chi square based likelihood ratio test with a model without interactions (i.e. one with all markers and Plinear only) to calculate the p value.
Power and Type-1 error rates of gene-based association tests for data simulated under multiplicative and additive models
We tested the performance of the proposed gene-based test by simulating genotype data for 30 biallelic SNPs assuming Hardy Weinberg equilibrium. We assumed the following 3 scenarios of linkage disequilibrium (LD) for the 30 SNPs: i) SNPs are within blocks with high LD (r = 0.9 or 0.8 within blocks); ii) SNPs are within blocks in moderate LD (r = 0.5 or 0.4); and iii) SNPs are completely independent of one another and in linkage equilibrium. The choice of simulation settings were similar to what has been used previously (Li et al. 2011). For each LD scenario, we considered 3 different gene sizes with the first 3, first 10 and all 30 SNPs with 1, 2 and 6 causative SNPs respectively. For each gene size, we tested the following models: i) a null model with no disease loci ii) an additive model where one SNP in each LD block had a minor allele that increased the risk additively by 0.14; and iii) a multiplicative model where one SNP in each LD block had a minor allele that increased the risk by a factor of 1.14. Disease prevalence was assumed to be 0.1. For each scenario, we used a sample of 1,500 cases and 1,500 controls drawn from a simulated population of 100,000 individuals. More details about LD patterns can be found elsewhere (Li et al. 2011). Type-1 error rates and statistical power were obtained from 1,000 and 500 simulated case-control datasets, respectively and were based on the fraction of datasets for which the gene-based association test generated significant p values (i.e. p < 0.05).
Power and Type-1 error rates of a gene-based test for models with interactions
The simulations in the previous section assumed that the effect of various disease susceptibility SNPs were independent of one another and that they increased the risk additively or multiplicatively. To explore the effect of pairwise and higher order interactions between genetic variants, we also compared the performance of methods for data simulated under models with interactions. We simulated a quantitative trait for many different models with one or more interactions among variants in addition to main effects. In addition, we also considered scenarios where there is pure epistasis (i.e. where the effect of a group of SNPs is simply due to their interactions and there are no main effects). We simulated samples of 3,000 individuals and genes with 5 or 10 SNPs assuming linkage equilibrium. The phenotype was drawn from a complex distribution involving the sum of a standard normal random variable and some multivariable function involving many SNP variables (Table 4). SNP variables are coded as 0, 1 or 2. Power and Type-1 error rates were estimated based on 100 and 500 simulated datasets, respectively. We calculated the fraction of simulated datasets for which the gene-based method generated a significant p value (p < 0.05). We compared our result with a gene-based test using linear regression and a gene-based test using GATES (Li et al. 2011). For the gene-based test with linear regression, p values were obtained using an F test statistic.
Power and Type-1 Error rates for a multi-marker test for interactions
For all the models simulated in the previous section, we also constructed a multi-marker test for interactions as described previously and estimated the power of such a test. We simulated samples of 3,000 individuals and genes with 5 or 10 SNPs assuming linkage equilibrium. The phenotype was drawn from a complex distribution involving a sum of a standard normal variable and interaction terms involving many SNPs as shown in Table 4. Power and type-1 error rates were estimated based on 1,000 simulated datasets. For each model with interactions, we calculated the fraction of simulated datasets for which the multi-marker test of interactions generated a significant p value (p < 0.05); p values were based on an F test statistic with two parameters as described previously.
Datasets
We applied the methods developed in this paper to data from 2 independent studies. The studies included the Study for Asthma Phenotypes and Pharmacogenomic Interactions by Race-ethnicity (SAPPHIRE) and the Genes-environments and Admixture in Latino Americans (GALA II). Recruitment for both studies is ongoing.
SAPPHIRE is population-based study which seeks to understand the genetic underpinnings of both asthma and asthma medication response. Study individuals included in this analysis were recruited from a single large health system serving the southeast Michigan and the Detroit metropolitan area. Enlisted patients with asthma met the following criteria: age 12-56 years, a physician diagnosis of asthma, and no recorded diagnosis of chronic obstructive pulmonary disease or congestive heart failure. Control individuals without asthma were recruited from a similar geographic region and were 12-56 years of age, but they did not have a prior recorded diagnosis of asthma, chronic obstructive pulmonary disease, or congestive heart failure. Genome wide genotyping was performed using the Axiom Genome-Wide AFR array (Affymetrix, Santa Clara, CA). After data quality control, genotype information was available on 586,952 SNPs for 1,073 individuals with asthma and 328 healthy controls (Padhukasahasram et al. 2014). All of the individuals from the SAPPHIRE cohort included in this analysis were African American by self-report.
The GALAII study is a case control study to identify gene-environment interactions contributing to asthma. Children of Latino descent age 8-21 years were recruited from New York City, Chicago, San Francisco, Houston, and Puerto Rico. Children with asthma had a physician diagnosis of asthma and either a 12% increase in forced expiratory volume at one second following the administration of albuterol or a positive methacholine challenge test. Genome wide genotype data was available on 3,772 individuals (1,891 with asthma and 1,881 without). Genomic DNA was genotyped on the Axiom Genome-Wide LAT array. After data cleaning, information was available for 747,075 SNPs genome wide.
RESULTS
Multiplicative and Additive models-Comparisons
Tables 1–3 shows comparisons for the performance of various methods for disease case-control datasets simulated under additive and multiplicative models. We can see that the performance of the newly proposed method based on an ensemble of machine learning algorithms is comparable to other approaches and the Type-1 error rates produced by all methods are close to what is expected. For more details about the different methods tested in these tables, please refer to Li et al. 2011. Note that when there are no disease-related SNPs in the data, we expect to see p values < 0.05, in around 5% of the simulated datasets due to chance alone. For the ensemble learning and logistic regression methods, we can also see that power is not strongly sensitive to the strength of linkage disequilibrium. Thus, for both additive and multiplicative models, power estimates do not appear to change much across Tables 1-3 for these methods.
Models with epistatic effects
In Table 4, we show the power of the ensemble learning based multi-marker association test using a simulated quantitative trait for models with interactions. We compare the ensemble learning approach with a gene-based test constructed using multiple linear regression, as well as with the extended Simes procedure as implemented by GATES. In all situations, our simulations indicate that the machine learning approach, which can model interactive effects, is uniformly more powerful for detecting gene-based associations when compared with the other two approaches. Table 4 also shows that the estimated gain in power can be substantial. Among the other two methods, multiple linear regression performed second best while the GATES method which only integrates the p values from single marker tests had the lowest power. In Table 5, we show the power and Type-1 error rates of a multi-marker test for interactions using the same models as in Table 4. These results clearly demonstrate the ability of our approach to detect the presence of interactions by considering the difference between ensemble learning and linear model based predictions.
Application to real datasets
We applied the proposed gene-based association test to an empirical dataset consisting of 1,401 African American individuals (1,073 individuals with asthma and 328 individuals without asthma) from the SAPPHIRE cohort and 3,772 Latino children (1,891 individuals with asthma and 1,881 individuals without asthma) from the GALAII study. Tables 6 and 7 show the sample characteristics of these populations. We tested 9 previously studied asthma-related genes (Li et al. 2010; Moffatt et al. 2010; Torgerson et al. 2011) to see if these are also associated with asthma status in our datasets. Although 100s of genes have been implicated in asthma, only a few have been reliably replicated in multiple groups. Therefore, to demonstrate the performance of our method, we restricted our analysis to a small subset of asthma genes identified (and replicated in some cases) in well-powered, high-quality studies. This also reduces the burden of multiple testing. When constructing gene-based association tests, we adjusted for age, gender and the first 10 principal components in both study groups. Principal component analysis was performed using the prcomp function in R using a random set of 10,000 markers. Tables 8 and 9 show the results of our ensemble learning gene-based association test in the SAPPHIRE and GALAII study populations, respectively. The results are compared with those obtained using the GATES method and logistic regression. At a Bonferroni adjusted significance threshold of 0.0027 (= 0.05/18 [i.e., 9 genes assessed twice]), we found that the ensemble learning gene-based test identified more statistically significant results when compared with the other gene-based methods. Specifically, IL33 was significantly associated with asthma in Latino children using the ensemble learning gene-based test, but this gene was of borderline significance using the other 2 approaches.
DISCUSSION
We have introduced a new method for assessing gene-based associations using genome wide genotype data. This method uses diverse machine learning algorithms to construct predictive models for the phenotype using the SNP variation with a gene and then using these predictions to construct tests of association. Machine learning algorithms represent powerful tools for inferring the relationship between multiple explanatory variables and a phenotype while accounting for complicated interactions between the former. Because the “true” multivariable relationship between a set of variables and a trait like disease or drug response is not known in advance, we can better approximate this relationship by first learning from the data. The use of ensemble learning-based predictions leads to novel multi-marker tests of association. In addition to gene-based tests of association, we expect that these methods could also be applied for pathway-based analysis or to any other set of polymorphic variants defining a region of interest or a functional class.
There are three key advantages of using our gene-based approach compared to existing approaches. First, our method does not make a priori assumptions about the genetic model for a SNP (i.e. additive, recessive or dominant). When constructing our tests, we can include 3 variables for each SNP where the variants are encoded according to these 3 models (i.e. additive, recessive, dominant). Thus, we can include heterogeneous effects within and across SNPs. A second advantage is the ability to include any number of covariates (genetic or non-genetic) and model higher level interactions between them. This feature makes machine learning particularly suited for assessing gene-environment or gene-gene interactions. Third, creating an ensemble of diverse multivariate models with meta-features makes our method less restrictive and capable of approximating the phenotype more accurately. Collectively, these novel aspects can boost statistical power and result in novel genetic discoveries.
Extensions of these methods towards the case of multiple correlated phenotypes should also be straightforward. If instead of a single phenotype, we are interested in many phenotypes that are correlated with one another in some manner, we can construct a joint association test for all of them in the following manner. First, we will apply the ensemble learning based gene-based association test to each phenotype individually and obtain their corresponding p values. Subsequently, we can obtain an overall p value from these individual p values using the TATES multi-trait association method (van der et al. 2013), which is analogous to the extended Simes procedure of GATES developed for testing multi-marker associations.
We applied our method to both simulated and empirical datasets to demonstrate its power and utility. For models without interactions between variables, the ensemble learning approach worked similarly when compared with other previous gene-based association tests. In contrast, for models dominated by interactions, our simulation studies suggested that the ensemble learning test can be considerably more powerful than other methods. Thus, for situations where epistatic or gene-environment effects are likely to be important, our association test is more likely to detect associations as compared to the alternative methods described.
There are a number of potential limitations to our approach that require mentioning. First, computational time can be a limitation when applying an ensemble learning algorithm based associations tests to thousands of genes. One potential solution would be to start by using a computationally efficient gene-based method, such as the GATES procedure, to first identify a smaller subset of likely candidate genes. Then, a machine learning based multi-marker association approach could be applied to this restricted set to further refine the group of candidate genes. However, at this point it is uncertain whether such an approach would result in improved statistical power. Next, we cannot state with certainty that the genes assessed here are involved in asthma pathogenesis, since many of these genes were identified in association studies and their function (as it relates to asthma) has not yet been elucidated. Therefore, while we assume that these genes represent true-positives, this portion of our analysis may not represent an actual demonstration of statistical power unless more detailed functional studies are conducted for the relevant genes to directly demonstrate their role in asthma. Lastly, it should also be mentioned that while our multi-marker tests can detect associations or the presence of interactive effects, they do not attempt to pinpoint the specific variants contributing to such effects. Elucidating such details will entail more in-depth analysis of the models learned and construction of additional tests.
In summary, ensemble learning algorithms provide a general and flexible framework for conducting association analysis. We have shown that phenotype predictions made by such algorithms can be used for many common tasks encountered in association analysis, such as multi-marker association tests, adjusting for genetic and non-genetic covariates, and tests of interaction. Because machine learning is a highly developed area of study, prediction of response from many input variables is a well-studied problem and numerous well-established algorithms are already available which can be readily incorporated as components in an ensemble learning framework to maximize prediction accuracy and construct powerful tests of association.