Abstract
High dimensional genomics data in biomedical sciences is an invaluable resource for constructing statistical prediction models. With the increasing knowledge of gene networks and pathways, this information can be utilized in the statistical models to improve prediction accuracy and enhance model interpretability. However, in some scenarios the network structure may only be partially known or inaccurately specified. Thus, the performance of statistical models incorporating such network structure may be compromised. In this paper, we proposed a weighted sparse network learning method by optimally combining a data driven network with sparsity property to a known or partially known prior network to address this issue. We showed that our proposed model attained the oracle property which aims to improve the accuracy of parameter estimation and achieved a parsimonious model in high dimensional setting for different outcomes including continuous, binary and survival data in extensive simulations studies. Case studies on ovarian cancer proteomics and melanoma gene expression further demonstrated that our proposed model achieved good operating characteristics in predicting response to chemotherapy and survival risk. An R package glmaag implemented our method is available on the Comprehensive R Archive Network (CRAN).
Introduction
The rapid advancement in high throughput genomics profiling has revolutionized biomedical research towards personalized medicine for treating and preventing various diseases including cancer. Several consortia have been established as part of the collaborative efforts to decipher the molecular mechanisms underlying these diseases, for example the Cancer Genome Atlas project have enabled researchers to access the rich cancer genomics database. Together with the rapid development of machine learning and artificial intelligence, these databases have been utilized extensively for improving computational and statistical model building and predictions.
One key attribute of these dataset is the high dimensionality, i.e., p ≫ n in which the number of candidate features/predictors is much larger than the sample size. For instance, in a typical DNA methylation data, several hundred-thousands of CpGs are interrogated. Regularization framework has emerged as an attractive alternative to address the limitations of classical feature selection method in generalized linear models (GLM) including computational efficiency and multicollinearity issues. For instance, GLM regularization with l1 penalty (least absolute shrinkage and selection operator (LASSO)) [28, 29] allows for simultaneous variable selection to prevent overfitting, whereas [37] showed that combining l1 with l2 penalty (elastic net (EN)) not only provides variable selection property but also robustness on correlated features (group property). [6] and [8] argued that a good feature selection procedure should have the oracle property which includes feature selection accuracy and asymptotically unbiased parameters estimation. Thus, [36] and [38] proposed adaptive LASSO and adaptive EN that have oracle property and can be optimized efficiently.
The above-mentioned methods have been shown to achieve positive performance in prediction models in which no prior knowledge is available. However, the abundance of genomics research has enable biological knowledge associated with the diseases to be inferred from gene regulatory networks and pathways. Several well-known databases of gene regulatory networks include the KEGG: Kyoto Encyclopedia of Genes and Genomes (https://www.genome.jp/kegg/) ([11]) and the Reactome Pathways (https://reactome.org/). If the network structure of the data is known in advance, one can potentially improve the model prediction and interpretability by incorporating the prior network information. One possible extension is to replace the l2 penalty with a quadratic penalty that utilizes the unsigned or signed adaptive Laplacian matrix of the network structure ([14, 15]), which yields better performance in both prediction and variable selection. This framework has been applied to both the classification ([25]) and survival ([24]) outcomes. On the other hand, [33] adapted the l1 penalty with unsigned network penalty to achieve the oracle property in Gaussian regression framework.
Although the above-mentioned public regulatory network databases are invaluable prior knowledge, one limitation is that most of the known networks only show the connectivity but without information on the strengths of the connectivity. The strengths of the connectivity are important factors which may influence the group property generated by the prediction model. Conversely, one may also encounter a dataset that only has unknown or partially known network structure. In this scenario, one can still apply the graph based method by estimating the network empirically from the data, e.g., the neighborhood selection method ([19]) to learn the connectivity among the candidate features and use the reliability score provided by reference gene association (RGA) ([31] as the strengths of connectivity.
Another challenge in regularization framework is to correctly tune the penalty parameters. A common approach is via the cross validation, which is straightforward to be applied in regularization framework. However, the cross validation approach has been shown to have the tendency to overfit the data when the number of features are relatively large compared to the sample size ([32]). An alternative approach is via the stability selection method developed based on the consistency of variable selection across multiple subsamples ([20]), and this method has been shown to perform well in graph-based models ([17]).
In this paper, we addressed the above-mentioned limitations of existing network/graph-based prediction models by proposing a mixture network prediction framework that combines two candidate networks (usually one being the fixed network obtained from gene regulatory network database while the other one is estimated from the data). To this end, we adapted the l1 penalty in order to achieve the oracle property. In addition, to attain a robust variable selection accuracy, we implemented the stability selection tuning method for parameters tuning and compared this approach to the cross validation method. We developed our proposed framework for various outcomes including continuous, binary and survival data.
This paper is organized as follows. The description of our proposed method and the corresponding model fitting algorithm are provided in Section 2. The Monte Carlo simulations and case study are provided in Section 3 and 4, respectively. We conclude with a discussion in Section 5.
Methodology
Network Regularized Regression
We start our exposition by reviewing the method associated with the (partial) log likelihood l (β) of generalized linear model (GLM) for continuous, binary and survival outcomes: where Y = (y1,…, yn) is the outcome vector, is the predictors matrix, δi is the event indicator for right censored variable, and Ri = {j|tj ≥ ti} is the risk set of subject i. None of these GLM models can be optimized in high dimensional (p ≫ n) case. One approach to circumvent this challenge is to solve the maximum penalized (partial) likelihood estimator (MPLE). We proposed a network LASSO with l1 adaptive weights (abbreviated as AAG), in which the MPLE in primal form is given as below: where w ⪰ 0 is the weight vector for l1 penalty, L is the normalized Laplacian matrix, and s1 ≥ 0, s2 ≥ 0 and t > 0 are tuning parameters. To estimate the sign adapter for the network penalty we can fit the GLM model without penalty or ridge GLM model with l2 penalty (denoted ), and use the estimated signs of as the sign estimate ([24]). Therefore we have and use as the signed network to be used. Next, to estimate the weight vector w we can estimate the coefficients with 1 where s1 = 0, i.e., no l1 penalty (denoted ). The l1 adapted weights w can be estimated by ([36] and [38]). The normalized Laplacian matrix L is given by where E is the connectivity set, ξ is the degree of the node and ω is the strength (can be either positive or negative) of the connectivity which can be estimated from the reference gene association network utilizing the reliability score of Pearson correlation ([31]). The reliability score of feature i and j denoted as Rij is given by where rij is the ranking of correlation between feature i and j among all the correlations of feature i to others.
We require L to be positive definite if XTX is not invertible. If L is an identity matrix, it reduces to adaptive elastic network model. This indicates that adaptive elastic net model ([38]) is a special case of our AAG model when there is no connection in the network (i.e., independent structure). To solve equation 1, we consider optimizing the objective function where λ1 ≥ 0 and λ2 ≥ 0.
To solve equation 2 when λ1 > 0, we implemented the proximal Newton based coordinate ascent algorithm derived by [9] and [22] with adaptive l1 penalty. For Gaussian regression we have the coordinate-wise update given by where S (a, b) = sign (a) (|a| − b)+ is the soft-thresholding operator ([5]).
For logistic and Cox regression, we require a quadratic approximation of the (partial) log likelihood using secondary Taylor expansion given by where is a term which does not depend on β, is the working update of β, and β0 = 0 for Cox model. For Cox model we only need to calculate the diagonal entries of and fix all the off-diagonal entries to be zeros to speed up the computation, based on the argument provided by [22] where the off-diagonal entries of are small compared to the diagonal entries. For logistic model is already in a diagonal matrix form. Therefore, let u be the diagonal elements of , we have
Note that the Gaussian model is a special case in which ui = 1 and zi = yi. The coordinate-wise update step for logistic model is given by
The working update for logistic model is given by
For Cox model, we used Breslow’s ([2]) method to handle tied survival time. The working update is given by where Rj is the set of k for the jth sample with tk ≥ tj, Ci = {j|tj ≤ ti} is the set of indices j for the ith sample with tj ≤ ti and di is the number of tied samples in survival time for the ith sample.
Mixture Network Tuning
In real data analysis, obtaining the correctly specified complete network structure could be infeasible for model fitting. In addition, in scenarios where the network structure is known, the strengths/weight of the connectivity might not be available. To circumvent these issues, we proposed a mixture network method that combines a pre-specified network L1 and a data driven network L2 in the following penalized likelihood framework: where 0 ≤ c ≤ 1 is the network weight. If L1 and L2 are both positive definite, the final mixture network L = cL1 + (1 − c) L2 is also positive definite, thus the consistency property still holds. To obtain the network weight c we recommend fixing λ1 = 0 when tuning the weight between networks and only search for the combination of λ2 × c for computational efficiency. As suggested by [4] we searched λ2 over {0.01 ⋅ 20, 0.01 ⋅ 21,…, 0.01 ⋅ 27}. To tune the parameter c we recommend searching over the set {0, 0.1, 0.2,…, 1}. We tuned the two networks via cross validation method and chose the value of c that optimized the cross validation performance. Upon identifying the optimal c we fixed the final mixed network with when tuning λ1 and λ2.
To estimate a data-driven network L2, we obtained the connectivity using the R package huge ([35]) with penalized neighborhood selection method ([19]) tuned by rotation information criterion (RIC). However, this method does not provide the strength of connectivity. Therefore, we estimated the strengths/weights using the reliability score provided by the reference gene association network ([31]).
Parameter Tuning
We compared two frameworks for tuning λ1 and λ2. The first is the cross validation (CV) framework, where we performed the CV via deviance () or robust measure including negative mean absolute error (MAE) for Gaussian model, area under the receiver operating characteristic curve (AUC) for logistic model and concordance index (C) for Cox model. For Gaussian model, the deviance measure is equivalent to negative mean squared error (MSE). One can either use the maximum (max) rule, i.e., obtaining (λ1, λ2) that maximizes the CV measure or the one standard error (1se) rule, i.e., obtaining (λ1, λ2) that results in the most parsimonious model within one standard error of the CV measures. We also imposed a p/2 constraint to the number of variables to improve computational speed.
Although CV is a convenient framework and has been shown to achieve good performance in low dimensional data, it may result in overfitting in high dimensional case ([32]). An alternative approach is via the stability selection (SS) proposed by [20] which measures the feature selection stability across subsampling replicates, and has been shown to be robust in graphical models ([17]).
Suppose we randomly draw K samples (usually K = 100) with [n/2] or observations depending on the sample size as suggested by [20] and [17], the selection probability of feature j is given by where Sk is the kth subsample. The selection variance is given by
A stable method should have a low selection variance, thus the instability score across all features is defined as
To make the score comparable across different λ2’s, we consider a monotone transformation of the instability score given by such that the instability path decreases with increasing λ1 for each fixed λ2. By combining the instability score together, we find the maximum score that is lower than a specific cutoff, usually 0.15 and use the corresponding (λ1, λ2) as the selected tuning parameter.
Tuning λ1 and λ2 usually works iteratively by searching λ1 for each fixed λ2 until all possible values of λ1 × λ2 have been considered. According to the strong rules for discarding predictors ([30]), it is not necessarily to consider all predictors for every λ2. In particular, we can discard predictors that are not likely to be retained in the model under the Karush-Kuhn-Tucker (KKT) condition. We applied the strong rules in our model to improve computational speed.
Theoretical Properties
Group Effect
We showed how the network penalty adjusts for the multicollinearity issues by proving the group effect. Without loss of generality, we assumed that the response vector y for Gaussian models and predictor matrix X have been standardized. We assume that feature i and j are linked and only linked to each other and that the sign of estimation is correct. Assume further that the sample correlation of Xi and Xj is ρij, the sign is consistent with the coefficient, the l1 penalty weight for feature i is wi and the strength of connectivity for feature i and j is ωij and 0 ≤ ωij ≤ 1. We have
Oracle Property
We provided the theoretical proof for the oracle property on our proposed method to ensure that the model is robust with respect to variable selection and coefficient estimation. Gaussian and logistic models are in the exponential family whereas the Cox proportional hazard model is not, thus the oracle property for Cox model is different from Gaussian and logistic models. We proved the oracle property for Cox model and exponential family GLM (not limited to Gaussian and logistic models) separately in the following subsections.
0.0.1 Generalized Linear Model in Exponential Family
For GLM in exponential family, e.g., Gaussian and logistic models, the likelihood function can be written as l (Y |X, θ) = h (Y) exp YTθ − φ (θ) where θ = Xβ and β is the true coefficient vector. We denote the maximum penalized likelihood estimation as where where is a root-n-consistent estimator of β such as the OLS estimator and r > 0. Let and A denote the selected features and true predictor set, respectively. In our case
Suppose that , and Λmax (L) ≤ λL < + ∞, where Λmax(⋅) represents the largest eigenvalue of a given matrix. Given the two regularity conditions:
Fisher information matrix I (β) = E[φ″ (xβ) XTX] and are finite and positive definite.
There exists a sufficiently large open set O where β ∈ O and ∀B ∈ O we have |φ″′ (XB)| ≤ M (X) < +∞ and E [M (X) |xixjxk] < +∞ for any 1 ≤ i, j, k ≤ p,
we have the following two properties:
Variable selection consistency: .
Asymptotic normality:
0.0.2 Cox’s Proportional Hazards Model
The Cox model is not within exponential family, thus the proof of the oracle property requires some modifications as shown in [7]. Define the at-risk and counting process as and where (Ti is the failure time and Ci is the censoring time for the ith subject), and the Fisher information matrix as where
s(k) (β, t) = E[x (t)⊗k Y(t) exp (x (t)T β)], k = 0, 1, 2 and h0 (t) is the baseline hazards function. Here we assume all the regularity conditions (A-D) in [?] hold. We assume that A is the true predictor set and AC the complement set. Given that and Λmax (L) ≤ M < +∞, the root-n consistent estimator satisfies the following conditions:
Sparsity:
Asymptotic normality:
Monte Carlo Simulations
We conducted a Monte Carlo study to evaluate the performance of our proposed model. We considered two network structures namely (1) the autoregressive (AR) structure where each feature is connected and only connected to its neighbor, and (2) the HUB structure where the features formed groups with one dominant feature within each group.
In our simulation, we generated p = 200 features and n = 500 samples in which 100 samples were used as training data and the remaining 400 samples were set aside as test data. The features were generated from a multivariate Gaussian distribution with mean zero and diagonal covariance one. We assigned three twenty-feature groups with absolute coefficients 0.5, 1 and 2 and random signs for the noninformative features (i.e., those with zero coefficients).
For Gaussian models, we generated Gaussian noise with mean zero and standard error ||β||2/2. For logistic models, we generated the outcome variable from the Bernoulli distribution with probability of the success as the logistic score of the predictors. For Cox models, we generated Weibull baseline hazards with shape parameter 5, scale parameter 2 and censoring time following a uniform distribution U (2, 15) which leads to a censoring rate of approximately 30%.
Cross Validation with p/2 Constraints
We compared our proposed model to the elastic net model (implemented in the R package glmnet) and network-LASSO regression without the l1 adaptive weights (implemented in the R package glmgraph). Since glmgraph is not implemented for Cox model, we wrote our own codes for fitting the Cox models in network-LASSO regression without the l1 adaptive weights. To assess the effect of network misspecification, we considered the scenarios where we used (1) the correct network (cor), (2) the incorrect network (AR misspecified as HUB and vice versa) (wr), and (3) the estimated network (est). The signs of network were estimated empirically. We compared this signed network model to our proposed mixture network model that combined (1) a correct network with an incorrect network, (2) a correct network with an estimated network, and (3) an incorrect network with an estimated network. The results of the AR structure as the true network are shown in Tables 1. The tuning parameters were chosen via cross validation with one standard error rule and the number of parameters were constrained to be at most p/2.
In the results, EN represents elastic (glmnet) method, Graph represents network LASSO without l1 adaptive weights, AAG represents our proposed network LASSO with l1 adaptive weights and MixAAG represents our proposed network LASSO with l1 adaptive weights and mixture network. For Gaussian model, we compared mean absolute error (MAE), mean squared error (MAE), Pearson and Spearman correlation. For logistic model, we compared the area under the receiver operating characteristic curve (AUC) calculated via [21] method, accuracy (ACC), Matthews correlation coefficient (MCC) and biserial correlation. For Cox model, we compared the concordance index (C). We reported the mean and standard deviations of these metrics across 100 replicates.
From the simulation results, our proposed method with l1 adaptive weights yields better performance compared to elastic net and network-LASSO without l1 adaptive weights. For both the AR and HUB structures, incorporating the correctly network yields significantly better results compared to the case where network is misspecified as expected. On the other hand, the network mixture approach (i.e., mixing an incorrect network with an estimated data driven network) yields better performance compared to a model with a wrong network or elastic net model.
Cross validation vs stability selection
In practice we constrain the number of selected features to be no more than p/2 similar to the default method of R package glmgraph to speed up computation. However, this constraint may not be desirable if the true number of informative features is greater than p/2. An alternative approach is the stability selection (SS) method as described earlier. In this subsection we compared the variable selection accuracy between cross validation without p/2 constraint and the stability selection method. We reported the MCC of the estimate coefficients, and Sensitivity (Sn) for large, medium, and small effect sizes and Specificity (Sp) averaged over 100 replications. The results for the AR structure are shown in Table 2.
From Table 2, the logistic model fitted via cross validation without p/2 constraint yields lower MCC and Sp compared to the stability selection. For Gaussian and Cox model the cross validation and stability selection have similar performance. The cross validation approach is also more computationally efficient (e.g., for Gaussian model, with 100 samples and 20 features, five-fold cross validation took 0.05s while 100 replicated stability selection took 4.80s).
Case Study
Platinum Therapy in Ovarian Cancer
We applied our proposed method to ovarian cancer proteomics dataset from the Cancer Genome Atlas (TCGA) generated by the Johns Hopkins University (JHU) and Pacific Northwest National Laboratory (PNNL) ([34]). The dataset was downloaded from the National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) (https://cptac-data-portal.georgetown.edu/cptacPublic/). Ovarian cancer is one of the most lethal gynecologic malignancy which is difficult to be detected early. Most ovarian cancer cases are only detected in late stage and treated with chemotherapy using a platinum compound drug. Unfortunately, the platinum therapy is not effective for all patients as some patients develop resistance to the treatment. To improve ovarian cancer survival, it is important to predict whether a patient will response to the treatment. In this case study, we used the proteins as candidate features to predict two types of outcome measurements, namely the platinum free interval, a continuous measurement for the treatment sensitivity and platinum status (sensitive versus resistant by dichotomizing platinum free interval, i.e., platinum free interval greater than 6 months was marked as sensitive). Our sample size consisted of 95 sensitive patients and 34 resistant patients. We used ComBat ([10] and [12]). Since the missing values in mass spectrometry proteomics data can be attributed to detection limit ([27]), we imputed the missing values by the minimum value of each protein divided by ([23]). The set of features with more than 20% missing were removed from our study. The pre-processed and normalized dataset consisted of 6451 candidate features/proteins for our subsequent analysis. Among the 129 samples, we randomly assigned 92 (71.3%) samples to form the training set and the remaining 37 (28.7%) as test set. We pre-screened the features using feature-wise Gaussian and logistic regression on log platinum free interval and platinum status, respectively in training data and retained the candidate features with p-values ≤ 0.15. 1026 and 881 features were retained for log platinum free interval and platinum status, respectively. To obtain a prior network structure, we downloaded the protein-protein interaction network (protein links within human sapiens) from the STRING database ([26]) and combined with the Laplacian matrix estimated from the training data. The signs and strengths of connectivity of the network were estimated using the method described in Sections 2.1 and 2.2. For platinum status prediction, a cutoff value that maximized the Youden’s index in training data was used to compute the accuracy (ACC), Matthew’s correlation coefficient (MCC), Youden index (J), Sensitivity (Sn) and Specificity (Sp) in test data. The results are shown in Tables 3 and 4.
Both the predictions of log platinum free interval and platinum status showed that elastic net (EN) method tend to overfit the data since the model chose α = 0 (ridge regression) as the optimal value, thus all the features were retained. On the other hand, our proposed mixture network method selected a smaller number of features and achieved better prediction performance. The results also indicated that the ovarian cancer proteomics dataset has an inherent network structure, thus out proposed method is suitable for modeling this type of data.
Survival Time in Skin Cutaneous Melanoma
Skin cutaneous melanoma (SKCM) is an aggressive malignancy that arises from uncontrolled melanocytic proliferation. Gene expression has been shown to be a promising biomarker for predicting survival in SKCM ([1], [3] and [18]). We applied our proposed method to the Cancer Genome Atlas (TCGA) SKCM gene expression data generated using the RNA-Seq platform. Gene expression values were summarized using RSEM ([13]) and normalized via the quantile normalization procedure ([16]). Our data consisted of the overall survival time of 436 patients with 217 events. We first pre-screened the candidate features, i.e., genes by individual Cox regression and retained 864 features with p-values ≤ 0.15. We randomly divided the data into training (305 patients) and test (131 patients) sets. The network structure was based the melanoma pathway from KEGG, combined with an estimated network in which the signs and strengths were estimated via the method described in Section 2.1 and 2.2. We compared the results of our proposed methods to elastic net models. The results are shown in Figure 1. We trained the models and obtained the optimal cutoff value for the log rank test. We used the selected cutoff in the test data to divide the patients into high and low risk groups, and evaluated the prediction via the Kaplan-Meier curves and log rank tests. We reported the concordance index (C) in test data and the number of features selected in the training data.
The results showed that our proposed methods with cross validation performed the best (best C index and lowest number of retained features). Our proposed method via stability selection had comparable performance to elastic net method via cross validation.
Conclusion
Incorporating network structure in the prediction model has been shown to be important in high dimensional genomics studies for accurate feature selection and model interpretability. In this paper we proposed a mixture network regularized generalized linear model which allows us to optimally combine a prior network and a data driven network. This is particularly useful in the scenarios in which the prior network is not known with certainty. Our model safeguards against incorporating an incorrect prior network by allowing an optimally mixed network structure in the model.
Our simulation studies showed that the proposed l1 adaptive method yields higher prediction and feature selection accuracies across different scenarios. We also found that cross validation may not be the best approach for feature selection in high dimensional data, especially for binary classification. An alternative strategy is the stability selection method which was shown to yield better performance than cross validation in such scenarios, though it requires a much higher computational cost. Based on our simulation results, we suggested using the stability selection method for parameter tuning in binary classification problem, whereas cross validation is often sufficient for Gaussian and Cox outcomes.
An interesting future work includes replacing the l1 penalty with a grouped LASSO penalty to allow for group-wise instead of feature-wise selection. However, the challenge would be to ensure that the group structure inferred from the group LASSO penalty is consistent with the group structure from the data driven network. One possibility is to define the grouped LASSO penalty after obtaining the network mixture within an iterative framework. Another future research is to apply the AAG method to other exponential family, for example, the Poisson and negative binomial regression for modeling count data outcomes. Our proposed model glmaag is available on the Comprehensive R Archive Network (CRAN).
Acknowledgments
This work was supported in part by the CDC/NIOSH award U01 OH011478.