Abstract
Various feature selection algorithms have been proposed to identify cancer prognostic biomarkers. In recent years, however, their reproducibility is criticized. The performance of feature selection algorithms is shown to be affected by the datasets, underlying networks and evaluation metrics. One of the causes is the curse of dimensionality, which makes it hard to select the features that generalize well on independent data. Even the integration of biological networks does not mitigate this issue because the networks are large and many of their components are not relevant for the phenotype of interest. With the availability of multi-omics data, integrative approaches are being developed to build more robust predictive models. In this scenario, the higher data dimensions create greater challenges.
We proposed a phenotype relevant network-based feature selection (PRNFS) framework and demonstrated its advantages in lung cancer prognosis prediction. We constructed cancer prognosis relevant networks based on epithelial mesenchymal transition (EMT) and integrated them with different types of omics data for feature selection. With less than 2.5% of the total dimensionality, we obtained EMT prognostic signatures that achieved remarkable prediction performance (average AUC values >0.8), very significant sample stratifications, and meaningful biological interpretations. In addition to finding EMT signatures from different omics data levels, we combined these single-omics signatures into multi-omics signatures, which improved sample stratifications significantly. Both single- and multi-omics EMT signatures were tested on independent multi-omics lung cancer datasets and significant sample stratifications were obtained.
Introduction
Prognosis prediction is necessary for cancer clinical decision making. Traditionally, cancer prognosis prediction is based on clinical variables such as tumor stage, age, and disease history, where the information of a patient is compared against population caner registries [1]. However, these clinical parameters are insufficient to accurately predict the risk of patients [2] as histologically similar tumors can be of completely different diseases at the molecular level [3, 4]. Therefore, molecular signatures are needed to give more accurate prognosis predictions. Nowadays we can obtain tumor molecular profiles in greater details. As one of the large scale projects, the Cancer Genome Atlas (TCGA) provides access to genomic, transcriptomic, epigenomic, and proteomic data from more than 11,000 cases in 33 cancer types and subtypes [5]. Using these data, researchers aim to build better prognosis prediction models. This has been found as very challenging due to the high dimensionality of omics data, where the number of features far exceeds the number of samples. This is often addressed as the curse of dimensionality. When the data lie in high dimensions, the samples become very sparse. This can cause the lack of statistical significance and over-fitting of machine learning models. Fortunately, not all features are relevant for predicting the phenotype of interest. It is desired to find the molecular signatures that capture the footprint of the phenotype so that the signatures can be employed on unseen samples.
Various feature selection methods were proposed to find molecular signatures. Early reviews categorized them into three categories: filter, wrapper, and embedded methods [6, 7]. Although many important algorithms were introduced, network-based feature selection algorithms were not included. After reviewing the literature, we found three main categories of network-based feature selection algorithms. The first category involves network-guided search. An algorithm identifies subnetworks that can best differentiate different phenotype groups. Each subnetwork is aggregated to produce one feature (called metagene) and eventually the metagenes are used as features for training predictive models. Different scoring functions were proposed to rank the subnetworks. For example, [8] used mutual information as the scoring function and the addition operator to aggregate subnetworks. [9, 10] used the p-value of Cox PH model in defining the scoring functions. [11] dichotomized features and defined a scoring function based on information theory. [12] tested the effects of different aggregation operators on the prediction performance.
The second category of methods uses network-based regularization. Regularization methods such as Lasso have been widely applied for feature selection. To integrate network information, the penalty term takes into account the network connectivity. Adjacency matrix A and Laplacian matrix L are frequently used to represent a network G to be included in the penalty term. The majority of methods in this category are based on linear classifiers and can be written in the following form:
LFor example, in graph Lasso penalty(w, G) = λ║w║1 + (1 – λ)Ʃi,j Ai,j(wi − wj). This forces adjacent nodes to have similar weights [13, 14]. Using similar formulations, [15] proposed a network-constrained regularization and feature selection method on genomic data. [16] added a network regularization term to the log-likelihood function of the Cox proportional hazard model. [17] developed a network-constrained support vector machine algorithm, where the network-based regularization term is added to the objective function of SVM.
The third category of methods involves iterative updates of node importance scores. Frequently used algorithms include network propagation and random walk. [18, 19] adapted Google’s PageRank algorithm to rank genes in a network. Genes are assigned initial ranks r[0] ∈ ℝN. Then the rank of each gene is updated iteratively depending on the ranks of genes that are linked to it. For gene j, its rank from or is updated as: where degi is the degree of the ith gene and d is a fixed parameter. By iterating until convergence a gene will be highly ranked if it is linked to other highly ranked genes. [20] used random walk kernel to smooth gene-wise t-statistics over the network. This is achieved by assigning each node an initial score based on t-test and then multiplying it with the random walk kernel. The p-step random walk kernel is used as a similarity measure to capture the relatedness of two nodes in the network. It is defined as: where Lnorm is the normalized graph Laplacian matrix, α is a constant, and p is the number of random walk steps. The network-smoothed t-statistic is used to measure node importance. Similarly, random walk-based scoring of network components is applied in [21] to prioritize functional networks.
Besides algorithmic difference, various biological networks have been employed in network-based feature selection algorithms. A list of these molecular interactions is given in Table 1. In these studies, it was shown that superior features could be selected by the integration of networks. However, recent studies showed that network-based feature selection methods did not significantly improve prediction performance, but mainly contributed to the biological interpretations of the signatures [22–24]. [22] compared 14 feature selection methods, 8 of which integrated network information and 6 of which did not, on 6 breast cancer datasets with respect to their prediction accuracy, signature stability and biological interpretations. The results showed that network-based features in most cases could not improve prediction accuracy significantly. [23, 24] showed that when a correction of feature set size was performed, the stability of network-based features was not higher than single features.
Regardless of whether network information is integrated, finding robust molecular signatures is a challenging task. Due to the high data dimensions, it is easy to find a feature subset that fits the training data very well but hard to have good generalization. Studies showed that there was hardly considerable overlap among biomarkers identified in different studies for the same disease [33–35]. Even taking random feature sets gave comparable prediction performance [36]. The existence of many feature subsets that perform similarly well on the training set makes it difficult to identify the true signatures. Note that the randomness of signatures is also observed when network information is integrated. [23] tested different network-based feature selection algorithms on six breast cancer datasets in prognosis prediction. They showed that the randomization of network structure, which destroyed biological information, did not deteriorate the prediction performance of the selected features. [24] extended the experiments in [23] by comparing more prognosis signatures. In the end, similar results were observed.
We suppose that the main reason for these counter-intuitive results is the curse of dimensionality, where selecting molecular signatures is hard given the limited amount of samples. In principle, molecular signatures should give better predictions than random features, because it is shown in biological research that certain genes are supposed to be more important than the others in cancer progression. If we use this information to constrain the feature space and guide feature selection, we could potentially obtain more robust biomarkers. State-of-the-art studies have not utilized this knowledge but considered the whole feature space and the entire biological network. Because both the data and the network are large, the irrelevant information may overwhelm the signals. Furthermore, biological networks were typically integrated with one type of omics data. It would be very interesting to investigate how the prediction performance differs when the networks are integrated with different omics data types, and additionally what are the relationships among the features selected from different omics data.
To address this issue, we proposed a phenotype relevant network-based feature selection (PRNFS) framework. It consists of constructing a phenotype relevant gene regulatory network (GRN) and selecting features from this network. We demonstrated the superiority of this framework with the application of lung adenocarcinoma (LUAD) prognosis prediction. We constructed a GRN for EMT, which has been demonstrated as highly relevant to cancer metastasis and prognosis. On this network 4 types of omics data (mRNA-Seq, miRNA-Seq, DNA methylation, and copy number alteration data) were integrated and 10 feature selection algorithms were employed. We obtained both single- and multi-omics EMT prognostic signatures, evaluated their prediction performance, analyzed the biological interpretations, and performed survival analysis. Furthermore, these signatures were tested on independent multi-omics LUAD data. We showed that EMT prognostic signatures achieved remarkable prediction performance on TCGA data. On independent data, both single- and multi-omics signatures stratified patients into significantly different prognostic groups. Multi-omics signatures were shown to be more robust than single-omics signatures.
Materials and methods
We will first describe the construction of EMT networks. This is followed by the introduction of 10 feature selection algorithms. Then we explain the details of the experiments.
EMT gene regulatory networks
As an up-to-date EMT GRN is not readily available, we constructed the network by literature review. The network we constructed has incorporated key transcription factors, miRNAs, their regulations and interactions with EMT hallmark molecules. Multiple levels of gene regulations such as transcriptional, translational, and post-translational regulations were covered. The reference for each component in the network can be found in [37]. Since this network covers mainly driver genes, we named it as the core network.
As it is observed, driver genes are often less differentially expressed than the genes they regulate [35]. If one includes only the driver genes for identifying molecular signatures, one may have captured only partial information. We therefore extended this network by including the molecules that directly interact with or being regulated by the molecules in the core network. NetworkAnalyst tool [38] was employed to find these interactions, which consist of protein-protein interactions, miRNA-gene interactions, and transcription factor-gene interactions. The resulting network was named extended network. After constructing this network, we noticed that many features have a rather low variance among samples, we thus removed these features and obtained the filtered network. All three networks were employed in our experiments. The three networks contain 74, 123 and 455 nodes respectively. Details of the networks can be found in [37].
Experiments
We first obtained RNA-Seq, miRNA-Seq, DNA methylation, and CNA data of LUAD from FIREHOSE Broad Genome Data Analysis Center website. mRNA-Seq and miRNA-Seq data were combined because they both measure the abundance of transcripts. This resulted in 3 data levels: gene expression, DNA methylation, and CNA data. These three data levels will be abbreviated as GE, DM, and CNA in the remaining text. Each data level was normalized feature-wise by subtracting the mean and dividing by the standard deviation. More details of data pre-processing can be found in [37, 39].
Since we have obtained 3 EMT networks and 3 data levels, feature selection can be performed on each combination of network and data level. To evaluate whether EMT-based feature selection can give more robust molecular signatures for prognosis prediction, we employed 10 representative features selection algorithms to identify signatures from EMT genes and EMT networks. Table 2 gives an overview of these algorithms. Five of these algorithms integrate network information and the other five algorithms use only omics data. The underlying methodologies are very different. We suppose that if EMT network is superior for selecting prognostic signatures, the performance of the selected features from the majority of these algorithms should show improvements. As mentioned before, state-of-the-art studies usually use only gene expression data for feature selection. We instead incorporated three different omics data levels. This gives us the possibility to compare and integrate the signatures from different data levels.
Note that even the largest EMT network (the extended network) covers only 2.3% of the original data dimensions. To assess the performance of EMT-based feature selection, we compared the prediction performance of EMT signatures with the features selected out of all features from the corresponding data levels. Additionally, random networks of the same size and structure as EMT networks were generated - the nodes in the random networks were randomly chosen from all the features of the corresponding data level.
The features selected from these random networks were compared with EMT signatures. In total, we compared the prediction performance of EMT signatures with the following 4 groups:
Random features. These features were selected from random networks using the same feature selection algorithms. 150 random networks were generated for feature selection.
All EMT features. We included all features in the EMT networks without feature selection.
All features from the corresponding data levels. This corresponds to 19,290 GE features, 20,074 DM features, or 21,456 CNA features.
Features selected from all data level features by applying Lasso algorithm.
We performed the comparison by selecting features with the training set, using these features to train an SVM classifier, and classifying samples on the cross-validation set. Patients who survived more than 1400 days belong to the good prognosis group and patients who survived less than 700 days belong to the poor prognosis group. The results from 30 times stratified 10-fold cross-validation were averaged. Within each data level the same cross-validation folds were used for all the feature selection algorithms on all the comparative groups. The classification performance was evaluated using three metrics: ROC-AUC, ROC-PR and accuracy.
We chose relatively stringent thresholds for feature selection, this is to reveal more difference than similarities between the two patient groups. We argue that it becomes harder to find the signatures if the two groups have more similar samples in terms of the phenotype. For example, if one uses a single threshold of 3 years, we assume that the molecular profiles of patients who survived a bit longer than 3 years may be very similar to patients who survived a bit shorter than 3 years. In this case, it is challenging to find the signatures that can capture the most important difference between the two groups, given the limited amount of samples and their heterogeneity. However, we did not omit the influence of thresholds. We tested the performance of all feature selection algorithms with four different thresholds, in the order of increasing discrepancy: 3 years, <900 or >1200 days, <700 or >1400 days, <500 or >1500 days.
Besides evaluating the classification performance, survival analysis was performed using the selected features on censored data. The data have much more samples that could not be included in classification.We think that if the selected features are good signatures, they should be able to stratify the patients into significantly different survival groups. We performed survival analysis on both all-stage and early-stage patients. The sample sizes for classification and for survival analysis (all stage patients) are given in Table 3.
Last but not least, we analyzed the biological interpretations of EMT signatures. Instead of performing gene set enrichment analysis, which could give very significant results due to the biological context of the EMT networks, we employed association rule mining approach to infer prognostic association rules. The rules have the advantage to directly associate the states of the features to the phenotype of interest. We inferred rules using EMT signatures from individual data levels and also from their different combinations. Our motivation is to understand whether features from different data levels complement each other and jointly contribute to patient prognosis.
We were able to show that EMT signatures from different data levels complement each other in prognostic rules. This inspired us to obtain multi-omics EMT signatures by combining the signatures on individual data levels (single-omics signatures). Both single- and multi-omics EMT signatures were evaluated on TCGA data and independent LUAD multi-omics data using survival analysis. All the data and code for analysis are available at https://github.com/BorongShao/EMT prognosis-master.
Results
EMT signatures outperformed comparative groups
First, we show that regardless of the employed feature selection algorithms and evaluation metrics, EMT-based feature selection always outperforms feature selection on random networks. Fig 1 shows the distributions of AUC, AUPR, and accuracy values of EMT signatures and random ones, where DM data and core EMT network were used. S1 Fig shows the same comparative groups using GE data with filtered EMT network. In both cases, the advantages of EMT signatures are very apparent.
Next, we give the average AUC values of EMT signatures on all three data levels in Table 4. The boxplot is given in S2 Fig. These results show that features selected from GE and DM data obtained better prediction performance than features selected from CNA data. Depending on the data levels and network sizes, we find it hard to identify the best-performing feature selection algorithm. In the last two lines of the table we give the results of comparative groups 3 and 4. This shows that EMT signatures in many cases outperformed features selected from all data level features. For example, with Lasso feature selection algorithm, which was applied in both EMT feature space and in the whole feature space, EMT signatures gave better predictions in more than half of the cases. This indicates that selecting prognostic signatures from a much smaller phenotype relevant network is a feasible approach. We also evaluated the performance of the 10 feature selection algorithms with different classification thresholds. Both SVM and random forest classifiers are employed. The results are given in S3 Fig. It shows that regardless of feature selection algorithms, using more discrepant thresholds tends to obtain higher AUC values. Meanwhile, a few algorithms such as addDA2, RegCox, and Survnet are more sensitive to the effect of thresholds than the other algorithms.
Frequently selected features further improves predictions
Although EMT signatures were shown to be significantly predictive in the experiments above, we observed high variance in the AUC values from individual cross-validation tests (shown in S2 Fig). Some partitions of data into training and cross-validation sets led to good predictions and some led to poor predictions. Even on the small EMT feature space, this phenomenon is already frequently observed. This suggests that selecting molecular signatures based on single cross-validation test or single sample division into training and testing set is highly unreliable.
We think that sample heterogeneity contributed to the high variance in prediction performance. Thus, we addressed this issue by employing the frequently selected features (FSFs) from all 30 times 10-fold cross-validation feature selection. Instead of using 20 features selected from each training set, we used the top 20 FSFs and tested their performance using the same evaluation approach. DM data and the extended EMT network were employed for the test, as this combination was shown in Table 4 and Figure 4 to give above-average prediction performance. We compared the prediction performance of FSFs with that of individually selected features. The results are given in Fig 2. The density plots and results of statistical tests are given in S4 Fig.
We observed that FSFs significantly outperformed individually selected features, except for Netrank algorithm. The average AUC values of t-test, Lasso, NetLasso, and addDA2 feature selection algorithms were 0.773, 0.825, 0.796, and 0.833, respectively. It shows that using FSFs can mitigate the effect of sample heterogeneity. Recall that we used only <2.5% of the original dimensionality, namely EMT features, for feature selection and prognosis prediction. The remarkable results are consistent with biological knowledge that EMT process is highly relevant to cancer prognosis [46–50].
Biological interpretations
After identifying EMT FSFs, we further investigated their biological interpretations, especially the relationships among FSFs from different omics data levels. We employed association rule mining approach [51]. It is originally defined as the following [52]:
Let I = {i1, i2, …, in} be a set of n binary features called items. Let D = {t1, t2, …, tm} be a set of transactions called the database. A rule is defined in the form: X ⇒Y, where X ⇒ Y, ⊆ I. The itemsets X and Y are called left-hand-side (LHS) and right-hand-side (RHS). In order to select interesting rules from the set of all possible rules, constraints on various measures of significance and interest are applied. Let a rule X ⇒ Y be identified on a set of transactions T. Commonly used constraints are given below:
Support. It indicates how frequently the itemset appears in T.
Confidence. It indicates how often a rule has been found to be true.
Lift. It indicates the degree to which X and Y depend on each other.
We binarized the EMT features using their means and applied Apriori algorithm [53] to derive rules, with the constraints of confidence ≥ 0.8 and support ≥ 0.1. The algorithm was implemented in the textitarules R package [51]. Since we are trying to find molecular patterns for predicting prognosis, we set the RHS of the rules to be the class labels of prognosis.
The resulting rules show sound biological interpretations according to established findings in cancer research. Here we interpret two rules identified from the core EMT network:
{LOXL2GE = high, TGFB1GE = high, miR − 34aGE = low} ⇒ {prognosis = poor}, with support = 0.135, confidence = 1, lift = 2.046. This rule applies to all samples that have these 3 gene expression conditions. Biologically, it has been shown that LOXL2 can stabilize SNAI1. TGFB1 can phosphorylate SMAD2 and SMAD3, which interact with SMAD4 to activate HMGA2, which then activates SNAI1. When LOXL2 and TGFB1 are highly expressed, it not only induces SNAI1 gene expression but also stabilizes SNAI1 protein. miR-34a has the role of repressing SNAI1. When miR-34a has low gene expression, SNAI1 is less repressed. Taken together, these three conditions point to the direction of the high expression of SNAI1 - a master transcription factor to induce EMT. This contributes to poor prognosis. In contrast, another rule which has an opposite LOXL2 state indicates good prognosis:
{LOXL2GE = low, ETS1GE = low, LOXL2DM = high}⇒ prognosis = good, with support = 0.105, confidence = 1, lift = 1.956. In this scenario, LOXL2 has high DNA methylation level and low gene expression level, and thus not able to stabilize SNAI1. ETS1 gene is known to increase the expression of ZEB1 which induces EMT. In this rule ETS1 has low expression so it does not contribute to inducing EMT. These factors can contribute to good prognosis. S1 Table contains more examples.
From single- to multi-omics signatures
The FSFs above were obtained alternatively from single data levels. Therefore, we name them as single-omics signatures. To investigate whether molecular signatures incorporating multiple data levels can be superior, we combined single-omics signatures into multi-omics signatures and compared their capabilities in stratifying samples into different prognostic groups. Using these signatures, we clustered the samples into 3 groups with both k-means and spectral clustering algorithms. Survival analysis was performed on the resulting cluster by estimating Kaplan-Meier survival curves and conducting log-rank tests. The test results based on k-means algorithm are given in Table 5, where the columns show different data level combinations and the rows correspond to feature selection algorithms. The comparative group of using all EMT features without feature selection is included. The test results based on spectral clustering algorithm are given in S2 Table. In both tables we observe that multi-omics signatures improve sample stratifications significantly. An example is visualized in S5 Fig, S6 Fig, and S7 Fig.
Next, we performed survival analysis on early stage patients. The results are given in Table 6. It shows that EMT-based signatures can still stratify the patients into significantly different prognostic groups.
Last but not least, we tested the performance of two integrative clustering algorithms: SNF [54] and iCluster [55] with multi-omics EMT signatures. Briefly, SNF algorithm constructs sample similarity networks using individual data levels and then fuses these networks into a single similarity network, where spectral clustering is used to decide sample clusters. iCluster employs joint latent variable model to connect different data levels. The latent component is used to determine sample clusters. Based on the clustering results we performed survival analysis. The results of log-rank tests are given in S3 Table for SNF algorithm and in S4 Table for iCluster algorithm. We observed that neither SNF nor iCluster algorithm yielded better sample stratifications than using k-means algorithm (Table 5).
Test results on independent data
We obtained the test data from [56] including 164 samples with DM data. 121 of these samples have also mRNA expression data (microarray) available. The patient follow up time ranges between 2 and 99 months with the median of 44 months. The outcome (event) is defined as the occurrence of relapse, distant metastasis or death. The time to event is calculated from the date of surgery. Detailed experimental procedures and the processing of raw data are provided in [56]. EMT single-and multi-omics signatures consisting of GE and DM data levels were evaluated on the test data using survival analysis. EMT signatures were extracted from the test data without any additional training or modifications. Hierarchical clustering, instead of k-means was employed in order to compare our results with the original study [56].
We have tested the EMT signatures selected by each feature selection algorithm [37]. It is show that single-omics signatures can already stratify the samples into significantly different prognostic groups. An example is given in Fig 3. Multi-omics signatures often yielded better sample stratifications. Fig 4 shows an example where the multi-omics signature from a feature selection algorithm can significantly stratify the samples while the single-omics signatures cannot. Compared with the survival analysis results in the original study [56], we achieved more significant sample stratifications with EMT signatures.
Discussion
Various feature selection algorithms have been proposed to identify biomarkers from Omics data for predicting the phenotype of interest. Although more and more information such as biological networks and multiple types of omics data have been integrated in feature selection, recent studies show the low reproducibility of molecular signatures [22, 24, 35]. Some accredit this to the existence of a large number of genes that are correlated with the target labels [12]. Given the limited amount of samples, it becomes very hard to differentiate the marker genes and irrelevant genes. We addressed this issue by constructing a phenotype relevant gene regulatory network, integrating multiple types of omics data with the network to select molecular signatures. We have shown that with lung cancer prognosis prediction, EMT signatures selected from only 2.5% of the original feature space outperformed the classical feature selection on the whole feature space. To the best of our knowledge, we for the first time constructed a phenotype-relevant GRN for lung cancer prognosis prediction.
Previously we employed EMT networks for selecting lung cancer prognostic signatures [39, 57]. However, [57] used mRNA expression and miRNA expression data only. [39] employed three data levels for feature selection but obtained no significant improvement in predictions. In this study, we extended the EMT network to incorporate its interacting molecules. Besides, we reviewed the network used in [39] and removed the edges which denote associations rather than direct gene regulations. What also distinguishes this study from our previous work is the employment of 10 representative feature selection algorithms, instead of decomposing the network into network motifs [39, 57]. We have selected EMT signatures on three data levels with different network sizes, compared with the features selected from the whole data dimensions, and derived prognostic rules from EMT signatures. Furthermore, we obtained multi-omics signatures and showed their superior prediction performance over single-omics signatures. This shows that signatures from multiple omics data types can complement each other to better distinguish different phenotypes.
The potential of EMT molecules in prognosis prediction has also been studied before.[61] and [62] performed survival analysis using individual EMT hallmark molecules such as E-cadherin and vimentin and showed that none of these molecules could separate LUAD or bladder cancer patients into significantly different prognostic groups. Note that these conclusions were drawn from mainly univariate analysis. Since the molecules jointly contribute to the phenotype, it could be more helpful to use a set of features. This can be seen also from the prognostic association rules derived from EMT signatures, where EMT molecules are jointly associated with the phenotype. All in all, we successfully demonstrated that EMT network-based feature selection and data integration can provide advantages in selecting cancer prognostic signatures.
Supporting information
Footnotes
* borong.shao{at}biontech.de