Use of relevancy and complementary information for discriminatory gene selection from high-dimensional cancer data

With the advent of high-throughput technologies, life sciences are generating a huge amount of biomolecular data. Global gene expression profiles provide a snapshot of all the genes that are transcribed or not in a cell or in a tissue at a particular moment under a particular condition. The high-dimensionality of such gene expression data (i.e., very large number of features/genes analyzed in relatively much less number of samples) makes it difficult to identify the key genes (biomarkers) that are truly and more significantly attributing to a particular phenotype or condition, such as cancer or disease, de novo. With the increase in the number of genes, simple feature selection methods show poor performance for both selecting the effective and informative features and capturing biological information. Addressing these issues, here we propose Mutual information based Gene Selection method (MGS) for selecting informative genes and two ranking methods based on frequency (MGSf) and Random Forest (MGSrf) for ranking the selected genes. We tested our methods on four real gene expression datasets derived from different studies on cancerous and normal samples. Our methods obtained better classification rate with the datasets compared to recently reported methods. Our methods could also detect the key relevant pathways with a causal relationship to the phenotype.


18
Genes are the physical and functional units of hereditary genetic information. The 19 activity and/or expression level of a gene affects the synthesis of downstream proteins 20 that dictate the functionality of a cell. Therefore, the properties as well as the 21 expression levels of a particular set of genes are responsible for a particular phenotype 22 such as disease or tissue morphology. Those genes which are able to differentiate 23 between different states (such as normal vs diseased, quiescent vs proliferating, adult vs 24 stem cells, etc.) of cells are called informative genes or biomarkers (a measurable drug development, etc. Especially, for different cancer diseases, these informative genes 28 may be invaluable for the improvement of diagnosis, prognosis, and treatment. 29 Usually, studies to generate cancer specific gene expression profiles comprise a small 30 number of control and patient samples in comparison to tens of thousands of genes 31 (high dimensionality of the data) in each sample where only a few numbers of genes are 32 responsible for a disease. From a large set of genes, identification of a subset that is 33 differently expressed in cancerous cells compared to the normal ones, is a challenging 34 task and is considered as NP hard or NP-complete [1]. Therefore, the feature/gene 35 selection methods can be a useful way to identify a subset of genes relevant to particular 36 cancer for better diagnosis and treatment. In this paper, we use the terms "gene" and 37 "feature" interchangeably. 38 In bioinformatics, several gene selection methods have been proposed, particularly 39 for cancer data classification [2][3][4]. "Wrapper"and "Filter"are two popular categories of 40 feature selection methods [5] where wrapper methods are classifier dependent and filter 41 methods are classifier independent and their performance mainly depends on the 42 selection of a criterion. Wrapper based methods select the most discriminant subset of 43 features by minimizing the prediction error of a particular classifier [6]. Support Vector 44 Machine based on the Recursive Feature Elimination (SVM-RFE) [2] is considered to be 45 one of the best performing wrapper methods. It ranks the genes using SVM and selects 46 the important genes combining with the recursive feature elimination strategy. Different 47 variants of SVM-RFE have also been proposed [7,8]. Although wrapper based feature 48 selection methods provide a better performance, these methods become computationally 49 expensive when the feature size grows. Moreover, these methods are classifier dependent 50 and may not provide the optimal solution for other classifiers [9]. For example, the 51 result of the wrapper method using SVM varies from the result of using random forest 52 (RF). To solve the aforementioned problem, a hybrid filter-wrapper method Information 53 Guided Interactive Search (IGIS) [10] was proposed to select the best genes based on 54 Mutual Information, and the genes were ranked using joint mutual information. 55 However, this method selected more genes than the wrapper or hybrid algorithms. To 56 solve the limitations of IGIS, improved Interaction information-Guided Incremental 57 Selection (IGIS+) [11] was proposed where the first gene is selected based on the highest 58 accuracy and utilizes Cohen's d test to add a new gene into the selected gene set. One 59 major limitation is that it uses several handcrafted thresholds for Cohen's d effect.

60
Compared to the wrapper methods, filter based methods are more popular as these 61 can assess the property of features without being dependent on any particular classifier. 62 Filter methods select a subset of features based on some criteria that can be evaluated 63 on the dataset itself. Different criteria that filter methods use are: correlation 64 coefficient [12], t-statistics [13], and mutual information [14,15]. Among these, MI based 65 methods are the most popular for feature selection due to their strong theoretical 66 background and ability to capture non-linear dependencies between features. MI based 67 methods have also been applied for gene expression data analysis [16]. One of the 68 earliest works used Minimum Redundancy Maximum Relevance (MRMR) [3]. In this 69 method, the authors select each gene incrementally which holds the highest 70 discriminatory power (relevancy) with the target class (control/cancer) and lowest 71 dependency (redundancy) with other selected genes. Relevance with the class variable is 72 calculated using MI between a gene and class variable or F-statistics while redundancy 73 among the genes is calculated using pair-wise MI or Pearson correlation coefficient for 74 discrete and continuous data, respectively. MRMR based methods have also been 75 adopted for gene selection using temporal gene expression data [17]. 76 Feature selection methods that use the aforementioned criteria are often posed as an 77 optimization problem where the goal is to select the feature subset that optimizes a cost 78 function. Generally, this cost function is constructed using one of the above criteria.

79
Apart from the aforementioned wrapper and hybrid methods, there exist few popular 80 evolutionary or bio-inspired algorithms to search for the optimal set of features. 81 Almugren et al. in [18] provide an extensive review of the bio-inspited wrapper and 82 hybrid methods. Alshamlan et. al. proposed a hybrid artificial bee colony as well as a 83 genetic bee colony optimization method that uses mRMR criterion [19,20]. El Akadi et 84 al. [21] proposed a genetic algorithm based on mRMR criteria. In these methods, 85 mRMR criterion is used to filter noise and redundant genes in the high-dimensional 86 microarray data and then the bio-inspired algorithm uses the classifier accuracy as a 87 fitness function to select the highly discriminating genes. Most bio-inspired algorithms 88 are local searches with random restart or population based methods. However, these 89 algorithms still can get stuck at a local optimum. In order to solve the optimization 90 problems globally, several selection methods were attempted [22]. These methods 91 incorporate parallel search strategies based on semi-definite programming (SDP) or 92 quadratic programming that can find the feature subset in polynomial time [23]. 93 Recently, deep learning based methods have provided better accuracy in different 94 classification problems such as image or audio classification [24]. These deep learning 95 based architectures have also been proposed for classification problems using gene 96 expression data [4,25]. One of the most recent works based on deep learning was 97 proposed by Ding and Peng [4]. The authors developed a new model namely Forest

98
Deep Neural Network (fDNN) that incorporates deep neural network (DNN) with 99 random forest (RF) to solve the problem of learning from small sample data having a 100 large number of genes. RF is used to reduce the dimension of these datasets by detecting 101 the important genes in a supervised manner [26]. This new feature representation was 102 then fed into DNN to predict the outcomes. However, this method does not make use of 103 the main advantage of deep learning in solving classification problems, which is 104 automatic feature extraction. On the other hand, using a neural network as a black box 105 to extract new features from gene expression data also reduces the interpretability of 106 the classifier which is important in studies such as cancer classification.

107
Since MI based filter methods do not extract new features and thus are more 108 interpretable, parallel to the development in Deep learning, there has been a lot of effort 109 to better approximate MI measures such as relevancy and redundancy. New Information 110 theoretic measures such as complementary information, the additional information that 111 a gene has about the class, which is not found in the already selected subset of genes 112 have been proposed [15,27]. These methods attempt to estimate the joint mutual 113 information of a feature subset with the class. In mDSM [15], the authors showed that 114 during the calculation of MI for finite samples, there exist some errors (bias) for all the 115 three terms namely relevancy, redundancy and complementary. Moreover, for selecting 116 a feature, they proposed to use χ 2 statistics by showing that all three terms follow χ 2 117 distribution. Moreover, even though it has few good characteristics, by incorporating 118 the term redundancy in gene expression data, informative genes might be discarded [11]. 119 Another issue with gene selection for cancer classification, in contrast to traditional 120 feature selection methods in machine learning, is that the set of genes selected should be 121 biologically relevant to the disease under study. Although filter methods can produce a 122 subset of genes that may be highly accurate in classifying cancer, the literature on filter 123 methods rarely discusses the biological relevance of the selected genes [5]. genes that are used in cancer classification. Finally, the proposed methods select the 132 relevant genes associated with a particular type of cancer.

134
In this section, we firstly describe the datasets that are used in our experiment and then 135 discuss the proposed gene selection method. It selects some candidate genes and then 136 ranks the genes based on one of the two proposed ranking criteria. Finally, using the 137 selected top η genes, classification and biological interpretations are then performed.

138
Dataset description 139 We used four different gene expression datasets GDS3341 [28] , GDS3610 [ nasopharyngeal carcinoma tissues (GDS3341 and GDS3610) as built-in controls in the 149 study to assess the coherence and performance of our proposed methods. Expression 150 data of multiple probes for the same gene were merged. All these datasets contained 151 much less number of samples compared to the number of genes (Table 1). In this paper, we propose an MI based Gene Selection (M GS) method for the selection 154 of an informative gene subset that provides both better classification accuracy and  Fig 1A) and then use the top η genes from that subset for classification ( Fig 1B). The 158 following subsections describe our method with further details. For the identification of a gene subset, in this paper, we propose to use a filter based 163 gene selection method that approximates the joint MI with respect to the class variable. 164 In order to identify an informative gene subset, we first subdivide the given gene To measure how much information a particular gene expression provides for the 179 identification of cancer data, we calculate MI between an expression value of a gene g i 180 and the class variable C. This MI represents the relevancy of a gene that reveals the 181 degree of importance of that gene in cancer data classification. Note that, before 182 calculating the MI, the gene expression data is quantized which is necessary for noise 183 reduction and data simplification and thus result in maximizing the relevancy of a gene 184 to the target class C. For calculating the relevance between g i and C, MI is calculated 185 using Eq. 1.
). This test helps to determine whether the gene is significantly relevant or not 189 and can be done because it can be shown that the relevancy follows χ 2 distribution with 190 (I − 1)(K − 1) degrees of freedom. Here, I, K and N represent the quantization levels of 191 gene g i , the total number of classes in C and the total number of samples respectively. 192 The genes which do not satisfy the χ 2 critical value are discarded considering these 193 genes are not related to C. All the genes selected through this process are now ranked 194 in descending order based on the relevancy. From this ranking, a selection method is 195 followed to get a subset of informative genes. As the top ranked gene is considered to be 196 the most important, we include it to the candidate gene subset G SC at first. Now, the 197 second ranked one is evaluated for selection based on its score calculated using Eq. 2.
here, along with relevancy, the complementary information I(g The same subset of genes is not always selected during the selection of genes by M GS 224 at each iteration of LOOCV. In the final gene selection step, we aggregate all the 225 candidate gene subsets (G SC ) from the candidate gene selection step and find the union 226 of these subsets, G S . Afterward, these genes in G S are ranked using one of the following 227 two ranking criteria.

228
• M GS f : This ranking is performed based on the following assumption.  To quantify the Assumption, we compute the relative frequency of every selected 232 gene, S i in G S using Eq. 3.
here, N G SC , F Si and P (S i ) are the total number of candidate subsets, frequency 234 of the selected gene S i and the relative frequency of gene S i respectively. For 235 example, we have two candidate gene subsets from candidate gene selection step, 236 L 1 = {g 1 , g 3 , g 4 , g 5 , g 6 } and L 2 = {g 1 , g 2 , g 4 , g 6 }. Here, the unique genes are 237 G S = {g 1 , g 2 , g 3 , g 4 , g 5 , g 6 } and the frequencies of these unique genes are F = 2, 238 1, 1, 2, 1, 2 respectively. So, the relative frequency is P (S i ) = 2/2, 1/2, 1/2, 2/2, 239 1/2, 2/2. Thus, based on the P (S i ), ranked genes are G S = g 1 , g 4 , g 6 , g 2 , g 3 , g 5 . 240 • M GS rf : Informative genes have the ability to split the control and cancer 241 samples into two groups. To find the more informative genes, we need to rank the 242 candidate genes. In order to rank these genes, it is necessary to measure how 243 much information a gene contains. To measure the information content of a gene, 244 we can use Information Gain (IG) criterion. IG is used in decision trees to select 245 features that reduces the entropy of the data most by splitting data into two 246 groups (called the the left and right child in a decision tree). We use weighted IG 247 derived in Eq. 4.
where, N t is the number of samples at the current (parent) node, N is the total This weighted IG is commonly used in Decision Tree (DT) [26], where each node 253 in a DT contains a gene with its corresponding weighted IG. Besides, to make the 254 weighted IG more robust, we use M number of DTs to construct a Random Forest 255 and take the average of IGs for each gene g j ∈ G S using Eq. 6.
.,(g k , IG k )} and k is the 257 total number of nodes in the random forest. That is, for each node of the random 258 forest, we store the corresponding gene and its weighted IG in V . δ(v i .g, g j ) = 1 if 259 v i .g = g j , and 0, otherwise (Kronecker function).

260
This average score can be used as the importance score of each gene. In our case, 261 this importance score represents how important a particular gene is to explain the 262 target class. Finally, based on the importance score, the genes from G S are 263 ranked in descending order.  passing the training data to M GS, we get candidate informative genes. This is repeated 271 n times and passing the selected candidate gene to M GS f and M GS rf for finding the 272 ranked genes. And finally, from the ranked genes, we take top η genes as biomarkers 273 and calculate the performance metrics.

274
To assess the performance of a gene selection method, we consider two performance 275 metrics accuracy and Area Under the Receiver Operating Characteristic Curve 276 (AU ROC). Accuracy is the percentage of samples that are predicted to the true class. 277 AU ROC represents degree or measure of separability between classes and it can be 278 used when the dataset is highly imbalanced and the number of samples is less than the 279 number of genes. ROC is a probability curve of a classifier at various thresholds. It We used NetworkAnalyst [33] to interpret the biological significance of the selected gene. 292 NetworkAnalyst is a bioinformatics platform to interpret gene expression data within 293 the context of protein-protein interaction (PPI) networks. We used top η selected genes 294 for each dataset determined by our proposed and the previously described methods as 295 input in NetworkAnalyst. Since the type of the cancer samples in the datasets was 296 known, we assessed the performance of the compared methods based on their abilities to 297 identify the key pathways affected in the corresponding cancer types.

299
We compared the performances of our proposed ranking methods (M GS f and M GS rf ) 300 to other renowned methods-RF , f DN N , IGIS+, and mDSM . As mDSM is a gene 301 selection method, so we incorporate our frequency and RF based gene ranking methods 302 (mDSM f and mDSM rf ) for comparison purpose. Note that, for a fair comparison, we 303 followed the same training and testing protocol for all the datasets. For RF , f DN N 304 and M GS rf (where random forest is used) we used 300 decision trees. We evaluated the 305 performance of these methods using SVM (linear kernel) and RF classifiers. These 306 classifiers are implemented in Python with packages Scikit-learn [34].

307
In this experiment, we applied the aforementioned methods on four gene expression 308 datasets. In this section, we first discuss the performance of all methods in terms of 309 accuracy and AU ROC and then, provide the biological interpretation selecting top η 310 (= 10) genes. In situations where feature selection method (IGIS+, mDSM f , 311 mDSM rf ) had selected less than 10 genes, we used only these genes for our analysis. 312 We also discuss the performance of a different number of top η genes for measuring the 313 robustness of our method.  It is not always true that the selected genes that have better classification ability are 332 also relevant for a biological process. To examine this, apart from accuracy and 333 AU ROC, we investigated the ability of the top (≤ 10) selected genes in identifying the 334 most relevant pathways in the cancer types used in different datasets. This is described 335 in the next section. Southeast Asia [35][36][37]. GDS3341 and GDS3610 datasets contain NPC samples [28,29]. 342 Though GDS3341 and GDS3610 are independent datasets, both M GS rf and M GS f 343 could detect the genes involved in viral carcinogenesis and Epstein-Barr virus infection 344 (Table 3). We used two different datasets (GDS3341 and GDS3610) on the same cancer 345 type as built-in controls in the study to increase confidence in the experimental results. pathways relevant to cancer (Fig 2) whereas other methods could not detect any pivotal 368 genes relevant to cancer (Table 3). LGALS1 and LAM B1 were selected among the top 10 genes from GDS3341 dataset by 371 the M GS rf method. These (highlighted in red) are part of a sub-network that contains 372 many other proteins (highlighted in green) known to play roles in different cancers [33].  genes. For the small and highly imbalanced dataset GDS3610, our methods showed its 403 superiority for a different number of η genes (Fig 4). Our methods handled not only 404 small datasets but also imbalanced dataset which is shown in Fig 4B, as AU ROC is a 405 better metric for imbalanced datasets. We also showed our method's strength in dataset 406 GSE106291, having comparatively large samples (Fig 6). Here, our methods performed 407 better than others in terms of accuracy and AU ROC over the different η, indicating its 408 applicability on gene expression datasets with small and relatively medium sample size. 409 In this paper, we presented a gene selection method and two gene ranking methods for classifying high dimensional low sample size gene expression data. The proposed gene selection method utilizes the maximum relevance and complementary information for selecting informative genes that have biological importance. Experimental results on real datasets illustrate that our gene selection method consistently yields higher classification accuracy and select more biologically relevant genes than prior state-of-the-art methods do. However, there are a few challenges that are left to be addressed for further studies. First, we believe introducing higher-order gene interaction term will help to reduce the number of selected genes. Second, to obtain globally optimum gene subsets, we may need a semi-definite programming based search strategy instead of using a χ 2 based filter method used in this paper.