Deep feature selection for Identification of Essential Proteins of Learning and Memory in Mouse Model of Down Syndrome

Down syndrome is a chromosomal abnormality related to intellectual disabilities that affects 0.1% of live births worldwide. It occurs when an individual has a full or partial extra copy of chromosome 21. This chromosome trisomy results in the overexpression of genes that is believed to be sufficient to interfere normal pathways and normal responses to stimulation, causing learning and memory deficiency. Therefore, by studying these proteins and the disturbance in pathways that are involved in learning and memory, we can consider drugs that would correct the observed perturbations, and therefore assist in enhancing the memory and learning. Here, from genes based on an earlier study that identified 77 proteins differentially expressed in normal and trisomic wild mice exposed to context fear conditioning (CFC), we provide a quantitative protein selection based on different feature selection techniques to select the most important proteins related to learning and memory. These techniques include Fisher score, Chi score, and correlation-based subset. In addition, a deep feature selection is utilized to extract high order proteins using deep neural networks. Three main experiments are carried out:studying the control mice’s response, studying the trisomy mice’s response, and studying the control-trisomy mice’s response. In each experiment, support vector machine classifier is used to assess these selected proteins ability to distinguish between learned and not-learned mice to the fear conditioning event. By applying the deep feature selection, fifteen proteins were selected in control mice, nine in trisomy mice, and seven in control-trisomy mice achieving distinguishing accuracies of 93%, 99%, 84% respectively compared to 74%, 78%, and 71% average accuracies of other selection methods. Some of these proteins have important biological function in learning such as CaNA, NUMb, and NOS.

. Overexpression of these genes may affect different biological processes and 10 pathways, including brain development and function [6]. 11 Along the past decades, the interest in the identification of DS learning deficits has 12 been raised [7][8][9][10][11][12][13]. However, because of the lack of enough information about Hsa21 13 encoded genes, where functional information is available for less their half [5], it is 14 logical to look for a disturbance in pathways that are critical to learning and memory, 15 and then to consider drugs that would correct the observed perturbations. The benefit 16 of studying pathway is that deep understanding of the functions of individual Hsa21 17 genes, their interactions, and contributions to brain function is not required [5]. The 18 pathway disturbance can be gathered from experiments that each measures a small 19 number of proteins. These experiments differ in some factors such as the performed 20 task, the task protocol, the timing of the measurement of the proteins (i.e., after an 21 hour and some minutes), or the region and method of the protein analysis [14]. For 22 instance, Ahmed et al. [14] used the reverse phase protein arrays (RPPA) to analyze the 23 levels of more than 80 proteins in mice exposed to Context Fear Conditioning (CFC) to 24 provide a study of protein responses and interactions after normal learning. Higuera at 25 el. [5] analyzed 77 proteins levels of control and trisomy mice (down syndrome) with 26 and without learning simulations to study the effect of treatment, and to investigate the 27 involved proteins in the learning process. They used self-organizing maps (SOM), an 28 unsupervised clustering method using neural networks, the output clusters of classes are 29 then processed to gather similar neighbors together. Then, the Wilcoxon test is used to 30 investigate the significance of each protein in separating two classes. The significant 31 proteins are then processed to get the final results using the intersection of the 32 compared classes pairs. However, self-organizing maps can result in divided subclusters, 33 and it does not provide a quantification of the protein selection results. 34 Our Motivation and Contribution 35 Based on the work of [5] that used SOM to find the significant proteins in learning, our 36 work is two-fold. First, on the contrary to SOM, we propose a quantitative approach to 37 investigate protein expressions in order to identify biologically important differences in 38 protein levels in mice exposed to CFC using machine learning feature selection 39 algorithms. We examine about 77 proteins in subcellular fractionation in brain regions 40 of wild type mice exposed to CFC to assess their relative learning, where part of these 41 mice have been simulated to context shock with and without treatment using 42 memantine, since it is currently used for treatment of moderate to severe Alzheimer's 43 disease (AD) and has been proposed for treatment of learning deficiency in DS [15,16]. 44 Second, we investigate different protein selection approaches to reflect the linearity 45 nature of the selected proteins. We also utilize deep learning to mimic the nonlinearity 46 of the underlying selection process.

47
The research is organized as follows: Section 2 discusses the dataset used as well as 48 the methodologies we utilized in order to select the proper proteins. In section 3, we 49 discuss the results of applying feature selection approaches on the protein expressions.

50
Then we conclude in section 4.

52
In order to detect the important proteins that contribute in memory and learning 53 process, we use their expressions as features in different feature selection models. Fig 1 54 illustrates the workflow of the proposed approach. As the protein expressions act as 55 features, we use four different feature selection models: Fisher score, Chi score, Abbreviation 60 Table 1 shows the abbreviation that we need to use along the paper. Though, these 61 abbreviations will be defined in the text. In this study, we use a dataset of mice proteins expressions [14]. be considered as an independent sample [14].  The dataset was a result of a context fear conditioning (CFC) experiment described in 74 details in [14]. In that experiment, mice were first injected with memantine or the 75 equivalent quantity of saline. Then, they were placed in a novel cage where they were 76 allowed to explore for 3 minutes and then given an electric shock (2s, 0.7 mA, constant 77 electric current). This group of mice had the chance to learn how to associate the 78 context with the stimulus (electric shock) and accordingly is considered as context shock 79 (CS) group. "Learning" here is meant by "freezing" after re-exposure to the context, 80 where freezing is defined as a lack of movement except for respiration. Another group of 81 mice, shock context (SC), were placed in the novel cage, and immediately given the 82 electric shock, then they were allowed to explore for 3 minutes. Each group was studied 83 under the effect of memantine and saline. Thus, a total of eight groups of mice were 84 produced as illustrated in fig 3. A reverse phase protein arrays (RPPA) technique was 85 then used to evaluate the levels of proteins and protein modifications in subcellular 86 fractionation from hippocampus in the subject mice, where proteins that are relevant to 87 CFC specifically, or to learning and memory and synaptic Alzheimer's disease generally 88 were chosen. However, as a reference, the name convention of the mice groups in this 89 paper is as follow: (c/t)-(SC/CS)-(s/m), where c/t is for control or trisomy mice, SC/CS 90 is for shock-context or context-shock mice, and s/m is either injection with saline or 91 memantine. For instance, c-CS-s is for control mice that were exposed to context shock 92 experiment while they were injected with saline, and t-SC-m is for trisomy mice that 93 were injected with memantine and exposed to shock context experiment. At the first level, we have either control or trisomy mices. Each group is involved in context-shock and shock-context experiment, while injected by memantine or saline. The last level is the results of the mice's responses.

95
The proteins dataset used needed to be processed prior to investigation. Firstly, for wrapper methods [17], a search algorithm is applied in order to find the 107 best subset of features . For instance, we can use the best-first search method [18], the 108 random hill-climbing algorithm [19,20], or the forward and backward passes [21] to add 109 and remove features which can be considered as heuristic methods [22]. 110 Secondly, for embedded methods, in which the accuracy of the system is measured, 111 and efficient features are used during building the model. These methods are similar to 112 the wrapper methods, but they are less computationally expensive, and prone to 113 overfitting since the validation is done using cross-validation. 114 Thirdly, for the filter methods, a statistical measure is applied, and each feature is 115 assigned a score. These scores are then ranked, and a threshold can be applied to 116 eliminate the unnecessary features. In this paper, we will focus on this type of methods 117 and consider both fisher, Correlation [23,24], and Chi scores to investigate the In order to investigate the proteins that are important in the memory and learning, 126 we divide the data into two main classes: mice that learned and mice that did not learn 127 as in fig 4. First, we study the control mice group to know which proteins are 128 responsible for the learning in case of control mice. For instance, in the first experiment 129 we have two classes: c-CS-s and c-CS-m versus c-SC-s and c-SC-m. This will allow us to 130 study the effect of the context shock or normal learning. Second, trisomy mice, we 131 divide the trisomy classes into two classes, mice who rescued due to context shock and 132 memantine, and those who failed to learn or did not learn due to the absence of a 133 stimuli. Thus, we have t-CS-m versus t-CS-s, t-SC-s, and t-SC-m, where we can explore 134 the effect of the rescued learning or the effect of context shock with memantine. Third, 135 both the control and trisomy mice types are considered in the feature selection problem 136 having two classes: c-CS-m, c-CS-s, and t-CS-m versus c-SC-s, c-SC-m, t-CS-s, t-SC-s, 137 and t-SC-m. This will allow us to explore the difference between the normal learning in 138 the control mice and memantine injected mice, and the not-learned mice due to lack of 139 either context shock or memantine.  within-class measure S w expresses the distance between instances for the same class, 146 while S b represents the distance between the different classes as given by equations (1) 147 and (2).
where N is the number of instances or observations, j is the class index up to n classes, 149 m j is the number of features in class j, µ i is the mean of feature i along all classes, and 150 f kij is the value of feature i for observation k in class j.

151
The F score is then the ratio of S w to S b as in equation (3). This means that the 152 higher F score, the better the feature will be.
Chi Score

154
Also called χ 2 test or "goodness of fit" statistic. It reveals how likely it is that an 155 observed distribution is due to chance, and quantify how well the observed data 156 distribution fits compared to the expected distribution if the variables are 157 independent [27]. Thus, the higher the Chi score, the better the feature classifies the 158 data.

159
Correlation-based Feature Selection (CbFS) 160 This method of feature selection's central hypothesis is that good feature sets contain 161 features that are highly correlated with the class, yet uncorrelated with each other [28]. 162 It uses heuristic search methods to find the best features subset as in equation (4).
where M s is the heuristic or merit of feature subset S that contains K features, r cf is 164 the mean feature-class correlation, and r ff is the average feature-feature inter-correlation. 165 The algorithm first computes the correlation factor of each feature. Then for every 166 combination, the algorithm calculates its merit, and search for the best subset. where non-zero weights only can be considered. Usually, the sparsing and finding the 174 weights of the network is done using a backpropagation method to find a solution of the 175 following objective function: where w are the weights, b are the nodes' biases, l(w, b) is the loss function (could be 177 log likelihood, logistic logs, or squared loss), r(w) is the regularization term, and λ is 178 the regularization factor that weights the regularization versus error.

179
Feature selection using neural network can be done in several ways. For instance, an 180 elastic net can be used where the input is connected to one hidden layer and then the 181 output layer (output nodes are equal to the number of classes) [25]. However, as we 182 used only one hidden layer, the level of complexity concerned in this case is not high. In shallow feature selection S-DFS, only one hidden layer exists in the network, and 184 one layer is added in the input level that has the same number of nodes as the input.

185
This additional layer's weights are our concern, as only non-zero or above threshold 186 weights are taken in the features selection as illustrated in Fig 5 (a). The output layer 187 of the network is a softmax layer having a number of nodes equal to the number of  [29]. This model is very useful in the systems 190 of high complexity where data separation cannot be done linearly. In the same manner, 191 an additional layer is added after the input layer that has the same number of nodes as 192 the inputs (in our case proteins) as shown in Fig 5 (b). This will allow us to learn 193 deeply about the proteins and their relations to each other. However, we will call the 194 D-DFS as DFS for convenience.

195
In the DFS system, two regularization terms are used; one for the additional added 196 layer and the other is for the hidden network layer. Thus, the objective function in 197 equation (5) proposed by [25] will be like: The equation can be explained as follows: 199 1. The loss term in our case is the log-likelihood, and it is given by: where y is the output, x is the input, w(k + 1) is the weight of the k th layer, i is 201 node index. Since the last layer is a softMax layer, its loss function will be like: 2. The first regularization term controls the tradeoff between sparsity of the weights 204 and the smoothness of the network, where λ 2 is a user-specified parameter, and 205 can take range from 0 to 1. 3. The second regularization term in the equation is used to reduce the complexity of 207 the model, and sparsing for the hidden net. Another effect of this term is to avoid 208 the shrinking of weights in the input weights layer (the layer after the input layer) 209 resulting having wi weights very small, and W weights large.

210
In order to find the solution to equation (6) we can utilized the gradient descent [30]. It 211 is a first-order iterative optimization algorithm, where in each iteration it takes steps 212 proportional to the negative of the gradient of the function at the current iteration.

214
For the purpose of quantifying and assessing the chosen proteins, we use classification 215 models to measure the ability of these proteins to distinguish between the two classes 216 studied. For instance, by selecting group X of proteins, the performance measures of the 217 classifiers are reported. Thus, the higher performance, the better the protein group. We 218 use three common measures named accuracy, sensitivity, and specificity that are defined 219 as follow: In order to validate the model, we use cross-validation technique. Using this 227 validation technique, we can assess the generalization ability of the classification model. 228 This generalization can be seen as the ability to make new predictions for data that has 229 not already seen. In k-fold cross-validation, the original data is randomly partitioned 230 into k equal size parts. Then for each iteration of the K iterations, a single subsample is 231 used as a testing data for the model, and the remaining k-1 subsamples are used as 232 training data. This cross-validation is utilized to avoid any over-fitting that might be 233 encountered during the classification.

234
The classifier that we used in this paper is linear support vector machine 235 (SVM) [31,32]. The main idea of SVM is to find the optimal separating line between the 236 data points or features to distinguish the classes. For instance, in a two-class SVM, the 237 classifier searches for the closest feature points, called support vectors, where as the 238 perpendicular to the line connecting these support vectors can be considered as the 239 separating line between these two classes.

241
Using the proteins dataset, the proposed protein selection algorithms were done using 242 SVM in order to assess the performance of separation of each protein set. A five-fold trisomy classes. Finally, the proteins that affect the whole dataset discrimination were 247 analyzed. In the DFS, a data separation was used having 70% as training data, 15% for 248 validation data, and 15% for testing data.

250
The fisher scores of the 77 proteins are illustrated in appendix 1. The SOD1 protein, 251 superoxide dismutase 1, scored the highest F-score of 1.63 meaning that this protein has 252 a high correlation with the learning and memory processes, which is supported by the 253 results of [33]. The pPKCAB protein comes in the second place having a score of 1.28, 254 followed by calcineurin CaNA that proved to influence both learning and reversal 255 learning [34]. On the other hand, CAMKII, RRP1, GluR4, pRSK, CREB, GluR3, and 256 RSK scored the lowest scores reflecting that these proteins do not affect the learning 257 process in control mice according to the Fisher criterion.

258
The influence of the proteins sorted according to the Fisher score using SVM 259 classifier is illustrated in Fig 6 (a), where the initial number of proteins is 1 and a step 260 of 1 protein added was used to analyze the effect of the proteins. The maximum 261 accuracy that can be achieved is 92.7% using all the proteins. However, the first 30 262 proteins span for 90% accuracy, having 47 proteins that only adds about 2% accuracy. 263 The Chi score was then tested for the control mice classes as shown in appendix 1. It 264 can be seen that, again, the SOD1 protein has the highest score of 695.5 followed by 265 pPKCG with 614.3 score. We can also notice that CREB, pNR2B, and GluR3 have 266 scores near zero, where they also gained the lowest scores in Fisher test. Hence, the Chi 267 test is consistent with the results from the Fisher criterion. The proteins are then 268 ordered, and the SVM classification was done on the data having a maximum accuracy 269 of 91%, 78% sensitivity, and 93% specificity as shown in Fig 6 (b). As illustrated, the   [5] and our approach in control mice group. Gray: proteins selected by [5]. Blue: proteins selected by DFS. Orange: common proteins.

PLOS
9/14 and 78.25% from [5]. Therefore, our selected proteins perform a better distinction 310 between the two mice groups that are with and without learning experience. The same algorithms were performed on the trisomy mice. Data was separated into two 313 groups; a group which represents learning such as t-CS-m, and groups that failed to 314 learn like t-CS-s, t-SC-m, and t-SC-s. The Fisher scores of the proteins can be shown in 315 appendix 2 followed by an SVM classification performance assessment in Fig 10 (a). In 316 contrast to the control mice, SOD1 was the fifth protein, while ARC and pS6 had the 317 highest scores. The ribosomal protein pS6 has an impact on the short memory 318 formation, storage, and retrieval as presented in Giese at el. [35]. On the other hand, we 319 have a very low specificity at the beginning that compensates for a high sensitivity, then 320 the specificity starts to increase again after involving 25 or more proteins.

321
The Chi test was done on the trisomy mice as shown in appendix 2. As shown, KCG 322 had the higher Chi score, and SOD1 came in the second place, followed by CaNA and 323 ERK. However, this means that SOD1 and CaNA are expressed in both trisomy and 324 control mice. The SVM classification system response using different number of proteins 325 sorted by their Chi score is shown in Fig 10 (b).  The comparison of the proposed techniques for trisomy mice is shown in Fig 11. The 344 DFS showed a very high performance accuracy of 99% at low features while the Fisher 345 and Chi using SVM obtained only 80% and 78% accuracies. This is a proof of 346 nonlinearity in the data classification. Similarly, we Compare the proteins selected by the DFS and [5] in trisomy mice, fig 348  12 shows the proteins selected by DFS, and SOM as well as the common proteins.

349
Among the selected proteins BRAF,pERK, MTOR,P38,SOD1, CaNA, and Ubiquitin 350 were common, whereas pPKCG was selected by our approach which agrees with the 351 results of [35]. To evaluate our selected proteins, linear SVM was used with five 352 cross-validation. Fig 13 shows the results of the five validation iterations. The proteins 353 selected in our study show close variance in the accuracies to [5]'s proteins. The 354 maximum and average accuracies obtained are 91.23% and 88.07% respectively 355 compared to 87.72% and 85.44% from [5]. Therefore, our selected proteins perform a 356 better distinction between the two trisomy mice groups that are with and without 357 learning experience that account for the effect of memantine. Proteins related to memory and learning from [5] and our approach in trisomy mice group.Gray: proteins selected by [5]. Blue: proteins selected by DFS. Orange: common proteins.

359
In this experiment, we studied both control and trisomy mice. The dataset was divided 360 into two categories: learned mice and failed to learn mice including the no-learning 361 classes resulted from shock context. The fisher scores of the 77 proteins are shown in 362 appendix 3. The CaNA and SOD1, obtained the highest scores, followed by Ubiquitin 363 and ACR proteins. Not to mention that these proteins had high scores in both control 364 and trisomy mice, which validates the results in the previous sections. Fig 14 (a) shows 365 the SVM system performance versus the number of selected proteins to classify learned 366 and not-learned mice in the whole dataset. The accuracy starts to settle after the first 367 20 proteins having 85% accuracy, 70% sensitivity, and 84% specificity.

368
The chi score of each protein is also shown in appendix 3. The CaNA protein held 369 the forth place while SOD1 obtained the fifth place. The highest valued was scored by 370 the H3MeK4 protein. The classification performance in shown in Fig 14 (b). The 371 overall system accuracy is not good, as the maximum accuracy of 86%, a sensitivity of 372 83% and specificity of 90% were obtained using the full protein set. However, the 373 system accuracy starts to settle after the first 33 first proteins. H3AcK18, EGR1, H3MeK4, CaNA as shown in appendix 3. Again, CaNA, and NUMB 378 proteins were indicated as significant proteins in the learning process.

379
In DFS, a train data of 756 instance, 86 validation instances, and 86 test instances 380 were utilized. We used the same parameters of the control and trisomy experiments.

388
Down syndrome is a chromosomal condition that is associated with intellectual 389 disability and related to memory and learning deficiency. This disease is complicated 390 and invloves so many genes and pathways. This pathway disturbance can be interpreted 391 from experiments that measure the proteins. In this work, we applied different protein 392 selection techniques in order to select the important proteins that influence the memory 393 and learning among 77 proteins from a Context Fear Shock experiment done on control 394 and trisomy wild mice. We utilized different approaches such as Fisher score, Chi score, 395 correlation-based feature selection using best subset searching algorithm, and deep 396 proteins selection using deep neural networks. In our approach, we compared learning or 397 rescued classes versus no-learning or failed to learn classes. Our study of the proteins in 398 the control mice group indicated that SOD1, CaNA, Ubiquitin, and nNOS proteins have 399 the highest impact on memory and learning process. Utilizing the deep feature selection 400 using two hidden layers, fifteen proteins were selected obtaining an accuracy of 93%. In 401 the trisomy mice case, nine proteins were selected obtaining an accuracy of 99%, 402 indicating that memantine injected to the trisomy mice can promote the response to the 403 learning process. On the other hand, control-trisomy deep feature selection resulted in 404 seven proteins achieving accuracy of 84%. The deep feature selection proved that higher 405 order protein selection is needed in the prediction of significant proteins related to 406 memory and learning, where it outperformed the Fisher, Chi, and correlation-based 407 protein selection methods. Moreover, the proteins selected in this research showed a 408 higher distinction ability between each of the two groups involved in the classification. 409 In addition, the deep feature selection showed a better performance compared to SOM, 410 since it take advantage of using the data label making it a supervised learning.