AIRBP: Accurate identification of RNA-binding proteins using machine learning techniques

Motivation Identification of RNA-binding proteins (RBPs) that bind to ribonucleic acid molecules, is an important problem in Computational Biology and Bioinformatics. It becomes indispensable to identify RBPs as they play crucial roles in post-transcriptional control of RNAs and RNA metabolism as well as have diverse roles in various biological processes such as splicing, mRNA stabilization, mRNA localization, and translation, RNA synthesis, folding-unfolding, modification, processing, and degradation. The existing experimental techniques for identifying RBPs are time-consuming and expensive. Therefore, identifying RBPs directly from the sequence using computational methods can be useful to efficiently annotate RBPs and assist the experimental design. In this work, we present a method, called AIRBP, which is designed using an advanced machine learning technique, called stacking, to effectively predict RBPs by utilizing features extracted from evolutionary information, physiochemical properties, and disordered properties. Moreover, our method, AIRBP is trained on the useful feature-subset identified by the evolutionary algorithm (EA). Results The results show that AIRBP attains Accuracy (ACC), F1-score, and MCC of 95.38%, 0.917, and 0.885, respectively, based on the benchmark dataset, using 10-fold cross-validation (CV). Further evaluation of AIRBP on independent test set reveals that it achieves ACC, F1-score, and MCC of 93.04%, 0.943, and 0.855, for Human test set; 91.60%, 0.942 and 0.789 for S. cerevisiae test set; and 91.67%, 0.953 and 0.594 for A. thaliana test set, respectively. These results indicate that AIRBP outperforms the current state-of-the-art method. Therefore, the proposed top-performing AIRBP can be useful for accurate identification and annotation of RBPs directly from the sequence and help gain valuable insight to treat critical diseases. Availability Code-data is available here: http://cs.uno.edu/~tamjid/Software/AIRBP/code_data.zip


81
sequence-structure matching score is used to predict RBPs. As shown, SPOT-seq achieved the MCC of 82 0.62 on the benchmark data of 215 RBP chains and 5765 non-binding protein chains.

83
The machine learning-based approach for the prediction of RNA-binding proteins involves two crucial 84 steps: i) extraction of relevant features, and ii) selection of an appropriate classification algorithm.

85
Furthermore, depending on the feature extraction mechanism, the existing predictive method can be 86 segmented into two different categories: i) extraction of relevant features from the structure of protein (Paz, 87 et al., 2016;Shazman and Mandel-Gutfreund, 2008); and ii) extraction of relevant features from protein 88 sequence (Kumar, et al., 2011;Ma, et al., 2015;Ma, et al., 2015;Zhang and Liu, 2017). BindUp (Paz, et 89 al., 2016) available as a web server, is one of the recent structure-based methods that extract electrostatic 90 features and other properties from the structure of the protein and uses SVM classifier for RBPs prediction.

91
As reported, BindUp attains sensitivity of 0.71 and specificity of 0.96 on an independent test set of 323 92 structures of RNA binding proteins and a control set of an equal number extracted from Protein Data Bank 93 (PDB). Towards a sequence-based approach, Ma et al. Ma, et al., 2015) recently proposed 94 two different methods, which differ in the features used to train the random forest model for predicting. In 95 , the authors incorporated features of evolutionary information combined with 96 physicochemical features (EIPP) and amino acid composition feature to develop the random forest 97 predictor. Besides, in , the authors employed features such as a conjoint triad, binding 98 propensity, non-binding propensity, and EIPP to establish random forest-based predictor with the minimum 99 redundancy maximum relevance (mRMR) method, followed by incremental feature selection (IFS). As 100 reported, their method achieved an accuracy of 0.8662 and MCC of 0.737. Zhang and Liu (Zhang and Liu,177 der Waals volume, polarizability, predicted secondary structure, solvent accessibility and charge and 178 polarity of the side chain to encode protein sequence. Here, the predicted SS and surface accessibility was obtained from SSpro and ACCpro program (Magnan and Baldi, 181 2014). Likewise, the MoRFs scores were predicted using the OPAL program  and the PSSM scores were 182 obtained using the PSI-BLAST program (Altschul, et al., 1990) 183

184
In this section, the C-T-D transformation method aims to describe the distribution patterns of amino acid 185 properties. This method to compute distribution patterns of amino acid properties were first suggested by 186 (Dubchak, et al., 1995) for protein fold class prediction. In our implementation, we used C-T-D 187 transformation to encode the properties including hydrophobicity, polarity, normalized van der Waals 194 hydrophobicity, normalized van der Waals volume, polarity, and polarizability.  203 Furthermore, to encode the predicted secondary structure and solvent accessibility as features, we first used 204 the SSpro and ACCpro program (Magnan and Baldi, 2014) to predict secondary structure in the form of 205 'H' (helix), 'E' (strand) and 'C' (other than helix and strand) and solvent accessibility in the form of 'e' 206 (exposed residues) and '-' (buried residues), respectively. The choice of SSpro and ACCpro was made to 207 extract predicted secondary structure and solvent accessibility because of its superior performance and 208 remarkable speed. As reported, SSpro and ACCpro (Magnan and Baldi, 2014) achieved an accuracy of 209 92.9% and 90% for secondary structure prediction and relative solvent accessibility prediction, respectively.

210
Using the transformation technique described above, we obtained a feature vector of 21 dimensional and 211 13 dimensions for predicted secondary structure and solvent accessibility, respectively.

213
While the 20 standard amino acids are divided into 4 groups (Group A, B, C, and D representing acidic, basic, polar 214 and non-polar, respectively). Shen et al. first proposed the CT transformation technique for protein-protein 215 interaction prediction (Shen, et al., 2007), which was successfully applied for protein-RNA interaction 216 prediction in the past (Wang, et al., 2013;Zhang and Liu, 2017). In our implementation, we adopted the 217 CT transformation technique to encode the protein sequence based on the charge and polarity of the side 218 chain of the amino acids in a protein. First, the 20 standard amino acids are divided into 4 groups: i) acidic

219
(contain residues D and E); ii) basic (contain residues H, R and K); iii) polar (contain residues C, G, N, Q, 220 S, T, and Y); and iv) non-polar (contain residues A, F, I, L, M, P, V, and W) according to their charge and

Features extracted from evolutionary information 231
In this section, we describe various feature extraction techniques utilized to obtain a fixed dimensional 232 feature vector from the evolutionary information, called PSSM profile to encode protein sequence. Evolutionary information is one of the most critical information useful for solving various biological 234 problems and has been widely used in many research work (Iqbal, et al., 2015;Kumar, et al., 2007;Kumar, 235 et al., 2008;Kumar, et al., 2011;Mishra, et al., 2018;Zhang and Liu, 2017). In this work, the evolutionary 236 information in the form of the PSSM profile is directly obtained from the protein sequence and later 237 transformed into a fixed dimensional vector. PSSM captures the conservation pattern in multiple alignments 238 and preserves it as a matrix for each position in the alignment. The high score in the PSSM matrix indicates 239 more conserved positions and the lower score indicates less conserved positions (Mishra, et al., 2018). For 240 this study, we generated the PSSM profile for every protein sequence by executing three iterations of PSI-

241
BLAST against NCBI's non-redundant database (Altschul, et al., 1990). The evolutionary information in 242 the PSSM profile is represented as a matrix of L*20 dimensions, where L is the length of the protein 243 sequence. A particular element M i,j of the PSSM matrix represents the occurrence probability of the amino 244 acid i at position j of a protein sequence.

246
We use two types of distance transformation techniques (Mishra, et al., 2018;Xu, et al., 2015): i) the PSSM 247 distance transformation for same pairs of amino acids (PSSM-SDT); and ii) the PSSM distance 248 transformation for different pairs of amino acids (PSSM-DDT), together known as PSSM-DT to extract 249 fixed dimensional feature vectors of size 100 and 1900, respectively.

250
Utilizing PSSM-SDT, we compute the occurrence probabilities for the pairs of the same amino acids 251 separated by a distance D along the sequence, which can be represented as: where j represents one type of the amino acid, L represents the length of the sequence, M i,j represents the 254 i+D. Through this approach, 20  K number of features were generated where K is the maximum range of 255 D (D = 1,2, …, K).

256
Likewise, utilizing PSSM-DDT, we compute the occurrence probabilities for pairs of different amino acids 257 separated by a distance D along the sequence, which can be represented as: 258 where, and represent two different types of amino acids. The total number of features obtained by 1 2

259
PSSM-DDT is 380  K. Here, we consider K = 5. Therefore a total of 100 features was obtained by PSSM-

260
SDT and PSSM-DDT transformation techniques obtained a total of 1900 features.

262
Unlike PSSM-DT, the EDT approximately measures the non-co-occurrence probability of two amino acids 263 separated by a specific distance d in a protein sequence from the PSSM profile (Mishra, et al., 2018;Zhang, 264 et al., 2014). The EDT is calculated from the PSSM profile as: where d is the distance separating two amino acids, D is the maximum value of d, and are the , + , 266 elements in the PSSM profile, and and represent any of the 20 standard amino acids in the protein 267 sequence. Here, the value of D = L min -1 where L min is the length of the shortest protein sequence in the 268 benchmark dataset. Using EDT, we obtain a feature vector of dimension 400.

269
Features extracted from disordered properties 270 In this section, we describe a feature extraction technique utilized to obtain a fixed dimensional feature RBPs are found to bind with RNA not only through classical structured RNA-binding domains but also 273 through intrinsically disordered regions (IDRs) (Calabretta and Richard, 2015). For example,

274
approximately 20% of the identified mammalian RBPs (~170 proteins) were found to be disordered by over 275 80% (Järvelin, et al., 2016). The energy contribution of a large number of inter and intra-residual 276 interactions in intrinsically disordered proteins (IDPs) cannot be approximated by the energy functions 277 extracted from known structures Iqbal, et al., 2015;Mishra and Hoque, 2017;Mishra, 278 et al., 2016;Zhou and Skolnick, 2011) as IDPs lack a defined and ordered 3D structure (Babu, et al., 2011).

279
Therefore, to inherently incorporate important information regarding the IDRs and amino acid interactions,

284
by the least square fitting of 674 proteins primary sequence with the contact energies derived from the 285 tertiary structure of 785 proteins. As shown in Table 1, the RCEM is a 20 × 20 dimensional matrix that

296
MoRFs, also sometimes known as molecular recognition elements (MoREs), are disordered regions in a 297 protein that exhibit various molecular recognition and binding functions (Vacic, et al., 2007  . Then, a single predicted MoRFs score is computed 308 by taking a ratio of the sum of the residue-wise MoRFs score and the length of the protein sequence.

Performance evaluation 310
To evaluate the performance of AIRBP, we adopted a widely used 10-fold CV and the independent testing 311 approach. In the process of 10-fold CV, the dataset is segmented into 10 parts, which are each of about the 312 same size. When one fold is kept aside for testing, the remaining 9 folds are used to train the classifier. This 313 process of training and test is repeated until each fold has been kept aside once for testing and consequently, 314 the test accuracies of each fold are combined to compute the average (Hastie, et al., 2009). Unlike a 10-fold 315 CV, in independent testing, the classifier is trained with the benchmark dataset and consequently tested 316 using the independent test dataset. While independent testing, it is ensured that none of the samples in the 317 independent test set are present in the benchmark dataset. We used several performance evaluation metrics listed in Table 2 as well as ROC and AUC to test the performance of the proposed method as well as to 319 compare it with the existing approaches. AUC is the area under the receiver operating characteristics (ROC) 320 curve which is used to evaluate how well a predictor separates two classes of information (RNA-binding 321 and non-binding protein).
322 Table 2. Name and definition of the evaluation metric.

Name of Metric Definition
True Positive (TP) Correctly predicted RNA-binding proteins During the feature extraction process, we collected a feature vector of 2603 dimensions, which is 326 significantly large. Therefore, to reduce the feature space and select the relevant features that could help 327 improve the classification accuracy, we adopted two distinct feature selection approaches, namely iteratively updating the population through various operators including elitism, crossover, and mutation to 347 discover, prioritize and recombine good building blocks present in parent chromosomes to finally obtain 348 fitter chromosome (Hoque, et al., 2010;Hoque, et al., 2007;Hoque and Iqbal, 2017).

379
Stacking is an ensemble-based machine learning approach, which collects information from multiple level are called meta-classifiers. In the first level, a set of base-classifiers C 1 , C 2 , …, C N is employed 385 (Džeroski and Ženko, 2004). The prediction probabilities from the base-classifiers are combined using a To select the classifiers to use in the first and second level of the AIRBP stacking framework, we analyzed 391 the performance of six individual classification methods: i) Random Decision Forest (RDF) (Ho, 1995); ii)

410
iii) ET: Extremely randomized tree (ET) classifier (Geurts, et al., 2006)  based on the majority-votes obtained from the K nearest neighbors. Here, we set the value of K to 9 and the 441 rest of the parameters to their default value.

442
All the classification methods mentioned above are built and optimized using python's Scikit-learn library 443 (Pedregosa, et al., 2012). In order to design a stacking framework for AIRBP, we evaluated the different 444 combinations of base-classifiers and finally selected the one that provided the highest performance.

445
The set of stacking framework tested are:

449
Here, the choice of base-level classifiers is made such that the underlying principle of learning of each of 450 the classifiers is different from each other (Mishra, et al., 2018). For example, in SF1, SF2 and SF3 the tree- CVs performance of the above three combinations, we found that the first stacking framework, SF1 attains 456 the highest performance. Therefore, we employ four classifiers RDF, XGBoost, LogReg, and KNN as the 457 base classifiers and another XGBoost as the meta-classifier in AIRBP stacking framework. In AIRBP, the with the 1346 features selected by GA and provided as an input features to the meta-classifier which

465
In this section, we first demonstrate the results of the feature selection. Then, we show the performance 466 comparison of potential base-classifiers and stacking frameworks. Finally, we report the performance of existing methods

524
To further select the classifiers to be used at the base-level, we adopted the guidelines of base-classifier 525 selection based on different underlying principles. Therefore, we used KNN and LogReg as two additional 526 classifiers at the base-level. Then, we added a single tree-based ensemble method out of three methods,

527
RDF, Bag, and ET, at a time as the fourth base-classifier and designed three different combinations of 528 stacking framework, namely SF1, SF2, and SF3. The performance comparison of SF1, SF2 and SF3 529 stacking framework on the benchmark dataset using 10-fold CV are presented in Table 6. Table 6 530 demonstrates that both SF1 and SF3 outperform SF2. Moreover, SF1 gained similar performance compared 531 to SF3 in terms of ACC, F1-score, and MCC. Since our aim through this research is to build a robust system is evident that SF1 has higher precision and lower FPR compared to SF2. Hence, we select SF1, which 534 includes RDF, XGBoost, LogReg, and KNN as base-classifiers and another XGBoost as a meta-classifier, 535 as our final predictor.

537
Performance Comparison with Existing Approaches on the Benchmark 538 Dataset 539 Here, we compare the performance of AIRBP with RBPPred (Zhang and Liu, 2017) on the benchmark 540 dataset using the 10-fold CV approach. RBPPred is a top-performing existing approach for the prediction 541 of RBPs directly from the sequence. Furthermore, it is to be noted that AIRBP uses the same benchmark 542 dataset as RBPPred. Therefore, for the comparison, the quantities for all the evaluation metrics for RBPPred 543 are obtained from Zhang and Liu (Zhang and Liu, 2017). The prediction results of AIRBP and RBPPred on 544 benchmark dataset computed using 10-fold CV are listed in Table 7.

545
From Table 7, we observed that AIRBP outperforms RBPPred based on all the evaluation metrics applied 546 in this study. Particularly, AIRBP provides 8.55%, 1.50%, 3.26%, 4.84%, 6.75% and 9.53% improvement 547 over RBPPred based on SN, SP, ACC, PR, F1-score and MCC, respectively. Besides, in Table 7, we report 548 the values of BACC, FPR, and FNR only for the AIRBP predictor as the values of these metrics were not 549 reported by RBPPred. Since our benchmark dataset is highly imbalanced (contains 2767 RBPs and 6987 550 non-RBPs), which also reflects the natural frequency, we focus on comparing the predictors based on MCC 551 and F1-score. MCC considers true and false positives as well as negatives and is generally considered as a 552 balanced measure that can be used even though the classes are of very different sizes.

554
Best score values are boldfaced. '%imp' stands for percentage improvement and '-' represents missing value or the value not 555 reported by RBPPred and '(-)' denotes that the % imp. cannot be calculated.

556
Likewise, F1-score is the harmonic average of the precision and recall and is generally considered another 557 balanced measure when the dataset is imbalanced. Since F1-score considers harmonic average, it is 558 considered to provide an appropriate score to the model rather than an arithmetic mean. From Table 7, it is 559 clear that based on MCC and F1-score AIRBP outperforms RBPPred by 9.53% and 6.75%.

563
In this section, we further compare the performance of AIRBP with RBPPred predictor on three different 564 independent test sets, Human, S. cerevisiae and A. thaliana. Here, we only report the comparison of AIRBP

624
Filtered by filtering similar sequences that are more than 25% similar between training and ATH, SC, and

625
Human datasets, respectively using CD-Hit program. features are used to train the ensemble of predictors at the first-level (i.e. base-layer) of the stacking 708 framework. Then, the prediction probabilities from the first-level predictors are combined with the 709 originally selected features and used to train the predictor at the second-level (i.e. meta-layer) of the stacking 710 framework. Finally, the AIRBP stacking framework achieves a 10-fold CV accuracy, F1-score, and MCC 711 of 95.38%, 0.917 and 0.885 respectively, on the benchmark dataset. While performing the independent test,