RFDTI: Using Rotation Forest with Feature Weighted for Drug-Target Interaction Prediction from Drug Molecular Structure and Protein Sequence

The identification and prediction of Drug-Target Interactions (DTIs) is the basis for screening drug candidates, which plays a vital role in the development of innovative drugs. However, due to the time-consuming and high cost constraints of biological experimental methods, traditional drug target identification technologies are often difficult to develop on a large scale. Therefore, in silico methods are urgently needed to predict drug-target interactions in a genome-wide manner. In this article, we design a new in silico approach, named RFDTI to predict the DTIs combine Feature weighted Rotation Forest (FwRF) classifier with protein amino acids information. This model has two outstanding advantages: a) using the fusion data of protein sequence and drug molecular fingerprint, which can fully carry information; b) using the classifier with feature selection ability, which can effectively remove noise information and improve prediction performance. More specifically, we first use Position-Specific Score Matrix (PSSM) to numerically convert protein sequences and utilize Pseudo Position-Specific Score Matrix (PsePSSM) to extract their features. Then a unified digital descriptor is formed by combining molecular fingerprints representing drug information. Finally, the FwRF is applied to implement on Enzyme, Ion Channel, GPCR, and Nuclear Receptor data sets. The results of the five-fold cross-validation experiment show that the prediction accuracy of this approach reaches 91.68%, 88.11%, 84.72% and 78.33% on four benchmark data sets, respectively. To further validate the performance of the RFDTI, we compare it with other excellent methods and Support Vector Machine (SVM) model. In addition, 7 of the 10 highest predictive scores in predicting novel DTIs were validated by relevant databases. The experimental results of cross-validation indicated that RFDTI is feasible in predicting the relationship among drugs and target, and can provide help for the discovery of new candidate drugs.


Introduction
Identifying the interaction between drugs and targets is a crux area in genomic drug discovery, which not only helps to understand various biological processes, but also contributes to the development of new drugs [1,2]. The emergence of molecular medicine and the completion of the Human Genome Project provide better conditions for the identification of new drug target proteins. Although the researchers have made a lot of efforts, only a small number of candidate drugs can be approved by the Food and Drug Administration (FDA) to enter the market so far [3][4][5]. An important reason for this situation is due to the inherent defects of the experimental methods. As is known to all, biological laboratory methods to identify DTIs are usually expensive, time-consuming, and are limited to small-scale studies. In silico methods can narrow the scope of candidate targets and provide supporting evidence for the drug target experiments, thus speeding up drug discovery. Therefore, in silico-based methods are urgently required to improve efficiency and reduce time in identifying potential DTIs across the genome. [6][7][8].
In recent years, researchers have developed a variety of in silico-based methods to analyze and predict DTIs [9][10][11]. For example, Wu et al. [12] proposed the SDTBNI model in 2016, which searches for unknown DTIs through new chemical entity-substructure linkages, drug-substructure linkages and known DTIs networks.
Zhang et al. [13] proposed a novel DTIs prediction model based on LPLNI. The model uses data points reconstructed from neighborhood to calculate the linear neighborhood similarity of drug-drug. Based on biomedical related data and Linked Tripartite Network (LTN), Zong et al. [14] used the target-target and drug-drug similarities calculated by DeepWalk to predict DTIs. In addition, Peng et al. [15] combines the biological information of targets and drugs with PCA-based convex optimization algorithms to predict new DTIs using semi-supervised inference method.
Ezzat et al. [16] used ensemble learning algorithm to predict DTIs by decrease features with subinterval features through three dimensionality reduction models.
Generally speaking, drugs with chemical similarity also have similar biochemical activity, that is, they can bind to similar target proteins. Based on the above assumptions, the use of medicinal chemical molecular structure information and protein sequence information to predict the DTIs model has achieved good results.
For example, Wen et al. [17] extracted drug and target features from their chemical substructure and sequence information, and used deep belief network (DBN) to predict potential DTIs.
In this article, according to the assumption that the interaction between drugs and target proteins largely depend on the information of target protein sequences and drug molecular sub-structure fingerprints, a novel in silico-based model is proposed to infer potential DTIs. Our feature combines the fingerprint of the drug molecule structure and the protein sequence encoded by a feature extraction method called Pseudo Position-Specific Score Matrix (PsePSSM). In the experiment, we adopt the FwRF classifier to predict the results on the four DTIs benchmark data sets, including Enzyme, Ion Channel, GPCR and Nuclear Receptor. In order to verify the performance of the proposed model, we compared with SVM classifier model, different feature extraction models and existing excellent methods. Furthermore, in the case study, 7 of the top 10 DTIs with the highest prediction score of RFDTI model were confirmed. The promoting experimental results show that RFDTI has excellent performance and can effectively predict potential DTIs.

Evaluation Criteria
In this paper, accuracy (Accu.), sensitivity (Sen.), precision (Prec.), and Matthews correlation coefficient (MCC) are used to estimate the performance of RFDTI. Their formulas are as follows: correctly identified; FP is the number of drug-target pairs that are related to each other to be incorrectly identified; TN is the number of drug-target pairs that are not related to each other to be correctly identified; FN is the number of drug-target pairs that are not related to each other to be incorrectly identified. Moreover, the receiver operating characteristic (ROC) curve [18][19][20] and area under the ROC curve (AUC) are used to visually display the performance of the classifier.

Model Construction
To optimize the performance of the RFDTI, the grid search method is applied to

Evaluation of Model Prediction Ability
After finding the optimal parameters of the RFDTI, we put them in benchmark data sets, including Enzyme, Ion Channel, GPCR and Nuclear Receptor. In order to avoid over-fitting of the model, we use five-fold cross-validation method to evaluate the performance of the model. More specifically, we split the data set into five subsets, one of which is taken as the test set, and the remaining four are used as the training set. Then, the cross-validation process will be repeated five rounds. The results from the 5 times are then averaged to produce the final result.

Comparison between RFDTI and LPQ descriptor model
To evaluate the impact of PsePSSM algorithm on the proposed model, we compare it with Local Phase Quantization (LPQ) on four benchmark data sets in this section. The LPQ feature extraction algorithm is based on the blur invariance property of the Fourier phase spectrum and originally described in the article for texture description by Ojansivu and Heikkila [21].

Comparison with existing methods
The prediction of the relationship between drugs and targets has drawn increasing interest of researchers. So far, a lot of excellent computational approaches have been designed. To better verify the proposed approach, we compare it with other existing methods using five-fold cross-validation on the same benchmark data sets. Table 7 lists the details of other excellent methods and RFDTI on four benchmark data sets in terms of the AUC. It can be seen that the results obtained by RFDTI on Enzyme and Ion Channel data sets are significantly higher than those of other existing methods, and the results achieved on GPCR data sets by RFDTI only lower than the highest result 1.13%. The performance of RFDTI on Nuclear Receptor data set is not very good, it may be because the sample number of the Nuclear Receptor data set is too small, and the training of the classifier is not sufficient

Case study
To further validate RFDTI's ability to predict potential DTIs, we use all known interactions to train the model and then predict unknown interactions. We selected 10 drug-target pairs with the highest predictive score to validate in SuperTarget [23].
SuperTarget is a database that collects drug-target relations and currently stores 332,828 DTIs. As shown in Table 8, 7 of the top 10 predicted highest scores were confirmed by the proposed model. This result indicates that RFDTI can effectively predict the potential DTIs. It is worth noting that although we have not found evidence of the interaction of the remaining 3 drug-target pairs, we cannot completely deny the possibility of their interactions. However, the experiment detected representatives known drug-target interactions only 90. Therefore, the number of positive drug-target pairs (e.g., 90) accounted for only 6.41% of the total number of drug-target pairs (e.g., 1404), much less than the number of negative drug-target pairs (e.g., 1404-90=1314). The same problem also appears in the other three data sets. In order to solve the problem of data imbalance, we

Molecules description
In recent years, different types of descriptors have been proposed to represent drug compounds, such as quantum chemical properties, topological, constitutional and geometrical. Since the molecular substructure fingerprint does not require the three-dimensional structural information of the molecule and has the advantage of directly reflecting the relationship between molecular properties and structure, more and more researchers use it as a descriptor to predict the relationship between the drug and the target protein. Specifically, we first store all the molecular substructures in the form of a dictionary, and then split a given drug molecule. When it contains a certain substructure, the corresponding bit of the descriptor is assigned to 1; otherwise it is assigned to 0. Finally, we get the drug molecule in the form of Boolean vectors. In the experiment, we use the chemical structure fingerprint set from PubChem System, and the fingerprints property is "PUBCHEM_CACTVS_SUBGRAPHKEYS" in PubChem. A drug fingerprint is recorded as 881 substructures, so the drug molecule feature is the 881-dimensional. Since the drug fingerprint is divided into 881 substructures, the dimension of the drug molecular fingerprint descriptor is 881 dimensions.

Numerical characterization of protein sequences
In the experiment, we used Position-Specific Scoring Matrix (PSSM) to convert protein sequence numerically [30]. PSSM is widely used in protein binding site prediction, protein secondary structure prediction, and prediction of disordered regions . PSSM is an L × 20 matrix that can be expressed as = , where 20 represents the number of the amino acids and L denotes the length of the protein sequence. PSSM matrix can be expressed as ( , ) follows: where denotes the probability that the ith residue being mutated into the jth ∂ , amino acid during the evolutionary process of protein multiple sequence alignment.
In the experiment, we use Position-Specific Iterated BLAST (PSI-BLAST) tool [31]

Feature extraction algorithm
Effective protein feature descriptors can not only mine useful information, but also improve the performance of the approach. In this study, we introduce the feature extraction algorithm Pseudo Position-Specific Score Matrix (PsePSSM), which concept from Chou et al. [32]. The PsePSSM is expressed by formula as follows: where is a related factor for j-type amino acid, whose contiguous distance is along each segmented protein sequence. The flow chart of the proposed model is shown below.
where n is the number of values in feature F, is the number of features in the class with a value of , which can be expressed as a formula: is the expected value of and , which is defined as follows: where is the number of samples with the value of in class , is the number of samples whose value of feature is , and is ( = ) the number of samples.
The implementation steps of feature weighted rotation forest are as follows: Firstly, the weights of all features are calculated by equation 10; secondly, the features are sorted according to the weights; finally, the desired features are selected according to the given feature selection rate r. After performing these steps, we get a new data set and send it to rotation forest.

Assuming
contains S training samples, where in be an n-dimensional feature vector. Let X is the training sample set, Y is the corresponding labels and F is the feature set. Then X is S×n matrix, which is composed of n observation feature vector composition. Assuming that the number of decision trees is N, then the decision trees can be expressed as . The 1 , 2 ,…, algorithm is executed in the following steps.
(1) Using the appropriate parameter K to randomly divide F into K independent and uncrossed subsets, the number of each subset feature is .
(2) A corresponding column of features in the subset is selected from the , training set X to form a new matrix . Then, 75% of the data is extracted from X in In classification, the test sample x is determined to belong to the class by the x generated by the classifier of . Then calculate the confidence class by the , ( ) following average combination formula: Finally, the class with the largest value is discriminated as x. ( )

Conclusions
Prediction of DTIs is a crucial problem for human medical improvement and genomic drug discovery. Under the hypothesis that the drug molecules structures and protein amino acids sequence have a big impact on the relationships among drugs and target proteins, the RFDTI model is proposed to infer potential drug-target relationships in this article. We implement it on Enzyme, Ion Channel, GPCR and Nuclear Receptor data sets, and obtained excellent results. To further evaluate the performance of the proposed approach, we compared it with PsePSSM model, the SVM classifier model and other existing methods on the same data sets. Moreover, 7 of the top 10 drug-target pairs predicted by the RFDTI model were confirmed by independent data set. Competitive cross-validation experimental results show that the performance of RFDTI has been significantly improved, which demonstrated RFDTI is stable and reliable.