KinasepKipred: A Predictive Model for Estimating Ligand-Kinase Inhibitor Constant (pKi)

Kinases are one of the most important classes of drug targets for therapeutic use. Algorithms that can accurately predict the drug-kinase inhibitor constant (pKi) of kinases can considerably accelerate the drug discovery process. In this study, we have developed computational models, leveraging machine learning techniques, to predict ligand-kinase (pKi) values. Kinase-ligand inhibitor constant (Ki) data was retrieved from Drug Target Commons (DTC) and Metz databases. Machine learning models were developed based on structural and physicochemical features of the protein and, topological pharmacophore atomic triplets fingerprints of the ligands. Three machine learning models [random forest (RFR), extreme gradient boosting (XGBoost) and artificial neural network (ANN)] were tested for model development. The performance of our models were evaluated using several metrics with 95% confidence interval. RFR model was finally selected based on the evaluation metrics on test datasets and used for web implementation. The best and selected model achieved a Pearson correlation coefficient (R) of 0.887 (0.881, 0.893), root-mean-square error (RMSE) of 0.475 (0.465, 0.486), Concordance index (Con. Index) of 0.854 (0.851, 0.858), and an area under the curve of receiver operating characteristic curve (AUC-ROC) of 0.957 (0.954, 0.960) during the internal 5-fold cross validation. Availability GitHub: https://github.com/sirimullalab/KinasepKipred, Docker: sirimullalab/kinasepkipred Implementation https://drugdiscovery.utep.edu/pki/ Graphical TOC Entry


Introduction
The interaction between drug and target facilitates the drug side effect prediction, 1 drug repurposing 2 and many others. Biochemical experiment methods for drug target interactions are found highly costly and take a lot of time, 3 whereas computational methods are efficient, faster and more convenient. 4 Proteins are the good targets in drug design 5 and get activated or inhibited by drug compounds. The binding of protein with other ligand is found to be specific and it plays crucial role in many biological functions. 6 Protein kinase is an enzyme that plays a major role in the signal transduction of cells by transferring a phosphate group from adenosine triphosphate (ATP) to other proteins to their serine, threonine or tyrosine residues. 7 In this study, we are focused on protein kinases because of their importance as drug targets for therapeutic use. They play important roles in a wide range of diseases such as cardiovascular disorders, inflammatory diseases, gastrointestinal stromal tumors and cancer, and can serve as drug targets for therapeutic use. 8 Kinase inhibitors that inhibit the activity of deregulated protein kinases very efficacious in treating many diseases. 9 The US food and drug administration has approved 48 kinase inhibitors 9 and the largest class of new drugs approved is for cancer treatment. The interaction between the kinase and ligand is usually measured as binding affinity values in terms of dissociation constant (K d ), inhibition constant (K i ) and half-maximal inhibitory concentration (IC 50 ). In this paper, we have used only the data corresponding to (K i ) values for our dataset. Algorithms that predict drug-target associations 10,11 and binding affinities were previously published by other researchers. 6,10,[12][13][14] Yamanishi et al. 10 proposed a supervised machine learning approach for categorizing drugtarget pairs as interacting or non-interacting based on an integrated model of chemical and genomic molecular profile 10 as a binary class problem. Pahikkala et al. 12 introduced a method KronRLS as a very first one to predict the non-binary drug target binding affinity values. After that, Simboost, 13 DeepDTA 14 and Indra et al. 6 are the most recent studies based on machine learning models to predict the binding affinity scores. Simboost purposed the gradient boosting machine learning model by using feature engineering to predict the binding affinity values. It is the first non-linear method for continuous drug-target interaction prediction. DeepDTA is a deep learning based approach to predict drug-target binding affinity using only sequences of proteins and drugs. They used convolution neural networks (CNNs) to learn representations from the raw sequence data of proteins and drugs and fully connected layers for the prediction. Simboost and DeepDTA are based on Davis, 15 Metz 16 and KIBA 13 data sets. Indra et al. 6 developed a Random forest ML based on pdbbind database 2015. All aforementioned studies were generalized for all types of proteins and were not kinase specific and also purposed models were not evaluated by independent data sets.
To our best knowledge, there are no algorithms that are specific to kinase K i predictions.
In this study, the performance of three different machine learning algorithms Random forest (RFR), Extreme gradient boosting (XGBoost), and Artificial neural network (ANN) was developed and compared based on test dataset and external dataset. The best model based on evaluated statistical metrics is selected and used for web implementation.

Datasets
Data was obtained from two different databases 1) Drug Target Commons (DTC) 17 and 2) Metz. 16 DTC data contained 5,867,349 ligand-target pair associations. The dataset was populated by the bioctivity types K i (inhibition constant), K d (dissociation constant), and IC 50 (half maximal inhibitory constant) for most of the ligand-target pairs. The "potent targets" and "potent inhibitors" were defined based on cut-offs for the four most popular bioactivity types (K d , K i , IC 50 , and activity). Cutoffs of ≤ 100 nM (i.e., ≥ pK i of 7) for the dose-response measurements (K d , K i , IC 50 ) in biochemical assays, and ≤ 1000 nM (i.e., ≥ pK i of 6) for the dose-response measurements (K d , K i , IC 50 ) in cell-based and other assay types 17 were used. The median of the bioactivity values was taken where there were multiple bioactivity values. Since we were only focused on kinases with K i , we obtained only data pertinent to kinases with K i values. We found 67,894 instances (ligand-target pairs) from 5,983 compounds and 118 kinases. All K i values were converted into molar units and then recorded as the negative decadic logarithm as shown in equation (1). We used 75% of the data for the training set and 25% for the test set. The data set by Metz et al. 16 was also used as an external evaluation for the model. It contained 150,000 instances of ligand-kinase K i values composed of more than 3,800 compounds tested against 172 protein kinases. We filtered out this Metz data to create a blind data set having 148 kinases with 240 compounds contributing 17,258 drug kinase pairs which are distinct from the DTC dataset and were not used in our training and test data.

Model development
Models were developed mainly based on the grid search method with 5-fold cross validation.
They were based on the Scikit-Learn machine learning library for Python. 26 We used the 25% of data for the test set and 75% of the data for the training set. Three different machine learning models (random forest, extreme gradient boosting, and artificial neural network) were developed and we compared their performances using several evaluation metrics. More detailed explanation of these three models are available in the supporting information. We also compared grid search vs random search 27 for our random forest model to estimate the efficiency of each method.

Results and Discussions
The developed models were evaluated using several metrics such as root-mean-square-error (RMSE), Pearson correlation coefficient (R), Spearman correlation coefficient (ρ), concor-

Performance of different ML algorithms on an external dataset
All the three models were further tested on the Metz dataset. 16