Abstract
Summary Kinases are one of the most important classes of drug targets for therapeutic use. Algorithms that can accurately predict the drug-kinase inhibitor constant (pKi) of kinases can considerably accelerate the drug discovery process. In this study, we have developed computational models, leveraging machine learning techniques, to predict ligand-kinase (pKi) values. Kinase-ligand inhibitor constant (Ki) data was retrieved from Drug Target Commons (DTC) and Metz databases. Machine learning models were developed based on structural and physicochemical features of the protein and, topological pharmacophore atomic triplets fingerprints of the ligands. Three machine learning models [random forest (RF), extreme gradient boosting (XGBoost) and artificial neural network (ANN)] were tested for model development. RF model was finally selected based on the evaluation metrics on test datasets and used for web implementation.
Availability GitHub: https://github.com/sirimullalab/KinasepKipred, Docker: sirimullalab/kinasepkipred
Implementation https://drugdiscovery.utep.edu/pki
Contact ssirimulla{at}utep.edu
Supplementary information Supplementary data are available Bioinformatics online.
1 Introduction
Protein kinases play important roles in a wide range of diseases such as cardiovascular disorders, inflammatory diseases, gastrointestinal stromal tumors and cancer, and can serve as drug targets for therapeutic use (Fabbro (2015)). Kinase inhibitors that inhibit the activity of deregulated protein kinases form the largest class of new drugs approved for cancer treatment. The interaction is usually measured as binding affinity values in terms of dissociation constant (Kd), inhibition constant (Ki) and half-maximal inhibitory concentration (IC50). We have developed a predictive model to estimate kinase-ligand pKi values. In this paper, we have used only the data corresponding to (Ki) values for our dataset. Algorithms that predict drug-target associations (Yamanishi et al. (2008))(Li and Lai (2007)) and binding affinities were previously published by other researchers (Pahikkala et al. (2014))(He et al. (2017))(Öztürk et al. (2018))(Kundu et al. (2018)). However, to our best knowledge there are no algorithms that are specific to kinase Ki predictions.
2 Methods
2.1 Datasets
Kinase-drug Ki data was obtained from two publicly available databases (Tang et al. (2018); Metz et al. (2011)). Drug Target Interaction (DTI) dataset from the Drug Target Commons (DTC) by Tang et al. (Tang et al. (2018)) was used for the model development and initial testing of the models. Since we were focused only on kinases, only DTI pairs of 118 kinases and 5,983 compounds with Ki values were used. Thus, our final data set contained 67,894 instances. All Ki values were converted into molar units and then recorded as the negative decadic logarithm of them. Data obtained from Metz et al. (Metz et al. (2011)) was used as the external dataset for evaluating our models. We filtered out the Metz data to remove all the overlapped data with DTC. The external dataset contained 148 kinases with 240 compounds contributing to unique 17,258 DTI pairs (with Ki values) which were not used in our training and test data.
2.2 Molecular features
Protein features
Protein features were generated using the Python-based tool named propy (Cao et al. (2013)). Features were sequence derived structural and physicochemical features from the amino acid sequence.
Ligand features
Toplogical pharmacophore atomic triplets fingerprints (TPATFP) features were generated for ligands using Perl scripts from the MayaChemTools (Sud (2016)). More detailed explanation of protein and ligand features are provided in the supporting information.
2.3 Model development
Models were developed mainly based on the grid search method with 5-fold cross validation. They were based on the Scikit-Learn machine learning library for Python (Pedregosa et al. (2011)). We used the 25% of data for the test set and 75% of the data for the training set. Three different machine learning models (random forest, extreme gradient boosting, and artificial neural network) were developed and we compared their performances using several evaluation metrics. More detailed explanation of these three models are available in the supporting information. We also compared grid search vs random search (Bergstra and Bengio (2012)) for our random forest model to estimate the efficiency of each method.
3 Results
The developed models were evaluated using several metrics such as root-mean-square-error (RMSE), Pearson correlation coefficient (R), Spearman correlation coefficient (ρ), concordance index (Con. Index), and area under the receiver operating characteristic curve (AUC-ROC). The table 1,2 and 3 show the scores obtained for the test and Metz data set (external test dataset) using three models. Among three different models, random forest was found performing best with R 0.887, ρ 0.846, RMSE0.475, Con. Index 0.854, and AUC 0.957 for the test data set and 0.769, 0.669, 0.503, 0.749, and 0.938 respective scores for the external data set. More details about the results and a comparative study can be found in the supporting information.
Performance of RF model on the test and Metz datasets
Performance of XGBoost on the test and Metz datasets
Performance of ANN on the test and Metz datasets
4 Web implementation and Code Availability
The model is available on a webportal at https://drugdiscovery.utep.edu/pki/. The web interface takes SMILES patterns and protein sequences as inputs, and provides the predicted results. Additionally, the model, data and results are available on github at github.com/sirimullalab/KinasepKipred. A docker image is also available via docker hub at sirimullalab/kinasepkipred
Funding
G.K was supported by the UTEP Computational Science Program. M.H and resources were supported by Dr. Sirimulla’s startup fund from the UTEP School of Pharmacy..
Acknowledgements
We thank Drs. Mahesh Narayan, Amy Wagler, and Gabriel A. Frietze at UTEP for helpful discussions and Dr. Michael Scott Long for reading. We thank the High-Performance Computing Center at UTEP for assistance in using the Chanti cluster and the Texas Advanced Computing Center (TACC) at The UT Austin for providing HPC resources that have contributed to the research results reported within this paper.