Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

KinasepKipred: A Predictive Model for Estimating Ligand-Kinase Inhibitor Constant (pKi)

KC Govinda, Md Mahmudulla Hassan, View ORCID ProfileSuman Sirimulla
doi: https://doi.org/10.1101/798561
KC Govinda
1Computational Science Program, College of Science, The University of Texas at El Paso, El Paso, Texas, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Md Mahmudulla Hassan
2Department of Computer Science, College of Engineering, The University of Texas at El Paso, El Paso, Texas, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Suman Sirimulla
1Computational Science Program, College of Science, The University of Texas at El Paso, El Paso, Texas, USA
2Department of Computer Science, College of Engineering, The University of Texas at El Paso, El Paso, Texas, USA
3Department of Pharmaceutical Sciences, School of Pharmacy, The University of Texas at El Paso, Texas, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Suman Sirimulla
  • For correspondence: ssirimulla@utep.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Summary Kinases are one of the most important classes of drug targets for therapeutic use. Algorithms that can accurately predict the drug-kinase inhibitor constant (pKi) of kinases can considerably accelerate the drug discovery process. In this study, we have developed computational models, leveraging machine learning techniques, to predict ligand-kinase (pKi) values. Kinase-ligand inhibitor constant (Ki) data was retrieved from Drug Target Commons (DTC) and Metz databases. Machine learning models were developed based on structural and physicochemical features of the protein and, topological pharmacophore atomic triplets fingerprints of the ligands. Three machine learning models [random forest (RF), extreme gradient boosting (XGBoost) and artificial neural network (ANN)] were tested for model development. RF model was finally selected based on the evaluation metrics on test datasets and used for web implementation.

Availability GitHub: https://github.com/sirimullalab/KinasepKipred, Docker: sirimullalab/kinasepkipred

Implementation https://drugdiscovery.utep.edu/pki

Contact ssirimulla{at}utep.edu

Supplementary information Supplementary data are available Bioinformatics online.

1 Introduction

Protein kinases play important roles in a wide range of diseases such as cardiovascular disorders, inflammatory diseases, gastrointestinal stromal tumors and cancer, and can serve as drug targets for therapeutic use (Fabbro (2015)). Kinase inhibitors that inhibit the activity of deregulated protein kinases form the largest class of new drugs approved for cancer treatment. The interaction is usually measured as binding affinity values in terms of dissociation constant (Kd), inhibition constant (Ki) and half-maximal inhibitory concentration (IC50). We have developed a predictive model to estimate kinase-ligand pKi values. In this paper, we have used only the data corresponding to (Ki) values for our dataset. Algorithms that predict drug-target associations (Yamanishi et al. (2008))(Li and Lai (2007)) and binding affinities were previously published by other researchers (Pahikkala et al. (2014))(He et al. (2017))(Öztürk et al. (2018))(Kundu et al. (2018)). However, to our best knowledge there are no algorithms that are specific to kinase Ki predictions.

2 Methods

2.1 Datasets

Kinase-drug Ki data was obtained from two publicly available databases (Tang et al. (2018); Metz et al. (2011)). Drug Target Interaction (DTI) dataset from the Drug Target Commons (DTC) by Tang et al. (Tang et al. (2018)) was used for the model development and initial testing of the models. Since we were focused only on kinases, only DTI pairs of 118 kinases and 5,983 compounds with Ki values were used. Thus, our final data set contained 67,894 instances. All Ki values were converted into molar units and then recorded as the negative decadic logarithm of them. Data obtained from Metz et al. (Metz et al. (2011)) was used as the external dataset for evaluating our models. We filtered out the Metz data to remove all the overlapped data with DTC. The external dataset contained 148 kinases with 240 compounds contributing to unique 17,258 DTI pairs (with Ki values) which were not used in our training and test data.

2.2 Molecular features

Protein features

Protein features were generated using the Python-based tool named propy (Cao et al. (2013)). Features were sequence derived structural and physicochemical features from the amino acid sequence.

Ligand features

Toplogical pharmacophore atomic triplets fingerprints (TPATFP) features were generated for ligands using Perl scripts from the MayaChemTools (Sud (2016)). More detailed explanation of protein and ligand features are provided in the supporting information.

2.3 Model development

Models were developed mainly based on the grid search method with 5-fold cross validation. They were based on the Scikit-Learn machine learning library for Python (Pedregosa et al. (2011)). We used the 25% of data for the test set and 75% of the data for the training set. Three different machine learning models (random forest, extreme gradient boosting, and artificial neural network) were developed and we compared their performances using several evaluation metrics. More detailed explanation of these three models are available in the supporting information. We also compared grid search vs random search (Bergstra and Bengio (2012)) for our random forest model to estimate the efficiency of each method.

3 Results

The developed models were evaluated using several metrics such as root-mean-square-error (RMSE), Pearson correlation coefficient (R), Spearman correlation coefficient (ρ), concordance index (Con. Index), and area under the receiver operating characteristic curve (AUC-ROC). The table 1,2 and 3 show the scores obtained for the test and Metz data set (external test dataset) using three models. Among three different models, random forest was found performing best with R 0.887, ρ 0.846, RMSE0.475, Con. Index 0.854, and AUC 0.957 for the test data set and 0.769, 0.669, 0.503, 0.749, and 0.938 respective scores for the external data set. More details about the results and a comparative study can be found in the supporting information.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1.

Performance of RF model on the test and Metz datasets

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2.

Performance of XGBoost on the test and Metz datasets

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3.

Performance of ANN on the test and Metz datasets

4 Web implementation and Code Availability

The model is available on a webportal at https://drugdiscovery.utep.edu/pki/. The web interface takes SMILES patterns and protein sequences as inputs, and provides the predicted results. Additionally, the model, data and results are available on github at github.com/sirimullalab/KinasepKipred. A docker image is also available via docker hub at sirimullalab/kinasepkipred

Funding

G.K was supported by the UTEP Computational Science Program. M.H and resources were supported by Dr. Sirimulla’s startup fund from the UTEP School of Pharmacy..

Acknowledgements

We thank Drs. Mahesh Narayan, Amy Wagler, and Gabriel A. Frietze at UTEP for helpful discussions and Dr. Michael Scott Long for reading. We thank the High-Performance Computing Center at UTEP for assistance in using the Chanti cluster and the Texas Advanced Computing Center (TACC) at The UT Austin for providing HPC resources that have contributed to the research results reported within this paper.

Footnotes

  • https://github.com/sirimullalab/KinasepKipred

References

  1. ↵
    Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. JMLR, page 305.
  2. ↵
    Cao, D.-S. et al. (2013). propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics, 29(7), 960–962.
    OpenUrlCrossRefPubMedWeb of Science
  3. ↵
    Fabbro, D. (2015). 25 years of small molecular weight kinase inhibitors: Potentials and limitations. Molecular Pharmacology, 87(5), 766–775.
    OpenUrlAbstract/FREE Full Text
  4. ↵
    He, T. et al. (2017). Simboost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines. J Cheminform, 9(1), 24–24. pmid:29086119[pmid].
    OpenUrlPubMed
  5. ↵
    Kundu, I. et al. (2018). A machine learning approach towards the prediction of protein–ligand binding affinity based on fundamental molecular properties. RSC Adv., 8, 12127–12137.
    OpenUrl
  6. ↵
    Li, Q. and Lai, L. (2007). Prediction of potential drug targets based on simple sequence properties. BMC Bioinformatics, 8, 353–353. pmid:17883836[pmid].
    OpenUrlCrossRefPubMed
  7. ↵
    Metz, J. T. et al. (2011). Navigating the kinome. Nature Chemical Biology, 7, 200 EP–.
    OpenUrl
  8. ↵
    Pahikkala, T. et al. (2014). Toward more realistic drug–target interaction predictions. Briefings in Bioinformatics, 16(2), 325–337.
    OpenUrl
  9. ↵
    Pedregosa, F. et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
    OpenUrlCrossRefPubMedWeb of Science
  10. ↵
    Sud, M. (2016). Mayachemtools: An open source package for computational drug discovery. Journal of Chemical Information and Modeling, 56(12), 2292–2297.
    OpenUrlPubMed
  11. ↵
    Tang, J. et al. (2018). Drug target commons: A community effort to build a consensus knowledge base for drug-target interactions. Cell Chemical Biology, 25(2), 224–229.e2.
    OpenUrl
  12. ↵
    Yamanishi, Y. et al. (2008). Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics (Oxford, England), 24(13), i232.–i240. pmid:18586719[pmid].
    OpenUrlCrossRefPubMedWeb of Science
  13. ↵
    Öztürk, H. et al. (2018). Deepdta: Deep drug-target binding affinity prediction. Bioinformatics, 34.
Back to top
PreviousNext
Posted October 11, 2019.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
KinasepKipred: A Predictive Model for Estimating Ligand-Kinase Inhibitor Constant (pKi)
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
KinasepKipred: A Predictive Model for Estimating Ligand-Kinase Inhibitor Constant (pKi)
KC Govinda, Md Mahmudulla Hassan, Suman Sirimulla
bioRxiv 798561; doi: https://doi.org/10.1101/798561
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
KinasepKipred: A Predictive Model for Estimating Ligand-Kinase Inhibitor Constant (pKi)
KC Govinda, Md Mahmudulla Hassan, Suman Sirimulla
bioRxiv 798561; doi: https://doi.org/10.1101/798561

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3686)
  • Biochemistry (7774)
  • Bioengineering (5668)
  • Bioinformatics (21245)
  • Biophysics (10563)
  • Cancer Biology (8162)
  • Cell Biology (11915)
  • Clinical Trials (138)
  • Developmental Biology (6738)
  • Ecology (10388)
  • Epidemiology (2065)
  • Evolutionary Biology (13843)
  • Genetics (9694)
  • Genomics (13056)
  • Immunology (8123)
  • Microbiology (19956)
  • Molecular Biology (7833)
  • Neuroscience (42973)
  • Paleontology (318)
  • Pathology (1276)
  • Pharmacology and Toxicology (2256)
  • Physiology (3350)
  • Plant Biology (7208)
  • Scientific Communication and Education (1309)
  • Synthetic Biology (1999)
  • Systems Biology (5528)
  • Zoology (1126)