Abstract
At present, drug toxicity has become a critical problem with heavy medical and economic burdens. acLQTS (acquired Long QT Syndrome) is acquired cardiac ion channel disease caused by drugs blocking the hERG channel. Therefore, it is necessary to avoid cardiotoxicity in the drug design and computer models have been widely used to fix this plight. In this study, we present a molecular fingerprint based on the molecular dynamic simulation and uses it combined with other molecular fingerprints (multi-dimensional molecular fingerprints) to predict hERG cardiotoxicity of compounds. 203 compounds with hERG inhibitory activity (pIC50) were retrieved from a previous study and predicting models were established using four machine learning algorithms based on the single and multi-dimensional molecular fingerprints. Results showed that MDFP has the potential to be an alternative to traditional molecular fingerprints and the combination of MDFP and traditional molecular fingerprints can achieve higher prediction accuracy. Meanwhile, the accuracy of the best model, which was generated by consensus of four algorithms with multi-dimensional molecular fingerprints, was 0.694 (RMSE) in the test dataset. Besides, the number of hydrogen bonds from MDFP has been determined as a critical factor in the predicting models, followed by rgyr and sasa. Our findings provide a new sight of MDFP and multi-dimensional molecular fingerprints in building models of hERG cardiotoxicity prediction.
1. Introduction
Drug-induced toxicity has become a critical reason for the failure of drug discovery and development in recent years (Wallace, 2015). A previous study showed that there were more than half of drugs failed (54%) in clinical development among 640 novel therapeutics, while 17% of them failed because of drug-induced toxicity (Hwang et al., 2016). Besides, it has also been reported that the mean costs required to bring a new drug to market increased from $374.1 million to $1335.9 million after counting for costs of failed trials (Wouters et al., 2016). Thus, it has become an urgent task to find ways to identify the toxicity of compounds on a large scale in drug development.
Acquired Long QT syndrome (acLQTS), one of the most important diseases caused by drug-induced toxicity, is a potentially life-threatening cardiac arrhythmia disease that increases the risk for syncope, sudden cardiac death (SCD), and seizures (Tester & Ackerman, 2014). The hERG protein is a tetrameric potassium ion channel and mainly relates to cardiotoxicity and acLQTS (Liu et al., 2020). It has been reported that the potassium ion channel (hERG channel) may be blocked caused by antiarrhythmic drug binding, which leads to prolonged repolarization time and acLQTS (Witchel, 2007). At present, multiple drug candidates have failed due to the cardiotoxicity of hERG, such as cisapride, terfenadine, sertindole, pimozide, and astemizole, which have become a significant limiting factor in drug discovery and development (Bergström & Lindmark, 2019; Villoutreix & Taboureau, 2019).
Computer-aided drug design (CADD) has been thought of as an alternate choice to reduce the amount of time and money in the development of drug design, especially in predicting drug toxicity (Maia et al., 2020). Molecular fingerprints are a way of CADD and are used to encoding the structure of molecules (O’Boyle et al., 2011). It has been deployed as descriptors for predicting biological activities and compound properties (Muegge & Mukherjee, 2014). Frequently used molecular fingerprints are structure-based and property-based (Kelley, 2018; Rogers & Hahn, 2010; Riniker & Landrum, 2013; Riniker, 2017). A previous study of hERG cardiotoxicity prediction showed that the accuracy of the best model developed by molecular descriptors reached 0.54 (R2), while RMSE was 0.63 (Johnson et al., 2007). Another study of the hERG channel also showed that the accuracy of the regression model by descriptors was 0.60 (Q2) and 0.55 (RMSE) for pIC50 (Radchenko et al., 2017). These results showed the practicalities and effectiveness based on commonly used molecular fingerprints. However, there are still no fingerprints that considered the time factor applied on the cardiotoxicity prediction of hERG.
Molecular dynamics fingerprints (MDFP) are the fingerprints based on calculating the trajectory of molecular dynamic simulation and have rapidly become a hotspot. After adding the dimension of time, MDFP can be seen as a choice of the traditional molecular fingerprint. The study of p-glycoprotein substrates prediction showed that gradient tree boosting (GTB) methods in combination with MDFP was the only model which achieved a good accuracy on the in-house dataset (Esposito et al., 2020). Meanwhile, the research of free-energy prediction showed good performance with a heterogeneous fusion model by MDFP (Riniker, 2017). Besides, studies of self-solvation free energies and application of MDFP in SAMPL6 octanol–water log P blind challenge also revealed a high prediction rate (Gebhardt et al., 2020; Wang & Riniker, 2019). As a consequence, MDFP can be an alternative choice of traditional molecular fingerprints and has great application potential on the cardiotoxicity prediction of hERG.
Multi-dimensional molecular fingerprints are indicated as multiple molecular fingerprints combining together in order to predict more accurately. Previous studies showed that multi-dimensional molecular fingerprints were better than the single molecular fingerprint in drug development (Kyaw et al., 2020). Thus, in this study, we studied MDFP and multi-dimensional molecular fingerprints (MDFP with other molecular fingerprints) in predicting hERG cardiotoxicity of compounds. The extensive open dataset of hERG compounds with IC50 values has been collected from previous studies. Then, molecular dynamic simulation was conducted to generate MDFP and traditional molecular fingerprints have also been generated by Baseline2D, ECFP4, and PropertyFP. Finally, the regression models were built by machine learning with four algorithms. Our study provides new sights on the combination of multi-dimensional molecular fingerprints and the research of predicting the hERG cardiotoxicity of compounds.
2. Methods
2.1. Toxicity Datasets
A high-quality hERG inhibitor dataset has been collected from the previous study (Munawar et al., 2019). The IC50 value is the biochemical half-maximal inhibitory concentration and has been used to represent the inhibiting abilities of compounds on hERG in this dataset (Kalliokoski et al., 2013). The data of toxicity have been eliminated if the name and IC50 values were repeated. The repeated molecules have also been averaged if the difference IC50 values were less than one order of magnitude (Feng et al., 2021). Finally, 203 compounds have been collected with specific IC50 values of the hERG. The distribution of training and testing sets followed by 80% and 20%, respectively. The training sets were used for 5-fold cross-validation and the testing sets were used to check the prediction performance of the established model for new compounds. Besides, pIC50 is the negative log unit of the IC50 values and has been used to represent inhibiting abilities better than IC50 (Cortés-Ciriano et al., 2020). Therefore, IC50 of compounds was converted to pIC50.
2.2. MD Simulations
Molecular dynamics (MD) simulation was performed by GROMACS (2020.4). For compounds in the dataset, mol2 files were obtained from Zinc15 (http://zinc15.docking.org/) by using SMILES files. The topology of compounds was generated with AMBER14SB force field by ACPYPE (https://www.bio2byte.be/) (Sousa da Silva et al., 2012). Afterward, the compounds were placed in a dodecahedron box with a size of 1.0 nm centrally and solvated with the TIP3P water model. Then, the descent energy minimization with 100ps was applied to the system. An additional equilibration of 1ns under NVT and NPT conditions was carried out, while the constant temperature was 300 K and the constant pressure was 1 bar, respectively (Sun et al., 2020). Finally, the system was performed with running 5 ns MD simulation and coordinates were written every 10ps, energies every 1ps.
2.3. 2D Molecular Fingerprints
Three types of molecular fingerprints have been used in this study. Baseline2D was obtained using RDKit and its elements mainly consisted of 10 counts: number of heavy atoms, number of rotatable bonds, number of N, O, F, P, S, Cl, Br, and I atoms (Riniker, 2017; Wang & Riniker, 2019). The PropertyFP fingerprint was also obtained using the Descriptastorus package from RDKit (Kelley, 2018). It contained nearly 200 atoms features and properties. Besides, ECFP4 was generated using the RDKit implementation of the Morgan algorithm with a vector length of 2048 and a radius of 2 ( Rogers & Hahn, 2010).
2.4. MD Fingerprints
The MD trajectories were analyzed by the GROMACS toolkit (Ogunwa, 2019). Following features has been generated: radius of gyration (rgyr), solvent-accessible surface area (sasa), root mean squared error (rmsd), total energy (tenergy), hydrogen bonds (hbond), kinetic energy (kinetic), Lennard-Jones short-range energies (LJ-SR) and Lennard-Jones 1-4 energies (LJ-14). The average (avr), median (mid), and standard deviation (std) of features were calculated using the R version 3.6.1 (Team, 2013). Fig. 1 showed the MDFP with all properties.
2.5. Feature Selection
Feature selection is critically important for predictive models, especially in machine learning (Johnson et al., 2018). It provides an effective way to reduce the dimensionality of data sets, identify informative features, and remove irrelevant features, improving the learning accuracy of machine learning models (Holder et al., 2017). In this study, zero variation and near-zero variation features were deleted firstly using the nearZeroVar function in the R package caret (version 6.0–84) (Kuhn, 2008). Then, recursive feature elimination (RFE) was performed to select the optimal feature subset using the rfe function in the caret package in a 10 times 5-fold cross-validation setting (Darst et al., 2018). In the RFE process, all features are first ranked according to the feature importance values obtained by the random forest (RF) algorithm, and then RF models are trained iteratively on the features that are gradually reduced according to the ranking to evaluate the performance of the feature subsets (Tang et al., 2020).
2.6. Model Construction
In this study, RF, SVM, gradient boosting machine (GBM), and partial least square regression (PLS) was used for machine learning model construction. All models were executed beyond R (version 3.6.1) with using the randomForest (version 4.6–12) (Liaw & Wiener, 2002), the kernlab (version 0.9-25) (Karatzoglou et al., 2004), the gbm (version 2.1.5) (Brandon et al., 2019), and the pls (version 2.7-1) packages (Bjørn-Helge et al., 2019), respectively.
2.6.1 Random forest
RF is the machine learning ensemble classifier and has been applied in many fields (Breiman, 2001). By constructing multiple decision trees, the RF classifier has been considered as better performance than the single decision tree (Gandhi et al., 2018). In the current study, the randomforest function has been used to build RF classifiers. The number of classification trees and variables randomly selected for each node spilt have been set as ntree = 500, while mtry was optimized from one-third of the number of features minus 10 to plus 15. The relative importance of molecular fingerprints has also been calculated by the important function of the package.
2.6.2 Support vector machine
SVM is a generalized linear classifier based on the principle of structural risk reduction for pattern recognition (Huang et al., 2018). It is well known as a supervised learning algorithm that analyzes data and recognizes patterns (Nedaie et al., 2018). In this study, the radial basis function (RBF) kernel was used for building the SVM classifier. Meanwhile, the random search method (Bergstra & Bengio, 2012) was also applied to optimize specific SVM parameters with the regularization parameter C and σ parameter by using the caret package, while C was from e-2 to e6, σ was from e-7 to e with the step of e0.5.
2.6.3 Gradient boosting machine
GBM is also a tree-based machine learning model. It has been considered as a step-wise, additive type model which sequentially fits new-tree-based models (Golden et al., 2019). Meanwhile, it also has many advantages, especially worked well in practice (Cho et al., 2019). In this study, the total number of trees (n.trees) and the maximum depth of each tree (interaction. depth) have been optimized by using the caret package and have been set from 1 to 3000 and 1 to 10, respectively. Besides, shrinkage and n.minobsinnode were set as 0.005 and 10.
2.6.4 Partial least square regression
PLS calculates a group of latent variables in connection with the output maximally and determines the relationship between the input and output data (Foodeh et al., 2020). It is a stretch of the multiple linear regression models and is widely used in many domains (Wu et al., 2020). Unlike multiple linear regression (MLR), it can handle the data with noisy, strongly collinear, and X-variables (Dong et al., 2018). In this study, n_components for PLS were optimized from 1 to the greatest features or sample sizes.
2.7. Model Evaluation
In order to test the predictive performance of the models, 5-fold cross-validation with 10 repeats has been used to evaluate the models. After randomly divided the original dataset into five equal subsets, four of them were used for training and the other was used for testing. Then the 5-fold cross-validation was repeated ten times to reduce the randomness. This cross-validation progress was performed 10 times with different random seeds of 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024. Then, average values were calculated to evaluate the prediction performance of the models.
Root-mean-squared error (RMSE), mean unsigned error (MUE), and R2 has been used to evaluate the predictive performance of the models. These indicators were calculated by the following formulas:
Where P, Ē, E, n represent predictive value, the average of experimental value, experimental value, and compound numbers, respectively.
3. Results and discussion
3.1. Feature selection
In this study, 203 compounds were collected from the previous study and divided into training and testing datasets with 80% to 20%, respectively. In order to build models to predict hERG cardiotoxicity, MDFP, Baseline2D, ECFP4, and PropertyFP have been calculated for the compounds in the dataset. Table 1 illustrated the number of features calculated from each type of molecular fingerprint and the detailed description of these features is shown in the supplementary files (Table S1 and Table S2). After the feature selection by RF-RFE, 11 and 6 features have been selected from MDFP and Baseline2D, respectively. Meanwhile, there were also 99 features selected from ECFP4 and 71 from PropertyFP. Percentage increase in MSE (%lncMSE) obtained by RF was used to evaluate the importance of features. Fig. 2 showed the top ten features (Baseline2D for six) which important to the prediction models. The results of MDFP showed that the number of hydrogen bonds between compounds and water has a significant effect on predicting hERG cardiotoxicity, followed by kinetic energy and surface area. Besides, the results of 2D molecular fingerprints indicated that the number of heavy atoms, number of O atoms (oxygens), and number of F atoms (fluorines) were the most important features in Baseline2D, while MolLog P in PropertyFP and 3218693969 in ECFP4. Above all, after calculating features in all molecular fingerprints, the following features have been selected as the most critical with heavyatoms, oxygens, fluorines, the median of hydrogen bonds, and 3218693969. These features may be played important roles in predicting the hERG cardiotoxicity and should be paid extra attention in the development of drug candidates.
3.2. Prediction performance of the models
After performing feature selection, the GBM, PLS, RF, and SVM algorithms were used for generating ML models based on the resulting fingerprints. The performance of these machine learning models was evaluated by 10 times 5-fold cross-validation and their performances were presented in Table 2. The results showed that the RMSE of each machine learning model based on PropertyFP is the lowest, with a range of 0.860-0.960, followed by MDFP, with a range of 0.967-1.039, while ECFP4 and Baseline2D are poor quality. R2 and MUE also showed the same pattern. Table 3 illustrated the performance of these models which were used to predict the pIC50 of the molecules in the testing set. In general, the models show better RMSE values in the test set than in the 5-fold cross-validation, indicating that the model has not been overfitted. Meanwhile, compared with the models based on different molecular fingerprints, the performance in the testing set was similar, while Baseline2D was slightly better (RMSE=0.721 to 0.795) and MDFP also obtained a good performance (RMSE=0.755 to 0.819). These results indicated that MDFP can effectively predict the activity of hERG inhibitors, and the predictive performance of the MDFP was similar to the traditional molecular fingerprints.
The predictive performance of the MDFP model combined with other molecular fingerprints was also investigated in this study. Table 4 and Table 5 showed the performance of models in the 5-fold cross-validation sets and testing sets while MDFP combined with other molecular fingerprints, respectively. The results showed that the combination of MDFP and other molecular fingerprints can obtain a model with better prediction performance. For example, the model established by the single molecular fingerprint (MDFP or PropertyFP) in the 5-fold cross-validation had the best performance as PropertyFP-SVM (RMSE=0.860). However, the model established by multi-dimensional molecular fingerprints (MDFP and PropertyFP) was MDFP+PropertyFP-SVM (RMSE=0.837), which showed a better performance than using the single molecular fingerprints. Besides, models combining MDFP with other molecular fingerprints also showed better predictive performance in the testing set (Table 5), while the best model was the SVM model trained on MDFP++ (MDFP with all other fingerprints) (RMSE=0.696±0.015). These results illustrated that the performance of multi-dimensional molecular fingerprints was better than the single molecular fingerprints and MDFP may provide additional effective predictors for the prediction of hERG inhibitor activity.
In order to improve the prediction performance of the model, we further averaged the prediction results of the four machine learning models to obtain a consensus value. The prediction performance was shown in Table 3 and Table 5. Fig. 3 and Fig. 4 showed the predicted values vs experimental values for MDFP and MDFP++, respectively. The values of other molecular fingerprints have been demonstrated in the supplementary files (Fig. S1 to S6). It was found that the performance of consensus models was significantly better than the other models (except PropertyFP). Among the models established by a single molecular fingerprint, the consensus model based on Baseline2D had the highest accuracy (RMSE=0.713), while the consensus model based on MDFP also obtained a better RMSE of 0.745. Meanwhile, in the model based on the multi-dimensional molecular fingerprints, MDFP+ECFP4 and MDFP++ obtained high accuracy with RMSE of 0.694 and 0.695, respectively. These results indicated that the integrated model can obtain a better method for predicting the activity of hERG inhibitors.
In summary, these results illustrated that the MDFP was effective compared with traditional molecular fingerprints and can truly be an alternative to the other molecular fingerprints. Meanwhile, the prediction accuracies of all ML models on multi-dimensional molecular fingerprints were better than the single molecular fingerprints in predicting the hERG cardiotoxicity. Besides, the integrated models showed the best prediction than the single models among most of the molecular fingerprints. Thus, the models obtained by multiple machine learning methods could be more accurate in predicting the hERG cardiotoxicity of compounds.
3.3. MDFP features associated with cardiotoxicity
To further reveal the contributions of fingerprint features associated with cardiotoxicity, the correlation coefficient has been used to determine the feature between MDFP and pIC50. Correlation is a measure of a monotonic association between 2 variables and Pearson’s correlation coefficient has become one of the most frequently used statistics (Armstrong, 2019). In this study, Pearson, Kendall, and Spearman correlations were used to evaluate the important features of MDFP with pIC50. Table 6 showed the correlation coefficient between the feature of MDFP and pIC50. The median of rgyr has been determined as the most relevant feature with pIC50 (Kendall = 0.35, Pearson = 0.51, and Spearman = 0.49), followed by the median of sasa and kinetic with the high correlation coefficient. These results showed the features which extracted from MDFP had strong correlations with pIC50 and can be used to predict cardiotoxicity in the future study.
3.4. Compared with other models
Recently, a couple of computational models have been developed for toxicity prediction. Among them, cardiotoxicity prediction has become a hotspot with multiple studies. Table 7 showed the comparisons between our model and other models for cardiotoxicity prediction. Compared with other models, the consensus model with MDFP and ECFP4 showed the lowest RMSE and MUE, with higher R2. Meanwhile, the molecular fingerprints of previous studies were used by only one dimension, which may prove that multi-dimensional fingerprints performed well in predicting the cardiotoxicity of hERG. Besides, although it was lower than QSAR-SVM, the consensus with MDFP still better than the other models as 0.745±0.005 (RMSE), which illustrated the advantages of MDFP. These findings showed that MDFP and multi-dimensional molecular fingerprints with machine learning methods can be an outstanding model in predicting cardiotoxicity.
4. Conclusion
In this study, MDFP and multi-dimensional molecular fingerprints were used for building machine learning models to predict the hERG cardiotoxicity of compounds. 203 compounds were firstly identified to establish the 5-fold cross-validation and testing datasets. Then molecular dynamic simulation has been used to generate molecular dynamic molecular fingerprints. Baseline2D, ECFP4, and PropertyFP were used to generate traditional molecular fingerprints. After that, critical features have been selected by RF-RFE and 4 machine learning algorithms, namely RF, SVM, GBM, and PLS were used for building predicting models based on the single fingerprints and multi-dimensional molecular fingerprints. Besides, the correlation between MDFP and pIC50 has also been surveyed. Results showed that MDFP has the potential to be an alternative choice of molecular fingerprints and multi-dimensional molecular fingerprints are better than single fingerprints in predicting cardiotoxicity. It also illustrated that the consensus model with MDFP and ECFP4 has the optimum prediction effect and hydrogen bonds are critically important in the models with MDFP. Our finding provides a new sight into the application of MDFP and multi-dimensional molecular fingerprints in predicting the hERG cardiotoxicity of compounds. Cell and animal experiments will be carried out to validate further.
Conflict of interests
The authors declare that they have no conflict of interests.
Data Availability Statement
All data and models generated or used during the study appear in the submitted article.
Author contributions
WZD, LZ, and HSL conceived the project, developed the prediction method, designed, and implemented the experiments, analyzed the result, and wrote the paper. YN, JSW, and XXX implemented the experiments, analyzed the result, and wrote the paper. SYH and SYL analyzed the result. All authors read and approved the final manuscript.
Acknowledgements
This study was supported by the National Natural Science Foundation of China (No. 82003655), the Key R&D Program of Liaoning Province (No. 2019JH2/10300041), Scientific Research Project from Department of Education of Liaoning Province (No. LQN201906), Shenyang Science and Technology Plan Project (No. 17-65-7-00, 19-302-3-04).