Targeting against HIV/HCV Co-infection using Machine Learning-based multitarget-quantitative structure-activity relationships (mt-QSAR) Methods

Co-infection between HIV-1 and HCV is common today in certain populations. However, treatment of co-infection is full of challenges with special consideration for potential hepatic safety and drug-drug interactions. Multitarget inhibitors with less toxicity may provide a promising therapeutic strategy for HIV/HCV co-infection. However, identification of one molecule acting on multiple targets simultaneously by experimental evaluation is costly and time-consuming. In silico target prediction tools provide more opportunities for the development of multitarget inhibitors. In this study, by combining naive Bayesian (NB) and support vector machine (SVM) algorithms with two types of molecular fingerprints (MACCS and ECFP6), 60 classification models were constructed to predict the active compounds toward 11 HIV-1 targets and 4 HCV targets based on the multitarget-quantitative structure-activity relationships (mt-QSAR). 5-fold cross-validation and test set validation was performed to confirm the performance of 60 classification models. Our results show that 60 mt-QSAR models appeared to have high classification accuracy in terms of ROC-AUC values ranging from 0.83 to 1 with a mean value of 0.97 for HIV-1 models, and ROC-AUC values ranging from 0.84 to 1 with a mean value of 0.96 for HCV. Furthermore, the 60 models were applied to comprehensively predict the potential targets for additional 46 compounds including 27 approved HIV-1 drugs, 10 approved HCV drugs and 9 selected compounds known to be active on one or more targets of HIV-1 or those of HCV. Finally, 18 hits including 7 HIV-1 approved drugs, 4 HCV approved drugs and 7 compounds were predicted to be HIV/HCV co-infection multitarget inhibitors. The reported bioactivity data confirmed that 7 compounds actually interacted with HIV-1 and HCV targets simultaneously with diverse binding affinities. Of those remaining predicted hits and chemical-protein interaction pairs involving the potential ability to suppress HIV/HCV co-infection deserve further investigation by experiments. This investigation shows that the mt-QSAR method is available to predict chemical-protein interaction for discovering multitarget inhibitors and provide a unique perspective on HIV/HCV co-infection treatment.

4 69 based, such as pharmacophore modeling[10], similarity searching [11,12] and molecular docking [13,14]. 70 Conventional quantitative structure-activity relationship (QSAR) generally takes a group of analogs against 71 one target into account to elucidate the relationship between chemical structure and biological activity. In performance between mt-QSAR and computational chemogenomic methods, and the result showed that mt-79 QSAR had a better performance than chemogenomic, which had a high risk of overfitting and a high false 80 positive rate in the external validation set [18]. 81 In this study, mt-QSAR method was applied to predict the chemical-protein interactions (CPIs) for 11 82 key targets related to HIV-1 and 4 targets related to HCV. The workflow is depicted in Fig 1. Firstly, 83 combining two machine learning algorithms (naive Bayesian [19,20]and support vector machine [21]) and two 84 molecular fingerprints (MACCS [22] and ECFP6[23]), we developed multiple mt-QSAR models for 85 identifying inhibitors against 11 HIV-1 related and 4 HCV related targets. Secondly, 5-fold cross-validation 86 and test set validation was used to evaluate the performance of all models. Thirdly, to verify the application 87 of this strategy, the developed mt-QSAR models were employed to predict CPI for 27 approved HIV-1 drugs 88 and 10 approved HCV drugs well as 9 compounds known to be active on one or more targets of HIV-1 or 89 HCV. The predicted results shown 18 hits to be potential HIV/HCV co-infection multitarget inhibitors, which 90 were further confirmed by reported bioactivity resulting in that 7 compounds actually interacted with HIV-1 91 and HCV targets simultaneously with diverse binding affinities. Our study indicates that machine learning-5 92 based mt-QSAR approaches could be potentially applied in discovery of HIV/HCV co-inhibitors and 93 multitarget drug discovery. 96 In this study, a total of 11 targets related to HIV-1 and 4 targets related to HCV were obtained. The 97 number of known small molecule inhibitors that act on HIV-1 and HCV targets were 11,006 and 1,431, 98 respectively. The number of active compounds for each target was shown in Fig 2.  HIV-1 and 4 targets related to HCV are spatial distributed into five and two categories, respectively. As 117 showed in Fig 3A, the HIV-1 target space (n = 11) consists of five subfamilies, including membrane receptor 118 (n = 2), enzyme (n = 5), transcription factor (n = 1), kinase (n = 1) and unclassified protein (n = 2). Similarly, 119 the HCV target space (n = 4) is divided into two subfamilies, including enzyme (n = 2) and unclassified protein 120 (n = 2) in Fig 3B. Thus, the two data sets for HIV-1 and HCV have diverse target coverage.

122
The construction of all classification models, in this study, was primarily based on two classifiers (NB 7 123 and SVM) and two types of fingerprints (MACCS and ECFP6). Afterwards, internal 5-fold cross-validation 124 and external test set validation was executed in turn.

125
The 5-fold cross-validation test results of mt-QSAR models identifying inhibitors and decoys are 126 displayed in Moreover, the performance of different algorithms (NB and SVM) was also evaluated. With the 161 fingerprint MACCS (Fig 5A), performance of the SVM models are generally superior to that of NB models.

162
In particular, for the models of HIV-1, MCC value from SVM models ranges from 0.70 to 1 with an average average of 0.96, which is dissimilar to NB that ranges from 0.50 to 1 and an average value of 0.81.

170
As described above, two fingerprints have no significant difference and the SVM algorithm is slightly 171 better than the NB. However, it is theoretically difficult to determine which fingerprint or algorithm is better 172 because each algorithm or fingerprint has its own advantages and disadvantages. For example, the models   corresponding binding values are presented in Table 4. Fig 6 shows (Fig 7). The

214
Above all, a set of 9 compounds (Fig 8)  well as HCV NS5B. Following the above rules, CPI is referred to a potential interaction when at least two out 219 of four single classifiers predict a compound as positive. The specific predicted results are given in S5 Table, 220 and the comparison of predicted and experimental results is shown in Table 5.

278
Moreover, one CPI pair shows false negative in the prediction result.

279
The prediction results demonstrate that machine learning models have high specific and low false

285
In the past, the "one gene, one drug, one disease" was the main paradigm of drug discovery. However, 286 with the introduction of the concept of polypharmacology, multitarget therapy using a single drug that 287 simultaneously inhibits two or more targets may provide a new perspective. Here, the mt-QSAR method was 288 applied to identify the compounds acting on multiple targets for HIV/HCV co-infection.

289
In this study, based on the mt-QSAR method, 44 binary classifiers of 11 targets related to HIV-1 and 16 290 binary classifiers of 4 targets related to HCV were established to predict the CPIs. Besides, the prediction 291 reliability of models was confirmed by 5-fold cross-validation and test set validation.

292
To illustrate the application of the models, two different cases were performed to systematically predict 293 the multiple bioactivities for 27 approved HIV-1 drugs and 10 approved HCV drugs (case 1) and 9 compounds 294 that were known to be active toward at least one target of HIV-1 and HCV (case 2). For case 1, 21 approved 295 drugs were predicted to be active against more than one biological target. And the predictions were confirmed 296 by reported bioactivity data, the success rate was 44.6% and the failure rate was 1.8%. For case 2, 9 active 297 compounds toward at least one target (HIV-1 PR, RT, IN and HCV NS5B) were predicted to be multitarget 298 inhibitors by mt-QSAR models with a success rate of 89.5%. In short, the prediction results of above two 299 cases indicate that mt-QSAR models have a significant performance on prediction of multitarget inhibitors 300 against HIV/HCV co-infection.

301
As mentioned above, mt-QSAR method achieved satisfactory prediction accuracy, and machine learning 302 models were applicable for potential targets identification in this study. However, similarity searching and 303 molecular docking will be better choices for those targets with a small number of active molecules, such as 304 GP41, PKC, CYP3A, NS4B and NS5A whose active molecules are less than 30. Besides, machine learning 305 models and similarity searching neglect the interaction between receptor and ligands compared with molecular 306 docking. Therefore, it is necessary to choose a reasonable single-method or combining of multi-methods 307 according to specific situation.

308
In summary, computational methods utilized in this study could effectively detect multitarget co-      Here, x denotes a set of features such as a molecular fingerprint is the feature set of a molecule; y 358 represents a set of categories that mark the molecular activity category with "+1" or "-1"; x j , y i and d are j-th 359 attribute value, i-th category and the amount of attributes, respectively. Besides, P(y i ) means prior probabilities, 360 which is calculated by analyzing the training set data; P(x 1 ,x 2 ,…,x m ) is a constant for the explicit training set.

361
Because of the assumption of attribution independence, equation 2 is simplified to equation 3, namely 362 the naive Bayesian formula. which a hyperplane with the maximal margin is generated to describe non-linear classification boundary.

367
Importantly, the classification is achieved through a kernel function. There are four basic kernels: linear, 368 polynomial, radial basis function (RBF) and sigmoid [67]. In most cases, RBF is preferred [67]. In this study, 369 the RBF kernel function was also used. The penalty parameter C and kernel parameter γ, involved in SVM, 370 were optimized through 5-fold cross-validation.   The red lines indicate that the CPI was experimentally verified as active. The black lines present that the 613 interaction was not proved and worthy of further experimental validation. And the green and yellow lines 614 express that the interaction was experimentally validated as inactive and inconclusive, respectively.