Deep neural network affinity model for BACE inhibitors in D3R Grand Challenge 4

Drug Design Data Resource (D3R) Grand Challenge 4 (GC4) offered a unique opportunity for designing and testing novel methodology for accurate docking and affinity prediction of ligands in an open and blinded manner. We participated in the beta-secretase 1 (BACE) Subchallenge which is comprised of cross-docking and redocking of 20 macrocyclic ligands to BACE and predicting binding affinity for 154 macrocyclic ligands. For this challenge, we developed machine learning models trained specifically on BACE. We developed a deep neural network (DNN) model that used a combination of both structure and ligand-based features that outperformed simpler machine learning models. According to the results released by D3R, we achieved a Spearman's rank correlation coefficient of 0.43(7) for predicting the affinity of 154 ligands. We describe the formulation of our machine learning strategy in detail. We compared the performance of DNN with linear regression, random forest, and support vector machines using ligand-based, structure-based, and combining both ligand and structure-based features. We compared different structures for our DNN and found that performance was highly dependent on fine optimization of the L2 regularization hyperparameter, alpha. We also developed a novel metric of ligand three-dimensional similarity inspired by crystallographic difference density maps to match ligands without crystal structures to similar ligands with known crystal structures. This report demonstrates that detailed parameterization, careful data training and implementation, and extensive feature analysis are necessary to obtain strong performance with more complex machine learning methods. Post hoc analysis shows that scoring functions based only on ligand features are competitive with those also using structural features. Our DNN approach tied for fifth in predicting BACE-ligand binding affinities.


Introduction
The Drug Design Data Resource (D3R) has organized four Grand Challenges (GC) for docking, affinity, and free energy predictions for protein-ligand complexes. [1][2][3] By providing high quality, blinded protein-ligand crystal structures and affinity datasets, D3RGC has attracted extensive attention from computational drug design researchers. Assessment of results from blinded competition has provided unbiased insights into the most effective strategies as well as the many shortcomings of the state of the art. This platform has sparked numerous novel computer-aided drug designing methods. In this article, we present our machine learning approach in GC4 for predicting ligand binding affinities for beta-secretase 1 (BACE), a key protein for the formation of amyloid -peptide and a leading drug target for Alzheimer's disease. [4] Computer-aided drug design facilitates the whole process of drug development, such as virtual screening, lead optimization, structure-activity relationships (SAR) analysis, and ADMET modeling. [5] The designing and modeling of drug molecules can be classified into structure-based drug design (SBDD) and ligand-based drug design (LBDD), depending on whether threedimensional structural information is used. [6] To design a novel drug molecule, an accurate prediction of the binding affinity between a ligand and its target protein is helpful. However, due to the complex nature of intermolecular interactions, protein flexibility, solvation, and the entropic effect, the prediction of docked structures and affinities of protein-ligands is very challenging.
Classically, a predicted docked pose needs to be generated first, then the affinity is calculated by force-field-based, knowledge-based, or an empirical scoring function. Popular docking programs and scoring functions such as the AutoDock family [7][8][9], Glide [10], GOLD [11], and X-Score [12] have demonstrated their advantages in predicting docked poses and protein-ligand affinities.
Machine learning models have shown great potential in advancing current computer-aided drug designing methodology. [13] The advance of machine learning platforms and tools for chemistry, such as TensorFlow [14] and scikit-learn [15], and the public availability of high-quality proteindrug datasets, such as the PDB [16] and PDBbind [17], greatly enables the application of machine learning to drug design. Novel machine learning-based scoring methods, such as RF-Score [18] and KDEEP [19], or machine learning-optimized software, such as Vinardo [20], smina [21], RF-Score-v3 [22], have demonstrated superior performance in addition to their generalizability and accessibility. Machine learning methods have been shown to perform better than conventional scoring methods in benchmark studies. [23] For example, KDEEP, RF-Score, X-Score, and Cyscore have Pearson's correlation coefficients (R) of 0.82, 0.80, 0.66, 0.65, respectively, for 290 proteinligand affinities in the PDBbind v.2016 core set. [19] Surprisingly, these four scoring methods are very poor at predicting the affinities of ligands to BACE, as they yield Pearson's correlation coefficients (R) of -0.06, -0.14, -0.12, and 0.2, respectively for 36 BACE ligands. It is still an unresolved question why some protein targets are more difficult than others for different algorithms and scoring functions.
We are especially interested in answering the question of whether a machine learning model trained on a specific target can improve the affinity prediction performance for ligands to this target. Since D3R GC4 provided a high-quality and blinded BACE affinity dataset, we took this opportunity to explore the performance of a target-trained machine learning model. We also demonstrate how the combination of structure-based and ligand-based features benefit machine learning performance.

Affinity model overview
To build and test the affinity prediction performance for BACE-specific trained machine learning models, three essential elements: training BACE input features (X training), training BACE experimental affinities (y training), and BACE test input features (X test), see Fig. 1. To generate input features of ligands, we compared using structure-based features and ligand-based features, thus the method of obtaining accurate docked is important.
The compilation of training dataset and test input features are discussed in the "Test dataset compilation for BACE-ligand affinity modeling" section. Before that, our workflow of generating predicted docked pose is introduced in "Semi-automated ligand pose generation and docking" section.

Semi-automated ligand pose generation and docking
Our first objective was to develop a docking workflow for the 154 ligands without crystal structures. It was shown in D3R GC3 that docking to the appropriate receptor structure is important for success [3]. In GC4 Stage 2, D3R provided SMILES codes of the 154 macrocyclic ligands for the affinity prediction test; 20 co-crystal structures were released earlier in D3R GC4 Stage 1B.
We looked for chemical similarity between the 154 ligands (BACE_1 to BACE_158) for the affinity prediction test and the 20 ligands (BACE_1 to BACE_20) with crystal structures using the FragFp method in DataWarrior 4.7.2, [24] see supplemental Table S1 for the full similarity list.
The FragFp descriptor uses chemical fingerprints based on substructure fragment matching.
We cross-docked each of the 154 test ligands to the BACE receptor bound to the ligand with the highest similarity out of BACE_1 to BACE_20. An automated cross-docking python script using Open Babel [25] and smina [21] was written to generate the docked poses for the 154 test ligands for Stage2. Smina was used for the cross-docking procedure with the Vinardo scoring function [20]. The docking box was centered on the most similar ligand with a reduced buffer padding of affinity test ligand BACE_73 is most similar to BACE_10 which has a known co-crystal structure.
Using smina with the Vinardo scoring function, multiple docked poses were generated to the receptor structure from the co-crystal structure of BACE_10. The second-best pose (Vinardo affinity score: -11.2 kcal/mol) and the fourth-best pose (Vinardo affinity score: -8.8 kcal/mol) are shown in Fig. 2. Using the Rsim method, the 3D-similarity of the BACE_73 second pose was shown to be more 3D-similar to BACE_10 than the BACE_73 fourth pose, as the Rsim is lower for the 2 nd pose (0.72 versus 1.36).
This 3D similarity algorithm works well when comparing docked poses to cocrystal structures.
For all Stage 2 ligands, the docked poses were scored with this method. Docked poses with the lowest Rsim value (highest 3D similarity) were picked for further analysis for affinity estimation.

Training dataset compilation for affinity modeling of BACE-ligand
To build a BACE-specific affinity machine learning model, a dataset comprised of 222 published BACE ligand affinities (label value ytraining, dimension: 222×1) was extracted from PDBbind v.2017 which contains 14,761 protein-ligand complexes affinity data, see Table S2. [17] In this work, structure-based and/or ligand-based features were used as input features (X) for our machine learning models. To generate the structure-based features used for machine learning model training, ten AutoDock Vina-like scoring terms were generated for the 222 BACE-ligands using smina, ( Table 1). To obtain the ten scoring values, "scoring_only" functionality was used in smina, without conformation search and docking. The scores from smina were compiled as the training set (Xtraining_structure-based, dimension: 222×10). To generate the ligand-based chemoinformatic features, DataWarrior 4.7.2 was used to produce twenty-four ligand-based descriptors (Xtraining_ligand-based, dimension: 222×24) for the 222 training ligands ( Table 2). When structure-based and ligand-based features were used together, a full training set (Xtraining, dimension: 222×34) was obtained by addition of the two datasets (Xtraining_structure-based + Xtraining_ligand-based). Table 1 Ten structure-based AutoDock Vina terms generated by smina. Fine parameters o, w, c, g, b, s, i, j, ^ are used in the generation of the terms. Briefly, o is offset between atom pairs, w is the width of the Gaussian function, c is distance cutoff, g is a good distance, b is a bad distance, s is a smoothing, i and j are Lennard-Jones exponents, ^ is the cap.

Construction, training, and tuning of machine learning models for BACE-ligands affinity
After we obtained a training dataset (Xtraining) and training affinities (ytraining), five common machine learning models (Table 3) were constructed, refined, and compared, using structure-based and/or ligand-based features, in Python using scikit-learn.
To validate and compare machine learning models, the training dataset was first linearly rescaled to facilitate model convergence, then treated with 10-fold cross-validation with the dataset shuffled ("random state" defined as one, for reproducibility). To refine each model, fine-tuning of machine learning models was scrutinized and adjusted. For evaluation, the coefficient of determination (R 2 ) of 10-fold validations of each model was compared. The weights of the five Vina terms were refined via partial gradient descent of each weight until the overall RMSD reached a local minimum ( Table 4).  (10) a Vinardo also has modified term functions besides weight.
Although a statistically improved cross-docking performance was achieved for 229 BACE ligands deposited in PDB, we discovered that the Vinardo scoring function yielded the best re-docking performance (lowest RMSD) for the macrocyclic BACE ligands that were the subject of D3R GC4 (  Since the Vinardo scoring function was shown to be effective in generating docked poses for macrocyclic BACE ligands ( Table 5)   A, BACE_26 (red) was found to be most chemically similar to BACE_3 (light pink), that has a released cocrystal structure, via the FragFp method in DataWarrior. Between two docked poses of BACE_26 generated by Vinardo scoring using smina, the pose with the lowest Rsim = 0.553 was used for structure-based affinity calculations. B, similarly, ligand BACE_137 (red) was docked to mimic the 3D structure of the most chemically similar ligand BACE_12 (light orange); the pose with the lowest Rsim = 0.564 was used for affinity modeling. ChemBioDraw and PyMOL were used to generate this figure.

Comparisons between machine learning models of BACE-ligand affinities
We obtained a dataset (Xtraining) of 222 BACE-ligands deposited in PDB with their affinities (ytraining) extracted from PBDBind v2017 [17]. We investigated three aspects of applying machine learning in BACE-ligand affinity prediction: input feature selection, machine learning model selection, and machine learning model regularization.
First, for feature selection, we compared the model performance based on structure-based features only (Table 1), ligand-based features only ( Table 2), and a combination of both using five machine learning models (Table 3) When only structure-based terms were used for affinity modeling (Fig. 5A) The performance of only using ligand-based terms was also evaluated (Fig. 5B). Interestingly, all five machine learning models yielded comparable results, suggesting that linear regression adequately utilizes most of the information in the input features. The SVM performed the best among the models, with an R 2 of 0.62(13) for the 10-fold cross-validation set and 0.836(9) for the cross-training set.
When a combination of structure-based terms and ligand-based terms was utilized in machine learning models, slightly improved performance was obtained across the different modeling methods (Fig. 5C). LR and regularized LR yielded equivalent performance, with R 2 = 0.59 (13) for performance of KDEEP, RF-Score, X-Score, and Cyscore, with their R equal to -0.06, -0.14, -0.12, and 0.2, respectively to 36 BACE ligands [19]. The architecture and mapping matrix (Wn) are represented in Fig. 6C. Every neuron in the MLP take a linear combination of earlier neurons as input, and output after ReLU (rectified linear unit) activation.
D3R released the performance of GC4 BACE Stage 2. Our results tied for the fourth best performance with a Kendall's  = 0.30(5) and Spearman's  = 0.43(7).  (17) Numpy and scikit-learn were used to generate the values in this table.

Fig. 5
Performance of machine learning models of BACE affinity. Red boxes indicate the mean of R 2 for the 10-fold cross-validations, cyan boxed indicate the mean of R 2 for the training set. The error bars represent the standard error (SE) for the 10-fold cross-validation metric evaluation. A, only the ten structure-based Vina terms were used. B, only the twenty-four ligand-based terms ( Table 2) were used. C, both the structure-based and ligand-based terms were applied. Numpy, Matplotlib, and scikit-learn were used to generate this figure.

Conclusions
Through participating in the D3R GC4 BACE Subchallenge, we investigated optimizing the AutoDock Vina docking scoring function and designed and compared machine learning models using a combination of structure-based and ligand-based terms. To generate docked poses for BACE macrocyclic ligands, a 3D similarity pose automated selection script was shown to be effective in generating accurate docked poses. Five different machine learning models were explored ranging in complexity from linear regression to a deep neural network. From tuning different models, we found that hyperparameter tuning greatly affects the accuracy of proteinligand affinity prediction.
This work shows that machine learning models are highly effective for protein-ligand affinity prediction if high-quality training datasets are available for the target protein. We found that the Vinardo scoring function, developed from a broad set of ligands, performed best for docking macrocyclic ligands to BACE. We expect docking performance can be further improved by careful choice and optimization of receptor structures. In contrast, a deep neural network trained specifically on BACE ligands performed best for affinity prediction. Affinity prediction can probably be improved by training on larger datasets, training on ligand/target-specific datasets, using deeper neural networks and adopting advanced neural networks such as convolutional neural networks, automated tuning of hyperparameters, and carefully selecting a larger set of informative input features.