Protein-Ligand Interaction Graphs: Learning from Ligand-Shaped 3D Interaction Graphs to Improve Binding Affinity Prediction

Graph Neural Networks (GNNs) have recently gained in popularity, challenging molecular fingerprints or SMILES-based representations as the predominant way to represent molecules for binding affinity prediction. Although simple ligand-based graphs alone are already useful for affinity prediction, better performance on multi-target datasets has been achieved with models that incorporate 3D structural information. Most recent advances utilize complex GNN architectures to capture 3D protein-ligand information by incorporating ligand-interacting protein atoms as additional nodes in the graphs; or by building a second protein-based graph in parallel. This expands the graph considerably while obfuscating the shape of the underlying ligand, diminishing the advantage that GNNs have when encoding molecular structures. There is therefore a need for a simple and elegant molecular graph representation that retains the topology of the ligand while simultaneously encoding 3D protein-ligand interactions. We present Protein-Ligand Interaction Graphs (PLIGs): a simple way of representing atom-atom contacts of 3D protein-ligand complexes as node features for GNNs. PLIGs featurize an atom node in the molecular graph by describing each atom’s properties as well as all atom-atom contacts made with protein atoms within a distance threshold. The edges of the graph are therefore identical to ligand-based graphs, but the nodes encode the 3D protein-ligand contacts. Since PLIGs are applicable to any GNN architecture, we have benchmarked their performance with six different GNN architectures, and compared them to conventional ligand-based graphs and fingerprint-based multi-layer perceptron (MLP) models using the CASF-2016 benchmark set where we found PLIG-based Graph Attention Networks (GATNet) to be the best performing model (ρ=0.84, RMSE=1.22 pK). In summary, we created a novel graph-based representation that incorporates 3D structural information into the node features of ligand-shaped molecular graphs. The PLIG representation is simple, elegant, flexible and easily customizable, opening up many possibilities of incorporating other 2D and 3D properties into the graph. Access The code and implementation for PLIGs and all models can be found at github.com/MarcMoesser/Protein-Ligand-Interaction-Graphs.


Introduction
In early stage pre-clinical drug discovery, one of the most important properties of a small molecule drug is its binding affinity for the correct protein target. High binding affinity is crucial for the overall efficacy of a drug while the careful design of multi-target affinity profiles is highly desirable to avoid toxicity and side-effects. In addition, higher binding affinity allows drugs to be administered at lower doses to generate the desired efficacy, which reduces overall toxicity and increases practicality.
Computer-aided drug design (CADD) has been firmly established as a powerful technique in the drug discovery pipeline, especially machine-learning-based methods ( [1]).
Interest in applying machine learning (ML) to the development of more accurate scoring functions that can predict protein-ligand binding affinity has grown in the last decade. Classical scoring functions use physics-based methods (force fields), linear combinations of (semi-)empirical terms, or knowledge-based potentials. They are used in popular docking software such as AutoDock4 ( [2,3]), AutoDock Vina ( [4]) or GOLD ( [5]) to score 3D ligand poses. While these classical scoring functions perform well in docking and virtual screening tasks, they struggle with binding affinity prediction and ranking tasks ( [6,7]).
More recently, ML-based scoring functions that can outperform classical scoring functions for binding affinity prediction have emerged. These models employ a diverse set of ML architectures and features, from classical ML techniques such as random forests and gradient boosted trees ( [8,9,10]) to deep learning (DL) models ( [11,12,13]). As the most accurate scoring functions use many different ML architectures, it is not yet clear how protein-ligand complexes should be represented as features for model training. Older quantitative structure activity relationship (QSAR) models, mostly used against a single protein target, use only ligand-based features such as ECFP fingerprints ( [14]). Recent models built with the intent of creating scoring functions that can generalise across multiple proteins and diverse ligands aim to incorporate 3D-structural information into their feature space. The main goal of including structure-based information is to create models with the right inductive bias that can learn the biophysics of protein-ligand interactions, rather than regurgitate biases in their training sets.
Graph-based neural networks (GNN) have recently emerged as a powerful method for binding affinity prediction ( [13,12]). Although simple ligand-based graphs alone are already useful for affinity prediction ( [13]) similar to how purely ligand-based scoring functions use random forests ( [9]), higher performance for affinity prediction and virtual screening tasks on multi-target datasets has been achieved in graphs that can incorporate 3D-structural information ( [12,15]). In order to encode 3D information, previous studies have chosen to expand the graph itself, by including nodes corresponding to protein atoms that are close to the ligand ( [12,15]). This expands the graph's complexity considerably while obfuscating the topology of the ligand, diminishing the advantage that GNNs might have when encoding molecular structures. We present Protein-Ligand Interaction Graphs (PLIG) to solve this issue of structural data incorporation into GNNs. PLIGs can incorporate 3D-interactions directly into the atom nodes of molecular graphs, encoding the intermolecular contacts made by each ligand atom in the 3D protein-ligand complex. This enables graphs to retain the shape of the ligand, by only altering node features. We benchmarked a large variety of modern GNN architectures: Graph Convolutional Neural networks (GCN, [16]), Graph Attention Networks (GATNet, [17]), Graph Isomorphism Networks (GIN, [18]) a combined GAT-GCN network [13], Graph SAGE [19] and Simple Graph Convolutional Networks (SGC, [20]). GATNet performs best when featurized with PLIGs, outperforming other well known scoring functions such as PLEC ( [10]), K DEEP ( [11]) and graph-based SIGN ( [12]) when tested on the CASF-2016 benchmark set.

Training and test sets
We investigated the ability of PLIGs with different graph-based neural networks (GNN) to predict the protein-ligand binding affinity. For training and testing, we used the PDBbind database ( [21,22]), a curated set of 3D-structures of protein-ligand complexes and their corresponding experimentally determined binding affinity, obtained from the Protein Data Bank (PDB, [23]). In order to utilize the most up-to-date data, we used the PDBBind 2020 General Set and supplemented it with data from the PDBBind 2016 Refined Set for a total of 19451 protein-ligand complexes. Additionally, for evaluation against docked poses rather than crystal poses, we subjected the combined dataset to the pre-processing and docking procedure described in the Methods Section 2.2, resulting in 14981 valid, docked protein-ligand complexes with the corresponding crystal poses. The dataset was randomly split into training (14254 data points) and validation (455 data points) sets. We used the PDBbind 2016 "Core Set", also referred to as the "CASF-2016 set", as our test set since it was used as the "scoring power" benchmark in the CASF exercise in 2016 ( [24]). Our test dataset excluded 13 protein-ligand complexes unable to go through the docking procedure failing either at the ligand preparation or post-docking quality control stage (for a breakdown of the processing see Methods Section 2.2), resulting in 272 valid data points. The CASF-2016 test set has been widely used in the community as a test set for evaluating scoring function performance ( [9,8,12]), and therefore allows us to compare our models to previously published scoring function benchmarks.
Each model was trained to predict the inhibition constant K i , the dissociation constant K d , or the half-maximal inhibitory concentration IC 50 , depending on which was provided by PDBBind for each complex. For the purpose of this study, these values were considered as interchangeable and are henceforth referred to as "the binding constant, K". This practice of combining two or all three different affinity values (IC 50 , K i and K d ) has been previously used for similar studies ( [9,8]). For training and performance evaluation, the negative base-10 logarithm of K, commonly denoted as pK was used: For every model, we evaluated its performance by computing the Pearson correlation coefficient (ρ) as well as the root-mean-square error (RMSE in pK units) between the predicted pK value and the experimentally determined pK value for every complex in the test set. Cross validation was performed using a 5-fold split of the training set excluding the validation and test set. Details about cross validation can be found in SI Section 3. Performance on the CASF-2016 benchmark was evaluated for every model by training and testing 10 times using different random seeds on the same dataset split.
Then, the ρ and RMSE were calculated and reported as the average over 10 runs with the corresponding standard deviation indicating the stability of the model. Model runs were also combined into ensemble models, by averaging each protein-ligand prediction between each of the 10 models in the test set and calculating the ρ or RMSE between the average prediction and the true value.

Protein-ligand docking methods
First, ligand files provided by PDBbind were processed using the cheminformatics toolkit RDKit

Architecture of machine-learning models
We explored a wide variety of graph-based neural networks (GNNs) as well as a multilayer perceptron neural network (MLPNet) in a two-branch setup ( Figure 2.1), with either the GNN or MLPNet embedding the ligand and/or protein structure in the first branch and a second branch encoding the protein sequence in a 1D convolutional neural network. Both branches combine into the readout layers, consisting of three fully-connected layers. This two-branch architecture is adapted from the GraphDTA architecture described by Nguyen et al. [13]. All models were implemented using PyTorch v1.9.0 ( [30]) and PyTorch Geometric v2.0.0 ( [31]). The architecture and implementation of the graph convolutional neural network (GCN, [16]), graph attention network (GAT, [17]), graph isomorphism network (GIN, [18]), and the combined GAT-GCN were directly adopted from [13] with the only changes being the selection of optimal hyperparameters after hyperparameter tuning (see SI Section 1). In addition, we incorporated GraphSAGE ( [19]) as well as Simple Graph Convolutional Networks (SGC, [20]) into the GraphDTA architecture using the PyTorch Geometric SAGEConv and Abstract general architecture of the models with the ligand branch (green) and the sequence branch (yellow). The ligand is either embedded as a fingerprint vector in the MLPNet branch (leftmost branch), or a molecular graph is created for the GNN branch (middle branch). On the right branch, the protein embedding is shown, where a 1D-convolutional layer is fed with the embedded protein amino acid sequence. The outputs from the protein and ligand branches are concatenated and fed into the three fully-connected readout layers and a single pK prediction is made.

Ligand-based graphs
Ligand-based graphs were generated based on the bonds and atoms in a given small molecule. Each atom (node) is represented as a one-hot encoded 40-dimensional feature vector. The feature set used by GraphDTA ( [13]) consisted of the following one-hot encoded features: atomic symbol; number of adjacent heavy atoms; number of adjacent hydrogens; implicit valence; and whether the atom is in an aromatic ring. This was expanded by adding additional features: a boolean variable describing whether the atom is in a ring; the one-hot encoded formal charge; and the one-hot encoded hybridization type; and by changing the encoding of the implicit valence to the explicit valence. Since PDBBind ( [21,22]) carefully pre-processes ligands and assigns physiologically relevant atom charges, the inclusion of features such as the explicit valence and the formal charge gives a more complete description of the molecule. All features were calculated using RDKit (https://www.rdkit.org/, v2021.03.5).

Small molecule fingerprints
The molecular fingerprints used for the MLPNet models were calculated using RDKit (https://www.rdkit.org/, v2021.03.5). We investigated the effect of using Morgan fingerprints with the RDKit implementation of Extended-Connectivity Fingerprints and Functional-Class Fingerprints (ECFP and FCFP respectively, [14]) with a radius of 2 and 512 or 1024-bit vectors. They are referred to as "ECFP512" or "ECFP1024" respectively.

Protein-Ligand Interaction Graphs (PLIGs)
The idea of Protein-Ligand Interaction Graphs (or "PLIGs") was inspired by the work of [8] who One major disadvantage of the ECIF fingerprints is that the dimensionality of the vector depends on the dataset. An ECIF-featurized model needs to be retrained from scratch every time the model is asked to evaluate a ligand with a new "ECIF atom type" that was not represented in the original dataset. Given the diversity of chemical space, this is highly likely, especially for early stage drug discovery where a diverse set of molecules might be screened or novel scaffolds explored for candidate optimization. PLIGs overcome this limitation by only defining atom types on the protein side, limited to the 20 proteinogenic amino acids (inclusion of more amino acids is also possible if needed), thus eliminating the problems of unknown atom types. Therefore, any new protein-ligand complex can be scored with a pre-trained PLIG model, regardless of the ligand structure. This is especially useful for building prospective models on targets where little structural information is known and docking needed for generating ligand poses, and a diverse set of new ligands needs to be screened.
Additionally, the PLIG architecture is able to represent 3D protein-ligand complexes in a ligand-based graph, without needing additional edges and nodes, only changing the node feature vector.

Quality of Docked Poses
The

Model Combinations
For this study, six different GNN and one MLPNet architecture were employed. The GNN models were either featurized using ligand-based graphs or PLIGs. The MLPNet model was featurized with structure-based fingerprints (ECIF [8]) or ligand-based fingerprints (ECFP512, ECFP1024, FCFP512, FCFP1024). All resulting models were trained and tested either in a two-branch architecture with an additional protein sequence encoding branch (see Figure 2.1), or by themselves as a standalone model. This resulted in the 34 models listed in Table 3.1. Each of the models were trained and tested separately on crystallographic and docked poses.
In addition to the models listed in Table 3.1, six different multi-model ensembles were created as described in Section 2.1. The list of created multi-model ensembles is given in Table 3.2. All multi-model ensembles were trained and tested separately on crystallographic and docked poses.

Ensemble Model Performance
Since

Model Performance on Crystal Poses
Performance of the PLIG and ECIF-based models when trained and tested on crystal poses is shown in Figure 3. The performance of ECIF fingerprints is therefore not tied to the original random forest and gradient boosted tree models and can be efficiently used with deep learning as well as in classical machine learning models. We therefore show that (1) PLIGs are a high-performing graph representation of protein-ligand complexes able to elevate GNN-based models to compete with some of the best proteinligand binding affinity prediction methods such as ECIF ( [8]), K DEEP ( [11]) or AEScore ( [32]); and, (2) ECIF are effective in a simple feed forward neural network (MLPNet).
For all PLIG models, the inclusion of the protein sequence branch in the model decreases performance, while it does not significantly affect the performance of the ECIF fingerprint model (Figure 3.1).
Since PLIGs already encode the structure of the protein (at least the part near the ligand), inclusion of 1D-protein sequence information might be introducing confounding information or noise, hindering PLIG model performance. In addition, protein-branch containing models are more unstable during cross validation, reaching maximum performance quickly, after a small number of epochs, therefore raising concerns of overfitting (SI Section 3). As a result, we recommend using the single-branch GNN or MLPNet implementations of our models without the protein sequence branch in future studies, with the GATNet PLIG implementation performing best. The best performing single model, the GATNet PLIG without protein sequence embedding (ρ=0.84, RMSE=1.24 pK), was not improved further through multi-model ensembles.

Model Performance on Docked Poses
Overall, performance of all models decreased when trained and tested on docked poses in comparison to crystal poses ( Figure 3.1 a, b). Like crystal poses, the GATNet PLIG model (no sequence) was the best performing graph (ρ=0.77, RMSE=1.43 pK), together with the GAT-GCN PLIG model (no sequence, ρ=0.77, RMSE=1.39 pK) when trained and tested on docked poses. However the MLPNet ECIF model (with sequence) performed better (ρ=0.82, RMSE=1.30 pK, see Figure 3.1 a, b) only losing performance slightly in comparison to the crystal pose results. As mentioned, models that included the protein sequence embedding were less robust during cross validation (SI Section 3). Therefore, when comparing the MLPNet and GATNet PLIG models without sequence embedding, the difference in ρ decreases to 0.03 (MLPNet ECIF no sequence: ρ=0.8, GATNet PLIG no sequence: ρ=0.77) and the difference in RMSE disappears (RMSE=1.43 pK for both models). The models were also trained on crystal structures and tested on docked poses (SI Figure 4.12) however no significant difference to models trained and tested on docked poses was observed. Overall, the drop in performance from crystal to docked poses for structure-based methods is to be expected, as docked poses tend to diverge from original crystal structures, and contacts identified by those methods might not reflect the actual contacts present in the crystal structure. Nonetheless, GATNet PLIG (no sequence) models perform best among GNN-based methods and are on a par with the MLPNet ECIF (no sequence) model.

Model Performance with Ligand-based Features
When using purely ligand-derived graphs, no significant difference in performance between the different graph architectures was observed. The ligand-based fingerprint models (MLPNet using ECFP and FCFP fingerprints) performed surprisingly well, especially when combined with the sequence embedding, reaching a ρ value of up to 0.8 (Figure 3.1 c, MLPNet FCFP1024, ECFP512 and ECFP1024 models). Overall, purely ligand-based models do not perform as well as the structure-based models trained and tested on crystal structures, but perform comparably to the structure-based models trained and tested on docked poses ( Figure 3.1 a-c, e-g), showcasing that the additional noise introduced to the dataset via docking leads to models performing as well as ligand-based models that have no 3D data at all. This same trend was observed by Boyles et al. [33] for random forest-based models.

Model Performance of Multi-model Ensembles
Since intra-model ensembles increased the performance of all models, we tested the impact of multi- been previously shown to recover most of the lost performance when replacing crystal with docked poses (Random Forest models by Boyles et al. [33]). We observe the same effect when using graph neural networks and feed forward neural networks (MLPNet) in an ensemble model framework. Being able to apply model ensembles has the advantage that models do not have to be retrained once the decision has been made to include additional ligand features. Rather, in this case, a single new ligand-based model can be created and combined post-prediction with the structure-based models to rescue performance.
Overall, the performance of multi-model ensembles is fairly similar across models, and small differences do not necessarily correlate to superior or inferior models. The GATNet PLIG models perform best, outperforming all other models on crystal poses and when combined with ligand-based models in a multi-model ensemble regain top performance on docked poses.

Influence of Proximity Threshold on Performance
We further investigated the effect of varying the protein-ligand contact proximity thresholds during PLIG generation on the performance of the GATNet PLIG model. We generated PLIGs using a threshold for the 5 / 6 Å thresholds, respectively) with no significant advantage of one over the other, which is in line with the results observed by Sánchez-Cruz et al. [8] who found 6 Å to be the optimal threshold for ECIF fingerprints in random forest and gradient boosted tree models. Furthermore, performance dropped for all thresholds when training and testing on docked poses (Table 3.

Model Generalizability
In order to assess the ability of the best performing GATNet PLIG model to generalize between different protein families, protein-ligand pairs in the training dataset were eliminated based on their sequence identity to proteins represented in the CASF-2016 benchmark using five threshold levels between 50 % and 100 % identity. The GATNet PLIG (no sequence) model was trained on the reduced dataset and tested against the full CASF-2016 dataset. As expected, performance decreases with stricter identity thresholds (Figure 3.2) with the largest drop in performance observed between the original full dataset and the elimination of 100% identical proteins from the training set (meaning that only proteins in the training set with identical sequence to proteins in the test set are removed). This observation is in line with similar experiments such as reported by Boyles et al. [9], where model performance decreased regardless of random forest model featurization when eliminating test set-similar protein structures from the training set. However, part of the decrease in performance could be due to the reduced dataset size when eliminating data points in the training set (SI Table 7.1). Nonetheless, as CASF-2016 is deliberately chosen to be representative of the proteins present in the PDBbind refined set, eliminating this bias through techniques such as the sequence identity elimination described herein, a fairer evaluation of generalizability would be achieved and should therefore be utilized in future evaluations of other protein-ligand affinity scoring functions.

Discussion
The recent interest in GNN-based scoring functions for protein-ligand binding affinity prediction has led to an arms race in complexity. New models that try to incorporate structural features into GNNs add protein nodes to the graph ( [12,15]) which changes the shape of the graph and introduces unnecessary complexity of the model. In this study, we present Protein-Ligand Interaction Graphs (GATNet) to be the best for predicting binding affinity using PLIGs. This might be due to the selfattention mechanism of GATs, which allow the weighting factor of each node to be implicitly defined in relation to its neighboring nodes, instead of being explicitly defined or learned. Since the protein contacts made by each atom node, as well as the nodes themselves, are more or less important based on their surrounding neighborhoods, the ability of GATs to elicit contextually-relevant importance of different neighbors could contribute to its increased performance. In addition, simplicity in model design seems to be desirable for PLIG models, since the hybrid branch architecture that combines the protein sequence and the GNN using PLIG performs worse than models with just PLIG-based GNNs.
Finally, we have shown that model ensembles can be a powerful tool in overcoming the performance drop of binding affinity scoring functions when going from training on crystal to training on docked poses. Rather than creating a hybrid model by adding ligand-based features such as ECFP fingerprints directly into the feature space of a structure-based model (as described by Boyles et al. [9]), a simple ensemble of a structure-based model and a ligand-based fingerprint model can recover or even increase performance to levels higher than the individual models by themselves. However, ensembling different structure-based methods does not seem to improve performance significantly, at least for the GNN PLIG and MLPNet ECIF models described here. Additional investigation would be required to test if a more diverse set of model ensembles could improve performance further.
This study shows that simply including proximity-based protein-ligand contacts into the atomic nodes of molecular graphs of the ligand boosts performance of graph neural networks when predicting protein-ligand binding affinity. As a result, it opens up a large space for exploration of which other node features might be included in molecular graphs. Rather than increasing the size of the graph, adding nodes or increasing the complexity, this method of incorporating 3D-structural information into ligand graphs is simple, powerful and compatible with a large variety of GNN architectures. Furthermore, the PLIG architecture can serve as a guide to enable researchers to investigate which other kinds of protein-ligand complex representations might be valuable in GNNs by following the steps laid out to create PLIGs, replacing the general contact information with other other kinds of contact information, biophysical quantities or representations of 3D-protein-ligand complex information.
Lastly, further study into the feature importance and interpretability of PLIGs could enable a deeper understanding of which individual ligand atoms and protein-ligand contacts drive high affinity since PLIG atom node features are fully interpretable.