Graph Attention Site Prediction (GrASP): Identifying Druggable Binding Sites Using Graph Neural Networks with Attention

Identifying and discovering druggable protein binding sites is an important early step in computer-aided drug discovery but remains a difficult task where most campaigns rely on a priori knowledge of binding sites from experiments. Here we present a novel binding site prediction method called Graph Attention Site Prediction (GrASP) and re-evaluate assumptions in nearly every step in the site prediction workflow from dataset preparation to model evaluation. GrASP is able to achieve state-of-the-art performance at recovering binding sites in PDB structures while maintaining a high degree of precision which will minimize wasted computation in downstream tasks such as docking and free energy perturbation.


Atom-Scale Features
Atom Type Local Density of Atoms (9 features, density within spheres ranging from 2 -10Å) Solvent Accessible Surface Area

I. SC-PDB DATASET PREPARATION
Unlabeled ligands in the modified sc-PDB dataset were identified by matching their chemical composition to labeled ligands from an entry with the same PDB ID. This criteria was chosen to avoid adding ligands not present in the original sc-PDB in order to be consistent with the sc-PDB's requirement that ligands must be biologically relevant. The count of each non-hydrogen element in the ligand was compared to the labeled ligands and matches were recorded. The relatively loose criteria of element count was chosen to avoid false negatives due to inconsistencies in bond perception and ligands with a different ordering of elements, S2 different sybyl atom types, or different residue names were labeled for visual inspection to confirm they were not identified as duplicates in error.
Unlabeled ligands were identified as buried using the ratio of solvent accessible surface area (SASA) in the protein complex to SASA in vacuum. Ligands were classified as buried if they were either below 30% solvent accessible or if their fraction accessible was no more than 10% above their labeled counterpart's fraction. These criteria use a conservative definition of buried ligands with 30% accessible surface chosen because 95% of the labeled ligands fall below this threshold. The second part of the criteria accounts for ligands that have over 30% surface exposure in their binding modes. We found that ligands with long tails may be up to 60% accessible in their labeled binding mode and this comparison to the labeled ligands identifies the small number of cases where unlabeled ligands are symmetric to their labeled counterparts but over 30% accessible. The use of two criteria allows the base threshold of 30% to be low enough to avoid false positives while comparison to the labeled fraction accessible catches false negatives that would arise due to unique binding modes.

II. BINDING SITE LABELS
We train GrASP using continuously valued binding site labels with the following sigmoid form where y i is the class label for protein atom i and d i is the distance from protein atom i to the nearest ligand heavy atom.
This can be viewed as a smoothed version of a 5Å binding site definition where the labels decrease from 1 to 0 in roughly the region between 4 -6Å as opposed to a discrete boundary at 5Å. Both the midpoint and slope of this sigmoid were tuned as hyperparameters to optimize the top N DCA recall on the validation set.

III. CONVEX HULL CENTER CALCULATION
To calculate the center of a convex hull, the hull is treated as a solid object with uniform density and its center of mass is calculated. This is accomplished by breaking the hull into tetrahedrons and taking the volume-weighted average of these tetrahedrons' centroids.

IV. SEMANTIC SEGMENTATION METRICS
We use the following metrics to evaluate the performance of semantic segmentation: • Area under the receiver operating characteristic curve (ROC AUC): A classification threshold invariant metric that measures the trade-off between the true positive rate and false positive rate as the classification threshold is varied. Here we use macro averaging to give equal weight to each category instead of each sample because there is heavy class imbalance between site and non-site atoms.
• Area under the precision-recall curve (PR AUC): A classification threshold invariant metric that measures the trade-off between precision and recall as the classification threshold is varied. Similar to ROC AUC but less sensitive to class imbalance.