Refining Embedding-Based Binding Predictions by Leveraging AlphaFold2 Structures

Background Identifying residues in a protein involved in ligand binding is important for understanding its function. bindEmbed21DL is a Machine Learning method which predicts protein-ligand binding on a per-residue level using embeddings derived from the protein Language Model (pLM) ProtT5. This method relies solely on sequences, making it easily applicable to all proteins. However, highly reliable protein structures are now accessible through the AlphaFold Protein Structure Database or can be predicted using AlphaFold2 and ColabFold, allowing the incorporation of structural information into such sequence-based predictors. Results Here, we propose bindAdjust which leverages predicted distance maps to adjust the binding probabilities of bindEmbed21DL to subsequently boost performance. bindAdjust raises the recall of bindEmbed21DL from 47±2% to 53±2% at a precision of 50% for small molecule binding. For binding to metal ions and nucleic acids, bindAdjust serves as a filter to identify good predictions focusing on the binding site rather than isolated residues. Further investigation of two examples shows that bindAdjust is in fact able to add binding predictions which are not close in sequence but close in structure, extending the binding residue predictions of bindEmbed21DL to larger binding stretches or binding sites. Conclusion Due to its simplicity and speed, the algorithm of bindAdjust can easily refine binding predictions also from other tools than bindEmbed21DL and, in fact, could be applied to any protein prediction task.


Introduction 1
The correct functioning of proteins is only possible under specific conditions often involving the 2 binding of specific molecules, called ligands. As such, the precise identification of residues, 3 involved in ligand binding, is a crucial step in unravelling the function of a protein and serves as 4 a stepping-stone for fields like drug discovery and development [1,2]. While high-throughput 5 sequencing ensures a constant increase in the availability of experimentally verified sequences, 6 the identification of binding sites is experimentally involved and cannot keep up. As such the 7 field currently relies on predictive methods to bridge this sequence-annotation gap [3]. 8 Traditionally, Machine Learning methods rely on evolutionary information as represented in 9 multiple sequence alignments (MSAs) to predict binding residues [4][5][6]. Alternatively, homology- 10 based inference transfers information about, e.g., binding residues from sequence-similar 11 proteins with known annotations to uncharacterized proteins [7]. Template-based methods 12 consider structural similarity instead of sequence similarity and therefore require structures with 13 binding annotations [8,9]. All those approaches are time-consuming and often limited by the 14 small number of available annotated sequences and structures. The emergence of protein 15 Language Models (pLMs) partially addresses this issue. Based on ideas from the field of Natural 16 Language Processing (NLP), pLMs are trained on large unlabeled protein sequence corpora 17 and implicitly learn to encode protein features into a latent vector space [10][11][12]. Leveraging 18 those vector representations (called embeddings) of protein sequences as input to downstream 19 models allows to transfer the encapsulated information to various tasks resulting in high quality 20 predictions for different aspects of protein structure and function (transfer learning) [13][14][15][16][17][18][19]. 21 One such model is bindEmbed21DL [16], which uses ProtT5 embeddings [10]  enables access to high-quality protein structure predictions which can be used to extend 30 sequence-based methods to integrate structural information and subsequently refine 31 predictions.

32
Here, we propose bindAdjust, a simple and fast method leveraging (predicted) protein structures 33 to refine binding predictions of bindEmbed21DL. The refinement algorithm combines the 34 (predicted) distance map and the predicted binding probabilities of a protein to calculate a per-35 residue structure bonus. This bonus is used to adjust the probabilities generated by 36 bindEmbed21DL accordingly. Notably, bindAdjust does not need any additional training apart 37 from hyper-parameter selection and, thus, provides a simple approach to refine binding residue 38 predictions from sequence-based methods.
Methods 1 Data set. The data set was taken from bindEmbed21DL [16]. We removed four sequences due 2 to non-native amino acids in those, which were not considered during structure prediction.
3 Details can be found in Table S1 in the Supporting Online Material (SOM).

4
Binding predictions. We computed per-ligand, per-residue binding predictions using 5 bindEmbed21DL [16]. bindEmbed21DL is a convolutional neural network, relying on protein 6 embeddings generated with the Protein Language Model (pLM) ProtT5-XL-UniRef50 [10] 7 (ProtT5) as input [16]. bindEmbed21DL provides three output probabilities for each residue in 8 a protein, indicating whether this residue binds to metal ions, nucleic acids, or small molecules.   Bonus coefficient C optimization. The coefficient C was optimized for each ligand class 1 individually using the development set (Table S1). C determines the influence of the structure 2 bonus on the binding probability, thus, substantially influencing the performance of bindAdjust. 3 We tested values for C between 0 and 200 and benchmarked the binarized bindAdjust scores 4 on the development set. Binarized scores were derived by applying a cutoff on the bindAdjust 5 scores (Eqn. 2, more information given below). We used C=38 to evaluate the performance on 6 the test set because it performed well for all ligand classes.

7
Label Binarization. bindAdjust produces continuous, per-residue scores (Eqn. 2) based on the 8 predicted bindEmbed21DL output probabilities. To evaluate against the binary binding labels 9 (0: non-binding, 1: binding), we binarized the scores using a cutoff. For each value of C, we 10 identified the cutoff leading to an average precision (Eqn. 3) of 50% of bindAdjust on the 11 development set. All continuous bindAdjust scores were mapped to binary non-binding/binding 12 if the score was smaller/larger than the defined cutoff. We chose a target precision of 50% to 13 be able to compare performances for different values for C and other methods.
14 Sequence-Based Refinement. To in ensure that bindAdjust uses information encoded the 15 distance map/3D structure and does not solely rely on information about residue distance in 16 sequence, we implemented a refinement method based on distance in sequence not structure.

17
Analogously to bindAdjust, we calculated a bonus, with the difference that it is based on the 18 position difference of two residues in the sequence (see Table S3 for an example). The distance 19 between two neighboring residues was set to 1. We hypothesize that a sufficiently large 20 performance drop for the sequence-based refinement compared to bindAdjust would indicate 21 that bindAdjust uses additional information beyond the sequence to refine binding residue To account for sampling variance, we computed standard errors as the standard deviations of 1 these metrics from 10,000 bootstrapping samples. predictions for 588 proteins (CovOneBind=90%; Fig. S2, Table S4). bindAdjust improves upon 12 this recall at the same level of precision depending on the choice of C (Table S4, Fig. S3A).

13
Increasing C correlates consistently with an increase in average recall and a respective 14 decrease in CovOneBind. For C>70, bindAdjust cannot achieve the target precision of 50% 15 anymore (Fig. S3A). We choose C=38 and C=66 for further analyses (Fig. S2).

16
Since bindAdjust achieves an increase in recall by accepting a decrease in CovOneBind, the 17 overall increase could stem from removing all binding predictions for proteins with low recall. To 18 assess whether this is the case, we compare the prediction performance for the 479 proteins,

19
for which both bindEmbed21DL and bindAdjust at C=38 predict at least one residue as small 20 molecule binding. C=38 is chosen because this value leads to an increase in recall with an 21 acceptable trade-off in terms of CovOneBind compared to C=66 (Fig. S2). bindEmbed21DL 22 achieves a mean recall of 40±1%, while bindAdjust achieves a recall of 45±1% at the same 23 level of precision for the same proteins (Fig. S4). This analysis shows that the increase in 24 performance achieved by bindAdjust does not originate from the removal of proteins with subpar 25 prediction performance and, thus, originates from improvement in the prediction quality.
We also benchmark bindAdjust using C=38 and C=66 against bindEmbed21DL on the test set 1 (TestSet220; Fig. 1). Using the cutoff which results in a mean precision of 50% on the 2 development set, bindEmbed21DL achieves recall=38±2% and precision=58±2% making 3 predictions for 202 of the 220 small binding proteins (CovOneBind=92%) while bindAdjust 4 achieves a similar recall and precision but at a lower CovOneBind (Fig. 1) DevSet454 -nucleic acid binding -in Table S6 and for DevSet108 -binding to metal ions -in 23 Fig. 1: Average performance metrics of bindEmbed21DL, the sequence-based refinement, and bindAdjust for test set and parameter settings for the prediction of small molecule binding. TestSet220: 220 proteins binding to small molecules from the test set. For the test set, the same cutoffs to classify a residue as (non-)binding were chosen as for the development set. Standard errors are given as error estimates. Using the same cutoffs as for the development set, bindAdjust and bindEmbed21DL achieve a similar performance in terms of precision, recall, and F1 with bindAdjust making predictions for less proteins than bindEmbed21DL. Table S7). However, focusing on the set of proteins for which both bindEmbed21DL and 1 bindAdjust made a prediction, the recall of bindEmbed21DL either reaches the same level as 2 bindAdjust (for metal ions; Table S6) or even exceeds it (for nucleic acids; Table S7). Therefore, 3 bindAdjust rather allows to identify a subset of proteins with particularly good binding predictions 4 by bindEmbed21DL than improving predictions for individual proteins. While not directly 5 enabling a refinement of predictions, bindAdjust can be used as a filter to identify good 6 predictions of bindEmbed21DL especially in terms of recall. Since the refinement of bindAdjust 7 is based on structural information, we assume that the set of good predictions, identified by 8 bindAdjust form spatial patterns indicating the prediction of an actual binding site rather than 9 isolated residues. This could also explain why bindAdjust cannot improve those predictions: 10 bindEmbed21DL correctly identifies the complete binding site and therefore bindAdjust cannot 11 identify any additional binding residues in this site. More details can be found in the SOM 12 Section 2.3.

13
bindAdjust leverages spatial information not available from sequence. The sequence-14 based refinement method only uses distance in sequence not in structure to refine predictions. 15 For C=38, the recall for predicting binding to small molecules improves over bindEmbed21DL 16 by four percentage points, but the sequence-based refinement cannot reach the same level of 17 performance as bindAdjust (Table S4). At C=66, the sequence-based method does not reach 18 the target precision level of 50% (Table S4). For both binding to metal ions and binding to nucleic 19 acids, the sequence-based refinement cannot improve the recall of bindEmbed21DL (Table S6 20 for metal ions, Table S7 for nucleic acids). Using information about residues close in sequence 21 space can apparently help in refining the predictions to a certain extent. However, incorporating 22

Fig. 2: Average performance metrics of bindEmbed21DL and bindAdjust(C=38) for the prediction of small molecule binding on protein subsets identified by bindAdjust.
TestSet199: 199 proteins of the test set binding to small molecules for which both bindEmbed21DL and bindAdjust(C=38) predict at least one residue as small binding. TestSet180: 180 proteins of the test set binding to small molecules for which bindAdjust(C=66) predicts at least one residue as small binding. For the test set, the cutoffs leading to 50% precision were used to classify a residue as (non-) binding. Standard errors are given as error estimates. Using TestSet188, both bindAdjust(C=38) and bindEmbed21DL perform better than for TestSet199. This demonstrates that bindAdjust and large values for C can be used to identify a subset of proteins for which both analyzed models predict small binding exceptionally well. structure information as done by bindAdjust is needed to obtain a large improvement. In 1 addition, the incorporation of structural information allows for the identification of binding sites 2 (i.e., groups of binding residues close in space) not possible from sequence alone. Hence, 3 bindAdjust incorporates new information and allows for a performance boost compared to only 4 using protein sequences.

5
Case study: bindAdjust leverages spatial relations between binding residues. From the 6 development set, we visualized two small binding proteins with an increase in F1 score 7 compared to bindEmbed21DL showcasing how bindAdjust can improve the binding predictions 8 from bindEmbed21DL by taking residues close in 3D space into account. For bindEmbed21DL, 9 a cutoff of 0.65 was applied to classify a residue as binding. For the sequence-based refinement 10 and bindAdjust, C was set to 38 and cutoffs of 1.35 and 1.15 were applied, respectively. We 11 chose these cutoffs as they corresponded to an average precision of 50% on the development 12 set (Table S4).

13
The rhodothermus marinus cytochrome c (Fig. 3  bindEmbed21DL predicts too few residues but at a high precision (Fig. 3A). Applying the 17 sequence-based refinement method, the recall improves to 61% by adding five new binding 18 predictions (Fig. 3B) of which three are correct (Precision=81%). bindAdjust adds seven 19 residues to the predictions of bindEmbed21DL resulting in 23 binding residues of which 20 were 20 correct, i.e., increasing the recall to 71% at a similar level of precision (Precision=87%; Fig. 3C).

21
Mapping those predictions to the predicted 3D structure clearly shows that the newly added Residue false predicted to be non-binding (False Negatives), Grey: Residues correctly predicted to not bind (True Negatives). For the rhodothermus marinus cytochrome c, the sequence-based method(B) slightly improves over bindEmbed21DL(A) by adding five new binding predictions of which three are correct. bindAdjust (C=38) predicts seven more residues compared to bindEmbed21DL, resulting in 20 correct predictions out of 23 total predictions. For the genome polyprotein, bindEmbed21DL(D) predicts six residues to bind. The sequence-based method(E) heavily underpredicts binding by (correctly) predicting only one residue to bind. bindAdjust(F) performs the best, it predicts 12 residues as binding of which ten are correct.
structure to the six residues already identified by bindEmbed21DL resulting in a larger surface 1 resembling an actual binding site (Fig. 3F). bindEmbed21DL achieves a particularly high recall (Table S6 &

25
Since binding predictions close in 3D space are not necessarily close in sequence, the 26 sequence-based refinement cannot improve the predictions for the two examples with 27 sometimes even removing correct predictions (Fig. 3D). Fig. 3. The algorithm at the core of 28 bindAdjust is straight forward and easily understood, only taking 5 seconds to calculate 29 bindAdjust scores for 1010 proteins even on low performance CPUs. Without the requirement 30 of time-consuming training or downloading of large models, the method is applicable to large 31 corpora and enables proteome wide studies.

32
Due to its speed and simplicity, bindAdjust can be easily adapted and applied to any type of 33 binding prediction. In general, any prediction task strongly tied to structure such as protein 34 function, binding, or disorder could benefit from incorporating (predicted) structures through an 35 approach like bindAdjust. Availability of data and materials. All data and code are available via the public GitHub