Improved Docking of Protein Models by a Combination of Alphafold2 and ClusPro

It has been demonstrated earlier that the neural network based program AlphaFold2 can be used to dock proteins given the two sequences separated by a gap as the input. The protocol presented here combines AlphaFold2 with the physics based docking program ClusPro. The monomers of the model generated by AlphaFold2 are separated, re-docked using ClusPro, and the resulting 10 models are refined by AlphaFold2. Finally, the five original AlphaFold2 models are added to the 10 AlphaFold2 refined ClusPro models, and the 15 models are ranked by their predicted aligned error (PAE) values obtained by AlphaFold2. The protocol is applied to two benchmark sets of complexes, the first based on the established protein-protein docking benchmark, and the second consisting of only structures released after May 2018, the cut-off date for training AlphaFold2. It is shown that the quality of the initial AlphaFold2 models improves with each additional step of the protocol. In particular, adding the AlphaFold2 refined ClusPro models to the AlphaFold2 models increases the success rate by 23% in the top 5 predictions, whereas considering the 10 models obtained by the combined protocol increases the success rate to close to 40%. The improvement is similar for the second benchmark that includes only complexes distinct from the proteins used for training the neural network.


Introduction
In Round 14 of Critical Assessment of Structure Prediction (CASP14), the doubly blind, biennial experiment focusing on protein structure prediction, the Google affiliated company DeepMind surprised the organizers by demonstrating major improvement in prediction accuracy [1]. The models had been obtained by AlphaFold2, a neural network based program. The AlphaFold2 algorithm and program has been recently released [2], and it is now generally recognized in the scientific community that DeepMind essentially solved the protein folding problem [3]. Given only the sequence, for a large fraction of monomeric proteins, the method is able to produce models that are indistinguishable from experimental structures. Although the accuracy may depend on the depth of the multiple sequence alignment (MSA) available for a given target, results are generally very good even without any meaningful templates and only with a shallow MSA.
Given the success of AlphaFold2 for protein structure prediction, the obvious question is whether the method can also predict protein-protein complexes. DeepMind did not participate in the assembly prediction experiment of CASP14 [4], but results show that most residues located at domain-domain interfaces of multi-domain proteins were predicted remarkably well. In fact, some of the predicted residue conformations seemed well placed for the interaction, in spite of not modeling the interacting protein partner. It was shown that docking models of interacting proteins generated by AlphaFold2 provides results that are comparable to docking X-ray structures [5]. Therefore, it was expected that deep neural network based methods can also be directly used for predicting the structures of protein-protein complexes. This assumption was confirmed by the Baker group by using RoseTTaFold, a different but similar deep learning based structure prediction method [6]. Rather than providing the sequence of a single protein as the input, they used two or more sequences with a gap between them, and demonstrated that the program can generate coordinates of two or more interacting protein chains. This was done by providing a paired alignment and modifying the residue index to include the gap. Thus, the network enabled the direct building of structure models for protein-protein complexes from sequence information, short circuiting the standard procedure of building models for individual subunits and then carrying out rigid-body docking. Mirdita et al. [7] has shown that the same technique works with AlphaFold2, often even without the paired alignment. AlphaFold2 uses relative positional encoding with a cap at |i − j| ≥ 32, which means that by offsetting the residue index between two proteins to be >32, AlphaFold2 treats them as separate polypeptide chains.
A similar protocol was applied to protein-peptide docking by Ko and Lee [8] and by Tsaban et al. [9]. Both groups linked the peptide sequence to the protein sequence via a polyglycine linker, and used AlphaFold2 without any modifications. The method was tested on large benchmark sets of protein-peptide complexes. There were some differences in the overall quality reported, but both studies concluded that fairly accurate docked structures could be obtained for about half of the targets. In some cases, the method obviously failed with the poly-glycine linker throwing the peptide segment into space. In other cases the models correctly identified the binding pocket, but showed errors of peptide rotation or translation [9]. Tsaban et al. also compared the performance of AlphaFold2 to that of the physics-based peptide docking protocol PIPER-FlexPepDock [10] and observed an almost orthogonal behavior in terms of successes and failures.
In the present paper we describe a protocol for protein-protein docking that substantially improves the AlphaFold2 generated models by combining AlphaFold2 with our established physics-based docking method ClusPro [11]. In the first step we use AlphaFold2 as described previously, with the two protein sequences separated by a gap, and generate five primary AlphaFold2 models. However, we go beyond what has been done in earlier studies and combine AlphaFold2 with ClusPro at three different points of the algorithm. First, the best AlphaFold2 model is selected, the two proteins in the complex are separated, and docked to each other using ClusPro. As standard with the server, ClusPro generates 10 models [11]. Each of the 10 ClusPro models are then used as templates in AlphaFold2 runs, resulting in 10 secondary AlphaFold2 models. As the third combination of the two methods, the five primary AlphaFold2 models from the first step are added to the 10 secondary models, and all 15 models are ranked using the PAE (Predicted Aligned Error) values generated by Alphafold2, selecting the model with the lowest PAE as the top prediction. As will be described, we tested the method on two benchmark sets of interacting proteins, and observed substantial increase in the number of complexes that were modeled with acceptable or better accuracy.

Benchmark sets
The combined method has been tested on two benchmark sets. Benchmark 1 was based on Version 5 of the very established protein-protein docking benchmark (BM5) developed by the Weng group [12]. BM5 is a nonredundant and fairly diverse set of protein complexes for testing protein-protein docking algorithms. Each entry in BM5 includes the 3D structures of the complex and one or both unbound component proteins. The set includes 40 antibody-antigen, 88 enzyme-containing and 102 ''other type'' complexes [12]. While Alphafold was applied to predict all 230 targets in BM5, due to the limitations of our GPU resources we were able to get results only for 204 complexes, among them were 34 antibody-antigen pairs and 170 nonantibody containing complexes that included both enzyme-containing and ''other type'' structures [12]. A drawback of using Benchmark 1 is that the proteins forming the complexes in BM5 were present in the training set used by AlphaFold2, which was trained on Protein Data Bank (PDB) structures released before May 2018 [2]. Thus, while the training set included only monomeric proteins, testing the method on BM5 may introduce some bias, leading to overestimating the performance of AlphaFold2.
Benchmark 2 was designed to exclude the possibility of implicit bias. It includes only heterodimeric targets from the PDB released after May 2018, since such protein structures were certainly not used for training AlphaFold2. Furthermore, we used ClusPro TBM to look for homologous templates for potential targets and limited our selection to those with no appropriate template in the PDB [13]. This was done to ensure that our data set consisted of truly novel protein-protein interactions. The final Benchmark 2 consists of 17 complexes with the PDB IDs 5ZNG, 6A6I, 6GS2, 6H4B, 6IF2, 6II6, 6ONO, 6PNQ, 6Q76, 6U08, 6ZBK, 7AYE, 7D2T, 7M5F, 7N10, 7NLJ, 7P8K.

Docking protein pairs using AlphaFold2
For the complexes in Benchmark 1, the sequences for the unbound proteins were extracted from BM5 in fasta file format [12]. For the targets in Benchmark 2, the sequence and chain information was obtained from the PDB using the PISA/author assigned biological assembly. We then generated a multiple sequence alignment (MSA) for each sequence using the MMseqs2 API [7,14]. We joined the MSAs to make pseudo-monomers in which each chain of the oligomer was separated by gaps that were set to 200 residues in our protocol [7]. The final MSA was then used as the input without any template to build five AlphaFold2 models with the pTM model parameter set as this allowed us to also obtain the predicted aligned error (PAE) values for each model [2].

Calculating the average interface PAE
The PAE values calculated by AlphaFold2 provided an error estimate for each residue with respect to every other residue in the model. We used this to calculate an average interface PAE score, where interface was defined as the residues within 10.0 Å of the other subunit. For each residue on the interface, we retained the PAE values for the interface residues on the other subunit. These were then used to calculate the average interface PAE scores that provided the ranking of the models, with lower scores being ranked higher.

Docking predicted subunits using ClusPro
The residues in the top ranked model predicted by AlphaFold2 were split into two groups. One served as the receptor, and the other, the ligand. For the complexes in Benchmark 1, this split was based on the sequences of the two proteins as defined in BM5. The residues belonging to the first protein, usually the larger one, were defined as the receptor, and the residues belonging to second protein formed the ligand. For the heterodimeric targets in Benchmark 2, the sequences in the PDB files were used to define the two proteins. The separated receptor and ligand models were then submitted to the ClusPro server for docking [11]. Some chains had long unstructured tails which were manually cut before submission. The antibody-antigen pairs were docked using ClusPro's antibody mode, whereas for all other targets we used the electrostatic-favored coefficient set as they have been shown to produce the best results [15]. In both cases the 10 best models generated by ClusPro were retained.
The ClusPro free docking protocol consists of two main steps [11]. The first step is running PIPER, a docking program that performs systematic search of complex conformations on a grid using the fast Fourier transform (FFT) correlation approach [16]. The scoring function includes the van der Waals interaction energy, an electrostatic energy term, and desolvation contributions calculated by a structure based pairwise potential [11]. The second step of ClusPro is clustering the top 1000 structures generated by PIPER using the pairwise RMSD as the distance measure. The radius used in clustering is defined in terms of C α interface RMSD. For each docked conformation, we select the residues of the ligand that have any atom within 10 Å of any receptor atom, and calculate the Cα RMSD for these residues from the same residues in all other 999 ligands. Thus, clustering 1000 docked conformations involves computing a 1000 × 1000 matrix of pairwise Cα RMSD values. Based on the number of structures that a ligand has within a (default) cluster radius of 9 Å RMSD, we select the largest cluster and rank its cluster center as the top prediction of the target complex. The members of this cluster are removed from the matrix, and we select the next largest cluster and rank its center as number two, and so on. After clustering with this hierarchical approach, the ranked complexes are subjected to a straightforward (300 step and fixed backbone) minimization of the van der Waals energy using the CHARMM potential to remove potential side chain clashes.

Refining ClusPro results using AlphaFold2
The top 10 docked structures produced by ClusPro were provided to AlphaFold2 as templates for structure prediction. The MSA for each case was generated in the same way as described earlier (see Methods 2.2). The parameters for model 1 from the pTM parameter set was used to generate one model for each template as it gave us the best results out of the five model parameters (data not shown). This resulted in 10 refined models for each target which were ranked based on their interface PAE scores. We then also added the five models generated by AlphaFold2 in the first step of this protocol. Finally, the 15 resulting models were ranked based on their average interface PAE scores.

Assessing the accuracy of docked structures
We used the DockQ program to assess the interface quality of the generated models [17]. The interface of interest for each target was the one between the receptor and ligand as specified in section 2.3. DockQ outputs a score between 0 and 1 to indicate interface quality. A score < 0.23 indicates an incorrect interface, 0.23 ≤ score < 0.49 indicates an acceptable interface, 0.49 ≤ score < 0.80 indicates a medium quality interface, and score ≥ 0.80 indicates a high quality interface. A target was considered successful for a method if the interface for one of the models produced by the method was of acceptable quality or better.

AlphaFold2 refinement improves predictions
We used the sequence information from the PDB for the 204 targets in Benchmark 1 to build the complexes using AlphaFold2. As described, the model with the lowest average interface PAE was split into two groups of residues based on the interface of interest, and docked using ClusPro. A direct comparison of the results by AlphaFold2 and ClusPro show that AlphaFold2 performs better for top 1 and top 5 predictions, both in terms of the number of targets with acceptable or better quality models and in terms of the quality of the interfaces defined by the DockQ score (Figure 1). However, considering the top 10 models from ClusPro increases the number of successful targets to 98, which is higher than the 88 successful targets for the 5 models from AlphaFold2. The higher success rate comes at the cost of interface quality as ClusPro provides far fewer targets with high quality interfaces. The 10 models generated by ClusPro were then refined using AlphaFold2 and ranked based on their average interface PAEs. As shown in Figure 1, this step improves the results considerably. AlphaFold2 on its own still generates good models as the top 1 prediction for more targets than the AlphaFold2 refinement of the ClusPro models. However, when considering the top 5 predictions, the AlphaFold2 refinement of the 10 ClusPro models is successful for 96 targets, compared to the 88 targets obtained by AlphaFold2 without ClusPro. Moreover, considering all 10 AlphaFold2 refined ClusPro models increases the number of successful targets to 109. For some targets where AlphaFold2 fails by itself and ClusPro produces an acceptable quality model, AlphaFold2 refinement is then able to improve the ClusPro interface to a high quality prediction (Table S1). However, it is also apparent that there is a loss of interface quality for some targets where AlphaFold2 by itself did better than ClusPro followed by AlphaFold2 refinement. In fact, the AlphaFold2 refinement of the 10 ClusPro models produces high quality predictions for fewer targets than when considering the top five Alphafold2 models without any further calculation (Table S1). To address this loss, we added the original five AlphaFold2 models to the 10 AlphaFold2 refined ClusPro models, and ranked the 15 models based on the average PAE interface scores. This improved the number of successful targets while also preserving, or even improving, the quality of the interfaces of the top models. This strategy, i.e., selecting 10 models with the lowest PAE values among the 15 models (10 AlphaFold2 refined ClusPro models plus the five original AlphaFold2 models) produced 123 successful targets, with an average DockQ score of 0.415. This was a 37.6% increase in the number of successful targets compared to ClusPro when considering the top 10 models, and almost 40% (39.8%) compared to the original AlphaFold2 runs. Thus, the combined protocol produces much higher success rates that either ClusPro or AlphaFold2 on its own. The refinement of ClusPro generated models by AlphaFold2 was particularly effective and improved the average DockQ by 58.3%. We note that the ClusPro performance discussed here was based on docking the monomer models generated by AlphaFold2, thus starting the calculations from sequences rather than structures, which is another strength of the combined protocol.

Refinement improvement persists for new targets
As stated, the heterodimer targets in Benchmark 2 were released in the PDB after May 2018 and did not have homologous templates. Similar to Benchmark 1, the sequences were used to derive a complex structure using Alphafold2. These were then split into subunits, docked using ClusPro, and refined by AlphaFold2. The results for these complexes differ from the results for Benchmark 1 in two significant ways. First, when considering the top 5 models, AlphaFold2 produced good predictions for fewer targets than ClusPro (Figure 2), although it did result in more high quality interfaces. The lower interface quality for the ClusPro results is not too surprising given that it is a rigid body docking method that does not take into account the conformational changes upon binding. It is likely that success for fewer targets by AlphaFold2 is due to the fact that these targets were specifically selected for their lack of appropriate templates in the PDB. Thus, AlphaFold2 had no or very few structures similar to the targets in the training data, reducing the quality of predictions. The second difference from the results for Benchmark 1 is that the number of successful targets did not improve when going from the ClusPro models to the AlphaFold2 refined ClusPro models, and further to the mixed version of AlphaFold2 and refined ClusPro models. However, similar to the results for Benchmark 1, interface quality improved considerably with refinement (Figure 3). The average DockQ score highlights this improvement as it increases from 0.422 for the top 10 ClusPro models to 0.561 for the AlphaFold2 refined ClusPro models (Table S2).

Figure 2:
Performance on Benchmark 2. The number of successful targets and the quality of best models at the different steps of the protocol.

Figure 3:
Comparison of the best model from each method to the native structure for targets 6H4B and 7P8K.

Conclusions
We tested the ability of the neural network based protein structure prediction program AlphaFold2 of predicting the structures of protein-protein complexes. The method, established in earlier publications, is simply running the program on the two sequences separated by a gap [7]. Application to 204 complexes from the "gold standard" protein-protein docking benchmark [12] confirmed that this method yields a considerable success rate when considering the top five AlphaFold2 predictions. This performance is similar to the one observed for other protein docking programs [15]. As expected, even some of the poorly predicted complexes had excellent models of the protein monomers. The monomers were docked by the physics based traditional docking server ClusPro [11], and the resulting models were used as templates in a repeated prediction of the complexes by AlphaFold2. The combination of the two approaches substantially improved the overall performance. The performance on the protein docking benchmark (Benchmark 1 in this paper) may be somewhat biased since these proteins were part of the structures used when training the AlphaFold2 program. The possibility of such bias was eliminated by performing the same type of calculations for heterodimer complexes deposited in the PDB after May 2018, the cut-off date of training AlphaFold2, and that did not have any templates. Performance was similar to that seen for the other benchmark. Thus, we conclude that the combined protocol produces better results than either AlphaFold2 or ClusPro on its own. It is also important that the calculations can be performed without knowing the structures of the proteins to be docked, and hence we believe that the algorithm proposed here will be very useful in applications.