Flex ddG: Rosetta ensemble-based estimation of changes in protein-protein binding affinity upon mutation

Computationally modeling changes in binding free energies upon mutation (interface ΔΔG) allows large-scale prediction and perturbation of protein-protein interactions. Additionally, methods that consider and sample relevant conformational plasticity should be able to achieve higher prediction accuracy over methods that do not. To test this hypothesis, we developed a method within the Rosetta macromolecular modeling suite (flex ddG) that samples conformational diversity using “backrub” to generate an ensemble of models, then applying torsion minimization, side chain repacking and averaging across this ensemble to estimate interface ΔΔG values. We tested our method on a curated benchmark set of 1240 mutants, and found the method outperformed existing methods that sampled conformational space to a lesser degree. We observed considerable improvements with flex ddG over existing methods on the subset of small side chain to large side chain mutations, as well as for multiple simultaneous non-alanine mutations, stabilizing mutations, and mutations in antibody-antigen interfaces. Finally, we applied a generalized additive model (GAM) approach to the Rosetta energy function; the resulting non-linear reweighting model improved agreement with experimentally determined interface DDG values, but also highlights the necessity of future energy function improvements.


Introduction
Protein-protein interactions underlie essentially all biological processes, including signal transduction and antibody-antigen recognition. Many protein-protein interfaces are sensitive to mutations that can alter interaction anity and specicity. In fact, mutations at protein-protein interfaces have been reported to be overrepresented within disease-causing mutations, 1 highlighting the central importance of these interactions to biology and human health. A suciently accurate computational method capable of predicting mutations that strengthen or weaken known protein-protein interactions would hence serve as a useful tool to dissect the role of specic protein-protein interactions in important biological processes.
Coupled with state-of-the-art methods for protein engineering and design, such a method would also enhance our ability to create new and selective interactions, enabling the development of improved protein therapeutics, protein-based sensors, and protein materials.
Several prior methods have been developed to predict changes in protein-protein binding anity upon mutation using dierent approaches to estimating energetic eects (scoring) and modeling structural changes (sampling). Common approaches include weighted energy functions that seek to describe physical interactions underlying protein-protein interactions, 2,3 statistical and contact potentials, 47 a combination of these approaches, 8,9 graph-based representations, 10 methods that sample backbone structure space locally around mutations, 11 and machine learning approaches. 12 We set out to develop and assess methods for estimating experimentally determined changes in binding free energy after mutation (interface ∆∆G) within the Rosetta macromolecular modeling suite. Rosetta is freely available for academic use, and allows combination of interface ∆∆G predictions with Rosetta's powerful protein design capabilities, which have proven successful in a variety of applications. 13

41
The ZEMu dataset was also curated to include a range of both stabilizing and destabilizing mutants, small side chain to large side chain mutations, single and multiple mutations, and a diversity of complexes. Small-to-large mutations are dened as those dataset cases where all mutation(s) are at positions where the residue side chain increases in van der Waals volume post-mutation.

42
After a review of the literature from which the known experimental ∆∆G values origi-  Table S1.
Flex ddG method steps are outlined in Fig. 1.
Step 1: The protocol begins with an initial minimization (on backbone φ/ψ and side chain χ torsional degrees of freedom, using the lbfgs_armijo_nonmonotone minimization algorithm option) of the input crystal structure of the wild-type protein complex. This (and later) minimizations are performed with constraints that harmonically restrain pairwise atom distance to their values in the input crystal structure. Minimization is run until convergence (absolute score change upon minimization of less than one REU (Rosetta Energy Unit)).
Step 2: Starting from the Wild-Type

Mutant
Step 0 Input starting structure. Generate pairwise atom constraints.
Step 1 -Minimize Global minimization of backbone and side chain torsions.
Pairwise atom constraints from Step 0. 1x Step 2 -Backrub Local sampling of backbone and side chain degrees of freedom of pivot residues, defined as those with neighbor atoms (C-β) within 8Å of mutation positions. 1x Step 3a -Pack Optimize side chains globally on wild-type model. Generate pairwise atom constraints.

50x
Step 3b-Mutate and Pack Optimize side chains globally on mutant model (using the mutant sequence). Generate pairwise atom constraints.

50x
Step 4a -Minimize Global minimization of backbone and side chains torsions. Pairwise atom constraints from Step 3a.

50x
Step 4b -Minimize Global minimization of backbone and side chains torsions. Pairwise atom constraints from Step 3b.

50x
Step 5a -Score Score wild type complex and unbound partners using Rosetta's all-atom energy function. 50x Step 5b -Score Score mutant complex and unbound partners using Rosetta's all-atom energy function.
minimized input structure, the backrub method in Rosetta of models. In brief, each backrub move is undertaken on a randomly chosen protein segment consisting of three to twelve adjacent residues in the neighborhood of any mutated position. The mutation neighborhood is dened by nding all residues with a C-β atom (C-α for glycines) within 8 Å of any mutant position, then adding this residue and its adjacent N and C-terminal residues to the list of neighborhood residues. All atoms in the backrub segment are rotated locally about an axis dened as the vector between the endpoint C-α atoms. Backrub is run at a temperature of 1.2 kT, for up to 50,000 backrub Monte Carlo trials/steps (Table S2 shows that using a kT of 1.6 gives similar results to a kT of 1.2). Up to 50 output models are generated.
Step 3A: For each of the 50 models in the ensemble output by backrub, the Rosetta packer is used to optimize side chain conformations for the wild-type sequence using discrete rotameric conformations 45 and simulated annealing. The packer is run with the multi-cool annealer option, 46 which is set to keep a history of the 6 best rotameric states visited during annealing.
Step 3B: Independently and in parallel to step 3A, side chain conformations for the mutant sequence are optimized on all 50 models, introducing the mutation(s).
Step 4A: Each of the 50 wild-type models is minimized, again adding pairwise atom-atom constraints to the input structure. Minimization is run with the same parameters as in step 1; the coordinate constraints used in this step are taken from the coordinates of the Step 3A model.
Step 4B: As Step 4A, but for each of the 50 mutant models.
Step 5A: Each of the 50 minimized wild-type models are scored in complex, and the complex partners are scored individually. The scores of the split, unbound complex partners are obtained simply by moving the complex halves away from each other.
No further minimization or side chain optimization is performed on the unbound partners before scoring.
Step 5B: In the same fashion as Step 5A, each of the 50 minimized mutant models are scored in complex, and the complex partners are scored individually.
Step 6: The interface ∆∆G score is produced via Eq. 1: where y i are the predicted ∆∆G values, x i are the known, experimentally determined values, and e i is the prediction error.

Rosetta energy function
We where a is the scaling range of the score, and b is the steepness of the sigmoid scaling. Both parameters are transformed through an exponential to ensure non-negativity. The scaling function h does not introduce bias, that is, h θ (0) = 0 for any θ. The scoring model results in a generalized additive model (GAM) over the M score terms, The parameters θ = (a j , b j ) M j=1 for the score terms were simultaneously sampled using a random walk Metropolis-Hastings MCMC algorithm (the mhsample function in Matlab) assuming a Gaussian likelihood as the target distribution with a noise variance set to σ 2 n = 1.0, and where (x i , y i ) N i=1 are the empirical observations y i that correspond to the protein score terms x i , respectively. We sample for 1000 samples with a burn-in set to 1000 samples and a thinning parameter of 20. The proposal distribution was selected to be a symmetric uniform distribution such that [a (s+1) , b (s+1) ] ∼ U (a (s) ± 2, b (s) ± 2). The resulting MCMC sample represents all logistics score scalings that reproduce the empirical measurements assuming an error model with noise variance σ 2 n .

Results and discussion
The overall performance of the protocol is summarized in Table 2. We compare 4 prediction  Table S3.
The new ex ddG method outperforms the comparison methods on the complete dataset in each of the correlation, MAE, and fraction correct metrics (Table 2). In particular, we see a large increase in performance relative to the other methods on the small-to-large subset of mutations. This is in accordance with our expectations that backrub ensembles should be able to sample small backbone conformational adjustments required to accommodate changes in amino acid residue size. Notably, application of backrub ensembles performs better than other methods that include backbone minimization steps only, including the current state-of-the-art Rosetta ddg_monomer method. On the small-to-large mutations subset, the ddg_monomer method achieves a Pearson correlation of only 0.31 compared to 0.65 with ex ddg.
Performance of the ex ddG method on the subset of single mutations to alanine is also competitive or outperforms the alternative methods. As we do not expect single mutations to alanine to require intensive backbone sampling, our method's eectiveness on this subset shows that the method is fairly robust to the mutation type. As we chose to perform backrub sampling prior to introducing mutations, these results could suggest that ex ddG is eective While the ex ddG method shows improved performance on the subset of multiple mutations as compared to the control and ddg_monomer methods, ex ddg did not match the performance of the ZEMu method on this subset. This result could indicate that further renement of the backrub parameters is required when simultaneously sampling conformational space around the sites of multiple mutations. However, and remarkably, ex ddG outperforms ZEMu on the subset of cases with multiple mutations where none of the mutations are to alanine (Table 2). Finally, the ex ddg method also shows considerable improvements over other methods on the subset of antibody-antigen complexes ( Table 2). Fig. 2 illustrates the performance for the ex ddG and control methods on the complete dataset and small-to-large subsets using scatterplots comparing experimentally determined and computationally estimated changes in binding free energies for each of the cases in the datasets. In particular, a notable improvement with ex ddG over the control can be seen for the 13 small-to-large mutations that were experimentally determined to stabilize the protein-protein interface signicantly (∆∆G <= -1.0 kcal/mol). For this set, the control method misclassies most stabilizing mutations to have minimal eect or to be destabilizing (9 mutations with predicted Rosetta ∆∆G scores > 0) (Fig. 2d), whereas ex ddG identies a sizable number (12 of 13 mutations) to have predicted Rosetta ∆∆G < 0 (Fig. 2c), even though only one of these mutations is predicted to be strongly stabilizing (predicted ∆∆G score < 1). The capability to predict stabilizing mutations is especially important for challenging design applications to modulate binding anity and selectivity, as well as creating entirely new high-anity protein-protein interactions.
In the following sections, we assess how dierent ex ddG implementations would aect prediction performance, focusing separately on sampling and scoring.

Eect of ensemble size
While the results presented above used an ensemble size of 50 members, we next investigated what the ideal ensemble size would be to maximize the predictive ability of our method. For example, prior methods used ensemble sizes ranging from ten 3 to thousands.
28 As the computational time required to run ex ddG increases linearly with ensemble size, determining an optimal size is practically relevant. We therefore evaluated the performance of ex ddG as we average across an increasing number of models (from 1 to 50, Fig. 3). The models are rst sorted by the score of the corresponding repacked and minimized wild type model, such that producing a ∆∆G with 1 model will only use the lowest (best) scoring model, 2 models will use the 2 lowest scoring models, and so forth. Fig. 3(a) shows the performance on the complete dataset. As more models with increasing wild type complex score are averaged, correlation with known experimental values increases. Conversely, performance for the no backrub control method stays approximately constant as more models are averaged. This result indicates that sampling with backrub adds information that improves ∆∆G calculation even though the additional averaged models have higher scores (average ensemble total score is shown in Fig. S1). These higher scoring models would be excluded in methods such as the Rosetta ddg_monomer protocol, which typically use only the lowest scoring wild-type and mutant models.
Instead of using just the three lowest energy models, 23 we nd that the performance of the ddg_monomer method also improves as more output models are averaged (Fig. S2, Table S5).
This was somewhat unexpected, as the no-backrub control method, which did not show an improvement with increasing ensemble size, is conceptually similar to the ddg_monomer method. However, the dierence may arise from the fact that the ddg_monomer method ramps the weight of the repulsive Lennard-Jones term in the energy function during minimization. This strategy explores conformational space more broadly in dierent backbone ensemble members than minimization with a fully weighted repulsive term in the no-backrub control method. In this fashion, including more ensemble members generated by the ddg_monomer method increases the conformational plasticity sampled which in turn increases performance, as seen for the ex ddg method.
Using ex ddG, the subset of small-to-large mutations shows the largest increase in correlation with experimental ∆∆G values as more models are averaged (Fig. 3(b)). This result is consistent with our reasoning above that improved modeling of conformational plasticity is important for prediction performance, and that this eect is most important for signicant changes in amino acid residue size. For the subset of multiple mutations where none are mutations to alanine (Fig. 3(c)), performance overall increases substantially initially when more models are added.
Averaging across increased numbers of models also improves correlation for the subset of single mutations to alanine (Fig. 3(d)). Here, improvements are seen up to averaging about 10 models, after which performance stays approximately constant. This observation indicates that increased sampling, in the very least, is not harmful for cases where one would expect structural changes to be relatively small on average.
From a practical standpoint, generating 20-30 models should constitute sucient sampling for most cases. Sorting the generated models by score and selecting the best scoring 20-30 out of 50 models does not appear to be necessary, as not sorting the models by score ( Fig. S3, Table S6) gives similar results to sorting the models (Fig. 3).

Eect of extent of backrub sampling in each trajectory
The extent of sampling can also be controlled by changing the number of Monte Carlo steps in the backrub simulations. Fig. 4 shows the eect of increasing the number of backrub Monte Carlo steps (while averaging all 50 models at each output step) on ex ddG performance, compared to a control method with zero backrub steps that uses only minimization and side chain packing. ∆∆G scores are calculated every 2,500 backrub steps.
After an initial increase for the rst set of 2500 backrub steps, performance stays relatively constant for the complete dataset ( Figure 4a) and for single mutations to alanine (Fig. 4d).  Predictions generated with the no backrub control protocol are shown in green. A selection of key data underlying this gure can be found in Table S7. (a) Complete dataset (n=1240) (b) Small-to-large mutation(s) (n=130) (c) Multiple mutations, none to alanine (n=45) (d) Single mutation to alanine (n=748) However, for the subsets of small-to-large mutations ( Figure 4b) and multiple mutations, none to alanine (Fig. 4c), performance increases considerably with increasing numbers of Monte Carlo steps. This increase in performance is similar to what was observed with averaging over more models for these subsets (Fig. 3b,c). Performance levels o at around 30,000 backrub Monte Carlo steps.
The increased performance does not appear to be simply a result of decreasing scores as the simulation progresses, as the average score of the minimized wild type complexes does not decrease uniformly across the sampled ensemble as the simulation progresses (Fig. S1). The pairwise backrub ensemble RMSDs continue to increase throughout the backrub simulation for all subsets (Fig. S4) 52 We did not observe an increase in performance on the complete ZEMu dataset, and performance decreases were seen for the subsets of smallto-large mutations and multiple mutations (Table S8). Interestingly, ex ddG performance with the REF energy function increased over using the Talaris energy function if the resolution of the input crystal structure was <= 1.5 Å, but this subset of the data was rather small with only 52 mutations.
Next, we sought to analyze underlying errors of the Rosetta energy function (when applied to interface ∆∆G) by assessing the individual terms of the energy function. To do so, we chose to reweight the terms of the energy function using a non-linear reweighting scheme similar to Generalized Additive Models (GAMs). 51 In this reweighting method, we used Monte Carlo sampling to t a sigmoid function to the individual distributions of energy function terms, with the objective function of reducing the absolute error between our predictions and known experimental values over the entire dataset.
The eect on the predictions is shown in Fig. 5, Fig. S5, and  Fig. S5, Table S8, Table S9). The correlation coecient also increases when retting the values obtained for the no backrub control, but only to 0.62 ( Fig. 5b, Table S9).
The t functions (t for Talaris-derived ∆∆G predictions) are shown in Fig. S6. Extreme values for most score terms are downweighted, especially for the fa_sol and fa_atr terms, which make the largest contributions to predicted ∆∆G (Fig. S7).

Conclusions
We have shown on a large, curated benchmark dataset that the ex ddG method presented here is more accurate than previous methods for estimating changes in binding anity after mutation in protein-protein interfaces. Particular improvement in performance is seen on the subset of small-to-large mutations, indicating that representing backbone exibility using backrub motions is eective in cases where backbone rearrangements are expected to be more common. Other notable improvements over previous methods are seen for stabilizing mutations, mutations in antibody-antigen interfaces, and for cases with multiple changes where none of the mutations is to an alanine residue.
We have also shown that more accurate predictions can be obtained by averaging the predictions across a generated structural ensemble of backrub models, and that the number of required models is relatively low (20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30). Prior methods that produced ∆∆G predictions by averaging an ensemble of models required on the order of thousands of models, 28 indicating that backrub sampling can eciently sample the local conformational space around an input wild-type structure that is relevant for interface ∆∆G prediction.
By creating a method that uses backrub to sample conformational space more broadly than minimization alone, while still staying close to the known wild-type input structure, we have also generated data that should prove useful for future energy function improvements. In

Supporting Information Available
The following les are available as Supporting Information: • suppinfo.pdf: Additional gures/tables • ex-ddG-data.csv: Rosetta scores for all predictions