Abstract
The correct evaluation of ligand binding free energies by computational methods is still a very challenging active area of research. The most employed methods for these calculations can be roughly classified into four groups: (i) the fastest and less accurate methods, such as molecular docking, designed to sample a large number of molecules and rapidly rank them according to the potential binding energy; (ii) the second class of methods use a thermodynamic ensemble, typically generated by molecular dynamics, to analyze the endpoints of the thermodynamic cycle for binding and extract differences, in the so-called ‘end-point’ methods; (iii) the third class of methods is based on the Zwanzig relationship and computes the free energy difference after a chemical change of the system (alchemical methods); and (iv) methods based on biased simulations, such as metadynamics, for example. These methods require increased computational power and as expected, result in increased accuracy for the determination of the strength of binding. Here, we describe an intermediate approach, based on the Monte Carlo Recursion (MCR) method first developed by Harold Scheraga. In this method, the system is sampled at increasing effective temperatures, and the free energy of the system is assessed from a series of terms W(b, T), computed from Monte Carlo (MC) averages at each iteration. We show the application of the MCR for ligand binding with datasets of guest-hosts systems (N=73) and we observed that a good correlation is obtained between experimental data and the binding energies computed with MCR. We also compared the experimental data with an end-point calculation from equilibrium Monte Carlo calculations that allowed us to conclude that the low-energy (low-temperature) terms in the calculation are the most relevant to the estimation of the binding energies, resulting in similar correlations between MCR and MC data and the experimental values. On the other hand, the MCR method provides a reasonable view of the binding energy funnel, with possible connections with the ligand binding kinetics, as well. The codes developed for this analysis are publicly available on GitHub as a part of the LiBELa/MCLiBELa project (https://github.com/alessandronascimento/LiBELa).
Introduction
Molecular interactions are central to most biological processes. Cell signaling depends on the interaction of receptors and hormones (e.g.1); metabolism and catabolism depend on the enzyme-substrate-recognition and many proteins exert their biological functions by forming dimers or higher order oligomers. Despite the pivotal relevance of molecular interactions, the accurate evaluation of the strength of binding involving two molecular species in a biological context is still a challenge2.
The most relevant thermodynamic quantity associated with binding is the change in free energy. This quantity can be rigorously computed from molecular dynamics simulations using the alchemical methods of free energy perturbation (FEP) or thermodynamic integration (TI)3. Other computer inexpensive methods have also been proposed, based on the end-points of the thermodynamic cycle4,5, or a fast evaluation of the partition function based on limited sampling6,7. In the best scenarios, binding free energies can be computed with a precision of around 1 kcal/mol, for the alchemical methods, at the cost of several (typically dozens of) simulations. Another approach for accurate computation of the ligand binding free energy involves the simulation of an unbinding process, using the funnel metadynamics8, for example. In this case, appropriate collective variables must be chosen, and the unbinding should be properly sampled in MD simulations. Finally, a different approach focuses on the direct estimation of the free energy using a Markov State Modeling (MSM) of the macrostates bound and unbound from several rapid MD or MC simulations. Guallar and coworkers showed promising results using MC simulations in PELE combined with MSM for the estimation of ligand binding free energies9,10. An emerging area of intense research combines the alchemical method FEP with machine learning, allowing better estimation of parameters and better precision in free energy estimation11,12, at the cost of losing throughput in calculations.
Following classical thermodynamics, the change in free energy due to binding is given by the sum of the change in macroscopic energy (ΔU) and the change in the entropy of the system at a given temperature (-TΔS). Although conceptually simple, the evaluation of this thermodynamic quantity involves several challenges. The macroscopic energy U can be assumed to be the average energy for the different microscopic states sampled by molecular dynamics (MD) or Monte Carlo (MC) simulations13, given that a good sampling of thermally accessible conformations is achieved.
The estimation of the entropy, on the other hand, is less straightforward. As pointed out by Edholm and Berendsen14, the multidimensional distribution of variables of the system hinders an accurate evaluation of changes in the system entropy. In addition to the problem of evaluating the thermodynamic quantities associated with binding, the sampling of thermally accessible and relevant microstates is itself an important issue to be addressed by free energy calculation methods.
Following the concept of the folding landscape, as proposed by Onuchic and Wolynes15–17, one could generalize the concept to the context of ligand binding. In this context, the ligand binding process can be thought of as a free energy funnel as a function of the ligand-receptor coordinates during the binding process. The depth of the funnel is related to the energy change (ΔU) due to binding whereas the funnel radius is related to the change in entropy due to ligand binding. These ideas have already been proposed18–20, however, some difficulties remain in the accurate estimation of the binding energies. Although the advances in GPU computing allow a better sampling of the phase space coupled with better force fields and a myriad of simulation methods, the small differences in binding energies, as compared to the absolute energies for the bound complex and its isolated parts, make the precise computation of the binding energies still a challenge.
Here, we evaluated a fast estimation of the binding free energy for a set of more than 70 host-guest systems using the Monte Carlo Recursion (MCR) approach, first proposed by Harold Scheraga. Here, we aimed to use computationally quick methods without compromising the accuracy, which could be used to evaluate binding energies in lower-end workstations. Our tests showed that MCR obtains a good correlation with experimental data and is fast to be used in large-scale drug discovery campaigns.
Methods
Test Sets
To evaluate the interaction energies computed from MC simulations, a set of 30 receptor-ligand complexes were selected. The complexes are experimental crystallographic structures of the T4 lysozyme (T4L) mutant L99A and the double mutants L99A/M102Q21 and L99A/M102E bound to several small ligands (or fragments), for which the experimental binding free energies are available in the literature and compiled in the database BindingDB22 and from Mobley and Gilson’s23 work. The PDB IDs for these complexes are given in Table 1, together with their experimental binding free energy.
In all cases, the receptor structure was parametrized with the AMBER FF14SB force field (atomic charges and Lennard-Jones parameters) using the DockPrep tool as available in UCSF Chimera24. The structural water molecules within 5 Å of the ligand binding site were maintained in the receptor structure. The ligands were parametrized using the General AMBER force field (GAFF)25 version 2 using ANTECHAMBER26 and AM1-BCC as the charge model.
Other datasets evaluated here include the cyclodextrin (CD) dataset, the curcubit[7]uril (CB7) dataset, and the BRD4 (first bromodomain of the BRD4 protein) dataset. For these datasets, the SYBYL MOL2 files were used as provided by David Mobley’s group27. For BRD4, the receptor protein was parametrized using UCSF Chimera24 tool DockPrep with AMBER FF14SB28 atomic charges.
Monte Carlo Implementation
The MC simulations were performed in a modified version of our ligand docking software named LiBELA (Ligand Binding Energy Landscape)29. The algorithm involves a random translation and rotation of the ligand within the receptor active site, followed by a random shift in the angle of each rotatable bond (RB). This set of 6 + 3nRB variables defines a move. The energy for the new coordinates is computed and tested according to the Metropolis criterion. The random displacements for Monte Carlo moves are sampled using the random number generator based on the RANLUX algorithm as implemented in the GNU Scientific Library30.
Before each production simulation, the system was equilibrated during 106 MC accepted moves or steps, and, in the case of equilibrium MC simulations, a production simulation was conducted for 5×107 MC steps. The maximal translation step was set to 0.5 Å for each direction, while the maximal rotation step and torsion scan step were set to 1.25 degrees. This modified version of LiBELa will be referred to hereafter as MCLiBELa. In order to speed up the calculations, the binding potentials are pre-computed in tridimensional grids31,32 assuming a rigid receptor, as usual in docking calculations.
For the purposes of this work, a modified version of the original interaction energy model was adopted. Here, the interaction energy was evaluated as:
In this model, the first term described a softcore Lennard-Jones potential33 between the ligand atoms (i) and receptor atoms (j), with a smoothing parameter δ set to 2.5 Å. The second term models the polar interactions, modeled as a smoothed Coulomb model. In this case, the dielectric ‘constant’ was set to interatomic distance, i.e., D = rij, while the smoothing parameter δes was set to 2.5 Å. The third term, Eijsol, is a desolvation term, previously described by Stouten34 and Verkivker33 and also evaluated before in the context of ligand docking35. In this model, the affinity of each atom i to the polar solvent, Si, is modeled as a linear function of the square of the atomic charge:
This empirical affinity is then combined with the volume of the solvent displaced upon interaction (solvent excluded volume), with a Gaussian envelope: where a is the atomic radius of the atom j interacting with atom i and desolvating it, rij is the interatomic distance and σ is a constant (σ=3.5 Å). The final energy associated with ligand and receptor desolvation is given by33:
In this work, the parameter α was set to 0.1 kcal/(mol. e2) and β was set to -0.005 kcal/mol35. The last term in equation (1), EijHB, explicitly models the hydrogen bonds through a 10-12 potential and a directionality term. The parameters are set to result in an interaction energy of -5.0 kcal/mol for a typical hydrogen bond and the directionality term cos4(θ), where δ is the angle between donor, hydrogen atom, and acceptor is used to ensuring the proper geometry of the interaction.
For each ligand conformation being sampled, the ligand internal energy (Elig) is computed using the OpenBabel API36 and the GAFF force field. The final energy (Et), as computed in this work, is the sum of the interaction energy (Ei, in equation 2) and ligand internal energy:
Et, as defined in equation (5), is the term used in the Metropolis criteria in the MC simulations.
Monte Carlo Recursion
The Monte Carlo Recursion (MCR) method is rooted in the computation of the quantity W(b,T), defined as37,38: where b is a constant (b>1) and the brackets represent a Monte Carlo average, evaluated at an MC temperature T. Li and Scheraga noted that W(b,T) is connected to the system partition function Q(T) by 37,38:
By iterating b, i.e., computing W(b,T), W(b2, bb2T), W(b3, bb2b3T), …, etc, one arrives at:
Using Q(∞) = VN, the configurational Helmhotz free energy A(T) can thus be given by: where is an ‘effective temperature’ assigned to each iteration i. Although the free energy is computed from equation (9) from an infinite series, the terms in the summation approach zero as the temperature increases, conferring convergence to the method after a few recursion terms. For the results shown here, 12 terms were included in the calculations to ensure convergence, with coefficients bi set to 1.5, 1.5, 2.0, 2.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0 and 256.0. For each term of the recursion, the system was allowed to equilibrate for 106 MC steps, followed by 20 million MC steps of production simulation, used to calculate W(b, T).
The binding free energy is computed using the same simulation setup for the ligand sampled within the receptor structure (Acomplex) and for the free ligand (Aligand) and taking their difference: ΔAbind = Acomplex− Aligand. Ten independent simulations were set up for each system using different random seeds. The errors in the estimation of the binding energy were computed as the standard deviation among the ten simulations.
Results
The estimation of the binding energy involved in ligand-receptor recognition is of paramount relevance in the structure-based design of drugs. Despite its relevance, this estimation still represents a challenge for computational structural biologists. Shoichet’s and Irwin’s groups showed recently that the energy scores computed from docking calculations show only a modest correlation with ligand affinity39,40, suggesting that the prioritization of a given compound after a virtual screening is still a challenge, especially if the project budget limits the number of compounds to be experimentally tested.
On the other hand, the most accurate methods for affinity estimation require expensive MD/MC simulations, making them impractical for the prioritization of a large number of compounds identified in a large to ultra-large docking campaign involving more than a hundred million compounds screened39. In this scenario, we looked for a methodology that could be accurate enough to suggest the best binders after a reasonable docking pose has been found but preserving a relatively low computational cost. In this context, we evaluated a Monte Carlo-based sampling method that could be able to refine the docking poses, but also ensure better linearity among the computed binding energies and the actual binding free energy.
After preliminary tests, low MC temperatures were chosen to start the simulations, ensuring that the ligand sampled low energy conformations within the active site, close to the crystallographic pose. For the results shown here, a temperature of 25K was set for all the simulations. As we will show below, starting the simulation at low MC temperatures ensures an appropriate sampling of the bound state during the recursion. Taking the average over the 30 for T4 lysozyme complexes, each calculation took about 24 hours for complex and free ligand, running one MCR simulation per processor thread of Intel Xeon E5645 processors. Thus, the computational costs involved in the calculations are affordable with small in-house computing resources, even for the screening of hundreds of molecules, looking forward to the prioritization of compounds previously selected in ligand docking campaigns.
From a sampling point of view, the MCR method can be seen as a sort of ‘inverse simulated annealing’, in the sense that the effective temperature is constantly increased to sample high-energy states and finally resulting in convergence in the cumulative quantity In(W). This quantity will be dominated by the low energy terms, sampled in the first iterations, making it relevant to choose low temperatures to start the simulations.
The convergence of the calculations of the binding energies was first analyzed. To assess whether the sampling acquired in the MCR simulations was appropriate and reached an equilibrium state, we analyzed the cumulative quantity In (W), for the complex, i.e., bound ligand and receptor (blue points in Figure 1), as well as for the isolated ligand (red points) and their difference (green points). As shown in Figure 1, the calculations reach a converged state with less than 10 iterations. The estimation of the free energies, as defined in equation (9), also revealed a convergence of this quantity to less than 0.1 kcal/mol after the 12 iterations in MCR.
A closer inspection of the results shown in Figure 1 also shows the relevance of the first five terms in the recursion. The difference between the ligand-bound simulation and free ligand simulation resides in these first terms, where the ligand is sampling low-energy conformations. By reaching higher effective temperatures, the ligand is dissociated from the receptor and the sampling is mostly restricted to a free ligand in solution. In this scenario, the quantity In (W) is mostly the same for both simulations and tends to zero as the temperature is increased.
This analysis is also shown for a representative complex of the T4L dataset in Figure 2. Here, the 186L ligand-enzyme complex is shown during the three first iterations (Figure 2, left). The left panel shows the ligand sampling different conformations within the binding pocket while the right panel shows the sampled conformations during the fourth iteration. At the beginning of the iteration, the ligand is still found in the ligand pocket, but is rapidly displaced and starts to sample different conformations outside the ligand pocket and around the enzyme. The results shown here are in line with the values computed for the cumulative In (W) shown in Figure 1, where the differences between ligand-bound and free ligand simulations tend to zero after the fourth iteration.
The binding energies were computed following equation (9). For this calculation, the cubic volume sampled by the ligand was estimated by the maximal displacement in the center of mass in each direction for each iteration. Figure 3 shows the free energy computed for each ligand of the T4L dataset as a function of the MCR iteration. It can be observed, from the figure, that the computed binding free energies converge with less than 10 iterations.
A comparison between the computed energies and the experimental binding free energies showed only a modest correlation between experimental and computed binding data, as shown in Figure 4 (inset) and Table 1. Here some interesting issues must be highlighted. First, the experimental energies were measured at temperature T = 300K, while the simulation data was taken at an MC temperature T = 25K. The low MC temperature was chosen to ensure that the sampling of the binding conformation would be appropriate in the first iterations. However, the difference might lead to an overestimation of the binding energies, as shown in Figure 4. Here, the computed energies are in the interval between -10 to -20 kcal/mol, while the experimental binding energies are found in the range of -3 to -7 kcal/mol.
Second, the small range of values found in the experimental results makes it very difficult for any computational method to make predictions on this dataset. Any computational method with a precision of ± 1 kcal/mol, would not distinguish among most of the ligands. So, a more rigorous assessment of the method should include more datasets spanning a wider range of experimental values for the binding free energies.
Third, it should be considered the important approximations applied to the model used here, including a rigid receptor with pre-computed interaction grids. Although these approximations make the sampling of conformations faster, they impose important limitations on the accuracy of the binding energy estimation.
The second issue can be further explored by simulating the binding energies in other datasets. Here we choose the datasets CB7, CD, and BRD4. Using the same approaches, the binding energies were computed for these datasets and the results are shown in Figure 4. In this scenario, the experimental binding free energies include values from 0 to -20 kcal/mol and the datasets together account for 73 data points. The correlation between the computed binding free energies and the experimental binding energies, as given by the Pearson correlation coefficient is r = 0.73. Interestingly, a linear fit between the experimental and computed binding free energies reveals a slope of 2.3 and a linear coefficient of -5.0 kcal/mol, suggesting an important bias of the model towards more negative values, as previously noted.
The data shown in Figure 4 indicates that the MCR method, as applied here, can correctly rank the ligands according to their binding free energy, despite the bias in estimating the absolute binding free energies. At this point, we decided to investigate whether MCR is making better predictions than any ‘end-point’ method, i.e., since the MCR energies are dominated by the low-energy terms at the beginning of the recursion, can average energies computed at low temperatures reach the same performance?
For this purpose, we run Metropolis Monte Carlo simulations at an MC temperature of 25K for the receptor-ligand complex and the free ligand. The binding energies were computed by taking their differences. In this approach, the conformational entropy was also estimated using a first-order approximation for ligand rotation and translation, as previously proposed by Edholm and Berendsen14 and by Killian and coworkers47.
We found that end-point binding energies computed from equilibrium MC simulations resulted in a similar correlation with experimental data when compared to the results obtained with MCR simulations, as shown in Figure 5. This finding suggests that binding energies computed with MCR are mostly based on energy differences sampled at low temperatures.
Discussion
Here we describe an approach for rapid estimation of the biomolecular interaction by Monte Carlo simulations of the ligand within its binding pocket. The method has a few conveniences: First, the rapid sampling obtained by MC simulations allows its use as a refinement of scores obtained in docking calculations in the context of the compound screening. In an ideal scenario, it is plausible to use a docking engine to sample libraries of purchasable compounds containing millions of compounds and pre-select the most likely to bind to a given macromolecular target. Afterward, the pre-selected compounds can be prioritized and ranked using MC simulations, an inexpensive, though reliable method for the estimation of the interaction. Second, the method is extensible and may allow in the future the sampling of a restricted region of the receptor, for example, increasing the reliability of the calculations.
In terms of the changes in free energy, the ligand binding process can be compared to the protein folding process19,48,49: during the binding event, i.e., when moving from the aqueous solvent to a macromolecular binding pocket, a ligand loses its conformational entropy, while gaining interaction energy or enthalpy. The folding funnel model17, as proposed by Wolynes and Onuchic, has a parallel here for ligand binding, where the correct binding pose should be identified as the global minimum in the funnel. Also, similar to what is found in folding funnels, several local minima are observed in the binding funnel. The roughness of the binding funnel leads to several imprecise results in calculations based on finding (local) minima, such as in docking calculations, for example, resulting in a poor correlation between docking scores and experimental data. The MCR method, when applied to ligand binding, as shown in this work, can be seen as a computational method to sample low-energy conformations (bottom of the funnel), as well as high-energy conformations, potentially providing an assessment of the accessible conformations in the (un)binding funnel. In this context, additional investigation is under way to evaluate whether unbinding events sampled in MCR correlate with experimental ligand unbinding kinetic data, for example.
In terms of the binding energies computed, the MCR method showed results comparable with end-point calculations with equilibrium MC data. However, MC simulations are much faster, taking about 6.3 hours on average for each T4L target, while MCR simulations take about 23.6 h for each T4L target, running on a single CPU thread. Although, even with this difference in the required time to simulated, we can point MCR advantages over equilibrium MC First, the MCR method is strongly rooted in thermodynamics. In principle, with better parameters and appropriate sampling, one could end up with free energy values close to experimental data, as shown for simpler systems37,38. Second, as we mentioned before, the method may allow the sampling of unbinding coordinates, providing a perspective of the binding funnel for a particular host-guest system.
A representative ligand-receptor view is shown in Figure 6 for the 186L complex, from the T4L dataset. The figure shows that conformations with low RMSD values (as compared to the crystal structure conformation) are observed for low-energy conformations. As the energy increases, the observed RMSD values also increase, and, at RMSD values close to 10 Å, the ligand is dissociated from its receptor. After unbinding, the ligand samples a wide range of conformations with RMSD values varying between 10 to 40 Å. The right panel in Figure 6 shows a closer view of the funneling in RMSD and the total energy of the system.
On the other hand, if only a simple rescoring of ligands is required, an equilibrium MC simulation with an end-point calculation could be sufficiently precise and about four times faster. This could allow, for example, the rescoring of 1,000 ligands pre-selected in a docking campaign to be refined in one day using an in-house computer cluster, making these calculations affordable and providing a more accurate view of the binding energetics, which may be a piece of valuable information for decision taking in drug discovery pipelines.
A typical FEP calculation takes about 8 hours in a modern GPU such as a NVIDIA GTX RTX 2080 and a few CPU cores, depending on the implementation. Our MC/MCR calculations take 8 (MC) or 23 hours (MCR) on a single CPU thread, scalable to hundreds or thousands of calculations in computer clusters or supercomputers. Finally, the correlation achieved between the experimental binding free energies is better than the results obtained with docking calculations and comparable with FEP calculations in some scenarios50,51.
Conclusions
In conclusion, the data provided here show the conceptually simple approach for the determination of the binding free energy by combining an MC ensemble average at increasing effective temperatures and free energy estimation using the Monte Carlo Recursion method. The direct evaluation of this thermodynamic quantity allows a more precise ranking of screened ligands in docking campaigns at an affordable computational cost, even for small and in-house computer clusters. The codes developed for this analysis are publicly available on GitHub as a part of the LiBELa/MCLiBELa project (https://github.com/alessandronascimento/LiBELa).
Supporting Information Description
Two spreadsheets are provided as supplementary material. These files contain the experimental binding free energies and the computed binding free energies obtained with MCR simulations and with equilibrium MC simulations.
Table of Contents/Abstract Graphics
Acknowledgments
The authors thank the funding agencies FAPESP for the financial support through grants 2014/06565-2, 2017/18173-0, 2020/03983-9, 2010/15376-8, 2015/26722-8, as well as for CNPq, through grants 485950/2013-8 and 302992/2021-9. JVSC also thanks FAPESP for the Master fellowship 2015/01709-9 and 2014/01751-2. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. We also thank Heloisa Muniz, Camila Tanimoto Rodrigues, and Milton T. Sonoda (in memorian) for the very fruitful discussions.