Overview of the SAMPL6 host-guest binding affinity prediction challenge

Accurately predicting the binding affinities of small organic molecules to biological macro-molecules can greatly accelerate drug discovery by reducing the number of compounds that must be synthesized to realize desired potency and selectivity goals. Unfortunately, the process of assessing the accuracy of current computational approaches to affinity prediction against binding data to biological macro-molecules is frustrated by several challenges, such as slow conformational dynamics, multiple titratable groups, and the lack of high-quality blinded datasets. Over the last several SAMPL blind challenge exercises, host-guest systems have emerged as a practical and effective way to circumvent these challenges in assessing the predictive performance of current-generation quantitative modeling tools, while still providing systems capable of possessing tight binding affinities. Here, we present an overview of the SAMPL6 host-guest binding affinity prediction challenge, which featured three supramolecular hosts: octa-acid (OA), the closely related tetra-endo-methyl-octa-acid (TEMOA), and cucurbit[8]uril (CB8), along with 21 small organic guest molecules. A total of 119 entries were received from 10 participating groups employing a variety of methods that spanned from electronic structure and movable type calculations in implicit solvent to alchemical and potential of mean force strategies using empirical force fields with explicit solvent models. While empirical models tended to obtain better performance than first-principle methods, it was not possible to identify a single approach that consistently provided superior results across all host-guest systems and statistical metrics. Moreover, the accuracy of the methodologies generally displayed a substantial dependence on the system considered, emphasizing the need for host diversity in blind evaluations. Several entries exploited previous experimental measurements of similar host-guest systems in an effort to improve their physical-based predictions via some manner of rudimentary machine learning; while this strategy succeeded in reducing systematic errors, it did not correspond to an improvement in statistical correlation. Comparison to previous rounds of the host-guest binding free energy challenge highlights an overall improvement in the correlation obtained by the affinity predictions for OA and TEMOA systems, but a surprising lack of improvement regarding root mean square error over the past several challenge rounds. The data suggests that further refinement of force field parameters, as well as improved treatment of chemical effects (e.g., buffer salt conditions, protonation states) may be required to further enhance predictive accuracy.

capable of possessing tight binding affinities. Here, we present an overview of the SAMPL6 host-guest 23 binding affinity prediction challenge, which featured three supramolecular hosts: octa-acid (OA), the closely 24 related tetra-endo-methyl-octa-acid (TEMOA), and cucurbit [8]uril (CB8), along with 21 small organic guest 25 molecules. A total of 119 entries were received from 10 participating groups employing a variety of methods 26 that spanned from electronic structure and movable type calculations in implicit solvent to alchemical and 27 potential of mean force strategies using empirical force fields with explicit solvent models. While empirical 28 models tended to obtain better performance than first-principle methods, it was not possible to identify 29 a single approach that consistently provided superior results across all host-guest systems and statistical 30 metrics. Moreover, the accuracy of the methodologies generally displayed a substantial dependence on 31 the system considered, emphasizing the need for host diversity in blind evaluations. Several entries ex- 32 ploited previous experimental measurements of similar host-guest systems in an effort to improve their 33 physical-based predictions via some manner of rudimentary machine learning; while this strategy succeeded 34 in reducing systematic errors, it did not correspond to an improvement in statistical correlation. Comparison 35 to previous rounds of the host-guest binding free energy challenge highlights an overall improvement in Introduction 42 Quantitative physical and empirical modeling approaches have played a growing role in aiding and directing 43 the design of small molecule biomolecular ligands for use as potential therapeutics or chemical probes [1-44 4, 23, 65]. The degree of inaccuracy of these predictions largely determines how effective they can be 45 in prioritizing synthesis of small molecule ligands [105]. Retrospective estimates have suggested that 46 current methodologies are capable of achieving about 1-2 kcal/mol inaccuracy for well-behaved protein- 47 ligand systems [6,122], but more work remains to be done to extend the applicability domain of these 48 technologies. 49 Assessment of how much of this inaccuracy can be attributed to fundamental limitations of the force 50 field in accurately modeling energetics is complicated by the presence of numerous additional factors [78]. 51 Proteins are highly dynamic entities, and many common drug targets-such as kinases [115] and GPCRs [63]- 52 possess slow dynamics with timescales of microseconds to milliseconds [62] that frustrate the computation 53 of true equilibrium affinities. While there has been some attempt to curate benchmark sets of protein- 54 ligand affinity data in well-behaved model protein-ligand systems that are believed to be mostly free of 55 slow-timescale motions that would convolve convergence issues with forcefield inaccuracies [78], other 56 effects can complicate assessment of the accuracy of physical modeling benchmarks. Ionizable residues, for 57 example, comprise approximately 29% of all protein residues [56], and large-scale computational surveys 58 suggest that 60% of all protein-ligand complexes undergo a change in ionization state upon binding [5], with 59 several notable cases characterized experimentally [24,25,91,111]. For physical or empirical modeling 60 approaches that assume fixed protonation states throughout the complexation process, protonation state 61 effects are hopelessly convolved with issues of force field inaccuracy. 62 Host-guest systems are a tractable model for assessing force field inaccuracies 63 Over the last decade, supramolecular host-guest complexes have emerged as a practical and useful model 64 system for the quantitative assessment of modeling errors for the interaction of druglike small molecules with 65 receptors. Supramolecular hosts such as cucurbiturils, cavitands, and cyclodextrins can bind small druglike 66 molecules with affinities similar to protein-ligand complexes [84,85,99]. The lack of slowly relaxing confor- 67 mational degrees of freedom of these hosts eliminates the potential for slow microsecond-to-millisecond 68 receptor relaxation timescales as a source of convergence issues [78], while the small size of these systems 69 allows many methodologies to take advantage of faster simulation times to rapidly assess force field quality. 70 The high solubilities of these systems permit high-quality biophysical characterization of their interactions 71 via gold-standard methods such as isothermal titration calorimetry (ITC) and nuclear magnetic resonance 72 (NMR) [22,38,114]. Additionally, the stability of supramolecular hosts at extreme pH allows for strict control 73 of protonation states in a manner not possible with protein-ligand systems, allowing confounding protona- 74 tion state effects to be eliminated from consideration if desired [114]. Collectively, these properties have 75 made host-guest systems a productive route for revealing deficiencies in modern force fields through blind 76 community challenge exercises we have organized as part of the Statistical Assessment of the Modeling of 77 Proteins and Ligands (SAMPL) series of blind prediction challenge [86,88,107,127]. 78 SAMPL host-guest challenges have driven advances in our understanding of sources of error 79 The SAMPL (Statistical Assessment of the Modeling of Proteins and Ligands) challenges are a recurring series 80 of blind prediction challenges for the computational chemistry community [76, Drug Design Data Resource]. 81 Through these challenges, SAMPL aims to evaluate and advance computational tools for rational drug design: 82 By focusing the community on specific phenomena relevant to drug discovery-such as the contribution 83 of force field inaccuracy to binding affinity prediction failures-isolating these phenomena from other 84 confounding factors in well-designed test systems, evaluating tools prospectively, enforcing data sharing to 85 learn from failures, and releasing the resulting high-quality datasets into the community as benchmark sets, 86 SAMPL has driven progress in a number of areas over five previous rounds of challenge cycles [9,34,35,43,87 44,81,82,86,88,93,107,107,108,127]. 88 More specifically, SAMPL host-guest challenges have provided key tests for modeling of binding interac- 89 tions [78], motivating increased attention to how co-solvents and ions modulate binding (resulting in errors 90 of up to 5 kcal/mol when these effects are neglected) and the importance of adequately sampling water 91 rearrangements [17,78,86,127]. In turn, this detailed examination has resulted in clear improvements in 92 subsequent SAMPL challenges [127], though host-guest binding remains difficult to model accurately [47], in 93 part due to force field limitations (spawning new efforts to remedy major force field deficiencies [125]). 94 SAMPL6 host-guest systems 95 Three hosts were selected for the SAMPL6 host-guest binding challenge from the Gibb Deep Cavity Cavitand 96 (GDCC) [36,48,79,80] and the cucurbituril (CB) [31,69,83] families (Figure 1). The guest ligand sets were 97 purposefully selected for the SAMPL6 challenge. The utility of these particular host systems for evaluating 98 free energy calculations has been reviewed in detail elsewhere [79,80]. 99 The two GDCCs, octa-acid (OA) [36] and tetra-endo-methyl-octa-acid (TEMOA) [33], are low-symmetry 100 hosts with a basket-shaped binding site accessible through the larger entryway located at the top. These 101 hosts also appeared in two previous SAMPL host-guest challenges-SAMPL4 [86] and SAMPL5 [127]-with 102 the names of OAH and OAMe respectively with different sets of guests. OA and TEMOA differ by four methyl 103 groups that reduce the size of the binding site entryway (Figure 1). Both hosts expose eight carboxyl groups 104 that increase their solubility. The molecular structures of the eight guests selected for the SAMPL6 challenge 105 for characterization against both OA and TEMOA are shown in Figure 1 (denoted OA-G0 through OA-G7). 106 These guests feature a single polar group situated at one end of the molecule that tends to be exposed to 107 solvent when complexed, while the rest of the compound remains buried in the hydrophobic binding site. 108 A second set of guest ligands were developed for the host cucurbit [8]uril (CB8). This host previously 109 appeared in the SAMPL3 host-guest binding challenge [87], but members of the same family or analogs 110 such as cucurbit [7]uril (CB7) and CBClip [128] were featured in SAMPL4 and SAMPL5 challenges as well. CB8 111 is a symmetric (D 8ℎ ), ring-shaped host comprising eight identical glycoluril monomers linked by pairs of 112 methylene bridges. Its top-bottom symmetry means that asymmetric guests have at least two symmetry- 113 equivalent binding modes that can be kinetically separated by timescales not easily achievable by standard 114 molecular dynamics (MD) or Monte Carlo simulations and may require special considerations, in particular in 115 alchemical absolute binding free energy calculations [75]. The CB8 guest set (compounds CB8-G0 to CB8-G13 116 in Figure 1) includes both fragment-like and bulkier drug-like compounds. 117 Some of the general modeling challenges posed by both families of host-guest systems have been 118 characterized in previous studies. While their relatively rigid structure minimizes convergence difficulties 119 associated with slow receptor conformational dynamics, both families have been shown to bind guest 120 ligands via a dewetting processes-in which waters must be removed from the binding site to accommodate 121 guests-in a manner that can frustrate convergence for strategies based on molecular simulation. In the 122 absence of tight-binding guest ligands, the octa-acid host experiences fluctuations in the number of bound 123 waters on timescales of several nanoseconds [30]

OA-G1
trans-2-hexenoic acid OA-G2 tures of the three hosts featured in the SAMPL6 challenge dataset (OA, TEMOA, and CB8) are shown in stick view from top and side perspective views. Carbon atoms are represented in gray, hydrogens in white, nitrogens in blue, and oxygens in red. Guest ligands for each complex are shown as two-dimensional chemical structures annotated by hyphenated host and guest names. Protonation states of the guest structures correspond to the predicted dominant microstate at the experimental pH at which binding affinities were collected, and matches those provided in the mol2 and sdf input files shared with the participants when the challenge was announced. The same set of guests OA-G0 through OA-G7 was used for both OA and TEMOA hosts. The gray frame (lower right) contains the three CB8 guests that constitute the bonus challenge.  -G1  OA-G2  OA-G3  OA-G4  OA-G5  OA-G6  OA-G7  TEMOA-G0  TEMOA-G1  TEMOA-G2  TEMOA-G3  TEMOA-G4  TEMOA-G5  TEMOA-G6  TEMOA-G7 CB8 System ID which is the case for the CB8 guest set, as well as for OA-G5, TEMOA-G5, and TEMOA-G7. 143 To determine experimental uncertainties, we added the relative error in the nonlinear fit-derived as-144 sociation constant ( ) or binding enthalpy (Δ ) with the relative error in the titrant concentration in 145 quadrature [20]. We decided to arbitrarily assume a relative error in the titrant concentration of 3% after 146 personal communication with Professor Lyle Isaacs who suggested a value inferior to 5% based on his expe-147 rience. The minimum relative nonlinear fit-derived uncertainty permitted was 1%, since the fit uncertainty 148 was reported by the ITC software as smaller than this in some cases. It should be noted that the error 149 propagation strategy adopted here assumes that the stoichiometry coefficient is fitted to the ITC data in 150 order to absorb errors in cell volume and titrand concentration; this approach is exact only for the OA/TEMOA 151 sets with the exclusion of OA-G5, TEMOA-G5, and TEMOA-G7, and an underestimate of the true error for the 152 remaining cases. The error was then further propagated to the binding free energies and entropies that 153 were calculated from and Δ . The final estimated experimental uncertainties are relatively small, never 154 exceeding 0.1 kcal/mol. 155 The resulting experimental measurements with their uncertainties are reported in Table 1 and Figure 2. 156 The dynamic range of the binding free energy Δ spans 4. 25   the OpenEye toolkit [54,55]. Figure 1 shows the protonation state of the molecules as provided in the input 207 files, which reflects the most likely protonation state as predicted by Epik [41,102]  and charged. The instructions stated clearly that the protonation and tautomeric states provided were not 212 guaranteed to be optimal. In particular, participants in the bonus challenge were advised to treat CB8-G12 213 with care as, in its protonated state, the nitrogen proton could be placed so that the substituent was axial 214 or equatorial. The latter solution was arbitrarily adopted by the tools used to generate the input files for 215 CB8-G12.

216
Statistical analysis of challenge entries 217 Performance statistics 218 We computed root mean square error (RMSE), mean signed error (ME), coefficient of determination (R 2 ), and 219 Kendall rank correlation coefficient ( ) comparing experimentally determined binding free energies with 220 blinded participant free energy predictions. 221 The mean signed error (ME), which quantifies the bias in predictions, was computed as where Δ ( ) and Δ ( ) are the experimental measurement of the binding free energy and its computational 223 prediction respectively for the -th molecule, and is the total number of molecules in the dataset. A positive 224 ME reflects an overestimated binding free energy Δ (or underestimated affinity = − Δ × ( ). 225 Some of the methods appearing in SAMPL6 were also used in previous rounds of the same challenge to 226 predict relative binding free energies of similar host-guest systems. In order to comment on the performance 227 of these methods over sequential challenges, for which statistics on absolute free energies are not readily 228 available, we computed a separate set of statistics defined as offset statistics, as opposed to the absolute 229 statistics defined above, in the same way they were reported in previous challenge overview papers. These will be referred to as OA/TEMOA set in the rest of the work. The only method used to predict the binding free 239 energies of the TEMOA set but not of the OA set was US-CGenFF (see Table 2  to draw general conclusions about the state of the field regarding the reliability of enthalpy predictions. 263 Moreover, the predictive performance was generally poor (see Supplementary Figure 9) Most methods employing classical force fields used GAFF [121] or GAFF2 (still under active development) 296 with AM1-BCC [54,55] or RESP [12] charges, which were usually derived at the Hartree-Fock or MP2 level of 297 theory. Other approaches made use of the AMOEBA polarizable model [96], CGenFF [28] or force match-298 ing [120] starting from CGenFF parameters. The movable type calculations utilized either the KECSA [129] 299 scoring algorithm or the more recently developed GARF [11]. Several submissions employed QM potentials 300 at the semi-empirical PM6-DH+ [64,100] or DFT level of theory either modeling the full host-guest system or 301 in hybrid QM/MM approaches that treated quantum mechanically the guest only. DFT calculations employed 302 B3LYP [13], B3PW91 [13], or TPSS [116] functionals and often the DFT-D3 dispersion correction [42]. 303 Sampling and free energy prediction 304 All the challenge entries used MD to sample host-guest conformations; uses of docking were limited to 305 preparation of initial bound geometries for subsequent simulations. This was also the case also for QM 306 and movable type calculations, where samples generated from MD were in some cases clustered prior 307 to quantum chemical energy evaluations. In a few cases, enhanced sampling techniques were used; 308 in particular, the entries identified by DDM-FM and DDM-FM-QMM used Hamiltonian Replica Exchange 309 (HREX) [113] as part of their double decoupling method (DDM) calculation [39] while Replica Exchange with 310 Solute torsional Tempering (REST) [68,71] was employed in FSDAM to generate from equilibrium the starting 311 configurations for the fast switching protocol. Many groups used the double decoupling or the double 312 annihilation method with purely classical force fields or with hybrid QM/MM potentials and either Bennett 313 acceptance ratio (BAR) [15,103] or the multistate Bennett acceptance ratio (MBAR) [104] to estimate free 314 energies for the aggregated simulation data. Other classes of methodologies applied to this dataset include 315 umbrella sampling (US) [119], movable type [130], MMPBSA [110], and free energy predictions based on QM 316 calculations. 317 The repeat appearance of hosts chosen from the octa-acid and cucurbituril families as test systems for 318 the SAMPL binding challenge, which reflects the continuous contribution of experimental data from the Gibb 319 and Isaacs laboratories, led some groups to take advantage of previously available experimental data to 320 improve their computational predictions. Several entries (e.g., SOMD-D, US-GAFF-C, and MovTyp-GE3L) were 321 submitted with a linear 1 correction of the form where the slope and offset coefficients (i.e., a and b respectively) were trained on data generated for previous 323 rounds of the challenge. In some of the movable type calculations (e.g., MovTyp-GE3O), the coefficient 324 was fixed to unity and the training data used to determine a purely additive bias correction. Relatedly, 325 RFEC-GAFF2 and RFEC-QMMM, which included predictions for the OA and TEMOA guest sets, calculated 326 the relative binding free energy between the compound and determined the offsets necessary to obtain 327 absolute free energy using binding measurements of similar OA and TEMOA guests. order to compare them on the same set of compounds. Table 3 reports such statistics with 95-percentile 334 confidence intervals and Figure 4 show the statistics bootstrap distributions. Some of the methods were 335 used to estimate the binding free energy of only one between the OA/TEMOA and the CB8 sets, and, as a 336 consequence, some of the table entries are missing. For the methodologies that made predictions of the 337 bonus compounds, we report the statistics obtained including them separately in Table 4. While it is difficult 338 to isolate methods and models that performed very well across datasets and statistics, a few patterns 339 emerged from comparing the different entries. 340 1 Technically, this is an affine transformation in the general case since ≠ 0 for some of the corrections employed by participants, but we will refer to it as linear here. association constants ( ), binding free energies (Δ ), enthalpies (Δ ), entropies at room temperature ( Δ ) and stoichiometric ratios ( ) as determined by ITC and NMR assays are reported for all compounds featured in the challenge. All quantities are reported as point estimates ± statistical error obtained by error propagation. For and Δ , the reported uncertainties incorporate both the uncertainty in the ITC enthalpogram least-squares fit and an assumed 3% uncertainty in titrant concentration. A minimum least-squares fit uncertainty of 1% was assumed for fit errors reported by instrumentation as < 1%. Δ and Δ and their uncertainties were obtained from the first two quantities. Some of the compounds in the CB8 guest set can be bound by their hosts with stoichiometries different than 1:1. For CB8-G1 and CB8-G4, which can form 1:2 (two hosts bound to the same guest) and 1:3 complexes with CB8, respectively, we report the thermodynamic quantities of only one of the equivalent binding events-the value used to calculate the statistics for challenge entries. For CB8-G12, we report the measurements of both the 1:1 (CB8-G12a) and the 2:1 (CB8-G12b) bound complexes. The original data can be found at https://github.com/MobleyLab/SAMPL6/tree/master/host_guest/Analysis/ ExperimentalMeasurements/experimental_measurements.csv. Eventual updates or corrections to the data will be made available at the same URL, and anyone wishing to reuse the data should refer there.        Challenge entries generally performed better on OA/TEMOA than CB8 341 In general, the CB8 guest set proved to be more challenging than the OA/TEMOA set both in terms of error 342 and correlation statistics. It is rarely the case that the same method scored better statistics on the former 343 set, and only MovTyp-GT1N does so with statistical significance while the opposite can be observed relatively 344 often. Figure 5-A shows the root mean square error (RMSE) and mean signed error (ME) with 95-percentile 345 bootstrap confidence interval computed for each molecule using the ten methods that scored best in RMSE 346 statistics in the merged OA/TEMOA set or the CB8 set (excluding the bonus challenge), which formed a set of 347 14 different techniques employing GAFF and GAFF2 [121], CGenFF [120], force matching [28], AMOEBA [96], 348 and QM/MM potentials using DFT(B3LYP) [13] or PM6-DH+ [64,100]. These top ten methods performed 349 poorly on eight out of the eleven CB8 compounds, and while confidence intervals for all the statistics are 350 generally large, they also performed significantly worse on several CB8 guests than the OA/TEMOA ligands 351 they accurately predicted affinities for. This loss of accuracy seems to be fairly consistent across models and 352 methodologies, but the data is not sufficient to determine the exact cause of this behavior (e.g., force field The same trend appears when examining the performance of methods in correctly predicting the tightest 361 binder of the three guest sets Figure 5-B. About 61% and 66% of the methods correctly ranked OA-G2 and 362 TEMOA-G4 as the tightest complexes in their respective sets, while CB8-G8 was correctly classified in only 363 about 43% of the cases. In particular, the latter observation is interesting when considering that the binding 364 free energy of G8 to CB8 is 3.5 kcal/mol greater than the second tightest binder (G7), despite the structural 365 similarities between both guests. It is also worth mentioning that SOMD method was the only methodology 366 that correctly ranked the tightest binder of the three separate guest sets, although the prediction that G8 367 was the highest affinity guest for CB8 did not hold when buffer salt conditions were modeled explicitly.   Data Source: https://github.com/MobleyLab/SAMPL6/tree/master/host_guest/Analysis Table 5. Offset statistics of the methods appearing in previous rounds of the SAMPL host-guest binding challenge.
Root mean square error (RMSE), coefficient of determination (R 2 ), Kendall correlation coefficient ( ), and offset root mean square error (RMSE ) computed by subtracting the mean signed error from the free energy predictions. Absolute and offset statistics for R 2 and are identical, and they are thus reported only once. Absolute statistics are identical to those presented before, but, consistently with the format adopted in the SAMPL5 host-guest binding challenge overview paper, the are reported as mean ± standard deviation of the bootstrap distribution (between parentheses) instead of the 95-percentile confidence interval. Mulliken charges were generated from DFT(B3LYP), which were subsequently used to determine AM1-BCC 460 charges. A different approach was adopted in DDM-FM-QMMM in which the platinum was substituted 461 by palladium, and the conformations necessary to the force matching parameterization procedure were 462 obtained by MNDO(d) dynamics. 463 All groups participating to the bonus challenge submitted 1:1 complex predictions also for CB8-G11 and 464 CB8-G12, for which the initial experimental data suggested the possibility of 2:1 complexes (two guests 465 simultaneously bound to one host). This later turned out to be correct only for CB8-G12, and several 466 groups reported to have computationally tested the hypothesis for CB8-G11 with the correct outcome. 467 DDM-AMOEBA was used to estimate affinity of both the 1:1 and 2:1 complexes, but in the end the first 468 one was used in the submission as the two predicted binding free energies differed by only 0.1 kcal/mol. 469 Accordingly, we used the experimental measurement determined for the first binding event to compute the 470 statistics (CB8-G12a in Table 1). 471 Summary statistics incorporating bonus challenge compounds are reported in Table 4. Although the 472 RMSE generally improves in most cases, it should be noted that this effect varies greatly across the three 473 molecules, and this improvement is mainly due to CB8-G11, whose predictions are regularly much closer to 474 the experimental measurement than the estimates provided for the other two compounds. 475 Comparison to previous rounds of the SAMPL host-guest binding challenge 476 Since previous rounds of the host-guest binding challenge featured identical or similar hosts to those 477 tested in SAMPL6, it is possible to compare earlier results and observe the evolution of methodological 478 performance. 479 Accuracy improvements over SAMPL5 for OA/TEMOA were driven by fits to prior experimental 480 data 481 SAMPL5 featured a set of compounds binding to both OA and TEMOA, which will be referred in the following 482 as the OA/TEMOA-5 set to differentiate it from the combined OA/TEMOA set used in this round of the 483 challenge. In the top row of Figure 6-A, we show median and fitted distributions of the RMSE and R 2 statistics 484 taken from the SAMPL5 overview paper [127] together with the results from SAMPL6. OA was used as a 485 test system in SAMPL4 as well, but in this case, only relative free energy predictions were submitted so we 486 cannot draw a direct comparison. Prediction accuracy displays a slight improvement of the median RMSE 487 from the previous round from 3.00 [2.70, 3.60] kcal/mol to 2.76 [1.85, 3.28] kcal/mol (95-percentile bootstrap 488 confidence intervals of the medians not shown in Figure 6-A). However, this change seems to be entirely 489 driven by the methods employing experiment-based fit corrections since removing them results in a median 490 RMSE that is essentially identical to SAMPL5. The data raises the question of whether the field is hitting the 491 accuracy limit of current general force fields. 492 On the other hand, the median R 2 improved with respect to the last round from 0.0 [0.0,0.8] to 0.5 [0.4,0.8]. 493 Even in this case, we observe a slightly lower SAMPL6 median R 2 when ignoring methods incorporating 494 experimental data, but this is likely due not to the correction itself but to the fact that the top performing 495 methods were generally submitted with and without correction, thus reducing the number of data points 496 with high R 2 . Indeed, as already discussed, no positive effect on correlation was evident from the inclusion of 497 a trained linear correction. The improvement is particularly evident when considering only free energy-based 498 methodologies (e.g., alchemical and potential of mean force calculations). It should be pointed out out that 499 the higher median R 2 observed in SAMPL6 can, in principle, be explained not only by recent methodological 500 advancements and the composition of the methods entering the challenge, but also by the particular set of 501 assayed guests. While the first explanation is obviously the most desirable, the latter is a confounding factor 502 OA / TEMOA  when attempting to associate the results of the challenge to the progress of the community. 503 Since SOMD calculations entered the SAMPL5 challenge as well [19], we can compare directly the same 504 statistics obtained by the method on the two guest sets to form an idea about the relative complexity of the 505 two sets for free energy methods. To this end, we report in Table 5  (TI) [60,112] or HREX and BAR that employed CGenFF and TIP3P, which performed relatively well in SAMPL6 529 on OA/TEMOA. It should be noted that the TI and HREX/BAR methodologies in SAMPL5 made use of a 530 Boresch-style restraint [18] harmonically constraining one distance, two angles, and three dihedrals. This is 531 similar to the solution adopted in DDM-GAFF in SAMPL6, which also showed a relatively low R 2 compared to 532 the other free energy submissions in the same round of the challenge so it is natural to suspect that it may 533 be particularly challenging to treat this class of host-guest systems with this type of restraint in alchemical 534 calculations. 535 An improvement can also be observed for the movable type method, which was applied to the OA/TEMOA- correct for dispersion interactions, which was developed earlier than DFT-D3. As already mentioned, SAMPL4 553 featured a set of 9 OA guests [86], but only relative free energy predictions were submitted so absolute 554 statistics are not available. Thus, in order to facilitate the comparison, we decided to report offset statistics 555 for the subset of the SAMPL6 methods analyzed in this section in the same way they were computed in 556 the previous two rounds of the challenge. The results are given in CB analogue called CBClip [128]. The 3D structures of the last two hosts are shown in Figure 6-B, while in 568 Figure 6-A (bottom row), we show the distribution of RMSE and R 2 computed from the binding free energy 569 predictions submitted for SAMPL3 and SAMPL5 against these four hosts. 570 In general, both statistics appear to have deteriorated from SAMPL3 to SAMPL5. Even though H1 571 and CBClip are sufficiently different for system-dependent effects to reasonably dominate the overall 572 performance, the most marked difference appears from the comparison of the SAMPL6 predictions to 573 those submitted for CB7 and CB8 in SAMPL3, which achieved a much greater R 2 in spite of the smaller 574 dynamic range of the binding affinity measurements and none of which involved simulation-based methods. 575 The explanation for this inequality is likely to be found in the complexity of the guest sets rather than a 576 methodological regression as SAMPL3 featured only two relatively simple fragment-like binders while the 577 latest round of the challenge included compounds of moderate size and/or complex stereochemistry (e.g., 578 gallamine triethiodate, quinine). 579 That the CB8 guests in SAMPL6 were particularly challenging is corroborated by the comparison between 580 the performance of DDM-AMOEBA and the results obtained by BAR-560, which also uses the double 581 decoupling method and the AMOEBA polarizable force field, on the CB7 guests in SAMPL4 [14]. Similarly to the OA/TEMOA guest set, simulation-based free energy methods display a higher median 589 R 2 than the global R 2 computed from considering all the methods in the challenge, albeit a slightly higher 590 RMSE as well. The pattern is consistent across the three rounds of the challenge, but the distributions of the 591 statistics are too wide to draw statistically significant conclusions without collecting more data.

593
As in previous years, the SAMPL host-guest binding challenge has provided an opportunity for the computa-594 tional chemistry community to focus on a common set of systems to assess the state-of-the-art practices 595 and performance of current binding free energy calculation methodologies. The value of the blind challenge 596 does not lie exclusively in the comparison and benchmarking of different methods, but also in its ability 597 to highlight general areas of weakness in the field as a whole on which the community can focus. The 598 latter aspect, in particular, risks to become of secondary importance in retrospective studies. Moreover, the 599 consistent use of octa-acid and cucurbiturils since SAMPL3, which took place in 2011, give us the opportunity 600 to make general observations over a longer time span.  comparison to other elements (e.g., charges, force field parameters, water model). 639 Even at extreme pH, protonation state effects may still contribute 640 The possible influence of multiple accessible protonation states of the guest compounds on the binding 641 free energy was left unexplored during the challenge, mirroring the widespread tendency in the free energy 642 literature to neglect its effect, and participants largely used the most likely protonation states predicted by 643 Epik that were provided in the input mol2 and sdf files. However, the p free energy penalties estimated by 644 Epik for the second most probable protonation state of the CB8 guests in water at experimental pH ( Table 6), 645 which is obtained in all cases by the deprotonation of the charged nitrogen atoms as given in Figure 1, suggest 646 that for several guests, and in particular for CB8-G3 and CB8-G11, the deprotonated state is accessible by 647 paying a cost of a few (where is Boltzmann's constant and is the absolute temperature), and a 648 change in relative populations between the end states driven by the hydrophobic binding cavity may have 649 a non-negligible effect on the binding affinity. Furthermore, even if the probability of having the carboxyl 650 Table 6. p free energy penalties predicted by Epik for the second most likely protonation state of the CB8 guests.
In all cases, the second most probable protonation state predicted by Epik can be obtained by removing the nitrogen proton of the dominant state. The estimated free energy penalties to access the deprotonated state are reported in kcal/mol and units of , where is the Boltzmann's constant and is the absolute temperature, taken to be (i.e., the temperature at experimental conditions). For all the other compounds, including the octa-acid guests, Epik was not able to find a second protonation state within a tolerance of 3 pH units. included several carboxylic acids and was measured at a similar buffer pH [118]. Experimentally, net proton 654 gain or loss during complexation could straightforwardly be assessed for highly soluble host-guest systems 655 via isothermal titration calorimetry (ITC) in buffers with the same pH but different ionization heats for proton 656 loss from solvent [7], a technique that has been used for protein-ligand systems [24,25,91,111]. Similarly 657 to buffer salts, there are few established practices in the community to treat multiple protonation states in 658 free energy calculations [66], and further development and testing of force fields and solvent models with 659 the goal of improving accuracy to experiments should consider these issues as ignoring them during the 660 fitting procedure could push the error caused by missing essential chemicals (e.g., ions, protonation and 661 tautomeric states) to other force field parameters with the risk of decreasing the transferability of the model. 662 Linear corrections fit to prior experimental measurements do not improve predictive utility 663 The experimental-based correction adopted by several groups introduces a new theme in the challenge 664 which pertains to strategies that can be used to inject previous knowledge into molecular simulations. Force 665 field parameters are in principle capable of incorporating experimental data, but an update of the model 666 driven by binding free energy measurements or other ensemble observables is doubtlessly challenging and 667 may involve calculations as expensive as the production calculations so this is normally not routinely viable, 668 although previous studies indicated the validity and feasibility of such an approach [125,126]. Other schemes 669 that emerged in particular from the field of crystallographic structural refinement avoid modifying the force 670 field parameters and instead add one or more biasing terms to the simulation to replicate experimental 671 measurements that the underlying force field cannot reproduce [16,123]. The simple linear corrections used 672 independently by various participants in this round of the challenge had a positive impact on the error, but a 673 very small effect in terms of correlation, which is often of central importance in the context of molecular 674 design. However, the simplicity of its application, which is confined entirely to the post-processing step, was 675 such that the participants were able to submit multiple entries with and without the correction. 676 Outlook for future SAMPL host-guest challenges 677 The SAMPL roadmap [77] outlines a proposal for subsequent host-guest challenges for SAMPL7-10. While 678 the future of these blind exercises is uncertain given the absence of a sustainable funding source, we briefly 679 review the likely future design of these host-guest challenges below. Scatter plots showing the experimental measurements of the host-guest binding enthalpies (horizontal axis) against the methods' predictions on the OA (yellow), TEMOA (green), and CB8 (blue) guest sets with the respective regression lines of the same color. The solid black line is the regression line obtained by using all the data points. The gray shaded area represent the points within 1.5 kcal/mol from the diagonal (dashed black line).