Assessing the accuracy of octanol-water partition coefficient predictions in the SAMPL6 Part II log P Challenge

The SAMPL Challenges aim to focus the biomolecular and physical modeling community on issues that limit the accuracy of predictive modeling of protein-ligand binding for rational drug design. In the SAMPL5 log D Challenge, designed to benchmark the accuracy of methods for predicting drug-like small molecule transfer free energies from aqueous to nonpolar phases, participants found it difficult to make accurate predictions due to the complexity of protonation state issues. In the SAMPL6 log P Challenge, we asked participants to make blind predictions of the octanol-water partition coefficients of neutral species of 11 compounds and assessed how well these methods performed absent the complication of protonation state effects. This challenge builds on the SAMPL6 pKa Challenge, which asked participants to predict pKa values of a superset of the compounds considered in this log P challenge. Blind prediction sets of 91 prediction methods were collected from 27 research groups, spanning a variety of quantum mechanics (QM) or molecular mechanics (MM)-based physical methods, knowledge-based empirical methods, and mixed approaches. There was a 50% increase in the number of participating groups and a 20% increase in the number of submissions compared to the SAMPL5 log D Challenge. Overall, the accuracy of octanol-water log P predictions in SAMPL6 Challenge was higher than cyclohexane-water log D predictions in SAMPL5, likely because modeling only the neutral species was necessary for log P and several categories of method benefited from the vast amounts of experimental octanol-water log P data. There were many highly accurate methods: 10 diverse methods achieved RMSE less than 0.5 log P units. These included QM-based methods, empirical methods, and mixed methods with physical modeling supported with empirical corrections. A comparison of physical modeling methods showed that QM-based methods outperformed MM-based methods. The average RMSE of the most accurate five MM-based, QM-based, empirical, and mixed approach methods based on RMSE were 0.92±0.13, 0.48±0.06, 0.47±0.05, and 0.50±0.06, respectively.

species of 11 compounds and assessed how well these methods performed absent the complication of protonation state 23 effects. This challenge builds on the SAMPL6 pK a Challenge, which asked participants to predict pK a values of a superset of the 24 compounds considered in this log P challenge. Blind prediction sets of 91 prediction methods were collected from 27 research 25 groups, spanning a variety of quantum mechanics (QM) or molecular mechanics (MM)-based physical methods, knowledge-based 26 empirical methods, and mixed approaches. There was a 50% increase in the number of participating groups and a 20% increase 27 in the number of submissions compared to the SAMPL5 log D Challenge. Overall, the accuracy of octanol-water log P predictions 28 in SAMPL6 Challenge was higher than cyclohexane-water log D predictions in SAMPL5, likely because modeling only the neutral 29 species was necessary for log P and several categories of method benefited from the vast amounts of experimental octanol-water 30 log P data. There were many highly accurate methods: 10 diverse methods achieved RMSE less than 0.5 log P units. These included Instead, SAMPL seeks to isolate and focus attention on individual accuracy-limiting issues. We aim to field blind challenges 83 just at the limit of tractability in order to identify underlying sources of error and help overcome these challenges. Working on 84 similar model systems or the same target with new blinded datasets in multiple iterations of prediction challenges maximize 85 our ability to learn from successes and failures. Often, these challenges focus on physical properties of high relevance to drug 86 discovery in their own right, such as partition or distribution coefficients critical to the development of potent, selective, and 87 bioavailable compounds. 88 + + + cyclohexane water water octanol water SAMPL6 pK a Challenge SAMPL6 logP Challenge SAMPL5 logD Challenge Figure 1. The desire to deconvolute the distinct sources of error contributing to the large errors observed in the SAMPL5 log D challenge motivated the separation of pK a and log P challenges in SAMPL6. The SAMPL6 pK a and log P challenges aim to evaluate protonation state predictions of small molecules in water and transfer free energy predictions between two solvents, isolating these prediction problems.
The partition coefficient (log P) and the distribution coefficient (log D) are driven by the free energy of transfer from an 89 aqueous to a nonpolar phase. Transfer free energy of only neutral species are considered for log P, whereas both neutral and 90 ionizable species contribute to log D. Such solute partitioning models are a simple proxy for the transfer free energy of a drug-like 91 molecule to a relatively hydrophobic receptor binding pocket, in the absence of specific interactions. Protein-ligand binding 92 equilibrium is analogous to partitioning of a small molecule between two environments: protein binding site and aqueous phase.
the reference interaction site model (RISM) integral equation approach, discussed further below. 185 The COSMO-RS solvation model is another method utilized in this context which covers a particularly broad range of solvents, 186 typically quite well [14][15][16][17][18]. In the present challenge, a "Cosmoquick" variant was also applied and falls into the "Mixed" method 187 category, as it utilizes additional empirical adjustments. The COSMO-RS implementation of COSMOtherm takes into account 188 conformational effects to some extent; the chemical potential in each phase is computed using the Boltzmann weights of a fixed 189 set of conformers. 190 In general, while choice of solvation model can be a major factor impacting QM approaches, the neglect of conformational 191 changes means these approaches typically (though not always) neglect any possibility of significant change of conformation on 192 transfer between phases and they simply estimate solvation by the difference in (estimated) solvation free energies for each 193 phase of a fixed conformation. Additionally, solute entropy is often neglected, assuming the single-conformation solvation free 194 energy plays the primary role in driving partitioning between phases. In addition to directly estimating solvation, QM approaches 195 can also be used to drive the selection of the gas-or solution-phase tautomer, and thus can be used to drive the choice of inputs 196 for MM approaches discussed further below.

197
Integral equation-based approaches 198 Integral equation approaches provide an alternate approach to solvation modeling (for both water and non-water solvents) 199 and have been applied in SAMPL challenges within both the MM and QM frameworks [19][20][21]. In this particular challenge, 200 however, the employed approaches were entirely QM, and utilized the reference interaction site model (RISM) approach [22][23][24]. 201 Additionally, as noted above, the IEF-PCM model used by the SMD solvation model (discussed above) is also an integral equation 202 approach. Practical implementation details mean that RISM approaches typically have one to a few adjustable parameters (e.g. 203 four [25]) which are empirically tuned to experimental solvation free energies, in contrast to the SMD and SM-n series of solvation 204 models which tend to have a larger number of adjustable parameters and thus require larger training sets. In this particular approaches do capture conformational changes on phase transfer, as long as such changes occur on timescales faster than the 223 simulation timescale. 224 1.2.2 Empirical log P predictions 225 Due to the importance of accurate log P predictions, ranging from pharmaceutical sciences to environmental hazard assessment, a 226 large number of empirical models to predict this property have been developed and reviewed [29][30][31]]. An important characteristic 227 of many of these methods is that they are very fast, so even large virtual libraries of molecules can be characterized. 228 In general, two main methodologies can be distinguished: group-or atom-contribution approaches, also called additive group 229 methods, and quantitative structure-property relationship (QSPR) methods. 230 1.2.2.1 Atom-and group-contribution approaches 231 Atom contribution methods, pioneered by Crippen in the late 1980s [32,33], are the easiest to understand conceptually. These 232 assume that each atom contributes a specific amount to the solvation free energy and that these contributions to log P are 233 additive. Using a potentially large number of different atom types (typically in the order of 50-100), the log P is the sum of the 234 individual atom types times the number of their occurrences in the molecule: A number of log P calculation programs are based on this philosophy, including AlogP [34], AlogP98 [34], and moe_SlogP [35]. 236 The assumption of independent atomic contributions fails for compounds with complex aromatic systems or stronger 237 electronic effects. Thus correction factors and contributions from neighboring atoms were introduced to account for these 238 shortcomings (e.g. in XlogP [36][37][38] and SlogP [35]. 239 In contrast, in group contribution approaches, log P is calculated as a sum of group contributions, usually augmented by 240 correction terms that take into account intramolecular interactions. Thus, the basic equation is where the first term describes the contribution of the fragments (each occurring times), the second term gives the clogP [39][40][41][42], KlogP [43], ACD/logP [44] and KowWIN [45]. 246 clogP is probably one of the most widely used log P calculation programs [39][40][41]. clogP relies on fragment values derived 247 from measured data of simple molecules, e.g., carbon and hydrogen fragment constants were derived from measured values 248 for hydrogen, methane, and ethane. For more complex hydrocarbons, correction factors were defined to minimize the differ-249 ence to the experimental values. These can be structural correction factors taking into account bond order, bond topology 250 (ring/chain/branched) or interaction factors taking into account topological proximity of certain functional groups, electronic 251 effects through -bonds, or special ortho-effects. 252 1.2.2.2 QSPR approaches 253 Quantitative structure-property relationships (QSPR) provide an entirely different category of approaches. In QSPR methods, 254 a property of a compound is calculated from molecular features that are encoded by so-called molecular descriptors. Often, 255 these theoretical molecular descriptors are classified as 0D-descriptors (constitutional descriptors, only based on the structural 256 formula), 1D-descriptors (i.e. list of structural fragments, fingerprints), 2D-descriptors (based on the connection table, topological   257 descriptors), and 3D-descriptors (based on the three-dimensional structure of the compound, thus conformation-dependent). 258 Sometimes, this classification is extended to 4D-descriptors, which are derived from molecular interaction fields (e.g., GRID, 259 CoMFA fields). 260 Over the years, a large number of descriptors have been suggested, with varying degrees of interpretability. Following 261 the selection of descriptors, a regression model that relates the descriptors to the molecular property is derived by fitting the 262 individual contributions of the descriptors to a dataset of experimental data; both linear and nonlinear fitting is possible. Various 263 machine learning approaches such as random forest models, artificial neural network models, etc. also belong to this category. 264 Consequently, a large number of estimators of this type have been proposed; some of the more well-known ones include  271 Our expectation was that empirical knowledge-based and other trained methods (implicit solvent QM, mixed methods) 272 would outperform other methods in the present challenge as they are impacted directly by the availability of octanol-water data. 273 Methods well trained to experimental octanol-water partitioning data should typically result in higher accuracy, if fitting is done 274 well. The abundance of octanol-water data may also provide empirical and mixed approaches with an advantage over physical 275 modeling methods. Current molecular mechanics-based methods and other methods not trained to experimental log P data 276 ought to do worse in this challenge. Performance of strictly physical modeling based prediction methods might generalize better 277 across other solvent types where training data is scarce, but that will not be tested by this challenge. In principle, molecular 278 mechanics-based methods could also be fitted using octanol-water data as one of the targets for force field optimization, but  will agree. Molecule IDs assigned in SAMPL6 pK a Challenge were conserved in the challenge for the ease of reference.
Research groups were allowed to participate with multiple submissions, which allowed them to submit prediction sets to 326 compare multiple methods or to investigate the effect of varying parameters of a single method. All blind submissions were 327 assigned a 5-digit alphanumeric submission ID, which will be used throughout this paper and also in the evaluation papers of 328 participants. These abbreviations are defined in Table 3. 329

330
A variety of error metrics were considered when analyzing predictions submitted to the SAMPL6 log P Challenge. Summary 331 statistics were calculated for each submission for method comparison, as well as error metrics of predictions of each method. 332 Both summary statistics and individual error analysis of predictions were provided to participants before the virtual workshop. 333 Details of the analysis and scripts are maintained on the SAMPL6 Github Repository (described in section 6). 334 There are six error metrics reported: the root-mean-squared error (RMSE), mean absolute error (MAE), mean (signed) error 335 (ME), coefficient of determination (R 2 ), linear regression slope (m), and Kendall's Rank Correlation Coefficient ( ). In addition 336 to calculating these performance metrics, 95% confidence intervals were computed for these values using a bootstrapping-337 over-molecules procedure (with 10000 bootstrap samples) as described elsewhere in a previous SAMPL overview article [60]. 338 Due to the small dynamic range of experimental log P values of the SAMPL6 set, it is more appropriate to use accuracy based 339 performance metrics, such as RMSE and MAE, to evaluate methods than correlation-based statistics. This observation is also 340 typically reflected in the confidence intervals on these metrics. Calculated errors statistics of all methods can be found in Tables S4   341   and S5.   342 Submissions were originally assigned to four method categories (physical, empirical, mixed, and other) by participants. 343 However, when we evaluated the set of participating methods it became clear that it was going to be more informative to group 344 them using the following categories: physical (MM), physical (QM), empirical, and mixed. Methods from the "other" group 345 were reassigned to empirical or physical (QM) categories as appropriate. Methods submitted as Physical by participants included 346 quantum mechanical (QM), molecular mechanics-based (MM) and, to a lesser extent, integral equation-based approaches (EC-347 RISM). We subdivided these submissions into "physical (MM)" and "physical (QM)" categories. Integral equation-based approaches 348 were also evaluated under the Physical (QM) category. The "mixed" category includes methods that physical and empirical 349 approaches are used in combination. Table 3 indicates the final category assignments in the "Category" column. 350 We created a shortlist of consistently well-performing methods that were ranked in the top 20 consistently according to two 351 error and two correlation metrics: RMSE, MAE, R-Squared, and Kendall's Tau. These are shown in Table 4. 352 We included null and reference method prediction sets in the analysis to provide perspective for performance evaluations 353 of blind predictions. Null models or null predictions employ a model which is not expected to be useful and can provide a 354 simple point of comparison for more sophisticated methods, as ideally, such methods should improve on predictions from a null 355 model. We created a null prediction set (submission ID NULL0) by predicting a constant log P value for every compound, based 356 on a plausible log P value for drug-like compounds. We also provide reference calculations using several physical (alchemical) 357 and empirical approaches as a point of comparison.    (Table 3). We also provided additional reference 419 calculations -five in the empirical category, and eight in the physical (MM) category. 420 The following sections present detailed performance evaluation of blind submissions and reference prediction methods. 421 Performance statistics of all the methods can be found in S4. Methods are referred to by their submission ID's which are provided 422 in 3. Many methods in the SAMPL6 Challenge achieved good predictive accuracy for octanol-water log P values. Figure 3 shows the 425 performance comparison of methods based on accuracy with RMSE and MAE. 10 methods achieved an RMSE ≤ 0.5 log P units. 426 These methods were QM-based, empirical, and mixed approaches (     Table 3. Submission IDs of the form REF## refer to non-blinded reference methods computed after the blind challenge submission deadline, and NULL0 is the null prediction method; all others refer to blind, prospective predictions.  rdsnw  2tzb0  mm0jf  6fyg5  fyx45  hdpuj  dqxk4  REF11  eufcy  v2q0t  zdj0j  hmz0n  vzgyt  ypmr0  REF13  5krdi  6nmtt  cp8kv  gmoq5  7dhtp  wu52s  pku5g  3vqbi  sq07q  ahmtf  dyxbt  o7djk  po4g2  REF12  3wvyh  ke5gu  REF07  6cdyo  jjd0b  ggm6n  yd6ub  REF02  bzeez  kuddg  gnxuu  ji2zm  xxh4i  w6jta  d7vth  tc4xa  nh6c0  2ggir  REF01  REF09  mwuua  qz8d5  sqosi  25s67  g6dwz  kivfu  7egyc  0a7a8  REF05  5svjv  pcv32  REF08  4nfzz  REF04  y0xxd  REF03  dbmg3  623c0  ynquk  hf4wj  REF10  ujsgv  kxsp3  cr3hs  hwf2k  REF06  5mahv  2mi5w  pnc4j  j4nb3  7gg6s  f3dpg  fmf7r  03cyy  bq6fo  bqeuh  3oqhx  odex0  padym  eg52i  h83sb  5t0yn  fcspk  fe8ws  c7t5j  jc68f  5585v  arw58  tzzb5  hsotx  rs4ns  4p2ph   Pearson's R 2 and Kendall's Rank Correlation Coefficient Tau ( ) are shown, with error bars denoting 95% confidence intervals obtained by bootstrapping over challenge molecules. Submission IDs are summarized in Table 3. Submission IDs of the form REF## refer to non-blinded reference methods computed after the blind challenge submission deadline, and NULL0 is the null prediction method; all others refer to blind, prospective predictions. Overall, a large number and wide variety of methods have a statistically indistinguishable performance on ranking, in part because of the relatively small dynamic range of this set and because of the small size of the set. Roughly the top half of methods with Kendall's Tau > 0.5 fall into this category. 4.1.2 Results from physical methods 440 One of the aims of the SAMPL6 log P Challenge was to assess the accuracy of physical approaches in order to potentially 441 provide direction for improvements which could later impact accuracy in downstream applications like protein-ligand binding. 442 Some MM-based methods used for log P predictions use the same technology applied to protein-ligand binding predictions, so 443 improvements made to modeling of partition may in principle carry over. However, prediction of partition between two solvent 444 phases is a far simpler test only capturing some aspects of affinity prediction -specifically, small molecule and solvation modeling 445 -in the absence of protein-ligand interactions and protonation state prediction problems. , and the TIP3P water model [69]. From the brief method descriptions submitted to 458 the challenge we could not identify the difference between these prediction sets. 459 RMSE values for predictions made with MM-based methods ranged from 0.74 to 4.00 log P units, with the average of the better 460 half being 1.44 log P units. 461 Submissions included diverse molecular simulation-based log P predictions made using alchemical approaches. Predictions that used polarizable force fields did not show an advantage over fixed-charged force fields in this challenge. 472 RMSEs for polarizable force field submissions range from 1.85 to 2.86 (submissions with the Drude Force Field were fyx45, pnc4j, 473 and those with the ARROW Force Field were odex0, padym fcspk, and 6cm6a). 474 Predictions using both dry and wet octanol phases were submitted to the log P challenge. When submissions from the same 475 participants were compared, we find that including water in the octanol phase only slightly lowers the RMSE (0.05-0.10 log P   j8nwc  dqxk4  ypmr0  qyzjx  6cdyo  nh6c0  kivfu  ujsgv  2mi5w  qz8d5  y0xxd  2ggir  dyxbt  mm0jf  3wvyh  25s67  zdj0j  pcv32  v2q0t  rdsnw  ggm6n  jjd0b  2tzb0  arw58  ahmtf  o7djk  4p2ph  6fyg5  sqosi  rs4ns  c7t5j  jc68f  hsotx  ke5gu  mwuua  fe8ws  5t0yn  fyx45  6nmtt  eufcy  tzzb5  3oqhx  bzeez  5svjv  odex0  padym  pnc4j  REF02  REF05  REF08  REF07  fcspk  6cm6a  623c0  4nfzz  eg52i  cp8kv  Kendall's Tau  Table 3. Submission IDs of the form REF## refer to non-blinded reference methods computed after the blind challenge submission deadline; all others refer to blind, prospective predictions. Although there was not any single method that performed significantly better than others in the log P challenge, we identified a 488 group of consistently well performing methods. There were many methods with good performance when judged based on RMSE, 489 but not many methods consistently showed up at the top according to all metrics. When individual error metrics are considered, 490 many submissions were not different from one another in a statistically significant way, and ranking typically depends on the 491 metric chosen due overlapping confidence intervals. Instead, we identified several consistently well performing methods by 492 looking at several different metrics -two assessing accuracy (RMSE and MAE) and two assessing correlation (Kendall's Tau and R 2 ). 493 We determined those methods which are in the top 20 by each of these metrics. This resulted in a list of eight methods which are 494 consistently well performing. The shortlist of consistently well-performing methods are presented in Table 4. 495 The resulting eight consistently well-performing methods were QM-based physical models and empirical methods. These  In addition to comparing method performance, we analyzed the prediction errors for each compound in the challenge set to 508 assess whether particular compounds or chemistries are especially challenging (Figure 7). For this analysis, MAE is a more 509 appropriate statistical value for following global trends, as its value is less affected by outliers than is RMSE. 510 Performance on individual molecules shows relatively uniform MAE across the challenge set ( Figure 7A). Predictions of SM14 511 and SM16 were slightly more accurate than the rest of the molecules when averaged across all methods. Prediction accuracy 512 on each molecule, however, is highly variable depending on method category ( Figure 7B). Predictions of SM08, SM13, SM09, 513 and SM12 were significantly less accurate with physical (MM) methods than the other method categories by 2 log P units in individual models. However, SM15 is more significantly shifted away from zero than any other compound (ME calculated accross 520 all molecules is -0.88±1.49 for SM15). SM08 had the most spread in log P prediction error. 521 This challenge focused on log P of neutral species, rather than log D as studied in SAMPL5, which meant that we do not see 522 the same trends where performance is significantly worse for compounds with multiple protonation states/tautomers or where 523 pK a values are uncertain. However, in principle, tautomerization can still influence log P values. Multiple neutral tautomers can be 524 present at significant populations in either solvent phase, or the major tautomer can be different in each solvent phase. However, 525 this was not expected to be the case for any of the 11 compounds in this SAMPL6 Challenge. We do not have experimental 526 data on the identity or ratio of tautomers, but tautomers other than those depicted in Figure 2 would be much higher in energy 527 according to QM predictions [22] and, thus, very unlikely to play a significant role. Still, for most log P prediction methods, it was at 528 least necessary for participants to select the major neutral tautomer. We do not observe statistically worse error for compounds 529 with potential tautomer uncertainties here, suggesting it was not a major factor in overall accuracy, some participants did chose 530 to run calculations on tautomers that were not provided in the challenge input files ( Figure 11 and Table 5), as we discuss in 531 Section 4.2.  general, all the water models tend to overestimate the log P, especially for the carboxylic acid in the challenge set, SM08, though 558 our calculations on this molecule had some specific difficulties we discuss further below. Relative to the TIP3P-FB and OPC 559 water models, predictions which used TIP3P showed improvement in some of the error metrics, such as lower deviation from 560 experiment with an RMSE range of 2  While water model and force field may have significantly impacted differences in performance across methods in some cases 569 in this challenge, we have very few cases -aside from these reference calculations -where submitted protocols differed only by 570 force field or water model, making it difficult to know the origin of performance differences for certain. Some discrepancies are observed for molecules SM13 and SM07, but are largest for SM08. For SM13 and SM07, method v2q0t 584 (InterX_GAFF_ WET_OCTANOL) performs over 1 log P unit better than 6nmtt (MD-AMBER-wetoct). The rest of the predictions for 585 these two methods differ by no more than about 1 log P unit, with the majority of the molecules differing by about 0.5 log P 586 units or less from each other. Comparing 6nmtt (MD-AMBER-wetoct) vs REF02 (YANK-GAFF-TIP3P-wet-oct) ( Figure 8A), there is a 587 substantial difference in the predicted values for molecules SM08 (4.6 log unit difference), SM13 (1.4 log unit difference), and 588 SM07 (1.2 log unit difference). Method v2q0t (InterX_GAFF_ WET_OCTANOL) and 6nmtt (MD-AMBER-wetoct) perform about 5 589 log P units better than REF02 (YANK-GAFF-TIP3P-wet-oct) for molecule SM08. Besides SM08, predictions from v2q0t (InterX_GAFF_ 590 WET_OCTANOL) and REF02 (YANK-GAFF-TIP3P-wet-oct) differ by 0.5 log P units or less from each other. In dry octanol, REF07 591 (YANK-GAFF-TIP3P-dry-oct) performs about 4 log P units worse than sqosi (MD-AMBER-dryoct) for SM08 ( Figure 8B).  In several of these approaches, users selected their own starting conformation, protonation state and tautomer, rather than 598 those provided in the SAMPL6 challenge, so the differences here could possibly be attributed to differences in tautomer or 599 resonance structures. Submissions 6nmtt (MD-AMBER-wetoct) and sqosi (MD-AMBER-dryoct) used different tautomers for SM08 600 and different resonance structures for SM11 and SM14 (microstates SM08_micro010, SM11_micro005, SM14_micro001 from the 601 previous SAMPL6 pK a Challenge). We will discuss possible differences due to tautomer choice below in Section 4.   Figure 9A shows calculations from our two different reference protocols using the DFE and IFE methods. We find that the 629 two protocols yield similar results, with the exception of two molecules. Molecule SM08 is not substantially overestimated using 630 the IFE protocol, where it is with the DFE protocol, and SM02 is largely overestimated by IFE, but not DFE ( Figure 9A). The DFE 631 (REF07)and IFE protocol both tend to overestimate the molecules' preference for octanol over water than in experiment, with the 632 DFE protocol overestimating it slightly more. Figure 9D shows comparison of predicted log P values of the same tautomers by  Figure 9. Comparison of predictions that use free energy calculations using GAFF and TIP3P water show deviations between predictions for the challenge molecules and several alternative tautomers and resonance structures. Deviations seem to largely stem from differences in equilibration amount and choice of tautomer. A compares reference direct transfer free energy (DFE, REF07) and indirect solvation-based transfer free energy (IFE) protocols to experiment for the challenge provided resonance states of molecules and a couple of extra resonance states for SM14 and SM11, and extra tautomers for SM08. B compares the same exact tautomers for submission sqosi (MD-AMBER-dryoct) and the two reference protocols to experiment. Submission sqosi (MD-AMBER-dryoct) used different tautomers than the ones provided in the challenge. C-E compares the calculated log P between different methods using the same tautomers. All of the predicted values can be found in Table 5. the DFE (REF07) and IFE protocols. The DFE and IFE protocols are almost within statistical error of one another, with the largest 634 discrepancies coming from SM02 and SM08. The DFE and IFE protocols are in better agreement for some tautomers of SM08 635 more than others. They agree better on the predicted values for SM08_micro008 and SM08_micro010 than for SM08_micro011. 636 In the SAMPL6 blind submissions, there was a third putatively equivalent method to our reference predictions with the DFE 637 protocol (REF07) and IFE protocol: sqosi (MD-AMBER-dryoct). It is identical in chosen force field, water model, and composition 638 of octanol phase, however, different tautomers and resonance states for some molecules were used. All three predictions 639 used free energy calculations with GAFF, TIP3P water, and a dry octanol phase. Additionally, sqosi (MD-AMBER-dryoct) also 640 used the more traditional indirect solvation free energy protocol. We chose to investigate the differences in these equivalent 641 approaches approaches by comparing predictions using matching tautomers and resonance structures (Figure 9). Figure 9B   642 shows comparison of these three methods using predictions made with DFE and IFE protocols using identical tautomer and 643 resonance input states as sqosi (MD-AMBER-dryoct): SM08_micro010, SM11_micro005, and SM14_micro001 (structures can be 644 found in Figure 11. Except SM02, there is general agreement between these predictions. Figure 9C, other than the SM08_micro010 645 tautomer, predictions of DFE (REF07) and sqosi (MD-AMBER-dryoct) largely agree. Figure 9E highlights SM02 and SM08_micro010 646 predictions as the major differences between our predictions with IFE protocol and sqosi (MD-AMBER-dryoct). 647 Only results from the DFE protocol were assigned submission numbers (of the form REF##) and presented in the overall 648 method analysis in Section 4.1. More details of the solvation and transfer free energy protocol can be found in section 12.1.    (Table 5). Our exploration of these issues was prompted by the fact that the MD-AMBER protocols had utilized 687 different tautomers than those initially employed in our physical reference calculations. 688 We also find that the choice of resonance structure affects calculated values, though less strongly so than the choice of 689 tautomer. Within the IFE method we find 1.3 log units of variation between log P values calculated with different resonance 690 structures of SM11 (SM11 and SM11_micro005) and 0.6 log units of variation between resonance structures of SM14 (SM14 and 691 SM14_micro001) (Table 5). 692 We also find that the partial charge assignment procedure can also dramatically impact log P values for carboxylic acids 693 ( SM08_micro010 (SM08_micro011) Figure 11. The tautomer and resonance structure choice resulted in discrepancies in the reference calculations. Shown here are calculated values for different input structures using the reference direct transfer free energy method. The uncertainties of the log P predictions were calculated as the standard error of the mean (SEM) of three replicate predictions. Structures labelled as SM08, SM11, and SM14 are based on input SMILES provided in SAMPL6 log P Challenge instructions. Three microstates shown for SM08 are different tautomers. SM08 (SM08_micro011) and SM08_micro010 are carboxylic acids, while SM08_micro008 is a carboxylate ion. SM08 (SM08_micro011) has a carbonyl group in the ring, while SM08_micro008 and SM08_micro010 have a hydroxyl in the ring. Structures pertaining to SM11 and SM14 are different resonance hybrids of the same tautomer (neutral microstate). Enumeration of all theoretically possible neutral tautomers of SAMPL6 molecules can be found in the SAMPL6 GitHub Repository (https://github.com/samplchallenges/SAMPL6/tree/master/physical_properties/pKa/microstates). 0.5 (gmoq5, Global XGBoost-Based QSPR LogP Predictor). In all these cases, using a relatively large training set (>1000-10000 712 compounds) seems to be key. 713 The exact choice of method or descriptors seems to be less critical. Predictions based on atom or group contributions perform 714 as well as those using either a small set of EHT-derived descriptors or a large set of diverse descriptors, sometimes additionally 715 including fingerprint descriptors. A possible explanation could be that log P is, to first order, primarily an additive property so that 716 empirical methods can do well since a wealth of octanol-water data is available for training. This is also reflected in the success 717 of the simple methods summing up atom contributions. This approach may become problematic, however, when a functional 718 group is present that was underrepresented or missing in the training set. In such cases, higher are expected. 719 As is true for the physical methods, empirical methods depend on the tautomeric state of the compound. Here we have 720 observed that clogP is particularly sensitive. clogP shifts of more than one log unit upon change of the tautomer are not 721 uncommon. h_logP is much less sensitive to tautomers with shifts usually below 0.5 log P units. This is also true for molecule 722 SM08, for which different tautomeric forms are possible (as seen in Figure 11). For the pyridone form of SM08 (SM08_micro011), 723 clogP predicts a log P of 2.17, whereas the hydroxy-pyridine form (SM08_micro010) yields a log P of 3.63. For h_logP, the respective 724 values are 3.09 and 3.06. 725 Despite the small training sets of the MOE models, good prediction for kinase inhibitor fragments and the extra compounds 726 was achieved. This is possibly because the training set for this model was biased towards drug-like compounds, with substantial 727 similarity to the SAMPL6 Challenge set. 728 Other studies have found that some empirical methods tend to overestimate log P when molecular weight increases [104, 105]. 729 In this challenge, this was less of an issue as molecular size remained relatively constant. 730 According to in-house experience at Boehringer-Ingelheim, different experimental log P measurement methods produce 731 values that are correlated with one another with an R 2 value of around 0.7 (T. Fox, P. Sieger, unpublished results), indicating that 732 experimental methods themselves can disagree with one another significantly. This is especially true when it comes to more 733 approximate methods of estimating log P experimentally, such as HPLC-based methods [42,106]. A dataset composed of 400 734 compounds from Boehringer-Ingelheim measured both with GLpKa and HPLC assays covering a range from 0-7 log P units had R 2 735 of 0.56, though in some cases these methods may have higher correlations with potentiometric approaches [107]. Thus, if an 736 empirical model is trained on log P data from one particular method, testing it on data collected via another method may not 737 yield performance as high as expected. 738 Here, all of the analyzed empirical reference methods achieved absolute error <2.0, and often <1.5 calculated for each 739 molecule in the SAMPL6 log P Challenge set. This is a sign of more consistent accuracy of the predictions across different 740 molecules compared to physical methods. However, it is difficult to draw general conclusions given the small size of the data set, 741 and many hypotheses being based on only one example.  To broaden the analysis with a larger set with more chemical diversity and larger dynamic range of log P values, an extra set of 744 27 compounds were included in the analysis of reference calculations (Figure 12). These compounds had literature experimental 745 log P values collected using the same method as the SAMPL6 dataset. This set is composed of substituted phenols, substituted 746 quinolines, barbiturate derivatives and other pharmaceutically relevant compounds [108]. 747 This set of molecules is larger and more diverse than the SAMPL6 challenge set, spanning a range of 4.5 log units compared to  In the physical reference calculations, the mean absolute errors are below 1 log P unit for dry octanol conditions and below 757 1 log P units for wet octanol conditions ( consistently had a slightly higher error; there is no obvious correlation between method performance and size or complexity of 767 the compounds. Figure 13A shows that 3,5-dichlorophenol, 3,4-dichlorophenol, and (±)-propranolol were the most challenging 768 compounds for empirical reference methods. The MAE calculated for these three molecules as the average of five methods 769 (EXT09, EXT10, EXT11, EXT12, EXT13) was higher than 0.5 log P units (  The comparison of identical methods also showed that the tautomer and resonance state choice for some molecules resulted in 795 discrepancies in calculated log P values. In one case, conformation selected for a carboxylic acid before charging was important. 796 We have also noticed differences in in equilibration protocols, which could be particularly important for the octanol phase, though 797 the present challenge does not conclusively demonstrate that differences in equilibration made a significant difference. errors). Many methods did better than 2 log P units of error in this challenge, which is in agreement with the expectation that 820 partition coefficients present an easier model system compared to protein-ligand binding affinities. 821 Performance of empirical methods far surpassed these thresholds taking advantage of the available octanol-water experi-822 mental data, however, these empirical techniques are specifically oriented towards predicting partitioning and cannot be applied 823 to the binding problem. help isolate methodological problems, making challenges like log P and log D modeling particularly helpful. We believe the largest 828 benefits to the field will be achieved from iterated challenges, as seen from the progress achieved in predicting hydration free 829 energies over multiple SAMPL challenges [60]. 830 As MM-based physical methods struggled with octanol-water log P predictions in SAMPL6, we recommend additional SAMPL 831 iterations focused on log P with larger datasets and more chemical diversity to facilitate progress. The conclusions of SAMPL6 pK a 832 and log P Challenges indicate that, if this had been posed as a log D challenge rather than a log P challenge, larger pK a prediction 833 errors would have masked underlying issues in predicting equilibrium partitioning of neutral solutes between solvent phases. 834 The fact that performance for physical methods was still relatively poor illustrates the potential benefit of future log P challenges. Challenge suggested that molecules with many rotatable bonds still pose challenges for contemporary methods, suggesting this 838 is a criterion for difficulty. However, in later challenges we hope to gradually increase the difficulty of the compounds considered 839 to provide a more diverse set that includes more difficult compounds including varying numbers of rotatable bonds. partition coefficient challenges with a diverse set of solvent phases would be beneficial for improving solute partitioning models. 846 The statistical power of the SAMPL6 log P Challenge for comparative method evaluation was limited due to the narrow 847 experimental data set with only 2 log P units of dynamic range and 11 data points, both of which were driven by limitations of the 848 experimental methodology chosen for this challenge [9]. Future log P challenges would benefit from larger blind datasets with a 849 broader dynamic range. We recommend at least a log P range of 1-5. The potentiometric log P measurement method used for 850 the collection for SAMPL6 data was rather low throughput, requiring method optimization for each molecule. High-throughput 851 log D measurement methods performed at pHs that would ensure neutral states of the analytes may provide a way to collect 852 larger datasets of log P measurements. However, this approach poses some challenges. First, it is necessary to measure pK a 853 values of the molecules first. Second, partitioning measurements need to be done at a pH that guarantees that the compound 854 has neutral charge, in which case solubility will be lower than if it is charged and may become a limitation for the experiment. 855 SAMPL6 log P Challenge molecules were not expected to have multiple tautomers affecting log P predictions (based on QM 856 predictions). The choice of the challenge set also ensured participants did not have to calculate contributions of multiple relevant 857 tautomerization states or shifts in tautomerization states during transfer between phases. However, participants still had to 858 select a major tautomer for each compound. To evaluate the tautomer predictions in the future, experimental measurement of 859 tautomer populations in each solvent phase would provide valuable information. However, such experimental measurements are 860 difficult and low throughput. If measuring tautomers is not a possibility, the best approach may be to exclude compounds that 861 present potential tautomerization issues from the challenge, unless the challenge focus is specifically on tautomer prediction. 862 Overall, for future solute partitioning challenges, we would like to focus on fragment-like compounds, matched molecular 863 pairs, larger dynamic range, larger set size, and functional group diversity. Here, a community-wide blind partition coefficient prediction challenge was fielded for the first time, and participants were 869 asked to predict octanol-water partition coefficients for small molecules resembling fragments of kinase inhibitors. As predicting 870 log D in the previous challenge was quite challenging due to issues with pK a prediction, the present challenge focused on log P, 871 avoiding these challenges and placing it at roughly the right level of complexity for evaluating contemporary methods and issues Empirical predictions showed performance which was less dependent on the compound/dataset than physical methods in this 881 study. For empirical methods, the size and chemical diversity of the training set employed in developing the method seems to be 882 more important than the exact methods and descriptors employed. We expected many of the empirical methods to be the top 883 performers, given the wealth of octanol-water log P training data available, and this expectation was borne out. 884 Better prediction performance was seen for octanol-water log P challenge than the SAMPL5 cyclohexane-water log D challenge. 885 In addition to absence of pK a prediction problem for the partition system, the molecules in the SAMPL6 log P Challenge were 886 considerably less diverse than in the SAMPL5 log D Challenge, which may have also affected relative performance in the two 887 challenges. Physical methods fared slightly better in this challenge than previous cyclohexane-water log D challenge, likely 888 because of the elimination of the need to consider protonation state effects. However, MM-based physical methods with 889 similar approaches did not necessarily agree on predicted values, with occasionally large discrepancies resulting from apparently 890 relatively modest variations in protocol. 891 All information regarding the challenge structure, experimental data, blind prediction submission sets, and evaluation of 892 methods is available in the SAMPL6 GitHub Repository to allow follow up analysis and additional method testing. 893 Overall, high participation and clear lessons learned pave the way forward for improving solute partitioning and biomolecular 894 binding models for structure-based drug design.       The SMILES string and the mole fraction of each compound in the system were used as input. The "wet" octanol systems     REF07 and REF02 vs exp REF08 and REF05 vs exp Figure S1. Varying the amount of water in the octanol phase has no significant effect on the predicted log P in reference calculations, as discussed in section 4.2.1. Comparison of predicted log P values to the experimental values using wet (27% water) and dry octanol phases and the (A) GAFF and (B) SMIRNOFF force field, from non-blinded reference calculations performed for this paper, shows no statistically significant difference in performance of methodologies. Comparison of the calculated log P using dry and wet octanol phases for (C) the GAFF force field and (D) the SMIRNOFF force field shows a small systematic difference.  2 Corresponds to the bond length between the oxygen and hydrogen atoms. 3 Corresponds to the length between the oxygen atom and virtual site. 4 Corresponds to the Lennard-Jones (LJ) parameters of the oxygen. "Syn" SM08_micro11 "Anti" SM08_micro11 9.8±0.1 3.9±1.1 Figure S2. Shown here are the 2-and 3D structures of SM08_micro011 with the carboxylic acid in "anti" and "syn" conformation. The dihedral angle is indicated by the arrow around the carbon and oxygen atom. The calculated log P is included for comparison. The charges pertaining to each conformation are listed in Figure S7 and transition data is available in Figure S3 . Here is the transition data for the C-O dihedral in Figure S2, with charges listed in Table S7, for the DFE method (run in triplicate).

Supplementary figures and tables
In the "anti starting position" the torsion remains "anti" throughout the simulation, while the "syn starting position" allows transitions.