Overview of the SAMPL6 pKa Challenge: Evaluating small molecule microscopic and macroscopic pKa predictions

The prediction of acid dissociation constants (pKa) is a prerequisite for predicting many other properties of a small molecule, such as its protein-ligand binding affinity, distribution coefficient (log D), membrane permeability, and solubility. The prediction of each of these properties requires knowledge of the relevant protonation states and solution free energy penalties of each state. The SAMPL6 pKa Challenge was the first time that a separate challenge was conducted for evaluating pKa predictions as part of the Statistical Assessment of Modeling of Proteins and Ligands (SAMPL) exercises. This challenge was motivated by significant inaccuracies observed in prior physical property prediction challenges, such as the SAMPL5 log D Challenge, caused by protonation state and pKa prediction issues. The goal of the pKa challenge was to assess the performance of contemporary pKa prediction methods for drug-like molecules. The challenge set was composed of 24 small molecules that resembled fragments of kinase inhibitors, a number of which were multiprotic. Eleven research groups contributed blind predictions for a total of 37 pKa distinct prediction methods. In addition to blinded submissions, four widely used pKa prediction methods were included in the analysis as reference methods. Collecting both microscopic and macroscopic pKa predictions allowed in-depth evaluation of pKa prediction performance. This article highlights deficiencies of typical pKa prediction evaluation approaches when the distinction between microscopic and macroscopic pKas is ignored; in particular, we suggest more stringent evaluation criteria for microscopic and macroscopic pKa predictions guided by the available experimental data. Top-performing submissions for macroscopic pKa predictions achieved RMSE of 0.7-1.0 pKa units and included both quantum chemical and empirical approaches, where the total number of extra or missing macroscopic pKas predicted by these submissions were fewer than 8 for 24 molecules. A large number of submissions had RMSE spanning 1-3 pKa units. Molecules with sulfur-containing heterocycles or iodo and bromo groups were less accurately predicted on average considering all methods evaluated. For a subset of molecules, we utilized experimentally-determined microstates based on NMR to evaluate the dominant tautomer predictions for each macroscopic state. Prediction of dominant tautomers was a major source of error for microscopic pKa predictions, especially errors in charged tautomers. The degree of inaccuracy in pKa predictions observed in this challenge is detrimental to the protein-ligand binding affinity predictions due to errors in dominant protonation state predictions and the calculation of free energy corrections for multiple protonation states. Underestimation of ligand pKa by 1 unit can lead to errors in binding free energy errors up to 1.2 kcal/mol. The SAMPL6 pKa Challenge demonstrated the need for improving pKa prediction methods for drug-like molecules, especially for challenging moieties and multiprotic molecules.

For multiprotic molecules, the definition of pK a diverges into macroscopic pK a and microscopic pK a [5][6][7]. Macroscopic pK a use Hammett-Taft type equations to predict pK a based on classification of the molecule to a parent class (associated with a base 168 pK a value) and two parameters that describe how the base pK a value must be modified given its substituents. Physical modeling 169 of pK a predictions requires Quantum Mechanics (QM) models. QM methods are often utilized together with linear empirical 170 corrections (LEC) that are designed to rescale and unbias QM predictions for better accuracy. Classical molecular mechanics-171 based pK a prediction methods are not feasible as deprotonation is a covalent bond breaking event that can only be captured 172 by QM. Constant-pH molecular dynamics methods can calculate pK a shifts in large biomolecular systems where there is low 173 degree of coupling between protonation sites and linear summation of protonation energies can be assumed [25]. However, 174 this approach can not generally be applied to small organic molecule due to the high degree of coupling between protonation 175 sites [26][27][28]. The SAMPL6 pK a Challenge was conducted as a blind prediction challenge and focused on predicting aqueous pK a values of 24 179 small molecules not previously reported in the literature. The challenge set was composed of molecules that resemble fragments 180 of kinase inhibitors. Heterocycles that are frequently found in FDA-approved kinase inhibitors were represented in this set. The 181 compound selection process was described in depth in the prior publication reporting SAMPL6 pK a Challenge experimental data 182 collection [8]. The distribution of molecular weights, experimental pK a values, number of rotatable bonds, and heteroatom to 183 carbon ratio are depicted in Fig. 1. The challenge molecule set was composed of 17 small molecules with limited flexibility (less not identify which submissions used these enumerated microstate lists as input for predictions and which have followed the 282 challenge instructions and relied only on their prediction method to generate microstates. 284 Since the experimental data for the challenge was mainly composed of macroscopic pK a values of both monoprotic and multipro-285 tic compounds, evaluation of macroscopic and microscopic pK a predictions was not straightforward. For a subset of 8 molecules, 286 the dominant microstate sequence could be inferred from NMR experiments. For the rest of the molecules, the only experimen-287 tal information available was the macroscopic pK a value. The experimental data-in the form of macroscopic pK a values-did 288 not provide any information on which group(s) are being titrated, the microscopic pK a values, the identity of the associated 289 macrostates (which total charge), or microstates (which tautomers). Also, experimental data did not provide any information 290 about the charge state of protonated and deprotonated species associated with each macroscopic pK a . Typically charges of 291 states associated with experimental pK a values are assigned based on pK a predictions, not experimental evidence, but we did 292 not utilize such computational charge assignment. For a fair performance comparison between methods, we avoided relying 293 on any particular pK a prediction to assist the interpretation of the experimental reference data. This choice complicated the 294 pK a prediction analysis, especially regarding how to pair experimental and predicted pK a values for error analysis. We adopted 295 various evaluation strategies guided by the experimental data. To compare macroscopic pK a predictions to experimental values, 296 we had to utilize numerical matching algorithms before we could calculate performance statistics. For the subset of molecules 297 with experimental data about microstates, we used microstate-based matching. These matching methods are described in more 298 detail in the next section.

299
Three types of submissions were collected during the SAMPL6 pK a Challenge. We have only utilized the type I (microscopic 300 pK a value and microstate IDs) and the type III (macroscopic pK a value) predictions in this article. Type I submissions contained 301 the same prediction information as the type II submissions which reported the fractional population of microstates with respect 302 to pH. We collected type II submissions in order to capture relative populations of microstates, not realizing they were redun-303 dant. The microscopic pK a predictions collected in type I submissions capture all the information necessary to calculate type 304 II submissions. Therefore, we did not use type II submissions for challenge evaluation. In theory, type III (macroscopic pK a ) 305 predictions can also be calculated from type I submissions, but collecting type III submissions allowed the participation of pK a 306 prediction methods that directly predict macroscopic pK a values without considering microspeciation and methods that apply 307 special empirical corrections for macroscopic pK a predictions. Macroscopic pK a predictions can be calculated from microscopic pK a values for direct comparison to experimental macroscopic 310 pK a values. One major question must be answered to allow this comparison: How should we match predicted macroscopic 311 pK a values to experimental macroscopic pK a values when there could multiple pK a values reported for a given molecule? For 312 example, experiments on SM18 showed three macroscopic pK a s, but prediction of xvxzd method reported two macroscopic pK a 313 values. There were also examples of the opposite situation with more predicted pK a values than experimentally determined 314 macroscopic pK a s: One experimental pK a was measured for SM02, but two macroscopic pK a values were predicted by xvxzd 315 method. The experimental and predicted values must be paired before any prediction error can be calculated, even though 316 there was not any experimental information regarding underlying tautomer and charge states. 317 Knowing the charges of macrostates would have guided the pairing between experimental and predicted macroscopic 318 pK a values, however, not all experimental pK a measurements can determine determine the charge of protonation states. The 319 potentiometric pK a measurements just captures the relative charge change between macrostates, but not the absolute value of 320 the charge. Thus, our experimental data did not provide any information that would indicate the titration site, the overall charge, 321 or the tautomer composition of macrostate pairs that are associated with each measured macroscopic pK a that can guide the 322 matching between predicted and experimental pK a values. 323 For evaluating macroscopic pK a predictions taking the experimental data as reference, Fraczkiewicz [23] delineated recom-324 mendations for fair comparative analysis of computational pK a predictions. They recommended that, in the absence of any 325 experimental information that would aid in matching, experimental and computational pK a values should be matched preserv- 326 ing the order of pK a values and minimizing the sum of absolute errors. 327 We picked the Hungarian matching algorithm [33,34] to match experimental and predicted macroscopic pK a values with 328 a squared error cost function as suggested by Kiril Lanevskij via personal communication. The algorithm is available in the 329 SciPy package (scipy.optimize.linear_sum_assignment) [35]. This matching algorithm provides optimum global assignment that 330 minimizes the linear sum of squared errors of all pairwise matches. We selected the squared error cost function instead of the 331 absolute error cost function to avoid misordered matches, For instance, for a molecule with experimental pK a values of 4 and 332 6, and predicted pK a values of 7 and 8, Hungarian matching with absolute error cost function would match 6 to 7 and 4 to 9. 333 Hungarian matching with squared error cost would match 4 to 7 and 6 to 9, preserving the increasing pK a value order between 334 experimental and predicted values. A weakness of this approach would be failing to match the experimental value of 6 to pre-335 dicted value of 7 if that was the correct match based on underlying macrostates. But the underlying pair of states were unknown 336 to us both because the experimental data did not determine which charge states the transitions were happening between and 337 also because we did not collect the pair of macrostates associated with each pK a predictions in submissions. Requiring this in-338 formation for macroscopic pK a predictions in future SAMPL challenges would allow for better comparison between predictions, 339 even if experimental assignment of charges is not possible. There is no perfect solution to the numerical pK a assignment prob-340 lem, but we tried to determine the fairest way to penalize predictions based on their numerical deviation from the experimental 341 values.

342
For the analysis of microscopic pK a predictions we adopted a different matching approach. For the eight molecules for which 343 we had the requisite data for this analysis, we utilized the dominant microstate sequence inferred from NMR experiments to 344 match computational predictions and experimental pK a values. We will refer to this assignment method as microstate matching, 345 where the experimental pK a value is matched to the computational microscopic pK a value which was reported for the dominant 346 microstate pair observed for each transition. We have compared the results of Hungarian matching and microstate matching. 347 Inevitably, the choice of matching algorithms to assign experimental and predicted values has an impact on the computed 348 performance statistics. We believe the Hungarian algorithm for numerical matching of unassigned pK a values and microstate-349 based matching when experimental microstates are known were the best choices, providing the most unbiased matching with-350 out introducing assumptions outside of the experimental data. There are six error metrics reported for the numerical error of the pK a values: the root-mean-squared error (RMSE), mean abso-358 lute error (MAE), mean error (ME), coefficient of determination (R 2 ), linear regression slope (m), and Kendall's Rank Correlation Co-359 efficient ( ). Uncertainty in each performance statistic was calculated as 95% confidence intervals estimated by non-parametric 360 bootstrapping (sampling with replacement) over predictions with 10 000 bootstrap samples. Calculated errors statistics of all 361 methods can be found in Table S2 for macroscopic pK a predictions and Tables S4 and S4 for microscopic pK a predictions.

362
Assessing macrostate predictions 363 In addition to assessing the numerical error in predicted pK a values, we also evaluated predictions in terms of their ability to 364 capture the correct macrostates (ionization states) and microstates (tautomers of each ionization state) to the extent possible 365 from the available experimental data. For macroscopic pK a s, the spectrophotometric experiments do not directly report on the 366 identity of the ionization states. However, the number of ionization states indicates the number of macroscopic pK a s that exists 367 between the experimental range of 2.0-12.0. For instance, SM14 has two experimental pK a s and therefore three different charge 368 states observed between pH 2.0 and 12.0. If a prediction reported 4 macroscopic pK a s, it is clear that this method predicted 369 an extra ionization state. With this perspective, we reported the number of unmatched experimental pK a s (the number of 370 missing pK a predictions, i.e., missing ionization states) and the number of unmatched predicted pK a s (the number of extra pK a 371 predictions, i.e., extra ionization states) after Hungarian matching. The latter count was restricted to only predictions with pK a 372 values between 2 and 12 because that was the range of the experimental method. Errors in extra or missing pK a prediction 373 errors highlight failure to predict the correct number of ionization states within a pH range.

374
Assessing microstate predictions 375 For the evaluation of microscopic pK a predictions, taking advantage of the available dominant microstate sequence data for 376 a subset of 8 compounds, we calculated the dominant microstate prediction accuracy which is the ratio of correct dominant 377 tautomer predictions for each charge state divided by the total number of dominant tautomer predictions. Dominant microstate 378 prediction accuracy was calculated over all experimentally detected ionization states of each molecule which were part of this analysis. In order to extract the sequence of dominant microstates from the microscopic pK a predictions sets, we calculated 380 the relative free energy of microstates selecting a neutral tautomer and pH 0 as reference following Equation 8. Calculation of 381 relative microstate free energies was explained in more detail in a previous publication [26].

382
The relative free energy of a state with respect to reference state B at pH 0.0 (arbitrary pH value selected as reference) can 383 be calculated as follows: Δ is equal to the number protons in state A minus that in state B. R and T indicate the molar gas constant and temperature, 385 respectively. By calculating relative free energies of all predicted microstates with respect to the same reference state and pH, 386 we were able to determine the sequence of predicted dominant microstates. The dominant tautomer of each charge state 387 was determined as the microstate with the lowest free energy in the subset of predicted microstates of each ionization state.

388
This approach is feasible because the relative free energy of tautomers of the same ionization state is independent of pH and 389 therefore the choice of reference pH is arbitrary.

390
Identifying consistently top-performing methods 391 We created a shortlist of top-performing methods for macroscopic and microscopic pK a predictions. The top macroscopic pK a 392 predictions were selected if they ranked in the top 10 consistently according to two error metrics (RMSE, MAE) and two correlation 393 metrics (R-Squared, and Kendall's Tau), while also having fewer than eight missing or extra macroscopic pK a s for the entire 394 molecule set (eight macrostate errors correspond to macrostate prediction mistake in roughly one third of the 24 compounds).

395
These methods are presented in Table 2. A separate list of top-performing methods was constructed for microscopic pK a with 396 the following criteria: ranking in the top 10 methods when ranked by accuracy statistics (RMSE and MAE) and perfect dominant 397 microstate prediction accuracy. These methods are presented in Table 3.

398
Determining challenging molecules 399 In addition to comparing the performance of methods, we also wanted to compare pK a prediction performance for each molecule 400 to determine which molecules were the most challenging for pK a predictions considering all the methods in the challenge. For 401 this purpose, we plotted prediction error distributions of each molecule calculated over all prediction methods. We also calcu-402 lated MAE for each molecule over all prediction sets as well as for predictions from each method category separately.

404
Including a null model is helpful in comparative performance analysis of predictive methods to establish what the performance 405 statistics look like for a baseline method for the specific dataset. Null models or null predictions employ a simple prediction 406 model which is not expected to be particularly successful, but it provides a simple point of comparison for more sophisticated 407 methods. The expectation or goal is for more sophisticated or costly prediction methods to outperform the predictions from a 408 null model, otherwise the simpler null model would be preferable. In SAMPL6 pK a Challenge there were two blind submissions 409 using database lookup methods that were submitted to serve as null predictions. These methods, with submission IDs 5nm4j and 410 5nm4j both used OpenEye pKa-Prospector database to find the most similar molecule to query molecule and simply reported its 411 pK a as the predicted value. Database lookup methods with a rich experimental database do present a challenging null model to 412 beat, however, due to the accuracy level needed from pK a predictions for computer-aided drug design we believe such methods 413 provide an appropriate performance baseline that physical and empirical pK a prediction methods should strive to outperform. 414 We also included additional reference calculations in the comparative analysis to provide more perspective. Some widely 415 used methods by academia and industry were missing from the blind challenge submission. Therefore, we included those meth- corrections. There were 4 methods that used a mixed physical modeling approach of QM + MM.

436
The following sections present a detailed performance evaluation of blind submissions and reference prediction methods 437 for macroscopic and microscopic pK a predictions. Performance statistics of all the methods can be found in Tables S2 and S4.
The performance of macroscopic pK a predictions was analyzed by comparison to experimental pK a values collected by the 441 spectrophotometric method via numerical matching following the Hungarian method. Overall pK a prediction performance was 442 worse than we hoped. Fig. 2    QM + MM M06-2X/6-31G*(for bases) or 6-31+G*(for acids) for gas phase, solvation free energy using TI with explicit solvent and GAFF, solvation free energy of proton -265.6 kcal/mol 0wfzo Blind [47] QM + MM M06-2X/6-31G*(for bases) or 6-31+G*(for acids) for gas phase, solvation free energy using TI with explicit solvent and GAFF, solvation free energy of proton -271.88 kcal/mol z3btx Blind QM + MM M06-2X/6-31G*(for bases) or 6-31+G*(for acids) + thermal state correction for gas phase, solvation free energy using TI with explicit solvent and GAFF, solvation free energy of proton -265.6 kcal/mol 758j8 Blind QM + MM M06-2X/6-31G*(for bases) or 6-31+G*(for acids) + thermal state correction for gas phase, solvation free energy using TI with explicit solvent and GAFF, solvation free energy of proton -271.88 kcal/mol hgn83 Blind * Microscopic pK a submissions were blind, however, participant requested a correction after blind submission deadline for macroscopic pK a submissions. Therefore, these were assigned submission IDs in the form of nb###.
RMSE, while an RMSE between 2-3 log units was observed for the majority of methods (20 out of 38 methods). Only five methods achieved RMSE less than 1 pK a unit. One is QM method with COSMO-RS approach for solvation and linear empirical cor- linear fit)), and the remaining four are empirical prediction methods of LFER (xmyhm (ACD/pKa Classic), nb007 (Schrödinger/Epik 449 Scan)) and QSPR/ML categories (gyuhx (Simulations Plus), nb017 (MoKa)). These five methods with RMSE less than 1 pK a unit are 450 also the methods that have the lowest MAE. xmyhm and xvxzd were the only two methods for which the upper 95% confidence 451 interval of RMSE was lower than 1 pK a unit.

452
In terms of correlation statistics, many methods have good performance, although the ranking of methods changes accord-453 ing to R 2 and Kendall's Tau. Therefore, many methods are indistinguishable from one another, considering the uncertainty of 454 the correlation statistics. 32 out of 38 methods have R and Kendall's Tau higher than 0.7 and 0.6, respectively. 8 methods have 455 R 2 higher than 0.9 and 6 methods have Kendall's Tau higher than 0.8. The overlap of these two sets are the following: gyuhx (Sim-

471
The distribution of macroscopic pK a prediction signed errors observed in each submission was plotted in Fig. 7A  In addition to the statistics related to the pK a value, we also analyzed missing or extra pK a predictions. Analysis of the pK a values with accuracy-and correlation-based error metrics was only possible after the matching of predicted macroscopic 498 pK a values to experimental pK a values through Hungarian matching, although this approach masks pK a prediction issues in 499 the form of extra or missing macroscopic pK a predictions. To capture this class of prediction errors, we reported the number of 500 unmatched experimental pK a s (missing pK a predictions) and the number of unmatched predicted pK a s (extra pK a predictions) 501 after Hungarian matching for each method. Both missing and extra pK a prediction counts were only considered for the pH 502 range of 2-12, which corresponds to the limits of the experimental assay. The lower subplot of Fig. 2 shows the total count 503 of unmatched experimental or predicted pK a values for all the molecules in each prediction set. The order of submission IDs 504 in the x-axis follows the RMSD based ranking so that the performance of each method from both pK a value accuracy and the 505 number of pK a s can be viewed together. The omission or inclusion of extra macroscopic pK a predictions is a critical error because 506 inaccuracy in predicting the correct number of macroscopic transitions shows that methods are failing to predict the correct set 507 of charge states, i.e., failing to predict the correct number of ionization states that can be observed between the specified pH 508 range.

509
In the analysis of these challenge results, extra macroscopic pK a predictions were found to be more common than missing 510 pK a predictions. In pK a prediction evaluations, the accuracy of predicted ionization states within a pH range is usually neglected.

511
When predictions are only evaluated for the accuracy of the pK a value with numerical matching algorithms, a larger number of 512 predicted pK a s lead to greater underestimation of prediction errors. Therefore, it is not surprising that methods are biased to 513 predict extra pK a values. The SAMPL6 pK a Challenge experimental data consists of 31 macroscopic pK a s in total, measured for 514 24 molecules (6 molecules in the set have multiple pK a s). Within the 10 methods with the lowest RMSE, only the xvxzd method 515 predicts too few pK a values (2 unmatched out of 31 experimental pK a s). All other methods that rank in the top 10 by RMSE 516 have extra predicted pK a s ranging from 1 to 13. Two prediction sets without any extra pK a predictions and low RMSE are 8xt50 517 (ReSCoSS conformations // DSD-BLYP-D3 reranking // COSMOtherm pKa) and nb015 (ChemAxon/Chemicalize). Methods ranked differently when ordered by different error metrics, although there were a couple of methods that consistently 520 ranked in the top fraction. By using combinatorial criteria that take multiple statistical metrics and unmatched pK a counts into 521 account, we identified a shortlist of consistently well-performing methods for macroscopic pK a predictions, shown in Table 2.

522
The criteria for selection were the overall ranking in Top 10 according to RMSE, MAE, R 2 , and Kendall's Tau and also having a 523 combined unmatched pK a (extra and missing pK a s) count less than 8 (a third of the number of compounds). We ranked methods 524 in ascending order for RMSE and MAE and in descending order for R 2 , and Kendall's Tau to determine methods. Then, we took 525 the intersection set of Top 10 methods according to each statistic to determine the consistently-well performing methods. This 526 resulted in a list of four methods that are consistently well-performing across all criteria.

527
Consistently well-performing methods for macroscopic pK a prediction included methods from all categories. Two methods in
536 Figure 4 plots predicted vs. experimental macroscopic pK a predictions of four consistently well-performing methods, a rep-537 resentative average method, and the null method(5nm4j). We selected the method with the highest RMSE below the median of 538 all methods as the representative method with average performance: 2ii2g (EC-RISM/MP2/cc-pVTZ-P2-q-noThiols-2par). In addition to comparing the performance of methods that participated in the SAMPL6 Challenge, we also wanted to analyze 541 macroscopic pK a predictions from the perspective of challenge molecules and determine whether particular compounds suffer 542 from larger inaccuracy in pK a predictions. The goal of this analysis is to provide insight on which molecular properties or moieties 543 might be causing larger pK a prediction errors. In Fig. 5 MAE was averaged over all the pK a values. For the analysis of pK a prediction accuracy observed for each molecule, MAE is a 546 more appropriate statistical value than RMSE for following global trends, as it is less sensitive to outliers than the RMSE. 547 Table 2. Four consistently well-performing prediction methods for macroscopic pK a prediction based on consistent ranking within the Top 10 according to various statistical metrics. Submissions were ranked according to RMSE, MAE, R 2 , and . Consistently wellperforming methods were selected as the ones that rank in the Top 10 in each of these statistical metrics. These methods also have less than 2 unmatched experimental pK a s and less than 7 unmatched predicted pK a s according to Hungarian matching. Performance statistics are provided as mean and 95% confidence intervals. choices that may artificially lower the error. 556 We separately analyzed the MAE of each molecule for empirical (LFER and QSPR/ML) and QM-based physical methods (QM, 557 QM+LEC, and QM+MM) to gain additional insight into prediction errors. Fig. 6B shows that the difficulty of predicting pK a values 558 of the same subset of molecules was a trend conserved in the performance of physical methods. For QM-based methods, sulfur-559 containing heterocycles, amides proximal to aromatic heterocycles, and compounds with iodo and bromo substitutions have 560 lower pK a prediction accuracy.

561
The SAMPL6 pK a set consists of only 24 small molecules and lacks multiple examples of many moieties, limiting our ability 562 to determine with statistical significance which chemical substructures cause greater errors in pK a predictions. Still, the trends 563 observed in this challenge point to molecules with iodo-, bromo-, and sulfur-containing heterocycles as having systematically 564 larger prediction errors in macroscopic pK a value. We hope that reporting this observation will lead to the improvement of 565 methods for similar compounds with such moieties.

566
We have also looked for correlation with molecular descriptors for finding other potential explanations as to why macroscopic 567 pK a prediction errors were larger for certain molecules. While testing the correlation between errors and many molecular de-568 scriptors, it is important to account for the possibility of spurious correlations. We haven't observed any statistically significant 569 correlation between numerical pK a predictions and the descriptors we have tested. First, having more experimental pK a values 570 (Fig. 6A) did not seem to be associated with poorer pK a prediction performance. Still, we need to keep in mind that multiprotic 571 compounds were sparsely represented in the SAMPL6 set (5 molecules with 2 macroscopic pK a values and one with 3 macro-572 scopic pK a ). Second, we checked the following other descriptors: presence of an amide group, molecular weight, heavy atom 573 count, rotatable bond count, heteroatom count, heteroatom-to-carbon ratio, ring system count, maximum ring size, and the 574 number of microstates (as enumerated for the challenge). Correlation plots and R 2 values can be seen in Fig. S2. 575 We had suspected that pK a prediction methods may perform better for moderate values (4-10) than extreme values as 576 molecules with extreme pK a values are less likely to change ionization states close to physiological pH. To test this we look at 577 the distribution of absolute errors calculated for all molecules and challenge predictions binned by experimental pK a value 2 pK a 578 unit increments. As can be seen in Fig. S3B, the value of true macroscopic pK a values was not a factor affecting the prediction 579 error seen in SAMPL6 Challenge. 580 Fig. 7B is helpful to answer the question "Are there molecules with consistently overestimated or underestimated pK a val-581 ues?". This ridge plots show the error distribution of each experimental pK a . SM02_pKa1, SM04_pKa1, SM14_pKa1, and SM21_pKa1 582 were underestimated, predicting lower protein affinity by more than 1 pK a unit by majority of the prediction methods. SM03_pKa1,

583
SM06_pKa2, SM19_pKa1, and SM20_pKa1 were overestimated by the majority of the prediction methods by more than 1 pK a unit.
SM03_pKa1, SM06_pKa2, SM10_pKa1, SM19_pKa1, and SM22_pKa1 have the highest spread of errors and were less accurately  Error bars indicate standard error of the mean of predicted and experimental values. Experimental pK a SEM values are too small to be seen under the data points. EC-RISM/MP2/cc-pVTZ-P2-q-noThiols-2par method (2ii2g) was selected as the representative method with average performance because it is the method with the highest RMSE below the median. predicted overall. 586 3.2 Analysis of microscopic pK a predictions using microstates determined by NMR for 8 molecules 587 The most common approach for analyzing microscopic pK a prediction accuracy has been to compare it to experimental macro-588 scopic pK a data, assuming experimental pK a values describe titrations of distinguishable sites and, therefore, correspond to 589 microscopic pK a s. But this typical approach fails to evaluate methods at the microscopic level.

590
Analysis of microscopic pK a predictions for the SAMPL6 Challenge was not straightforward due to the lack of experimental 591 data with microscopic resolution of the titratable sites and their associated microscopic pK a s. For 24 molecules, macroscopic 592 pK a values were determined with the spectrophotometric method. For 18 molecules, a single macroscopic titration was ob-593 served, and for 6 molecules multiple experimental pK a values were observed and characterized. For 18 molecules with a single 594 experimental pK a , it is probable that the molecules are monoprotic and, therefore, macroscopic pK a value is equal to the mi-595 croscopic pK a . There is, however, no direct experimental evidence supporting this hypothesis aside from the support from 596 computational predictions, such as the predictions by ACD/pKa Classic. There is always the possibility that the macroscopic pK a 597 observed is the result of two different titrations overlapping closely with respect to pH if any charge state has more than one 598 tautomer. We did not want to bias the blind challenge analysis with any prediction method. Therefore, we believe analyzing 599 the microscopic pK a predictions via Hungarian matching to experimental values with the assumption that the 18 molecules 600 have a single titratable site is not the best approach. Instead, an analysis at the level of macroscopic pK a values is much more 601 appropriate when a numerical matching scheme is the only option to evaluate predictions using macroscopic experimental data.   Error distribution for each SAMPL6 molecule for all prediction methods according to Hungarian matching. For multiprotic molecules, pK a ID numbers (pKa1, pKa2, and pKa3) were assigned in the direction of increasing experimental pK a value.
For a subset of eight molecules, dominant microstates were inferred from NMR experiments. Six of these molecules were monoprotic and two were multiprotic. This dataset was extremely useful for guiding the assignment between experimental 604 and predicted pK a values based on microstates. In this section, we present the performance evaluations of microscopic pK a 605 predictions for only the 8 compounds with experimentally-determined dominant microstates. imental pK a values to predicted pK a values only based on the closeness of the numerical values, without consideration of the 615 relative population of microstates and microstate identities. Because of this, a microscopic pK a value that describes a transition 616 between very low population microstates (high energy tautomers) can be assigned to the experimental pK a if it has the closest 617 pK a value. This is not helpful because, in reality, the microscopic pK a values that influence the observable macroscopic pK a the 618 most are the ones with higher microstate populations (transitions between low energy tautomers).

619
The number of unmatched predicted microscopic pK a s is shown in the lower bar plots of Fig. 8B and C, to emphasize the large 620 number of microscopic pK a predictions submitted by many methods. In the case of microscopic pK a , the number of unmatched as a reference (Fig. 8). Although we believe that microstate-based evaluation is more informative, the lack of a large experimental 638 dataset limits the conclusions to a very narrow chemical diversity. Still, microstate-based matching revealed errors masked by 639 pK a value-based matching between experimental and predicted pK a s. Both accuracy-and correlation-based statistics were calculated for the predicted microscopic pK a values after microstate-based 642 matching. RMSE, MAE, ME, R 2 , and Kendall's Tau results of each method are shown in Fig. 8C and Fig. 9. A table of the calculated 643 statistics can be found in Table S4. Due to the small number of data points in this set, correlation-based statistics have large 644 uncertainties and thus have less utility for distinguishing better-performing methods. Therefore, we focused more on accuracy-645 based metrics for the analysis of microscopic pK a s than correlation-based metrics. In terms of accuracy of predicted microscopic 646 pK a values, all three QSPR/ML based methods (nb016 (MoKa), hdiyq (Simulations Plus), 6tvf8 (OE Gaussian Process)), three QM-647 based methods (nb011 (Jaguar), ftc8w (EC-RISM/MP2/cc-pVTZ-P2-q-noThiols-2par), t8ewk (COSMOlogic_FINE17)), and one LFER 648 method (v8qph (ACD/pKa GALAS)) achieved RMSE lower than 1 pK a unit. The same six methods also have the lowest MAE.   nb011  ftc8w  6tvf8  t8ewk  v8qph  ccpmw  0xi4b  cywyk  eyetm  nb008  y4wws  ktpj5  wuuvc  xnoe0  qsicn  epvmk  4o0ia  ko8yx  2umai  cm2yq  nxaaw  wcvnu  kxztt  wexjs  z7fhp  gdqeg  8toyp  w4z0e  arcko  0wfzo  z3btx  758j8  A Dominant microstate sequence of two compounds (SM07 and SM14) were determined by NMR [8]. Based on these reference compounds, the dominant microstates of 6 related compounds were inferred and experimental pK a values were assigned to titratable groups with the assumption that only the dominant microstates have significant contributions to the experimentally observed pK a . B RMSE vs. submission ID and unmatched pK a vs. submission ID plots for the evaluation of microscopic pK a predictions of 8 molecules by Hungarian matching to experimental macroscopic pK a values. C RMSE vs. submission ID and unmatched pK a vs. submission ID plots showing the evaluation of microscopic pK a predictions of 8 molecules by microstate-based matching between predicted microscopic pK a s and experimental macroscopic pK a values. Submissions 0wfzo, z3btx, 758j8, and hgn83 have RMSE values bigger than 10 pK a units which are beyond the y-axis limits of subplot C and B. RMSE is shown with error bars denoting 95% confidence intervals obtained by bootstrapping over the challenge molecules. Lower bar plots show the number of unmatched experimental pK a s (light grey, missing predictions) and the number of unmatched pK a predictions (dark grey, extra predictions) for each method between pH 2 and 12. Submission IDs are summarized in Table 1. for all charge states (Fig. 10A).  3.2.4 Consistently well-performing methods for microscopic pK a predictions 693 We have identified different criteria for determining consistently top-performing predictions of microscopic pK a than macro-694 scopic pK a : having perfect dominant microstate prediction accuracy, unmatched pK a count of 0, and ranking in the top 10 695 according to RMSE and MAE. Correlation statistics were not found to have utility for discriminating performance due to large 696 uncertainties in these statistics for a small dataset of 10 pK a values. Unmatched predicted pK a count was also not considered since experimental data was only informative for the pK a between dominant microstates and did not capture all the possible 698 theoretical transitions between microstate pairs.  corrections with good performance with microscopic pK a predictions.

704
The Simulations Plus pK a prediction method is the only method that appeared to be consistently well-performing in both the 705 assessment for macroscopic and microscopic pK a prediction (gyuhx and hdiyq). However, it is worth noting that two methods 706 that were in the list of consistently top-performing methods for macroscopic pK a predictions lacked equivalent submissions 707 of their underlying microscopic pK a predictions, and therefore could not be evaluated at the microstate level. These meth-

How do pK a prediction errors impact protein-ligand binding affinity predictions?
711 pK a predictions provide a key input for computational modeling of protein-ligand binding with physical methods. The SAMPL6 712 pK a Challenge focused only on small molecule pK a prediction and showed how pK a prediction accuracy observed can impact the to quantitatively describe molecular association.

719
In terms of ligand protonation states, there are two ways in which pK a prediction errors can influence the prediction accuracy 720 for protein-ligand binding free energies as depicted in Fig. 11. The first scenario is when a ligand is present in aqueous solution 721 in multiple protonation states (Fig. 11A). When only the minor aqueous protonation state contributes to protein-ligand complex 722 formation, the overall binding free energy (Δ ) needs to be calculated as the sum of binding free energy of the minor state 723 and the protonation penalty of that state (Δ ). Δ is a function of both pH and pK a . A 1 unit of error in predicted pK a would 724 lead to 1.36 kcal/mol error in overall binding free energy if the protonation state with the minor population binds the protein and 725 this minor protonation state is correctly selected to model the free energy of binding; if the incorrect dominant protonation state 726 for the complex is selected, the dominant contribution to the free energy of binding may be missed entirely, leading to much 727 larger modeling errors in the binding free energy. Other scenarios-in which multiple protonation states can be significantly 728 populated in complex-can lead to more complex scenarios in which the errors in predicted pK a propagate in more complex 729 ways. The equations in Fig. 11A show the overall free energy for a simple thermodynamic cycle involving multiple protonation 730 states.

731
In addition to the presence of multiple protonation states in the aqueous environment, multiple charge states can contribute 732 to complex formation (Fig. 11B). Then, the overall free energy of binding needs to include a Multiple Protonation States Correction 733 (MPSC) term (Δ ) [4]. MPSC is a function of pH, aqueous pK a of the ligand, and the difference between the binding free energy 734 of charged and neutral species (Δ − Δ ) as shown in Fig. 11B.

735
Using the equations in Fig. 11B, we can model the true MPSC (Δ ) with respect to the difference between pH and the pK a of 736 the ligand to see when this value has a significant impact on the overall binding free energy. In Fig. 12, the true MPSC that must be 737 added to Δ is shown for ligands with varying binding affinity difference between protonation states (ΔΔ = Δ − Δ ).
Fig . 12A shows the case of a monoprotic base in which the charged state has a lower affinity than the neutral state. Solid lines 739 depict the accurate correction value. In cases where the pK a is lower than the pH, the correction factor disappears as the ligand 740 fully populates the neutral state (Δ = Δ ). As the pH dips below the pK a , the charged state is increasingly populated and 741 Δ increases to approach ΔΔ . . Aqueous ligand pK a can influence overall protein-ligand binding affinity. A When only the minor aqueous protonation state contributes to protein-ligand complex formation, the overall binding free energy (Δ ) needs to be calculated as the sum of binding affinity of the minor state and the protonation penalty of that state. B When multiple charge states contribute to complex formation, the overall free energy of binding includes a multiple protonation states correction (MPSC) term (Δ ). MPSC is a function of pH, aqueous pK a of the ligand, and the difference between the binding free energy of charged and neutral species (Δ − Δ ).
It is interesting to note the pH-pK a range over which Δ changes significantly. It is often assumed that, for a basic ligand, 743 if the pK a of a ligand is more than 2 units higher than the pH, only 1% of the population is in the neutral state according to 744 Henderson-Hasselbalch equation, and it is safe to approximate the overall binding affinity with Δ . Based on the magnitude 745 of the relative free energy difference between ligand protonation states, this assumption is not always correct. As seen in 746 Fig. 12A, the responsive region of Δ can span 3 pH units for a system with ΔΔ = 1 ∕ , or 5 pH units for a system with 747 ΔΔ = 4 ∕ . This highlights that the range of pK a values that impact binding affinity predictions is wider than 2 pH units.

748
Molecules with pK a values several units away from the physiological pH can still impact the overall binding affinity significantly 749 due to the MPSC.

750
Despite the need to capture the contributions of multiple protonation states by including the MPSC in binding affinity calcu-751 lations, inaccurate pK a predictions can lead to errors in Δ and overall free energy of binding prediction. In Fig. 12A dashed 752 lines show predicted Δ based on pK a error of -1 units. We have chosen a pK a error of 1 unit as this is the average inaccuracy 753 expected from the pK a prediction methods based on the SAMPL6 Challenge. Underestimation of the pK a causes the Δ to 754 be underestimated as well and will result in overestimated affinities (i.e., too negative binding free energy) for a varying range 755 of pH -pK a values depending on the binding affinity difference between protonation states(ΔΔ ). In Fig. 12B dashed lines show 756 how the magnitude of the absolute error caused by calculating Δ with an inaccurate pK a varies with respect to pH. Different 757 colored lines show simulated results with varying binding free energy differences between protonation states. For a system 758 whose charged state has higher binding free energy than the neutral state (ΔΔ = 2 kcal/mol), the absolute error caused by 759 underestimated pK a by 1 unit can be up to 0.9 kcal/mol. For a system whose charged state has an even lower affinity (more 760 positive binding free energy) than the neutral state (ΔΔ = 4 kcal/mol), the absolute error caused by underestimated pK a by 761 1 unit can be up to 1.2 kcal/mol. The magnitude of errors contributing to overall binding affinity is too large to be neglected.

762
Improving the accuracy of small molecule pK a prediction methods can help to minimize the error in predicted MPSC.

763
With the current level of pK a prediction accuracy as observed in SAMPL6 Challenge, is it advantageous to include the MPSC 764 in affinity predictions that may include errors caused by pK a predictions? We provide a comparison of the two choices to answer

768
What is the best strategy? Error due to choice 1 is always larger than error due to choice 2 for all pH-pK a values. In this scenario, 769 including the MPSC improves overall binding affinity prediction accuracy. The error caused by the inaccurate pK a is smaller than 770 the error caused by neglecting the MPSC.

771
We can also ask whether or not an MPSC calculated based on an inaccurate pK a should be included in binding affinity predic-772 tions in different circumstances, such as underestimated or overestimated pK a values and charged states with higher or lower 773 affinities than the neutral states. We tried to capture these circumstances in four quadrants of Fig. 12. In the case of overesti-774 mated pK a values (Fig. 12E-H), it can be seen that for most of the pH-pK a range, it is more advantageous to include the predicted 775 MPSC in affinity calculations, except a smaller window where the opposite choice would be more advantageous. For instance, 776 for the system with ΔΔ = 2 kcal/mol and overestimated pK a (Fig. 12E) for the pH-pK a region between -0.5 and 2, including the 777 predicted Δ introduces more error than ignoring the MPSC.

778
In practice, we normally do not know the exact magnitude or the direction of the error of our predicted pK a . Therefore, using 779 simulated MPSC error plots to decide when to include MPSC in binding affinity predictions is not possible. However, based on 780 the analysis of a case with 1 unit of pK a error, including the MPSC correction would be more often than not helpful in improving 781 binding affinity predictions. The detrimental effect of pK a inaccuracy is still significant. Hopefully, future improvements in pK a 782 prediction methods will improve the accuracy of the MPSC and binding affinity predictions of ligands which have multiple proto-783 nation states that contribute to aqueous or complex populations. Being able to predict pK a values with 0.5 units accuracy, for 784 example, would significantly aid binding affinity models in computing more accurate MPSC terms.

785
The whole analysis presented in this section assumes that at least the dominant protonation state of the ligand is correctly 786 included in the modeling of the protein-ligand complex. We have not discussed the case of omitting this dominant state from 787 the free energy calculations entirely when it is erroneously predicted to be a minor state in solution. Such a mistake could be 788 the most problematic, and the errors in estimated binding free energy could be very large.

790
The SAMPL6 pK a Challenge showed that, in general, pK a prediction accuracy of computational methods is lower than expected 791 for drug-like molecules. Our expectation prior to the blind challenge was that well-developed methods would achieve prediction in the SAMPL6 Challenge that achieved RMSE around 0.5 or lower for macroscopic pK a predictions for the 24 molecule set of 798 kinase inhibitor fragment-like molecules. Smaller RMSEs were observed in the microscopic pK a evaluation section of this study 799 for some methods; however, the 8 molecule set used for that analysis poses a very limited dataset to reach conclusions about 800 general expectations for drug-like molecules.

801
As the majority of experimental data was in the form of macroscopic pK a values, we had to adopt a numerical matching 802 algorithm (Hungarian matching) to pair predicted and experimental values to calculate performance statistics of macroscopic 803 pK a predictions. Accuracy, correlation, and extra/missing pK a prediction counts were the main metrics for macroscopic pK a 804 evaluations. An RMSE range of 0.7 to 3.2 pK a units was observed for all methods. Only five methods achieved RMSE between 805 0.7-1 pK a units, while an RMSE between 1.5-3 log units was observed for the majority of methods. All four methods of the LFER 806 category and three out of 5 QSPR/ML methods achieved RMSE less than 1.5 pK a units. All the QM methods that achieved this 807 level of performance included linear empirical corrections to rescale and unbias their pK a predictions.

808
Based on the consideration of multiple error metrics, we compiled a shortlist of consistently-well performing methods for 809 macroscopic pK a evaluations. Two methods from QM+LEC methods, one QSPR/ML, two empirical methods achieved consistent 810 performance according to many metrics. The common features of the two empirical methods were their large training sets 811 (16000-17000 compounds) and commercial nature.

812
There were four submissions of QM-based methods that utilized the COSMO-RS implicit solvation model. While three of these 813 achieved the lowest RMSE among QM-based methods (xvxzd, yqkga, and 8xt50) [46], one of them showed the highest RMSE 814 (0hxtm (COSMOtherm_FINE17)). The comparison of these methods indicates that capturing the conformational ensemble of 815 microstates, using high-level QM calculations, and including RRHO corrections contribute to better macroscopic pK a predictions.

816
Linear empirical corrections applied QM calculations improved results, especially when the linear correction is calibrated for an 817 experimental dataset using the same level of theory as the deprotonation free energy predictions (as in xvxzd). This challenge 818 also points to the advantage of the COSMO-RS solvation approach compared to other implicit solvent models.

819
Molecules that posed greater difficulty for pK a predictions were determined by comparing the macroscopic pK a prediction caused by ignoring the MPSC completely (solid lines) vs. calculating MPSC based in inaccurate pK a value (dashed lines). These plots provide guidance on when it is beneficial to include MPSC correction based on pK a error, pH -pK a , and ΔΔ . accuracy of each molecule averaged over all methods submitted to the challenge. pK a prediction errors were higher for compounds with sulfur-containing heterocycles, iodo, and bromo groups. This trend was also conserved when only QM-based 822 methods were analyzed. The SAMPL6 pK a dataset consisted of only 24 small molecules which limited our ability to statistically 823 confirm this conclusion, however, we believe it is worth reporting molecular features that coincided with larger errors even if 824 we can not evaluate the reason for these failures.

825
Utilizing a numerical matching algorithm to pair experimental and predicted macroscopic pK a values was a necessity, how-826 ever, this approach did not capture all aspects of prediction errors. Computing the number of missing or extra pK a predictions 827 remaining after Hungarian matching provided a window for observing macroscopic pK a prediction errors such as the number of 828 macroscopic transitions or ionization states expected in a pH interval. In pK a evaluation studies, it is typical to just focus on pK a 829 value errors evaluated after matching and to ignore pK a prediction errors that the matching protocol can not capture [49][50][51][52][53].

830
Frequently ignored prediction errors include predicting missing or extra pK a s and failing to predict the correct charge states.

831
The SAMPL6 pK a Challenge results showed sporadic presence of missing pK a predictions and very frequent tendency to make 832 extra pK a predictions. Both indicate failures to capture the correct ionization states. The traditional way of evaluating pK a s that 833 only focuses on the pK a value error after some sort of numerical match between predictions and experimental values may have 834 motivated these types of errors as there would be no penalty for missing a macroscopic deprotonation and predicting an extra 835 one. This problem does not seem to be specific to any method category. 836 We used the eight molecule subset of SAMPL6 compounds with NMR-based dominant microstate sequence information 837 to demonstrate the advantage of evaluating pK a prediction on the level of microstates. Comparison of statistics computed 838 for the 8 molecule dataset by Hungarian matching and microstate-based matching showed how Hungarian matching, despite 839 being the best choice when only numerical matching is possible, can still mask errors in pK a predictions. Errors computed by 840 microstate-based matching were larger compared to numerical matching algorithms in terms of RMSE. Microscopic pK a analysis 841 with numerical matching algorithms may mask errors due to the higher number of guesses made. Numerical matching based on 842 pK a values also ignores information regarding the relative population of states. Therefore, it can lead to pK a s defined between 843 very low energy microstate pairs to be matched to the experimentally observable pK a between microstates of higher populations.

844
Of course, the predicted pK a value could be correct however the predicted microstates would be wrong. Such mistakes caused 845 by Hungarian matching were observed frequently in SAMPL6 results, and therefore we decided microstate-based matching of 846 pK a values provides a more realistic picture of method performance.

847
Some QM and LFER methods made mistakes in predicting the dominant tautomers of the ionization states. Dominant tau-848 tomer prediction seemed to be particularly difficult for charged tautomers compared with neutral tautomers. The easiest way to 849 extract the dominant microstate sequence from predictions was to calculate the relative free energy of microstates at any refer-850 ence pH, determining the lowest free energy state in each ionization state. Errors in dominant microstate predictions were very 851 rare for neutral tautomers, but more frequent in cationic tautomers with +1 charge of the 8 molecule set. SM14 was the molecule 852 with the lowest dominant microstate prediction accuracy, while dominant microstates predictions for SM15 were perfect for all 853 molecules. SM14 and SM15 both possess two experimental pK a s and a benzimidazole scaffold. The difference between them is 854 the distance between the experimental pK a values, which is smaller for SM14. These results make sense from the perspective 855 of relative free energies of microstates. Closer pK a values mean that the free energy difference between different microstates is 856 smaller for SM14, and therefore any error in predicting the relative free energy of tautomers is more likely to cause reordering of 857 relative populations of microstates and impact the accuracy of dominant microstate predictions. It would have been extremely 858 informative to evaluate the tautomeric ratios and relative free energy predictions of microstates, however, the experimental 859 data needed for this approach was not available. Tautomeric ratios could not be measured by the experimental methods avail-860 able to us. Resolving tautomeric ratios would require extensive NMR measurements, but these measurements can suffer from 861 lower accuracy especially when the free energy difference between tautomers is large.

862
The overall assessment of the SAMPL6 pK a Challenge captured non-stellar performance for microscopic and macroscopic 863 pK a predictions which can be detrimental to the accuracy of protein-ligand affinity predictions and other pH-dependent physic-864 ochemical property predictions such as distribution coefficients, membrane permeability, and solubility. Protein-ligand binding 865 affinity predictions utilize pK a predictions in two ways: determination of the relevant aqueous microstates and quantification of 866 the free energy penalty to reach these states. More accurate microscopic pK a predictions are needed to be able to accurately 867 incorporate multiple protonation state corrections (MPSC) into overall binding affinity calculations. 868 We simulated the effect of overestimating or underestimating pK a of a ligand by one unit on overall binding affinity prediction 869 for a ligand where both cation and neutral states contribute to binding affinity. A pK a prediction error of this magnitude (assum-870 ing dominant tautomers were predicted correctly) could cause up to 0.9 and 1.2 kcal/mol error in overall binding affinity when 871 the binding affinity of protonation states are 2 or 4 kcal/mol different, respectively. For the case of 4 kcal/mol binding affinity difference between protonation states, the pH-pK a range that the error would be larger than 0.5 kcal/mol surprisingly spans around 3.5 pH units. The worse case, of course, is where there is a significant difference in binding free energy between the 874 two protonation states, but we include the wrong one in our free energy calcuation. We demonstrated that the range of pH-pK a 875 value that the MPSC needs to be incorporated in binding affinity predictions can be wider than the widely assumed range of 2 876 pH units, based on the affinity difference between protonation states. At the level of 1 unit pK a error, incorporating the MPSC 877 would improve binding affinity predictions more often than not. If the microscopic pK a could be predicted with 0.5 pK a units of 878 accuracy, MPSC calculations would be much more reliable.

879
There are multiple factors to consider when deciding which pK a prediction method to utilize. These factors include the 880 accuracy of microscopic and macroscopic pK a values, the accuracy of the number and the identity of ionization states predicted 881 within the experimental pH interval, the accuracy of microstates predicted within the experimental pH interval, the accuracy of 882 tautomeric ratio (i.e., relative free energy between microstates), how costly is the calculation in terms of time and resources, and 883 whether one has access to software licenses that might be required.

884
All of the top-performing empirical methods were developed as commercial software that requires a license to run, and 885 there were not any open-source alternatives for empirical pK a predictions. Since the completion of the blind challenge, two 886 publications reported open-source machine learning-based pK a prediction methods, however, one can only predict the most 887 acidic or most basic macroscopic pK a values of a molecule [54] and the second one is only trained for predicting pK a values of 888 monoprotic molecules [55]. Recently, a pK a prediction methodology was published that describes a mixed approach of semi-889 empirical QM calculations and machine learning that can predict macroscopic pK a s of both mono-and polyprotic species [56].

890
The authors reported RMSE of 0.85 for the retrospective analysis performed on the SAMPL6 dataset.

892
This analysis helped us understand the current state of the field and led to many lessons informing future SAMPL challenges. 893 We believe the greatest benefit can be achieved if further iterations of small molecule pK a prediction challenges can be orga-894 nized, creating motivation for improving protonation state prediction methods for drug-like molecules. In future challenges, it 895 is desirable to increase chemical diversity to cover more common scaffolds [57] and functional groups [58] seen in drug-like 896 molecules, gradually increasing the complexity of molecules.

897
Microscopic pK a measurements are needed for careful benchmarking of pK a predictions for multiprotic molecules.

898
Future challenges should promote stringent evaluation for pK a prediction methods from the perspective of microscopic pK a and 899 microstate predictions. It is necessary to assess the capability of pK a prediction methods to capture the free energy profile of 900 microstates of multiprotic molecules. This is critical because pK a predictions are often utilized to determine relevant protonation 901 states and tautomers of small molecules that must be captured in other physical modeling approaches, such as protein-ligand 902 binding affinity or distribution coefficient predictions. Different tautomers can have different binding affinities and partition 903 coefficients.

904
In this paper, we demonstrated how experimental microstate information can guide the analysis further than the typical pK a 905 evaluation approach that has been used so far. The traditional pK a evaluation approach focuses solely on the numerical error of 906 the pK a values and neglects the difference between macroscopic and microscopic pK a definitions. This is mainly caused by the 907 lack of pK a datasets with microscopic detail. To improve pK a and protonation state predictions for multiprotic molecules, it is 908 necessary to embrace the difference between macroscopic and microscopic pK a definitions and select strategies for experimen-909 tal data collection and prediction evaluation accordingly. In the SAMPL6 Challenge, the analysis was limited by the availability of 910 experimental microscopic data as well. As is usually the case, macroscopic pK a values were abundant (24 molecules) and limited 911 data on microscopic states was available (8 molecules), although the latter opened new avenues for evaluation. For future blind 912 challenges for multiprotic compounds, striving to collect experimental datasets with microscopic pK a s would be very beneficial, 913 despite the high cost of these measurements. Benchmark datasets of microscopic pK a values with assigned microstates are 914 currently missing because experimental determination of these are much more expensive and time-consuming than macro-915 scopic pK a measurements. This limits the ability to improve pK a and tautomer prediction methods for multiprotic molecules.

916
If the collection of experimental microscopic pK a s is not possible due to time and resource costs of such NMR experiments, at 917 least supplementing the more automated macroscopic pK a measurements with NMR-based determination of the dominant mi-918 crostate sequence or tautomeric ratios of each ionization state can create very useful benchmark datasets. This supplementary 919 information can allow microstate-based assignment of experimental to predicted pK a values and a more realistic assessment 920 of method performance.

921
Evaluation strategy for pK a predictions must be determined based on the nature of experimental pK a measure-922 ments available.

923
If the only available experimental data is in the form of macroscopic pK a values, the best way to evaluate computational pre-924 dictions is by calculating predicted macroscopic pK a from microscopic pK a predictions. With the conversion of microscopic pK a 925 to macroscopic pK a s, all structural information about the titration site is lost, and the only remaining information is the total 926 charge of macroscopic ionization states. Unfortunately, most macroscopic pK a measurements-including potentiometric and 927 spectrophotometric methods-do not capture the absolute charge of the macrostates. The spectrophotometric method does 928 not measure charge at all. The potentiometric method can only capture the relative charge changes between macrostates. Only 929 pH-dependent solubility-based pK a estimations can differentiate neutral and charged states from one another. It is, therefore, 930 very common to have experimental datasets of macroscopic pK a without any charge or protonation position information regard-931 ing the macrostates. This causes an issue of assigning predicted and experimental pK a values before any error statistics can be 932 calculated.

933
As delineated by Fraczkiewicz [23], the fairest and most reasonable solution for the pK a matching problem involves an 934 assignment algorithm that preserves the order of predicted and experimental microstates and uses the principle of smallest 935 differences to pair values. We recommend Hungarian matching with a squared-error penalty function. The algorithm is available 936 in SciPy package (scipy.optimize.linear_sum_assignment) [35]. In addition to the analysis of numerical error statistics following 937 Hungarian matching, at the very least, the number of missing and extra pK a predictions must be reported based on unmatched 938 pK a values. Missing or extra pK a predictions point to a problem with capturing the right number of ionization states within 939 the pH interval of the experimental measurements. We have demonstrated that for microscopic pK a predictions, performance 940 analysis based on Hungarian matching results in overly optimistic and misleading results-instead the employed microstate-941 based matching provided a more realistic assessment when microstate data is available.

942
Lessons from the first pK a blind challenge will guide future decisions on challenge rules, prediction reporting for-943 mats, and challenge inputs. 944 We solicited three different submission types in SAMPL6 to capture all the necessary information related to pK a predictions.

945
These were (1) macroscopic pK a values, (2) microscopic pK a values and microstate pair identities, and (3) fractional population 946 of microstates with respect to pH. We realized later that collecting fractional populations of microstates was redundant since 947 microscopic pK a values and microstate pairs capture all the necessary information to construct fractional population vs. pH 948 curves [26]. Only microscopic and macroscopic pK a values were used for the challenge analysis presented in this paper.

949
While exploring ways to evaluate SAMPL6 pK a Challenge results, we developed a better way to capture microscopic pK a 950 predictions, as presented in Gunner et al. [26]. This alternative reporting format consists of reporting the charge and relative 951 free energy of microstates with respect to an arbitrary reference microstate and pH. This approach presents the most concise 952 method of capturing all necessary information regarding microscopic pK a predictions and allows calculation of predicted mi-953 croscopic pK a s, microstate population with respect to pH, macroscopic pK a values, macroscopic population with respect to pH, 954 and tautomer ratios. Still, there may be methods developed to predict macroscopic pK a s directly instead of computing them 955 from microstate predictions that justifies allowing a macroscopic pK a reporting format. In future challenges, we recommend 956 collecting pK a predictions with two submission types: (1) macroscopic pK a values together with the charges of the macrostates 957 and (2) microstates, their total charge, and relative free energies with respect to a specified reference microstate and pH. This 958 approach is being used in SAMPL7.

959
In SAMPL6, we provided an enumerated list of microstates and their assigned microstate IDs because we were worried about 960 parsing submitted microstates in SMILES from different sources correctly. There were two disadvantages to this approach. First, 961 this list of enumerated microstates was used as input by some participants which was not our intention. (Challenge instructions 962 requested that predictions should not rely on these microstate lists and only use them for matching microstate IDs.) Second, 963 the first iteration of enumerated microstates was not complete. We had to add new microstates and assign them microstate 964 IDs for a couple of rounds until reaching a complete list. In future challenges, a better way of handling the problem of capturing 965 predicted microstates would be asking participants to specify the predicted protonation states themselves and assigning iden-966 tifiers after the challenge deadline to aid comparative analysis. This would prevent the partial unblinding of protonation states 967 and allow the assessment of whether methods can predict all the relevant states independently, without relying on a provided 968 list of microstates. Predicted states can be submitted as mol2 files that represent the microstate with explicit hydrogens. The 969 organizers must only provide the microstate that was selected as the reference state for the relative microstate free energy 970 calculations.

971
In the SAMPL6 pK a Challenge, there was not a requirement that participants should report predictions for all compounds.

972
Some participants reported predictions for only a subset of compounds, which may have led these methods to look more 973 accurate than others due to missing predictions. In the future, it will be better to allow submissions of only complete sets for a 974 better comparison of method performance.

975
A wide range of methods participated in the SAMPL6 pK a Challenge-from very fast QSPR methods to QM methods with a 976 high-level of theory and extensive exploration of conformational ensembles. In the future, it would be interesting to capture 977 computing costs in terms of average compute hours per molecule. This can provide guidance to future users of pK a prediction 978 methods for selection of which method to use.

979
It is advantageous to field associated challenges with common set of molecules for different physicochemical prop- October 8th, 2020, follow this principle and include both pK a , log P, and membrane permeability properties for a set of mono-992 protic compounds. We hope that future pK a challenges can focus on multiprotic drug-like compounds with microscopic pK a 993 measurements for an in-depth analysis.

995
The first SAMPL6 pK a Challenge focused on molecules resembling fragments of kinase inhibitors, and was intended to assess 996 the performance of pK a predictions for drug-like molecules. With wide participation, we had an opportunity to prospectively 997 evaluate pK a predictions spanning various empirical and QM based approaches. In addition to community participants, a small 998 number of popular pK a prediction methods that were missing from blind submissions were added as reference calculations 999 after the challenge deadline. multiprotic. For a subset of molecules there was also NMR data to inform the dominant microstate sequence, though micro-1003 scopic pK a measurements were not performed. We conducted a comparative analysis of methods represented in the blind 1004 challenge in terms of both macroscopic and microscopic pK a prediction performance avoiding any assumptions about the inter-1005 pretation of experimental pK a s.

1006
Here, we used Hungarian matching to assign predicted and experimental values for the calculation of accuracy and cor-1007 relation statistics, because the majority of experimental data was macroscopic pK a values. In addition to evaluating error in 1008 predicted pK a values, we also reported the macroscopic pK a errors that were not captured by the match between experimental 1009 and predicted pK a values. These were extra or missing pK a predictions which are important indicators that predictions are failing 1010 to capture the correct ionization states. 1011 We evaluated microscopic pK a predictions utilizing the experimental dominant microstate sequence data of eight molecules.

1012
This experimental data allowed us to use microstate-based matching for evaluating the accuracy of microscopic pK a values 1013 in a more realistic way. We have determined that QM and LFER predictions had lower accuracy in determining the dominant 1014 tautomer of the charged microstates than the neutral states. For both macroscopic and microscopic pK a predictions we have determined methods that were consistently well-performing according to multiple statistical metrics. Focusing on the com-1016 parison of molecules instead of methods for macroscopic pK a prediction accuracy indicated molecules with sulfur-containing heterocycles, iodo, and bromo groups suffered from lower pK a prediction accuracy.
The overall performance of pK a predictions as captured in this challenge is concerning for the application of pK a prediction methods in computer-aided drug design. Many computational methods for predicting target affinities and physicochemical 1020 properties rely on pK a predictions for determining relevant protonation states and the free energy penalty of such states. 1 unit 1021 of pK a error is an optimistic estimate of current macroscopic pK a predictions for drug-like molecules based on SAMPL6 Challenge 1022 where errors in predicting the correct number of ionization states or determining the correct dominant microstate were also 1023 common to many methods. In the absence of other sources of errors, we showed that 1 unit over-or underestimation of the 1024 pK a of a ligand can cause significant errors in the overall binding affinity calculation due to errors in multiple protonation state 1025 correction factor.

1026
The SAMPL6 GitHub Repository contains all information regarding the challenge structure, experimental data, blind predic-1027 tion submission sets, and evaluation of methods. The repository will be useful for future follow up analysis and the experimental 1028 measurements can continue to serve as a benchmark dataset for testing methods.

1029
In this article, we aimed to demonstrate not only the comparative analysis of the pK a prediction performance of contempo-1030 rary methods for drug-like molecules, but also to propose a stringent pK a prediction evaluation strategy that takes into account 1031 differences in microscopic and macroscopic pK a definitions. We hope that this study will guide and motivate further improve-1032 ment of pK a prediction methods.    Figure S2. MAE of macroscopic pK a predictions of each molecule did not show any significant correlation with any molecular descriptor. Plots show regression lines, 95% confidence intervals of the regression lines, and R 2 . The following molecular descriptors were calculated using OpenEye OEMolProp Toolkit [59]: molecular weight, non-terminal rotatable bond count, heteroatom to carbon ratio, maximum ring size, heavy atom count, heteroatom count, ring system count. Microstate count is based on the enumerated microstates for each compounds including additional microstates requested by participants.  Figure S3. The value of macroscopic pK a s was not a factor affecting prediction error seen in SAMPL6 Challenge according to the analysis with Hungarian matching. There was not clear trend between pK a prediction error and the true pK a error. Very high and very low pK a values have similar inaccuracy compared to pK a values close to 7. A Scatter plot of macroscopic pK a prediction error calculated with Hungarian matching vs. experimental pK a value B Box plot of absolute error of macroscopic pK a predictions binned into 2 pK a unit intervals of experimental pK a .
hdiyq nb011  nxaaw  wexjs  nb016  v8qph  ccpmw  0xi4b  t8ewk  ftc8w  eyetm  cywyk  6tvf8  wcvnu  y4wws  ko8yx  ktpj5  arcko  wuuvc  w4z0e  z7fhp  kxztt  xnoe0  8toyp  cm2yq  epvmk  0wfzo  z3btx  2umai  gdqeg  qsicn  4o0ia  nb008  758j8  hgn83   Submission ID   0   20   40   60   80   100 Accuracy of matched microstate pairs (%) Agreement between experimental dominant microstate pairs and predicted microstate pairs by Hungarian matching Figure S4. There was low agreement between experimental dominant microstate pairs and the predicted microstate pairs selected by Hungarian algorithm for microscopic pK a predictions. This analysis could only be performed for 8 molecules with NMR data. Hungarian matching algorithm which matches predicted and experimental values considering only the closeness of the numerical value of pK a and it often leads to predicted pK a matches that described a different microstates pair than the experimentally observed dominant microstates.  Table S3. Evaluation statistics calculated for all microscopic pK a prediction submissions based on Hungarian match for 8 molecules with NMR data. Methods are represented via their SAMPL6 submission IDs which can be cross-referenced with Table 1 for method details. There are eight error metrics reported: the root-mean-squared error (RMSE), mean absolute error (MAE), mean (signed) error (ME), coefficient of determination (R 2 ), linear regression slope (m), Kendall's Rank Correlation Coefficient ( ), unmatched experimental pK a s (number of missing pK a predictions) and unmatched predicted pK a s (number of extra pK a predictions between 2 and 12. This   Table 1 for method details. There are eight error metrics reported: the root-mean-squared error (RMSE), mean absolute error (MAE), mean (signed) error (ME), coefficient of determination (R 2 ), linear regression slope (m), Kendall's Rank Correlation Coefficient ( ), unmatched experimental pK a s (number of missing pK a predictions) and unmatched predicted pK a s (number of extra pK a predictions between 2 and 12. This