Abstract
The microscopic rate constants that govern an enzymatic reaction are only directly measured under certain experimental set-ups, such as stopped flow, continuous flow, or temperature-jump assays; the majority of enzymology proceeds from steady state conditions which leads to a set of easily–observable parameters such as kcat, KM, and observed Kinetic Isotope Effects (Dkcat). This paper further develops a model from Toney (2013) to estimate microscopic rate constants from steady-state data for a set of reversible, four–step reactions. This paper uses the Bayesian modeling software Stan, and demonstrates the benefits of Bayesian data analysis in the estimation of these rate constants. In contrast to the optimization methods employed often in the estimation of kinetic constants, a Bayesian treatment is more equipped to estimate the uncertainties of each parameter; sampling from the posterior distribution using Hamiltonian Monte Carlo immediately gives parameter estimates as mean or median of the posterior, and also confidence intervals that express the uncertainty of each parameter.
1 Introduction
The two enzymes under study here are alanine racemase (AR, EC 5.1.1.1), which catalyzes the reversible conversion of l-alanine to d-alanine, and triosephosphate isomerase (TIM, EC 5.3.1.1), which functions in glycolysis to convert dihydroxyacetone phosphate into d-glyceraldehyde 3-phosphate. Both are classified as isomerases, and take a single substate in both the forward and reverse directions. The general reaction scheme for AR and TIM is given in Scheme 1. In order to fully characterize these reactions, kinetically, we would like to estimate the rate constants for every step. In addition, if certain rate constants are isotopically sensitive, there will be additional values to estimate. For an enzymatic reaction scheme with four reversible steps, that leaves us with 8 microscopic rate constants to determine. In Scheme 1, k1 and k−4 are second–order rate constants, and all others are first order. EZ is an intermediate that reacts rapidly in both directions. The substrates are taken to be l-alanine for AR and dihydroxyacetone phosphate for TIM, though the reactions are reversible. First order rate constants k2 and k−3 are isotopically sensitive, with primary kinetic isotope effects D k2 and D k−3.
The Reaction scheme for a reversible, 4-step reaction. 2
Since we cannot directly measure k1, k2, etc., we have to rely on indirect methods of determining those values. Ref. 1, which is the starting point for this work, uses a series of measurements done under steady state condition, each of which can be related to the microscopic rate constants mathematically (Eqs. (2)– (12)). Through incorporation of sufficient experimental data, it is possible in principle to determine each of the microscopic rate constants. In Ref. 1, global fitting is used to extract individual rate constants from steady-state reaction data. Global fitting in this case refers to the use of a target function containing contributions from all of the experimental data, from which are estimated a set of parameters consistent with the entire data set through non-linear regression. The earlier work used standard non-linear optimization algorithms to minimize the relative squared error of a set of data points. The target function used was
Where θ is a vector of parameters to be estimated, D is a vector of experimental values, and
is the function relating the parameters to the experimental value. This function leads to a minimization of the relative standard deviation (RSD), which is preferred because the experimental values are different orders of magnitude so they must be scaled to avoid bias. Ref. 1 showed that convergence was achievable using non-linear optimization, and that the method was reasonably robust. The fact that an optimization algorithm converges on a set of parameter values is not in itself useful, unless we have some confidence in those numbers. Ref. 1 wisely uses a method whereby a set of randomly generated values with the same mean and standard deviation as the experimental data are fed into the optimization algorithm, and parameters are re-calculated for each set, allowing an estimation of parameter uncertainty. Other non-linear methods would employ the Hessian matrix or bootstrapping to the same effect. 3,4 These methods fall under the rubric of frequentist analysis, which is often faster and equally as accurate as Bayesian methods are, given plentiful, high quality data. However, when the number of parameters to be estimated is nearly equal to the number of data points, as in the current case, Bayesian methods can provide invaluable information about the most likely parameter values, given all available data, and the uncertainty the estimates of each parameter. 5 Here I show that a Bayesian modeling of the same system gives robust and useful estimates of the rate constants and their associated uncertainties. In addition, a Bayesian treatment is able to handle cases of possible experimental error, at the cost of greater uncertainty in the parameter prior distributions. 6
1.1 Incorporation of the Equilibrium Constant
We have here introduced a new data value to improve the estimate, the equilibrium constant Keq (Eq. 13). This is equal to the product of the forward rate constants divided by the reverse rate constants, and can be determined experimentally by measuring the concentrations of reactant and product at equilibrium, or indirectly from the forward and reverse kcat/Km values using the Haldane relationship: 11,12
Direct measurement of Keq is to be preferred, since use of the Haldane relationship utilizes the kcat and Km values that are already incorporated into the model, so using these values again tends to bias the estimates. For the same reason, the values of (D (kcat/Km) – 1) / (D kcat − 1) used in Ref. 1 are not used here, because they represented re-use of data that is already incorporated as D kcat and D (kcat/Km). However, in some cases the Keq might be hard to measure directly, and the Haldane relationship may be used (with caution). More reliable estimates for Keq might be possible using the Haldane relationship if there exists high quality data for homologues of the enzyme, or point mutants, because the Keq values calculated by the Haldane relationship should theoretically be the same for all active versions of an enzyme, as long as the temperature and buffer composition are similar. In this case we can average the values from several sources to obtain a more reliable estimate for Keq.
An additional reason for using the value of Keq is that the expression contains k1 and k−4, which each only appear in one other equation (for Km, forward and reverse). This means that we are dependent on accurate measurement of Km to get reliable values for k1 and k−4, in the absence of any further information. For enzymes such as AR, which converts l-alanine to d-alanine, the Keq is theoretically exactly 1, since there is no reason l-alanine would have a higher or lower free energy than d-alanine in a mostly achiral aqueous solution. For TIM, there is no direct measurement of the Keq available in the literature, possibly because both dihydroxyacetone phosphate and d-glyceraldehyde 3-phosphate are themselves in equilibrium with their catalytically-inactive hydrated forms. 8 So in order to obtain the Keq for the unhydrated forms, I averaged 4 literature values for Keq, derived from the Haldane relationship. 8–10
2 Considerations for Accurate Parameter estimation
2.1 General Limitation of the Bayesian Method
Every form of parameter estimation rests on a set of assumptions about the data and a model; this case is no different. Stan, as with other Bayesian modeling software, requires these assumptions be made explicit. Each parameter needs a prior distribution, which can affect the final result. The form of the model partly determines the results, and an incorrect model will lead to unhelpful results.
2.2 Choice of Priors for ks
One aspect of a Bayesian analysis that differs from the function minimization procedures used in Ref. 1 is the requirement to specify a prior distribution for each of the parameters. This is information that is incorporated into the model according to the modified version of Bayes’ Law: 13
Here, the posterior distribution of the parameters p(Θ|D), the output of our simulation, is the product of the likelihood function p(D |Θ) and the prior distribution for the parameters p(Θ). I have chosen a uninformative prior k ∼ Exponential (β)for each of the ks, based on the following assumptions:
The value of k is necessarily > 0, so an exponential distribution has the same domain.
The exponential distribution is often seen in physically relevant phenomena. 14,15
Setting k ∼ Exponential β with β << 1 gives a broad distribution that covers the region from 1 to 1 × 109, typical values for microscopic rate constants.
Nonetheless, the prior is not too restrictive, because we have poor prior information about which values are typical for a rate constant.
This last point is especially important, as too restrictive a prior can end up determining the shape of the posterior distribution in the absence of sufficient experimental data.
Another justification made be made as follows: According to the Eyring equation, the relationship between the rate constant k and the transition state ΔG‡ is
ΔG‡ is defined as positive, and can be found in the interval [0, ∞). Its variance is expected to be finite, as there will be physical limits to its upper and lower values. Therefore, the Maximum Entropy distribution for ΔG‡, subject to these constraints, is the Uniform distribution. If ΔG‡ is uniformly distributed, then k itself is exponentially distributed under the conditions of Maximum Entropy.
The prior for k2 (and the other ks) is implemented as follows in Stan:
Here, we utilize a hyperprior β; the prior distribution for k2 depends on the parameter β, which is also estimated over the course of the simulation. This allows a great deal of flexibility while keeping the mathematical form of the priors constant. The hyperprior for β is set as β ∼ Gamma (1, 1), a relatively uninformative prior with most of the mass below 1.
2.3 Choice of Priors for Intrinsic KIEs (Dki)
Kinetic isotope effects are strictly positive quantities, and for the comparison between deuterium and protium the intrinsic KIE of step i is
where ki,P and ki,D are the rate constants of the reaction with protonated and deuterated substrate. Common ranges for primary KIEs are 1.5 – 3, in the absence of quantum-mechanical tunneling. 16 Rarely, inverse KIEs are observed where D ki < 1. Given these constraints, I set the prior as
Figure 8 graphs the prior used for KIEs. We see that most of the mass is between 1 and 4, but the density extends to infinity in the positive direction. I limit the value of KIEs to less than 500, based on the fact that the largest measured enzymatic D kcat is around 500. 17 Any KIE greater than 6 is likely to be due to quantum mechanical effects, and in cases where this is suspected (e.g. hydride transfer) the prior could be adjusted to reflect the expected ranges of values.
2.4 The Problem constants – k1 and k−4
In Ref. 1 and here, there are difficulties in accurately determining k1 and k−4 for both TIM and AR. Significantly, in R ef. 1 k 1 and k −4 each only appear in one equation, the one for Km,f (Eq. 4) and Km,f (Eq. 5).
The intuitive effect of this is that each of the experimental values besides Km,f and Km,r only indirectly provide information as to the true value of k1 and k−4, by helping to determine the values of the other parameters. But an interesting effect of this can be seen in Figure 7, which shows correlation between parameters during the course of the simulation as the posterior distribution is explored. In row 5, column 1, we see that the values of k1 and k−1 are linearly correlated, as are the values of k4 and k−4 in row 8, column 4. Looking at the equations for Km, we see that this is largely due to the fact that each contains the factor k1/k−1 or k4/k−4, and since this is the sole place that k1 and k−4 appear in this model, ambiguity in k−1 is passed along to k1, etc. Adding the data for the Keq doesn’t alter this, as the expression for Keq also contains k1/k−1 and k4/k−4. This tells us that k1 and k−4 can’t be considered separately from k−1 and k4; all this model can give us, in the absence of strong prior information about k1 and k−4, is the ratios k1/k−1 and k4/k−4, i.e. the equilibrium constants for the first and fourth steps. Thus in my Stan code I have replaced k1/k−1 and k4/k−4, where they appear, with K1 and K4. This slightly simplifies the calculations, and for a reversible reaction such as these it is reasonable to assume that the forward and reverse constants are within three orders of magnitude of each other, so we can limit the value of K1 and K4 during the simulation to between 0 and 103. Indeed, in both TIM and AR the values determined are approximately equal to Km,f and 1/Km,r, though this is not necessarily true in general as Kms can be greater than, less than, or equal to the association equilibrium constant (e.g. K1) in the case of a multi-step reaction. 18
Equations used in the analysis of data for TIM, AR and the simulated data. Eqs. (2)– (12) are used in Ref. 1, and Eq. (13) is included in this analysis.
Traceplots of the modeled rate constants for the simulated data set, with experimental SD set to 0.01 for all quantities. The y-axis is the parameter value at each draw, and the x-axis are the sample numbers, post warm-up.
Pairwise comparison of the MCMC Draws for simulated rate constants, showing correlations between parameters.
Traceplots of the rate constants for TIM.
Pairwise comparison of the MCMC Draws for TIM rate constants, showing correlation between k1 and k−1
Traceplots of the rate constants for AR.
Pairwise comparison of the MCMC Draws for AR rate constants, showing correlation between k1 and k−1
The prior distribution used for intrinsic KIEs (Dki).
3 Results and discussion
3.1 Application to Simulated Data
We base our estimates on a set of 12 equations, and we estimate 11 parameters from these data points and their uncertainties, following the general rule of thumb that one can estimate at best n − 1 unknown parameters from n data points. However, this is only best–case; experimental error and the structure of the model can limit our ability to estimate parameters effectively. The primary difficulty here is one of structural identifiability; 19 can we, even with ideal data, estimate the parameters given the model we have?
To test the ability of our model to accurately determine rate constants, I simulated a data set with a fixed relative standard deviation (RSD) for all experimental values. I chose values for ks in the range of 103 to 108, and two isotope effect values in the classical range (1-6). With a RSD of 0.01, representing ideal experimental conditions, the modeled mean values are all within 10% of the true value, and the 90% confidence intervals contain the true value. Repeating this with other simulated values gives equally accurate results. The statistic 13,20 measures the average divergence between MCMC chains during a simulation; in ideal data the value is 1.0 exactly, indicating that all the chains in the simulation have converged on the same posterior distribution. I have used 4 independent chains in each analysis. The
for all of the parameters in this investigation is less than 1.1, as prescribed by Ref. 13.
Increasing the RSD to 0.1, a much more realistic value, shows the model beginning to drift away from the true values and an increase in uncertainty. Nonetheless only two of the parameters is off by more than 50% – k4 and k−3. These two parameters are highly correlated, and in the absence of stronger prior information are likely to deviate from their simulated values. Notably, the ratio of k4 to k−3 is simulated as 0.5 and the fit shows 0.375, suggesting than an increase in prior information for either k4 or k−3 would greatly improve the estimate of both. While Toney validated his model with ideal datasets, his test data didn’t include experimental error and is analogous to my dataset with RSD = 0.01. The results of the modeling show that the current method is able to accurately determine rate constants under ideal conditions, and that experimental error begins to affect this at higher levels, as expected. I conclude from this that the model is structurally identifiable, with some parameters such as the intrinsic isotope effects determined with better precision than others.
3.2 Alanine Racemase
Table 3 shows the output of the Stan modeling for Alanine Racemase. We see that all of the parameters have converged well, as shown by the values lying close to unity. From the table, and from the graphs in Figure 9, the intrinsic KIEs (D k2 and D k−3) are in good agreement with the analysis from Ref. 1. The intrinsic KIEs for AR are especially well-defined, as shown by Figure 9. The prior and posterior distributions of both D k2 and D k−3 are shown, and the prior is much broader than the posterior, showing that the experimental data have been instumental in determining the mean and confidence intervals for the KIEs through the likelihood function. Others show some disagreement, especially the values of k3 and k−2 which differ by > 10-fold. The reason for this is not entirely clear, but is likely due to the effects of experimental uncertainty, as both in Ref. 1 and the present work the models are shown to give accurate results with ideal data. In the case of real world data, the difference between the models and algorithms becomes more important as error increases, as do the assumptions behind each. In every case where the present results and those of Ref. 1 disagree significantly (> 10-fold), the latter parameters show a great deal of uncertainty. In the case of AR, these are k3, k−2 and k−3. For k3 and k−2 we only have lower bounds in Ref. 1, and the SD of k−3 is 4 times the mean. In the present work, Stan is using the log of the joint likelihood function to estimate the shape and position of the posterior distribution, under the influence of a prior distribution; in Ref. 1 the program is trying to minimize a cost function (Eq. 1). Function minimization in the absence of a prior distribution can behave similarly to doing a Bayesian analysis with a Uniform prior on all parameters. In a case where the domain of each parameter is on the order of 1012, for a Bayesian analysis this gives us a prior where the parameter is nine times as likely to be in the range 1011 to 1012 as between 0 and 1011. This is part of the motivation for the use of the exponential distribution as a prior for all ks, to correct this bias towards larger numbers. In minimizing the function, if the experimental data are subject to error the parameters can vary freely over large ranges during the search, and converge to a wide range of values. This search is unbiased by a prior distribution, so might be preferred as long as the uncertainty in parameter estimates can be contained and there is little prior information. However in cases where parameters cannot be defined to within even an order of magnitude by function minimization, a Bayesian analysis such as the one shown here should be considered. I also note that the values that I estimate for all the parameters are consistent with the experimental data; the experimental means and theoretical mean values agree to within 1% in all cases. It may be that due to experimental uncertainty more than one set of parameters is consistent with the data (multimodality). In this case, an improvement of the experimental data or a more informative prior distribution might be necessary to resolve the problem.
Experimental Values used to estimate Rate Constants
Statistical summary of the Stan output for the Simulated Data. is the Gelman–Rubin statistic. 20 90% CI is the 90% confidence interval for the posterior of each parameter.
Statistical summary of the Stan output for the AR Data. is the Gelman–Rubin statistic. 20 90% CI is the 90% confidence interval for the posterior of each parameter.
Posterior and prior distributions for the intrinsic KIEs(Dki)of TIM and AR. The prior distribution is shown in purple, and the posterior is shown in blue, filled.
3.3 Triosephosphate Isomerase
For TIM, similar issues arise as for AR. There is an assumption that the uncertainties in each parameter are distributed normally. Looking at the estimates for the TIM dataset, Ref. 1 estimated k−4 as 5 × 105 with a SD (20 × 105) ; if you assume normalcy that would mean a full 40% of the confidence interval lies below zero, where it is impossible for the value to be. Likewise for k−4, the optimization results give wide ranges for the parameter confidence interval.
Nickbarg and Knowles 9 also calculated the ratios of the forward and reverse rate constants for yeast TIM. Table 5 shows a comparison of the results from the current study compared with Ref. 1 and Nickbarg and Knowles (1988). There is agreement between as to the ratios of the forward and reverse constants for the first and fourth steps (k1/k−1 and k4/k−4). The current paper and Nickbarg and Knowles (1988) give essentially identical estimates for k2/k−2 and k3/k−3, with Ref. 1 differing by ≈ 103 in both cases. At stake is the question of whether the complex of TIM with the enediol intermediate (EZ, in Scheme 1) is significantly higher in energy than the other enzyme forms. Higher energy would destabilize EZ, leading to higher k−2 and k3 values and would therefore lead to k2/k−2 approaching zero and k3/k−3 much greater than one. While it is not my intention to wade into this debate, the results presented here are not consistent with a high-energy intermediate.
Statistical summary of the Stan output for the TIM Data. is the Gelman–Rubin statistic. 20 90% CI is the 90% confidence interval for the posterior of each parameter.
Comparison of forward and reverse rate-constant ratios from the present work, Nickbarg and Knowles (1988), and Toney (2013). Results in the two rightmost columns are given as Mean (SD).
4 Methods
4.1 Incorporation of Experimental Error
The data 1 is expressed as mean and standard deviation
, as is commonly done in biochemical studies. The data are incorporated into the model as follows:
This is represented in Stan as follows, for kcat,f (ignoring all other data values):
Where the true value of the experimentally-determined parameter (e.g. kcat) is assumed to be drawn from a distribution with mean µ, and the standard deviation is set equal to the experimentally-determined uncertainty in the value. The model will then incorporate the mean value as data, and estimate a true value for it as well (µ), based on the global fitting. While one could use a distribution other than Normal to model the error, most published results use models that assume Normally-distributed error so when incorporating results from others it is important to follow this assumption. Ideally, the model would incorporate the raw data instead and proceed from there to estimates of the rate constants; however, the raw data is not available, as is often the case with biochemical data. Nonetheless it is still possible to get estimates of the rate constants from published results.
4.2 MCMC Analysis
To estimate the posterior distribution of each of the eight parameters, I used cmdstanr 0.3.0 running under R 3.5, which is built on cmdstan 2.26.1. 21 For each analysis, 5000 iterations of the sampler were run on 4 parallel chains. The first 4000 of each were ‘warm-up’ samples, in which the parameters and step sizes are tuned by stan’s NUTS algorithm. This value, higher than the default value of 1000 warm-up samples, was necessary to ensure that the sampling distribution was stable, but had minimal effect on the runtime of the program. Runtimes on MacOS using a 3 GHz quad-core processor and 4 parallel chains ranged from 2 - 60s, without diagnostic errors after sampling. Figures were generated using ggplot2 and the bayesplot package, except Figure 8 which was plotted with gnuplot.
Supporting Information Available
The following files are available free of charge.
simulate.stan: Stan model file
enrg.R: R script file to process Toney (2013) data.
simulate.R: R script file to simulate and process data.
Acknowledgement
This work was partially supported by a Summer Research Grant from Dominican University of California.
Footnotes
↵* E-mail: ian.barr{at}dominican.edu, Phone: +1 (415)257 1346
† This article is a preprint, and has not yet undergone peer review.