## Abstract

The microscopic rate constants that govern an enzymatic reaction are only directly measured under certain experimental set-ups, such as stopped flow, quenched flow, or temperature-jump assays; the majority of enzymology proceeds from steady state conditions which leads to a set of more easily–observable parameters such as *k*_{cat}, *K*_{M}, and observed Kinetic Isotope Effects (^{D}*k*_{cat}). This paper further develops a model from Toney (2013) to estimate microscopic rate constants from steady-state data for a set of reversible, four–step reactions. This paper uses the Bayesian modeling software Stan, and demonstrates the benefits of Bayesian data analysis in the estimation of these rate constants. In contrast to the optimization methods employed often in the estimation of kinetic constants, a Bayesian treatment is more equipped to estimate the uncertainties of each parameter; sampling from the posterior distribution using Hamiltonian Monte Carlo immediately gives parameter estimates as mean or median of the posterior, and also confidence intervals that express the uncertainty of each parameter.

## 1 Introduction

Estimation of the rate constants associated with each step of an enzymatic mechanisms is rarely straightforward, due to complexity of the reactions and lack of an ability to observe each intermediate species during the course of a reaction. The two enzymes under study here are alanine racemase (AR, EC 5.1.1.1), which catalyzes the reversible conversion of l-alanine to d-alanine, and triosephosphate isomerase (TIM, EC 5.3.1.1), which functions in glycolysis to convert dihydrox-yacetone phosphate into d-glyceraldehyde 3-phosphate. Both are classified as isomerases, and take a single substate in both the forward and reverse directions. The general reaction scheme for AR and TIM is given in Scheme 1. In order to fully characterize these reactions, kinetically,
we would like to estimate the rate constants for every step. In addition, if certain rate constants are isotopically sensitive, there will be additional values to estimate. For an enzymatic reaction scheme with four reversible steps, that leaves us with 8 microscopic rate constants to determine. In Scheme 1, *k*_{1} and *k*_{−4} are second–order rate constants, and all others are first order. EZ is an intermediate that reacts rapidly in both directions. The substrates are taken to be l-alanine for AR and dihydroxyacetone phosphate for TIM, though the reactions are reversible. First order rate constants *k*_{2} and *k*_{−3} are isotopically sensitive, with primary kinetic isotope effects ^{D}*k*_{2} and ^{D}*k*_{−3}.

Since we cannot directly measure *k*_{1}, *k*_{2}, etc., we have to rely on indirect methods of determining those values. Ref. 1, which is the starting point for this work, uses a series of measurements done under steady state condition, each of which can be related to the microscopic rate constants mathematically (Eqs. (2)– (12)). Through incorporation of sufficient experimental data, it is possible in principle to determine each of the microscopic rate constants. In Ref. 1, global fitting is used to extract individual rate constants from steady-state reaction data. Global fitting in this case refers to the use of a target function containing contributions from all of the experimental data, from which are estimated a set of parameters consistent with the entire data set through non-linear regression. The earlier work used standard non-linear optimization algorithms to minimize the relative squared error of a set of data points. The target function used was

Where ** θ** is a vector of parameters to be estimated,

**is a vector of experimental values, and is the function relating the parameters to the experimental value. This function leads to a minimization of the relative standard deviation (RSD), which is preferred because the experimental values are different orders of magnitude so they must be scaled to avoid bias. Ref. 1 showed that convergence was achievable using non-linear optimization, and that the method was reasonably robust. The fact that an optimization algorithm converges on a set of parameter values is not in itself useful, unless we have some confidence in those numbers. Ref. 1 wisely uses a method whereby a set of randomly generated values with the same mean and standard deviation as the experimental data are fed into the optimization algorithm, and parameters are re-calculated for each set, allowing an estimation of parameter uncertainty. Other non-linear methods would employ the Hessian matrix or bootstrapping to the same effect.**

*D*^{3,4}These methods fall under the rubric of frequentist analysis, which is often faster and equally as accurate as Bayesian methods are, given plentiful, high quality data. However, when the number of parameters to be estimated is nearly equal to the number of data points, as in the current case, Bayesian methods can provide invaluable information about the most likely parameter values, given all available data, and the uncertainty the estimates of each parameter.

^{5}Here I show that a Bayesian modeling of the same system gives robust and useful estimates of the rate constants and their associated uncertainties. In addition, a Bayesian treatment is able to handle cases of possible experimental error, at the cost of greater uncertainty in the parameter prior distributions.

^{6}

### 1.1 Incorporation of the Equilibrium Constant

We have here introduced a new data value to improve the estimate, the equilibrium constant *K*_{eq} (Eq. 13). This is equal to the product of the forward rate constants divided by the reverse rate constants, and can be determined experimentally by measuring the concentrations of reactant and product at equilibrium, or indirectly from the forward and reverse *k*_{cat}/*K*_{m} values using the Haldane relationship:^{11,12}

Direct measurement of *K*_{eq} is to be preferred, since use of the Haldane relationship utilizes the *k*_{cat} and *K*_{m} values that are already incorporated into the model, so using these values again tends to bias the estimates. For the same reason, the values of (^{D}(*k*_{cat}/*K*_{m}) − 1)/(^{D}*k*_{cat} − 1) used in Ref. 1 are not used here, because they represented re-use of data that is already incorporated as ^{D}*k*_{cat} and ^{D}(*k*_{cat}/*K*_{m}). However, in some cases the *K*_{eq} might be hard to measure directly, and the Haldane relationship may be used (with caution). More reliable estimates for *K*_{eq} might be possible using the Haldane relationship if there exists high quality data for homologues of the enzyme, or point mutants, because the *K*_{eq} values calculated by the Haldane relationship should theoretically be the same for all active versions of an enzyme, as long as the temperature and buffer composition are similar. In this case we can average the values from several sources to obtain a more reliable estimate for *K*_{eq}.

An additional reason for using the value of *K*_{eq} is that the expression contains *k*_{1} and *k*_{−4}, which each only appear in one other equation (for *K*_{m}, forward and reverse). This means that we are dependent on accurate measurement of *K*_{m} to get reliable values for *k*_{1} and *k*_{−4}, in the absence of any further information. For enzymes such as AR, which converts l-alanine to d-alanine, the *K*_{eq} is theoretically exactly 1, since there is no reason l-alanine would have a higher or lower free energy than d-alanine in a mostly achiral aqueous solution. For TIM, there is no direct measurement of the *K*_{eq} available in the literature, possibly because both dihydroxyacetone phosphate and d-glyceraldehyde 3-phosphate are themselves in equilibrium with their catalytically-inactive hydrated forms.^{8} So in order to obtain the *K*_{eq} for the unhydrated forms, I averaged 4 literature values for *K*_{eq}, derived from the Haldane relationship.^{8–10}

## 2 Considerations for Accurate Parameter estimation

### 2.1 General Limitation of the Bayesian Method

Every form of parameter estimation rests on a set of assumptions about the data and a model; this case is no different. Stan, as with other Bayesian modeling software, requires these assumptions be made explicit. Each parameter needs a prior distribution, which can affect the final result. The form of the model partly determines the results, and an incorrect model will lead to unhelpful results.

### 2.2 Choice of Priors for *k*s

One aspect of a Bayesian analysis that differs from the function minimization procedures used in Ref. 1 is the requirement to specify a prior distribution for each of the parameters. This is information that is incorporated into the model according to the modified version of Bayes’ Law:^{13}

Here, the posterior distribution of the parameters *p*(Θ|*D*), the output of our simulation, is the product of the likelihood function *p*(*D*|Θ) and the prior distribution for the parameters *p*(Θ). I have chosen a uninformative prior *k* ~ Exponential (*β*) for each of the *k*s, based on the following assumptions:

The value of

*k*is necessarily > 0, so an exponential distribution has the same domain.The exponential distribution is often seen in physically relevant phenomena.

^{14,15}Setting

*k*~ Exponential (*β*) with*β <<*1 gives a broad distribution that covers the region from 1 to 1 × 10^{9}, typical values for microscopic rate constants.Nonetheless, the prior is not too restrictive, because we have poor prior information about which values are typical for a rate constant.

This last point is especially important, as too restrictive a prior can end up determining the shape of the posterior distribution in the absence of sufficient experimental data.

The prior for *k*_{2} (and the other *k*s) is implemented as follows in Stan:

Here, we utilize a hyperprior *β*; the prior distribution for *k*_{2} depends on the parameter *β*, which is also estimated over the course of the simulation. This allows a great deal of flexibility while keeping the mathematical form of the priors constant. The hyperprior for *β* is set as *β* ~ Gamma (1, 1), a relatively uninformative prior with most of the mass below 1.

### Choice of Priors for Intrinsic KIEs (^{D}k_{i})

Kinetic isotope effects are strictly positive quantities, and for the comparison between deuterium and protium the intrinsic KIE of step *i* is
where *k*_{i,P} and *k*_{i,D} are the rate constants of the reaction with protonated and deuterated substrate. Common ranges for primary KIEs are 1.5 – 3, in the absence of quantum-mechanical tunneling.^{16} Rarely, inverse KIEs are observed where ^{D}*k*_{i} < 1. Given these constraints, I set the prior as

Figure 8 graphs the prior used for KIEs. We see that most of the mass is between 1 and 4, but the density extends to infinity in the positive direction. I limit the value of KIEs to less than 500, based on the fact that the largest measured enzymatic ^{D}*k*_{i} is around 500.^{17} Any KIE greater than 6 is likely to be due to quantum mechanical effects, and in cases where this is suspected (e.g. hydride transfer) the prior could be adjusted to reflect the expected ranges of values.

### The Problem constants – *k*_{1} and *k*_{−4}

In Ref. 1 and here, there are difficulties in accurately determining *k*_{1} and *k*_{−4} for both TIM and AR. Significantly, in Ref. 1 *k*_{1} and *k*_{−4} each only appear in one equation, the one for *K*_{m,f} (Eq. 4) and *K*_{m,f} (Eq. 5).

The intuitive effect of this is that each of the experimental values *besides K*_{m,f} and *K*_{m,r} only indirectly provide information as to the true value of *k*_{1} and *k*_{−4}, by helping to determine the values of the other parameters. But an interesting effect of this can be seen in Figure 7, which shows correlation between parameters during the course of the simulation as the posterior distribution is explored. In row 5, column 1, we see that the values of *k*_{1} and *k*_{−1} are linearly correlated, as are the values of *k*_{4} and *k*_{−4} in row 8, column 4. Looking at the equations for *K*_{m}, we see that this is largely due to the fact that each contains the factor *k*_{1}/*k*_{−1} or *k*_{4}/*k*_{−4}, and since this is the sole place that *k*_{1} and *k*_{−4} appear in this model, ambiguity in *k*_{−1} is passed along to *k*_{1}, etc. Adding the data for the *K*_{eq} doesn’t alter this, as the expression for *K*_{eq} also contains *k*_{1}/*k*_{−1} and *k*_{4}/*k*_{−4}. This tells us that *k*_{1} and *k*_{−4} can’t be considered separately from *k*_{−1} and *k*_{4}; all this model can give us, in the absence of strong prior information about *k*_{1} and *k*_{−4}, is the ratios *k*_{1}/*k*_{−1} and *k*_{4}/*k*_{−4}, i.e. the equilibrium constants for the first and fourth steps. Thus in my Stan code I have replaced *k*_{1}/*k*_{−1} and *k*_{4}/*k*_{−4}, where they appear, with *K*_{1} and *K*_{4}. This slightly simplifies the calculations, and for a reversible reaction such as these it is reasonable to assume that the forward and reverse constants are within three orders of magnitude of each other, so we can limit the value of *K*_{1} and *K*_{4} during the simulation to between 0 and 10^{3}. Indeed, in both TIM and AR the values determined are approximately equal to *K*_{m,f} and 1/*K*_{m,r}, though this is not necessarily true in general as *K*_{m}s can be greater than, less than, or equal to the association equilibrium constant (e.g. *K*_{1}) in the case of a multi-step reaction.^{18}

## 3 Results and discussion

### 3.1 Application to Simulated Data

We base our estimates on a set of 12 equations, and we estimate 11 parameters from these data points and their uncertainties, following the general rule of thumb that one can estimate at best *n* − 1 unknown parameters from *n* data points. However, this is only best–case; experimental error and the structure of the model can limit our ability to estimate parameters effectively. The primary difficulty here is one of structural identifiability;^{19} can we, even with ideal data, estimate the parameters given the model we have?

To test the ability of our model to accurately determine rate constants, I simulated a data set with a fixed relative standard deviation (RSD) for all experimental values. I chose values for *k*s in the range of 10^{3} to 10^{8}, and two isotope effect values in the classical range (1-6). With a RSD of 0.01, representing ideal experimental conditions, the modeled mean values are all within 10% of the true value, and the 90% confidence intervals contain the true value. Repeating this with other simulated values gives equally accurate results. The statistic^{13,20} measures the average divergence between MCMC chains during a simulation; in ideal data the value is 1.0 exactly, indicating that all the chains in the simulation have converged on the same posterior distribution. I have used 4 independent chains in each analysis. The for all of the parameters in this investigation is less than 1.1, as prescribed by Ref. 13.

Increasing the RSD to 0.1, a much more realistic value, shows the model beginning to drift away from the true values and an increase in uncertainty. Nonetheless only two of the parameters is off by more than 50% – *k*_{4} and *k*_{−3}. These two parameters are highly correlated, and in the absence of stronger prior information are likely to deviate from their simulated values. Notably, the ratio of *k*_{4} to *k*_{−3} is simulated as 0.5 and the fit shows 0.375, suggesting than an increase in prior information for either *k*_{4} or *k*_{−3} would greatly improve the estimate of both. While Toney validated his model with ideal datasets, his test data didn’t include experimental error and is analogous to my dataset with RSD = 0.01. The results of the modeling show that the current method is able to accurately determine rate constants under ideal conditions, and that experimental error begins to affect this at higher levels, as expected. I conclude from this that the model is structurally identifiable, with some parameters such as the intrinsic isotope effects determined with better precision than others.

### 3.2 Alanine Racemase

Table 3 shows the output of the Stan modeling for Alanine Racemase. We see that all of the parameters have converged well, as shown by the values lying close to unity. From the table, and from the graphs in Figure 9, the intrinsic KIEs (^{D}*k*_{2} and ^{D}*k*_{−3}) are in good agreement with the analysis from Ref. 1. The intrinsic KIEs for AR are especially well-defined, as shown by Figure 9. The prior and posterior distributions of both ^{D}*k*_{2} and ^{D}*k*_{−3} are shown, and the prior is much broader than the posterior, showing that the experimental data have been instumental in determining the mean and confidence intervals for the KIEs through the likelihood function.

Others show some disagreement, especially the values of *k*_{3} and *k*_{−2} which differ by > 10-fold. The reason for this is not entirely clear, but is likely due to the effects of experimental uncertainty, as both in Ref. 1 and the present work the models are shown to give accurate results with ideal data. In the case of real world data, the difference between the models and algorithms becomes more important as error increases, as do the assumptions behind each. In every case where the present results and those of Ref. 1 disagree significantly (> 10-fold), the latter parameters show a great deal of uncertainty. In the case of AR, these are *k*_{3}, *k*_{−2} and *k*_{−3}. For *k*_{3} and *k*_{−2} we only have lower bounds in Ref. 1, and the SD of *k*_{−3} is 4 times the mean. In the present work, Stan is using the log of the joint likelihood function to estimate the shape and position of the posterior distribution, under the influence of a prior distribution; in Ref. 1 the program is trying to minimize a cost function (Eq. 1). Function minimization in the absence of a prior distribution can behave similarly to doing a Bayesian analysis with a Uniform prior on all parameters. In a case where the domain of each parameter is on the order of 10^{12}, for a Bayesian analysis this gives us a prior where the parameter is nine times as likely to be in the range 10^{11} to 10^{12} as between 0 and 10^{11}. This is part of the motivation for the use of the exponential distribution as a prior for all *k*s, to correct this bias towards larger numbers. In minimizing the function, if the experimental data are subject to error the parameters can vary freely over large ranges during the search, and converge to a wide range of values. This search is unbiased by a prior distribution, so might be preferred as long as the uncertainty in parameter estimates can be contained and there is little prior information. However in cases where parameters cannot be defined to within even an order of magnitude by function minimization, a Bayesian analysis such as the one shown here should be considered. I also note that the values that I estimate for all the parameters are consistent with the experimental data; the experimental means and theoretical mean values agree to within 1% in all cases. It may be that due to experimental uncertainty more than one set of parameters is consistent with the data (multimodality). In this case, an improvement of the experimental data or a more informative prior distribution might be necessary to resolve the problem.

### 3.3 Triosephosphate Isomerase

For TIM, similar issues arise as for AR. There is an assumption that the uncertainties in each parameter are distributed normally. Looking at the estimates for the TIM dataset, Ref. 1 estimated *k*_{−4} as 5 × 10^{5} with a SD (20 × 10^{5}); if you assume normalcy that would mean a full 40% of the confidence interval lies below zero, where it is impossible for the value to be. Likewise for *k*_{−4}, the optimization results give wide ranges for the parameter confidence interval.

Nickbarg and Knowles^{9} also calculated the ratios of the forward and reverse rate constants for yeast TIM. Table 5 shows a comparison of the results from the current study compared with Ref. 1 and Nickbarg and Knowles (1988). There is agreement between as to the ratios of the forward and reverse constants for the first and fourth steps (*k*_{1}/*k*_{−1} and *k*_{4}/*k*_{−4}). The current paper and Nickbarg and Knowles (1988) give essentially identical estimates for *k*_{2}/*k*_{−2} and *k*_{3}/*k*_{−3}, with Ref. 1 differing by ≈ 10^{3} in both cases. At stake is the question of whether the complex of TIM with the enediol intermediate (EZ, in Scheme 1) is significantly higher in energy than the other enzyme forms. Higher energy would destabilize EZ, leading to higher *k*_{−2} and *k*_{3} values and would therefore lead to *k*_{2}/*k*_{−2} approaching zero and *k*_{3}/*k*_{−3} much greater than one. While it is not my intention to wade into this debate, the results presented here are not consistent with a high-energy intermediate.

## 4 Methods

### 4.1 Incorporation of Experimental Error

The data^{1} is expressed as mean and standard deviation , as is commonly done in biochemical studies. The data are incorporated into the model as follows:

This is represented in Stan as follows, for *k*_{cat,f} (ignoring all other data values):

Where the true value of the experimentally-determined parameter (e.g. *k*_{cat}) is assumed to be drawn from a distribution with mean *μ*, and the standard deviation is set equal to the experimentally-determined uncertainty in the value. The model will then incorporate the mean value as data, and estimate a true value for it as well (*μ*), based on the global fitting. While one could use a distribution other than Normal to model the error, most published results use models that assume Normally-distributed error so when incorporating results from others it is important to follow this assumption. Ideally, the model would incorporate the raw data instead and proceed from there to estimates of the rate constants; however, the raw data is not available, as is often the case with biochemical data. Nonetheless it is still possible to get estimates of the rate constants from published results.

### 4.2 MCMC Analysis

To estimate the posterior distribution of each of the eight parameters, I used `cmdstanr` 0.3.0 running under R 3.5, which is built on `cmdstan` 2.26.1.^{21} For each analysis, 5000 iterations of the sampler were run on 4 parallel chains. The first 4000 of each were ‘warm-up’ samples, in which the parameters and step sizes are tuned by Stan’s NUTS algorithm. This value, higher than the default value of 1000 warm-up samples, was necessary to ensure that the sampling distribution was stable, but had minimal effect on the runtime of the program. Runtimes on MacOS using a 3 GHz quad-core processor and 4 parallel chains ranged from 2 - 60s, without diagnostic errors after sampling. Figures were generated using ggplot2 and the bayesplot package, except Figure 8 which was plotted with `gnuplot`.

## Supporting Information Available

The following files are available free of charge.

simulate.stan: Stan model file

enrg.R: R script file to process Toney (2013) data.

simulate.R: R script file to simulate and process data.

## Acknowledgement

This work was partially supported by a Summer Research Grant from Dominican University of California.

## Footnotes

† This article is a preprint, and has not yet undergone peer review.

Misspellings have been fixed, and experimental values included in a table.