Model balancing: in search of consistent metabolic states and in-vivo kinetic constants

Enzyme kinetic constants in vivo are largely unknown, which limits the construction of large metabolic models. While model fitting, in principle, aims at fitting kinetic constants to measured metabolic fluxes, metabolite concentrations, and enzyme concentrations, the resulting estimation problems are typically non-convex and hard to solve, especially if models are large. Here we assume that metabolic fluxes are known and show how consistent kinetic constants, metabolite concentrations, and enzyme concentrations can be determined simultaneously from data. If one specific term is omitted – a term that penalises small enzyme concentrations – we obtain a convex optimality problem with a unique local optimum. The estimation method with or without this term, called model balancing, applies to models with a wide range of rate laws and accounts for thermodynamic constraints on kinetic constants and metabolite concentrations through thermodynamic forces. It can be used to estimate in-vivo kinetic constants from omics data, to complete and adjust available data, or to construct plausible metabolic states with a predefined flux distribution. As a demonstrative case, we balance a model of E. coli central metabolism with artificial or experimental data. The tests show what information about kinetic constants can be obtained from omics data, and reveal the practical limits of estimating in-vivo kinetic constants.


Introduction
As the number of metabolic network reconstructions is constantly growing, so is the desire to convert metabolic networks into kinetic models by integration of omics data [1,2,3,4,5].To build models with plausible parameters and metabolic states (characterised by enzyme concentrations, metabolite concentrations, and fluxes), one needs to reconstruct the metabolic network, add allosteric regulation arrows, choose the enzymatic rate laws, determine the kinetic constants, and construct plausible metabolic states.All these subproblems have been addressed in one way or another.Pathway models have been built from in-vitro enzyme kinetics [6,7].To simplify model construction and to deal with unknown rate laws, modellers proposed standardised rate laws [1,8], used them to generate models automatically [9], and evaluated them for their practical use [10].In-vitro kinetic constants, available from the Brenda database [11,12], are widely used, and unknown k cat values have been estimated by machine learning [13].Inserting measured or sampled kinetic constants directly into models can lead to inconsistencies, given that thermodynamics implies dependencies between kinetic constants via Wegscheider conditions and Haldane relationships.To address this problem, methods to construct thermodynamically consistent parameter sets have been devised [14,15,16,17,18] and applied in modelling [5].In parallel, there have been attempts to estimate in-vivo kinetic constants [19] from flux, metabolite, and enzyme data [20,21].Methods for parameter fitting have been developed and benchmarked [22,23], and the question of parameter identifiability has been addressed [24].Large models have been parameterised [2,25], and workflows for model parameterization have been developed [26,27].Finally, even if parameters are unknown, parameter sampling and ensemble modelling [28,29,30,31,32] allow to find plausible parameter sets [33] and to draw conclusions about dynamic behaviour independently of specific parameter values [34].
Here we address the problem of finding realistic, consistent values of kinetic constants.The values in vivo are hard to measure, and the in-vitro values can be unreliable .To obtain in vivo kinetic constants, we consider a kinetic metabolic model that contains these constants as model paramaters and fit the model to measured omics data.
In an ideal case in which enzyme concentrations, metabolite concentrations and fluxes are precisely known for a sufficient number of metabolic states, we can directly solve for the kinetic constants [21].Likewise, in an ideal case in which kinetic constants and enzyme concentrations are already known precisely, metabolite concentrations and fluxes could be computed by simulating the model.In reality, we are in between these two cases: data of different types may be available, but all of them are incomplete and noisy, and cannot be used directly in models.
For a complete picture and reliable estimates, these heterogeneous data must be combined -but how?In practice, this depends on our aims: in model construction we may have different aims, for example (i) finding consistent kinetic constants in plausible ranges; (ii) adjusting and completing a data set of measured kinetic constants to obtain consistent model parameters; (iii) estimating in-vivo kinetic constants from omics data (measured enzyme concentrations, metabolite concentrations, and fluxes); (iv) completing and adjusting omics data for consistent metabolic states, which may involve predictions of physilogical metabolite concentrations [35,36] or predictions of metabolite and enzyme concentrations based on resource allocation principles [37,38,39].
Most generally, our task is not only to fit kinetic constants, but also to reconstruct consistent metabolite concentrations, enzyme concentration, and metabolic fluxes, based on available data for all these quantities.Since the data will be uncertain, inconsistent and incomplete, the uncertainties in data and estimated state variables and kinetic constants need to be quantified.Luckily, we can reduce these uncertainties by imposing further constraints by physical laws, including thermodynamic relations (Wegscheider conditions and Haldane relationships) between kinetic constants [1,8] and relations between kinetic constants and metabolic variables (e.g. the fact that flux directions follow from equilibrium constants and metabolite concentrations).In addition, we can use prior distributions (for kinetic constants, metabolite concentrations, or enzyme concentrations); and we can use measured (in vitro) parameter values as additional data (of course, these data cannot be used as test data anymore).However, one problem remains.The resulting estimation problems typically lead to non-convex optimality problems with multiple local optima: global optima may be hard to find especially if networks are large.
Here we address, as a general problem, the estimation of consistent kinetic constants and metabolic states based on heterogenous (kinetic, metabolomics, and proteomics) data, assuming that the metabolic fluxes are given (from measurements or previous calculations) and thermodynamically consistent.We show that parameter fitting in kinetic metabolic models can be formulated, with the exception of one term that penalises small enzyme concentrations, as a convex optimality problem.The resulting estimation method, called model balancing, uses the following input data: measured or assumed values of kinetic constants (which may be incomplete and uncertain), measured or assumed values of metabolite and enzyme concentrations in a number of metabolic states (which may be incomplete and uncertain), and metabolic fluxes from the same metabolic states (which may be stationary as in FBA, or non-stationary as in dynamic simulation of a kinetic model 1 ).From these data it determines a set of kinetic parameters and state variables (metabolite and enzyme concentrations for all metabolic states in question) that are consistent with the rate laws and all other dependencies in the model, plausible (i.e.respecting constraints and prior distributions), and resemble the data (showing large likelihood values).The only mandatory data are the metabolic fluxes (while other data are optional, but can help make the estimates more precise).For the estimation, we maximise the Bayesian posterior density (maximum-likelihood estimation can be obtained as a special case by the use of flat priors).If we neglect one specific term (a part of the posterior that penalises small enzyme concentrations) and assume non-flat priors, the maximum-posterior problem has a single solution which can be found be gradient descent.This convexity relies on two main assumptions: (i) all fluxes are predefined and thermodynamically consistent (i.e.infeasible cycle fluxes must be excluded) and (ii) kinetic constants and metabolite concentrations are described in log-scale.Model balancing builds upon some previous methods for model construction: parameter balancing [15,18], elasticity sampling [40], and enzyme cost minimisation [39], which we review in the discussion section.

Parameter estimation in kinetic models
A programme for finding kinetic constants and metabolic states in kinetic models by Bayesian estimation was outlined in [14].Here we follow this programme and show that it becomes mathematically more convenient if fluxes are given (using insights about convexity from [39]).

Estimating kinetic constants from omics data
How can information about in-vivo kinetic constants be obtained from omics data?Let us first have a look at an existing approach.To estimate k cat values, Davidi et al. [20] compared flux data obtained from flux balance analysis (FBA) to measured proteomics, without any knowledge of metabolite concentrations or kinetic laws.The method, which was later called kinetic profiling [41], assumes (unknown) rate laws of the general form v = e k(c) (with flux v and enzyme concentration e), where the catalytic rate k depends on (unknown) metabolite concentrations (vector c) and varies between zero and a maximal value k cat (called turnover rate or catalytic constant).To determine k cat from data, we consider a cell in different metabolic states and compute the empirical catalytic rates k (s) app = v (s) /e (s) in all of these states.We can know for sure that k cat ≥ max s k (s) app .If we further assume that each enzyme reaches its maximal capacity in at least one of these states, the k cat value (or at least, a lower bound on the k cat value) is estimated by taking their maximum value.The method was applied to a large number of enzymes in E. coli and a certain correlation between the estimated k cat values and measured in-vivo values was found.A part of the remaining deviations could be explained by enzyme kinetics and thermodynamics, but this was not quantitatively modelled.The limitations of this method are clear: since the max function is only sensitive to the highest value, a single high outlier value can completely distort the result.Such outliers may arise if a small protein concentration, due to measurement errors, appears even smaller.But aside from this practical problem, we also cannot be sure that enzymes reach (or come close to) their maximal capacity in one of the samples.If the estimated in-vivo k cat value is seen as a lower bound k cat ≥ max s v (s) /e (s) , then how far is this bound from the true in-vivo k cat value?
If we manage to model the (non-maximal) catalytic rates in the different states as a kinetic effect, can we maybe obtain a better estimate of the k cat value, even if this value is not reached in any of the samples?To do this, we need to consider metabolite concentrations and enzyme kinetics, i.e. a functional form of k(c).A typical form of k(c) for a uni-uni reaction, the Michaelis-Menten kinetics, reads v = k + cat s/Ks−k − cat p/Kp 1+s/K S +p/Kp [8] or in factorised form [42] The efficiency terms η rev (for reversibility, or thermodynamics) and η sat (for enzyme saturation and allosteric regulation) are numbers between 0 and 1 depending on metabolite concentrations.If the efficiency terms are close to 1, then k(c) approaches its maximal value k cat ; but normally k(c) is lower.Given the rate laws, and given data for fluxes v, metabolite concentrations c, and enzyme concentrations e, we might be able to estimate k cat and K M values, even if the maximal efficiency is not reached in any of the samples....

Metabolic model and statistical estimation model
Let us now state the estimation problem (see Figure 1 (a), for a list of mathematical symbols see table 1).We consider a kinetic metabolic model with thermodynamically consistent modular rate laws [8] and kinetic constants (e.g.equilibrium constants2 , catalytic constants and Michaelis-Menten constants) in a vector q = ln p.We further assume a number of metabolic states, each characterised by a flux vector v (s) , a metabolite concentration vector c (s) , and an enzyme concentration vector e (s) .These states can be stationary (with steady-state fluxes) or nonstationary (e.g.snapshots from a dynamic time course).The model defines dependencies among model parameters and state variables, and also the kinetic constants themselves are interdependent because of physical laws [1,8]: each rate law contains a forward and a backward catalytic constant as well as Michaelis-Menten constants, and all these parameters may depend on each other -within and across reactions -via Haldane relationships and Wegscheider conditions, which necessarily creates correlations between these constants.Activation and inhibition constants, which may also appear in the rate laws, are independent of all other kinetic constants.
The dependencies between model variables are summarised in appendix A.1.To satisfy all parameter dependencies, we introduce a set of independent kinetic parameters (independent equilibrium constants, Michaelis-Menten constants, and velocity constants) that determine all remaining kinetic constants (Figure 1  i , and catalytic rates k l .By inverting this equation, the enzyme concentrations can be written as functions of kinetic constants, metabolite concentrations, and fluxes (Figure 1 (b), bottom).Importantly, these functions, as well as their logarithms, are convex in the space spanned by ln p and ln c (s) .The thermodynamic forces, in the vector θ (s) = ln k eq − N ln c (s) , determine the flux directions (but note that vanishing fluxes are always allowed): This law holds for all thermodynamically feasible rate laws.
Given the kinetic constants, state variables, and dependencies in our model, we can now define estimation problems.Our most general aim is to estimate kinetic constants, metabolite profiles and enzyme profiles in a number of metabolic states, based on data for kinetic constants, fluxes, metabolite and enzyme concentrations (and possibly thermodynamic forces) in these states.All these variables3 , together with their dependecnies, can be visualised in a dependency schema as in Figure 1.All data may be uncertain and incomplete, except for the fluxes, which are assumed to be known.Besides dependencies and data, we may use prior distributions and impose upper and lower bounds on the model parameters and on metabolite and enzyme concentrations.In model balancing, we treat independent kinetic constants and metabolite concentrations as free variables, while and dependent kinetic constants, enzyme concentrations and thermodynamic forces are dependent variables (computed from kinetic constants, metabolite concentrations, and fluxes).All kinetic constants and concentrations -but not the thermodynamic forces and fluxes -are treated in log-scale, and natural logarithms are used throughout.The vector of free variables (kinetic constants and metabolite concentrations) is constrained by Eq. ( 2) and by lower and upper bounds, and the resulting solution space is a convex polytope.For a Bayesian parameter estimation, here we determine the posterior mode, but the formalism can also be readily used for maximum-likelihood estimation and posterior sampling [43].Formulae for these problems are summarised in appendix A.1.

State balancing: fitting of metabolite and enzyme concentrations
Before we come to the full model balancing problem, let us first consider a simplified problem in which the kinetic constants are known and the aim is to find consistent metabolite and enzyme concentrations for a single steady state with known fluxes 4 , and given incomplete, uncertain data for some or all metabolite and enzyme concentrations.To fit consistent metabolite and enzyme concentrations, we maximise their posterior density.
Metabolite and enzyme concentrations 5 are described by their natural logarithms 6 .For the metabolite logconcentration vector x = ln c, we assume Gaussian priors 7 (with mean vector xprior and covariance matrix Σ x prior ) and lower and upper bounds (possibly different for each metabolite).Likewise, for the enzyme log-concentration vector z = ln e, we assume Gaussian priors (with mean vector zprior and covariance matrix Σ z prior ).Both for metabolites and enzymes, negative concentrations are excluded 8 .The possible metabolite log-profiles x form a convex polytope P x in log metabolite space [39].The shape of this polytope is defined by physiological upper and lower bounds and by thermodynamic constraints, depending on flux directions and equilibrium constants.The metabolite log-concentrations x i , our free variables, determine the enzyme concentrations e l through Eq. ( 1) and both the absolute and the enzyme log-concentrations are convex functions on the metabolite polytope (appendix section A.2). To define an estimation problem, we construct the polytope as our solution space, consider prior, likelihood and posterior functions on this polytope, and use them to estimate metabolite concentrations and corresponding enzyme concentrations.
The Bayesian estimation of model variables follows the approach outlined in [14].Starting from the Gaussian prior density p(x) for metabolite log-concentrations x, we take the logarithms, omit constant terms, and obtain the prior loss score for metabolite log-concentrations.Writing quadratic functions written in short as quad(x|a, Σ) = 1 2 (x − a) Σ −1 (x − a), we obtain the joint prior for metabolite log-concentrations x = ln c and enzyme concentrations z = ln e P (x, z) = quad(x|x prior , Σ x prior ) + quad(z|z prior , Σ z prior ).
The choice of prior means and covariances is discused at the end of section 3.3.Similarly, using data for x and z, we define the likelihood loss score the negative log-likelihood (again without constant terms and the prefactor).The vectors xdata , and zdata and matrices Σ x data and Σ z data contain mean values and covariances for measurement data.For simplicity, we assume that each state variable has been measured exactly once 9 .By adding the two functions ( 4) and ( 5), we obtain the posterior loss score R (x, z) = P (x, z) + L (x, z), and by simplifying the quadratic functions [14,15], we 4 Mathematically, this estimation problem resembles Enzyme Cost Minimisation (ECM) [39].Both methods are based on kinetic models with known parameters and predefined fluxes, and both of them optimise metabolite and enzyme concentrations, but in different ways.In ECM, metabolite and enzyme concentrations are optimised for a minimal biological cost, while in the present estimation problem, they are fitted to data. 5 6 Exact zero concentrations are deemed impossible, except for enzyme concentrations in the case of exactly vanishing fluxes, which are not estimated but directly set to zero). 7Typically, we assume uncorrelated priors, with covariance matrices obtain the formula with metabolite covariance matrix and mean vector10 and analogous formulae for the enzyme covariance matrix Σ z post and mean vector zpost .Until this point, all functions considered are convex: the prior score P , likelihood score L , and posterior score R are convex in x and z, and if non-flat priors are used, then prior and posterior score are strictly convex.Now, as we remember, the enzyme concentrations are not free variables, but fully dependent on metabolite concentrations and fluxes.Therefore, the enzyme concentrations can be eliminated from the formulae.From the enzyme demand function Eq. ( 1), we obtain the function z(x), and inserting this into Eq.( 5), we obtain the three score functions as functions of x alone: For distinction, in contrast to these "score" functions, the previous functions with will also be called "prescores".
Our estimation method can be extended to problems with several metabolic states.Each condition s has its own flux distribution, metabolite data, and enzyme data, and we can run the estimation separately for each state (see also appendix A.1).In problems with several states, vanishing fluxes can be considered (see appendix B.3).
In estimation problems with a single metabolic state, this is even simpler: reactions with vanishing flux can be omitted.

A convex version of the score functions
If an optimisation problems is convex, this has many advantages: it excludes separate local optima, decreases the numerical effort, and thus makes large problems tractable.Methods like parameter balancing and ECM profit from this.Are the prior, likelihood, and posterior score functions (8) convex in x?The first term in each score is convex because F is quadratic and therefore convex 11 .The second term is composed of a convex function quad(•) and a convex function z(x), but this alone does not make it convex.To be convex, it would have to be composed of an outer non-decreasing function and an inner convex function, but quad(•) is quadratic and obviously not non-decreasing.
However, here we argue that we can modify the second term to make it more similar to a convex function, in order to make the estimation problem better tractable.First, we can assume an uncorrelated prior for enzyme concentrations, i.e. a separate Gaussian prior for each reaction l and each metabolic state s.If we do this, the second term becomes a sum of the form with a one-dimensional quadratic function Next, we split this function into G = G + + G − , where G + = G only if x ≥ a (and 0 otherwise) and G − = G only if x < a (and 0 otherwise).G + is an increasing function, and the resulting truncated quadratic function is a sum of non-decreasing functions G + (•) of a convex function z(x) and therefore convex (remember that the enzyme log demand ln e(x) is a convex function on the metabolite polytope [39] for a wide range of plausible rate laws.).If we now replace quad(•) by quad + (•) in all second terms of Eqs (8), we obtain the truncated score functions Truncated prior loss which are convex (or strictly convex if non-flat priors are used).
Now that the G − term is gone, the posterior mode can be found by convex optimisation, while using the G − term makes the problem non-convex 12 .Alternatively, we can replace with a relaxation parameter α ∈ [0, 1], which we can also see as an interpolation between the original function G (if α = 1) and the truncated function G 0 (if α = 0).Accordingly, we defined the relaxed score functions as interpolations Relaxed prior loss These functions are not convex (because they contain non-convex terms stemming from G − (•)), but by choosing small α values, they can be made arbitrarily similar to the convex truncated functions.By varying α between 0 and 1, we can interpolate between the convex truncated problem and the non-convex, original problem.
If we omit the G − term, what will the effect on the estimation problem be?If we see G (in the terms related to enzymes) as a sum of G + and G − , then G + penalises enzyme concentrations that are higher than the posterior mean z post , while G − penalises enzyme concentrations that are lower.If we omit the terms with G − , an estimated enzyme concentrations may be much smaller than the corresponding enzyme prior means or enzyme data, as if all this information did not exist.Thus, at this point, we have two possibilities: using the full posterior score function R (using the quadratic function G), which is non-convex, or using a posterior score function R 0 without the G − term (i.e., replacing G by G + ) and accepting that enzyme concentrations may be underestimated.In fact, due to the relatively general form of our model, we expect that this trade-off (lower bounds or lower penalty terms for enzyme concentrations lead to non-convex estimation problems) holds for metabolic estimation problems more generally.However, we can still interpolate between the two possibilities.With α > 0, the relaxed posterior score function R α will be non-convex, but by choosing small α values, we can make it arbitrarily similar to a convex one, while keeping at least a (weak) term for penalising small enzyme concentrations.Our tests (see below) show that this compromise, and the resulting estimation method is a promising tool.

Model balancing: estimation of kinetic constants and metabolic states
We now consider the full model balancing problem, that is, an estimation of kinetic constants, metabolite concentrations, and enzyme concentrations.Following [1], we parametrize the model by kinetic constants K ind eq , K V , K M , and possibly K A and K I (all in log-scale).For some of these constants, data may be available (for instance, equilibrium constants K eq can be estimated from thermodynamic calculations) and consistent values need to be estimated.This problem resembles our simplified problem, where the enzyme concentrations were convex in x.Now the enzyme concentrations depend additionally on kinetic constants, and are convex in the joint space of log kinetic constants and metabolite log-concentrations (for any number of states)!This means that all the formulae and arguments for our previous simplified problem (including the usage of a "one-sided" enzyme posterior for obtaining a convex problem) can be reused.A description of the algorithm, including the convexity proof for the function ln e(ln p, ln c), is given in appendix A.2.
Since state variables and kinetic constants are estimated together, and since the kinetic constants are kept constant across metabolic states, the state variables become coupled across metabolic states and need to be estimated in one go.Instead of a metabolite vector x, we consider a larger vector y, containing the metabolite log-concentrations for all metabolic states and the vector of log kinetic constants.Allowed ranges and thermodynamic constraints define a solution polytope for the vector y.The prior, likelihood, and posterior loss scores contain terms that depend on enzyme concentrations e(x).If we insert Eq. ( 1) into these formulae, these terms are convex in the log kinetic constants, and independent of the metabolite concentrations 13 .Since e l (q, x) is a convex function of the vector y = x q , all terms of the likelihood loss score are convex in y.The prior loss score is strictly convex in y if a correlated prior for kinetic constants with pseudo values14 is considered [15].Altogether, model balancing has the same favourable properties as the previous, simplified problem.In practice, the model balancing algorithm can be improved by a number of simplifications and tricks (appendix B).For example, enzyme concentrations (and therefore the likelihood function) increase very steeply close to thermodynamics-related polytope boundaries; to avoid numerical problems, a region close to the boundary may be excluded by extra constraints, or the log(log posterior) may be minimised instead of the log posterior.

Example application: Escherichia coli central metabolism
As a test case for model balancing, we consider a model of E. coli central metabolism taken from [39] (Figure 12 in appendix 12), and corresponding metabolite, enzyme, and kinetic data.We consider two different estimation scenarios, one with artificial data and one with experimental data from a single metabolic state (data from [39]).
The algorithm was always run with the same settings (including priors and bounds).In all cases, we used the relaxed non-convex version of model balancing with a relaxation parameter α = 0.5.The calculation time (matlab function fmincon, with interior-point algorithm on a normal laptop) ranged between 2 and 30 minutes, with a single outlier of 150 minutes (notably, the numerical effort did not correlate with the size or expected difficulty of the problem, which suggests that the numerics could still be improved).So far, we did not test multiple realisations of the artificial data, and did not run the simulationswith multiple starting points.For details on model structure, kinetic and state data, and priors see appendix C.

Tests with artificial data
For the tests with artificial data, we first generated an artificial "true" set of kinetic constants and state data (metabolite concentrations, enzyme concentrations, and fluxes).Artificial measurement data were generated from the same random distributions that were also used as priors in model balancing.Metabolic state variables were generated from the kinetic model (with the "true" artificial kinetic constants) by randomly chosen enzyme concentrations and external metabolite concentrations and computing the steady states.For details see appendix C.2.Now the aim was to reconstruct the true (noisy-free) values for six simulated metabolic states from (noisefree or noisy) artificial data.We considered different scenarios (see Figure 13 in appendix) in which data were either fitted (metabolite and enzyme concentrations, and "known" kinetic constants) or predicted based on the other data ("unknown" kinetic constants).In the tests with artificial data, realistic statistical distributions (for kinetic constants, metabolic variables, and their measurement errors) were used to generate data, and the same distributions were used as priors when reconstructing the true values.To obtain realistic distributions of kinetic constants, we started from known (or suspected) distributions (from [39,18], which relied on [12]), and adjusted them based on data.This is an ideal situation, making reconstruction particularly easy.In real-life applications, if our priors and assumed noise concentrations are not realistic, the reconstruction would be worse than suggested by our tests with artificial data.
The model balancing results with artificial data are shown in Figures 3, 4, 5, and 6, where noise-free or noisy kinetic and noisy state data were used in different combinations.In Figure 3 both kinetic and state data are free of noise.Subfigures show the different simulation and estimation scenarios (rows) and different types of variables (columns).Each subfigure shows a scatter plot between true and fitted variables (metabolite concentrations, enzyme concentrations, and different types of kinetic constants).Deviations from the diagonal indicate estimation errors.In the top row, data for all kinetic constants were used; in the centre row, only data for equilibrium constants were used, and in the bottom row, no kinetic data were used15 at all.Depending on the scenario, kinetic constants were either fitted (red dots) or predicted from data (magenta dots).The quality of fits or predictions is measured by geometric standard deviations 16 and linear (Pearson) correlations (for logarithmic values).For comparison, we also estimated k cat values from the same artificial data, using the "max-k app " method from [20], (Figure 7).
The first scenario (top row) assumes ideal conditions: we assume noise-free, complete kinetic data and state data.
Not surprisingly, the reconstruction errors are tiny, arising from small conflicts between data and priors.The other rows show estimation results using metabolic state data and equilibrium constants only (centre row) or using metabolic state data but no kinetic data at all (bottom row).The reconstructions in these two rows recover k cat and K M values partially, but they still depend on complete and precise state data.To assess the effect of noisy state data, we generated artificial state data (metabolite concentrations, enzyme concentrations, and fluxes) with a relative noise concentration of 20 percent.With noisy kinetic and/or state data, the estimation results become worse (Figures 4, 5, and 6), and especially the reconstruction of K M values becomes very poor.Using data on equilibrium constants improves the results and k cat values can still be partially reconstructed (Figure 6).The tests with artificial data show, first, that model balancing can adjust noisy data sets: relatively small changes in the data suffice to obtain a consistent set of kinetic and state data.Second, they show that information about k cat values can be extracted from state data.Notably, even in the case without any kinetic data (nor equilibrium constants), model balancing yields better k cat estimates than the kinetic profiling method.If we compare the results between Figures 3, 5  are much worse: due to our priors, the estimates are in realistic ranges, but otherwise they appear to be randomly distributed.Therefore, with noisy metabolic data, adjusting in-vitro kinetic constants to in-vivo values (instead of estimating in-vivo values from omics data alone) remains the main application.

Tests with measurement data
After this test, we balanced the same E. coli model with real measurement data.As kinetic data, we used the in-vitro kinetic constants collected in [39] ("original kinetic data") and, for comparison, a balanced version of the same data set ("balanced kinetic data").For details on model and data, see appendix C. Figures 8 and 9 show estimation results for a single metabolic state, aerobic growth on glucose (appendix C).Since the true model variables are not known, the available data are plotted against the reconstructed kinetic constants, metabolite concentrations and enzyme concentrations.In most cases, the scatter plots show the goodness of fit.
In some cases, where some data were not used for fitting, the plots show the quality of actual predictions.In a first test, we used a consistent set of kinetic data obtained by parameter balancing as the kinetic data (Figure 8).If all kinetic data are used (top row), a good fit to these data is achieved, and only slight adjustments were necessary to obtain a consistent kinetic model agreeing with all the state data.However, we should not forget that the kinetic constants were fitted and not predicted (as indicated by red dots).In the centre row, equilibrium constants were used as the only kinetic data the other kinetic constants were actually predicted (as indicated by magenta dots).In the bottom row, all kinetic constants, including equilibrium constants, were predicted 17 .Now the correlations to in-vitro values are relatively small.This test may still be criticised because the kinetic data were taken from a previous parameter balancing procedure on the same network model and with the same priors, which might make it easier for model balancing to recover these values.To avoid any bias, we next ran model balancing with the original in-vitro kinetic data (which contain much fewer data points for comparison), with similar results.

Parameter identifiability
To understand how much useful information can be extracted from (noise-free or noisy) data, we need to think about uncertainties, parameter identifiability, and the usage of priors.In model fitting, generally, certain parameters or parameter combinations may be non-identifiable, that is, they have no effects on measurable outcomes, and therefore we cannot determine them from available data [24].Likewise, if parameters have very small effects on measurable outcomes, and if data are sparse or noisy, our estimates of these parameters will be very uncertain.For example, if kinetic constants are estimated from metabolic state data, and if a reaction is strongly forward-driven, its backward k cat and K M values may be hard to determine (compared to forward k cat and K M values value); if an enzyme is always close to saturation, its K M value will be poorly determined, and if an enzyme remains always in its linear range (due to small substrate levels), only the k cat /K M ratio will be determined, but not k cat and K M individually; finally, enzyme concentrations and k cat values appear in rate laws in the form of a product (which is therefore well-determined by the fluxes), but their ratios may be ill-determined.In particular, an error in the scaling of flux or enzyme data will have an immediate effect on k cat estimates.
For model balancing, this means: variables or variable combinations (e.g.k cat /K M ratios) may be structurally non-identifiable (if they cannot have any effect on the metabolic state), practically non-identifiable (for example, if data from n states would be needed to infer them, but less data are given), or poorly identifiable (if they have a very small effect, e.g. the K M value in an enzyme close to full saturation).Since model balancing also uses priors and kinetic data, non-identifiability does not mean that parameters are ill-determined: the method will always result in a posterior with clearly defined modes -however, in these solutions the values of "non-identifiable" kinetic constants or state variables will completely depend on the priors, and will therefore be sensitive to the (somehow arbitrary) choice of the prior mean.Non-identifiability will always arise if there are fewer data values than variables to be estimated.For example, state data from a single metabolic state may not suffice to reconstruct the kinetic constants; if more metabolic states are used, the kinetic constant may become well-defined.
If model variables are non-identifiable (or poorly identifiable), this will lead to considerable uncertainties, visible in the posterior distribution.Assuming that the problem is convex, and that non-flat priors are used, we know that the posterior mode is single point in solution space.However, if the priors are broad, this point is somewhat arbitrary, and posterior sampling would reveal that many other solutions (along the "non-identifiable subspace") are almost as good, revealing a large uncertainty.This raises some practical questions.Can we improve the result by using more data (e.g., metabolite concentrations from more metabolic states)?If no kinetic data are given, how many metabolic states are needed to identify -at least potentially -all kinetic parameters?Which variables are hard to reconstruct?And are there variables that remain non-identifiable, no matter how much state data we use?
Paradoxically, variables that are always non-identifiable (or poorly identifiable) are less of a problem, because these variables, in turn, are also irrelevant for model predictions.An example would be an enzyme that is always close to saturation -we cannot determine its K M value, but we also do not need this K M value for forward simulation.However, if we try to simulate other conditions, in which the enzyme is not close to saturation, the results may be poor.Of course, the identifiability problem is not specific to model balancing; other estimation methods would face the same problem.

Discussion
Model balancing completes the earlier attempt from [14]

Parameter estimation
In parameter estimation, the idea of ensemble modelling is turned upside down.
Instead of assuming a distribution of model parameters and doing forward simulations, we infer such a distribution from available data, reflecting our knowledge about the model and its parameters.In practice, parameter fitting in kinetic models can be performed by Monte-Carlo methods such as random screening, genetic algorithms, or simulated annealing.For example, one may generate a large ensemble of possible parameter sets, compute the likelihood or posterior density values for each of them, and choose the best parameter set among them (see [33] for an example).Sampling-based optimisation methods are generic and easy to implement, but with large parameter spaces and complicated objective functions, finding the optimum becomes very unlikely.Moreover, without understanding the optimality problem, it is hard to assess whether there are many local optima, and how good the computed solutions actually are.
3. Fitting kinetic constants in single reactions: SIMMER and kinetic profiling Estimating kinetic constants becomes much simpler if it can be done separately for different reactions.For example, if the flux, metabolite concentrations, and enzyme concentration in a reaction are known (completely and precisely) in several metabolic states, the kinetic constants can be fitted without knowing the rest of the system18 [44,21].However, this approach has a number of limitations: to estimate the kinetic constants of a reaction, all pieces of information, including the concentration profiles of all reactants, must be given.Model balancing, in contrast, attempts to reconstruct missing information from other parts of the network.Kinetic profiling, an approach to estimate k cat values, has been described in the introduction.It is easy to implement, and -given protein concentrations and fluxes -the computational effort is low.However, it provides only a lower bound, which may be far from the true value if only few metabolic states are considered.
Model balancing, in contrast, has the potential to integrate different types of data (including K eq values and in-vitro k cat values) to obtain a best estimate, and is in principle able to estimate a k cat value even if this value is never reached in a living cell.
Parameter estimation in kinetic models can easily lead to non-convex optimality problems, especially if it involves a screening of thermodynamically feasible flux distributions [45].To obtain problems that are nearly convex, model balancing relies on two insights: all fluxes must be predefined 19 , and log kinetic constants and metabolite concentration must be used as free variables 20 .Model balancing combines ideas from two previous, convex methods: Parameter Balancing (PB) for the estimation of kinetic constants (and possible metabolic states) from data, and Enzyme Cost Minimisation (ECM) for predicting biologically favourable states from a principle of minimal effort (see Figure 2).
1. Parameter balancing.Parameter balancing is a method for estimating consistent kinetic and thermodynamic constants from kinetic and thermodynamic data.It resembles model balancing, but (in its basic form) does not use detailed information on rate laws and fluxes.All multiplicative constants (such as Michaelis-Menten constants or catalytic constants) are described by log-values.To account for parameter dependencies 21 , we choose a subset of kinetic constants 22 , the free parameters in our linear regression model, and express all other kinetic constants as functions of them.With Gaussian priors and measurement errors (in log-scale), likelihood and posterior terms are quadratic and convex.Parameter balancing can also be applied to kinetic and thermodynamic constants ("kinetic parameter balancing"), to metabolite concentrations and thermodynamic forces in one or more metabolic states ("state balancing"), or to kinetic constants and metabolic states together ("state/parameter balancing").Known flux directions can be used to define the signs of thermodynamic forces.Thus, parameter balancing can predict thermodynamically feasible kinetic constants and metabolite concentrations and its optimisation takes place on the same set as in model balancing.It provides plausible ranges for kinetic constants, but unlike model balancing it does not consider rate laws and cannot be used to fit kinetic constants to data 23 .
2. Enzyme cost minimisation.Enzyme cost minimisation (ECM) [39] predicts optimal enzyme and metabolite concentrations in kinetic models with given parameter values.ECM assumes predefined metabolic fluxes and determines metabolite and enzyme concentrations that realise these desired fluxes at a minimal cost, where cost functions can be a linear or convex function of the enzyme concentrations, plus a convex function of the metabolite concentrations.Thus, in contrast to parameter balancing, this method uses kinetic rate laws with given kinetic constants, and it is a biological cost (typically, a cost depending on absolute metabolite and enzyme concentrations), not a fit to data, that is optimised.The optimisation is carried out in (log-)metabolite space.With given rate laws, the enzyme concentrations can be written as functions of metabolite concentrations and fluxes and the cost function (a weighted sum of enzyme and metabolite concentrations) is convex on the feasible metabolite polytope.
Model balancing combines elements from these two methods.Like in parameter balancing, kinetic constants and metabolite concentrations are used as free variables (forming a parameter/concentration polytope as the solution space).And, like in ECM, we assume that the fluxes are given and use the fact that the (now, logarithmic) enzyme concentrations are convex functions of the metabolite log-concentrations.We combined this with two additional insights: the fact that (both absolute and logarithmic) enzyme concentrations are convex functions on the combined solution space of kinetic and metabolic variables, and the fact that most of the terms in the posterior score are convex, while the one sort of terms (limiting enzyme concentrations from below) that is non-convex can be omitted or be "weakened" by a tunable prefactor.
In all three methods -Parameter Balancing, Enzyme Cost Minimisation, and Model Balancing -the solution space is a high-dimensional polytope (in the space of log kinetic constants, metabolite log-concentrations, or both).
To construct this polytope, we define a box by upper and lower bounds and add linear constraints describing dependencies.In fact, the solution polytope for Model Balancing is obtained from the polytopes of the other methods by taking their Cartesian product and removing infeasible regions, in which constraints between kinetic constants and metabolite concentrations would be violated (shown in Figure 11).Since all variables are estimated together, any information about one variable (bounds, priors, or data) can improve the estimates of other variables, too.In parameter balancing, a data value for a kinetic constant may improve the estimates of all other variables.
Similarly, in model balancing additional metabolite and enzyme data may also improve the estimation of all kinetic constants.
In summary, model balancing can use various types of knowledge (network structure, data, priors, and constraints), handles different types of variables (as defined by the dependency schema used), and applies to steady and non- Figure 2: Model balancing and similar methods for parameter estimation and optimal metabolic states.The methods differ in their purpose (parameter estimation versus prediction of biologically optimal states), the choice of free variables (kinetic constants and/or metabolite and enzyme concentrations), and data used, but they all share some mathematical features: kinetic constants and metabolite concentrations are described in log-scale (such that all dependencies become linear); thermodynamic and physiological constraints are imposed; and fluxes are predefined.In each of these methods, the search space is a convex polytope.In the previous methods, the objective function was convex (either quadratic or derived from kinetics), leading to convex optimality problem.In model balancing, the non-convex terms (which penalise low enzyme concentrations) can be weakened or omitted to make the problem easily solvable.steady states 24 as long as the fluxes are thermodynamically correct 25 .The tests with articifial and measurement data show that, in realistic scenarios, a consistent model with consistent metabolic states can be obtained by relatively small adjustements of the data, that precise metabolic state data and equilibrium constants contain information about k cat and K M values, but that the estimation of k cat and K M values from metabolic data with realistic noise levels is relatively poor (although it tends to be better than the prediction by the existing kinetic profiling method).We did not perform a systematic leave-one-out crossvalidation for individual kinetic constants, but the outcome of such a crossvalidation can be expected to lie in between our two scenarios in which either all kinetic constants were used as data, or only equilibrium constants were used as data.We can conclude for sure that the usage of equilibrium constants improves the results, which confirms the importance of known equilibrium constants for constructing reliable kinetic models.
Depending on the data available, model balancing can be applied in different ways.
1. Infer missing data types Sometimes, data for two of our data types (kinetic constants, metabolite concentrations, and enzyme concentrations) are available and the aim to estimate the third, missing type of data.
There are three cases: we may estimate in-vivo kinetic constants from fluxes, metabolite concentrations, and enzyme concentrations; we may estimate metabolite concentrations from fluxes, enzyme concentrations, and a kinetic model; or we may estimate enzyme concentrations from fluxes, metabolite concentrations, and a kinetic model.If the data were complete and precise, the third type of variables could be directly computed without model balancing.But if data are uncertain and incomplete, model balancing allows us to infer the missing data while completing and adjusting the others.
2. Adjusting omics data to obtain complete, consistent metabolic states If the kinetic constants are known and metabolite and enzyme have been measured, we can turn these incomplete and uncertain data into consistent and plausible metabolic states, in agreement with our model.Again, fluxes must be given and their directions must agree with the assumed equilibrium constants and metabolite bounds.Even without any enzyme or metabolite data, we can still apply model balancing: in this case, the method would predict possible metabolic states with the given fluxes, relying on priors for enzyme or metabolite concentrations.

Imposing thermodynamic constraints and bounds and data
To obtain a consistent model, we may collect data for kinetic and state variables and use model balancing to translate them into parameters and state variables.The resulting values will satisfy the rate laws, agree with physical and physiological constraints, and resemble the data and prior values.As in all other cases, posterior sampling could be used to decrease and assess uncertainties about the model parameters.
Model balancing extracts information from heterogeneous data.Even if almost no data are available, it can be used to obtain plausible models or model ensembles.In the tests with articificial data, model balancing performed well when precise data were given, and even with imprecise data it performed better than estimation by kinetic profiling.A main limitations of model balancing is that, unlike its predecessor methods, it is not a convex problem unless the G − terms (which penalise low enzyme concentrations) are omitted.By downscaling these terms with a constant factor, the the posterior score can be made more similar to a convex function (which may make it easier to treat numerically), but it remains non-convex unless these terms are omitted completely.Another limitation of model balancing is its calculation speed.The number of variables increases with model size and the number of metabolic states, which impacts memory requirements and calculation time (results not shown).Due to its simplicity, kinetic profiling is much faster.
Here we considered a Bayesian estimation problem, constructed the posterior, but otherwise did not follow a Bayesian methodology.Instead of sampling from the posterior, we only considered the problem of finding the posterior mode.Posterior sampling would be big step forward, allowing us to determine the posterior mean, variance, covariances and marginal distributions of kinetic constants and state variables.Moreover, entire parameter sets could be sampled to obtain a model ensemble.To simplify sampling, the posterior may be approximated by a multivariate Gaussian distribution, obtained from the posterior mode and the Hessian matrix of the posterior score in this point.Such a sampling procedure has been implemented for parameter balancing, but not yet for model balancing.We hope that our formulation of the problem will facilitate posterior sampling, due to the fact that the posterior score is close to convex, and that it will pave the way to efficient sampling algorithms that might also include a sampling of metabolic fluxes [45].
[53] L. Gerosa, B.R.B.H. van Rijsewijk, D. Christodoulou, K. Kochanowski, T.S.B. Schmidt, E. Noor, and U. Sauer.Pseudo-transition analysis identifies the governing regulation of microbial nutrient adaptations from steady state data.Cell Systems, 1:270-282, 2015.ln K A , allosteric inhibition constants ln K I , and velocity constants ln K V ), collected in a vector and (ii) the metabolite log-concentrations from one or several metabolic states s, contained in metabolite vectors x (s) = ln c (s) .We obtain a vector of free variables x (1)   x (2)   ..
With n p independent kinetic constants, n m metabolites, and n s metabolic states, the vector contains n p + n m n s free variables.

Dependent variables
We consider three types of dependent variables: dependent kinetic constants, enzyme concentrations, and thermodynamic forces.(i) The dependent kinetic constants on log-scale (dependent equilibrium constants ln K dep eq , forward catalytic constants ln k + cat ), and backward catalytic constants ln k − cat ) form a vector This vector can be computed from q ind with the help of a linear function q dep = M dep q ind .The dependency matrix M follows from the stroichiometric matrix as described in [15].Similarly, the vector q of all kinetic constants is given by a linear formula (ii) The thermodynamic forces are computed by the linear formula or briefly θ = M θ y with a matrix M θ obtained from the network structure.(iii) Based on rate laws and using Eq. ( 1), the enzyme concentration vectors e (s) are given by e l and its logarithm z l are convex functions in (q, x) for a wide range of rate laws.Moreover, the formula (18) yields positive enzyme concentrations on the entire solution space x (see below), except when fluxes vanish exactly v  (where taking the logarithm would lead to problems.

Solution space
The solution space for our free variables is defined by two sorts of constraints.First, lower and upper bounds on all variables except for the enzyme concentrations27 where s denotes metabolic states.Second, the driving forces must be positive (in the direction of the fluxes, and so the known flux directions define the signs of all driving forces.For all reactions with non-zero fluxes v (s) l = 0, this yields the thermodynamic constraints which (together with Eq. ( 17)) translate into linear constraints for the variable vector y.In reactions with zero fluxes, the driving forces are unconstrained (unless we know, for some other reasons, that reactions are in chemical equilibrium).Together all these constraints can be written as with a matrix A and a vector b obtained from reaction stoichiometries and flux directions.These constraints define a convex solution polytope P. Each polytope point y describes a feasible set of model parameters and metabolic states (i.e.states with positive driving forces).Conversely, any feasible set of kinetic constants and metabolic states (respecting all bounds and constraints) corresponds to a point in the polytope.

Priors and likelihood terms
The posterior is obtained from prior and likelihood terms.Kinetic constants, metabolite concentrations, and enzyme concentrations are considered in log-scale and described by Gaussian priors and likelihood terms on this scale.The log kinetic constants depend on a smaller set of independent kinetic constants.For these independent kinetic constants, we use a correlated prior arising from independent prior terms (for independent kinetic constants) and from pseudo values (for dependent kinetic constants).
While pseudovalues are invoked formally to define correlated priors, they can be treated in practice as additional data points (see [15]).For the state variables (metabolite log-concentrations x In contrast to Parameter Balancing and ECM, model balancing determines the vectors q and x at the same time.The resulting vector y lives in a high-dimensional polytope whose geometric structure is schematically shown in Figure 11.Since each state vector y consists of a vector q and a number of vectors x s , the polytope resembles a Cartesian product of the polytopes for these single vectors.However, thermodynamic constraints between kinetic constants and metabolite concentrations require that some parts of this Cartesian product must be removed.To see how the metabolite spaces for several metabolic states are combined, let us return to our simplified model balancing problem from section 2.3.We can solve this problem separately for each state, and this is also the easiest way to solve the problem.But we can also fit all metabolic states simultaneously by one big regression model.Each metabolite profile x (s) must lie in a metabolite polytope P (s) x .If the flux directions in all metabolic states are the same, these polytopes are identical; if fluxes change their directions, the metabolite polytopes P (s) x will differ.If we merge all vectors x (s) into a vector x, the solution polytope for this vector will be higher-dimensional and will be given by the Cartesian product s P (s) x .As before, we can consider the prior, likelihood, and posterior (for all metabolic states) as functions on this solution space.Since the metabolic states terms of are quadratic in x and independent of q and therefore convex in (x (s) , q).The third terms, by themselves, are quadratic in z(q, x) and not necessarily convex.However, if they are are replaced by the truncated version, each of them can be written as a sum of non-decreasing functions of z (s) l (q, x), which are convex in (q, x), and are therefore convex in (q, x) themselves.

B Implementation
A Matlab implementation of Model Balancing, together with example models and data, is available at https: //github.com/liebermeister/model-balancing.The file format for models and data (kinetic constants, fluxes, metabolite concentrations, protein concentrations) is SBtab [48] and metabolic networks can be defined in SBML [49] or SBtab files.By default, the algorithm starts by running model balancing on an average model state (with metabolic state data given by the geometric mean over the metabolic states).The resulting kinetic constants are then used as initial values for the following full calculation with several metabolic states.

B.1 Prior distributions
Since the values of non-identifiable parameters will be predicted based on their priors, the choice of realistic priors is important!Of course, the usage of priors affects all predicted variables, pushing them towards the prior means: high values will be estimated too low and low values will be estimated too high.This effect does not matter if a variable has been measured precisely or if priors are very broad.However, for non-identifiable variables the effect is large: while model balancing predicts a single (posterior mode) value, this value is almost arbitrary, determined entirely by the choice of the prior, and carries a large uncertainty (marginal posterior variance).On the one hand, this calls for posterior sampling (ninstead of reporting only the posterior mode).On the other hand, this underlines the need for realistic, informative priors.To define informative priors for kinetic constants [14,50], we followed the procedure from parameter balancing [15] which takes into account parameter dependencies.We start by assuming independent Gaussian priors for all independent constants and (independent Gaussian) pseudo values for dependent constants: taken together, they define a correlated prior for all constants.Non-diagonal covariance matrices for ln K eq values, obtained from equilibrator, could be taken into account.Mean values and standard deviations were taken from [18] and manually refined to match empirical distribution of kinetic constants.For the state variables (metablite and enzyme log-concentrations), we typically use diagonal covariances Σ x data = diag(σ x data ) 2 and Σ z data = diag(σ z data ) 2 , assuming independent measurement errors.However, correlated errors (e.g.caused by normalisation for metabolomics or proteomics samples) may be taken into account.There are several possible reasons to do this: first, concentrations differ more strongly between metabolites than between metabolic states for the same metabolite.This means, for the vector of all metabolite data, that values for each metabolite tend to be correlated.We could describe this by splitting the data values into x i , with a reference value x ref (and the same for enzyme log concentrations z l ).Second, if we model metabolic time series, and assume that state variables vary smoothly over time, we can implement this assumption by imposing a prior that assumes correlations between (one and the same concentration at) subsequent time points.Third, there may even be reasons to assume correlated errors in data values: for example, in metabolomics data, differences in sample preparation or a normalisation per sample will lead to correlated measurement errors (by which, for example, all metabolite concentrations in one sample may appear too high or too low).In theory, such effects could be accounted for by assuming a correlated covariance matrix in the likelihood terms.
To define priors in practice, pseudo values, and constraints (for kinetic constants, metabolite concentrations and enzyme concentrations), we used the default values from parameter balancing (see www.parameterbalancing.

net.
However, when running parameter balancing as a test, we found that the available k cat values were typically much higher than the prior median value as expected for enzymes in central metabolism [12].In view of these data, we changed the prior for k cat values from a median of 10 s −1 (geometric standard deviation 100) to a median of 200 s −1 (geometric standard deviation 50).Likewise, we changed the prior width for K M values from a geometric standard deviation of 10 to a geometric standard deviation of 20 (while keeping the median 0.1 mM unchanged).A table describing the priors is provided in the github repository, file resources/data/dataprior/model-balancing prior.tsv.These values, used in the matlab implementation, can easily be modified.Importantly, in the actual model balancing calculations, a much broader prior was used in order to put more weight on existing data.

B.2 Model balancing variants and simplifications
Model balancing can be adapted in various ways.(i) If a type of data is not used, likelihood terms for this data type are omitted.Even without any data, priors will keep the results in biochemically plausible ranges.
(ii) If certain parameters (e.g. the equilibrium constants) are precisely known, their values can be predefined (e.g. by treating them as data with very small standard errors).(iii) Model balancing also applies to models with irreversible rate laws.In an irreversible rate law, there are fewer kinetic constants (since backward catalytic constants, equilibrium constants, and velocity constants do not play a role); the forward kinetic constant is a free parameter, and no Haldane relationship is considered.Describing (some or all) rate laws as irreversible changes the structure of the kinetic dependency matrix M. (iv) Different model parameterisations: instead of independent equilibrium constants, standard chemical potentials may be used as independent parameters [15].i .Uncorrelated priors for these variables yield a meaningful correlated prior for the metabolite concentrations, and a similar splitting can be used for enzyme concentrations.
So far we assumed that one data value is available for each of the model variables.For example, we assume that each metabolite level in each metabolic state has been measured, and that an in-vitro value is available for each kinetic constant.To account for the fact that values may be missing or measured multiple times, we need to relate a vector of model variables to the corresponding vector of variables measured (possibly, comprising multiple measurements of the same variable).The connection can be made by a data mapping matrix D x (for metabolite log concentrations, and similar mapping matrices for other variables) that maps the vector of metabolite log-concentrations in the model onto the vector of metabolite log-concentration data values.In the ideal case (all values have been measured exactly once), D x is a simple identity matrix.If a concentration has not been measured, the corresponding row of this identity matrix must be omitted, and if a value has been measured multiple times, the row must be duplicated.With mapping matrices, our likelihood loss score (with terms for kinetic constants and state variables) becomes L (x, z) = quad(D q θ|q data , Σ q data ) + quad(D x x|x data , Σ x data ) + quad(D e z|z data , Σ z data ), and formula (7) for posterior means and covariances reads

C.2 Artificial data
Artificial kinetic constant data were generated as follows.Given the network structure, true artificial kinetic constants were generated by assigning random (log-normal) values to ln K ind eq , ln K M , and ln K V and computing the other constants.The random values were sampled from the same distributions that are used as priors in model balancing.
To generate artificial metabolic state data, enzyme concentrations and external metabolite concentrations were randomly sampled from the same distributions that are used as priors in model balancing.Then the model was parameterised with the "true" artificial kinetic constants and was solved to obtain a steady reference state (steadystate metabolite concentrations and fluxes).Based on this reference state, a number of metabolic states were constructed by randomly varying metabolite and enzyme concentrations (again, following the prior distribution) and computing the (non-steady) reaction rates 28 .The resulting states are seen as the "true values".To generate noisy state data, uncorrelated random noise was added to the "true values".When generating artificial data, noise was also added to fluxes but the flux signs were kept unchanged, to ensure thermodynamically feasible flux directions as required in model balancing.

Figure 1 :
Figure 1: Estimation of model variables in kinetic metabolic models.(a) Kinetic model and metabolic states.A model is parameterised by kinetic constants (e.g.equilibrium constants, catalytic constants, and Michaelis-Menten constants) and gives rise to a number of metabolic states (characterised by enzyme concentrations, metabolite concentrations, and fluxes).These states may be stationary (with steady-state fluxes) or not (e.g.states during dynamic time courses).(b) Dependencies between kinetic constants and state variables.All kinetic constants are described in logarithmic scale.A subset of kinetic constants determines all other kinetic constants through linear relationships.If kinetic constants, metabolite concentrations, and fluxes are known, the enzyme concentrations can be computed from rate laws and fluxes: each enzyme concentration is a convex function of the (logarithmic) kinetic constants and metabolite concentrations.The signs of thermodynamic forces are constrained by the flux directions.(c) Parameter estimation.Kinetic constants and metabolite concentrations (for a number of metabolic states) serve as free variables of a statistical model.Dependent kinetic constants, thermodynamic driving forces, and enzyme concentrations (bottom) are treated as dependent variables, and the fluxes (top right) are predefined.For estimating the variables, priors and available data may be used.The other graphics show estimation and optimisation methods for other scenarios, in which (d) only kinetic data are balanced (no state data), (e) only state data are balanced (kinetic parameters are predefined), or (f) enzyme and metabolite concentrations are optimised for a low enzyme or metabolite cost.

l
(b), top left).The vector p contains all kinetic constants.In each metabolic state s, the rate laws determine the fluxes v k l (p, c (s) ) as functions of enzyme concentrations e (last row, no use of any kinetic data) for model balancing and Figure 7 for kinetic profiling, the model balancing predictions are much more accurate in their indidivual values (Pearson correlation of 0.63 vs 0.28) if noise-free metabolic data are given.With noisy metabolic data, both methods show similar results.It turns out that K M values are much harder to reconstruct: with noise-free metabolic data, a Pearson correlation of 0.5 (and a geometric standard deviation of 4.5) is achieved.With noisy metabolic data, the results

= 0 .
In this case, we directly set e (s) l = 0 instead of estimating ln e we assume uncorrelated Gaussian priors (i.e.log-normal priors for the concentrations themselves).Similarly, the logarithmic data values x (s) i,data and z (s) l,data in the likelihood term, are assumed to be independent and normally distributed around the true values.
(v) A preposterior for kinetic parameters may be obtained by previous parameter balancing, and pseudo values for metabolite and enzyme concentrations may be obtained by a previous ECM.(vi) To penalise unrealistically high metabolite or enzyme concentrations, a regularisation term may be added, for example, proportional to the cost function considered in ECM.(vii) Omics data may not contain absolute metabolite and enzyme concentrations, but relative changes between metabolic states.To account for such data, a variant of the dependency schema might be considered: for each metabolite, we split the log-concentrations ln c (s) i into a reference value ln c i and a deviation ∆ ln c (s)

Figure 12 :
Figure 12: Model of E. coli central carbon metabolism and protein data, taken from [39].
Ensemble modelling If the parameters or states of a model are not described by fixed numbers, but by probability distributions, we obtain a model ensemble.Ensemble modelling can be used to express the actual possibilities of a system or the possibilities we attribute to it given our limited data and knowledge, correct reaction elasticities.SKM and STM can be adapted to account for priors or data for the K M values.However, including data or priors for k cat values and enzyme concentrations remains difficult, and the method cannot be used to used to fit kinetic constants simultaneously to several metabolic states.
[40]ntegrate kinetic and metabolic data to obtain a consistent metabolic model with several states within a Bayesian framework, accounting for uncertainties and missing data.Depending on data availability and quality, it can be used for (i) inferring kinetic constants from state data and (ii) making almost complete, uncertain data consistent, certain, and precise.Various methods and tools have previously been developed to parameterise metabolic models, to integrate omics data, and to build model ensembles.These methods use different types of knowledge (in-vitro kinetic constants, omics data, and physical parameter constraints) and different estimation approaches (including machine learning, regression models, calculations based on rate laws, and model fitting).1.constants.Structural thermokinetic modelling (STM)[40], a variant of SKM, considers reversible rate laws and guarantees thermodynamically consistent results.The procedure follows a dependency schema just like in model balancing.In the first step, it requires thermodynamically consistent fluxes, metabolite concentrations, and thermodynamic forces.In the second step, thermodynamic forces are used to convert saturation values into