Abstract
Mathematical models are routinely calibrated to experimental data, with goals ranging from building predictive models to quantifying parameters that cannot be measured. Whether or not reliable parameter estimates are obtainable from the available data can easily be overlooked. Such issues of parameter identifiability have important ramifications for both the predictive power of a model, and the mechanistic insight that can be obtained. Identifiability analysis is well-established for deterministic, ordinary differential equation (ODE) models, but there are no commonly-adopted methods for analysing identifiability in stochastic models. We provide an accessible introduction to identifiability analysis and demonstrate how existing ideas for analysis of ODE models can be applied to stochastic differential equation (SDE) models through four practical case studies. To assess structural identifiability, we study ODEs that describe the statistical moments of the stochastic process using open-source software tools. Using practically-motivated synthetic data and Markov-chain Monte Carlo (MCMC) methods, we assess parameter identifiability in the context of available data. Our analysis shows that SDE models can often extract more information about parameters than deterministic descriptions. All code used to perform the analysis is available on Github.
1 Introduction
Stochastic mathematical models are rapidly becoming an essential tool for interpreting biological phenomena [1–7]. These models are necessitated, in part, by increasing experimental interest in capturing finer-scale, time-series observations [8–12] as well as spatial information [13–18] rather than coarse-scale deterministic trends (figure 1). As computational inference techniques for stochastic models have improved [19–23], a fundamental question that often remains overlooked is whether or not model parameters can be confidently estimated from the available data. Drug development, for example, often relies on the quantification of cell growth rates from a proliferation assay (figure 1a–d) [24]. If a mean-field model is applied to interpret the most frequently reported observation—cell count data—only the net growth rate is identifiable, not the proliferation and death rates [25,26]. Establishing the identifiability of model parameters is critical as predictions, and parameter estimates, from a non-identifiable model may be unreliable [27–30], with further analysis required to quantify prediction uncertainty in non-identifiable models [31–33]. Identifiability should always, therefore, be established before parameter estimation is attempted. Such identifiability analysis is well-established for deterministic ordinary differential equation (ODE) models [28,34–41], but there is a scarcity of methods available for the stochastic models that are becoming increasingly important.
(a–d) Cell proliferation and death observed in vitro over 36 hours in a proliferation assay [68]. Each snapshot has a field-of-view of 1440 × 1440 μm and the location of each cell is indicated with a yellow marker. (e) Data from the early stages of the coronavirus pandemic comprising the observed number of (i) infected individuals, deaths, and (ii) daily new case count in Australia [69]. (f) Continuous glucose monitoring data from a single individual over three consecutive days [70].
Stochasticity is fundamental to many processes [2,42–48]. For example, diabetic patients rely on the rapid interpretation of highly volatile blood glucose measurements to determine insulin input (figure 1f) [49,50]. Data from the COVID-19 pandemic [1] is also volatile (figure 1e), and inferences of epidemic data must often be drawn from a single, stochastic, time-series. Finally, for systems at equilibrium in the mean-field, such as ion-channel data, models that account for system noise are required to establish parameters [51,52]. Stochastic differential equation (SDE) models of the Itô form are widely applied in systems biology to describe stochastic phenomena [53–56]. While many stochastic systems can be simulated exactly using discrete Markov models, SDE approximations offer a significant computational advantage. In addition, the use of reflected SDEs [57] can guarantee good agreement with their discrete counterparts at boundaries [57,58]. SDE models can describe intrinsic noise in, for example, gene expression [2,9,23] or a bio-chemical reaction network [59]; extrinsic noise describing volatility in the environment [48,53,60,61]; and model approximations and unknown effects in so-called grey-box models [62,63]. Explicitly modelling this variability in biological systems can often capture more information about a process than a deterministic model is able to [64–67]. Further, SDE models can account for the correlations inherent to time-series data and account for noise that might otherwise obscure parameters. Our results demonstrate how to establish parameter identifiability for SDE models that encode information about the intrinsic noise of the process [64]. We focus on SDE state-space models that can be formulated through the chemical Langevin equation (CLE), although our intention is to provide analysis applicable to any SDE of the Itô form. The use of such formulations have a long and extensive history of use.
A prerequisite for parameter estimation is that model parameters be structurally identifiable [28,34–36,71–73]. Structural identifiability refers to the question of whether a parameter can be identified given an infinite amount of noise-free data. A state-space model is said to be structurally identifiable if distinct values of the parameters imply distinct observed model outputs (or in the case of a stochastic model, distinct observed output distributions [74]), and vice versa [75–77]. Techniques such as differential algebra [38, 78, 79] and transfer function approaches [34, 35] can establish structural identifiability in ODE models. These approaches are also used to establish identifiable relationships between parameters [35,80]—for example, the net growth rate in a proliferation assay—which can aid model design and model reduction [80–82]. Many of these techniques have accessible implementations in symbolic computation packages [38,83–86], meaning structural identifiability analysis does not require a detailed understanding of the, often complex, underlying mathematical analysis [38].
When experimental data is considered, a more useful question is that of practical identifiability or estimability [28,77,83,87]. That is, can parameters in the model be accurately estimated given a finite amount of noisy experimental data? This kind of analysis is routinely used in the field of experimental design to assess the nature of data required to adequately identify biophysical parameters [29,51,55,88–90]. Practical identifiability is established in conjunction with an inference technique, such as profile or maximum likelihood [90–93] or Markov-chain Monte-Carlo (MCMC) [29, 51]. These techniques provide information about the flatness (or otherwise) of the likelihood function or, in the Bayesian case, the posterior distribution, that describes knowledge about the parameters after the experimental data is taken into consideration. For deterministic and simple stochastic models, this information can be obtained directly from the Fisher information matrix [93]. A model parameter is classified as practically non-identifiable if it cannot be established uniquely within a reasonable level of confidence [28,29]. Compared with structural identifiability, which is a property of the model, practical identifiability is more nuanced and additionally dependent upon prior knowledge; the experimental data; and consequentially, the experiment itself [29,83]. For example, should the model and data provide no more information about a parameter than that already established in previous studies, the parameter may be classified as practically non-identifiable from the data and model in question. For this reason, we take a Bayesian approach to parameter estimation and encode existing knowledge about the parameters in a prior distribution. This question of practical identifiability has not yet been demonstrated for SDE models in systems biology.
Computational inference for stochastic models is a significant challenge [22]. Unlike approaches to parameter estimation for deterministic models, the likelihood function for a realistic stochastic model is, generally, intractable [22]. Techniques based on approximations, such as a linear-noise approximation [93] or approximate Bayesian computation [20,94–98], are available for SDEs but are, naturally, approximations. Pseudo-marginal methods [99,100], developed relatively recently, are computationally costly, but provide an unbiased estimate of the true likelihood function for partially observed time-series described by non-linear stochastic models. In this review, we utilise a pseudo-marginal MCMC approach, where we estimate the likelihood with a particle filter, which we refer to as particle MCMC [101,102]. There are many excellent reviews of inference for stochastic models in systems biology [20,22,102,103], so we do not focus on the details our out implementation here. Despite the established importance of identifiability, it is all too common in parts of the inference literature to draw the standard assumption that the model parameters are identifiable: we note that all the aforementioned review articles make no mention of identifiability. The computational cost of inference for stochastic models, in itself, motivates us to consider identifiability. For example, identifiability can guide model selection: if both a deterministic and stochastic description of a process are practically non-identifiable, the cheaper deterministic model may, in some cases, be adequate for parameter estimation. Where structural non-identifiability is detected, practical non-identifiability necessarily follows and does not need to be established separately.
The focus of this review is to provide an accessible guide to establishing identifiability in SDE models in biology. To do this, we analyse identifiability in SDE descriptions of four case study models, shown in figure 2. The simplest model we consider is a birth-death process (figure 2a) that is routinely used to describe cell proliferation and death in a range of in vitro and in vivo biological systems, such as that shown in figure 1a. We demonstrate that, from cell count-data, the cell proliferation and death rates are structurally non-identifiable for a routinely employed ODE model, but can be identified for an SDE model. Next, we consider two multi-state models where only partial observations of the system are available. First, a two-pool model (figure 2b) that can describe, for example, the decay of human cholesterol whilst it transfers between two organs [35,104]. We assume that data from the two-pool model comprises several time-series observations of the substance concentration in a single pool. Secondly, an epidemic model (figure 2c) [105–107] describes individuals infected due to interactions between susceptible and infectious individuals. We model a testing procedure such that unknown proportions of the number of infectious and recovered individuals are observed, and inferences are drawn from a single time-series. The last model we consider is a non-linear SDE model for insulin regulation by β-cells (figure 2d) [108,109]. This type of model can describe the volatility associated with data from a continuous glucose monitoring device (figure 1f) [70]. The equivalent ODE description of the β-insulin-glucose circuit is not structurally or practically identifiable [110], and we demonstrate how the analysis for the ODE description can inform a parameter transformation to aid identifiability analysis for the SDE model.
We demonstrate identifiability in an SDE CLE description of four models: (a) a birth-death process; (b) a two-pool model; (c) an epidemic model; and (d) a β-insulin-glucose circuit. The coloured boxes indicate the observed quantity, which is coupled to a noisy observation process.
We demonstrate two main approaches to assess identifiability in SDE models. First, we assess structural identifiability through a surrogate model, taken to be a system of ODEs that describe the time-evolution of the statistical moments of the SDE [111–115]. This allows us to apply the established open-source structural identifiability software package DAISY (written for the freeware REDUCE software) to the SDE models through the moment equations. We repeat this analysis in the more recent open-source software package GenSSI2 [116,117], written for MATLAB, which can be more efficient for non-linear systems. We interpret these results as a proxy for identifiability of the SDE model itself. While this approach is not always conclusive, it can provide a rapid preliminary screening tool and allows direct comparison of identifiability for an SDE model, which contains information about the mean, variance and higher moments; to identifiability for a corresponding ODE model that is typically assumed to describe an approximation of the mean. We only apply this approach where an exact system of moment equations can be derived, which occurs when the reaction rates are polynomial. For more complex stochastic models containing terms such as Hill functions, as found in the β-insulin-glucose circuit model, an exact system of moment equations cannot be derived, we do not apply the moment dynamics approach in this case. We assess practical identifiability for all models using MCMC [29,51], first demonstrating how practical identifiability can be cheaply established from a naïve proposal kernel. To compute credible intervals for each parameter, and visualise potential correlations between parameters, we produce results using a tuned proposal kernel where we can be more certain of convergence.
The outline of this review is as follows. In Section 2, we establish the types of SDE models and observation processes that we consider, and then outline the techniques used to generate synthetic data. Following this, in Section 2.2, we summarise moment closure techniques for SDEs and describe how we implement the software tools DAISY and GenSSI to assess for structural identifiability. Next, in Section 2.3, we provide a brief overview of our implementation of the particle MCMC algorithm. Full details of particle MCMC for SDE models can be found in the existing literature [102,103] and as supporting material. In Section 3, we use these tools to assess identifiability using an SDE description of four models. In Sections 4 and 5, we discuss our results and provide an outlook on the future of identifiability for stochastic models in biology. To aid in the accessibility of the techniques we review, we provide our MCMC code in the form of a module1 for the open-source, high-performance Julia programming language [118].
2 Mathematical techniques
In this section, we outline the mathematical and statistical techniques we use to perform identifiability analysis. Full details of all algorithms used are provided as supporting material.
2.1 Stochastic models in biology
We consider Itô SDE state space models of the form
Here, the state is described by is a Q-dimensional Wiener process with independent components; α(·) maps to an N-dimensional vector; and σ(·) maps to an N × Q matrix. The observables,
, are connected to the state variables according to an observation process with probability density function g(Yt|Xt, t; θ). We consider several forms of observation function, including partial observations of the state with both additive and multiplicative Gaussian noise with unknown variance
. In equations (1) and (2), θ is a vector of unknown parameters to be determined through inference. In this review, all variables and parameters are dimensionless.
The focus of this review is on Itô SDE models that are formulated through the CLE description of a system of bio-chemical reactions [59,119,120]. Therefore, additional information about rate parameters is encoded in the noise of the process. The first three models we consider (figure 2a–c) can be expressed directly as a network of reactions. As the β-insulin-glucose circuit model (figure 2d) involves state variables modelled as concentrations, not individual counts, we derive a stochastic description from the CLE but scale the noise term in proportion to the concentration of each species.
In summary, a bio-chemical reaction network comprises N species, X1, X2,…, XN, that interact through Q reactions [121–123]. The population of each species is given by . By the law of mass action [59,124], each reaction occurs with a rate described by a propensity function, ak(Xt, t; θ), which is equal to the product of the reactants and the rate constant. The net effect of the kth reaction is described by the stoichiometry νk such that, should reaction k occur in [t, t + dt),
For bio-chemical reaction networks without an explicit time-dependent input, the propensity functions will be independent of t and the system can be simulated exactly using an event-driven stochastic simulation algorithm (SSA) [5,124–126]. The principle behind an exact SSA is that reactions can be modelled by an inhomogeneous Poisson process. The time interval between reactions, Δt, is exponentially distributed such that
A single reaction occurs at each time-step; the kth reaction occurs with probability proportional to ak(Xt; θ). A typical implementation of the SSA first samples a time-step using equation (4); then samples the next event to occur; and finally updates the state. Full details of our implementation of an SSA are given as supporting material, and the reader is directed to [121] for a comprehensive review of simulation algorithms for bio-chemical reaction networks. We generate synthetic data for the first three models, for which the propensity functions are independent of t, using the SSA. In figure 3a–c we show 100 realisations of the SSA for the birth-death process, two-pool model and epidemic model, respectively.
(a–d) 100 example realisations of each model, produced using: (a–c) the SSA; and, (d) the SDE. (e–h) Synthetic data used for practical identifiability analysis. Synthetic data comprises noisy observations of the (e) full and (f–h) partial state. In (e,f,h), experimental replicates used simultaneously for parameterisation are shown semi-transparent, with the first replicate fully opaque. For the epidemic model, both short-time (opaque) and long-time (semi-transparent) data are considered separately. In both cases of the epidemic model, an unknown proportion of the number of infected individuals (green), and the recovered individuals (black), is observed. In (d,h), the β cell concentration, βt (and the measured concentration Y1,t) is shown on the right axis.
When the population of each species is large and reactions sufficiently frequent, the dynamics of a bio-chemical reaction network can be approximated using the CLE [25,119,121]. Such an approximation is widely applied in systems biology [127, 128], and it is often necessary as the SSA quickly becomes computationally expensive as the populations become large and reactions are frequent enough [129]. The CLE is an Itô SDE of the form
Here Wt = (W1,t, W2,t,…, WQ,t) is a Q-dimensional Wiener process with independent components. In this study, we derive an SDE description for each model using the CLE, and we calibrate this SDE to the synthetic data to approximate the parameters in each model. For the first three models, where data is generated using the SSA, not the SDE, this means that identifiability analysis is conducted in such a way that model misspecification could potentially arise. This pragmatically mirrors experimental data, where any model (including an ODE and SDE description) is an approximation. The forward simulation for each SDE is approximated using the Euler-Maruyama algorithm [130], where we apply reflected SDEs to ensure positivity [57]. Full details of the numerical algorithm are given as supporting material.
2.2 Moment dynamics
To enable the application of established methods for structural identifiability analysis to SDE models, we formulate a system of ODEs that describe the statistical moments of the random variable . We denote mi1i2…id(t) as a raw moment of Xt, such that [112–114,120]
where 〈·〉 indicates the expectation taken with respect to the probability measure of the random variable Xt. Here,
is the order of the moment. For example, the first order moments correspond to the mean of each dimension of Xt, the second order moments relate to the variances and covariances, and so forth.
We apply the software packages DAISY [38] and GenSSI2 [117] to establish structural identifiability of the resultant system of moment equations. The software package takes a system of ODEs describing the state equations—in our case, the moment equations—in addition to an explicit algebraic relationship between the observables and the state. We, therefore, provide the moments of the observables, Yt, in the noise-free limit, which we denote
In many cases, the observation distribution, g(Yt|Xt, t; θ), will depend upon the unknown parameters, θ, if, for example, an unknown proportion of the state is observed. This is captured in the structural identifiability analysis as the equations derived for the observed moments, n, may depend on θ. We provide well commented input and output obtained using DAISY on Github as supporting material.
An expression for the time derivative of each moment can be found using Itô’s lemma (supplementary material). When each component of σTσ is an analytic function, which occurs when all the propensity functions in the bio-chemical reaction network are also analytic functions, we obtain [131]
where H(·) denotes the N × N Hessian matrix of its argument and
. In the case that N = 1, equation (8) reduces to
where α and σ are now scalar functions.
When each component of α and σTσ are polynomials in Xt, the expectation in equation (8) can be carried through to replace powers of Xt with appropriate moments. This, in general, provides an infinite system ODEs that exactly describe the time evolution of the moments. In practice, we consider a finite system of moments, up to and including moments of order J. We express this now finite system of ODEs as
where m≤J(t) is a vector containing all the moments up to, and including, order J; and m>J(t) is a vector containing all moments of order J + 1 and above. In the case that f≤J(·) depends only on moments up to order J, the system is said to be closed at order J. That is, the infinite system of equations can be truncated at order J and solved directly to obtain an exact solution for the moments. This is the case if α and σTσ are linear in Xt, which occurs in SDEs derived from the CLE if each propensity is linear in Xt, as is the case for the first two models we consider (figure 2a,b).
For more complicated models, including the epidemic model (figure 2c), the system will not, in general, be closed. We must, therefore, apply a moment closure approximation to express moments of order higher than J in terms of lower order moments [45]. Moment closures typically make an a priori assumption about the distribution of the random variable Xt. For example, assuming components of Xt are independent or normally distributed is a common approach. In this review, we consider three common moment closures: (1) a mean-field closure [113]; (2) a pairwise closure [113]; and (3) a Gaussian closure [112].
The mean-field closure we consider makes the approximation
This closure is derived from the assumption that components of Xt are weakly correlated [113] and is also referred to as the covariance closure [132]. In the case a closure is drawn at J = 1, the mean-field closure often corresponds to an ODE description of the process. For our analysis of the epidemic model, we find that the mean-field closure behaves poorly, suggesting that an assumption that the components of Xt are independent may not be appropriate (Supplementary Material, Figure S1).
While the mean-field closure is commonly drawn at order J = 1, it is more common for the pair-approximation closure to be applied for second and higher order closures [113]. The pair-approximation closure assumes that a third order moment can be expressed as
The Gaussian closure approximates higher order moments to match those of the normal distribution, and gives a closure in terms of the mean and covariances. Higher order moments can be approximated with [112, 133]
Here, denotes a central moment; Cov(Xj,tXk,t) denotes the covariance between Xj,t and Xk,t; and Is are the sets formed by partitioning the set
into unordered pairs, where s is the number of sets. The raw moments, mi1i2…iN(t) can then be solved from the expressions for the central moments obtained from equation (12). For a practical example of the Gaussian closure, see [112].
Other choices of moment closure are routinely used in systems biology, such as those based upon a multivariate lognormal distribution [112] or a derivative matching scheme [134]. However, more complex closures add further complexity to the moment equations, which is a significant computational disadvantage for automated assessment of structural identifiability in software packages such as DAISY and GenSSI2. Furthermore, an approximate system of moment equations (which must then also be closed) could be obtained by applying a series expansion approximation, or an approximation similar to the mean-field closure, to systems containing non-polynomial analytic functions; this is the case for the fourth model we consider (figure 2d). We do not consider the moment dynamics approach for non-polynomial models in this review.
2.3 Inference with MCMC
We take a Bayesian approach to parameter estimation to update our knowledge about the parameters, θ, from a set of observations, , using the likelihood function,
, such that [135]
Here, p(θ) is the prior distribution, and represents our knowledge of θ before consideration of the observations . The prior distribution may encode information from, for example, previous experiments, established knowledge, or physical restrictions on the parameters. In the context of practical identifiability, our goal is to significantly increase our understanding of θ from our prior knowledge. We specify p(θ) to be a truncated uniform distribution: all parameters within a specified region of realistic parameter values (the support) are considered equally likely [29]. An advantage of a uniform prior in the context of identifiability is that the posterior corresponds to the truncated likelihood function, and, therefore, high density regions of the posterior correspond to regions of high likelihood. Further, should an improper, unbounded uniform prior be considered, the posterior will be directly proportional to the likelihood. Thus, our methodology can also be applied to assess parameter identifiability using a purely likelihood-based approach.
We use an MCMC technique, based on the Metropolis-Hastings algorithm, to sample from the posterior distribution [135–138]. The principle behind MCMC in Bayesian inference is to construct a Markov chain, {θi}i≥0, with a stationary distribution equal to . We make a standard choice to initiate the chain from a prior sample, θ0 ~ p(θ). At each iteration of the algorithm, a new state is proposed, θ* ~ q(θ*|θm), where q is termed the proposal kernel. The proposal is accepted, θm+1 ← θ*, with probability
else the proposal is rejected, θm+1 ← θm. Full details of our implementation are provided as supporting material. In this review, we use a multivariate normal proposal so that q(θm|θ*) = q(θ*|θm). An interpretation of the Metropolis choice of acceptance probability, equation (14), where the proposal is normal and, therefore, symmetric, is that proposals that increase the posterior density are always accepted, whereas proposals that decrease the posterior density are accepted with some reduced probability [29].
We refer to the first set of MCMC chains for each problem as pilot chains [139]. The proposal distribution for each pilot chain is set to be a multivariate normal distribution with independent components and variances equal to one-tenth the corresponding prior variance for each parameter, a typical choice. We always produce four pilot chains, each of 10,000 iterations, which we find to be sufficient to indicate identifiability for our models. These pilot chains are then used to tune the MCMC proposal kernel [140]. We then produce four tuned chains, which can be reliably used to estimate credible intervals and other features of the posterior distribution. The proposal distribution for each tuned chain is chosen to be multivariate normal, with covariance given by [139]
Here, dim(θ) is the number of unknown parameters, and is the covariance matrix for the pooled samples from the four pilot chains (a total of 28,000 samples after 3,000 samples are discarded as burn-in from each pilot chain). To assess convergence, we calculate the commonly used
[141] and neff (effective sample size) [135] diagnostics. In summary,
measures the ratio of between-chain and within-chain variance; and neff measures the effective number of independent samples drawn from the posterior. To draw reliable inferences, Gelman et al. [135] suggest ensuring that
. Full details of these convergence statistics are available in [135].
The primary challenge with performing inference for SDE models, with time-series data, is computing the likelihood function. In this review, we consider synthetic data from E independent experiments, each with NE time-series observations. The data are denoted
and correspond to the likelihood function
In most cases, the likelihood for noisy time-series data modelled by an SDE will be intractable [102]. This contrasts with data modelled by a deterministic model, which are typically assumed to be independent and normally distributed about the model output [29]. Likelihood free methods, such as ABC [20,97] and pseudo-marginal approaches [100], are routinely used in systems biology to calibrate complex stochastic models to experimental data by approximating equation (17). In this study, we apply a pseudo-marginal approach based on a bootstrap particle filter to approximate the likelihood and calibrate each SDE model to synthetic experimental data [102]. In summary, the bootstrap particle filter approximates equation (17) by
Here, the observation probability density, g (equation (2)), is averaged over R samples from the SDE, to approximate the likelihood. The bootstrap particle filter then resamples from the set of weighted samples,
, at each time-step to form the starting locations for each SDE sample to sample forward to tn+1. This process is repeated for each independent experiment, and the result is an unbiased Monte Carlo estimate of the likelihood function,
, that replaces
in the Metropolis acceptance probability (equation (14)). Full details of the particle MCMC algorithm, including an implementation for an ODE model used in one case study, are provided as supporting material, and for further information the reader is directed to [102, 103].
3 Case studies
Using the moment equations and MCMC, we provide a practical guide for assessing parameter identifiability in SDE CLE models through four case studies. We generate synthetic data for each model using the SSA when the propensity functions are time-independent (the birth-death process, two-pool model and epidemic model), and the corresponding CLE when the propensity functions are time-dependent (the β-insulin-glucose circuit). In practice, we would first assess practical identifiability using the experimental data available. However, working with synthetic data provides the means to evaluate the effect of different experiment designs, and observation protocols, on practical identifiability. Our focus is on data comprising partial observations of the process that realistically captures potential experimental data.
3.1 Birth-death process
The first model we consider is a birth-death process (figure 2a). The birth-death processes can describe, for example, the growth of a well-mixed cell population where individuals proliferate and die according to rates θ1 and θ2, respectively. We consider practical identifiability for synthetic data comprising noisy measurements of the cell count at 10 equally spaced times in 10 identically prepared experiments. Such data are typical for in vitro cell proliferation experiments [24,142], an example of which is shown in figure 1a–d.
3.1.1 Model formulation and moment equations
The birth-death process can be expressed as the bio-chemical reaction network
with stoichiometries ν1 = 1 and ν2 = −1; and propensities a1(Xt) = θ1Xt and a2(Xt) = θ2Xt. Here, we denote Xt as the number of individuals in the population. The observed number of individuals, Yt, is described by the noise model
Here, we consider a noise process that scales with the total population, that is, multiplicative Gaussian noise. We show 100 realisations of the SSA for the birth-death process in figure 3a, and the synthetic data used for practical identifiability analysis in figure 3e. The data are generated using the initial condition X0 = 50 and target parameter values θ1 = 0.2, θ2 = 0.1 and σerr = 0.05. Here, σerr ≪ 1, which ensures that Yt remains positive.
The CLE for the birth-death process is
and the first and second order moment equations are
The moment equations for the SDE description of the birth-death model above are identical to the moment equations for the discrete Markov model that we simulate using the SSA [143]. The moments of the observable (in the noise-free limit) are given to second order by n1 = m1 and n2 = m2. As α(·) and σ2(·) are linear in Xt, the moment equations of the birth-death process are closed at every order and so equations (21) are exact. Further, we note that the common mean-field model for the birth-death process,
corresponds to the first moment, and describes the average behaviour of Xt. The solution to equation (22) is
Here, the population, , undergoes exponential growth with a net-growth rate of θ1 − θ2. Therefore, intuitively, it is not possible to identify θ1 and θ2 if only average growth behaviour is observed [25].
3.1.2 Structural identifiability
We first assess structural identifiability of the moment equations in DAISY [38]. If only the first moment, n1, is observed, the system is structurally non-identifiable, meaning the model parameters cannot be uniquely estimated with any amount of data. However, the system becomes structurally identifiable if n2 is also observed. As the moment equations are closed at every order, and therefore exact, this analysis indicates that the ODE model (equation (22), corresponding to the first moment equation) is structurally non-identifiable, while the SDE model is structurally identifiable.
These structural identifiability results can be intuitively understood through reparameterisation [40]. The first moment equation (or the ODE model) can be re-parameterised with where
is the sole parameter in the model. Therefore, for a fixed
, all values on the line
produce indistinguishable behaviour in the first moment, m1, and hence in the observation, n1. On the other hand, when re-parameterised the second moment equation contains a second, linearly independent, parameter
. For the birth-death process, the second moment provides enough additional information to uniquely identify both parameters θ1 and θ2, provided enough data is available. Thus, the birth-death process is structurally identifiable from the first two moments.
3.1.3 Practical identifiability
We assess practical identifiability of the parameter vector θ = (θ1, θ2, σerr) for the ODE and SDE models using MCMC. We place independent uniform priors on each parameter so that and
. If prior knowledge about the population (i.e., the cell line) is available, perhaps based upon previously conducted experiments, this can be incorporated into the analysis through an informative prior. For example, upper bounds that define reasonable values for biological parameters are routinely applied in this context [90].
In figure 4a–i, we show MCMC results for the birth-death process using the ODE model. Based on the structural identifiability results, we expect the likelihood (and for a uniform prior, the posterior density) to be constant along the identifiable parameter combination , and we see this in figure 4d. These results also suggest that, should one of θ1 or θ2 be known (for example, if the cells are treated with an anti-proliferative drug that enforces θ2 = 0 [144]) the other be identifiable. However, lower and upper bounds for θ1 and θ2, respectively, are able to be established as a direct consequence of the prior assumption that all parameters are strictly positive. Examination of univariate credible intervals, shown in table 1, reveals that each parameter cannot individually be identified within 3–4 orders of magnitude, a hallmark of non-identifiability [29]. We note that σerr is practically identifiable (figure 4i, 95% CrI: (0.1448,0.1907)) from the ODE model, however it will always be overestimated as the observation model for the ODE model must also account for the intrinsic noise of the process.
95% credible intervals, and diagnostics, for the parameter estimates for the birth-death process. Credible intervals are approximated using the MCMC quantiles after burn-in.
MCMC results for (a–i) an ODE and (j–r) an SDE description of the birth-death process. (a,c,f) and (j,l,o) show trace plots for the ODE and SDE models, respectively. Kernel density estimates of the posterior for each parameter ((b,e,i) and (k,n,r)), and bivariate scatter plots ((d,g,h) and (m,p,q)), are produced by thinning the MCMC chains by using every 100th sample from four independent MCMC chains, after burn-in.
We repeat the analysis for the SDE model, results of which are shown in figure 4j–r. For the prior support chosen, both θ1 and θ2 are practically identifiable, as seen in figure 4k,n. Further, 95% credible intervals identify each parameter within a single order of magnitude (table 1). While structural identifiability analysis revealed that the SDE model is identifiable in the limit of infinite, noise-free data, it is not necessarily so for data with a realistic signal-to-noise ratio, characterised by the noise model parameter σerr. In our case, if prior knowledge provided an upper bound for θ1 and θ2 at, for example, 0.3, conclusions of practical identifiability may be analogous to those of the ODE model. We see this in table 1, where the upper bounds of the credible intervals for θ1 and θ2 extend beyond 0.3. This is also evident from both the bivariate scatter plot (figure 4m) and MCMC trace plots (figure 4j,l), where posterior samples above 0.3 are regularly drawn for both θ1 and θ2. As the SDE explicitly accounts for intrinsic noise, σerr is identifiable with estimates close to the true value, in contrast to results from the ODE model.
3.2 Two-pool model
Next, we consider partial observations of a process governed by a two-pool model, describing the decay of a substance that is able to transfer between two pools (figure 2b). Identifiability of a two-pool model was first examined in the fundamental study of Bellman and Åström [34] as they introduced the concept of structural identifiability. The model can represent, for example, human cholesterol distribution dispersed through two-pools (for example, two organs), where measurements are taken from a tracer in the first pool [104]. Bellman [34] and later Cobelli [35] show that, for an ODE model, the pool transfer and decay rates are not structurally identifiable. We consider practical identifiability for synthetic data comprising noisy measurements of the first pool at 10 equally spaced time points in five identically prepared experiments. Although measurements of the second pool are not taken, we assume, for demonstration purposes, that the initial concentration in each pool is zero before a known amount is introduced to the first pool, thus the full initial condition is known. In practice, the initial condition may also depend on a set of unknown parameters, and we focus on this with the epidemic model.
3.2.1 Model formulation and moment equations
The two-pool model can be expressed as the bio-chemical reaction network
with stoichiometries ν1 = (−1, 0)T, ν2 = (0, −1)T, ν3 = (−1,1)T and ν4 = (1, −1)T; and propensities a1(Xt) = θ1X1, a2(Xt) = θ2X2, a3(Xt) = θ3X1 and a4(Xt) = θ4X2. Here, we denote Xt = (X1,t, X2,t)T as the concentration of cholesterol in the first and second pools, respectively. The observed concentration, Yt, is described by the noise model
in which we consider that the data are subject to measurement error in the form of additive Gaussian noise [9, 145, 146]. We show 100 realisations of the SSA for the two-pool model in figure 3b, and the synthetic data used for practical identifiability analysis in figure 3f. The data are generated using the initial condition X0 = (100,0)T and target parameter values θ1 = 0.1, θ2 = 0.2, θ3 = 0.2, θ4 = 0.5 and σerr = 2. Here, we note that σerr ≪ Xt (figure 3c), which ensures Yt > 0.
The CLE for the two-pool model is
and the moment equations are given to second order by
The moments of the observed cholesterol concentration are given in the noise-free limit by n1 = m10 and n2 = m20. As with the birth-death process, all elements of α(·) and σ(·)σ(·)T are linear in Xt, so the moment equations are closed at every order and, therefore, exact.
3.2.2 Structural identifiability
The two-pool model provides an archetypical example of structural non-identifiability in an ODE model [34,35]. Unless a restriction is placed on one of the parameters (for example, if decay of the substance can only occur from the first pool so θ2 = 0), the model parameters are structurally non-identifiable: many parameter combinations give identical behaviour in the ODE model. Therefore, the model parameters cannot be uniquely determined from any amount of noise-free experimental data if observations are made from only the first pool.
We assess structural identifiability of an SDE description of the two-pool SDE model using DAISY with the system of moment equations up to second order (equation (26)). While the ODE model is structurally non-identifiable, the SDE model is structurally identifiable. Therefore, in the limit of infinite, noise-free data, the model parameters can be uniquely determined from an SDE description of the two-pool model.
3.2.3 Practical identifiability
To assess practical identifiability of the two-pool model, we apply MCMC to infer θ = (θ1, θ2, θ3, θ4, σerr). Initially, independent uniform priors are chosen such that , and
. The support of each prior is chosen to cover a range of magnitudes over the target parameter values. Results from four independent pilot chains, each initiated at a random sample from the prior, are shown in figure 5a–f. In figure 5a we see that the log-likelihood estimate rapidly stabilises, indicating that the chain has moved to a high-likelihood region of the parameter space. Results for σerr and θ3 also rapidly stabilise, indicating that these parameters are practically identifiable [29].
Results for the remaining three kinetic rate parameters in figure 5c,d,f indicate that θ1, θ2 and θ4 are practically non-identifiable. In particular, chains for θ1 and θ2 spend a non-negligible time near zero, indicating that the model may be indistinguishable (using the available data) from a model where removal only occurs from a single pool.
Pilot MCMC trace plots, and log-likelihood estimates, of four chains for the two pool SDE model on with (a–f) untransformed parameters; and (g–l) transformed parameters. Priors for each parameter are uniform with support corresponding to the respective axis limits. The target parameters, used to generate synthetic data, are indicated (black dashed line).
We next repeat the analysis using MCMC to infer θ* = (logθ1, logθ2, logθ3, logθ4, σerr). Inferring the logarithm of rate parameters will provide more detailed information about the magnitude of rate parameters potentially close to zero [102]. This transformation provides an excellent example of why even a uniform prior is informative, since a uniform prior placed on the linear-scale is not uniform on the log-scale. A uniform prior on the linear-scale makes parameters of a smaller magnitude less likely than a larger magnitude. The priors are again chosen to be independent and uniform (on the log-scale), such that for all i and
as before. The support of each prior is chosen, again, to cover a range of magnitudes above and below that of the target parameter values. Results in figure 5k confirm that θ3 is practically identifiable, while θ2 and θ4 are practically non-identifiable. From results in figure 5l we term θ4 one-sided identifiable: the parameter has an identifiable lower bound, and is distinguishable from zero.
To visualise correlations between inferred parameters, we tune the proposal kernel (equation (15)) and run the MCMC algorithm for 30,000 iterations, results are shown in figure 6 and table 2. If only the univariate marginal distributions are considered, all parameters except for θ4 may be classified as practically identifiable. However, our analysis shows that θ1 and θ2 are distinguishable only within a large range of magnitudes. A strong correlation is seen between θ1 and θ2, indicating that the total substance exit rate, θ1 + θ2, may be practically identifiable. If one of θ1 or θ2 were known in advance, perhaps based on past experimental knowledge, the other may become practically identifiable. Further, results from the tuned chains verify that θ3 is practically identifiable (95% CrI (0.1356,0.4857)) and θ4 is distinguishable from zero.
95% credible intervals, and diagnostics, for the parameter estimates (on the linear-scale) for the two-pool model. Credible intervals are approximated using the MCMC quantiles after burn-in.
Tuned MCMC results for the two-pool model with a parameters on the linear-scale. The left-most column shows an MCMC trace from a single chain. Kernel density estimates of the marginal posterior for each parameter and bivariate scatter plots are produced using every 300th sample from four independent MCMC chains, after burn-in. The autocorrelation function for a single chain is shown in (c), indicating that every 300th sample is approximately independent.
3.3 Epidemic model
Here, we consider a four-compartment epidemic model — the SEIR model [105–107] (figure 2c). In this model, susceptible individuals, S, are infected due to interactions with infectious individuals, I, and undergo an unknown period of time during which they have been exposed, E, but are not themselves infectious. Infectious individuals either recover or are removed from the total population, R. A noisy unknown proportion, ξ, with mean μobs, of the number of infectious and recovered individuals is monitored. This captures a testing regime where not all infectious or recovered individuals are tested. We supplement these results by considering a scenario where the same unknown proportion of the exposed individuals is also monitored during the early part of the epidemic.
The kind of data available for the epidemic model differs significantly from that for the experiment-based models we have considered thus far: we are interested in a practical identifiability problem where data from only a single time-series is available, which mirrors data available from an actual epidemic [147]. We first consider practical identifiability using data from the early part of the epidemic, before the number of cases is observed to decrease. Next, these results are compared to a case where data further through the course of the epidemic is considered (figure 3g). Initially, 10 infected individuals and 10 recovered individuals are detected. For simplicity we assume there is no noise in these initial observations, so the number of infected and recovered individuals is given by 10/μobs. An unknown number of individuals, E0, are initially exposed. In our analysis, we assume that E0 is not of direct interest, and we class it a nuisance parameter.
3.3.1 Model formulation and moment equations
The SEIR model can be represented by the following bio-chemical reactions
with stoichiometries ν1 = (−1,1, 0, 0)T, ν2 = (0, −1, 1, 0)T and ν3 = (0, 0, −1, 1)T; and propensities a1(Xt) = θ1StIt, a2(Xt) = θ2Et and α3(Xt) = a3It. Here, we denote Xt = (St, Et, It, Rt)T as the number of individuals in each compartment. Two observations are made,
Here, Y1,t and Y2,t describe the observed number of infected individuals and recovered individuals, respectively. We further assume that μobs, the average observed proportion; and σerr, the observation error, are unknown and must be estimated. We show 100 realisations of the SSA for the epidemic model in figure 3c, and synthetic data used for practical identifiability analysis in figure 3g. The data are generated using the initial condition X0 = (500 – E0, E0, 10/μobs, 10/μobs)T and target parameter values θ1 = 0.01, θ2 = 0.2, θ3 = 0.1, E0 = 20, μobs = 0.5 and σerr = 0.05. Here, we note that σerr ≪ μobs, ensuring that Y1,t and Y2,t remain positive.
The moment equations differ from the previous two models considered in that they are not closed. Therefore, the first order moment equations are not equivalent to those for the corresponding ODE model [30], unless a mean-field closure is drawn at first order. To make progress, we close the moment equations after second order to form an approximate system of moment equations for the first two moments. We give the system of 14 moment equations, under all three moment closures considered, as supporting material. The moments of the observation variables are given in the noise-free limit by
In the supporting material we produce numerical solutions to the moment equations for the epidemic model for each closure considered (figure S1). All closures predict visually identical behaviour at first order, and the pair-approximation and Gaussian closures are in agreement at second order. For the target parameters we consider, the mean-field closure does not agree at second order with the more advanced closures. Whereas a numerical solution to the moment equations for the pair-approximation and Gaussian closures is readily obtainable from a standard solver in Julia [148], the mean-field closure required a positivity-preserving Patankar-type method [149] to avoid blow up.
3.3.2 Structural identifiability
We assess structural identifiability of the approximate system of moment equations in DAISY and GenSSI2, results are shown in table 3. The ODE model, equivalent to a mean-field closure (equation (10)) drawn after the first moment, is structurally non-identifiable. The second-order systems, for all closures, are structurally identifiable (table 3). As the second-order systems are approximate, this analysis is not conclusive for the SDE. However, we can conclude that if the mean and variance of the epidemic model (the first two moments) are modelled using the system of moment equations, and data is available accordingly, the parameters are able to be accurately estimated in the limit of infinite, noise-free data. We highlight the computational cost in DAISY of introducing complexity into the moment equations through the closure methods. The pair-wise closure, equation (11), which introduces a quotient, and the Gaussian closure, equation (12), which introduces a cubic, take significantly longer using DAISY to assess than the mean-field closure, equation (10), yet give the same result. However, unlike MCMC, we note that structural identifiability results are deterministic, and independent of user choices such as prior, number of particles, and generated or real synthetic data.
Structural identifiability of the partially observed SEIR model assessed in DAISY and GenSSI2. Structural identifiability of the SDE is assessed using each closure method for third and higher order moments. Note that the ODE model is equivalent to the SDE model with a mean-field closure for second and higher order moments. Runtimes correspond to a 3.7GHz quad-core i7 desktop machine running Windows 10.
3.3.3 Practical identifiability
We assess practical identifiability of the epidemic model using MCMC to infer θ = (θ1, θ2, θ3, E0, μobs, σerr). Independent uniform priors are placed on each parameter so that and
. Results are shown in figure 7, where we initiate each chain at the same location for all forms of data we consider.
Pilot MCMC trace plots, and log-likelihood estimate, of four chains for the epidemic model. We consider data comprising noisy observations of an unknown proportion of the number of infected and recovered individuals during the early part of the epidemic (first column) and throughout the epidemic (second column). We supplement these results by considering the case we are also able to observe the same unknown proportion of the number of exposed individuals during the early part of the epidemic (third column). Priors for each parameter are uniform with support corresponding to the respective axis limits. The target parameter set, used to generate synthetic data, are indicated (black dashed line).
First, we assess identifiability when only early-time data is available. The log-likelihood estimate rapidly stabilises (figure 7a), indicating that the chains have moved to a high-likelihood region of the parameter space [29]. Results for θ3, the recovery rate, also stabilise, indicating that θ3 is structurally identifiable. Eventually, we see the estimate for θ2 stabilises in all chains, however they under-estimate the target value, although proposals equal to and greater than the target value θ2 = 0.2 are occasionally accepted. To compensate, the estimate of θ1 stabilises, and covers a region an order of magnitude greater than the target (θ1 = 0.01). Therefore, although θ1 is practically identifiable to a large, but finite, range of values, we classify θ1 as non-identifiable from the short-time data. Estimates for E0 and μobs in figure 7m,p do not stabilise, and are practically non-identifiable.
Next, we consider a scenario where long-time data are available, such that the number of infected individuals is observed to eventually decrease. The log-likelihood estimate (figure 7b) and chains for all parameters, except E0, are observed to stabilise, indicating that all parameters of interest are now practically identifiable. We supplement these results by considering a third scenario, where only early-time data are available, but the same unknown proportion of the number of exposed individuals is also monitored. As with the long-time data, all parameters of interest are now practically identifiable.
We perform a posterior predictive check [135] of the epidemic model to compare the model prediction—which accounts for parameter uncertainty, intrinsic noise and observation error—to the synthetic data used for inference. We discard the first 3,000 samples from each pilot chain as burn-in, and resample 10,000 parameter combinations for each data type considered. Results in figure 8 show that, in all cases, the model predictions are in agreement with the full time-course (although, we note, the long-time data is only used to calibrate parameters in figure 8b). Results in figure 8a, for the short-time data, highlight how practical non-identifiability affects model predictions. These results predict an epidemic size at t = 30 is noticeably wider and higher than those for the data types where θ1 is practically identifiable. Further, the lower 95% credible interval for the observed number of infected individuals reduces much faster than that predicted by the other data types.
Posterior predictive distribution for the epidemic model using (a) short–time data; (b) long-time data; and (c) short-time data where observations are also made of the number of exposed individuals. In (a,c), the dashed line indicates the last observation point used for inference. The first 3,000 samples from each pilot chain is discarded as burn-in. We resample 10,000 parameter combinations (with replacement) and solve the SDE model to estimate posterior predictive intervals (PIs). Shown are 50% (darker) and 95% (lighted) prediction intervals computed from the quantiles of the posterior predictive distribution.
3.4 β-insulin-glucose circuit
Finally, we consider a non-linear model of glucose homeostasis, the β-insulin-glucose circuit [108,109] (figure 2d). Parameterising mathematical models of glucose homeostasis is important for the development of patient-specific insulin delivery for type 1 diabetics [49]. Time-series data of blood glucose concentration is available from continuous glucose monitoring sensors, a critical component of type 1 diabetes management [50,70], an example of which is shown in figure 1f. The model describes the regulation of blood plasma glucose by insulin secreted by pancreatic β cells. Glucose is introduced into the system through a base production plus a meal intake, u(t), and decays linearly according to the insulin concentration. Insulin is secreted by β cells at a rate given by a non-linear Hill function [109]. β cells are produced and decay in a non-response to the glucose concentration. We consider identifiability for synthetic data comprising noisy measurements of the β cell and glucose concentrations, but not the insulin concentration. The data consists of five independent experiments, each comprising 15 time-series observations following a meal intake. We only consider inference for two biophysical parameters: θ1, the insulin secretion rate; and θ2, the insulin sensitivity. The non-linearities in the model mean that the moment equation approach is not available, and inference using MCMC is computationally expensive. We demonstrate how structural identifiability analysis of the corresponding ODE system [150] can guide analysis of the SDE system and alleviate some of the computational challenges.
3.4.1 Model formulation
We consider a stochastic analogue of the model presented by Karin et al. [109]. Denoting Xt = (βt, It, Gt)T as the concentrations of β cells, insulin and glucose, respectively, the propensity functions and corresponding stoichiometries are given by
where
Since βt, It and Gt denote the concentrations of each substance, and not the population counts, we scale the diffusion term in the CLE to represent the relative concentrations of each substance [57]. Denoting Nβ, NI and NG the relative concentration of β cells, insulin and glucose, respectively, we write
Two observations are made,
such that Y1,t and Y2,t are the observed β cell and glucose concentrations, respectively. We show 100 realisations of the SSA for the β-insulin-glucose circuit in figure 3d, and the synthetic data used for practical identifiability analysis in figure 3h. The data are generated using the initial condition X0 = (322,10, 5)T with fixed parameters, μ+ = 0.21/(24 × 60), μ− = 0.025/(24 × 60), η = 7.85, γ = 0.3, u0 = 1/30, c = 10−3, Nβ = 1, NI = NG = 20, and target parameters θ1 = 0.02, θ2 = 0.0005 and σerr = 0.5 [109]. Here, we note σerr ≪ βt,Gt (figure 3d), which ensures that Y1;t and Y2,t remain positive.
3.4.2 Parameter transform
Villaverde et al. [151] study structural identifiability of the corresponding ODE model using differential geometry. In the ODE model, θ1 and θ2 are structurally non-identifiable, unless the insulin concentration is also observed or one of these two parameters is known. We demonstrate this using MCMC in figure 9a, where the marginal posterior for (θ1, θ2) covers a hyperbolic region of the parameter space of equal posterior density. In the ODE model, the product θ1θ2 is structurally identifiable. To demonstrate this, we perform MCMC on the ODE model with transformed variables and
, results shown in figure 9b. These results also show how inefficient a naïve MCMC proposal can be when correlations between posterior parameters are non-linear. Structural identifiability analysis [151] indicates that the hyperbolic region defined by
(for a fixed
) produces indistinguishable behaviour, corresponding to a flat posterior when a uniform prior is applied. Despite this, the tail regions in figure 9a are rarely sampled, which could give the impression that the parameters are practically identifiable.
Kernel density of the bivariate marginal posterior distribution of the biophysical parameters in the β-insulin-glucose circuit, using the ODE and 100,000 pilot MCMC iterations (the first 3,000 are discarded as burn-in. (a) The posterior for the untransformed parameters, (θ1, θ2) shows non-identifiability. (b) The posterior for the transformed parameters, , demonstrates that
is identifiable, but
is not.
As the propensity functions for the β-insulin-glucose circuit model contain non-polynomial functions, we cannot produce an exact expression for the moment equations. Therefore, we only study practical identifiability using MCMC, and do not consider structural identifiability of the SDE for the β-insulin-glucose circuit using the moment equations. Motivated by the structural identifiability analysis of the ODE model, we use MCMC to infer , where we only consider the transformed variables
and
.
3.4.3 Practical identifiability
We show MCMC results from four pilot chains in figure 10. The log-likelihood estimate rapidly stabilises (figure 10a), as do results for and σerr (figure 10b,d). As with the ODE model,
is practically identifiable, but
is not. To visualise possible correlations between inferred parameters, we tune the proposal kernel (equation (15)) and run the MCMC algorithm for 10,000 iterations. The univariate marginal distributions, and MCMC trace plots, show that
(95% CrI: (1.34,1.67) × 10−5) and σerr (95% CrI: (0.812,1.049)) are practically identifiable, whereas
is not (95% CrI: (8.21,97.79)). No large correlations are seen between the parameters
, and θ2 is clearly practically non-identifiable as samples cover the entire range of the prior.
Pilot MCMC trace plots, and log likelihood estimate, of four chains for the β-insulin-glucose circuit in the transformed parameter space. The likelihood quickly stabilises, but estimates for do not, indicating practical non-identifiability. Priors for each parameter are uniform with support corresponding to the respective axis limits. The target parameter set, used to generate synthetic data, are indicated (black dashed line).
Tuned MCMC results for the β-insulin-glucose circuit in the transformed parameter space. The left-most column shows an MCMC trace from a single chain. Kernel density estimates of the marginal posterior for each parameter and bivariate scatter plots are produced using every 100th sample from four independent MCMC chains, after burn-in. The autocorrelation function for a single chain is shown in (c), indicating that every 20th sample is approximately independent.
4 Discussion
Mathematical models are routinely calibrated to experimental data, with goals ranging from building a predictive model to quantifying biophysical parameters that cannot be directly measured. Much of the usefulness of calibrated models hinges on an assumption that model parameters are identifiable. Heavily over-parameterised models, with large numbers of practically non-identifiable parameters, are often referred to as sloppy in the systems biology literature [72,152–154]. Worryingly, these issues of parameter identifiability can often go undetected: models with non-identifiable parameters can still match experimental data (figure 8), but may have poor predictive power and provide little or no mechanistic insight [28]. Identifiability analysis is well-developed for deterministic ODE models, but there is little guidance in the literature to conducting such analysis for the stochastic models that are often vital for interpreting complex experimental data. In this review, we demonstrate how existing techniques can be applied to assess both structural and practical identifiability of SDE models in biology.
4.1 Moment dynamics approach
We demonstrate how existing ODE identifiability techniques can be applied directly to stochastic problems by formulating a system of moment equations. In the birth-death process and the two-pool model, the derived moment equations are closed and, therefore, exactly describe the time-evolution of the moments of the SDE. In these two case studies we find that the moment equations are structurally identifiable. This implies structural identifiability of the corresponding SDE model, and parameters can be uniquely estimated in the limit of infinite, noise-free data. For an SDE model this implies an infinite number of observation-noise free trajectories of the SDE, since the variability, which relates to higher-order moments, contains information. While we find that the two-pool model of cholesterol distribution is not practically identifiable, establishing structural identifiability is useful as it suggests to the practitioner that the observation process (i.e. observe cholesterol in the first pool) is sufficient, in principle, to fully parameterise the model.
For the epidemic model, the moment equations are not closed, so we study structural identifiability through an approximate system of second-order moment equations. The idea of studying identifiability through an approximate system was first suggested by Pohjanpalo [75], who studies identifiability of ODE systems through a power series expansion. The closed system of moment equations suggest the epidemic SDE model could be structurally identifiable, and these results agree, in our case, with practical identifiability detected using MCMC. More research is needed to establish how identifiability is affected when closing, or truncating, a system of moment equations. For example, if information required to identify model parameters is contained in third or higher order moments, results suggesting that a model is practically non-identifiable from a second order closure will not be indicative of non-identifiability in the SDE model. Furthermore, if structural identifiability differs between moment closures, such a preliminary screening tool needs to be interpreted with caution. If this were the case, a conclusion of structural identifiability is indicative of the model under a particular closure. Recent work suggests that a finite number of moments often contain the information required to identify parameters [155], even for a bimodal distribution and if a closure is applied [156].
Due to the computational constraints placed on analysing structural identifiability of nonpolynomial ODE models, we do not attempt to apply the moment dynamics approach to the stochastic β-insulin-glucose circuit model. However, for many models, a mean-field closure corresponds to an ODE description of the system, and studying identifiability of this ODE model can aid practical identifiability analysis of the corresponding SDE. In our case, the corresponding ODE model is structurally non-identifiable due to a hyperbolic relationship between the two parameters of interest: for a fixed θ1 θ2, model outputs are indistinguishable [151]. The question of whether an SDE description can provide enough information to practically identify θ1 and θ2 can be answered through MCMC, however simple variants of MCMC can struggle when correlations between parameters are strong and non-linear. Therefore, we work in a transformed parameter space where, for the ODE model, is identifiable but
is not (figure 9). This analysis provides a better sense of whether the SDE model captures enough information to identify the parameters, and provides more robust results that are less dependent upon choices made in the MCMC algorithm.
4.2 Particle Markov-chain Monte Carlo
We demonstrate practical identifiability by calibrating each model to synthetic data using particle MCMC. We observe the MCMC chains to stabilise in a region of high posterior density, after which time transitions produce samples from the posterior distribution [29]. By visualising MCMC trace plots, we see that estimates of practically identifiable parameters also stabilise, but those of practically non-identifiable parameters do not. These results also demonstrate that, although estimates made of practically identifiable parameters are precise (that is, within a reasonable level of confidence), they are not necessarily accurate. For example, in figure 7g, the rate at which exposed individuals become infectious is practically identifiable, but it is underestimated compared to the target value, which could hint at model misspecification.
Given that particle MCMC is computationally expensive, our implementation of a standard technique to detect identifiability from pilot chains carries several advantages. First, pilot chains are regularly generated in the early stages of many inference procedures to establish efficient proposal kernels. Practical identifiability can, therefore, be established as part of an existing workflow. Second, more sophisticated methods are by their very nature more difficult to implement and dependent on practitioner choices, which could obscure results and require more algorithmic experimentation. In comparison, we take an automated approach: aside from the model and choice of prior, the procedure to perform MCMC for each model we consider is identical. Once identifiability is established, the computational cost of MCMC can be alleviated to some extent by adopting a more efficient inference technique. For example, adaptive MCMC [157], sequential Monte-Carlo [158], multi-level methods [159–161], sub-sampling techniques [162] and model-based proposal methods [163] provide significant performance improvements over the standard technique we employ. Further, we expect applying higher-order SDE simulation algorithms, such as a Runge-Kutta method [164], or considering GPU approaches to particle MCMC [165], to improve performance.
As we calibrate to synthetic data for the purpose of a didactic demonstration, we take a pragmatic approach by treating the true values as unknown. Hence, we initiate each chain as a random sample from the prior distribution. This involves a burn-in phase before the MCMC chain settles in an area of high posterior density. For computationally expensive models, such as those found in the cardiac modelling literature [166], synthetic data can be used with pilot chains initiated at the target values. If models have have already been calibrated to experimental data using, for example, maximum likelihood estimation, the chain can be initiated at the calibrated values. MCMC then, relatively cheaply, provides information about the posterior distribution about this point, akin to the Fisher information for models where it can be calculated [41].
MCMC can be applied to detect identifiability for any stochastic model provided an approximation to the likelihood is available. Recent developments to particle MCMC have seen its adoption for more complicated SDE models, such as SDE mixed effects models [167]. For systems with relatively small populations, it may be more appropriate to work directly with an SSA with, for example, a tau-leap method [54, 119]. Alternative approximations to the likelihood, such as those employed by ABC, may be necessary if model complexity requires; for example, should the model include spatial effects [18]. A major drawback of ABC in the context of identifiability is that one must typically decide a priori which features of the model and data to match. Common applications of ABC for SDE models match the mean and variance of system [103] or the mean square error between simulations and data [168]. Estimating the likelihood directly, as particle MCMC does, is advantageous when assessing identifiability as it is not clear a priori which features of the data and model are significant. For example, some systems might contain the information required for identifiability in higher-order moments or auto-correlations between time-series observations. If ABC is used, a variant that preserves features of the model distribution might be desirable [169].
4.3 Modelling noise
In contrast to many studies of identifiability analysis for ODEs, we do not pre-specify parameters in the observation distribution. In a deterministic modelling framework, it is common to assume that all the variability in the data is uncorrelated and sourced from the observation process [41, 90, 170]. Therefore, for an ODE model, the observation parameters can be reliably estimated using the pooled sample variance. For inference on the birth-death ODE model (figure 4), we see that, because the observation variance must now also account for intrinsic noise, the identified value of σerr is significantly larger than the target value. For an ODE model with additive homoscedastic Gaussian noise, the posterior mode (in the case of an improper uniform prior), maximum likelihood estimate and least-squares estimate are identical and are independent of the choice of the observed variance. For an SDE model, this is not the case as the intrinsic component of the noise is also modelled implicitly. Therefore, pre-specifying the observation variance could lead to biased estimates and obscure parameter identifiability. We account for this by treating the observation distribution variance as a nuisance parameter that we infer using MCMC, finding it to be identifiable in every case-study considered.
We have focussed our analysis on SDEs derived through the CLE, where the intrinsic noise can provide more information about the process. However, for large populations, the information contained in higher-order moments dissipates: to leading order, 〈X2〉 → 〈X〉2 as X → ∞. We see this in the epidemic model (figure 3c), where the variance is small compared to the scale of the mean. This loss of information in higher-order moments will not be detected by structural identifiability analysis of the moment equations, which is independent of the relative sizes of each moment. As populations become large, the information tends towards that obtained from the equivalent ODE system: this is the assumption behind many mean-field models. There are, however, many other models that contain sources of variability in their own right. For example, Mummert and Otunuga [61] study identifiability of an epidemic model where the infection rate varies according to a white noise process. Other external effects, such as seasonal effects, are often incorporated into epidemic models [171, 172]. In other systems, extrinsic noise describing, for example, the environment, forms a core part of the process and is described by an SDE independent of the population size [48]. Grey-box models use a diffusion term to characterise uncertain physiological effects [63] that could obscure inference, rather than contain information. Making high-level assumptions about which noise process contains information can help with some of the computational challenges by formulating hybrid models containing a mixture of ODEs and SDEs. Particle MCMC carries across, trivially, to any Itô SDE, and the moment equation approach can be applied provided a system of moments be constructed. We have not considered identifiability of SDE models containing non-diffusion noise, such as coloured noise or jump noise. These models lend themselves to different inference techniques, such as forms of rejection sampling [173].
4.4 Approaches to computational challenges
The primary computational cost of working with SDE models stems from the need to simulate a suite of trajectories at each iteration of the particle MCMC algorithm. This cost increases not only with the dimensionality of the problem (as for deterministic models) but also with the amount of data, since the number of particles required for an unbiased likelihood estimate increases with the sample size [25,100]. We see this, in particular, when conducting practical identifiability analysis of the two-pool and β-insulin-glucose models. These issues have important ramifications for identifiability, as it may not always be feasible to increase the amount of experimental data to rectify practical non-identifiability. Working with a surrogate model, such as a system of moment equations, can help alleviate some of these challenges. For example, establishing structural identifiability—which is requisite for parameter estimation [28]—indicates that the computational investment is worthwhile. Furthermore, such surrogate models can also form a computationally efficient alternative for assessing practical identifiability and performing inference [156,174], while still capturing more information than a purely deterministic description.
A large class of high-dimensional stochastic models lend themselves to structural identifiability analysis through moment equations. For example, CLE descriptions of multi-state ion-channel systems [57] and cascades with many bimolecular reactions [175], can be analysed in terms of a surrogate model using moment equations. This approach can be used because the propensity functions are often polynomial. However, the systems of moment equations are often infinite and require a moment closure approximation to facilitate this analysis. When a moment closure assumption is required, we find that newer software, such as GenSSI2 in Matlab, is computationally advantageous over DAISY (table 3). Further, for larger systems, it may not be necessary to consider a full system of second-order moments: a closed system that neglects covariances is also a potentially useful surrogate model.
Many stochastic problems are both computationally expensive to assess using particle MCMC and may not directly permit moment dynamics analysis. The β-insulin-glucose model comprises eight reactions and takes approximately 30 hours to perform 10,000 iterations of a pilot chain2. These issues are magnified by the quantity of data we consider in this example: five time-series of 15 observations each. Non-polynomial propensity functions mean that an exact expression for the moment equations cannot be derived, so it may not be possible to pre-detect structural non-identifiability through the moment dynamics approach. Fortunately, other approaches, such as those that use polynomial chaos [176], Gaussian processes [177], or the linear noise approximation [93] can provide alternative means of deriving surrogate models. This kind of approximation is already routine in the field of uncertainty quantification, which has deep connections to identifiability [178, 179]. In the future, many of these ideas could allow tractable structural and practical identifiability analysis of large systems of SDEs and, by extension, analysis of spatial problems described by stochastic partial differential equations (SPDEs).
The computational cost of MCMC, in particular for stochastic models with many parameters, has spurred the development of alternatives to explore and exploit the geometry of the likelihood near parameter estimates. Komorowski et al. [93] apply the linear noise approximation—which captures first and second order behaviour—to perform sensitivity analysis and calculate the Fisher information matrix for stochastic chemical kinetics models. The concept of information geometry [153,180] generalises Fisher information and can be applied to detect identifiability through local information [181], and improve the performance of MCMC algorithms [182]. For SDE models in particular, variational Bayesian techniques provide an efficient alternative to MCMC for parameter estimation [183]. In many cases, mathematical models are calibrated to experimental data to establish the value of a biophysical parameter, not to fully parameterise a model. Profile likelihood [92,184] is widely applied to assess identifiability in ODE models by maximising out parameters that are not of direct interest to reduce the dimensionality of the analysis. Since the bootstrap particle filter that we employ estimates the likelihood function, profile likelihood could be applied to SDE problems.
5 Conclusion
It is essential to consider identifiability when performing inference. Yet, there is a scarcity of methods available for assessing identifiability of the stochastic models that are becoming increasingly important. We have provided, through this review, an introduction to identifiability and a guide for performing identifiability analysis of SDE models in systems biology. By formulating a system of moment equations, we show how existing techniques for structural identifiability analysis of ODE models can be applied directly to SDE models [28, 34, 35, 150]. Through synthetic data and particle MCMC, we have demonstrated how to establish practical identifiability for SDE models from data [29, 51].
The analysis we demonstrate is critical for tailoring model complexity to the available data [28]. When a structurally identifiable model is found to be practically non-identifiable, identifiability analysis can guide experiment design to discern the quality and quantity of data required to estimate model parameters [185]. On the other hand, models found to be structurally non-identifiable should be re-parameterised, reduced in complexity, or changed [87,186]. Moving from an ODE to an SDE model can often provide enough information to render an otherwise structurally non-identifiable parameter identifiable: we demonstrate this with the birth-death process. As increasing computing power facilitates inference of complex stochastic models, we expect identifiability to become ever more relevant.
Data availability
This study contains no experimental data. Code used to produce the numerical results is available as a Julia module on GitHub at github.com/ap-browning/SDE-Identifiability.
Author Contributions
All authors designed the research; APB performed the research and wrote the manuscript; APB and DJW implemented the computational algorithms. All authors provided direction, feedback and gave approval for final publication.
Acknowledgements
This work was supported by the Australian Research Council (DP200100177) and the Air Force Office of Scientific Research (BAA-AFRL-AFOSR-20160007). A.P.B. acknowledges support from the Australian Centre for Excellence in Mathematical and Statistic Frontiers International Mobility Program. R.E.B. would like to thank the Leverhulme Trust for a Leverhulme Research Fellowship, the Royal Society for a Wolfson Research Merit Award, and the BBSRC for funding via BB/R00816/1. The authors thank Oliver Maclaren for helpful discussions, and the three anonymous referees for their comments.
Footnotes
Minor textual changes throughout (no new results).
↵1 Code available on Github at https://github.com/ap-browning/SDE-Identifiability
↵2 Runtimes for all results produced are available on Github at https://github.com/ap-browning/SDE-Identifiability
References
- [1].↵
- [2].↵
- [3].
- [4].
- [5].↵
- [6].
- [7].↵
- [8].↵
- [9].↵
- [10].
- [11].
- [12].↵
- [13].↵
- [14].
- [15].
- [16].
- [17].
- [18].↵
- [19].↵
- [20].↵
- [21].
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].
- [38].↵
- [39].
- [40].↵
- [41].↵
- [42].↵
- [43].
- [44].
- [45].↵
- [46].
- [47].
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].
- [66].
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].
- [82].↵
- [83].↵
- [84].
- [85].
- [86].↵
- [87].↵
- [88].↵
- [89].
- [90].↵
- [91].
- [92].↵
- [93].↵
- [94].↵
- [95].
- [96].
- [97].↵
- [98].↵
- [99].↵
- [100].↵
- [101].↵
- [102].↵
- [103].↵
- [104].↵
- [105].↵
- [106].
- [107].↵
- [108].↵
- [109].↵
- [110].↵
- [111].↵
- [112].↵
- [113].↵
- [114].↵
- [115].↵
- [116].↵
- [117].↵
- [118].↵
- [119].↵
- [120].↵
- [121].↵
- [122].
- [123].↵
- [124].↵
- [125].
- [126].↵
- [127].↵
- [128].↵
- [129].↵
- [130].↵
- [131].↵
- [132].↵
- [133].↵
- [134].↵
- [135].↵
- [136].
- [137].
- [138].↵
- [139].↵
- [140].↵
- [141].↵
- [142].↵
- [143].↵
- [144].↵
- [145].↵
- [146].↵
- [147].↵
- [148].↵
- [149].↵
- [150].↵
- [151].↵
- [152].↵
- [153].↵
- [154].↵
- [155].↵
- [156].↵
- [157].↵
- [158].↵
- [159].↵
- [160].
- [161].↵
- [162].↵
- [163].↵
- [164].↵
- [165].↵
- [166].↵
- [167].↵
- [168].↵
- [169].↵
- [170].↵
- [171].↵
- [172].↵
- [173].↵
- [174].↵
- [175].↵
- [176].↵
- [177].↵
- [178].↵
- [179].↵
- [180].↵
- [181].↵
- [182].↵
- [183].↵
- [184].↵
- [185].↵
- [186].↵