## Abstract

Parameter inference of high-dimensional data is challenging and microbiome time series data is no exception. Methods aimed at predicting from point estimates exist, but often even fail to recover the true parameters from simulated data. Computational methods to robustly infer and quantify the uncertainty in model parameters are needed. Here, we propose a computational workflow addressing such challenges – allowing us to compare mechanistic models and identify the values and the certainty of inferred parameters. This approach allows us to infer which kind of interactions occur in the microbial community. In contrast to point-estimate inference, the distribution for the parameters, our outcome, reflects their uncertainty. To achieve this, we consider as many equations for the statistical moments of the microbiome as parameters. Our inference workflow, which builds upon a mechanistic foundation of microscopic processes, can take into account that commonly metagenomic datasets only provide information on relative abundances and hosts’ ensembles. With our framework, we move from qualitative prediction to quantifying the likelihood of certain interaction types in microbiomes.

Numerous studies have shown how important the microbiome is for their hosts, ranging from development to health [1, 2]. The promise of manipulating the microbiome relies on having understood the ecological and evolutionary processes operating on it [3]. Although metagenomics studies have widely characterized microbiome samples [4], their connection to mathematical models and eco-evolutionary theories lags behind. Part of the gap is explained by an intrinsic difficulty in analysing microbiome data [5], in particular the inverse problem of robustly inferring model parameters – and thus interactions between microbes – from data (Fig.1 A). Pioneering studies have achieved point (“best guess”) estimates that allow accurate predictions of the dynamics [6], but fall short in estimating the true parameters from simulations [5]. This apparent contradiction, rooted in the high dimensionality of the parameter space and incomplete nature of data, has been discussed by Cao et al. [5]. Others have suggested to use Bayesian inference – where probabilities are assigned to parameter values – to go beyond point-estimates [7]. Here, we present a computational workflow that starts from defining microscopic transition rates in a mathematical model – describing ecological and evolutionary events (such as birth, migration, mutation or speciation) – and works with macroscopic statistical moments of the microbiome composition. Our Bayesian inference workflow, which naturally bypasses known limitations of point-estimate inference [5], is sufficiently flexible to test mathematical models as diverse as microbiome samples while quantifying the parameter uncertainty stemming from data limitations (Fig.1 C) – including its extrinsic noise. Two classical ecological models are used to illustrate its application on datasets describing absolute or relative abundances of microbes – namely, logistic growth and Lotka-Volterra models. The inference workflow outlined here bridges a gap between microbiome data and theoretical modelling.

## Results

### Developing an inference workflow

We propose a parameter inference workflow grounded on a mechanistic description of the dynamics of absolute abundances in a microbiome (Fig.1 C). First, we write down microscopic transition rates *T* describing changes in the microbiome composition of one host, the vector **n**, to other compositions (**n**^{′}), given the parameter set ** θ**. Note that these rates can be functions of time

*t*as well,

Now, instead of tracking the microbiome composition **n** in a single host, we can describe how the probability of observing a microbiome composition **n** in an ensemble of hosts, *P* (**n**, *t*), changes with time,

In this expression, called the master equation [8], the probability influx and outflux terms indicate an increase or decrease in the probability of composition **n**. This dynamics depends on the ecological and evolutionary processes contained in the microscopic transition rates.

Using the master equation, we can derive equations for the statistical moments of the microbiome composition in an ensemble of hosts – namely, the product of the master equation by a variable of interest (*g*_{k}, where *k* is an identifying index) summed over all possible microbiome compositions **n**,

For example, computing the average abundance of microbial type *i* implies setting *g*_{k} = *n*_{i}, whereas for the co-moment of microbial types *i* and *j, g*_{k} = *n*_{i}*n*_{j}. Equations like these describe the expected macroscopic dynamics of the microbiome. To obtain a system of equations of the moments that fully determines the parameters, we have to derive at least as many equations for them as parameters in the model. For example, in a Lotka-Volterra model with *S* microbial types, there are *S* growth rates and *S*^{2} intra- and inter-specific interactions, amounting to *S* + *S*^{2} parameters. Thus, we would need *S* +*S*^{2} equations. We can use the *S* equations for the first moments ⟨*n*_{k}⟩, *S* for the second moments , and *S*(*S* − 1) for the co-moments ⟨*n*_{k}*n*_{l}⟩ and covariances ⟨*n*_{k}, *n*_{l}⟩ (see Sup. Methods A). Note that each equation can depend on the vector of other moments, i.e., ⟨*g*_{k}⟩ = *f* (⟨**g**⟩, ** θ**,

*t*). A step by step derivation from microscopic rates up to second order moments for a logistic growth and the Lotka-Volterra models can be found in the Sup. Methods A. These models include conventional ecological events, such as growth, death, immigration, and direct and indirect interactions.

We now have the elements to infer the parameters ** θ** from microbiome data. The idea of Approximate Bayesian Computation (ABC) is to identify feasible parameters values by comparing the data to model predictions [7]. Specifically, for a given set of parameters values

**, a distance metric between the numerical solution of the equations for the moments, ⟨**

*θ**g*

_{k}⟩, and the equivalent moments from data, , is estimated, e.g., for the Euclidean distance. If this distance is smaller than a threshold

*ε*, the set is considered valid. By testing sets of parameters sampled according to an expectation – prior distribution – and recording those below the threshold

*ε*, a posterior distribution of the parameters reflecting the uncertainty of the inference can be obtained (Fig. 1 C). With a smaller threshold

*ε*this posterior can become the new prior and the process can be iterated to improve it. This is called Approximate Bayesian Computation - Sequential Monte Carlo (ABC-SMC). We show how to choose prior distributions of the parameters in the Sup. Methods C.

### Properties of microbiome data

Given a microbiome dataset, all statistical moments can be estimated from raw data. Concretely, it can be done by averaging the variable of interest *g*_{k}, over all replicates in each specific time point (Fig. 1 C). For example, for *g*_{k} = *n*_{i}, the replicates of *n*_{i} are simply summed over and divided by the number of replicates, while for *g*_{k} = *n*_{i}*n*_{j}, the products of *n*_{i} and *n*_{j} for each replicate are computed, then summed over and finally divided by the number of replicates.

Microbiome data is nowadays typically produced by metagenome sequencing. Conventionally, for technical reasons metagenomics only quantifies the *relative abundance* of each microbial type in a sample (Fig. 1 C) [9]. More recently, some studies have measured absolute numbers of culturable [10] and non-culturable microbes in samples [11]. We call these counts *absolute abundances*.

Our former equations only track moments of absolute abundance, ⟨*g*_{k}⟩. As Gloor et al. [9] show, inferring parameters from relative abundance (*x*_{k}) data using them would lead to spurious correlations (Fig. 2 A-B). To find equivalent expressions for the statistical moments of relative abundance, first, we propose *n*_{Σ} ≡ ∑_{j} *n*_{j} to be defined as a scaling factor and a dynamical equation for its first moment, ⟨*n*_{Σ}⟩. Then, a transformation to moments of relative abundances, ⟨*γ*_{k}⟩, is given by,

Because relative abundance datasets lack information about the scaling factor, its initial condition, , must be inferred as a free parameter, one parameter more than for absolute abundance data. Note that because ∑_{k} *x*_{k} = 1, the number of equations for the microbial types decreases but the number of parameters per type remains. A detailed derivation of transformations to relative abundance for a logistic growth and the Lotka-Volterra models is shown in the Sup. Methods A.

Apart from tracking the microbiome of the same hosts over time, as in animal gut studies, commonly, hosts sampled at different time points are pulled together to produce a single time series (Fig. 1 B). This is the case when hosts are sacrificed while sampling as in experimental studies of *D. melanogaster, C. elegans*, and *Hydra vulgaris* [10]. In contrast to deterministic models, our workflow can deal with hosts pulled together thanks to its account of the stochastic demography. More concretely – akin to the concept of biological replicates – if the parameter values and initial conditions are the same in each host sampled, we can account for their emerging demographic differences, i.e., differences in microbiome composition resulting from stochasticity.

### Inference from simulated and empirical data

We tested our inference workflow in two ways. Firstly, we recovered the correct parameters from simulated relative abundance data (Fig. 2). Our approach proved successful in cases with and without inter-specific interactions, namely, data simulated from logistic growth and Lotka-Volterra models. In both cases, our approach identified the correct model while in addition estimating their parameters values and certainty.

Secondly, we inferred parameters of an empirical time series of absolute abundance in a reduced mice microbiome – OMM^{12} [12] (Fig. 3). The posteriors suggested the growth rates of *Akkermansia muciniphila, Bacteroides caecimuris, Bifidobacterium longum*, and *Muribaculum intestinale* to be most certain, with average doubling times ranging from hours to days. Meanwhile, except from *B. caecimuris*, the average death and immigration rates were less certain, ranging from *≈* 4 · 10^{5} to· 10^{6} cells per day. Most of the certainties obtained from empirical data (Fig. 3) are smaller than those from simulations (Fig. 2), highlighting the limits of the model tested and inference from, noisy, empirical data. However, in each case we obtained a set of parameters – capturing interactions between microbes – with some level of certainty. To exemplify the utility of our outcome, here, our results point to selection as the ecological driver of the OMM^{12} dynamics, despite a possible compatibility of this data with a neutral hypothesis once it has reached steady state [13]. In this case, neutrality would imply that the parameter posteriors overlap between microbial types, which is not the case (Fig. 3).

## Discussion

The work presented here is motivated by the goal to understand how the microbes in a microbiome interact and the need to quantify the uncertainty of parameter estimation from microbiome data. Although inference methods for point estimation have been used [6], several issues limit their quantitative application, restricting them to recreate qualitative patterns of data [5]. A major issue is the indeterminacy of models with more parameters than equations [6, 5, 14]. We suggest a solution to this issue by deriving equations for the statistical moments of the microbiome composition – at least as many as parameters. In fact, our approach is driven by a mechanistic spirit, where microscopic rates must be written down first. As opposed to approaches where analytic solutions – or expensive stochastic simulations – are needed, here, a numerical solution is sufficient to quantify the distance to data [15]. This allows our workflow to handle very diverse models, where model comparison [15] in a Bayesian sense is possible.

The workflow is not limited by the properties of microbiome data [9]. As we have shown, analyzing datasets describing the relative abundance of microbial types – even if the total absolute abundance is dynamic [14] – is possible. Such is the nature of metagenomic sequencing data – the most common method to characterize microbiomes [5]. In addition, by tracking statistical moments of the microbiome, our approach naturally accounts for the diverse types of experimental samplings, such as those where ensembles of hosts are used to obtain a single time series. Concretely, compared to other methods, we track the demographic variation between hosts explicitly, and assign the remaining variation to external noise.

Although by design, our approach deals with longitudinal (time series) data, analyzing single time points (snapshot data) is possible. For example, if data is assumed to be at steady state, the inference method’s aim is to find parameters making the dynamical equations for the moments equal to zero. This does not mean that the moments are zero, but that their rate of change is. Nevertheless, as our results illustrate, given the various sources of uncertainty, time series data leads to better parameter inference, in particular, those time intervals of “high activity” where many changes occur [5]. As Cao et al. [5] proposed, several of these intervals could be analysed simultaneously to improve the inference.

Bayesian inference can suffer the curse-of-dimensionality in large and diverse systems [15]. By tracking numerous statistical moments readily solved numerically, we believe our approach combined with data of sufficiently high quality can overcome this to some extent – exploring the parameter space thoroughly in a reasonable time. We implemented an Approximate Bayesian Computation with Sequential Monte Carlo in our workflow using tools from the Python package pyABC [16] (Sup. Methods C), on which further optimizations could greatly improve its wider application [7]. As proof-of-principle, we applied our workflow to two simulated relative abundance datasets and recovered the true parameter values. We also applied it to a reduced microbiome in mice [12], where we estimated values and certainties of parameters describing logistic growth.

In summary, we presented a Bayesian inference workflow bridging microbiome data to theoretical modelling. By inferring datasets of microbial absolute and relative abundances, we showed its robustness – identifying likely interactions and certainty of parameters in simulated and empirical data. Because mechanistic rates serve as stepping stones of the workflow, similar microscopic models could replace the two classical ecological models that we illustrated – including experimentally informed models.

## Author contributions

RZC and AT developed the concept. RZC and FB worked on the methods. RZC wrote the first draft; all authors reviewed and approved the final version.

## Availability of data and software

The data generated and software used for the analyses are available in GitHub (https://github.com/romanzapien/microbiome-inference.git).

## Acknowledgements

We thank the *Theoretical Biology Department* in the MPI Plön and the *Collaborative Research Centre 1182: Origins and Functions of Metaorganisms* for the fruitful discussions. Finally, we thank the Max Planck Society and the CRC 1182 for the funding provided.

## Appendix

### A. Derivation of dynamical equations for the microbiome moments

To track the statistical moments of a model, e.g. average, variances, and co-variances, we have to account for the stochasticity of events. Thus, describing the probability of microbiome compositions is needed. The change in probability of each microbiome composition is described by the master equation,
where **n** is the vector of absolute microbial abundances, and **e**_{i} is the amount of change, a vector with one in the *i-th* entry and zero otherwise.

Dynamical equations for the statistical moments can be obtained from the master equation by multiplication and subsequent summation. E.g., for the first moment ⟨*n*_{k}⟩, equivalent to the average, we have
where for convenience, we make summations more explicit. For the second moment , we have
and for the co-moments ⟨*n*_{k}*n*_{l}⟩,

For models with a finite carrying capacity, the upper sum limit is changed to a finite number.

#### A.1. Logistic growth with immigration and death

Let us exemplify the former steps with a logistic growth model. Similarly to Allouche and Kadmon [17], let us define the microscopic transition rates for one microbial population *i*
where *N* is the shared carrying capacity, *f*_{i} is the maximum growth rate, and *ϕ*_{i} and *m*_{i} are the death and immigration rates for each type *i*.

Now, we illustrate how to derive dynamical equations for the moments. Let us start with the first moment,
where the first four lines describe birth or death of a microbe of type *k* and the last four lines describe birth or death of a microbe of type *i* different from *k*. Note that by definition at the boundaries and , so their summation indices go up to *n*_{i} = *N* − 1, or start from *n*_{i} = 1, respectively.

After appropriate transformations of variables to only deal with *P* (**n**, *t*) and re-indexing, we obtain

Note that the last four terms reduce to zero, and that at the boundaries and , which allows including *n*_{k} = 0 and *n*_{k} = *N* in the summations. Simplifying, we find
and substituting the transition rates *T* (**n** → **n** + **e**_{i}) and *T* (**n** → **n** − **e**_{i}) from Eqs. (10) leads to

For other moments and models similar derivations can be done.

For the second moment, we find
which after substituting *T* (**n** → **n** + **e**_{i}) and *T* (**n** → **n** − **e**_{i}) from Eqs. (10) reduces to

For the co-moments, we find
which after substituting *T* (**n** → **n** + **e**_{i}) and *T* (**n** → **n** − **e**_{i}) from Eqs. (10) leads to

Because each equation depends on higher moments, e.g., *d*⟨*n*_{k}*n*_{l}⟩*/dt* depends on ⟨*n*_{k}*n*_{l}*n*_{j}⟩, it is not possible to solve this system of equations without additional assumptions. However, one can find approximate expressions, where lower moments replace higher moments. For example, and ⟨*n*_{k}*n*_{l}*n*_{j}⟩ are approximated as functions of the lower moments: , ⟨*n*_{k}*n*_{l}⟩, and ⟨*n*_{j}⟩. Concretely, we can approximate e.g. ⟨*n*_{k}*n*_{l}*n*_{j}⟩ *≈* ⟨*n*_{k}*n*_{l}⟩ ⟨*n*_{j}⟩. This technique, called moment closure approximation, leads to a closed system of ODEs and we use it in our approach. Kuehn [18] makes a thorough review on this technique.

#### A.2. Lotka-Volterra

Now, for a model with intra- and inter-specific interactions, let us define the transition rates,
where *A* and *B* are positively defined matrices containing the interactions, satisfying *A*_{i,j} = 0 if *B*_{i,j} *>* 0, and *B*_{i,j} = 0 if *A*_{i,j} *>* 0. Ecologically, while interactions in *A* promote growth, those in *B* lead to death. Finally, *f*_{i} is the intrinsic growth rate.

For the first moment, similarly to Eq. (13), we have
which after substituting *T* (**n** → **n** + **e**_{i}) and *T* (**n** → **n** − **e**_{i}) from Eqs. (19),
which takes the form of the conventional, determininistic Lotka-Volterra equations for the abundance with growth rate *f*_{k} and interaction matrix *A*_{k,j} − *B*_{k,j}.

For the second moment, similarly to Eq. (15)
which after substituting *T* (**n** → **n** + **e**_{i}) and *T* (**n** → **n** − **e**_{i}) from Eqs. (19) leads to

For the co-moments, similarly to Eq. (17), we derive
which after substituting *T* (**n** → **n** + **e**_{i}) and *T* (**n** → **n** − **e**_{i}) from Eqs. (19) reduces to

As previously, a moment closure approximation is required to solve the system of equations.

#### A.3. From absolute to relative abundance

The former equations account for the change of absolute abundances. To focus on relative abundance data, we define the relative abundance as and to serve as a scaling factor. Thus,

Let us find the transformation to relative abundances for the first moment. Using the definition of the covariance ⟨*x*_{k}, *n*_{Σ}⟩ = ⟨*x*_{k}*n*_{Σ}⟩ *−*⟨*x*_{k}⟩ ⟨*n*_{Σ}⟩, such that ⟨*x*_{k}*n*_{Σ}⟩ = ⟨*x*_{k}⟩ ⟨*n*_{Σ}⟩ + ⟨*x*_{k}, *n*_{Σ}⟩, we have

Rearranging, the transformation is given by

For second order moments, we use that and approximate . Then, using the chain rule

Rearranging, the transformations are given by
and
where the differential equation for ⟨*n*_{Σ}⟩ is given by

A close look at the dynamics of the covariances shows their contribution is negligible in large populations. To see this, let us write
after the appropriate transformations of variable to only deal with *P* (**n**, *t*) and re-indexing, we find

Note that if *n*_{Σ} ≫ 1, *n*_{Σ} *±* 1 *≈ n*_{Σ}, then, if either *n*_{k} ≫ 1, such that *n*_{k} *±* 1 *≈ n*_{k} or , the terms from the previous equation simplify, leading to

Similar arguments lead to conclude that, and

These approximations of the covariances are sensible in microbiomes, where *n*_{Σ}, *n*_{k} ≫ 1 is often the case. Moreover, in the infinite population limit, covariances must be zero.

Putting all together, the change of the first moment of relative abundance in large populations is given by while for the second moments of relative abundance and

Finally, to solve these equations in terms of relative abundance, the change of variable *n*_{i} = *x*_{i}*n*_{Σ} is needed all along. As Joseph et al. [14], we see that the second term of each equation serves as “correction factor” due to the fact that relative abundances must add up to one at all times.

### B. True parameters in simulations

### C. TInference settings

## Footnotes

Minor changes in figures and text