Inferring interactions from microbiome 1 data

Parameter inference of high-dimensional data is challenging and microbiome time 12 series data is no exception. Methods aimed at predicting from point estimates exist, 13 but often even fail to recover the true parameters from simulated data. Computational 14 methods to robustly infer and quantify the uncertainty in model parameters are needed. 15 Here, we propose a computational workﬂow addressing such challenges – allowing us 16 to compare mechanistic models and identify the values and the certainty of inferred 17 parameters. This approach allows us to infer which kind of interactions occur in the 18

here bridges a gap between microbiome data and theoretical modelling. (n ), given the parameter set θ. Note that these rates can be functions of time t as well, T (n → n , t) = f (n, θ, t). (1) Now, instead of tracking the microbiome composition n in a single host, we can describe how the For example, computing the average abundance of microbial type i implies setting g k = n i , whereas 64 for the co-moment of microbial types i and j, g k = n i n j . Equations like these describe the expected Microbiome data is nowadays typically produced by metagenome sequencing. Conventionally, for 94 technical reasons metagenomics only quantifies the relative abundance of each microbial type in a 95 sample ( Fig. 1 C) [9]. More recently, some studies have measured absolute numbers of culturable 96 [10] and non-culturable microbes in samples [11]. We call these counts absolute abundances.

97
Our former equations only track moments of absolute abundance, g k . As Gloor et al. [9] show, 98 inferring parameters from relative abundance (x k ) data using them would lead to spurious correlations 99 ( Fig. 2 A-B). To find equivalent expressions for the statistical moments of relative abundance, first, 100 we propose n Σ ≡ j n j to be defined as a scaling factor and a dynamical equation for its first 101 moment, n Σ . Then, a transformation to moments of relative abundances, γ k , is given by, Because relative abundance datasets lack information about the scaling factor, its initial condition, 103n Σ (0), must be inferred as a free parameter, one parameter more than for absolute abundance data. Note that because k x k = 1, the number of equations for the microbial types decreases but the number of parameters per type remains. A detailed derivation of transformations to relative 106 abundance for a logistic growth and the Lotka-Volterra models is shown in the Sup. Methods A.

107
Apart from tracking the microbiome of the same hosts over time, as in animal gut studies, 108 commonly, hosts sampled at different time points are pulled together to produce a single time series 109 ( Fig. 1 B). This is the case when hosts are sacrificed while sampling as in experimental studies 110 of D. melanogaster, C. elegans, and Hydra vulgaris [10]. In contrast to deterministic models, our 111 workflow can deal with hosts pulled together thanks to its account of the stochastic demography.

112
More concretely -akin to the concept of biological replicates -if the parameter values and initial 113 conditions are the same in each host sampled, we can account for their emerging demographic 114 differences, i.e., differences in microbiome composition resulting from stochasticity.

115
Inference from simulated and empirical data 116 We tested our inference workflow in two ways. Firstly, we recovered the correct parameters from 117 simulated relative abundance data (Fig. 2). Our approach proved successful in cases with and 118 without inter-specific interactions, namely, data simulated from logistic growth and Lotka-Volterra is not the case (Fig. 3).   The parameters inferred for each species varied widely with various certainties. For the shared carrying capacity we found an average N ≈ 1.45 · 10 7 bacterial cells, ±3.49 · 10 5 cells and uncertainty of 0.0582. A system of 156 equations was solved (S = 12 1st moments and S 2 = 144 2nd moments and co-moments).

135
The work presented here is motivated by the goal to understand how the microbes in a microbiome 136 interact and the need to quantify the uncertainty of parameter estimation from microbiome data.  Although by design, our approach deals with longitudinal (time series) data, analyzing single time 155 points (snapshot data) is possible. For example, if data is assumed to be at steady state, the inference 156 method's aim is to find parameters making the dynamical equations for the moments equal to zero.
with data of sufficiently high quality can overcome this to some extent -exploring the parameter 165 space thoroughly in a reasonable time. We implemented an Approximate Bayesian Computation 166 with Sequential Monte Carlo in our workflow using tools from the Python package pyABC [16] 167 (Sup. Methods C), on which further optimizations could greatly improve its wider application [7].

168
As proof-of-principle, we applied our workflow to two simulated relative abundance datasets and 169 recovered the true parameter values. We also applied it to a reduced microbiome in mice [12], 170 where we estimated values and certainties of parameters describing logistic growth.

171
In summary, we presented a Bayesian inference workflow bridging microbiome data to theoretical 172 modelling. By inferring datasets of microbial absolute and relative abundances, we showed its where n is the vector of absolute microbial abundances, and e i is the amount of change, a vector 250 with one in the i-th entry and zero otherwise.

251
Dynamical equations for the statistical moments can be obtained from the master equation by 252 multiplication and subsequent summation. E.g., for the first moment n k , equivalent to the average, 253 we have where for convenience, we make summations more explicit. For the second moment n 2 k , we have and for the co-moments n k n l , 256 d n k n l dt = n n k n l ∂P (n, t) ∂t = . . . . . . n k n l ∂P (n, t) ∂t .
For models with a finite carrying capacity, the upper sum limit is changed to a finite number. [17], let us define the microscopic transition rates for one microbial population i, where N is the shared carrying capacity, f i is the maximum growth rate, and φ i and m i are the 262 death and immigration rates for each type i. . . . n k P (n + e k , t)T (n + e k → n) . . . n k P (n, t)T (n → n + e k ) − . . .
where the first four lines describe birth or death of a microbe of type k and the last four lines describe . . . n k P (n, t)T (n → n − e k ) Note that the last four terms reduce to zero, and that at the boundaries T (n → n − e k )| n k =0 = 0 271 and T (n → n + e k )| n k =N = 0, which allows including n k = 0 and n k = N in the summations.

272
Simplifying, we find . . . P (n, t) (T (n → n + e k ) − T (n → n − e k )) , and substituting the transition rates T (n → n + e i ) and T (n → n − e i ) from Eqs. (10) leads to For other moments and models similar derivations can be done.

275
For the second moment, we find which after substituting T (n → n + e i ) and T (n → n − e i ) from Eqs. (10) reduces to For the co-moments, we find . . . n l P (n, t)T (n → n + e k ) which after substituting T (n → n + e i ) and T (n → n − e i ) from Eqs. (10) leads to 279 d n k n l dt = (f k + f l ) n k n l − j n k n l n j N Because each equation depends on higher moments, e.g., d n k n l /dt depends on n k n l n j , it 280 is not possible to solve this system of equations without additional assumptions. However, one 281 can find approximate expressions, where lower moments replace higher moments. For example, 282 n 2 k n j and n k n l n j are approximated as functions of the lower moments: n 2 k , n k n l , and n j .

283
Concretely, we can approximate e.g. n k n l n j ≈ n k n l n j . This technique, called moment closure 284 approximation, leads to a closed system of ODEs and we use it in our approach. Kuehn [18] makes 285 a thorough review on this technique.

A.2. Lotka-Volterra
Now, for a model with intra-and inter-specific interactions, let us define the transition rates, where A and B are positively defined matrices containing the interactions, satisfying Ecologically, while interactions in A promote growth, those in B 291 lead to death. Finally, f i is the intrinsic growth rate.

292
For the first moment, similarly to Eq. (13), we have which after substituting T (n → n + e i ) and T (n → n − e i ) from Eqs. (19), which takes the form of the conventional, determininistic Lotka-Volterra equations for the abundance 295 with growth rate f k and interaction matrix A k,j − B k,j .
which after substituting T (n → n + e i ) and T (n → n − e i ) from Eqs. (19) leads to For the co-moments, similarly to Eq. (17), we derive . . . n l P (n, t)T (n → n + e k ) which after substituting T (n → n + e i ) and T (n → n − e i ) from Eqs. (19) reduces to As previously, a moment closure approximation is required to solve the system of equations. The former equations account for the change of absolute abundances. To focus on relative abundance 303 data, we define the relative abundance as and to serve as a scaling factor. Thus, Let us find the transformation to relative abundances for the first moment. Using the definition Rearranging, the transformation is given by For second order moments, we use that x k x l n 2 Σ = x k x l n 2 Σ + x k x l , n 2 Σ and approximate 310 n 2 Σ ≈ n Σ 2 . Then, using the chain rule Rearranging, the transformations are given by and where the differential equation for n Σ is given by A close look at the dynamics of the covariances shows their contribution is negligible in large 315 populations. To see this, let us write . . . n k n Σ − n k n Σ (n Σ − n Σ ) P (n + e k , t)T (n + e k → n) after the appropriate transformations of variable to only deal with P (n, t) and re-indexing, we find Note that if n Σ 1, n Σ ± 1 ≈ n Σ , then, if either n k 1, such that n k ± 1 ≈ n k or n Σ − n Σ n Σ ≈ 0, 318 the terms from the previous equation simplify, leading to Similar arguments lead to conclude that, and These approximations of the covariances are sensible in microbiomes, where n Σ , n k 1 is often the 322 case. Moreover, in the infinite population limit, covariances must be zero.

323
Putting all together, the change of the first moment of relative abundance in large populations is 324 given by while for the second moments of relative abundance Finally, to solve these equations in terms of relative abundance, the change of variable n i = x i n Σ 328 is needed all along. As Joseph et al. [14], we see that the second term of each equation serves as 329 "correction factor" due to the fact that relative abundances must add up to one at all times.  Table 1: Parameters in simulated logistic growth with immigration and death (Fig. 2 A). The growth and death rates as well as the immigration parameters were only chosen for illustration, thus, time units are arbitrary.
parameter units prior N (3000, 1000) shared carrying capacity N cells N (10 5 , 10 4 ) initial scaling factorn Σ (0) cells U(4000, 6000) Table 3: Priors for simulated logistic growth with immigration and death (Fig. 2 A). A combination of uninformative (uniform) and informative (normal) priors were used for illustration. These priors span a wide range of values to test the ability of the inference workflow to find the true parameters in simulations (Table 1). inter-specific I i,j (cells · time) −1 N (0, 5 · 10 −5 ) initial scaling factorn Σ (0) cells U(1.5 · 10 4 , 2.5 · 10 4 ) Table 4: Priors for simulated Lotka-Volterra (Fig. 2 B). A combination of uninformative (uniform) and informative (normal) priors were used for illustration. These priors span a wide range of values to test the ability of the inference workflow to find the true parameters in simulations (Table 2) death φ i cells/day U(0, 2 · 10 6 ) from 0 to ≈ 23 cells/second* immigration m i cells/day U(0, 2 · 10 6 ) from 0 to ≈ 23 cells/second* shared carrying capacity N cells U(1.4 · 10 7 , 1.6 · 10 7 ) max. number of cells in data [12] Table 5: Priors for logistic growth with immigration and death in empirical mouse data (Fig. 3). Available evidence and back-of-the-envelope calculations (marked by *) were used to propose wide priors. U(a, b) indicates a uniform distribution in the range from a to b. N (a, b) indicates a normal distribution with mean a and standard deviation b.
simulations (Fig. 2) mouse data (Fig. 3) logistic Lotka-Volterra logistic  Table 6: Settings for Approximate Bayesian Computation -Sequential Monte Carlo (ABC-SMC) code. These settings were chosen to decrease the computing time, but still robustly minimize the distance between data and model. We used tools from the Python package pyABC [16], mainly ABCSMC. The maximum number of generations, mismatch threshold (ε) minimum, and minimum ε change between generations are all stopping criteria (marked by *). LSODA is a numerical solver capable of selectively adapting to the stiffness of a system of differential equations. NA: Not Applicable.