Modeling ecological communities when composition is manipulated experimentally

In an experimental setting, the composition of ecological communities can be manipulated directly. Starting from a pool of n species, one can co-culture species in different combinations, spanning mono-cultures, pairs of species, and all the way up to the full pool. Here we advance methods aimed at inferring species interactions from data sets reporting the density attained by species in a variety of sub-communities formed from the same pool. First, we introduce a fast and robust algorithm to estimate parameters for simple statistical models describing these data, which can be combined with likelihood maximization approaches. Second, we derive from consumer-resource dynamics statistical models with few parameters, which can be applied to study systems where only a small fraction of the potential sub-communities have been observed. Third, we show how a Weighted Least Squares (WLS) framework can be used to account for the fact that species abundances often display a strong relationship between means and variances. To illustrate our approach, we analyze data sets spanning plants, bacteria, phytoplankton, as well as simulations, recovering a good fit to the data and demonstrating the ability to predict experiments out-of-sample. We greatly extend the applicability of recently proposed methods, opening the door for the analysis of larger pools of species.

coexisting species we can form from a pool of n species. The approach rests on two main assumptions. First, 96 we take each observed measurement to be a noisy realization of a "true" value: (1) That is, the observed density of species i when grown in community k,x (k) i , is given by a "true" value, i , modified by an error term, ϵ (k) i . We can arrange our empirical data in a matrixX, with n columns 99 (one for each species in the pool) and as many rows as the number of observed communities. In particular, if species i is present in community k, and zero otherwise.
if species i is in 101 community k and zero otherwise. Then, we can rewrite Eq. 1 in matrix form: When several replicates are available, we assume that the density recorded for species i in community k and 103 replicate r followsx inefficient and computationally very expensive, as it requires inverting a sub-matrix of B for each observed 143 community, and there may be up to 2 n − 1 unique community compositions that can be built from a pool of 144 n species. Moreover, the problem of minimizing deviations is markedly non-convex-starting from different 145 initial estimates, we are likely to converge to different (and thus in general sub-optimal) solutions. 146 To circumvent this problem, Maynard et al. (2020) proposed a simple analytical approach (dubbed the 147 "naïve method") to find a rough estimate of B from observed data; this initial estimate of B can then 148 be used as a starting point for more sophisticated fitting routines. However, as discussed in their study, 149 this method suffers from a number of issues (detailed in Supporting Information, S3). In particular, the 150 statistical model assumes that the observations are noisy, while the naïve method assumes that they have 151 been observed precisely and that instead Eq. 3 only holds approximately. In practice this means that this 152 approach does not provide the maximum likelihood estimate for B when the data are noisy. One of the goals 153 of the present work is to build an iterative algorithm that is not only computationally efficient, but that 154 also improves upon the naïve estimate for B by yielding a superior starting point for subsequent parameter 155 search.

157
A fast, iterative algorithm to estimate parameters 158 We want to find the maximum likelihood estimate for the matrix B, i.e., the choice of B minimizing the sum 159 of squared deviations: The computational bottleneck we face is that determining the predicted abundances x (k) i from the matrix 161 B (via Eq. 3) for all community compositions is very expensive (requiring the inversion of many matrices). 162 We therefore develop an algorithm in which this expensive calculation is performed only seldomly and at 163 defined intervals, and optimization is carried out in between these steps without having to re-calculate the 164 predicted x (k) i . To achieve this goal, we divide the process of optimizing B into two steps: a prediction step arranged our data in the matrixX as detailed above, we takeX = X + E (Eq. 2), transpose each side, and 169 multiply by B. We obtain the sum of two new matrices: In the remainder of this section we use P and S simply as a convenient means to build our algorithm-we discuss their ecological interpretation in Supporting Information S4. From Eq. 4, we have B −1 P T = X T 172 and B −1 S T = E T . Our sum of squared deviations is simply the squared Frobenius norm of E, ij E 2 ij = 173 ∥E∥ 2 F = ∥E∥, and from Eq. 2 and 4, we can write: While this algorithm is not guaranteed to find the maximum likelihood estimate of B, we observe that in 192 practice, the combination of our iterative algorithm and the final numerical optimization step yields very 193 good solutions. In all cases, we converge on a solution that is superior to the result using the naïve method. 194 As shown in Figure 1 (and Supporting Information, S8), the SSQ typically decays rapidly with the number 195 of iterative steps, and the final numerical search provides an additional improvement. While for the data 196 presented here the algorithm converges smoothly, in principle, one could observe the SSQ oscillating as 197 the algorithm progresses. As for gradient descent and similar methods, this problem can be alleviated by 198 introducing a "momentum" (Polyak, 1964;Qian, 1999), i.e., when updating matrix B in Step 4, one could model with n 2 parameters may be susceptible to overfitting. 217 We follow a disciplined approach to develop these simpler models: we consider a version of the MacArthur's 218 consumer-resource model (MacArthur, 1970) in which each species has access to its own private resources, 219 and all species have access to a shared resource (Supporting Information, S5). By iteratively reducing the 220 number of free parameters in the model, we obtain simpler structures for the matrix B (Table 1). 10 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made is obtained by assuming that interspecific interactions follow a simple pattern, defined by few species' traits.

231
In Figure 2, we show the fit of these four models when analyzing the data reported in Ghedini et al. (2022).

232
For all data sets we find the same qualitative results: the full model (n 2 parameters, 25 parameters for the   Figure 3: Sum of squared deviations between observed and predicted densities for the four data sets considered, fitted using the four models in Table 1. 13 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  For each species (colors) and community combination, the mean and variance of the observed species densities are computed across replicates. Note that both axes have a logarithmic scale, and thus the strong linear trend displayed by the data corresponds to a power-law relationship between the mean and the variance.
14 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made be interpreted as reweighting the relative importance of errors made in our predictions, such that small errors 254 made when estimating low species abundances are penalized as much as larger errors made when predicting 255 higher species abundances. This is in contrast to OLS, where a one percent error in estimating the density 256 of a species with high abundance contributes much more to the sum of squared deviations than the same 257 proportional error for a species with low abundance.

258
Since experimental replicates are necessary to estimate these variances, we must have replicates for all 259 communities (as we do for all the data sets considered here) in order to implement the WLS approach in 260 this straightforward manner.

261
In Figure 5 (and Supporting Information, S8), we compare these two error structures and find that we 262 indeed observe a better fit for species with lower abundances when the WLS approach is used. This has 263 important implications for the qualitative prediction of species' coexistence, because for a species present 264 at low abundance in a community, the numerical difference between a positive abundance and a negative 265 abundance (equivalent to a prediction of lack of coexistence) could be quite small. 266 Finally, in Supporting Information S9 we perform simulations in which observations are taken from dis-267 tributions with known mean-variance relationship, and fit the data using OLS, WLS or a likelihood-based 268 approach. In all cases, we find that WLS outperforms OLS, allowing us to fit the data well and recover with 269 good confidence the parameters used to generate the data.  . Replicate measurements for each species/community are reported as semi-transparent points; the mean for each species/community combination is reported as a solid point. In OLS, small (proportional) deviations of highly abundant species (e.g., Synechococcus sp., in green) are penalized more than larger (proportional) deviations of lower-abundance species (e.g., Tisochrysis lutea, in yellow). In contrast, when performing WLS each data point is standardized by its corresponding variance, leveling the importance of each measurement.

16
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint communities have been measured, we implement the LOO approach by designating one of the communities 276 (along with any replicates) as out-of-sample (sometimes called "out-of-fit," as in Maynard  Note that in the panel displaying the measurements for the full community (A-T-D-S-Ti), Tisochrysis lutea is predicted at a negative abundance-without having observed these data, our models would suggest a lack of coexistence.
17 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint experimental data where community composition has been manipulated, and showcased its potential by 285 testing this approach on data sets spanning plant, phytoplankton, and bacterial communities, as well as  Our framework bridges the dynamical and statistical perspectives of these approaches. Importantly, though 296 the models presented here are statistical in nature, they have a straightforward connection to the Generalized 297 Lotka-Volterra dynamical model. By exploiting this parallel, we have derived a family of simplified models 298 that put a premium on the ecological interpretability of the parameters. While an alternative approach 299 to model regularization would be data-driven, machine learning methods (e.g., enforcing the sparsity or 300 parsimony of B through a penalized regression), we have shown that ecologically-motivated model constraints 301 are a viable option. By clearly relating statistical and dynamical models, we retain the ability to probe other 302 properties of these systems, for example in relation to invasibility, assembly and dynamical stability (Maynard 303 et al., 2020).

304
Using our framework, we are able to obtain quantitative predictions for coexistence and abundances of 305 an arbitrary set of species both in-and out-of-sample, even for data sets that comprise just a fraction of 306 the possible sub-communities that may be formed from a fixed pool of species. Here, our simpler models 307 are used to fit the relatively sparse data sets when fitting the full model is impossible. As the number 308 of combinations that can be formed from a set of n species grows exponentially with n, this ability to 309 predict species abundances out-of-sample is critical for exploring larger systems. Additionally, the good 310 performance out-of-sample indicates that the models are capturing meaningful information about species 311 interactions, rather than merely over-fitting the data.

312
Remarkably, we find that a model where interspecific interactions are approximated by a simplified struc-313 18 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint ture, while intraspecific interactions are modeled more freely, achieves a comparable level of accuracy to 314 the full model for the data sets. These results suggests that interspecific interactions in these systems are 315 not completely idiosyncratic, but rather are largely characterized by a lower-dimensional structure. This  Naturally, to develop these simple models we need to make certain strong assumptions about the community 325 dynamics. Here we have ruled out "true-multistability," in which a set of species can coexist at distinct 326 configurations of abundances. We have also neglected higher-order or highly-nonlinear interactions, which 327 would make the effect of species i on j context-dependent. While we would expect this framework to 328 perform poorly when these assumptions are violated, the good agreement with experimental data suggests 329 that departures from these strict assumptions are modest. However, relaxing these assumptions could further 330 expand the applicability of these methods.

331
Another area that deserves further exploration is quantifying uncertainty in the point estimates produced 332 by this approach. While one could gauge these effects via bootstrapping of experimental data, we instead 333 advocate a fully Bayesian approach to uncertainty quantification, for example as implemented by Maynard  The main outstanding problem with our approach is the assumption that dynamics have reached a steady 340 state, and thus the observed community composition is the "true" final composition for the system. Violations 341 of this assumption can greatly complicate inference. Suppose, for example, that we have two species, A and 342 B, and that A excludes B asymptotically. If we sample this system before the extinction of B, we force 343 the model to find parameters consistent with the robust coexistence of A and B, thereby biasing the results 344 considerably. This problem of "spurious coexistence" is especially troublesome for microbial communities, 345 19 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint where species' presence is frequently determined by sequencing. Sequencing-based methods often detect 346 some species at very low densities (Venturelli et al., 2018), potentially spuriously, making it difficult to 347 discriminate between coexistence of rare species and actual extinctions masked by "background noise." A 348 Bayesian approach could make it possible to simultaneously impute the "true" final composition (i.e., which 349 species are truly coexisting in the sample, taken as a latent parameter) as well as determine the distribution 350 of parameters.

351
This general framework can be further extended in a number of directions. For example, one could introduce 352 a more sophisticated error model assuming that species abundances are correlated within communities (e.g., 353 overestimating the density of a predator could be associated with an underestimate of the densities of its 354 prey). Similarly, we could assume more generally that observed densitiesx The goal was to test how species richness affects seedling establishment and productivity below-and above-  Table S1. replicates), two species together in all possible pairs (10 combinations, 3 replicates each), and all species 393 together (5 replicates) for 10 days (corresponding to approximately ten generations). Time series for these 394 data are reported in Figure S1. Samples were measured every other day. Here we analyze the data for day q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 24 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

389
The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint In order to fit the n 2 parameters of the interaction matrix B in our model, we need to have observed 407 a sufficiently large variety of communities. For simplicity, suppose that we are simulating data from the 408 model as specified by a given matrix B, which details the interactions between the n species in the pool. with zeros with all the species that are in the pool but not present in community k. We ask when we can 412 recover all the coefficients in B from X. 413 Under these conditions, to recover the coefficients of the matrix B we need to be able to find a unique 414 solution for BX T = P T . We can write this as: where vec(A) is the vectorization operator (stacking all the columns in a matrix into a vector), and ⊗ is the 416 Kronecker product. Note that (as explained in details in Supporting Information, S4), we cannot determine 417 all elements of P without having access to matrix B. We can however be certain that whenever X ij > 0, 418 then P ij = 1. If we then callP a matrix that hasP ij = 1 whenever X ij > 0 andP ij = 0 elsewhere. We can 419 rewrite the system of equations above as: 420 diag(vec(P ))(X ⊗ I)vec(B) = vec(P T ) , where diag(b) creates a diagonal matrix with vector b on the diagonal. It is now apparent that solving the 421 matrix equation for the n 2 entries of B requires the matrix diag(vec(P ))(X ⊗ I) to have rank n 2 . This 422 condition will be met only when each of the n species in a system are present in at least n experiments, and 423 each pair of species appears in at least one experiment. Note that if any two species were to always co-occur 424 alongside each other in the experimental data, we would not be able to fit the full B matrix-in this case, 425 we would be unable to distinguish the effects of the two species from one another.

S3. Naïve approach
not minimize the SSQ and therefore does not yield the m.l.e. for matrix B.

432
Taking the equation Eq. 4, and using the matrixP as above, we find: where • is the element-by-element (Hadamard) product.

434
If we proceed as above and attempt to solve the system of equations  27 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint The Generalized Lotka-Volterra (GLV) model is arguably the simplest nonlinear model for population dy-441 namics, and has been studied for more than a century (Lotka, 1920;Volterra, 1926). It can be written 442 as: whereẋ(t) = dx(t)/dt, x(t) is a column vector detailing the densities of all species at time t, r is a vector 444 of intrinsic growth/death rates, A is a matrix of species' interactions, and D(x(t)) is a diagonal matrix with 445 x(t) on the diagonal. We can divide each element of A by the corresponding growth rate, obtaining: These equations hold whenever all species are present, or when we initialize the system using a subset of  data. Note that this also makes clear that, using "static" observations of several sub-communities, we cannot 454 find A and r separately, only the composite parameters B = D(r) −1 A. 455 In this context, the predicted matrix X (for a given matrix B) is simply a collection of feasible equilibria for 456 some of the sub-systems we can form from the pool of n species. Take one of the rows of X,x, containing the 457 density of the |k| species forming the feasible equilibrium, along with zeros for the species that are absent.

458
Because of the equilibrium condition, for allx i > 0, we have for the remaining species (i.e., the species x i not present in k), we have 460 28 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint The quantity ρ i can be interpreted as the effect of the "resident" species (i.e., those for whichx i > 0) on the 461 growth rate of the invader i when entering the community at low abundance. Write the per-capita growth 462 rates: If the system is resting atx, we have that whenever (D(r)(1 − Bx)) i > 0, species i has positive growth rate 464 when invading the system at low abundance. If we assume that the growth rates are positive for all species, 465 we have that the inequality for species i is satisfied whenever 1 − ρ i > 0, and therefore ρ i < 1. If the sign of 466 r i is negative, the inequality is reversed.

467
Therefore, when we compute P = XB T , we have that P ij = 1 whenever the species j is part of the community 468 specified by row i of X, and that (assuming r i > 0) whenever P ij < 1 for a species that is not part of the 469 community, then the species can invade the community when rare, and cannot invade whenever P ij > 1. 470 If moreover we have that the species in k are resting at a stable equilibrium, then any row of P for which 471 P ij ≥ 1 for all species represents a feasible, stable and non-invasible equilibrium for the system, also called (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint where the vector α collects the attack rates for the consumers on their private resources and the vector β 495 the attack rates for each consumer and the shared resource. Then we have that M = AA T is simply: Dividing each row of M by the corresponding s i , we obtain: where d = α 2 /s, v = β/s and w = β. As such, in this case we can fully define the matrix B using 3n − 1 498 parameters (with the −1 due to the fact that elements v i w j always appear as products). We can complicate 499 this model by adding extra shared resources, each time adding further parameters. In particular, if we 500 wanted to model off-diagonal (i.e., inter-specific interactions) as the sum of two rank-1 matrices, we could 501 write: adding only 2n−3 parameters; in fact, without loss of generality we can assume that v T v ′ = 0 and w T w ′ = 0, (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint consumers have the same attack rate on the resources, the matrix can be written as B = D(d) + α11 T (n + 1 509 parameters). 510 We have shown how the consumer-resource framework can be used to derive simplified versions of the 511 statistical model, thereby lifting some of the requirements on the data (see section S7 below). While in a 512 consumer-resource model we would expect all attack rates to be positive, we can relax this condition to have 513 a more general and flexible model. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint S6. Fitting the simplified models 515 The main computational hurdle one faces when fitting the original model with n 2 parameters is the need to 516 invert a sub-matrix of B, B (k) for every species combination k, in order to compute the predicted values.

517
Inverting matrices is computationally costly, requiring on the order of approximately n 2.4 operations for an 518 n × n matrix, even using the best available algorithms.

519
This problem is greatly reduced for the simplified models, as we can easily obtain a closed-form analytical The predicted abundance for the species in the community can be computed as: Even more conveniently, the predicted abundance for a sub-community be readily computed. Simply define 527 a vector of presence/absence π k such that π ki = 1 if species i is part of the sub-community and 0 otherwise.

528
Then, we can write x (k) , containing the predicted density of the species when present, and 0 otherwise, as:

S7. Data requirements to fit the simplified model 533
As detailed in Appendix S2, to fit the full model we need to identify the n 2 coefficients of the matrix B.

534
This can be accomplished by writing a set of equations based on the observed data. In particular, having 535 observed a certain community k in which m species coexist, we can write m equations. For example, having 536 observed species one growing in monoculture (labeled community 1), we can write: (1) Similarly, when we have observed species one and two growing together (labeled community 2), we can write 538 two equations: As such, having observed all monocultures we can write n equations, and having observed all the n 2 pairs 540 we can write n(n − 1) equations-for a total of n 2 equations, allowing us to solve for all the coefficients

545
Take the simplest of the reduced models presented in section S5 and S6: Following the same logic, it is obvious that having observed all the monocultures (yielding B ii = d i + α = 547 1/x i (i) ) and any of the pairs (allowing us to solve for α) would be sufficient to parameterize the model.

548
The more interesting and complicated case is that of the most complex of the reduced models: B = D(d) + 549 vw T . Here we show that at a minimum we need to be able to write 3n − 1 independent equations, in order to (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint full model, we illustrate it with an example. Take the four-species community defined by the parameters: (1, 2, 3, 4) T v = (1/8, 1/4, 1/2, 1) T w = (1/32, 1, 7/2, 1) T Yielding the matrix B:  we will only use the "observed" data for monocultures and pairs: Clearly, if we had access to all the rows ofX, we would be able to infer an arbitrary B (as in the full model), 559 and therefore also any B with the special form B = D(d) + vw T . Our goal is to show that we do not need 560 to have observed all the monocultures and pairs to infer the parameters of this simplified form.

561
Before we start, we note that G = vw T is defined uniquely by 2n − 1 (rather than 2n) parameters. In fact, 562 we can take any θ ̸ = 0 and define v ′ = θv and w ′ = 1 θ w such that G = vw T = v ′ w ′T . 563 36 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint and the pair involving species two and three. I.e., we have observed all the monocultures, but only four pairs 565 of the six potential. We can use this data to solve for the coefficients B 11 , B 12 , B 13 , and B 14 :  we have identified 3n − 1 coefficients of B: the n elements on the diagonal, as well as 2n − 1 elements in the 568 off-diagonal part of the matrix.

569
Note that We have therefore observed a partial matrix G: Is there a unique way to "complete" the matrix G, knowing that it must be rank-1? This problem is known 571 in mathematics as the "matrix completion" problem, and has been studied extensively. In particular, a 572 partially-specified matrix admits at least one rank-1 completion whenever it meets two conditions (Hadwin 573 et al., 2006). The first condition, called the "zero row or column property" is very simple: if one of the 574 specified coefficients is zero, then all the specified coefficients in either the corresponding row or column 575 must also be zero. In our case, we have no zero coefficients, and thus the property is trivially satisfied.

576
The second condition is based on the bipartite graph G that we can construct by taking the labels for the 577 rows and columns of G * , r i and c i respectively, as nodes and drawing an undirected edge between r i and c j 578 whenever G * ij is specified. The second condition, called the "cycle property," has to do with the consistency find the coefficients needed to complete the matrix.  Figure S3: Bipartite graph associated with the partially-specified matrix in Eq. 8. The graph represents a connected tree, and as such a completion of matrix G * exists and is unique.

595
In summary, provided with 2n − 1 nonzero coefficients of G satisfying the connectedness of the corresponding 596 graph G, we can complete the matrix and the completion will be unique. Note that we are specifying only 597 2n − 1 coefficients, which implies that G has only 2n − 1 edges. As such, if G is connected, it is necessarily 598 a tree. A tree automatically satisfies the cycle condition in (Hadwin et al., 2006), and if the coefficients 599 are nonzero, connectedness is necessary and sufficient for the existence of a unique solution. Because the 600 solution is unique, numerical methods could easily substitute the algebra we performed above.

601
Finally, having identified all of the coefficients G ii , and provided with the estimates of B ii we obtain d i by 602 38 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint difference.

603
To demonstrate that not all choices of 2n−1 off-diagonal coefficients would satisfy the conditions for existence 604 and uniqueness, consider the same system and use the monocultures and the experiments in which we grow 605 species (1, 2), (2,3), (3,4) and (4,1) together. In total, we should be able to identify 3n coefficients of B, 606 including all the diagonal coefficients and 2n off-diagonal coefficients. Excluding one of them arbitrarily (this 607 has no effect on the result), we end up with the partially-specified matrix: In Figure S4 we show that the induced graph G is not connected (and in fact, not a tree). As such, if the 609 matrix were to satisfy the cycle condition (it does), we would have multiple solutions (in this case, infinitely 610 many), while if it did not, we would have no possible completion.  Figure S4: Bipartite graph associated with the partially-specified matrix in Eq. 9. The graph does not represent a connected tree, and as such a completion of matrix either does not exist, or is not unique.
In summary, to identify the parameters of the simplified model we need to have sufficient data to determine 612 both the n diagonal coefficients of B as well as 2n − 1 off-diagonal coefficients, ensuring that the induced 613 graph G associated with the matrix G * is connected. This condition is both necessary and sufficient to 614 guarantee the existence and the uniqueness of a solution. Finally, because we need to identify n 2 parameters 615 (quadratic in n) for the full model, and only 3n − 1 (linear in n) for the simplified model,the benefits of the 616 simplified model will be markedly greater for larger n. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint reported in section S2 in all cases ( Figures S3, S4, and S5).  42 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. 44 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint sum of squared deviations, rather than the sum of squared deviations, we obtain a better fit for the species 625 with low abundance, at the cost of a slightly less precise fit for the highly abundant species (Figures S9, S10,   626 and S11). 45 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. 46 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made the other hand, we make one qualitatively incorrect prediction (lack of coexistence, when the community is 632 observed to coexist empirically), and also the quantitative predictions are worse, especially for the Ochrobac-633 trum sp., which appears in only two communities ( Figure S12). In all cases, using OLS or WLS yields similar 634 results. are grown together, we obtain a qualitatively wrong prediction (lack of coexistence). Ochrobactrum sp. appears in only two communities, and when one of the communities is removed the fit is greatly reduced.

47
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. 48 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. 49 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint data and recover the parameters used to generate the data in the first place. These simulations allow us to 638 highlight two aspects of the approach presented here: a) WLS is superior to OLS whenever the observations 639 on abundances display a marked mean-variance relationship; and b) while the approach presented in the 640 main text is fundamentally based on normally-distributed deviations from the theoretical mean (as found 641 in both OLS and WLS), one could easily extend the approach to other distributions-in which case the 642 function to maximize would be a more generic likelihood (rather than simply minimizing the sum of squared 643 deviations).

644
All the simulations presented below are based on a simple, 3-species "true" matrix of interactions: true parameters inferred parameters Figure S19: Inferred vs. true parameters for simulations in which replicates for a given system are sampled from a Gamma distribution. For each parameter, we plot the inferred value using the specified approach and fifty replicates for each community.

53
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 3, 2022. ; https://doi.org/10.1101/2022.05.12.491213 doi: bioRxiv preprint