Towards understanding predictability in ecology: A forest gap model case study

Underestimation of uncertainty in ecology runs the risk of producing precise, but inaccurate predictions. Most predictions from ecological models account for only a subset of the various components of uncertainty, making it diffcult to determine which uncertainties drive inaccurate predictions. To address this issue, we leveraged the forecast-analysis cycle and created a new state data assimilation algorithm that accommodates non-normal datasets and incorporates a commonly left-out uncertainty, process error covariance. We evaluated this novel algorithm with a case study where we assimilated 50 years of tree-ring-estimated aboveground biomass data into a forest gap model. To test assumptions about which uncertainties dominate forecasts of forest community and carbon dynamics, we partitioned hindcast variance into five uncertainty components. Contrary to the assumption that demographic stochasticity dominates forest gap dynamics, we found that demographic stochasticity alone massively underestimated forecast uncertainty (0.09% of the total uncertainty) and resulted in overconfident, biased model predictions. Similarly, despite decades of reliance on unconstrained “spin-ups” to initialize models, initial condition uncertainty declined very little over the forecast period and constraining initial conditions with data led to large increases in prediction accuracy. Process uncertainty, which up until now had been diffcult to estimate in mechanistic ecosystem model projections, dominated the prediction uncertainty over the forecast time period (49.1%), followed by meteorological uncertainty (32.5%). Parameter uncertainty, a recent focus of the modeling community, contributed 18.3%. These findings call into question our conventional wisdom about how to improve forest community and carbon cycle projections. This foundation can be used to test long standing modeling assumptions across fields in global change biology and specifically challenges the conventional wisdom regarding which aspects dominate uncertainty in the forest gap models.

constrained by data, model parameters have often been optimized to a single "best" estimate that ignores with field collected data at each time step and restarts the process-model with an update of the ecological This box provides background on each type of uncertainty in ecological process models.
• Internal demographic stochasticity -Demographic stochasticity refers to the variability in population growth arising from random sampling of birth and deaths. • Internal State (Initial conditions) -Initial condition uncertainty is the uncertainty associated with the initial state of a system. For example, the number, type, and size of trees in a plot at the start of a model run. • External Forcing (Drivers / covariates) -Driver and covariate uncertainty is typically the uncertainty around external environmental forcings like temperature and precipitation. • Process -Model process uncertainty is a measure of the ability of the model structure to predict the latent "true" state of the system after accounting for observation errors in the data. Without an estimate of process error covariance (multivariate) or process variance (univariate) it is difficult to determine model completeness. This would be analogous to predicting with a regression model without considering its RMSE.

Ecosystem model
138 LINKAGES (Post and Pastor, 1996) is a forest gap model that links the dynamics of aboveground demo- is our case study in this paper, our approach to data assimilation variance partitioning is generalizable to 148 many process-based ecological models and data types.

149
A full analysis of the ecological dynamics inferred by LINKAGES at our site is beyond the scope of this is itself a function of the ratio of lignin to nitrogen and actual evapotranspiration (Post and Pastor, 1996). 158 We did not empirically constrain belowground state variables directly. Constraints on soil carbon, in this 159 study, are indirect through the incoming source to litter pools, aboveground biomass. the region, and mature red oak, which is found in greater abundance in the stand than is regionally typical.

167
The permanent plot was established by Walter Lyford in the 1960s and the diameter at breast height (DBH) 168 and location of all stems over 5 cm DBH have been recorded at approximately decadal intervals since then.
Additional site information including census collection is available in (Eisen and Plotkin, 2015 . We addressed this common problem by incorporating a Tobit likelihood into our 220 analysis step (Figure 1, blue). Furthermore, while the analytical solution is computationally practical, the 221 EnKF must make the assumption that process error is known. To estimate process error with data, our 222 TWEnF introduces a latent 'true' state (Berliner, 1996) in the usual framework of Bayesian hierarchical 223 models (Figure 1, pink).

224
To estimate process error covariance, we fit the following TWEnF annually to tree-ring derived multivari- Let y be the posterior mean of multivariate species biomass from the aforementioned tree ring analysis; R 229 be, similarly, the posterior covariance of species biomass from the tree ring analysis; y L be the left censored

237
We updated the estimate of the process covariance (Q t ) every time step by updating the shape parameters 238 of the Inverse Wishart distribution as follows where ⌦ is the process precision to the process covariance Q t and r and c represent the rows and columns . Each method works in an iterative forecast cycle (Dietze 2017a) over time (t to t + 1), where the model forecast (blue) is updated by the data (green) into an analysis (pink), which is used to restart the forecast for another time step. The difference between these filters is that the TWEnF is generalized for non-normal forecasts and can also estimate the process covariance matrix over time by updating prior parameters (↵ q and q ) at each time step. Let y be data mean, R be data covariance, µ f be forecast mean, P f be forecast covariance, µ a be mean analysis, P a be analysis covariance, Q t be process covariance matrix at time t, and x lt be the left censored ecosystem model ensemble values. In both cases, the analysis analysis mean (µ a ) and covariance (P a ) are taken from the filter and used to update the ecosystem model states which restart the next ecosystem model forecast.

Hindcasting uncertainty scenarios
246 We used eight model scenarios with additively more types of uncertainty to partition total forecast variance 247 between the five components we considered (Box 1). In order to partition uncertainty from initial conditions, 248 we divided the scenarios into two initial condition types: model spin-up (run without data constraints until 249 equilibrium is reached, Scenarios A1-4) or informed by data (after spin-up the forecast is constrained with 250 data from tree ring derived biomass in 1960 Scenarios B1-4). In each scenario batch (A versus B), initial 251 condition uncertainty was the first type of uncertainty added to the default ecosystem model. LINKAGES 252 includes demographic stochasticity by default, so the default version of LINKAGES plus initial condition 253 uncertainty were scenarios A1 and B1 in this analysis. We then added the following uncertainties sequentially: 254 parameter (A2 and B2), meteorological (A3 and B3), and process uncertainty (A4 and B4) ( Table 1).

255
Parameter and meteorological uncertainties were added by running each ensemble member with a different  Variance partitioning allows us to quantify which aspects of uncertainty contribute the least to overall 265 uncertainty and pinpoints where we should focus efforts to constrain uncertainty in future predictions. 266 We estimated the effect of each source of variance by calculating the difference in variance between pairs 267 of scenarios then calculating the cumulative proportion of variance in reference to the final scenario that 268 includes all five aspects of uncertainty (Dietze, 2017b). This is similar to analytical approximation methods 269 (Hawkins and Sutton, 2009) but our sequential approach accounts for nonlinear interactions that may affect 270 prediction.

271
Our scenarios did not allow a full variance partitioning because we did not introduce each source of 272 variance independently from the other sources of variance in all possible permutations. Specifically, it was not possible to partition initial condition variance because we could not separate initial condition uncertainty 274 from demographic stochasticity. However, because we had two sets of scenarios: one with data derived initial 275 conditions (scenarios A1-4) and one with spin-up based initial conditions (scenarios B1-4), we were able to 276 calculate the covariance between data derived initial condition variance and the other components of variance We found that running LINKAGES using the default parameters resulted in inaccurate predictions of forest 291 composition and biomass when compared with species-level biomass data from the nearby Harvard Forest 292 EMS Tower (Munger, 2018). Free runs of LINKAGES using data constrained parameters improved the 293 accuracy of predicted total biomass but not that of forest composition ( Figure 2 and

State data assimilation 305
Empirical estimates of aboveground biomass derived from tree ring and census observations at the Lyford 306 plot showed that red oak, the dominant species in the stand, has accrued biomass over the last fifty years 307 while understory species have experienced a few mortality events among individuals. The census was not 308 conducted annually, therefore the biomass data has larger uncertainty during periods where an individual in 309 the understory has died. For example, the green envelope spanning the data in Figure 3 had high uncertainty 310 between 1980 and 1990, a census period that experienced both yellow birch and hemlock mortality. These 311 areas of larger uncertainty allowed us to illustrate an example of successful constraint by our methods, as 312 the analysis step (Figure 3, pink) was able to match the variance associated with the data during those 313 time periods. We assessed our state data assimilation algorithm by looking at several bias diagnostics: 314 average model bias (difference between the observation and analysis over time), mean square error, R 2 , 315 relative absolute error, and absolute mean error (Supplemental Section 6). We also reported the coefficient 316 of variation for the average model bias to account for differences in species biomass magnitudes. The highest 317 biomass species, red oak, was best represented by LINKAGES with a high R 2 between the modeled red oak 318 and the data (R 2 = 0.769) (Supplemental Figure 5). As the most abundant species, red oak unsurprisingly had 319 the largest average model bias (-0.82 kgC/m 2 /yr, 10.86% coefficient of variation (CV), Table 3) and largest 320 estimated process variance (diagonal element of process error covariance matrix) among the aboveground 321 biomass of species ( 2 = 0.25, 14.2% CV). This bias increased over time, indicating that the modeled process 322 of red oak mortality and/or growth may need adjustment and agreeing with ecological analyses that red oak 323 will continue growing at Harvard Forest in the future (Eisen and Plotkin, 2015).

324
The second most abundant species, red maple, had a persistent negative bias (-0.14 kgC/m 2 /yr, Figure   325 2). This negative bias was expected because red maple is the dominant species in the region but is suppressed   Table 3: Model diagnostics ordered by species biomass. Average model bias is simply the modeled mean minus observed annual means. Estimated process variance (diagonal elements of process error covariance matrix) for each species is estimated over time using the Tobit Wishart ensemble filter (TWEnF). Average forecast total biomass is the average modeled biomass for each state variable to give a reference point for the magnitude of the estimated process variance. For example, red oak is the highest biomass species in the stand and also has the highest estimated process variance.
We estimated the process covariance matrix, akin to RMSE in linear models (Box 1), associated with  Figure 4). This suggested that, while LINKAGES typically represents beech 343 and red oak as having a positive interaction (forecast correlation = 0.105, Supplemental Materials Figure   344 9), they were actually more neutral with one another at Harvard Forest according to the tree ring data.

345
In the absence of empirical data on changing soil carbon pools, we depended on the mechanistic linkages 346 between aboveground and belowground carbon in LINKAGES to constrain soil carbon fluxes. In our runs,

347
LINKAGES did not provide a constraint on soil carbon given the aboveground biomass constraint and soil 348 carbon pools rose to highly unrealistic levels with a similarly high process variance estimate ( 2 = 1.03, 349 Table 3). Soil carbon was not estimated to be highly correlated with any species biomass in the process error  Table 3.

Hindcasting uncertainty scenarios 355
Across all uncertainty scenarios, data constrained initial conditions reduced model bias and improved root 356 mean square error agreeing with ecological hypotheses that forests and potentially many ecological processes 357 have substantial historical dependence that should be accounted for by propagating initial condition uncer-358 tainty. Across most uncertainty scenarios, excluding scenario A1 and B1, data constrained initial conditions lowered average forecast variance (Table 4) Figure 5: Individual model ensemble members overlaid with shaded 95% quantiles (outlined in black) of aboveground biomass results from each uncertainty scenario using spin-up as the initial conditions in the model (left) and using data to constrain the initial conditions in the model (right). Default was run with initial condition uncertainty and internal model demographic stochasticity, which vastly under-represents the true forecast uncertainty (first row). Next, parameter uncertainty was accounted for (second row), followed by meteorological uncertainty (third row). Finally process uncertainty estimated in the full data assimilation was accounted for (fourth row). The dotted lines on all the plots are the 95% credible intervals of the data estimated from tree rings.  . Uncertainty from ecosystem model spin up can be seen up to 1960 then data constraints greatly constrain the forecasts. Variance from the first scenario arises only from demographic stochasticity (orange). Parameter, meteorological, and process uncertainty are sequentially accounted for in the next three scenarios. The dotted lines on all the plots are species posterior means of the data estimated from tree rings. Note that the y-axes are different between plots to provide better visualization of the uncertainty components for lower biomass species.

382
Variance partitioning showed that the covariance between the initial condition uncertainty and the other 383 types of uncertainties was the dominant variance contributor over time (hashed areas in Figure 7). All 384 model scenarios that were run with model spin-up had much larger uncertainty than with data constrained 385 initial conditions ( Figure 5, column 1 versus column 2). In addition, initial conditions had long lasting 386 effects on the magnitude of the total forecast variance (Figure 7). Notably, covariance between initial 387 condition uncertainty and parameter, meteorological driver, and process uncertainty decreased significantly Year Covariance with Data Constrained Initial Conditions 0 0.2 0.5 0.8 1  The hashed areas are the relative variances that can be attributed to the covariance with initial conditions. For example, over time initial condition uncertainty covariance with meteorological uncertainty (purple) accounted for a larger proportion of total variance. Bottom: The black increasing line indicates the total amount of aboveground biomass (kgC/m 2 ) variance partitioned by the relative variance plot. This shows that while the proportion of variance that process variance is contributing to the total variance decreases over time that the absolute magnitude of that variance is not necessarily decreasing.
Because we did not assimilate soil carbon data but updated soil carbon based mechanisms in the model 396 (litterfall, mortality, decomposition, etc), we considered the soil carbon uncertainty separately from above-397 ground biomass uncertainty. The initial condition constraint was much less apparent in the soil carbon 398 variance partitioning results outside of major outliers (Figure 8, left). Process uncertainty dominated by an 399 order of magnitude, reflecting the lack of constraint by our version of LINKAGES on this carbon pool ( Figure   400 9), which was out of the bounds of any soil carbon pool on Earth. Even though process uncertainty was the 401 obvious contributor to total uncertainty, meteorological, and parameter uncertainty also caused total soil 402 carbon to drift to extremely large values. The covariance between initial conditions and process uncertainty 403 was an increasingly substantial component over time (Figure 8, right), but it was difficult to assess how much 404 of a constraint initial conditions could provide given the magnitude of uncertainty for total soil carbon. and using data to constrain the model's initial conditions (right). The default was run with initial condition uncertainty and internal model demographic stochasticity (first row). Next, parameter uncertainty was included (second row), followed by meteorological uncertainty (third row), and finally process error (fourth row). The y-axis in the fourth row is colored differently to draw attention to the much larger scale in this row. The instability in the soil carbon reconstruction arises from deterministic cohort dynamics present in the version of LINKAGES we ran. Year  The relative contribution of each type of variance to total soil carbon variance. The hashed areas are the amounts of variance that can be attributed to the covariance with initial conditions. Process (pink) uncertainties contributed a large amount of proportional variance to covariance with initial condition uncertainty. Bottom: The black increasing line indicates the total amount of variance in soil carbon (kgC/m 2 ) that is being partitioned by the relative variance plot above.

406
In our final modeling scenario (B4), we incorporated five data constrained uncertainties: demographic 407 stochasticity, parameter, meteorological, initial condition, and process uncertainty ( Figure 5 and 8 for any other type of uncertainty. Our study highlights the consequences of ignoring these uncertainties.

419
As an illustrative example, in the top left panel of Figure 5, the default parameters of LINKAGES make 420 precise predictions of the stand's above ground woody biomass, which are well outside of the distribution of    However, the overall contribution of parameter uncertainty was relatively modest (18%) suggesting parameter 445 variance is not a dominant source of uncertainty in our analysis. 446 We found that meteorological uncertainty was increasingly important over decadal hindcasts ( Figure   447 7). While hindcasting in this circumstance is much easier than forecasting because we hindsight allows  construct informative priors about initial condition states from past ecological literature (Cressie et al., 2009;466 Hobbs and Hooten, 2015). Our analysis suggests that focusing efforts on data constrained initialization will 467 be the most successful approach for improving forecast accuracy across forest gap models (68% reduction in 468 total forecast variance from data constraints), even on multi-decadal timescales.  error. Our approach allows us to propagate process uncertainty into ecological forecasts, which heretofore 497 has generally been absent from process modeling approaches. 498 We found that process uncertainty contributes substantially to total forecast variance (Figure 7 and 9, 499 pink). The particular process model we used, LINKAGES version 1.0, is something of a classic (Bonan et al., 500 2002), but it is an older model that has since been replaced in most modeling applications. It may be that an 501 alternative process model would be better at predicting 60 years of forest stand dynamics that still requires 502 testing to prove. We do note that the process covariance estimation itself is quite small (Table 3), suggesting 503 that LINKAGES does adequately capture annual forest development changes. However, a small annual error 504 is magnified over time, resulting in large 50 year uncertainty. Our estimate of process error covariance in 505 LINKAGES over the data assimilation time period (50 years) suggests there are errors in modeled species 506 mortality and recruitment, especially in red maple, that led to notable process uncertainty over many years.

507
This finding aligns with previous studies in New England showing that the competitive relationships between 508 red maple and red oak are difficult to understand and therefore predict (Lorimer, 1984;Abrams, 1998;Eisen and Plotkin, 2015). 510 The large impact of process uncertainty on forecast certainty suggests that future efforts might benefit 511 from parsing out specific components of process uncertainty. This could include comparing alternative error 512 models or the spatial and temporal autocorrelation in the process uncertainty, looking for evidence for 513 heteroskedasticity, and partitioning of process uncertainty into bias and variance components. Some of the 514 variability currently attributed to process uncertainty might also represent random individual variability  Constraining belowground soil carbon with aboveground productivity inputs has been a hallmark of our 520 understanding of the evolution of long-term soil carbon accumulation (Meentemeyer, 1978;Aber, 1982; 521 Solomon, 1986). LINKAGES mechanistically links aboveground biomass production, which was well con-522 strained in our model thanks to SDA, to soil carbon through input from litter and tree mortality. But, 523 the pools of soil carbon were not constrained by aboveground inputs in our model and grew to unrealistic 524 levels. Variance partitioning reveals that this lack of constraint is caused by both process and initial con-525 dition uncertainty (Figure 9). But, parameter and meteorological uncertainty also yield hindcasts that are 526 far from reality ( Figure 8). While the link between aboveground and belowground pools has typically been 527 assumed to be a quadratic cumulative relationship (Jenny, 1941), our work suggests that more evaluation 528 is necessary to determine a better modeled representation. We also must add that, despite diligent model

Future Directions
Beyond forest gap models, our results call into question the conventional wisdom in many areas of ecological 540 modeling more broadly, such as the reliance on "spin-up" initial conditions and the exclusion of process 541 uncertainty from predictions. Most ecological forecasts are made without the inclusion of key uncertainties, 542 with many made purely deterministically, leaving out uncertainty quantification altogether (Cressie et al.,543 2009). This common practice creates projections that may be precise but are often inaccurate. Similarly, 544 many ecological fields focus on specific aspects of uncertainty without considering the full suite of possible 545 sources of uncertainty. In particular, a lesson we learned, that demographic stochasticity is not the dominant 546 uncertainty on forest gap models, likely extends more broadly, suggesting that the current reliance on specific  Improving predictions of ecosystems properties not only leads to more efficient progress in advancing 557 basic research and theory but also enhancements for decision makers and stakeholders. We demonstrated 558 the power that uncertainty quantification has in ecology to reveal which long-standing modeling assumptions 559 (spin-up initial conditions are sufficient; models with demographic stochasticity included are sufficient to 560 capture uncertainties; process uncertainty is negligible) are not upheld by data and what steps can be taken 561 to immediately increase forecast accuracy in forest gap modeling. These lessons are not unique to forest 562 ecology. Moving forward, there is a critical need to extend analyses like these to more ecosystems, additional 563 models, and larger spatial and temporal scales. This extension will allow ecologists to assess the generalities 564 of our conclusions and to understand variation of the relative importance of different uncertainties across 565 systems (Dietze, 2017b). Demographic stochasticity, parameter, meteorological, and process uncertainties 566 are quantities that are measured across scales and systems. These types of methodologies can be used to 567 quantitatively move toward better and more useful ecological predictions through systematic evaluation of 568 the contribution of each uncertainty to total forecast uncertainty across different scales and systems.