Capturing site-to-site variability through Hierarchical Bayesian calibration of a process-based dynamic vegetation model

Process-based ecosystem models help us understand and predict ecosystem processes, but using them has long involved a difficult choice between performing data- and labor-intensive site-level calibrations or relying on general parameters that may not reflect local conditions. Hierarchical Bayesian (HB) calibration provides a third option that frees modelers from assuming model parameters to be completely generic or completely site-specific and allows a formal distinction between prediction at known calibration sites and “out-of-sample” prediction to new sites. Here, we compare calibrations of a process-based dynamic vegetation model to eddy-covariance data across 12 temperate deciduous Ameriflux sites fit using either site-specific, joint cross-site, or HB approaches. To be able to apply HB to computationally demanding process-based models we introduce a novel emulator-based HB calibration tool, which we make available through the PEcAn community cyberinfrastructure. Using these calibrations to make predictions at held-out tower sites, we show that the joint cross-site calibration is falsely over-confident because it neglects parameter variability across sites and therefore underestimates variance in parameter distributions. By showing which parameters show high site-to-site variability, HB calibration also formally gives us a structure that can detect which process representations are missing from the models and prioritize errors based on the magnitude of the associated uncertainty. For example, in our case-study, we were able to identify large site-to-site variability in the parameters related to the temperature responses of respiration and photosynthesis, associated with a lack of thermal acclimation and adaptation in the model. Moving forward, HB approaches present important new opportunities for statistical modeling of the spatiotemporal variability in modeled parameters and processes that yields both new insights and improved predictions.


Introduction:
In times of unprecedented environmental change, process-based models are one of the most important tools at our disposal to understand and predict ecosystem function under global change. By encoding our current understanding about how processes work, these models often provide greater confidence when trying to understand current conditions, which are far from equilibrium and dominated by transients, as well as when making short-term ecological forecasts or long-term projections of environmental change . However, it can be hard to apply these models effectively because the parameters in such models are empirical coefficients that need to be calibrated to data, which can vary in space and time (Zaehle et al., 2005;Buotte et al., 2021). This presents a conundrum of how to calibrate with data that are collected over different sites or other observational units, be it stands, individuals, watersheds, lakes depending on the scales that models operate and the scales that data are collected.
Users are thus often presented with a choice between using general model parameters or site-specific parameters. General parameters may reflect uncalibrated "defaults", or may result from performing a multisite "joint" calibration against all the data from all the sites assuming no site-to-site variability (Minunno et al., 2019. But in either case such parameter sets often have mixed success at capturing the dynamics at any particular site. Site-specific calibrations, on the other hand, assume all sites to be completely independent but generally do a much better job of capturing site-level dynamics, although can be both labor-and data-intensive to generate (van Oijen, 2017) and limit potential for upscaling. Similar to general calibrations, site-specific calibrations also have had mixed success in their ability to make predictions at new sites (Post et al., 2017). Furthermore, while ecologists have only recently begun propagating uncertainties in process-based models, both the general and site-specific approaches are likely to not just be 'wrong' when applied to new sites, but to be over-confidently wrong or unquantifiably uncertain, with confidence intervals that do not reflect the true uncertainties associated with extrapolating calibrations to new sites. This out-of-sample overconfidence is occurring precisely because neither approach has a way of quantifying or propagating the uncertainties associated with the spatiotemporal variability in model parameters. Finally, the presence of spatiotemporal variability in model parameters itself implies that models are missing important interactions and processes altogether (Medlyn et al., 2015).
Hierarchical Bayesian (HB) approaches to calibration provide a third option in the continuum between these two approaches (Fig 1) that allows us to quantify and partition the scales of unexplained uncertainties (what statisticians call random effects, or in our specific case; site effects), and frees the modelers from assuming model parameters to be completely generic or completely site-specific (van Oijen, 2017). The hierarchical calibration framework has two levels: at the site-level, parameters are fitted to each site, then at the hierarchical level, sites are modeled to be distributed around an across-site (global) mean with an across-site (global) variance. After calibration, when a new prediction is made at one of the sites that went into the hierarchical calibration (in-sample prediction), the site-level parameters for that site will be used.
When a new prediction will be made at a new site that was not part of the hierarchical calibration (out-of-sample prediction), global level parameters will be used. HB calibration also borrows strength from sites with more data and constrains models at under-sampled sites, which is a common scenario in ecology (Clark, 2005). If the sites are very different (large site-tosite variability), the across-site model would provide very little constraint on parameters. If the sites are very similar (little or no site-to-site variability) the across-site model would behave very much like the joint calibration. However, in either case this structure is formally informed by the data rather than making a priori assumptions. This structure also allows us to constrain parameters at one site based on observations at another when there are high across-site correlations among parameters. HB not only provides improved predictions, but also formally gives us a structure that will help us better understand this inter-site variability moving forward (Dietze, 2017). By looking at which parameters show the highest site-to-site variability, we can detect both known and novel missing key processes. For example, one of the known missing and influential processes from dynamic vegetation models is the temperature acclimation of photosynthesis (Dietze, 2014;Lombardozzi et al., 2015). In a HB calibration, this might be detected as optimum  showing high site-to-site variability, as the calibration would try to   compensate for this missing photosynthesis process in the model by suggesting different values for different sites. However, the reason why some of these parameters show high site-to-site variability may not always be known, and may point to new questions that we did not think to ask until we applied the HB framework. In such cases, we can generate hypotheses and perform simple post-calibration analyses where we can test which predictors can explain this site-to-site variability (Fer, 2020). Furthermore, by looking at across-site parameter correlations we can detect model structural errors (assuming the trade-offs in across-site parameters are not biologically meaningful). Overall, this framework gives us a way of detecting whether a process needs to be added to a model without adding it and seeing if the model gets better.
HB approach has recently gained attention for the calibration of process-based models (Thomas et al., 2017;Susiluoto et al., 2018;Tian et al., 2020), however, hierarchical processbased model calibrations are still very limited. One reason why a HB approach may not have yet been widely adopted by the process-based modeling community is that Bayesian calibration is already computationally costly for process-based models (Fer et al., 2018). Recently, statistical and computational solutions for calibration of process-based models have gained attention, including Bayesian model emulation (Fer et al., 2018), which could help overcome this hurdle.
Here we build upon our previous emulator-based Bayesian calibration tools to develop a generalized, emulator-based multi-site HB calibration tool. In addition, this tool is coupled with automated pipelines within the Predictive Ecosystem Analyzer (PEcAn), a community cyberinfrastructure tool, that can handle the data processing and execution of large numbers of model runs across a large number of sites (LeBauer et al., 2013;Fer et al., 2021).
In this study, we assimilated data from the Ameriflux network to calibrate a processbased terrestrial ecosystem model, Simplified Photosynthesis and Evapotranspiration (SIPNET), using the HB approach and present insights gained from it in comparison to both individual site-level and "joint" across-site simple Bayesian (SB, as referred by Clark, 2005) approaches (Fig 1). We specifically seek to test the following hypotheses about in-sample calibration and prediction (H1 and H2), out-of sample prediction (H3) and uncertainty partitioning (H4): (H1) Model predictions will be falsely over-confident after joint fitting because it neglects across-site parameter variability and estimates a single parameter vector for all sites. In other words, uncertainties in parameter posterior estimates and post-calibration model ensemble output will be narrower but model-data agreement will be worse (higher deviance). (H2) Model-data comparison will show the best agreements (lowest deviance) after the site-level calibrations. Sites-level parameter estimates will gain additional constraints from HB calibration, and HB site-level post-calibration model ensemble agreement with data will be better (lower deviance) than SB site-level post-calibration ensemble.
(H3) Out-of-sample model predictions using parameters fitted through joint calibration will again be overconfident: i.e. predictive intervals will be more precise (narrower width) but will often exclude data at out-of-sample sites (lower coverage). Model predictions using the HB calibration will be more uncertain (reflecting the true uncertainties) but more accurate (better model assessment scores) at out-of-sample sites.
(H4) Across-site parameter variability will be higher than parameter uncertainty.

Study sites and model
The Ameriflux network provides long-term standardized measurements across a wide range of climatic and geographic space . Here we limited our study system  Table S1 with the longest uninterrupted (continuous) data availability range given in the "Years used" column.
Differences in number of active tower years among sites provides an opportunity to demonstrate the ability of HB calibration to borrow strength across sites that vary in information availability.
Among available Ameriflux data, we use Net Ecosystem Exchange (NEE) and Latent Heat (LE) as model constraints on carbon and water dynamics, respectively. To illustrate our Hierarchical Bayes (HB) calibration, we used the Simplified Photosynthesis and Evapotranspiration (SIPNET) model (https://github.com/PecanProject/sipnet), which we ran through the Predictive Ecosystem Analyzer (PEcAn, LeBauer et al., 2013; https://github.com/PecanProject/pecan) ecological informatics toolbox. SIPNET is ideal for our study as it was developed to facilitate model comparisons to eddy-covariance data (Braswell et al., 2005;Sacks et al., 2006). For all of our calibration and validation runs we ran SIPNET's temperate deciduous PFT for the given periods (ranging from 3 to 14 years of data, also see Table S1, "Years used" column) using Ameriflux tower meteorological data as model drivers. Meteorological drivers were gap-filled using the marginal distribution sampling approach implemented in the REddyProc package described in Wutzler et al. (2018). NEE and LE data were not gap-filled as this is not required for calibration and doing so would have artificially inflated sample sizes and resulted in a model-model comparison. SIPNET also requires initial conditions for its leaf, wood, and soil C pools (Table S2), which we extracted from the Ameriflux BADM (Biological, Ancillary, Disturbance and Metadata) files, or from reference papers listed in the BADM files, when possible. For the remaining sites, we used other papers in the literature, personal communication, and expert opinion. The initial conditions used at each site are given in Table   S3, and sources used in their compilation listed in Table S4.

Parameter selection
To reduce the dimensionality of the calibration we screened SIPNET's soil decomposition and plant physiology parameters (40 in total) to determine which parameters contribute most to the uncertainty in the NEE and LE predictions of the model, and thus could be constrained by NEE and LE data in calibration. We used a global uncertainty analysis approach to determine the contribution of different parameters to the output uncertainty. First, we generated 250 ensemble runs per site by uniformly sampling prior parameter distributions.
Next, we fitted separate univariate linear regressions between all 40 parameter samples and outputs. Then, we estimated the relative importance of each parameter using the respective R 2 values of each linear regression (Dietze, 2017). Full sensitivity analysis results are available in Fig. S2. In the end, we selected the 5 most influential parameters (Table 1)

Model calibration experiments
In this study, we compare the performance of three calibration approaches ( Fig. 1): i) individual site-level, ii) joint across-site, and iii) hierarchical. Note that both individual and hierarchical approaches produce site-level calibrations. Therefore, we specify the site-level calibrations in the individual calibration framework as simple Bayesian (SB) site-level, and in the hierarchical framework as HB site-level (in-sample). Likewise, both joint and hierarchical calibration produces "global" across-site parameters which are referred as joint-global and hierarchical-global, respectively (Table 2). All results are reported for the 12 DBF sites.
One of our hypotheses (H3) suggests that model predictions with hierarchical global posteriors will capture data at out-of-sample sites better than the model predictions using joint global parameters. To test this, we used leave-one-out cross validation (LOO-CV, Gelman et al., Susiluoto et al., 2018) where we repeated the joint and HB calibration by leaving one site out at a time, and using the global parameter estimates for prediction at the out-of-sample sites.
LOO-CV was not performed for the independent site-level SB fits because, from a Bayesian perspective, this approach does not provide a basis for out-of-sample prediction. Table 2   in the ensemble run (Yes) or just the five targeted parameters (No).

Calibration approach Description All pars
Pre-PDA ensembles Model is run with PEcAn meta-analysis posteriors (PDA-priors). Yes Therefore, in this study we used the emulator approach which can be readily extended to more complex models than SIPNET.
The overall calibration scheme closely follows Fer et al. (2018): Calibration priors for model parameters were generated by PEcAn's trait data meta-analysis module, which performs a univariate hierarchical Bayesian meta-analysis while accounting for random site effects (LeBauer et al., 2013;Raczka et al., 2018). Flux data were u* filtered, corrected for autocorrelation and residuals were modeled by a heteroskedastic Laplacian error distribution following Richardson et al. (2006). Gaussian Process (GP) models were used as the emulator following the basic algorithm: Each site-level emulator calibration is run in iterative rounds with adaptive sampling following Fer et al. (2018), with the exception that we changed the total number of proposed knots per round in this study to px30, p being the number of parameters. This adaptive sampling achieves an effect analogous to a nested grid, providing a higher resolution near the center of the posterior distribution and sparse sampling in parts of parameter space with low probability. is also in agreement with the scaling experiment in Fer et al. (2018) where they showed the rate of improvement in model data agreement has not substantially changed as proposed number of knots reached 480 for 6 parameters (Fig 7b therein).

Calibration scheme
Algorithms for all three workflows are provided in

Comparison of parameter posterior distributions
To assess the impact of calibration approach on parameter means and uncertainties we report the marginal posterior median and 95% ranges at each site for each calibration approach (Fig 3). As expected, the joint calibration global posteriors were the narrowest for all parameters as this approach requires estimating the smallest number of parameters (i.e. one parameter vector for all sites). The HB global posteriors, which describe the across-site means, were typically widest, but still more informed than the priors. This is also expected as the width of these interval estimates is primarily controlled by the number of sites, not the amount of data available within a site, and we only used 12 sites in this analysis. In general, the joint and HB

Analysis of random effects
Decomposing the across-site (global) covariance matrix showed that soil respiration parameter has the highest site-to-site variability (Fig 4, right panel). This is in contrast with the relatively modest variation apparent in Fig 3, especially in comparison to the prior range, but reflects the consistently low values of the posterior parameter estimates, causing the relative variability to be high. Across-site parameter correlations were generally modest; the parameters that showed the highest site-to-site (negative) correlations with each other (Fig 4, left panel) were the leaf growth and the canopy aerodynamic resistance parameter with -0.3, and leaf growth and the soil respiration parameter with -0.29.

Comparisons of calibration approaches at in-sample sites
(H1) Model predictions will be falsely over-confident after joint fitting because it neglects across-site parameter variability and estimates a single parameter vector for all sites. In other words, uncertainties in parameter posterior estimates and post-calibration model ensemble output will be narrower but model-data agreement will be worse (higher deviance).
(H2) Model-data comparison will show the best agreements (lowest deviance) after the site-level calibrations. Sites-level parameter estimates will gain additional constraints from HB calibration, and HB site-level post-calibration model ensemble agreement with data will be better (lower deviance) than SB site-level post-calibration ensemble.
Comparison of pre-and post-calibration fits supported the first two hypotheses (Fig 6).
While all three calibration approaches showed improvement over the pre-PDA (Fig S4)

Comparisons of calibration approaches at out-of-sample sites
(H3) Out-of-sample model predictions using parameters fitted through joint calibration will again be overconfident: i.e. predictive intervals will be more precise (narrower width) but will often exclude data at out-of-sample sites (lower coverage). Model predictions using the HB calibration will be less precise (more uncertain) but more accurate (better model assessment scores) at out-of-sample sites.
The leave-one-out cross validation (LOO-CV) tests with joint and hierarchical global parameters at the out-of-sample sites partly supported the third hypothesis. shows that while joint calibration LOO-CV deviance values were lower on average, the best ensemble member (== lowest deviance) was usually in HB calibration LOO-CV ensemble.

Uncertainty partitioning
(H4) Across-site parameter variability will be higher than parameter uncertainty.
Partitioning of effects of the HB calibration by propagating the parameter uncertainty encapsulated by the site-level parameter posterior distributions, and the parameter variability encapsulated by the global parameter posterior distributions into model predictions revealed that uncertainty is dominated by across-site parameter variability (Fig 7), confirming our 4 th hypothesis. performances of the three approaches were admittedly small; however, we note that this was a result from targeting five model parameters and using only twelve sites in this proof of concept study. With more parameters and more sites included in the calibration the false confidence and lack of flexibility of the joint calibration is expected to get worse, while the gains from HB calibration are expected to become more pronounced.

Analysis of the random effects
Analysis of across-site correlation showed relatively modest (negative) correlation between across-site parameters, the highest being between the aerodynamic resistance constant and the leaf growth parameters. In other words, sites with high aerodynamic resistance constants were having low leaf growth values, and vice versa. In SIPNET, both higher aerodynamic resistance and leaf growth values result in less evaporation from soil surface.
However, except for one site, no similar trade-off was observed within sites (not shown), therefore it is unlikely that it could have led to across-site trade-offs. Still, such across-site correlations have practical implications. For example, if we can inform one of these correlated parameters when making out-of-sample predictions, we would gain information about the other (Shiklomanov et al., 2020).
The correlation analysis between potential explanatory variables and site-to-site parameter variability showed relatively high correlation between mean annual temperature and optimum photosynthesis parameter as expected (Dietze, 2014;Lombardozzi et al., 2015). This suggests that the model may be trying to compensate for the response to different temperatures by using different values for these parameters, and thus improving the temperature sensitivity, or possibly accounting for acclimation, ecotypic adaptation, or species composition within the model processes can reduce site-to-site variability and improve model predictions. Separating acclimation from adaptation is difficult from the current analysis, but could be improved by further analyses exploring different random effects such as temporal random effects if acclimation is over shorter timescales.
The highest site-to-site variability, however, was seen for the soil respiration parameter, and the highest correlations between explanatory variables and its site-to-site variability were with initial carbon pools. This could be a compensating error in SIPNET. It could also be due to missing processes in SIPNET where the model cannot capture that larger pools are associated with more recalcitrant carbon. This supports recent directions of improvements to use realistic and measurable carbon pools, and include coupled carbon, nitrogen and phosphorus cycles in soil biogeochemical modeling (Abramoff et al., 2018;Thum et al., 2019).
Among explanatory variables, a priori fixed parameters for the Laplace error model of fluxes also showed high correlations with site-to-site variability of some parameters, especially with the water use efficiency constant (wueConst). wueConst was the most influential parameter contributing to the model output uncertainties for the latent heat (LE) variable. In other words, Laplace parameters related to LE data stream may be expected to explain some of the site-tosite variability in wueConst. However, within site parameter correlations (not shown) also show relatively high negative correlation between wueConst and optimum photosynthesis parameter (psnTOpt) or positive correlation between wueConst and leaf growth parameter, sometimes both. Both psnTOpt and leaf growth parameter values were also controlled by the NEE data stream, therefore Laplace parameters related to NEE may also be expected to explain some of the site-to-site variability in wueConst within our current calibration scheme, which was the case for our results. However, in principle, the parameters of the error model should not be correlated with site-to-site variability of the process-model parameters. We calculated the parameters for the Laplace error model following the approach in Richardson et al. (2006). Fits for the asymmetric heteroskedastic Laplace flux uncertainty parameters of NEE ( Fig S5) revealed that the US-Dk2 site has distinctly different slopes than the rest of the sites. Indeed, when this site was removed from the correlation analysis there was no remaining correlation relationship between fixed Laplace parameters and site-to-site variability of model parameters. A closer look at the raw flux data from this site showed noisy data even after the u* filtering which needs further inspection. Still, an apparent next step for our calibration scheme is to fit the Laplace error model parameters simultaneously with the rest of the model parameters, instead of using a priori fixed values.

Out-of-sample experiments
Leave-one-out cross validation (LOO-CV) exercise showed that even though joint calibration global posterior ensemble simulations were more confident (smaller width on average), they excluded more of the observational data points in their predictive intervals. This was also visible in the residual check (not shown) where joint calibration posterior predictive intervals over-and under-predicted more values in comparison to the HB LOO-CV. We note that even fewer number (11)

Future steps
In this study we included 12 sites as we worked with a simple model that can simulate one biome type at a time. Next step is to perform emulator-based hierarchical Bayesian calibration of a more sophisticated dynamic vegetation model which would allow us to include more sites that can better inform the site effects. However, emulating such models might require emulator implementations that are more scalable because in that case there would be more parameters (of multiple plant functional types that represent different biomes) to target in the calibration. Adequately training an emulator for a parameter space of increasing size requires more design points, and our current emulator also gets slow with size due to O(N 3 ) floating point operations and O(N 2 ) memory that are needed in the Gaussian Process (GP) implementation, limiting its applicability in the current calibration scheme. Therefore, utilizing recent sparsity solutions to computational limitations of full GPs would be the logical next step in this regard (Vanhatalo et al., 2013;Matthews et al., 2017). parameter differences as simple "random effects" that are independent and uncorrelated.
However, as we switch to more scalable emulators and increase the number of calibration sites (with the distance between them decreasing) we need to better consider the spatial correlation structure in parameter variability. Similarly, our hierarchical model here only considered spatial variability in model parameters, but just as there are unaccounted for processes that cause variability in space, there are likely to be unaccounted for processes in models that would cause the parameters to vary in time (Wikle et al., 2019). Therefore a logical next step is to develop spatiotemporal models of parameter variability at the hierarchical level using our the emulator framework.   Table S3. Initial pool values. LAI0 are set to 0 for all sites, as runs start from winter dormancy. WSP are also set to 0 for all sites following previous studies, as this is a fast pool and snow accumulation by Jan 1 st is assumed to be negligible at most of our sites. When only one of them or the total wood C content were available in the literature, 0.8/0.2 ratio is assumed for CA/CR. When only one of them or the total litter C content were available in the literature, 0.5/0.5 ratio is assumed for CLL/CLW.  Table 2. Deviance values are normalized by sample size at each site to obtain intercomparable results across sites, the lower the better.