rsofun v4.4.1: A model-data integration framework for simulating ecosystem processes

Mechanistic vegetation models serve to estimate terrestrial carbon fluxes and climate impacts on ecosystems across diverse biotic and abiotic conditions. Systematically informing them with data is key for enhancing their predictive accuracy and estimate uncertainty. Here we present the Simulating Optimal FUNctioning {rsofun} R package, providing a computationally efficient and parallelizable implementation of the P-model for site-scale simulations of ecosystem photosynthesis, complemented with functionalities for Bayesian model-data integration and estimation of parameters and uncertainty. We provide a use case to demonstrate the package functionalities for modelling ecosystem gross CO2 uptake at one flux measurement site, including model sensitivity analysis, Bayesian parameter calibration, and prediction uncertainty estimation. {rsofun} lowers the bar of entry to ecosystem modelling and model-data integration and serves as an open access resource for model development and dissemination.


Introduction
The modelling of land ecosystem processes and structure, water, and carbon fluxes relies on both mechanistic and statistical approaches (Dietze et al., 2018;Hartig et al., 2012;Van Oijen et al., 2005).Mechanistic models are formulated as mathematical descriptions of functional relationships between the abiotic environment and ecosystem states and rates.These descriptions reflect available theory and general empirical patterns and provide a means for "translating" hypotheses about governing principles and causal relationships into testable predictions (Marquet et al., 2014), and for upscaling model-based estimates in geographical space and to novel environmental conditions.However, also mechanistic models rely on statistical (empirical) descriptions of processes at varying levels of abstraction.
Mechanistic models have model parameters that are either specified directly or fitted to data.A great advantage of mechanistic models is that they explicitly link known physical constants with process representations (e.g., molecular mass of CO2 for diffusion and assimilation, or the gravitational constant and viscosity of water for its transport and transpiration).Other parameters may be specified based on independent measurements under controlled conditions (e.g., the activation energy of Arrhenius-type metabolic rates), or represent (more or less directly) measurable plant functional traits, taken as constant over time and within plant functional types (PFTs -commonly the basic units in mechanistic vegetation models).Both types of parameters have traditionally been specified in models directly ('direct parameterization' (Hartig et al., 2012)).Yet other model parameters may not be directly observable and describe processes that are not explicitly resolved in models, but parameterized at a higher level of abstraction.Such parameters are often fitted to data such that the agreement between one (or several) related model predictions and observations is optimised.Parameter optimization/estimation for mechanistic vegetation models typically employs generic optimization algorithms or Bayesian statistical approaches and is often used for specifying different types of parameters (except for universal physical constants).Bayesian methods have the advantage that they enable a systematic assessment of the correlation structure among multiple fitted parameters, provide a means for considering uncertainty in observations and available prior information, and generate probabilistic parameter estimations and model predictions (Dietze et al., 2018;Hartig et al., 2012;Van Oijen et al., 2005).
State-of-the-art mechanistic vegetation models, used for global simulations and as components in Earth System Models, are complex.They describe a large number of interrelated state variables, treat multiple feedbacks between them, and rely on a large number of parameters for multiple PFTs and other ecosystem components and properties (e.g., soil, microbes, biophysical and hydrological properties).This makes them "data-hungry" (Hartig et al., 2012).However, their complexity and computational demand poses a limitation for systematic model-data integration and Bayesian parameter estimation.
Eco-evolutionary optimality (EEO) principles have been proposed for reducing model complexity and for a robust grounding of models in governing principles (Franklin et al., 2020;Harrison et al., 2021).They enable parameter-sparse representations, limit the distinction of separate PFTs, and may enable better model generalisations to novel environmental regimes.EEO principles make predictions of plant functional traits that would otherwise have to be prescribed as temporally fixed model parameters.However, although parameter-sparse, EEO-based vegetation models are not devoid of model parameters.Ideally, remaining parameters represent known physical constants or quantities that can be measured independently.But remaining parameters in optimality models typically also represent quantities that are not directly measurable -e.g., the marginal cost of water in (Cowan and Farquhar, 1977), or the unit cost ratio in (Prentice et al., 2014).These are considered to be more universally valid (e.g., without distinctions between PFTs), but still have to be fitted to data.
The P-model (Prentice et al., 2014;Stocker et al., 2020;Wang et al., 2017b) is an example of an EEO-guided model for terrestrial photosynthesis and its acclimation.It avoids the requirement for prescribing PFT-specific parameters of photosynthesis and stomatal regulation, but instead predicts them from universal EEO principles for the full range of environmental conditions across the Earth's (C3 photosynthesis-dominated) biomes.However, not directly observable parameters of the underlying EEO theory and of additional empirical parameterizations employed in the P-model (Stocker et al., 2020) remain and have to be fitted to data.
Here, we provide a solution for this challenge, acknowledging that the data is an integral part of the modelling process (Dietze et al., 2013) -even in theory-based models of ecosystem processes.We show how the most important parameters contributing to uncertainty in the P-model can be estimated inversely, using observations of ecosystem-level photosynthetic CO2 uptake (gross primary productivity, GPP).We provide a brief description of the theory embodied in the P-model and introduce the P-model implementation in the Simulating Optimal FUNctioning {rsofun} version v4.4 modelling framework, made available as an R package.A more comprehensive P-model description can be found in (Stocker et al., 2020).We demonstrate the functionalities of {rsofun} through a case study for simulating GPP at a single flux measurement site.The case study includes sensitivity analysis, Bayesian model calibration, and inference -the prediction of GPP with an estimation of its uncertainty.This paper presents the calibration to GPP observations only, but the package allows calibration to multiple targets simultaneously, including fluxes and leaf traits.

P-model description
The P-model predicts the acclimation of leaf-level photosynthesis to a (slow varying) environment based on EEO principles and thereby yields a parameter-sparse representation of ecosystem-level quantities, generalising across (C3 photosynthesisdominated) vegetation types and biomes.It combines the established theory for C3 photosynthesis following the Farquharvon Caemmerer-Berry (FvCB, (Farquhar et al., 1980)) model with the Least-Cost hypothesis for the optimal balancing of water loss and carbon gain (Prentice et al., 2014), and the coordination hypothesis (Wang et al., 2017a), which states that the light and Rubisco-limited assimilation rates (as described by the FvCB model) are equal for representative daytime environmental conditions.Based on these theoretical foundations, gross primary productivity (GPP) can be modelled as the product of absorbed photosynthetically active radiation, specified by the locally measured photosynthetically active radiation (PAR) and remotely sensed fAPAR, and the theory-based prediction of the ecosystem-level light use efficiency (LUE, (Stocker et al., 2020;Wang et al., 2017b).LUE acclimates to preceding environmental conditions with a characteristic, empirically determined (calibrated), time scale τ.Two latent (not directly observable) parameters govern the optimalityguided water-carbon trade-off: the unit cost ratio β (governing the balancing of maintaining the carboxylation capacity versus the transpiration stream) and c * (the marginal cost of maintaining the electron transport rate).These have previously been calibrated separately to data and specified as fixed model parameters in the P-model (Stocker et al., 2020;Wang et al., 2017b).The theory for predicting acclimated LUE then requires only the atmospheric environment to be specified (meteorological variables and CO2).A set of corollary predictions, physically and physiologically consistent with the simulated LUE, follow.These include the acclimated base rates of photosynthetic capacities in the FvCB model (Vcmax25 and Jmax25), the acclimated average ratio of leaf internal-to-ambient CO2 concentration (ci:ca), acclimated average daytime stomatal conductance (gs), and the acclimated base rate of leaf dark respiration (Rd25).Physical constants and additional parameters that determine the instantaneous temperature dependence of Vcmax, Jmax, Rd, and parameters in the FvCB model are prescribed and held fixed in the P-model (Tab.A2 in (Stocker et al., 2020)).
For simulating GPP, the P-model is conceived as a single-big-leaf model (Fig. 1).While describing leaf-level quantities at relatively high mechanistic detail, the link between the leaf and the canopy-scale is not explicitly resolved.Instead, an empirical approach for leaf-to-canopy scaling CO2 uptake is employed by treating the quantum yield parameter φ0 to be representative for the canopy-scale and allowing it to be calibrated to ecosystem-level CO2 flux data.The implementation of the P-model in the {rsofun} package (version v4.4) further includes an empirical parameterization of the temperature dependency of the quantum yield φ0, generalising the approach taken in (Stocker et al., 2020)), and an empirical soil moisture stress function, generalising the approach taken in (Stocker et al., 2020)) (Appendix A).Acclimating quantities are derived by employing the P-model theory to gradually varying environmental conditions where variations are damped and lagged by a characteristic (calibrated) time scale τ.The continuous treatment of the acclimation time scale is different from (Stocker et al., 2020).The temperature dependency of φ0 can be turned off by setting a (see Tab. 1 and Appendix A) to 0 ('ORG' model setup in (Stocker et al., 2020)).The soil moisture stress function can be turned off by setting β0 (see Tab. 1 and Appendix A) to 1 ('BRC' model setup in (Stocker et al., 2020)).Taken together, the P-model approach is guided by EEO and thus yields predictions for quantities that otherwise have to be prescribed and fitted to data.Yet, a small set of model parameters, related to empirical parametrizations and latent quantities remain and have to be calibrated to data.See Tab. 1 for a list of all (calibratable) model parameters of the P-model implementation in the {rsofun} package.Table 1: Calibratable parameters in the {rsofun} P-model implementation.We list the fixed parameter values for non-calibrated variables used in Sec.4.2 and, marked with *, the MAP estimates for the calibrated parameters.In parentheses, we list the lower and upper bounds used in the sensitivity analysis (Sec.4.1) and for the uniform priors in the Bayesian calibration (Sec 4.2).

Symbol
Parameter 3 The {rsofun} model framework {rsofun} implements the P-model (Stocker et al., 2020) and provides off-the-shelf methods for Bayesian (probabilistic) parameter and prediction uncertainty estimation.{rsofun} is distributed as an R package and also implements the BiomeE vegetation demography model (Weng et al., 2015).The latter is not further described here and is implemented at an experimental stage in {rsofun} version v4.4.The P-model implementation in {rsofun} is designed for time series simulations by accounting for temporal dependencies in the acclimation to a continuously varying environment.Function wrappers in R make the simulation workflow user-friendly and all functions and input forcing data structures are comprehensively documented (https://geco-bern.github.io/rsofun/).
In {rsofun}, model parameters can be calibrated using a calibration function `calib_sofun()`, providing two modes of calibration, one based on generalised simulated annealing ({GenSA} R package) for global optimization (Xiang et al., 2013) and one based on the {BayesianTools} R package, giving access to a wide variety of Bayesian methods (Hartig et al., 2023).
The former being fast, while the latter provides more informed parameter optimization statistics (Clark, 2004;Dietze et al., 2013).This gives the option for both exploratory and more in-depth analysis of estimated parameters.A set of standard cost functions are provided for the calibration, facilitating the exploration of various metrics or target variables, and the specification of calibrated model parameters.Furthermore, the vignettes accompanying the package (https://gecobern.github.io/rsofun/articles/)explain how to customise the calibration cost functions and interpret the calibration results.

Case study
We use {rsofun} to model GPP as estimated from ecosystem flux measurements taken at one site using the eddy-covariance technique.The site, selected here for demonstration purposes, is Puéchabon (Rambal et al., 2004) -an evergreen Mediterranean forest, dominated by Quercus ilex, growing on relatively shallow soil on karstic bedrock, and located in southern France.The climate is governed by a distinct seasonality in solar radiation and temperature, peaking in summer, and a dry period during summer months with recurrent ecosystem water limitation in late summer (Rambal et al., 2004).
Data used here cover years 2007-2012.Both the model target data (GPP) and model forcing data (see Fig. 1) were measured at the site and obtained from the FLUXNET2015 dataset (Pastorello et al., 2020).Data for years 2013 and 2014 were available from FLUXNET2015 but removed here due to unexpectedly low GPP measurements (pers.comm.Jean-Marc Limousin).

Sensitivity analysis
The P-model implementation in {rsofun} has a total of nine model parameters that are accessible for calibration (Tab.1).
Due to the relatively high computational cost of simultaneously calibrating all nine model parameters, we start by performing a sensitivity analysis to determine the most influential parameters on the model fit and exclude the least influential parameters for the subsequent calibration step.Here, we apply the Morris method for global sensitivity analysis (Morris, 1991) from the {sensitivity} R package (Looss et al., 2023) and compute the sensitivity metrics μ*, indicating the magnitude of the overall influence of a given parameter on the prediction, and σ, a measure of the heterogeneity of a parameter's influence on the prediction across the parameter space.We assume an additive and normally distributed model error term for the GPP prediction by the P-model (Trotsiuk et al., 2020) and express the fit to observed data via the Gaussian log-likelihood.The sensitivity analysis result (Fig. 2) indicates that the model error term is the most influential parameter, followed by the quantum yield intercept parameter φ0, the unit cost of electron transport c* and the soil moisture stress intercept parameter  ! .The remaining parameters have relatively little influence on the evaluation of daily GPP predictions from the single site considered here and b0, the ratio of dark respiration to the temperature-normalised maximum carboxylation rate, has no influence on GPP predictions, which follows from the model structure.These parameters will be held constant and thus excluded from the Bayesian calibration below.Note that we chose the threshold of influential and less influential (and thus fixed) parameters for demonstration purposes here and thus arbitrarily.

Bayesian calibration
We simultaneously calibrate a subset of the model parameters that have been identified as particularly influential (Sec.4.1.).
These include the model error term, φ0, c*, and β0.By letting φ0 and β0 to be calibrated and setting a to a fixed value of -0.0025 °C "# , the simulations consider a temperature-dependence of φ0 and a soil moisture stress effect on GPP (see Appendix A) -similar to the model setup 'FULL' in (Stocker et al., 2020).Note that the model error is simultaneously calibrated following (Dietze, 2017, p.201).We use the DEzs MCMC sampler (Ter Braak and Vrugt, 2008) as implemented in {BayesianTools} (Hartig et al., 2023) to estimate the posterior distribution of the calibrated parameters (Fig. 3).All parameters are given a uniform prior with bounds informed by their physical interpretations (Tab.1).GPP is assumed to be normally distributed, as in the Morris analysis.On a 12th Gen Intel Core i7-1270P processor, it took 552.854sec.to run 3 independent MCMC chains, each with 3 internal chains, and 2000 iterations of individual {rsofun} P-model runs per internal chain.The algorithm converged with a scale reduction factor (Gelman and Rubin, 1992) of 1.05 (≤1.1).More detail, examples, and explanations of calibration diagnostics are provided in the vignette (https://gecobern.github.io/rsofun/articles/sensitivity_analysis.html).The calculation of the log-likelihood is implemented in the function `cost_likelihood_pmodel()`, enabling custom calibrations, also for multiple target variables that are considered

Inference and prediction uncertainty estimation
The parameter sets generated by the MCMC chains provide the basis for inference (model prediction) and prediction uncertainty estimation, allowing us to get insights into the sources of uncertainty and assess the potential risks associated with the predicted outcomes in a changing environment.We consider a simple representation of the uncertainty, split between the parameter uncertainty and the model error (Dietze, 2017).We take 600 MCMC samples from the multivariate parameter posterior distribution (estimated during the calibration) and run the P-model for each set of parameters to predict GPP.The credible interval is computed from the posterior distribution of predicted GPP (at each time step).The predictive credible interval for GPP is computed by adding generated Gaussian error, with standard deviation err_gpp, to the predicted GPP.Fig. 4 shows that the uncertainty ascribable to the parameters (in green) is much smaller than the uncertainty due to the model error (in orange).A point estimate of GPP at each time step is calculated as the median of the posterior distribution of GPP (in orange).

Discussion
The {rsofun} R package provides a user-friendly and fast implementation of the P-model.It implements off-the-shelf modeldata assimilation routines with simple function calls, while maintaining flexibility for future experiments and the further development of model uncertainty estimation.The {rsofun} vegetation modelling framework is designed to strike a balance between integration and flexibility, enabled by extensible and user-defined calibration specifications and the parallelizable and fast low-level code implementation in Fortran 90.Its high computational efficiency offers the potential for effective model parameter and uncertainty estimation using Bayesian statistical methods -as demonstrated here.
With the function `calib_sofun()`, {rsofun} provides a blueprint for model-data assimilation and an implementation of the likelihood function with flexibility in selecting among a predefined list of model parameters and target observations.We have demonstrated here how ecosystem flux measurements of GPP from a single site can be used to estimate model parameters of the P-model and generate estimations of prediction uncertainty.The approach taken in previous studies with the P-model was to specify latent parameters directly, based on independent observations.In (Wang et al., 2017b), β and c * were determined from observations of Vcmax, Jmax, and ci:ca.These were then used as constants for modelling GPP, while additional parameters related to empirical photosynthesis stress parameterizations were calibrated to GPP observations.
Here, we simultaneously calibrate both types of parameters to GPP data.We have followed a simplified setup -used here for demonstration purposes.We note that future applications may use Vcmax, Jmax, or ci:ca data directly as additional calibration targets to provide complementary constraints.Note also that the FvCB photosynthesis model contains additional parameters (see Tab. A2 in Stocker et al., 2020) that are treated as constants here -as in previous publications (Bloomfield et al., 2023;Mengoli et al., 2022;Stocker et al., 2020;Tan et al., 2021;Wang et al., 2017b).
Our model implementation as an R package takes inspiration from the {r3PG} forest model (Trotsiuk et al., 2020), and our implementation of model-data integration on the basis of ecosystem data serves similar, yet reduced, aims and functionalities compared to PEcAn (https://pecanproject.github.io/index.html)(LeBauer et al., 2013).{rsofun} is designed to be minimally reliant on package dependencies and connections to specific data, while limiting the implementation and usability of a predefined set of process models (currently P-model, BiomeE at an experimentation stage).Note that the {rpmodel} R package (available on CRAN) also provides an implementation of the P-model, but is written fully in R and in the form of a function of a given environment -without a treatment of temporal dependencies and without the functionalities for modeldata integration.

Conclusions
We have demonstrated how important parameters contributing to uncertainty in GPP predictions by the P-model can be estimated inversely, using observations of ecosystem-level photosynthetic CO2 uptake flux time series.The Bayesian approach to model-data integration enables a probabilistic prediction of GPP and estimation of model parameters.The {rsofun} model implementation as an R package makes it possible to leverage a set of methods and complementary libraries for parameter estimation, sensitivity analysis, and calibration diagnostics -as demonstrated here.Its low-level code in Fortran 90 is geared towards computational efficiency.Provision of {rsofun} as an open-access library aims at lowering the bar of entry to modelling for both field ecologists and computational ecologists and serves as an Open Science resource for future model development and experimentation and the further development of model uncertainty estimation.

Appendix A: Calibrated empirical functions
A new formulation of the temperature dependency of the quantum yield efficiency φ0 is introduced in the {rsofun} package, allowing more flexibility than Eq.18 in (Stocker et al., 2020).It is expressed as follows: Where T stands for temperature, c is the quantum yield efficiency at optimal temperature (in mol mol "$ ), a is a unitless shape parameter and b is the optimal temperature.Whenever  = 0, the quantum yield efficiency is kept constant at  != .
A possible improvement for the model would be to use a peaked Arrhenius function instead of a parabola (Medlyn et al., 2002).Furthermore, the soil moisture stress function follows (Stocker et al., 2020), but the parameters considered for calibration there differ from the calibratable parameters in the package.
In the equation above,  stands for the plant-available soil water (in mm) and  * for the threshold indicating when the plants start being water stressed.The intercept  ! is the  reduction at low soil moisture, that is  != (0).This intercept is now calibrated directly, rather than expressed as a function of mean aridity (see eq. 20 in (Stocker et al., 2020)).

Figure 1 :
Figure 1: P-model inputs, outputs, and target observations for the calibration.The model takes as inputs static site information (longitude, latitude, elevation and soil water holding capacity), time series of meteorological forcing (daily values of temperature, VPD, PPFD, net radiation, atmospheric pressure, precipitation, cloud coverage, CO2 , fAPAR ), simulation parameters (spinup years, recycle period length) and, common for all sites, model parameters (listed in Tab. 1).The simulation returns a time series of several ecosystem fluxes, acclimating leaf traits and ecosystem water balance quantities.By comparing these outputs to field measurements and flux data in a Bayesian calibration routine, model parameters can be estimated.

Figure 2 :
Figure 2: Results from the Morris sensitivity analysis.The y-axis represents all 9 model parameters and the Gaussian error terms's standard deviation (error_gpp) and the x-axis the values of the statistics μ* and σ (unitless, like the log-likelihood).μ* indicates the magnitude of the overall influence of a certain parameter on the P-model output, while σ measures the heterogeneity of such influence across the parameter space.The parameter names are described and identified with corresponding symbols in Tab. 1.

Figure 3 :
Figure 3: Prior and posterior distributions of the calibrated model parameters and error term.The maximum a posteriori (MAP) estimate for the gaussian error is 1.22210684 and the estimates for each calibrated parameter are given in Tab. 1.

Figure 4 :
Figure 4: Predicted and observed GPP time series.The comparison is provided for the first year of GPP observations (black) at the site FR-Pue (Rambal et al., 2004) against GPP predictions (orange line), calculated as the median of the posterior distribution.A light green band indicates the 90% credible interval for GPP predictions, which captures parameter uncertainty, while the 90% predictive interval for GPP predictions (in orange) captures model uncertainty.