Calibration of models to data: a comparison of methods

Complex models are often fitted to data using simulation-based calibration, a computationally challenging process. Several calibration methods to improve computational efficiency have been developed with no consensus on which methods perform best. We did a simulation study comparing the performance of 5 methods that differed in their Goodness-of-Fit (GOF) metrics and parameter search strategies. Posterior densities for two parameters of a simple Susceptible-Infectious-Recovered epidemic model were obtained for each calibration method under two scenarios. Scenario 1 (S1) allowed 60K model runs and provided two target statistics, whereas scenario 2 (S2) allowed 75K model runs and provided three target statistics. For both scenarios, we obtained reference posteriors against which we compare all other methods by running Rejection ABC for 5M parameter combinations, retaining the 0.1% best. We assessed performance by applying a 2D-grid to all posterior densities and quantifying the percentage overlap with the reference posterior. We considered basic and adaptive sampling calibration methods. Of the basic calibration methods, Bayesian calibration (Bc) Sampling Importance Resampling (S1: 34.8%, S2: 39.8%) outperformed Rejection Approximate Bayesian Computation (ABC) (S1: 2.3%, S2: 1.8%). Among the adaptive sampling methods, Bc Incremental Mixture Importance Sampling (S1: 72.7%, S2: 85.5%) outperformed sequential Monte Carlo ABC (AbcSmc) (S1: 53.9%, S2: 72.9%) and Sequential ABC (S1: 21.6%, S2: 62.7%). Basic methods led to sub-optimal calibration results. Methods using the surrogate Likelihood as a GOF outperformed methods using a distance measure. Adaptive sampling methods were more efficient compared to their basic counterparts and resulted in accurate posterior distributions. BcIMIS was the best performing method. When three rather than two target statistics were available, the difference in performance between the adaptive sampling methods was less pronounced. Although BcIMIS outperforms the other methods, limitations related to the target statistics and available computing infrastructure may warrant the choice of an alternative method. Author summary As mathematical models become more realistic, they tend to become more complex. Calibration, the process of tuning a model to better reproduce empirical data, can become dramatically more computationally intensive as model complexity increases. Researchers have responded by developing a range of more efficient, adaptive sampling calibration methods. However, the relative performance of these calibration methods remains unclear. To this end, we quantified the performance of five commonly used calibration methods. We found that adaptive sampling methods were more efficient compared to their basic counterparts and resulted in more accurate posterior distributions. We identified the best performing method, but caution that limitations related to the target statistics and available computing infrastructure may warrant the choice of one of the alternatives. Finally, we provide the code used to apply the calibration methods in our study as a primer to facilitate their application.

Researchers use mathematical and computer simulation models to approximate 2 real-world processes [1][2][3]. Model calibration, or fitting the model to data, allows for 3 correct estimation of uncertainty and increases the confidence that the model provides a 4 realistic approximation to the real-world process [3,4]. Another use of calibration is to 5 estimate parameter values that are not available in the literature [5]. Representing 6 parameter uncertainty allows decision-makers to assess the relative likelihood of different 7 outcomes and not merely a point estimate of the most likely outcome. Calibrating 8 complex models to data is often done using simulation-based calibration, which involves 9 running the model many times [6]. This process makes simulation-based calibration 10 challenging in terms of computational cost for models with long run-times [7]. 11 Calibration involves running the model for different parameter combinations and 12 comparing the produced model outputs with empirical summary statistics (targets) to 13 identify the model parameter values that achieve a good fit to data [3,8]. The main 14 components of a calibration method are the targets, the parameter-search strategy, the 15 Goodness-Of-Fit (GOF) measure, acceptance criteria, and stopping rules [4]. We 16 distinguish Approximate Bayesian Computation (ABC) methods, using a distance 17 measure as a GOF (e.g. Euclidean distance), from Bayesian calibration methods, using 18 the likelihood as a GOF measure (e.g. Binomial likelihood distribution) [8,9]. In this 19 study, we focus on sampling algorithms as the parameter search strategy, since sampling 20 methods obtain valid estimates of parameter uncertainty and correlations between 21 parameters [1,8,10]. The basic versions of sampling algorithms, Rejection ABC and 22 Bayesian calibration with Sampling Importance Resampling (BcSIR), draw a random 23 sample of parameter values from the prior distribution and compute a posterior 24 distribution [8,11]. Researchers have proposed several adaptive sampling algorithms as 25 more efficient parameter search strategies [3,7,12,13]. 26 Adaptive sampling algorithms improve efficiency by iteratively adapting the prior 27 distribution such that more of the sampled parameter combinations lead to a good fit to 28 the target statistics [7,12,13]. In the current study, we compare sequential Monte Carlo 29 approximate Bayesian computation with partial least squares (AbcSmc), Adaptive 30 population Monte Carlo ABC, hereafter called Sequential ABC -Lenormand (Seq ABC) 31 and Bayesian calibration with Incremental Mixture Importance Sampling 32 (BcIMIS) [7,12,13]. We can divide the total time consumed by calibration methods into 33 time spent on running the model and time spent on implementing the sampling 34 algorithm itself. Although sampling algorithms can vary substantially in run-time, the 35 sampling algorithm's efficiency (the number of model runs needed to achieve a certain 36 GOF) is ultimately more important. As models become increasingly complex, the 37 sampling algorithm run-time remains constant, while model run-time may increase 38 arbitrarily.

39
A previous simulation study by Minter et al. found that the amount of data 40 available for calibration affects the ability of the calibration methods to estimate the 41 parameters [14]. The same study found that the number of retained parameter 42 combinations, which is a function of the number of model runs performed and the 43 tolerance of choice, influences the speed of the ABC algorithm and the accuracy of the 44 posterior distribution. The study also found that Rejection ABC has inefficient 45 sampling compared to the ABC Sequential Monte Carlo (ABC-SMC) algorithm [14]. 46 Similarly, research showed that BcSIR was inefficient compared to BcIMIS [8]. Another 47 study showed that BcIMIS outperformed both simple BcSIR and three publicly available 48 variants of generic Markov Chain Monte Carlo (MCMC) [13]. A review of calibration 49 methods found that the number of target statistics is often only slightly higher than the 50 number of parameters calibrated, indicating the limited availability of target statistics 51 to calibrate the model [15]. This review also highlighted the need for studies comparing 52 the performance, strengths and limitations of calibration methods in scenarios differing 53 in the number of target statistics and the number of calibrated parameters [15].

54
Despite this research, many questions remain around the performance of calibration 55 methods [3,4,15]. Existing literature comparing the performance of calibration methods 56 do not evaluate distance measures versus surrogate likelihoods as GOF measures and 57 therefore do not allow us to draw general conclusions [16]. Also, previous literature does 58 not calculate or quantify how well these adaptive sampling methods approximate the 59 true posterior. Using a simple Susceptible-infectious-Recovered disease transmission 60 model, we compare the performance of commonly applied basic and adaptive sampling 61 methods, including both distance measures and surrogate likelihoods as GOF 62 measures [15]. We assume that both the number of target statistics and the 63 computational resources available to the researcher are limited. We highlight differences 64 between methods concerning implementation and requirements in terms of target 65 statistics.

67
Performance for obtaining the posterior 68 Calibration methods differed in their ability to obtain the posterior (See Fig 1, Fig 2,   69 and Fig 3). In Scenario 1, where the targets comprised prevalence at time 50 and 70 prevalence at time 75, the posterior density for Rejection ABC was more dispersed than 71 the reference posterior. The posterior densities for the other methods were closer to the 72 reference posterior, but most methods were unable to identify β accurately. The 73 posterior density for BcSIR was strongly multi-modal, in contrast to the other methods 74 (Fig 1 and Fig A in      The posterior densities for all methods in Scenario 2 were closer to the reference 93 posterior (Fig 2 and Fig 3) than they were for Scenario 1 (Fig 1 and Fig 3)  computational burden of the model calibration is small enough that a single multi-core 126 personal computer or compute node is sufficient, Rejection ABC and seq ABC provide 127 the convenience of a n cluster argument, allowing the user to launch model simulations 128 in parallel on n cluster cores of the computer. convergence criterion does not allow the user to infer how many steps and how many 138 model runs will be required before the algorithm reaches convergence. AbcSmc allows 139 the user to specify the number of algorithm steps and the number of model runs within 140 each step. In summary, the choice for one adaptive sampling method over another is 141 context-dependent, conditional on computational cost the model and the availability of 142 programming skills in the modelling group.

143
Our findings are consistent with some of those in previous work. Minter et al. 144 similarly found that Rejection ABC was less efficient than other methods and that 145 increasing the number of informative target statistics improves the performance of the 146 calibration methods [14]. Our study also confirmed previous findings that BcSIR was 147 less efficient than BcIMIS [8,13].

148
A limitation of our study is the uncertainty regarding how well our results generalize 149 to realistic calibration settings. Aspects of our experimental design might have 150 implications for our qualitative conclusions. We used a low number of simulation runs 151 for each of the methods (e.g. 60K in Scenario 1), while providing broad prior parameter 152 ranges, resembling the situation in which previous research provided little information 153 for these parameters. Using broad prior distributions may have advantaged adaptive calibrating two parameters in scenario 1 with 60 000 simulation runs, leads to an 156 approximate parameter sampling density of (60000) = 245, or 245x245 parameter 157 combinations on the β by γ grid. As the dimensionality of parameter space increases 158 with additional parameters, to obtain a similar parameter sampling density for 159 calibrating three parameters would require over 14 million (245 3 ) simulation runs, and 160 for four parameters, over 3.6 billion runs. Even when prior parameter ranges are closer 161 to the posterior density, calibrating a large number of parameters can dramatically 162 increase the fraction of parameter space producing bad model outputs. Therefore we 163 think that the observed increased efficiency of adaptive sampling methods over basic 164 methods may still be an underestimation relative to realistic use cases involving more 165 complex and imperfectly-specified models. Our use of relatively few target statistics (i.e. 166 two to three) compared to the number of parameters we are calibrating (two) may have 167 implications for our results. However, this design choice is in line with a recent 168 systematic review of individual-based models that found that the number of target 169 statistics is often limited compared to the number of parameters calibrated [15]. The 170 last limitation is that we used default settings (e.g. convergence criterion) for most of 171 the methods and did not explore the use of alternative method settings.

172
To determine the generalizability of our findings, we recommend future simulation 173 research to explore complex models with many parameters and target statistics, 174 calibrating more parameters to more and different types of target statistics. The 175 performance measures used in the current study scale to these higher-dimension settings. 176 We expect the adaptive sampling methods to outperform the basic methods 177 exponentially as parameter space increases. Future research on adaptive sampling 178 methods should explore the importance of sufficient exploration of the parameter space 179 in the first set (e.g. of sequential Monte Carlo algorithms), and the optimal distribution 180 of model runs in consecutive sets. Also, further research is required to find out whether 181 some of the adaptive sampling methods encounter problems with the procedures to run 182 the algorithm for more complex models.
192 Table 2 provides an overview of relevant differences between the calibration methods, 193 including parameter search strategy, GOF, parallelizability, target statistics 194 requirements, and stopping criteria (e.g. user-specified or convergence detection). We simulated an epidemic with a known parameter combination and observed how well 197 different calibration methods approximated a reference posterior distribution. All 198 simulations and analyses were conducted using R version 3.6.2 (2019-12-12) [17]. The R 199 code used to generate the simulations is available in S4 Appendix. The EasyABC 200 package in R was used to perform simulations for Rejection ABC, and Seq ABC 201 whereas the IMIS package was used for BcIMIS [18,19] 202 Simulation Model

204
For the model implementation we used the "SIR" function in the SimInf package in R. 205 The SimInf package in R uses the Gillespie stochastic simulation algorithm to provide a 206 stochastic framework for the simulation model to be implemented. The framework 207 integrates the dynamics of infection as continuous-time Markov chains, incorporates 208 data available, and simulates epidemics [20,21]. We assumed correct model specification, 209 meaning that the model used to generate the target statistics was also used to 210 re-estimate the parameters. In the model, individuals belong to one of three 211 compartments (i.e. health states) -Susceptible (S), Infectious (I), and Recovered (R). 212 In the limit of infinite population size, the model equates to a system of three non-linear 213 ordinary differential equations (ODEs) that relate the three compartments [22].
where β > 0 and γ > 0 are the rates at which individuals move from one compartment 215 to another and N (t) = S(t) + I(t) + R(t) is the total population size which remains 216 constant over time [23].

218
We considered two scenarios with different numbers of prevalence-based target statistics. 219 We define prevalence as the fraction of the population that is infectious at a given point 220 in time. Scenario 1 (S1) considered two target statistics (prevalence at time 50 and 75) 221 whereas scenario 2 (S2) looked at three target statistics; the two target statistics from 222 scenario 1 plus the peak prevalence. Table 3 gives a summary of the simulation setup. In the limit of an infinitesimally small acceptance tolerance, Rejection ABC can be used 232 to approximate the true posterior [9]. To increase the efficiency of obtaining the 233 reference posterior, We reduced the width of the prior distributions for β and γ 234 compared to the distributions used for the methods. We ran 5M simulations and 235 To compare the posterior densities of the calibration methods to the reference posterior 242 density, we adapted the L 2 measure from Lenormand et al. [12]. To this end, we created 243 a raster to compute percentage overlaps [24]. A raster comprises a matrix of cells 244 arranged into rows and columns to form a grid [25]. Each cell contains a number of 245 parameter combinations. The raster was created by considering the minimum and 246 maximum values of β and γ retained by the calibration methods. The resulting 247 parameter space was divided into 100x100 equally sized bins with β values on the x-axis 248 and γ values on the y-axis, leading to a grid of 100 2 = 10000 cells. We applied the grid 249 to each posterior density to quantify the density in each cell. 250 We computed the percentage overlap for each method, by summing the within cell 251 density differences between the calibration method and the reference and subtracting 252 from 1. Equation (4) computes the percentage overlap for each method within each 253 scenario.
where P is the percentage overlap of a calibration method with the reference posterior. 255 Summing the absolute difference between cells over the m = 100 rows and m = 100 256 columns of the grid approximates the overlap between the calibrated posterior and the 257 reference posterior. In both scenarios, M ij represents the cell density of row i and column j of a method in the grid, R ij represents the cell density of row i and column j 259 of the reference, n = 5000 is the number of parameter combinations retained by each 260 calibration method (posterior size).

261
Quantile-Quantile Plots 262 We plotted the quantiles of β and γ obtained by each method against the quantiles of β 263 and γ of the reference. Quantile-quantile plots allow us to compare distributional 264 features, such as the presence of outliers and shifts in location.

265
Basic Reproduction Number (R 0 ) and Epidemic Doubling time (DT ) 266 We computed two derivatives of the estimated parameters; the basic reproduction where n is the resample or posterior size and w is the sampling weight.