Efficient inference for agent-based models of real-world phenomena

The modelling of many real-world problems relies on computationally heavy simulations. Since statistical inference rests on repeated simulations to sample the parameter space, the high computational expense of these simulations can become a stumbling block. In this paper, we compare two ways to mitigate this issue based on machine learning methods. One approach is to construct lightweight surrogate models to substitute the simulations used in inference. Alternatively, one might altogether circumnavigate the need for Bayesian sampling schemes and directly estimate the posterior distribution. We focus on stochastic simulations that track autonomous agents and present two case studies of real-world applications: tumour growths and the spread of infectious diseases. We demonstrate that good accuracy in inference can be achieved with a relatively small number of simulations, making our machine learning approaches orders of magnitude faster than classical simulation-based methods that rely on sampling the parameter space. However, we find that while some methods generally produce more robust results than others, no algorithm offers a one-size-fits-all solution when attempting to infer model parameters from observations. Instead, one must choose the inference technique with the specific real-world application in mind. The stochastic nature of the considered real-world phenomena poses an additional challenge that can become insurmountable for some approaches. Overall, we find machine learning approaches that create direct inference machines to be promising for real-world applications. We present our findings as general guidelines for modelling practitioners. Author summary Computer simulations play a vital role in modern science as they are commonly used to compare theory with observations. One can thus infer the properties of a observed system by comparing the data to the predicted behaviour in different scenarios. Each of these scenarios corresponds to a simulation with slightly different settings. However, since real-world problems are highly complex, the simulations often require extensive computational resources, making direct comparisons with data challenging, if not insurmountable. It is, therefore, necessary to resort to inference methods that mitigate this issue, but it is not clear-cut what path to choose for any specific research problem. In this paper, we provide general guidelines for how to make this choice. We do so by studying examples from oncology and epidemiology and by taking advantage of developments in machine learning. More specifically, we focus on simulations that track the behaviour of autonomous agents, such as single cells or individuals. We show that the best way forward is problem-dependent and highlight the methods that yield the most robust results across the different case studies. We demonstrate that these methods are highly promising and produce reliable results in a small fraction of the time required by classic approaches that rely on comparisons between data and individual simulations. Rather than relying on a single inference technique, we recommend employing several methods and selecting the most reliable based on predetermined criteria.

Mathematical and computational modelling opens up and sheds light on a wide variety 2 of research questions in fields, ranging from hydrodynamics to oncology [1][2][3][4]. They 3 have thus become crucial to advances in nearly all research areas and for addressing a 4 variety of real-world problems. However, in many cases, detailed mechanistic 5 mathematical models of nature are notoriously complex and computationally 6 cumbersome. When faced with such models, direct statistical analyses can become 7 insurmountable due to the associated computational cost. In this paper, we address 8 ways to mitigate this drawback. 9 We take a look at a specific subset of heavy computational simulations, known as 10 agent-based models (ABMs) [5]. Such models keep track of autonomous agents -be it, 11 individuals, cells, or particles -that follow a set of rules prescribing their behaviour 12 relevant to the particular phenomenon being modelled. Importantly, these rules are 13 oftentimes stochastic and involve both interactions among the agents and between the 14 agents and their environment. If a simulation is repeated, its output, therefore, changes 15 in complex ways. Before one might compare the predictions of such ABMs to 16 observations, one must gauge this stochasticity, which might consume considerable 17 computational resources. Even without this additional complication, proper (Bayesian) 18 inference requires a large number of simulations to ensure a robust exploration of the 19 model's parameter space, which could be large [6]. However, ABMs for real-world 20 applications often require many agents or complex interactions making simulations 21 computationally demanding. Due to the high computational cost, a direct route for 22 simulation-based inference, such as Approximate Bayesian computation (ABC) 23 methods [7,8], might thus be impracticable. 24 There are different problem-dependent ways to deal with the computational 25 challenges that complex inference problems pose. Some models lend themselves to 26 statistical techniques that greatly reduce the number of required simulations [9]. In 27 other cases, one might resort to interpolation in an existing archive of models or to 28 parameterizations of emergent properties [10,11]. In recent years, it has become 29 increasingly popular to train machine learning (ML) methods to tackle the issue [12,13]. 30 We follow this latter approach here. A-E contain the total duration of the epidemic, the total death toll, the highest number of infected individuals during the outbreak, the time at which the peak in infections occurs, and the highest number of recovered agents during the outbreak, respectively. NN closely recovers the correct mean, median, and width of the marginal distributions 92 obtained from simulations and generally yields the lowest Wasserstein distance. In 93 contrast, the MDN is generally biased, overestimating the true mean, while the GP 94 results in distributions that are much too broad. 95 For our epidemiological SIRDS model, on the other hand, the MDN outperforms the 96 NN for sufficiently large training sets. The GP still yields the worst performance across 97 all measures. 98 As can be seen from both figures, the accuracy and precision of the different ML 99 approaches generally improve with increasing size of the training set. However, the GP 100 cannot cope with the kernel matrix for the largest training set. This is not to say that 101 this is a general flaw of Gaussian processes [28] but rather goes to show that the Comparing inference techniques 104 When comparing the inference results to the ground truth, we include four metrics: We 105 compare the mean of the distribution to the true parameters and compute both the 106 relative and absolute error. Moreover, we consider the standard deviation of the 107 posterior and the probability density (q) attributed to the true value (see further details 108 on metrics used in the method section). 109 As mentioned above, the NN and GP result in the best and worst performance  For the cancer CA model, we find that the higher accuracy and precision of the NN 120 emulator is reflected in the performance metrics for the inference results. We illustrate 121 this result in Fig 3. Whether we use the ABC or MCMC algorithm for inference, the 122 NN leads to a narrower distribution, whose mean more closely recover the ground truth 123 than the GP does. The second row shows the corresponding relative error in the prediction. The third row summarizes the standard deviation of the marginals, while the fourth row shows the negative logarithm of the probability density (q) attributed to the true parameter value. The plot includes the results from the emulation-based approaches (emu.) as well as our direct inference machines (inf.). We include two machine learning approaches: A neural network (NN) and Gaussian processes (GP). In connection with the emulators, we distinguish between results obtained when using rejection ABC and MCMC. For each approach, the label specifies the size of the training set: For the emulators, we consistently used 10 4 simulations. machines based on different sizes of the training set. Two features stand out: First, the 160 direct inference machines perform as well or better than the emulation-based approaches, 161 even with a low number of simulations in the training set. Indeed, some measures do not 162 visibly improve even when the model is confronted with significantly more training data. 163 Secondly, the dissonance in − log q(θ o ) between the NN and GP is more prominent for 164 the epidemiological SIRDS ABM than in the case of the cancer CA model.   To quantify the motivation for using the different inference techniques, we summarize 181 the CPU time requirements for each method in Fig 5. Of course, the computation time 182 is greatly model-dependent. Our intention is thus merely to underline a few key features. 183 As can be seen from the figure, the computation of the actual simulations by far The light magenta bars show the total time required for each inference technique. The remaining three bars specify the time required for the simulations involved (red), the time required for training and validation (yellow), and the time that is needed to infer the parameters for a single set of observations (blue). The number of required simulations is specified in the labels. We include the NN direct inference machine based on three different sizes of the training set. Moreover, we include the NN emulator in tandem with the rejection ABC and MCMC algorithms. The ABC accepts the best 10 4 among 10 7 randomly drawn samples. The ensemble MCMC uses 8 walkers, drawing 20,000 samples for each walker. For the MCMC, we average over the CPU times for cases a and b. For comparison, we include an estimate on the time that would be required for running the ABC with 10 7 samples using the actual simulations. Note that the ordinate is logarithmic. Note also that the time needed to run any given simulation depends on the number of agents and hence the model parameters.   techniques. These methods fall into two main categories: emulators and direct inference 206 machines. We find that the performance of the emulators and direct inference machines 207 are not consistent across the two ABMs but rather depend on the real-world application 208 in question as well as on the size of the training set and thus on the available 209 computational resources. This leads us to our first guideline:

210
• There is no one-size-fits-all solution for simulation-based inference.

211
The performance of any given algorithm is problem-dependent. Since there are a 212 priori no clear-cut preferences, we recommend confronting data with several 213 techniques and selecting the best method based on suitable metrics. 214 We do hence not present an absolute ranking of the methods per se but rather raise 215 caution about solely relying on a specific method or implementation when facing a new 216 research problem. We do, however, note that the GP performed worse than the NN and 217 MDN methods. So, one may need to be careful when relying on Gaussian processes for 218 emulation [30,31]. This being said, other implementations of Gaussian processes might 219 be able to address the shortcomings of the implementation used here [32,33].

220
The notion that different methods may work better for different problems is 221 consistent with the findings by Lueckmann et al. [13], who study a range of benchmark 222 problems to provide guidelines for so-called sequential inference techniques. In a 223 nutshell, sequential techniques explore the parameter space in an iterative fashion.

224
Rather than relying on a static grid for training, they hand-pick those points in the 225 parameter space that closely match the specific observation whose properties we aim to 226 infer. Sequential techniques are hereby able to greatly reduce the number of required 227 simulations by orders of magnitude. Implementations for both emulation and direct 228 inference exist [34,35]. The catch is that new simulations are required for every new set 229 of observations. With 250 test cases for each ABM in our analysis, we can thus not reap 230 the benefits of these algorithms as a substitute for our inference machines. However, 231 sequential techniques have proven very powerful in fields, such as cosmology [36]. After 232 all, there is only one universe.

233
When comparing our results to those by Lueckmann et al., it is also worth noting 234 that we use an alternative application of neural networks for emulation and direct 235 inference in this paper: Rather than estimating the likelihood or posterior densities, we 236 use a neural network with dropout that provides an approximate Bayesian formulation 237 and mimics a Gaussian process [16,16,17,37,38]. Like the methods discussed by

241
Yet another other distinct aspect of our study is that we keep the limitations of the 242 real-world ABM applications in mind: When dealing with real-world data, one must 243 often resort to simplifying assumptions regarding the true distribution of the data, due 244 to the nature of the experiments. Not only does this dictate the type of data that we 245 can gather but it also imposes constraints on the summary statistics and the metrics 246 that we can draw on (cf. the methods section). Our paper outlines how one might go 247 about selecting different methods and performance measures based on the research 248 problem using ABMs as a case study. 249 We do by no means claim that our list of emulators and direct inference machines is 250 exhaustive. Other viable approaches exist with scope for future developments. For 251 example, recent work on generating differential equations from stochastic ABMs [39] 252 might be extended to stochastic settings to generate new emulators. However, it is 253 beyond the scope of the present paper to explore all possible paths. Rather, we intend 254 to present an illustrative selection. We also note that not all ML methods can readily 255 address the stochastic nature of the data. At least not in their standard form.

256
Acquiring error bars might, oftentimes, require elaborate extensions (see e.g. the paper 257 by Meinshausen et al. [40] for a discussion on regression random forests [41]).

258
For our ABMs, we find that the direct inference machines provide accurate and 259 precise results even with a very limited training set. Since computation time is one of 260 the most prominent obstacles when dealing with ABMs for real-world applications, this 261 feature favours direct inference algorithms. Thus,

262
• we generally recommend to look for techniques that minimize the number 263 of simulations required such as direct inference machines.

264
This is not to say that direct inference machines will be the optimal choice for all ABMs. 265 The present success of the direct inference machines might reflect aspects, such as the MCMC. However, we note that the presented direct inference machines (as well as the 269 emulation-based ABC or MCMC) can produce multi-modal posteriors.

270
As regards the statistical inference techniques, we find that the MCMC does not 271 always yield robust results. We attribute this to the stochastic nature of the ABMs.

272
While the rejection ABC consistently yields worse results than the direct inference 273 machines in terms of the employed metrics, it does not suffer from this shortcoming.

274
• Due to the stochastic nature of the agent-based models, some statistical 275 inference techniques do not yield robust constraints for 276 emulation-based approaches.

277
The statistical algorithms explored in this paper, i.e. rejection ABC and MCMC, are 278 by no means exhaustive. Other viable alternatives, such as sequential Monte Carlo 279 (SMC) [42], exist. The aforementioned sequential techniques discussed by Lueckmann et 280 al. offer another alternative when combined with the emulators. However, we also note 281 that the stochastic nature and intractable likelihood of the explored systems render 282 some approaches impractical. For instance, while Hamiltonian Monte Carlo (HMC) [43] 283 is generally a demonstrably powerful technique [44], it is not suited for our purposes as 284 we cannot reliably compute derivatives in the explored parameter space.

285
By addressing and comparing multiple widely-used algorithms, we hope that the 286 present paper assists researchers in selecting appropriate inference techniques for 287 real-world applications. While we have focused on ABMs here as one class of real-world 288 models, we believe our results to be relevant to other stochastic complex systems.

289
State-of-the-art numerical models are often computationally heavy, and we hope that In this paper, we use two agent-based models (ABMs) describing two distinct real-world 296 problems: Our first model deals with a malignant type of brain cancer called 297 glioblastoma multiforme. The second model describes the spread of infectious diseases 298 in a population. The appendices Brain tumour, CA model and Epidemic, SIRDS model 299 provide detailed accounts of both ABMs. Since these models are solely to be understood 300 as examples that we have used to benchmark different inference methods, we limit 301 ourselves to remark on a few key aspects of the models here. In both models, the agents 302 live in a two-dimensional plane, and the dynamics of the system is governed by a set of 303 stochastic rules that dictate the behaviour of the individual agents.

304
Brain tumour, CA model 305 Our agent-based brain cancer model is a so-called cellular automata (CA) model, in 306 which each agent represents a tumour cell. This ABM has four input parameters. Our 307 first two parameters, P div and P cd , are probabilities that are associated with the rules 308 for cell division and spontaneous cell death. The third parameter, λ C , determines the 309 nutrient consumption of the cells and entails the possibility of cell death due to nutrient 310 deprivation. This parameter is a rate given in units of the nutrient consumption per cell 311 per time step.

312
To explain the fourth parameter, we need to mention that brain tumours are 313 composed of three main different cell types [45]. In the following, we refer to these as

319
In this paper, we perform inference based on synthetic data, i.e. mock observations. 320 To do so, we need to consider what data would be available in a real-world scenario. In 321 brain cancer research, data often comprise detailed snapshots of tumour growth. To As regards the summary statistics, we again resort to the emergent properties of the 353 system. Rather than relying on a single detailed snapshot, epidemiologists often have 354 comprehensive knowledge of D(t), I(t), and R(t). Here, we boil these time series down 355 to five numbers: the total duration of the epidemic, the total death toll, the highest 356 number of infected individuals during the outbreak, the time at which the peak in 357 infections occurs, and the highest number of recovered agents during the outbreak (cf. 358 the appendix Parameters and summary statistics). 359 We stress that both our cancer CA and our epidemiological SIRDS simulations are 360 rather simplistic. However, they are sufficiently realistic to capture the main aspects of 361 the real-world applications, including the stochastic nature of the modelled events. This 362 feature is the most essential for our purposes. Rather than providing detailed models for 363 specific biological systems, we investigate how to infer properties from observations 364 based on such simulations. By keeping the models simple, we lower the computational 365 cost making the analysis more practical and allowing us to readily explore different 366 aspects.

368
To perform inference, we construct a set of grids in the input parameter spaces with a 369 limited number of simulations. We then use these to train the machine learning (ML) 370 techniques discussed in the sections Emulators and Direct inference below. 371 We impose uniform priors on all parameters. Our parameter spaces are thus four 372 and five-dimensional hypercubes for the cancer CA and epidemiological SIRDS model, 373 respectively. We could cover the parameter spaces by sampling points in regular 374 intervals. However, this approach is rather inefficient. Alternatively, we could generate 375 points randomly. However, this approach leads to unintended clustering. We, therefore, 376 follow the suggestion by Bellinger et al. [46] and generate points based on a Sobol 377 sequence [47]. Doing so, we sample the hyperspace uniformly, while minimizing 378 redundant information as discussed by Bellinger et al. [46]. We pick the model parameter values for our mock observations using tighter priors 407 than those presented in Table 2. The benefits of this approach are twofold. First, we 408 avoid drawing simulations at the edges of the regions covered by the grids. Secondly, by 409 choosing different priors, we select points that differ from those that enter the training 410 data although we use a Sobol sequence to sample the parameter space.

411
In the section Results, we infer the properties of our mock observations using various 412 techniques and compare the results with the ground truth. This comparison allows us to 413 assess the performance of the different inference algorithms. However, due to the  Especially when one or more of the output variables consistently take on a fixed value, 418 the defined distance measure and likelihood functions run into trouble since we end up 419 dividing by zero. One such scenario would be the case where the parameter values 420 ensure that all tumour cells consistently die off before t = 100. Another example is the 421 case where the disease is never passed on. To mitigate this problem, one might discard 422 all mock observations that produce delta functions as the marginals for one or more 423 output variables of the ABMs -in the real world, such scenarios would not be studied, 424 anyway. Here, we apply a more conservative criterion, discarding all mock observations, 425 for which the width of the 68 per cent confidence interval is zero. With this selection 426 criterion, we are left with 219 and 227 points in the parameter space for the cancer CA 427 and epidemiological SIRDS models, respectively.  Since the ABMs in this paper are driven by stochastic events, several simulations 437 with the same input parameters will produce a distribution for each output variable of 438 the ABM rather than a unique deterministic result. The emulator must capture this 439 behaviour. Based on this notion, we follow three different approaches: We emulate the 440 simulation output using a deep neural network (NN), a mixture density network 441 (MDN), and Gaussian processes (GP). 442 We train and validate the emulators based on precomputed grids of models (cf. the 443 section Model grid). When the ML methods rely on neural networks (NN and MDN) As our first approach, we construct a broad deep neural network (NN) with 3 hidden 455 layers that contain 100 neurons each. To avoid overfitting, we include early stopping as 456 well as dropout during the training phase [48]. The latter property implies that we 457 randomly select 20 per cent of the neurons in each layer and temporarily mask these 458 neurons during each pass through the NN [48]. To recover the stochastic nature of the 459 ABMs, we maintain dropout during the inference phase and use the loss function As our second approach, we employ a mixture density network (MDN) that fits a 470 mixture of multivariate normal distributions to the output of the ABMs [20,21]. Within 471 the framework of MDN, it is straightforward to mimic the stochasticity of the underlying 472 ABM: During inference, we simply sample from the constructed mixture model.

473
The fit is accomplished by training a neural network to predict the means and full 474 covariance matrices for each of these normal distributions. We use three components for 475 each mixture model. For a five-dimensional space of output variables from the ABM, 476 the neural network is thus optimized to predict 63 variables: 15 means, 15 diagonal and 477 30 unique off-diagonal elements of the covariance matrix, and 3 component weights. To 478 ensure that the covariance matrices are positive semi-definite, we use exponential and 479 linear activation functions for the diagonal and off-diagonal elements, respectively [36]. 480 To guarantee that the component weights are positive and sum to unity, they are 481 meanwhile passed through a softmax activation function. Regarding the loss function, 482 we maximize the log-likelihood that the mixture model attributes to the training data 483 from the grid discussed in the section Model grid. As in the section Neural networks, we 484 employ early stopping and dropout. Our implementation builds on TensorFlow [49]. 485 Gaussian Processes 486 Finally, as our third approach, we describe the output of the ABM as a Gaussian 487 process (GP) [18,50]. For a comprehensive overview of GPs, we refer the reader to the 488 work by Rasmussen et al. [19]. 489 We use a linear combination of a radial basis function (RBF) kernel and a white 490 noise kernel to specify the covariance of the prior function. We do so to properly 491 account for the spread in the underlying output from the ABM [51]. Just as in the case 492 of the MDN, the stochasticity of the ABM can be emulated during inference by 493 sampling from the obtained GP. 494 Here, we set the mean squared error in the prediction to be the loss function, which 495 amounts to the simplifying assumption that the output variables of the ABM are 496 uncorrelated. We use the python implementation of Gaussian process regression from 497 scikit-learn [52]. 498 We note that the employed implementation produces a uni-modal distribution for a 499 unique set of input parameter values. The same holds true for the NN. In contrast, the 500 MDN is able to produce multi-modal outputs when given a single input vector.

502
Having constructed an emulator, we can resort to standard sampling techniques to map 503 the posterior probability distributions of the model parameters (Θ) of the ABM given a 504 set of observations (x). In our case, these observations constitute the mock observations 505 created by the ABM as discussed in the section Mock observations.  In this paper, we use the implementation by Lintusaari et al. [56] (elfi). As regards 515 the prior, we use uniform priors corresponding to the region covered by the grids 516 discussed in the section Model grid. As regards δ, we keep the best 10 4 of 10 7 samples 517 for each combination of parameters for our mock observations.

518
The last thing that we need to specify for the ABC is the distance measure. To 519 motivate our choice, we need to consider the nature of the real-world applications that 520 our ABMs strive to model. At best, a targeted laboratory study of glioblastoma or 521 disease transmission may involve a handful of animals, such as mice or ferrets [57,58]. It 522 can, therefore, be notoriously complicated to assess the stochasticity that underlies the 523 data. The same holds true when dealing with patient-specific data or data from the 524 outbreak of an epidemic [59].

525
For instance, consider an experiment, in which we implant human tumour cells in a 526 single mouse to study tumour growth. Based on this time series alone, we cannot hope 527 to rigorously constrain predictions for the outcome when repeating this experiment.

528
Even given a handful of such experiments, we might be hard-pressed to go beyond 529 simplifying assumptions, such as the notion that the observations will be normally 530 distributed. Indeed, one might often not have enough information to go beyond 531 measures, such as the median, the mean, the standard deviation, the standard deviation 532 of the mean, the mean absolute error and/or the median absolute error. With this in 533 March 23, 2022 16/32 mind, we settle for a distance measure (d) in the form where the sum runs over all five observed output variables of the ABM, x sim,i is the 535 median of the marginal distribution for the ith summary statistics in the mock 536 observations, and σ sim,i denotes the standard deviation of the synthetic data. The 537 distance is computed for each prediction made by the emulator denoted by f (Θ). 538 We note that x sim,i and σ sim,i do not fully capture the distribution of the mock 539 observations. In other words, by employing Equation (1), we assert the notion that 540 real-world data do not reveal the same level of detail regarding the stochasticity of the 541 studied events as our mock observations do. This being said, we also note that the  Alternatively, one might introduce a surrogate likelihood. This approach is, indeed, 553 commonly used across different fields of research [60,61].

554
Following similar arguments to those that lead to the distance measure in 555 Equation 1, we arrive at a surrogate likelihood (p(Θ|x)) in the form where x sim,i and σ sim,i again denote the median and standard deviation of the marginal 557 of the ith output variable in the mock observations, respectively.

558
However, the use of Equation (2) leads to convergence problems, due to the 559 stochastic nature of the ABMs. Alternatively, one might directly compare the predicted 560 median (M ) with that of the observations. To account for the stochastic behaviour of 561 the ABMs, we include the standard deviation (σ pred ) of the prediction in the 562 denominator: We refer to the surrogate likelihoods in Equations (2) and (3) as case a and b, 564 respectively.

565
Having defined a surrogate likelihood, we can employ standard sampling techniques, 566 such as Markov chain Monte Carlo (MCMC) [6,62]. The advantage of this approach 567 over that of rejection ABC is that we can sample more efficiently from the prior.

568
However, as discussed in the section Results and mentioned above, the stochasticity of 569 the emulator, partly reflecting the stochasticity of the ABM, might compromise 570 convergence.

571
In this paper, we use the ensemble sampler published by Foreman-Mackey et al. [63] 572 based on the procedure suggested by Goodman & Weare [64]. We thus evolve an 573 ensemble of N walkers in parallel, where N is twice the number of dimensions of the 574 mapped input parameter space of the ABM. For each combination of parameters in our 575 mock observations, we exclude a burn-in phase containing 5000 samples per walker. We 576 then collect 20,000 additional samples per walker. To assess the robustness of the 577 MCMC, we estimate the integrated autocorrelation length using an 578 autoregressive-moving-average model. In short, the autocorrelation length denotes the 579 number of samples that is necessary for the walker to become oblivious to its initial 580 conditions. For further details, we refer the interested reader to the paper by 581 Foreman-Mackey et al. [63].

582
Direct inference 583 The procedures presented in the section Emulators require two separate steps. First, we 584 train and validate an emulator. Then we impose Bayesian statistics. Alternatively, we 585 can construct an ML algorithm that directly predicts Θ given points from the output 586 variable space of the ABMs. In the following, we will refer to such an algorithm as a 587 direct inference machine.

588
Direct inference machines circumvent subsequent Bayesian sampling schemes, such 589 as ABC or MCMC [12,13]. Once direct inference machines are trained, they are thus 590 much faster to employ than emulators. We quantify this statement in Fig  shortcomings. Rather than resorting to a one-size-fits-all approach, one must hence 610 choose between different algorithms based on the inference problem that one faces.

611
Indeed, both emulators and direct inference machines are very popular in different 612 research fields [13,21,46].

613
In this paper, we compare the results obtained using the emulators with those 614 obtained using two direct inference machines. As our first direct inference technique, we 615 employ a broad, deep neural network. Apart from the fact that the input and output 616 spaces are swapped, the architecture of this neural network matches that described in 617 the section Neural networks above. As our second approach, we train a Gaussian 618 process based on the settings summarized in the section Gaussian Processes.

619
Before commencing, we still need to address one question: How do we establish 620 credibility intervals for the direct inference machine? Consider the case, where we 621 repeatedly generate samples using one of our direct inference machines based on the 622 same set of singular values for each of the input variables. The chosen direct inference 623 machine would then yield a distribution that reflects the stochastic behaviour of the 624 training data and the limited knowledge of the ML algorithm. However, following this 625 approach, we do not include the uncertainty of the observations. Remember that the 626 input variables of the direct inference machines are the observations, and our mock 627 observations are distributions rather than singular values. We should thus not run the 628 direct inference machine repeatedly using the same set of input variables. Rather, to 629 include the observational errors, we would need to generate samples from a distribution 630 of input parameters that corresponds to the distribution of the observations [46]. As 631 discussed in the section Rejection ABC, we assume that the marginal distributions of 632 the observed quantities are approximated by independent Gaussian distributions.

633
Following the outline by Bellinger et al. [46], we thus sample 10,000 combinations of 634 input parameters for the direct inference machines using a multivariate Gaussian 635 distribution with a diagonal covariance matrix matching the observational constraints. 636 The means of this multivariate distribution are set to the medians of the observations, 637 while the variance reflects the standard deviations of the marginals. respectively. Any deviations by the mean of this measure from zero would reveal a 648 bias. Since we cover a broad region of the parameter space for each ABM, we cite 649 this measure in units of the standard deviation of the marginal obtained from the 650 simulations (σ sim ).

651
• As our second measure, we include the relative absolute error in the predicted 652 median, i.e. |M pred − M sim |/M sim . Like the first measure, this measure quantifies 653 the accuracy of the emulator.

654
• Thirdly, we compute the ratio between the predicted standard deviation of the 655 marginal distribution and the standard deviation obtained from simulations 656 (σ pred /σ sim ). This measure should be close to one. Otherwise, the emulator over-657 or underestimates the uncertainties of the predictions.

658
• Finally, we compute the Wasserstein distance that quantifies the discrepancy 659 between probability distributions [65]. The lower this measure is, the more similar 660 the predicted distribution by the emulator is to the ground truth. 661 We note that the metrics chosen for the output of the emulators reflect our 662 assumptions about the nature of the data that we would obtain in a real-world scenario. 663 As elaborated upon in the section Rejection ABC, we thus assume the observations to be 664 approximated by a multivariate Gaussian distribution with a diagonal covariance matrix. 665 Although we deem the listed measures to be the most suitable, it is still worth stressing 666 that generalisations of the measures above as well as alternative measures exist [13].

667
Meanwhile, the assumption that the observations are normally distributed does not 668 affect the metrics, with which we assess the inference results, i.e. the obtained posterior 669 distribution for the model parameters. The marginal posterior distributions might very 670 well be skewed or multi-modal, and the metrics should be able to handle this. We turn 671 to these metrics below.

672
As regards the ABC runs, the MCMC runs, and the direct inference machines, we 673 can compare the obtained distributions to the original input parameter values (Θ o ) of 674 the mock observations. To assess the performance of the algorithms, we thus compute 675 both the residuals and relative difference between the mean of the predicted marginal techniques. By comparing the measures across the different algorithms, we can put the 685 performance of the individual algorithms into perspective.

686
As an additional comparison, one might propose to infer the model parameters using 687 the actual simulations in tandem with the ABC or MCMC algorithms. We could then 688 use the obtained distributions as benchmarks. However, since we are dealing with 689 complex ABMs, any analysis involving the actual simulations become a herculean task 690 that is beyond the scope of our computational resources. After all, this stumbling block 691 was the reason for turning to emulators and direct inference machines in the first place. 692 Here, D denotes the diffusion coefficient for the nutrient flow, λ C scales the nutrient 1018 consumption of the cells, and ρ denotes the tumour cell concentration, i.e. we assume 1019 that all cell types consume the same. Finally, S signifies any nutrient sources, i.e the 1020 aforementioned inflow of nutrients in the lower left-hand corner. Note that Equation (4) 1021 is solved on a Cartesian grid. At each time step, interpolation is hence required between 1022 this grid and the irregular lattice on which the cells live. We illustrate the described  This grid is highlighted in cyan outside the tumour. The cells live on an adaptive irregular grid that can, at most, harbour 40,000 cells. This adaptive grid is highlighted in cyan inside the tumour. In the proliferating rim, the irregular grid has a high resolution. In the bulk tumour, several cells share one lattice site. Single GSC (purple), GPP (red), and GDS (yellow), are shown using markers, whose colours match those in the left panel. Dead tumour cells or clusters, containing only dead cells, are marked with black dots. Clusters with quiescent cells are blue. The presence of a nutrient source in the lower-left corner and the imposed cell death upon starvation introduce an asymmetric growth pattern.
We assume that a sufficiently low nutrient concentration will inflict immediate cell 1025 death [81]. Such low concentrations might occur at the centre of the tumour leading to 1026 a necrotic core. This morphology is observed for real tumours.

1027
For glioblastoma multiforme, the migration of tumour cells is known to primarily 1028 occur along white matter tracks and blood vessels [82][83][84]. One might take this 1029 behaviour into account by letting the stochastic rules that govern cell actions vary 1030 across the lattice. In addition, one could allow for mutations that alter these rules.

1031
However, for our purposes, simple rules for the cell dynamics suffice. Our main focus exponential distribution with mean t p . The agent will then re-enter the population of 1126 susceptible individuals. 1127 We stress that our epidemiological SIRDS model is rather simplistic. For instance, 1128 factors, such as an individual's socioeconomic group, sex, or age, might play a role in 1129 mortality and infection rates. Many mathematical models thus build on census data to 1130 draw a more realistic picture [92,93]. We do not take such information into account.

1131
However, the considered level of complexity is sufficient for our purposes. Rather than 1132 modelling a specific disease, our focus lies on the stochasticity that all epidemiological 1133 agent-based models share, irrespective of their level of detail.