Probabilistic forecasting in infectious disease epidemiology: The thirteenth Armitage lecture

Routine surveillance of notifiable infectious diseases gives rise to daily or weekly counts of reported cases stratified by region and age group. From a public health perspective, forecasts of infectious disease spread are of central importance. We argue that such forecasts need to properly incorporate the attached uncertainty, so should be probabilistic in nature. However, forecasts also need to take into account temporal dependencies inherent to communicable diseases, spatial dynamics through human travel, and social contact patterns between age groups. We describe a multivariate time series model for weekly surveillance counts on norovirus gastroenteritis from the 12 city districts of Berlin, in six age groups, from week 2011/27 to week 2015/26. The following year (2015/27 to 2016/26) is used to assess the quality of the predictions. Probabilistic forecasts of the total number of cases can be derived through Monte Carlo simulation, but first and second moments are also available analytically. Final size forecasts as well as multivariate forecasts of the total number of cases by age group, by district, and by week are compared across different models of varying complexity. This leads to a more general discussion of issues regarding modelling, prediction and evaluation of public health surveillance data.


Introduction
Forecasting is one of the key goals of infectious disease epidemiology, and mathematical and statistical models play a central role in this task. For example, Keeling and Rohani [1] (Section 1.5) write that "models have two distinct roles, prediction and understanding. Prediction is the most obvious use of models." However, most of traditional infectious disease epidemiology is concerned with "understanding", and less so with forecasting.
Indeed, the World Health Organization (WHO) concluded in 2014 that "forecasting disease outbreaks is still in its infancy, however, unlike weather forecasting, where substantial progress has been made in recent years." [2]. As a consequence, WHO has recently organized an informal consultation with more than 130 global experts entitled "Anticipating Emerging Infectious Disease Epidemics" in order "to define the elements within which epidemics of the future will occur" [3]. Likewise, the United States Federal Government has recently sponsored two prediction competitions on dengue fever in Puerto Rico and influenza-like illness in the United States † .
There have been various attempts to predict future infectious disease trends with mathematical or statistical models. For example, there has been success in reproducing the spread of the 2002/2003 SARS epidemic based on global flight travel data [4]. Another commonly considered problem is the prediction of influenza epidemics [5,6,7,8,9,10]. However, these approaches often concentrate on data available as univariate time series. This is often of limited value since infectious diseases may spread differently in subgroups of the population considered, for example in different age groups. Strong spatio-temporal dynamics are also very common. A multivariate view is therefore required to predict incidence in different regions and age groups and this is the scenario we are considering in this article. Of course, other stratification levels such as gender may also be relevant in other applications.
The key requirements for stratified forecasting of infectious disease incidence are threefold: First, suitable multivariate models have to be developed for stratified time series of surveillance counts. This includes spatio-temporal models [11,12,13,14] and models that reflect contact patterns between age groups [15,16]. Secondly, focus should be on probabilistic forecasts, rather than on deterministic point predictions [17]. Probabilistic forecasting is the standard in many other areas of science, including economics and weather forecasting. The inherent difficulty in forecasting epidemics [18] makes the use of probabilistic forecasts in infectious disease epidemiology even more necessary to properly reflect forecasting uncertainty. The natural way to obtain probabilistic predictions is through the use of statistical models. An important aspect of the focus on predictions is the emphasis on the primacy of observables and the notion of a model as a (probabilistic) prediction device for such observables. Our models may therefore not perfectly represent (individual-based) disease transmission, but may still be useful for prediction of suitably aggregated surveillance data.
Thirdly, the quality of the forecasts has to be assessed with appropriate predictive model criteria. As in weather forecasting, we use proper scoring rules to assess the quality of probabilistic (count) predictions. Propriety ensures that both calibration and sharpness are addressed. Calibration is defined as the statistical consistency of the probabilistic forecasts and the observations. Sharpness refers to the concentration of the predictive distributions. Thus, the goal is to "maximize sharpness subject to calibration" [19]. Calibration of univariate forecasts can be visually assessed with probability integral transform (PIT) histograms for count data [20]. Calibration tests for count data can also be employed [21,22,23]. Related methods have been proposed to assess the quality of multivariate probabilistic forecasts [24,25,26].
With this paper we aim to review and address the above issues. In our application we focus on norovirus gastroenteritis (described in Section 2), an endemic infectious disease where the amount of historical data makes it possible to predict future disease spread. Although we apply our methods for predictive validation in infectious disease epidemiology exclusively in the hhh4 [27] modelling framework (described in Section 3) using the R package surveillance [28], the proposed methodology for assessing probabilistic forecasts (described in Section 4) can also be applied to other modelling approaches. Section 5 presents results for the norovirus surveillance data and Section 6 concludes with some discussion.

Norovirus gastroenteritis surveillance data
Norovirus gastroenteritis is characterized by a "sudden onset of vomiting, diarrhea and abdominal cramps lasting 2-3 days" [29]. Its generation time is similar to seasonal influenza (3-4 days). The disease is highly contagious, transmitted directly from person to person, but also indirectly via contaminated surfaces or food. No vaccination is available. Weekly (lab-confirmed) counts have been downloaded from https://survstat.rki.de (Figure 1). Our training dataidentical to the one analysed in Meyer and Held (2016) [16] -are stratified into 6 commonly used age groups (0-4, 5-14, 15-24, 25-44, 45-64, and 65+ years) and 12 city districts (of Berlin), and covers the period from week 2011/27 to week 2015/26. The following year (2015/27 to 2016/26) has been used to assess model predictions (test data).

Endemic-epidemic models for infectious disease counts
We start with a modelling framework for spatio-temporal infectious disease counts in Section 3.1. This is extended in Section 3.2 to include social contact patterns between age groups.

Space-time model
In a series of papers [27,30,31,12], a modelling framework for multivariate surveillance count time series of infectious diseases has been proposed. The formulation is built upon an additive decomposition of disease incidence into an endemic and an epidemic component. The endemic component may represent seasonal and climatic variation, heterogeneity in population numbers and other socio-demographic characteristics. The epidemic component describes the force of previously infected individuals and thus spatio-temporal interaction [12]. Let Y rt denote disease counts in region r and week t. Given the counts Y r,t−1 , r = 1, . . . , R, in the previous week, Y rt is assumed to be negative binomial distributed, i. e. Y rt ∼ NBin(µ rt , ψ), with mean µ rt = ν rt e r + φ rt r w r r Y r ,t−1 (1) and overdispersion parameter ψ, so Var(Y rt ) = µ rt (1 + ψ µ rt ). Here ν rt is an unknown endemic log-linear predictor and e r are known population fractions of the different districts. The epidemic log-linear predictor φ rt describes the force of infection from time t − 1 to time t where w r r denote normalized weights for r to r transmission from Y r ,t−1 to Y rt , i. e. r w r r = 1 [12].
Spatio-temporal modelling has always been an important feature of infectious disease epidemiology. For example, Keeling and Rohani [1] dedicate a whole section to spatial dispersal of infections. An important discovery was that shorttime travel behaviour follows approximately a power law with respect to distance [32]. Specifically, the relative frequency Pr(d) of the distance d traversed by ∼500 000 dollar bills within 4 days in the U.S. has been shown to follow a power law: 59 . An interesting feature of the power law is that it is scale-free, i. e. the power parameter (here −1.59) does not depend on the unit in which the distances d are measured. A power law for areal data has also been proposed [12,16] and is based on the adjacency order o r r between regions r and r as distance measure: Here adjacent regions r and r have order o r r = 1, regions where we need to traverse one other region are of order 2 and higher orders are defined accordingly. The power parameter ρ is treated as unknown and estimated from the data.
In our application we model ν rt and φ rt in (1) as with region-specific effects α (ν) r and α (φ) r in both components, a Christmas break indicator x t (to account for reduced reporting and school closure in calendar weeks 52 and 1) in the endemic component, and sinusoidal log-rates with frequency ω = 2π/52 in both components [31]. An alternative model replaces α (φ) r by α (φ) + τ log(e r ), so includes a parametric "gravity model" [33,11] as described in the next section in more detail.
A multivariate branching process formulation [27] is useful to derive the (time-dependent) epidemic proportion λ t as a function of the model parameters. We omit details here but will report the average epidemic proportion λ t , which describes the average proportion of the incidence that can be explained with the epidemic component.

Age-structured spatio-temporal model
Consider now counts Y grt stratified by age group g, region r and week t. Again we use a negative binomial likelihood with mean µ grt = ν grt e gr + φ grt g ,r c g g w r r Y g ,r ,t−1 , now with age group-specific overdispersion parameter ψ g . The unknown log-linear predictors ν grt and φ grt , and the known population fractions e gr also become age group-specific. There are now two sets of weights: c g g quantifies transmission from age group g to age group g, while w r r remain spatial weights for r to r transmission. The product c g g w r r is normalized such that g,r c g g w r r = 1.
We now specify the model as c g g = (C κ ) g g (8) w r r = (o r r + 1) −ρ (9) with age group and region-specific effects α g and α r in both components (ν and φ). Again we include an indicator x t for the Christmas break and let seasonal terms enter both components, where we now assume age group-specific seasonality with coefficients γ (ν) g and δ (ν) g in the endemic component. Note that the formulation (7) for φ grt extends the one used in Meyer and Held (2016) [16], since we here also allow for seasonal variation in the epidemic component [31]. The gravity model feature φ grt ∝ e τ gr causes the amount of disease transmission to scale with the population size e gr of the "importing" age group g in district r -with unknown coefficient τ .
To specify the weights c g g , we estimate a contact matrix C for Germany from recorded contacts of individuals participating in the POLYMOD study [34]. Note that social contacts are reciprocal at the population level, i. e. there is an equal number of contacts between age group g and g as between age group g and g, which has been taken into account appropriately [35]. We finally specify a power transformation for the contact matrix [16], where Λ is the diagonal matrix of eigenvalues and E is the corresponding matrix of eigenvectors of C.
To investigate the relevance of the various model features, we consider four different models of increasing complexity: Model 1 is based on equation (6) with φ grt ≡ 0 in (5), so only includes the endemic component. Model 2 extends this with an autoregression on the number of cases Y g,r,t−1 in the same age group and region, i. e. (5) reduces to µ grt = ν grt e gr + φ grt Y g,r,t−1 with φ grt as in (7), but without the gravity model component τ log(e gr ). Model 3 allows for spatial dispersal according to (9) and includes the gravity model in (7), but does not include contact data, so c g g = 1 for g = g, otherwise c g g = 0. Finally, Model 4 is based on the full formulation (5)-(9), so allows for dispersal across space and across age groups.

Predictive model assessment
We validate the models based on probabilistic one-step-ahead and long-term predictions. The one-step-ahead predictions are all negative binomial distributed, so the predictive mass probability function is known in closed form. Note that the underlying parameters change from prediction to prediction, as we always refit the model to all available data prior to the counts to be predicted. A simpler approach would always use the parameter estimates based on the original training data.
The model formulation implies that one-step-ahead predictions in different age groups and different regions are conditionally independent. Long-term predictions in time, sometimes called path forecasts, are generated through Monte Carlo simulation by sequentially simulating from one-step-ahead predictions. Note that our time series model is multivariate (across regions and age groups) and thus the path forecasts are also multivariate. Alternatively, the first two moments of multivariate path forecasts can be computed analytically in the model class considered, see Appendix A for details.
Predictions are computed for all combinations of 6 age groups × 12 districts × 52 weeks and have been suitably aggregated across districts, across age groups, or across time, if required. Similarly, the first two moments of aggregated counts can be calculated analytically from the moments of the original path forecasts.

One-step-ahead forecasts
Scoring rules (also called scoring functions) are the key measures for the evaluation of probabilistic forecasts. Scoring rules assign a numerical score based on the predictive density f (y) for the unknown quantity and on the true value y obs , that has later materialised. They are typically negatively oriented, i. e. the smaller, the better. They are called proper, if they do not provide any incentive to the forecaster to digress from her true belief, and strictly proper if any such digress results in a penalty, i. e. the forecaster is encouraged to quote her true belief rather than any other predictive distribution. Note that in the literature inappropriate scoring methods are still often used, e. g. correlation coefficients between point predictions and observations [8,10]. It is well known in the medical literature that high correlation does not necessarily imply good agreement [36] and therefore a very poor forecasting method may have high correlation. Furthermore, usage of metrics incorporating the whole probabilistic forecast (rather than using only point predictions) is still rare [6].
For ease of presentation we drop the indices g, r and t in this section and denote by Y the predictive distribution which we compare with the actual observation y obs . The logarithmic score [37] is strictly proper and defined as i. e. minus the log predictive density at the observed value. A normal approximation gives the Dawid-Sebastiani score [38]. A strictly proper alternative is the ranked probability score [19], which can be written for count data as the sum of the Brier scores for binary predictions at all possible thresholds k ∈ {0, 1, . . .} [20]. An equivalent definition is here Y and Y are independent realisations from f (y) [19]. For Poisson and negative binomial predictive distributions, both scores (LogS and RPS) can be computed directly in the R package surveillance [28]. In order to develop a calibration test for one-step-ahead count forecasts [23], we compute the average score RPS across all regions, age groups and one-step-ahead predictions and compare it to its expected value E 0 (RPS) under the null hypothesis of "forecast validity" [22], sometimes also called "perfect calibration". The difference RPS − E 0 (RPS) is standardized using the variance Var 0 (RPS). Note that our model implies that the one-stepahead predictions in different age groups and different districts are (conditionally) independent, so under H 0 independence holds for all components of RPS for fixed t. Furthermore, proofs of independence are available for the sequence of scores across time [21,22], so in the computation of Var 0 (RPS) we can simply assume independence of all components of RPS. Finally, a z-statistic can be computed where the sign of z indicates if the observations are over-/underdispersed relative to the predictions (+/− sign of z) [23,39]. A (two-sided) P -value can be computed to quantify the evidence for miscalibration. The probability integral transform (PIT) F (y obs ) is often used to visually assess calibration, here F (y) denotes the predictive CDF. Under H 0 , the probability integral transforms of a sequence of absolutely continuous one-step-ahead forecasts are independent and uniformly distributed [40]. Here we use a modification of the PIT histogram for count data [20]. We can also apply the Diebold-Mariano Test for equal predictive performance [41].
For negative binomial predictions, sharpness can be quantified with the estimated overdispersion parameter ψ, which may be age-dependent. Easier to interpret is the variance-to-mean ratio (VMR) 1 + ψ µ, but this requires an estimate of the predictive mean µ, where we simply average (in each age group for age-dependent overdispersion parameters) the predictive means across weeks and districts.

Multivariate path forecasts
Our target quantity to assess the long-term forecasts are the number of reported cases during the test period, suitably aggregated to stratification levels of interest. In the simplest case we aggregate over week, age group and region and predict the final size, i. e. the total number of reported cases during the whole year. We also predict the final size in different age groups and the final size in different regions. Finally, we predict the so-called epidemic curve, i. e. the total number of cases for each of the 52 weeks of the test data (aggregated over age group and region), given the training data.
Predictions of aggregated counts enable us to compare our models to simpler models that are applied to suitably aggregated data. For example, we could just fit a univariate negative binomial time series model to the counts Y t aggregated over age group and district to predict the total number of cases per week and to obtain final size predictions (model 5): Likewise, aggregating the data over regions yields a model for Y gt ∼ NBin(µ gt , ψ g ) to additionally obtain final size predictions in the different age groups (model 6): Finally, two models for regional counts Y rt ∼ NBin(µ rt , ψ) (aggregated over age group) have been applied. Both are based on the decomposition (1) with power law (2) and endemic model (3). Model 7 uses the epidemic model (4) with region-specific intercept α (φ) r . Model 8 is more parsimonious and replaces α (φ) r by α (φ) + τ log(e r ), so includes a gravity model component instead. Note that the more general formulation α (φ) r + τ log(e r ) is not identifiable. One possibility for a proper scoring rule is the multivariate Dawid-Sebastiani score [38] mDSS(Y , y obs ) = log|Σ| + (y obs − µ) Σ −1 (y obs − µ), that depends only on the mean vector µ and the covariance matrix Σ of the predictive distribution. The first term in (11) involves the determinant |Σ| of the covariance matrix Σ, here Σ is a d × d matrix. Transformed to this is known as the determinant sharpness (DS) and recommended as a multivariate measure of sharpness [24], with smaller values corresponding to sharper predictions. In higher dimensions it is useful to report the log determinant sharpness log DS = log|Σ|/(2d) to avoid unnecessarily large numbers. For presentational reasons we also report the Dawid-Sebastiani score (11) in a scaled version and divide it by 2d. Finally, we compute an approximate P -value for forecast validity based on the mDSS [26]. We note that this method assumes approximate normality of the observed data, which may be questionable if the observed counts are very low. However, evaluation of (11) and (12) based on Monte Carlo estimates of the first two moments is not recommended, since the determinant |Σ| is known to be very sensitive to Monte Carlo sampling error [25]. Fortunately we can evaluate this score exactly since we have analytical results for recursive computation of µ and Σ in our modelling framework, see Appendix A for details.
In order to take the full predictive distribution into account, the multivariate log-score LogS(Y , y obs ) = − log f (y obs ) will be very difficult to compute based on samples from the predictive distribution f (y). Specifically, if there is no sample exactly equal to the observed vector y obs , the empirical estimate of f (y obs ) will be zero and the log-score will be infinite. Instead we use a generalization of the ranked probability score (10), the so-called energy score [19] ES(Y , y obs ) = E ||Y − y obs || − for multivariate forecasts Y , here ||.|| denotes the Euclidean norm. The energy score reduces to the Euclidean error ||µ − y obs || for a deterministic point mass forecasts at µ and can be reported in the same unit as the observations. It has been noted [25,42] that the ES discriminates well between forecasts with different means or variances, but less so for forecasts with different correlation structures. Monte Carlo estimation of (13) is possible if samples from the predictive distribution are available [24]. These samples can also be used to derive the distribution of ES under the null hypothesis of forecast validity: Suppose we have n sim samples y (1) , . . ., y (n sim ) from the predictive distribution f (y). We can then generate n sim samples from the distribution of ES(Y , y obs ) under H 0 by computing the Monte Carlo approximation of (13) based on the realisation y obs = y (i) , i = 1, . . . , n sim and approximate the distribution of Y using the remaining n sim − 1 samples from f (y). Computation of a (one-sided) Monte Carlo P -value is then possible based on the energy score of the actual observation y obs . Table 1. AIC improves for each of the features added from model 1 to model 4, where the largest impact is due to the account for autoregression (1 → 2). Likewise, the average epidemic proportion λ t increases with increasing model complexity, so in model 4, 70% of the incidence can be explained by the epidemic component. There is strong evidence that the epidemic part scales with the population size of the "importing" districts and age groups with the coefficient τ estimated as 0.62 (95% CI: 0.27 to 0.97) in model 3 and 0.82 (95% CI: 0.49 to 1.14) in model 4. The power parameter ρ of the spatial power law (9) is estimated to be 2.21 (95% CI: 1.92 to 2.55) in model 3 and 2.30 (95% CI: 2.01 to 2.63) in model 4. Thus, the inclusion of social mixing between age groups in model 4 results in a slightly larger coefficient, reducing the range of spatial dispersal. Note that the results of model 4 differ slightly from the estimates reported in Meyer and Held [16] due to the additional seasonal variation in the epidemic component here. This adds two parameters but improves AIC by 18.3 points. The estimate of the power coefficient κ for the contact matrix C is 0.41 (95% CI: 0.29 to 0.60). For comparison, in model 6, the estimated exponent is nearly identical (κ = 0.40) but has more uncertainty (95% CI: 0.23 to 0.67).  Figure 2 gives the variance-to-mean ratios of the one-step-ahead forecasts in the different age groups and models. Generally the VMR decreases with increasing model complexity. The predictions become sharper in age groups 65+ and 45-64 when we move from model 1 to 2, 3, and finally to model 4. In age group 00-04 we also see sharper predictions in model 2-4 compared to model 1.

Assessing one-step-ahead forecasts
The mean RPS with z-value and P -value is shown in Table 2. The PIT histograms [20] based on all 6 × 12 × 52 = 3744 one-step-ahead forecasts are shown in   (Figure 3) with more PITs close to one than we would expect under forecast validity. This indicates potential problems of the negative binomial distribution in the more complex model formulations.
The corresponding results for age groups 00-04 and 65+ are also shown in Table 2. In age group 00-04, there is evidence for underdispersed predictions with values of the z-statistic larger than 2 for all four models and an even more pronounced peak of the PIT histogram close to one than for the overall set of forecasts (compare the middle panel in Figure 5). However, we see some evidence for overdispersed predictions in age group 65+ for model 1 with a negative value of the z-statistic, which fits the corresponding pattern of the PIT histogram shown in the right panel of Figure 5. Table 3. P -values from the Diebold-Mariano Test for the pairwise comparison of RPS across the different models.  Table 3. We see very strong evidence (p < 0.0001) for differences in predictive performance between model 1 and the other three models but no evidence for differences in predictive performance between models 2, 3 and 4.

Assessing long-term forecasts
Final size predictions are shown in Figure 4. All models tend to overpredict the total number of cases with varying uncertainty. In the worst cases (models 1, 2 and 8), the observed number of cases is not or only poorly supported by the predictions. The best model (in terms of RPS) is model 4, the most complex formulation, followed by model 6. This underlines the importance of using an age-structured modelling approach supported by social contact data. Figure 5 (left panel) displays the probabilistic forecasts for the total number of cases in the different age groups for models 1-4 and 6. While all models predict the number of cases reasonably well in the lower five age groups, there are remarkable differences for the last age group 65+. Model 1 overpredicts the number of cases in age group 65+ considerably, while models 2-4 give similar point predictions, but with increasing uncertainty. As a consequence, the observed number of cases in age group 65+ is well supported by the predictive distributions of models 3, 4 and 6, less so for model 2, and not at all for model 1.
The middle and right panels of Figure 5 give PIT histograms of the one-step-ahead forecasts for the five models for age group 00-04 and 65+, respectively. The PIT histograms for the other age groups are all very close to uniformity. The PIT histograms for age group 00-04 are very similar for models 1-4 with a pronounced peak at values close to 1. This indicates a tendency to underpredict relatively large observed values. A similar pattern can be seen for model 6, where the PIT histogram is based on a smaller number of observations. The PIT histogram for age group 65+ and model 1 indicates a bias of the estimates with a tendency to predict larger values than observed. This results in overprediction of the aggregated counts shown in the left panel. In contrast, the PIT histogram of model 6 is inconspicuous. The other three models compensate the bump of the PIT histogram around 0.2 with another peak at values of PIT close to 1. This seems to lead to greater uncertainty of the final size predictions as shown in the left panel. Note that the calibration tests based on the mean RPS of the one-step-ahead forecasts ( Table 2) give evidence for miscalibrated forecasts in age group 00-04 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  for all the models, but no such evidence for age group 65+. However, the negative value of the z-statistic of model 1 (with corresponding P -value of 0.067) already indicated a potential problem of this formulation. Table 4 gives values of the log determinant sharpness for the different models and the different stratification levels. The first column gives values for total final size, where we observe for models 1 to 4 a tendency for larger values of logDS, i. e. more dispersed predictions, with increasing complexity, see also Figure 4. However, this pattern is less clear for the models 5-8 that work on aggregated data. An interesting pattern appears for the predictions by age group, region, or week: With only one exception, the models that analyse the data in the finest resolution (1)(2)(3)(4) give sharper predictions than the models that analyse aggregate data (5)(6)(7)(8). In particular, the most sophisticated model with both a spatial power law and social contact data (model 4) gives the sharpest prediction for the epidemic curve. This can be explained by the fact that the predictions from this model have the largest autocorrelations among all models from week 20 onwards (and also quite large before week 20), see Figure 6 bottom. Note also that the logDS values by region roughly reflect the order of the correlations between regions (Figure 6 top right) with a tendency for larger values of logDS for smaller correlations. However, logDS is also a function of the marginal variances, so the agreement of logDS and the correlations is not perfect. Similarly, the correlations of the predictions between age groups, shown in the top left panel of Figure 6 for models 4 and 6, do not lead to smaller values of logDS by age group because of larger variances compared to models 1 and 2, as can be seen in the left panel of Figure 5.   Table 5 gives both the energy and the Dawid-Sebastiani score by different stratification levels. Likewise, Table 6 gives the corresponding P -values for forecasts validity. The two scores agree quite well with model 4 giving the best (or second best) predictions in total, by age group and by region. Somewhat surprisingly, model 4 gives quite poor predictions by week for the DSS (rank 7), but not for the ES (rank 2). This can be explained by the fact that the ES is known to be insensitive to misspecifications in the correlation structure of multivariate predictions. We have already noted that the autocorrelations of the model 4 predictions are quite large. However, the observed time series (see Figure 7), has an oscillating pattern around weeks 20 and 30. This discrepancy between observed and predicted correlation seems to be detected by the DSS, but not by the ES. This can also be seen from Table 6, where the test based on DSS gives strong evidence for miscalibration of the forecasts by week for all models except for models 5 and 6. Interestingly, these two models have the largest values of logDS, see Table 4, but are still best in terms of DSS, see Table 5. Larger uncertainty of the predictions (see Figure 7) combined with relatively small autocorrelations (see bottom panel of Figure 6) seems to make model 5 the best in terms of DSS. Note that the corresponding test based on the ES identifies only model 1 and 8 as miscalibrated (and model 2 to a lesser extent). For the other stratification levels the P -values based on the DSS tend to be larger than the corresponding ones based on the ES.

Discussion
In this paper we have described a multivariate model framework for infectious disease surveillance counts by borrowing strength from different regions and different age groups. This model framework provides probabilistic forecasts that are useful for epidemic forecasting. Specifically, means and covariance matrices are available analytically as well as Monte Carlo samples from the full predictive distribution. Predictive model assessment helps to identify poor predictive models. The application has shown the importance of predictive model assessment in different strata. We have emphasized the importance to use appropriate methodology for predictive model assessment, including proper scoring rules and tests for forecast validity.
Our study has shown that complex modelling on the original fine resolution generally leads to better predictions of future disease incidence, even of aggregated quantities. As a consequence, the most complex model 4 was nearly always the best in predictive performance. The only exception are predictions by week, where model 4 turned out to predict poorly in terms of DSS. We were able to explain this feature by the discrepency of strong predicted correlations on the one hand, but oscillating observed number of cases on the other hand. This model deficiency was not detected by the energy score, in accordance with similar observations made in the literature [25,42]. Consideration of models for aggregated data showed the importance to integrate contact pattern data in the analysis of age-stratified surveillance data (model 6), as this model was consistently among the top three models in terms of ES and DSS. The spatial dimension turned out to be less important, which may be different in applications with more districts.
Our forecasts were always based on one particular model. A possible extension of our approach would be to use a Bayesian model average framework to combine forecasts from different models into one averaged forecast. The model weights may even differ based on differing forecasts and could be based on AIC or BIC computed from the training data [43], if the models considered act on the same data resolution. Such model averages are known to have better forecast properties and have been successfully applied in weather forecasting [44].
To aggregate across strata (e. g. to break our results down to stratification only by region, only by age group etc.), we can use Cov(BY) = B Cov(Y)B where the matrix B is suitably defined.