Connecting surveillance and population-level influenza incidence

There is substantial interest in estimating and forecasting influenza incidence. Surveillance of influenza is challenging as one needs to demarcate influenza from other respiratory viruses, and due to asymptomatic infections. To circumvent these challenges, surveillance data often targets influenza-like-illness, or uses context-specific normalisations such as test positivity or per-consultation rates. Specifically, influenza incidence itself is not reported. We propose a framework to estimate population-level influenza incidence, and its associated uncertainty, using surveillance data and hierarchical observation processes. This new framework, and forecasting and forecast assessment methods, are demonstrated for three Australian states over 2016 and 2017. The framework allows for comparison within and between seasons in which surveillance effort has varied. Implementing this framework would improve influenza surveillance and forecasting globally, and could be applied to other diseases for which surveillance is difficult.

: Schematic illustration of proposed framework for surveillance and forecasting, using peak week as an example. Given an observed influenza surveillance time series, the peak week is typically assumed to be a point estimate based on these data, as in (A); different ways of analysing these influenza surveillance data can lead to different point estimates of the peak week (B), for example based on test positivity, or perconsultation rates. Instead, we propose it should be a probabilistic estimate as in (C), because the observed time series is a sample or an estimate of the population-level dynamics, with associated uncertainty. Taking the observation process into account we can obtain probabilistic forecasts of influenza in the population (D), which can then be used to produce probabilistic forecasts of quantities such as the peak week (E). To assess these forecasts, we propose modified metrics that take into account that the true peak week has a distribution rather than a point estimate (F). 86 (GPs) distributed throughout Australia. The target coverage is approximately one GP per 200,000 people 87 in metropolitan areas, and one GP per 50,000 people in regional areas. This database is continually updated 88 and requires the voluntary participation of GPs, so there is variation between years and locations. All GPs 89 in the database report the number of consultations they performed each week, along with every case of a 90 notifiable disease that they observe, including influenza-like-illness (ILI). GPs are asked to take swab samples 91 from a proportion of those patients presenting ILI symptoms (20% in 2015-2017), and these samples are sent 92 for virological (PCR) testing, resulting in respiratory virus virology data for each sample. In practice, not all 93 GPs take samples, and the proportion of patients for which samples are taken varied substantially, however 94 the number of doctors participating, and the number of ILI patients observed and tested is known. 95 The hierarchical observation process is adapted from Cope et al. [21], with key details re-stated here for 96 clarity. 97 We proceed by considering how an individual in the population who is infected may come to be observed 98 in the dataset. Filtering occurs at three levels: 99 • Not all individuals infected with influenza will seek treatment, as cases may vary in severity, including 100 potentially being mild or asymptomatic. We assume that each infectious individual has some (unknown, 101 independent and identically distributed) probability of seeking treatment.

102
• Only a small proportion of GPs are part of the ASPREN network, and this proportion has varied 103 over time. We assume that if an infectious patient consults a GP, then that GP has some fixed 104 probability of being part of the ASPREN network (estimated annually based on the number of doctors 105 that participate).

106
• The proportion of ILI cases that are sent for testing varies substantially (see Figures S1-3), both due 107 to changes in ASPREN protocols, and variation in doctor behaviour. We assume that each infectious 108 patient who attends an ASPREN GP has some probability (fixed, estimated from data, varying weekly) 109 of being sent for testing.

110
To obtain population-level estimates from these data, we model this observation process: if Z j is the number 111 of individuals in the population seeking treatment for symptomatic influenza in week j (j ∈ {1, 2, . . . , 52}), 112 then we observe n j ∼ Bin(Z j , p j ) confirmed cases, with p j the normalization probability (i.e., the proportion 113 of doctors in the state that are in the dataset, multiplied by the proportion of ILI cases those doctors sent for 114 testing in that week). So, given n j and p j , the number of symptomatic individuals is Z j ∼ NegBin(n j +1, p j ).

115
Critically, this is a probabilistic estimate rather than a point estimate (e.g., Figure 2). Note that this is an 116 estimate of the number of individuals who seek treatment with influenza; to obtain the total number of 117 individuals with influenza we would also require an estimate of the probability of seeking treatment, which 118 is not directly available from these data (though can be estimated using a model). In contrast to surveillance uncertainty, practitioners understand that forecast uncertainty is critical, and 121 best practice includes reporting and assessing influenza forecast uncertainty ([12, 22]). A consequence of 122 overlooking uncertainty in surveillance data is that, particularly when data are highly filtered or stratified 123 (resulting in substantial uncertainty), the absence of uncertainty in estimates flows through to an absence of 124 uncertainty in subsequent analyses. For example, assessing forecasts against point-estimate surveillance data 125 could result in erroneous assessments; rather, forecasts should be assessed against probabilistic estimates of 126 the true epidemic behaviour. Incorrect forecast assessment could lead to choosing inferior forecasting methods 127 to inform management decisions, which could in the worst case have detrimental public health impacts, for 128 example, planning hospital bed availability to peak at the wrong time.
129 Figure 2: 2016 ASPREN Influenza surveillance data from New South Wales, demonstrating how observed data-level peaks can be misleading. The top figure shows confirmed influenza (black) and ILI (purple), along with the number of samples tested each week (green). These data would suggest peak-week point estimates in the weeks ending 6 August and 20 August (dashed black lines). The second figure (grey) shows influenza test positivity, which peaks during the week ending 4 September (dotted grey line), and is highly variable outside of the season when few tests were performed. The third panel (yellow -blue colour gradient) shows estimates of population-level symptomatic influenza incidence each week, with uncertainty: black boxes indicate the median and a 50% credible interval, with the shading showing the full density. The lowest panel (red) shows the distribution of population-level peak week calculated using all data. This demonstrates that the peak week inferred directly from the surveillance-level data can be misleading, and that test positivity is also a flawed metric. influenza incidence is highest. When the population-level influenza estimate for each week has a probability 131 distribution, then the peak week must also have a distribution (i.e., the distribution of the maximum of the 132 Z 1 , . . . , Z 52 ). This can be evaluated through simulation. Figure 2 shows the relationship between surveillance 133 data, test positivity, and population-level peak week distribution for data from New South Wales in 2016, 134 demonstrating that considering the data directly or relying on positivity alone can be erroneous and at 135 least fails to quantify the uncertainty inherent in estimates of peak week from surveillance data. Producing 136 population-level estimates of symptomatic influenza incidence (Figure 2c) combines the ILI data, the number 137 of tests performed, and the results of those tests to quantify incidence and uncertainty through the season; 138 which is then used to produce a probabilistic estimate of the peak week ( Figure 2d). After estimating this 139 distribution for the true peak week, we can use it to assess forecasts, by generalizing metrics such as forecast 140 skill, to average forecast skill over the distribution (see Section 4.6 for methods).

142
To illustrate the link between influenza surveillance, forecasting, and forecast assessment, we apply this frame- explicitly linking surveillance data to population-level influenza (e.g. Figure 3). We extracted forecast distri-149 butions for peak week, which are assessed against probabilistic estimates of true peak week at the population 150 level ( Figure 4). We observed that forecasting was possible when data were sufficient, such as in New South 151 Wales (both years), and in 2017 for Queensland and South Australia; but when data were very sparse it 152 was not possible to effectively produce or assess forecasts (e.g. for South Australia in 2016). We evaluated Influenza surveillance and forecasting can provide important information for public health decision making.

158
Of primary interest is the incidence of influenza, specifically, within populations. Unfortunately, current 159 approaches to surveillance and forecasting do not evaluate this. Instead, they often work with influenza-160 like-illness (ILI), and operate at variable scales influenced by a myriad of other factors. In this study we 161 demonstrate that analysis can be performed on influenza specifically, at the population level, by carefully 162 considering the observation process and denominators that link population-level processes to observed data.  While we claim that, in most cases, the goal in influenza surveillance and forecasting should be to analyse 171 population-level influenza incidence itself, it is possible that in some cases there may be other targets of 172 interest. For example, in some applications one may be interested in influenza-like-illness presentations generally at particular points of care, such as at emergency departments. In this case, it would be reasonable to surveil, report, and forecast that quantity of interest, specifically. We advocate for careful consideration 175 of the quantity of interest, how that links to the data available, and direct evaluation of that target quantity 176 and its uncertainty.   A challenge posed by this shift in framework is communicating these ideas to the public health community. 178 We claim that operating at the population level and with confirmed influenza specifically is an intuitive 179 goal, but acknowledge that changes in approach, and in particular introducing uncertainty around perceived 180 "truth" values such as the peak week, will require a concerted effort led by uptake by informed practitioners.

181
But this shift in framework to surveilling and forecasting the quantities of interest will be of substantial 182 benefit to the public health community. compare to, and (ii) metrics that assess forecasts against probability distributions depend on the probability 190 distribution for the "truth", such that no forecast can ever reach a "perfect" score (e.g., a forecast skill of 1,191 or average error of 0; except when the "truth" is a point estimate). There are substantial differences between 192 the best possible scores in different states or years ( Figure 5). For this reason, we suggest that metrics such 193 as averaged forecast skill should be used as tools for comparison between methods (or over time), when 194 comparing in the same location, against the same estimated truth distribution.

195
One tangential advantage of these metrics is that it allows one to retrospectively assess the value of the 196 data being used for forecasting, by comparing the scores that would be achieved by the truth distribution 197 itself. We observe this in the current study ( Figure 5), whereby maximum possible forecast skill for peak 198 week is substantially higher in 2017 than 2016 in all locations due to more data and clearer peaks. Maximum 199 possible forecast skill for peak week was lowest in South Australia in 2016 where data was insufficient to 200 perform effective forecasting, and there was no clear peak in the data. 201 We do not expect the distribution of a forecast (p w , the forecast probability that the peak is week w) to 202 resemble the distribution of the truth (q w , the probability that the peak is week w given the observed data), as more data becomes available). Whereas the distribution of the truth treats each week as an independent 206 sample of the unobserved population-level quantity.

207
Note that point estimates of disease incidence can artificially inflate forecast accuracy score, because once, 208 say, the peak week is observed, its value is known exactly. As such, if you are certain that you have observed 209 the peak, your forecast can assign all probability to that week. This is why, for example, the forecasts assessed 210 in Biggerstaff et al.
[12] rapidly achieve skill scores of 1.0 after the peak has occurred. One challenge of the forecasting we use is that to produce forecasts of influenza requires forecasts of tem-213 perature. In this study, we used a simple statistical time-series approach; it may be the case that improved 214 forecasting of temperature could be obtained with more complex methods or by using externally-produced 215 forecasts (i.e., from the Bureau of Meteorology; www.bom.gov.au). However, it is likely that stochasticity in 216 both the disease and observation processes outweighs variation in temperature forecasts.

217
One limitation of this work is that it considers only a single influenza surveillance data stream. In prac-218 tice there are many complementary surveillance systems, and while none assess population-level influenza 219 incidence directly, they each provide useful information. Future work will develop a rigourous framework to 220 assimilate a full range of disease surveillance data streams to produce a combined estimate of population-level 221 influenza incidence. Influenza data were obtained as part of the ASPREN project (13; https://aspren.dmac.adelaide.edu. 231 au/). In this study, we used those data from three states: New South Wales, Queensland, and South Australia. dynamics, we used the average daily temperature rather than the three-hourly observations themselves. 243 We used fixed, total population sizes of 6.9 million, 4.3 million, and 1.6 million, in each state respectively, Using surveillance data and the observation process, we can evaluate the distribution of population-level 248 influenza incidence, each week. We can compare these distributions across a year to produce a distributional 249 estimate of the peak week. This proceeds as follows:

250
• For each week, given n j observed influenza cases with p j the normalization probability (i.e., the pro-251 portion of doctors in the state that are in the dataset, multiplied by the proportion of ILI cases those 252 doctors sent for testing in that week), generate Z j ∼ NegBin(n j + 1, p j ).

253
• Compare the generated Z j to determine the peak week.

254
• Repeat this process many (e.g., 10,000) times and collate the results to evaluate the distribution for 255 the population-level peak week.  To forecast influenza using a model that is driven by temperature, we were first required to forecast tem-   2. Consider the data from the current year, for weeks 1, . . . , w. Use ABC to fit the model to these data.

289
The prior distribution for this process is a modified version of the final posterior distribution from Step 290 Figure 6: Schematic describing forecasting process for an example week in 2016. The black line is the observed data (dotted over the forecast period). ABC is performed on two preceding years of data (green region), and the posterior used as a prior for the current season. ABC is then performed again within the current season, based on data observed to date (red), and a forecast is produced for the remainder of the season by progressing accepted particles forward through the rest of the season (blue solid line is the median forecast; shaded region is the 50% prediction interval). The population-level forecast is projected down to the data-level using the true normalization data observed in that week.
(1), including initialising the unobserved model states at the beginning of this year as those from the 291 posterior distribution at the end of the previous year.

292
• To modify the prior so as to allow for between-season variation due to, e.g., differences in strains:   true population-level peak week. We report two accuracy metrics:

311
• Forecast skill score, i.e., the exponential of log score average over the true week distribution: exp( w q w ln(p w )).

312
This is equivalent to the forecast skill score used elsewhere [12] when the truth is a point estimate (i.e., 313 only has density in a single week).

314
• Average error in peak week predictions: w1 q w1 w2 p w2 |w 2 − w 1 |. This is equivalent to average 315 (absolute) error if the truth is a point estimate. This is reported in the Supplemental Information 316 ( Figure S7).

317
There are other metrics that could reasonably be used (e.g., mean squared error, or Kullback-Leibler di-318 vergence), however we chose to focus on these two as they are most analogous to those used in existing 319 forecasting analyses.     Approximate Bayesian computation (ABC) [23, 24] is a likelihood-free Bayesian statistical method, which we use for parameter estimation as part of forecasting. We aim to estimate the (joint) posterior distribution for model parameters (e.g., rates of infection, recovery, waning immunity, treatment seeking) in the stochastic epidemic model ( Figure S6; following Cope et al. [21]); to do this we select parameters sets (particles) from the prior, simulate from the model using the particle, and accept the particle as part of the posterior if the resulting simulated realisation (i.e., a time-series of weekly confirmed influenza counts at ASPREN doctors) is close to the observed data under a chosen metric. In this study, we chose a weighted squared error metric: with true i the observed number of cases in the i th week from the ASPREN dataset and candidate i the number 417 in the simulated realisation from the candidate particle. The weighting was chosen so as to value weeks with 418 higher influenza activity (within the season) higher than those with little activity. When fitting the two 419 preceding seasons to the season of interest, the threshold was set to 0.8 × D(0), where D(0) is the score that 420 would be achieved by a season with no influenza activity. This was to ensure that the accepted realisations 421 did actually resemble the data, while allowing for variation in activity between seasons. When fitting during 422 the season, the threshold was chosen adaptively to ensure an acceptance proportion of 1%.

423
Prior distributions for the initial, two-year fitting period were chosen to be uninformative. Exponentially-424 distributed priors were used for latent, infectious, and immune periods, so as to have unbounded positive 425 support (with means of 1 day, 2 days, and 10,000 days, respectively). Transmission was temperature depen-426 dant, with R 0 =R 0 + a(T −T ); soR 0 had a Gamma(2,2.5) distributed prior, and the seasonality parameter 427 a was -Exp(0.2). The probability an infected individual would seek treatment had a Uniform(0.1,1.0) prior.

428
In the second ABC phase, fitting to the season of interest, the prior distribution was taken to be the posterior 429 sample from the initial two-year fitting.

430
Given that the model treats influenza as a single disease rather than fitting strains separately, prior and Note that while forecast error can theoretically reach 0, this requires ideal conditions; in practice. An illustrative baseline of achievable forecast error is that achieved by the observed distribution of peak week from the data, shown here as a dashed line.