Abstract
Real-time forecasts based on mathematical models have become increasingly important to help guide critical decision-making during infectious disease outbreaks. Yet, epidemic forecasts are rarely evaluated during or after the event, and it has not been established what the best metrics for assessment are. Here, we disentangle different components of forecasting ability by defining three metrics that assess the calibration, sharpness and unbiasedness of forecasts. We use this approach to analyse the performance of weekly district-level fore-casts generated in real time during the 2013–16 Ebola epidemic in West Africa, which informed a range of public health decisions during the outbreak. We found forecasting performance with respect to all three measures was good at short time horizons but deteriorated for long-term forecasts. This suggests that forecasts provided useful performance only a few weeks ahead of time, reflecting the high level of uncertainty in the processes driving the trajectory of the epidemic. Comparing the semi-mechanistic model we used during the epidemic to two null models showed that the approach we chose performed best with respect to probabilistic calibration but sharpness decreased more rapidly for longer forecasting horizons than with simpler models. As forecasts become a routine part of the toolkit in public health, standards for evaluation of performance will be important for assessing quality and improving credibility of mathematical models, and for elucidating difficulties and trade-offs when aiming to make the most useful and reliable forecasts.