Forecasting COVID-19 cases at the Amazon region: a comparison of classical and machine learning models

BACKGROUND Since the first reports of COVID-19, decision-makers have been using traditional epidemiological models to predict the days to come. However, the enhancement of computational power, the demand for adaptable predictive frameworks, the short past of the disease, and uncertainties related to input data and prediction rules, also make other classical and machine learning techniques viable options. OBJECTIVE This study investigates the efficiency of six models in forecasting COVID-19 confirmed cases with 17 days ahead. We compare the models autoregressive integrated moving average (ARIMA), Holt-Winters, support vector regression (SVR), k-nearest neighbors regressor (KNN), random trees regressor (RTR), seasonal linear regression with change-points (Prophet), and simple logistic regression (SLR). MATERIAL AND METHODS We implement the models to data provided by the health surveillance secretary of Amapáa, a Brazilian state fully carved in the Amazon rainforest, which has been experiencing high infection rates. We evaluate the models according to their capacity to forecast in different historical scenarios of the COVID-19 progression, such as exponential increases, sudden decreases, and stability periods of daily cases. To do so, we use a rolling forward splitting approach for out-of-sample validation. We employ the metrics RMSE, R-squared, and sMAPE in evaluating the model in different cross-validation sections. FINDINGS All models outperform SLG, especially Holt-Winters, that performs satisfactorily in all scenarios. SVR and ARIMA have better performances in isolated scenarios. To implement the comparisons, we have created a web application, which is available online. CONCLUSION This work represents an effort to assist the decision-makers of Amapá in future decisions to come, especially under scenarios of sudden variations in the number of confirmed cases of Amapá, which would be caused, for instance, by new contamination waves or vaccination. It is also an attempt to highlight alternative models that could be used in future epidemics.

All those numbers caught the attention of many researchers, that presented models 11 to attend the concerns from the Brazilian government and population, such as when the 12 outbreak will peak, how long it will last, and how many will be infected or die [4,5]. 13 Many of those forecasting models rely on epidemiological approaches [6,7] or state-of-art 14 artificial intelligence (AI) algorithms. Generally, researchers address their models to the 15 country as a unit or to highly populated areas, mainly big cities and federation states 16 like Sao Paulo and Rio de Janeiro [4, 8,9]. 17 However, COVID-19 has also impacted other Brazilian regions, such as the North, 18 that is a territory almost entirely covered by the Amazon rain-forest and accounts for 19 almost half of the Brazilian territory. The north has a low population density (4.78 20 inh./km 2 ), accounts for only 8.8% of the Brazilian population, and is responsible for 21 14.3% of all confirmed cases of COVID-19 in Brazil. It may be represented by infected 22 per population rates: 2.6% in the North, versus 1.5% in the rest of the country [10]. 23 Figure 1 shows the evolution of the infection rate in all five Brazilian regions. 24 ,QIHFWLRQ5DWH Carved into the Amazon rain-forest is Amapá, a northern state of Brazil. Amapá is 25 like an island surrounded by the forest since it displays no land routes with any other 26 Brazilian state (See Fig. 2). It has only 830,000 inhabitants but living in an area bigger 27 than England, which is Voc67 times denser. Like other parts of Amazon, Amapá 28 October 5, 2020 2/14 already experiences an excess mortality from infectious diseases, especially among 29 indigenous populations. Despite recent political efforts, many people living in the state 30 still suffers from different social and health problems such as minimal access to clean 31 water and public sanitation [11]. Those and other reasons make Amapá especially 32 susceptible to COVID-19 and other epidemic outbreaks that may occur in the future.

33
By the end of May, Mapacá, the Amapá's capital, saw its health system collapse due to 34 COVID-19. By closing August 2020, the state consolidated the second highest infection 35 rate in Brazil, according to official data [10]. By the end of September 2020, the state 36 also has a low fatality rate (1.29%) when compared to the whole country (3.02%), which 37 may be the result of local attempts to track new cases and avoid under-notifications. Respecting this ambiance, in this paper, we explore and compare traditional and AI 39 forecasting models to support the Amapaense decision-makers in the future decisions to 40 come. The interest variables are the accumulated number of confirmed and death cases. 41 We compare the models autoregressive integrated moving average (ARIMA),

42
Holt-Winters, support vector regression (SVR), k-nearest neighbors regressor (KNN), 43 random trees regressor (RT), seasonal linear regression with change-points (SLiR) and 44 simple logistic regression (SLR), which dictates the baseline performance in this study. 45 We compare the models according to the necessities of local authorities. Thus, we 46 measure the model's effectiveness to forecast the 17 days ahead and how fast they have 47 responded to quick increases and decreases in the number of cases, as well as to periods 48 of stability. This scenarios may repeat in the future, as result of new contamination 49 waves or vaccination, for example. The forecasts are performed to each Amapaense 50 municipality individually and to the state accumulated data, which we paint as our 51 main example.

52
Since the municipalities are in different stages of the COVID-19 spreading, they may 53 also display very different curve growing behaviors. Thus, as a result of this study, we 54 have also created an online application (which can be accessed in 55 http://www.previsor.covid19amapa.com/, that can be used to visualize the data at 56 follow the steps we do, as well as choose the best model to use in different occasions. 58

59
In this section, we describe our research framework, which we split into: (1.1) data 60 acquisition, (1.2) data splitting, (1.3) fitting and forecasting, and (4) model evaluation. 61 The subsection that follows treats each one of those steps. We performed all modelings to the cumulative confirmed cases of COVID-19 in Amapá, 64 since the first official case, in March 20 th 2020, up to August 20 th 2020. We gather the 65 data from official reports, from each of the 16 th Amapense municipalities. The collected 66 data is also available in an application programming interface provided by Brasil.io 67 repository [10]. The measurement periods are different for each municipality and Tab. 1 68 summarized the dates of the first and last reports. training set and tried to forecast the next q days. Then, interactively, we added one day 77 to the training set, until it comprised n − q observations. Thus, for a given municipality, 78 we have n − p − q + 1 different cross-validation splittings.

69
Amapense decision-makers. Thus, in the first splitting, the raw data is divided into a 81 proportion of half-and-half between training and testing sets (see Algorithm 1).

82
Each training sample (x) is then standardized (z) by its mean (u) and standard 83 deviation (s), calculated as z = (x − u)/s. We then fit the training datasets to each one the following models: autoregressive running on a daily bases will convert the predicted values before calculating the metrics 99 and comparing them. The models are explained as follows: The ARIMA model stands for integration (I) between autoregressive (AR) and moving 102 average (MA) models. Box and Jerkings [12] are the first designers of this model.

103
ARIMA may also be adjusted to consider seasonality, which optimal value may be found 104 after the conduction of a Canova-Hansen test [13]. The optimum values of 105 autoregressive (p), degree of their differences (d) and moving average (q) may also be 106 found by search-grid. Usually, we select the parameters that minimize the Information 107 Criterion (AIC). Articles such as Benvenuto et al. [14], Ceylan [15], and Singh et al. [16] 108 bring examples of ARIMA applications to COVID-19 cases forecasting. The general 109 equations for AR and MA models are [15]: where Y t , ε, φ, and θ are the observed values at time t, the value of the random shock 111 at time t, AR, and MA parameters, respectively. Thus, an ARMA model is given by: Where α is a constant. When dealing with non stationarity, the data may be 113 differenced, and the ARIMA model is then performed. cases [19,20]. The equations of the additive model follow.
Where S is the smoothed observation, L the cycle length, and t a period. The trend 122 factor (b), the seasonal index (I), and the forecast at m steps (F ) are given by: The general logic of an SVR is relatively simple. Suppose a linear regression, which 130 objective is to minimize the sum of square errors.
where y i is the target, w i the coefficient and x i the feature. Then, the training of 132 SVR aims to minimize the following system.

144
Random forest is a machine learning algorithm with many decision trees. Breiman [26] 145 proposed a combination of bagging and random subspaces methods. Nowadays, Prophet is a forecasting approach developed by Facebook. It employs a decomposable 156 times series model, with three main model components: trend (g(t)), seasonality (s(t)) 157 and holidays (h(t)). It also assumes an error representing any idiosyncratic changes 158 that are not predicted by the model.
with 160 g(t) = (k + a(t) T δ)t + (m + a(t)γ) (15) where k is the growth rate, δ is the rate adjustments, m is the offset parameter, and 161 γ j is set to s j δ j to make the function continuous. Another important aspect is that the 162 model performs automatic changepoint selection, putting a sparse prior on δ.

163
On the other hand, it relies on Fourier series to incorporate daily, weekly, and 164 annually seasonalities. In the case of COVID-19, we are more concerned about weekly 165 seasonality.
In the context of COVID-19, Prophet has few appearances in forecasting the 167 accumulated confirmed and death cases [8,28]. 168 1.4 Model evaluation 169 We evaluate the performance of each forecasting models in terms of R-squared (R 2 ),

170
Root Mean Square Error (RMSE), and Symmetric Mean Absolute Percentage Error 171 (SMAPE). We perform the evaluations for each train/test pair created by the rolling 172 forward splitting. Thus, each metric is performed n − p − q + 1 times.
where n is the number of observations, y i andŷ i are the i th observed and predicted 174 values.    In a general manner, all machine Learning models achieve better results than Logistic 210 regression. In Fig. 6 we can see how Holt-Winters performs in comparison to the other 211 five models. Those findings Notice that we measure rolling forward performances 212 according to the R-Squared given by each cross-validation set.

213
In each pair of models we can observe how the Holt-Winters perform in comparison 214 to an other model and considering the periods we classify as (1) exponential increasing, 215 (2) after sudden daily decreasing and (3) stability of daily new cases.

216
Similar evaluations to the prediction of confirmed cases can be extended to death 217 cases. In this case, Holt-Winters still seems to be the most suitable model, along with 218 ARIMA. Similar considerations can also be draw to the municipalities of Amapá.

219
However, for small cities where data is scarce, most models we present here struggle to 220 make predictions. In this case, even naive approaches seem to be a good alternative.  will evolve is critical to local authorities to determine the best responses.

230
Thus, in this paper, we compared classical and machine learning models to forecast 231 the evolution of COVID-19 in the state. Despite the volume of research papers pointing 232 Machine Learning models as those with the best performance for many locations, in the 233 case of Amapá, two classical approaches seem to perform better: Holt-Winters and 234 ARIMA. It may be a consequence of the Amapaense data, which has marked seasonality 235 and sudden variations. One advantage of these two models is that they are easier to (1) (2) May 2020 Jun 2020 Jul2020 Aug2020 Sep2020 Networks models, which may consider other feature sets in forecasting future numbers 242 of cases. We also intend to propose a framework that indicates the best forecasting 243 model for each municipality and period, saving time from local decision-makers.