Comparison of national level spatial and spatio-temporal models of malaria

Geospatial statistical models play an important role in malaria control and prevention; they are widely used to produce malaria risk maps, which are essential to guide efficient resource allocation for intervention. Although many models are available for spatial mapping, the most commonly used model in the literature is the Bayesian geostatistical model (BGM), which is based on an underlying Gaussian process. To our knowledge, methods such as splines and decision trees ensemble methods have not been compared relative to their predictive skill for country level malaria prevalence mapping. Moreover, as more countries now have multiple datasets collected throughout the past decade, it is critical to evaluate if the inclusion of past datasets and the use of spatio-temporal models improve the prediction accuracy of present spatial distribution of malaria. Here we compare the prediction accuracy of five models under spatial and spatio-temporal settings in five African countries. The five models are stepwise logistic regression, generalized additive model (GAM), gradient boosted trees (GBM), Bayesian additive regression trees (BART) and the BGM. There is not a single best model to predict malaria prevalence on a national scale. The model performances varied from country to country, and from spatial to spatio-temporal setting. In general, BGM, GAM and BART models performed well, with BGM being the most consistent. The inclusion of past data is not always beneficial: the predictive performance of GAM and GBM increased under spatio-temporal setting, but BGM’s performance decreased in most of the countries. An accurate depiction of malaria risk is important and statistical assumptions that are suitable for a country does not always fit other countries and a wide range of models and settings should be used. It ensures that we find the best modeling approach possible and can provide additional insight to the spatial distribution of malaria risk. Author summary Malaria is still affecting hundreds of millions of people every year, and killing hundreds of thousands. As the majority of malaria intervention and control policies are developed at the national level, accurate spatial prediction of malaria risk is important. Choosing the best modeling approach for prediction is not straightforward. Here we compare the predictive performance of five models in five countries, with and without dataset from multiple surveys, to provide empirical evidence on whether there is a single best model for national level malaria prediction, and whether inclusion of past dataset may be beneficial in predicting current distribution of malaria risk. We find that models’ performances vary from country to country and there is no single best model. Although Bayesian geostatistical model is widely and commonly used in the literatures, its performance is not necessarily superior to other simpler-to-fit methods such as general additive model (splines) and Bayesian additive regression tree. Importantly, we also show that the incorporation of past data does not always improve spatial predictions of current disease risk. Together, this demonstrate the importance of fitting wide range of models as part of the prediction mapping process, instead of relying on a one-size-fit-all model.

it is unclear if the inclusion of past data and the use of models that account for spatial 53 and temporal correlation necessarily improve prediction of malaria prevalence across the 54 landscape. 55 In this article, we conduct a model comparison exercise to determine if GAM and 56 decision trees can be good alternatives to the BGM, under both spatial and 57 spatio-temporal setting. We also determine if inclusion of past datasets is beneficial in 58 modeling the current spatial distribution of malaria prevalence. We compared the 59 predictive performance of five modelling approaches in five sub-Saharan African 60 countries (Burkina Faso, Mali, Malawi, Nigeria and Uganda) to find out the best spatial 61 and spatio-temporal models for national level malaria prediction. The five models are 62 stepwise logistic regression, GAM, gradient boosted trees (GBM), BART and BGM, fit 63 using SPDE-INLA. In addition, we compare the performance between the spatial and 64 spatio-temporal models. To help readers use the models discussed here and to reflect on 65 some practical considerations while fitting the models, we have included a short 66 step-by-step tutorial in the Appendix (S1 Appendix) that describes how to implement 67 these models using a simulated dataset.  (Table 1). The 71 standard DHS is a household survey on a wide array of population, health and nutrition 72 indicators designed to be representative at national and regional levels 73 (http://www.dhsprogram.com). Typically, a two-stage sampling protocol is followed 74 where a few hundreds "clusters" (e.g. villages or residential areas) are selected with  The geospatial covariates were extracted from various remote sensing and GIS resources 103 that are freely and publicly available ( Table 2). They comprise of commonly used based on a buffer area around each sampled location is recommended [31]. However, we 117 found that the covariates are so highly spatially correlated that point and buffer 118 extraction do not differ significantly.
The goal here is to model p jt (i.e. malaria prevalence of a given location and time) 131 as accurately as possible given the design vector x jt and, for some models, the 132 spatial-temporal random effects. The design vector contains the predictor variables described in Table 2, latitude, longitude, the interaction between latitude and longitude, 134 and a dummy variable indicating if the observation comes from the present (1) or past 135 (0) dataset. Different assumptions are made regarding the prevalence p jt in different 136 models.

137
Under the spatial-only setting, all temporally related components are dropped and 138 all models assume that: We model the malaria prevalence of given location, p j , using the design vector x j 140 and, for some models, spatial random effects. In this setting, the design vector contains 141 the predictor variables described in Table 2, latitude, longitude, and the interaction 142 between latitude and longitude.

143
Stepwise logistic regression 144 The stepwise logistic regression model is the simplest model we adopt and is included as 145 a "baseline" model. It assumes that where β is the vector of regression coefficients. The regression coefficients are 147 estimated using iteratively weighted least squares. Stepwise regression method was used 148 to select variables based on AIC to avoid the risk of overfitting these data. This is done 149 in R using stepAIC() function from MASS package.
where f s is a thin plate spline function. This model was fitted using the restricted 154 maximum likelihood (REML) method in the mgcv package in R [41]. The package also 155 uses generalized cross validation to determine the optimal amount of smoothness in the 156 spline function.
slower learning rate requires more trees to achieve the optimal deviance, and the 174 prediction accuracy is improved from having more trees, although too many trees will 175 eventually have a detrimental impact on the performance. To obtain the predicted 176 response of a given set of predictors The fitting process involves gradient descent, thus the method is commonly called 178 "gradient boosting". Our implementation of the method also uses stochastic subsample 179 April 18, 2019 9/24 of predictors and training samples at each iteration to introduce randomness. Another 180 popular aggregation method is random forest. However, our preliminary study showed 181 that random forest performed much worse than the gradient boosted method across all 182 of our datasets, so we excluded the random forest results from this study. 183 We implemented the gradient boosting method by using the gbm package to fit the 184 model and used the caret package [42] to optimize the tuning parameters.
where Σ is the covariance matrix representing the combination of two processes: Σ s 203 arises from a 2D Gaussian process that captures the spatial correlation among clusters 204 and Σ t arises from a first order autoregressive process (AR1) that describes the 205 temporal correlation between survey years. For spatial correlation, we chose a Matern 206 corelation function of smoothing parameter ν = 1, in which the correlation between two 207 clusters j and j at time t is: where κ is an inverse range parameter, K 1 (·) is the modified Bessel function of 209 second kind with order of 1, and D j,j is the Euclidean distance between cluster j and 210 j . To account for temporal correlation, we assumed an underlying first order 211 autoregressive process, in which the correlation between two different time point t and 212 t of the same cluster j is: where φ ∈ (−1, 1) controls the temporal correlation.

214
Under the spatial-only setting, the BGM formulation above is simplified to: with all geospatial covariates and a spatial random effect term (i.e. the SPDE object). 229 Comparing predictive performance among models 230 We assessed the predictive performance of the models based on their out-of-sample 231 predictive skill using ten-fold cross validation. Since we are only interested in predicting 232 the "present" prevalence, clusters in present dataset were randomly divided into ten 233 subsets. Data from the first subset of clusters was withheld and the model was fitted 234 using data from the remaining nine subsets, and the past dataset if applicable. Then, 235 we used the estimated parameters to obtain the expected predicted prevalence of the 236 holdout dataset. This procedure was repeated nine times by withholding data from the 237 second to the tenth group of clusters. As a result, each cluster in the present dataset is 238 held out exactly once and has an out-of-sample prediction. We fitted 100 models (five 239 models × two settings×ten folds) for each country.

240
Using the out-of-sample predictions for each cluster, we calculated the per person clusters. These statistics were calculated using the following expressions: where N t is the total number of individuals sampled (for the present dataset), n jt 245 andp jt are the number of individuals sampled and the predicted prevalence for cluster j 246 at time t, respectively.

247
We compare the performance among models separately for the spatial and we only report the results for logLik for these pairwise comparisons.

253
Finally, to assess the benefit of using past prevalence data in predicting present 254 prevalence, we calculate the difference in logLik and MAE between spatial and 255 spatio-temporal setting of the same model. If spatio-temporal setting is better, i.e.

256
inclusion of past dataset is beneficial, then we expect that the difference is negative for 257 logLik and positive for MAE.

259
Performance of models for the spatial-only setting 260 There was no clear "winner" when modelling prevalence using only present dataset (Fig. 261   1a). The BGM was the best in Burkina Faso and Nigeria (in terms of logLik) and it Performance of models under spatio-temporal setting 278 The performances of the models under spatio-temporal setting can be noticeably 279 different from that of spatial setting (Fig. 2a). GAM was the best model in Burkina Nigeria, the performance differences among the models reduced remarkably.

289
Pairwise comparison among the models yielded similar results (Fig. 2b). GAM and 290 BGM appears to be the best overall: the former performed better or similar to other 291 models in all countries except in Malawi while the latter noticeably underperformed Performance difference between spatial and spatio-temporal However, our separable formulation of the covariance matrix implicitly assumes that the 320 spatial correlation pattern in the past is similar to that of the present. This is a strong 321 for GAM, GBM and BART are also relatively straightforward using the packages in R 385 and it is simple to conduct parameter tuning for GBM and BART through the caret 386 package and built-in functions.

387
As the majority of malaria intervention and control policies are developed at the 388 national level, country-level analyses play an important role in informing policy makers 389 (e.g. [45][46][47] and detailed explanations on fitting the five models under the spatial and 407 spatio-temporal settings.