A framework for testing different imputation methods for tabular datasets

Background and purpose Handling missing values is a prevalent challenge in the analysis of clinical data. The rise of data-driven models demands an efficient use of the available data. Methods to impute missing values are thus crucial. Here, we developed a publicly available framework to test different imputation methods and compared their impact in a typical stroke clinical dataset as a use case. Methods A clinical dataset based on the 1000Plus stroke study with 380 completed-entries patients was used. 13 common clinical parameters including numerical and categorical values were selected. Missing values in a missing-at-random (MAR) and missing-completely-at-random (MCAR) fashion from 0% to 60% were simulated and consequently imputed using the mean, hot-deck, multiple imputation by chained equations, expectation maximization method and listwise deletion. The performance was assessed by the root mean squared error, the absolute bias and the performance of a linear model for discharge mRS prediction. Results Listwise deletion was the worst performing method and started to be significantly worse than any imputation method from 2% (MAR) and 3% (MCAR) missing values on. The underlying missing value mechanism seemed to have a crucial influence on the identified best performing imputation method. Consequently no single imputation method outperformed all others. A significant performance drop of the linear model started from 11% (MAR+MCAR) and 18% (MCAR) missing values. Conclusions In the presented case study of a typical clinical stroke dataset we confirmed that listwise deletion should be avoided for dealing with missing values. Our findings indicate that the underlying missing value mechanism and other dataset characteristics strongly influence the best choice of imputation method. For future studies with similar data structure, we thus suggest to use the developed framework in this study to select the most suitable imputation method for a given dataset prior to analysis.


Introduction
(NIHSS), discharge NIHSS, discharge mRS and discharge Trial-of-ORG-10172-in-Acute-Stroke-Treatment (TOAST). Six were categorical: sex, 40 treatment with tissue plasminogen activator (tPA), occlusion, hyperlipidemia, diabetes 41 and hypertonia. 42 Missing values from 0% to 60% were simulated following two different cases of 43 missing values: missing-at-random (MAR) and missing-completely-at-random (MCAR). 44 MAR means that the probability for a value missing depends on same values of other 45 observed variables. MCAR, on the other hand, describes the scenario that values are 46 missing completely at random. In contrast to MAR, there is no systematic reason for a 47 missing value and the probability for a value missing is the same for each value. 48 Performance was estimated for different imputation methods in two fashions: 1) Error 49 assessment using RMSE and absolute bias and 2) Performance assessment of stroke 50 discharge mRS. The imputation methods included 1) mean imputation, 2) hot-deck 51 imputation, 3) MICE and 4) multiple imputation by EM and 5) listwise deletion. 52 Error assessment 53 For the error assessment analysis, we chose two common measures for evaluating 54 imputation methods, RMSE and absolute bias [21]. In this analysis, the parameters 55 were split into numerical and categorical. For numerical parameters, the RMSE of the 56 normalized data is defined according to: where n is the number of imputed samples,Ŷ i the estimated sample value and Y i the 58 true value. For categorical parameters, the RMSE corresponds to the percentage of 59 misclassified values: 60 % of misclassified samples = number of misclassified samples total number of samples , As a second error assessment, the mean absolute bias was calculated. It is defined as: 61 where m is the number of iterations for each imputation, i.e. how often the value 62 was imputed. The absolute bias was then averaged over all imputed samples n.

63
Both RMSE as well as the absolute bias was assessed for each variable at a time and 64 then averaged for each parameter-type (i.e. numerical vs. categorical).
Predictive model analysis 66 The second part of the analysis included the incorporation of a supervised predictive 67 model, the generalized linear model (GLM). We constructed a model predicting the 68 modified Rankin Scale (mRS) at discharge, which is a measure of the early clinical 69 outcome after stroke. It can take values from 0 (no symptoms) to 6 (death). The mRS 70 was split into 0-2 (good outcome) and 3-6 (bad outcome) [4,22]. Other discharge 71 parameters (discharge NIHSS and discharge TOAST) were excluded for this analysis to 72 maintain the integrity of the model. This predictive modelling framework is used as a 73 standard method for this use case [23][24][25] MCAR missing values (Fig 1). For the categorical data the lowest percentage of 113 misclassified samples could be observed for mean imputation (Fig 2). For MAR missing 114 values the mean imputation appears less steady and stable compared to the MCAR   imputed, the lower the resulting performance. The best overall performance was yielded 129 by mean imputation (Fig 5A). Compared to the other imputation methods, this MAR case-type between 2% to 3% missing values. From 11% on every model is 137 significantly worse than the complete-entry model.

138
Similar results could be observed for MCAR missing values (Fig 6). The more values 139 imputed, the lower the resulted AUC is. Mean imputation yielded the best performance, 140 yet significance was shown only for 45% missing values and above (Figs 6A and 6C).

141
Listwise deletion performed significantly lower than all other imputation methods 142 starting from 3% missing values (Fig 6D). For the MCAR case-type, the completed-entry model performed the best as well.

144
The first significant AUC value was for 1% missing values. Starting from 18% every 145 model was significantly worse than the complete-entry model. should not be neglected.

163
Our results do not provide a strict recommendation for one imputation method.

164
While mean imputation seemed to show the lowest RMSE and highest performance in 165 terms of AUC, these results should be interpreted with caution. Mean imputation is a 166 method that aims to reduce the RMSE, thus this measurement is biased towards mean 167 imputation. Therefore, we additionally compared the methodologies using the absolute 168 bias. Here, mean imputation performs well for categorical data as well as numerical 169 data with MCAR missing values. Looking at the error assessment for categorical data, 170 however, we observed that mean imputation performed less robustly. In the particular 171 case of categorical data, mean imputation means imputing the value that occurs most 172 often in the remaining dataset. Hence, the imputation method highly depends on which 173 category the missing value belonged to. The resulting error is then less stable and more 174 easily corrupted by the missing value pattern.

175
In the predictive model analysis, mean imputation showed significantly better results 176 than other imputation methods in the range of 25% to 45% (MAR+MCAR) and 45% to 177 60% (MCAR) missing values. For the given dataset we establish a threshold of 11%

178
(MAR+MCAR) and 18% (MCAR) over which imputation of missing values is 179 discouraged. Consequently, the significant improvement of mean imputation is a priori 180 not within the practical range where values should be imputed [26,27].
For numerical data in the MAR case-type, we found MICE and EM to show the lowest absolute bias. In other studies, complex algorithms like MICE and EM also 183 appeared to be superior to seemingly old-fashioned imputation methods like mean or 184 hot-deck imputation [16,17,26] better. This is corroborated also in theory by the "no free lunch theorem" [33,34]. The 225 theorem states that there is no algorithm that performs best in all tasks. The good 226 performance of one algorithm in one task comes with the cost of low performance in 227 another task [33]. finding was that listwise deletion should not be performed and the choice of imputation 254 methods might depend highly on the underlying missing value mechanism and other 255 characteristics of a given dataset. Thus, we suggest that the optimal imputation method 256 is dataset-dependent and we strongly encourage other researchers to adapt our openly 257 available framework to their own datasets prior to analysis.