Benchmarking 50 classification algorithms on 50 gene-expression datasets

By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Diverse types of biomarkers have been proposed for assigning patients to subgroups. For example, DNA variants in tumors show promise as biomarkers; however, tumors exhibit considerable genomic heterogeneity. As an alternative, transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist—and most support diverse hyperparameters—so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 50 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection in nested cross-validation folds. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies. Author Summary

there would be scenarios in which these algorithms would perform poorly. Furthermore, relatively little is 142 known about the extent to which algorithm choice affects predictive success for a given dataset. Thus we 143 questioned how much variance in predictive performance we would see across the algorithms. In addition, 144 we evaluated practical matters such as tradeoffs between predictive performance and execution time, the 145 extent to which algorithm rankings are affected by the performance metric used, and which algorithms 146 behave most similarly-or differently-to each other.

148
General trends 149 We evaluated the predictive performance of 50 classification algorithms on 50 gene-expression datasets. 150 Across the 50 datasets, we made predictions for a total of 143 class variables. We divided the analysis 151 into 5 stages as a way to assess benefits that might come from including clinical predictors, optimizing an 152 algorithm's hyperparameters, or performing feature selection ( Figure 1).

153
In Analysis 1, we used only gene-expression data as predictors and used default hyperparameters for each 154 classification algorithm. Figure S1 illustrates the performance of these algorithms using area under the 155 receiver operating characteristic curve (AUROC) as a performance metric. As a method of normalization, 156 we ranked the classification algorithms for each combination of dataset and class variable. Two patterns Performance rankings differed considerably depending on which evaluation metric we used. For example, 165 in Analysis 1, many of the same algorithms that performed well according to AUROC also performed 166 well according to classification accuracy ( Figure S2). However, classification accuracy does not account 167 for class imbalance and thus may rank algorithms in a misleading way. For example, the weka/ZeroR 168 algorithm is ranked 17th among the algorithms according to classification accuracy, even though the 169 algorithm simply selects the majority class. (Our analysis included two-class and multi-class problems.) 170 Rankings for the Matthews correlation coefficient were relatively similar to AUROC. For example, 171 sklearn/logistic_regression had the 2nd-best average rank according to this metric. 172 However, in other cases, the rankings were considerably different. For example, the mlr/sda algorithm 173 performed 3rd-best according to MCC but 26th according to AUROC ( Figure S3). Figure 2 shows the 174 rankings for each algorithm across all metrics that we evaluated, highlighting the reality that conclusions only outperformed other algorithms in terms of predictive ability but also was one of the fastest 181 algorithms. In contrast, the mlr/randomForest algorithm was among the most predictive algorithms 182 but was orders-of-magnitude slower than other top-performing algorithms. significantly correlated (r = 0.87; CI = 0.82-0.90; p = 2.2e-16). However, in some instances, their 192 performance differed dramatically. For example, when predicting drug responses for dataset GSE20181, 193 weka/LibSVM performed 2nd best, but mlr/svm performed worst among all algorithms. Figures S4-194 S5 illustrate for two representative datasets that algorithms with similar methodologies often produced 195 similar predictions; but these predictions were never perfectly correlated. Execution times also differed 196 from one implementation to another; for example, the median execution time for weka/LibSVM was 197 27.9 seconds, while mlr/svm was 114.4 seconds. Overall, the median execution times differed 198 significantly across the software packages (Kruskal-Wallis test; p-value = 5.0e-07); the sklearn algorithms 199 executed faster than algorithms from other packages ( Figure 3).

200
Some classification labels were easier to predict than others. Across the dataset/class combinations in 201 Analysis 1, the median AUROC across all algorithms ranged between 0.441 and 0.966 (Additional Data 202 File 1). For a given dataset/class combination, algorithm performance varied considerably, though this 203 variation was influenced partially by the weka/ZeroR results, which we used as controls. To gain 204 insight into predictive performance for different types of class labels, we assigned a category to each class 205 variable ( Figure S6); the best predictive performance was attained for class variables representing 206 molecular markers, histological statuses, and diagnostic labels. Class variables in the "patient 207 characteristics" category performed worst; these variables represented miscellaneous factors such as the 208 patient's family history of cancer, whether the patient had been diagnosed with multiple tumors, and the 209 patient's physical and cognitive "performance status" at the time of diagnosis.

211
In Analysis 2, we used only clinical predictors (for the dataset / class-variable combinations with 212 available clinical data). These results differed considerably from Analysis 1, which used only gene-expression predictors. Three linear-discriminant classifiers performed particularly well: mlr/sda, 214 sklearn/lda, and mlr/glmnet ( Figure S7). Two Naïve Bayes algorithms also ranked among the 215 top performers, whereas these algorithms had performed poorly in Analysis 1. Only two kernel-based 216 algorithms were ranked among the top 10: weka/LibLINEAR and 217 sklearn/logistic_regression. Both of these algorithms used the LibLINEAR solver. Most of 218 the remaining kernel-based algorithms were among the worst performers. As with Analysis 1, most 219 ensemble-based algorithms ranked in the top 25, but none ranked in the top 5.

220
Additional Data File 2 shows the performance of each combination of dataset and class variable in 221 Analysis 2. As with Analysis 1, we observed considerable variation in our ability to predict particular 222 classes and categories ( Figure S8). For approximately two-thirds of the dataset/class combinations,

223
AUROC values decreased-sometimes by more than 0.3 ( Figure 4A); however, in a few cases, predictive 224 performance increased. The most dramatic improvement was for GSE58697, in which we predicted 225 progression-free survival for desmoid tumors. The clinical predictors were age at diagnosis, biological 226 sex, and tumor location. Salas, et al. previously found in a univariate analysis that age at diagnosis was 227 significantly correlated with progression-free survival (60). We focused on patients who experienced 228 relatively long or short survival times and used multivariate methods.

229
In Analysis 3, we combined clinical and gene-expression predictors. We limited this analysis to the 108 230 dataset / class-variable combinations for which clinical predictors were available (Additional Data File 3; 231 Figure S9). As with Analysis 1, kernel-and ensemble-based algorithms performed best overall (Figure 232 S10). For 90 (83.3%) of the dataset/ class-variable combinations, the AUROC values were identical to 233 Analysis 1 ( Figure 4B). Except in three cases, the absolute change in AUROC was smaller than 0.05, 234 including for GSE58697 (0.026 increase). These results suggest that standard classification algorithms 235 (using default parameters) are not well suited for datasets in which gene-expression and clinical predictors 236 have simply been merged. The abundance of gene-expression variables may distract the algorithms and/or obfuscate signal from the relatively few clinical variables. Additionally, gene-expression and clinical 238 predictors may carry redundant signals.

239
Effects of performing hyperparameter optimization 240 In Analysis 4, we performed hyperparameter optimization via nested cross validation. Across all 50 241 classification algorithms, we employed 1008 distinct hyperparameter combinations under the assumption 242 that the default settings may be suboptimal for the datasets we evaluated. When clinical predictors were 243 available, we included them (as in Analysis 3). When no clinical predictors were available, we used gene-  Figure 5A); however, in some cases, performance decreased.

250
The best-and worst-performing class variables and categories were similar to the previous analyses 251 ( Figure S12; Additional Data File 4). We observed a positive trend in which datasets with larger sample 252 sizes resulted in higher median AUROC values ( Figure S13); however, this relationship was not 253 statistically significant (Spearman's rho = 0.12; p = 0.15). We observed a slightly negative trend between 254 the number of genes in a dataset and median AUROC ( Figure S14), but again this relationship was not 255 statistically significant (rho = -0.06; p = 0.47).

256
Evaluating many hyperparameter combinations enabled us to quantify how much the predictive 257 performance varied for different combinations. Some variation is desirable because it enables algorithms 258 to adapt to diverse analysis scenarios; however, large amounts of variation may make it difficult to select

271
Of the 1008 total combinations, 984 were considered best for at least one dataset / class-variable 272 combination (based on average performance in inner cross-validation folds).

273
Effects of performing feature selection 274 In Analysis 5, we performed feature selection via nested cross validation. We used 14 feature-selection 275 algorithms in combination with each of the 50 classification algorithms. Due to the computational 276 demands of evaluating these 700 combinations, we used default hyperparameters for both types of 277 algorithm. The feature-selection algorithms differed in their methodological approaches (Table 1). Some 278 were univariate methods, while others were multivariate. Some feature-selection algorithms mirrored the 279 behavior of classification algorithms (e.g., SVMs or random forests); others were based on statistical 280 inference or entropy-based metrics.

281
Once again, kernel-and ensemble-based classification algorithms performed best overall when feature 282 selection was used ( Figure 6). The median improvement per dataset / class-variable combination was 283 slightly larger for feature selection than for hyperparameter optimization, and the maximal gains in 284 predictive performance were larger for feature selection ( Figure 5B, Additional Data File 5). Overall, there was a strong positive correlation between AUROC values for Analyses 4 and 5 (Spearman's rho = 286 0.75; Figure S20). Among the 10 dataset / class-variable combinations that improved most after feature 287 selection, 8 were associated with prognostic, stage, or patient-characteristic variables-categories that 288 were most difficult to predict overall ( Figure S21). The remaining two combinations were molecular 289 markers (HER2-neu and progesterone receptor status).

290
Across all classification algorithms, the weka/Correlation feature-selection algorithm resulted in 291 the best predictive performance ( Figure S22), despite being a univariate method. This algorithm 292 calculates the Pearson's correlation coefficient between each feature and the class values, a relatively 293 simple approach that also ranked among the fastest ( Figure S23). Other univariate algorithms were among 294 the top performers. To characterize algorithm performance further, we compared the feature ranks 295 between all algorithm pairs for two of the datasets. Some pairs produced highly similar gene rankings, 296 whereas in other cases the similarity was low ( Figures S24-S25). The weka/Correlation and 297 mlr/kruskal.test algorithms produced similar feature ranks; both use statistical inference; the 298 former is a parametric method, while the latter is nonparametric.

303
Finally, as a way to provide guidance to practitioners, we examined interactions between individual 304 feature-selection algorithms and classification algorithms (Figure 7). If a researcher had identified a 305 particular classification algorithm to use, they might wish to select a feature-selection algorithm that 306 performs well in combination with that classification algorithm. A feature-selection algorithm that 307 performs well overall may not perform especially well in combination with a given classification but it was only the 6th-best algorithm on average when sklearn/logistic_regression was used 310 for classification. In contrast, a feature-selection algorithm that underperforms in general may perform 311 well in combination with a given classification algorithm. For example, sklearn/svm_rfe performed 312 poorly overall but was effective in combination with mlr/svm.

314
The overarching purpose of our benchmark study was to provide insights that might inform gene-315 expression biomarker studies. Such insights could lead to more accurate predictions in future studies and 316 thus benefit patients. In situations where a biomarker is applied to thousands of cancer patients, even 317 modest increases in accuracy can benefit hundreds of patients. We also sought to help bridge the gap 318 between machine-learning researchers who develop general-purpose algorithms and biomedical 319 researchers who seek to apply them in a specific context. When selecting algorithm(s), hyperparameters, 320 and features to use in a gene-expression biomarker study, researchers might base their decisions on what 321 others have reported in the literature for a similar study; or they might consider anecdotal experiences that 322 they or their colleagues have had. However, these decisions may lack an empirical basis and not 323 generalize from one analysis to another. Alternatively, researchers might apply many algorithms to their 324 data to estimate which algorithm(s) will perform best. However, this approach is time-and resource-325 intensive and may lead to bias if the comparisons are not performed in a rigorous manner. In yet another 326 approach, researchers might develop a custom classification algorithm, perhaps one that is specifically 327 designed for the target data. However, it is difficult to know whether such an algorithm would outperform 328 existing, classical algorithms.

329
Many factors can affect predictive performance in a biomarker study. These factors include data-330 generation technologies, data normalization / summarization processes, validation strategies, and 331 evaluation metrics used. Although such factors must be considered, we have shown that when holding them constant, the choice of algorithm, hyperparameter combination, and features usually affects 333 predictive performance for a given dataset-sometimes dramatically. Despite these variations, we have 334 demonstrated that particular algorithms and algorithm categories consistently outperform others across 335 diverse gene-expression datasets and class variables. However, even the best algorithms performed poorly 336 in some cases. These findings support the theory that no single algorithm is universally optimal(61). But 337 they also suggest that researchers can increase the odds of success in developing accurate biomarkers by 338 focusing on a few top-performing algorithms and by using hyperparameter optimization and/or feature 339 selection, despite the additional computational demands in performing these steps.

340
This benchmark study is considerably larger than any prior study of classification algorithms applied to 341 gene-expression data. We deliberately focused on general-purpose algorithms because they are readily 342 available in well-maintained, open-source packages. Of necessity, we evaluated an inexhaustive list of 343 algorithms and hyperparameter combinations. Other algorithms or hyperparameter combinations may 344 have performed better than those that we used.

345
Some algorithms had more hyperparameter combinations than others, which may have enabled those 346 algorithms to adapt better in Analysis 4. Additionally, in some cases, our hyperparameter combinations 347 were inconsistent between two algorithms of the same type because different software libraries support 348 different options. Despite these limitations, a key advantage of our benchmarking approach is that we 349 performed these comparisons in an impartial manner, not having developed any of the algorithms that we 350 evaluated nor having any other conflict of interest that might bias our results. improve predictive performance in future studies. Future efforts to improve predictive ability might also include optimizing hyperparameters of feature-selection algorithms, combining hyperparameter-358 optimized classification algorithms with feature selection, and using multiple classifier systems(63).

359
Transfer learning across datasets may also prove fruitful(64).

360
Our findings are specific to high-throughput gene-expression datasets that have either no clinical 361 predictors or a small set of clinical predictors. However, our conclusions may have relevance to other 362 datasets that include a large number of features and that may include a combination of numeric, discrete, 363 and nominal features.

364
Finally, we mention additional limitations and caveats. We applied Monte Carlo cross validation to each 365 dataset separately and thus did not evaluate predictive performance in independent datasets. This 366 approach was suitable for our benchmark comparison because our priority was to compare algorithms 367 against each other rather than to optimize their performance for clinical use. On another note, 368 comparisons across machine-learning packages are difficult to make. For example, some sklearn 369 algorithms provided the ability to automatically address class imbalance, whereas other software 370 packages often did not provide this functionality. Adapting these weights manually was infeasible for this 371 study. In addition, some classification algorithms are designed to produce probabilistic predictions, 372 whereas other algorithms produce only discrete predictions. The latter algorithms may have been at a 373 disadvantage in our benchmark for the AUROC and other metrics.

375
Data preparation 376 We used 50 datasets spanning diverse diseases and tissue types but focused primarily on cancer-related 377 conditions. We used data from two sources. The first was a resource created by Golightly, et al.(65) that 378 includes 45 datasets from Gene Expression Omnibus(66). For these datasets, the gene-expression data 379 were generated using Affymetrix microarrays, normalized using Single Channel Array Normalization(67), summarized using BrainArray annotations(68), quality checked using IQRay (69) 406 We used 50 classification algorithms that were implemented in the ShinyLearner tool, which enables  we excluded these algorithms from our analysis because they raised exceptions when we used default 419 hyperparameters, required excessive amounts of random access memory (75 gigabytes or more), or were 420 orders of magnitude slower than the other algorithms.

421
For feature selection, we used 14 algorithms that had been implemented in ShinyLearner(78). Table 1 422 lists each of the algorithms, along with a description and high-level category for each algorithm. To analyze the benchmark results, we wrote scripts for Python (version 3.6)(83) and the R statistical 431 software (version 4.02)(84). We also used the corrplot(85), cowplot(86), ggrepel(87), and tidyverse (88)   432 packages.

433
Analysis phases 434 We performed this study in five phases (Figure 1). In each phase, we modulated either the data used or  In each phase, we used Monte Carlo cross validation. For each iteration, we randomly assigned the patient 445 samples to either a training set or test set, stratified by class. We assigned approximately 2/3 of the patient 446 samples to the training set. We then made predictions for the test set and evaluated the predictions using 447 diverse metrics (see below). We repeated this process (an iteration) multiple times and used the iteration 448 number as a random seed when assigning samples to the training or test set (unless otherwise noted).

449
ShinyLearner relays this seed to the underlying algorithms, where applicable.

450
During Analysis 1, we evaluated the number of Monte Carlo iterations that would be necessary to provide 451 a stable performance estimate. For the mlr/randomForest, sklearn/svm, and weka/Bagging classification 452 algorithms, we executed 100 iterations for datasets GSE10320 (predicting relapse vs. non-relapse for the number of iterations increased, we calculated the cumulative average of the AUROC for each 455 algorithm. After performing at most 40 iterations, the cumulative averages did not change more than 0.01 456 over sequences of 10 iterations (Figures S27-S28). To be conservative, we used 50 iterations in Analysis 457 1, 2, and 3. In Analysis 4 and Analysis 5, we used 5 iterations because hyperparameter optimization and

469
While executing each analysis phase, we encountered some situations in which we did not obtain results 470 for all combinations of class variable and algorithms. We describe these exceptions below.

471
Analysis 1. On iteration 34, the weka/RBFNetwork algorithm did not converge after 24 hours of execution 472 time for one of the datasets. We manually changed the random seed from 34 to 134, and it converged in 473 minutes.

474
Analysis 2. The mlr/glmnet algorithm failed three times due to an internal error. We limited the results for 475 this algorithm to the iterations that completed successfully. The total number of classification problems 476 was smaller for this analysis than for Analysis 1 because no clinical predictors were available for some 477 class variables.
Analysis 3. Again, on iteration 34, the weka/RBFNetwork algorithm did not converge after 24 hours of 479 execution time for one of the datasets. We manually changed the random seed from 34 to 134, and it 480 converged in minutes. The total number of classification problems was less than for Analysis 1 because 481 no clinical predictors were available for some class variables.

482
Analysis 4. During nested Monte Carlo cross validation, we specified a time limit of 168 hours under the 483 assumption that some hyperparameter combinations would be especially time intensive. A total of 1022 484 classification tasks failed either due to this limit or due to small sample sizes. We ignored these 485 hyperparameter combinations when determining the top-performing combinations. Most failures were 486 associated with the mlr/h2o.gbm and mlr/ksvm classification algorithms.

487
Analysis 5. During nested Monte Carlo cross validation, we specified a time limit of 168 hours. A total of 488 574 classification tasks failed either due to this limit or due to small sample sizes. We ignored these tasks 489 when seeking to select an optimal number of features. cores were available on a given server, we executed tasks in parallel using GNU Parallel(89).

497
In outer cross-validation folds, we used diverse metrics to quantify classification performance. These   Tables  Table 1:  The workbench for machine learning). For each algorithm, we provide a brief description of the 538 algorithmic approach; we extracted these descriptions from the libraries that implemented the algorithms.

539
In addition, we assigned high-level categories that indicate whether the algorithms evaluate a single 540 feature (univariate) or multiple features (multivariate) at a time. In some cases, the individual machine-541 learning libraries aggregrated algorithm implementations from third-party packages. In these cases, we 542 cite the machine-learning library and the third-party package. When available, we also cite papers that 543 describe the algorithmic methodologies used.   Figure 1: Overview of analysis scenarios. This study consisted of five separate but related analyses.

563
This diagram indicates which data type(s) was/were used and whether we attempted to improve predictive 564 performance via hyperparameter optimization or feature selection in each analysis.