PT - JOURNAL ARTICLE AU - Crawford, Jake AU - Chikina, Maria AU - Greene, Casey S. TI - Optimizer’s dilemma: optimization strongly influences model selection in transcriptomic prediction AID - 10.1101/2023.06.26.546586 DP - 2023 Jan 01 TA - bioRxiv PG - 2023.06.26.546586 4099 - http://biorxiv.org/content/early/2023/06/26/2023.06.26.546586.short 4100 - http://biorxiv.org/content/early/2023/06/26/2023.06.26.546586.full AB - Motivation Most models can be fit to data using various optimization approaches. While model choice is frequently reported in machine-learning-based research, optimizers are not often noted. We applied two different implementations of LASSO logistic regression implemented in Python’s scikit-learn package, using two different optimization approaches (coordinate descent and stochastic gradient descent), to predict driver mutation presence or absence from gene expression across 84 pan-cancer driver genes. Across varying levels of regularization, we compared performance and model sparsity between optimizers.Results After model selection and tuning, we found that coordinate descent (implemented in the liblinear library) and SGD tended to perform comparably. liblinear models required more extensive tuning of regularization strength, performing best for high model sparsities (more nonzero coefficients), but did not require selection of a learning rate parameter. SGD models required tuning of the learning rate to perform well, but generally performed more robustly across different model sparsities as regularization strength decreased. Given these tradeoffs, we believe that the choice of optimizers should be clearly reported as a part of the model selection and validation process, to allow readers and reviewers to better understand the context in which results have been generated.Availability and implementation The code used to carry out the analyses in this study is available at https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification. Performance/regularization strength curves for all genes in the Vogelstein et al. 2013 dataset are available at https://doi.org/10.6084/m9.figshare.22728644.Competing Interest StatementThe authors have declared no competing interest.