Accessible, Reproducible, and Scalable Machine Learning for Biomedicine

Supervised machine learning, where the goal is to predict labels of new instances by training on labeled data, has become an essential tool in biomedical data analysis. To make supervised machine learning more accessible to biomedical scientists, we have developed Galaxy-ML, a platform that enables scientists to perform end-to-end reproducible machine learning analyses at large scale using only a web browser. Galaxy-ML extends Galaxy, a biomedical computational workbench used by tens of thousands of scientists across the world, with a machine learning tool suite that supports end-to-end analysis.

machine learning accessible to biomedical scientists who have limited programming and informatics knowledge and (2) connects machine learning with the broader ecosystems of biomedical analysis tools and a scalable computational workbench.
To meet this need, we have developed Galaxy-ML (Figure 1), an extension of the Galaxy platform (http://galaxyproject.org) 13 that features a large and diverse suite of supervised machine learning tools. Galaxy is a user-friendly web-based computational workbench used by tens of thousands of scientists across the world for a wide variety of biomedical data analysis, including genomics, proteomics, metabolomics, cheminformatics, image processing, and flow cytometry. A key aspect of Galaxy is its web-based user interface, enabling anyone to use complex analysis tools and multi-tool workflows without any knowledge of programming. Galaxy-ML uses the Galaxy web interface to make machine learning tools and pipelines widely accessible. The Galaxy-ML platform provides all the tools necessary to define a learner, train it, evaluate it, and visualize its performance. Panel B is a screenshot of the Galaxy tool to create a gradient boosted classifier. Panel C shows a Galaxy workflow to create a learner using a pipeline, perform hyperparameter search, and visualize the results.
Galaxy-ML also provides benefits in scalability, reproducibility, and workflow development. Large machine learning analyses, such as optimising hyperparameters and model evaluation across many different datasets, can require executing tens of thousands of analyses. Galaxy-ML uses Galaxy's workflow system to execute large-scale analyses by distributing them across one or more computing clusters and running them in parallel. Galaxy ensures reproducibility by recording all parameters and tools used, so all analyses, including those for machine learning, are completely reproducible. This is critical, as reproducibility continues to gain importance in machine learning research 14 . Galaxy-ML enables end-to-end machine learning analyses that begin with processing primary biological data and end with trained machine learning models that can make predictions of phenotypic attributes like demographics or prognosis. For instance, a tutorial 15 is provided in which Galaxy-ML is utilized to reproduce a study that uses RNA-seq data to predict an individual's chronological age 16,17 . End-toend workflows are possible because Galaxy-ML's machine learning tools can be connected to the more than 7,800 tools available in the Galaxy ToolShed 18 for analyzing genomics, proteomics, imaging, and other kinds of biomedical data.
Galaxy-ML supports four major steps in machine learning-preprocessing, modeling, ensembling, and evaluation-by integrating six machine learning libraries (Table 1) together with additional visualization and conversion tools. Scikit-learn 11 provides the foundation for Galaxy-ML with approaches for all four major steps; most scikit-learn methods are available in Galaxy-ML. Additional libraries are included to meet key needs for machine learning in biomedicine, including feature selection, approaches for working with imbalanced datasets, and modeling approaches using gradient boosted decision trees, deep learning, and ensembling. Using Galaxy-ML, tools from all these libraries can be connected together into complete machine learning pipelines and can be stored and reused as an executable workflow. An example pipeline for using RNA-seq to predict drug response will (1) normalize gene expression values to bring them into the same scale; (2) select genes that show the highest variance; (3) use the selected genes to create a predictive model such as logistic regression or gradient-boosted decision trees; (4) use grid search with cross validation to optimize model hyperparameters; and (5) visualize model performance from the grid search using a heatmap or parallel coordinates plot. Galaxy-ML can be used to create thousands of different machine learning pipelines. Documentation, along with tutorials, is available at https://galaxyproject.org/community/machine-learning/, and links to the Galaxy-ML code and tool repositories are available in the Methods section.

Software Library Applications
Scikit-learn 11 Various approaches for preprocessing, modeling, ensembling, and evaluation Scikit-rebate 19 Feature selection Imbalanced-learn 20 Approaches for working with imbalanced datasets We demonstrate the utility of Galaxy-ML in three use cases: (1) extending a machine learning benchmark experiment where 4,000 models were created and evaluated on 276 biomedical datasets 24 ; (2) predicting drug response activity in cancer cell lines using gene expression datasets using stacked meta-ensembles; and (3) recreating deep learning models for genomics that predict, among other attributes, the functional impact of genetic variants. The Methods section provides links to complete analysis histories and results so that all analyses can be fully reproduced on any Galaxy server with the Galaxy-ML tool suite. All analyses were performed on a public Galaxy server at https://usegalaxy.eu and are listed at https://ml.usegalaxy.eu. All workflows, data and results can be accessed via a web browser and analyses can be reproduced directly.
In the first use case, we used Galaxy-ML to extend an analysis of machine learning models across 276 biomedical datasets 24 -164 classification datasets and 112 regression datasets 25 . The original analysis compared performance of 13 models on the 164 classification datasets. We applied 15 models to the classification datasets and 14 models to the regression datasets, creating a total of 4,028 trained models with hyperparameters optimized using grid search. We evaluated all models using 10-fold cross-validation (CV). Because many datasets were imbalanced, F1 scoring rather than ROC AUC was used to evaluate performance of classification models, and Pearson's R 2 was used to evaluate performance of regression models. Performance of classification models are concordant with the initial publication: (a) boosted tree models perform best overall ( Figure 2a) and (b) automated hyperparameter optimization improves performance for many models (Figure 2b). Performance of regression models are similar to those in classification, though boosted tree models only modestly outperform random tree models, and hyperparameter optimization often improves results most for models with low overall performance (see Methods section M.1.1 for more details). For the second use case, we implemented stacked meta-ensemble predictors in Galaxy-ML for drug response in cancer cell lines using high-throughput gene expression data from RNA-seq. Because cancer cell lines serve as models for patient tumors, accurate predictions of drug response can be used to improve understanding of cancer systems biology and potentially inform patient treatment recommendations. Gene expression and drug response data was obtained from DepMap 26 . There are two key challenges for this dataset: (1) there are ~50,000 gene expression features but only ~1,000 cancer cell lines and ~700 drugs, so preventing overfitting is essential, and (2) the dataset is highly imbalanced because there is a small number of cell lines that respond to each drug.
Using Galaxy-ML, we built a meta-ensemble as well as other learners for each drug. The meta-ensemble included a linear boosted model, tree boosted model, and k-nearest neighbor regression, and we use principal component analysis (PCA) for dimensionality reduction in several learners. Dimensionality reduction was used in an effort to address the challenge of using a dataset with a very large number of features. We developed predictors for both regression and classification; labels for classification were generated by thresholding drug response values and labeling cell lines as responders or non-responders to each drug using a cutoff of z-score < -1 for responders. Predictors were scored using average precision to address the challenge of assessing model performance on a highly imbalanced dataset, where the goal is to identify responders (true positives) amongst a very large number of non-responders. To compare regressors and classifiers, average precision for regressors was calculated using rank-ordered predictions, which has been done in past machine learning work in this space 6 . We evaluated each learner using nested CV, with 5-fold CV for 4 repetitions for the outer splits and 5-fold CV with two repetitions for the inner splits. Our results show that stacking regressors performed best for both regression ( Figure 2c) and classification (Figure 2d). Linear boosting approaches also performed very well, with results that were on par with the meta-ensembles. Successful completion of these two use cases shows that Galaxy-ML can support large and diverse machine learning experiments.
In the third use case, Galaxy-ML was used to reproduce key results from Selene 12 , a deep learning toolkit for biological sequence data built on the PyTorch library. Using Galaxy-ML, we reimplemented two deep learning architectures originally implemented in Selene that model and predict regulatory elements, including transcription factor binding sites, DNase I hypersensitive sites, and histone marks. Results from these models are within 1% of those reported for Selene ( Figure   In summary, Galaxy-ML helps accelerate biomedical research by making machine learning more accessible, scalable, and reproducible for all biomedical scientists. Galaxy-ML's tools are completely generalizable and have applications well beyond these use cases. With Galaxy's web-based user interface, an entire machine learning pipeline from normalization, feature selection, model definition, hyperparameter optimization, and cross-fold evaluation can be created and run on a compute cluster using a web browser. This makes scalable and reproducible machine learning accessible to biomedical scientists regardless of their informatics skills. By leveraging the more than 7800 analysis tools available in Galaxy, comprehensive end-to-end analyses can be performed, which begins with primary analysis of -omics, imaging, or other large biomedical dataset and continues to downstream machine learning tools that build and evaluate predictive machine learning models from features extracted from the primary data. The website https://galaxyproject.org/community/machinelearning/ provides a hub for machine learning in Galaxy and access to all Galaxy-ML tools, workflows and tutorials. We anticipate that this hub will serve as a community starting point to foster accessible machine learning in biomedicine.

Methods
The Galaxy tool wrappers for our machine learning suite are available at the following URLs: (1) main tools: https://github.com/bgruening/galaxytools/tree/master/tools/sklearn and (2) utilities and custom classifiers: https://github.com/goeckslab/Galaxy-ML, and the entire suite can be installed onto any Galaxy server through the Galaxy ToolShed at http://bit.ly/galaxy-ml-toolshed.
Three use-cases are discussed in the parent manuscript:  For regression analysis, we created one Galaxy history to measure the performance of 14 regressors with the default and optimised parameters on 112 datasets. Table M.1 lists this history and associated workflows. Similar to our classification results, the XGBoost regressor records the best performance (see Figure M.1). For example the XGBoost regressor workflow achieving the best performance has one step (named SearchCV) that optimises the hyperparameters such as number of estimators, booster and max_depth of XGBoost regressor. For each hyperparameter, a range of values is specified and using grid search, all parameter combinations are tried, and performance is reported with the optimal parameter settings. All the resulting datasets from running regression algorithms with and without parameter optimization is available from: https://usegalaxy.eu/u/kumara/h/pmlbregressionanalysisjune2020. Section M.1.1 has more discussion of our regression analyses.

M.1.1 Regression analysis: Comparison of 14 regressors on 112 Penn regression datasets
Using tools for data preprocessing, feature selection and regressors, we performed an aggregated analysis on 112 regression datasets from Penn Machine Learning Benchmark repository. This repository contains numerous datasets for regression many of which are of biological importance. We applied 14 different regressors on 112 datasets from the collection and performed a detailed comparison of performances of these regressors (Figure M.1). To measure the accuracy of regression models, we used the r-squared metric (R2), which is common in regression analyses. This metric can be any real number to a maximum of 1.0. If it is negative, it suggests that the regression model is not good. If it is closer to 1.0, the model's performance is good. We used 5-fold cross-validation for training and repeated it for 10 experiment runs to compute a mean rsquared score for each dataset. We achieved a r-squared score of more than 0.80 for 3 regressors (xgboost, gradient boosting and extra trees) and close to 0.80 for 2 regressors (bagging and random forest).   We compute an average running time of each regressor over all 112 datasets. The running time of a regressor on a dataset is the sum of the training and validation times for the best regression model. The regressors such as xgboost, gradient boosting and extra trees achieve > 0.80 r-squared score, but extra trees regressor requires significantly more time to finish compared to the other two regressors. Regressors such as linear regression, huber and elastic net are fast, but their accuracy is low. Decision and extra tree regressors are also fast, but their accuracy is better (> 0.7 r-squared score) than the linear regressors.

M.2 Use Case 2: DepMap Cancer Cell Lines
In our second analysis, we analyzed cancer cell lines gene expression and drug response datasets from the Cancer Dependency Map Project 26 (https://depmap.org/). This dataset includes more than 50,000 gene expression values for over 1000 cancer cell lines obtained from bulk RNA-seq as well as drug response data for 265 drugs. Both gene expression data and drug response targets are continuous data. We had several goals in mind when performing this analysis. We wanted to assess how well supervised learning performed on a dataset with a very large number of features and a relatively small and imbalanced number of examples. These challenges are common when machine learning is applied to molecular datasets. We also wanted to compare performance of meta-ensemble (stacking) approaches with traditional single-model methods.
• Galaxy History URLs: ○ Regression: https://usegalaxy.eu/u/qiang_gu/h/depmap-regression ■ Example workflow: https://usegalaxy.eu/u/kumara/w/stackingensembleregressorpcaknr ○ Classification: https://usegalaxy.eu/u/qiang_gu/h/depmap-classification ■ Example workflow: https://usegalaxy.eu/u/kumara/w/stackingclassifierdrugprna2 Because target values for this analysis-cell line drug response data-was continuous, we developed a strategy to binarize the data so that classification approaches could be used. Drug response values were standardized/z-scored, and: • cell lines with a standardized value of less than -1 were labeled responders; • cell lines with a standardized value between -1 and 0 wer e labeled indeterminate; • cell lines with a standardized value greater than 0 were labeled nonresponders.   Performance results obtained using Galaxy-ML models fully trained using GPU and Selene models. All datasets used were obtained from Selene. AUPRC is the area under the precision-recall curve, and is also known as the average precision. "N.R." means that the models did not report this information.