Abstract
The easyml (easy machine learning) package lowers the barrier to entry to machine learning and is ideal for undergraduate/graduate students, and practitioners who want to quickly apply machine learning algorithms to their research without having to worry about the best practices of implementing each algorithm. The package provides standardized recipes for regression and classification algorithms in R and Python and implements them in a functional, modular, and extensible framework. This package currently implements recipes for several common machine learning algorithms (e.g., penalized linear models, random forests, and support vector machines) and provides a unified interface to each one. Importantly, users can run and evaluate each machine learning algorithm with a single line of coding. Each recipe is robust, implements best practices specific to each algorithm, and generates a report with details about the model, its performance, as well as journal-quality visualizations. The package’s functional, modular, and extensible framework also allows researchers and more advanced users to easily implement new recipes for other algorithms.
1. Introduction
Numerous machine learning libraries (or packages) are becoming available in popular programming languages, especially R (R Core Team, 2016) and Python (Rossum, 1995). Both languages are high-level, interpreted, employ functional and object-oriented paradigms, and have a wide ecosystem of mature machine learning libraries. However, existing machine learning libraries assume the user has a solid understanding of statistics and machine learning principles and best practices, strong programming skills, and the knowledge of how to apply this skillset to their problem. Oftentimes, this is not the case. Individuals without strong technical background increasingly want to apply machine learning techniques to their research without having to spend years studying mathematics, statistics, and/or computer science and there is a critical need to lower the barrier to machine learning or computational approaches in general (Ahn and Busemeyer, 2016; Ahn et al., 2017). The easyml targets these individuals and hopes to lower the barrier to entry to machine learning by providing user-friendly recipes for common machine learning algorithms.
These recipes leverage R and Python’s programming capabilities and their existing machine learning libraries and breaks down each analysis into steps common to all algorithms and steps unique to each algorithm. These steps are abstracted from the user by a common unified framework. Thus, machine learning is like baking a cake; whether one wants to bake a chocolate cake or a vanilla cake, one still needs eggs, flour, and butter as the core ingredients. If you mix them in certain steps and add chocolate, a chocolate cake is baked. If you mix the core ingredients in a certain way and add vanilla, a vanilla cake is baked. Analogously, though one may run similar steps to build and evaluate a penalized linear model (Friedman et al., 2010; Simon et al., 2011) and a random forest model (Breiman, 2001), one will wish to assess the coefficients of the linear model and the variable importances of the random forest model. easyml (easy machine learning) makes this easy by handling the best practices for each algorithm but still allows an advanced user the flexibility to customize each recipe.
2. Project Vision
Maintenance
This package is maintained by Paul Hendricks and Woo-Young Ahn.
Availability
The easyml source code is available under the MIT license and hosted on GitHub (https://github.com/CCS-Lab/easyml).
Standardized recipes
The package provides standardized recipes for regression and classification machine learning algorithms in R and Python (see Table 1). Specifically, easyml provides recipes and unified interface to some of widely used machine learning algorithms including penalized regression models, random forests, support vector machines (Cortes and Vapnik, 1995), Group-Lasso interaction model (Lim and Hastie, 2015), and (deep) neural network models. More advanced users will find it easy to implement new recipes for other algorithms. To implement the algorithms, we use other R and Python packages including glmnet (Friedman et al., 2010), randomForest (Liaw and Wiener, 2002), e1071 (Meyer et al., 2017), glinternet (Lim and Hastie, 2015), nnet (Venables and Ripley, 2002), darch (Drees, 2013), and scikit-learn (Pedregosa et al., 2011). We also plan to add more algorithms in the future.
Journal-quality visualizations
Users will find that easyml can immediately produces journalquality visualizations. These visualizations can be easily be modified, if needed, and used directly in research papers or presentations. See Section 4 for an example.
Functional, modular, and extensible framework
The package’s functional, modular, and extensible framework also allows researchers and more advanced users to implement new recipes for other algorithms. An example of how to implement a new algorithm is demonstrated in Section 5.
Parallelization
The nature of machine learning often lends itself to highly parallelizable code. easyml makes it possible to run all recipes leveraging as many CPUs as are available. Simply specify the n_core parameter in the interface and easyml will parallelize the analyze over that number of cores.
Code quality control
easyml uses software engineering best practices such as Continuous Integration (CI) to check the build of the package, Unit Testing and Code Coverage to check the quality of the code, and linting to ensure adherence to a common style. As of this writing, all builds and tests pass on Ubuntu 14.04 and Mac OS X and the Code Coverage is above 85%. The project is also hosted on GitHub (https://github.com/CCS-Lab/easyml), and is available to users who want to examine the source code, contribute to the code base, or provide the authors with feedback or alert the authors to potential bugs, both via issues.
Documentation
easyml provides exhaustive documentation and examples for both R and Python. Users interested in the R package can find documentation here: http://ccs-lab.github.io/easyml/R. Users interested in the Python package can find documentation here: http://easyml.readthedocs.io.
3. Recipes
easyml uses standardized recipes for regression and classification machine learning algorithms in R and Python. These recipes can be broken down into multiple steps and are useful for interpreting models (e.g., estimating coefficients and variable importances) or estimating in-sample and out-of-sample performance (e.g., predictions and measures of goodness-of-fit). For each of these recipes, we describe our motivation for including the recipe, a breakdown of the steps in each recipe, and the algorithms that recipe is implemented for.
Coefficients
Linear models are powerful due to their simplicity, robustness, and interpretability of variables. However, sometimes the estimated coefficients for linear models are different after each run, even with the same random state. This can be due to the low-level code not setting the random state at the C/Fortran level or due to the stochastic nature of the algorithm or optimizer. This phenomenon makes it difficult to interpret a coefficient after building the model only once. To account for this intrinsic randomness and ensure the final coefficients returned are robust estimators, we generate the coefficients n_samples times using k-fold cross validation, where n_samples = 1000 and k = 10 are set as the defaults, and then calculate the mean and standard deviation of the estimated coefficients. We have applied and validated this protocol in previous studies (Ahn et al., 2014; Ahn and Vassileva, 2016; Ahn et al., 2016; Vilares et al., 2017). The ability to generate beta coefficients is currently implemented only for the penalized linear model algorithm (easy_glmnet). See Algorithm 1.
Variable Importances
Ensemble models are powerful due to their simplicity, ability to capture the non-linear patterns of features from the data, and like linear models, their interpretability of variables. As with linear models, we wish to calculate and visualize the importances of the variables as part of our machine learning protocol. Like linear models, ensemble models often have inherent sources of randomness. For example, the random forest algorithm bootstraps the data randomly and randomly selects a subset of predictors to use in each decision tree. Interpretability heuristics such as variable importance scores can often differ from one random state to another. To ensure the resulting variable importances are robust, we can generate the random forest algorithm n_samples times, where n_samples = 1000 is set as the default, and then calculate the mean and standard deviation of the estimated importances. The ability to generate variable importances is currently implemented for the random forest algorithm (easy_random_forest), which was used in our recent paper (Haines et al., in preparation). See Algorithm 2.
Predictions
We often wish to visualize our predictions, whether it’s a plot of actual against predicted values or a plot of the area under the curve (AUC) of a Receiver Operating Characteristic (ROC) curve. If models sometimes produce random, albeit small, deviations in coefficients or weights, these deviations can propagate to our predictions. To guard against this intrinsic error, we train a model using k-fold cross validation within the training set (k = 10 set as the default) and generate predictions n_samples times for a particular train-test split (separately on training and test sets). Then we average the predictions across the n_samples iterations. By default, the training and test sets are 67% and 33% of the whole dataset and it can be adjusted (e.g., train size = 0.67). The ability to generate predictions is currently implemented for all algorithms. See Algorithm 3 (note that nrow(Xz) indicates the number of rows in Xz).
Model performance
Often we wish to visualize model performance representing the quality (i.e., accuracy) of our predictions, whether it’s a plot of mean squared errors, correlation coefficients, or AUCs. We can guard against intrinsic errors by replicating predictions many times for a particular train-test split, averaging the predictions across n_iterations, generating a model performance metric, and replicating for many (n_divisions) different train-test splits. The reader is referred to the Algorithm 4 box for more details. The ability to generate model performance is currently implemented for all algorithms.
4. Example
This example demonstrates how to use easyml in R. For further examples on how to use easyml in R, please see the documentation at http://ccs-lab.github.io/easyml/R. For further examples on how to use easyml in Python, please see the documentation at http://easyml.readthedocs.io. In this example, we will use easyml to replicate findings reported in Ahn et al. (2016) where a penalized logistic regression was used to identify multivariate patterns of behavioral measures that can classify individuals with cocaine dependence. To use easyml in R, we must first install the easyml library.
Next, let’s load the package and the data set.
Finally, we pass in the following input arguments to run the analysis:
.data, the data set to be analyzed (n × m matrix). n=the number of samples. m=the number of features. At this time, it should contain no missing data.
dependent_variable, the name of the dependent variable, which is an n by 1 vector. In the cocaine data, diagnosis (0=healthy control, 1=cocaine user) is the dependent variable.
family, the name of the family of regression with choices are “gaussian” and “binomial”. Since we are modeling a binary dependent variable, we will select “binomial”.
preprocess, the preprocessing function to use on the data. We choose the preprocess_scale function so as to scale (z-score) any continuous variable across samples (full data set or train/test data sets) before training a model.
exclude_variables, which variables, if any, should be excluded from the analysis. If there is more than one variable, use the function c() (e.g., exclude_variables = c ("subject", "edu_yrs")).
categorical_variables, which variables are categorical, and thus need to be specially handled during preprocessing. Note that categorical variables will not be normalized. If there is more than one variable, use the function c().
random_state, the seed to use for the random state.
model_args, the list of arguments specific to penalized linear models. See ?glmnet:: glmnet.
That’s it! Now let’s examine the results. Each algorithm returns a list with objects for various functions, data structures, and plot objects that are instrumental to the analysis. Calling the names function on the variable results will show all the slots available to us.
ggplot2 objects can be accessed via the $ operator. For example, to examine the predictions for the train data set, we can call the following (see Figure 1):
And we can examine the out-of-sample predictions for the test data set (see Figure 2).
We can also examine the plot of the estimated beta coefficients by calling results$plot _coefficients. By default, beta coefficients and variable importances are sorted so that features are shown in the order of their magnitudes. See Figure 3 where the coefficient means are represented by the dots and the error bars represent the standard deviations.
We could reproduce the almost the same plot that appears in Ahn et al. (2016). These are true ggplot2 objects and can be modified however needed. See the scripts below how we modified Figure 3 by adding a title, labels, etc.:
As seen in Figure 4, by calling results$plot_model_performance_train, we can also examine the in-sample model performance generated for the train data set, which is the distribution of the AUCs of the ROC curves over 1,000 repetitions:
as well as the out-of-sample model performance generated for the test data set by calling results$plot_model_performance_test (see Figure 5).
The other algorithms are just as easily run and follow the same structured interface, making it easy to quickly switch to other algorithms with very few modifications to the parameters. For example, to run a random forest model, one would run:
Or for a support vector machine model, one would run:
5. Implementing a New Algorithm
While penalized regression models, random forests, and support vector machines are among some of the most popular algorithms, an advanced user may wish to add an algorithm implemented elsewhere to easyml or perhaps even write their own algorithm. easyml makes this easy by allowing users to write wrapper functions to provide a common interface to those algorithms and pass wrapper functions into easyml functions. In Appendix A, we provide an example where we wrap an algorithm that uses averaging over several neural networks. The reader is referred to the caret (Kuhn, 2016) documentation for details on the caret::avNNet function.
6. Comparison to Similar Toolkits and Frameworks
R and Python both have a wide ecosystem of machine learning toolkits. caret and mlr (Bischl et al., 2016) are perhaps the most similar packages to easyml in R while scikit-learn (Pedregosa et al., 2011) is perhaps the most simiar package to easyml in Python. These packages contain algorithms for regression and classification tasks, tools for preprocessing and model interepretation, and all focus on lowering the barrier to entry for machine learning for non-experts. While these packages provide users the flexibility and tools to develop modeling techniques, easyml extends the process by allowing users to use standardized recipes for common machine learning techniques and produce journal-quality visualizations, all in a single line of coding.
7. Conclusions and Outlook
In conclusion, the easyml package fits a specialized niche, and further lowers the barrier to entry to machine learning. Practitioners have immediate access to powerful machine learning algorithms in a single-line of coding in R or Python, without worrying about their implementation or best practices for each algorithm. Researchers with strong programming skills can leverage the easyml library to provide customized extensions quickly. This is still a beta version (v0.1.0 as of May 2017) and users are encouraged to make suggestions or report any bug by posting new issues to the GitHub repository (https://github.com/CCS-Lab/easyml). Next steps for easyml are likely to include more algorithms and additional recipes and convenience functions to further lower the barrier to entry for machine learning. We also plan make it easy to use easyml on neuroimaging data. Specifically, we will allow users to apply a machine learning algorithm to functional magnetic resonance imaging (fMRI) data and produce journal-quality brain maps in a single line of coding.
8. Acknowledgement
We thank Nathaniel Haines for his feedback on various codes and his help on the random forest algorithm.
9. Author contributions
W.-Y.A. conceived the project. P.H. and W.-Y.A. programmed codes and designed/built the package.
Appendix A.
Below we demonstrate an ordinary way of fitting the avNNET model of the caret package on the cocaine dataset in R. We see that while we can build this model relatively easily, it takes some extra work to build such as removing the first and second columns from the cocaine_dependence dataset. Furthermore, to evaluate this model multiple times with customized train-test splits, preprocessing data, visualize outputs, users need to program many lines of code additionally.
Here we demonstrate the easyml way of using this model (see below how we wrapped the algorithm into easyml). We see that with very few lines of code, we can enjoy all the features and benefits of easyml. For example, users can examine model performance by calling b$plot_predictions_train_mean and b$plot_metrics_test_mean, etc.
Here we show how we wrapped the avNNet algorithm into easyml.