## Abstract

**Background** Feature selection is becoming increasingly important in machine learning as users want more interpretation and post-hoc analyses. However, existing feature selection techniques in random forests are computationally intensive and high RAM memory requirements requiring specialized (uncommon) high throughput infrastructures. In bioinformatics, random forest classifiers are widely used as it is a flexible, self-regulating and self-contained machine learning algorithm that is robust to the “*predictors(features)* P ≫ *subjects N*” problem with minimal tuning parameters. The current feature selection options are used extensively for biomarker detection and discovery; however, they are limited to variants of permutation tests or heuristic rankings with arbitrary cutoffs. In this work, we propose a novel paradigm using the binomial framework for feature selection in random forests, *binomialRF*, which is designed to produce significance measures devoid of expensive and uninterpretable permutation tests. Furthermore, it offers a highly flexible framework for efficiently identifying and ranking multi-way interactions via dynamic tree programming.

**Methods** We propose a novel and scalable feature selection technique that exploits the tree structure in the random forest, treats each tree as a binomial stochastic process, and determines feature significance by conducting a one-sided binomial exact test to determine if a feature was detected more often than expected by random chance. Since each tree is an independent and identically distributed random sample in a binomial process (from the perspective of choosing a splitting variable in the root node), the test statistic is constructed based on the frequency of a feature being selected (each tree is a Bernoulli trial for selecting a feature), and then the features are ranked based on the observed test statistics, its resulting nominal p-values, and its multiplicity-adjusted q-values. Furthermore, the binomialRF framework provides a general selection framework to identify 2-way, 3-way, and *K*-way interactions by generalizing the test statistic to count sub-trees in the random forest using dynamic tree programming.

**Results** In simulation studies, the binomialRF algorithm performs competitively with respect to the state of the art in terms of classification accuracy, true model coverage, and controlling for false selection in identifying main effects while attaining substantial computational performance gains (between 30 to 600 times faster in high dimensional settings than the state of the art). In addition, extending the binomialRF using model averaging identified the true model on average with greater accuracy (>20% improvement in reducing false positive feature selection at high dimensions while maximizing true model coverage) and attained greater classification accuracy (between 4-9% improvement across all techniques) without sacrificing computational speed (2^{nd} fastest performance after binomialRF). In addition, the framework easily scales and extends to identifying 2-way and *K*-way interactions (i) without additional memory requirements (only requires storing original predictor matrix), and (ii) with minimal additional computational complexity cost due to efficient dynamic tree programming interaction searches. The algorithm was validated in a case study to predict bronchospasm-related hospitalization from blood transcriptomes where the binomialRF algorithm correctly identified the previously published relevant physiological pathways, presented comparable classification accuracy in a validation set, and extended previous work in this area by looking at pathway-pathway interaction.

**Conclusion** The proposed binomialRF proposes a novel and efficient feature selection method devoid of permutation tests – that scales linearly in the number of trees, with minimal computational complexity, thus outperforming alternate conventional methods from a computational perspective while attaining competitive model selection and classification accuracies and enabling computations on common of cost-effective high throughput infrastructures. Furthermore, the binomialRF model averaging framework greatly improves the accuracy of the feature predictions, controlling for false selection and substantially improving model and classification accuracy. Validated in numerical studies and retrospectively in a clinical trial (case study), the binomialRF paradigm offers a binomial framework to detect feature significance and easily extends to search for *K*-way interactions in a linear fashion, reducing a known non-polynomial time exploration to linear approximations.

The binomialRF R package is freely available on GitHub and has been submitted to BioConductor, with all associated documentations and help files.

Github: https://github.com/SamirRachidZaim/binomialRF

BioConductor: binomialRF

## 1 Introduction

Recent advances in machine learning and data science tools have led to a revamped effort for improving clinical decision-making anchored in genomic data analysis and biomarker detection. However, despite these novel advances, random forests (RFs) [1] remain a widely popular machine learning algorithm choice in genomics given their ability to i) accurately predict phenotypes using genomic data and ii) identify relevant genes and gene products used to predict the phenotype. Literature over the past twenty years has demonstrated [2-9] their wide success in being able to robustly handle the “P ≫ *N*” issue where there are more predictors or features “*P*” (i.e., genes) than there are human subjects “*N*” while maintaining competitive predictive and gene selection abilities.

However, the translational utility of random forests has not been fully understood as they are often viewed as “black box” algorithms for physicians and geneticists. Therefore, a substantial effort over the past decade has focused around “feature selection” in random forests [5, 6, 10-14] in order to better provide explanatory power of these models and to identify important genes and gene products in classification models. **Table 1** demonstrates the two major classes of feature selection techniques: i) permutation-type measures of importance and ii) heuristic rankings without formal decision boundaries (i.e., no p-values). The second and third columns in **Table 1** indicate where each method falls. These feature selection techniques have been widely used in biomarker discovery by the bioinformatics community as an alternative to classical statistical techniques and have shown promising results in identifying single gene products. However, these techniques do not easily scale computationally nor memory-wise for identifying molecular interactions, limiting their translational utility in precision medicine. To meet this need, we propose the *binomialRF* feature selection algorithm, a wrapper feature selection algorithm that identifies significant genes and gene sets in a memory-efficient and scalable fashion. Building from the ‘inclusion frequency”[15] heuristic feature ranking, binomialRF formalizes this concept into a binomial probabilistic framework with measures of p-values and extends to identify gene set interactions of any size. The main algorithm is presented in Section 2 while the extension to identify interactions is presented in Section 3. A section on theoretical computational complexity is presented in Section 4, while applications in numerical analyses and case studies evaluate its utility in Sections 5 and 6. The discussion, limitations, and concluding sections are presented in Sections 7-9.

## 2 binomialRF: Identifying main effects

### 2.1 Problem Set-up and Notation

The binomialRF algorithm is a feature selection algorithm designed to develop classifiers in genomics datasets. Datasets are noted as ** X** and are comprised of

**subjects (usually < 1,000) and**

*N***genes in the genome (usually**

*P**P*> 20,000 expressed genes). Genomics data represent the traditional “high-dimensional” setting where there are many more features than there are samples (e.g., “

*P*≫

*N*”). In binary classification settings, the outcome variable

**, differentiates the case and control groups (i.e., “healthy” vs. “tumor” tissue samples).**

*Y*In linear and generalized linear models (hereinafter termed “linear models”), there is a major assumption of linearity (in coefficients) in which either *E*[*Y*|*X*] = *Xβ* or *E*[*Y*|*X*] = *g*(*Xβ*), where the *E* operator denotes the “expected value” and *g* is a link function (e.g., logistic or probit for binomial regression), and *β* is a coefficient vector. Linearity assumptions are often inflexible and many times unjustified, resulting in suboptimal classification performance. Furthermore, linear models offer restrictive results when performing feature selection in the presence of high-dimensional data (“*P* ≫ *N*”) since linear models saturate at *P* = *N*, rendering powerful algorithms such as the LASSO[16] and elastic net [17] ineffective as they cannot identify additional features (genes) after saturation. In other words, if a genomics classifier requires a 1,000 gene signature to effectively predict classes, but there are only 100 subjects in the study, the LASSO and elastic net will at most select 100 genes out of the 1,000 gene signature, effectively missing the remaining 900. Similar limitations occur for other penalized regression strategies like the group LASSO[18].

These two major limitations led many bioinformaticians to consider powerful, model-free machine learning algorithms such as random forests (RF)[1] to analyze genomics datasets for developing classifiers and identifying gene-product biomarkers. Trees and random forest require minimal assumptions and can robustly handle high-dimensional data since they do not require any full column rank and other matrix regularity conditions in linear models. RF are collections of randomized decision trees, where for each decision tree, *T*_{z}, in the RF, a subset of the data and of the features are selected. This randomization encourages a diverse set of trees and allows each individual tree to make predictions across a variety of features and subjects. The induced diversity in the RF model strengthens the overall classifier and mitigates overfitting. Specifically, each tree only sees ** m** <

*P*^{i}features in the root when it determines the first optimal feature for splitting the data into two subgroups. Alternatively, one can re-parameterize the feature subsampling with parameter

*s*(

*s*∈ (0,1), which determines what percentage of

*P*is seen by each tree. Letting

*F*

_{j,z}denote the random variable measuring whether

*X*

_{j}is selected as the splitting variable for the root at tree

*T*

_{i},

This results in *F*_{j,z} being a Bernoulli random variable, *F*_{j,z} ∼ *Bern*(*p*_{root}), and being a Binomial random variable across all *V* trees, where *p*_{root} is the probability of randomly selecting a feature *X*_{j} as the optimal splitting variable in the root of tree *T*_{i}. Under the null, *p*_{root} is constant across all trees. **Table 2** details the notation that will be used throughout this study.

### 2.2 High-level overview and background

The proposed binomialRF R package is a software that contains a novel feature selection technique for random forests, which wraps around the existing randomForest R package[19]. Since random forests (RFs) use bootstrap sampling, feature subsetting, and aggregating to create a robust ensemble classifier from individual weak learners, the algorithm essentially averages results from a set of trees to form predictions and provide heuristic feature importance rankings. With a minimal number of tuning parameters and hyperparameters, RFs are essentially a self-regulating and self-sufficient machine learning algorithm. The binomialRF algorithm builds upon the random forest algorithm to exploit its binary-split tree structure to select important features. It treats each tree in the RF as a stochastic binomial process and develops a hypothesis-test-based feature selection criterion for random forests, resulting in a rigorous p-value ranking for feature selection. There are a number of existing feature selection algorithms in random forest algorithms (see **Table 1**) that measure “variable importance” based on permutation tests or heuristic rankings. In this work, we provide an alternative for measuring importance devoid of permutations and heuristic rankings by introducing a binomial hypothesis test into the feature selection process. To the best of our knowledge, this is the first feature selection algorithm in random forests that implements a binomial framework, can provide a relative significance measure without using permutation tests, and naturally scales to screen for interactions in a fast and memory-efficient way. We explain in the subsequent sections the components of the binomialRF algorithm.

The binomialRF R package is freely available on GitHub and has been submitted to BioConductor, with all associated documentations and help files.

Github: https://github.com/SamirRachidZaim/binomialRF

BioConductor: binomialRF

### 2.3 Optimal splitting variables and decision trees

Consider a decision tree, *T*_{z}, in a random forest. At the top-most “root” node, *m*, features are randomly subsampled from the entire space of *P* features. These *m* features are all tested as possible splitting variables for the tree, and the optimal splitting variable, *X*_{optimal}, is selected as the feature *X*_{j} that best separates the two classes and provides the best information gain. Formally, this is stated in **Equation 2.**

Starting from the root, each node either selects its *X*_{optimal} or becomes a terminal node as seen in **Figure 1**, where in the root, *X*_{optimal} = *X*_{1} and in the subsequent right daughter nodes, *X*_{optimal} = *X*_{2}, *X*_{3}.

If we focus solely at the root, then under a null hypothesis, each feature has the same probability of being selected as the optimal root splitting feature, denoted by *p*_{root} = Pr(*X*_{optimal} = *X*_{j})∀ *j* ∈ {1, …, *P*}. The random variable *F*_{j,z} (shown in **Equation 1**) is an indicator variable that measures if *X*_{j} is selected as the optimal variable for the root at tree *T*_{z}. *F*_{j,z} is a Bernoulli random variable, *F*_{j,z} ∼ *Bern*(*p*_{root}). Summing across all the trees in the random forest, is a Binomial random variable across all *V* trees, where *p*_{root} is the probability of randomly selecting a feature *X*_{j} as the optimal splitting variable in the root of tree *T*_{z}.

### 2.4 Calculating *p*_{root} and the binomial exact test for significance

Let *T*_{z} represent an individual tree grown in a random forest. Then, ** RF** = (

*T*

_{1}, …,

*T*

_{N}) denotes the random forest (i.e., the collection of independently and identically distributed trees). If at each

*T*

_{z},

*m*<

*P*features are subsampled to reduce the feature subspace at each tree, then the probability,

*p*

_{root}, of

*X*

_{j}being selected by a tree,

*T*

_{z}, is shown in

**Equation 3:**

Since features are selected without replacement in the subsampling process, is the probability of not selecting it, and using the complement rule, we get P(A) = 1-P(¬A). Using Equation 3, we can provide a formal measure of significance (i.e., a p-value) regarding if *X*_{j} was selected more than expected by chance by determining if its frequency of selection exceeds a statistical threshold. If we are concerned with only conducting a single hypothesis test about predictor *X*_{j} using a significance level α, then we conclude that *X*_{j} is chosen more often than by random chance if *F*_{j} exceeds the critical value *Q*_{α,K,p}, resulting in a one-sided hypothesis test (shown in **Equations 4a-c**). Formally:

binomialRF Feature Selection Hypothesis Test

Furthermore, since we are conducting simultaneous hypothesis tests when assessing the significance of each feature, we must adjust for multiplicity. Using false discovery rate adjustment procedures, such as Benjamini-Hochberg(BH)[20] or Benjamini-Yekutieli (BY)[21] or Bonferroni family-wise, error rates can all be used depending on the predictor space, though a safe option is always given that it provides a false-discovery-adjusted p-value in the case of dependent predictors.

### 2.5 The binomialRF algorithm

Combining methods of Sections 2.3 and 2.4, the binomialRF feature selection algorithm constructs a formal hypothesis test to determine whether *X*_{j} is an important feature or not. It first calculates the probability of selecting *X*_{j} as *X*_{optimal} a tree and the test statistic . It conducts a hypothesis test comparing the observed test statistics *F*_{j} for all features *X*_{j}, and its expected value, yielding a p-value for each feature. Finally, it returns a feature selection ranking with p-values and FDR-corrected q-values denoting variable importance. In other words, the binomialRF ranks and assigns p-values to features based on how frequently they were the optimal splitting variable in the tree. By restricting the search to the root of each tree, it allows for modeling it via a binomial hypothesis testing framework. This procedure is illustrated in **Figure 2** and formalized in **Algorithm 1** (Appendix A1).

### 2.6 Cross-validated binomialRF

Since the binomial exact test is contingent on a test statistic measuring the likelihood of selecting a feature and if there is a single dominant feature, it will render all remaining ‘important’ features useless as they will always be selected as the splitting variable. Letting *s* represent the percent_feature parameter, it is used to determine how many features are considered at each node. Features may be collinear or some dominant features may overshadow other important features. If *s* is too small, there will be minimal competition among features, allowing noisy ones to be deemed important, and if *s* is too large, dominant features will always overshadow important but small-signal features. Therefore, it is important to test a number of possible values of *s* ∈ (0,1) and optimize it via cross-validation. An extension to the binomialRF has been written to conduct a [default] 5-cross-validation^{ii} to tune and identify the optimal *s* hyper-parameter.

### 2.7 binomialRF Model Averaging: A tool for final model selection

Model averaging is an alternative way to perform model selection by combining different models based on their performance and feature selection. In particular, assume that there are *L* different candidate models, model_{1}, model_{2}, …, model_{L}, that may contain the full true model structure, a subset of the truly important terms, or even none of the significant variables. We can implement binomialRF under each model, model_{a}, and identify important terms under each model; denote the selected set by Selected_{1}, under model_{1}. Then one can define an importance metric called “Proportion Selected” – see **Equation 5** – to measure how often a given feature *X*_{j} was selected. Formally, the proportion selected metric is defined as
where I is the indicator function measuring if *X*_{j} was selected by model_{i}. If Proportion Selected (*X*_{j}) is equal to 1, then that feature is determined important by every single possible candidate model; however, if it equals 0, then every possible candidate model rejected it.

In **Figure 3**, we use a simple toy example to illustrate the power of model averaging under binomialRF. As shown in **Figure 3A**, consider a dataset where the design matrix X contains 10 predictors, of which the first 5 are related to the binary class label, *Y*, and the last 5 are noise.

Suppose that you do not know what the true features are, so you consider a set of 10 possible candidate models (**Figure 3B)** to choose which is the most likely set of ‘significant’ features. Each of the proposed candidate models (model_{1}, model_{2}, …, model_{L=10}) is ran as shown by the top bars (in black and white in **Figure 3B**), and determine their selections (Selected_{1}, Selected_{2}, …, Selected_{L=10}) as shown by the bottom bars (in green and purple in **Figure 3B**). Then, to validate whether a feature is truly important, a model average is obtained by calculating the Proportion Selected (*X*_{j}) metric for all possible features and rank them by their Proportion Selected value (**Figure 3C**). Using Proportion Selected > 0.5 as a cutoff, we would conclude that *X*_{1}*X*_{2}, *X*_{3}, *X*_{4}, *X*_{5} are all significant while the remaining features can be discarded as noise.

### 2.8 Evaluations in Numerical Studies

#### 2.8.1 Logistic Dataset Generation

To understand the strengths and limitations of the binomialRF feature selection algorithm and to compare its performance with the state of the art, we conduct a variety of numerical studies. These simulation scenarios generate logistically-distributed data to mimic binary classification settings. The gold standard is generated by first creating a coefficient vector *β* whose first five elements are non-zero and the remaining are zero. Then, *X*_{N×P}, random multi-variate standard normal matrix with *N* samples and *P* features, is generated to mimic a standardized and centered genomics matrix, and finally, *X*, undergoes a logistic transformation from which a Bernoulli random variable is generated to mimic the binary class vector, *Y*.

#### 2.8.2 Modifying Signal-to-Noise Ratio

To understand the strengths and limitations of the binomialRF feature selection algorithm and to compare its performance with the state of the art, we conduct a variety of numerical studies that compare different signal-to-noise ratio settings. These simulation scenarios generate logistic data as the gold standard with a coefficient vector *β* whose first five elements are non-zero and the remaining are zero. The signal-to-noise ratio is altered in two ways to determine how robustly each technique is able to handle noise. First, the magnitude of the non-zero coefficients are increased from 1 to 3 to evaluate how much better each technique is able to determine the true variables from noise. Later, we increase the number of features, *P*, in the design matrix, *X*_{N×P}, 10-fold from 10 to 100 to 1,000. Each time we fix the number of true features to 5, making the remaining *P* −5 noise features, which enables us to evaluate how well each technique is able to select the 5 true features in the presence of 5, 95, and 995 noisy features, respectively. The numerical studies are presented in Section 5. Results for *β* = [1_{5} 0_{P−5}]^{T} are shown in **Figure 6** while the results for *β* = [3_{5} 0_{P−5}]^{T} are available upon request and are omitted as they present minimal additional information. In the simulation study, we seeded a small number of true features relative to the number of noisy features and then used model averaging to perform feature selection. We ran the model averaging algorithm twice, using the following decision rules.

Proportion Selected (

*X*_{j}) ≥ 0.5,Proportion Selected (

*X*_{j}) ≥ 0.9.

This second cutoff was chosen due to empirical results suggesting that as the number of candidate models increases, the Proportion Selected (*X*_{j}) approaches 1. That is, the limit of its selection proportion will approach 1 as the number of candidate models L goes to infinity (i.e., Limit_{L→∞} Proportion Selected (*X*_{j}) → 1). We report simulation results for Proportion Selected (*X*_{j}) ≥ 0.9.

### 2.9 Evaluations in Clinical Studies

#### 2.9.1 Overview of the Asthma Clinical Validation Study

To determine the utility of the binomialRF feature selection algorithm in translational bioinformatics, we conducted a validation study mirroring a prior study that focused on the translational impact of random forest classifiers.

Specifically, the study by Gardeux et al [26] determined whether

a classifier predicting symptomatic subjects among healthy adults who were inoculated with human rhinovirus (HRV) (i.e., the common cold) using their blood transcriptome data before and during infection could forecast which pediatric asthmatic patients will experience recurrent exacerbations using their transcriptomes derived from

*ex vivo*incubation of their blood with and without HRV.we ccould develop a fully-specified random forest classifier (i.e., set of features) to make these predictions using dynamic genomic information (i.e., gene expression data) before and after HRV exposure.

The study examined a few different machine learning techniques and developed a random forest classifier that identified key pathways to predict asthma exacerbation.

#### 2.9.2 Asthma Clinical Validation Datasets

The two datasets in this clinical validation are described in **Table 3** and contain microarray data from two different studies. The first study, conducted by researchers at Duke [22] in 2009, looked at understanding dynamic changes in gene expression data in healthy adults as a response to human rhinovirus (HRV) infection and measured the rhinovirus symptoms (about 50% were asymptomatic yet shedding the virus and the remainder were symptomatic and was used as a “training set” to develop the classifier) [23]. The second dataset was a tightly-controlled clinical trial conducted at the University of Arizona within severe asthmatic patients to determine whether the *ex vivo* HRV incubation of the peripheral blood mononucleocytes would be associated to ulterior asthmatic exacerbation. This dataset was designed to predict asthmatic exacerbation, a phenotype that was established during a 1-year follow-up and was defined as

- No Exacerbation: patients with no hospitalizations and/or emergency room visits; or

- Recurrent Exacerbation: patients with hospitalizations and/or emergency room visits.

The clinical trial was designed in a tightly-controlled fashion in which all patients’ demographic and clinical profiles were verified and shown equally distributed between the two groups [23] (no obvious confounders), and all patients received the same maximal treatment to mitigate exacerbations.

Both datasets contained microarray data at the gene-product and were transformed using the *N-of-1-pathways*[24] framework to determine whether a gene-ontology biological process pathway was dysregulated in each patient. Thus, the final design matrix *X*_{N×P} contained ∼ 20 subjects in each group and approximately 3,000 pathways. The goal of our Asthma case study is to determine whether our feature selection technique could confirm the clinical findings (i.e., reproduce the predicted pathways in the study), attain similar prediction performance, propose new pathway discoveries, and extend their study by proposing pathway-pathway interactions.

## 3 binomialRF: identifying interactions

### 3.1 Current Available Interactions Screening and Selection Techniques

In classical linear models when detecting 2-way interactions, interactions are included in a multiplicative fashion and treated as a separate feature in the model with its own linear coefficient term. Here, we denote *X*_{i} ⊗ *X*_{j} as an interaction between *X*_{i} and *X*_{j}. The main regularity condition imposed in interactions in linear models is strong heredity. Strong heredity is the requirement that if *X*_{i} ⊗ *X*_{j} is an interaction in the model, then both *X*_{i} and *X*_{j} must be included in the model. Similarly, under weak heredity, at least one of the two features must be included individually in the model if the interaction is included. The argument between allowing weak vs. strong heredity revolves around which properties a model can attain under certain regularity conditions as well as feasibility and utility [25, 26]. However, under tree-based models, strong heredity hierarchy is automatically induced as a natural consequence of the binary split tree’s structure. This reduces computational inefficiencies in forcing strong heredity as well as avoids irregularities present under weak heredity. Therefore, it makes the interaction search more computationally feasible and statistically rigorous.

### 3.2 binomialRF: Identifying 2-way Interactions

To generalize the binomialRF algorithm to search for 2-way interactions, we generalize **Equation 3** by adding another product term to denote the second feature in the interaction set to calculate *p*_{2−way}.

As we are interested in selecting interactions and if *X*_{j} is selected at the root node, then it is no longer available for selection subsequently. Thus, we replace *P* with (*P* − 1), and we include a 1/2 normalizing constant since the interaction can happen two different ways (either via the left or right child nodes). **Figure 4A** provides the visual representation of how to generalize binomialRF to identify a 2-way interaction by looking at pairs of features starting at the root node.

Next, we update the hypothesis test in (4) and modify it to identify 2-way interactions for all possible *X*_{i} ⊗ *X*_{j} pairs.

binomialRF 2-way Interaction Selection Hypothesis Test

### 3.3 binomialRF: Generalizing to identify *K*-way interactions

To generalize **Equation 6** into multi-way interactions and calculate *p*_{K−way}, we first note that for any multi-way interaction of size K in a binary split tree results in 2^{K−1} terminal nodes. Therefore, there are 2^{K−1} possible ways of obtaining the *K*-way interaction (**Figure 4B)**. Thus, the normalizing constant in **Equation 6** is replaced to 2^{K−1} in **Equation 8.** The two products in **Equation 6** are now expanded to become *K* different product terms (each representing the probability of selecting individual features in the interaction set), and (*P* − 2) is replaced with (*P* − *k*) to account for sampling without replacement, which yields **Equation 8**.

Next, we update the hypothesis test in **Equation 7** and modify it to identify 2-way interactions for all possible ⊗ sets.

### 3.4 Using dynamic tree programming to search for interactions

Let any node at the *K*^{th} level of a tree be a “*K*-terminal” node. Binary split trees have exactly two child nodes for every non-terminal node; therefore, to climb up from every *K*-terminal node to the root, we can calculate the path recursively back to the root. This is done by traveling up a node’s parent node until we reach the root. The climbToRoot algorithm is provided in the Appendix (A2), and a pseudo-algorithm is provided below to illustrate the main concepts.

**Pseudo-Algorithm**

Identify

*K*-terminal (child) nodeFor each

*K*-terminal (child) nodeInitialize each interaction path as the child node

While (child node ≠ root node)

Determine if child node is left or right daughter node

New.parent.node ← Identify parent node

Append new.parent.node to interaction path

Child.node ← new.parent.node

Return (interaction path)

Return (all 2

^{K−1}*K*-set interaction paths)

### 3.5 Evaluations in Numerical Studies

To understand the strengths and limitations of the binomialRF feature interaction selection algorithm, we conducted various small-scale numerical studies in which an interaction or set of interactions were seeded and evaluated how well the binomialRF algorithm detected the interaction. In contrast to the main effects numerical studies that examined signal-to-noise ratio across different dimensions, this focus was to evaluate whether binomialRF could detect the interaction structures in absence of explicitly mapping them into the design matrix.

### 3.6 Evaluations in Clinical Studies

Biological and genomics analyses that omit interactions will assume biology occurs in isolation; thus, to fully validate the binomialRF algorithm, we extend the clinical study from traditional biomarker discovery to biomarker interactions. The prior study by Gardeux[23] examined a pathway-level random forest classifier, and in this component of the evaluation, we consider pathway-pathway interactions.

As described in Section 2.9, the clinical trial datasets each contained approximately 3,000 pathways, yielding approximately possible 2-way interaction combinations. A brute-force approximation would require storing a matrix 1,500 times larger in order to give every 2-way combination an equal chance of attaining significance. In our case study, we show how to use binomialRF as an interaction screening process to drastically reduce the computation time.

## 4 Complexity Analysis

The algorithm binomialRF has a 2-way computational efficiency as compared to traditional feature selection methods in random forests since a) it attains a minimal computation complexity, and b) it requires minimal memory storage during runtime. Section 4.1 describes the memory required at runtime, while Section 4.2 notes the theoretical computational complexity and conducts some studies to show computational gain over the state of the art.

### 4.1 Memory Storage Requirements

To illustrate the magnitude of memory gained by binomialRF, we use a simple case with 10 variables to show how much more memory is required to calculate 2-way interactions. As seen in **Table 4**, to calculate 2-way interactions, binomialRF only requires an *n* × 10 matrix whereas any other technique would require an *n* × 55, effectively 5 times more RAM during runtime. To calculate 2-way interactions in a moderately larger dataset with 1000 variables, it would require approximately 500 times more memory. **Table 4** illustrates the relative memory requirements for calculating 2-way and 3-way interactions when there are 10, 100, and 1000 predictors in the design matrix, X.

Thus, the memory storage gains are not trivial for even simple 2-way interactions, let alone *K*-way interactions. Note that in linear models, efficient solution paths for ⊗ only exist for *K* ∈ {1,2} (LASSO[16] for *K*=1 and RAMP[27] for *K*=2). For *K*>2, no algorithm guarantees computational efficiency. In RF-based feature selection techniques examined in this paper, no efficient solutions exist to scalably identify interactions.

### 4.2 Computational Complexity

The computational complexity of detecting interaction is on the order of *O*(*V*2^{K−1}) where *V* is the number of trees grown in the forest, and *K* (usually small) is the depth of the interaction search in a binary split tree. For example, in order to calculate 2-way interactions, the algorithm complexity requires only twice as many operations as for main effects, rather than more operations (one for each explicitly-mapped interaction). As seen in the clinical study validation, when *P*=3,000 pathways, looking at 2-way interaction screening only requires ∼6,000 calculations, rather than brute-force calculations. For a permutation-based algorithm, this would mean in an additional 4,498,500 permutation-tests to determine interaction significance. This substantial computational gain occurs because a decision tree’s binary split limits the search space drastically, at most growing by powers of 2. This is still exponential growth, however, since interactions are usually limited to measuring 2-way or 3-way interactions, the 2^{K−1} term is for all practical purposes a constant, thus empowering the binomialRF interaction search to provide a quasi-linear approximation to a non-polynomial time problem.

## 5 Numerical results

In this section, each method is evaluated based on its computation speed (runtime), classification accuracy, and model selection. To evaluate computation time, each method’s runtime is reported in total seconds with the range of performances displayed in a boxplot; to evaluate classification accuracy, the standard 0-1 unweighted classification loss function is used; and to evaluate model selection, false selection rates and discovery rates will be used to determine how well the techniques recover the ‘true’ model.

### 5.1 Simulation study: Main effects

We generate a simple simulated logistic dataset by generating a multivariate standard normal feature matrix composed of *P* features and 100 data points. The true model consists of a β vector where the first five coefficients are non-zero and the last *P*-5 are zero. Finally, the binomial outcomes are generated using a logistic transformation to calculate the probabilities for the Bernoulli random variable. We simulate various signal-to-noise ratio settings in two different ways. First, we let the five nonzero coefficients be either all 1s or all 3s to determine if the algorithm is robust to decreasing the magnitude of the coefficients in the logistic model. Second, we add noise by increasing the number of irrelevant features relative to the five nonzero ‘true’ variables. We increase the dimension of *P* 10-fold while keeping constant the number of nonzero coefficients to increase the relative noise in the predictor matrix. Formally, the simulation structure is illustrated below:

#### 5.1.1. Computation time

In order to compare the computation time of each model as accurately as possible, we strictly measure the time it takes for the algorithm to produce a feature ranking or p-value and omit all other portions of the algorithm using the base system.time R function as seen below:

## binomialRF time profile

binomialRf.time = system.time(binom.rf <-binomialRF(X,y, …))

In order to measure scalability in the predictor space, 500 random forest objects are grown with 500 trees, measuring each algorithm’s run time. We repeat this experiment expanding the dimension of *X*_{N×P} 10-fold each time and measure the runtime as the predictor matrix increases dimension. The runtimes are graphically summarized in powers of 10 (i.e., log_{10}) in **Figure 5** as the larger runtimes otherwise dominate the boxplots and do not allow for visual differentiation between all techniques. The tests were all run on a 2017 MacBook Pro laptop with 16GB RAM, Intel Core i5, with 4 cores using R version 3.4.0 (64-bit).

The runtimes are shown in the boxplots in **Figure 5**, ranked left to right by median runtime. The binomialRF is consistently the fastest tree-based feature selection algorithm while the binomialRF model averaging is on average the second fastest algorithm. Note, the model averaging algorithm considers 10 different candidate models to perform feature selection, and the runtime reported measures the time it took to average the results of all 10 candidate models. We increased the number of features from 10 to 1000 to mimic a high dimensional space (where *P* ≫ *N*) and asses which techniques scale well in high dimension. In the high-dimensional setting of *P* = 1,000, the binomialRF’s mean runtime was 0.96 seconds while other techniques required on average between 30 to 600 seconds per run to analyze 1,000 features. A tabular summary of the runtimes was omitted to remove redundancy and is available upon request.

#### 5.1.2 Misclassification test error, model size, false selection rate and coverage

To assess the feature selection techniques, we evaluate them based on the accuracy of the induced final model, average model size, false selection rate (FSR), and true variable coverage. The test error is measured via a standard 0/1 loss function, and the FSR and Coverage formulas are defined below:
where *U* = uninformative variables in model, *I* = informative variables in model, *I*_{selected} is the number of “true” features each algorithm detected, and *I*_{Total} = 5 is the number of true predictors. The +1 is added in the denominator of the FSR formula to ensure non-degenerative cases when no variables are selected (or alternatively viewed as an intercept in a linear model). Results are shown in **Figure 6.** Across the majority of the simulations, the final binomialRF-induced model, on average, results in a lower test error and attains the highest true variable coverage. However, the Boruta algorithm consistently attains the best FSR. In high dimension, the binomialRF trade-off results in a higher FSR in order to attain better coverage.

### 5.2 Simulation study: 2-way Interactions

To determine whether binomialRF can successfully detect interactions, we ran two different simulation studies. The first study considers the following model:

500 different simulation runs were conducted, each growing 500 trees in the forest. The averaged results for the top 10 interactions (ranked in order of significance) are shown in **Table 6-A**. As shown in **Table 6**, the binomialRF is clearly detecting the signal in the *X*_{1} ⊗ *X*_{3} interaction at a higher rate than random combinations of main effect signals and noise (i.e., *X*_{1} ⊗ *X*_{8}). Thus, in the simple 2-way interaction, the binomialRF model clearly detects the signal. The second interaction simulation study considers the following slightly more complicated model where two sets of 2-way interactions are now present:

Similarly, as before, 500 trees were grown, and 500 different simulation runs were conducted. The averaged results for the top 10 interactions (ranked in order of significance) are shown in **Table 6-B.**

## 6 Asthma Validation study

### 6.1 Asthma Validation study: Identifying Pathways

The study by Gardeux et al [23] was designed to assess whether it was possible to develop a fully-specified classifier from healthy adults infected with HRV that predicted asthma exacerbation in pediatric asthmatic patients who were infected with HRV. They used a variety of machine learning techniques and restricted their classifier to consider 10, 20, and 30 pathways at a time. The optimal classifier developed was a fully-specified random forest classifier that attained approximately 73 percent accuracy in the pediatric asthmatic validation-set cohort. Since our goal is to not develop the optimal classifier but rather provide more interpretability and assess interactions, we conducted the asthma clinical validation study in a slightly different fashion following the same overarching concepts. As shown in **Table 3**, the healthy adult [training] dataset was used to identify meaningful pathways, and the pediatric asthmatic [validation] cohort was used to confirm their utility. We first ran the binomialRF algorithm on the healthy adults considering all 3014 pathways and validated its selected pathways in the asthmatic patients (shown in Appendix A.3). The binomialRF algorithm identified 67 significant pathways at FDR < 5%. 19 out of the 20 pathways identified by [23] were confirmed by binomialRF while a number of other candidate pathways identified by binomialRF may hold predictive power and may be physiologically related to the pathway classes identified in [23]. For example, “GO:0006342” is a “chromatin silencing” pathway and this pathway is physiologically related to Class “V-Chromatin Organization.” Another such example is “GO:0001763”, a “morphogenesis of a branching structure” pathway that falls under the “Class III - Morphogenesis” pathway class. Thus, binomialRF is able to confirm known physiologically discoveries as well as propose biologically-feasible novel candidate pathways for predicting HRV-induced asthma exacerbation. A binomialRF model averaging analysis was also conducted – the pathway selection results from the model averaging are located in Appendix **A4. binomialRF Model Averaging in Asthma Validation Study**. A total of 10 candidate models were proposed by randomly selecting between 200 and 3,000 pathways per model. Each pathway was then ranked based on its “Proportion Selected” value, and the pathways identified in more than half the models (i.e., Proportion Selected >0.5) were selected into the final model.

### 6.2 Asthma Validation study: Classification Accuracy

To validate these results in the classifier, we also compared the classification accuracy on the pediatric asthma patients using the pathways selected by the binomialRF model against that of a random forest trained on all 3,014 pathways. We then compared the binomialRF model averaging (see **Appendix A4**) to the results obtained by the naïve random forest and the binomialRF model. The optimal binomialRF model averaging selection resulted in a classifier with a 67% and 70% mean and median classification accuracy, respectively. It is worth noting that although the validation accuracy did not surpass the 73% obtained by Gardeux [23], the binomialRF is a feature selection framework for which classification accuracy is an important outcome but not the only objective.

### 6.3 Asthma Validation study: Identifying Pathway-Pathway Interaction

To extend the Asthma Validation study from Gardeux et al [23], we also screen for pathway-pathway interactions. To do dimension reduction and reduce the search space for interactions, the binomialRF algorithm was first ran (prediction set and testing set) with 67 pathways selected by a previous run of binomialRF for single factors. Using these 67 pathways, we then screened for all possible 2-way interactions using the 2-way binomialRF algorithm and identified 107 pathway-pathway interactions (interactions are listed in **Appendix A5**, ranked in order of significance). These results indicate all the significant interactions that occur between ‘main effects’ (or pathways deemed significant on their own). As a first step, exploring interactions between ‘main effects’ allows for a bioinformatician to explore “expected” interactions, which are interactions between the main pathways at play. Restricting the interaction space to the 67 pathways identified by the first run of binomialRF greatly reduces the computational burden, but it is worth noting that this type of search leads to a confirmation bias as we only allow the algorithm to search for interactions in pathways in which we expect interactions, rather than allow for the algorithm to also search for unexpected interactions. Therefore, a proper interaction search would be more robust under a model averaging setting in order to determine which interactions are consistently important versus those that are pure noise.

## 7 Discussion

### 7.1 Feature selection in random forests

In general, machine learning algorithms are judged on three main concepts: interpretability, scalability, and accuracy. The binomialRF feature selection algorithm provides a relatively simple interpretation, attains competitive model identifiability and accuracy performance, and provides a framework that easily scales in high dimensional main effect and interaction settings. Furthermore, it offers a distinct advantage compared to other feature selection algorithms in trees, as it easily generalizes to search for interactions using the same binomial exact test. The heuristic rankings based on the random forest importance measure (like RFE and VarSelRF) clearly cannot search for interactions since they are restricted to univariate analyses, and iterations can only eliminate non-important univariate relationships, thus providing zero information on interactions. The permutation-based tests (like the Permutation-Importance algorithm) do not generalize (easily at least) to search for interactions since the literature demonstrates that it is challenging enough as is to estimate empirical null distributions of Z-scores of importance metrics (Vita, Altmann, etc..), let alone estimating joint Z-score distributions, or how to construct scalable and efficient algorithms to estimate them.

P-value based rankings offer statistical rigor in machine learning-based models as they allow for more robust measures of gene selection rather than arbitrary cutoffs (i.e., “we selected the top 10%”), while permutation-type tests shuffle predictor values across nodes in a tree which are expensive operations and disregard node hierarchy in trees. In the former, techniques with heuristic stopping criteria are subject to arbitrary user-imposed decisions which lack scientific rigor (i.e., why is selecting the top 10% of genes more scientifically valid than the top 20%?). In the latter, feature selection techniques that disregard the innate hierarchy in tree-based structures are losing valuable signal. In the construction of a decision tree, the predictor chosen as the “splitting variable” in each node is the optimal predictor choice. It intuitively follows that features selected higher up the hierarchy of the tree have more weight as they are deemed to be “optimal” in a larger pool of samples, and the root node is theoretically the best feature in each tree as it is the “first best” feature selected. Therefore, the hierarchy must be respected.

The power of model averaging results from the same principle that allows random forests to be powerful classifiers: an ensemble of weak learners is stronger than an individual strong classifier. A prediction made from a majority vote or consensus of weak decision trees (i.e., a random forest) is more robust than a well-pruned decision tree as the ensemble minimizes the effects of mistakes done by individual trees as long as the majority make the right choice. If we extend this to feature selection, then the same occurs with binomialRF. An individual run of binomialRF might be sensitive to noise; however, if we rely instead on the consensus of multiple iterations of binomialRF and add some randomization (both at the feature and sample-size level), then in order for a feature to be selected into the final model, it has to be selected across the majority of the candidate models. Therefore, this makes the final model selection more robust and stable. The binomialRF model averaging runs across the simulation studies resulted in the best classification rate in the test sets while controlling for false selection and maximizing true feature coverage. These results confirm the intuition that extending ensemble techniques from classifier development into feature and model selection will improve the latter. It is also worth noting that while model averaging is independent of the feature selection technique (and could have been applied to all other methods), we will explore whether the binomialRF (and largely the binomial distribution) framework offers theoretical advantages to propose any theoretical asymptotic results (see limitations and future studies).

### 7.2 Pathway-Pathway Interactions in the Asthma Validation Study

The first interaction listed in the supplemental table **(A5)** is GO:0016570 ⊗ GO:0009581. GO:0016570 is a histone protein modification pathway while GO:0009581 is a pathway indicating a response to external stimulus. Their interaction indicates that differential expression of pathways associated to histone protein regulation is interacting with response to an external stimulus (likely the HRV-inoculation stimulus). In fact, a few pathway interactions down the list we see GO:0016570 ⊗ GO:0009615, which suggests that differential histone modification in response to a virus is predictive of recurrent asthma exacerbations as well as of healthy subjects’ symptoms to HRV (plausibly, the human rhinovirus infection). Indeed, histone modification has been linked to the development of asthma[28, 29] and HRV infection has also been shown to cause DNA methylation changes in epithelial cells of healthy and asthmatic subjects[30]. These two pathway-pathway interactions indicate that histone modifications are potentially highly susceptible to environmental stimuli, suggesting an epigenetic component to asthmatic children’s response to therapy. The previous “genome by environment” classifier by Gardeux et al as well as the epigenetic literature in asthma corroborate the existence of these “genome by environment” interactions [23, 31-33], illustrating the utility of looking for pathway-pathway interactions beyond single pathway response effect that they reported. The pathway-pathway interaction screening can then be used to corroborate known biological phenomenon as well as potentially shed light on previously unknown interacting mechanisms.

## 8 Limitations and Future Studies

### 8.1 Pushing towards more theoretical guarantees in machine learning

The binomialRF technique framework provides a novel paradigm shift that can be extended into multiple directions. On one hand, binomialRF can be extended into a Bayesian framework by placing priors (the current implementation enforces an equal-weighted discrete uniform prior) on the likelihood of selecting a feature and determining significance using posterior probabilities on a beta-binomial process. On the other hand, the binomialRF algorithm can be extended into a binomialRF model averaging framework where candidate models comprised of feature subsets can be assessed and ‘averaged’ across. Similar to Bayesian Model Averaging (BMA)[34] and Sparsity Orientated Importance Learning (SOIL)[35], the binomialRF can weight candidate model based on its utility, however, in the model-free case, a likelihood-induced weighting is not possible so we can alternatively weigh by out-of-bag (or validation) error. Since the model averaging data is composed of binomial random variable test statistics, future work will explore whether any asymptotic results occur if we let the number of candidate models go to infinity. At the moment, model averaging still requires arbitrary cutoffs (for our simulations studies experiments we used Proportion Selected > 0.9 as our cutoffs) to make the final model selection, with empirical results suggesting it helps reduce false selection rate without sacrificing true feature coverage. However, these results are still empirical and offer no theoretical guarantees. We need stronger theoretical results to inform which cutoffs to use when considering a specific number of candidate models. Ideally, it would allow us to guarantee model selection results (as is the case with SOIL), however at the moment this is not guaranteed beyond a few empirical studies and will thus be explored in future studies.

### 8.2 Improving interpretative power and translational utility: Incorporating pathways and ontologies in feature selection

On the other hand, as pathway-based biomarker studies gain more traction in the genomics realm [24, 36, 37], the machine learning community needs to continue developing domain-specific methods that can cater to the bioinformatics and genomics research community. These techniques must be further explored in order to improve the translational power and interpretation of machine learning results in bioinformatics. Too often we develop power predictive “black box” algorithms that lack explanation or interpretive power that is required to translate this information into knowledge. Therefore, future work must be prioritized in this direction. One possible direction to consider is ontology-enriched binomialRF models. Ontologies offer well-curated knowledge graphs that represent complex interplay of biological networks. Incorporating this information beforehand into the feature selection method, and later sending the results back into the ontology-domain for visualization can yield more interesting network-level analyses. Since feature selection in the binomialRF are composed of binomial distribution test statistics, there are numerous statistical possibilities with which one can enrich gene-based binomialRF predictions into aggregated pathway-level features. For example, one way would be to extend gene detection to ontology-based pathway-level analyses via over-representation tests from binomial test statistics. Another would incorporate gene ontology hierarchies between pathways to eliminate redundant signal and incorporate ontologies into smarter pathway detection. Model averaging can also be conducted by incorporating knowledge graphs to make the ‘candidate’ models more ontologically meaningful by looking at clusters of genes or pathways or identifying which elements dominate the signal in a biological process. We will explore these in future studies.

## 9 Conclusion

As the biomarker discovery process moves away from identifying single-gene products and moves towards interactions and pathways (say from gene ontologies like GO), the statistical machine learning community will need to continue to develop corresponding interpretable and scalable techniques. The binomialRF algorithm provides an early step in this direction in order to match the technical and computational requirements for these novel large-scale genomics analyses, as well as to extend to other ‘omics.

## 11 Conflict of Interest

The authors declare no conflict of interest.

## 12 Author Contributions

SRZ conducted all the analyses in R; HHZ contributed to the statistical framework and analysis; all authors contributed to the evaluation and interpretation of the study; SRZ contributed to the figures and tables; SRZ, HHZ, CK, and YAL contributed to the writing of the manuscript; all authors read and approved the final manuscript.

## 13 Funding

This work was supported in part by The University of Arizona Health Sciences Center for Biomedical Informatics and Biostatistics, the BIO5 Institute, and the NIH (U01AI122275, NCI P30CA023074, 1UG3OD023171). This article did not receive sponsorship for publication.

## 14 Appendix

## A1: binomialRF Feature Selection Algorithm

Let *F*_{j,z} denote the random variable measuring whether *X*_{j} is selected as the splitting variable for the root at the root of a tree *T*_{i},

This results in *F*_{j,z} being a Bernoulli random variable, *F*_{j,z} ∼ *Bern*(*p*_{root}), and is a Binomial random variable across all *V* trees, where *p*_{root} is the probability of randomly selecting a feature *X*_{j} as the optimal splitting variable in the root of tree *T*_{i.}*p*_{root} is given by

Under the null, *p*_{root} is constant across all trees. The binomialRF feature selection algorithm below illustrates the process of identifying main effects for the binomialRF algorithm.

To generalize to identify *K*-set interactions, denoted by ⊗, replace *F*_{j,z} with
where {*X*_{i}}_{i∈K} denotes the interaction sequence. Then replace *p*_{root} with

Then,, and the hypothesis test follows from this.

## A2: climbToRoot (DTP) Algorithm

Note, in binary split trees, the nodes are ordered such that each the label of each left daughter node is twice the label of its parent node, and the label of each right daughter −1 is twice the label of its parent node. Under this such ordering, the *K*_{terminal} nodes are identified as *K*_{terminal} = {2^{K-1}:(2^{K} − 1)}, and lines 20-23 are required for cases when true terminal (leaf) nodes occur before the *K*_{terminal} nodes.

## A3. Asthma Validation study validation: Predicted Pathways by binomialRF

The table below compares our Asthma Validation study to the original classifier obtained by Gardeux et al’s random forest classifier. The pathways below are the ones determined significant by the binomialRF algorithm. The first two columns show the GO pathway identifier and description. The third one determines whether it was validated in the HRV by Gardeux, and the last column corresponds to their 5 distinct pathway classes. As seen by some of the pathways not validated in the HRV study, (e.g., “GO:0001763”) even though they were not part of the original discoveries, they correspond to the Class III “Morphogenesis” pathways, thus identifying physiologically relevant and related candidate pathway discoveries.

## A4. binomialRF Model Averaging in Asthma Validation Study

## A5. Pathway-Pathway Interactions in Asthma Validation Study

## Footnotes

## 10 List of acronyms

- RF
- random forest
- BH
- Benjamini Hochberg adjustment
- BY
- Benjamini Yekutieli adjustment
- BMA
- Bayesian Model Averaging
- SOIL
- Sparsity Oriented Importance Learning
- HRV
- Human Rhinovirus
- FSR
- False Selection Rate
- LASSO
- Least Absolute Shrinkage and Selection Operator
- RAMP
- regularization algorithm under marginality principle
- DTP
- Dynamic Tree Programming
- GO
- Gene Ontology
- GO-BP
- Gene Ontology Biological Processes