Consensus Machine Learning for Gene Target Selection in Pediatric AML Risk

Acute myeloid leukemia (AML) is a cancer of hematopoietic systems that poses high population burden, especially among pediatric populations. AML presents with high molecular heterogeneity, complicating patient risk stratification and treatment planning. While molecular and cytogenetic subtypes of AML are well described, significance of subtype-specific gene expression patterns is poorly understood and effective modeling of these patterns with individual algorithms is challenging. Using a novel consensus machine learning approach, we analyzed public RNA-seq and clinical data from pediatric AML patients (N = 137 patients) enrolled in the TARGET project. We used a binary risk classifier (Low vs. Not-Low Risk) to study risk-specific expression patterns in pediatric AML. We applied the following workflow to identify important gene targets from RNA-seq data: (1) Reduce data dimensionality by identification of differentially expressed genes for AML risk (N = 1984 loci); (2) Optimize algorithm hyperparameters for each of 4 algorithm types (lasso, XGBoost, random forest, and SVM); (3) Study ablation test results for penalized methods (lasso and XGBoost); (4) Bootstrap Boruta permutations with a novel consensus importance metric. We observed recurrently selected features across hyperparameter optimizations, ablation tests, and Boruta permutation bootstrap iterations, including HOXA9 and putative cofactors including MEIS1. Consensus feature selection from Boruta bootstraps identified a larger gene set than single penalized algorithm runs (lasso or XGBoost), while also including correlated and predictive genes from ablation tests. We present a consensus machine learning approach to identify gene targets of likely importance for pediatric AML risk. The approach identified a moderately sized set of recurrent important genes from across 4 algorithm types, including genes identified across ablation tests with penalized algorithms (HOXA9 and MEIS1). Our approach mitigates exclusion biases of penalized algorithms (lasso and XGBoost) while obviating arbitrary importance cutoffs for other types (SVM and random forest). This approach is readily generalizable for research of other heterogeneous diseases, single-assay experiments, and high-dimensional data. Resources and code to recreate our findings are available online.

Acute leukemia is the most prevalent childhood cancer, accounting for 30% of childhood 2 cancers overall [1,3]. Major subtypes of pediatric acute leukemia include acute myeloid 3 leukemia (AML) and acute lymphoblastic leukemia (ALL), accounting for 15% and 85% 4 of these leukemia cases, respectively [1]. Despite improving survival rates, pediatric 5 AML remains deadlier than ALL [1]. AML is a heterogeneous cancer of the blood and 6 bone marrow myeloid stem cells that presents with numerous molecular subtypes 7 actionable for stratification and treatment. These subtypes are often based on 8 cytogenetics, molecular data, and other characteristics [2,4]. By contrast to adult AML, 9 pediatric AML is characterized by rare somatic mutations, absence of common adult 10 AML mutations, and relatively frequent structural variants [4]. These findings indicate 11 the importance of age-based targeted therapies for AML treatment, and the potential 12 for molecular assays to further our understanding of how gene expression relates to 13 pediatric AML risk, prognosis, and treatment. We utilized RNA-seq expression data to 14 better understand its relation to pediatric AML risk, which remains poorly understood. 15 Interest in identification of biomarker and gene target sets of cancer risk using 16 RNA-seq data has endured for over a decade [10]. For statistical rigor and clinical 17 utility, reduction of high-dimensional, whole-genome expression sets of tens of thousands 18 of genes is vital. Differential gene expression (DEG) analysis is typically used to achieve 19 dimensionality reduction by selecting loci with maximal expression contrast between 20 sample groups. This is typically followed by fitting and optimization of models to these 21 reduced sets of DEGs, further narrowing focus to loci showing the greatest contrast and 22 most predictive qualities between sample sets. For the present work, we consider this 23 cumulative process of dimensionality reduction, model fitting, and optimization as a 24 problem of gene feature selection. 25 Selection of important genes from expression data remains challenging for biomedical 26 research, partly because the commonly applied cross-sectional case/control study design 27 confounds results interpretability. Further, underlying biological dynamics can be 28 nuanced and complex in disease processes, especially for molecularly heterogeneous 29 cancers like AML. These problems can be tractable with modern machine learning 30 approaches, which include the recently developed eXtreme Gradient Boosting 31 (XGBoost) algorithm and Boruta permutation method [7,12]. With computational 32 advances, these and other methods are more robust, efficient, and accessible to 33 quantitative researchers than ever before. However, these improvements don't address 34 the need to reconcile disparate findings from applying multiple distinct algorithm types 35 to biomedical data. For this task, it is useful to devise a formalized consensus approach 36 that leverages feature importance metrics across algorithms to arrive at a consensus forest, support vector machines (SVM), and XGBoost [11,13,14]. These represent a 51 variety of algorithm types, each with distinct assumptions, strengths, and weaknesses.

52
Random forest and SVM do not natively differentiate important from non-important 53 features, necessitating an importance or weight cutoff be set to identify the most 54 important features. By contrast, lasso and XGBoost perform penalized regression and 55 ensemble learning, respectively, which returns greatly restricted feature subsets, though 56 at the cost of feature exclusion bias (see Results). We addressed these issues by 57 bootstrapping Boruta permutations with a novel consensus importance metric based on 58 relative feature importance rank across these 4 algorithms.  in Low (binary risk group, BRG = 0), compared to Not-Low (BRG=1) clinical risk group. B Volcano plot of 62 differentially expressed genes (DEGs), x-axis is log2 fold-change, y-axis is -1 times log10 of unadjusted p-value 63 from t-tests, significance threshold (horizontal line) set at ¡0.01 p-adjusted and (vertical lines) -log2FC-¿1. 64 C Heatmap of DEG expression (Z-score of normalized expression) with sample-wise clinical annotations 65 ("cto" is primary ctogenetic subtype). (https://gdc.cancer.gov/). In brief, raw reads were aligned to GRCh38 using STAR 82 aligned in 2-pass mode and gene counts were produced using the HTSeq-counts analysis 83 workflow with Gencode v22 annotations. Full details of the data processing pipeline can 84 be found at the GDC ('https://docs.gdc.cancer.gov/Data/'). The GDC file manifest are 85 included in (Supplemental Table 4). Gene counts were then normalized using trimmed 86 mean of M (TMM) values method and converted to log2 counts per million (CPM, [18]). 87  We focused on the "Risk Group" variable from the patient clinical data table. This 93 variable is an aggregate pertaining to a combination of risk of recurrence, progression, 94 and relapse ( [1]). Patients were categorized as either low or not-low (e.g. standard or 95 high) risk, and this categorization, called binarized risk group (BRG), was used in the 96 machine learning investigation. Patients missing data for risk group were excluded from 97 the analysis. BRG sample groups were approximately balanced according to important 98 demographic variables, including age at first diagnosis and gender (Table 1).

99
Differentially Expressed Genes (DEGs) 100 To reduce noise and false positive rate, we opted to exclude genes with low expression 101 levels and which demonstrated significant differential expression in a contrast between 102 the binarized risk groups in the training data subset using the voom function from the 103 limma Bioconductor package ( [11]). With this pre-filter, we identified N = 1,998 104 (9.33% retained) differentially expressed genes (DEGs) showing substantial mean 105 differences between risk groups (absolute log2 fold-change ¿= 1, adj. p-value ¡ 0.05). 106

107
Optimizations 108 We trained and tested gene expression-basd models for predicting BRG using a variety 109 of algorithms, including two types of ensemble approaches (random forest and 110 XGBoost), a kernel-based classifier (Support Vector Machines or SVM), and penalized 111 regression (lasso). These algorithms quantify feature importance in the following ways: 112 1. Lasso assigns beta-value coefficient (positive, negative, or null/0) for use in penalized 113 regression; 2. SVM assigns a feature weight (positive or negative) for inclusion in 114 kernel-based estimator; 3. XGBoost assigns importance (positive or null/0) from gain 115 across splits; 4. Random forest assigns importance using mean decrease in Gini index 116 (positive value).

117
With each algorithm type, we fitted models by varying algorithm hyperparameters 118 (Table 1, Figure 2, and Results). For Random Forest, we varied the number of trees 119 (ntrees) from 2,000 to 10,000. For XGBoost, we varied training depth and repetitions. 120 For SVM, we varied the kernel type to be linear or radial, and the weight filter to be 121 none or 50%. For lasso, we varied the alpha value to be from 0.8-1.2 (Table 1 column 3). 122 These runs informed hyperparameters used in each of the 4 algorithms with bootstraps 123 of Boruta permutations (Supplemental Material, Figure 4).

125
To test accuracy of sample labels and quantify possible miss-classification, we performed 126 permutation tests with risk label reassignment. For each algorithm, the training dataset 127 class labels were randomly permuted (switched) 5000 times, such that each patient in 128 the training set was randomly assigned to, the class label switching allows one to infer 129 that the feature contribution for correct classification is not likely due to chance.  Table 4 for download manifest). The majority of analysis was conducted 142 using the R programming language with packages from Bioconductor and CRAN 143 repositories ( [7,[11][12][13][14], Supplemental Methods). Pediatric AML RNA-seq and clinical 144 data were bundled into SummarizedExperiment objects for convenience (Supplemental 145 Materials). Scripts, notebooks, code, and data objects are available online (website).

148
This study focused on whether gene expression could be used to predict pediatric AML 149 risk, as defined using the classical cytogenetic and molecular classification scheme [3]. 150 We initially identified TARGET pediatric AML patients with primary blood or bone 151 cancer samples (N = 137 patients) and defined a binarized risk group (BRG) classifier 152 as either low risk or not-low risk, where the latter category combines "standard" and 153 "high" risk patients (Figure 1). Summary statistics indicated binarized risk was 154 approximately balanced for important demographic characteristics, including age at first 155 diagnosis, gender, and bone marrow leukemic blast percentage and peripheral blasts 156 (Supplemental Table 1). However, the "not-low" risk group had a significantly lower   Table 1). These algorithms include two 178 ensemble methods (random forest and XGBoost) two penalized methods (XGBoost and 179 lasso) and two unpenalized methods (SVM and random forest). These algorithms 180 quantify feature importance in the following ways: 1. Lasso assigns beta-value 181 coefficient (positive, negative, or null/0) for use in penalized regression; 2. SVM assigns 182 a feature weight (positive or negative) for inclusion in kernel-based estimator; 3.

183
XGBoost assigns importance (positive or null/0) from gain across splits; 4. Random 184 forest assigns importance using mean decrease in Gini index (positive value). For each 185 algorithm, we tested at least 3 distinct hyperparameter value sets (Table 1 column   respectively ( Figure 3C and Supplemental Figure 1). Interestingly, models fitted in later 216 iterations could recover substantial performance, and this trend was even more 217 exaggerated for XGBoost than lasso ablation iterations. This trend likley reflects signal 218 gain and loss of alternative predictive and related or correlated gene sets and pathways, 219 which are unrepresented in sets from earlier ablation iterations. As iteration increases, 220 gene members of alternate functional sets may be successively selected then exhausted, 221 resulting in initial performance recovery followed by successive performance loss. These 222 findings highlight the importance of carefully evaluating iterations of penalized methods 223 in biomedical research, and the utility of ablation tests.

230
We observed substantial correlated expression, both positive and negative, across 231 genes selected in the first 3 and 4 ablation iterations for lasso and XGboost, respectively 232 ( Figure 3C and Supplemental Figure 2). Correlated expression could result from direct 233 or indirect functional interactions or relatedness. We observed evidence for functional 234 similarity across these selected gene sets, especially shared HOX pathway membership. 235 Surprisingly, HOXA9 was selected in the first iteration of lasso ablation, but not until 236 the fourth iteration of XGBoost ablation ( Figure 3A, Supplemental Figure). HOXA9 is 237 known to be co-expressed in multiple pediatric AML subtypes, and its activity can be 238 used to predict patient risk [5,9] ( Figure 3A). We further note substantial positive 239 correlation between HOXA9 and the HOX family gene MEIS1, which was selected in 240 iteration 3 of lasso ablations. MEIS1 expression is linked to hematopoietic stem cell 241 development, and HOXA9-MEIS1 complexes were found to correlate with AML subtype 242 and outcome [8,17,19,20].  N = 1,000). 262 A Workflow calculating "nrank" consensus importance, or normalized median absolute importance rank, 263 across 4 algorithms (lasso, SVM, random forest, and XGBoost). B Feature (green is confirmed, red is 264 rejected, yellow is tentative) and shadow feature-wise (blue lines) importance (rank, y-axis) across Boruta 265 permutations (x-axis, max = 100). C Upset plot of recurrent confirmed features (present in at least 20% 266 or 200/1,000 bootstraps) across Boruta bootstrap analyses with 5 distinct importance metrics (XGB = 267 XGBoost importance, SVM = SVM importance, Nrank = consensus importance nrank, Lasso = lasso 268 importance, RF = random forest importance). Red genes and data are shared across consensus, lasso, and 269 random forest runs, purple is confirmed genes unique to consensus run, blue is confirmed genes shared only 270 by consensus and lasso runs.