Abstract
Background Machine learning (ML) is the application of specialized algorithms to datasets for trend delineation, categorization or prediction. ML techniques have been traditionally applied to large, highly-dimensional databases. Gliomas are a heterogeneous group of primary brain tumors, traditionally graded using histopathological features. Recently the World Health Organization proposed a novel grading system for gliomas incorporating molecular characteristics. We aimed to study whether ML could achieve accurate prognostication of 2-year mortality in a small, highly-dimensional database of glioma patients.
Methods We applied three machine learning techniques: artificial neural networks (ANN), decision trees (DT), support vector machine (SVM), and classical logistic regression (LR) to a dataset consisting of 76 glioma patients of all grades. We compared the effect of applying the algorithms to the raw database, versus a database where only statistically significant features were included into the algorithmic inputs (feature selection).
Results Raw input consisted of 21 variables, and achieved performance of (accuracy/AUC): 70.7%/0.70 for ANN, 68%/0.72 for SVM, 66.7%/0.64 for LR and 65%/0.70 for DT. Feature selected input consisted of 14 variables and achieved performance of 73.4%/0.75 for ANN, 73.3%/0.74 for SVM, 69.3%/0.73 for LR and 65.2%/0.63 for DT.
Conclusions We demonstrate that these techniques can also be applied to small, yet highly-dimensional datasets. Our ML techniques achieved reasonable performance compared to similar studies in the literature. Though local databases may be small versus larger cancer repositories, we demonstrate that ML techniques can still be applied to their analysis, though traditional statistical methods are of similar benefit.
Introduction
Gliomas are a heterogeneous class of tumors comprising approximately 30% of all brain malignancies1. Previously, the World Health Organization (WHO) grading system stratified them by histological origin (i.e. astrocytoma, oligodendroglioma, mixed oligoastrocytoma and ependymoma), with additional grading (I-IV) according to pathological features of aggression. In 2016, the WHO presented a novel classification system, with incorporation of molecular biomarkers including IDH1/IDH2 mutations2, MGMT methylation3, p53 and PTEN deletion4,5, EGFR amplification6, 1p/19q deletions7,8, 9p(16q) deletions9 and Ki67 index10. The phenotypic expression of these markers by a glioma carry unique prognostic11 and therapeutic implications6,7,11–14. Moreover, the prognostic implications of the relationship between a tumor possessing >1 molecular marker with a patients’ baseline clinical and demographic status is not fully understood15,16. Existing prognostic systems separate patients into “low-grade” (i.e. WHO I,II) or “high-grade” (i.e. WHO III,IV) groups, and incorporate additional clinical features such as performance status, age and tumor size13,17–21 into their stratifications. Though some newer studies have incorporated limited molecular classification features22, it is clear that older prognostic indices are likely to become obsolete in the “molecular medicine” era.
Machine learning (ML) is a subset of computer science, whereby a computer algorithm “learns” from prior experience. Using specified training data with known input and output values, the ML algorithm is able to devise a set of rules which can be used as predictors for novel data with similar input characteristics23. Previously a human investigator would have to approach data collection and analysis using a set of a priori assumptions to prevent the burden of collecting data irrelevant to their hypothesis. The risk of this approach is that potentially meaningful trends caused by disregarded variables go unnoticed. ML lends itself naturally to trend-delineation in large, unprocessed datasets24. It can also be used for clinical prediction using known inputs and desired outputs (e.g. mortality). Moreover, when implemented in a local database, ML-derived prognosticators may take-into account unique features of the local population and treatment infrastructure, making them potentially more useful than evidence from non-contiguous populations. Local databases may be considerably smaller than large-scale cancer repositories, limiting their “academic study” in context of the literature, but potentially providing the local clinician with a wealth of meaningful clinical information.
Bearing these factors in mind, we aimed to apply a selection of ML algorithms to a database of 76 glioma cases in order to devise a 2-year mortality predictor. The complex histological and molecular pathological features of gliomas, combined with a series of clinical prognosticators, such as performance status, age and treatment techniques25, make them an ideal multi-dimensional application for machine learning techniques. Additionally, due to our database’ characteristics we aimed to compare the performance of ML algorithms using an unprocessed dataset with a dataset where only statistically-significant variables had been pre-selected.
Machine Learning Methods
Logistic Regression
Logistic regression (LR) (Figure 1A) is a traditional statistical method used for binary classification and has been adopted as a basic ML model. It differs from linear regression (Figure 1B) as it uses a sinusoidal curve, delineating a boundary between two categories. Like linear regression, the logarithmic function is derived from weighted transformation of the categorical data points. The regression function thus categorizes novel inputs into one of two categories based upon what side of the line its coordinates fall upon.
Support Vector Machines
Support vector machines (SVM) (Figure 1C) are based upon the logistic regression method and assign training examples to one of two categories, with a bisecting “hyperplane” separating the data-points. Unlike logistic regression, however, the optimal hyperplane bisects the points representing the largest separation between the two categories, and its shape may not be defined by a simple function. The algorithm is tasked with finding the data points (“support vectors”) defining the hyperplane and derivative line coefficients. The function can then categorize novel input values into groups falling on either side of the hyperplane, similar to logistic regression.
Artificial Neural Networks
Artificial neural networks (ANN) (Figure 1D) are so called because they are based upon the layer-like histological stratification of neurons. The input and output values represent the most superficial, but opposing layers of the network, while the inner “hidden” layers consist of successive transformations of the input values. ANNs can be used for clinical prediction algorithms by using a set of training data consisting of certain input values, with known output values. The algorithm therefore learns from the training set by progressive transformation of initial inputs. Values of these transformed inputs are then used by the model to predict output values.
Decision Trees
Decision tree (DT) (Figure 1E) algorithms split data into binary categories using progressive iterations. ML algorithms aim to find optimal features at which to perform data splitting, creating a “branching-tree” shaped diagram. Each node represents a point at which the data is split, and the “leaves” at the end of the tree are the output variables. As the method involves binary classification, categorical data is preferred, while non-categorical data is preferably discretized prior to input.
Methods
Study Population
Our study population consisted of 76 patients (40 females, 36 males) with WHO grade I-IV gliomas, presenting to the neurosurgical oncology service at [Redacted] from 2009-2017. At the end of the 2-year followup period, 52 patients were alive, while 24 had perished. Mean age for the whole population at diagnosis was 47.3 years (SD 16.8). Interventions included total or subtotal resection (as stated by the operating surgeon), stereotactic biopsy, gamma knife therapy or no intervention. Other information collected included radiological maximum tumor diameter (cm), CNS tumor location (lobe), pre-operative and postoperative Eastern Cooperative Oncology Group (ECOG) performance status (0-5), subsequent chemotherapy, radiotherapy or vaccine therapy, or more than 1 surgical intervention. Surgical histopathology data included the presence of EGFR amplification, PTEN deletion, p53 mutation, 1p deletion, 19q deletion, 9p (p16) deletion, IDH 1/IDH 2 mutations, MGMT methylation, and Ki 67 proliferation index*.
Ethical Considerations
This study was approved by the Institutional Review Board at the [Redacted]. All subjects consented for their de-identifiable data to be used in this study.
Study Design
Due to the relatively small number of subjects in our database (n = 76), and the high dimensionality of the data, with 21 variables, we adopted two approaches to ML for this population (Figure 2). The first was to apply the algorithms to the raw dataset, for which input variables had not been pre-selected. The second was to apply χ2 (for categorical variables) and independent samples T-tests (for continuous variables) to the dataset, as outlined by Oermann et al. (2013)26. As this involved a number of independent statistical tests, Bonferroni correction was applied subsequently. 14 variables were thus identified for which there was a significant difference between subjects who survived 2-years and those who did not. Non-significant variables were excluded from the input.
Data Collection, Information Encoding and Dataset Splitting
The raw data was collected using Microsoft Excel (Microsoft Corporation, Seattle, U.S.A.), and the data was parsed using Python 2.7 programming language, using a custom written code. We used binary notation for ordinal variables (i.e. ‘yes’ = 1, ‘no’ = 0). Categorical and continuous variables were scaled (e.g. for Ki 67 index and age at diagnosis) to values between 0-1. The continuous variables were age, maximum tumor diameter and Ki 67 index. Categorical variables were total resection, performance status (ECOG), lobe/area of brain affected, WHO grade. There were 3 subjects whose surgical pathology results were unavailable. Instead of discarding these from analysis, we assigned a value of 0.5 for each variable (e.g. IDH1/2, PTEN etc.) All the features were then normalized using a normal vector.
ML Algorithms
All ML and logistic regression models were imported from the SciKitLearn27 library. All models were run 15 times for each model. Each cycle consisted of a training and testing stage, where the dataset was repetitively partitioned. Per cycle, the same subject was not used for both training and testing. The number of dead and alive participants in the training and testing sets did vary between cycles, however.
Artificial Neural Network Method
Our ANN method utilized a single layer of neurons between the input and output layers. The intermediate layer contained 100 neurons, each with a mini-batch size of 5. The network was trained using 1000 epochs, using an Adam optimizer28 with a default 0.001 learning rate. Briefly, the Adam optimizer is an algorithm for first-order gradient-based optimization, which is an extension to stochastic gradient descent.
Decision Tree Method
The criteria used to split each node was determined by the Gini index29, a standard measure of information gain in DT applications30. This represents a more intuitive approach than randomly selecting criteria at which to split data. The minimum number of samples for each leaf was 1, while the minimum number of samples to split a node was 2.
Support Vector Machine Method
Our SVM model used a radial basis function Kernel, with a C-penalty parameter of 100 and a Gamma value of 0.1.
Logistic Regression
The penalization parameter used was l2 norm. The C parameter was 150.0 and the optimization algorithm used was coordinate descent.
Data Processing
The averaged output values from the 15 cycles were then tabulated into standardized 4×4 confusion matrices and the sensitivity, specificity, positive likelihood ratio (PLR), negative likelihood ratio (NLR), positive predictive value (PPV), negative predictive value (NPV) and overall accuracy were calculated. All probabilities were calculated to 95% certainty. Receiver operating curves (ROC) and area under the ROC (AUROC) were additionally calculated and tabulated using the “roc_curve” model imported from the SciKitLearn toolbox. To optimize comparison between accuracy (percentages) and AUC (ratio), we multiplied AUC results by 100.
Results
Comparison of Diagnostic Performance
For raw data the ANN method performed best in terms of sensitivity (81.54%), followed by SVM (79.31%), LR (76.75%) and DT (73.65%) methods. Using a feature-selected dataset, sensitivity decreased for DT (68.93%), ANN (78.39%) and LR (74.26%), but increased slightly for SVM (80.54%). Using a feature-selected dataset, the specificity of all algorithms increased for all methods, with ANN performance showing the biggest increase (+11.62%) and DT showing the smallest (+7.56%). Using a feature-selected versus a raw dataset, all methods demonstrated a performance increase in terms of PPV (SVM = +7.69%; ANN = +7.08%; LR = +7.03%; DT = +5.87%), while all (DT = −6.21%; ANN = −3.79%; LR = −3.37%) but SVM (+0.54%) demonstrated a decrease in NPV performance. Likewise, ANN (+0.42), SVM (+0.36), LR (+0.28), DT (+0.14) demonstrated an increase in PLR performance using a feature-selected dataset. In terms of NLR, all methods (SVM = −0.10; LR = −0.04; ANN = −0.02) aside from DT (+0.55) demonstrated a decrease in NLR prediction. All methods demonstrated an increase in accuracy using the feature-selected dataset (SVM = +5.38%; ANN = +2.71%; LR = +2.62%; DT = +0.17%). Finally, censoring increased performance, as represented by AUC for all methods (LR = +8.58%; ANN = +6.21%; SVM = +2.83%) aside from DT, which demonstrated a decrease in AUC (−7.54%)(Figure 3)†.
Receiver Operating Curves and Confidence Intervals
When comparing the ROC curve performance to that of y=x, with an area of 0.5 (50), the SVM (AUC = 71.88) demonstrated the best performance, followed by DT (AUC = 70.54), ANN (AUC = 69.19) and LR (AUC = 64.29). Though these were higher than 0.5, the 95% confidence intervals (C.I.) for both the ANN (C.I. = 49.86 - 88.52) and LR (C.I. = 43.63 − 84.95) both included 50, indicating non-significance. Even though the SVM (C.I. = 53.40 − 90.36) and DT (C.I. = 51.62 − 89.46) algorithms had lower C.I. values >50, these were only marginally higher than this boundary. The feature-selected datasets, provided a performance increase for all but the DT algorithms, which demonstrated a decrease in AUC value. The performance benefit was indicated by higher AUC values, with ANN (AUC = 75.40) performing best, followed by SVM (AUC = 74.71), LR (AUC = 72.87) and DT (AUC = 63.00). Using feature-selected data also yielded overall narrower 95% C.I., with all methods aside from DT demonstrating at least a 10 unit increase of lower C.I. boundary above 50, indicating significance over random guessing and use of raw data. Nevertheless, for both feature-selected and raw data, none of the ML methods demonstrated significant performance improvement versus LR, nor over one another (Figure 4)‡.
Discussion
We have successfully demonstrated the application of 3 machine learning techniques and a ML-implemented logistic regression technique to a database of 76 glioma patients of all stages, molecular phenotypes and with heterogeneous clinical characteristics. Relative to older prognostic indices, which do not incorporate molecular features, our study involves considerably fewer subjects. We accomplished our goal of applying ML techniques to this database with a relatively low subject number/variable ratio, and furthermore we demonstrate that machine learning can be applied with a reasonable level of confidence to make prognostic inferences from this data.
Comparison to Similar Studies
In the neuro-oncology literature, much focus of ML application has been directed towards discernment of characteristic imaging magnetic resonance imaging (MRI) characteristics of CNS tumors (discussed later). Only one non-imaging focussed study has utilized ML for glioma outcome prediction31, while the study of Oermann et al. (2013) utilized a similar methodology for cerebral metastasis prognostication. The study by Malhotra et al. (2016) applied a novel data mining algorithm to extract relevant features pertaining to treatment and molecular patterns in a database of 300 newly diagnosed glioblastoma multiforme cases. The ML component of their study involved the extraction of relevant treatment and pathological features, which were then classified and subjected to classical statistical methods for prognostication. This is in effect the opposite approach to our method using feature-selected data, as the authors utilized data mining to extract relevant features, while we conducted statistical tests of significance for feature-selection prior to ML input. They achieved maximal C-values of 0.85 using LR and 0.84 using Cox multivariate regression. The study of Oermann et al. (2013), though pertaining to cerebral metastases versus gliomas, utilized a similar methodology to our feature-selected approach to prognosticate 1-year survival in a total of 196 patients. In this study, the pooled voting results of 5 independent ANNs (AUC = 84%) significantly outperformed traditional LR methods (AUC = 75%). Further, they found that ML techniques were more accurate at predicting 1-year survival than 2 traditional prognostic indices. As our study used data from gliomas of all stages, we did not compare our results to existing prognostic indices which specifically separate patients into low and high-grade categories. Though even using a feature-selected approach, our best performing algorithm (which was coincidentally also an ANN) achieved approximately 10-unit lower AUC metric than their ANN approach. We suspect that this is due to two reasons: Firstly, their training set consisted of 98 patients, which was over twice the size of our training set of 40, offering more examples to learn from. Secondly, their method only utilized 6 input variables, versus 21 for our raw approach and 14 for our feature-selected approach. It is therefore likely that increased proportionality of subjects-to-variables in their dataset also enhanced predictive performance of the ML algorithm. From this, it is apparent that smaller datasets may require pre-censoring prior to ML application if predictive performance is to be maximized. Secondly, we cannot conclude that for small, highly-dimensional datasets ML approaches including ANN, DT or SVM offer any significant performance advantage over traditional LR methods. Nevertheless, we achieved reasonably good predictive metrics using pre-input variable censoring using all ML approaches.
Future Directions
ML algorithms have been intuitively applied to “data-rich” MRI sequences in an effort to quantitatively discern characteristic imaging features of gliomas, due to differences between tumor area and normal brain32–36. These methods have yielded ability to discern imaging features indicating the presence of MGMT methylation35, IDH1 mutation37–39 and 1p/19q co-deletion38. These breakthroughs potentially allow for the non-invasive identification40,41 and even prognostication41,42 of gliomas according to imaging characteristics alone. It is not unreasonable to suggest that the next generation of prognostic indices will be derived from a combination of clinical database mining techniques, such as our present study, with novel techniques of image-based ML. This will represent a substantial step forward, as previous prognostic systems relied upon invasive methods for definitive diagnosis, prognostication and treatment plans. It may also permit clinicians to prognose non-invasively and with high-accuracy, the clinical course of low-grade tumors.
Another potential future direction for ML in neuro-oncology is the creation of a local repository of cases that can be used much in the same way as the study populations utilized for developing previous prognostication systems. Depending upon the integrity and scale of the localized database, predictions can be made with reasonable accuracy, as we have demonstrated in our present study. The information contained in a localized oncology repository may be of greater benefit to the clinician than data from a non-contiguous population, as localized population characteristics and demographics can be discerned and incorporated into individually-tailored therapeutic approaches.
Limitations
Though we achieved acceptable predictive performance using feature-selected data, our study has highlighted potential difficulties of ML application to smaller, highly-dimensional clinical databases. Pre-censoring of data may optimize ML algorithms in studies using a smaller subject-set, however, the censoring of particular variables may result in weaker trends going unnoticed. We also anticipate that the predictive accuracy and AUC would improve by increasing the number of subjects included in the training set. Despite this, our study used a training set less than half the size of that of Oermann et al. (2013) yet achieved only slightly weaker prognostic performance. An alternative to the manual testing of variables for significance is principal component analysis43, which is an entirely ML-based method of reducing data-set dimensionality, thus reducing scope for human error or bias.
Conclusions
Though the “data mining” of “big data” is popular phrase amongst both the medical and lay media, we have demonstrated that ML techniques are applicable to small yet highly-dimensional datasets. As the clinical approaches to gliomas are beginning to adapt to the molecular-medicine era, the small size of a local database does not provide a barrier to the implementation of ML techniques for prognostication purposes. Though our study was purely academic, it demonstrates the potential for ML to provide meaningful insight into the diagnosis and treatment of these heterogeneous tumors at a local level.
Acknowledgements
We would like to thank Yue-Fang Chang PhD, at the University of Pittsburgh for her assistance with this project.