Ensemble Feature Selection and Meta-Analysis of Cancer miRNA Biomarkers

The role of microRNAs (miRNAs) in cellular processes captured the attention of many researchers, since their dysregulation is shown to affect the cancer disease landscape by sustaining proliferative signaling, evading program cell death, and inhibiting growth suppressors. Thus, miRNAs have been considered important diagnostic and prognostic biomarkers for several types of tumors. Machine learning algorithms have proven to be able to exploit the information contained in thousands of miRNAs to accurately predict and classify cancer types. Nevertheless, extracting the most relevant miRNA expressions is fundamental to allow human experts to validate and make sense of the results obtained by automatic algorithms. We propose a novel feature selection approach, able to identify the most important miRNAs for tumor classification, based on consensus on feature relevance from high-accuracy classifiers of different typologies. The proposed methodology is tested on a real-world dataset featuring 8,129 patients, 29 different types of tumors, and 1,046 miRNAs per patient, taken from The Cancer Genome Atlas (TCGA) database. A new miRNA signature is suggested, containing the 100 most important oncogenic miRNAs identified by the presented approach. Such a signature is proved to be sufficient to identify all 29 types of cancer considered in the study, with results nearly identical to those obtained using all 1,046 features in the original dataset. Subsequently, a meta-analysis of the medical literature is performed to find references to the most important biomarkers extracted by the methodology. Besides known oncomarkers, 15 new miRNAs previously not ranked as important biomarkers for diagnosis and prognosis in cancer pathologies are uncovered. Such miRNAs, considered relevant by the machine learning algorithms, but still relatively unexplored by specialized literature, could provide further insights in the biology of cancer.


Introduction
1 Several studies have shown the properties of microRNA types (miRNAs) as oncogenes 2 and tumor suppressors [1][2][3]. Since then, many sophisticated techniques, such as 3 high-throughput technologies, microarray, mass spectrometry and especially the Next 4 Generation Sequencing (NGS), have been developed for their identification [4]. However, 5 it is clear that the development of computational tools is needed for the interpretation 6 of results from these high-throughput experiments [5]. Indeed, computational assisted 7 methods are used for the identification of miRNAs from different genome organisms, for 8 example in Caenorhabditis briggsae [6] and in Epstein-Barr virus (EBV or HHV4), a 9 member of the human herpesvirus (HHV) [7]. Furthermore, several computational 10 techniques can be applied to accurately predict miRNA expressions, as seen for example 11 in [8]. 12 Succeeding the earliest evidence of miRNA involvement in human cancer by Croce 13 and collaborators [9], various studies demonstrate that miRNA expression is deregulated 14 in human cancer through diverse mechanisms [10]. Additionally, in comparison to the 15 impractical and invasive methods currently used for cancer diagnosis [11,12], miRNA 16 biomarkers can be detected directly from biological fluids (such as blood, urine, saliva 17 and pleural fluid [13]), and they can also be used as biomarkers to detect tumors at an 18 early stage, which is extremely important for survival. For example, the 5-year survival 19 rate for lung cancer is 5%, but an early diagnosis can boost it to almost 50% [14]. Thus, 20 miRNA expression profiles correlate with clinical variables, highlighting their potential 21 value as prognostic and/or diagnostic tools. 22 In such a context of increasing availability of data, it is of utmost practical 23 importance to build databases of miRNA expressions data for cancer research [15][16][17][18][19], 24 and also to extract features that could be used as cancer biomarkers [20][21][22]. For 25 example, miRNA hsa-mir-21 is mentioned as a marker for patients with squamous cell 26 lung carcinoma [23], with astrocytoma [24], breast cancer [25], and gastric cancer [26]. 27 Following this idea, the scientific community is currently looking for miRNA signatures, 28 representing the minimal number of miRNAs to be measured for discriminating between 29 different stages and types of cancer. 30 Current NGS technologies such as Applied Biosystems, SOLiD3,or HiSeq from 31 Illumina are able to extract thousands of components in genome sequences [27], and 32 traditional linear statistical analysis are not suited to manage such quantities of 33 measured elements with non-lineal relationships to extract meaningful features. Thus, a 34 suitable solution, is to use machine learning techniques for analysis, classification, and 35 relevant features extraction of miRNA data [28][29][30]. 36 Starting from a dataset containing 8,129 patients, 29 different types of cancer, and most relevant miRNAs to use as biomarkers for cancer classification. Typically, 39 classifiers trained on a dataset will not use the whole set of available features to 40 separate classes, but just a subset which could be ordered by relative importance, with 41 a different meaning given to the list by the specific technique. The top 100 biomarkers 42 in the list are then evaluated as a potential reduced signature for classification. Finally, 43 the top 50 miRNAs are compared to a meta-analysis of the medical literature, to 44 validate the results automatically produced by the machine learning algorithms. 45 Unsurprisingly, most of the miRNAs identified by the classifiers are also considered 46 important by the specialized literature: 15 of them, however, are still understudied, and 47 they could thus represent promising leads for future exploration. 48 The rest of the paper is organized as follows. The target dataset and the proposed 49 approach are detailed in Section 2. Experimental results are reported in Section 3, while 50 Section 4 concludes the paper. 51 2 Methods

52
The considered dataset, containing miRNA sequencing isoform values, is taken from the 53 Cancer Genome Atlas 1 . The database contains the information from 8,129 patients. 54 Using the next-generation sequencing miRNASeq BCGSC IlluminaHiSeq miRNASeq 55 Level 3, a total of 1,046 miRNA expression features for each case study are extracted. 56 In summary, the dataset that will be used in the following experiments has 29 types of 57   As the objective is to find and validate a reduced list of miRNAs to be used as a 66 signature, feature selection is to be performed on the dataset. Popular approaches to 67 feature selection range from univariate statistical considerations, to iterated runs of the 68 same classifier with a progressively reduced number of features, in order to assess the 69 contribution of the features to the overall result. As the considered case study is 70 particularly complex, however, relying upon simple statistical analyses or a single 71 classifier might not suffice. Following the idea behind ensemble feature selection [31][32][33], 72 we use multiple algorithms to obtain a more robust predictive performance. For this 73 purpose, we train a set of classifiers to then extract a sorted list of the most relevant 74 features from each. As, intuitively, a feature considered important by the majority of 75 classifiers in the set is likely to be relevant for our aim, the information from all 76 classifiers is then compiled to find the most common relevant features. 77 Starting from a thorough comparison of 22 different state-of-the-art classifiers on the 78 considered dataset presented in [34], in this work a subset of those classifiers is selected 79 considering both (i) high accuracy and (ii) a way to extract the relative importance of 80 the features from the trained classifier. After preliminary tests to set algorithms' 81 hyperparameters, 8 classifiers are chosen, all featuring an average accuracy higher than 82 90% on a 10-fold cross-validation: 83 • All considered classifiers are implemented in the scikit-learn Python toolbox [43]. 92 Overall, the selected classifiers fall into two broad typologies: those exploiting 93 ensembles of classification trees [44] (Bagging, GradientBoosting, RandomForest), 94 and those optimizing the coefficients of linear models to separate classes   Table 2 compares the classification accuracy of each classifier using the full 1,046 119 features, with the accuracy obtained by the same classifier using a signature composed 120 by selected 100 features. It is interesting to notice how the accuracy is, for most cases, 121 unchanged, providing empirical evidence that a 100-miRNA signature is enough to 122 obtain good classification results.
123 Figure 2 shows a heatmap comparing the relative frequency of the overall top 100 124 most frequent features, for each considered classifier. As expected, not all classifiers use 125 the same features to separate the types of cancer, and thus using their consensus proves 126 to be more robust than just relying upon a single algorithm. It is interesting to notice 127 that while the overall most common biomarkers appear among the top for each RidgeClassifier is forced to use the top 100 features; while BaggingClassifier 136 seems to be overall unaffected by the restriction (see Table 2). One classifier, SVC, even 137 slightly increases its average accuracy, probably due to the fact that the search space 138 defined by the 100-feature signature is easier to explore for its optimization procedure. 139 Algorithm 1: Extracting the 100 most relevant features from the considered medical dataset.
1 Normalize dataset by feature; 2 Divide dataset in N folds; 3 Select K classifiers; 4 for each fold n of N do 5 for each classifier k of K do Train classifier k n on all folds minus n, using all features; Test classifier k n on fold n; Obtain sorted list l kn of features from k n ; Assign weight w f nk to each f of the 1,046 features; 6 for each feature f of F do if f is among the top 100 features in l kn then 9 Select 100-feature signature, from features with highest S f ; 10 for each fold n of N do 11 for each classifier k of K do Train classifier k n on all folds minus n, using signature; Test classifier k n on fold n; 12 Compare performance of classifiers using all features and signature; either for cancer diagnosis, prediction, or prognosis by Muller [3], and can be measured 147 by body fluids. Ferracin [2] mentions a total of 10, with 6 not appearing in [3]: 148 hsa-let-7b, hsa-let-7f-1, hsa-let-7i, hsa-mir-29c, and hsa-mir-145. In total, 21 miRNAs 149 identified by the proposed approach are found among the two surveys.

150
The most important result of this work is that several miRNAs types highly ranked 151 by the classifiers appear to be still understudied in literature. Table 3 reports a 152 summary of the literature meta-analysis, from which it clearly emerges that 15 of the 153 miRNA types appear in less than 50 references connected to cancer from the PubMed 154 query: strikingly, hsa-135-a-1 appears only once, and hsa-103-1 only twice, while both 155 are considered important by more than 60% of the classifiers. These two miRNAs could 156 thus represent promising leads for future research. Interestingly, only 25 of the miRNA 157 types found by the proposed machine learning approach also appear among the most 158 expressed in the baseline analysis summarized in Figure 1, and only 4 (hsa-mir-9-1, 159 hsa-mir-28, hsa-mir-103-1, and hsa-mir-199b) are both overexpressed and understudied, 160 suggesting that simple statistical approaches might not be enough to extract meaningful 161 information from such complex data.  hsa-mir-10b, the feature with the highest S f score, while not appearing in the surveys, 165 is currently under clinical trials as oncomarker for Glioma [45], and it is mentioned in a 166 recent article as possible oncomarker [46]. hsa-mir-135a-1 and hsa-mir-135a-2, located 167 inside chromosomes 3 and 12, respectively, generate the same mature active 168 sequence [47]. hsa-mir-9-1, hsa-mir9-2, and hsa-mir9-3, generate the oncogenic 169 hsa-mir-9 [48]. hsa-mir-944 at the moment is known to decrease malignant features in 170 gastric [49], colorectal [50] and endometrial [51] cancers. 171 We envision that understudied miRNAs could play a key role in the biology of 172 cancer. On the other hand, the presence of single-nucleotide polymorphism (SNP) could 173 affect miRNA biology due to alterations in the miRNA maturation process, their structure, and expression levels. From these, creation and loss of miRNA targeted sites 175 by SNPs is the most inspected area [52,53]. However, the miRNA-mediated oncogenic 176 transcriptional landscape could be a consequence of the SNPs presence in the seed 177 region of mature miRNAs [54], that is involved in the molecular recognition with its 178 targeted mRNAs. Although the presence of SNPs in seed regions seems to be negatively 179 selected [55], in a pathological condition as cancer, it would be interesting to perform   Table 4. hsa-mir-146b, hsa-mir-199a-2, hsa-mir-1245, hsa-mir-328, hsa-mir-124-2, hsa-let-7a-3, 224 hsa-let-7d, hsa-mir-139, hsa-let-7e, hsa-mir-101-1, hsa-mir-374b, hsa-mir-223, 225 hsa-mir-335, hsa-mir-124-1, hsa-mir-125b-1, hsa-mir-23a.