Dimensionality reduction by sparse orthogonal projection with applications to miRNA expression analysis and cancer prediction

Background High dimensionality, i.e. p > n, is an inherent feature of machine learning. Fitting a classification model directly to p-dimensional data risks overfitting and a reduction in accuracy. Thus, dimensionality reduction is necessary to address overfitting and high dimensionality. Results We present a novel dimensionality reduction method which uses sparse, orthogonal projections to discover linear separations in reduced dimension space. The technique is applied to miRNA expression analysis and cancer prediction. We use least squares fitting and orthogonality constraints to find a set of orthogonal directions which are highly correlated to the class labels. We also enforce L1 norm sparsity penalties, to prevent overfitting and remove the uninformative features from the model. Our method is shown to offer a highly competitive classification performance on synthetic examples and real miRNA expression data when compared to similar methods from the literature which use sparsity ideas and orthogonal projections. Discussion A novel technique is introduced here, which uses sparse, orthogonal projections for dimensionality reduction. The approach is shown to be highly effective in reducing the dimension of miRNA expression data. The application of focus in this article is miRNA expression analysis and cancer predction. The technique may be generalizable, however, to other high dimensionality datasets.

controls [19,16], e.g., using a t-test [9], which can yield poor separation, as is the case in 20 figure 1. In this case, the projection planes are spanned by either w = (1, 0) or w = (0, 1), 21 and both features, namely miRNA1 and miRNA2, have an R = .10 correlation with the 22 class labels.  which merges some cases and controls along the oscillating boundary. In figures 2e and 2f we see that the feature selection methods, FR and EFC, successfully remove the noisy variables, and offer the same 2-D projection (i.e., the same set of two features are selected 104 using FR and EFC). However, the spin top is misoriented in the 2-D projection, which 105 yields some overlap in cases and controls, and thus a reduction in classification accuracy. 106 In figure 2d, the OPLS projection is corrupted by the noise, which causes a significant 107 overlap in cases and controls. We see only mild noise effects and misorientation of the 108 spin top in figure 2c using SOP, which outperforms the methods from the literature across 109 all metrics considered. 110 2.2. Results -Japanese data. Here we present our results on the miRNA expression 111 data of [19,16,14], collected from Japanese patients. This data is discussed in point

112
(  Table 1. Japan results. In the table, ∼ 1 and ∼ 0 indicate that the score is one or zero, respectively, to two significant figures.

119
In this example, SOP, OPLS, and FR perform well in terms of the classification ac-120 curacy, AUC and specificity, with SOP offering the best performance in terms of mean 121 sensitivity and F 1 score. The standard deviation F 1 scores and sensitivities offered by 122 SOP, are less than or equal to those of FR and OPLS, indicating that SOP offers a more 123 consistent performance in terms of F 1 score and sensitivity. The box plots of figure 3 124 also support this. In particular, OPLS has an outlier sensitivity score of TPR = .25, 125 which corresponds to the colorectal cancer classification. FR offers a significantly higher 126 sensitivity of TPR = .50 on the colorectal cancer classification, with SOP offering the 127 best sensitivity of TPR = .58. SOP has a longer lower AUC tail in the box plots, when 128 compared to OPLS and FR, which is an advantage for OPLS and FR over SOP. In par- more cases than controls, whereas the Japanese data was weighted towards controls. In 163 section 2.2, it was harder to achieve high sensitivity given the weighting towards controls.

164
In this case, the specificity is of concern given the weighting towards cases. This example 165 provides further evidence that SOP is most optimal for data sets weighted to one class, In this example, OPLS and SOP offer the best overall performance across all metrics.  technique may be applicable to other high dimension data sets (e.g., gene expression 229 arrays). Further analysis is needed, however, to validate the performance on other high 230 dimension data sets.

231
In further work we aim to test the effectiveness of SOP in the case when n c > 3, as only 232 n c = 2, 3 are considered here. We suspect that the classification accuracy will decrease 233 with increasing n c , as it is more difficult to find linear separability for multiple classes 234 simultaneously. Further testing is needed to confirm this hypothesis. In this section we introduce our dimensionality reduction strategy, SOP, and discuss 322 the data sets that will be used for testing. We also discuss the methods from the literature 323 that we compare our method against, and introduce a new, synthetic data set to test the 324 performance of the proposed method, and those of the literature.

325
A.1. Dimensionality reduction methodology. Throughout this paper, we let X = 326 (x 1 , . . . , x p ) ∈ R n×p denote a matrix of input variables (e.g., miRNA expression data), 327 and y ∈ {0, . . . , n c − 1} n a vector of class labels, where n c ≥ 2 is the number of classes, p 328 is the dimension of the data (i.e., the number variables), and n is the number of samples.

329
We let k < p denote the number of features extracted after a dimensionality reduction. 330 We aim to perform dimensionality reduction by right multiplication of X with a sparse, 331 orthogonal matrix W ∈ R p×k , and extract a set of features T ∈ R n×k via 332 (A.1) T = XW.
The calculation of W can be broken down into two steps. First, we find where Y = (y, y, . . . , y) k times and · F is the Frobenius norm. Then we decompose W as its  variables X to the response Y , and finds w i for which Xw i has high correlation with y.

342
The α term is included to ensure orthogonality of W . In practice, we set α to some large 343 value, orders of magnitude greater than the maximum expression, so that W is hard 344 constrained to be orthogonal. The β term is included to enforce sparsity and prevent 345 overfitting. This follows the idea that many of the variables will have little predictive 346 quality, which is typically the case in miRNA expression analysis and cancer prediction 347 [4]. Hence, the L 1 penalty is included to remove the noisy, less informative variables from 348 consideration in the model. 349 We now give specific details on the implementation of SOP. To solve the objective in To initialize L-BFGS-B we use the classical solution to the orthogonal procrustes problem 356 [7]. Specifically, we implement the algorithm, SOP, below:

357
(1) Set the input matrix X, the label matrix Y , and the number of features to be 358 extracted k. Set α and β.

359
(2) Center and normalize the input data X. For example, in section 2, the data was

374
The SOP code detailed above is available from the authors upon request.  To clarify the classification process in section 2.1, for each method the number of 414 features was set to k = 2. After the dimension reduction was performed, we fit a K 415 nearest neighbors classifier with K = 11 neighbors. Then, we generated a test data set 416 (size n = 4500) from the spin top distribution of figure 2a, input the test data into our 417 model, and calculated the classification metrics discussed in section A.3 for each method.

418
For SOP, we set α = 10 6 and β = 10 2 (see equation A.2). For SPCA we set λ = λ 1,j = λ 1,i 419 (the sparsity parameter) for every 1 ≤ i, j ≤ k (using the notation of [21]), and chose λ 420 to give the best results in terms of AUC. When implementing DROP-D, we set b = 2, 421 and chose the w (using the notation of [8]) which gave the best results in terms of AUC. 422 We set b = 2 as this was shown to produce good results in the examples conducted in [8].

423
A.5. Real data sets. We consider the following four data sets from the literature for 424 real data testing:

425
(1) Serum miRNA expression data of [19,16,14]  Given the distinct clustering among cancer patients, and the controls, it is reasonable to 467 consider a multi-classification, whereby we aim to classify patients into n c = 3 classes 468 simultaneously. In this case, the three classes are controls, breast cancers, and advanced 469 breast cancers.

470
EFC is only defined when n c = 2. To remedy this, and generalize EFC for n c > where j ∈ {0, 1, . . . , n c − 1} is the class label, t ∈ R k is a sample in the reduced dimension 487 space (e.g., one row of T ), and the (w j , b j ) are weights and biases to be trained. Here hold out set validation. That is, we randomly and uniformly set aside a subset of samples 502 size n t < n from the data, fit a model to the remaining n − n t samples, and validate the 503 performance on the hold out set. Then the process is repeated n T times and the results 504 are averaged. We set n t = 300 and n T = 50, so in total there are n T × n t = 15000 test 505 trials for each classification performed on data set (1).

506
A.9. Hyperparameter selection and normalization methods used for real data 507 testing. Throughout the experiments conducted here on real miRNA expression data, 508 all data were normalized to the cube [−1, 1] p and centered with mean zero. For SOP, we 509 set α = 10 3 throughout the simulations conducted in this section. α = 10 3 is such that the 510 orthogonality constraint of A.2 is satisfied to 3 significant figures, which is sufficient, as in 511 step two of SOP, discussed in section A.1, the projection matrix is reassigned to the closest 512 orthonormal matrix using the SVD. Thus, the matrix used in the SOP dimensionality 513 reduction is orthonormal to machine precision. In general, we find that setting α = 10 m 514 yields a W in the first step of SOP, which is orthogonal to m significant figures of accuracy, here to calculate the components, as is recommended in [21] for the case when p > n, 527 and for gene expression arrays. All data sets considered in this section satisfy p > n.

528
In the DROP-D implementation, we set b = 2, and w = 5 (using the notation of [ et. al. in [9]. This data is discussed in point (2)  an improved accuracy and specificity using OPLS when compared to SOP, whereas SOP 554 offers an improved sensitivity and a more consistent AUC. As was the case with the 555 Japanese data, discussed in section 2.2, SOP offers better sensitivity than the methods of 556 the literature. Ultimately, in this case, the optimal classifier is dependent on the situation.

557
For example, we may wish to screen patients for risk of cancer, and decide who should 558 go forward for further testing (e.g., imaging). In this case we do not want to miss any 559 patients who have cancer or may go on to develop cancer (i.e., we want high sensitivity), 560 and SOP would be most appropriate. In other cases, for example if we are recommending 561 that patients have surgery based on the predicted risk of cancer, we would like to avoid   where we have presented the results using SOP, with k = 3 fixed, on all data sets (1)-(4) 587 discussed in section A.5. We see a very comparable performance to the best k values 588 when k = 3 is fixed throughout, and SOP retains the performance advantages when 589 compared to OPLS, FR, EFC, SPCA, and DROP-D.  Table 7. Results using SOP with k = 3 fixed throughout all classifications. The Japan and Keller et. al. data results are presented as in tables 1 and 5, and show the mean and standard deviation results across all classifications conducted in each case. The Lee et. al. and Chan et. al. data results are presented as in tables 3 and 6. allow the feature selection methods, namely FR and EFC, a chance to perform well, as 591 often more features than k = 3 are needed to achieve high performance. For example, in

592
[19], a selection of miRNA panels are compared against with k ranging from k = 1 up to 593 k = 10. In that paper, the larger k panels (e.g., k = 8) offered an improved performance 594 when compared to the panels with smaller k (e.g., when k = 3).

595
Appendix C. Analysis of [19,16,14]  . TSNE plot of Japan data and an expression histogram of hsa-miR-4783-3p, which is highly correlated (R = .91) to the separation between high risk and low risk controls. The left-hand mode of the miR-4783-3p histogram corresponds to the low risk controls, and the right-hand mode to the high risk controls.

602
their own class far from the cancers and controls. We find that the control separation 603 is largely due to hsa-miR-4783-3p, which has high correlation (R = .91) with the high 604 risk/low risk control separation. In figure 8b we have plotted a histogram of the hsa-miR-605 4783-3p expression values across all control patients. We see a clear bimodal distribution, 606 with one peak corresponding to low risk controls (the left-hand peak), and one to high risk controls (the right-hand peak). Such separation in the control set is not reported 608 in [19,16], nor is hsa-miR-4783-3p selected as a biomarker in the cancer predictions. It 609 would seem that hsa-miR-4783-3p is worth investigating further. However, we choose to 610 leave this for further work as the focus of this paper is testing performance of the new 611 dimensionality reduction methodology.  Figure 9. TSNE plot of Japanese control set and breast cancer samples. We have highlighted the breast cancer samples of [19,16] in green, and the advanced breast cancer sample of [14] in blue.