Using a supervised principal components analysis for variable selection in high-dimensional datasets reduces false discovery rates

High-dimensional datasets, where the number of variables ‘p’ is much larger compared to the number of samples ‘n’, are ubiquitous and often render standard classification and regression techniques unreliable due to overfitting. An important research problem is feature selection — ranking of candidate variables based on their relevance to the outcome variable and retaining those that satisfy a chosen criterion. In this article, we propose a computationally efficient variable selection method based on principal component analysis. The method is very simple, accessible, and suitable for the analysis of high-dimensional datasets. It allows to correct for population structure in genome-wide association studies (GWAS) which otherwise would induce spurious associations and is less likely to overfit. We expect our method to accurately identify important features but at the same time reduce the False Discovery Rate (FDR) (the expected proportion of erroneously rejected null hypotheses) through accounting for the correlation between variables and through de-noising data in the training phase, which also make it robust to outliers in the training data. Being almost as fast as univariate filters, our method allows for valid statistical inference. The ability to make such inferences sets this method apart from most of the current multivariate statistical tools designed for today’s high-dimensional data. We demonstrate the superior performance of our method through extensive simulations. A semi-real gene-expression dataset, a challenging childhood acute lymphoblastic leukemia (CALL) gene expression study, and a GWAS that attempts to identify single-nucleotide polymorphisms (SNPs) associated with the rice grain length further demonstrate the usefulness of our method in genomic applications. Author summary An integral part of modern statistical research is feature selection, which has claimed various scientific discoveries, especially in the emerging genomics applications such as gene expression and proteomics studies, where data has thousands or tens of thousands of features but a limited number of samples. However, in practice, due to unavailability of suitable multivariate methods, researchers often resort to univariate filters when it comes to deal with a large number of variables. These univariate filters do not take into account the dependencies between variables because they independently assess variables one-by-one. This leads to loss of information, loss of statistical power (the probability of correctly rejecting the null hypothesis) and potentially biased estimates. In our paper, we propose a new variable selection method. Being computationally efficient, our method allows for valid inference. The ability to make such inferences sets this method apart from most of the current multivariate statistical tools designed for today’s high-dimensional data.

Many new challenges have been raised over recent years by high-dimensional data, 2 especially so-called 'large p small n' data with thousands or tens of thousands of 3 features but a limited number of samples. This brings the standard analysis tools to a 4 halt or deteriorates the learning process as the learning algorithm becomes prone to 5 over-fitting due to the abundance of irrelevant and redundant features. This means 6 that the good performance of a fitted model could be limited to the training data set 7 and its performance could be very poor for new unseen samples. An obvious challenge 8 in such applications is feature selection, which is designed to filter out the features 9 that have little to no correlation with the response variable based on the available 10 data. Feature selection is an integral component of modern statistical research, and 11 methods that are computationally less demanding and that obtain reliable results are 12 urgently needed. 13 Recently, considerable effort has been directed to identifying important variables 14 that contribute useful information towards a response variable of interest. The 15 approaches that are commonly used can be broadly divided into two categories: 16 univariate filters and multivariate feature selection methods. The univariate filters 17 independently assess each feature based on its relevance to the response and reduce 18 the dimension by filtering out all those features that are not strongly associated with a 19 response variable [for example, see 1,2,3]. Owing to their simplicity, interpretability 20 and computational efficiency, the filters are more commonly used in the analysis of 21 high-dimensional datasets. One of the drawbacks of univariate filters, however, is that 22 dependencies among variables are ignored because the variables are assessed one at a 23 time. As a result, a substantial amount of information contained in the data is left 24 under utilized. Moreover, ignoring correlations between variables may lead to biased 25 estimates and incorrect or underpowered inferences. 26 Multivariate variable selection methods overcome this drawback and use all the 27 variables simultaneously to choose the best subset of variables. Many multivariate 28 variable selection methods have been proposed. For example, the lasso regression [4] 29 and its improved variants like the elastic-net regression [5] are widely used because 30 they produce sparse solutions and perform the model selection simultaneously with 31 parameter estimation. Random forest models, which use an ensemble of classification 32 or regression trees, are also popular [6]. The algorithm returns measures of variable 33 importance and has been used in many applications, including gene selection [7]. 34 However, both lasso regression and random forest were designed for datasets of a 35 moderate size (hundreds of variables) and rely on computationally cumbersome 36 methods such as cross-validation for tuning penalty parameters. For high-dimensional 48 extended to logistic regression of case control data [8]. Other similar screening 49 techniques to accelerate the computation at the second stage of optimization for final 50 variable selection and parameters estimation include SAFE rules [12] and strong rules 51 [13]. However, these methods still suffer from the drawbacks of computational 52 inefficiency and statistical inaccuracy at the second stage of the final variable selection 53 if the number of filtered variables is large compared to the number of available samples. 54 Principal component analysis (PCA) has been used for dimension reduction in 55 high-dimensional data because it is statistically coherent, computationally fast and 56 scalable [14]. A disadvantage of using PCA as a dimension reduction technique is that Huang [17] introduced sparse PCA that produce PCs with sparse loadings (a.k.a 65 coefficients) through forcing near zero loading to be exactly zero. The variables thus 66 associated with the non-zero loadings in the first few PCs are considered informative.

67
More recently, [18] have adapted PCA for variable selection for high-dimensional 68 genotype data coded 0, 1, or 2 based on the reference allele counts. These authors 69 ranked the original variables based on their weights in the first two PCs. These 70 approaches, however, are arguably too arbitrary to be used for general purpose 71 variable selection.

72
In this study we propose to use PCA for variable selection in a binary classification 73 problem or a case-control study. We achieve this by using data from one class to fit a 74 PCA model. The estimated parameters thus obtained are used to reconstruct the data 75 from another class. The variables that are not reconstructed accurately are considered 76 to be important predictors to distinguish between the two classes. The method is 77 effective even for small sample sizes. The results, however, could be improved by 78 accurately estimating the parameters in the training phase. Therefore, we use the 79 class with the larger sample size for training so that the estimated parameters are as 80 accurate as possible. In some cases it may be easier to acquire a larger sample size for 81 one class compared to the other. For example, healthy subjects may be abundant 82 while the diseased subjects may be rare. Training of a classification technique, such as 83 penalized logistic regression and random forest, usually requires a sufficiently large 84 number of samples in each class. In our case, in such case-control association studies 85 we could use the data from healthy subjects for fitting a PCA model. A simple 86 procedure to obtain p-values corresponding to each variable is also proposed since such 87 a measure is relatively easier to interpret for decision making purposes.

88
For illustration we focus on a genomics problem, emphasizing that our method is 89 much more widely applicable. We evaluate our method through an extensive 90 simulation study and consider its performance in a semi-real gene expression study 91 where the true differentially expressed genes are known. We further evaluate the 92 method through application to two publicly available high-dimensional datasets: a 93 childhood acute lymphoblastic leukemia gene expression dataset and a genome-wide PC stands for the axis along which the observed data exhibit the largest variance; the 106 second PC stands for the axis that is orthogonal to the first PC and along which the 107 observed data exhibit the second largest variance; the third PC stands for the axis that 108 is orthogonal to the first two PCs and along which the observed data exhibit the third 109 largest variance, and so on. In this way, the p orthogonal dimensions of variability that 110 exist in the data are captured in p PCs and the proportion of variability that each PC 111 accounts for accumulates to the total variation in the data. The sole objective of PCA 112 is to capture as much variation as possible in the first few PCs. It is, therefore, often 113 the case that the first q (q p) PCs retain conceivably useful information in the 114 observed data and the rest contain variation mainly due to noise [22].

115
More formally, let y ij denote a real-valued observation of the jth variable made on 116 the ith subject, where i = 1, . . . , n and j = 1, . . . , p. Assume that the n observations 117 are arranged in n rows of an n × p data matrix Y with columns corresponding to p 118 variables or features. We standardize columns of Y to have zero mean and unit 119 standard deviation, and store the resultant values in a data matrix X, that is, the 120 elements x ij of X are obtained by x ij = (y ij −ȳ j )/s j , whereȳ j and s j are the mean 121 and standard deviation of the jth column of Y , respectively. The PCA can be 122 performed by singular value decomposition (SVD) of X; that is, the n × p matrix X of 123 rank r ≤ min(n, p) is decomposed as where U is a n × r orthonormal matrix (U T U = I r ), Γ is a r × r diagonal matrix 125 containing r non-negative singular values in decreasing order of magnitude on the 126 diagonal and V is a p × r matrix with orthonormal columns (V T V = I r ). Denote the 127 sample correlation matrix of X by R, which can be expressed as where Λ = 1/(n − 1)Γ 2 is a r × r diagonal matrix containing r non-zero positive eigenvalues λ = (λ 1 , . . . , λ r ) T of matrix R on the diagonal in decreasing order of magnitude. It follows that the r columns of matrix V contain the eigenvectors of X T X and hence are the desired directions of variation. The derived set of r transformed variables (the PCs) are obtained by It is important to note that the matrix U above contains standardized PCs in its columns and is a scaled version of Z, which is provided additionally in (1). To see this, multiply (1) on the right by V to obtain May 11, 2020 4/33 In practice, the first q r major components are of greater interest since they 129 account for most of the variation in the data. Without undue loss of information, the 130 dimension of Z may thus be reduced from p to q, that is whereṼ is a p × q matrix that contains the first q columns of V andZ contains the 132 first q PCs in its columns. The set of first q PCs is a lower dimensional representation 133 of a p-dimensional dataset and can be used to uncover trends and patterns in the 134 data, which is probably the most prevalent application of PCA. In addition, the most 135 informative PCs can be used in subsequent analyses of the data [for example, see 26].

136
However, there are concerns with this approach. First, like the p original variables, the 137 q PCs (although informative because they account for most of the variation in the 138 data) are not easy to interpret [15,27] and often misleading or imaginary 139 interpretations are associated with PCs [22]. Second, in high-dimensional datasets the 140 level of noise due to high-dimensionality often obscures useful reproducible patterns 141 (true structure rather than random association) in the data [28]. Third, the PCs are 142 developed in an unsupervised manner so may not be optimal for further analyses such 143 as regression. In such cases it is assumed that only a small number of variables are 144 able to uncover interpretable patterns in the observed data and variable selection 145 (rather than extraction) methods are usually preferred.

146
A lower rank approximation of X can be obtained by which is the best approximation of X in the least-squares sense by a matrix of rank q 148 [29,30]. The value of q is chosen using available heuristic options such as a plot of 149 eigenvalues (or singular values) in decreasing order of magnitude, known as a 150 scree-plot or the cumulative proportion of explained variance (CPEV), which chooses 151 a smallest value of q for which the CPEV exceeds a pre-specified quantity φ that takes 152 a value between 0 and 1. Commonly used values of φ range from 0.7 to 0.9 [22]. A 153 value of φ = 0.8 means that the chosen number of components q explains at least 80% 154 of the total variance in the observed data.

155
PCA is computationally efficient and scalable to high-dimensional datasets. In this 156 paper, we adapt PCA to variable selection in a classification problem-rather than its 157 traditional use of variable extraction and thereby visualization. We achieve this by 158 considering the error induced by the approximate reconstruction of an actual 159 observation x ij using the expression in (4); that is, the error is defined by wherex ij are the values contained in the ith row and jth column ofX. We store the 161 elementsε ij in an n × p residual matrixˆ .

162
In a case-control study, assume that Y − is an n 1 × p data matrix containing the 163 data from the n 1 control samples and Y + is an n 2 × p data matrix containing 164 observations from the n 2 cases. We standardize the p columns of Y − to have zero 165 mean and unit standard deviation, and store the resultant values in a matrix X − , that 166 is, the elements the mean and standard deviation of the jth column of matrix Y − , respectively. We 168 decompose X − as given in (1) and estimate the parameter V − . We also standardize 169 the columns of Y + using the means and standard deviation of the corresponding 170 columns in Y − and store the resultant values in X + ; that is, the elements We reconstruct X + by substituting the estimated parameter V − 172 in (4). The reconstruction error of X + is computed using the expression given in (5) as 173 wherex + ij is the low rank approximation of x + ij . If a variable is differentially expressed 174 between X − and X + , this will be reflected in the reconstruction error. The larger the 175 reconstruction error the larger will be the discriminative ability of that particular 176 variable. We compute the mean of the jth column ofˆ + and denote it by t + j ; that is, If a variable in X + is generated by a different model then its corresponding t + j will 178 follow a different distribution. In high-dimensional problems only a small subset of 179 variables usually turns out to be useful in terms of their association with a response 180 variable of interest. In such situations the realizations of t + j associated with the 181 majority of unimportant variables will behave similarly in distribution. The 182 differential expression of variables in two groups or the class prediction ability of a 183 variable will be reflected by its associated value of t + j . In other words, the values of t + j 184 that correspond to variables that are differentially expressed between the two groups 185 will deviate more from the center-zero-of the distribution of t + j . The further away a 186 t + j is from zero, the better its associated variable would perform as a predictor in 187 classification. One can thus rank the p variables based on the magnitude of the 188 statistic t + j from most important to least important.

189
One can even calculate p-values associated with t + j if the elements ofˆ + are assumed to be independent and distributed according to the Gaussian distribution; that is, , where σ 2 is a variance that needs to be estimated. This is a common assumption 190 generally made when inference procedures are developed for a PCA model [for 191 example, see 31]. If X − and X + are generated by the same model then the 192 distribution of t + j is easy to characterise; that is, This turns a variable selection problem to an outlier identification problem and a 194 variable that is associated with an outlying t + j will have discriminative ability to 195 differentiate between groups when used in a predictive model such as logistic 196 regression.

197
Note that using the sample variance as an estimator of σ 2 may obscure the outlying values of t + j since the sample variance is not robust against outliers. It is important to use a robust estimate of σ 2 such as mean absolute deviation (MAD), which has a 50% breakdown point [32] as opposed to sample standard deviation, which has a 0% breakdown point. As noticed above, in high-dimensional problems one usually does not expect to have a large number of significant variables (since MAD breaks down if the proportion of significant variables is larger than 50%). The MAD estimator of σ iŝ The above procedure could be used to identify potential discriminatory variables subtypes of a disease could be useful in precision medicine and personalized treatments 205 [33,34]. The one-by-one analysis makes it possible to further identify those variables 206 which are differentially expressed and are common (or uncommon) to all patients in 207 the same group. This makes our procedure different and more advantageous from the 208 variable selection procedures that are commonly used in practice.

209
In the rest of this document we refer to our PCA based approach as Variable 210 Selection based on PCA (VSPCA) and formalize it in the following algorithm: Estimate the parameter V − using the control sample X − and the expression in  5. Computeε + ij using (6). 218 6. Estimate σ using a robust MAD estimator given bŷ 7. Calculate t + j using (7).
indicates the inconsistency of associated variable in X + with the one in X − and 221 hence is the important variable.

222
Note that in some circumstances one may wish to reduce the number of variables from 223 p to a smaller number k (i.e., to filter out the p − k least important variables). This is 224 also useful when the number of variables is small, in which case one does not have 225 enough data to estimate σ, or when the required normality assumption for the error is 226 not valid. In such cases step-8 of the algorithm can be dropped and variables can be 227 ranked based on the relative magnitude of t + j . Those variables that correspond to the 228 k largest magnitudes of t + j can be selected as the superset of predictors. Note also 229 that the p-values obtained via VSPCA need to be adjusted for multiple testing.

230
Unless otherwise specified, all p-values reported in this paper are adjusted for multiple 231 testing via the 'fdr' method [35] implemented in the p.adjust() function of the R 232 'stats' package. 233 We recommend examining the quantile-quantile plot (qq-plot) of t + j and the  Simulation study 243 We used simulations to study the empirical performance of our method and to explore 244 the circumstances in which it substantially dominates some state-of-the-art methods. 245 We considered a simulation setting similar to the one created in [36] and later adopted 246 in [37]. A range of simulation scenarios was considered based on combinations of the under a moderate to strong value of ρ (see Fig 1). For smaller values of n 1 and ρ 292 a larger value φ produced good results.   We test and illustrate our new variable selection procedure by applying it to the 296 Platinum Spike [40] dataset. This is one of the largest semi-real datasets for which the 297 true differentially expressed genes are known and is accessible through GEO Series 298 accession number GSE21344. Hence, it has been widely used for benchmarking 299 microarray analysis methods [for example, see 41, 42, and references therein].

300
The dataset is produced by a controlled experiment [40] with two experimental

309
Although the sample size of the training set (whichever of condition-A and 310 condition-B one wishes to choose as a training set of data) is very small (n 1 = 9), this 311 dataset is worth considering here because of its benchmark status as described above 312 [43]. The analysis of this dataset will help understand the performance of VSPCA 313 under this common scenario of a very small sample size. 314 We normalized the dataset using the Affymetrix MAS5 algorithm implemented in 315 the mas5 function of the R package 'affy' [44]. We applied our variable selection 316 procedure to the processed data. The PCA model was trained on data produced under 317 condition-B. The scree-plot is shown in Fig 3a and the CPEV plots is given in Fig 3b. 318 The scree-plot showed a mild elbow at q = 3. These three components together       To provide additional evidence of the performance of VSPCA, we chose a childhood 341 acute lymphoblastic leukemia (CALL) high-throughput gene-expression dataset that is 342 studied in detail by [45] and accessible through GEO Series accession number

350
We normalized the data using the Affymetrix MAS5 algorithm implemented in the 351 R package 'limma' [39]. The normalized data were then used to rank and identify genes 352 that are associated with the class labels using our VSPCA method and the MTT. 353 We used hyperdiploid subtype samples (n 1 = 44) as a control group to train our 354 PCA model. The T.ALL subtype samples (n 2 = 36) were used as cases to identify 355 genes that are differentially expressed from hyperdiploid subtype samples. A scree plot 356 based on hyperdiploid subtype samples is shown in Fig 8, which suggested q = 7.

357
However, we used the proportion of explained variance criterion which we believe 358 protects against unreasonably smaller choices of q. We set φ = .8, which yielded 359 q = 33 to reconstruct the test set of data (cases group).

360
The qq-plot of t + j is given in Fig 9 and  higher discriminative ability than the much larger superset of 4,980 genes selected 378 using MTT (see Fig 11). The PCA plot based on the superset selected using VSPCA is 379 dominated by between-group variation (see Fig 11a). It can also be observed that the 380 hyperdiploid samples (the training data) are tightly clustered together in comparison 381 to the T.ALL samples (the test data). This is because we took into account 33 382 components and dropped the rest which in general are believed to represent noise and 383 possible outlying effects of some of the observations. This makes the procedure robust 384 to outliers and unusual observations in the training data. In contrast, the plot based 385 on the superset selected using MTT is dominated by within-group variation (see Fig   386   11b). This indicates that the MTT performance is affected by the noise, which led to 387 a higher FPR. As expected, our new approach can substantially reduce the FPR by 388 accounting for correlation between variables and by dropping the minor components 389 that are often believed to stand for noise and unusual observations in the data.

390
To see the stability of the above results and to further test the performance of 391 VSPCA under a smaller n 2 (smaller set of test data), we re-sampled the 36 T.ALL were also significant under smaller sample sizes in more than 50% of the replicates (see 403 Fig 12). Note that the choice of 50% (> 500/1000) is an arbitrary threshold but a 404 similar pattern could be achieved by changing the threshold. The number of selected 405 variables dropped much faster for MTT than it did for VSPCA as we reduced n 2 .

406
With n 2 = 6 the superset of 4,980 genes selected through MTT dropped by 63.5%.

407
However, the superset of 582 genes selected through VSPCA dropped by 21.6%, which 408 not only showed the stability of VSPCA but also showed that VSPCA could be more 409 useful when the number of samples in the test class is smaller.

410
The VSPCA together with the post-selection PCA is useful when one is interested  For all values of n 2 < 36 that are considered, we selected a sample without replacement from the test set of data and performed variable selection based on VSPCA and MTT using the nominal level of significance α = 0.01. The experiment was repeated 1000 times for each value of n 2 < 36. The variables that turned out to be significant under smaller sample sizes and matched the ones that were selected under the entire available sample size for the test data are counted. Shown are the proportions of significant variables under the entire sample size that were also significant under smaller sample sizes in more than 50% (500/1000) of the replicates.  proportions of significant variables for the original complete sample with n 2 = 43 that 447 were also significant under a smaller value of n 2 in more than 50% of the replicates.

448
Again, the number of selected variables dropped much more quickly for MTT than it 449 did for VSPCA as we reduced n 2 . This highlights the stability of VSPCA as well as  For all values of n 2 < 43 that were considered, we selected a sample without replacement from the test set of data and conducted variable selection based on VSPCA and MTT using the nominal level of significance α = 0.01. The experiment was repeated 1000 times for each value of n 2 < 43. The variables that were significant under smaller sample sizes matched with the ones that were selected under the entire available sample size for the test data are counted. Shown are the proportions of significant variables under the entire sample size that were also significant under a smaller sample sizes in more than 50% (500/1000) of the replicates.
GWAS to identify SNPs associated with rice grain length 452 We used a publicly available High Density Rice Array (HDRA, 700k SNPs) [ dataset readers are referred to [46]. Note that our aim here was not to re-analyse the 459 data but to show how well our method performs on very high-dimensional non-normal 460 data. 461 We limited the genotype data to include accessions for which phenotype data are 462 available. pre-selection PCA. The PCA plots in [46] are reproduced in Fig 18 for   The categorization into sub-populations is in accordance with [46].
The problem of confounding by population structure in genome-wide association studies (GWAS) is widely recognized [47,26,48]. One needs to correct for confounding factors induced by relatedness of observations to minimize FDR [47]. To correct for population stratification, we decompose G using a singular value decomposition and reconstruct it using q = 10; that is, The phenotype of the 1,127 varieties had a bimodal distribution (see Fig 19) with 479 easily distinguishable peaks and with a complete separation around the average grain 480 length of 6 mm. Therefore, we partitioned the genotype data, contained in Y =G, 481 into two classes based on the phenotype variable such that the varieties of rice having 482 average grain length of less than 6 mm fell into class-A and those having average grain 483 length of more than 6 mm fell into class-B. We used the class-A data (n 1 = 692) to 484 obtain the estimates of the parameters. We used total variance explained criterion and 485 set φ = .8. This led us to use q = 436 to reconstruct the genotype data from class-B 486 (n 2 = 435). The quantiles of √ n 2 t + j /σ deviated from the expected quantiles under null 487 distribution N (0, 1) (see Fig 21), which indicated that some SNPs are associated with 488 the phenotype of interest. This is equally reflected in the histogram of unadjusted  was found to be highly associated with grain length in [46] and is also highly 496 significant (p-value = 2.47e − 26) in our analysis together with some more significant 497 SNPs in the close vicinity on chromosome-3. The locus corresponding to the peak on 498 chromosome-5 was found to be associated with grain size [51] and is also mentioned in 499 [46] for its possible association with grain length. Other well pronounced peaks were 500 located on chromosome-4 and chromosome-8, which were also observed in [46]. One    in the first few components. The rest of the components are believed to stand for the 543 peculiarities and unusual observations in the data; therefore, it has been used as a core 544 method to de-noise high-dimensional datasets. We expect our VSPCA to reduce FDR 545 through accounting for the correlation between variables and through de-noising data 546 by dropping a number of minor components in the training data. Disregarding the 547 minor components will reduce the individual specific influence on the effect size since 548 some of them are believed to represent outliers in the data.

549
The VSPCA is simple and computationally efficient for a high-dimensional dataset. 550 For computational time and memory requirements of the R implementation of SVD 551 the readers are referred to [54]. We even reduce these requirements by performing 552 SVD only on the training data. Modern variable selection approaches such as LASSO 553 regression require tuning parameters and computationally cumbersome methods such 554 as cross-validation are used to tune these parameters. In contrast, our approach needs 555 only a few lines of R code, as given in the Appendix A, and unlike LASSO regression 556 it does not rely on tuning parameters that need to be chosen using techniques such as 557 cross-validation, which may not be justifiable for 'large p small n' data. However,

558
VSPCA requires one to choose the number of components, q. (indeed one can use 559 q = r; however, we recommend dropping some minor components). We recommend 560 choosing a value of φ ranging from 0.7 to 0.9 under small n 1 , which is a popular 561 heuristic approach to choosing q. This criterion usually favours larger values of q and 562 is safer when n 1 is small. In our experience, changing the value of φ from 0.7 to 0.9 563 did not make a noticeable difference to the final results. Other heuristics such as 564 looking for an elbow point in a scree plot could be used as well. However, we found 565 that the scree plot is less safe because it favours smaller values of q and an extra-small 566 value of q might have a large effect on the results. We found it safe to use a slightly 567 larger value of q (compared to what the elbow point suggest) since it does not make a 568 notable change to the results. Appropriate choices of q for PCA in general, and for 569 VSPCA in particular, is also worthy of future investigation.