Abstract
Background Low-rank approximation is a very useful approach for interpreting the features of a correlation matrix; however, a low-rank approximation may result in estimation far from zero even if the corresponding original value was zero. In this case, the results lead to misinterpretation.
Methods To overcome these problems, we propose a new approach to estimate a sparse low-rank correlation matrix based on threshold values combined with cross-validation. In the proposed approach, the MM algorithm was used to estimate the sparse low-rank correlation matrix, and a grid search was performed to select the threshold values related to sparse estimation.
Results Through numerical simulation, we found that the FPR and average relative error of the proposed method were superior to those of the tandem approach. For the application of microarray gene expression, the FPRs of the proposed approach with d = 2, 3, and 5 were 0.128, 0.139, and 0.197, respectively, while FPR of the tandem approach was 0.285.
Conclusions We propose a novel approach to estimate sparse low-rank correlation matrix. The advantage of the proposed method is that it provides results that are easy to interpret and avoid misunderstandings. We demonstrated the superiority of the proposed method through both numerical simulations and real examples.
1 Background
To describe the linear relationship between two variables or subjects, a correlation matrix is calculated from multivariate data. For example, in the domain of genomics, a correlation matrix between genes is used in combination with the heatmap [Wilkinson and Friendly, 2009]. In addition, to classify sample profiles based on multivariate data, a correlation matrix between these samples is also used [Wei and Li, 2010]. However, network structures of these correlation coefficients are masked because real data are observed including noise.
An approach to this problem is the use of low-rank approximation [ten Berge, 1993]. For the estimation of low-rank correlation matrices, various methods have been proposed [Pietersz and Groenen, 2004, Simon and Abell, 2010, Grubii and Pietersz, 2007, Duan et al., 2016], and there are two advantages for this estimati2on. First, it becomes easy to interpret network structure of the correlation matrix because the variation of these low-rank correlation coefficients tend to be larger. Therefore, the heat map of the estimated low-rank correlation matrix provides a readable visualization. Second, such a low-rank approximation also can describe the clustering structure [Ding and He, 2004], which helps us to improve the interpretation. Finally, the values of the estimated low-rank correlation coefficient can range from −1 to 1, in contrast to the result of singular valued decomposition (SVD). Therefore, researchers can interpret the relationship easily because these values are bounded.
It is indeed the case that low-rank correlation matrices simplify these relations; however, there are several problems with this. First, currently, the size of data tends to become larger owing to improvement in information technology. Therefore, the number of coefficients needed for interpretation is above the cognitive boundary. Second, even if a true correlation coefficient is close to zero, the corresponding estimated low-rank correlation coefficient can be far from zero.
To overcome this problem, in this study, we propose a new approach to estimate a sparse low-rank correlation matrix. The proposed approach, combined with the heat map, provides a visual interpretation of the relationships between variables. For the sparse methods of the original correlation matrix and covariance matrix [Engel et al., 2017, Lam, 2020], there are two types. The first is to add sparsity penalty to the objective functions [Bien and Tibshirani, 2011, Rothman, 2012, Xue et al., 2012, Cai et al., 2011, D’aspremont et al., 2008, Friedman et al., 2008, Rothman et al., 2008, Yuan and Lin, 2007, Cui et al., 2016]. The other type uses thresholding values to achieve a sparse structure. [Bickel and Levina, 2008a] proposed the thresholding matrix estimator and various related methods have been developed [Cai and Liu, 2011, Bickel and Levina, 2008b, El Karoui, 2008, Jiang, 2013]. In addition, to estimate sparse correlation matrix, [Rothman et al., 2009, Lam and Fan, 2009, Liu et al., 2014] used generalized thresholding operator based methods [Rothman et al., 2009], respectively. For the estimation of sparse low-rank matrices, methods based on penalty terms have also been proposed [Zhou et al., 2015, Savalle et al., 2012].
In the proposed approach, we adopt an approach that uses hard thresholding based on [Bickel and Levina, 2008a] and [Jiang, 2013] because the approach is quite simple and easy to interpret the results. Therefore, to estimate sparse low-rank correlation matrix, we combine the majorize-minimization algorithm (MM algorithm) [Pietersz and Groenen, 2004, Simon and Abell, 2010] and the hard thresholding approach. There are two advantages for the proposed approach. First, the estimated sparse low-rank correlation matrix allow for easy interpretation the correlation matrix. Second, the proposed approach can avoid misunderstanding the interpretation of the correlation matrix. When the true correlation coefficient is zero, the proposed method can correctly estimate the corresponding coefficient as zero. In addition, we focus only on positive correlation coefficients, not negative correlation coefficients. From the focus on only positive relations, it becomes easy to interpret the features of relations easily.
The rest of this paper is structured as follows. We explain the model and algorithm in section 2. In section3, we evaluate the proposed approach, and a numerical simulation is presented. In addition, the results of applying the proposed method to real data are provided in Section 4. Finally, we conclude out study and discuss the proposed approach in Section 5.
2 Method
2.1 Adaptive thresholding for sparse and low-rank correlation matrix estimation
In this section, we present the proposed approach for estimating a sparse low-rank correlation matrix. First, the estimation of the low-rank correlation matrix is introduced based on the MM algorithm [Pietersz and Groenen, 2004, Simon and Abell, 2010]. Next, to achieve the sparse low-rank correlation structure, the hard thresholding operator and proposed cross-validation function are described.
2.1.1 Optimization problem of low-rank correlation matrices
Let R = (rij) rij ∈ [− 1, 1] (i, j = 1, 2, …, p) and W = (wij) wij ∈ {0, 1} (i, j = 1, 2,…, p) be correlation matrix between variables and the binary matrix, respectively, where p is the number of variables. Given the number of low dimensions d ≤ p and correlation matrix R, binary matrix W, the optimization problem of estimating a low-rank correlation matrix is defined as follows. subject to where Y = (y1, y2, …, yp)T, yj = (yj1, yj2, …, yjd)T, yjo ∈ ℝ (j = 1, 2, …, p; o = 1, 2, …, d) is coordinate matrix of variables on d dimensions,⊙ is the Hadamard product, and ∥·∥F is Frobenius norm. The objective function in Eq. (1) was explained by [Knol and Ten Berge, 2012]. From the constraint (2), Y T Y becomes the correlation matrix.
2.1.2 Estimation of low-rank correlation matrices based on MM algorithm
To estimate a low-rank correlation matrix, the MM algorithm proposed by [Simon and Abell, 2010] is explained. To estimate Y under the constraint (2), the quadratic optimization problem for Y must be converted to a linear optimization problem. Using the linear function, we can derive the updated formula combined with the Lagrange multiplier in closed form. Let y(t) ∈ ℝd be the parameter of t step in the algorithm of the optimization problem, and g(y|y(t)) be a real function g : ℝd × ℝd ↦ ℝ. If g(y|y(t)) satisfy the following conditions such that g(y|y(t)) is defined as majorizing function of f (y) at the point y(t), where f : ℝd ↦ ℝ is the original function. Simply, to estimate the parameters in MM algorithm, we minimize g(y|y(t)), not f (y). In many situations, g(y|y(t)) is expected to minimize the value easily. For the detail of MM algorithm, see [Hunter and Lange, 2004].
Before deriving the majorizing function, the objective function (1) is redescribed as follows: where and c1 are constants. Here, the parameter estimation of Y is conducted by yi. The corresponding part of Eq.(5) and the majorizing function can be described as follows: where represents the majorizing function of Eq.(5), c2 is constant. Id is d × d identity matrix, λi is the maximum eigenvalue of Bi, and is yi of (t − 1) step in the algorithm. Here, the inequality of Eq. (7) is satisfied because λiId − Bi is negative semi-definite. In fact, if , Eq.(6) and Eq.(7) becomes equal.
Using Lagrange multiplier method and Eq. (7), the updated formula of yi is derived as follows:
2.1.3 Proposed Algorithm of Cross Validation to determine hard thresholds
In the proposed approach, to estimate a sparse low-rank correlation matrix, we adopt hard thresholding. To determine the threshold values, we introduce a cross-validation function based on [Bickel and Levina, 2008a]. The purpose of this approach is quite simple, and the proposed approach can determine the threshold values related to sparse estimation by considering the corresponding rank.
Let h(α) ∈ (−1, 1) be a threshold value corresponding to the α percentile of correlations, where α ∈ [0, 1] is the percentage point. For a correlation rij ∈ [−1, 1], the function is defined as 1 = 𝕝h(α)[rij ≥ h] if rij ≥ h(α), else 0 = 𝕝h(α)[rij ≥ h(α)]. Using them, proportional threshold operator is defined as where The proportional threshold operator is used in the domain of neural science [van den Heuvel et al., 2017]. Here, Eq.(9) can be described as R and the binary matrix Wh(α) corresponding to h(α) such that Th(α)(R) = Wh(α) ⊙ R, where and . Here, Eq.(10) is modified for the original function of Bickel and Levina [2008a]. Originally, 𝕝h(α)[|rij| ≥ h(α)] is used, however, we only focus on higher correlations and not on negative correlation. Using modifications, it becomes easy to interpret the results.
To estimate sparse a low-rank correlation matrix, We use the proportional threshold operator in Eq.(9) because the interpretation of the proportional threshold is quite simple. For the choice of the threshold value h(α), a cross-validation was introduced (e.g., Bickel and Levina [2008a], Jiang [2013]). The cross-validation procedure for the estimation of h(α) consists of three steps, as shown in Figure1. First, original multivariate data X ∈ ℝn×p is split into two groups such as and , respectively. where n1 = n − ⌊n/ log n⌋, n2 = ⌊n/ log n⌋ and k represents the index of the number of iterations for cross validation. For n1 and n2, [Bickel and Levina, 2008a] determines both n1 and n2 from the perspective of theory. Second, correlation matrices for both X(1,k) and X(2,k) are calculated as R(1,k) and R(2,k), respectively. Third, low-rank correlation matrix with rank d, Ψ(1,k) = Wh(α) ⊙ Y Y T, is estimated for both R(1,k) and h(α) based on (1) with constraint (2). Forth, for h(α), the procedure from the first step to the third step is repeated K times and the proposed cross-validation function is calculated as follows. where Th(α),d(R(1,k)) = Th(α)(Ψ(1,k)) = Wh(α) ⊙ Φ(1,k) and K is the number of iterations for the cross validation. Among the candidates of threshold values, h(α) is selected as it the value such that the expression in Eq.(11) is minimized. The algorithm for the cross-validation is presented in Algorithm 2.
Algorithm of Cross-validation for tuning proportional thresholds
Finally, h(α)†, corresponding to the minimum value of Eq.(11) among the candidate threshold values is selected and is estimated based on Eq.(1).
2.2 Numerical Simulation and Real example
In this section, we present a numerical simulation to evaluate the proposed approach. The numerical simulation was conducted in the same manner as that in [Cui et al., 2016]. In addition, we present a real example of applying the proposed method to a micro-array gene expression dataset from [Khan et al., 2001].
2.2.1 Simulation design of numerical simulation
In this subsection, the simulation design is presented. The framework of the numerical simulation consists of three steps. First, artificial data with true correlation matrix are generated. Second, a sparse low-rank correlation matrix is estimated using two methods, including the proposed method. In addition, a sample correlation matrix and sparse correlation matrix based on threshold are also applied. Third, using several evaluation indices, these estimated correlation matrices are evaluated and their performances are compared.
In this simulation, three kinds of correlation models are used. Let I and J be a set of indices for the rows and columns of correlation matrices, respectively. In addition, where ik and Jk are the maximum number of indices of Ik and Jk, respectively. Using these notations, three true correlation models , and . are set as respectively. The models for(12) and (13) are called sparse models, while the modelfor (14) is called a non-sparse model by [Cui et al., 2016]. The models for(12) and (13) are used in [Bickel and Levina, 2008a], [Xue et al., 2012], and [Rothman, 2012]; for these, see Figure 2. These artificial data are generated as xi ∼ N (0p, R(𝓁)) (i = 1, 2, …, n; 𝓁 = 1, 2, 3), where 0p is a zero vector with a length of p. In this simulation, we set p = 100 and the number of cross-validations K = 5. For the methods of sparse low-rank correlation matrix, there are 2(Factor1) × 3(Factor2) ×3(Factor3) × 2(Factor4 proposal and tandem) = 36, patterns and for the methods without low-rank approximation, there are 2(Factor1) × 3(Factor3) × 2(Factor4 proposal and tandem) = 12 patterns. Simply, there are 48 patterns in this numerical simulation. In each pattern, artificial data is generated 100 times and evaluated using several indices. In addition, Both the proposed approach and tandem approach runs from a random start 50 times, and the best solution is selected. For R1 and R2, the candidates of α are set as 0.66 to 0.86 in steps of 0.02. However, for R1, the candidates of α are set from 0.66 to 0.82 in steps of 0.02.
Next, the factors of the numerical simulation are presented. For the summary, see Table 1. Factor 1 was set to evaluate the effects of the number of subjects. If the number of subjects is smaller, the estimated sparse low-rank correlation is expected to be unstable. To evaluate the effect of rank, factor 2 was set. When a smaller rank is set, the variance between the estimated sparse low-rank correlation coefficients become larger. Therefore, it becomes easy to interpret the results, although the estimated coefficients generally tend to be far from true. Next, as explained in Eq.(12), Eq.(13) and Eq.(14), there are three levels in factor 3. Finally, in factor 4, we set four methods: the proposed approach, tandem approach, sample correlation matrix calculation, and sparse correlation matrix estimation based on threshold value [Jiang, 2013] with modifications. The purpose of both the proposed approach and the tandem approach is to estimate a sparse low-rank correlation matrix. In the tandem approach, estimation of the low-rank correlation matrix is the same as that of the proposed approach; however, the determination way of proportional threshold is different from Eq.(11). In contrast to Eq.(11), the proportional threshold is selected using the method of [Bickel and Levina, 2008a] [Jiang, 2013] with modification, which does not consider the features of the low-rank correlation matrix. Here, in the tandem approach, we use (10) as a threshold function. Therefore, in the tandem approach, given the corresponding Wh(α) and correlation matrix R, a sparse low-rank correlation matrix is estimated based on the optimization problem of Eq.(1). To estimate the sparse correlation matrix without dimensional reduction, (10) is used as the threshold function: although rij · 𝕝h(α)[|rij | ≥ h(α)] was used in [Jiang, 2013]. In this paper, we follow the approach without dimensional reduction, Jiang (2013), with modifications.
Next, in the same manner as the approach pursued in [Cui et al., 2016], we adopt four evaluation indices. To evaluate the fitting between estimated sparse low-rank correlation matrix and true correlation matrix, the average relative error of the Frobenius norm (F-norm) and of a spectral norm (S-norm) are adopted as follows: respectively, where ∥·∥S indicates spectral norm, is an estimator of the sparse low-rank correlation matrix, and R(𝓁) (𝓁 = 1, 2, 3) is the true correlation matrix corresponding to Eq.(12), Eq.(13) and Eq.(14), respectively. In addition, to evaluate the results on sparseness, the true positive rate (TPR) and false-positive rate (FPR) are defined as follows: where | ·| indicates the cardinality of a set, , and .
2.2.2 Application of microarray gene expression dataset
Here, we present the results of applying both the proposed approach and the tandem approach to the microarray gene expression dataset in [Khan et al., 2001]. The purpose of this real application is to evaluate the differences between two clusters of genes on the results of estimating sparse low-rank correlation matrices.
In [Rothman et al., 2009], the same dataset was used as an application of their method. Specifically, the dataset provided by the R package “MADE4” [Culhane et al., 2005] is used in this example. The dataset includes 64 training sample and 306 genes. In addition, there are four types of small round blue cell tumors of childhood (SRBCT), such as neuroblastoma (NB), rhabdomyosarcoma(RMS), Burkitt lymphoma, a subset of non-Hodgkin lymphoma (BL) and the Ewing family of tumors (EWS). Simply, there are four sample classes in this dataset. As was done in [Rothman et al., 2009], these genes are classified into two clusters: “informative” and “noninformative,” where genes belonging to “informative” have information to discriminate four classes and those belonging to “noninformative” do not.
Next, to construct “informative” cluster and “noninformative” cluster, F statistics is calculated for each gene as follows: where G indicates the number of classes, such as NB, RMS, BL, and EWS, ng is the number of subjects belonging to class g, and is mean of class g for gene j, and is the mean of gene j, sgj is the sample variance of the class g for gene j. Here, if Fj is relatively higher, gene j is considered as “informative” because the corresponding j tends to include information such that each class is discriminated. From the calculated Fj, the top 40 genes and bottom 60 genes are set as “informative” cluster and “noninformative” cluster, respectively. Then, the correlation matrix for 100 genes was calculated and set as input data. See Figure 3.
To compare the results of the proposed approach with those of the tandem approach, the FPR is calculated. For the tandem approach, see “2.2.1 Simulation design of numerical simulation”. In this application, true correlations between gene belonging to “informative” cluster and gene belonging to “noninformative” cluster is considered as 0. Therefore, the denominator of the FPR is set to 2 × 40 × 60 = 4800. For TPR, it is difficult to determine the true structure because correlations within each cluster are not necessarily non-zero. For the rank, we set 2, 3 and 5. The candidates of α for determining the threshold value are set as 0.50 to 0.83 in steps of 0.01 for both approaches, and these algorithms start from 50 different initial parameters. In addition, as was done for the numerical simulation, both the sample correlation matrix and Jiang (2013) with modifications, are also employed.
3 Results
In this section, we present the results of the numerical simulation and real application.
3.1 Simulation result
In this subsection, we present the simulation results by the true correlation models. Table 2, Table 3 and Table 4 indicate the FPRs and TPRs for applying R(1), R(2), and R(3), respectively. Each cell indicates mean of these indices. Here, R(2) is a non-sparse correlation matrix and therefore, FPR cannot be calculated, and both the TPR and FPR of the sample correlation matrix cannot be calculated because the sample correlation is not a sparse matrix. From the results of the numerical simulation, the FPRs of the proposed approach were the lowest among those of all the methods in all situations. while the TPRs of the proposed approach tended to be inferior to those of the other approaches. Simply, the proposed approach makes it a sparser low-rank correlation matrix compared to the tandem approach when a smaller rank is used.
For the relative error of F-norm, Figure 4, Figure 5 and Figure 6 indicates the results of applying these methods to R(1), R(2), and R(3), respectively. Hence, the median of the proposed approach was lower than that of the tandem approach for each pattern. In addition, the interquartile range of the proposed approach was smaller than that of the tandem approach in each pattern. Therefore, we confirmed that the results of the proposed approach are effective and stable compared to those of the tandem approach. As rank is set as larger, The results of both the proposed approaches become lower and close to those of Jiang (2013) with modifications in all situations. Among these methods, the result of Jiang (2013) with modifications is the best for the relative error of rmFmathchar’ − norm. However, it is natural things from the properties of low-rank approximation. As was done for F-norm, those of S-norm for R(1), R(2) and R(3) are shown in Figures 7, 8 and 9, respectively. The tendency of the results for S-norm is quite similar to that for S-norm. From the results for F-norm, we observe that the result of the proposed approach with rank 5 is quite close to that of Jiang (2013) with modification.
For the estimated correlation matrices, Figure 10, Figure 11 and Figure 12 correspond to true correlation model 1, true correlation model 2, and true correlation model 3 with n = 50, respectively. As the same way, Figure 13, Figure 14 and Figure 15 correspond to true correlation model 1, true correlation model 2, and true correlation model 3 with n = 75, respectively. From Figures 10, 13, 12 and 15, we found that the estimated correlation matrices of the proposed approach can estimate zero correctly compared to those of the tandem approach. In addition, rank is set larger, estimated correlation matrices tend to be close to the results of Jiang (2013) with modifications.
3.2 Result of application of microarray gene expression dataset
In this subsection, the results of application of microarray gene expression dataset is shown. For the estimated original correlation matrix, Jiang (2013) with modification, the proposed approach and tandem approach, see Figure 16. Hence, the percentage points of d = 2, 3, and 5 in the proposed approach were α = 0.82, 0.81, and 0.75, respectively, while the percentage points in the tandem approach and Jiang (2013) with modification were both α = 0.65. The estimated results of Jiang (2013) with modification are as presened in Figure 16. However, FPRs were higher than those of the proposed approach. Here, the FPR is not affected by the choice of rank in the tandem approach. From these results, the estimated sparse low-rank correlation matrix tends to be sparser when the rank is set as lower. In fact, it can be confirm that in Figure 16. In addition, as the rank is set larger, the estimated correlations of the proposed approach become similar to those of the tandem approach. We also confirmed that the estimated sparse low-rank correlation matrix between genes belonging to “informative” cluster tend to be similar to the results obtained in Rothman et al. [2009] using the heatmap.
Next, Table 5 shows the results of the FPR of the proposed approach, tandem approach, and Jiang (2013) with modifications. Hence, the FPRs in the proposed method with d = 2, 3, and 5 were all lower than those of both the tandem approach and Jiang (2013) with modifications. This tendency was observed regarding the results of numerical simulations.
4 Conclusion
In this paper, we proposed a novel estimation method for sparse low-rank correlations based on the MM algorithm. The proposed approach can overcome the problem of estimating low-rank correlation matrix. Low rank approximation is a very powerful tool, and the approach provides us with an easy interpretation of the feature because the contrast of the estimated coefficients becomes larger. However, the estimation sometimes lead to misunderstanding. Simply, even if the true correlation coefficient is zero, the corresponding estimated coefficient of the low-rank approximation without sparse estimation may be greater than zero. For efficiency, we confirmed the advantages via a numerical simulation and real example. In fact, in the real example of microarray gene expression dataset, the FPR of the proposed approach with d = 2, 3, and 5 were 0.128, 0.139, and 0.197, respectively. although those of the tandem approach and Jiang (2013) with modifications were 0.285 and 0.285, respectively. Therefore, we confirmed that the FPRs of the proposed approach are the best, irrespective of the rank. In the same manner, from the numerical simulation, we confirmed that the FPRs of the proposed approach are superior to those of the tandem approach and Jiang (2013) with modifications.
Footnotes
yfurotani{at}mis.doshisha.ac.jp
shiwa{at}mail.doshisha.ac.jp
The abstract is revised.
https://www.bioconductor.org/packages/release/bioc/html/made4.html