CNLLRR: A Novel Low-Rank Representation Method for Single-cell RNA-seq Data Analysis

The development of single-cell RNA-sequencing (scRNA-seq) technology has enabled the measurement of gene expression in individual cells. This provides an unprecedented opportunity to explore the biological mechanisms at the cellular level. However, existing scRNA-seq analysis methods are susceptible to noise and outliers or ignore the manifold structure inherent in the data. In this paper, a novel method called Cauchy non-negative Laplacian regularized low-rank representation (CNLLRR) is proposed to alleviate the above problem. Specifically, we employ the Cauchy loss function (CLF) instead of the conventional norm constraints in the noise matrix of CNLLRR, which will enhance the robustness of the method. In addition, graph regularization term is applied to the objective function, which can capture the paired geometric relationships between cells. Then, alternating direction method of multipliers (ADMM) is adopted to solve the optimization problem of CNLLRR. Finally, extensive experiments on scRNA-seq data reveal that the proposed CNLLRR method outperforms other state-of-the-art methods for cell clustering, cell visualization and prioritization of gene markers. CNLLRR contributes to understand the heterogeneity between cell populations in complex biological systems. Author summary Analysis of single-cell data can help to further study the heterogeneity and complexity of cell populations. The current analysis methods are mainly to learn the similarity between cells and cells. Then they use the clustering algorithm to perform cell clustering or downstream analysis on the obtained similarity matrix. Therefore, constructing accurate cell-to-cell similarity is crucial for single-cell data analysis. In this paper, we design a novel Cauchy non-negative Laplacian regularized low-rank representation (CNLLRR) method to get a better similarity matrix. Specifically, Cauchy loss function (CLF) constraint is applied to punish noise matrix, which will improve the robustness of CNLLRR to noise and outliers. Moreover, graph regularization term is applied to the objective function, which will effectively encode the local manifold information of the data. Further, these will guarantee the quality of the cell-to-cell similarity matrix learned. Finally, single-cell data analysis experiments show that our method is superior to other representative methods.

In recent years, a large number of single-cell RNA sequencing (scRNA-seq) data have 2 been generated due to the development of next-generation sequencing technologies [1]. 3 At the same time, scRNA-seq data contain a wealth of information on biological 4 function and gene regulation. Analysis and research on these information pave us a way 5 to observe individual cells unprecedentedly [2]. It thus provides us possibilities to 6 explore the heterogeneity and complexity of cell population. It also offers us an 7 unprecedented opportunity to learn the biological mechanisms and functional diversity 8 at the cellular level. 9 With the help of scRNA sequencing, identification of subpopulation of cells [3] is 10 now possible. The identification can be considered as a clustering problem through the 11 similarities of gene expression. Those cells with high similarity will be categorized into 12 the same subgroup. Cell clustering contributes to the targeted therapy of disease and 13 the discovery of new cell subtypes. By using visualization methods, such as 14 t-distributed stochastic neighbor embedding (t-SNE) [5], biologists can intuitively 15 understand the distribution of the cells. Through prioritizing the genes that define the 16 subgroup of cells, a new direction is provided for us to study cell differences. 17 Currently, many methods for single-cell clustering have been proposed [6,7]. 18 Generally speaking, these methods are mainly to learn the similarity between cells and 19 cells. Then they use the clustering algorithm to perform cell clustering or downstream 20 analysis on the obtained similarity matrix. Therefore, constructing accurate cell-to-cell 21 similarity is crucial for scRNA-seq data analysis. Up to now, many efforts have been 22 taken, for example, Xu et al. used the idea of shared nearest neighbors to measure the 23 similarity between cells [8]. To enhance clustering accuracy, Wang et al. proposed the 24 single-cell interpretation via multi-kernel learning (SIMLR) [9], which obtains a 25 similarity matrix by learning the appropriate distance metric. Then Kim et al. designed 26 a new framework based on improved SIMLR by using Pearson correlation as the 27 measure of similarity [10]. Influenced by the assumption that neighbor information can 28 measure the similarity, Jiang et al. proposed a method called Corr, which is based on 29 differentiability correlation of cell pairs [11]. 30 The aforementioned method relied on the accurate pairwise similarities among the 31 cells. However, the similarity prones to be corrupted by noises or outliers. The high 32 dimensional characteristics of the scRNA data further deteriorates the accuracy of the 33 similarity calculation by involving an averaging operation. Alternatively, recent 34 advances focus on capturing the global structure of the data. Zheng et al. proposed a 35 subspace clustering algorithm called SinNLRR based on low-rank representation (LRR) 36 to identify cell types [4]. Although the SinNLRR method achieves satisfactory 37 experimental results, two issues still need to be considered. One is that single-cell data 38 inevitably contain noise (for example, in the process of measuring and collecting 39 scRNA-seq data) and outliers. However, the noise matrix of the SinNLRR method is 40 constrained by the Frobenius norm, in which case its performance is affected. In 41 addition, the common L 1 -norm, L 2,1 -norm or Frobenius norm is applied to the noise 42 term, which can only model specific noise. In general, L 1 -norm can model the noise that 43 matches the Laplacian distribution, L 2,1 -norm constraint is imposed to tackle 44 sample-specific outliers, and Frobenius norm can well characterize Gaussian noise [12]. 45 Unfortunately, noise and outliers are generally unknown and have complex statistical 46 distributions. Therefore, a reasonable new strategy is needed to effectively deal with 47 noise and outliers. Another issue is that there are low-dimensional manifold structures 48 in the high-dimensional sample space of data. If the manifold information is not 49 considered, the similarity matrix learned will not better reflect the similarity among 50 cells. This will seriously affect the results of single-cell analysis.

51
To overcome the above limitations, we design a novel Cauchy non-negative Laplacian 52 regularized low-rank representation (CNLLRR) method for clustering cells, visualizing 53 cells and prioritizing gene markers. Specifically, Cauchy loss function (CLF) constraint 54 is applied to punish noise matrix. It can effectively reduce the influence of a single 55 sample containing noise and outliers and avoid to be affected pronouncedly when the 56 sample contains large noise. In addition, the influence function of CLF has an upper 57 limit when estimating residuals, so it seldom relies on the distribution of noise and 58 outliers [13]. Therefore, the CNLLRR method is robust to noise and outliers. Moreover, 59 to preserve the intrinsic geometric local information in scRNA-seq data, the graph 60 regularization term is applied to the objective function. The main contributions of this 61 paper are as follows: 62 1. A novel method called CNLLRR is presented to capture the global structure of 63 the data. CNLLRR uses the CLF instead of the conventional norm constraints on 64 the noise term, which effectively reduces the influence of a single sample.

65
Therefore, the robustness of the CNLLRR method to noise and outliers is 66 improved. Moreover, the CNLLRR method also applies graph regularization 67 terms to encode manifold information of the data. These will improve the quality 68 of the learned similarity matrix.  3. We perform comprehensive experiments to validate the CNLLRR method on the 73 scRNA-seq data. The experimental results show that our method is meaningful 74 and superior to several representative methods.

75
The rest of this paper is organized as follows. In Materials and methods Section, 76 LRR, CLF, and graph regularization are simply reviewed. The CNLLRR method and 77 its optimization are also described in detail. In Results and discussion Section, the LRR is a widely used data analysis technique for exploring multiple low-dimensional 83 subspace structures [14,15]. In the field of bioinformatics, a scRNA-seq dataset can be 84 represented by a matrix. The columns and rows of the matrix represent cells and the 85 level of gene expression in different cells, respectively.

86
Given a data matrix X = [x 1 , x 2 , ..., x n ] ∈ R m×n , where x j can be considered as a 87 linear combination of the bases in the dictionary [16]. Therefore, the data matrix plays 88 the same role as the dictionary which helps to discover the potential affinity between 89 data samples. LRR decomposes the data matrix into the low-rank representation of the 90 data and the noise, the purpose of which is to search the lowest rank that can represent 91 all the data [17]. Therefore, the objective function of the LRR can be expressed as the 92 following rank minimization problem: where C ∈ R n×n and E ∈ R m×n represent the coefficient and the noise matrices, 94 respectively. · * is the nuclear norm of the matrix, which is defined as the sum of the 95 singular values of the matrix. And · l is some sort of regularization strategy, such as 96 · 1 , · 2,1 , and · F . λ ≥ 0 denotes the weighting parameter which is used to balance 97 the two terms. 98 Recently, Zheng et al. proposed a modified version of LRR to detect cell types in 99 scRNA-seq data [4]. If the expressions of cells lie in the same subspace, it is implied 100 that these cells are most likely of the same type. Substituting X − XC for E, the 101 objective function of SinNLRR is defined as where · F represents the Frobenius norm of matrix. C can be seen as a representation 103 of X, with C ≥ 0 meaning the non-negative similarity between cells. The penalty is 104 adaptive due to the selection of λ.

106
In real-world applications, noise and outliers are common in the data. What is more, 107 they are usually unknown and complex. Therefore, a robust loss function is urgently 108 needed to deal with this problem. Recently, CLF, as a robust estimator, has been 109 successfully used in face recognition [18] and image clustering [13]. CLF can also be 110 applied in solving the minimum issue in the form of the following optimization problem: 111 where r j is the residual between the j-th data and its estimated value. ρ(·) represents a 112 symmetric and positive-definite function where there is a unique minimum at zero. The 113 ρ(·) function in CLF is log 1 + (x/a) 2 , hence the influence function of it can be 114 calculated as where a is a constant. φ(·) reflects the impact of sample points changes on the 116 parameter estimation. It can be seen from Eq (4) that as the residual x increases, the 117 influence function of CLF changes slightly. In other words, the influence function of the 118 CLF has an upper bound when estimating the residual [13]. Therefore, using CLF as a 119 constraint on the noise matrix in the model helps to improve the robustness of the 120 method.

122
Due to the rapid development of manifold learning, people pay more and more attention 123 to the inherent nonlinear structure of data. Graph regularization is based on local 124 invariance assumption that if two sample points are close to each other in the 125 high-dimensional data space, they are also close in the new low-dimensional space [19]. 126 Hence graph regularization can capture the potential local manifold structure in the 127 original data. For a given data matrix X, we use the data points as the vertices to 128 construct a k-nearest neighbor (k-NN) graph [20]. Then, a symmetric weight matrix W 129 is defined [21], whose element W ij represents the weight of edge that joins the vertex x i 130 and the vertex x j . It can be calculated as where N k (x i ) represents the set of k nearest neighbors of x i . In mathematics, the graph 132 regularization can be measured by where c i and c j denote the mappings of x i and x j under some tranformations [22]. The 134 degree matrix D is a diagonal matrix whose non-zero element D ii = j W ij . L is the 135 graph Laplacian matrix of X [23].

136
Cauchy non-negative Laplacian regularized low-rank 137 representation 138 In this section, we introduce the Cauchy non-negative Laplacian regularized low-rank 139 representation (CNLLRR) method in detail. The CNLLRR algorithm is summarized in 140 Algorithm 1.

141
Objective function 142 LRR has been successfully applied in many fields because it can effectively explore the 143 global structure of data [22]. However, data are inevitably contaminated by noise and  To overcome the above limitations, a method called CNLLRR is proposed. CLF 151 rather than conventional norm constraints is applied to penalize the noise matrix. CLF 152 can effectively reduce the influence of a single sample containing noise and outliers.

153
Moreover, it seldom relies on the distribution of noise and outliers [13] which helps to 154 improve the robustness of the method. In addition, a graph regularization term is 155 applied to encode the potential manifold information of data. In conclusion, the 156 formulation of CNLLRR can be described as follows: where λ and β denote penalty factors for balancing the regularization terms. c i is a 158 representation of the i-th data x i . Similar to the SinNLRR method, we replace 159 X − XC with the noise matrix E. The CLF constraint is applied on it. To achieve the 160 computational separability of Eq (7), an auxiliary matrix J ∈ R n×n is introduced. Then 161 CNLLRR can be rewritten as the following optimization problem: October 17, 2019 5/17 Optimization of CNLLRR

163
Alternating direction method of multipliers (ADMM) is used to optimize the CNLLRR 164 method. Then the augmented Lagrangian function of Eq (8) can be defined as follows: By using A, B = tr A T B , Eq (9) can be further rewritten as where Y ∈ R n×n represents the Lagrange multiplier. µ denotes the user-defined 167 parameter [4]. We can easily solve the problem Eq (10) by alternately updating one 168 variable while fixing other variables.

169
First, J K and Y K are fixed when solving C K+1 . And Through algebraic calculations, there exists an explicit solution of C, where , R K+1 = X − XC K and I denotes an identity 173 matrix.

174
Similarly, the subproblem used to update J can be calculated as It is a quadratic optimization with low-rank constraint. It has an explicit solution by 176 soft-thresholding, where soft λ,1/µ (·) represents the soft-thresholding operator [24]. And Finally, the optimization procedure for solving the proposed CNLLRR method is 181 presented in Algorithm 1.
2) While not converged and K ≤ max In this subsection, we analyze the robustness of the proposed CNLLRR method. From 184 the viewpoint of statistics, using the robust estimator to measure errors is beneficial to 185 enhance the robustness of the algorithm.
respectively. The influence function can reflect the impact of sample points changes on 190 parameter estimation. In Fig 1(b), the influence of a sample point on the parameter 191 estimation increases linearly with respect to its error for L 2 estimator. Therefore, L 2 192 estimator is sensitive to noise and outliers. From Fig 1, we can also observe that L 1 193 estimator can reduce the impact of large error on parameter estimation. However, with 194 the error increasing, the influence function does not cut off. In Fig 1(b), for CLF, as the 195 error increases, its influence function value tends to be zero. That is to say, the 196 influence function of CLF has an upper limit. This is the reason why the CNLLRR 197 method is robust to noise and outliers. proposed CNLLRR method is more robust than SinNLRR. Convergence and complexity analysis 209 We use the ADMM method to optimize the proposed CNLLRR method. As shown in 210 Algorithm 1, C, J, and Y are iteratively updated until the convergence condition of 211 CNLLRR is satisfied or the number of iterations exceeds the maximum we set.

212
Therefore, the convergence of our method can be guaranteed. We also performed 213 experiments to verify the convergence of the CNLLRR method. In Fig 3,  datasets. In addition, O notation is used to characterize the computational complexity. 218 The computational cost of CNLLRR is mainly produced in the procedure of updating C 219 and J. Futhermore, each time C is updated, R and Q also need to be calculated. Hence 220 the time complexity of updating C can be calculated as O n 3 + mn 2 according to 221 Eq (12). As shown in Eq (14), singular value decomposition (SVD) needs to be 222 employed when updating J whose computational cost is O n 3 . Moreover, data graph 223 also requires O mn 2 to be built. Assuming that k is the number of iterations of the 224 algorithm, the total computational complexity of CNLLRR is O kn 3 + kmn 2 + mn 2 .

228
The first step is the adoption of the CNLLRR method. Specifically, preprocessing in 229 needed for the input data matrix. The preprocessing includes gene filtering [25] and 230 L 2 -norm normalization [26]. Gene filtering is commonly used in single-cell 231 analysis,which aims to filter out the 5% genes that expresses less than all cells. L 2 -norm 232 normalization normalizes each column vector of the matrix to 1, ie., ij . It can eliminate the scale differences between samples. After 234 preprocessing, we use data matrix as the input of CNLLRR method to get the 235 corresponding coefficient matrix C. Then we obtain the similarity matrix which has a 236 better block diagonal structure in the same way as SinNLRR [4]. The localized 237 similarity matrix is defined as where f denotes the relaxation coefficient. Like SinNLRR, the value of f is also 1.5 in 239 our method. The similarity matrix can be easily obtained through S = P T + P.

240
The second step is the single-cell analysis. We perform spectral clustering on the 241 learned similarity matrix. In addition, the similarity matrix is visualized and gene 242 markers are prioritized.

244
In this section, clustering cells, visualizing cells and prioritizing gene markers are 245 performed to analyze the performance of the CNLLRR method. In addition, t-SNE [5], 246 K-means, PCA [27], sparse subspace clustering (SSC) [28], SIMLR [9], Corr [11], and respectively. In the experiment, gene filtering and L 2 -norm normalization are utilized to 260 preprocess the data. Specific information on scRNA-seq data is listed in Table 1. In fact, the appropriate parameter setting will contribute to the performance of our respectively. From Fig 5 and Fig 6, we can observe that the proposed method is 271 sensitive to parameters λ and β. Fortunately, λ and β can be selected to achieve better 272 performance in the aforementioned interval. We can find in Fig 7(a) and 7(b) that as 273 the values of a and µ increase, the behavior of CNLLRR is relatively stable. That is to 274 say, the proposed method is robust to these two parameters. From Fig 7(b), we can we set the value of µ to 10 in the following experiment. In conclusion, reasonable   To quantify the clustering results, we employ two widely used evaluation metrics:

285
Adjusted Rand Index (ARI) [38] and Normalized Mutual Information (NMI) [39]. ARI 286 is more suitable for sample-balanced datasets compared with NMI [40]. Let where N ij represents the number of sample points in T i and P j . The number of sample 291 points in T i is represented by N i. . And the number of sample points in P j is denoted by 292 N .j . a b means the binomial coefficient.

293
NMI is another metric that represents the similarity of cluster sets T and P . It is 294 formulated as: where H(·) denote the entropy of the cluster set and M I(·, ·) represents mutual 296 information between cluster sets. Both ARI and NMI can be used to measure the  are provided with real cluster sets, even though the Corr itself can automatically 304 estimate the number of clusters. In our experiment, ARI and NMI are adopted to 305 evaluate the clustering performance. The experimental results are illustrated in Table 2. 306  331 We can conclude that our method is reasonable and effective in suppressing noise, 332 outliers and preserving potential geometric information for the data.

333
In real-world applications, the number of clusters is usually unknown. To tackle this 334 issue, we design an experiment to confirm the number of clusters in the scRNA-seq 335 dataset. It is well known that SIMLR, Corr and SinNLRR are clustering methods 336 proposed for the specific purpose of single-cell cluster analysis. Hence we use the three 337 methods as comparison methods. In the experiment, we first construct the normalized 338 Laplacian matrices L norm which are based on the corresponding similarity matrices S 339 obtained by each method. L is defined as where B represents the diagonal matrix and the elements on its diagonal are 341 B ii = j S ij . I is the identity matrix. The eigengap [41] method is then applied on 342 L norm , which can estimate the number of clusters by maximizing the eigenvalues gap 343 |λ k − λ k−1 | of L norm . It should be mentioned that Corr decides the cluster number by 344 itself automatically.  Goolam  5  3  3  5  3  Pollen  11  19  3  10  10  Treutlein  5  12  3  4  5  Deng  7  6  4  5  5  Ting  5  51  5  5  5  Breton  4  2  3  6  2  Engel3  3  2  3  5  3  Grover  2  4  6  4  4  Engel4  4  2  2  6  2  Darmanis  8  2  6 10 7 Visualization and gene markers 352

353
In scRNA-seq data analysis, data visualization is also commonly used in identifying cell 354 groups [42]. As a dimensionality reduction method, t-SNE has become a mainstream 355 tool of visualization. According to [9], the modified t-SNE algorithm projects the input 356 similarity matrix obtained by CNLLRR into a two-dimensional space, which shows the 357 distinctions between cell subpopulations more intuitively and help biologists understand 358 the diversity of cells more simply and directly.  genes highly expressed in NKT2. REXO2, HMGB2 and CCT3 can be regarded as 389 marker genes of NKT17, NKT1 and NKT0, respectively. REXO2 is a protein coding 390 gene which may have a role in DNA repair, replication, and recombination, and in RNA 391 processing and degradation. Therefore, the lack of REXO2 leads to a significant 392 decrease in mitochondrial nucleic acid content [43]. HMGB2 encodes a member of 393 non-histone chromosomal high mobility group protein family. The proteins of this 394 family are chromatin-associated and distributed in the nucleus of higher eukaryotic cells 395 ubiquitously [44]. The relevant pathways of CCT3 are cell cycle role of APC in cell 396 cycle regulation and organelle biogenesis and maintenance. The protein encoded by this 397 gene is a molecular chaperone that is a member of the chaperonin containing TCP1 398 complex (CCT), also known as the TCP1 ring complex (TRiC) [45]. In Fig 9(b), PLP1, 399 TMEM144 and CLDND1 are the gene markers of oligodendrocytes. MALAT1 can be oligodendrocyte development and axonal survival. The protein encoded by AQP4 is the 405 predominant aquaporin found in brain and has an important role in brain water 406 homeostasis. Published articles further confirmed that PLP1 and AQP4 are the marker 407 genes of oligodendrocytes and astrocytes, respectively [46]. To save space, we perform a 408 brief analysis of some of the gene markers in the Engel4 and Darmanis datasets. The 409 experiments and analysis provide new insights for researchers into the studies of NTK 410 cells and cerebral cortical tissue. In this paper, we propose a robust LRR method called Cauchy non-negative Laplacian 413 regularized LRR (CNLLRR) for scRNA-seq analysis. The proposed CNLLRR method 414 uses CLF to reduce the negative impact of noise and outliers in clustering. Moreover, 415 the graph regularization was applied to exploit the local manifold structure of data.

416
Extensive experiments compared with other state-of-art methods on scRNA-seq data 417 demonstrate the effectiveness of our proposed CNLLRR method. This method may 418 provide a guiding for the next single-cell analysis and be applied to other relative fields. 419 For future work, we will make further efforts in model simplifying and application in 420 larger dataset. In our model, the optimal solution is not easy to be obtained owing to 421 the four free parameters. Another shortcoming of CNLLRR is the lack of flexibility 422 when analyzing a large number of cells. Thus, we will continue to develop the Cauchy 423 non-negative Laplacian regularized LRR method and promote its flexibility in the 424 application of scRNA-seq analysis.