Abstract
Recent advances in single-cell RNA sequencing (scRNA-seq) technologies provide a great opportunity to study gene expression at cellular resolution, and the scRNA-seq data has been routinely conducted to unfold cell heterogeneity and diversity. A critical step for the scRNA-seq analyses is to cluster the same type of cells, and many methods have been developed for cell clustering. However, existing clustering methods are limited to extract the representations from expression data of individual cells, while ignoring the high-order structural relations between cells. Here, we proposed a new method (GraphSCC) to cluster cells based on scRNA-seq data by accounting structural relations between cells through a graph convolutional network. The representation learned from the graph convolutional network, together with another representation output from a denoising autoencoder network, are optimized by a dual self-supervised module for better cell clustering. Extensive experiments indicate that GraphSCC model outperforms state-of-the-art methods in various evaluation metrics on both simulated and real datasets. Further visualizations show that GraphSCC provides representations for better intra-cluster compactness and inter-cluster separability.
I. Introduction
Recent advances in single-cell transcriptomics permit researchers to study complex differentiation and developmental trajectories [1], to discover novel cell types [2], and to improve understanding of human diseases [3, 4]. As scRNA-seq usually includes thousands to millions of cells, most of these studies require clustering the same types of cells. Benefit from extensive studies of the classic clustering problem in machine learning, many algorithms that have been developed for general clustering purposes are transplanted to the clustering analyses[5, 6].
Despite the advances of many clustering methods and significant improvements in measuring scRNA-seq technologies, clustering cells based on scRNA-seq data remains challenging [7]. To be more specific, scRNA-seq data often contains substantial noise and dropout events, which may be caused by bio-logical and experimentally technical factors, such as cell cycle effects [8], amplification bias, and the low RNA capture rate [9]. A dropout event is defined as missed gene measurements, resulting in a ‘false’ zero count observation [10]. Therefore, solving the noises and dropout events is essential for improving clustering analyses. In order to solve the dropout events of scRNA-seq data, several imputation methods have been developed. Early methods are based on statistical models, e.g. CIDR [11], scImpute [12], MAGIC [13], and SAVER [14]. With the development of deep learning techniques, many neural-network based models have been developed. For example, DCA [10] reconstructs the scRNA-seq data through the autoencoder optimized by a loss function of the zero-inflated negative binomial (ZINB) [15]. DeepImpute predicts missing values using highly correlated genes and sufficient reads coverage [16]. GraphSCI employed graphical neural network to reflect the relations between genes for accurate imputations [17]. Although the imputed scRNA-seq data could improve the clustering results, the results remain unsatisfactory as these methods are not optimized for the clustering and the imputed data may generate false positive gene-gene correlations[18].
Recently, several clustering methods have been specifically developed for scRNA-seq data. For example, SIMLR is a spectral clustering method that learns a robust distance metric to fit the structure of scRNA-seq data using multiple kernels[19]. Seurat3.0 utilizes the Louvain algorithm [20] to cluster cells based on the low-dimensional scRNA-seq data [21]. DendroSplit uncovers multiple levels of biologically meaningful cell populations using feature selection in scRNA-seq data [22]. ScDeepCluster is a single-cell model-based deep embedded clustering method, which takes the overdispersion and sparsity of the scRNA-seq data into account [23]. A few tools were designed to divide single cells into hierarchies or groups, such as SC3[24], RaceID[25], SNN-Cliq[26], BISCUIT[27], and pcaReduce[28]. However, most of these methods rely on only the data of individual cells without explicitly considering structural relations between cells.
Structural information could be naturally captured by the Graph Convolutional Networks (GCN) [29]. In recent years, GCN and their variants [30, 31] have been successfully applied to a wide range of applications, including traffic prediction [32], protein prediction [33] and drug design[34]. Xie et al designed an unsupervised deep embedding method for clustering analysis (DEC)[35], which iteratively refines clusters by learning highly confident assignments using an auxiliary target distribution. DEC and its variant IDEC[36] have been successfully applied in molecular biology[23, 37]. Recently, Bo et al designed a Structural Deep Clustering Network (SDCN) to integrate structural information between objects[38]. Theoretically, they have proved that inclusion of GCN enables a high-order regularization constraint to learn representations that improve the clustering results. The model was also shown to outperform other methods in many types of datasets.
Inspired by these works, we have proposed a new GCN based model, GraphSCC, to integrate structural information in the clustering of scRNA-seq data. Meanwhile, a denoising autoencoder network was employed to obtain low dimensional representations for capturing local structural, and a dual self-supervised module was utilized to optimize the representations and the clustering objective function iteratively in an unsupervised manner. We demonstrate that the GraphSCC outperforms state-of-the-art methods on both simulated and real datasets. Furthermore, GraphSCC provides representations for better intra-cluster compactness and inter-cluster separability in the 2D visualization.
The advantages of GCN is its native learnable properties of aggregating and propagating attributes to obtain relations over the whole cell-cell graph. Thus, the learned graph representations can be treated as high-order representations between cells. The superior performance of GraphSCC in cell cluster prediction benefits from (i) we synergistically determine cell clusters based on the integration of high-order topological relations between cells and characteristics of individual cells, and (ii) we apply the dual self-supervised module to iteratively refine clusters by learning from highly confident assignments using an auxiliary target distribution.
II. Materials and Methods
A. Datasets and Preprocessing
Simulated Data
Our simulated data were generated by the Splatter R package [39], a generally used package for simulating scRNA-seq count data. For all simulated data, we set 2000 cells composed of 2000 genes with four groups of the same numbers, i.e., 500 cells per group. Following the previous study[23], we set dropout. mid = 2, dropout. shape = −1(fixed dropout rates at 45%), and used default values for other parameters. To simulate various clustering signal strengths, we generated datasets with different de.fracScale in {0.2, 0.225, 0.25, 0.275, 0.3,0.325,0.35,0.4}. The de.fracScale is the parameter sigma of a log-normal distribution to control multiplicative differential expression factors. To avoid random fluctuations, we repeatedly generated 20 datasets for each setting with different random seeds, and reported the average results.
Real Data
We downloaded 15 datasets of human and mouse scRNA-seq involved in various tissues and different biological processes as used in the previous study[40] from the Hemberg group (https://hemberg-lab.github.io/scRNA.seq.datasets/). The datasets contain different scales of cells from dozens to thousands derived from various single-cell RNA-seq techniques. The detail information of datasets was listed in TABLE I. The data type of top 10 datasets is raw read counts, and the last 5 datasets are normalized counts.
Preprocessing
For the simulated datasets, we normalized them using transcripts per million (TPM) method [41] and then scaled the value of each gene to [0, 1]. For real datasets, we employed the procedure suggested by Seurat3.0 to normalize and select top 2000 highly variable genes for scRNA-seq data, then to scale the value of each gene to [0,1]. Note that for real datasets normalized by FPKM, we first converted them to TPM by Eq. (1) as proposed by [42], and then preprocess the data as above.
B. GraphSCC Architecture
In this section, we introduce our proposed GraphSCC network. As shown in Fig. 1, the overall network consists of three components: Denoising Autoencoder (DAE), Graph Convolutional Network (GCN), and Dual Self-supervised Module (DSM). The DAE network is employed to obtain robust low-dimensional representations that could reconstruct the inputs. GCN aims to optimize the representations with constraints from structural information between cells, where the graph is initialized through the K-nearest neighbor (KNN). DSM aims to separate the cells according to the learned representations by DAE and GCN. Below we will introduce the details.
1. DAE Networks
For a given single cell sample, gene expression is represented by the matrix X ∈ ℝN ×d, with N as the number of single-cell and d as the dimension of expressed genes. DAE is employed to encode gene expression matrix to fixed-size vector representations. DAE is a variant of autoencoder that is input with corrupted data and outputs the fit data. Concretely, at encoding layer ℓ the output H(ℓ) is computed as: where ∅ is the activation function, and are the weight matrix and bias parameters, respectively. H(0) = X + ϵ, and ϵ is the Gaussian noise. The decoded layers are computed as: where ∅ is the activation function, and are the weight matrix and bias parameters, respectively,and the output of the last layer is the reconstructed data The encoder and decoder networks are both fully connected neural networks using the RELU activation function. The weight and bias parameters are optimized based on the loss function:
2. KNN Graph
The GCN graph was initialized by KNN, where the cell similarity is computed through the dot product function as: where xi and xj are the embedded features for cells i and j, the embedded features obtained from the representation H(L) of pre-train DAE, and |xi | and |xj| are the corresponding modules, respectively. According to the similarity matrix S ∈ ℝN ×N, the most similar cells of each cell are selected as its neighbors to construct the adjacency matrix A for GCN. In this study, the number of neighbors was set as at most 1% of the total number of cells with a maximum of 20.
3. GCN Network
GCN network was constructed to capture structural information between cells that have been ignored by DAE network. Inspired by recent study [43], we use residual connection to alleviate the well-known over-smoothing phenomenon in GCN [29]. Due to the large feature dimension d of X, we use a lower-dimensional Z(0) as the initial representation of GCN, which is extracted from X by a fully-connected neural network as follows: where Z(0) ∈ ℝN ×m, as the initial representation of GCN, m is lower than d. W and b are weight matrix and bias parameters, respectively.
Then, based on the adjacency matrix A obtained from KNN and the initial representation Z(0), with the weight matrix W, the representation Z(k+1) learned by GCN can be obtained by the following convolutional operation: where two hypermeters αk > 0 ensures each layer retaining the information from input layer Z(0) and βk > 0 ensures the decay of the weight matrix adaptively with stacked layers. with A as the adjacency matrix obtained from the KNN-graph and IN is the identity matrix. is the normalization term. The ϕ is the RELU activation function. We set and αk = 0.3 following the previous study [43].
The last layer in GCN module is connected by a softmax function: where the output Z∈ ℝN×c could be treated as the probability distribution, c is the number of clusters. The result zij ∈ Z represents the probability cell i belongs to cluster center j.
4. Dual Self-supervised Module
The dual self-supervised module was designed to guide the learning of low-dimensional representations by DAE and GCN for better cell clustering. By the input of H(L) from DAE, the cells are clustered into c clusters corresponding with c cluster centers uj, j = 1, …, c through the K-means algorithm. Then the clusters are iteratively refined until convergence by learning from high confidence assignments using an auxiliary target distribution.
Concretely, for the i-th single cell and the j-th cluster, the Student’s t-distribution [44] is used as the kernel to calculate the similarity between the cluster center vector uj and the data representation hi as follows: where v is the degree of freedom of the Student’s t-distribution (v set as 1 in this study), and qij can be treated as the probability of i-th cell belonging to j-th cluster. Based on the computed distribution Q = [qij], the target distribution P = [pij] is computed by: where fj = ∑i qij is soft frequency for cluster j. For better clustering, a loss function could be defined to minimize the Kull-back-Leibler (KL) divergence between two probability distributions as: By minimizing the loss between the distribution Q and P, the target distribution P can help DAE module learn better representations for clustering cells. Since the target distribution P is defined by Q, minimizing Lclu is a form of self-supervised learning mechanism.
To integrate the structural information between cells, we also used the target distribution P to supervise the updating of distribution Z as follows: By decreasing the distance between the target distribution P and soft assignments Z, the parameters of GCN module will learn the information from DAE module. In this way, the representation learned by GCN contains both the characteristic information and the structural information of data. Because both the DAE and GCN modules are supervised by the target distribution P, we call it a dual self-supervised mechanism.
Finally, the loss function of GraphSCC is defined as: where θ and η are two hyper-parameters to balance contributions from local structure preservation of scRNA-seq data and the clustering optimization. In this study, we set θ = 0.1 and η = 0.01 for all the datasets.
Since representations learned by the GCN contain both structural information and characteristic information learned from DAE, the soft assignments in distribution Z are used as the final clustering results. Thus, the label assigned to cell i is: where zij is calculated in Eq. (8) for cell i in the j-cluster.
The procedure of GraphSCC is also shown in Algorithm 1.
C. Training and Evaluation
1. Evaluation Metrics
The clustering results are evaluated by three commonly used metrics, Clustering Accuracy (CA) [45], Normalized Mutual Information (NMI) [46], and Adjusted Rand Index (ARI) [47]. The NMI is defined as: where U and V are the predicted and truth assignments of totally n cells into CV and CU clusters, respectively. The numerator is the mutual information between U and V, and the denominator is the entropy of the clustering U and V.
The CA is explained as the best match between the predicted cluster assignments and the truth assignments, calculated as: Where n is the number of data points, and m ranges over all probable one-to-one mapping between real label li and clustering assignment ui.
The ARI is defined as: where a is the number of cell pairs belonging to the same group in both U and V, b is the number of cell pairs belonging to different groups in V and the same group in U, c is the number of cell pairs belonging to the same group in V and different groups in U, and d is the number of cell pairs belonging to different groups in U and different groups in V.
2. Implementation
The GraphSCC model was implemented in python 3 using PyTorch. The dimensions of DAE is set to d-512-256-64-10, where d is the dimension of the input data, and 10 is the dimension of the bottleneck latent H(L). DAE is first pre-trained for 400 epochs by the optimizer Adam with the initial learning rate lr = 0.0001 and the batch size of the pre-train equaling to 32. We used the “Randn” function in PyTorch to generate Gaussian noises. Note that, for simulated data, we reduce the noise value to 0.2 times its value. We set the layers of GCN as 5, and set the dimensions of the initial representation Z(0) and the hidden layer of GCN as 256. The optimizer for the clustering stage is Adam with setting lr = 0.00001, and the clustering training epoch is 1000. The training stops until the proportions of cells to change clustering assignment (ca) are below a threshold tol in 300 consecutive steps. The ca is computed as , where n is the number of all cells, Ycurr is the cluster identity gained by the maximum cluster assignment possibility in the current step, Yprev is the corresponding identity in the previous step, and #|Ycurr ≠ Yprev | is the number of cells whose Ycurr differ from Yprev. We set tol = 0.0001 by default. For all other competing methods, we use the default parameters provided in the original articles. All experiments were conducted on the Nvidia Tesla P100 (16G).
III. Results
A. Evaluation of GraphSCC
We first evaluated GraphSCC on simulated data. To investigate the performance of GraphSCC under different sceneries, we employed Splatter in the R package to generate scRNA-seq data with different “sigma” in the log-normal distribution. A greater sigma value represents more significant distances between cells from different clusters with lower clustering difficulty. As shown in Fig. 2, GraphSCC consistently achieved the highest NMI values. Though Seurat3.0 and IDEC could reach essentially the same NMI (∼0.98) as GraphSCC at sigma of 0.4, Seurat3.0 and IDEC have a sharp drop to 0.25 and 0.37, respectively at sigma of 0.2. Relatively, GraphSCC is much flatter with NMI of 0.76 at sigma of 0.2. CIDR, scDeepCluster, and DCA performed bad at sigma of 0.2 with NMI below 0.1, but they can achieve decent results at sigma of 0.4 with NMI of 0.72, 0.77, and 0.9, respectively. The SIMLR failed (NMI close to 0) for exploring clustering signals when sigma is below 0.3, which is consistent with the previous observation [23].
We further investigated the clustering performance of GraphSCC on 15 real scRNA-seq datasets with different species and tissues (Details in TABLE I). As shown in Fig. 3, the clustering results of GraphSCC outperformed other competing methods in all evaluation metrics. In average, GraphSCC achieved 0.798, 0.791, and 0.744 for the CA, NMI, and ARI, respectively. These are respectively 13.3%, 9%, and 19% higher than the ones achieved by Seurat3.0, generally the 2nd best method. scDeepCluster and SIMLR methods, different from the bad performances in the simulated datasets, achieved comparable NMI to Seurat3.0 and ranked the 4th and 5th. CIDR has an average NMI of 0.64. IDEC obtained the lowest NMI values; it is likely because the autoencoder module in IDEC cannot deal well with noisy scRNA-seq data. When measured by CA or ARI, these methods had generally similar ranks. The consistent outperformances of GraphSCC over the simulated and real datasets indicated the robustness of our method.
B. Illustration of the GraphSCC
In this section, we visualized the clustering results on three datasets with different scale of cells. To illustrate the embedded representation effectiveness of GraphSCC, we employed t-SNE[44] to visualize embedded representation in the two-dimensional (2D) space. As shown in Fig. 4. On the Baron Mouse dataset, DCA, CIDR and scDeepCluster showed poor performances. Seurat3.0 and SIMLR showed better separation, but the beta cells (colored blue) were separated into at least three groups and mixed with alpha cells. Compared to Seurat3.0, GraphSCC separated beta cells in one group and produced more compact clusters for all cell types. Alpha and gamma cells were mixed both in GraphSCC and Seurat3.0.
On the Zeisel dataset, DCA, scDeepCluster, and CIDR showed poor performances in classification, which mixed a few clusters together. Seurat3.0, SIMLR, and GraphSCC separated most cell populations. In comparison, the cell clusters in GraphSCC and SIMLR have better intra-cluster compactness and inter-cluster separability than Seurat3.0, especially in the oligodendrocytes and ca1pyramidal cells. All methods failed in separating the ca1pyramidal and s1pyramidal cells.
On the Baron Human dataset, DCA performed bad and only separated three groups. For SIMLR, CIDR, and scDeepCluster, at least seven compact clusters of cells were separated, but they were underclustering in the alpha cells (colored brown), where the cell types were separated into at least three groups. In contrast, Seurat3.0 and GraphSCC separated the most cell populations, while Seurat3.0 mixed a few beta and ductal cells with the alpha cells. All methods including GraphSCC separated the ductal cells into at least two groups.
We further visualized the wrongly clustered cells by the Sankey river plots on the Baron Mouse dataset as shown in Fig. 5, where GraphSCC achieved CA, NMI, and ARI values of 0.9, 0.904, 0.935, respectively. For the beta cell type with the biggest portion (47%), GraphSCC can correctly assign 98% cells. In contrast, the second best method, Seurat3.0 can correctly assign only 35% cells. Other methods make an accuracy of 34-67% on the cell type. Two major sources of wrong assignments in GraphSCC are the separation of the ductal cells into two clusters and merging of the gamma cells with another cluster. The separation of ductal cells was also seen in the SIMLR method, and the merging of the gamma cells was seen in the Seurat3.0. These similar mistakes may come from the difficulty to cluster these cell types.
C. Contribution of Components to the Clustering
In this section, we investigated the contributions of components for the clustering performance of GraphSCC by conducting ablation studies on real datasets. As shown in TABLE II, the removal of the GCN module caused generally the largest drop in the performance with 7.8%, 4.8%, and 10.5% decreases in CA, NMI, and ARI, respectively. The changes indicate the importance of catching structural relations between cells. The removal of the residual connection (GraphSCC-Res) caused the 2nd biggest decreases in CA and NMI, while the biggest decrease in ARI. The residual connection is a good way to reduce the drawback of GCN to produce similar representations between nodes (cells), as indicated in the previous study [43]. We also clustered the cells based on learned distributions Q and P (denoted as GraphSCC (Q) and GraphSCC (P)), and they both cause a decrease in performances relative to GraphSCC that is based on distribution Z. In summary, the cooperation of the modules enables a better clustering of the scRNA-seq data.
IV. Discussion and conclusion
In this study, we integrated the structural relations between cells into scRNA-seq clustering by the graphical neural network. The structural deep clustering model GraphSCC consists of GCN, DAE, and DSM modules. GraphSCC is able to effectively capture the relations between cells and the characteristics of data by learning representations using the GCN and DAE modules, respectively. Furthermore, DSM was applied to cluster cells based on representations by iteratively optimizing the clustering objective function in an unsupervised manner. We have demonstrated that the clustering performance of GraphSCC outperformed the competing methods on both simulated and real datasets. Furthermore, GraphSCC provided representations for better intra-cluster compactness and inter-cluster separability in the 2D visualization.
scRNA-seq is a revolutionary tool in biomedical research. Recently, many studies had been conducted based on scRNA-seq technique. However, before we fully reap the benefit of scRNA-seq, many challenges must be overcome. Clustering cells into biologically meaningful groups is the critical step in scRNA-seq analyses. Through comprehensive evaluations with competing methods on real and simulated datasets, we have shown that GraphSCC offers stable clustering results based on scRNA-seq data. We believe that GrahpSCC will be a valuable tool for catching cellular heterogeneity. In the future, for better modeling the distribution of scRNA-seq data, we will integrate an imputation mechanism into GraphSCC. We will also apply graph transformer models and attention mechanisms to make scRNA-seq analyses more explainable.
AVAILABILITY OF DATA AND MATERIALS
The datasets we used in this study can be available at https://hemberg-lab.github.io/scRNA.seq.datasets/; All source code and datasets used in our experiments have been deposited at https://github.com/biomed-AI/GraphSCC.
ACKNOWLEDGMENT
This study has been supported by the National Natural Science Foundation of China (61772566, 81801132, and U1611261), Guangdong Key Field R&D Plan (2019B020228001 and 2018B010109006) and Introducing Innovative and Entrepreneurial Teams (2016ZT06D211).
Footnotes
↵* yangyd25{at}mail.sysu.edu.cn; yutong.lu{at}nscc-gz.cn