Abstract
Single-cell RNA sequencing (scRNA-seq) analysis has significantly advanced our knowledge of functional states of cells. By analyzing scRNA-seq data, we can deconvolve individual cell states into thousands of gene expression profiles, allowing us to perform cell clustering, and identify significant genes for each cluster. However, interpreting these results remains challenging. Here, we present a novel scRNA-seq analysis pipeline named ASURAT, which simultaneously performs unsupervised cell clustering and biological interpretation in semi-automatic manner, in terms of cell type and various biological functions. We validate the reliable clustering performance of ASURAT by comparing it with existing methods, using six published scRNA-seq datasets from healthy donors and cancer patients. Furthermore, we applied ASURAT to patient-derived scRNA-seq datasets including small cell lung cancers, finding some putative cancer subpopulations showing different resistance mechanisms. ASURAT is expected to open new means of scRNA-seq analysis, focusing more on “biological meaning” than conventional gene-based analyses.
Introduction
Single-cell RNA sequencing (scRNA-seq) has profoundly advanced our knowledge of cells, owing to its immense potential for discovering the transcriptional principles governing cell fates at the single-cell level1. scRNA-seq has been widely used to improve understanding of individual cells2, intra- and intertumoral heterogeneity3, cell-to-cell interaction4, tumorigenesis5, drug resistance3,6, and the effects of viral infection on immune cell populations7. Various clustering methods, wherein cells are partitioned according to transcriptome-wide similarity, have been proposed8 and applied to cell type annotation9. However, interpreting single-cell data remains challenging10–13.
Conventionally, cell types are inferred using unsupervised clustering followed by a manual literature search of differentially expressed marker genes13. Currently, several computational tools, such as Garnett14 and SCSA12, are available to assist manual annotation, as detailed in the review by Pasquini et al.8. However, this process is often difficult because marker genes are generally expressed in multiple cell types15. In cancer transcriptomics, this difficulty is exacerbated by the interdependence between disease-related genes and numerous biological terms; furthermore, expression levels of marker genes can be heterogeneous depending on cancer microenvironments16.
A possible solution is to realize cell clustering and biological interpretation at the same time. Recently, reference-based analysis has been applied in single-cell transcriptomics10,12,17. One such technique is reference component analysis (RCA), which is used for accurate clustering of single-cell transcriptomes along with cell-type annotation based on similarity to reference transcriptome panels17. However, these methods require well-characterized transcriptomes with purified cells, which may be difficult to apply to ambiguous phenotypes. Another approach is using supervised classification11 combined with gene set enrichment analysis, incorporating biological knowledge such as pathway activity; hence, it may improve the interpretability over signature gene-based approaches, which place sole emphasis on individual roles of genes. However, we still lack a prevailing theory leveraging this information at the single-cell level.
To overcome the aforementioned limitations, a novel theoretical tool providing biological interpretations to computational results is needed. Thus, we propose a scRNA-seq analysis pipeline for simultaneous cell clustering and biological interpretation, named ASURAT. Here, “interpretation” is given by multiple biological terms such as cell type, biological process, pathway activity, chemical reaction, and various biological functions. By using ASURAT, users can create desired sets of biological terms and the corresponding spectrum matrices, which can be supplied to the subsequent unsupervised cell clusterings. In this paper, we first demonstrate the reliable clustering performance of ASURAT based on comparison with existing methods, using six published scRNA-seq datasets of healthy donors and cancer patients. Next, we applied ASURAT to single-cell lung cancer transcriptomes, which include malignant cancer types expressing neuroendocrine markers3. We show that ASURAT can greatly improve functional understandings of various cell types, which may contribute to clinical improvements.
Results
Overview of ASURAT
ASURAT was developed for simultaneously clustering single-cell transcriptomes and biological interpretation, which was implemented by R programming scripts (Supplementary Notes, Supplementary File 6). After inputting scRNA-seq data and knowledge-based databases (DBs), ASURAT creates lists of biological terms with respect to cell type and biological functions, which we termed signs. Then, ASURAT creates a functional spectrum matrix, termed a sign-by-sample matrix (SSM). By analyzing SSMs, users can cluster samples to aid their interpretation. We later explain the workflow (Fig. 1). The details of ASURAT’s formulations can be found in the Methods section.
Workflow of ASURAT
In preparation, we collected DBs for Disease Ontology (DO)18, Cell Ontology (CO)19, Gene Ontology (GO)20, Kyoto Encyclopedia of Genes and Genomes (KEGG)21, and Reactome22 using the R packages DOSE (version 3.16.0), ontoProc (version 1.12.0), clusterProfiler (version 3.18.0), KEGGREST (version 1.30.0), and reactome.db (version 1.74.0), respectively (Chapter 7, Supplementary Notes). Any DBs including corresponding tables between biological descriptions and genes can be input to ASURAT (Fig. 1b). Additionally, ASURAT computes a correlation matrix using Pearson or Spearman correlation coefficients from a normalized read count matrix of scRNA-seq data.
The first step is to create signs by inputting a normalized-and-centered read count matrix and knowledge-based DB. From a gene set Ω and correlation matrix R defined for each biological description T in DBs, ASURAT decomposes the correlation graph into several parts. Here, a triplet of biological description, gene subset, and correlation matrix is termed a sign, in particular (T, Ω, R) a parent sign. In many applications, high correlations are expected to have rich information. Hence, we decompose Ω into the following three categories (Fig. 2): (i) a strongly correlated gene set (SCG), which is a set of genes with strong positive correlations with each other; (ii) variably correlated gene set (VCG), which is a set of genes with strong negative correlations with genes in SCG; and (iii) weakly correlated gene set (WCG), which is a set of genes with weak correlations with each other.
Next, ASURAT creates an SSM for SCG by weighted averaging of normalized and centered gene set expression levels of SCGs and WCGs. Similarly, an SSM for VCG is created from VCGs and WCGs. Then, by vertically concatenating SSMs for SCG and VCG, we create a single SSM. The rows and columns of an SSM stand for signs and samples (or cells), respectively, and entries stand for cell-type or functional spectra, termed as sign scores. A remarkable benefit is that users can create multiple SSMs as necessary by inputting various DB (Fig. 1c).
The final step is to characterize samples using SSMs to produce a conclusion. One focus of analyzing SSMs is to cluster samples and find significant signs (Fig. 1d), where “significant” means that the sign score is specifically upregulated or downregulated at the cluster level (cf. separation index). In ASURAT, we use two strategies: one uses unsupervised clusterings, such as Partitioning Around Medoids (PAM), hierarchical-based, and graph-based clusterings with and without principal component analysis (PCA); while the other is a method of extracting a continuous tree-like topology using diffusion map23, followed by allocating samples to different branches of the data manifolds24. Choosing an appropriate strategy depends on the biological context, but the latter is usually applied for developmental processes or time-course experimental data, which are often followed by pseudotime analyses.
Comparison of performance of ASURAT with existing methods
Many unsupervised clustering methods have been proposed and their performances quantified using datasets with independently identified phenotypes. However, it remains unclear whether these methods robustly demonstrate better performance using cancer single-cell transcriptomes including ambiguous phenotypes. Conventional marker gene-based approaches may misrepresent cluster accuracy17, and simple application of PCA may be ineffective. However, when using ASURAT, users can obtain robust and explainable clustering results, since SSMs can be created from as many DBs as needed and supplied to the subsequent unsupervised clusterings.
To validate the reliable clustering performance of ASURAT, we obtained six published scRNA-seq datasets derived from healthy donors (PBMC datasets: pbmc_4000 and pbmc_6000), cervical cancer patients (day1_norm and day7_hypo), and lung cancer patients (sc68_vehi and sc68_cisp). From all datasets, we excluded genes and cells with low qualities and attenuated technical biases with respect to zero-inflation and variation of capture efficiencies between cells using bayNorm25. The resulting read count tables were supplied to ASURAT and four other methods: Seurat (version 4.0.1)26, Monocle 3 (version 0.2.3.0)27, SC3 (version 1.18.0)28, and PCA using prcomp() from the R stats package (version 4.0.4).
There are five blood cells in the PBMC datasets12, which are regarded as hypothetical results. However, no consensus cell types exist, especially for cancer datasets. Hence, the clustering accuracies cannot be quantified using standard measures such as adjusted Rand index29. Instead, the clustering qualities were assessed using validity indices such as average silhouette width (ASW)30, a measure of how tightly grouped cells are in clusters and the distant between clusters. To reduce computational cost, we performed two-dimensional Uniform Manifold Approximation and Projection (UMAP)31 after the straightforward computations of Seurat, Monocle 3, PCA, and ASURAT; the resulting two-dimensional cell states were supplied to NbClust32, and 26 validity indices were obtained (Supplementary Files). From SC3, we obtained only ASWs computed from consensus matrices and hierarchical clusterings. We hypothesized that clustering quality positively correlates with clustering accuracy, while considering that they do not guarantee interpretability. Additionally, other topology-based clustering methods were not used for computing ASWs.
For PBMC datasets with known numbers of clusters of existing cell types, we compared ASWs across all the methods within such numbers ±1 (shaded area in Fig. 3a). For other datasets, we focused on the ranges of the number of clusters, wherein at least one method provides ASWs ≥0.6. Interestingly, the best-performing method, exhibiting the greatest ASW, was different across the datasets (Fig. 3a). Seurat performed best when the number of clusters k = 4 in pbmc_6000. Although SC3 outperformed at a different k in day7_hypo and PBMC datasets, it could not detect >1 cluster in sc68_vehi and sc68_cisp. Compared with other methods, only the naïve usage of PCA was unremarkable across most datasets.
Notably, ASURAT outperformed existing methods at ≥1 k in every dataset, with one exception in sc68_cisp (Fig. 3a). Moreover, those ASWs were >0.5 without exception and >0.6 with only one exception (viz. sc68_cisp). The existing methods presented both strengths and weaknesses depending on the datasets. Seurat exhibited better performances with PBMC datasets, while it performed less remarkably with most cancer datasets. Although we carefully tuned Seurat’s parameters by changing the normalization method, variable gene-per-cell ratio, and the number of principal components, we could not obtain well-separated clusters for day1_norm and day7_hypo (Fig. 3b). In contrast, Monocle 3 generally exhibited better performances on cancer datasets while performing less remarkably with PBMC datasets. We found that Monocle 3’s clustering performance was unstable and strongly depended upon dimension reduction techniques.
To confirm whether ASURAT outperforms existing methods using other low-dimensional representation techniques, we replaced UMAP with t-distributed stochastic neighbor embedding (t-SNE)33 and supplied the resulting two-dimensional cell states to NbClust32. Again, we confirmed that ASURAT generated well-separated clusters with relatively greater ASWs across datasets, while Monocle 3 broke down when used with some datasets (Supplementary Fig. S1). These results indicate that cells are better characterized in the high-dimensional sign score space than in the gene expression space.
Finally, to validate ASURAT’s cell-type inference, we reanalyzed PBMC datasets using Seurat, Monocle 3, SC3, and ASURAT under almost default settings. Consequently, Seurat and Monocle 3 could reproduce most blood cell type labels (Figs. 3c and d), as inferred by Cao et al.12, but a few dozen cells remained unspecified. Although SC3 provided the greatest ASWs at k = 4 and 6 in pbmc_4000 and pbmc_6000, respectively, it reproduced only B cell and NK or NKT cell labels. However, ASURAT identified five cell types, with none remaining unspecified (Supplementary Figs. S3 and S4). The subpopulation ratios were approximately consistent with the reported values, except for the tiny megakaryocyte subpopulation. Such a small discrepancy was unavoidable, because Cao et al. used only differentially expressed genes and preselected cell types to identify the most preferable cell types. Furthermore, we reanalyzed cervical cancer datasets using ASURAT and found several putative populations of small cell neuroendocrine carcinoma and adenocarcinoma (Supplementary Figs. S5 and S6). These results demonstrate that ASURAT can perform robust, high-quality, and reliable clusterings using various single-cell transcriptomes.
Identifying chemoresistant cells in lung cancer scRNA-seq datasets
Previous work3 indicated that small cell lung cancer (SCLC) tumors undergo a shift from chemosensitivity to chemoresistance against platinum-based therapy. However, the exact mechanism behind chemoresistance is still unclear, because transcriptional heterogeneity is often concealed in hidden biological states, which cannot be readily identified by conventional marker gene-based analyses. To investigate the cancer subtypes in the chemosensitive and chemoresistant tumors, we applied ASURAT to the scRNA-seq data of circulating tumor cell-derived xenografts from the vehicle (sc68_vehi) and cisplatin (sc68_cisp) treatment groups.
Given the normalized and centered read count matrices, we created SSMs using DO and GO DBs, and KEGG for both sc68_vehi and sc68_cisp. We then visualized the sign scores in heat maps (Figs. 4a and 5a). The cells were clustered by one of the following: (i) PCA, followed by k-nearest neighbor (KNN) graph generation and Louvain algorithm using Seurat’s functions26 and (ii) diffusion map generation, followed by allocation of cells to the different branches of the data manifold using MERLoT24. Here, cells in sc68_vehi were clustered by (i), while those in sc68_cisp were clustered by (ii), providing the most explainable results.
We visualized the t-SNE plot of SSM using GO for sc68_vehi, wherein cell clustering labels and SCLC-related sign scores are overlaid (Fig. 4b). Sign IDs and the related genes are represented by, for example, DOID:5409_S (ASCL1, etc.) and DOID:5409_V (MKI67, BIRC5, etc.), where the suffixes “S” and “V” indicate SCG and VCG, respectively. Since ASCL1, MKI67, and BIRC5 are important for neuronal differentiation34, malignancy35, and inhibition of apoptosis36, DOID:5409_S and DOID:5409_V represent SCLC differentiation and proliferation with cell survival, respectively. We found at least two existing subpopulations of SCLC in sc68_vehi. This was further confirmed by violin plots for the related signs (Fig. 4c). Remarkably, sign scores for platinum drug resistance were specifically upregulated in the group with label 2 (GO: BP). The population ratios of group 1 and 2 were 0.84 and 0.15, respectively. Consequently, we found that the SCLCs not receiving cisplatin treatment contained ≤15% putative chemoresistant cells, which was not found in the original report3.
Likewise, we visualized the diffusion map of SSM with DO for sc68_cisp. We observed a tree-like topology in the data manifold, representing a putative cell differentiation lineage (Fig 5b). We defined a pseudotime t ∈ [0, 1] (i.e., an arc-length parameter) along the branches using MERLoT24; a starting point t = 0 was set at the end of the branch with label 1. From the pseudo-time course analysis, we found at least three SCLC subpopulations (Fig. 5c). Strikingly, sign scores for different resistant mechanisms, such as platinum drug resistance and PD-L1 expression mediating immunosuppression, were upregulated in groups labeled 2 and 3 (DO: disease), while sign scores for intracellular protein transport with an SCLC malignancy marker CD2437 was upregulated in the group labeled 1 (DO: disease), suggesting the recalcitrant malignancy of relapsed SCLCs against cisplatin treatments. The population ratios of groups 1, 2, and 3 were 0.39, 0.30, and 0.30, respectively. Consequently, we found 30% putative chemoresistant SCLCs and another 30% with other possible resistant cell types expressing PD-L1, while others did not exhibit these resistance mechanisms. Our results support the finding that transcriptional heterogeneity increases in chemoresistant SCLC tumors3.
The most time-consuming step in our workflow is finalizing the set of signs by tuning ASURAT’s parameters through trial and error, which is critical for downstream analyses. Here, users may face difficulty in prioritizing the importance of several signs. For sc68_cisp, we found that the sign scores for meningioma, myopathy, malignant pleural mesothelioma, and other diseases were also upregulated in the group labeled 2, but their actual relationships to the patient’s disease were unknown. Nevertheless, ASURAT helped us find well-structured data manifolds and characterize cells in biologically explainable manners for cell types, biological processes, and signaling pathways.
Discussion
We developed a novel scRNA-seq analysis pipeline for simultaneous cell clustering and biological interpretation, allowing users to create systems of cell-type and functional spectra as necessary by inputting collected databases. The resulting matrices can be supplied to unsupervised clustering without gene preselection. We analyzed cancer patient- and healthy donor-derived scRNA-seq datasets: the former was to uncover the unknown characteristics of small cell neuroendocrine cancers, while the latter to confirm cell-type inference, aiming to reproduce results inferred in previous studies.
First, we demonstrated ASURAT’s superiority to existing methods with respect to robust, high-quality, and reliable clustering using these datasets (Fig. 3). ASURAT yielded well-separated cell clusters from most transcriptomes, despite the dimension reduction processing, while other conventional methods occasionally failed, demonstrating cells were better characterized in the high-dimensional sign score space than in the gene expression space. In practice, we recommend using signature gene-based tools such as Seurat before using ASURAT to broadly understand the transcriptome. Unlike reference-based analyses10,12,17, ASURAT does not require any bespoke reference but instead takes input from knowledge-based databases.
Next, we found the putative cancer subpopulations existing in the chemosensitive and chemoresistant tumors of SCLC. We found that sc68_vehi (vehicle treatment) contained ≤ 15% possible platinum-resistant cells (Fig. 4c), suggesting this chemoresistant mechanism latently existed before the therapy. Moreover, we found that sc68_cisp (cisplatin treatment) contained 30% platinum-resistant cells with the same ratio of cells exhibiting PD-L1 expression (Fig. 5c).
Notably, we demonstrated that simultaneous cell clustering and biological interpretation of single-cell transcriptomes was viable (Fig. 1). The formulation of correlation-based decomposition of signature gene sets was critical for ASURAT’s performance (Fig. 2). Additionally, we searched virtually the whole parameter space to obtain the desired interpretation results. Thus, our strategy may greatly improve functional understandings of cancer subpopulations, intracellular heterogeneity, and cellular processes.
However, some limitations are worth noting. Although small cell neuroendocrine cancers have been studied extensively for human tumors by bulk sample RNA-seq analyses34, few publications address scRNA-seq experiments for such rare cancer subtypes. As available scRNA-seq data and knowledge-based databases expand in size and diversity, our theoretical framework for ASURAT should be generalized to prioritize biological terms more efficiently than manual screening. Furthermore, integrating systems of signs across various conditions should be addressed. One means is applying canonical correlation analysis, which has been incorporated in Seurat26,38. Nevertheless, extracting common systems of “biological meanings” across multiple conditions, different cell types, and possibly different species remains challenging.
We also expect ASURAT to improve scRNA-seq data-driven mathematical modeling for patient classification39, which includes parameter estimations of dynamical systems of gene regulatory network. Since ASURAT detects significant biological functions (e.g., biological process, pathway activity, and chemical reaction) for cell clustering, one can obtain promising candidates for a core regulatory network, which may greatly reduce the numbers of parameters. Another interesting approach to this problem is implementing ASURAT to construct sign networks, which may be analyzed by nonparametric Markov random field theory40. We expect ASURAT to open new ways to scRNA-seq analysis from “biological meaning” perspective beyond conventional gene-based analyses.
Author contributions
M.O. and M.I. started the project. K.I. conceived the theory of ASURAT. K.I. developed the analysis pipeline. J.K. and M.I. prepared the cervical cancer samples and obtained the single-cell RNA sequencing data. M.I. and J.K. translated the computational results. K.I., J.K., and M.O. wrote the manuscript. M.O. supervised the work.
Conflict of interest
The authors declare no conflict of interest.
Supplementary materials
Notes Clear documentation (R bookdown files) showing the commands and outputs for all the analysis in the present paper, as well as an introduction to ASURAT, which is available on GitHub (https://github.com/keita-iida/ASURAT).
Methods
Datasets and data processing
Human lung cancer datasets
These data were obtained from circulating tumor cell-derived xenografts cultured with vehicle (symbolized by sc68_vehi) and cisplatin (sc68_cisp) treatments, which were generated from lung cancer patients3. The data were produced with the 10x protocol using unique molecular identifiers (UMIs) (https://support.10xgenomics.com/single-cell-gene-expression/library-prep/doc/user-guide-chromium-single-cell-3-reagent-kits-user-guide-v2-chemistry). The SRA files were downloaded from Gene Expression Omnibus (GEO) with the accession code GSE138474: GSM4104164 and GSM4104165, which are referenced in Stewart et al3. SRA Toolkit version 2.10.8 was used to dump the FASTQ files. Cell Ranger version 3.1.0 was used to align the FASTQ files to the GRCh38-3.0.0 human reference genome and produce the single-cell transcriptome datasets. After quality controls, the read count matrices of sc68_vehi (resp. sc68_cisp) contained 6581 (resp. 6347) genes and 3923 (resp. 2285) cells.
Human cervical cancer datasets
These data were obtained from cancer tissue originated spheroids (CTOS line cerv21) including small cell neuroendocrine carcinoma, cultured for 1 d under normoxic conditions (symbolized by day1_norm) and 7 d hypoxic conditions (day7_hypo), which were generated from cervical cancer patients41. The data were produced by the Nx1-seq protocol using UMIs. The FASTQ files were downloaded from the DNA Data Bank of Japan (DDBJ) with accession codes DRA007915: DRX155817 and DRX155818. The Nx1-seq data were aligned and annotated as described previously42. Briefly, the barcode sequences were extracted from the read 1 FASTQ files. The read 2 FASTQ files, which included each cell mRNA, were directly aligned to Refseq transcript sequences (ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot) using bowtie 2.2.643. The aligned reads were linked to their paired extracted barcode sequences. By counting mapped reads per barcode, the gene count data in individual cells were obtained. After quality controls, the read count matrices of day1_norm (resp. day7_hypo) contained 5272 (resp. 6213) genes and 3663 (resp. 1947) cells.
Human peripheral blood mononuclear cell datasets
These datasets were obtained from peripheral blood mononuclear cells (PBMCs) of healthy donors, which include approximately 4000 (symbolized by pbmc_4000) and 6000 (pbmc_6000) cells. The data were produced with a 10x protocol using UMIs. The single-cell transcriptome datasets were downloaded from 10x Genomics repository (https://support.10xgenomics.com/single-cell-gene-expression/datasets). The following filtered read count matrices were obtained: 4000 PBMCs from a healthy donor (https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k) and 6000 PBMCs from a healthy donor (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc6k). After quality controls, the read count matrices of pbmc_4000 (resp. pbmc_6000) contained 6658 (resp. 5169) genes and 3815 (resp. 4878) cells.
Data preprocessing: quality control, normalization, and centering
For all the single-cell RNA sequencing (scRNA-seq) data, the genes and cells with low qualities were removed by the following three steps: (i) removing the genes for which the number of non-zero expressing cells is less than a user-defined threshold; (ii) removing the cells whose read counts, number of genes expressed with non-zero read counts, and percent of reads mapped to mitochondrial genes are within user-defined ranges; and (iii) removing the genes for which the mean of read counts is less than a user-defined threshold (Chapter 3, Supplementary Notes).
After quality controls, the data were normalized by bayNorm25, which attenuates technical biases with respect to zero-inflation and variation of capture efficiencies between cells. The resulting inferred true count matrices were supplied to a log transformation with a pseudo-count to attenuate the impact of dispersion in the counts for highly expressed genes. Finally, subtracting the sample mean from each row vector, we obtained the normalized and centered read count matrices (Chapter 4, Supplementary Notes).
Definition of sign
Let T be a biological description, Ω a variable (e.g., gene) set defined for T, and R a relation structure (e.g., correlation matrix) among Ω. Assume that Ω can be represented by a union of its subsets based on R, that is . Then, the triplet (T, Ω(i), R) is termed a sign, in particular (T, Ω, R) a parent sign.
Definition of correlated gene set
Let A = (ai,j) be a gene-by-sample matrix of size p × n from transcriptome data, whose entries stand for normalized and centered gene expression levels, and A = (ri,j) a correlation matrix of size p × p defined by A and a certain measure, whose diagonal elements are 1. Let α and β be positive and negative constants satisfying 0 < α ≤ 1 and −1 ≤ β < 0, respectively, and let us fix a biological description Tk and the associated gene set Ωk = {1, 2, … , mk}, where k = 1, 2, … , K for some K. Now, consider the following subsets of Ωk:
Hereinafter we omit the arguments α and β for simplicity. Let us denote , where “\” means set difference. If Vk is not empty, represent each element of Wk as a point in the Euclidean space spanned by the row vectors of R and decompose Wk into two disjoint subsets by Partitioning Around Medoids (PAM) clustering44, that is . Otherwise, if Vk is empty, let and (empty). Thus Ωk is decomposed into three parts as follows:
Let (resp. ) be the mean of off diagonal elements of R for , and assume without loss of generality. If , then , and are strongly, variably, and weakly correlated gene sets, respectively, which are abbreviated as SCG, VCG, and WCG. Otherwise, correlated gene sets cannot be defined for Tk.
For any given (Tk, Ωk, R) the genes should strongly and positively correlate within each of and , while they negatively correlate between and . Thus, we can hypothesize that SCG and VCG are predominantly associated with Tk, which may aid interpretation of biological meanings of corresponding signs. Fig. 2 shows that Ω(s) and Ω(v) include KRT18 and ASCL1, which respectively have negative and positive contributions for lung small cell carcinoma. Thus, we interpret that (T, Ω(s), R) and (T, Ω(v), R) relate positively and negatively with this cell type, respectively.
Though simpler methods based on decomposition of correlation graphs exist, such as one-shot PAM clustering44, tree cutting after hierarchical clustering45, independent component analysis (ICA)- or principal component analysis (PCA)-based methods46, and several graph statistical approaches47,48, we found our VCG definition is critical for providing sample clusterings in the downstream analysis. We tried replacing our decomposition method (1) with one-shot PAM clustering, but sample clusterings frequently exhibited deteriorated performance. This occurred when both VCG and WCG (obtained from the one-shot clustering) included many weakly correlated genes, which may contribute less to the parent sign.
Definition of sign-by-sample matrix
Let A = (ai,j) be a gene-by-sample matrix of size p × n from a transcriptomic data, whose entries stand for normalized and centered gene expression levels, and G = {1, 2, … , p} a set representing p genes. Assume that we have q biological descriptions and the associated gene sets, denoted by Tk and Ωk, k = 1, 2, … , q, respectively. Let us assume that Ωk can be decomposed into non-empty , and for any k. Let B(x), x ∈ {s, v, w} be matrices of size q × n whose entries are defined as follows: where stands for the number of elements in . Additionally, let C(x) x ∈ {s, v}, be q × n matrices as follows: where ω(x), 0 ≤ ω(x) ≤ 1, are weight constants. Here C(s) and C(v) are said to be sign-by-sample matrices (SSMs) for SCG and VCG, respectively, and the as a sign score of the Hth sign and Mth sample (Fig. 1c). Note that ensemble means of sign scores across samples are zeros because SSMs are derived from the centered gene expression matrix A.
Definition of separation index
Briefly, a separation index is a measure of significance of a given sign score for a given subpopulation. Since the row vectors of SSMs are centered (i.e., the means are zeros), wherein the degree of freedom is reduced, naïve usages of statistical tests and fold change analyses should be avoided. Nevertheless, we propose helping users to find significant signs using a nonparametric index to quantify the extent of separation between two sets of random variables. A separation index of a given random variable X takes a value from −1 to 1: the larger positive value indicates that Xs are markedly upregulated, and the probability distribution is well separated against other distributions and vice versa.
Let us consider a vector a of size n, i.e., the number of samples, whose elements stand for the sign scores, and assume that the elements are sorted in ascending order. For simplicity suppose that the samples are classified into two groups labeled 0 and 1. Let v be a vector of the labels corresponding to a, and w0 and w1 vectors having the same elements with v but the elements are sorted in lexicographic orders in forward and backward directions, respectively. Then we define separation index as follows: where d(v, wi) is an edit distance (or Levenshtein distance49) with only adjacent swapping permitted. For example, if v = (1, 0, 0, 1, 1), then w0 = (0, 0, 1, 1, 1) and w1 = (1, 1, 1, 0, 0). From (3) one can calculate d(v, w0) = 2 and d(v, w1 = 4, and thus I(v) = 1/3. As another example, if v = (0, 1, 1, 0, 0), then I(v) = −1/3. From this example, one can see that the positive and negative values of m mean that the given sign has positive and negative contributions for group “1,” respectively.
Drawbacks
Signs are derived from information in existing databases (DBs). This inevitably introduces bias problems, such as the inherent incompleteness of the DBs and annotation bias, viz. some biological terms are associated with many genes, while others with few50.
To overcome this problem, one should monitor what signs are included during data processing (Fig. 1a) and carefully tune the parameters to select reliable signs (Supplementary Fig. S2). Our R programming scripts help users perform this process (Supplementary Notes).
Parameter setting
To obtain explainable results of cell clustering in the downstream analysis of ASURAT, it is critical to tune the parameters in the sign creation step (Supplementary Fig. S2). There are six to nine parameters for creating SSMs depending on the database used but many of them have been preset to unbiased and sensible default values. We found that our default settings worked well in our scRNA-seq analyses but the three parameters should be tuned by users, as described below.
As formulated in (1), positive and negative constants α and β from thresholds of correlation coefficients are required for decomposing correlation graphs and creating signs (see Fig. 2 for the demonstration). In addition, unreliable signs are discarded with user-defined criteria, which were preset as follows: the sum of the number of genes in SCG and VCG is less than nmin or the number of genes in WCG is less than (the default value is 2). Furthermore, users can remove redundant signs with similar biological meanings if information contents (ICs)51 are defined.
Comparison of clustering validity indices of ASURAT with existing methods
To benchmark the clustering qualities of existing methods and ASURAT, we prepared six cancer patient- and healthy donor-derived single-cell RNA-seq datasets. Subsequently, careful quality control and normalization by bayNorm were performed for each dataset. However, 22 additional non-negligible outliers were detected for sc68_vehi by ASURAT, which led to a substantial average silhouette width (ASW) (much greater than 0.9). Hence, those cells were removed from sc68_vehi and the resulting read count table containing 6581 genes and 3901 cells was obtained (Chapter 14.2, Supplementary Notes). Note that such additional preprocessing was undertaken only for the comparison of ASWs.
Using Seurat version 4.0.126, we normalized the data by log transform with a pseudo-count of 1 (default), selected variable genes based on variance stabilizing transformation with a gene-per-cell ratio of 0.2 (as suggested in previous work52), scaled and centered gene expression levels, and performed PCA. The principal components that explain 90% of the total variability were used for the computations of Uniform Manifold Approximation and Projection (UMAP)31 and t-distributed stochastic neighbor embedding (t-SNE)33, and the resulting two-dimensional cell states were supplied to NbClust32 (Chapter 14.3.1, Supplementary Notes).
Using Monocle 3 version 0.2.3.027, we ran R function preprocess_cds() in the Monocle 3 package using the default settings, in which data were normalized by log transform with a pseudo-count of 1, scaled and centered in gene expression levels, and performed PCA with a dimensionality of the reduced space of 50. The results were used for the computations of UMAP and t-SNE, and resulting two-dimensional cell states were supplied to NbClust (Chapter 14.3.2, Supplementary Notes).
Using SC3 version 1.18.028, we normalized the data by log transform with a pseudo-count of 1 (default), performed PCA, and ran R function sc3() in the SC3 package, with the arguments ks = 2:7 and biology = TRUE. This function automatically computed a consensus matrix for each number of clusters and output the ASW based on the hierarchical clustering of the consensus matrix (Chapter 14.3.3, Supplementary Notes). However, sc3() stopped processing and reported errors for sc68_vehi and sc68_cisp irrespective of the arguments.
Using PCA-based clustering, we normalized the data by log transform with a pseudo-count of 1 and ran prcomp() in R stat package. The principal components that explain 90% of the total variability were used for the computations of UMAP and t-SNE, and the resulting two-dimensional cell states were supplied to NbClust (Chapter 14.3.4, Supplementary Notes).
Databases were downloaded in December 2020 and verified for human and mouse scRNA-seq datasets. Using ASURAT, we normalized the data by log transform with a pseudo-count of 1, scaled and centered gene expressions, and created SSMs based on Disease Ontology (DO) for sc68_vehi and sc68_cisp, Gene Ontology (GO) for day1_norm, day7_hypo, and pbmc_6000, and Cell Ontology (CO) for pbmc_4000. These SSMs were used for the computations of UMAP and t-SNE without preprocessing by PCA, and the resulting two-dimensional cell states were supplied to NbClust (Chapter 14.3.5, Supplementary Notes).
Cell-type inference of PBMC datasets by existing methods and ASURAT
To benchmark the abilities of cell-type inference of existing methods and ASURAT, we prepared the normalized read count tables of pbmc_4000 and pbmc_6000 in the same manner described in the previous section. Using R functions FindClusters() and FindAllMarkers() in Seurat, cluster_cells() and top_markers() in Monocle 3, and sc3_plot_markers() in SC3 packages, we identified several different cell types by manually searching marker genes in GeneCards version 5.253 (Chapter 14.4, Supplementary Notes). Seurat identified T cells (resp. marker genes CD3D, CD3E, IL32, TRAC), monocytes (S100A8, LYZ, CD14), B cells (CD79A, MS4A1, IGHM, VPREB3, BANK1), and NK/NKT cells (NKG7, CD160, KLRF1, GZMA, GZMB, FGFBP2, GNLY), Monocle 3 identified T cells (CD3D, CD3E, CD27, IL32, TRAC, TCF7), monocytes (S100A8, LYZ, CD14), B cells (CD79A, CD79B, MS4A1, IGHM, VPREB3, BANK1), and NK/NKT cells (NKG7, GNLY, CD160, GZMA, FGFBP2), and SC3 identified B cells (CD79A, MS4A1) and NK/NKT cells (TPD52L2, GZMA, GZMB, GZMH, GZMK).
Using ASURAT, we created SSMs based on CO, GO, and Kyoto Encyclopedia of Genes and Genomes (KEGG), clustered the cells by k-nearest neighbor (KNN) graph generation and Louvain algorithm using Seurat’s functions26 after dimension reduction by PCA, analyzed the separation index (3) of each sign score for each cluster, found the signs upregulated in specific clusters, and inferred the cell types (Supplementary Figs. S3 and S4; Chapter 14.4.4, Supplementary Notes): T cells (respectively marker genes CD3D, CD3E, CD247, PTPRC, IL7R, etc.), monocytes (MEF2C, LYN, CCL3, CD14, FGR, etc.), B cells (CD19, CD72, CD79B, BTK, DAPP1, etc.), NK/NKT cells (SH2D1A, KLRD1, NCR3, GZMB, CD160, FGR, ITGB2, FCGR3A, etc.), and dendritic cells (HLA-DOB, CCR7, CD2, FCGR2B, BLK, etc.).
Cell-type inference of cervical cancer datasets by ASURAT
To validate ASURAT’s reliable cell-type inference, the normalized read count tables of day1_norm and day7_hypo were prepared in the same manner as described in the previous section. Previous work studying human cervical cancers using CTOS methods indicated that some small cell neuroendocrine carcinomas (SCNCs) exhibited combined phenotypes with other non-SCNC cells41. Additionally, hypoxia drove divergent differentiation of SCNCs, but detailed molecular information remained to be elucidated. Using ASURAT, we created SSMs based on DO, GO, and KEGG, and clustered the cells by one of the following: (i) PCA, followed by KNN graph generation and Louvain algorithm using Seurat’s functions26 and (ii) diffusion map generation, followed by allocation of cells to the different branches of the data manifold by using MERLoT24. Here, cells in day1_norm were clustered by (i), while those in day7_hypo were clustered by (i) and (ii) for SSM using DO and GO, respectively (Supplementary Figs. S5 and S6).
Code availability
An open-source implementation of ASURAT is available on GitHub (https://github.com/keita-iida/ASURAT) under the GPLv3 license. All the input and output files used in the present paper and user-friendly documentation written in R bookdown can be downloaded from the above URL.
Supplementary Notes
Supplementary Notes are written in separate files, which are structured as follows:
Chapter 1. Overview of ASURAT.
Chapter 2. Preparing data sets.
Chapter 3. Data quality control (QC).
Chapter 4. Normalizing and centering data.
Chapter 5. Computing correlations among genes.
Chapter 6. Checking expression profiles of marker genes.
Chapter 7. Collecting databases (optional).
Chapter 8. ASURAT using Disease Ontology database (optional).
Chapter 9. ASURAT using Cell Ontology database (optional).
Chapter 10. ASURAT using Gene Ontology database (optional).
Chapter 11. ASURAT using KEGG (optional).
Chapter 12. ASURAT using Reactome (optional).
Chapter 13. Multiple sign analysis by concatenating DO, CO, GO, KEGG, and Reactome.
Chapter 14. Appendix A: comparing performances of ASURAT and existing methods.
Chapter 15. Appendix B: automatically tuning ASURAT’s parameters.
Supplementary Files
Supplementary Files are prepared in separate files, which are structured as follows:
Supplementary File 6. ASURAT’s R function files (SupplementaryFile_006_R_files.zip).
Supplementary Figures
Acknowledgements
We thank Takeya Kasukawa and Johannes Nicolaus Wibisana for comments that greatly improved the analysis pipeline. K.I. was supported by JSPS KAKENHI Grant No. 20K14361. K.I., J.K., and M.I. were supported by Shin Bunya Kaitaku Shien Program of Institute for Protein Research, Osaka University. M.O. was supported by JSPS KAKENHI Grant No. 17H06299, 17H06302, and 18H04031, and JST-Mirai program No. JPMJMI19G7. M.O. and M.I. were supported by P-CREATE, Japan Agency for Medical Research and Development.