Functional annotation-driven unsupervised clustering of single-cell transcriptomes

Keita Iida; Jumpei Kondo; Masahiro Inoue; Mariko Okada

doi:10.1101/2021.06.09.447731

Abstract

Single-cell RNA sequencing (scRNA-seq) analysis has significantly advanced our knowledge of functional states of cells. By analyzing scRNA-seq data, we can deconvolve individual cell states into thousands of gene expression profiles, allowing us to perform cell clustering, and identify significant genes for each cluster. However, interpreting these results remains challenging. Here, we present a novel scRNA-seq analysis pipeline named ASURAT, which simultaneously performs unsupervised cell clustering and biological interpretation in semi-automatic manner, in terms of cell type and various biological functions. We validate the reliable clustering performance of ASURAT by comparing it with existing methods, using six published scRNA-seq datasets from healthy donors and cancer patients. Furthermore, we applied ASURAT to patient-derived scRNA-seq datasets including small cell lung cancers, finding some putative cancer subpopulations showing different resistance mechanisms. ASURAT is expected to open new means of scRNA-seq analysis, focusing more on “biological meaning” than conventional gene-based analyses.

Introduction

Single-cell RNA sequencing (scRNA-seq) has profoundly advanced our knowledge of cells, owing to its immense potential for discovering the transcriptional principles governing cell fates at the single-cell level¹. scRNA-seq has been widely used to improve understanding of individual cells², intra- and intertumoral heterogeneity³, cell-to-cell interaction⁴, tumorigenesis⁵, drug resistance^3,6, and the effects of viral infection on immune cell populations⁷. Various clustering methods, wherein cells are partitioned according to transcriptome-wide similarity, have been proposed⁸ and applied to cell type annotation⁹. However, interpreting single-cell data remains challenging^10–13.

Conventionally, cell types are inferred using unsupervised clustering followed by a manual literature search of differentially expressed marker genes¹³. Currently, several computational tools, such as Garnett¹⁴ and SCSA¹², are available to assist manual annotation, as detailed in the review by Pasquini et al.⁸. However, this process is often difficult because marker genes are generally expressed in multiple cell types¹⁵. In cancer transcriptomics, this difficulty is exacerbated by the interdependence between disease-related genes and numerous biological terms; furthermore, expression levels of marker genes can be heterogeneous depending on cancer microenvironments¹⁶.

A possible solution is to realize cell clustering and biological interpretation at the same time. Recently, reference-based analysis has been applied in single-cell transcriptomics^10,12,17. One such technique is reference component analysis (RCA), which is used for accurate clustering of single-cell transcriptomes along with cell-type annotation based on similarity to reference transcriptome panels¹⁷. However, these methods require well-characterized transcriptomes with purified cells, which may be difficult to apply to ambiguous phenotypes. Another approach is using supervised classification¹¹ combined with gene set enrichment analysis, incorporating biological knowledge such as pathway activity; hence, it may improve the interpretability over signature gene-based approaches, which place sole emphasis on individual roles of genes. However, we still lack a prevailing theory leveraging this information at the single-cell level.

To overcome the aforementioned limitations, a novel theoretical tool providing biological interpretations to computational results is needed. Thus, we propose a scRNA-seq analysis pipeline for simultaneous cell clustering and biological interpretation, named ASURAT. Here, “interpretation” is given by multiple biological terms such as cell type, biological process, pathway activity, chemical reaction, and various biological functions. By using ASURAT, users can create desired sets of biological terms and the corresponding spectrum matrices, which can be supplied to the subsequent unsupervised cell clusterings. In this paper, we first demonstrate the reliable clustering performance of ASURAT based on comparison with existing methods, using six published scRNA-seq datasets of healthy donors and cancer patients. Next, we applied ASURAT to single-cell lung cancer transcriptomes, which include malignant cancer types expressing neuroendocrine markers³. We show that ASURAT can greatly improve functional understandings of various cell types, which may contribute to clinical improvements.

Results

Overview of ASURAT

ASURAT was developed for simultaneously clustering single-cell transcriptomes and biological interpretation, which was implemented by R programming scripts (Supplementary Notes, Supplementary File 6). After inputting scRNA-seq data and knowledge-based databases (DBs), ASURAT creates lists of biological terms with respect to cell type and biological functions, which we termed signs. Then, ASURAT creates a functional spectrum matrix, termed a sign-by-sample matrix (SSM). By analyzing SSMs, users can cluster samples to aid their interpretation. We later explain the workflow (Fig. 1). The details of ASURAT’s formulations can be found in the Methods section.

Fig. 1

Workflow of ASURAT. (a) Flowchart of the procedures, (b) collection of knowledge-based databases (DBs), (c) creation of sign-by-sample matrices (SSMs) from normalized and centered read count matrix and the collected DBs, and (d) analysis of SSMs to infer cell types and biological functions.

Workflow of ASURAT

In preparation, we collected DBs for Disease Ontology (DO)¹⁸, Cell Ontology (CO)¹⁹, Gene Ontology (GO)²⁰, Kyoto Encyclopedia of Genes and Genomes (KEGG)²¹, and Reactome²² using the R packages DOSE (version 3.16.0), ontoProc (version 1.12.0), clusterProfiler (version 3.18.0), KEGGREST (version 1.30.0), and reactome.db (version 1.74.0), respectively (Chapter 7, Supplementary Notes). Any DBs including corresponding tables between biological descriptions and genes can be input to ASURAT (Fig. 1b). Additionally, ASURAT computes a correlation matrix using Pearson or Spearman correlation coefficients from a normalized read count matrix of scRNA-seq data.

The first step is to create signs by inputting a normalized-and-centered read count matrix and knowledge-based DB. From a gene set Ω and correlation matrix R defined for each biological description T in DBs, ASURAT decomposes the correlation graph into several parts. Here, a triplet of biological description, gene subset, and correlation matrix is termed a sign, in particular (T, Ω, R) a parent sign. In many applications, high correlations are expected to have rich information. Hence, we decompose Ω into the following three categories (Fig. 2): (i) a strongly correlated gene set (SCG), which is a set of genes with strong positive correlations with each other; (ii) variably correlated gene set (VCG), which is a set of genes with strong negative correlations with genes in SCG; and (iii) weakly correlated gene set (WCG), which is a set of genes with weak correlations with each other.

Fig. 2

An example showing decomposition of a correlation graph, which produces three signs based on a Disease Ontology (DO) term. From single-cell RNA sequencing data and a DO term with DOID 5409, which concerns small cell lung cancer, three signs (T, Ω⁽ⁱ⁾, R), i ∈ {s, v, w}, were produced from their parent sign (T, Ω, R) by decomposing the correlation graph (Ω, R) into strongly, variably, and weakly correlated gene sets, Ω^(s), Ω^(v), and Ω^(w), respectively. Red and blue edges in correlation graphs indicate positive and negative correlations, respectively, and color density indicates the strength of the correlation.

Next, ASURAT creates an SSM for SCG by weighted averaging of normalized and centered gene set expression levels of SCGs and WCGs. Similarly, an SSM for VCG is created from VCGs and WCGs. Then, by vertically concatenating SSMs for SCG and VCG, we create a single SSM. The rows and columns of an SSM stand for signs and samples (or cells), respectively, and entries stand for cell-type or functional spectra, termed as sign scores. A remarkable benefit is that users can create multiple SSMs as necessary by inputting various DB (Fig. 1c).

The final step is to characterize samples using SSMs to produce a conclusion. One focus of analyzing SSMs is to cluster samples and find significant signs (Fig. 1d), where “significant” means that the sign score is specifically upregulated or downregulated at the cluster level (cf. separation index). In ASURAT, we use two strategies: one uses unsupervised clusterings, such as Partitioning Around Medoids (PAM), hierarchical-based, and graph-based clusterings with and without principal component analysis (PCA); while the other is a method of extracting a continuous tree-like topology using diffusion map²³, followed by allocating samples to different branches of the data manifolds²⁴. Choosing an appropriate strategy depends on the biological context, but the latter is usually applied for developmental processes or time-course experimental data, which are often followed by pseudotime analyses.

Comparison of performance of ASURAT with existing methods

Many unsupervised clustering methods have been proposed and their performances quantified using datasets with independently identified phenotypes. However, it remains unclear whether these methods robustly demonstrate better performance using cancer single-cell transcriptomes including ambiguous phenotypes. Conventional marker gene-based approaches may misrepresent cluster accuracy¹⁷, and simple application of PCA may be ineffective. However, when using ASURAT, users can obtain robust and explainable clustering results, since SSMs can be created from as many DBs as needed and supplied to the subsequent unsupervised clusterings.

To validate the reliable clustering performance of ASURAT, we obtained six published scRNA-seq datasets derived from healthy donors (PBMC datasets: pbmc_4000 and pbmc_6000), cervical cancer patients (day1_norm and day7_hypo), and lung cancer patients (sc68_vehi and sc68_cisp). From all datasets, we excluded genes and cells with low qualities and attenuated technical biases with respect to zero-inflation and variation of capture efficiencies between cells using bayNorm²⁵. The resulting read count tables were supplied to ASURAT and four other methods: Seurat (version 4.0.1)²⁶, Monocle 3 (version 0.2.3.0)²⁷, SC3 (version 1.18.0)²⁸, and PCA using prcomp() from the R stats package (version 4.0.4).

There are five blood cells in the PBMC datasets¹², which are regarded as hypothetical results. However, no consensus cell types exist, especially for cancer datasets. Hence, the clustering accuracies cannot be quantified using standard measures such as adjusted Rand index²⁹. Instead, the clustering qualities were assessed using validity indices such as average silhouette width (ASW)³⁰, a measure of how tightly grouped cells are in clusters and the distant between clusters. To reduce computational cost, we performed two-dimensional Uniform Manifold Approximation and Projection (UMAP)³¹ after the straightforward computations of Seurat, Monocle 3, PCA, and ASURAT; the resulting two-dimensional cell states were supplied to NbClust³², and 26 validity indices were obtained (Supplementary Files). From SC3, we obtained only ASWs computed from consensus matrices and hierarchical clusterings. We hypothesized that clustering quality positively correlates with clustering accuracy, while considering that they do not guarantee interpretability. Additionally, other topology-based clustering methods were not used for computing ASWs.

For PBMC datasets with known numbers of clusters of existing cell types, we compared ASWs across all the methods within such numbers ±1 (shaded area in Fig. 3a). For other datasets, we focused on the ranges of the number of clusters, wherein at least one method provides ASWs ≥0.6. Interestingly, the best-performing method, exhibiting the greatest ASW, was different across the datasets (Fig. 3a). Seurat performed best when the number of clusters k = 4 in pbmc_6000. Although SC3 outperformed at a different k in day7_hypo and PBMC datasets, it could not detect >1 cluster in sc68_vehi and sc68_cisp. Compared with other methods, only the naïve usage of PCA was unremarkable across most datasets.

Fig. 3

ASURAT outperforms existing methods for robust, high-quality, and reliable clustering of various single-cell transcriptomes. (a) Average silhouette widths (ASWs) versus the number of clusters (k), computed by two-dimensional Uniform Manifold Approximation and Projection (UMAP) and k-means clustering for Seurat, Monocle 3, PCA, and ASURAT, while they were computed by consensus matrix-based hierarchical clustering for SC3. The dashed line on the graph represents ASW = 0.6 and the shaded area the hypothetical result. (b) Comparison of UMAP plots between different methods using various datasets. The input databases for ASURAT are indicated in parentheses. (c) Visualizations of the cell types on UMAP plots for pbmc_4000, which was reanalyzed using the inherent algorithms of Monocle 3 and ASURAT. (d) Population ratios in the peripheral blood mononuclear cell (PBMC) datasets, predicted by five different methods.

Notably, ASURAT outperformed existing methods at ≥1 k in every dataset, with one exception in sc68_cisp (Fig. 3a). Moreover, those ASWs were >0.5 without exception and >0.6 with only one exception (viz. sc68_cisp). The existing methods presented both strengths and weaknesses depending on the datasets. Seurat exhibited better performances with PBMC datasets, while it performed less remarkably with most cancer datasets. Although we carefully tuned Seurat’s parameters by changing the normalization method, variable gene-per-cell ratio, and the number of principal components, we could not obtain well-separated clusters for day1_norm and day7_hypo (Fig. 3b). In contrast, Monocle 3 generally exhibited better performances on cancer datasets while performing less remarkably with PBMC datasets. We found that Monocle 3’s clustering performance was unstable and strongly depended upon dimension reduction techniques.

To confirm whether ASURAT outperforms existing methods using other low-dimensional representation techniques, we replaced UMAP with t-distributed stochastic neighbor embedding (t-SNE)³³ and supplied the resulting two-dimensional cell states to NbClust³². Again, we confirmed that ASURAT generated well-separated clusters with relatively greater ASWs across datasets, while Monocle 3 broke down when used with some datasets (Supplementary Fig. S1). These results indicate that cells are better characterized in the high-dimensional sign score space than in the gene expression space.

Finally, to validate ASURAT’s cell-type inference, we reanalyzed PBMC datasets using Seurat, Monocle 3, SC3, and ASURAT under almost default settings. Consequently, Seurat and Monocle 3 could reproduce most blood cell type labels (Figs. 3c and d), as inferred by Cao et al.¹², but a few dozen cells remained unspecified. Although SC3 provided the greatest ASWs at k = 4 and 6 in pbmc_4000 and pbmc_6000, respectively, it reproduced only B cell and NK or NKT cell labels. However, ASURAT identified five cell types, with none remaining unspecified (Supplementary Figs. S3 and S4). The subpopulation ratios were approximately consistent with the reported values, except for the tiny megakaryocyte subpopulation. Such a small discrepancy was unavoidable, because Cao et al. used only differentially expressed genes and preselected cell types to identify the most preferable cell types. Furthermore, we reanalyzed cervical cancer datasets using ASURAT and found several putative populations of small cell neuroendocrine carcinoma and adenocarcinoma (Supplementary Figs. S5 and S6). These results demonstrate that ASURAT can perform robust, high-quality, and reliable clusterings using various single-cell transcriptomes.

Identifying chemoresistant cells in lung cancer scRNA-seq datasets

Previous work³ indicated that small cell lung cancer (SCLC) tumors undergo a shift from chemosensitivity to chemoresistance against platinum-based therapy. However, the exact mechanism behind chemoresistance is still unclear, because transcriptional heterogeneity is often concealed in hidden biological states, which cannot be readily identified by conventional marker gene-based analyses. To investigate the cancer subtypes in the chemosensitive and chemoresistant tumors, we applied ASURAT to the scRNA-seq data of circulating tumor cell-derived xenografts from the vehicle (sc68_vehi) and cisplatin (sc68_cisp) treatment groups.

Given the normalized and centered read count matrices, we created SSMs using DO and GO DBs, and KEGG for both sc68_vehi and sc68_cisp. We then visualized the sign scores in heat maps (Figs. 4a and 5a). The cells were clustered by one of the following: (i) PCA, followed by k-nearest neighbor (KNN) graph generation and Louvain algorithm using Seurat’s functions²⁶ and (ii) diffusion map generation, followed by allocation of cells to the different branches of the data manifold using MERLoT²⁴. Here, cells in sc68_vehi were clustered by (i), while those in sc68_cisp were clustered by (ii), providing the most explainable results.

Fig. 4

Identification of the putative cell types in sc68_vehi by ASURAT. (a) Heat maps showing the sign scores of sign-by-sample matrices (SSMs) for Disease Ontology (DO), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG), which are concatenated vertically. (b) The t-distributed stochastic neighbor embedding (t-SNE) plots of the SSM for GO, showing cell clustering and sign scores for the indicated sign IDs. (c) Violin plots showing the distributions of sign scores for the indicated sign IDs. Each plot represents the separation index for the given group versus all other cells.

Fig. 5

Identification of the putative cell types in sc68_cisp by ASURAT. (a) Heat maps showing the sign scores of sign-by-sample matrices (SSMs) for Disease Ontology (DO), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG), which are concatenated vertically. (b) Diffusion map of the SSM for DO projected onto the first three coordinates, showing cell clustering and sign scores for the indicated sign IDs. (c) Sign scores for the indicated sign IDs plotted along the pseudotime, with standard deviations shown as the shaded area. Each plot represents the separation index for the given group versus all other cells.

We visualized the t-SNE plot of SSM using GO for sc68_vehi, wherein cell clustering labels and SCLC-related sign scores are overlaid (Fig. 4b). Sign IDs and the related genes are represented by, for example, DOID:5409_S (ASCL1, etc.) and DOID:5409_V (MKI67, BIRC5, etc.), where the suffixes “S” and “V” indicate SCG and VCG, respectively. Since ASCL1, MKI67, and BIRC5 are important for neuronal differentiation³⁴, malignancy³⁵, and inhibition of apoptosis³⁶, DOID:5409_S and DOID:5409_V represent SCLC differentiation and proliferation with cell survival, respectively. We found at least two existing subpopulations of SCLC in sc68_vehi. This was further confirmed by violin plots for the related signs (Fig. 4c). Remarkably, sign scores for platinum drug resistance were specifically upregulated in the group with label 2 (GO: BP). The population ratios of group 1 and 2 were 0.84 and 0.15, respectively. Consequently, we found that the SCLCs not receiving cisplatin treatment contained ≤15% putative chemoresistant cells, which was not found in the original report³.

Likewise, we visualized the diffusion map of SSM with DO for sc68_cisp. We observed a tree-like topology in the data manifold, representing a putative cell differentiation lineage (Fig 5b). We defined a pseudotime t ∈ [0, 1] (i.e., an arc-length parameter) along the branches using MERLoT²⁴; a starting point t = 0 was set at the end of the branch with label 1. From the pseudo-time course analysis, we found at least three SCLC subpopulations (Fig. 5c). Strikingly, sign scores for different resistant mechanisms, such as platinum drug resistance and PD-L1 expression mediating immunosuppression, were upregulated in groups labeled 2 and 3 (DO: disease), while sign scores for intracellular protein transport with an SCLC malignancy marker CD24³⁷ was upregulated in the group labeled 1 (DO: disease), suggesting the recalcitrant malignancy of relapsed SCLCs against cisplatin treatments. The population ratios of groups 1, 2, and 3 were 0.39, 0.30, and 0.30, respectively. Consequently, we found 30% putative chemoresistant SCLCs and another 30% with other possible resistant cell types expressing PD-L1, while others did not exhibit these resistance mechanisms. Our results support the finding that transcriptional heterogeneity increases in chemoresistant SCLC tumors³.

The most time-consuming step in our workflow is finalizing the set of signs by tuning ASURAT’s parameters through trial and error, which is critical for downstream analyses. Here, users may face difficulty in prioritizing the importance of several signs. For sc68_cisp, we found that the sign scores for meningioma, myopathy, malignant pleural mesothelioma, and other diseases were also upregulated in the group labeled 2, but their actual relationships to the patient’s disease were unknown. Nevertheless, ASURAT helped us find well-structured data manifolds and characterize cells in biologically explainable manners for cell types, biological processes, and signaling pathways.

Discussion

We developed a novel scRNA-seq analysis pipeline for simultaneous cell clustering and biological interpretation, allowing users to create systems of cell-type and functional spectra as necessary by inputting collected databases. The resulting matrices can be supplied to unsupervised clustering without gene preselection. We analyzed cancer patient- and healthy donor-derived scRNA-seq datasets: the former was to uncover the unknown characteristics of small cell neuroendocrine cancers, while the latter to confirm cell-type inference, aiming to reproduce results inferred in previous studies.

First, we demonstrated ASURAT’s superiority to existing methods with respect to robust, high-quality, and reliable clustering using these datasets (Fig. 3). ASURAT yielded well-separated cell clusters from most transcriptomes, despite the dimension reduction processing, while other conventional methods occasionally failed, demonstrating cells were better characterized in the high-dimensional sign score space than in the gene expression space. In practice, we recommend using signature gene-based tools such as Seurat before using ASURAT to broadly understand the transcriptome. Unlike reference-based analyses^10,12,17, ASURAT does not require any bespoke reference but instead takes input from knowledge-based databases.

Next, we found the putative cancer subpopulations existing in the chemosensitive and chemoresistant tumors of SCLC. We found that sc68_vehi (vehicle treatment) contained ≤ 15% possible platinum-resistant cells (Fig. 4c), suggesting this chemoresistant mechanism latently existed before the therapy. Moreover, we found that sc68_cisp (cisplatin treatment) contained 30% platinum-resistant cells with the same ratio of cells exhibiting PD-L1 expression (Fig. 5c).

Notably, we demonstrated that simultaneous cell clustering and biological interpretation of single-cell transcriptomes was viable (Fig. 1). The formulation of correlation-based decomposition of signature gene sets was critical for ASURAT’s performance (Fig. 2). Additionally, we searched virtually the whole parameter space to obtain the desired interpretation results. Thus, our strategy may greatly improve functional understandings of cancer subpopulations, intracellular heterogeneity, and cellular processes.

However, some limitations are worth noting. Although small cell neuroendocrine cancers have been studied extensively for human tumors by bulk sample RNA-seq analyses³⁴, few publications address scRNA-seq experiments for such rare cancer subtypes. As available scRNA-seq data and knowledge-based databases expand in size and diversity, our theoretical framework for ASURAT should be generalized to prioritize biological terms more efficiently than manual screening. Furthermore, integrating systems of signs across various conditions should be addressed. One means is applying canonical correlation analysis, which has been incorporated in Seurat^26,38. Nevertheless, extracting common systems of “biological meanings” across multiple conditions, different cell types, and possibly different species remains challenging.

We also expect ASURAT to improve scRNA-seq data-driven mathematical modeling for patient classification³⁹, which includes parameter estimations of dynamical systems of gene regulatory network. Since ASURAT detects significant biological functions (e.g., biological process, pathway activity, and chemical reaction) for cell clustering, one can obtain promising candidates for a core regulatory network, which may greatly reduce the numbers of parameters. Another interesting approach to this problem is implementing ASURAT to construct sign networks, which may be analyzed by nonparametric Markov random field theory⁴⁰. We expect ASURAT to open new ways to scRNA-seq analysis from “biological meaning” perspective beyond conventional gene-based analyses.

Author contributions

M.O. and M.I. started the project. K.I. conceived the theory of ASURAT. K.I. developed the analysis pipeline. J.K. and M.I. prepared the cervical cancer samples and obtained the single-cell RNA sequencing data. M.I. and J.K. translated the computational results. K.I., J.K., and M.O. wrote the manuscript. M.O. supervised the work.

Conflict of interest

The authors declare no conflict of interest.

Supplementary materials

Notes Clear documentation (R bookdown files) showing the commands and outputs for all the analysis in the present paper, as well as an introduction to ASURAT, which is available on GitHub (https://github.com/keita-iida/ASURAT).

Methods

Datasets and data processing

Human lung cancer datasets

These data were obtained from circulating tumor cell-derived xenografts cultured with vehicle (symbolized by sc68_vehi) and cisplatin (sc68_cisp) treatments, which were generated from lung cancer patients³. The data were produced with the 10x protocol using unique molecular identifiers (UMIs) (https://support.10xgenomics.com/single-cell-gene-expression/library-prep/doc/user-guide-chromium-single-cell-3-reagent-kits-user-guide-v2-chemistry). The SRA files were downloaded from Gene Expression Omnibus (GEO) with the accession code GSE138474: GSM4104164 and GSM4104165, which are referenced in Stewart et al³. SRA Toolkit version 2.10.8 was used to dump the FASTQ files. Cell Ranger version 3.1.0 was used to align the FASTQ files to the GRCh38-3.0.0 human reference genome and produce the single-cell transcriptome datasets. After quality controls, the read count matrices of sc68_vehi (resp. sc68_cisp) contained 6581 (resp. 6347) genes and 3923 (resp. 2285) cells.

Human cervical cancer datasets

These data were obtained from cancer tissue originated spheroids (CTOS line cerv21) including small cell neuroendocrine carcinoma, cultured for 1 d under normoxic conditions (symbolized by day1_norm) and 7 d hypoxic conditions (day7_hypo), which were generated from cervical cancer patients⁴¹. The data were produced by the Nx1-seq protocol using UMIs. The FASTQ files were downloaded from the DNA Data Bank of Japan (DDBJ) with accession codes DRA007915: DRX155817 and DRX155818. The Nx1-seq data were aligned and annotated as described previously⁴². Briefly, the barcode sequences were extracted from the read 1 FASTQ files. The read 2 FASTQ files, which included each cell mRNA, were directly aligned to Refseq transcript sequences (ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot) using bowtie 2.2.6⁴³. The aligned reads were linked to their paired extracted barcode sequences. By counting mapped reads per barcode, the gene count data in individual cells were obtained. After quality controls, the read count matrices of day1_norm (resp. day7_hypo) contained 5272 (resp. 6213) genes and 3663 (resp. 1947) cells.

Human peripheral blood mononuclear cell datasets

These datasets were obtained from peripheral blood mononuclear cells (PBMCs) of healthy donors, which include approximately 4000 (symbolized by pbmc_4000) and 6000 (pbmc_6000) cells. The data were produced with a 10x protocol using UMIs. The single-cell transcriptome datasets were downloaded from 10x Genomics repository (https://support.10xgenomics.com/single-cell-gene-expression/datasets). The following filtered read count matrices were obtained: 4000 PBMCs from a healthy donor (https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k) and 6000 PBMCs from a healthy donor (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc6k). After quality controls, the read count matrices of pbmc_4000 (resp. pbmc_6000) contained 6658 (resp. 5169) genes and 3815 (resp. 4878) cells.

Data preprocessing: quality control, normalization, and centering

For all the single-cell RNA sequencing (scRNA-seq) data, the genes and cells with low qualities were removed by the following three steps: (i) removing the genes for which the number of non-zero expressing cells is less than a user-defined threshold; (ii) removing the cells whose read counts, number of genes expressed with non-zero read counts, and percent of reads mapped to mitochondrial genes are within user-defined ranges; and (iii) removing the genes for which the mean of read counts is less than a user-defined threshold (Chapter 3, Supplementary Notes).

After quality controls, the data were normalized by bayNorm²⁵, which attenuates technical biases with respect to zero-inflation and variation of capture efficiencies between cells. The resulting inferred true count matrices were supplied to a log transformation with a pseudo-count to attenuate the impact of dispersion in the counts for highly expressed genes. Finally, subtracting the sample mean from each row vector, we obtained the normalized and centered read count matrices (Chapter 4, Supplementary Notes).

Definition of sign

Let T be a biological description, Ω a variable (e.g., gene) set defined for T, and R a relation structure (e.g., correlation matrix) among Ω. Assume that Ω can be represented by a union of its subsets based on R, that is . Then, the triplet (T, Ω⁽ⁱ⁾, R) is termed a sign, in particular (T, Ω, R) a parent sign.

Definition of correlated gene set

Let A = (a_i,j) be a gene-by-sample matrix of size p × n from transcriptome data, whose entries stand for normalized and centered gene expression levels, and A = (r_i,j) a correlation matrix of size p × p defined by A and a certain measure, whose diagonal elements are 1. Let α and β be positive and negative constants satisfying 0 < α ≤ 1 and −1 ≤ β < 0, respectively, and let us fix a biological description T_k and the associated gene set Ω_k = {1, 2, … , m_k}, where k = 1, 2, … , K for some K. Now, consider the following subsets of Ω_k:

Hereinafter we omit the arguments α and β for simplicity. Let us denote , where “\” means set difference. If V_k is not empty, represent each element of W_k as a point in the Euclidean space spanned by the row vectors of R and decompose W_k into two disjoint subsets by Partitioning Around Medoids (PAM) clustering⁴⁴, that is . Otherwise, if V_k is empty, let and (empty). Thus Ω_k is decomposed into three parts as follows:

Let (resp. ) be the mean of off diagonal elements of R for , and assume without loss of generality. If , then , and are strongly, variably, and weakly correlated gene sets, respectively, which are abbreviated as SCG, VCG, and WCG. Otherwise, correlated gene sets cannot be defined for T_k.

For any given (T_k, Ω_k, R) the genes should strongly and positively correlate within each of and , while they negatively correlate between and . Thus, we can hypothesize that SCG and VCG are predominantly associated with T_k, which may aid interpretation of biological meanings of corresponding signs. Fig. 2 shows that Ω^(s) and Ω^(v) include KRT18 and ASCL1, which respectively have negative and positive contributions for lung small cell carcinoma. Thus, we interpret that (T, Ω^(s), R) and (T, Ω^(v), R) relate positively and negatively with this cell type, respectively.

Though simpler methods based on decomposition of correlation graphs exist, such as one-shot PAM clustering⁴⁴, tree cutting after hierarchical clustering⁴⁵, independent component analysis (ICA)- or principal component analysis (PCA)-based methods⁴⁶, and several graph statistical approaches^47,48, we found our VCG definition is critical for providing sample clusterings in the downstream analysis. We tried replacing our decomposition method (1) with one-shot PAM clustering, but sample clusterings frequently exhibited deteriorated performance. This occurred when both VCG and WCG (obtained from the one-shot clustering) included many weakly correlated genes, which may contribute less to the parent sign.

Definition of sign-by-sample matrix

Let A = (a_i,j) be a gene-by-sample matrix of size p × n from a transcriptomic data, whose entries stand for normalized and centered gene expression levels, and G = {1, 2, … , p} a set representing p genes. Assume that we have q biological descriptions and the associated gene sets, denoted by T_k and Ω_k, k = 1, 2, … , q, respectively. Let us assume that Ω_k can be decomposed into non-empty , and for any k. Let B^(x), x ∈ {s, v, w} be matrices of size q × n whose entries are defined as follows: where stands for the number of elements in . Additionally, let C^(x) x ∈ {s, v}, be q × n matrices as follows: where ω^(x), 0 ≤ ω^(x) ≤ 1, are weight constants. Here C^(s) and C^(v) are said to be sign-by-sample matrices (SSMs) for SCG and VCG, respectively, and the as a sign score of the Hth sign and Mth sample (Fig. 1c). Note that ensemble means of sign scores across samples are zeros because SSMs are derived from the centered gene expression matrix A.

Definition of separation index

Briefly, a separation index is a measure of significance of a given sign score for a given subpopulation. Since the row vectors of SSMs are centered (i.e., the means are zeros), wherein the degree of freedom is reduced, naïve usages of statistical tests and fold change analyses should be avoided. Nevertheless, we propose helping users to find significant signs using a nonparametric index to quantify the extent of separation between two sets of random variables. A separation index of a given random variable X takes a value from −1 to 1: the larger positive value indicates that Xs are markedly upregulated, and the probability distribution is well separated against other distributions and vice versa.

Let us consider a vector a of size n, i.e., the number of samples, whose elements stand for the sign scores, and assume that the elements are sorted in ascending order. For simplicity suppose that the samples are classified into two groups labeled 0 and 1. Let v be a vector of the labels corresponding to a, and w₀ and w₁ vectors having the same elements with v but the elements are sorted in lexicographic orders in forward and backward directions, respectively. Then we define separation index as follows: where d(v, w_i) is an edit distance (or Levenshtein distance⁴⁹) with only adjacent swapping permitted. For example, if v = (1, 0, 0, 1, 1), then w₀ = (0, 0, 1, 1, 1) and w₁ = (1, 1, 1, 0, 0). From (3) one can calculate d(v, w₀) = 2 and d(v, w₁ = 4, and thus I(v) = 1/3. As another example, if v = (0, 1, 1, 0, 0), then I(v) = −1/3. From this example, one can see that the positive and negative values of m mean that the given sign has positive and negative contributions for group “1,” respectively.

Drawbacks

Signs are derived from information in existing databases (DBs). This inevitably introduces bias problems, such as the inherent incompleteness of the DBs and annotation bias, viz. some biological terms are associated with many genes, while others with few⁵⁰.

To overcome this problem, one should monitor what signs are included during data processing (Fig. 1a) and carefully tune the parameters to select reliable signs (Supplementary Fig. S2). Our R programming scripts help users perform this process (Supplementary Notes).

Parameter setting

To obtain explainable results of cell clustering in the downstream analysis of ASURAT, it is critical to tune the parameters in the sign creation step (Supplementary Fig. S2). There are six to nine parameters for creating SSMs depending on the database used but many of them have been preset to unbiased and sensible default values. We found that our default settings worked well in our scRNA-seq analyses but the three parameters should be tuned by users, as described below.

As formulated in (1), positive and negative constants α and β from thresholds of correlation coefficients are required for decomposing correlation graphs and creating signs (see Fig. 2 for the demonstration). In addition, unreliable signs are discarded with user-defined criteria, which were preset as follows: the sum of the number of genes in SCG and VCG is less than n_min or the number of genes in WCG is less than (the default value is 2). Furthermore, users can remove redundant signs with similar biological meanings if information contents (ICs)⁵¹ are defined.

Comparison of clustering validity indices of ASURAT with existing methods

To benchmark the clustering qualities of existing methods and ASURAT, we prepared six cancer patient- and healthy donor-derived single-cell RNA-seq datasets. Subsequently, careful quality control and normalization by bayNorm were performed for each dataset. However, 22 additional non-negligible outliers were detected for sc68_vehi by ASURAT, which led to a substantial average silhouette width (ASW) (much greater than 0.9). Hence, those cells were removed from sc68_vehi and the resulting read count table containing 6581 genes and 3901 cells was obtained (Chapter 14.2, Supplementary Notes). Note that such additional preprocessing was undertaken only for the comparison of ASWs.

Using Seurat version 4.0.1²⁶, we normalized the data by log transform with a pseudo-count of 1 (default), selected variable genes based on variance stabilizing transformation with a gene-per-cell ratio of 0.2 (as suggested in previous work⁵²), scaled and centered gene expression levels, and performed PCA. The principal components that explain 90% of the total variability were used for the computations of Uniform Manifold Approximation and Projection (UMAP)³¹ and t-distributed stochastic neighbor embedding (t-SNE)³³, and the resulting two-dimensional cell states were supplied to NbClust³² (Chapter 14.3.1, Supplementary Notes).

Using Monocle 3 version 0.2.3.0²⁷, we ran R function preprocess_cds() in the Monocle 3 package using the default settings, in which data were normalized by log transform with a pseudo-count of 1, scaled and centered in gene expression levels, and performed PCA with a dimensionality of the reduced space of 50. The results were used for the computations of UMAP and t-SNE, and resulting two-dimensional cell states were supplied to NbClust (Chapter 14.3.2, Supplementary Notes).

Using SC3 version 1.18.0²⁸, we normalized the data by log transform with a pseudo-count of 1 (default), performed PCA, and ran R function sc3() in the SC3 package, with the arguments ks = 2:7 and biology = TRUE. This function automatically computed a consensus matrix for each number of clusters and output the ASW based on the hierarchical clustering of the consensus matrix (Chapter 14.3.3, Supplementary Notes). However, sc3() stopped processing and reported errors for sc68_vehi and sc68_cisp irrespective of the arguments.

Using PCA-based clustering, we normalized the data by log transform with a pseudo-count of 1 and ran prcomp() in R stat package. The principal components that explain 90% of the total variability were used for the computations of UMAP and t-SNE, and the resulting two-dimensional cell states were supplied to NbClust (Chapter 14.3.4, Supplementary Notes).

Databases were downloaded in December 2020 and verified for human and mouse scRNA-seq datasets. Using ASURAT, we normalized the data by log transform with a pseudo-count of 1, scaled and centered gene expressions, and created SSMs based on Disease Ontology (DO) for sc68_vehi and sc68_cisp, Gene Ontology (GO) for day1_norm, day7_hypo, and pbmc_6000, and Cell Ontology (CO) for pbmc_4000. These SSMs were used for the computations of UMAP and t-SNE without preprocessing by PCA, and the resulting two-dimensional cell states were supplied to NbClust (Chapter 14.3.5, Supplementary Notes).

Cell-type inference of PBMC datasets by existing methods and ASURAT

To benchmark the abilities of cell-type inference of existing methods and ASURAT, we prepared the normalized read count tables of pbmc_4000 and pbmc_6000 in the same manner described in the previous section. Using R functions FindClusters() and FindAllMarkers() in Seurat, cluster_cells() and top_markers() in Monocle 3, and sc3_plot_markers() in SC3 packages, we identified several different cell types by manually searching marker genes in GeneCards version 5.2⁵³ (Chapter 14.4, Supplementary Notes). Seurat identified T cells (resp. marker genes CD3D, CD3E, IL32, TRAC), monocytes (S100A8, LYZ, CD14), B cells (CD79A, MS4A1, IGHM, VPREB3, BANK1), and NK/NKT cells (NKG7, CD160, KLRF1, GZMA, GZMB, FGFBP2, GNLY), Monocle 3 identified T cells (CD3D, CD3E, CD27, IL32, TRAC, TCF7), monocytes (S100A8, LYZ, CD14), B cells (CD79A, CD79B, MS4A1, IGHM, VPREB3, BANK1), and NK/NKT cells (NKG7, GNLY, CD160, GZMA, FGFBP2), and SC3 identified B cells (CD79A, MS4A1) and NK/NKT cells (TPD52L2, GZMA, GZMB, GZMH, GZMK).

Using ASURAT, we created SSMs based on CO, GO, and Kyoto Encyclopedia of Genes and Genomes (KEGG), clustered the cells by k-nearest neighbor (KNN) graph generation and Louvain algorithm using Seurat’s functions²⁶ after dimension reduction by PCA, analyzed the separation index (3) of each sign score for each cluster, found the signs upregulated in specific clusters, and inferred the cell types (Supplementary Figs. S3 and S4; Chapter 14.4.4, Supplementary Notes): T cells (respectively marker genes CD3D, CD3E, CD247, PTPRC, IL7R, etc.), monocytes (MEF2C, LYN, CCL3, CD14, FGR, etc.), B cells (CD19, CD72, CD79B, BTK, DAPP1, etc.), NK/NKT cells (SH2D1A, KLRD1, NCR3, GZMB, CD160, FGR, ITGB2, FCGR3A, etc.), and dendritic cells (HLA-DOB, CCR7, CD2, FCGR2B, BLK, etc.).

Cell-type inference of cervical cancer datasets by ASURAT

To validate ASURAT’s reliable cell-type inference, the normalized read count tables of day1_norm and day7_hypo were prepared in the same manner as described in the previous section. Previous work studying human cervical cancers using CTOS methods indicated that some small cell neuroendocrine carcinomas (SCNCs) exhibited combined phenotypes with other non-SCNC cells⁴¹. Additionally, hypoxia drove divergent differentiation of SCNCs, but detailed molecular information remained to be elucidated. Using ASURAT, we created SSMs based on DO, GO, and KEGG, and clustered the cells by one of the following: (i) PCA, followed by KNN graph generation and Louvain algorithm using Seurat’s functions²⁶ and (ii) diffusion map generation, followed by allocation of cells to the different branches of the data manifold by using MERLoT²⁴. Here, cells in day1_norm were clustered by (i), while those in day7_hypo were clustered by (i) and (ii) for SSM using DO and GO, respectively (Supplementary Figs. S5 and S6).

Code availability

An open-source implementation of ASURAT is available on GitHub (https://github.com/keita-iida/ASURAT) under the GPLv3 license. All the input and output files used in the present paper and user-friendly documentation written in R bookdown can be downloaded from the above URL.

Supplementary Notes

Supplementary Notes are written in separate files, which are structured as follows:

Chapter 1. Overview of ASURAT.

Chapter 2. Preparing data sets.

Chapter 3. Data quality control (QC).

Chapter 4. Normalizing and centering data.

Chapter 5. Computing correlations among genes.

Chapter 6. Checking expression profiles of marker genes.

Chapter 7. Collecting databases (optional).

Chapter 8. ASURAT using Disease Ontology database (optional).

Chapter 9. ASURAT using Cell Ontology database (optional).

Chapter 10. ASURAT using Gene Ontology database (optional).

Chapter 11. ASURAT using KEGG (optional).

Chapter 12. ASURAT using Reactome (optional).

Chapter 13. Multiple sign analysis by concatenating DO, CO, GO, KEGG, and Reactome.

Chapter 14. Appendix A: comparing performances of ASURAT and existing methods.

Chapter 15. Appendix B: automatically tuning ASURAT’s parameters.

Supplementary Files

Supplementary Files are prepared in separate files, which are structured as follows:

View this table:

Supplementary File 1.

NbClust’s output for 2-dim UMAP computed by Seurat across six cancer patient- and healthy donor-derived scRNA-seq datasets (SupplementaryFile_001_nbclust_umap_seurat.pdf).

View this table:

Supplementary File 2.

NbClust’s output for 2-dim UMAP computed by Monocle 3 across six cancer patient- and healthy donor-derived scRNA-seq datasets (SupplementaryFile_002_nbclust_umap_monocle3.pdf).

View this table:

Supplementary File 3.

SC3’s output of ASWs across six cancer patient- and healthy donor-derived scRNA-seq datasets (SupplementaryFile_003_average_silhouette_sc3.pdf).

View this table:

Supplementary File 4.

NbClust’s output for 2-dim UMAP preprocessed by PCA across six cancer patient- and healthy donor-derived scRNA-seq datasets (SupplementaryFile_004_nbclust_umap_pca.pdf).

View this table:

Supplementary File 5.

NbClust’s output for 2-dim UMAP computed by ASURAT across six cancer patient- and healthy donor-derived scRNA-seq datasets (SupplementaryFile_005_nbclust_umap_asurat.pdf).

Supplementary File 6. ASURAT’s R function files (SupplementaryFile_006_R_files.zip).

Supplementary Figures

Fig. S1

ASURAT outperforms existing methods with respect to producing robust, high-quality, and reliable clusterings of various single-cell transcriptomes. (a) Average silhouette widths (ASWs) versus the number of clusters, computed by two-dimensional t-distributed stochastic neighbor embedding (t-SNE) and k-means clustering for Seurat, Monocle 3, PCA, and ASURAT. (b) Comparison of t-SNE plots between different methods using various datasets. The input databases for ASURAT are indicated in parentheses.

Fig. S2

Detailed workflow of Fig. 1c focusing on the parameter settings. The indicated values are preset as default in ASURAT, while “u.d.” stands for the value or argument that users must define. Here, α and β are positive and negative threshold values of correlation coefficients, n_min and positive integers for selecting reliable signs, MEASURE the name of information content (IC)-based method defining semantic similarities, SIM_TH a threshold value used to regard two biological terms as similar, KEEP_RAREID determines whether the signs with larger ICs are kept or not (if TRUE, the signs with larger ICs are kept), and ω^(s) and ω^(v) weight constants are used to define SSMs.

Fig. S3

Identification of the cell types in pbmc_4000 by ASURAT. (a) Heat maps showing the sign scores of sign-by-sample matrices (SSMs) for Cell Ontology (CO), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG), which are concatenated vertically. The cells are clustered by k-nearest neighbor (KNN) graph generation and Louvain algorithm by using Seurat’s functions in the R package after dimension reduction by principal component analysis. (b)-(d) Violin plots showing the distributions of sign scores for the indicated sign IDs. The cell type labels were inferred by CO as follows: T cell (label 1), monocyte (label 2), B cell (label 3), and NK/NKT cell (label 4 and 5).

Fig. S4

Identification of the cell types in pbmc_6000 by ASURAT. (a) Heat maps showing the sign scores of sign-by-sample matrices (SSMs) for Cell Ontology (CO), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG), which are concatenated vertically. The cells are clustered by k-nearest neighbor (KNN) graph generation and Louvain algorithm by using Seurat’s functions in the R package after dimension reduction by principal component analysis. (b)-(d) Violin plots showing the distributions of sign scores for the indicated sign IDs. The cell type labels were inferred by CO as follows: T cell (label 1), monocyte (label 2), NK/NKT cell (label 3), B cell (label 4), and dendritic cell (label 5).

Fig. S5:

Identification of putative the cell types and functional subpopulations in day1_norm by ASURAT. (a) Heat maps showing the sign scores of sign-by-sample matrices (SSMs) for Disease Ontology (DO), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG), which are concatenated vertically. The cells were clustered by k-nearest neighbor (KNN) graph generation and Louvain algorithm by using Seurat’s functions in the R package after the dimension reduction by principal component analysis. (b) t-SNE plots of the SSM for DO, showing the cell clustering and sign scores for the indicated sign IDs. (c) t-SNE plots of the SSM for GO and violin plots showing the distributions of sign scores for the indicated sign IDs.

Fig. S6

Identification of putative the cell types and functional subpopulations in day7_hypo by ASURAT. (a) Heat maps showing the sign scores of sign-by-sample matrices (SSMs) for Disease Ontology (DO), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG), which are concatenated vertically. The cells were clustered by (i) k-nearest neighbor (KNN) graph generation and Louvain algorithm by using Seurat’s functions in the R package after the dimension reduction by principal component analysis for the SSM for DO, and (ii) diffusion map, followed by allocations of samples to the different branches of the data manifold by using MERLoT for the SSM for GO. (b) t-SNE plots of the SSM for DO, showing cell clustering and sign scores for the indicated sign IDs. (c) t-SNE plots of the SSM for GO and violin plots showing the distributions of sign scores for the indicated sign IDs.

Acknowledgements

We thank Takeya Kasukawa and Johannes Nicolaus Wibisana for comments that greatly improved the analysis pipeline. K.I. was supported by JSPS KAKENHI Grant No. 20K14361. K.I., J.K., and M.I. were supported by Shin Bunya Kaitaku Shien Program of Institute for Protein Research, Osaka University. M.O. was supported by JSPS KAKENHI Grant No. 17H06299, 17H06302, and 18H04031, and JST-Mirai program No. JPMJMI19G7. M.O. and M.I. were supported by P-CREATE, Japan Agency for Medical Research and Development.

REFERENCES

↵
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498, doi:10.1038/s41586-018-0414-6 (2018).
OpenUrl CrossRef PubMed
↵
Ganesh, K. et al. L1CAM defines the regenerative origin of metastasis-initiating cells in colorectal cancer. Nat Cancer 1, 28–45, doi:10.1038/s43018-019-0006-x (2020).
OpenUrl CrossRef
↵
Stewart, C. A. et al. Single-cell analyses reveal increased intratumoral heterogeneity after the onset of therapy resistance in small-cell lung cancer. Nat Cancer 1, 423–436, doi:10.1038/s43018-019-0020-z (2020).
OpenUrl CrossRef
↵
Chen, Z. et al. Ligand-receptor interaction atlas within and between tumor cells and T cells in lung adenocarcinoma. Int J Biol Sci 16, 2205–2219, doi:10.7150/ijbs.42080 (2020).
OpenUrl CrossRef
↵
Chen, H. J. et al. Generation of pulmonary neuroendocrine cells and SCLC-like tumors from human embryonic stem cells. J Exp Med 216, 674–687, doi:10.1084/jem.20181155 (2019).
OpenUrl Abstract/FREE Full Text
↵
Maynard, A. et al. Therapy-Induced Evolution of Human Lung Cancer Revealed by Single-Cell RNA Sequencing. Cell 182, 1232–1251 e1222, doi:10.1016/j.cell.2020.07.017 (2020).
OpenUrl CrossRef PubMed
↵
Devitt, K. et al. Single-cell RNA sequencing reveals cell type-specific HPV expression in hyperplastic skin lesions. Virology 537, 14–19, doi:10.1016/j.virol.2019.08.007 (2019).
OpenUrl CrossRef
↵
Pasquini, G., Rojo Arias, J. E., Schafer, P. & Busskamp, V. Automated methods for cell type annotation on scRNA-seq data. Comput Struct Biotechnol J 19, 961–969, doi:10.1016/j.csbj.2021.01.015 (2021).
OpenUrl CrossRef
↵
Kim, N. et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nat Commun 11, 2285, doi:10.1038/s41467-020-16164-1 (2020).
OpenUrl CrossRef PubMed
↵
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol 20, 163–172, doi:10.1038/s41590-018-0276-y (2019).
OpenUrl CrossRef PubMed
↵
Gao, F. et al. DeepCC: a novel deep learning-based framework for cancer molecular subtype classification. Oncogenesis 8, 44, doi:10.1038/s41389-019-0157-8 (2019).
OpenUrl CrossRef
↵
Cao, Y., Wang, X. & Peng, G. SCSA: A Cell Type Annotation Tool for Single-Cell RNA-seq Data. Front Genet 11, 490, doi:10.3389/fgene.2020.00490 (2020).
OpenUrl CrossRef
↵
Andrews, T. S., Kiselev, V. Y., McCarthy, D. & Hemberg, M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nat Protoc 16, 1–9, doi:10.1038/s41596-020-00409-w (2021).
OpenUrl CrossRef
↵
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat Methods 16, 983–986, doi:10.1038/s41592-019-0535-3 (2019).
OpenUrl CrossRef PubMed
↵
Cancer Genome Atlas Research, N. et al. Integrated genomic and molecular characterization of cervical cancer. Nature 543, 378–384, doi:10.1038/nature21386 (2017).
OpenUrl CrossRef PubMed
↵
Moore, D., Simoes, R. M., Dehmer, M. & Emmert-Streib, F. Prostate Cancer Gene Regulatory Network Inferred from RNA-Seq Data. Curr Genomics 20, 38–48, doi:10.2174/1389202919666181107122005 (2019).
OpenUrl CrossRef
↵
Li, H. et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet 49, 708–718, doi:10.1038/ng.3818 (2017).
OpenUrl CrossRef PubMed
↵
Yu, G., Wang, L. G., Yan, G. R. & He, Q. Y. DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609, doi:10.1093/bioinformatics/btu684 (2015).
OpenUrl CrossRef PubMed Web of Science
↵
Diehl, A. D. et al. The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability. J Biomed Semantics 7, 44, doi:10.1186/s13326-016-0088-7 (2016).
OpenUrl CrossRef
↵
Yu, G., Wang, L. G., Han, Y. & He, Q. Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS 16, 284–287, doi:10.1089/omi.2011.0118 (2012).
OpenUrl CrossRef PubMed Web of Science
↵
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27–30, doi:10.1093/nar/28.1.27 (2000).
OpenUrl CrossRef PubMed Web of Science
↵
Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res 46, D649–D655, doi:10.1093/nar/gkx1132 (2018).
OpenUrl CrossRef PubMed
↵
Coifman, R. R. & Lafon, S. Diffusion maps. Appl Comput Harmon A 21, 5–30, doi:10.1016/j.acha.2006.04.006 (2006).
OpenUrl CrossRef
↵
Parra, R. G. et al. Reconstructing complex lineage trees from scRNA-seq data using MERLoT. Nucleic Acids Res 47, 8961–8974, doi:10.1093/nar/gkz706 (2019).
OpenUrl CrossRef PubMed
↵
Tang, W. et al. bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics 36, 1174–1181, doi:10.1093/bioinformatics/btz726 (2020).
OpenUrl CrossRef
↵
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Preprint at https://www.biorxiv.org/content/10.1101/2020.10.12.335331v1, doi:10.1101/2020.10.12.335331 (2020).
OpenUrl Abstract/FREE Full Text
↵
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol 32, 381–386, doi:10.1038/nbt.2859 (2014).
OpenUrl CrossRef PubMed
↵
Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14, 483–486, doi:10.1038/nmeth.4236 (2017).
OpenUrl CrossRef PubMed
↵
Hubert, L. & Arabie, P. Comparing partitions. J Classif 2, 193–218 (1985).
OpenUrl CrossRef Web of Science
↵
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20, 53–65, doi:10.1016/0377-0427(87)90125-7 (1987).
OpenUrl CrossRef PubMed Web of Science
↵
McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
↵
Charrad, M., Ghazzali, N., Boiteau, V. & Niknafs, A. Nbclust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software 61, 1–36 (2014).
OpenUrl
↵
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J Mach Learn Res 9, 2579–2605 (2008).
OpenUrl CrossRef PubMed Web of Science
↵
Balanis, N. G. et al. Pan-cancer Convergence to a Small-Cell Neuroendocrine Phenotype that Shares Susceptibilities with Hematological Malignancies. Cancer Cell 36, 17–34 e17, doi:10.1016/j.ccell.2019.06.005 (2019).
OpenUrl CrossRef PubMed
↵
Skov, B. G., Holm, B., Erreboe, A., Skov, T. & Mellemgaard, A. ERCC1 and Ki67 in small cell lung carcinoma and other neuroendocrine tumors of the lung: distribution and impact on survival. J Thorac Oncol 5, 453–459, doi:10.1097/JTO.0b013e3181ca063b (2010).
OpenUrl CrossRef PubMed
↵
Belyanskaya, L. L. et al. Cisplatin activates Akt in small cell lung cancer cells and attenuates apoptosis by survivin upregulation. Int J Cancer 117, 755–763, doi:10.1002/ijc.21242 (2005).
OpenUrl CrossRef PubMed Web of Science
↵
Kristiansen, G. et al. CD24 is an independent prognostic marker of survival in nonsmall cell lung cancer patients. Br J Cancer 88, 231–236, doi:10.1038/sj.bjc.6600702 (2003).
OpenUrl CrossRef PubMed Web of Science
↵
Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888–1902 e1821, doi:10.1016/j.cell.2019.05.031 (2019).
OpenUrl CrossRef PubMed
↵
Imoto, H., Zhang, S. & Okada, M. A Computational Framework for Prediction and Analysis of Cancer Signaling Dynamics from RNA Sequencing Data-Application to the ErbB Receptor Signaling Pathway. Cancers (Basel) 12, doi:10.3390/cancers12102878 (2020).
OpenUrl CrossRef
↵
Morrison, R. E., Baptista, R. & Marzouk, Y. Beyond normality: Learning sparse probabilistic graphical models in the non-Gaussian setting. Adv Neur In 30 (2017).
↵
Kubota, S. et al. Dedifferentiation of neuroendocrine carcinoma of the uterine cervix in hypoxia. Biochem Biophys Res Commun 524, 398–404, doi:10.1016/j.bbrc.2020.01.024 (2020).
OpenUrl CrossRef
↵
Hashimoto, S. et al. Comprehensive single-cell transcriptome analysis reveals heterogeneity in endometrioid adenocarcinoma tissues. Sci Rep 7, 14225, doi:10.1038/s41598-017-14676-3 (2017).
OpenUrl CrossRef PubMed
↵
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595, doi:10.1093/bioinformatics/btp698 (2010).
OpenUrl CrossRef PubMed Web of Science
↵
Schubert, E. & Rousseeuw, P. J. Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP 2020, 171–187, doi:10.1007/978-3-030-32047-8_16 (2019).
OpenUrl CrossRef
↵
Murtagh, F. & Legendre, P. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? J Classif 31, 274–295, doi:10.1007/s00357-014-9161-z (2014).
OpenUrl CrossRef
↵
Hyvarinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw 10, 626–634, doi:10.1109/72.761722 (1999).
OpenUrl CrossRef PubMed Web of Science
↵
Blondel, V. D., Guillaume, J., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J Stat Mech-Theory E, P10008 (2008).
↵
Bodenhofer, U., Kothmeier, A. & Hochreiter, S. APCluster: an R package for affinity propagation clustering. Bioinformatics 27, 2463–2464, doi:10.1093/bioinformatics/btr406 (2011).
OpenUrl CrossRef PubMed Web of Science
↵
Lowrance, R. & Wagner, R. A. An extension of the string-to-string correction problem. J Assoc Comput Mach 22, doi:10.1145/321879.321880 (1975).
OpenUrl CrossRef Web of Science
↵
Gaudet, P. & Dessimoz, C. Gene Ontology: Pitfalls, Biases, and Remedies. Methods Mol Biol 1446, 189–205, doi:10.1007/978-1-4939-3743-1_14 (2017).
OpenUrl CrossRef
↵
Yu, G. et al. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26, 976–978, doi:10.1093/bioinformatics/btq064 (2010).
OpenUrl CrossRef PubMed Web of Science
↵
Cruz, J. A. & Wishart, D. S. Applications of machine learning in cancer prediction and prognosis. Cancer Inform 2, 59–77 (2007).
OpenUrl PubMed
↵
Stelzer, G. et al. The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Curr Protoc Bioinformatics 54, 1 30 31–31 30 33, doi:10.1002/cpbi.5 (2016).
OpenUrl CrossRef

View the discussion thread.

Posted June 10, 2021.

Download PDF

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5215)
Biochemistry (11752)
Bioengineering (8752)
Bioinformatics (29200)
Biophysics (14974)
Cancer Biology (12096)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14182)
Epidemiology (2067)
Evolutionary Biology (18308)
Genetics (12245)
Genomics (16803)
Immunology (11869)
Microbiology (28097)
Molecular Biology (11594)
Neuroscience (60969)
Paleontology (451)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2886)
Systems Biology (7340)
Zoology (1651)

[1] ↵
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498, doi:10.1038/s41586-018-0414-6 (2018).
OpenUrl CrossRef PubMed

[2] ↵
Ganesh, K. et al. L1CAM defines the regenerative origin of metastasis-initiating cells in colorectal cancer. Nat Cancer 1, 28–45, doi:10.1038/s43018-019-0006-x (2020).
OpenUrl CrossRef

[3] ↵
Stewart, C. A. et al. Single-cell analyses reveal increased intratumoral heterogeneity after the onset of therapy resistance in small-cell lung cancer. Nat Cancer 1, 423–436, doi:10.1038/s43018-019-0020-z (2020).
OpenUrl CrossRef

[4] ↵
Chen, Z. et al. Ligand-receptor interaction atlas within and between tumor cells and T cells in lung adenocarcinoma. Int J Biol Sci 16, 2205–2219, doi:10.7150/ijbs.42080 (2020).
OpenUrl CrossRef

[5] ↵
Chen, H. J. et al. Generation of pulmonary neuroendocrine cells and SCLC-like tumors from human embryonic stem cells. J Exp Med 216, 674–687, doi:10.1084/jem.20181155 (2019).
OpenUrl Abstract/FREE Full Text

[6] ↵
Maynard, A. et al. Therapy-Induced Evolution of Human Lung Cancer Revealed by Single-Cell RNA Sequencing. Cell 182, 1232–1251 e1222, doi:10.1016/j.cell.2020.07.017 (2020).
OpenUrl CrossRef PubMed

[7] ↵
Devitt, K. et al. Single-cell RNA sequencing reveals cell type-specific HPV expression in hyperplastic skin lesions. Virology 537, 14–19, doi:10.1016/j.virol.2019.08.007 (2019).
OpenUrl CrossRef

[8] ↵
Pasquini, G., Rojo Arias, J. E., Schafer, P. & Busskamp, V. Automated methods for cell type annotation on scRNA-seq data. Comput Struct Biotechnol J 19, 961–969, doi:10.1016/j.csbj.2021.01.015 (2021).
OpenUrl CrossRef

[9] ↵
Kim, N. et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nat Commun 11, 2285, doi:10.1038/s41467-020-16164-1 (2020).
OpenUrl CrossRef PubMed

[10] ↵
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol 20, 163–172, doi:10.1038/s41590-018-0276-y (2019).
OpenUrl CrossRef PubMed

[11] ↵
Gao, F. et al. DeepCC: a novel deep learning-based framework for cancer molecular subtype classification. Oncogenesis 8, 44, doi:10.1038/s41389-019-0157-8 (2019).
OpenUrl CrossRef

[12] ↵
Cao, Y., Wang, X. & Peng, G. SCSA: A Cell Type Annotation Tool for Single-Cell RNA-seq Data. Front Genet 11, 490, doi:10.3389/fgene.2020.00490 (2020).
OpenUrl CrossRef

[13] ↵
Andrews, T. S., Kiselev, V. Y., McCarthy, D. & Hemberg, M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nat Protoc 16, 1–9, doi:10.1038/s41596-020-00409-w (2021).
OpenUrl CrossRef

[14] ↵
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat Methods 16, 983–986, doi:10.1038/s41592-019-0535-3 (2019).
OpenUrl CrossRef PubMed

[15] ↵
Cancer Genome Atlas Research, N. et al. Integrated genomic and molecular characterization of cervical cancer. Nature 543, 378–384, doi:10.1038/nature21386 (2017).
OpenUrl CrossRef PubMed

[16] ↵
Moore, D., Simoes, R. M., Dehmer, M. & Emmert-Streib, F. Prostate Cancer Gene Regulatory Network Inferred from RNA-Seq Data. Curr Genomics 20, 38–48, doi:10.2174/1389202919666181107122005 (2019).
OpenUrl CrossRef

[17] ↵
Li, H. et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet 49, 708–718, doi:10.1038/ng.3818 (2017).
OpenUrl CrossRef PubMed

[18] ↵
Yu, G., Wang, L. G., Yan, G. R. & He, Q. Y. DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609, doi:10.1093/bioinformatics/btu684 (2015).
OpenUrl CrossRef PubMed Web of Science

[19] ↵
Diehl, A. D. et al. The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability. J Biomed Semantics 7, 44, doi:10.1186/s13326-016-0088-7 (2016).
OpenUrl CrossRef

[20] ↵
Yu, G., Wang, L. G., Han, Y. & He, Q. Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS 16, 284–287, doi:10.1089/omi.2011.0118 (2012).
OpenUrl CrossRef PubMed Web of Science

[21] ↵
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27–30, doi:10.1093/nar/28.1.27 (2000).
OpenUrl CrossRef PubMed Web of Science

[22] ↵
Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res 46, D649–D655, doi:10.1093/nar/gkx1132 (2018).
OpenUrl CrossRef PubMed

[23] ↵
Coifman, R. R. & Lafon, S. Diffusion maps. Appl Comput Harmon A 21, 5–30, doi:10.1016/j.acha.2006.04.006 (2006).
OpenUrl CrossRef

[24] ↵
Parra, R. G. et al. Reconstructing complex lineage trees from scRNA-seq data using MERLoT. Nucleic Acids Res 47, 8961–8974, doi:10.1093/nar/gkz706 (2019).
OpenUrl CrossRef PubMed

[25] ↵
Tang, W. et al. bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics 36, 1174–1181, doi:10.1093/bioinformatics/btz726 (2020).
OpenUrl CrossRef

[26] ↵
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Preprint at https://www.biorxiv.org/content/10.1101/2020.10.12.335331v1, doi:10.1101/2020.10.12.335331 (2020).
OpenUrl Abstract/FREE Full Text

[27] ↵
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol 32, 381–386, doi:10.1038/nbt.2859 (2014).
OpenUrl CrossRef PubMed

[28] ↵
Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14, 483–486, doi:10.1038/nmeth.4236 (2017).
OpenUrl CrossRef PubMed

[29] ↵
Hubert, L. & Arabie, P. Comparing partitions. J Classif 2, 193–218 (1985).
OpenUrl CrossRef Web of Science

[30] ↵
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20, 53–65, doi:10.1016/0377-0427(87)90125-7 (1987).
OpenUrl CrossRef PubMed Web of Science

[31] ↵
McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).

[32] ↵
Charrad, M., Ghazzali, N., Boiteau, V. & Niknafs, A. Nbclust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software 61, 1–36 (2014).
OpenUrl

[33] ↵
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J Mach Learn Res 9, 2579–2605 (2008).
OpenUrl CrossRef PubMed Web of Science

[34] ↵
Balanis, N. G. et al. Pan-cancer Convergence to a Small-Cell Neuroendocrine Phenotype that Shares Susceptibilities with Hematological Malignancies. Cancer Cell 36, 17–34 e17, doi:10.1016/j.ccell.2019.06.005 (2019).
OpenUrl CrossRef PubMed

[35] ↵
Skov, B. G., Holm, B., Erreboe, A., Skov, T. & Mellemgaard, A. ERCC1 and Ki67 in small cell lung carcinoma and other neuroendocrine tumors of the lung: distribution and impact on survival. J Thorac Oncol 5, 453–459, doi:10.1097/JTO.0b013e3181ca063b (2010).
OpenUrl CrossRef PubMed

[36] ↵
Belyanskaya, L. L. et al. Cisplatin activates Akt in small cell lung cancer cells and attenuates apoptosis by survivin upregulation. Int J Cancer 117, 755–763, doi:10.1002/ijc.21242 (2005).
OpenUrl CrossRef PubMed Web of Science

[37] ↵
Kristiansen, G. et al. CD24 is an independent prognostic marker of survival in nonsmall cell lung cancer patients. Br J Cancer 88, 231–236, doi:10.1038/sj.bjc.6600702 (2003).
OpenUrl CrossRef PubMed Web of Science

[38] ↵
Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888–1902 e1821, doi:10.1016/j.cell.2019.05.031 (2019).
OpenUrl CrossRef PubMed

[39] ↵
Imoto, H., Zhang, S. & Okada, M. A Computational Framework for Prediction and Analysis of Cancer Signaling Dynamics from RNA Sequencing Data-Application to the ErbB Receptor Signaling Pathway. Cancers (Basel) 12, doi:10.3390/cancers12102878 (2020).
OpenUrl CrossRef

[40] ↵
Morrison, R. E., Baptista, R. & Marzouk, Y. Beyond normality: Learning sparse probabilistic graphical models in the non-Gaussian setting. Adv Neur In 30 (2017).

[41] ↵
Kubota, S. et al. Dedifferentiation of neuroendocrine carcinoma of the uterine cervix in hypoxia. Biochem Biophys Res Commun 524, 398–404, doi:10.1016/j.bbrc.2020.01.024 (2020).
OpenUrl CrossRef

[42] ↵
Hashimoto, S. et al. Comprehensive single-cell transcriptome analysis reveals heterogeneity in endometrioid adenocarcinoma tissues. Sci Rep 7, 14225, doi:10.1038/s41598-017-14676-3 (2017).
OpenUrl CrossRef PubMed

[43] ↵
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595, doi:10.1093/bioinformatics/btp698 (2010).
OpenUrl CrossRef PubMed Web of Science

[44] ↵
Schubert, E. & Rousseeuw, P. J. Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP 2020, 171–187, doi:10.1007/978-3-030-32047-8_16 (2019).
OpenUrl CrossRef

[45] ↵
Murtagh, F. & Legendre, P. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? J Classif 31, 274–295, doi:10.1007/s00357-014-9161-z (2014).
OpenUrl CrossRef

[46] ↵
Hyvarinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw 10, 626–634, doi:10.1109/72.761722 (1999).
OpenUrl CrossRef PubMed Web of Science

[47] ↵
Blondel, V. D., Guillaume, J., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J Stat Mech-Theory E, P10008 (2008).

[48] ↵
Bodenhofer, U., Kothmeier, A. & Hochreiter, S. APCluster: an R package for affinity propagation clustering. Bioinformatics 27, 2463–2464, doi:10.1093/bioinformatics/btr406 (2011).
OpenUrl CrossRef PubMed Web of Science

[49] ↵
Lowrance, R. & Wagner, R. A. An extension of the string-to-string correction problem. J Assoc Comput Mach 22, doi:10.1145/321879.321880 (1975).
OpenUrl CrossRef Web of Science

[50] ↵
Gaudet, P. & Dessimoz, C. Gene Ontology: Pitfalls, Biases, and Remedies. Methods Mol Biol 1446, 189–205, doi:10.1007/978-1-4939-3743-1_14 (2017).
OpenUrl CrossRef

[51] ↵
Yu, G. et al. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26, 976–978, doi:10.1093/bioinformatics/btq064 (2010).
OpenUrl CrossRef PubMed Web of Science

[52] ↵
Cruz, J. A. & Wishart, D. S. Applications of machine learning in cancer prediction and prognosis. Cancer Inform 2, 59–77 (2007).
OpenUrl PubMed

[53] ↵
Stelzer, G. et al. The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Curr Protoc Bioinformatics 54, 1 30 31–31 30 33, doi:10.1002/cpbi.5 (2016).
OpenUrl CrossRef

Functional annotation-driven unsupervised clustering of single-cell transcriptomes

Abstract

Introduction

Results

Overview of ASURAT

Workflow of ASURAT

Comparison of performance of ASURAT with existing methods

Identifying chemoresistant cells in lung cancer scRNA-seq datasets

Discussion

Author contributions

Conflict of interest

Supplementary materials

Methods

Datasets and data processing

Human lung cancer datasets

Human cervical cancer datasets

Human peripheral blood mononuclear cell datasets

Data preprocessing: quality control, normalization, and centering

Definition of sign

Definition of correlated gene set

Definition of sign-by-sample matrix

Definition of separation index

Drawbacks

Parameter setting

Comparison of clustering validity indices of ASURAT with existing methods

Cell-type inference of PBMC datasets by existing methods and ASURAT

Cell-type inference of cervical cancer datasets by ASURAT

Code availability

Supplementary Notes

Supplementary Files

Supplementary Figures

Acknowledgements

REFERENCES

Citation Manager Formats

Subject Area