Abstract
Analysis of single cell RNA sequencing (scRNA-Seq) datasets is a complex and time-consuming process, requiring both biological knowledge and technical skill. In order to simplify and systematize this process, we introduce UNCURL-App, an online GUI-based interactive scRNA-Seq analysis tool. UNCURL-App introduces two key innovations: First, prior knowledge in the form of cell type, anatomy, and Gene Ontology databases is integrated directly with the rest of the analysis process, allowing users to automatically map cell clusters to known cell types based on gene expression. Second, tools for interactive re-analysis allow the user to iteratively create, merge, or delete clusters in order to arrive at an optimal mapping between clusters and cell types.
Availability The website is at https://uncurl.cs.washington.edu/. Source code is available at https://github.com/yjzhang/uncurl_app
Background
Single cell RNA sequencing (scRNA-seq) has become an essential and ubiquitous tool for exploring the diversity of cell types in multicellular organisms. Progress in experimental technology development has driven rapid growth in the number of scRNA-seq datasets [1, 2] with a search in 2020 for “single-cell RNA-seq” on NCBI GEO returning tens of thousands of results. Over little more than a decade, scRNA-seq experiments progressed from first proof-of-principle demonstrations using a handful of cells [3] to the construction of “cell atlases” that enumerate all of the cell types present in an organ or organism [4–9]. However, while experimental approaches have become higher throughput and more widely available, it remains challenging to map experimentally determined single cell transcriptomes to biologically meaningful cell types. Given the very large throughput in cell number and the high complexity of many of the systems under investigation, reliable data analysis has become the main bottleneck of the scRNA-seq workflow.
The process of assigning sequenced cells to cell types is a multi-step process that requires the user to make decisions based on their judgment, because the ground truth about abundance and identity of cell types in an experiment of interest is typically not available. In practice, sequencing data is often first “pre-processed,” i.e. corrected for variability introduced by experimentally sampling the actual cellular transcriptome or “batch-corrected” if data from multiple experiments need to be integrated. Then, data are visualized in two dimensions and cells are clustered. Differential expression analysis identifies genes that are characteristic of each cluster, and these differentially expressed genes are used to assign clusters to cell types based on known gene-cell type associations. It is almost always necessary to iterate over this process and repeatedly remove, merge or split clusters to arrive at a satisfactory mapping of clusters to cell types consistent with known biology. These tasks require users who have both technical proficiency and knowledge of the underlying biology.
A wide range of computational tools have been developed to guide and assist each step in the analysis workflow from preprocessing [10–13], to clustering [14–18], data integration through batch effect correction [19, 20] and cell type annotation [21–23]. There are also a number of integrated analysis frameworks that combine several of these tasks into one package [24, 25]. However, these tools are typically restricted to command-line usage and require programming knowledge, hindering the accessibility of scRNA-seq analysis. These tools are also limited in their interactivity; even web-based tools such as scQuery [23] typically do not allow cluster assignments or cell type labels to be changed by the user. Moreover, in particular the last step of assigning labels to clusters remains heavily dependent on a user’s prior knowledge. Thus, even with all of these computational tools, the process of analyzing a new scRNA-seq data set remains somewhat idiosyncratic.
To aid in the task of analyzing scRNA-seq data, we here introduce UNCURL-App. UNCURL-App combines data preprocessing, dimensionality reduction, clustering, differential expression, and interactive data analysis within an online graphical user interface. UNCURL-App introduces two key innovations: First, cell type databases are integrated directly with the rest of the analysis process, accelerating mapping of clusters to cell types. Second, UNCURL-App includes tools for interactive re-analysis that allow the user to create, merge, or delete clusters, thus making it possible to iteratively refine clusters using knowledge about gene expression, putative cell type annotations, and other information accessible through UNCURL-App. Because the entire workflow can be performed in a browser and because external knowledge is made available during the analysis process, we expect UNCURLapp not only to accelerate scRNA-seq analysis but also to further extend the user base for this technology.
Results
Workflow & Interface
From the user’s perspective, the first step in the UNCURL-App pipeline is to upload the data as a genecell read count matrix (Supplemental Figures 1 & 2). Next, the app automatically performs preprocessing and clustering using UNCURL, dimensionality reduction, and differential expression. Then, the user is redirected to an interactive web site, from which they can view the results of the previous steps, query for cell types, or perform interactive re-analysis (Figure 1).
There are three main visualization components: the dimensionality-reduced scatterplot of cells, the barplot showing the most differentially expressed genes, and the cell type query results. A labeled screenshot of the main UNCURL-App view is shown in Figure 2. The scatterplot, on the top left of the screen, shows a dimensionality-reduced view of the cells in the dataset, where each point represents a cell. This view can be colored by cluster, gene expression for selected gene(s), or custom label sets based on uploaded files or userdefined criteria. For example, a user may select all cells belonging to a given cluster that also have positive expression of a certain gene, or select all cells that both belong to a certain cluster and have a certain label in an uploaded file (Supplemental Figure 3). The user may also select cells by drawing a box or shape on the plot itself.
On the top right of the screen, the barplot shows the top differentially expressed genes for the selected cluster or label. The barplot is automatically updated whenever the user clicks on a cell on the scatterplot, showing the top genes for the cluster that the cell belongs to.
The bottom left of the screen shows the database query view. From this view, the user may query cell type databases using the top differentially expressed genes. Databases include Enrichr [26], CellMarker [27], and Gene Ontology [28, 29], as well as our new cell type database, CellMeSH (see Methods). Submitting a query will return a list of cell types with a confidence score, overlapping genes, and references for each genecell type pair.
Data preprocessing, clustering, and differential expression
The first step in the analysis pipeline builds on UNCURL, a tool for preprocessing and clustering scRNAseq data using probabilistic matrix factorization [12]. UNCURL has been shown to have state-of-the-art performance in clustering large-scale scRNA-seq datasets, and performs exceptionally well on sparse datasets. It assumes that the observed read count matrix is distributed with either a Poisson, Log-Normal, or Gaussian distribution, with the parameters of the distribution coming from a hidden state matrix. This hidden matrix is the product of two non-negative matrices of rank k: M, the archetype matrix, of shape genes × k, and W, the weights matrix, of shape k × cells, where each column sums to 1. These two matrices are the outputs of UNCURL. The rank k can be manually set as an input parameter, or automatically determined using the gap score [30]. By default, k is set to 10, which tends to produce good results in practice, but can be changed interactively by merging or splitting clusters as discussed in more detail below.
The result of UNCURL is then used for dimensionality reduction and clustering. Dimensionality reduction is done using standard methods, such as tSNE [31] or UMAP [32]. This produces a two-dimensional scatterplot of cells. By default, clustering is done using argmax on the W matrix returned by UNCURL (as described in [12]). Each column in W represents the weights for each archetype in one cell, so the archetype with the maximum weight is the most likely cluster assignment for that cell. Clustering can also be done using the Louvain [33] or Leiden [34] community detection algorithms, which also use the W matrix as input. However, only clustering using UNCURL is compatible with iterative cluster refinement as detailed below.
In order to identify the most differentially expressed genes in each cluster, UNCURL-App uses one of two methods: the t-test, or the ratio of means. These metrics can either be calculated for one cluster against all other clusters, or against a single cluster. The t-test has been shown to be one of the best performing methods for identifying DE genes in scRNA-seq datasets, and is also much faster than more complex methods [35].
Interactive data analysis
UNCURL-App has the capacity to merge, split, or delete clusters of cells in an interactive fashion. After the initial analysis process is completed, there are often refinements to the clustering that users would like to make, no matter the quality of the initial clustering. For example, the user may want to split a large cluster, merge multiple similar clusters, or delete a group of poor quality cells or potential doublets. This cluster refinement might be based on the shape of the scatterplot, differential expression results, cell type queries, or some other metrics.
The user-driven changes in clustering are incorporated into UNCURL by using them to generate new initializations and then re-running the optimization process, as shown in Figure 3. This process fundamentally relies on the UNCURL algorithm [12], and was inspired by the UTOPIAN software for interactive nonnegative matrix factorization [36], but in UNCURLApp, cells take the place of documents. Say that we have matrices M and W, with shapes g × K and K × c. In order to split a selected cluster, we first run k-means with k = 2 on the cells assigned to the selected cluster. This generates new matrices Mcluster and Wcluster, of shape g × 2 and 2 × c representing the means and cell cluster assignments. The column and row corresponding to the selected cluster are deleted from M and W, and Mcluster and Wcluster are appended to M and W, creating Mnew and Wnew, with shapes g × (K + 1) and (K + 1) ×c. Then, UNCURL is re-run with Mnew and Wnew as the initializations, which affects other clusters as well. The process is analogous for merging clusters and assigning cells to new clusters: we create new initializations for M and W using the selected clusters or cells, and then re-run the optimization process.
After generating Mnew and Wnew, a new visualization, clustering, and differential expression results are calculated using Wnew. Running re-clustering also automatically updates the differential expression results.
Examples
In order to validate the UNCURL-App workflow, we used the app to analyze three different scRNA-seq datasets, as described below. For these datasets, we performed clustering and cell type annotation using UNCURL-App with default settings. Cell type labels were generated by querying the top 50 genes by 1-vsrest ratio with the CellMeSH database (see Methods).
The running times of the non-interactive steps are shown in Table 1. The running time for UNCURL scales linearly with the number of cells, while the running time for tSNE scales with order n log n, where n is the number of cells. With larger numbers of cells, running tSNE is the most time-consuming step. This can be obviated by using UMAP as the dimensionality reduction method.
Example: Tabula Muris lung cells
As a first example, we consider a subset of the Tabula Muris dataset from [7] containing only cell types found in the lung. This dataset contains 5449 cells and 14 annotated cell types. The labels in the original study were generated by first running graph-based clustering and then manually examining the marker genes for each cluster.
After uploading the dataset and processing it with default settings, we see the clustering and initial cell type assignments in Figure 4a. The clustering was based on UNCURL, and the scatterplot visualization was generated using tSNE. Based on the scatterplot, it was apparent that cluster 4 appeared to consist of at least two groups of cells that should not be grouped together. In addition, the top cell types from a CellMeSH query on the top genes in this cluster included both B and T cells (Figure 4b), suggesting that this cluster might be a mixture of at least these two cell types. Based on these observations, we decided to split this cluster using our interactive data analysis tools, resulting in the clusters given in Figure 4c. The post-split cell type assignments (Figure 4d) appeared to be more consistent with known biology than the original assignments.
Based on Figure 4f, there is generally good concordance between the generated clusters and original clusters, as well as between assigned labels and the original labels. Of the labels that were different, in most cases UNCURL-App assigned cell types that were closely related to the original ground truth label (for example, pneumocytes and columnar cells are subsets of epithelial cells, and neutrophils are a subset of leukocytes). The stromal cells are split into multiple clusters in UNCURL-App, which could represent heterogeneity in the original sample that was not captured by the original labels. No prior information about the cell types present in the dataset was used at any point in this process.
Example: 10X PBMCs
Next, we turned to a dataset comprised of 8000 human peripheral blood mononuclear cells (PBMCs) from [37]. This dataset was created by randomly sampling 1000 cells from each of 8 scRNA-seq datasets comprised of cells that were flow-sorted based on known cell-type markers. Thus, the ground truth cell type labels represent pure samples, as opposed to the computational assignments used as ground truth in the other example datasets.
UNCURL-App was run with default settings to generate 10 initial clusters (Figure 5a). Looking at the resulting clusters and putative cell type assignments (5b), it appeared that clusters 2 and 6, labeled Neutrophils and Monocytes, were very similar, and could just represent a single group of cells. A pairwise differential expression analysis (Figure 5c) further illustrates that only related genes, S100A8 and S100A9, appear to be significantly differentially expressed between these clusters. Plotting the expression levels of these genes (Figure 5d), it seems that the small group of cells to the left of the main cluster has much higher expression of these genes, suggesting that this group might constitute a separate cluster. Thus, we first merged clusters 2 and 6, and then split off that small group of cells. These operations resulted in the clustering shown in Figure 5f.
As with the previous dataset, there now is good correspondence with the ground truth clusters and labels (Figure 5h, i). Cells of the same ground truth type are generally assigned to the same cluster, and the cluster labels returned by CellMeSH generally correspond to the ground truth labels. CD34+ cells are generally recognized as hematopoietic stem cells [38], so the CellMeSH label here seems to be accurate. One major difference is that CellMeSH labeled all four T cell subtypes as “T-Lymphocytes", even though they were clustered into distinct clusters. To investigate further, we looked at the full list of CellMeSH labels for these clusters, not just the top one. These results are shown in Figure 5g, with the cell types most similar to the ground truth highlighted in green. For example, Cluster 0 corresponds to naive T-cells, which are selected as CD4+. Cluster 5 corresponds to naive cytotoxic T-cells, which are CD8+, and the “CD8+ TLymphocytes” label is the third highest label, below “T-Lymphocytes” and “Lymphocytes” (Figure 5g). Cluster 6 corresponds to memory T-cells, which can be either CD8+ or CD4+; the second and third labels are “CD8+ T-Lymphocytes” and “CD4-Positive T-Lymphocytes". Cluster 7 corresponds largely to regulatory T-cells, which are CD4+, and the second and third highest CellMeSH labels are “CD4-Positive T-Lymphocytes” and “T-Lymphocytes, Regulatory". This shows a good correspondence between the true and assigned labels at a more fine-grained level.
Example: SPLiT-seq spinal cord
For a final test we turned to a larger dataset comprised of 22,614 mouse spinal cord nuclei from 2 and 11-day old mice sequenced using SPLiT-seq [9]. This dataset has 44 annotated cell types, which is substantially more than the previous two datasets. However, many of these annotated cell types are closely related (for example, there are 15 types of excitatory neurons), so for the “ground truth” comparisons in this section, we combine many of the annotated cell types into larger clusters of similar cells. Even after this process, many of the cell types are similar, with many subtypes of neurons.
We first ran UNCURL-App with default settings to generate an initial clustering with 10 clusters, several of which exhibit substantial heterogeneity (Figure 6a). For example, cluster 8 (labeled as “Endothelial Cells”) represents at least four different groups of nonneuronal cells. Thus, we split them into four different clusters (Figure 6b). It is clear that splitting the clusters worked to separate what appeared to be distinct cell types. In addition, the clusters that CellMeSH labels as “Neurons” or “Interneurons” (3, 6, 9, 0) all appear to be rather heterogeneous. Results after splitting some of the neuronal clusters are shown in Supplemental Figures 4-7.
As with the previous datasets, there is generally good concordance between the cell types from the original paper and the clusters generated by UNCURLApp, as shown in Figure 6d. Also similarly to previous datasets, the CellMeSH annotations were generally coarser grained than the original hand-annotated labels, with all of the neuron clusters being labeled as “Interneurons” or just “Neurons". For the nonneuronal results, interpreting the labels identified by CellMeSH is more challenging (Supplemental Figure 5). Oligodendrocytes, astrocytes, and endothelial cells were correctly identified. For cluster 8, the ground-truth label was “VLMC", or “vascular and leptomeningeal cells". This is a highly specific category that does not appear in the CellMeSH ontology but was used as a cell label in Ref. [39]. Still, while coarse, the first three labels suggested by CellMeSH (Stromal Cells, Fibroblasts, Mesenchymal Stem Cells) seem consistent with cells derived from the meninges, the membrane enveloping the brain and spinal cord. In cluster 10, the ground-truth label “Ependymal” was not correctly identified by CellMeSH, and the returned results did not seem to relate to ependymal cells. This points to a paucity of annotated publications with gene markers for this cell type. For cluster 11, all of the top CellMeSH results were immune cells, a group which the published label, “microglia", belongs to. “Microglia” was one of the top 10 cell types returned.
Discussion
Comparison with existing tools
Unlike other general-purpose toolkits for scRNA-seq analysis such as scanpy [24], Seurat [25], and Monocle 2 [40], UNCURL-App is a web-based GUI tool that does not require command line usage. This allows a much wider range of potential users, such as biologists who are not programmers. One comparable web-based tool is scQuery [23]. Both scQuery and UNCURL-App perform clustering and dimensionality reduction on uploaded single-cell datasets, and can identify cell types. With regards to the user interface, whereas UNCURL-App is a single-page application that presents all of its information on a single screen, scQuery has multiple views for different tasks. Unlike in scQuery, where cell type annotations are ultimately derived from scRNA-seq data from GEO, cell type annotations in UNCURL-App are based on the published scientific literature. UNCURL-App is also capable of interactively merging, splitting, and deleting clusters of cells, unlike scQuery.
There are a number of tools that classify cells given gene markers for known cell types, such as [21, 41]. We view these tools as complementary to UNCURLApp and CellMeSH. These tools require some knowledge of the cell types present in the dataset, as well as a way to manually find gene markers for these cell types, whereas such prior knowledge is unnecessary in the UNCURL-App/CellMeSH pipeline. In addition, CellMeSH can be used to improve the workflow for these tools by automatically selecting gene markers, obviating the need for manually finding them.
There also exist tools that perform single cell similarity search on reference datasets, such as CellAtlasSearch and scMatch [22, 42]. Rather than using marker genes, these methods compare the entire gene expression profile of every single cell to a reference database, using locality-sensitive hashing in the case of CellAtasSearch [22] or Pearson or Spearman correlation in the case of scMatch [42]. These tools do not include functionality for clustering or low-dimensional visualization. The advantage of UNCURL-App comes with its integration of clustering, differential expression, interactive re-analysis, and cell type querying into one easy-to-use platform.
Conclusion
UNCURL-App provides a useful way to perform interactive scRNA-seq data analysis, including cell type annotation. In the future, we hope to augment UNCURLApp with new analysis capabilities, such as cell lineage and gene network analysis. We also hope to connect UNCURL-App to additional sources of information for cell type and functional annotation. This could come in the form of connections to new databases, or expansions to the CellMeSH database. Our ultimate goal is to increase UNCURL-App’s utility as a general tool for scRNA-seq analysis.
Methods
Cell type annotation
UNCURL-App has a number of interfaces to external databases, which are used to assist with identifying cell types present in the dataset, as well as helping to better understand underlying biological processes. First, UNCURL-App contains an interface to the Enrichr tool for gene set analysis [26, 43]. This tool contains interfaces to a variety of gene set databases that can be used to help identify cell function. We also provide an interface to Gene Ontology [28, 29], which is queried using the goatools package [44]. In addition, we have two databases specifically for cell type identification, CellMarker and CellMeSH.
CellMarker [27] is a hand-curated database of cell types, annotated with marker genes based on a literature search. This dataset consists of 673 cell types, where each cell type is associated with an average of 72 and a median of 9 marker genes. To search this database given a list of query genes, we use the hypergeometric test for the overlap between the query gene set and the marker genes. where N is the total number of genes, n is the number of genes in the query set, Kc is the number of genes for the cell type, and kc is the number of genes that overlap between the query and the cell type. This is the probability that, given that the query gene set is randomly sampled, the overlap is greater than or equal to the actual overlap. To find the top cell types for a query gene set, this p-value is calculated for all cell types and ranked in ascending order.
CellMeSH is a new database that maps cell types to their associated genes. It was created by combining two existing literature indices: the MEDLINE citation index [45], which contains publication abstracts with associated metadata, and the gene2pubmed database [46], which contains a mapping of genes to publications. The key metadata from MEDLINE are the associated Medical Subject Headings, or MeSH terms [47], a subset of which represent cell types. For each cell type from MeSH, we found all publications where they occur, and all genes that occur in the same publications, thus creating an association between cell types and genes. This database contains 292 cell types with at least one associated gene. Searching this database can be done using a hypergeometric test. A query returns an ordered list of cell types sorted by relevance.
Implementation
UNCURL-App and the associated backend tools and databases are written in Python. The primary package is the uncurl-app package, which uses the Flask library as the server backend. Visualization is done in javascript using the plotly library [48]. The backend, which interfaces with the dimensionality reduction and differential expression methods, is provided by the uncurl-analysis package, and the databases are provided by the cellmarker and cellmesh packages.
Deployment
UNCURL-App has been tested to run on Ubuntu 16.04 and above, and can be deployed on a local or cloud server using Docker. We have created an example UNCURL-App deployment at https://uncurl.cs.washington.edu/. This deployment limits its upload size to 100MB.
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
Conceptualized the tool: YZ, SM1, SM2, SK, GS. Implemented the application: YZ. Implemented the database: SM1, YZ. Wrote and edited the paper: YZ, SM1, SM2, SK, GS.
Additional Files
Supplemental Figures
1. Data upload interface
2. UNCURL-App preprocessing options
3. Custom cell selection interface
4. Split-seq spinal cord scatterplot after splitting out neuronal clusters
5. Top CellMeSH cell types for split-seq spinal cord data
6. Heatmap comparing UNCURL-App clusters with published labels (coarse-grained)
7. Heatmap comparing UNCURL-App clusters with published labels (fine-grained)
Acknowledgements
We thanks Li Liu, Tao Peng, and Matthew Hirano for testing
UNCURL-App and suggesting new features. This work was supported by NIH R01HG009136 and R01HG009892 to G.S.