Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

One Cell At a Time: A Unified Framework to Integrate and Analyze Single-cell RNA-seq Data

Chloe Wang, Lin Zhang, Bo Wang
doi: https://doi.org/10.1101/2021.05.12.443814
Chloe Wang
1University Health Network;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lin Zhang
2University of Toronto
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Bo Wang
1University Health Network;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: bo.wang@uhnresearch.ca
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

1 Abstract

The surge of single-cell RNA sequencing technologies enables the accessibility to large single-cell RNA-seq datasets at the scale of hundreds of thousands of single cells. Integrative analysis of large-scale scRNA-seq datasets has the potential of revealing de novo cell types as well as aggregating biological information. However, most existing methods fail to integrate multiple large-scale scRNA-seq datasets in a computational and memory efficient way. We hereby propose OCAT, One Cell At a Time, a graph-based method that sparsely encodes single-cell gene expressions to integrate data from multiple sources without most variable gene selection or explicit batch effect correction. We demonstrate that OCAT efficiently integrates multiple scRNA-seq datasets and achieves the state-of-the-art performance in cell-type clustering, especially in challenging scenarios of non-overlapping cell types. In addition, OCAT facilitates a variety of downstream analyses, such as gene prioritization, trajectory inference, pseudotime inference and cell inference. OCAT is a unifying tool to simplify and expedite single-cell data analysis.

2 Introduction

The rapid advancement of transcriptome sequencing technologies in single cells (scRNA-seq) has witnessed the exponential growth in the number of large-scale scRNA-seq datasets. Integration of multiple scRNA-seq datasets from different studies has the great potential to facilitate the identification of both common and rare cell types, as well as de novo cell groups. Data heterogeneity, or batch effect, is one of the biggest challenges when integrating multiple scRNA-seq datasets. Batch effect is the perturbation in measured gene expressions, often introduced by factors such as library preparation, sequencing technologies and donors. It is therefore likely to confound with true biological signals embodying cell identities, resulting in misclassification of cells by experiment rather than by true cell types. Therefore, batch effect removal becomes a mandatory step prior to data integration, introducing additional computational cost. Most existing batch effect removal procedures assume that the biological effect of cell types is orthogonal to the batch effect, which is unlikely to be true in real life. Moreover, as the scale of the datasets increases, integrating multiple large-scale scRNA-seq datasets can induce heavy, or sometimes unbearable computational and memory storage burden.

Most of the existing scRNA-seq integration methods require batch removal steps. One of the most commonly used approach is mutual nearest neighbors (MNNs) [3], which requires paired cells (or MNNs) to align the datasets into a shared space. However, this approach demands for large run-time memory and long computation time to search for MNNs in the high dimensional space of gene expressions. Though some derivatives of the MNN method [6, 15] attempted to improve the memory efficiency by performing dimensional reduction in the gene expression space, the memory usage is still demanding when the number of single cells is large. Another common approach, Seurat [21], projects scRNA-seq data to a canonical correlation analysis (CCA) subspace, and then computes MNNs in the CCA subspace to correct the batch effect. On the other hand, Harmony [7] iteratively removes batch effects after projecting scRNA-seq data to a principal component analysis (PCA) subspace. However, Harmony can also consume large memory when the sample size is large. To reduce the computational burden of batch effect correction on scRNA-seq integration, we hereby propose OCAT (One Cell At a Time), a fast and memory-efficient machine learning-based method that does not require explicit batch effect removal in integrating multiple scRNA-seq datasets. OCAT utilizes sparse encoding to integrate multiple heterogeneous scRNA-seq datasets, achieving state-of-the-art or comparable performance compared to existing methods.

OCAT offers three major advantages over existing methods. First, OCAT identifies hypothetical “ghost” cells of each datasets and constructs a sparse bipartite graph between each cell with the “ghost” cells, generating a sparsified encoding of each cell optimized for computational efficiency (O(N)). Second, by connecting each individual cell to the “ghost” cell collection from all datasets, OCAT manages to capture the global similarity structure between single cells, and thus does not require any batch removal step. Thirdly, the OCAT sparse graph encoding can be effectively transformed into cell feature representations that readily tackle a wide range of downstream analysis tasks, providing a unified solution to common single-cell problems such as gene prioritization, trajectory inference, psuedotime inference and cell inference.

3 Results

3.1 The OCAT framework overview

OCAT integrates multiple large-scale scRNA-seq datasets using sparse encoding as the latent representation of the single-cell gene expressions. Given multiple scRNA-seq gene expression matrices as input, OCAT first identifies hypothetical “ghost” cells, the centers of local neighbourhoods, from each dataset. OCAT then constructs a bipartite graph between all single cells to the “ghost” cell set with similarities as edge weights. OCAT next obtains a set of sparsified weights that best reconstruct each single cell from the most similar “ghost” cells only using Local Anchor Embedding (LAE) [9], which further amplifies strong connections. OCAT lastly captures the global cell-to-cell similarities through message-passing between the “ghost” cells, which maps the sparsified weights of all single cells to the same global latent space. The sparsified weights are treated as the sparse encoding of each single cell.

As the number of most similar “ghost” cells is much fewer than the number of genes, the OCAT latent representation is very sparse. We show that this sparse encoding can effectively facilitate downstream analyses, such as cell-type clustering, differential gene selection, trajectory and pseudotime inference, and cell-type inference. Moreover, OCAT sparse encoding is also capable of clustering spatial transcriptomics. Figure 1 outlines the workflow of the OCAT integration procedures as well as various downstream analysis functionalities.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1: Schematic workflow of OCAT.

For each individual scRNA-seq dataset, OCAT identifies “ghost” cells in each dataset, and encodes the gene expression of all single cells by their similarities to the “ghost” cell collection. The encoding can then facilitate downstream analysis tasks, such as cell clustering, gene prioritization, trajectory inference and spatial scRNA-seq clustering.

3.2 Sparse encoding of single-cell transcriptomics effectively corrects batch effect and integrates multiple scRNA-seq datasets

When integrating multiple heterogeneous scRNA-seq datasets, most existing integration methods require iterations of explicit batch effect correction steps between every pair of datasets. Another common assumption in scRNA-seq data integration is that cell types are shared across all the datasets, which is rarely true in real life. Such requirements and assumption pose major challenges in accuracy as well as computational efficiency on most existing integration methods. By capturing the global cell-to-cell similarity across datasets, OCAT does not require any explicit batch effect correction step and proves to be robust in identifying non-overlapping cell types unique to some datasets. The sparsified encoding also greatly accelerates the computational speed and considerably reduces the memory usage when integrating multiple large-scale datasets.

One common assumption of existing integration methods is that any cell type present in one dataset also exists in all the other datasets. However, if some non-overlapping cell types exist, such methods can falsely cluster the non-overlapping cell types. For example, Seurat performs most variable gene selection between every two datasets to cluster similar gene expressions. We here demonstrate how this assumption introduces misclassification in the presence of non-overlapping cell types. The human dendritic dataset [26] consists of human blood dendritic cell (DC), namely, CD1C DC, CD141 DC, plasmacytoid DC (pDC), and double negative cells. [22] further processed and manually split the data into two batches: batch 1 contains 96 pDC, 96 double negative and 96 CD141 cells, while batch 2 has 96 pDC, 96 double negative and 96 CD1C cells. CD141 cells are only present in batch 1, while CD1C cells are only present in batch 2. The visualization of cell type clustering in Figure 3 shows that Seurat, Harmony and Scanorama all falsely group CD141 and CD1C together. On the other hand, OCAT manages to distinguish CD141 and CD1C as two separate cell clusters. This verifies that through the construction of single cell to “ghost” cell bipartite graph, the sparse encoding by OCAT successfully recovers global cell-to-cell similarity across batches. The cell type clustering metrics also reflect the same result, where OCAT has NMIcell type = 0.7636, higher than all the other benchmarked methods; see Table 1 for a detailed comparison.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1: Cell type clustering and batch correction performance on integrating multiple scRNA-seq datasets.

The clustering performance is measured by the Normalized Mutual Information (NMI), where NMIcluster = 1 implies correctly clustering all the cells. (1 – NMIbatch) = 0 implies all the batch effect are corrected.

We then demonstrate the performance and efficiency of OCAT on integrating more than two heterogeneous scRNA-seq datasets. The pancreatic dataset consists of five human pancreatic scRNA-seq datasets sequenced with four different technologies (inDrop [1], CEL-Seq2 [12], Smart-Seq2 [18], SMARTer [28, 30]). Datasets generated by different sequencing platforms and technologies have inherent technical differences [5, 25], posing great challenge to this integration task as the distributions of gene expressions vary significantly across the five datasets. Another challenge lies in the computational cost and memory consumption of integrating five datasets, caused by the iterative batch correction process for large number of cells with high feature dimension. Nevertheless, OCAT manages to outperform the other methods in correctly identifying the cell types without any most variable gene selection or batch removal steps. Following the data pre-processing procedures outlined in [22], we integrate the five pancreas datasets using OCAT, and benchmark with three existing integration methods, Seurat [21], Harmony [7] and Scanorama [6]. The visualization in Figure 2 demonstrates that OCAT outperforms other methods in identifying the cell type clustering (NMIcell type = 0.8037), while maintaining comparable batch correction (1 – NMIbatch = 0.9638); see Table 1 for details. We show in Figure 2C that OCAT is more computationally and memory efficient than the other benchmarked methods. Notably, OCAT takes less than half of the runtime of Seurat and Scanorama. Though Harmony runs slightly faster than OCAT, it requires four times more memory consumption than OCAT. Seurat and Scanorama, on the other hand, require more than eight times memory than OCAT.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2: Integration of multiple scRNA-seq datasets.

A: UMAP projection of scRNA-seq gene expression features of by OCAT, Seurat, Harmony and Scanorama on integrating five pancreatic scRNA-seq datasets. Cells on the top panel are colored by annotated cell type and the bottom panel colored by batch label. B: UMAP projection of scRNA-seq sparsified embeddings by OCAT on PBMC dataset. C: Memory usage and runtime of OCAT, Seurat, Harmony and Scanorama on integrating four benchmarking scRNA-seq datasets.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3: OCAT integration on human dendritic dataset with non-overlapping cell types.

We also validate the performance of OCAT on integrating mouse cell atlas [4], human peripheral blood mononuclear cell (PBMC) [30] and mouse hematopoietic stem and progenitor cell [13] datasets. OCAT achieves state-of-the-art or comparable performance with the other benchmarked methods; see Table 1 and Supplementary Figure S1 for details. Notably, when integrating the two PBMC datasets with a total of 15,476 single cells and 33,694 genes, OCAT is twice faster than Seurat and three times faster than Scanorama. In addition, Harmony and Seurat consume more than 29 times of OCAT’s memory usage, while Scanorama consumes more than 24 times.

3.3 OCAT unifies various downstream biological inferences

The sparse encoding framework by OCAT can readily extract the latent representation of cells in individual scRNA-seq datasets. We show in this section that sparse encoding can effectively facilitate various down-stream analyses, such as cell inference, feature gene selection, trajectory inference and psuedotime inference. Further, we show that OCAT can also extract sparse latent representation of spatial scRNA-seq datasets.

3.3.1 Sparse encoding of individual scRNA-seq dataset

We first demonstrate with four large-scale scRNA-seq datasets [31, 11, 19, 32] that the sparse encoding produced by OCAT achieves state-of-the-art performance in clustering cell types compared with existing scRNA-seq clustering methods [27, 21, 10]; see Figure 4 and Table 2 for the clustering performance details. OCAT is also capable of sparsely encoding spatial scRNA-seq data by treating the spatial coordinates as additional feature representations of the single cells. We show in Figure 4H the identification of cell types using the OCAT sparse encoding on the sagital mouse brain spatial scRNA-seq dataset [21].

Figure 4:
  • Download figure
  • Open in new tab
Figure 4: OCAT on individual scRNA-seq datasets.

A-D: UMAP projection of the sparsified embedding generated by OCAT using Zeisel, Mocosko, Retina and PBMC 68k datasets. Single cells are color-coded by their annotated cell types. E: Gene prioritization on the Zeisel dataset. F-G: Trajectory inference and pseudotime inference on HSMM dataset. H: Spatial scRNA-seq clustering using OCAT sparsified embeddings.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2: Clustering results of OCAT benchmarked with existing methods

on four individual scRNA-seq datasets: Zeisel, Macosko, Retina and PBMC 68k. The clustering performance is measured by the Normalized Mutual Information (NMI), where NMI = 1 implies correctly clustering all the cells with the same cell types while NMI = 0 indicates random guessing.

3.3.2 Cell inference

OCAT supports immediate and accurate cell type inference of new incoming data based on existing scRNA-seq data, without re-running the feature extraction procedures on the entire combined dataset. We denote the existing scRNA-seq datasets as the “reference” dataset, and the incoming unlabelled data as the “inference” dataset. When the “inference” dataset comes in, OCAT performs sparse encoding immediately based on existing “ghost” cells previously identified in the “reference” dataset. OCAT then “trains” a Support Vector Machine (SVM) [14] on the “reference” cells to assign cell type labels to the incoming “inference” cells.

We first demonstrate the performance of cell type inference in individual scRNA-seq datasets. On the Zeisel, Macosko, Retina and PBMC 68k individual scRNA-seq data, each dataset is randomly split into 90% “reference” set and 10% “inference” set. The OCAT-encoded features of the “inference” set based on the “reference” set yield highly competitive cell-type assignment performance. We then test the OCAT cell inference method on the more challenging scenario of integrating two batches of scRNA-seq datasets using the mouse atlas and PBMC datasets. The two batches of each dataset, batch 1 and batch 2, are split into 90% “reference” set and 10% “inference” set. In the PBMC dataset, OCAT assigns cell types to the 10% “inference” set from batch 1 based on the 90% “reference” set from batch 2, and vice versa, achieving NMI of 0.8288 and 0.8007, respectively; see Supplementary Figure S2 for details.

3.3.3 Feature gene selection

OCAT effectively selects the marker genes for each cell group identified from cell-type clustering. Feature gene selection is one of the most common approaches to facilitate cell type annotations. OCAT directly identifies the most variable genes based on the raw gene expression data with respect to the clusters.

We demonstrate the efficacy of OCAT in feature gene selection using the Zeisel dataset [31] that classifies 9 cell types in the mouse somatosensory cortex and hippocampal CA1 region. Figure 4E plots the top 5 marker genes for each cell type. OCAT manages to replicate the marker gene findings reported by [31], for example, Gad1 and Gad2 genes for Interneuron cells, and Acta2 gene for mural cells. We further compare the top feature genes identified by OCAT with those identified by Seurat v3, and show the top selected genes are highly consistent. For example, for the CA1 cell type, OCAT identifies Crym, Cpne6, Neurod6, Gria1 and Wipf3 as the top 5 genes, and four of them are also in the top 5 feature genes selected by Seurat. See Supplementary Table S1 and Supplementary Figure S3 for the full comparison.

3.3.4 Trajectory and pseudotime inference

OCAT further applies the sparse encoding framework to reconstruct the developmental trajectory and pseudotime of cells from their transcriptomic profiles. In most cell populations, there exists a gradient of differentiation underlying the process of cell renewal, from progenitor cells to the terminally differentiated cell types. Based on similarities in gene expressions, trajectory and pseudotime analyses infer the differentiation status of the cell types as well as individual cells. Trajectory inference first maps out the developmental lineages from the least differentiated to most differentiated cell types. Pseudotime analysis then orders the individual cells along the predicted lineages and assigns each cell a pseudotime, indicating its time stamp in the process of differentiation.

OCAT extracts a reduced “ghost” neighbourhood graph between cell types by aggregating cell-to-cell similarities in each cluster. OCAT then infers the lineages by constructing the minimal spanning tree [8] over the aggregated “ghost” neighbourhood graph that connects all the cell types. The least differentiated cell type is considered as the root cluster, which determines the unique directionality of the inferred lineages; see Section 5 for details. Lastly, to compute the pseudotime of each cell, OCAT appoints the least differentiated cell in each “ghost” neighbourhood as the root cell. Traversing down the lineages, OCAT uses the root cell as the point of reference in each local neighbourhood to assign pseudotime to individual cells.

We validate the performance of OCAT trajectory and pseudotime inference on the human skeletal muscle myoblast (HSMM) dataset [24]. The HSMM dataset contains time-series RNA-seq data outlining the early stages of myogenesis. The 271 myoblast cells were collected at 0, 24, 48 and 72 hours of differentiation, with gold-standard annotations based on known gene markers [23, 24]. We observe the differentiation trajectory from myoblast to intermediate cells, followed by three separate branches into myotubes, fibroblasts and undifferentiated cells, respectively. The inferred trajectory is consistent with the known biology of myotube formation. Fibroblasts and undifferentiated cells represent the two cell groups that exit the differentiation cycle prior to myotube formation [23]. The pseudotime assigned by OCAT is highly correlated with the collection time stamps, with a Pearson correlation of 0.8743 by annotated cell type group. Additionally, following the procedures in [17], we compare OCAT with Slingshot [20], PAGA Tree [29], and Monocle ICA [16] on trajectory inference with 28 gold-standard real datasets using the dynverse R package. OCAT is competitive in grouping neighbourhoods of highly similar cells to as well as assisting downstream tasks of identifying important genes specific to the trajectory; see Supplementary Figure S4 for details.

4 Discussion

In this work, we present OCAT, a fast and memory-efficient integration tool for scRNA-seq datasets that utilizes sparse encoding as the latent representation of cell features. Through hypothetical “ghost” cells, the sparse encoding of OCAT captures the global similarity between single cells across multiple datasets. We demonstrate that, without any batch effect correction, the sparse encoding of OCAT manages to separate biological differences between the cells from batch effects, achieving state-of-the-art or comparable performance with existing methods on identifying cell types.

Unlike most existing methods, OCAT does not rely on most variable gene selection to discriminate biological cell groups, which preserves the identities of non-overlapping cell types unqiue to some datasets and has the potential to facilitate the discovery of de novo cell groups. Furthermore, OCAT is computationally and memory efficient in integrating large-scale scRNA-seq datasets. Through its sparse encoding of gene expressions, OCAT can scale up to integrate scRNA-seq datasets with large number of single cells, large number of genes or large number of datasets.

OCAT is also applicable to analyzing individual scRNA-seq dataset, as well as spatial scRNA-seq data. We show that OCAT achieves state-of-the-art performance on identifying cell type groups on single scRNA-seq datasets. Moreover, the sparse encoding of OCAT effectively facilitates downstream analyses such as gene prioritization, trajectory inference, pseudotime inference and cell inference. With additional biological priors, OCAT has the great potential to better facilitate downstream analyses and extend to tackle more complex tasks such as cell-to-cell communication network inference, which we will explore as future work.

OCAT is freely available at https://github.com/bowang-lab/OCAT.

5 Online methods

5.1 The OCAT framework overview

OCAT endorses sparse encoding of the latent representations of the single-cell transcriptomics. Given N single cells each with M gene expressions, OCAT first identifies m ≪ M “ghost” cells and connects each individual cell with the ghost cells through a bipartite graph where the weights of the edges are treated as the encoding. As m ≪ M, the OCAT encoding is very sparse and computationally fast for large-scale datasets. The sparse encoding can then be deployed to find the similarities between cells within the same datasets as well as across multiple datasets. The similarities between cells can facilitate downstream analyses such as cell clustering, trajectory inference, gene prioritization. Moreover, the OCAT sparse encoding is also capable of clustering spatial transcriptomics. Figure 1 depicts the sparse encoding procedures of OCAT and the latent representations for various downstream tasks given two input scRNA-seq datasets. In the next sections, we will outline the OCAT algorithms in details.

5.2 Data pre-processing

Given an N × M gene expression matrix, R, where N is the number of single cells and M is the number of genes, OCAT first pre-processes the raw gene expression data by log-transforming each entry rij in R Embedded Image and normalizes the log-transformed expression to Embedded Image for i = 1, …, N and j = 1, …, M. The normalized gene expression matrix is denoted as Embedded Image, where xi = (xi1, xi2, …, xiM)T is the M × 1 normalized gene expression vector of cell i.

5.3 Dimension reduction of gene expression matrix

To efficiently encode the transcriptomics of the single cells, OCAT further reduces the dimension of the normalized gene expression matrix X. OCAT adopts the online Fast Similarity Matching (FSM) algorithm [2] that projects each xi from ℝM to ℝd such that Embedded Image where yi is the d × 1 feature vector for cell i. OCAT adopts Embedded Image as the transcriptomic feature representation that will facilitate the construction of the sparsified bipartite graph.

Note that though a vast collection of methods is available for dimension reduction, online FSM is much more efficient with a complexity of O(NMd) than the traditional principal component analysis (PCA) whose complexity is O(M 2N + M3), which is offered as an alternative option in the OCAT software package.

5.4 Sparsified bipartite graph for single-cell transcriptomics

5.4.1 Identifying “ghost” cells

We introduce the idea of “ghost” cells which are imaginary cells that characterize the transcriptomics of the real single cells. OCAT identifies m ghost cells that are the K-Means cluster centers of Y, and denotes their features as uj, for j = 1, 2, …, m. We then construct a sparsified bipartite graph G = (V, U, E) between the single cells and the ghost cells, where each node vi represents the feature yi of the ith single cell.

5.4.2 Construct sparsified bipartite graph

Our next goal is to construct the sparsified bipartite graph between the single cells and the ghost cells. For single cell i, the weights zi = (Zi1, Zi2, …, Zim)T is a m × 1 vector such that Embedded Image

For single cell i, OCAT first identifies its s closest ghost cells with the top s cosine similarity values and denote their indices as ⟨i⟩ ∈[1 : m]. OCAT then optimizes the edge weights cell i and its s neighbor ghost cells, z⟨i⟩, using Local Anchor Embedding (LAE) [9] by Embedded Image and U⟨i⟩ = {uk}k∈⟨i⟩ are the features of the s neighbor ghost cells. The edge weights of cell i to all the ghost cells are thus denoted as as Embedded Image, and Embedded Image, and the collection of all the edge weights connecting N single cells to m ghost cells is denoted as Embedded Image.

5.4.3 Message passing between single cells

To infer the transcriptomic similarity between single cells, a common approach is to compute the adjacency matrix W between the cells. However, when the number of single cells, N, is large, storing a N ×N adjacency matrix consumes significant memory. OCAT, instead of computing cell-to-cell similarity directly, infers it through single cell to ghost cell edge weights, Z, and the similarities between ghost cells, ZG ∈ ℝ m×m. The similarity between ghost cells is defined as, Embedded Image

We then standardize Zghost by Embedded Image where D is a diagonal matrix with Embedded Image

The normalized ghost cell to ghost cell similarity, ZG, is an m × m “ghost field” that transmits messages between single cells. Lastly, we obtain refined sparse embeddings for the single cells through message passing ZW by Embedded Image

5.5 Integration of multiple scRNA-seq datasets

OCAT can easily integrate multiple gene expression datasets thanks to the design of sparsified bipartite graph. Without loss of generality, suppose we have two scRNA-seq datasets to integrate, each with N1 and N2 single cells and M common genes. Each individual dataset first undergoes the same pre-processing and dimension reduction steps outlined in Section 5.2 and 5.3, and yields Embedded Image for dataset 1 and Embedded Image for dataset 2.

OCAT then identifies m1 ghost cells from X1 with features Embedded Image, and m2 ghost cells from X2 with features Embedded Image. For the ith individual cell, OCAT identifies s1 closest ghost cells with indices ⟨i1⟩ from the first ghost cell set and s2 closest ghost cells with indices ⟨i2⟩ from the second set. Within the first ghost cell set, OCAT obtains the optimized weights z⟨i1⟩ such that Embedded Image

Similarly, the optimized weights for the second ghost cell set, OCAT obtains the optimized weights z⟨i2⟩. The weights of the edges connecting the ith single cell to all the ghost cells are then denoted as Embedded Image and Embedded Image. The collection of all the edge weights of (N1 + N2) single cells connecting to (m1 + m2) ghost cells is denoted as Embedded Image where N = N1 + N2 and m = m1 + m2.

Following (4) and (5), we obtain the re-fined embeddings, Embedded Image, for each single cell through message passing between the combined “ghost field”. We lastly normalize Embedded Image by Embedded Image

5.6 Gene prioritization

OCAT offers the functionality to find the feature genes for each cell type clusters. Denote the normalized gene expression matrix as X = {xij}, where i = 1, …, N and j = 1, …, M. For cell type cluster C, we compares the gene expression of cell type cluster C with all the other types, and we rank the top feature genes by the magnitude of Embedded Image where Embedded Image.

5.7 Cell inference

OCAT supports immediate cell type inference of incoming data based on existing databases, without recomputing the latent representations by combining the new incoming (“inference”) dataset and the existing (“reference”) dataset.

Given an incoming “inference” set, OCAT first projects the normalized gene expression Xinfer to the same ℝ N ×D subspace as the “reference” set, obtaining the reduced cell representation Y infer. OCAT then constructs a bipartite graph that connects these new “inference” cells to the “ghost” cells identified in the “reference” set following (3), and obtains the edge weights, Zinfer, for the “inference” cells. The edge weights then go through the same message-passing procedures as the “reference” cells, resulting in Embedded Image, the sparse representation of the new “inference” cells mapped to the same global subspace as the “reference” cells.

To assign cell type labels to the “inference” cells, OCAT “trains” a Support Vector Machine (SVM) [14] based on the sparse representations of the “reference” cells, Embedded Image, and the cell type labels for the “reference” cells. Based on the estimated coefficients from SVM, OCAT infers the cell type labels of the new incoming cells using Embedded Image.

5.8 Trajectory inference

Trajectory inference aims to computationally reconstruct the developmental trajectory of cells based on gene expressions. It outlines the temporal transition from the the least differentiated to the most differentiated cell types. OCAT infers the developmental lineages by connecting the similarity graph between cell types with a minimum spanning tree [8].

Suppose we have an N × m dimensional gene expression embedding for the cells, for example, the sparse embedding by OCAT,Embedded Image. The cells are clustered into c cell types based on the embedding. OCAT computes the similarity score between cell type p and cell type q, Ap,q, by averaging the pair-wise cell-to-cell cosine similarities between cell types p and q. Embedded Image Embedded Image is the embedding vector for the uth cell in cell type p, Embedded Imageis the embedding vector for the vth cell in cell type q, and np, nq are the number of cells in cell type p and q, respectively. ∥ · ∥ denotes the l2-norm.

Let A = {Ap,q} ∈ ℝ c×c denote the matrix of pair-wise similarity scores between c cell types. OCAT constructs an undirected graph GC from A, where each node represents a unique cell type, and the edge weight between two nodes (two cell types) is their similarity score. OCAT then obtains the minimum spanning tree T that connects all the nodes while minimizing the total sum of edge weights in the tree T. OCAT lastly adds directionality to the tree by taking the least differentiated cell type, namely, the root cell type, as the starting point of differentiation. Once the root cell type is determined, we obtain a unique directionality within the tree T.

5.9 Pseudotime Inference

Pseudotime analysis assigns each cell a time stamp along the lineages: less differentiated cells have earlier time stamps; more differentiated cells have later time stamps. It thus provides more granularity to individual cells than the lineage ordering of cell types. OCAT defines a root cell in the root cluster, r1, to serve as a reference to quantify differentiation. Biologically, r1 represents the most primitive in the entire differentiation trajectory. OCAT identifies r1 computationally by locating the cell whose spatial distances with other cells have the best accordance with the lineage ordering of cell types identified. OCAT then infers the extent to which a particular cell differentiates using its distance to the most primitive r1, where less differentiated cells are closer to r1, and vice versa.

To calculate the distance of the uth cell of type p to the first cell r1, OCAT considers both the position of cluster p along the cell type lineages and the position of cell u in cluster p. We then define a root cell in every non-root cluster to serve as landmarks to connect the cell types along the lineages, denoted as r2, …, rc ∈ ℝ1×m. In a non-root cluster p, the cell with the closest average Euclidean distance with all cells in the previous cluster p ™ 1 on the same lineage is assigned to be the root cell, rp. OCAT defines a distance Di for each cell in the dataset, where D ∈ ℝ 1×N. The distance for the uth cell in cluster p is defined as the sum of Euclidean distance between Embedded Image and the current root cell cluster rp, and the length of cell type lineages up to cluster p: Embedded Image

OCAT uses the normalized distance D norm as the pseudotime measure: Embedded Image

References

  1. [1].↵
    M. Baron, A. Veres, S. L. Wolock, A. L. Faust, R. Gaujoux, A. Vetere, J. H. Ryu, B. K. Wagner, S. S. Shen-Orr, A. M. Klein, et al., A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure, Cell systems, 3 (2016), pp. 346–360.
    OpenUrl
  2. [2].↵
    A. Giovannucci, V. Minden, C. Pehlevan, and D. B. Chklovskii, Efficient principal subspace projection of streaming data through fast similarity matching, in 2018 IEEE International Conference on Big Data (Big Data), IEEE, 2018, pp. 1015–1022.
  3. [3].↵
    L. Haghverdi, A. T. Lun, M. D. Morgan, and J. C. Marioni, Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors, Nature biotechnology, 36 (2018), pp. 421–427.
    OpenUrlCrossRefPubMed
  4. [4].↵
    X. Han, R. Wang, Y. Zhou, L. Fei, H. Sun, S. Lai, A. Saadatpour, Z. Zhou, H. Chen, F. Ye, et al., Mapping the mouse cell atlas by microwell-seq, Cell, 172 (2018), pp. 1091–1107.
    OpenUrlCrossRefPubMed
  5. [5].↵
    S. C. Hicks, F. W. Townes, M. Teng, and R. A. Irizarry, Missing data and technical variability in single-cell rna-sequencing experiments, Biostatistics, 19 (2018), pp. 562–578.
    OpenUrlCrossRef
  6. [6].↵
    B. Hie, B. Bryson, and B. Berger, Efficient integration of heterogeneous single-cell transcriptomes using scanorama, Nature biotechnology, 37 (2019), pp. 685–691.
    OpenUrl
  7. [7].↵
    I. Korsunsky, N. Millard, J. Fan, K. Slowikowski, F. Zhang, K. Wei, Y. Baglaenko, M. Brenner, P.-r. Loh, and S. Raychaudhuri, Fast, sensitive and accurate integration of single-cell data with harmony, Nature methods, (2019), pp. 1–8.
  8. [8].↵
    J. B. Kruskal, On the shortest spanning subtree of a graph and the traveling salesman problem, Pro-ceedings of the American Mathematical society, 7 (1956), pp. 48–50.
    OpenUrlCrossRef
  9. [9].↵
    W. Liu, J. He, and S.-F. Chang, Large graph construction for scalable semi-supervised learning, in ICML, 2010.
  10. [10].↵
    R. Lopez, J. Regier, M. B. Cole, M. I. Jordan, and N. Yosef, Deep generative modeling for single-cell transcriptomics, Nature methods, 15 (2018), pp. 1053–1058.
    OpenUrl
  11. [11].↵
    E. Z. Macosko, A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, A. R. Bialas, N. Kamitaki, E. M. Martersteck, et al., Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets, Cell, 161 (2015), pp. 1202–1214.
    OpenUrlCrossRefPubMed
  12. [12].↵
    M. J. Muraro, G. Dharmadhikari, D. Grün, N. Groen, T. Dielen, E. Jansen, L. van Gurp, M. A. Engelse, F. Carlotti, E. J. de Koning, et al., A single-cell transcriptome atlas of the human pancreas, Cell systems, 3 (2016), pp. 385–394.
    OpenUrl
  13. [13].↵
    S. Nestorowa, F. K. Hamey, B. Pijuan Sala, E. Diamanti, M. Shepherd, E. Laurenti, N. K. Wilson, D. G. Kent, and B. Göttgens, A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation, Blood, The Journal of the American Society of Hematology, 128 (2016), pp. e20–e31.
    OpenUrl
  14. [14].↵
    W. S. Noble, What is a support vector machine?, Nature biotechnology, 24 (2006), pp. 1565–1567.
    OpenUrlCrossRefPubMedWeb of Science
  15. [15].↵
    K. Polański, M. D. Young, Z. Miao, K. B. Meyer, S. A. Teichmann, and J.-E. Park, Bbknn: fast batch alignment of single cell transcriptomes, Bioinformatics, 36 (2020), pp. 964–965.
    OpenUrl
  16. [16].↵
    X. Qiu, Q. Mao, Y. Tang, L. Wang, R. Chawla, H. A. Pliner, and C. Trapnell, Reversed graph embedding resolves complex single-cell trajectories, Nature methods, 14 (2017), p. 979.
    OpenUrl
  17. [17].↵
    W. Saelens, R. Cannoodt, H. Todorov, and Y. Saeys, A comparison of single-cell trajectory inference methods, Nature biotechnology, 37 (2019), pp. 547–554.
    OpenUrlCrossRefPubMed
  18. [18].↵
    Å. Segerstolpe, A. Palasantza, P. Eliasson, E.-M. Andersson, A.-C. Andréasson, X. Sun, S. Picelli, A. Sabirsh, M. Clausen, M. K. Bjursell, et al., Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes, Cell metabolism, 24 (2016), pp. 593–607.
    OpenUrl
  19. [19].↵
    K. Shekhar, S. W. Lapan, I. E. Whitney, N. M. Tran, E. Z. Macosko, M. Kowalczyk, X. Adiconis, J. Z. Levin, J. Nemesh, M. Goldman, et al., Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics, Cell, 166 (2016), pp. 1308–1323.
    OpenUrlCrossRefPubMed
  20. [20].↵
    K. Street, D. Risso, R. B. Fletcher, D. Das, J. Ngai, N. Yosef, E. Purdom, and S. Du-doit, Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics, BMC genomics, 19 (2018), pp. 1–16.
    OpenUrlCrossRef
  21. [21].↵
    T. Stuart, A. Butler, P. Hoffman, C. Hafemeister, E. Papalexi, W. M. Mauck III, Y. Hao, M. Stoeckius, P. Smibert, and R. Satija, Comprehensive integration of single-cell data, Cell, 177 (2019), pp. 1888–1902.
    OpenUrlCrossRefPubMed
  22. [22].↵
    H. T. N. Tran, K. S. Ang, M. Chevrier, X. Zhang, N. Y. S. Lee, M. Goh, and J. Chen, A benchmark of batch-effect correction methods for single-cell rna sequencing data, Genome biology, 21 (2020), pp. 1–32.
    OpenUrlCrossRefPubMed
  23. [23].↵
    T. N. Tran and G. D. Bader, Tempora: Cell trajectory inference using time-series single-cell rna sequencing data, PLoS computational biology, 16 (2020), p. e1008205.
    OpenUrl
  24. [24].↵
    C. Trapnell, D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse, N. J. Lennon, J. Livak, T. S. Mikkelsen, and J. L. Rinn, The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells, Nature biotechnology, 32 (2014), p. 381.
    OpenUrlCrossRefPubMed
  25. [25].↵
    P.-Y. Tung, J. D. Blischak, C. J. Hsiao, D. A. Knowles, J. E. Burnett, J. K. Pritchard, and Y. Gilad, Batch effects and the effective design of single-cell gene expression studies, Scientific reports, 7 (2017), pp. 1–15.
    OpenUrlCrossRef
  26. [26].↵
    A.-C. Villani, R. Satija, G. Reynolds, S. Sarkizova, K. Shekhar, J. Fletcher, M. Gries-beck, A. Butler, S. Zheng, S. Lazo, et al., Single-cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors, Science, 356 (2017).
  27. [27].↵
    B. Wang, J. Zhu, E. Pierson, D. Ramazzotti, and S. Batzoglou, Visualization and analysis of single-cell rna-seq data by kernel-based similarity learning, Nature methods, 14 (2017), pp. 414–416.
    OpenUrl
  28. [28].↵
    Y. J. Wang, J. Schug, K.-J. Won, C. Liu, A. Naji, D. Avrahami, M. L. Golson, and K. H. Kaestner, Single-cell transcriptomics of the human endocrine pancreas, Diabetes, 65 (2016), pp. 3028–3038.
    OpenUrlAbstract/FREE Full Text
  29. [29].↵
    F. A. Wolf, F. K. Hamey, M. Plass, J. Solana, J. S. Dahlin, B. Göttgens, N. Rajewsky, Simon, and F. J. Theis, Paga: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells, Genome biology, 20 (2019), pp. 1–9.
    OpenUrlCrossRef
  30. [30].↵
    Y. Xin, J. Kim, H. Okamoto, M. Ni, Y. Wei, C. Adler, A. J. Murphy, G. D. Yancopoulos, C. Lin, and J. Gromada, Rna sequencing of single human islet cells reveals type 2 diabetes genes, Cell metabolism, 24 (2016), pp. 608–615.
    OpenUrl
  31. [31].↵
    A. Zeisel, A. B. Muñoz-Manchado, S. Codeluppi, P. Lönnerberg, G. La Manno, A. Juréus Marques, H. Munguba, L. He, C. Betsholtz, et al., Cell types in the mouse cortex and hippocampus revealed by single-cell rna-seq, Science, 347 (2015), pp. 1138–1142.
    OpenUrlAbstract/FREE Full Text
  32. [32].↵
    G. X. Zheng, J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, D. Wheeler, G. P. McDermott, J. Zhu, et al., Massively parallel digital transcriptional profiling of single cells, Nature communications, 8 (2017), pp. 1–12.
    OpenUrl
Back to top
PreviousNext
Posted May 13, 2021.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
One Cell At a Time: A Unified Framework to Integrate and Analyze Single-cell RNA-seq Data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
One Cell At a Time: A Unified Framework to Integrate and Analyze Single-cell RNA-seq Data
Chloe Wang, Lin Zhang, Bo Wang
bioRxiv 2021.05.12.443814; doi: https://doi.org/10.1101/2021.05.12.443814
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
One Cell At a Time: A Unified Framework to Integrate and Analyze Single-cell RNA-seq Data
Chloe Wang, Lin Zhang, Bo Wang
bioRxiv 2021.05.12.443814; doi: https://doi.org/10.1101/2021.05.12.443814

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4243)
  • Biochemistry (9173)
  • Bioengineering (6806)
  • Bioinformatics (24064)
  • Biophysics (12157)
  • Cancer Biology (9565)
  • Cell Biology (13825)
  • Clinical Trials (138)
  • Developmental Biology (7659)
  • Ecology (11737)
  • Epidemiology (2066)
  • Evolutionary Biology (15544)
  • Genetics (10672)
  • Genomics (14362)
  • Immunology (9515)
  • Microbiology (22906)
  • Molecular Biology (9130)
  • Neuroscience (49144)
  • Paleontology (358)
  • Pathology (1487)
  • Pharmacology and Toxicology (2584)
  • Physiology (3851)
  • Plant Biology (8351)
  • Scientific Communication and Education (1473)
  • Synthetic Biology (2301)
  • Systems Biology (6206)
  • Zoology (1303)