Cell BLAST: Searching large-scale scRNA-seq database via unbiased cell embedding

Zhi-Jie Cao; Lin Wei; Shen Lu; De-Chang Yang; Ge Gao

doi:10.1101/587360

Abstract

Large amount of single-cell RNA sequencing data produced by various technologies is accumulating rapidly. An efficient cell querying method facilitates integrating existing data and annotating new data. Here we present a novel cell querying method Cell BLAST based on deep generative modeling, together with a well-curated reference database and a user-friendly Web interface at http://cblast.gao-lab.org, as an accurate and robust solution to large-scale cell querying.

Main Text

Technological advances during the past decade have led to rapid accumulation of large-scale single-cell RNA sequencing (scRNA-seq) data. Analogous to biological sequence analysis¹, identifying expression similarity to well-curated references via a cell querying algorithm is becoming the first step of annotating newly sequenced cells. Tools have been developed to identify similar cells using approximate cosine distance² or LSH Hamming distance^{3, 4} calculated from a subset of carefully selected genes. Such intuitive approach is efficient, especially for large-scale data, but may suffer from non-biological variation across datasets, i.e. batch effect^{5, 6}. Meanwhile, multiple data harmonization methods have been proposed to remove such confounding factors during alignment, for example, via warping canonical correlation vectors⁷ or matching mutual nearest neighbors across batches⁶.While these methods can be applied to align multiple reference datasets, computation-intensive realignment is required for mapping query cells to the (pre-)aligned reference data space.

Here we introduce a new customized deep generative model together with a cell-to-cell similarity metric specifically designed for cell querying to address these challenges (Figure 1A, Method). Differing from canonical variational autoencoder (VAE) models^8–11, adversarial batch alignment is applied to correct batch effect during low-dimensional embedding of reference datasets. Query cells can be readily mapped to the batch-corrected reference space due to parametric nature of neural networks. Such design also enables a special “online tuning” mode which is able to handle batch effect between query and reference when necessary. Moreover, by exploiting the model’s universal approximator posterior to model uncertainty in latent space, we implement a distribution-based metric for measuring cell-to-cell similarity. Last but not least, we also provide a well-curated multi-species single-cell transcriptomics database (ACA) and an easy-to-use Web interface for convenient exploratory analysis.

Figure. 1 Cell BLAST benchmarking and application to trachea datasets.

(A) Overall Cell BLAST workflow. (B) Extent of dataset mixing after batch effect correction in four groups of datasets, quantified by Seurat alignment score. High Seurat alignment score indicates that local neighborhoods consist of cells from different datasets uniformly rather than the same dataset only. Methods that did not finish under 2-hour time limit are marked as N.A. (C) Cell type resolution after batch effect correction, quantified by cell type mean average precision (MAP). MAP can be thought of as a generalization to nearest neighbor accuracy, with larger values indicating higher cell type resolution, thus more suitable for cell querying. Methods that did not finish under 2-hour time limit are marked as N.A. (D) ROC curve of different distance metrics in discriminating cell pairs with the same cell type from cell pairs with different cell types. (E) Sankey plot comparing Cell BLAST predictions and original cell type annotations for the “Plasschaert” dataset. (F) t-SNE visualization of Cell BLAST rejected cells, colored by unsupervised clustering.

To assess our model’s capability to capture biological similarity in the low-dimensional latent space, we first benchmarked against several popular dimension reduction tools^{8, 12, 13} using real-world data (Supplementary Table 1), and found that our model is overall among the best performing methods (Supplementary Figure 1-2). We further compared batch effect correction performance using combinations of multiple datasets with overlapping cell types profiled (Supplementary Table 1). Our model achieves significantly better dataset mixing (Figure 1B) while maintaining comparable cell type resolution (Figure 1C). Latent space visualization also demonstrates that our model is able to effectively remove batch effect for multiple datasets with considerable difference in cell type distribution (Supplementary Figure 3). Of note, we found that the correction of inter-dataset batch effect does not automatically generalize to that within each dataset, which is most evident in the pancreatic datasets (Supplementary Figure 3C-D, Supplementary Figure 4A-C). For such complex scenarios, our model is flexible in removing multiple levels of batch effect simultaneously (Supplementary Figure 4D-H).

Supplementary Fig. 1 Comparing low dimensional space cell type resolution of different dimension reduction methods.

Nearest neighbor cell type mean average precision (MAP) is used to evaluate how well biological similarity is captured. MAP can be thought of as a generalization to nearest neighbor accuracy, with larger values indicating higher cell type resolution, thus more suitable for cell querying. Methods that did not finish under 2-hour time limit are marked as N.A.

Supplementary Fig. 2 t-SNE visualization of latent spaces learned by our model.

(A) “Muraro”⁴³, (B) “Adam”⁴⁴, (C) “Guo”⁴⁵, (D) “Plasschaert”¹⁵, (E) “Baron_human”⁴⁶, (F) “Bach”⁴⁷, (G) “Macosko”⁴⁸.

Supplementary Fig. 3 t-SNE visualization of latent spaces learned by our model on combinations of multiple datasets, with batch effect corrected.

Figures in the left column color cells by their cell types, while figures in the right column color cells by dataset. (A-B) “Baron_human”⁴⁶ and “Baron_mouse”⁴⁶; (C-D) “Baron_human”⁴⁶, “Muraro”⁴³, “Enge”⁴⁹, “Segerstolpe”⁵⁰, “Xin_2016”⁵¹ and “Lawlor”⁵²; (E-F) “Montoro_10x”¹⁴ and “Plasschaert”¹⁵; (G-H) “Quake_Smart-seq2”¹⁸ and “Quake_10x”¹⁸.

Supplementary Fig. 4 Multi-level batch correction and Cell BLAST strategy optimization.

(A-C) Latent space learned only with cross-dataset batch correction, colored by (A) donor in “Baron_human”⁴⁶, (B) donor in “Enge”⁴⁹, (C) donor in “Muraro”⁴³, respectively. (D-H) Latent space learned with both cross-dataset batch correction and within-dataset batch correction, colored by (D) donor in “Baron_human”⁴⁶, (E) donor in “Enge”⁴⁹, (F) donor in “Muraro”⁴³, (G) cell type, (H) dataset, respectively. (I) Standard deviation decreases as number of samples from the posterior increases. (J) Relation between Euclidean distance and posterior distance on “Baron_human”⁴⁶ data. Orange points represent cell pairs that are of the same cell type (“positive pairs”), while blue points represent cell pairs of different cell types (“negative pairs”). (K) AUROC of different distance metrics in discriminating cell pairs with the same cell type from cell pairs with different cell types. Note that posterior distribution distances for scVI only lead to decrease in performance, possibly due to improper Gaussian assumption in the posterior. (L) Accuracy, Cohen’s κ, specificity and sensitivity all increase as the number of models used for cell querying increases, among which improvement of specificity is most significant.

View this table:

Supplementary Table. 1 Datasets used in dimensionality reduction and batch effect correction benchmarking.

DR, dimension reduction benchmarking; BC, batch effect correction benchmarking

While the unbiased latent space embedding derived by nonlinear deep neural network effectively removes confounding factors, the network’s random components and nonconvex optimization procedure also lead to serious challenges, especially false positive hits when cells outside reference types are provided as query. Thus, we propose a novel probabilistic cell-to-cell similarity metric in latent space based on posterior distribution of each cell, which we term “normalized projection distance” (NPD). Distance metric ROC analysis (Method) shows that our posterior NPD metric is more accurate and robust than Euclidean distance, which is commonly used in other neural network-based embedding tools (Figure 1D). Additionally, we exploit stability of query-hit distance across multiple models to improve specificity (Methods, Supplementary Figure 4L). An empirical p-value is computed for each query hit as a measure of “confidence”, by comparing posterior distance to the empirical NULL distribution obtained from randomly selected pairs of cells in the queried database.

The high specificity of Cell BLAST is especially important for discovering novel cell types. Two recent studies (“Montoro”¹⁴ and “Plasschaert”¹⁵) independently reported a rare tracheal cell type named pulmonary ionocyte. We artificially removed ionocytes from the “Montoro” dataset, and used it as reference to annotate query cells from the “Plasschaert” dataset. In addition to accurately annotating 95.9% of query cells, Cell BLAST correctly rejects 12 out of 19 “Plasschaert” ionocytes (Figure 1E). Moreover, it highlights the existence of a putative novel cell type as a well-defined cluster with large p-values among all 156 rejected cells, which corresponds to ionocytes (Figure 1F-G, Supplementary Figure 6A, also see Supplementary Figure 5 for more detailed analysis on the remaining 7 cells). In contrary, scmap-cell² only rejected 7 “Plasschaert” ionocytes despite higher overall rejection number of 401 (i.e. more false negatives, Supplementary Figure 6B-E).

Supplementary Fig. 5 Ionocytes predicted to be club cells are potentially doublets or of intermediate cell state.

(A) Cell-cell correlation heatmap for several cell types of interest. Cells labeled as “<X>” are reference cells in the “Montoro”¹⁴ dataset. Cells labeled as “X->Y” are cells annotated as “X” in the original “Plasschaert”¹⁵ dataset but predicted to be “Y”. (B-D) Expression levels of several club cell markers in cell groups of interest. (E-G) Expression levels of several ionocyte markers in cell groups of interest.

Supplementary Fig. 6 Rejected cells in the “Montoro” - “Plasschaert” query.

(A) t-SNE visualization of Cell BLAST rejected cells, colored by cell type. (B) Sankey plot of scmap prediction. (C, D) t-SNE visualization of scmap rejected cells, colored by unsupervised clustering (C) and cell type (D). (E) scmap similarity distribution in each cluster of scmap rejected cells. The rejected ionocytes do not have the lowest cosine similarity scores to draw enough attention.

We further systematically compared the performance of query-based cell typing with scmap-cell² and CellFishing.jl⁴ using four groups of datasets, each including both positive control and negative control queries (first 4 groups in Supplementary Table 2). Of interest, while Cell BLAST shows superior performance than scmap-cell and CellFishing.jl under the default setting (Supplementary Figure 7A-C, 8-10), detailed ROC analysis reveals that performance of scmap-cell could be further improved to a level comparative to Cell BLAST by employing higher thresholds, while ROC and optimal thresholds of CellFishing.jl show large variation across different datasets (Supplementary Figure 7D). Cell BLAST presents the most robust performance with default threshold (p-value < 0.05) across multiple datasets, which largely benefits real-world application. Additionally, we also assessed their scalability using reference data varying from 1,000 to 1,000,000 cells. Both Cell BLAST and CellFishing.jl scale well with increasing reference size, while scmap-cell’s querying time rises dramatically for larger reference dataset with more than 10,000 cells (Supplementary Figure 7E).

Supplementary Fig. 7 Benchmarking cell querying. (A-C) Querying specificity

(A), sensitivity (B) and Cohen’s κ (C) for different methods under the default setting. (D) ROC curve of cell querying in four different groups of test datasets. Cohen’s κ values in bottom left of each subpanel correspond to the optimal point on ROC curve. Points corresponding to each method’s default cutoff (scmap: cosine distance = 0.5, CellFishing.jl: Hamming distance = 110, Cell BLAST: p-value = 0.05) are marked as triangles. Note that CellFishing.jl does not provide a default cutoff, so we chose Hamming distance of 110 which the closest to balancing sensitivity and specificity, but it’s still far from being stable across different datasets. (E) Querying speed on references of different sizes subsampled from the 1M mouse brain dataset³⁵.

Supplementary Fig. 8

Sankey plots for Cell BLAST in the cell querying benchmark.

Supplementary Fig. 9

Sankey plots for scmap in the cell querying benchmark.

Supplementary Fig. 10

Sankey plots for CellFishing.jl in the cell querying benchmark.

View this table:

Supplementary Table. 2

Datasets used in cell query benchmarking.

Moreover, our deep generative model combined with posterior-based latent-space similarity metric enables Cell BLAST to model continuous spectrum of cell states accurately. We demonstrate this using a study profiling mouse hematopoietic progenitor cells (“Tusi”¹⁶), in which computationally inferred cell fate distributions are available. For the purpose of evaluation, cell fate distributions inferred by authors are recognized as ground truth. We selected cells from one sequencing run as query and the other as reference to test if we can accurately transfer continuous cell fate between experimental batches (Figure 2A-B). Jensen-Shannon divergence between prediction and ground truth shows that our prediction is again more accurate than scmap (Figure 2C).

Figure. 2 Application to hematopoietic progenitor datasets.

(A, B) UMAP visualization of latent space learned on the “Tusi” dataset, colored by sequencing run (A) and cell fate (B). Model is trained solely on cells from run 2 and used to project cells from run 1. Each one of the seven terminal cell fates (E, erythroid; Ba, basophilic or mast; Meg, megakaryocytic; Ly, lymphocytic; D, dendritic; M, monocytic; G, granulocytic neutrophil) are assigned a distinct color. Color of each single cell is then determined by the linear combination of these seven colors in hue space, weighed by cell fate distribution among these terminal fates. (C) Distribution of Jensen-Shannon divergence between predicted cell fate distributions and author provided “ground truth”. (D) UMAP visualization of the “Velten” dataset, colored by Cell BLAST predicted cell fates. (E) Number of organs covered in each species for different single-cell transcriptomics databases, including Single Cell Portal (https://portals.broadinstitute.org/single_cell); Hemberg collection²; SCPortalen¹⁹; scRNASeqDB²⁰.

Besides batch effect among multiple reference datasets, bona fide biological similarity could also be confounded by large, undesirable bias between query and reference data. Taking advantage of the dedicated adversarial batch alignment component, we implemented a particular "online tuning" mode to handle such often-neglected confounding factor. Briefly, the combination of reference and query data is used to fine-tune the existing reference-based model, with query-reference batch effect added as an additional component to be removed by adversarial batch alignment (Method). Using this strategy, we successfully transferred cell fate from the above “Tusi” dataset to an independent human hematopoietic progenitor dataset (“Velten”¹⁷) (Figure 2D). Expression of known cell lineage markers validates the rationality of transferred cell fates (Supplementary Figure 11A-F). In contrary, scmap-cell incorrectly assigned most cells to monocyte and granulocyte lineages (Supplementary Figure 11G). As another example, we applied “online tuning” to Tabula Muris¹⁸ spleen data, which exhibits significant batch effect between 10x and Smart-seq2 processed cells. ROC of Cell BLAST improves significantly after “online tuning”, achieving high specificity, sensitivity and Cohen’s κ at the default cutoff (Supplementary Figure 11H, last group in Supplementary Table 2).

Supplementary Fig. 11 Using “online tuning” in hematopoietic progenitor and Tabula Muris¹⁸ spleen data.

UMAP visualization of the “Velten”¹⁷ dataset, colored by expression of lineage markers, including CA1 for erythrocyte lineage (A), GP1BB for megakaryocyte lineage (B), DNTT for B cell lineage, TGFBI for monocyte, dendritic cell lineage (D), CLC for eosinophil, basophil, mast cell lineage (E), MPO for neutrophil lineage (F), and scmap predicted cell fate distribution (G). (H) ROC curve of cell querying in Tabula Muris¹⁸ spleen data. Cohen’s κ values in bottom left of each subpanel correspond to the optimal point on ROC curve. Points corresponding to each method’s default cutoff (scmap: cosine distance = 0.5, CellFishing.jl: Hamming distance = 110, Cell BLAST: p-value = 0.05) are marked as triangles.

A comprehensive and well-curated reference database is crucial for practical application of Cell BLAST. Based on public scRNA-seq datasets, we developed a comprehensive Animal Cell Atlas (ACA). With 986,305 cells in total, ACA currently covers 27 distinct organs across 8 species, offering the most comprehensive compendium for diverse species and organs (Figure 2E, Supplementary Figure 12A-B, Supplementary Table 3). To ensure unified and high-resolution cell type description, all records in ACA are collected and annotated with a standard procedure (Method), with 95.4% of datasets manually curated with Cell Ontology, a structured controlled vocabulary for cell types.

Supplementary Fig. 12 ACA database and Cell BLAST Web portal.

(A) Comparison of cell number in different single-cell transcriptomics databases. (B) Composition of different single-cell sequencing platforms in ACA. (C) Home page of the Web interface. (D) Web interface showing results of an example query.

A user-friendly Web server is publicly accessible at http://cblast.gao-lab.org, with all curated datasets and pretrained models available, providing “off-the-shelf” querying service. Of note, we found that our model works well on all ACA datasets with minimal hyperparameter tuning (latent space visualization available on the website). Based on this wealth of resources, users can obtain querying hits and visualize cell type predictions with minimal effort (Supplementary Figure 12C-D). For advanced users, a well-documented python package implementing the Cell BLAST toolkit is also available, which enables model training on custom references and diverse downstream analyses.

By explicitly modeling multi-level batch effect as well as uncertainty in cell-to-cell similarity estimation, Cell BLAST is an accurate and robust querying algorithm for heterogeneous single-cell transcriptome datasets. In combination with a comprehensive, well-annotated database and an easy-to-use Web interface, Cell BLAST provides a one-stop solution for both bench biologists and bioinformaticians

Software availability

The full package of Cell BLAST is available at http://cblast.gao-lab.org

Author contributions

G.G. conceived the study and supervised the research; Z.J.C. and L.W. contributed to the computational framework and data curation; S. L., Z.J.C., and D.C.Y designed, implemented and deployed the website; Z.J.C. and G.G. wrote the manuscript with comments and inputs from all co-authors.

Methods

The deep generative model

The model we used is based on adversarial autoencoder (AAE)²¹. Denote the gene expression profile of a cell as x ∈ ℝ^G. The data generative process is modeled by a continuous latent variable z ∈ ℝ^D (D ≪ G) with Gaussian prior z ∼ N(0, I_D), as well as a one-hot latent variable c ∈ {0,1}^K, c^T c = 1 with categorical prior c ∼ Cat(K), which aims at modeling cell type clusters. A unified latent vector is then determined by l = z + Hc, where H ∈ ℝ^D×K. A neural network (decoder, denoted by Dec below) maps the cell embedding vector l to two parameters of the negative binomial distribution μ, θ = Dec(l) that models the distribution of expression profile x: Where μ and θ are mean and dispersion of the negative binomial distribution. Theoretically, the negative binomial model should be fitted on raw count data^{8, 13, 22}. However, for the purpose of cell querying, datasets have to be normalized to minimize the influence of capture efficiency and sequencing depth. We empirically found that on normalized data, negative binomial still produced better results than alternative distributions like log-normal. To prevent numerical instability caused by normalization breaking the mean-variance relation of negative binomial, we additionally include variance of the dispersion parameter as a regularization term.

Training objectives for the adversarial autoencoder are: q(z|x; Enc) and q(c|x; Enc) are “universal approximator posteriors” parameterized by another neural network (encoder, denoted by Enc). Expectations with regard to q(z|x; Enc) and q(c|x; Enc) are approximated by sampling x’ ~ poisson (x) and feeding to the deterministic encoder network. D_z and D_c are discriminator networks for z and c which output the probability that a latent variable sample is from the prior rather than the posterior. Effectively, adversarial training between encoder (Enc) and discriminators (D_n and D_q) drives encoder output to match prior distributions of latent variables p(z) and p(c). λ_z and λ_c are hyperparameters that control prior matching strength. The model is much easier and more stable to train compared with canonical GANs because of low dimensionality and simple distribution of z and c.

At convergence, the encoder learns to map data distribution to latent variables that follow their respective prior distributions and the decoder learns to map latent variables from prior distributions back to data distribution. The key element we use for cell querying is vector l on the decoding path, as it defines a unified latent space in which biological similarities are well captured. The model also works if no categorical latent variable is used, in which case l = z directly.

Some architectural designs are learned from scVI⁸, including logarithm transformation before encoder input, and softmax output scaled by library size when computing μ. Stochastic gradient descent with minibatches is applied to optimize the loss functions. Specifically, we use the “RMSProp” optimization algorithm with no momentum term to ensure stability of adversarial training. The model is implemented using Tensorflow²³ python library.

Adversarial batch alignment

As a natural extension to the prior matching adversarial training strategy described in the previous section, also following recent works in domain adaptation^24–26, we propose the adversarial batch alignment strategy to align the latent space distribution of different batches. Below, we extend derivation in the original GAN paper²⁷ to show that adversarial batch alignment converges when embedding space distributions of different batches are aligned. In the case of multiple batches, we assume that marginal data distribution can be factorized as below: b ∈ {0,1}^B, b^T b = 1 denotes a one-hot batch indicator, and batch distribution p(b) is categorical: Adversarial batch alignment introduces an additional loss: ℒ_base denotes the loss function in (3). D_b is a multi-class batch discriminator network that outputs the probability distribution of batch membership based on embedding vector l. λ_b is a hyperparameter controlling batch alignment strength. Additionally, the generative distribution is extended to condition on b as well: Now, we focus on batch alignment and discard the first ℒ_base term and scaling parameter λ_b. To simplify notation, we fuse data distribution and encoder transformation, and replace the minimization over encoder to minimization over batch embedding distributions: Here D_bi (l) denotes the i^th dimension of discriminator output, i.e. probability that the discriminator “thinks” a cell is from the i^th batch. D_b is assumed to have sufficient capacity, which is generally reasonable in the case of neural networks. Global optimum of (12) is reached when D_b outputs optimal batch membership distribution at every l: Solution to the above maximization is given by: Substituting D_b^∗(l) back into (11), we obtain: Thus is the global minimum, reached if and only if p_i(l) = p_j(l), ∀i, j. The minimization of (11) is equivalent to minimizing a form of generalized Jensen-Shannon divergence among multiple batch embedding distributions.

Note that in practice, model training looks for a balance between ℒ_base and pure batch alignment. Aligning cells of the same type induces minimal cost in ℒ_base, while cells of different types could cause ℒ_base to rise dramatically if improperly aligned. During training, gradient from both batch discriminators and decoder provide fine-grain guidance to align different batches, leading to better results than “hand-crafted” alignment strategies like CCA⁷ and MNN⁶. Empirically, given proper values for λ_b, the adversarial approach correctly handles difference in cell type distribution among batches. If multiple levels of batch effect exist, e.g. within-dataset and cross-dataset, we use an independent batch discriminator for each component, providing extra flexibility.

Data preprocessing for benchmarks

Most informative genes were selected using the Seurat⁷ function “FindVariableGenes”. We set the argument “binning.method” to “equal_frequency” and left other arguments as default. If within dataset batch effect exists, genes are selected independently for each batch and then pooled together. By default, a gene is retained if it is selected in at least 50% of batches. Downstream benchmarks were all performed using this gene set, except for scmap and CellFishing.jl which provide their own gene selection method. GNU parallel²⁸ was used to parallelize and manage jobs throughout the benchmarking and data processing pipeline.

Benchmarking dimension reduction

PCA was performed using the R package irlba²⁹ (v2.3.2). ZIFA¹² was downloaded from its Github repository, and hard coded random seeds were removed to reveal actual stability. ZINB-WaVE¹³ (v1.0.0) was performed using the R package zinbwave. scVI⁸ (v0.2.3) was downloaded from its Github repository, and minor changes were made to the original code to address PyTorch³⁰ compatibility issues. Our modified versions of ZIFA and scVI are available upon request.

For PCA and ZIFA, data was logarithm transformed after normalization and adding a pseudocount of 1. Hyperparameters of all methods above were left as default. For our model, we used the same set of hyperparameters throughout all benchmarks. λ_z and λ_c are both set to 0.001. All neural networks (encoder, decoder and discriminators) use a single layer of 128 hidden units. Learning rate of RMSProp optimizer is set to 0.001, and minibatches of size 128 are used. For comparability, target dimensionality of each method was set to 10. All benchmarked methods were repeated multiple times with different random seeds. Run time was limited to 2 hours, after which the job would be terminated.

Cell type nearest neighbor mean average precision (MAP) was computed with K nearest neighbors of each cell based on low dimensional space Euclidean distance. Denote cell type of a cell as y, and cell type of its ordered nearest neighbors as y₁, y₂, … y_k. Average precision (AP) for that cell is defined as: Mean average precision is then given by: Note that when K = 1, MAP reduces to nearest neighbor accuracy. We set K to 1% of total cell number throughout all benchmarks.

Benchmarking batch effect correction

We merge multiple datasets according to shared gene names. If datasets to be merged are from different species, Ensembl ortholog³¹ information is used to map them to ortholog groups. To obtain informative genes in merged datasets, we take the union of informative genes from each dataset, and then intersect the union with the intersection of detected genes from each dataset.

CCA⁷ and MNN⁶ alignments were performed using R packages Seurat⁷ (v2.3.3) and scran³² (v1.6.9) respectively. Hard coded random seeds in Seurat were removed to reveal actual stability. The modified version of Seurat is available upon request. For comparability, we evaluated cell type resolution and batch mixing in a 10-dimensional latent space. For MNN alignment, we set argument “cos.norm.out” to false, and left other arguments as default. PCA was applied to reduce dimensionality to 10 after obtaining the MNN corrected expression matrix. For CCA alignment, we used the first 10 canonical correlation vectors. Run time was limited to 2 hours, after which the job would be terminated. Seurat alignment score was computed exactly as described in the CCA alignment paper⁷.

Cell querying based on posterior distributions

We evaluate cell-to-cell similarity based on posterior distribution distance. Like in the training phase, we obtain samples from the “universal approximator posterior” by sampling x’ ~ Poission (x) and feeding to the encoder network. In order to obtain robust estimation of distribution distance with a small number of posterior samples, we project posterior samples of two cells onto the line connecting their posterior point estimates in the latent space, and use projected scalar distribution distance to approximate true distribution distance. Wasserstein distance is computed on normalized projections to account for non-uniform density across the embedding space: Where We term this distance metric normalized projection distance (NPD). By default, 50 samples from the posterior are used to compute NPD, which produces sufficiently accurate results (Supplementary Figure 4I-J). The definition of NPD does not imply an efficient nearest neighbor searching algorithm. To increase speed, we first use Euclidean distance-based nearest neighbor searching, which is highly efficient in the low dimensional latent space, and then compute posterior distances only for these nearest neighbors. Empirical distribution of posterior NPD for a dataset is obtained by computing posterior NPD on randomly selected pairs of cells in the reference dataset. Empirical p-values of query hits are computed by comparing posterior NPD of a query hit to this empirical distribution.

We note that even with the querying strategy described above, querying with single models still occasionally leads to many false positive hits when cell types that the model has not been trained on are provided as query. This is because embeddings of such untrained cell types are mostly random, and they could localize close to reference cells by chance. We reason that embedding randomness of untrained cell types could be utilized to identify and correctly reject them. Practically, we train multiple models with different starting points (as determined by random seeds), and compute query hit significance for each model. A query hit is considered significant only if it is consistently significant across multiple models. To acquire predictions based on significant hits, we use majority voting for discrete variables, e.g. cell type, or averaging for continuous variables, e.g. cell fate distribution.

Distance metric ROC analysis

Our model and scVI⁸ were fitted on reference datasets and applied to positive and negative control query datasets in the pancreas group of Supplementary Table 2. We then randomly selected 10,000 query-reference cell pairs. A query-reference pair is defined as “positive” if the query cell and reference cell are of the same cell type, and “negative” otherwise. Each benchmarked similarity metric was then computed on all sampled query-reference pairs, and used as predictors for “positive” / “negative” pairs. AUROC values were computed for each benchmarked similarity metric. In addition to Euclidean distance, we also computed posterior distribution distances for scVI (Supplementary Figure 4K). NPD is computed as described in (18), based on samples from the posterior Gaussian. JSD is computed in the original latent space (without projection).

Benchmarking cell querying

For each of the four querying groups, three types of datasets were used, namely reference, positive query and negative query (see Supplementary Table 2 for a detailed list of datasets used in each querying group). For each querying method, cell type predictions for query cells are obtained based on reference hits with a minimal similarity cutoff. Cell ontology annotations in ACA were used as ground truth. Cells with no cell ontology annotations were excluded in the analysis. Predictions are considered correct if and only if it exactly matches the ground truth, i.e. no flexibility based on cell type similarity. This prevents unnecessary bias introduced in the selection of cell type similarity measure. Cells were inversely weighed by size of the corresponding dataset when computing sensitivity, specificity and Cohen’s κ. AUROC was computed using linear interpolation. For scmap², we varied minimal cosine similarity requirement for nearest neighbors. For Cell BLAST, we varied maximal p-value cutoff used in filtering hits. For CellFishing.jl⁴, the original implementation does not include a dedicated cell type prediction function, so we used the same strategy as for our own method (majority voting after distance filtering) to acquire final predictions, in which we varied the Hamming distance cutoff used in distance filtering. Lastly, 4 random seeds were tested for each cutoff and each method to reflect stability. Several other cell querying tools (CellAtlasSearch³, scQuery³³, scMCA³⁴) were not included in our benchmark because they do not support custom reference datasets.

Benchmarking querying speed

To evaluate scalability of querying methods, we constructed reference data of varying sizes by subsampling from the 1M mouse brain dataset³⁵. For query data, the “Marques” dataset³⁶ was used. For all methods, only querying time was recorded, not including time consumed to build reference indices.

Application to trachea datasets

We first removed cells labeled as “ionocytes” in the “Montoro_10x”¹⁴ dataset, and used “FindVariableGenes” from Seurat to select informative genes using the remaining cells. Four models with different starting points were trained on the tampered “Montoro_10x” dataset. We computed posterior distribution distance p-values as mentioned before, and used a cutoff of p-value > 0.1 to reject query cells from the “Plasschaert”¹⁵ dataset as potential novel cell types. We clustered rejected cells using spectral clustering (Scikit-learn³⁷ v0.20.1) after applying t-SNE³⁸ to latent space coordinates. Average p-value for a query cell was computed as the geometric mean of p-values across all hits.

Online tuning

When significant batch effect exists between reference and query, we support further aligning query data with reference data in an online-learning manner. All components in the pretrained model, including encoder, decoder, prior discriminators and batch discriminators are retained. The reference-query batch effect is added as an extra component to be removed using adversarial batch alignment. Specifically, a new discriminator dedicated to reference-query batch effect is added, and the decoder is expanded to accept an extra one-hot indicator for reference and query. The expanded model is then fine-tuned using the combination of reference and query data. Two precautions are taken to prevent decrease in specificity caused by over-alignment. Firstly, adversarial alignment loss is constrained to cells that have mutual nearest neighbors⁶ between reference and query data in each SGD minibatch. Secondly, we penalize the deviation of fine-tuned model weights from original weights.

Application to hematopoietic progenitor datasets

For within-“Tusi”¹⁶ query, we trained four models using only cells from sequencing run 2, and cells from sequencing run 1 were used as query cells. PBA inferred cell fate distributions, which are 7-dimensional categorical distributions across 7 terminal cell fates, were used as ground truth. We took the average cell fate distributions of significant querying hits (p-value < 0.05) as predictions for query cells. As for scmap-cell, we filtered nearest neighbors according to default cosine similarity cutoff of 0.5. Jensen-Shannon divergence (JSD) between true and predicted cell fate distribution was computed as below: For cross-species querying between “Tusi” and “Velten”¹⁷, we mapped both mouse and human genes to ortholog groups. Online tuning with 200 epochs was used to increase sensitivity and accuracy. Latent space visualization was performed with UMAP^{39, 40}.

ACA database construction

We searched Gene Expression Omnibus (GEO)⁴¹ using the following search term:

( "expression profiling by high throughput sequencing"[DataSet Type] OR "expression profiling by high throughput sequencing"[Filter] OR "high throughput sequencing"[Platform Technology Type] ) AND "gse"[Entry Type] AND ( "single cell"[Title] OR "single-cell"[Title] ) AND ("2013"[Publication Date]: "3000"[Publication Date]) AND "supplementary"[Filter]

Datasets in the Hemberg collection (https://hemberg-lab.github.io/scRNA.seq.datasets/) were merged into this list. Only animal single-cell transcriptomic datasets profiling samples of normal condition were selected. We also manually filtered small scale or low-quality data. Additionally, several other high-quality datasets missing in the previous list were included for comprehensiveness.

The expression matrices and metadata of selected datasets were retrieved from GEO, supplementary files of the publication or by directly contacting the authors. Metadata were further manually curated by adding additional description in the paper to acquire the most detailed information of each cell. We unified raw cell type annotation by Cell Ontology⁴², a structured controlled vocabulary for cell types. Closest Cell Ontology terms were manually assigned based on Cell Ontology description and context of the study.

Building reference panels for ACA database

Two types of searchable reference panels are built for the ACA database. The first consists of individual datasets with dedicated models trained on each of them, while the second consists of datasets grouped by organ and species, with models trained to align multiple datasets profiling the same species and the same organ.

Data preprocessing follows the same procedure as in previous benchmarks. Both cross-dataset batch effect and within-dataset batch effect are manually examined and removed when necessary. For the first type of reference panels, datasets too small (typically < 1,000 cells sequenced) are excluded because of insufficient training data. These datasets are still included in the second type of panels, where they are trained jointly with other datasets profiling the same organ in the same species. For each reference panel, four models with different starting points are trained. Latent space visualization, self-projection coverage and accuracy on all reference panels are available on our website.

Web interface

For conveniently performing and visualizing Cell BLAST analysis, we built a one-stop web interface. The client-side was made from Vue.js, a the single-page application Javascript framework, and D3.js for cell ontology visualization. We used Koa2, a web framework for Node.js, as the server-side. The Cell BLAST web portal with all accessible curated datasets is deployed on Huawei Cloud.

Acknowledgements

The authors thank Drs. Zemin Zhang, Cheng Li, Letian Tao, Jian Lu and Liping Wei at Peking University for their helpful comments and suggestions during the study. This work was supported by funds from the National Key Research and Development Program (2016YFC0901603), the China 863 Program (2015AA020108), as well as the State Key Laboratory of Protein and Plant Gene Research and the Beijing Advanced Innovation Center for Genomics (ICG) at Peking University. The research of G.G. was supported in part by the National Program for Support of Top-notch Young Professionals. Part of the analysis was performed on the Computing Platform of the Center for Life Sciences of Peking University, and supported by the High-performance Computing Platform of Peking University.

Main text references

1.↵
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J Mol Biol 215, 403–410 (1990).
OpenUrl CrossRef PubMed Web of Science
2.↵
Kiselev, V.Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat Methods 15, 359–362 (2018).
OpenUrl
3.↵
Srivastava, D., Iyer, A., Kumar, V. & Sengupta, D. CellAtlasSearch: a scalable search engine for single cells. Nucleic Acids Res 46, W141–W147 (2018).
OpenUrl
4.↵
Sato, K., Tsuyuzaki, K., Shimizu, K. & Nikaido, I. CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing. Genome Biol 20, 31 (2019).
OpenUrl
5.↵
Tung, P.Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci Rep 7, 39921 (2017).
OpenUrl CrossRef PubMed
6.↵
Haghverdi, L., Lun, A.T.L., Morgan, M.D. & Marioni, J.C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36, 421–427 (2018).
OpenUrl CrossRef
7.↵
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 36, 411–420 (2018).
OpenUrl CrossRef PubMed
8.↵
Lopez, R., Regier, J., Cole, M.B., Jordan, M.I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat Methods 15, 1053–1058 (2018).
OpenUrl
9.
Ding, J., Condon, A. & Shah, S.P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun 9, 2002 (2018).
OpenUrl
10.
Grønbech, C.H. et al. scVAE: Variational auto-encoders for single-cell gene expression data. bioRxiv preprint, 318295 (2019).
11.↵
Wang, D. & Gu, J. VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder. Genomics, proteomics bioinformatics (2018).
12.↵
Pierson, E. & Yau, C. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol 16, 241 (2015).
OpenUrl CrossRef PubMed
13.↵
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun 9, 284 (2018).
OpenUrl CrossRef PubMed
14.↵
Montoro, D.T. et al. A revised airway epithelial hierarchy includes CFTR-expressing ionocytes. Nature 560, 319–324 (2018).
OpenUrl CrossRef
15.↵
Plasschaert, L.W. et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature 560, 377–381 (2018).
OpenUrl CrossRef
16.↵
Tusi, B.K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).
OpenUrl
17.↵
Velten, L. et al. Human haematopoietic stem cell lineage commitment is a continuous process. Nat Cell Biol 19, 271–281 (2017).
OpenUrl CrossRef PubMed
18.↵
Tabula Muris, C. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
OpenUrl CrossRef
19.↵
Abugessaisa, I. et al. SCPortalen: human and mouse single-cell centric database. Nucleic Acids Res 46, D781–D787 (2018).
OpenUrl
20.↵
Cao, Y., Zhu, J., Jia, P. & Zhao, Z. scRNASeqDB: A Database for RNA-Seq Based Gene Expression Profiles in Human Single Cells. Genes (Basel) 8 (2017).

Supplementary references

21.↵
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I. & Frey, B. Adversarial autoencoders. arXiv preprint (2015).
22.↵
Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014).
OpenUrl CrossRef PubMed
23.↵
Abadi, M. et al. in 12th USENIX Symposium on Operating Systems Design and Implementation 265–283 (2016).
24.↵
Ganin, Y. & Lempitsky, V. Unsupervised domain adaptation by backpropagation. arXiv preprint (2014).
25.
Xie, Q., Dai, Z., Du, Y., Hovy, E. & Neubig, G. in Advances in Neural Information Processing Systems 585–596 (2017).
26.↵
Tzeng, E., Hoffman, J., Saenko, K. & Darrell, T. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7167–7176 (2017).
27.↵
Goodfellow, I. et al. in Advances in neural information processing systems 2672–2680 (2014).
28.↵
Tange, O. Gnu parallel 2018. (2018).
29.↵
Baglama, J., Reichel, L. & Lewis, B.J.R.p.v. irlba: Fast truncated singular value decomposition and principal components analysis for large dense and sparse matrices. 2 (2017).
30.↵
Paszke, A. et al. Automatic differentiation in pytorch. (2017).
31.↵
Herrero, J. et al. Ensembl comparative genomics resources. Database (Oxford) 2016 (2016).
32.↵
Lun, A.T., McCarthy, D.J. & Marioni, J.C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res 5, 2122 (2016).
OpenUrl
33.↵
Alavi, A., Ruffalo, M., Parvangada, A., Huang, Z. & Bar-Joseph, Z. A web server for comparative analysis of single-cell RNA-seq data. Nat Commun 9, 4768 (2018).
OpenUrl
34.↵
Han, X. et al. Mapping the Mouse Cell Atlas by Microwell-Seq. Cell 172, 1091–1107 e1017 (2018).
OpenUrl CrossRef PubMed
35.↵
10x-Genomics in 1.3 Million Brain Cells from E18 Mice (2017).
36.↵
Marques, S. et al. Oligodendrocyte heterogeneity in the mouse juvenile and adult central nervous system. Science 352, 1326–1329 (2016).
OpenUrl Abstract/FREE Full Text
37.↵
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. 12, 2825–2830 (2011).
OpenUrl
38.↵
Maaten, L.v.d. & Hinton, G. Visualizing data using t-SNE. Journal of machine learning research 9, 2579–2605 (2008).
OpenUrl
39.↵
McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint (2018).
40.↵
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol (2018).
41.↵
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 41, D991–995 (2013).
OpenUrl CrossRef PubMed Web of Science
42.↵
Diehl, A.D. et al. The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability. J Biomed Semantics 7, 44 (2016).
OpenUrl
43.↵
Muraro, M.J. et al. A Single-Cell Transcriptome Atlas of the Human Pancreas. Cell Syst 3, 385–394 e383 (2016).
OpenUrl
44.↵
Adam, M., Potter, A.S. & Potter, S.S. Psychrophilic proteases dramatically reduce single-cell RNA-seq artifacts: a molecular atlas of kidney development. Development 144, 3625–3632 (2017).
OpenUrl Abstract/FREE Full Text
45.↵
Guo, J. et al. The adult human testis transcriptional cell atlas. Cell Res 28, 1141–1157 (2018).
OpenUrl
46.↵
Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst 3, 346–360 e344 (2016).
OpenUrl
47.↵
Bach, K. et al. Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing. Nat Commun 8, 2128 (2017).
OpenUrl CrossRef
48.↵
Macosko, E.Z. et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202–1214 (2015).
OpenUrl CrossRef PubMed
49.↵
Enge, M. et al. Single-Cell Analysis of Human Pancreas Reveals Transcriptional Signatures of Aging and Somatic Mutation Patterns. Cell 171, 321–330 e314 (2017).
OpenUrl CrossRef
50.↵
Segerstolpe, A. et al. Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes. Cell Metab 24, 593–607 (2016).
OpenUrl CrossRef PubMed
51.↵
Xin, Y. et al. RNA Sequencing of Single Human Islet Cells Reveals Type 2 Diabetes Genes. Cell Metab 24, 608–615 (2016).
OpenUrl CrossRef PubMed
52.↵
Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res 27, 208–222 (2017).
OpenUrl Abstract/FREE Full Text
53.
Zheng, G.X. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 8, 14049 (2017).
OpenUrl CrossRef PubMed
54.
Hashimshony, T. et al. CEL-Seq2: sensitive highly-multiplexed single-cell RNA-Seq. Genome Biol 17, 77 (2016).
OpenUrl CrossRef PubMed
55.
Verboom, K. et al. SMARTer single cell total RNA sequencing. bioRxiv preprint, 430090 (2018).
56.
Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat Protoc 9, 171–181 (2014).
OpenUrl CrossRef PubMed
57.
Klein, A.M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
OpenUrl CrossRef PubMed
58.
Wu, H. et al. Comparative Analysis and Refinement of Human PSC-Derived Kidney Organoid Differentiation with Single-Cell Transcriptomics. Cell Stem Cell 23, 869–881 e868 (2018).
OpenUrl PubMed
59.
Philippeos, C. et al. Spatial and Single-Cell Transcriptional Profiling Identifies Functionally Distinct Human Dermal Fibroblast Subpopulations. J Invest Dermatol 138, 811–825 (2018).
OpenUrl CrossRef PubMed
60.
Park, J. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 360, 758–763 (2018).
OpenUrl Abstract/FREE Full Text
61.
Giraddi, R.R. et al. Single-Cell Transcriptomes Distinguish Stem Cell State Changes and Lineage Specification Programs in Early Mammary Gland Development. Cell Rep 24, 1653–1666 e1657 (2018).
OpenUrl