Abstract
The advent of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies. However, integrative analysis of scRNA-seq data remains a challenge largely due to batch effects. We present single-cell Embedded Topic Model (scETM), an unsupervised deep generative model that recapitulates known cell types by inferring the latent cell topic mixtures via a variational autoencoder. scETM is scalable to over 106 cells and enables effective knowledge transfer across datasets. scETM also offers high inter-pretability and allows the incorporation of prior pathway knowledge into the gene embeddings. The scETM-inferred topics show enrichment in cell-type-specific and disease-related pathways.
Background
Advances in high-throughput sequencing technologies [1] provide an unprecedented opportunity to profile the individual cells’ transcriptome across various biological and pathological conditions, and have spurred the creation of several atlas projects [2–5]. Emerged as a key application of scRNA-seq data, unsupervised clustering allows for cell-type identification in a data-driven manner. Flexible, scalable, and interpretable computational methods are crucial for exploiting the full potential of the wealth of single-cell datasets and translating the transcription profiles into biological insights. Despite considerable progress made on clustering method development for scRNA-seq data analysis [6–16], several challenges remain.
First, compared to bulk RNA-seq, scRNA-seq data commonly exhibit higher noise levels and drop-out rates, where the data only captures a small fraction of a cell’s transcriptome [17]. Changes in gene expression due to experimental design, often referred to as batch effects [18], can have a large impact on clustering [12, 18–20]. If not properly addressed, these technical artefacts may mask true biological signals in cell clustering.
Second, the partitioning of the cell population alone is insufficient to produce biological interpretation. The annotations of the cell clusters require extensive manual literature search in practice and the annotation quality may be dependent on users’ domain knowledge [20]. Therefore, an interpretable and flexible model is needed. In the current work, we consider model interpretability as whether the model parameters can be directly used to associate the input features with latent factors or target outcome. In particular, latent topic models are a popular approach in mining genomic data [21, 22] as one can use them to infer the topic distribution for both the samples and genomic features by problematically decomposing the samples-by-features matrix into samples-by-topics and topics by features, respectively. However, their values in modeling scRNA-seq data have not been fully realized [23].
Third, model transferability is an important consideration. We consider a model as transferable if the learned knowledge manifested as the model parameters could benefit future data modeling. In the context of scRNA-seq data analysis, it translates to learning feature representations from one or more large-scale annotated reference datasets and applying the learned representations to a query dataset without annotation. As the number and size of scRNA-seq datasets continue to increase, there is an increasingly high demand for efficient exploitation and knowledge transfer from the existing reference datasets.
Several recent methods have attempted to address these challenges. Seurat [7] uses canonical correlation analysis to project cells onto a common embedding, then identifies, filters, scores and weights anchor cell pairs between batches to perform data integration. Harmony iterates between maximum diversity clustering and a linear batch correction based on the mixture-of-experts model. Scanorama [10] performs all-to-all dataset matching by querying nearest neighbors of a cell among all remaining batches, after which it merges the batches with a Gaussian kernel to form a single cell panorama. These methods are often not scalable to cope with the entire genes-by-cells data matrices, or are vulnerable to the noise inherent to scRNA-seq read count data; hence they rely on feature (gene) selection and/or dimensionality reduction methods. They are also non-transferable, meaning the knowledge learned from one dataset cannot be easily transferred through model parameters to benefit the modeling of another dataset. LIGER [9] uses integrative non-negative matrix factorization to jointly factorize multiple scRNA-seq matrices across conditions using genes as the common axis, linking cells from different conditions by a common set of latent factors also known as metagenes. Although relying on Seurat’s preprocessing pipeline, LIGER is weakly transferable in the sense that the global metagenes-by-genes matrix can be transferred when modeling new datasets, whereas in the case of Seurat both the correlation components and the anchor cell pairs must be recomputed.
Deep learning approaches, especially autoencoders, have demonstrated promising perfor-mance in scRNA-seq data modeling. scAlign [15] and MARS [25] encode cells with non-linear embeddings using autoencoders, which is naturally transferable across datasets. While scAlign minimizes the distance between the pairwise cell similarity at the embedding and original space, MARS looks for latent landmarks from known cell types to infer cells of unknown type. Variational autoencoders (VAE) [26] is an efficient probabilistic framework known to better account for noise compared to conventional autoencoders. scVAE-GM [11] changed the prior distribution of the latent variables in the VAE from Gaussian to Gaussian mixture, adding a categorical latent variable that clusters cells. Single-cell variational inference (scVI), another VAE-based method, models library size and takes into account batch effect in generating cell embeddings [6]. A key drawback for autoencoders for modeling scRNA-seq data is the lackof interpretability. These approaches often require posthoc analyses to interpret the learned model parameters and associate condition and cell-type-specific gene signatures. To improve interpretability, a linear decoded VAE (hereafter referred to as scVI-LD) was proposed and included in the scVI software package [14].
In this paper, we present single-cell Embedded Topic Model (scETM), a generative topic model that facilitates integrative analysis of large-scale single-cell transcriptomic data. Our key contribution is the novel, efficient and scalable Bayesian inference framework which utilizes a transferable neural-network-based encoder while having an interpretable linear decoder. scETM simultaneously learns a set of highly interpretable cell embeddings, gene embeddings, topic embeddings, and batch effect embeddings from scRNA-seq data. The flexibility and expressiveness of the encoder network enable us to model extremely large raw scRNA-seq datasets. By the tri-factorization design, we are able to incorporate existing pathway information into gene embeddings during the model training to further improve interpretability, which is a salient feature compared to the related methods such as scVI-LD. This incorporation allows scETM to simultaneously discover interpretable cellular signatures and gene markers while integrating scRNA-seq data across conditions, subjects and experimental studies. We demonstrate that scETM offers state-of-the-art performance across a diverse range of datasets with desirable runtime and memory requirements. We also show scETM’s capability of effective knowledge transfer across datasets with different sequencing technologies and even cross-species. We then use scETM to discover biologically meaningful gene expression signatures and to differentiate known cell types as well as pathological conditions. We analyze scETM-inferred topics and show that several topics are enriched in cell-type-specific or disease-related pathways.
Finally, we directly incorporate known pathway-gene relationships (pathway gene sets) into scETM in the form of gene embeddings, and use the learned pathway-topic embedding to show the pathway-informed scETM (p-scETM)’s capability of learning biologically and pathologically meaningful information.
Results
scETM model overview
Topic models are natural for scRNA-seq data modeling. Each sampled cell transcriptome can be viewed as a bag of genes, and its cell type identity could be inferred from its topic proportions. Each topic is a distribution over genes that would capture certain aspect of cell functions. We choose Embedded Topic Model (ETM) [27] as the backbone of our model, as it inherits the benefits of topic models, and is especially effective for handling large and heavy-tailed vocabularies. The amortized inference process of ETM is very similar to that of VAEs, while the data modeling of the former is much more interpretable. We model the cells-by-genes read-count matrix by factorizing it into a cells-by-topics matrix θ and a topics-by-genes matrix β, which is further decomposed into topics-by-embedding α and embedding-by-genes ρ matrices (Fig. 1a,b). This tri-factorization design allows for the simultaneous embedding of cells, topics, and genes into low-dimensional spaces, and exploring their relations in a highly interpretable way through automatically inferred latent topics. To account for biases across conditions or subjects, we introduce an optional batch correction parameter λ which acts as an intercept term in the categorical softmax function to relieve the burden of modeling batch variations from the cell topic mixture θd. We infer the topic mixture θ of a cell (also referred to as the cell embedding) via a two-layer fully-connected neural network (Fig. 1c) using VAE [26]. Details are described in Methods.
(a) Probabilistic graphical model of scETM. We model the scRNA-profile read count matrix yd,g in cell d and gene g across S subjects or studies by a multinomial distribution with the rate parameterized by cell topic mixture θ, topic embedding α, gene embedding ρ, and batch effects λ. (b) Matrix factorization view of scETM. (c) Encoder architecture for inferring the cell topic mixture θ.
Clustering
We benchmarked scETM, along with seven state-of-the-art single-cell clustering or integrative analysis methods – scVI [6], scVI-LD [14], Seurat (integrated) [7], scVAE-GM [11], Scanorama [10], Harmony [24] and LIGER [9], on five published datasets, namely Mouse Pancreatic Islet (MP) [28], Human Pancreatic Islet (HP) [7], Tabula Muris (TM) [3], Alzheimer’s Disease dataset (AD), and Major Depressive Disorder dataset (MDD) [29]. Across all datasets, scETM performs on par with the state-of-the-art methods in terms of Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) (Table 1; Supp. Table S1). Specifically, it has the best clustering performance among the transferable and interpretable models. scETM stably yields competitive results, while others methods fluctuate across the five datasets. Overall, Harmony and Seurat have slightly higher ARIs than scETM, with trade-offs of model transferability, interpretability, and/or scalability (more details next).
The clustering performance is measured by Adjusted Rand Index (ARI) between ground truth cell types and Leiden [74] clusters. NA is reported for models that did not converge. See Clustering performance benchmark of the existing methods section for experimental details. *Batch integration was turned off to prevent over-correction.
To further verify the clustering performance and validate our evaluation metrics, we visualized the cell embeddings using Uniform Manifold Approximation and Projection (UMAP) [30] (Fig. S2). This result demonstrates that scETM effectively captures cell-type-specific information, while accounting for artefacts arising from individual or technological variations. scETM is also robust to hyperparameter changes, requiring very few or no hyperparameter tuning efforts when applied to unseen datasets (Supp. Table S2). We also performed a comprehensive ablation analysis to validate our model choices. The ablation experiment demonstrates the necessity of key model components, such as the batch effect correction λ and batch normalization in the encoder. Normalizing gene expression as the input to the encoder also improves the performance (Supp. Table S3).
Scalability
A key advantage of scETM is its high scalability and efficiency. We demonstrate this by comparing the run time, memory usage, and clustering performance of the state-of-the-art models using their recommended pipelines when integrating a merged dataset consisting of MDD and AD (Fig. 2; see Efficiency and scalability benchmark of the existing methods section). Because of the simple model design and efficient implementation (sparse matrix representation, multithreaded data retrieval, to name a few tricks), scETM has the shortest run time among all deeplearning based models. Specifically, on the largest dataset (148,247 cells), it runs 3-4 times faster than scVI and scVI-LD, and over 10 times faster than scVAE-GM. Notably, although The architectures are not exactly the same in these deep models, the run time is not heavily dependent on the architecture choices but rather on the implementation. Harmony and Scanorama are the only methods faster than scETM, yet they both operate on no more than a hundred principal components, while scETM can operate on all genes for better model transferability and interpretability.
The line styles in the plot indicate model inputs. The number of genes was fixed to 3000 in this experiment. See Efficiency and scalability benchmark of the existing methods section for experimental details.
Because of stochastic variational inference [26, 31, 32] and minibatch parameter update, scETM takes almost constant run-time memory with respect to the sample size, with the increase attributed to the data loader. In contrast, the memory requirement of Seurat increases rapidly with the number of cells, due to the vast numbers of plausible anchor cell pairs in the two brain datasets. In accord with the results above, scETM consistently yields first-class clustering results, whereas Harmony and Scanorama show sub-optimal performance when dataset sizes vary. UMAP visual inspection of scVAE embeddings suggests that scVAE likely suffers from under-correction of batch effects (Supp. Fig. S3). The sudden drop of Liger’s clustering performance in the largest benchmark dataset may be due to overfitting because of the frequentist numerical optimization of the least square objective in Liger in contrast to the Bayesian inference in ours and other approaches.
Transfer learning across single-cell datasets
A prominent feature of scETM is that its parameters, hence knowledge of modeling scRNA-seq data, are transferable across datasets. As an example, we trained an scETM model on the fluorescence-activated-cell-sorting-based Tabula Muris dataset (TM-FACS) from a multi-organ mouse single-cell atlas, and evaluated it using the MP data, which only contains mouse pancreatic islet cells (see Transfer learning with scETM section). Though the two datasets were obtained using different sequencing technologies, the model yields an encouragingly high ARI score of 0.94, considering that the ARI score is 0.95 if the model is directly trained on MP (Fig. 3a,b). Interestingly, the TM-FACS-pretrained model puts B cells, T cells and macrophages far away from other clusters and separates B cells and T cells from macrophages, which is not observed in the model directly trained on MP.
The embeddings of cells in the Mouse Pancreatic islet (MP) dataset inferred by three scETM models trained on (a) MP, (b) TM-FACS, (c) HP, respectively.
Our transfer learning can also be cross-species. Using the gene orthologs between human and mouse, scETM trained on the human pancreas (HP) dataset achieves a 0.79 ARI on MP (Fig. 3c), which surpassed scVI (0.39) and scVI-LD (0.49) on the same HP-MP transfer learning task. The performance is even better than scVAE-GM (0.66) trained directly on MP. This improvement is attributable to the gene embedding learning and the explicit batch effect correction in our scETM model. To assess scETM’s capability to embed unseen query cells and similar reference cells together in the embedding space, we trained a k-Nearest Neighbors classifier on the HP embeddings generated by the HP-pretrained scETM model, and evaluated it on the MP embeddings generated by the same HP-pretrained model, which was not trained on the MP data. The classifier achieves 79.8% accuracy in MP cell type prediction, demonstrating the capability of automatically annotating query scRNA-seq datasets using the pretrained scETM model on the reference scRNA-seq data. We expect that these results can be improved by further tuning of the model on the unannotated query data, or by the use of a compatible transfer learning framework such as MARS [25].
Gene set enrichment analysis of scETM topics
We next investigated whether the scETM-inferred topics are biologically relevant in terms of known human gene pathways. We conducted pathway enrichment analysis using pathDIP4, a data portal that integrates 24 major pathway databases [33]. For each topic, we selected the top 30 genes based on topic intensity as the input gene set and identified significantly en-riched pathways based on a hypergeometric test with false discover rate (FDR) below 0.05 [34]. We found that several topics learned from the human pancreatic islet dataset are significantly enriched in pathways relevant to pancreas functions, including insulin signalling pathway, fat digestion and absorption, starch and sucrose metabolism, etc (Supp. Table S4). Topic learned from AD and MDD datasets are also enriched in brain function-related pathways: about 64% and 37% of the topics respect to AD and MDD have significant hits with neuronal system pathways (Supp. Table S5, S6).
Interestingly, several topics are also enriched for disease-relevant pathways. In AD, the top 30 genes from topic 18 are enriched for Alzheimer’s Disease pathway itself, and the top 30 genes from topic 75 are highly enriched in amyloid fiber formation (Fig. 4a). Notably, amyloid fibrils are widely known to be associated with aging and AD, and β-amyloid plaques are among the major characteristics of AD brains [35–37]. We also found that the top 30 genes in topic 15 are enriched in the GABA synthesis pathway (FDR < 0.001), which is known to have an important role in AD pathogenesis [38, 39]. In MDD, topic 7 is enriched for neurodegenerative diseases such as Parkinson’s, Alzheimer’s and Huntington’s disease; topic 94 is enriched in toll-like receptor (TLR) pathway (FDR < 0.05), which is known to be associated with MDD severity [40, 41] (Fig. S1a).
(a) Gene topics heatmap of top 10 genes in each topic based on topic intensity. We annotated the top genes by the significantly enriched AD-related pathways per topic (rows). For visualization purposes, we divided the topic values by the maximum absolute value within the same topic. Only select topics are shown. (b) Topics intensity of cells (n=10,000) sub-sampled from the AD dataset. Topic intensities shown here are the Gaussian mean before applying softmax. Only the select topics with the sum of absolute values greater than 1500 across all sampled cells are shown. The three color bars show disease conditions, cell types, and batch identifiers (i.e., subject IDs). (c) Differential expression analysis of topics across the 8 cell types and 2 clinical conditions. Z-scores of the two-sided t-tests were shown. Asterisks indicate Bonferroni q-value < 0.05 for one-sided t-test of up-regulated topics in each cell-type and two-sided t-test for disease-relevant topics.
Differential scETM topics in disease conditions and cell types
We sought to use the scETM topics to differentiate pathological conditions. We separated the cells derived from the AD subjects from the cells derived from the control subjects. We then performed two-sided t-tests to evaluate whether the two cell groups exhibit significant differences in terms of their topic expression (see Differential analysis of topic expression section). Here we consider the topic expression for each cell as the metagene expression because we projected the original gene expression of each cell onto the topic embedding (Fig. 4b), and we also observed that each of these topics is highly selective of a small fraction of the genes (Fig. 4a). We found that topic 12 and 58 are differentially expressed in the AD cells and control cells (Fig. 4c, d; t-test p-value ≈ 0). Interestingly, topic 58 is highly enriched for mitochondrial genes. Indeed, it is known that β-amyloids selectively build up in the mitochondria in the cells of AD-affected brains [42]. The MDD topics 52, 68, 77, 82, 86 also exhibit differential expres-sions between the suicidal and healthy populations (Supp. Fig. S1c) and interesting neurological pathway enrichment (Supp. Table S6).
We also identified several cell-type-specific scETM topics from the AD and MDD datasets. In AD, as shown by both the cell embedding heatmap and the differential expression analysis (Fig. 4b), topics (or metagenes) 19, 50, 97 are up-regulated in oligodendrocytes, endothelial cells, and oligodendrocyte progenitor cells (OPCs), respectively (t-test Bonferroni q-value ≈ 0; Fig. 4, Supp. Fig. S4). Interestingly, two subpopulations of cells that exhibit high expression of topics 12 and 58 colocalize within the oligodendrocytes and one of the excitatory subclusters, respectively (Supp. Fig. S5). Meanwhile, a clear separation of AD and controls within those two subpopulations is present in the cell embedding. Among the AD cells, there is also a strong enrichment for the female subjects, which is consistent with the original finding [43]. For MDD, topics 1, 20 and 72 are up-regulated in astrocytes, oligodendrocytes and OPCs, respectively (Supp. Fig. S1c). This is consistent with the positive correlations observed in the heatmap (Supp. Fig. S1b).
Pathway-informed scETM topics
In the above analysis of the MDD dataset, we found that several topics are dominated by long non-coding RNAs (lincRNAs) (Fig. S1a). While previous studies have suggested that lincRNAs can be cell-type-specific [44], it remains difficult to interpret them [45]. This prompted us to in-corporate the known pathway information in the form of gene embedding. In particular, we fixed the gene embedding ρ to a pathways-by-genes matrix obtained from the pathDIP4 pathway database (see Incorporation of pathway knowledge section) [33, 46] and learn only the pathways by topics embedding α, which provides a direct interpretation of disease-pathways associations. We tested our pathway-informed scETMs (p-scETM) on the HP, AD and MDD datasets. Without compromising the clustering performance (Supp. Table S8), p-scETM learned functionally meaningful topics as shown by the pathway-topic embedding α (Fig. 5). In the topic α-embedding inferred by p-scETM trained on HP, we found 6 topics with top pathways related to insulin signaling, and 6 topics related to nutrient digestion and metabolism (Fig. 5a). In the MDD α embedding, we found 8 topics with top pathways known as therapeutic targets for MDD treatment (Fig. 5b, Supp. Table S9) [47–55], 2 topics with top pathways related to MDD pathogenesis [53, 56], and 3 topics with top pathways correlated with MDD [57–59]. Notably, the top pathway in topic 40, “beta-2 adrenergic receptor signaling”, is also statistically enriched (p=0.021) in MDD genome-wide association studies (GWAS) [60].
(a) The pathway-topics heatmap of top 5 pathways in selected topics, inferred by a p-scETM model trained on HP. Pathways related to pancreas function, insulin signalling and digestion are highlighted. (b) The pathway-topics heatmap of top 5 pathways in selected topics, inferred by a p-scETM model trained on MDD. Pathways related to MDD pathogenesis and therapeutic targets are high-lighted. (c) The pathway-topics heatmap of top 7 pathways in selected topics, inferred by a p-scETM model trained on AD. Pathways related to AD pathogenesis and therapeutic targets are highlighted.
In the AD topic α-embedding, we found 6 topics with top pathways related to AD treatment and 1 topic related to AD pathogenesis (Fig. 5c, Supp. Table S10) [38, 39, 61–65]. Importantly, “Alzheimer disease-amyloid secretase” pathway, which is directly related to AD pathogene-sis [66], is the seventh-highest expressed pathway in topic 9. Therefore, the p-scETM inferred topics are highly related with not only the primary tissue types but also the disease of interests, although overall the former case tends to be the predominant signals we observe in our analyses.
Discussion
As scRNA-seq technologies become increasingly affordable and accessible, large-scale datasets have emerged. This challenges traditional statistical approaches and calls for robust, reliable and scalable representation learning methods to mine the latent biological knowledge from the vast amount of scRNA-seq data. To address this challenge, we developed scETM and demonstrated its state-of-the-art performance on the unsupervised clustering task across diverse datasets. scETM demonstrates excellent capabilities of batch effect correction and knowledge transfer across datasets. Many integration methods require running on both reference and query datasets to perform posthoc analyses such as joint clustering and label transfer [7, 9, 10, 24]. In contrast, our method enables a direct knowledge transfer of the reference-based pretrained parameters in annotating a new dataset, which is more efficient than the existing methods. Recently proposed by [23], single-cell Hierarchical Poisson Factor (scHPF) model applies hierarchical Poisson factorization to discover interpretable gene expression signatures in an attempt to address the interpretability challenge. However, compared to our model, scHPF lacks the flexibility in learning the gene embedding and incorporating existing pathway knowledge, and is not designed to account for batch effects. Moreover, scETM has the benefits of both interpretability in the linear decoder and flexibility in the neural network encoder. Our qualitative experiments show that scETM topics preserve cell functional and state-specific biological signals within single-cell transcriptome profiles. By seamlessly incorporating the known pathway information in the gene embedding, p-scETM finds biologically and pathologically important pathways without the need for posthoc analyses. Together, with the scalability and interpretability, scETM serves as a useful tool for large-scale single-cell transcriptome analysis.
Methods
scETM data generative process
To model scRNA-seq data distribution, we take a topic-modeling approach [67]. In our framework, each cell is considered as a “document”, each scRNA-seq read as a “token” in the document, and the gene that gives rise to the read is considered as a “word” from the vocabulary of size V. We assume that each cell can be represented as a mixture of latent cell types, which are commonly referred to as the latent topics. The original LDA model [67] defines a fixed set of K independent Dirichlet distributions β over a vocabulary of size V. Following the ETM model, here we decompose the unnormalized topic distribution K × V β∗ into the topic embedding α ∈ ℝK×L and gene embedding ρ ∈ ℝL×V, where L denotes the size of the embedding space. Therefore, the unnormalized probability of a gene belonging to a topic is proportional to the dot product between the embeddings of the topic and gene. Formally, the data generating process of each scRNA-seq profile d is:
Draw a latent cell type proportion θd for a cell d from logistic normal θd ∼ L𝒩(0, I):
For each gene g in cell d, draw its expression from a categorical distribution:
Here Nd is the library size of cell d, wi,d is the index of the gene that gives rise to the ith read in cell d (i.e., [wi,d = g]), and yd,g is the total read counts of gene g in cell d. The transcription rate rd,g is parameterized as follows:
Here θd is the 1 × K cell topic mixture for cell d, α is the global K × L cell topic embedding variable, ρg is a L×1 gene-specific transcriptomic embedding, and λs(d),g is the batch-dependent and gene-specific scalar effect, where s(d) indicates the batch index for cell d. Notably, to model the sparsity of gene expression in each cell (i.e., only a small fraction of the genes have non-zero expression), we use the softmax function to normalize the transcription rate over all of the genes.
scETM model inference
In scETM, we treat the latent cell type mixture θd for each cell d as the only latent variable. We treat the topic embedding α, the gene-specific transcriptomic embedding ρ, and the batch-effect λ as point estimates. Let Y be the D × V gene expression matrix for D cells and V genes. The posterior distribution of the latent variables p(Θ|Y) is intractable. Hence, we took a variational inference approach using a proposed distribution q(δd) to approximate the true posterior. Specifically, we define the following proposed distribution: q(δ | y) = Πd q(δd|yd), where q(δd|yd) = µd + diag(σd)𝒩 (0, I) and q (δd|yd) = μd + diag (σ d) 𝒩 (0,I) and . Here
is the normalized gene expression as the read counts for each gene divided by the total reads in cell d. The function NNET(v; W) is a two-layer feed-forward neural network used to estimate the sufficient statistics of the proposed distribution for the cell topic mixture θd.
To learn the above variational parameters Wθ, we optimize the evidence lower bound (ELBO) of the log likelihood, which is equivalent to minimizing the Kullback-Leibler (KL) divergence between the true posterior and the proposed distribution: ELBO = 𝔼q[log p(Y|Θ)]−KL [q(Θ|Y)||p(Θ)]. The Bayesian model is learned by maximizing the reconstruction likelihood with regularization in the form of KL divergence of the proposed distribution from the prior. For computational efficiency, we optimize ELBO with respect to the variational parameters by amortized variational inference [26, 31, 32]. Specifically, we draw a sample of the latent variables from q(δ | y) for a minibatch of cells from reparameterized Gaussian proposed distribution q(δ | y) [26], which has the mean and variance determined by the NNET functions. We then use those draws as the noisy estimates of the variational expectation for the ELBO. The optimization is then carried out by back-propagating the ELBO gradients into the variational parameters.
scETM implementation details
We implemented scETM using the PyTorch library. We chose the encoder to be a 2-layer neural network, with hidden sizes of (256, 128), ReLU activations [68], 1D batch normalization [69], and 0.1 dropout rate between layers. We set the gene embedding dimension to 300, and the number of topics to 100. We optimize our model with Adam Optimizer and a 0.02 learning rate. To prevent over-regularization, we start with zero weight penalty on the KL divergence and linearly increase the weight of the KL divergence in the ELBO loss function during the first 300 epochs. We show that our model is robust to changes in the above hyperparameters (Supp.Table S2). During the evaluation, we used the variational mean of the unnormalized topic mixture µd as the scETM cell embedding for cell d. With a minibatch size of 2000, scETM typically needs 5k-20k training steps to converge.
Transfer learning with scETM
We trained scETM on the MP dataset and visualized the cell embedding using UMAP (Fig. 3a). For TM-FACS, we first subset the genes of both TM-FACS and MP to their intersection (13263 genes). We then trained scETM on the processed TM-FACS and evaluated it on MP (Fig. 3b). For HP, we first matched the orthologous genes (12603 genes) based on the Mouse Genome Informatics database [70, 71]. We then trained and evaluated scETM on the aligned gene sets (Fig. 3c). The k was set to 5 for the k-NN classifier trained on the reference embeddings and the reference cell types and used to predict cell types of the query cells.
Differential analysis of topic expression
We aimed to identify topics that are differentially associated with known cell type labels or disease conditions. For each label (e.g., AD positive), we first separated the cells into cells with the label and cells without the label. We then performed two-sided t-tests to evaluate whether the cells with the label exhibit significantly higher or lower topic expression relative to the cells without the label. Here we used the Gaussian topic expression (i.e., δ) without the softmax transformation because it is more suitable to the normality assumption of the t-test. We determine a topic to be differentially expressed (DE) if the Bonferroni corrected p-value is lower than (i.e., q-value < 0.01). Supp. Table S7 summarizes the number of DE topics we identified for each cell type and disease conditions from the AD and MDD data.
Incorporation of pathway knowledge
We downloaded the pathDIP4 pathway database from [46]. Pathway gene sets containing fewer than five genes were removed. We represent the pathway knowledge as a pathways-by-genes ρ matrix, where ρij = 1 if gene set i contains gene j, and ρij = 0 otherwise. For the fixed-rho version of p-scETM, we fix the gene embedding matrix ρ to the pathways-by-genes matrix.
Clustering performance benchmark of the existing methods
We assessed the performance of each method by three metrics: Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). ARI [72] and NMI are widely-used representatives of two families of clustering agreement measures, pair-counting and information theoretic measures, respectively. A high ARI or NMI indicates a high degree of agreement for a given clustering result against the ground-truth cell type labels.
All embedding plots were generated using the Python scanpy package [16]. We use UMAP [30] to reduce the dimension of the embeddings to 2 for visualization, and Louvain [73] and Leiden [74] to cluster the cell embeddings. During clustering, we tried multiple resolution values and reported the result with the highest ARI for each method. We ran all methods under their default pipeline settings (see Experimental details of other scRNA-seq methods), and we use batch correction option whenever applicable to account for batch effects. All results are obtained on a compute cluster with Intel Gold 6148 Skylake CPUs and Nvidia V100 GPUs. We limit each experiment to use 8 CPU cores, 128 GB RAM and 1 GPU.
Efficiency and scalability benchmark of the existing methods
To create a benchmark dataset for evaluating the run time of each method, we merged MDD and AD, keeping genes that appear in both datasets. We then selected 3000 most variable genes using scanpy’s highly_variable_genes(n_top_genes=3000, flavor=‘seurat_v3’) function, and randomly sampled 28,000, 14,000, 70,000 and 148,247 (all) cells to create our benchmark datasets. The memory requirements reported in Fig. 2 were obtained by reading the VmRSS entry in /proc/[pid]/status at the end of each process. We kept the same experimental settings (RAM size, number of GPUs, etc) as in the Clustering performance benchmark of the existing methods section above.
Funding
YL is supported by New Frontier Research Fund - Exploration (NFRFE-2019-00980). YZ is supported by Jacqueline Johnson Desoer Science Undergraduate Research Award (SURA).
Availability of data and materials
scETM source codes as well as the benchmarking workflows have been deposited at the GitHub repository (https://github.com/li-lab-mcgill/scETM). The datasets analysed during the current study are from publicly available repositories or data portals. The acquisition and quality control steps for all datasets are included in the supplementary information.
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Authors’ contributions
YL and JT conceived of the study. YZ, HC, YL analyzed and interpreted the data, wrote the manuscript, and wrote the code for scETM. HC optimized and completed the final scETM code. ZZ ran some initial experiments. YL and JT supervised the project. All authors approved the final manuscript.
Additional Files
Additional file 1
Supplementary Information, including Tables S1-3,7-11, Figures S1-4, and supplementary methods.
Additional file 2
Table S4: scETM 100-topic enrichment for Human Pancreas scRNA-seq data.
Table S5: scETM 100-topic enrichment for Alzheimer’s Disease snRNA-seq data.
Table S6: scETM 100-topic enrichment for Major Depressive Disorder snRNA-seq data.
Supplementary Methods
.1 Data processing
All of the single-cell datasets used in this study are from publicly available repositories or data portals. We describe below the acquisition and quality control (QC) for each of the datasets used in the current work.
.1.1 Human pancreatic islet
We obtained the human pancreatic islet dataset and the ground truth cell type labels from Satija Lab at the following link: https://satijalab.org/seurat/v3.0/integration.html (accessed 1 Dec 2020), originally deposited by Stuart et al. [7]. This dataset is a compilation of scRNA-seq data from five studies which can be accessed using the following Gene Expression Omnibus (GEO) accession numbers: GSE81076 (CelSeq), GSE85241 (CelSeq2), GSE86469 (Fluidigm C1), E-MTAB-5061 (SMART-Seq2), and GSE84133 (inDrops). A QC step was conducted by [7], and no additional QC was performed. In our benchmarking experiment, we use the different scRNA-seq technologies as the batch variable.
.1.2 Mouse pancreatic islet
We obtained the mouse pancreatic islet data and ground truth cell type labels from GSE84133 (inDrops) without conducting an additional QC step. There are 1,886 mouse cells from two mice of different strains, ICR and C57BL/6 [28]. The cell counts from the two trains are of approximately equal proportions. In our benchmarking experiment, we treated the mouse strain as the batch variable because of the different genetic backgrounds.
.1.3 Major Depressive Disorder (MDD)
We obtained the 10X Genomics-based MDD snRNA-seq dataset with ground truth cell type labels from GSE144136. A strict QC step was conducted in the original empirical study by [29], where cells with fewer than 110 detected genes were removed. The top 0.5% of cells based on the total number of UMI (unique molecular identifiers) detected in each cell were also excluded because they are likely to be multiplets rather than single nuclei. No additional QC was performed. The MDD dataset consists of 78,886 cells from the dorsolateral prefrontal cortex of 34 male participants. The participants in the control group (n=17) who died due to natural cause and case group (n=17) who died by suicide were matched for age (18–87 years), postmortem interval (12–93h) and brain pH (6–7.01) [29]. The number of cells from each donor is approximately the same.
.1.4 Alzheimer’s disease (AD)
We obtained the droplet-based AD snRNA-seq data and the corresponding ground truth cell type labels from Synapse (https://www.synapse.org/#!Synapse:syn18485175) under the doi 10.7303/syn18485175, and the metadata from https://www.synapse.org/#!Synapse: syn3157322. A strict QC step based on UMI counts and mitochondrial ratio values was conducted in the original empirical study by Mathys et al. [43]. The AD dataset consists of 70,634 cells from the prefrontal cortex of 48 individuals, both male and female, in the Religious Order Study (ROS) or the Rush Memory and Aging Project (MAP), two longitudinal cohort studies of aging and dementia. The cases group consists of 24 individuals with high levels of β-amyloid and other pathological hallmarks of AD, and the control group consists of 24 individuals who have no or very low β-amyloid or other pathologies.
Study data were provided by the Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago. Data collection was supported through funding by NIA grants P30AG10161 (ROS), R01AG15819 (ROSMAP; genomics and RNAseq), R01AG17917 (MAP), R01AG30146, R01AG36042 (5hC methylation, ATACseq), RC2AG036547 (H3K9Ac), R01AG36836 (RNAseq), R01AG48015 (monocyte RNAseq) RF1AG57473 (single nucleus RNAseq), U01AG32984 (genomic and whole exome sequencing), U01AG46152 (ROSMAP AMP-AD, targeted proteomics), U01AG46161(TMT proteomics), U01AG61356 (whole genome sequencing, targeted proteomics, ROSMAP AMP-AD), the Illinois Department of Public Health (ROSMAP), and the Translational Genomics Research Institute (genomic). Additional phenotypic data can be requested at www.radc.rush.edu.
.1.5 Tabula Muris
We obtained the Tabula Muris dataset with ground truth cell type labels from FigShare (https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733) for the Version 2 release [3]. This dataset includes mouse single-cell transcriptome data sequenced by two tech-nologies: microfluidic droplet-based, and fluorescence-activated cell sorting (FACS)-based. A QC cutoff was applied in the original empirical study where only cells with at least 500 genes and 50,000 reads are kept. The droplet subset includes data for 422,803 droplets, 55,656 of which passed the QC cutoff. The TM-FACS subset contains data for 53,760 cells, 44,879 of which passed the QC cutoff.
.2 Experimental details of other scRNA-seq methods
Neural-network based models, including scVI, scVI-LD, scVAE-GM and scETM typically need at least 5000 gradient updates to converge. When running on small datasets, the total number of gradient updates per epoch may be very small (even 1). In these cases, we increase the number of epochs T to ensure the model goes through at least 5000 gradient updates, i.e., where N is the number of cells in the dataset and B is the mini-batch size.
.2.1 Seurat v3
We downloaded Seurat v3 (version 3.1.5) from CRAN [8]. We followed the steps outlined by the integration workflow (https://satijalab.org/seurat/v3.2/integration.html) which includes NormalizeData, FindVariableFeatures, FindIntegrationAnchors, and IntegrateData. To make the comparisons more equitable, we set the min.features=0 to avoid exclusion of cells. All other parameters were set as default. We noted that, with batch integration turned on, Seurat reports error in the integration step due to the high number of anchors arising from the 48 individuals (batch variable in AD), which is a known implementation issue with the standard Seurat v3 integration workflow [75]. We therefore turned off the batch integration for AD in the benchmarking experiments (see Clustering performance benchmark of the existing methods and Efficiency and scalability benchmark of the existing methods) and followed the steps described in the Guided Clustering Tutorial (https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html).
.2.2 Scanorama
We downloaded the source code from GitHub brianhe/scanorama. We used the integrate_scanpy function for dataset integration and batch correction as suggested by the guided tutorial. All parameters were set as default. The algorithm performs a PCA on the stacked datasets and uses 100 PCs for downstream computation.
.2.3 Harmony
We downloaded the source code from GitHub slowkow/harmonypy suggested by the primary repository immunogenomics/harmony and followed the preprocessing (normalization and top variable gene selection) described in the publication and the integration steps in the provided tutorial. We used the run_harmony function to obtain the corrected PCA embeddings and used 50 PCs as input. All other parameters were set as default.
.2.4 LIGER
We used the official implementation provided on the website of the Liger package. For the convenience of implementation, we followed the usage tutorial using Seurat Wrapper to process the raw data and then ran Liger with default parameters.
.2.5 scVI/scVI-LD
We downloaded the implementation from the Github repository YosefLab/scVI. We used the default model, which has one layer for both the encoder and decoder (for scVI-LD the decoder is a latent dimensions-by-genes matrix), 128 hidden units, 10 latent dimensions and ZINB distribution for modeling the data. We chose 10−3 as the learning rate and trained on each unprocessed dataset for 400 epochs, following the provided tutorials. We change the training batch size to 2000 for faster training. We obtained the cell embeddings via the get_latent method.
.2.6 scVAE
We downloaded the implementation from Github repository scvae/scvae. We set the hidden units to be (256, 128) for the encoder. The decoder is symmetric to the encoder. Latent dimen-sion was set to 128 to match scETM. We chose 10−4 as the learning rate and NB distribution for modeling the data following the authors’ recommendation. We trained on each unprocessed dataset for 400 epochs with batch size of 250, including a 200-epoch warm-up for the KL divergence loss. In the scalability benchmark, we disabled the time-consuming per-epoch check-points to match other methods. The model did not converge on the Human Pancreatic Islet dataset, where the ELBO went to infinity. It failed to extract meaningful information from the Tabula Muris dataset, resulting in an ARI of 0.0.
Supplementary Figures
(a) Gene embedding heatmap of top 10 genes in selected topics, where genes are ordered based on topic intensity. Only topics with differential expression with respect to cell types or MDD, or with significant enrichment in MDD-related pathways are shown, and these properties are also reflected in the bottom annotations. Rows correspond to topic indices. (b) Cell embedding heatmap of cells (n=10,000) sampled from the MDD dataset. Only topics in which the sum of absolute intensity values across all of the sampled cells are above 1,500 are shown. Both rows and columns are clustered using average linkage hierarchical clustering and ordered accordingly. Blue arrows: topics enriched in MDD-related pathways; grey arrows: topics with DE in MDD positive population; red arrows: topics with DE with respect to cell types (c) DE analyses across 8 cell types and 2 clinical conditions. Z-scores of t-tests are shown. Rows correspond to topic indices. Red arrows indicate topics with DE with respect to cell types, and grey arrows indicate topics with DE with respect to MDD condtions. Asterisks indicate Bonferroni q-value < 0.05 for one-sided t-test of up-regulated topics in each cell-type and two-sided t-test for disease-relevant topics.
Cell embedding visualization using UMAP on the MP (left) and HP (right) datasets.
UMAP visualization of scETM, Liger and scVAE-GM cell embeddings on the bench-mark dataset (MDD-AD) which includes 148247 cells and 3000 genes.
UMAP cell embedding visualization on the AD dataset, colored by differentially expressed topics (or metagenes) and ground truth labels for the cell types. Circled cell clusters were discussed in the main text (see Differential scETM topics in disease conditions and cell types section).
UMAP cell embedding visualization on the AD dataset, colored by differentially expressed topics (or metagenes) and AD/control or Male/Female labels. Circled cell clusters were discussed in the main text (see Differential scETM topics in disease conditions and cell types section).
Supplementary Tables
Normalized Mutual Information (NMI) between ground truth cell types and leiden clusters on 5 benchmark scRNA-seq datasets. NA is reported for models that did not converge.
Robustness analysis of the scETM model. Changing the encoder architecture, gene embedding dimensions and number of topics has limited impact on model performance. We report the average ARI of three repeated trails. scETM was trained on MP for 6000 epochs and on HP for 2000 epochs.
Ablation study of the scETM model. We report the average ARI of three repeated trails. scETM was trained on MP for 6000 epochs and on HP for 2000 epochs.
Table S4: scETM 100-topic enrichment for Human Pancreas scRNA-seq data. Only pathway hits with FDR < 0.05 are included. The table is saved in Additional file 2.xls.
Table S5: scETM 100-topic enrichment for Alzheimer’s Disease snRNA-seq data. Only pathway hits with FDR < 0.05 are included. The table is saved in Additional file 2.xls.
Table S6: scETM 100-topic enrichment for Major Depressive Disorder snRNA-seq data. Only pathway hits with FDR < 0.05 are included. The table is saved in Additional file 2.xls.
Differential expression (DE) analysis summary of topics in AD and MDD data. + indicates up-regulation, and − indicates down-regulation.
Adjusted Rand Index (ARI) comparison of scETMs and p-scETMs in three human single cell transcriptomics datasets. Refer to Incorporation of pathway knowledge section for experimental details.
MDD-relevant pathways from the pathway-topic embedding inferred by p-scETM trained on the MDD dataset.
AD-relevant pathways from the pathway-topic embedding inferred by p-scETM trained on the AD dataset.