ABSTRACT
Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular landscapes of human and mouse tissues and identified multiple peripheral immune subtypes associated with COVID-19 disease severity.
INTRODUCTION
Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin using sequencing (scATAC-seq) technologies enable decomposition of diverse cell-types and states to elucidate their function and regulation in tissues and heterogeneous systems1–4. Efforts like the Human Cell Atlas project5 and Tabula Muris Consortium6 are constructing a single-cell reference landscape for a new era of highly resolved cell research. With the explosive accumulation of single-cell studies, integrative analysis of data from experiments of different contexts is essential for characterizing heterogenous cell populations7. However, potentially informative biological insights are often confounded by batch effects that reflect different donors, conditions, and/or analytical platforms8,9.
Integration methods have been developed to remove batch effects in single-cell datasets10–16. One common strategy is to identify similar cells or cell populations across batches. This includes the mutual nearest neighborhood (MNN) method10 which identifies correspondent pairs of cells between two batches by searching for mutual nearest neighbors in gene expression. Scanorama11 generalizes the process of neighbor searching from within two batches to a multiple-batch manner. Seurat v213 applies canonical correlation analysis (CCA) to identify common cell populations in low-dimensional embeddings across data batches, while Seurat v314 introduces “cell anchors” to mitigate the problem of mixing non-overlapping populations, an issue experienced in Seurat v2. Harmony16 also applies population matching across batches, specifically through a fuzzy clustering algorithm.
It is notable that all of these cell similarity-based methods are local-based, wherein cell-correspondence across batches are identified through the similarity of individual cells or cell anchors/clusters. Accordingly, these methods all suffer from two common limitations. First, they are prone to mixing cell populations that only exist in some batches. This becomes a severe problem for the integration of datasets that contain non-overlapping cell populations in each batch (i.e., partially-overlapping data). Second, these methods can only remove batch effects from the current batches being assessed but cannot manage batch effects from additional, subsequently obtained batches. So each time a new batch is added, it requires an entirely new integration process that again examines the previous batches. This severely limits the capacity to integrate new single-cell sequencing datasets.
As an alternative to the cell similarity-based local methods, scVI17 applies a conditional variational autoencoder (VAE)18 framework to model the inherent distribution/structure of the input single-cell data. VAE is a deep generative method that comprises an encoder and a decoder, wherein the encoder projects all high-dimensional input data into a low-dimensional embedding, and the decoder recovers them back to the original data space. The VAE framework can maintain the same global internal data structure between the high- and low-dimensional spaces19. However, scVI includes a set of batch-conditioned parameters into its encoder that restrains the encoder from learning a batch-invariant embedding space, limiting its generalizability with new batches.
We previously applied VAE and designed SCALE (Single-Cell ATAC-seq Analysis via Latent feature Extraction) to model and analyze single-cell ATAC-seq data20. We found that the VAE framework in SCALE can disentangle cell-type-related and batch-related features in a low-dimensional embedding space. Here, having redesigned the VAE framework, we introduce SCALEX as a method for integration of heterogeneous single-cell data. We demonstrate that SCALEX integration is accurate, scalable, and computationally efficient for multiple benchmark datasets from scRNA-seq and scATAC-seq studies. As a specific advantage, SCALEX accomplishes data integration through projecting all single-cell data into a generalized cell-embedding space using a batch-free encoder and a batch-specific decoder. Since the encoder is trained to only preserve batch-invariant biological variations, the resulting cell-embedding space is a generalized one, i.e., common to all projected data. SCALEX is therefore able to accurately integrate partially-overlapping datasets without mixing of non-overlapping cell populations. By design, SCALEX runs very efficiently on huge datasets. These two advantages make SCALEX especially useful for the construction and research utilization of large-scale single-cell atlas studies, based on integrating data from heterogeneous sources. New data can be projected to augment an existing atlas, enabling continuous expansion and improvement of an atlas. We demonstrated these functionalities of SCALEX in the construction and analyses of atlases for human, mouse, and COVID-19 PBMCs.
RESULTS
Projecting single-cell data into a generalized cell-embedding space
The central goal of single-cell data integration is to identify and align similar cells across different batches, while retaining true biological variations within and across cell-types. The fundamental concept underlying SCALEX is disentangling batch-related components away from batch-invariant components of single-cell data and projecting the batch-invariant components into a generalized, batch-invariant cell-embedding space. To accomplish this, SCALEX implements a batch-free encoder and a batch-specific decoder in an asymmetric VAE framework18 (Fig. 1a. Methods). While the batch-free encoder extracts only biological-related latent features (z) from input single-cell data (x), the batch-specific decoder is responsible for reconstructing the original data from z by incorporating batch information back during data reconstruction.
Supplying batch information to the decoder in data reconstruction allows the encoder to learn a batch-invariant data representation for each individual cell during model training, which, as a whole, defines a generalized low-dimensional cell-embedding space. This learning is also facilitated by random slicing of all input single cells from different batches into mini-batches. Each mini-batch is forced into alignment with the same data distribution under the restriction of KL-divergence in the same cell-embedding space21. SCALEX also implements Domain-Specific Batch Normalization (DSBN)22 (Methods), a multi-branch Batch Normalization23, in its decoder to support incorporation of batch-specific variations to reconstruct single-cell data.
The design underlying SCALEX renders the encoder to function as a data projector that projects single cells of different batches into a generalized, batch-invariant cell-embedding space. SCALEX thus removes batch-related variations present in single-cell data while preserving batch-invariant biological signals in cell-embedding, making it an enabling tool for integration analyses of diverse single cell datasets, without relying on searching for cell similarities.
SCALEX integration is accurate, scalable, and accommodates diverse data types
We first evaluated the data integration performance of SCALEX on multiple well-curated scRNA-seq datasets, including human pancreas (eight batches of five studies)24–28, heart (two batches of one study)29 and liver (two studies)30,31; as well as human non-small-cell lung cancer (NSCLC, four studies)32–35 and peripheral blood mononuclear cell (PBMC; two batches assayed by two different protocols)13. For comparison, we included several other methods in the analyses, including Seurat v3, Harmony, Conos, BBKNN, MNN, Scanorama, and scVI (Methods).
We used Uniform Manifold Approximation and Projection (UMAP)36 embeddings to visualize the integration performance of all methods (Methods). Note that all of the raw datasets displayed strong batch effects: cell-types that were common in different batches were separately distributed. Overall, SCALEX, Seurat v3, and Harmony achieved the best integration performance for most of the datasets by merging common cell-types across batches while keeping disparate cell-types apart (Fig. S1). MNN and Conos integrated many datasets but left some common cell populations not well aligned. BBKNN, Scanorama, and scVI often had unmerged common cell-types, and sometimes incorrectly mixed distinct cell-types together. For example, in the PMBC dataset (Fig. 1b), considering the T cell populations between the two batches, while SCALEX, Seurat v3, Harmony, and MMN integrations were effective, Scanorama showed both a larger misalignment and mixed all cell-types together without maintaining clear boundaries.
We quantified single-cell data integration performance using a silhouette score37 and a batch entropy mixing score10 (Methods). Briefly, the silhouette score assesses the separation of biological distinctions, and the batch entropy mixing score evaluates the extent of mixing of cells across batches. Overall, SCALEX outperformed all of the other methods as assessed by the silhouette score, and tied with Seurat and Harmony as the best-performing methods based on the batch entropy mixing score (Fig. 1c). We note that SCALEX obtained a slightly lower batch entropy mixing score, compared to Seurat v3 and Harmony on the liver dataset, which contains batch-specific cell-types and thus is a partially-overlapping dataset. However, Seurat v3 and Harmony may have obtained a high batch entropy mixing score because of misaligning different cell-types together. Indeed, by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.
We also tested the scalability and computation efficiency of SCALEX on large-scale datasets by applying it to 1,369,619 cells from the human fetal atlas dataset (two data batches, Methods)38,39. SCALEX accurately integrated these two batches, showing good alignment of the same cell-types (Fig. S2, Fig. 1d). We then compared the computational efficiency of different methods using down-sampled datasets (of 10 K, 50 K, 250 K, 1 M) from the human fetal atlas dataset. SCALEX consumed almost constant runtime and memory that increased only linearly with data size, whereas MNN, Seurat v3, and Conos consumed runtime and memory that increased exponentially, thus did not scale well beyond 250 K cells. Harmony consumed over 400 gigabytes (GB) of memory in analyzing the 1 M dataset, rendering it unsuitable for integration of datasets at this scale (Fig. 1e). Notably, the deep learning framework of SCALEX enables it to run very efficiently on GPU devices, requiring much reduced runtime (took about 10 minutes and 16 GB of memory on the 1 M dataset).
Finally, SCALEX can be used to integrate scATAC-seq data as well as cross-modality data (e.g. scRNA-seq and scATAC-seq) (Methods). For example, SCALEX integrated the mouse brain scATAC-seq dataset (two batches assayed by snATAC and 10X)40 very well, aligning common cell subpopulations and separate distinct ones (Fig. 1f). We also integrated the cross-modality PBMC data between scRNA-seq and scATAC-seq41,42, and found that SCALEX could correctly integrate the two types of data, and could distinguish rare cells that are specific to scRNA-seq data, including pDC and platelet cells (Fig. 1g). Thus, SCALEX has broad integration capacity across various types of single-cell data.
SCALEX integrates partially-overlapping datasets
Partially-overlapping datasets present a major challenge for single-cell data integration for local cell similarity-based methods13,14, often leading to over-correction (i.e., mixing of distinct cell-types). As a global integration method that project cells into a generalized cell-embedding space, SCALEX is expected to be immune to this problem. For example, the liver dataset is a partially-overlapping dataset where the hepatocyte population contains multiple subtypes specific to different batches: three subtypes are specific to LIVER_GSE124395, and two other subtypes only appear in LIVER_GSE115469 (Fig. S3). We noticed that SCALEX maintained the five hepatocyte subtypes apart, whereas Seurat v3 mixed all five and Harmony mixed the hepatocyte-SCD and hepatocyte-TAT-AS1 cells (Fig. 2a).
To characterize the performance of SCALEX on partially-overlapping datasets, we constructed test datasets with a range of common cell-types, down-sampled from the six major cell-types in the pancreas dataset (Methods). SCALEX integration was accurate for all cases, aligning the same cell-types without over-correction, whereas both Seurat v3 and Harmony frequently mixed the cell-types, particularly for the low-overlapping cases (Fig. 2b, Fig. S4). When there was none common cell-type, both Seurat v3 and Harmony collapsed the six cell-types to three, mixing alpha with gamma cells, beta with delta cells, and acinar with ductal cells in various extent. We repeated the cell-type down-sampling analysis from the 12 cell-types in the PBMC dataset as a more complex partial-overlapping example and observed similar results (Fig. S5), demonstrating that SCALEX is robust in retaining informative biological variations for partially-overlapping datasets.
Projection of unseen data into an existing cell-embedding space
The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder’s capacity to project cells from various sources into a generalized, batch-invariant cell-embedding space. We speculate that once a cell-embedding space has been constructed after integration of existing data, SCALEX should be able to use the same encoder to project additional (i.e., previously unseen) data onto the same embedding space. To test this hypothesis, we used the pancreas dataset. SCALEX integration removed the strong batch effect in the raw data and aligned the same cell-types together and kept different cell-types were clearly distinguished (Fig. 3a, Fig. S6a). Cell-types were validated by the expression of their canonical markers, including rare cells such as Schwann cells, epsilon cells (Fig. S6b).
We projected three new batches43–45 for pancreas tissues (Fig. 3b) into this “pancreas cell space” using the same encoder trained on the pancreas dataset. After projection, most of the cells in the new batches were accurately aligned to the correct cell-types in the pancreas cell space, enabling their accurate annotation by cell-type label transfer (Fig. 3c, Method). We benchmarked annotation accuracy by calculating the adjusted Rand Index (ARI)46, the Normalized Mutual Information (NMI)47, and the F1 score using the cell-type information in the original studies as a gold standard (Methods). The SCALEX annotations achieved the highest accuracy in comparisons with annotations using three other methods (Seurat v3, Conos, and scmap).
Expanding an existing cell space by including new data
The ability to project new single-cell data into a generalized cell-embedding space allows SCALEX to readily extend this cell space. To verify this, we projected two additional melanoma data batches (SKCM_GSE72056, SKCM_GSE123139)48,49 onto the previously constructed PBMC space. The common cell-types were correctly projected onto the same locations in the PBMC cell space (Fig. 3d). For the tumor and plasma cells only present in the melanoma data batches, SCALEX did not project these cells onto any existing cell populations in the PBMC space; rather, it projected them onto new locations close to similar cells, with the plasma cells projected to a location near B cells, and the tumor cells projected to a location near HSC cells (Fig. 3e).
SCALEX projection enables post hoc annotation of unknown cell-types in the existing cell space using new data. We noted a group of cells previously uncharacterized in the pancreas dataset (Fig. 3a). We found that these cells displayed high expression levels for known epithelial genes (Methods). We therefore assembled a collection of epithelial cells from the bronchial epithelium dataset50. We then projected these epithelial cells onto the pancreas cell space and found that a group of antigen-presenting airway epithelial (SLC16A7+ epithelial) cells were projected onto the same location of the uncharacterized cells (Fig. 3f). This, together with the observation that both cell populations showed similar marker gene expression (Fig. 3g), indicates that these uncharacterized cells are also SLC16A7+ epithelial cells. SCALEX thus enables discovery science in cell biology by supporting exploratory analysis with large numbers of diverse datasets.
SCALEX supports construction of expandable single-cell atlases
The ability to combine partially-overlapping data onto a generalized cell-embedding space makes SCALEX a powerful tool to construct a single-cell atlas from a collection of diverse and large datasets. We applied SCALEX integration to two large and complex datasets—the mouse atlas dataset (comprising multiple organs from two studies assayed by 10X, Smart-seq2, and Microwell-seq6,51) (Fig. 4a) and the human atlas dataset (comprising multiple organs from two studies assayed by 10X and Microwell-seq39,52).
Despite the strong batch effects in the raw data, SCALEX integrated the three batches of the mouse atlas dataset into a unified cell-embedding space (Fig. 4b,c, Fig. S7a). Common cell-types (including both B, T, and endothelial cells in all tissues and proximal tubule, urothelial, and hepatocytic cells in certain tissues) were well-aligned together at the same position in the cell space. Non-overlapping cell-types (such as sperm, Leydig, and small intestine cells from the Microwell-seq data, keratinocyte stem cells and large intestine cells in the Smart-seq2 data, and oligodendrocytes in the Smart-seq2 and Microwell-seq data) were located separately in the space, indicating that biological variations were preserved well (Fig S7b).
Importantly, atlases generated with SCALEX can be used and further expanded by projecting new single-cell data to support comparative studies of cells both in the original atlas and in the new data. Illustrating this, we projected two additional data batches of aged mouse tissues from Tabula Muris Senis (Smart-seq2 and 10X) 53 and two single tissue datasets (lung and kidney)54 onto the SCALEX mouse atlas space. We found that the same cell-types in the new data batches were correctly projected onto the same locations on the cell-embedding space of the initial mouse atlas (Fig. 4d), which was also confirmed by the accurate cell-type annotations for the new data by label transfer from the corresponding cell-types in the initial atlas (Fig. 4e. Methods). On one way, this mouse atlas then can be used to accurately identify/characterize the cells in the new data based on their projected locations in the cell space; and on the other way, projection of new data enables ongoing (and informative) expansion of an existing atlas.
Following the same strategy, we also constructed a human atlas by SCALEX integration of multiple tissues from two studies (GSE134255, GSE159929) (Fig. S8a,b). SCALEX, effectively eliminated the batch effects in the original data and integrated the two datasets in a unified cell-embedding space (Fig. S8c,d). Again, we were able to correctly project two additional human skin datasets (GSE130973, GSE147424)55,56 onto the human atlas cell-embedding space (Fig. S8e), and again accurately annotated these projected skin cells (Fig. S8f. Methods). These results illustrate that: i) SCALEX enables researchers to evaluate their project-specific single cell datasets by leveraging existing information in large-scale (and ostensibly well annotated) cell atlases; and ii) it also enables atlas creators to informatively integrate new datasets and attendant biological insights from many research programs.
An integrative SCALEX COVID-19 PBMC atlas
Many single-cell studies have been conducted to analyze COVID-19 patient immune responses57–64. However, these studies often suffer from small sample size and/or limited sampling of various disease states58,64. For a comprehensive study, we collected data from multiple COVID-19 PBMC studies, involving 860,746 single cells, and 10 batches from 9 studies57–63 (Fig. 5a, Fig. S9a), and used SCALEX to generate a COVID-19 PBMC atlas, identifying 22 cell-types, each of which were supported by canonical marker gene expression (Fig. 5b,c, Fig. S9b,c. Methods). Cells across different studies were integrated accurately with the same cell-types aligned together, confirming integration performance of SCALEX (Fig. 5c, Fig. S9d).
We observed that some cell subpopulations were differentially associated with patient status (Fig 5d). A subpopulation of CD14 monocytes (CD14-ISG15-Mono), specifically associated with COIVD-19 patients, was characterized by its high expression of Type I interferon-stimulated genes (ISGs) and genes associated with immune-response-related GO terms (Fig 5e,f). The frequency of CD14-ISG15-Mono cells increased significantly from healthy donors to mild/moderate and severe patients (Fig. 6g, Fig. S9e. Methods). Within the COVID-19 patients, we observed a significant decrease in ISG gene expression in CD14-ISG15-Mono cells between the mild/moderate and severe cases, indicating apparently dysfunctional anti-viral immune response in severe COVID-19 patients (Fig. 5e). Specifically enriched in severe verse mild/moderate patients, a neutrophil subpopulation (NCF1-Immature_Neutrophil) lacked expression of the genes responsible for neutrophil activation but showed elevated expression of genes associated with viral-process-related GO terms (Fig. S10a,b). Also enriched in severe patients, a plasma cell subpopulation (MZB1-Plasma) cells displayed decreased expression for antibody production and were enriched for GO terms of immune and inflammatory responses (Fig. S10c,d). Thus, the SCALEX COVID-19 PBMC atlas, generated by integrating a highly diverse collection of single-cell data from individual studies, identified multiple immune cells-types showing dysregulations during COVID-19 disease progression. Note that these trends could not have been detected in the small-scale, individual studies that served as the basis for our SCALEX COVID-19 PBMC atlas.
Comparative analysis of the SCALEX COVID-19 PBMC atlas and the SC4 consortium study
Recently, a large-scale effort of the Single Cell Consortium for COVID-19 in China (SC4) has generated a single-cell atlas that contains over 1 million cells (including PBMCs and other tissues) from 171 COVID-19 patients and 25 healthy controls65 (Fig. S11a). We projected the consortium dataset into the cell-embedding space of the SCALEX COVID-19 PBMC atlas, and found that the cell-types of two atlases were well-aligned in the embedding space (Fig. 5h,i, Fig. S11b,c).
Our analysis, based on the SCALEX COVID-19 PBMC atlas, yielded findings consistent with two conclusions from the SC4 study65. First, in both analyses diverse immune subpopulations displayed differential associations with COVID-19 severity. The proportions of CD14 monocytes, megakaryocytes, plasma cells, and pro T cells were elevated with increasing disease severity, while the proportion of pDC and mDC cells decreased (Fig. 5g). Second, we confirmed that the megakaryocytes and monocyte populations are associated with cytokine storms triggered by SARS-Cov2 infection and are further elevated in severe patients66, based on calculating the same cytokine score and inflammatory score (defined in the SC4 study) for the cells of our SCALEX COVID-19 PBMC atlas (Fig. 5j. Methods).
Integration of the SC4 data further substantially improved both the scope and resolution of the SCALEX COVID-19 PBMC atlas. First, this data added macrophages and epithelial cells to the cell space, enabling investigation of their potential involvement in COVID-19. The integration also supported more precise characterization of specific cell subpopulations. For example, the megakaryocyte population, not distinguished in either single atlas, could be divided into two subpopulations in the combined atlas (Fig. 5h). An exploratory functional analysis of the differentially expressed genes in these two newly delineated megakaryocyte subpopulations (TUBA8-Mega and IGKC-Mega, Fig. S11d,e) revealed enrichment for the GO terms “humoral immune response” for IGKC-Mega cells yet enrichment for “negative regulation of platelet activation” for TUBA8-Mega cells (Fig. 5k). These results illustrate how the continuously expandable single-cell atlases generated using SCALEX capitalize on existing large-scale data resources and also facilitate discovery of biological and biomedical insights.
DISCUSSION
SCALEX provides a VAE framework for integration of heterogeneous single-cell data by disentangling batch-invariant components from batch-related variations and projecting the batch-invariant components into a generalized, low-dimensional cell-embedding space. By design, SCALEX models the inherent batch-invariant patterns of single-cell data, distinguishing it from previously reported integration methods based on cell similarities. SCALEX does not rely on the identification of common cell-types across batches, and therefore avoids the problem of cell-type over-correction, a severe problem for partially-overlapping datasets. SCALEX thus also overcomes issues of computational complexity in cell similarity-based methods; that is, the computational time required to identify similar cells may increase exponentially as the cell number increases.
These two features make SCALEX particularly useful for construction and integrative analysis of large-scale single-cell atlases based on very heterogenous data (i.e., datasets acquired by different labs and using different single-cell analysis platforms). Our construction of human, mouse, and COVID-19 patient single-cell atlases—which aligned well with previously reported atlases generated from coordinated large-scale consortium efforts—demonstrates the particular ability of SCALEX to producing large-scale atlases from extant small-scale datasets. SCALEX achieves data integration by projecting all single cells into a generalized cell-embedding space using a universal data projector (i.e., the encoder). This data projector only needs to be trained once, and then can be used without retraining to continuously integrate new incoming data into an existing single-cell atlas. This continuous growth ability makes a SCALEX atlas an elastic resource, allowing the integration of many single-cell studies to support ongoing, very large-scale research programs throughout the life sciences and biomedicine.
While the number of single-cell studies is increasing enormously each year, best practices for experimental design and sample processing are not established, and there is no obviously dominant data-acquisition platform. SCALEX’s ability to informatively combine data from heterogenous studies and platforms makes it particularly suitable for the current era of single-cell biological research. Finally, the ability to conduct exploratory analysis within a generalized cell space supports that SCALEX should be particularly useful for large-scale integrative (e.g., pan-cancer) studies. We speculate that use of SCALEX to project single-cell datasets (including for example scATAC-seq and scRNA-seq) from highly diverse cancer types to construct a pan-cancer single-cell atlas may lead to the discovery of previously unknown cell types that are common to divergent carcinomas and that function in pathogenesis, malignant progression, and/or metastasis.
Supplementary figures
Footnotes
↵4 Co-first authorship