Panoramic stitching of heterogeneous single-cell transcriptomic data

Researchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological systems1–4 and every cell type in the human body.5 Leveraging this data to gain unprecedented insight into biology and disease will require assembling heterogeneous cell populations across multiple experiments, laboratories, and technologies. Although methods for scRNA-seq data integration exist6,7, they often naively merge data sets together even when the data sets have no cell types in common, leading to results that do not correspond to real biological patterns. Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration. Our strategy identifies and merges the shared cell types among all pairs of data sets and is orders of magnitude faster than existing techniques. We use Scanorama to combine 105,476 cells from 26 diverse scRNA-seq experiments across 9 different technologies into a single comprehensive reference, demonstrating how Scanorama can be used to obtain a more complete picture of cellular function across a wide range of scRNA-seq experiments.


Main Text
Individual single-cell RNA sequencing (scRNA-seq) experiments have already been used 1 to discover novel cell states 1,2 and reconstruct cellular differentiation trajectories 3,4 . Through 2 global efforts like the Human Cell Atlas 5 , researchers are now generating large, comprehensive 3 collections of scRNA-seq data sets that profile a diversity of cellular function, which promises to 4 make a substantial contribution to our understanding of fundamental biology and disease. 5 Assembling a large, unified reference data set, however, may be compromised by differences due 6 to experimental batch, sample donor, or sequencing technology. While recent approaches have 7 shown that it is possible to integrate scRNA-seq studies across multiple experiments 6,7 , these 8 approaches automatically assume that all data sets share at least one cell type in common 6 or that 9 the gene expression profiles share largely the same correlation structure across all data sets. 7 10 These methods therefore have limited utility when integrating collections of data sets with 11 considerable differences in cellular composition such as the Human Cell Atlas. 12 Here we present Scanorama, a strategy for efficiently integrating multiple scRNA-seq 13 data sets even when they are composed of heterogeneous cellular phenotypes. Scanorama takes 14 inspiration from computer vision algorithms for panorama stitching that identify images with 15 overlapping content, which are then merged into a larger panorama (Fig. 1a). 8 Analogously, 16 Scanorama automatically identifies only those scRNA-seq data sets with overlapping cell types 17 and can leverage those overlaps for batch-correction and integration (Fig. 1b), without also 18 merging data sets that do not overlap. Importantly, Scanorama is robust to different data set sizes 19 and sources, preserves data set-specific populations, and does not require that all data sets share 20 at least one cell population, e.g., using a cell control spiked into all data sets. 6

21
Our approach generalizes mutual nearest neighbors matching, a technique which finds 22 similar elements between two data sets, to instead find similar elements among a large number of 23 data sets. Originally developed for pattern matching in images 9 , finding mutual nearest neighbors 24 has also been used to identify common cell types between two scRNA-seq data sets at a time. 6 25 However, to align more than two data sets, existing methods 6,7 select one data set as a reference 26 and successively integrate all other data sets into the reference one at a time, which may lead to 27 suboptimal results depending on the order in which the data sets are considered (Supplementary 28 Fig. 1). In contrast, Scanorama is not vulnerable to this problem since it finds matches between 29 all pairs of data sets. Conceptually, we first form directed links from cells in one data set to their 30 nearest neighbors among all other data sets, where cells with more similar gene expression 31 profiles are also closer together, which we repeat for each data set. We then match cells between 32 all pairs of data sets only if they were mutually linked in the previous search step, which 33 increases the robustness of our matches and excludes links involving data set-specific 34 populations. Once cell type matches have been determined, they can then be used to merge data 35 sets together to create scRNA-seq "panoramas" (Methods). 36 Although searching for matching cells among all data sets may seem computationally 37 intractable, we introduce two key optimizations to greatly accelerate our matching procedure. 38 Instead of performing the nearest neighbor search in the high-dimensional gene space, we 39 compress the gene expression profiles of each cell into a low-dimensional embedding using an 40 efficient, randomized singular value decomposition (SVD) 10 of the cell-by-gene expression 41 matrix. Additionally, we use an approximate nearest neighbor search based on hyperplane 42 locality sensitive hashing 11 and random projection trees 12 to greatly reduce the nearest neighbor 43 query time both asymptotically and in practice (Methods). 44 To verify the merit of our approach, we first tested Scanorama on simulated data and a 45 small collection of scRNA-seq data sets. We simulated three data sets with four cell types in total 46 but where the first and third data sets had no cell types in common (Supplementary Fig. 2a-c). 47 We also obtained three previously generated 13 data sets: one of 293T cells, one of Jurkat cells, 48 and one with a 50:50 mixture of 293T and Jurkat cells (Fig. 2a). In both cases, we were able to 49 merge common cell types across data sets (Fig. 2b, Supplementary Fig. 2d) without also 50 merging disparate cell types together, as was the case when applying existing integration 51 methods to these data sets (Fig. 2c,d). 52 We then sought to demonstrate the ability of Scanorama to assemble a larger and more 53 diverse set of cell types. In total, we ran our pipeline on 26 scRNA-seq data sets representing 9 54 different technologies and containing a total of 105,476 cells (Fig. 3a, Supplementary Table 1), 55 each data set coming from a different scRNA-seq experiment. Scanorama correctly identifies 56 data sets with the same cell types and merges them together such that they cluster by cell type 57 instead of by experimental batch (Fig. 3a-c). Importantly, in contrast with existing methods, our 58 algorithm does not merge disparate cell types together (Fig. 3b,c; Supplementary Figure 3), 59 and correctly identifies a data set of mouse neurons as distinct from the cell types of all other 60 data sets (Fig. 3a). One of the panoramas identified by Scanorama consists of two data sets of 61 hematopoietic stem cells (HSCs) 14,15 which, once corrected for batch effects and plotted along 62 the first two principal components, reconstruct the expected HSC differentiation hierarchy ( Fig.   63 3d, Supplementary Fig. 4a,b). We also observe cell type-specific clusters within panoramas of 64 pancreatic islet cells (Fig. 3e, Supplementary Fig. 5,6) and peripheral blood mononuclear cells  The alignments made by Scanorama are also interpretable, allowing researchers to 78 examine the genes that significantly contribute to the alignment between two data sets and better 79 understand the technical or biological origin of discrepancies between experiments. For example, 80 Scanorama aligned two data sets consisting of the same cell type but separated by a known 81 biological difference, i.e., two macrophage data sets with and without Mycobacterium 82 tuberculosis (Mtb) infection. We found that the alignment of macrophage populations between 83 these data sets is directed by genes that are enriched for processes relating to bacterial infection

Data sets and data processing
We used the following publicly-available data sets:  Table 1).

Dimensionality reduction using randomized SVD
We compute a compressed, low-dimensional embedding of the gene expression values for each cell by taking the SVD of the combined cell-by-gene expression matrix, taking inspiration from different compressive techniques for other biological problems. 28 The SVD is normally very expensive to compute on large matrices, so we leverage an efficient, randomized approach to find an approximate SVD 10 , which is implemented in the fbpca Python package (http://fbpca.readthedocs.io/en/latest/). We use a reduced dimension of 100 in all of our experiments.

All-to-all data set matching
For each data set, we query for its cells' nearest neighbors among the cells of all remaining data sets in the low-dimensional embedding space. After repeating this for all data sets, we find all instances where a cell in one data set is the nearest neighbor of a cell in another data set, and vice versa. Additional algorithm descriptions are given in the Supplementary Materials.

Approximate nearest neighbors using locality sensitive hashing
To greatly accelerate our nearest neighbor queries, our algorithm conducts an approximate search based on locality sensitive hashing, where multiple trees of random hyperplanes, used as hash functions, divide the search space of the points in the query set. 11,12 We use the Annoy C++/Python package (https://github.com/spotify/annoy), a memory efficient implementation of this algorithm.

Nonlinear data set merging and panorama stitching
Once mutual nearest neighbor matches are identified, we merge data sets together into larger panoramas. We build upon the nonlinear batch-correction strategy of Haghverdi et al. 6  Gaussian kernel function upweights matching vectors belonging to nearby points. We order pairs of data sets based on the percentages of cells in the data sets that are involved in a matching and use this ordering to build panoramas of data sets by successively merging a data set into a panorama or using the pair of data sets to merge two panoramas together using the nonlinear correction procedure described above. Additional algorithm descriptions are given in the Supplementary Materials.

Interpreting alignments
Since aligning and merging two data sets is based on a set of matching vectors, we reasoned that the magnitude of these vectors along the dimensions of the gene expression space would allow us to identify genes with a significant contribution to the direction of the alignment. To interpret an alignment between two data sets, we averaged the matching vectors across all matchings and compared the absolute magnitude of each of the values in the mean vector to the corresponding absolute magnitudes of a "null set" of vectors. We constructed a null distribution by randomly pairing cells between the two data sets and computing the vectors obtained from each pair (excluding pairs in the matching itself). We used a null distribution of 100,000 vectors to calculate a permutation-based p-value for each of the genes. Genes were put into a target list at a false discovery rate (FDR) 29 of less than 0.05, for which we looked for gene ontology (GO) process enrichment among the background set of 5,216 genes using the GOrilla web tool (http://cbl-gorilla.cs.technion.ac.il/). 30

t-SNE visualization
We modified the implementation of t-Distributed Stochastic Neighbor Embedding (t-SNE) 31 in scikit-learn 32 by replacing the exact nearest neighbors search phase with an approximate nearest neighbors search using the same locality sensitive hashing algorithm and implementation as in our data set matching procedure. This modification was done to improve the runtime of the default scikit-learn t-SNE when visualizing our results and is included in the code package for our algorithm.

Clustering performance
Previous scRNA-seq analyses 6,33 have used the Silhouette Coefficient 34 as a quantitative measure of clustering performance. The Silhouette Coefficient is calculated using the mean of the distances from cell to all other cells of the same type ( ) and the mean of the distances from cell to all other cells that belong to the cell type that is nearest to the cell type of ( ). The We computed the Silhouette Coefficient using the t-SNE embeddings of all cells across our collection of 26 data sets integrated by Scanorama, scran MNN, and Seurat CCA (Supplementary Fig. 10). We use the Silhouette Coefficient implementation provided by scikitlearn.

Simulation of non-overlapping data sets
We simulated data sets by sampling from a 2-dimensional Gaussian mixture model with four components and isotropic noise. We then projected all of our data sets into a 100-dimensional space using a matrix with Gaussian noise entries. We also simulated batch effects for each data set by adding a vector of Gaussian noise to all entries in a data set, using a different random vector (i.e., a different batch effect) for each data set.

Panorama of 293T and Jurkat cells
We obtained three separate data sets consisting of 293T cells, Jurkat cells, and a 50:50 mixture of 293T and Jurkat cells from 10x Genomics. 13 These data sets were processed, aligned, and merged using the procedure described previously to give a total of 9,530 cells. We defined cell types using labels from the original study.

Panorama of hematopoietic stem cells (HSCs)
Two publicly available data sets 14,15 of HSCs were processed, aligned, and merged using the procedure described previously to give a total of 3,175 cells. We used the cell types that were reported by both studies, and we examined the expression of marker genes indicating erythropoiesis for additional validation (Supplementary Fig. 4c-e). We quantified the quality of our batch correction by computing the likelihood-ratio using the likelihood that the corrected MARS-Seq data set came from the same distribution as the uncorrected MARS-Seq data set ( 0 ) or from the same distribution as the corrected Smart-seq 2 data set ( 1 ), where we can more confidently reject the null hypothesis 0 if the likelihood-ratio ℒ( 0 ) sup{ℒ( 0 ),ℒ( 1 )} is very small. We modeled each distribution with a three-component Gaussian mixture model.

Panorama of pancreatic islets
Five publicly-available pancreatic islet data sets [22][23][24][25][26] were processed, aligned, and merged using the procedure described previously to give a total of 15,921 cells (Supplementary Fig. 5). We kmeans clustered the cells in the corrected gene expression space, obtaining 40 clusters, and assigned cell types to each cluster based on cell type labels provided by previous clustering analyses 7,23,25 and the relative expression levels of cell type-specific marker genes ( Supplementary Fig. 6). This allowed us to identify a cluster corresponding to a rare subpopulation of beta cells with upregulated ER stress genes, which we identified using previously inferred labels 7 for one of the data sets 22 and by confirming upregulation of the marker genes HERPUD1 and GADD45A in cells from all data sets within that cluster ( Supplementary Fig. 6g,h).

Panorama of peripheral blood mononuclear cells (PBMCs)
Ten publicly-available data sets 13,27 involving PBMCs, or cell types found in PBMCs, were processed, aligned, and merged using the procedure described previously to give a total of 47,994 cells. Cell types were either experimentally determined 13 using fluorescence activated cell sorting (FACS) or inferred by previous clustering analyses 6,7 (Supplementary Fig. 7), and we examined the expression levels of cell type-specific marker genes for additional validation (Supplementary Fig. 8).

Macrophage alignment interpretation
Four publicly available data sets 21 of macrophages, sorted by FACS for Mtb infection, were processed and aligned using the procedure described previously to give a total of 15,337 cells.
Once the matching vectors between the Mtb infected and uninfected data sets were determined, we used the procedure described previously to identify genes and functional processes that significantly contribute to the discrepancy between the two data sets (Supplementary Table 2).

Runtime and memory profiling
We used Python's time module to obtain runtime measurements for the alignment and merging portions of our algorithm and used the top program in Linux (Ubuntu 17.04) to make periodic memory measurements. We also randomly subsampled sets of 10,547 (10%), 26,369 (25%), and 52,738 (50%) cells from our total of 105,476 cells and measured the runtime and memory of our algorithm on the subsampled data. We compared computational resource usage to two methods, Seurat CCA 7 (with 15 canonical correlation vectors) and scran MNN 6 , using their default parameters. For a fair comparison, we used the same preprocessed data and only measured the resources required for the portions of the methods responsible for alignment and data set integration. We used R's proc.time function and Linux's top to measure runtime and memory usage, respectively, of these programs. All methods were limited to 10 cores and run on a 2.30 GHz Intel Xeon E5-2650v3 CPU with 384 GB of RAM.

Data and code availability
The data and code used in this study are available at http://scanorama.csail.mit.edu.      14,15 and removes any significant difference due to experimental batch (likelihood-ratio = 2.7e-902; Methods). Visualizing the corrected HSCs along the first two principal components captures the "pseudo-temporal" arrangement of the cell types, in which granulocyte-macrophage progenitors (GMP) and megakaryocyte-erythrocytes (MEP) are derived from common myeloid progenitors (CMP). (e) Within a panorama of pancreatic islet cells from five different data sets [22][23][24][25][26] , we can observe clusters of distinct cellular subtypes including a previously reported rare subpopulation of beta cells marked by ER stress gene upregulation (Supplementary Fig. 8g,h). (f) Scanorama is significantly faster than current methods for scRNA-seq integration. (g) Interpreting the alignment between Mtb infected and uninfected macrophages reveals genes significantly enriched for processes related to bacterial infection.