Cluster similarity spectrum integration of single-cell 1 genomics data 2 3

17 Technologies to sequence the transcriptome, genome or epigenome from thousands of 18 single cells in an experiment provide extraordinary resolution into the molecular states 19 present within a complex biological system at any given moment. However, it is a major 20 challenge to integrate single-cell sequencing data across experiments, conditions, batches, 21 timepoints and other technical considerations. New computational methods are required that 22 can simultaneously preserve biological signals, while also integrating samples. Here, we 23 propose an unsupervised reference-free data representation, Cluster Similarity Spectrum 24 (CSS), where each cell is represented by its similarities to clusters independently identified 25 across samples. We show that CSS can be used to assess cell heterogeneity and enable 26 differentiation trajectory reconstruction from cerebral organoid single-cell transcriptome data, 27 and to integrate data across individuals and experimental conditions. We compare CSS to 28 other integration algorithms and show that CSS performs comparably well. We also show 29 that CSS allows projection of single-cell genomic data of different modalities to the CSS30 represented reference atlas for visualization and cell type identity prediction. We think CSS 31 provides a straightforward and powerful approach to understand and integrate challenging 32 single-cell multi-omic data. 33 34


35
Recent advances in molecular, engineering, and sequencing technologies have enabled the 36 high-throughput measurement of transcriptomes and other genomic features in thousands of 37 cells in an experiment. Single-cell RNA sequencing (scRNA-seq) greatly enhances our 38 capacity to resolve the heterogeneity of cell types and cell states in biological samples, as 39 well as to understand how systems change during dynamic processes such as development. 40 However, current scRNA-seq technology only provides a molecular snapshot on limited 41 measured samples at one time. Joint analysis on many samples across multiple 42 experiments and different conditions is often required. In such a scenario, the biological 43 variation of interest is usually confounded by other factors, including sample sources and 44 experimental batches. This is particularly challenging for developing systems, where the 45 data contains distinct, mature cell types as well as intermediate cell states at different points 46 along various differentiation trajectories. Several computational integration methods, 47 including but not limited to Reference Similarity Spectrum (RSS) 1,2 , Harmony 3 , Seurat 4 , 48 LIGER 5 , Scanorama 6 and MNN 7 , have been developed to address some of these issues. 49 Among them, RSS represents each cell by its transcriptome similarities to a series of 50 reference samples. Harmony uses iterative clustering-correction procedure based on soft 51 clustering to correct for sample differences. Seurat introduces an anchor strategy, with 52 anchors between samples defined by canonical correlation analysis, to correct batch effect. 53 LIGER adapts integrative non-negative matrix factorization to identifying shared and dataset-54 specific factors for joint analysis. MNN identifies mutual nearest neighbors between two data 55 sets and derives cell-specific batch-correction vectors for integration. Scanorama 56 generalizes mutual nearest-neighbors matching in two data sets to identify similar elements 57 in multiple data sets in order to support integration of more than two data sets. 58 Benchmarking on these integration methods have revealed that each method presents 59 different performance in various scenarios, and there is no single magic bullet capable of 60 always dissecting out meaningful variations of interest 8 . 61 62 Here, we propose an unsupervised scRNA-seq data representation namely Cluster 63 Similarity Spectrum or CSS. Instead of using external references as in RSS, CSS considers 64 every cell cluster in each sample for integration as an intrinsic reference, and represents 65 each cell by its transcriptome similarities to clusters across samples. The underlying 66 hypothesis of both RSS and CSS is that the undesired confounding factors, e.g. read 67 coverage of different cells, introduce random perturbation to the observed transcriptomic 68 measures which are not correlated with cell type or cell state identities. Cells with the same 69 identity therefore share similar spectrums, after standardizing across similarities to different 70 references, i.e. similarity spectrum to normalize the global differences across cells 71 introduced by the random perturbation. Additionally, as CSS considers all clusters in 72 different samples as references, standardization is done separately across clusters of 73 different samples, so that global differences across samples can be largely eliminated. We 74 apply CSS to various scenarios focusing on data generated from cerebral organoids derived 75 from human induced pluripotent stem cells (iPSCs). We use CSS to integrate data from 76 different iPSC lines, human individuals, batches, modalities, and conditions. We show that 77 technical variation caused by experimental conditions or protocols can be largely reduced 78 with the CSS representation, and CSS has similar or even better performance compared to 79 other integration methods including Harmony, Seurat v3 and LIGER, which were highlighted 80 in the previous benchmarking 8 . We also show that CSS also allows projection of new data, 81 either scRNA-seq or scATAC-seq, to the CSS-represented scRNA-seq reference atlas for 82 visualization and cell type identity prediction. The CSS codes are available at 83 https://github.com/quadbiolab/simspec. 84 85

CSS integrates scRNA-seq data from different organoids, batches, human individuals 87
To calculate the CSS representation, clustering is first performed on transcriptome data from 88 cells in each sample separately, and average expression profiles are calculated for each 89 cluster (Fig. 1, Supplementary Fig. 1). Transcriptome similarity, represented as the 90 Spearman correlation between gene expression profiles, is then calculated between each 91 cell and each cell cluster. For each cell, the calculated similarities are normalized across 92 clusters of each sample by z-transform and concatenated, resulting in its CSS 93 representation. We applied CSS and other integration approaches to a complex cerebral 94 organoid scRNA-seq dataset 1 , where the data was affected by technical variation due to 95 organoid, batch, and iPSC line. Altogether the dataset contained scRNA-seq data from 96 twenty-two-month-old human cerebral organoids, each with a different cell type composition, 97 from seven different ESC/iPSC lines in four batches of in total eleven experiments (Fig. 2a). 98 The UMAP embedding without any integration method reveals variation among cells due to 99 cell type (e.g. brain region), batch, and individual. Based on marker gene expression there 100 were clear signatures of cell heterogeneity that were observed across samples, such as 101 differentiation trajectories from neural progenitor cells (NPC) to neurons, as well as 102 specification of neurons from different brain regions including cortical excitatory neurons 103 (ENs) and ganglionic eminence (GE)-derived inhibitory neurons (INs). In addition, different 104 organoids, especially organoids from different technical batches, also separated cells into 105 distinct clusters (Fig. 2b). Using RSS to fetal BrainSpan RNA-seq data largely integrated 106 data from different experimental batches, allowing interpretable cell type annotations 107 ( Supplementary Fig. 2, Fig. 2c). In this case, there are neuronal differentiation trajectories 108 from multiple brain regions including cortex, GE, and non-telencephalic regions. This RSS-109 based cell type annotation, as provided in the original study 1 , was considered as the 110 reference annotation for a comparison of CSS and other integration methods. 111 112 Next we applied CSS, Harmony, Seurat v3, and LIGER to this cerebral organoid dataset to 113 compare integration approaches. The UMAP embeddings of all four integration methods 114 show that each method largely improves cell mixing from different organoids and batches 115 (Fig. 2d). For each cortical EN and MGE/CGE IN, we calculated the proportions of 116 neighboring cells being of the same cell type but from different batches. Here, neighboring 117 cells were defined as cells with the smallest Euclidean distances to the cell in the integrated 118 space. All the four integration methods substantially increased such proportions, indicating 119 that they all performed well in cell mixing (Fig. 2e). However, Seurat v3 and LIGER suffered 120 from a severe over-correction problem, where cells from different stages of differentiation or 121 brain regions were mixed together ( Fig. 2d-e). On the other hand, cell type heterogeneity 122 remained when the data was integrated with CSS and Harmony, giving results comparable if 123 not better than RSS (Fig. 2d-e). Relative to Harmony, CSS resulted in an UMAP embedding 124 with larger distinction of brain region-specific differentiation trajectories (Fig. 2d), especially 125 the three major cell fates (dorsal telencephalic, ventral telencephalic, and non-telencephalic 126 cells), and less likelihood mixing of different cell types (Fig. 2e). 127 128

CSS integrates time course data 129
We next wanted to apply CSS to a time course scRNA-seq dataset of cerebral organoid 130 development from pluripotency (Fig. 3a) To assess whether CSS can integrate data that segregates due to random variation, we 142 applied CSS to the scRNA-seq time course data set from iPSC through four months of 143 human cerebral organoid development 1 (Fig. 3a). This dataset includes seventeen samples 144 from two PSC lines, covering seven different developmental stages. In comparison, 145 Harmony, Seurat v3 and LIGER were also applied to the same data set. The UMAP 146 embedding shows that when no integration was applied, cells of samples at different time 147 points, particularly PSC, embryoid body (EB), neuroectoderm (NEcto) and neuroepithelium 148 (NEpith) stages, separate from each other, forming distinct cell groups (Fig. 3b). Differences 149 between the two PSC lines are also substantial. 150 151 Strikingly, we find that CSS largely integrated cells of the early time points including from 152 PSC to NEpith stages, and cell of different lines, without disturbing later time points where 153 different neuronal types have emerged (Fig. 3c). In comparison, Seurat v3 and LIGER again 154 encountered severe over-correction problems and largely mixed PSCs with NPCs in 155 cerebral organoids ( Fig. 3d-f). Harmony preserved both temporal and cell type heterogeneity 156 in the data set, although the corresponding UMAP embedding lacked qualitative clarity ( Fig.  157 3d). 158 159 To quantitatively compare different integration results, we focused on the PSC/EB time 160 points, which are known to be distinct from cell types in the later time points 9,10 . First of all, 161 all the four integration methods intermixed cells from different PSC lines. However, 162 substantial proportions of non-PSC/EB cells became neighboring cells of the PSC/EB 163 population when Seurat v3 and LIGER were used (Fig. 3g), implying over-correction 164 problem. We also calculated average distances between each PSC/EB cell and cells at 165 different time points (Fig. 3g). The distances to PSC increased along the developmental time 166 course when three of the four tested integration methods with LIGER being the only 167 exception. In general, CSS and Harmony substantially outperformed Seurat v3 and LIGER. 168 Compared to Harmony, CSS showed weaker power in mixing cells from different PSC lines, 169 but better performance in discriminating cells at the later time points from the PSC/EB ones 170 (Fig. 3g). 171 172 Interestingly, although CSS in general enhanced the differences between PSC and other cell 173 types, likely by reducing intra-group diversity while retaining inter-group divergence, a 174 continuous linkage was observed from PSC/EB to NEcto cells (Fig. 3c) in the UMAP 175 embedding. Cells between the two cell groups may therefore represent the transition state of 176 neural induction. RNA velocity analysis with scVelo also supported the transition potential 177 from PSC to NEcto cells (Fig. 3h). We screened for cells at PSC/EB or NEcto time points 178 with at least 15 neighbors being both PSC/EB and NEcto cells (Fig. 3h). This resulted in 295 179 cells in total. Comparing those transition cells to both PSC/EB cells and NEcto cells resulted 180 in differentially expressed genes such as CYP26A1 and LITAF (Fig. 3h). Among them, 181 CYP26A1 encodes for a retinoic acid-metabolizing enzyme and has been shown to be 182 essential for body patterning and brain development 11-13 . 183 184

CSS integrates data across technical conditions 185
To determine if CSS can integrate data across technical conditions, we generated scRNA-186 seq data from fresh and methanol-fixed cerebral organoid single-cell suspensions. Methanol 187 fixation was applied to half of the dissociated cells prior to the scRNA-seq experiment.

CSS integrates scRNA-seq data generated by different technologies 203
We sought to access how CSS may perform when scRNA-seq data of different samples are 204 generated by different technologies. We compared scRNA-seq data generated by 10x 205 Genomics (one cerebral organoid, 4512 cells), Fluidigm C1 and Smart-Seq2 (C1/SS2) 206 technology (685 cells) 1 , and inDrop technology (5847 cells, generated in this study). Analysis 207 on the three data sets separately indicates their cell composition, allowing annotation of cells 208 into nine different groups, including NPC and neurons of dorsal forebrain, ventral forebrain 209 and non-telencephalic regions, choroid plexus, mesenchymal-like cells and epithelial-like 210 cells ( Fig. 4a-b). Joint analysis with no integration showed that cells primarily separated by 211 data sets by different technologies. 212 213 Next we integrated the three data sets using RSS (to fetal BrainSpan RNA-seq data), CSS, 214 Harmony, Seurat v3 or LIGER. All the methods, except RSS, largely improve cell mixing 215 from different data sets ( Fig. 4c-d), although over-integration appeared when LIGER was 216 applied. Focusing on the C1/SS2 data set which only represents 6.2% cells in the joint data 217 set, different methods show different pros and cons. Harmony, Seurat v3 and LIGER 218 performed comparably well in terms of mixing cells from different data sets and better than 219 CSS. On the other hand, CSS performed the best balancing cell mixing improvement and 220 retaining cell type heterogeneity (Fig. 4d). 221 222 CSS allows query data projection to the reference scRNA-seq atlas 223 As most of the integration methods require data-specific transformation of the expression 224 matrices, when new data is introduced, the integration procedure needs to be rerun. Since 225 CSS representation relies on the normalized expression levels without additional data 226 transformation, it allows projection of new query data to the CSS-represented reference 227 atlas. We tested the feasibility with the two-month-old cerebral organoid scRNA-seq data. In 228 brief, we first extracted cells of two organoids from the Sc102a1 iPSC line as the query data. 229 Next, we built a CSS-integrated reference atlas using the remaining scRNA-seq data (Fig.  230 5a). The corresponding CSS representations of the query cells in the Sc102a1 organoids 231 were then calculated. The projected UMAP embedding of the query cells, as well as their 232 projected cell type labels, were obtained with the intrinsic UMAP projection mechanism and 233 a k-nearest-neighbor (kNN, k=50, defined in the CSS-space) classifier, respectively. The 234 projected UMAP embedding matches with the fact that the two Sc102a1 cerebral organoids 235 mostly consist of cortical cells (Fig. 5b). More importantly, using the kNN classier to predict 236 cell type of query cells based on the reference cell annotation resulted in 89.4% of the cells 237 predicted to be the exact same cell type as they were annotated. Among the rest, another 238 8.9% query cells were predicted to be a different cell type as it was annotated, but at the 239 same cell fate branch, i.e. one of cortical cells, GE cells, and non-telencephalic cells (Fig.  240 5b). This result suggests that the CSS-based data projection procedure is technically 241 straightforward and reliable. 242 243 We further sought to determine whether this projection procedure can be applied using 244 scATAC-seq data as query data. We used scATAC-seq data (Fluidigm C1) of micro-245 dissected cortex-like structure in human cerebral organoids 1 (Fig. 5c). For each cell, its 246 chromatin accessibility pattern was summarized into gene activity profile, defined as the 247 enrichment of accessible regions in promoter and gene body of different genes. CSS 248 representations of those cells towards the scRNA-seq reference atlas were then calculated 249 based on their gene activity profiles. The projected UMAP embedding and cell type labels 250 were then compared to the annotation based on the scATAC-seq data (Fig. 5c). We found 251 that most of the cells (77.4%) in the scATAC-seq data were projected to the dorsal 252 telencephalic branch which matches with the experimental design. Using the cell type 253 annotation by clustering of the scATAC-seq data as the benchmark, most of the cells in the 254 scATAC-seq data that annotated as NPC (93.5%) projected to NPC in the reference atlas, 255 while the majority of the cells annotated as neurons projected to IP or neurons in the 256 reference (67.8%). These results suggest that the two annotation strategies cross validated 257 each other. CSS-based representation and projection of scATAC-seq to the scRNA-seq 258 reference atlas is therefore possible and can be helpful for cell type annotation and 259 interpretation of the scATAC-seq data. 260 261 262 Additionally, some technical variations may introduce differences on the measured 293 transcriptome that affects similarity patterns, and therefore won't be corrected by CSS 294 representation. 295 296 One useful feature of CSS representation is its ease of applicability to new data, which 297 makes projections of new data to the CSS-represented reference data feasible. In this study, 298 by splitting the human cerebral organoid scRNA-seq data into reference and query data and 299 applying CSS for data representation, we showed that the CSS projection of query data is 300 reliable and accurate, in terms of both the projected UMAP embedding or the transferred cell 301 type labels. What's more, our work on projecting human cerebral organoid scATAC-seq data 302 to the corresponding scRNA-seq reference further suggests that CSS could help linking 303 single-cell multi-omic data and assist with annotation across modalities. Altogether, these 304 features highlight the utility of CSS as a simple yet powerful approach for single-cell 305 sequencing data integration from complex samples. 306 307

Principles of RSS and CSS 310
In general, the calculation of RSS and CSS includes the following steps ( Supplementary Fig.  311 1). For RSS calculation, highly variable genes are first identified in the reference data. We acquired the human induced pluripotent stem cell (hiPSC) lines Kucg2 and Sojd3 from 338 the HipSci resource, cultured and differentiated them to cerebral organoids following the 339 same protocol as reported 1 . One organoid from each line at 116-day-old was dissociated as 340 described previously 1 . Cell suspension of each organoid was split into two aliquots. Cells in 341 one aliquot per line were pooled, together with cells of another human cerebral organoid of 342 line Wibj2, also from HipSci resource, and one chimpanzee cerebral organoid, and diluted 343 for an appropriate concentration to obtain approximately 10000 cells in one lane of a 10x 344 microfluidic chip device. Methanol fixation was applied to cells in the remaining aliquots, 345 following the same procedure as described previously 14 . Cells were kept at 4°C all the time. 346 Briefly, between 1 and 2 × 10 6 cells from a filtered single-cell suspension of cerebral 347 organoids, were pelleted at 300 x g for 5 min. based on genomic loci with diverged bases between human and chimpanzee, following the 399 same procedure as described 1 . Only human cells were used in the later analysis. 400 Demultiplexing of the three human lines in the unfixed sample was done using demuxlet 18 , 401 based on the genotyping information of lines downloaded from HipSci websites. Cells with 402 the best singlet prediction being Wibj2 with likelihood no less than 10 higher than the second 403 best singlet likelihood were discarded. 404 405 For the newly generated inDrops scRNA-seq data, preprocessing was done following the 406 Drop-seq tools procedure. In brief, quality control to the FASTQ files was done by removing 407 reads with multiple low-quality bases at cell or molecular barcodes, and polyA sequences 408 (with at least six consecutive As) trimmed using Drop-seq tools (v1.12). Remaining reads 409 were mapped to the hg19 human genome using STAR with default parameters. Count 410 matrices were made using Drop-seq tools. Cells with less than 10000 reads were dropped 411 from the analysis. 412 413 To integrate the human two-month-old cerebral organoid scRNA-seq data, RSS to the 414 human fetal BrainSpan RNA-seq reference data was calculated as described in the original 415 publication 1 . For CSS calculation, principal component analysis (PCA) was firstly applied to 416 the data considering the top 5000 highly variable genes. CSS was then calculated, with 417 different organoids as different samples for louvain clustering (with resolution of 0.6) 418 implemented in Seurat, which took the top-20 calculated principal components (PCs) as the 419 input. Harmony was applied with the default parameters and the top-50 PCs, calculated in 420 the same way as above, to integrate different organoids. Seurat v3 was applied following its 421 standard workflow of integration, using 5000 features for anchoring and top-30 PCs in the 422 weighting procedure. LIGER was applied following basic commands tutorial, with variance 423 threshold being 0.3, inner dimension of factorization being 20, convergence threshold being 424 5E-5, three restarts of integrative non-negative matrix factorization and clustering resolution 425 of 0.4. The same parameters were also applied to the integration of the developmental time 426 course scRNA-seq data of human cerebral organoids from PSC, and the integration of 427 scRNA-seq data of human cerebral organoids in different experimental conditions 428 (fixed/fresh). The only exception is the variance threshold in LIGER, which was set as 0.1 for 429 the developmental time course data, and 0.01 for the two experimental conditions data. 430 Such difference was made to keep the number of variable features used in LIGER 431 integration similar. 432 433 To integrate the three scRNA-seq data sets generated by different technologies, 5000 highly 434 variable genes were determined for each data set separately. Genes defined as highly 435 variable in at least two data sets after excluding genes reported as cell-cycle-related genes 436 were used for integration, accounted for 2984 genes, were used for integration. The 437 variance thresholds in LIGER were 0.3, 1 and 0.8 for the 10x, inDrop and C1/SS2 data sets, 438 respectively. Other parameters are the same as described above.  To calculate CSS representation of scATAC-seq data towards a scRNA-seq reference, we 500 firstly summarized peak accessibilities to gene activity scores for each cell. The detected 501 peaks were annotated using the R package ChIPseeker 19 , against the gene annotation of 502 UCSC (hg19). For each cell, the proportion of detected genic peaks, defined as peaks 503 annotated to be at the promoter, exonic or intronic region of genes, was calculated (denoted 504 as pi for cell i). For each gene with at least ten genic peaks detected in the data set, its 505 proportion of detected genic peak at each cell was also calculated (denoted as pi,j for gene j 506 in cell i). The gene activity score of gene j in cell i was then defined as the odds ratio pi,j / pi. 507 CSS representation was then calculated for each cell against the reference sample clusters, 508 by calculating, standardizing and concatenating the Spearman correlation coefficients 509 between the gene activity profile of cells in the scATAC-seq data and the average 510 transcriptomic profiles of the reference sample clusters.