Coexpression uncovers a unified single-cell transcriptomic landscape

Researchers stand to gain insight into complex biological systems by assembling multiple single-cell RNA-sequencing (scRNA-seq) studies to reveal a panoramic view of overarching biological structure. Unfortunately, many existing scRNA-seq analyses are limited by sensitivity to study-specific noise patterns, by lack of scalability to large datasets, or by integrative transformations that obscure biological relevance. We therefore introduce a novel algorithmic framework that analyzes groups of cells in coexpression space across multiple resolutions, rather than individual cells in gene expression space, to enable multi-study analysis with enhanced biological interpretation. We show that our approach reveals the biological structure spanning multiple, large-scale studies even in the presence of batch effects while facilitating biological interpretation via network and latent factor analysis. Our coexpression-based analysis enables an unprecedented view into two complex and dynamic processes—neuronal development and hematopoiesis—by leveraging a total of seven studies containing 1,460,527 cells from laboratories spanning three continents, yielding systems-level insight unattainable by any individual experiment. Our work demonstrates a path toward probing highly complex biological systems from emerging consortium-scale single-cell transcriptomics.


Introduction
Fundamental biological processes, like neuronal development or hematopoiesis, are of 2 broad importance but are also highly complex. While researchers can now functionally 3 interrogate such processes at a high resolution with single cell RNA-sequencing (scRNA-seq) 1-7 , 4 the underlying biology is often more dynamic and multi-faceted than what can be captured by a 5 single experiment. Instead, multiple laboratories assay different parts and stages of the process 6 using many separate scRNA-seq experiments. A major computational and analytic challenge is 7 to provide researchers with insight into the full biological landscape of interest (for example, 8 across the full range of development or differentiation) not previously accessible by any 9 individual experiment 8 . 10 Multi-study scRNA-seq analysis, however, remains challenging for a number of reasons. 11 Such analysis involves extracting overarching, systems-level insight, but it must do so within a 12 practical amount of computation. Moreover, biological signal in multi-study analysis is 13 confounded by study-specific noise patterns. This problem has motivated techniques for 14 computational batch effect correction [9][10][11][12][13][14][15] , but existing approaches integrate experiments using 15 transformations that obscure the biological relevance of individual data values, making it 16 difficult for downstream analyses to interpret the transformed result. Existing integrative 17 algorithms also aim to minimize inter-study variation, thus removing relevant differences that 18 would otherwise be useful to biological researchers. 19 To enable robust, consortium-scale scRNA-seq analysis, we reasoned that scRNA-seq 20 analysis of groups of cells in the gene coexpression space (captures the similarity of gene 21 expression changes between pairs of genes), rather than single cells in the gene expression space 22 (focuses on the expression patterns of individual genes), would be a more favorable paradigm. 23 Coexpression is more robust to experiment-specific noise patterns than gene expression 24 Preprint. Work in progress. measurements; not only are many coexpression measures (for example, Pearson correlation) 25 robust to affine transformation, some evidence suggests that gene coexpression and information 26 redundancy underlie cross-study replicability of single-cell experiments [16][17][18] . Coexpression also 27 provides a rich feature space with directly meaningful values that capture pairwise dependencies 28 among genes, allowing for graph-theoretic analysis of gene coexpression networks. There is a 29 wealth of existing literature, developed in both single-cell and bulk settings 19-23 , for inferring a 30 coexpression network and determining gene modules within a network. Previous work, however, 31 has not focused on analyzing the meaningful variation across multiple coexpression networks 32 over a large biological landscape. 33 Here we demonstrate that coexpression is a valuable paradigm for consortium-scale 34 scRNA-seq analysis. We develop a novel algorithmic framework, which we call Coscape, that 35 constructs a landscape of coexpression variation by piecing together information across multiple 36 studies and resolutions to capture meaningful changes in complex biological systems. We 37 leverage our coexpression paradigm to conduct unprecedented meta-analyses of large-scale 38 scRNA-seq datasets profiling mouse neuronal development and human hematopoiesis. We focus 39 on these biological systems because they have been extensively profiled across many large-scale 40 scRNA-seq studies [1][2][3][4][5][6][7] which have meaningful developmental differences that we do not wish to 41 completely remove. We analyze data from laboratories spanning three continents and containing 42 a total of 1,460,527 high quality cells and uncover rich, systems-level insight into the genes 43 involved in functions as diverse as neuronal activation, synaptic development, neuronal cell 44 division, lymphocyte activation, and coagulation. We obtain additional biological validation 45 from existing literature and from other data modalities including in situ hybridization and protein 46 interaction networks. We envision that the techniques and ideas outlined here will help enable 47 analyses that take advantage of a wealth of scRNA-seq data generated across diverse biological 48 systems. 49

50
Coexpression-based analysis using pan-resolution clustering 51 Our coexpression-based analysis is fundamentally based on statistics computed over 52 groups of cells, rather than individual cells. Coexpression is typically measured by computing a 53 gene-by-gene correlation matrix over a cluster of cells. Coexpression measurements, however, 54 may change with clustering resolution 16,24 and single cell datasets often have meaningful 55 multiresolution structure 25 . We therefore introduce a strategy that repeatedly clusters a dataset at 56 multiple resolutions and considers all clusters for downstream analysis (Figure 1a ; Methods); 57 we refer to this strategy as pan-resolution clustering (panclustering). Panclustering ensures that 58 our algorithm captures coexpression patterns across multiple resolutions, which, as we 59 demonstrate below, can increase the discovery of gene interactions corroborated by other 60 biological networks. 61 Our implementation of panclustering is based on the Louvain community detection 62 algorithm 26 , a common clustering method for scRNA-seq data. Louvain clustering iteratively 63 merges cells into cluster "communities" until convergence, which is controlled by a resolution 64 parameter 27 (higher resolutions tend to increase the number of communities). We obtain many 65 possible realizations of a Louvain clustering by repeating the algorithm with multiple resolution 66 parameters and, importantly, also keeping cluster information from each agglomerative iteration 67 (Figure 1a; Methods). Each cluster defines a single gene-by-gene correlation matrix, on which 68 we perform an additional sparsification step that sets low correlations to zero to both reduce the 69 influence of noisy associations and improve computational efficiency (Methods). We choose 70 Louvain clustering due to its asymptotic efficiency, since its runtime and space usage scales with 71 the size of the k-nearest neighbor (KNN) graph of cells (i.e., each cell is a node in the graph), 72 rather than quadratically in the number of cells as in other hierarchical clustering algorithms. 73 Preprint. Work in progress.
Each datapoint in the downstream analysis therefore represents a cluster of cells 74 featurized by coexpression. Importantly, we can then perform analyses like visualizing the KNN 75 graph of coexpression matrices (which we refer to as the "coexpression landscape"), arranging 76 the coexpression matrices into a trajectory, and finding common patterns within groups of 77 similar coexpression matrices via dictionary learning (Figure 1a). We call our overall algorithm 78 Coscape since it constructs and analyzes the coexpression landscape. Many of the parts of 79 Coscape have analogous versions within typical analyses of single cells in gene expression space 80 (Figure 1b), but here we demonstrate that similar lines of thinking can be transferred to pan-81 resolution clusters of cells in gene coexpression space. Unlike traditional scRNA-seq analysis, in 82 which information is largely separated according to study, Coscape pieces together information 83 across multiple studies to form a naturally unified landscape (Figure 1c). Such a landscape 84 becomes especially valuable when researchers seek to understand the meaningful biological 85 changes among different studies (for example, studies assaying different stages of development), 86 which would be difficult to preserve using traditional integrative methods 9-15 that attempt to 87 minimize any inter-study variation (Figure 1c).  (Supplementary Fig. 1). Study-specific and subcluster-specific structure is also present 119 after applying existing integrative algorithms based on mutual nearest neighbors matching 9 120 (Scanorama) or on learning a latent space parameterized by a variational autoencoder 14 (scVI) 121 ( Supplementary Fig. 1); these methods are representative of many others also based on nearest 122 neighbors matching [10][11][12] or on learning a joint latent space 13,15 . Visualization and analysis that 123 conveys such structure is not necessarily undesirable and may be useful in many cases, 124 especially when analysis is limited to a single scRNA-seq experiment. However, in cases when 125 we seek higher order, systems-level patterns spanning multiple datasets, such as those generated 126 across a consortium of institutions, we find that coexpression provides a naturally unified and 127 much more advantageous space. 128 Interpretation of coexpression landscape yields insight into neuronal development 129 A notable advantage of analysis in coexpression space is our ability to gain systems-level 130 insight into processes occurring throughout neuronal development from embryo to adult. Like 131 methods for clustering cells in the gene expression setting, we can facilitate interpretation by 132 clustering coexpression matrices that share similar structure and analyzing each cluster's unique, 133 representative patterns. We leverage a technique known as dictionary learning to discover 134 consistent patterns across many coexpression networks. Dictionary learning across covariance 135 matrices has been successfully applied to diverse problems, including information retrieval 30 and 136 functional brain profiling 31 , and can be naturally extended to single cell coexpression. Dictionary 137 learning is distinct from but analogous to methods like nonnegative matrix factorization (NMF) 138 for finding the components underlying a set of gene expression profiles. In our dictionary 139 learning setup, we represent each pan-resolution cluster as a sparse weighted sum of a few 140 underlying coexpression matrices, or "dictionary entries," each of which represents important 141 patterns reproduced across many coexpression matrices. We found that only six basis 142 coexpression matrices were required to achieve good reconstruction error of the full set of pan-143 resolution coexpression matrices (Methods). These basis coexpression matrices can also be 144 interpreted as networks with genes as nodes and edges between genes with nonzero 145 coexpression. 146 As a first interpretative step, we looked at genes involved in edges unique to each of the 147 six basis networks; analogous to "marker gene" analysis in expression space, we can refer to this 148 as "marker edge" analysis in coexpression space. We then looked for significant gene ontology 149 (GO) process enrichments 32 161 We sought to further characterize our coexpression networks by scoring genes on their 162 betweenness centrality, which is a general measure of node importance based on the number of 163 shortest paths containing a particular node, in each of the basis coexpression networks 33 164 (Methods). High betweenness genes of note include Bmp4 (dictionary entry 1, fetal), an 165 important neural stem cell morphogen 34 ; Cbln1 (dictionary entry 1, fetal), an important gene in 166 synaptic formation 35 ; Coro1a (dictionary entry 2, fetal/neonatal) and Snhg11 (dictionary entry 3, 167 fetal/neonatal), both involved in axon growth 36,37 ; and Htr2c (dictionary entry 4, 168 adolescent/adult), which encodes the serotonin receptor (Figure 2e-h). These high-betweenness 169 genes are more likely to be centrally located within the coexpression network or be involved in 170 multiple gene modules; we also note that many other possible node and edge centrality measures 171 can be applied to these networks to yield additional insights. We can also look at gene expression  178 We found additional validation for the genes that had the strongest correlation with 179 developmental pseudotime by using the Allen Developing Mouse Brain Atlas (ADMBA) 40 Fig. 3). We also observed that neither the sparsity nor size of the pan-resolution clusters was 203 strongly correlated with coexpression landscape structure, as quantified by diffusion pseudotime 204 (Spearman correlation of 0.38 and 0.19, respectively, compared to 0.80 for developmental age; n 205 = 2,380 pan-resolution clusters), and changes to sparsity did not substantially affect the structure 206 of our developmental landscape (Supplementary Fig. 3). 207 Coexpression-based developmental trajectories yields insight into hematopoiesis 208 We next sought to demonstrate the broad applicability of Coscape to other complex 209 biological phenomena besides neuronal development. To this end, we analyzed the coexpression 210 landscape of three large-scale hematopoietic datasets: 240,898 cells from bone marrow and 211 158,639 cells from cord blood, both generated by the Human Cell Atlas 7 , and 128,689 peripheral 212 blood mononuclear cells (PBMCs) 6 . From these tissues, we expect to observe cells at most stages 213 of hematopoiesis 41 including hematopoietic stem cells and erythroid progenitors, mostly in the 214 bone marrow and cord blood, to more mature lymphocytes and myeloid cells, mostly as PBMCs. 215 A large number of the PBMCs underwent fluorescence activated cell sorting (FACS) 216 prior to scRNA-seq, giving us experimentally-determined proteomic labels for a subset of the 217 data. We therefore labeled some clusters as containing a substantial amount of progenitor-218 associated (CD34 + ), myeloid-associated (CD14 + ), and lymphoid-associated (CD4 + , CD8 + , 219 CD19 + , CD56 + ) cell-surface marker expression (Figure 4, Supplementary Fig. 4); these labels 220 allowed us to see which parts of the coexpression landscape were more associated with 221 progenitor, myeloid, or lymphoid states. We applied the same dictionary-learning procedure 222 (Methods) to the hematopoietic coexpression landscape, yielding four main dictionary entries. 223 The first dictionary entry, which we call the progenitor coexpression network, corresponds to all 224 of the CD34 + -labeled clusters and also has high betweenness centrality scores for genes that have much more substantial study-specific and tissue-specific structure (Supplementary Fig. 5). In 251 contrast, coexpression finds the high-level, cross-dataset structure consistent with cellular 252 differentiation. 253 We also note that we observed lower amounts of erythroid and myeloid cells within the 254 PBMC dataset due to transcriptional quiescence and that, in general, the number of clusters does 255 not necessarily reflect the "true" in vivo proportion of the various cell lineages. However, by 256 combining information across multiple tissues and hundreds of thousands of cells, we are able to 257 obtain a more complete view of the hematopoietic coexpression landscape. 258 Coexpression across pan-resolution clusters has greater correspondence with other known gene-259 gene associations 260 While coexpression dictionary learning across many pan-resolution clusters highlighted a 261 wealth of biologically relevant genes, we looked to assess if the interactions captured by our 262 analysis also had any additional biological support, as well as if our particular pan-resolution 263 clustering strategy provided any advantage in uncovering biologically important interactions 264 over simpler baseline techniques. More specifically, because pan-resolution clustering discovers 265 associations across many resolutions which may not be discovered otherwise, we reasoned that 266 our coexpression networks might also have greater overlap with real gene-gene associations. 267 We therefore leveraged an existing strategy 23 (Figure 5a,b). We reasoned that this result is due to more discoverable gene-gene interactions 284 (as captured by coexpression) within the pan-resolution setting because coexpression changes in 285 strength with clustering resolution 16,24 . We also note that this result is not limited to the multi-286 study integration setting but can, in principle, also increase discovery of coexpressed genes 287 Preprint. Work in progress. within a single study. Many gene-gene interactions are discovered via pan-resolution clustering 288 but not by lower resolution methods (Figure 5c,d).  Table 1, also see Discussion). 300 We performed all of our analyses in a practical amount of computational time and   useful properties including interpretability, robustness, and scalability, other spaces may exist 363 that may work well according to the same criteria. 364 Coscape can also be used to newly probe many other biological systems, including 365 pancreatic islet cells 59 or lung cells 60 , that have been or will be deeply profiled using single cell 366 technologies. Reasoning about the relationship between coexpression and other functional 367 measurements of single cells, such as chromatin accessibility or methylation, also remains an 368 important future direction. We believe the algorithms and ideas presented here provide a 369 complementary and highly-informative way for researchers to study biological processes at 370 single-cell resolution and at multi-institution scale. We make our analysis pipelines and data 371 available at http://coscape.csail.mit.edu. 372 Preprint. Work in progress.

373
Mouse neuronal development dataset preprocessing 374 We obtained publicly available datasets from five large-scale, published scRNA-seq 375 studies of the mouse brain at different developmental timepoints [1][2][3][4][5] . We used only the cells that 376 passed the filtering steps of each respective study and additionally removed low-complexity or 377 quiescent cells with less than 500 unique genes. For the embryonic dataset from Cao et al. 1 , we 378 only considered cells that the study authors had assigned to the "neural tube and notochord" 379 trajectory. For the datasets from Zeisel et al. 4 and Saunders et al. 5 we only considered cells that 380 the study authors had labeled as neuronal. We then intersected the genes with the highest 381 variance-to-mean ratio (i.e., dispersion) within each study to obtain a total of around 2000 genes 382 that were highly variable across all studies. All studies provided data as digital gene expression 383 (DGE) counts, which we further log transform after adding a pseudo-count of 1.

384
Human hematopoiesis dataset preprocessing 385 We obtained publicly available datasets of cord blood and bone marrow cells from the 386 Human Cell Atlas 7 (https://preview.data.humancellatlas.org/) and PBMCs from Zheng et al. 6 387 (https://support.10xgenomics.com/single-cell-gene-expression/datasets). We removed cells with 388 less than 500 unique genes; we also noticed a large number of cells with high percentages of 389 ribosomal transcripts, which may indicate nontrivial amounts of ambient ribosomal RNA 390 contamination during the scRNA-seq experiment, so we only included cells with less than 50% 391 ribosomal transcripts in further analysis. As in the mouse neuronal dataset, we intersected the 392 genes with the highest dispersions within each study to obtain a total of around 2000 genes that 393 were highly variable across all studies. All studies provided data as digital gene expression 394 (DGE) counts, which we further log transform after adding a pseudo-count of 1.
We modify the Louvain clustering algorithm 26,27 (implemented at   397 https://github.com/vtraag/louvain-igraph) to store community information at each iteration. To 398 capture a range of potential clustering results, we rerun the Louvain clustering algorithm at a 399 diverse range of clustering resolutions (0.1, 1, and 10), storing the hierarchical cluster 400 information for each run. The three runs of Louvain clustering are done in parallel and we cluster 401 each study individually. To reduce the effect of noisy correlations, we consider clusters with a 402 minimum of 500 cells, which, combined with highly variable gene filtering (described below), 403 reduces the chance that a strong correlation is due to a few outlier cells. 404 Computing coexpression matrices 405 We compute the Pearson correlation matrix  Coexpression matrix dictionary learning 438 We formulated the dictionary learning problem for coexpression matrices by optimizing 439 argmin (1) ,…, (  where ( ) ∈ ℝ ≥0 is a sparse code of weights for pan-resolution cluster , is a sparsity-442 controlling parameter, = [ 1 ⋯ ⋯ ] ∈ ℝ ≥0 ( 2 )+ × is a dictionary of (vectorized) 443 coexpression matrices, and is a user-defined parameter indicating the number of dictionary 444 entries to learn. We used an iterative optimization algorithm that alternatively estimated 445 dictionary weights and dictionary entries using a least angle regression-based procedure 62  Interpretation of dictionary entries 450 We can interpret each dictionary entry as a coexpression network in which genes are 451 nodes and elements of define edge weights between those genes. We identify important genes 452 using statistics such as betweenness centrality 63 , which is the sum of the fraction of all-pairs 453 shortest paths that pass through some node. We use the networkx Python package 64  dimensional "semantic space" that places similar terms closer together 65 . We limit analysis to 462 patterns that are reproducible across many clusters and only consider dictionary entries that have 463 nonzero weights in at least ten pan-resolution clusters. 464 Gene interaction network overlap analysis 465 We obtained four "target" gene-gene were set to zero. The number of overlapping edges between a coexpression network and a target 479 graph was compared to a null distribution over the random graphs, which we use to compute a Z 480 score for each coexpression network. 481 Runtime and memory profiling 482 We

Data Availability
We used the following publicly available datasets: • Notochord and neural plate cells from Cao et al. 1  (a) Our algorithm for coexpression-based analysis of scRNA-seq data, which we refer to as Coscape. Cells are clustered at multiple resolutions, with groups of cells colored individually, resulting in an ensemble of clusters in which each cluster in each resolution defines a single gene-gene correlation matrix. These matrices are sparsified with a winner-take-all strategy in which weak correlations are set to zero. We use these sparsified correlation matrices as our coexpression features, where each datapoint in subsequent downstream analysis is a cluster of cells. The KNN graph of coexpression matrices forms the "coexpression landscape" that captures the topological relationships between pan-resolution clusters. Many downstream analyses are then possible, including trajectory learning and pseudotime assignment. Coexpression matrices are expressed as a combination of a few basis matrices, or "dictionary entries"; pairs of genes unique to a dictionary entry can be thought of as "marker edges," for which we can look at enriched gene ontology (GO) processes. (b) Many of these analyses take inspiration from analogs in gene expression space. For example, rather than visualizing pan-resolution clusters in coexpression space, standard analyses visualize single cells in expression space; rather than decomposing coexpression matrices via dictionary learning, the expression matrix is decomposed via algorithms such as nonnegative matrix factorization. Coexpression space, however, enjoys enhanced interpretability, multi-study robustness, and scalability to large-scale studies. (c) A conceptual illustration of the difference between attempting to extract biological information from single-studies, each profiling different parts of a larger biological system ("Traditional single-cell expression analysis"); integrative algorithms that attempt to minimize