Mapping multicellular programs from single-cell profiles

Tissue homeostasis relies on orchestrated multicellular circuits, where interactions between different cell types dynamically balance tissue function. While single-cell genomics identifies tissues’ cellular components, deciphering their coordinated action remains a major challenge. Here, we tackle this problem through a new framework of multicellular programs: combinations of distinct cellular programs in different cell types that are coordinated together in the tissue, thus forming a higher order functional unit at the tissue, rather than only cell, level. We develop the open-access DIALOGUE algorithm to systematically uncover such multi-cellular programs not only from spatial data, but even from tissue dissociated and profiled as single cells, e.g., by single-cell RNA-Seq. Tested on spatial transcriptomes from the mouse hypothalamus, DIALOGUE recovered spatial information, predicted the properties of a cell’s environment only based on its transcriptome, and identified multicellular programs that mark animal behavior. Applied to brain samples and colon biopsies profiled by scRNA-Seq, DIALOGUE identified multicellular configurations that mark Alzheimer’s disease and ulcerative colitis (UC), including a program spanning five cell types that is predictive of response to anti-TNF therapy in UC patients and enriched for UC risk genes from GWAS, each acting in different cell types, but all cells acting in concert. Taken together, our study provides a novel conceptual and methodological framework to unravel multicellular regulation in health and disease.


INTRODUCTION
The interplay between different cells in a tissue ecosystem is crucial for maintaining tissue homeostasis. While many diseases have been traditionally perceived as the result of the malfunction of a particular cell or cell type, mounting evidence (1)(2)(3)(4)(5) and new therapeutic strategies (3,6,7) have demonstrated the pivotal role of multicellular action in maintaining homeostasis and its dysregulation in a wide-range of diseases, thus opening new opportunities for intervention, diagnosis, disease monitoring and prevention. In parallel, advances in single cell RNA-seq (scRNA-seq) (8,9) and spatial transcriptomics (10)(11)(12) now allow us to systematically explore molecular profiles at single cell resolution across cell types (13), tissues (14,15), and disease states (16)(17)(18), in both isolated cells and intact tissues (10)(11)(12). However, despite these advances, deciphering multicellular regulation still remains a major challenge, limiting our ability to move from a cell-to a tissue-centric perspective.
One of the key current limitations is in our computational formulation of the problem and available analysis methods. While many computational methods have been developed to analyze single cell data the vast majority were geared to map and explore single cell states (13,(19)(20)(21)(22)(23)(24)(25). Those developed to study cell-cell interactions, are focused on reconstructing the tissue's spatial organization (e.g., by combining single cell and spatial data or based on assumptions on spatial patterning) (26-29), on inferring putative physical cell-cell interactions based on known receptor-ligand pairs and known signaling pathways (30-32), or on highlighting recurring cell type compositions of cellular neighborhoods using multiplex molecular spatial data (33,34). Thus, while these methods revealed important properties of cell biology and tissue structure, we still lack a computational basis to study tissue biology and functional multicellular processes at scale.
Here, we approach this problem in a new way, by introducing the concept of multicellular programs and developing the first algorithm to systematically uncover multicellular programs from single-cell data. We define multi-cellular programs as the combinations of different cellular programs in different cell types that are coordinated together in the tissue, thus forming a higher-order functional unit at the tissue-level, rather than only at the cell-level. We develop DIALOGUE, a computational approach for decoupling cell states through multicellular configuration identification, by using cross-cell-type associations to identify multicellular programs and map the cell transcriptome as a function of its environment. We apply DIALOGUE to a wide-range single-cell data types, obtained not only from spatially intact tissue (e.g., MERFISH) but even from dissociated cells (e.g., single-cell RNA-Seq). Tested on MERFISH data from the mouse hypothalamus (10) DIALOGUE recovered spatial information, predicted the properties of a cell's environment only based on its transcriptome, and identified multicellular programs that mark animal behavior. Next, applied to single-cell RNA-seq from colon biopsies, DIALOGUE identified a coordinated multicellular response across epithelial, innate and adaptive immune cells that arises in ulcerative colitis patients, predicts clinical responses to anti-TNF therapy, and spans multiple genes associated with UC risk by Genome Wide Association Studies (GWAS) but acting in different cells. Lastly, using single nucleus transcriptomes from the prefrontal cortex, DIALOGUE highlighted an Alzheimer's disease multicellular program that increases in the aging brain. Taken together, our framework and the DIALOGUE method provide a systematic way to study cellular crosstalk, along with its molecular underpinnings and phenotypic implications, as part of a transition from single cell to tissue biology.

DIALOGUE: Decoupling cell states through multicellular configuration identification
DIALOGUE leverages the reasoning that cells within the same microenvironment are exposed to similar cues, which may activate coordinated responses in different cell types at two (nonmutually exclusive) levels (Fig. 1A). First, cells of multiple types may simultaneously activate the same, cell-type-independent program (35) (Fig. 1A). Second, cells of different types may each activate a different, cell-type-specific program, but in a concerted fashion, either because they directly impact each other, or because they each respond distinctly to the same shared cue (Fig. 1A). Both cases should give rise to corresponding transcriptional programs across different cell typeswhere the expression of one set of genes in a certain cell is associated with the expression of the same or another set of genes in nearby cells or cells from the same sample. If we can find those patterns from data, we should be able to uncover multicellular configurations and their dependent cellular programs.
Given single-cell profiles from different spatial locations (in spatial profiling) or from different samples (in scRNA-seq), DIALOGUE treats different types of cells from the same micro/macroenvironment or sample as different representations of the same entity (Fig. 1B). It first applies penalized matrix decomposition (PMD) (36) to identify sparse canonical variates that transform the original features' space (e.g., genes, PCs, etc.) to a new feature space, where the different (cell-type-specific) representations are correlated across the different samples/environments; this is done based on the average signal across the different cell subsets in each niche/sample. It then identifies the specific genes that comprise these latent features by fitting multilevel (hierarchical) models; this step models the single-cell distributions and controls for potential confounders, such as gender, age, variation in sample preparation, technical variability etc. (Fig. 1B, Methods). As output, DIALOGUE provides multicellular programs, each composed of two or more co-regulated, cell-type-associated programs (Fig.   1B, Methods). DIALOGUE can identify numerous multicellular programs (MCPs; input parameter k < rank of the original feature space). In practice, the different MCPs are not correlated with one another (Methods), and the cross-cell-type correlations observed within an MCP decreases with k, such that the first few MCPs depict most of the multicellular co-expression.
Importantly, DIALOGUE does not depend on k and will find the same MCPs or a subset of them.
Using the multicellular programs, DIALOGUE can accurately identify mis-localized cells and predict the cell's environment solely based on its transcriptional state (without using any type of spatial information for training). It can decouple cell states to their niche/sample-dependent and independent programs, and further decompose the niche/sample-dependent programs into generic (shared) and cell-type-specific components (Fig. 1A,B). DIALOGUE then uses receptor-ligand interactions to propose cell-cell interactions that may mediate the novel multicellular programs that it identified (starting from a graph that depicts all the possible ligand-receptor interactions, and parsing it for each MCP to include its genes and short paths that link these genes; Methods). It further links the multicellular programs to genetic risk factors and to specific phenotypes of interest, thus providing a new way to uncover multicellular processes that mediate the genotype-to-phenotype mapping. As we show below, the multicellular programs identified by DIALOGUE are associated with complex phenotypes, ranging from animal behavior to clinical and pre-clinical disease manifestation.

DIALOGUE identifies generalizable multicellular configurations and recovers spatial information from MERFISH in the mouse brain
To test DIALOGUE, we first applied it to MERFISH data, where the in situ expression of 155 genes was measured at the single-cell level in ~1.1 million cells from the mouse hypothalamic preoptic region (10). We considered six broad cell types that were sufficiently represented in the datamicroglia, endothelial cells, astrocytes, oligodendrocytes, excitatory and inhibitory neuronsand analyzed all pairwise combinations (Supplementary Table 1).
DIALOGUE identified corresponding transcriptional programs that were highly correlated across different cell types within the same physical niche ( Fig. 2A,B, right, light blue bars).
In contrast, single genes show only a moderate correlation across different cell types within the same niche (Fig. 2C, Supp. Fig. 1A), and standard dimensionality reduction approaches, such as principle component analysis (PCA) and Non-negative Matrix Factorization (NMF) did not reveal spatial associations across cell types (Fig. 2C, Supp. Fig. 1A).
The MCPs that DIALOGUE learned generalized well and had predictive value in a train-test setting. Specifically, we learned MCPs, each with two cell-type-specific components, from MERFISH data from a training set of 9 animals, and then tested whether these MCPs can predict the properties of the cell's environment only from its own transcriptional state in a test set of 9 other animals. Indeed, using the learned model from the training data and the transcriptional state of a given cell (e.g., excitatory neurons) in the test data, DIALOGUE accurately predicted the expression level of the corresponding cellular programs in the unseen neighboring cells (e.g., microglia, astrocytes, etc., Pearson's r > 0.69, P < 1*10-30; AUROC: 0.79 -0.9, 0.82 median, Fig. 2A,B). We obtained similar results when withholding only males or only females (data not shown).
Next, we challenged DIALOGUE to recover such information by training it without any spatial coordinates, providing as input only samples that consisted of hundreds of cells from one tissue "patch" ("macro-environments" ~500 "in silico dissociated" cells; ~50-100µm2 each, Methods). We applied it on this cohort as before to learn programs for all pairs of cell types.
Despite not having spatial data for training, DIALOGUE nonetheless was able to use the cell's expression to predict the expression of the multicellular program in its unseen neighbors when tested on data from 9 other animals, which were not used in any part of the analysis other than to evaluate these predictions. For example, it successfully used the expression of inhibitory neurons to predict the state of their neighboring astrocytes (Fig. 2B). DIALOGUE performed well in this task both when predicting the microenvironment features (direct neighbors in a radius of ~15 cells; AUROC: 0.73 -0.91, median 0.82; P < 1*10-30, Mann-Whitney test) and macroenvironment features (AUROC: 0.83 -0.98, median 0.94; P < 1*10-30, Mann-Whitney test, Fig. 2A,B).
Importantly, DIALOGUE MCPs do not simply mirror cell sub-clusters, nor do they only reflect simple compositional variation across space, where different cell subtypes are localized in different regions in the hypothalamus. Specifically, when considering cell clustering based on previous analyses of this cohort (10) we find that cells with a high, low, or moderate expression of the different MCPs, span multiple clusters, and can be found as over-or under-expressed in each of these clusters (Fig. 2D). Second, the co-expression of the programs' components remained strong even when considering only pairs of specific cell subtypes (e.g., inhibitory neurons I-1 and excitatory neurons E-2; Supp. Fig. 1B) or after regressing out cell-subtypespecific variation in each cell type (Supp. Fig. 2). The MCPs thus reflect cellular programs that are distinct from those identified using standard dimensionality reduction and clustering procedures (10), and highlight variation in specific cellular components as opposed to only gross changes in tissue composition at the cell subtype level.

The multicellular programs in the mouse hypothalamus are linked to animal behavior
Many of the MCPs that DIALOGUE identified were strongly associated with particular animal behaviors. This was observed mostly for the leading MCP (MCP1; P < 0.05 for 12 out of 15 first MCPs, mixed-effects, BH FDR (37), Methods; Fig. 3A), which we focused on here (Fig.   3B,C). Subsequent MCPs (MCP2-5) showed higher intra-animal variation, resulting from spatial co-variation within the tissue (Fig. 3D,E).
The first MCPs (MCP1) that oligodendrocytes, excitatory and inhibitory neurons formed with other cell types were strongly repressed after parenting, aggression, or mating (P < 0.05, mixedeffects, BH FDR (37), Fig. 3A,B). All of these programs include Oprl1, encoding the nociceptin receptor, aligned with the key role of nociceptin in learning and emotional behaviors (38). We also find the involvement of hormonal signals, where the microglia-excitatory neurons MCP includes the genes encoding the receptors for prolactin (Prlr, neurons), progesterone (Pgr, neurons) and estrogen (Esr1, both compartments; Fig. 3B), with similar patterns in the microglia-inhibitory neuron MCP (Fig. 3B). These "hormonal" programs were equally active in female and male mice (P > 0.1, mixed-effects) and their neuronal and microglia components were strongly co-regulated even when considering only male or only female animals (r > 0.63, P < 1*10-30, Pearson correlation).
In contrast to the MCPs DIALOGUE identified, spatially-independent components did not show any significant association with animal behavior. To recover these spatially-independent components of cell states, we regressed out the spatially-associated programs from the gene expression and computed the PCs of the residuals (Methods). None of these residual components were associated with animal behavior, indicating that a large fraction of the biologically meaningful cell-cell variation in this system is captured by the MCPs identified.
Further examining the genes comprising the MCPs (Supplementary Table 1), we see that different programs have unique features that may reflect specific intercellular cross-regulation, while some genes recur across MCPs, either exclusively within the same cell type but in different MCPs, or across different types (Fig. 3C). For example, Selplg was in the microglia compartment in 4 out of the 5 (pairwise) MCP1s that involved microglia, but was not found in other cell types. Selplg encodes P-selectin glycoprotein ligand-1 (SELPLG), that, together with P-Selectin, constitutes a receptor/ligand complex involved in the recruitment of activated lymphocytes, and it is one the topmost (ranked 4) repressed genes in disease associated (DAMs) vs. homeostatic microglia (39). Other genes, such as Sox8 and Ttyh2, were repeatedly and specifically members of the astrocyte component of multiple MCPs. In contrast, some genes, as Mbp (myelin basic protein (40)) and Nnat (involved in brain development and synaptic plasticity (41)), were found in multiple cell types across multiple MCPs (Fig. 3B), suggesting that their expression may be spatially regulated in a cell-type-independent manner.

DIALOGUE recovers MCPs across five cells types from scRNA-seq data from colon biopsies of healthy individuals and ulcerative colitis patients
Next, we applied DIALOGUE to scRNA-Seq data that was collected from dissociated tissue specimens, without cell-level spatial coordinates, treating each sample as one "niche". The  Table 2

, Methods).
To test DIALOGUE in this setting, we first examined if it can identify "mis-localized cells" based on the assumption that such cells will not fit the state predicted by their macroenvironment, as reflected by DIALOGUE's multicellular programs. We trained DIALOGUE with an "in silico contaminated" dataset we generated, where we added to each sample ~50 cells (10 per cell type; 0.02-0.06% of the cell type) from an "adjacent sample" from the same individual. These "adjacent samples" could be from the same or different layer (LP/EPI), and have the same or a different clinical status (both healthy, or inflamed and uninflamed). Based on the resulting multicellular programs that DIALOGUE found in the "contaminated training set", it then computed an environment-score, which denotes for each cell to what extent its real state agrees with the one predicted by its neighbors (some of which may be contaminating; Methods).
DIALOGUE identified the mis-localized cells with high accuracy (Fig. 4A), as their environment-scores were significantly lower than those of the other cells (P < 1*10-10, t-test).
We further evaluated this when considering different types of contamination, and found that DIALOGUE was most accurate in spotting contamination from a different and distinct spatial location (i.e., LP vs. EPI, Supp. Fig. 3), and most successful for CD8 and CD4 T cells, and least so for macrophages (Fig. 4A, Supp. Fig. 3). These findings demonstrate the robustness of the approach to data contamination and the potential of using it to identify "infiltrators" to an established environment.
Applying our approach to the real (non-contaminated) data we identified five MCPs, each with five cell-type-associated components. The different cell-type components of each multicellular program were strongly co-expressed across the samples ( Fig. 4B; P < 1*10-3, mixed-effect model that accounts for gender, spatial compartment (EPI/LP), and cellular subtypes), even when considering only samples from UC patients or from healthy individuals (P < 1*10-3, mixed-effects). The majority of genes in each MCP were cell type specific (59-87%), although some were shared between the components of two or more cell types (Fig. 4C, Supplementary Table 2), potentially representing the context-dependent and -independent impact that the environment has on different cells, respectively.
In the ligand-receptor network of the UC multicellular program (Supplementary Table 3 Indicating its generalizability, the UC-multicellular program was associated with UC status also in an independent cohort of colon tissue samples obtained from 24 UC patients before and 4-6 weeks after their first treatment with a TNF-inhibitor (infliximab infusion) and in normal mucosa from 6 control patients (50). The UC multicellular program was substantially higher in the UC patients compared to the controls (P < 1*10-10, linear regression; AUC = 1) and was A multicellular program associated with Alzheimer's disease and the aging brain Finally, we applied DIALOGUE to 80,660 single-nucleus RNA-seq (snRNA-seq) profiles from the prefrontal cortex of 48 individuals with varying degrees of Alzheimer's disease (AD) pathology (17) (Methods). DIALOGUE identified a multicellular program that was substantially increased in AD patients (P < 1*10-3, mixed-effects, controlling for patient age and gender, Fig. 5A-C, Supplementary Table 4). The program spans components in inhibitory and excitatory neurons, oligodendrocytes, oligodendrocyte-precursor-cell (OPCs), astrocytes, and microglia, and its different cell type compartments are largely distinct, with minimal overlap (Fig. 5D; 90-95% of the MCP genes are in a single cell type component).
Further supporting its connection to AD, its different components were substantially induced in the disease state also in an independent cohort of bulk brain tissue samples collected from 350 individuals, with definitive diagnosis of AD pathology or no AD pathology (51,52) (AUROC = 0.65, P = 1.47*10-6, regression model, accounting for individual age and gender,

Methods).
This multicellular program includes the simultaneous repression of synaptic signaling genes in excitatory neurons (P < 1*10-3, hypergeometric test), of aerobic respiration genes in excitatory and inhibitory neurons (e.g., the TCA cycle, P < 1*10-5, hypergeometric test), and of genes that facilitate proper synaptic activation in oligodendrocytes and astrocytes (e.g., the excitatory amino acid transporter SLC1A2). Both its repressed and induced components include genes that have been previously linked to AD risk in genome-wide association studies (53-56) ( As one of the greatest risk factors for Alzheimer's is increasing age, we explored the possibility that the AD-program might become more prominent with age. Indeed, the up-and downregulated parts of the program were also respectively enriched with genes that have been previously found to be over-or under-expressed in the aging cortex (P < 1*10-5, hypergeometric test). Moreover, the AD-program as a whole (Fig. 5E), and each of its celltype compartments individually (Supp. Fig. 5), were strongly associated with age in bulk gene expression data that was obtained from both the frontal cortex (n = 455) and cerebellum (n = 456) of neurologically normal subjects across different ages (61). These findings thus demonstrate DIALOGUE's ability to highlight pathological multicellular configurations that are tightly linked to different risk factors.

DISCUSSION
In this study, we define the concept of multicellular programs and introduce it as a framework for studying tissue biology. We further develop the DIALOGUE method to recover multicellular programs, and apply it not only to spatial data but also to single cell genomics of dissociated tissue samples. The model it learns can predict the microenvironment of a cellat different length scalessolely from its expression profile, and reveals multicellular programs associated to diverse phenotypes properties. DIALOGUE is entirely data-driven, and does not rely on strong underlying assumptions. It refers to different cell types as different representations or views of the same entity and maps multicellular programs directly from the data in an efficient and regularized way. The multicellular programs DIALOGUE identifies may arise due to shared (latent) factors in the cells' micro-or macro-environment, due to shared (latent) genetic features, which impact different cell types in different ways, or due to a combination of genetic and environmental factors. All are instrumental for our understanding of homeostatic and disease states in tissue.
The AD-and UC-multicellular programs we identified included genes with germline variants associated with predisposition to these respective diseases, as well as genes that become increasingly over/under-expressed with age (in AD), drug treatment or response (in UC). Cellintrinsic age-related processes leading to somatic genetic changes or epigenetic aberrations, together with the external cues from the aging microenvironment may impact different cell types in different yet correlated ways and give rise to such multicellular programs. Given additional information (e.g., genotypes) or datasets where multiple samples are profiled from each individual, DIALOGUE can incorporate these sources of multicellular covariation and decouple their intertwined effects (Methods), discerning between genetic-related and environmental-related co-variation.
In the future, our approach can be extended to integrate the identified multicellular programs with models of transcriptional regulation (62) or intercellular signaling (30) to further explore their molecular basis. Integrating DIALOGUE with Perturb-Seq (63) data, especially when applied in vivo (64) (and measured in situ), or with human genetic data in larger cohorts, could provide a powerful approach to uncover the impact of genetic perturbations on the cell and its neighbors and set the stage for causal inference of gene function, vertically integrating from genetic causes, to single cells to tissue and clinical phenotypes.
While we focus here on multicellular transcriptional programs, DIALOGUE can be applied to any type of single cell data, from dissociated cells or in situ, including other molecular profiles or cell morphological features. Given multimodal data, it can identify both intracellular associations across different cellular modalities along with multicellular programs, and decouple the intracellular and multicellular processes that dictate different cellular properties (e.g., cell shape, transcriptome, proteome, etc.). DIALOGUE should be immediately applicable to existing datasets, allowing to extract multicellular programs from existing data, which was previously analyzed mostly at the single cell

AUTHOR CONTRIBUTIONS
L.J. and A.R. conceived the study. L.J. devised the method and performed the analyses, with guidance and input from A.R. L.J. and A.R. wrote the manuscript.

COMPETING INTERESTS
A.R. is a co-founder and equity holder of Celsius Therapeutics, an equity holder in Immunitas Therapeutics, and a scientific advisory board member of ThermoFisher Scientific, Syros Pharmaceuticals, Asimov, and Neogene Therapeutics. As of August 1, 2020, AR is a an employee of Genentech.

DIALOGUE. Overview.
Given single-cell data obtained across different spatial locations or samples, DIALOGUE treats different types of cells from the same micro/macroenvironment or sample as different representations of the same entity (Fig. 1). To select the tuning parameters, we defined 1 ,…, as the optimal value of the optimization function described above, when using a particular set of tuning parameters 1 , … , . We quality. The latter is tested using the following multilevel model where is the value of the latent feature in cell i in sample j, , is a cell-type-level covariate that controls for potential confounding factors (e.g., log-transformed number of reads), and is the sample j intercept, defined as where are sample-level covariates of sample j (e.g., patient age, gender), and ( ) is the average expression of gene g in cell type z in sample j.
DIALOGUE. Environment-score metric. The environment-score (Es) is defined based on the difference between the expected and observed cell state, given the identified multicellular programs.
where , is the value of feature k in cell i of cell type z in sample j, and ′ is the number of cells of cell type z' in sample j.
To test if this measure can identify mis-localized cells, we used the colon/IBD data (16), where multiple samples were obtained from each individual. We performed an in silico contamination of the data by adding to each sample ~50 cells (10 per cell type) from an "adjacent" sample obtained from the same individual. These "adjacent samples" could be from the same or different layer (LP/EPI), and have the same or a different clinical status (both healthy, or inflamed and uninflamed). Applied to this "contaminated" data, DIALOGUE identified MCPs spanning TA1, TA2, macrophages, CD8 and CD4 T cells, and computed the environmentscores as described above.
The predictions were tested per cell type (Fig. 4A, Supp. Fig. 3

DIALOGUE application to spatial transcriptomic data from the mouse hypothalamus.
We applied DIALOGUE to single-cell transcriptomes of 155 genes across ~1.1 million cells from the mouse hypothalamic preoptic region (10)  The "behavior" annotation was used to examine the association of the programs with animal behavior. "Cell class" annotations were used to assign cells to major cell types (e.g., oligodendrocytes) and subtype (e.g., immature oligodendrocytes 1), and the "neuron cluster_ID" was used to assign neurons to different neuronal subtypes (e.g., I-1 for inhibitory neurons cluster 1, E-2 for excitatory neurons cluster 2, etc.).
Because the number of observations substantially exceeded the number of features in this cohort, we used the scaled and centered gene expression matrix as the "original features space"  To capture disease-related variation, which in this case was subtler and potentially masked by other sources of cell-cell variations, PCs were first computed based on genes that showed at least some moderate association with the disease state (t-test p-value < 0.1, without correction for multiple hypotheses). The top 30 PCs were used as the original feature space (Fig. 1B, "input"), ensuring that the number of features will not exceed the number of samples.
Bulk RNA-Seq data from the ROS/MAP study was used to examine the AD program in a larger cohort of 638 cortex autopsies (51,52