Abstract
Cell-cell interactions are crucial for multicellular organisms as they shape cellular function and ultimately organismal phenotype. However, the spatial code embedded in the molecular interactions that drive and sustain spatial organization, and in the organization that in turns drives intercellular interactions across a living animal remains to be elucidated. Here we use the expression of ligand-receptor pairs obtained from a whole-body single-cell transcriptome of Caenorhabditis elegans larvae to compute the potential for intercellular interactions through a Bray-Curtis-like metric. Leveraging a 3D atlas of C. elegans’ cells, we implement a genetic algorithm to select the ligand-receptor pairs most informative of the spatial organization of cells. Validating the strategy, the selected ligand-receptor pairs are involved in known cell-migration and morphogenesis processes and we confirm a negative correlation between cell-cell distances and interactions. Thus, our computational framework helps identify cell-cell interactions and their relationship with intercellular distances, and decipher molecular bases encoding spatial information in a whole animal. Furthermore, it can also be used to elucidate associations with any other intercellular phenotype and applied to other multicellular organisms.
Introduction
Cell-cell interactions (CCIs) are fundamental to all facets of multicellular life. They shape cellular differentiation and the functions of tissues and organs, which ultimately influence organismal physiology and behavior. Cell-cell communication (CCC) is a subtype of CCIs and involves a cell sending a signal to another cell, usually triggering downstream signaling events that culminate in altered gene expression. Thus, CCIs allow cells to coordinate their gene expression1, form spatial patterns of interaction2, and perform collective behaviors3. Signals passed between cells are often ligands which are received by receptors in other cells. Ligands can mediate CCC across a range of distances and they can encode positional information for cells within tissues, which is critical for cellular decision-making4. For instance, some ligands form a gradient that serves as a cue for cells to migrate5,6. Thus, studying CCIs elucidates how cells coordinate different functions depending on both the molecules mediating CCC and their spatial context.
CCIs and CCC can be inferred from transcriptomic data7. Computational analysis of CCIs usually consists of examining the coexpression of secreted proteins by a sender cell (e.g. ligands) and their cognate surface proteins in a receiver cell (e.g. receptors). To reveal active communication pathways from the coexpression of the corresponding ligand-receptor (LR) pairs, communication scores can be assigned to these interactions based on the RNA expression levels of genes encoding the secreted and receiver proteins8–12. RNA-based analyses have informed CCIs and their mediators in small communities of cells, such as embryos13,14 and tissues11,15–23. Moreover, CCI analyses have enabled the study of all cell types in the whole body of a multicellular organism in post-embryonic stages24.
CCIs allow cells to know their location in their communities, enabling a coordination of functions4,25. Molecules mediating these interactions are used by cells to encode and pass this spatial code. Thus, the study of CCIs can help decode spatial organization and function. However, studying CCIs in a spatial context cannot be directly done from conventional single-cell RNA-sequencing technologies (scRNA-seq) since spatial information is lost during tissue dissociation26. Nevertheless, previous studies have proved that gene expression levels still encode spatial information that can be recovered by adding extra information such as protein-protein interactions and/or microscopy data13,26–28. For example, RNA-Magnet inferred cellular contacts in the bone marrow by considering the coexpression of adhesion molecules present on cell surfaces28, while ProximID used gene expression coupled with microscopy of cells to construct a spatial map of cell-cell contacts in bone marrow27. Thus, as previously done for understanding a few multicellular niches13,29–31, one can study CCIs in a spatial context by adding appropriate information to the RNA-based CCI analysis. Hence, integrating CCI analyses with spatial information represents a great opportunity to deepen our understanding of intercellular communication in a spatial context. Moreover, this task holds potential to identify the signals that cells use to encode spatial organization.
Caenorhabditis elegans is an excellent model for studying CCIs in a spatial context32. Its cellular organization shows complexity comparable to higher-order organisms. It also has well defined tissues, and organs in post-embryonic stages. C. elegans can also be easily cultured and all individuals possess the same number and location of cells. Previous studies about its CCIs have focused on subsets of cells or very-early developmental stages. For example, cell contacts were mapped by microscopy during embryogenesis33,34, which helped identify cell pairs using a specific Notch signaling pathway34. Thus, C. elegans represents a well controlled experimental system that is sophisticated enough for performing CCI analyses. Importantly, a microscopy-based 3D atlas of cells35 and a single-cell transcriptome36 exist for C. elegans larvae; a stage at which most of the C. elegans’ organs have already formed and most cells have migrated to their final adult position37. Therefore, C. elegans can readily be exploited to decode the signals that define the spatial organization of cells in a whole living animal. To this end, here we compute CCIs from scRNA-seq data and assess how intercellular distance is associated with the potential of cells to interact. Also, we inspected which signals govern the CCC that occur in different locations and ranges of distance across the C. elegans body. For the latter task, a list of ligand-receptor interactions was built, which to date is the most comprehensive one for CCI analyses of C. elegans. Our computational framework detects molecules that link CCIs and intercellular distances. Importantly, this approach can be extended to any other organism that has the necessary input data to study the molecular bases underlying intercellular distances and many other traits driving and sustaining cellular organization in multicellular organisms.
Results
Computing cell-cell interactions
Intercellular communication allows cells to coordinate their gene expression1 and to form spatial patterns of molecule exchange2. These events also allow cells to sense their spatial proximity, which is essential for both the formation and the homeostasis of tissues and organs4. For example, one mechanism includes sensing the occupancy of receptors by signals from surrounding cells4,38; higher occupancy can indicate greater proximity of communicating cells39. Thus, to represent a cell-cell potential of interaction that may respond to or drive intercellular proximity, we propose a Bray-Curtis-like score (Figure 1). This cell-cell interaction (CCI) score is computed from the expression of ligands and receptors to represent the molecular complementarity of a pair of interacting cells. Specifically, it weighs the number of LR pairs that both cells use to communicate by the aggregate total of complementary ligands and receptors each cell in the pair produces (see Methods and Figure 1b). The main assumption of our CCI score is that the smaller the intercellular distance, the more complementary is the production of the ligands and receptors in a pair of cells. In other words, for any given pair of cells, cells are defined as closer when a greater fraction of the ligands produced by one cell interacts with cognate receptors on the other cell and vice versa, as this increases their potential of interaction.
For computing CCIs in C. elegans, besides the gene expression levels of its cells, a list containing the interactions between its ligands and receptors is needed. Although much is known about this organism, knowledge of its LR interactions remains scattered across literature or contained in protein-protein interaction (PPI) networks that include other categories of proteins. Thus, we generated a list for C. elegans that consists of 245 ligand-receptor interactions (Supplementary Table 1), which was built from interactions described in literature and high-confidence published PPIs (see Methods). Next, we used this list to determine the presence or absence of ligands and receptors in each cell identified in the single-cell transcriptome of C. elegans36, and ultimately the active LR pairs in all pairs of cells. To determine presence and absence of proteins, we used expression thresholding9,24, the most common strategy for analyzing CCIs and CCC due to its binary nature and easy interpretation7, and used the derived ligand and receptor scores as input of our CCI score to represent the overall potential of its cell types to interact.
To facilitate the application of the CCI analyses, we developed cell2cell, an open source tool to infer intercellular interactions and communication with any gene expression matrix and list of LR pairs as inputs (https://github.com/earmingol/cell2cell). Thus, we used our Bray-Curtis-like score to generate the first predicted network of CCIs in C. elegans that measures the complementarity of interacting cells given their active LR pairs (Figure 2a). Although we chose this score for representing a spatial-dependent complementarity of interaction, cell2cell is flexible in terms of the scoring strategies applied to decipher CCIs and CCC, so depending on the purpose of study other CCI scores can be used (e.g. the number of active LR pairs to represent the strength of the interaction).
Cell-type specific properties are captured by computed cell-cell interactions
After determining the potential for interaction between every pair of cell types from the single-cell transcriptome of C. elegans, we grouped the different cell types based on their interactions with other cells through an agglomerative hierarchical clustering (Figure 2a). This analysis generated clusters that seem to represent known roles of the defined cell types in their tissues. For instance, we found germline cells to have the lowest CCI potential with other cell types. This is consistent with the physical constraint that germline cells have as they are surrounded by basal membranes that limit their intercellular communication with other cell types40,41, and may uncouple the coordination of their gene expressions. For example, germline cell fate into mitosis or meiosis is almost exclusive on CCIs with distal tip cells, especially through mediators of Notch signaling such as glp-142. In contrast, neurons have the largest potential for interactions with other cell types, suggesting that these cell types use a higher fraction of all possible communication pathways. This occurs especially in interactions between neurons and muscle cells (Figure 2a), which is consistent with the high molecule interchange that occurs at the neuromuscular junctions43, suggesting that our method is exposing complementary signals actually transmitted between cells.
Similarity between pairs of interacting cells was also analyzed given the LR pairs they use. By using UMAP44,45 to visualize the similarity they have (Figure 2b), we observe that pairs of interacting cells tend to be grouped by the sender cells (i.e. those expressing the ligands), but not by the receiver cells (i.e. those expressing the receptors). This result is consistent with previous findings that ligands are produced in a cell type-specific manner by human cells, but receptors are promiscuously produced46, which suggests that this phenomenon may be conserved across multicellular organisms.
Key signaling pathways link distance between cells with their potential of interaction
Given that our CCI score is undirected, it can also be compared with spatial properties such as the distance between cells, which represent a unique state that does not change with the order that cells in a pair are considered. Under the hypothesis that larger distances should decrease the potential of cells to interact, we expected our CCI scores to be negatively correlated with the Euclidean distances between cells. Thus, we used the distances between cells, calculated by taking the Euclidean distance between cells from a 3D atlas of C. elegans (Supplementary Figure 1a), as our reference data to assess our methodology and assumptions in calculating our CCI score (see Computing cell-cell interactions)
We annotated each cell in the 3D atlas with a corresponding cell type in the scRNA-seq dataset (Supplementary Table 2), and computed the minimal Euclidean distances between each pair of cell types (Supplementary Figure 1b). In this case, the minimal distance was considered because this case would represent the maximal potential that cells have to interact. Thus, we next calculated the Spearman correlation between the CCI score matrix and the Euclidean distance matrix. This originally resulted in a correlation coefficient of −0.21 (P-value = 0.0016). Although the correlation was negative as expected, it was a low value. This may be due to noise introduced by comprehensively incorporating all LR pairs into the computation, which may include several LR pairs not necessarily encoding spatial information.
Given that the CCI scores computed by using all LR pairs in our list led to a low correlation with the distances between cells, we hypothesized that there is a subset of key LR pairs that follow a spatial pattern of co-expression and would allow cells to specifically sense their spatial relationship with other cells or drive the spacing. In this scenario, a CCI score that is a function of only these LR pairs would better represent the potential of two cells to functionally interact and it is expected to better correlate with cell proximity. Under this hypothesis, we looked for key LR pairs of C. elegans that better capture the potential of cells to interact given their physical locations. To do so, we ran a genetic algorithm (GA) to maximize the correlation between the CCI score matrix and the Euclidean distance matrix by randomly generating different size subsets of the LR pairs in our complete list (Supplementary Figure 2a-b). This algorithm was run 100 times, obtaining in each case a different optimal list of LR pairs due to the stochastic nature of this algorithm (Supplementary Figure 2c). Nevertheless, across all solutions, an average Spearman coefficient of −0.67 ± 0.01 was obtained (shown as an absolute value in Supplementary Figure 2b) and the maximal correlation resulted in a value of −0.70 (P-value = 1.435 × 10−35). Thus, the resulting optimized subsets of LR pairs (hereinafter referred to as initial GA-LR pairs) may constitute good predictors of biological functions driving or sustaining intercellular proximity, and they support the hypothesis that a subset of the LR pairs would drive the distance-dependent potential of interaction between two cells.
To identify the specific biological roles that the initial GA-LR pairs may have, we used our functional annotations about the signaling processes they are involved in (see column “LR Function” in Supplementary Table 1). Specifically, for each initial GA-LR pair list, we computed the relative abundance of each signaling pathway (i.e. the number of LR pairs involved in a given pathway with respect to the total number of LR pairs in the list) (Figure 3a). Considering the relative abundance of these pathways in our complete list containing the 245 LR pairs, we used the distribution of abundances from the GA runs and performed a one-sided Wilcoxon signed-rank test to evaluate whether the fractions of each function either increased or decreased with respect to the fraction in all LR pairs (Figure 3b). Remarkably, LR pairs involved in Canonical RTK-Ras-ERK signaling, cell migration, Hedgehog signaling and mechanosensory mechanisms increased their relative abundance in the resulting subsets from the GA runs. Thus, the GA seems to be prioritizing LR pairs associated with processes such as cell patterning, morphogenesis and tissue maintenance47.
The established or predicted roles of the enriched initial GA-LR pairs are congruent with a role in establishing and/or sustaining the proximity of cells. This notion is further supported by the congruence between our predictions and targeted studies demonstrating the essential role in spatial organization of the initial GA-LR pairs detected in most of the GA runs. For instance: 1) The LR pair composed of smp-2/plx-1 mediates epidermal morphogenesis, as demonstrated by the defects in epidermal functions exhibited by C. elegans lacking smp-248; 2) cwn-1/mig-1 mediate cell positioning, as demonstrated by the abnormal migration of hermaphrodite specific motor neurons in the mutants49,50; 3) K05F1.5/dma-1, as K05F1.5 was described as a novel gene (named lect-2) that, similar to mnr-1 and dma-1, is key for dendrite guidance of sensory neurons innervating the muscle-skin interface51,52; and 4) the let-756/ver-1 interaction, an homolog of the mammalian FGF-VEGF pathway, that is essential for C. elegans development, especially in the L2 stage where let-756 has its peak of expression53, and for the positioning of ray 1 in the tail54.
Consensus GA-selected LR pairs contribute to the formation of spatial patterns of communication along the C. elegans body
Next, we looked for a core set of LR pairs that were representative of the optimal solutions generated by the GA. We first clustered LR pairs by their co-occurrence across the GA runs and then selected the cluster with members that were simultaneously present in a high fraction of the optimal subsets (Supplementary Figure 2c-d). This resulted in a consensus list of 37 LR pairs (Supplementary Table 3), hereinafter referred to as GA-LR pairs, whose combined appearance seemed to encode proximity across cell-cell interactions.
This consensus list yields a Spearman coefficient of −0.63 (P-value = 2.629 × 10-27) between the CCI score matrix and the Euclidean distance matrix. To test whether the correlation stems from the specific LR pairs in the consensus list, we did a series of permutation analyses (Supplementary Figure 3). We evaluated if the correlation computed with the consensus list was greater than the value stemming from the null distribution generated with random interactions generated from the ligands and receptors in the GA-LR interactions, either by randomly permuting the ligands and the receptors (Supplementary Figure 3a) or by shuffling their labels to keep the topology of the interactions (Supplementary Figure 3b). We also subsampled the complete list of LR pairs (Supplementary Table 1) to obtain random subsets of similar size to the list of GA-LR pairs (Supplementary Figure 3c). In each scenario, the randomized lists yielded a smaller negative Spearman correlation than the consensus list (P-value < 0.0001, see Supplementary Figure 3).
Interestingly, when compared to the complete list of LR pairs, the GA-LR pairs led to more heterogeneity in the cell-cell interaction potential (Figures 2a and 4a). The heterogeneity seems to stem from the proximity that cells have since cells are grouped by functional interactions within their tissues (Figure 4a). For example, the cells composing the pharynx (pharyngeal gland, epithelia, muscle and neurons) group together, which may reflect a pharynx-specific pattern of interactions between these cell types that may not translate to the cell types composing other organs or locating in other regions of the worm. Similarly, we observed seemingly functional associations between neurons and muscles. In particular, the interactions between muscle cells and GABAergic and cholinergic neurons presented a high CCI score, which may represent a protein-based priming for the known exchange of GABA and acetylcholine in neuromuscular junctions. We also found that the most complementary interaction of amphid/phasmid sheath cells was with seam cells. This is consistent with the role of seam cells during larval development, where they form adherens junctions with sheath cells and hypodermal cells and function as sockets for the phasmid sensilla55. Another interesting observation was the high cell-cell interaction score between oxygen sensing neurons and intestinal cells, which is consistent with the extensive communication between these cells to link oxygen availability with nutrient status56–58. Thus, the protein-protein interactions prioritized by the GA seem to capture cellular properties that define physical proximity, especially defining functional roles of tissues and organs.
To explore the spatial distribution of consensus GA-LR pairs along the body of C. elegans (Figure 5), we performed an enrichment analysis of CCC along its body. We first divided the C. elegans body in 3 sections, encompassing different cell types (Figure 5a). Then, we computed all pairwise CCIs within each section and counted the number of times that each LR pair was used. With this number, we performed a Fisher’s exact test on each bin for a given LR interaction. We observed enrichment or depletion of specific LR pairs in different parts of the body (Figure 5b). Interestingly, we observed LR pairs enriched only in one section and depleted in the others and vice versa (Table 1), following a pattern mostly congruent with existing experimental data. For instance, col-99 shows prominent expression in the head, especially during L1-L2 larvae stages of development59, while LIN-44 is secreted by hypodermal cells exclusively in the tail during larval development60,61, both cases coinciding with the results in Table 1. By contrast, daf-7 is known to be expressed only by sensory neurons in the head62; however, our results suggest an enrichment in the tail (Table 1). This is likely due to the transcriptome including two types of sensory neurons that indeed express daf-7 (Figure 4b) suggesting that clustering of single cells cannot distinguish all subtypes of sensory neurons. Thus, mapping cell types in the 3D atlas resulted in all sensory neurons having a similar transcriptome across the body (Figure 5a), explaining this spurious result. Nevertheless, although the daf-7 example points to limitations of the current scRNAseq methods and their analysis tools, the col-99 and lin-44 examples demonstrate that when cellular identification is sufficiently detailed, our strategy captures true biological spatial behaviors of gene expression and therefore of CCC.
To gain more understanding on the importance that the GA-LR pairs may have in defining spatially-constrained CCIs, we searched for LR pairs enriched or depleted across all cell pair interactions in any of the different distance-ranges of communication (Figure 6). We found five LR pairs that were either enriched or depleted in at least one of the three distance ranges given the corresponding pairs of cell types (FDR < 1%). Three of these LR pairs are associated with Wnt signaling (lin-44/cfz-2, cwn-1/lin-17 and cwn-1/mig-1) and the other two with cell migration (smp-2/plx-1 and smp-2/plx-2). As previously reported, semaphorins (encoded by smp-1, smp-2 and mab-20) and their receptors (plexins, encoded by plx-1 and plx-2) can control cell-cell contact formation63; while their mutants show cell positioning defects, especially along the anterior/posterior axis of C. elegans48,64. They are also key for axon guidance and cell migration65 and necessary for epidermal66 and vulval morphogenesis67. Members of the Wnt signaling are biomolecules known to act as a source of positional information for cells4. In C. elegans, cwn-1 and lin-44, for example, follow a gradient along its body, enabling cell migration49,68–70. Thus, the GA-LR pairs may influence local or longer-range interactions and help encode intercellular proximity.
GA-selected mediators are enriched in morphology and cell migration phenotypes
As many LR interactions have not been spatially or functionally characterized in C. elegans, we anticipate that several pairs in our GA-selected list may either be “yet to be discovered functional interactions” or false positives. To minimize false positives, we incorporated phenotypic data. The rationale here being that phenotypic associations between the genes likely indicate coordinated function, which in this case may imply CCIs. Thus, we next examined the phenotypic associations between genes composing the GA-LR pair list. Given that, as defined before, these LR interactions are important for cell patterning, morphogenesis and tissue maintenance, we focused on phenotypes with annotations such as “morphology phenotype” and “cell migration”. We tested for enrichment of GA selected genes with a Fisher’s exact test, using the lists of genes whose mutants have been demonstrated to affect that phenotype. Using our complete list of LR pairs to define the background list of genes and those in the consensus list as the sample list, we observed that morphology and cell migration phenotypes have odds ratios of 4.83 (P-value = 0.027) and 3.07 (P-value = 0.0019), respectively, indicating an enrichment and therefore, supporting their role as drivers of spatial organization.
Among the genes associated with either morphology or cell migration phenotypes in our complete list of LR pairs, 51% of them were selected by our genetic algorithm (Figure 7). Remarkably, many of the GA-selected genes associated with morphology phenotypes are involved in the Wnt signaling pathway (lin-17, lin-18, lin-44 and mom-2) and the rest with cell adhesion (epi-1) and insulin signaling (daf-2). On the other hand, GA-selected genes associated with cell migration include Wnt signaling (cam-1, cfz-2, cwn-1, lin-17, lin-44, mig-1 and mom-2), cell migration pathways (mab-20, unc-5, unc-6, unc-129 and rig-6), cell adhesion (ddr-1, epi-1, ina-1, let-2, nid-1 and pat-3), Notch signaling (lag-2 and lin-12) and TGF-β signaling (dbl-1). Noteworthily, previous studies have shown that these genes and their interactions are key in spatial allocation of cells. Examples include: 1) pat-3, which is involved in post-embryonic organogenesis and tissue function71; 2) the interaction between the discoidin domain receptor ddr-1 and the collagen col-99, which plays a role in axonal guidance and asymmetry establishment of the ventral nerve cord72; 3) lin-12, mab-20 and unc-6 and their respective receptors, which are involved in intestine morphogenesis73; and 5) the interaction between nid-1 and ptp-3, which participates in cell migration and axon guidance74. Cell adhesion molecules such as collagen and proteins from the immunoglobulin superfamily also play a role in intercellular contact and communication, encompassing genes such as let-2, cle-1, gpn-1, rig-6, wrk-1, unc-5 and ina-175–79. Interestingly, when considering the expression of genes associated with the phenotypes reported here, cell types seem to be clustered by spatial proximity of their lineage groups (Figure 7), suggesting that these genes may be markers of spatial properties. Thus, the congruence between spatial proximity and biological function strongly supports the notion that the strategy presented here provides insights into the spatial code behind CCIs and CCC.
Discussion
Here we developed a computational strategy for inferring complementarity between cells given their ligand and receptor expression in scRNA-seq data. With this approach, we identified spatial properties in C. elegans associated with the potential of cells to interact and communicate. Particularly, we found a negative Spearman coefficient between intercellular distance and CCIs computed with our Bray-Curtis-like score; a correlation that was stronger when inferring CCIs from ligand-receptor pairs selected with a genetic algorithm. Thus, these LR pairs resulted to be informative of spatial properties and may direct how cells transmit this kind of information to other cells.
In this study, we also collected ligand-receptor interactions of C. elegans that were available in literature and PPI databases, building an essential resource for CCI analyses of C. elegans (Supplementary Table 1). Using this list, our CCI analysis led to results consistent with previous findings. For example, we found that interacting cells were grouped given cell type-specific production of ligands (Figure 2b), which was previously shown in a work that analyzed a communication network of human haematopoietic cells46. Our results are also consistent with experimental studies of C. elegans. For instance, the GA-driven selection of LR pairs significantly prioritized interactions participating in cell migration, Hedgehog signaling, mechanosensory mechanisms and canonical RTK-Ras-ERK signaling. Remarkably, these pathways and few other mediators, also involved in cell migration (e.g. members of Notch and TGF-β signaling), are crucial for the larval development of C. elegans80–83, coinciding with the cognate stages of the datasets we used.
Our analysis successfully recapitulated known biology regarding spatial organization of the cells in C. elegans and associated ligand-receptor interactions. Many of the ligands and receptors included in the GA-LR pairs (Supplementary Table 3 and Figure 4b) have been reported to contribute to cellular positioning, migration and/or organ morphogenesis, which may explain why the genetic algorithm selected them as encoders of spatial information. Netrin UNC-6 is an example ligand that serves as a cue for enabling DTC migration given its spatial pattern of expression84. Moreover, TGFβ-related signaling pathways regulate body size and therefore location of cells, sometimes involving proteins selected by the GA, such as DAF-7, DBL-1, SMA-6, SMA-10, LON-2, SRP-7, F14B4.1 and UNC-12985–87. Similarly, mediators associated with Wnt signaling are required for axon guidance88 and mutants of some of their encoding genes, such as cam-1 (ROR homolog that sequesters Wnts), cwn-1, lin-44, cfz-2, lin-17, lin-18 and mig-1, have previously been linked to defects of cell migration, positioning and patterning50,89–91. Thus, the spatial distributions and functions determined in previous studies provide support to our predictions, which suggests that analyzing CCIs and CCC through RNA-based approaches can decipher important spatial and functional properties.
Considering the inputs of our analyses, false positives may arise from the preprocessing of datasets. For instance, selecting a threshold value to consider ligands and receptors as expressed can affect the number of false positives and negatives9. In this regard, different values could be explored to infer the presence of biologically active protein, as previously addressed92,93. Moreover, other approaches such as using expression products to compute the usage of a ligand-receptor pair may also help10, but further adaptations should be done to use it with our Bray-Curtis like score. Our CCI score is also dependent on the input list of LR interactions, so including other LR pairs that were not considered in our complete list (Supplementary Table 1) could improve the predictions. More comprehensive lists might generate a better correlation or contain more important LR pairs for the GA selection than the ones selected here (Supplementary Table 3). Furthermore, lists of ligand-receptor pairs considering the formation of multimeric complexes can improve the reliability of the results11,94, so considering structural information of proteins may also improve the predictions. However, in contrast to mammals such as mice and humans, C. elegans has considerably less information for building comprehensive lists of ligand-receptor interactions containing multimeric complexes.
Although our strategy captured underlying mechanisms that are consistent with experimental evidence in literature, our approach has limitations that can be related, for instance, to the nature of the dataset. Conventional scRNA-seq technologies do not preserve spatial information, so labelling cells in a 3D atlas by using the cell types in the transcriptome might be a confounder. For example, C. elegans possesses sub-types of non-seam hypodermal cells, and their gene expressions vary depending on the antero/posterior location of them. However, in the scRNA-seq data set employed here there was only one type of non-seam hypodermal cells to represent all subtypes, so they virtually share the same gene expression. An illustrative case where this represents an issue is the expression of lin-44, which is expressed by hypodermis cells located in the tail68,70, but not in other sections as our results showed for lin-44/lin-17 (Figure 5b). Besides that, physical constraints, such as physical barriers between cells, are not considered in our analyses, which could lead to false positives. Examples of this class of false positives are the Notch pathway interactions between germline cells and cells that are not the distal tip cell or sheath cells since a basal lamina physically blocks interactions with other types of cells41. Therefore, relying only on the gene expression enables us to infer the LR interactions that a pair of cells can theoretically use but may not actually use, and may also explain the strong but imperfect correlation obtained between CCI scores and intercellular distances. Thus, emerging spatial transcriptomics methods are expected to be an important advance in the study of CCIs95, since they can distinguish specific cell subtypes given their locations and physical constraints.
The strategy presented in this work provides a framework to associate CCIs with phenotypes and detect ligand-receptor interactions that are crucial for those phenotypes. In contrast to previously proposed overall CCI scores7, ours is undirected. A benefit of it is that it can be successfully integrated with any information also representing an undirected state between two cells, as it is the intercellular distance. When considering, for example, spatial information, this holds the potential of recovering spatial properties that are lost in the traditional transcriptomics methods, either bulk or single cell. Thus, it also holds the potential to be used for other correlative or predictive purposes about important LR pairs associated with a phenotype of interest. Importantly, substantial support to our predictions is observed in the literature, which suggests that our strategy captures underlying mechanisms and functions of mediators associated with the spatial allocation of cells. Thus, although future experiments evaluating spatial allocation of signals will help refine our predictions, the strategy presented here lays the foundation to unveil the code of cell-cell interactions and communication that defines the spatial distribution of cells across a whole animal body.
Methods
Single-cell RNA-seq data
A previously published single-cell RNA-seq dataset containing 27 cell types of C. elegans in the larval L2 stage was used as transcriptome36. The cell types in this dataset belong to different kinds of neurons, sexual cells, muscles and organs such as the pharynx and intestine. We used the published preprocessed gene expression matrix for cell-types provided previously36, where the values are transcripts per million (TPM).
Intercellular distances of cell types
A 3D digital atlas of cells in C. elegans in the larval L1 stage, encompassing the location of 357 nuclei, was used for spatial analyses of the respective cell types35. Each of the nuclei in this atlas was assigned a label according to the cell types present in the transcriptomics dataset, which resulted in a total of 322 nuclei with a label and therefore a transcriptome. To compute the Euclidean distance between a pair of cell types, all nuclei of each cell type were used to compute the distance between all element pairs (one in each cell type). Then, the minimal distance among all pairs is used as the distance between the two cell types (Supplementary Figure 1a). In this step, it is important to consider that this map is for the L1 stage, while the transcriptome is for the L2 stage. However, we should not expect major differences in the reference location of cells between both stages.
Generating a list of ligand-receptor interaction pairs
To build the list of ligand-receptor pairs of C. elegans, a previously published database of 2,422 human pairs9 was used as reference for looking for respective orthologs in C. elegans. The search for orthologs was done using OrthoDB96, OrthoList97 and gProfiler98. Then, a network of protein-protein interactions for C. elegans was obtained from RSPGM99 and high-confidence interactions in STRING-db (confidence score > 700 and supported at least by one experimental evidence)100. Ligand-receptor pairs were selected if a protein of each interaction was in the list of ortholog ligands and the other was in the list of ortholog receptors. Additionally, ligands and receptors mentioned in the literature were also considered (Supplementary Table 4). Finally, a manual curation as well as a functional annotation according to previous studies were performed, leading to our final list of 245 annotated ligand-receptor interactions, encompassing 127 ligands and 66 receptors (Supplementary Table 1).
Communication and CCI scores
To detect active communication pathways and to compute CCI scores between cell pairs, first it was necessary to detect the presence or absence of each ligand and receptor. To do so, we used a threshold over 10 TPM as previously described9. Thus, those ligands and receptors that passed this filter were considered as expressed (a binary value of 1 was assigned). Then, a communication score of 1 was assigned to each ligand-receptor pair with both partners expressed; otherwise a communication score of 0 was assigned. To compute the CCI scores, a vector for each cell in a pair of cells was generated as indicated in Figure 1. Using the respective vectors for both interacting cells, a Bray-Curtis-like score was calculated to represent the potential of interaction. This potential aims to measure how complementary are the signals that interacting cells produce. To do so, our Bray-Curtis-like score considers the number of active LR pairs that a pair of cells has while also incorporating the potential that each cell has to communicate independently (Figure 1). In other words, this score normalizes the number of active LR pairs used by a pair of cells by the total number of ligands and receptors that each cell expresses independently. Unlike other CCI scores that represent a directed relationship of cells by considering, for instance, only the number of ligands produced by one cell and the receptors of another, our CCI score is also undirected. To make our score undirected, it includes all ligands and receptors in cell A, and all cognate receptors and ligands, respectively, in cell B (Figure 1). Thus, pairs of cells interacting through all their ligands and receptors are represented by a value of 1 while those using none of them are assigned a value of 0.
Genetic algorithm for selecting ligand-receptor pairs that maximize correlation between physical distances and CCI scores
An optimal correlation between intercellular distances and CCI scores was sought through a genetic algorithm (GA). This algorithm used as an objective function the absolute value of the Spearman correlation, computed after passing a list of ligand-receptor pairs to compute the CCI scores. In this case, only non-autocrine interactions were used (elements of the diagonal of the matrix with CCI scores were set to 0). The absolute value was considered because it could result either in a positive or negative correlation. A positive correlation would indicate that the ligand-receptor pairs used as inputs are preferably used by cells that are not close, while a negative value would indicate the opposite. The GA generated random subsets of the curated list of ligand-receptor pairs and used them as inputs to evaluate the objective function (as indicated in Supplementary Figure 2a). The maximization process was run 100 times, generating 100 different lists that resulted in an optimal correlation. As shown in Supplementary Figures 2c-d, a selection of the consensus ligand-receptor pairs was done according to their co-occurrence across the 100 runs of the GA and presence in most of the runs.
Defining short-, mid- and long-range communication
The different ranges of distance used for CCC were defined by using a Gaussian mixed model on the distributions of distance between all pairs of cell types. This model was implemented using the scikit-learn library for Python101 and a number of components equal to 3.
Statistical analyses
For each function annotated in the list of ligand-receptor pairs (Supplementary Table 1), a one-sample Wilcoxon signed rank test was used to evaluate whether the relative abundance increased or decreased with respect to the distribution generated with the GA runs. In this case, two one-tail analyses were performed for each function, one to evaluate an increase of the relative abundance and the other to assess a decrease. Finally, the smallest P-value was considered and the respective change was assigned if the adjusted P-value passed the threshold.
A permutation analysis was done on the list of consensus ligand-receptor pairs obtained from the GA. To do so, three scenarios were considered: (1) a column-wise permutation (one column is for the ligands and the other for the receptors); (2) a label permutation (run independently on the ligands and the receptors); and (3) a random subsampling from the original list, generating multiple subsets with similar size to the consensus list. In each of these scenarios, the list of ligand-receptor interactions was permuted 10,000 times.
All enrichment analyses in this work corresponded to a Fisher exact test. In all cases a P-value was obtained for assessing the enrichment and another for the depletion. The analysis of enriched ligand-receptor pairs along the body of C. elegans (head, mid-body and tail) was performed by considering all pairs of cells in each section and evaluated the number of those interactions that the corresponding ligand-receptor pair was used. The total number of pairs corresponded to the sum of cell pairs in all sections of the body. Similarly, the enrichment analysis performed for the different ranges of distance (short-, mid- and long-ranges) was done by considering all cell pairs in each range and the total number of pairs was the sum of the pairs in each range. To evaluate the enrichment of phenotypes (obtained from the phenotype ontology association available in WormBase102), all genes in the GA-selected list were used as background. Then, the genes associated with the respective phenotype tested were used to assess the enrichment.
When necessary, P-values were adjusted using Benjamini-Hochberg's procedure. In those cases, a significance threshold was set as FDR < 1% (or adj. P-value < 0.01).
Data availability
Single-cell RNA-seq dataset: Supplementary Table S4 in https://doi.org/10.1126/science.aam8940 and GitHub repository for the analyses of this work.
3D digital atlas of C. elegans: Supplementary Data 1 in https://doi.org/10.1038/nmeth.1366 and GitHub repository for the analyses of this work.
Lists of ligand-receptor pairs of C. elegans: The manual curated list containing 245 interactions is present in the Supplementary Table 1, while the consensus list from the GA-selection, which contains 37 interactions, is available in the Supplementary Table 3. Both lists are also available in the GitHub repository for the analyses of this work.
Code availability
All analyses performed in this work, their respective codes (implemented in Python and Jupyter Notebooks) and instruction to use them are available in a public repository (https://github.com/LewisLabUCSD/Celegans-cell2cell). Similarly, cell2cell is available as an open source tool in a GitHub repository (https://github.com/earmingol/cell2cell).
Author contributions
EA, CJ, EJO, and NEL conceived the work. EA and HMB annotated the cells in the previously published 3D digital atlas of C. elegans and analyzed their physical locations. EA, CJ, EJO and NEL analyzed the single-cell RNA-seq data. EA and IS generated the list of ligand-receptor pairs of C. elegans. EA developed cell2cell, performed the CCI analyses and created the corresponding GitHub repositories for both the tool and the analyses. JC helped with the visualization of ligand-receptor interactions. HH implemented enrichment analyses in cell2cell. EA, AG and EJO compiled C. elegans data and compared results of this work with previous findings in literature. EA wrote the paper and all authors carefully reviewed and edited the paper.
Competing interests
The authors declare no competing interests.
Acknowledgements
EA is supported by the Chilean Agencia Nacional de Investigación y Desarrollo (ANID) through its scholarship program DOCTORADO BECAS CHILE/2018 - 72190270 and by the Fulbright Commission. This work was further supported by NIGMS grant R35 GM119850 to NEL, a Lilly Innovation Fellows Award to CJ, Jefferson Foundation Award to AG, J Yang Foundation Fellowship to HH, PEW Charitable Trust Award to EJO, and generous funding from the W. M. Keck Foundation. We thank Ariel Pani for helpful comments.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.
- 77.
- 78.
- 79.↵
- 80.↵
- 81.
- 82.
- 83.↵
- 84.↵
- 85.↵
- 86.
- 87.↵
- 88.↵
- 89.↵
- 90.
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵