Uncovering hypergraphs of cell-cell interaction from single cell RNA-sequencing data

Complex biological systems can be described as a multitude of cell-cell interactions (CCIs). Recent single-cell RNA-sequencing technologies have enabled the detection of CCIs and related ligand-receptor (L-R) gene expression simultaneously. However, previous data analysis methods have focused on only one-to-one CCIs between two cell types. To also detect many-to-many CCIs, we propose scTensor, a novel method for extracting representative triadic relationships (hypergraphs), which include (i) ligand-expression, (ii) receptor-expression, and (iii) L-R pairs. When applied to simulated and empirical datasets, scTensor was able to detect some hypergraphs including paracrine/autocrine CCI patterns, which cannot be detected by previous methods.

their functions. Nevertheless, more sophisticated methodologies are still required, 1 because CCI is the difference between the whole system and sum of their parts.  To assess the coexpression of known L-R genes, circle plots [24,33,36,44,45,50], 9 bigraph/Sankey diagrams [29,30,34,49,52,53], network diagrams [26-28, 31, 37, 10 40, 41, 44,46,54] and heatmaps [30,32,34,39,46,47,49] are often drawn. Some 11 studies have also introduced more systematical approaches for quantifying the de-12 gree of CCIs based on the L-R coexpression, such as the number of coexpressed 13 L-R pairs [23,24,26,27], Spearman correlation coefficients between L-R expres-14 sion profiles [26,31], original interaction scores between L-R coexpression [39], or 15 hypothetical test based on random cell-type label permutation [29,32,37,55]. 16 All the approaches described above implicitly suppose that CCIs are the one-to-17 one relationships between two cell types and that the corresponding L-R coexpres- 18 sion is observed as the cell-type-specific manner. In real empirical data, however, 19 the situation can be more complex; CCIs often exhibit many-to-many relationships 20 involving many cell types, and an particular L-R pair can also function across mul-21 tiple cell-type pairs. Therefore, in this work, we propose scTensor, which is a novel 22 method based on a tensor decomposition algorithm. Our method regards CCIs as 23 hypergraphs and extracts some representative triadic relationship from the data 24 tensor, which includes (i) ligand-expressing cell types, (ii) receptor-expressing cell 25 types, and (iii) L-R pairs. 26 product) of the two vectors is then calculated, and a matrix is generated. The matrix 1 can be considered as the similarity matrix of all possible cell-type combinations 2 using the L-R pair. Finally, for each L-R pair, the matrix is calculated, and a tensor 3 is generated as the combined matrices. In this work, this is called the CCI-tensor. 4 After the construction of a CCI-tensor, we perform non-negative Tucker decompo-5 sition (NTD [57, 58]), which is known as a tensor decomposition algorithm. We orig-6 inally implemented this algorithm and confirmed its convergence within a realistic 7 number of iterations (for more details, see Additional File 3). NTD assumes that the 8 CCI-tensor can be approximated by the summation of some representative CaH s. 9 NTD has the three rank parameters (R1, R2, and R3) and a CaH is calculated as 10 the outer product of the column vectors of three factor matrices A (1) ∈ R J×R1 , 11 A (2) ∈ R J×R2 , and A (3) ∈ R K×R3 calculated by NTD (Figure 1c). Each CaH -12 strength is calculated by the core tensor G(r1, r2, r3) ∈ R R1×R2×R3 of NTD. In this 13 work, each CaH is termed CaH(r1, r2, r3) = A data-driven way without the assumption of one-to-one CCIs. Therefore, it can also 19 detect many-to-many CCIs according to the data complexity. 20 Results and discussion 21 Evaluation of multiple CCI prediction 22 Accuracy of the detection of CCIs and the related L-R pairs 23 Here, we demonstrate the efficacy of scTensor by using the two simulation datasets 24 (Figure2a). Three different cell types are indicated as "A", "B", and "C". In the case 25 I dataset, all CCIs represent the one-to-one relationships between two cell types. 26 CCIs corresponding to A → B, B → C, and C → A are colored by red, blue, and 27 green, respectively. In contrast, the CCIs in the case II dataset represent many-to-1 many relationships involving many cell types such as A → B/C (red), C → A/B/C 2 (blue), and A/B/C → B/C (green). To evaluate whether such ground truth combi-  Again, note that the CaH s detected by scTensor are not just CCIs, but sets of 10 CCIs and their related L-R pairs. To the best of our knowledge, the label permu-11 tation method implemented in CellPhoneDB [37] is the only previous method that 12 can detect CCIs and their related L-R pairs simultaneously. To demonstrate the 13 efficacy of scTensor in terms of the detection of many-to-many CCIs, we also orig-14 inally implemented this method and compared it with scTensor (for more details 15 on the algorithm, see the Materials and Methods). Note that each combination of 16 a CCI and its related L-R pair is separately extracted by NTD of scTensor, but 17 under the label permutation method, the combinations are not separated and are 18 just sorted in ascending order of their P -values. Combinations with low P -values 19 indicate significant triadic relationships. 20 In the case I dataset, ground truth L-R sets are highly enriched according to the 21 measures of both methods, and the AUC values show that there is no difference 22 in their performance (Figure2b and c, left). On the other hand, in the case II 23 dataset, the label permutation method cannot correctly detect blue or green L-R 24 sets, and the AUC value becomes lower (Figure2b, right). In the scTensor analysis, 25 red, blue, and green L-R sets are separately extracted as three CaH s (Figure2c, 26 right), and the AUC values are still high. This is because, the label permutation 27 is implicitly hypothesized the CCI as a one-to-one relationship. Therefore, in the 1 case II dataset, many-to-many CCIs such as the CCIs corresponding to green L-R 2 sets, are hard to detect by the method. This is because for each L-R pair, mean 3 values for any combination of cell types are basically high in such situations, and a 4 P -value corresponding to a one-to-one CCI tends to be large (i.e., not significant); 5 accordingly, the observed L-R coexpression and the null distribution calculated 6 are hard to distinguish. In the analysis of real datasets presented later, however, 7 the L-R gene expression pairs are not always the cell-type specific, and it is more 8 natural that the CCI corresponding to the L-R has a many-to-many relationship. 9 This simulation shows that scTensor is a more general method for detecting CCIs 10 and their related L-R pairs at once, irrespective of whether a particular CCI is 11 one-to-one or many-to-many.

12
Biological interpretation of real datasets 13 To demonstrate the efficacy of scTensor in the analysis of empirical datasets, we 14 applied scTensor to four real scRNA-seq datasets (Table 1). First, we used the 15 scRNA-seq data derived from fetal germ cells (FGCs (♀)) and their gonadal niche 16 cells (Soma) from female human embryos (Germline Female [25]). As a known CaH 17 pattern, CCIs of Soma with FGCs (♀) involving the BMP signaling pathway are 18 reported, and scTensor accurately detects the CCIs and CCI-related L-R pairs 19 as CaH (4,2,5) ( Figure 3). Moreover, scTensor was able to extract some putative 20 CaH s such as CaH (3,4,14) and CaH (1,4,22). CaH (3,4,14) is the autocrine-type CCI 21 within FGCs (♀) and L-R pairs such as Wnt (WNT5A/WNT6) and some growth 22 factor genes (NGF/IGF/FGFR/VEGF), suggesting that this CaH is related to the 23 proliferation and differentiation of Soma. CaH (1,4,22) is the CCI corresponding to 24 FGCs (♀) and Soma, and most of the receptors are related to G protein-coupled 25 receptor (GPCR). The conjugated ligands are some neuropeptides, and this suggests 26 that the peptides are related to the activation of GPCR in FGCs (♀) by some 1 mechanism. 2 We also used the scRNA-seq data derived from FGCs (♂) and their gonadal 3 niche cells from male human embryos (Germline Male [25]). In contrast to the 4 Germline Female data, the original study reported that FGCs (♂) interact with 5 Soma through AMH-BMPR interactions, and scTensor also detected the triadic 6 relationship as CaH (1,1,10) ( Figure 4). The conjugate receptor AMHR2 is also de-7 tected in this CaH . In this dataset, like in Germline Female, in this data, autocrine-8 type CCIs such as CaH (3,3,9) and CCIs corresponding to FGCs (♂) and Soma 9 (CaH (1,3,16)) were detected. 10 Next, we used the scRNA-seq data derived from immune cells isolated from the 11 metastatic melanoma patients (Melanoma [59]). The meta-analysis of scRNA-seq 12 and TCGA datasets in the original published study claimed that T cell abundance 13 and the expression of complement system-related genes in cancer-associated fibrob-14 lasts (CAFs) are correlated, and CCIs between T cells and CAFs were therefore 15 inferred. scTensor detected these CCIs as CaH (1,1,3)/CaH (1,1,9) and comple-16 ment system-related genes such as C3, CXCL12, CFB, and C4A were also detected 17 ( Figure 5). scTensor was also able to capture a well-known CCI between T cells 18 and B cells as CaH (2,2,6), although this was not a focus of the original study; this 19 CCI is the antigen-presenting from B cells to T cells by class II major histocom-  ing CCIs were detected as CaH (2,2,19) by scTensor ( Figure 6). scTensor also 1 detected CaH (1,2,17) and CaH (3,2,21), which are the autocrine-type CCI among 2 macrophages and the CCI of NK cells with macrophages by known chemokine lig-  To demonstrate the applicability of using scTensor in the species that is not 6 mouse or human, we used a scRNA-seq data derived from zebrafish habenular  result suggests that scTensor may also be useful in spatial transcriptomics [61,62]. 19 scTensor and L-R database implementations as R/Bioconductor packages 20 All the algorithms and L-R lists are available as R/Bioconductor packages and a 21 web application described below. nomics Roadmap [76], ENCODE [77], and Human Protein Atlas [78]. Additionally, 16 in consideration of users who might want to experimentally investigate detected 17 CCIs, we embedded the hyperlinks to Connectivity Map (CMap [79]), which pro- 18 vides the relationships between perturbation by the addition of particular chemical 19 compounds/genetic reagents and succeeding gene expression change. 20 LRBase.XXX.eg.db-type packages 21 For data sustainability and the extension to the wide range of organisms, in 22 this work, we originally constructed L-R databases as R/Bioconductor packages 23 named LRBase.XXX.eg.db (where XXX represents the abbreviation for an organ-24 ism, such as "Hsa" for Homo sapiens). LRBase.XXX.eg.db currently provides the 25 L-R databases for 12 organisms (Table 2 and 3). The data process pipeline is almost 26 the same as that of the FANTOM5 project for constructing the putative L-R lists.

27
Precise differences between LRBase.XXX.eg.db and FANTOM5 are summarized in 1   Table S4 in Additional File 1.  zon S3), reports can be used as simple web applications, enabling the user to 20 share their results with collaborators or to develop an exhaustive CCI database. 21 We have already performed scTensor analyses using a wide-variety of scRNA- 22 seq datasets, including the five empirical datasets examined in this study (https: In this work, CCIs were regarded as CaH s, which are hypergraphs that represent 2 triadic relationships, and a novel algorithm scTensor for detecting such CaH was 3 developed. In evaluations with empirical datasets from previous CCI studies, pre-4 viously reported CCIs were also detected by scTensor. Moreover, some CCIs were 5 detected only by scTensor, suggesting that the previous studies may have over-6 looked some CCIs. To extend the use of scTensor to a wide range of organisms, 7 we also developed multiple L-R databases as LRBase.XXX.eg.db-type packages. types with semi-supervised manner. Such information will improve the accuracy 20 and extend the scope of the data. 21 We aim to tackle such problems and develop the framework further through  (v-10.5) were downloaded from https://stringdb-static.org/cgi/download.pl. 19 To unify the gene identifier as NCBI Gene ID, we retrieved the correspond- The simulated single-cell gene expression data were sampled from the negative bi-25 nomial distribution N B (F C gc m g , φ g ), where F C gc is the fold-change (FC) for gene 26 g and cell type c, and m g and φ g are the average gene expression and the dispersion 27 parameter of gene g, respectively. For the setting of differentially expressed genes 1 (DEGs) and non-DEGs, F C gc values were calculated based on the non-linear rela-2 tionship of FC and the gene expression level log 10 F C gc = a exp(−b log 10 (m g + 1)) 3 as follows:

11
For the setting of the case I datasets, 150 × 150 × 500 CCI-tensor was constructed.

12
For each cell type, 50 cells were established, and in total, three cell types were set. as non-DEGs. 20 For the setting of the case II datasets, 150 × 150 × 500 CCI-tensor were con- Construction of CCI-tensor 22 Here we assume that a data matrix Y ∈ R I×H is the gene expression matrix of scRNA-seq, where I is the number of genes and H is the number of cells. Next, the matrix Y is converted to cell-type mean matrix X ∈ R I×J , where J is the number of mean vectors for each cell type. The cell-type label is supposed to be specified by user's prior analysis such as clustering or confirmation of marker gene expression.
The relationship between the X and Y is described as below: where the matrix A ∈ R H×J converts cellular-level matrix Y to cell-type-level matrix X and each element of A is where n j is the number of cells belonging to j's cell type.  Finally, a J × J matrix is calculated as the outer product of x L and x R and 5 incrementally stored as a sub-tensor (frontal slice) of the CCI-tensor χ ∈ R J×J×K 6 as below: where K is the number of L-R pairs found in the row names of matrix X. Here, we suppose that the CCI-tensor has some representative triadic relationship.

10
To extract the triadic relationships from a CCI-tensor, here we consider perform-11 ing some tensor decomposition algorithms. There are two typical decomposition 12 methods; CANDECOMP/PARAFAC (CP) and Tucker decomposition [57, 58].

13
In CP decomposition, CCI-tensor χ is decomposed as follows: where × n is mode-n product, R is the rank of χ, and Λ is diagonal cubical tensor, 1 in which the element λ r on the superdiagonal can be non-zero. A (1) ∈ R J×R , 2 A (2) ∈ R J×R , and A (3) ∈ R K×R are factor matrices. a r • b r • c r is rank-1 tensor, 3 and the scalar λ r is the size of rank-1 tensor. The rank-1 tensor indicates the To deal with this problem, the application of Tucker decomposition can be considered next. In Tucker decomposition, a CCI-tensor is decomposed as follows: : where R1, R2, and R3 are the rank of mode-1,2, and 3, respectively. Unlike the 1 CP model, the constraint conditions of the Tucker model are relaxed, that is, three 2 factor matrices A (1) ∈ R J×R1 , A (2) ∈ R J×R2 , and A (3) ∈ R K×R3 can differ in their 3 numbers of columns and any combination of A :r1 , A :r2 , and A :r3 can be considered.

4
This is because G ∈ R R1×R2×R3 is a dense core tensor and any element, including and which cell type is not.

15
For the above reason, here we utilize NTD. Unlike Tucker decomposition based on singular value decomposition (SVD), NTD is based on non-negative matrix factorization (NMF), which is another matrix decomposition method. NMF is formalized as follows: The typical algorithm for optimizing the NMF problem is multiplicative updating where * is the element-wise (Hadamard) product.
These update rules are derived from the element-wise gradient descent with the 1 spatial form of the learning rate [58]. Starting with a random non-negative initial 2 value, update of W and H are updated iteratively until convergence. In this work, 3 MU with the KL-form, which shows a stable convergence with simulation data, is 4 used for following initialization step of NTD (for more details, see Additional File To extend the KL form of MU to NTD, we consider iterative updating is as follows: Additionally, the updating rule for core tensor G is: where is a small value included to avoid generating negative values. In the 1 nnTensor, 1e-10 is used.

11
To enhance the interpretation, each column vector is binarized in advance by types. This process generates 1,000 of synthetic L-R coexpression matrices and these 6 are used to generate the null distribution, that is, in a combination of cell types, the 7 proportion of the means which are "as or more extreme" than the observed mean 8 is the calculated as P -value. 9 Availability and requirements 10   Some cell images used in Figure 1 were previously presented by c 2016 DBCLS TogoTV. We thank Dr. Yoshihiro 7 Taguchi for discussions about the algorithms. We thank Mr. Akihiro Matsushima for their assistance with the IT 8 infrastructure for the data analysis. We are also grateful to all members of the Laboratory for Bioinformatics 9 Research, RIKEN Center for Biosystems Dynamics Research for their helpful advice.  Figure 1 CCI as a hypergraph (a) Previous studies have regarded CCIs as graphs, and the corresponding data structure is expressed as an adjacency matrix (left). In this work, CCIs are regarded as hypergraphs, and the corresponding data structure is a tensor. (b) The CCI-tensor is generated by users' scRNA-Seq matrices, cell-type labels, and L-R databases. NTD is used to extract CaHs from the CCI-tensor. (c) Each CaH(r1,r2,r3) is equal to the outer product of three vectors. A :r1 represents the ligand expression pattern, A :r2 represents the receptor expression pattern, and A :r3 represents the related L-R pairs pattern.