Abstract
Recent progress in single-cell genomics has generated multiple tools for cell clustering, annotation, and trajectory inference; yet, inferring their associated regulatory mechanisms is unresolved. Here we present scMomentum, a model-based data-driven formulation to predict gene regulatory networks and energy landscapes from single-cell transcriptomic data without requiring temporal or perturbation experiments. scMomentum provides significant advantages over existing methods with respect to computational efficiency, scalability, network structure, and biological application.
Availability scMomentum is available as a Python package at https://github.com/larisa-msoto/scMomentum.git
The utilization of single-cell RNA sequencing (scRNA-seq) technologies in several research contexts, such as the Human Cell Atlas (HCA), has shown that different cell types have characteristic transcriptomic signatures [1,2]. The characterization of cell types from scRNA-seq data has been a constant challenge. Although, it is known that reconstructing gene-regulatory networks (GRNs) is a critical step towards understanding the establishment of cellular identity. The idea of modeling transcriptomes with systems biology approaches has generated numerous methods using bulk measurements [3]. However, remaining hurdles include extensive data requirements and interpretability of the resulting networks. Single-cell genomics offers the opportunity to resolve such challenges, enabling the ability to model developmental progressions with classic paradigms [4], such as the Waddington landscape [5].
Although scRNA-seq profiling produces many samples (cells), thus increasing statistical power, there are inherent limitations such as missing transcripts that complicate the distinction of signal from noise [6]. Several existing methods rely on identifying an accurate time progression of cells (pseudotime) [7,8,9]. Such an ordering limits their applicability to time-series experiments and induces potential error sources when inferring pseudotime on non-temporal datasets. A common strategy is to use perturbation experiments to sample cause-effect events [10,11], but these methods are not scalable to interrogate different cell types en masse.
Here we explore the assumption that regulatory signals are specific and similar among cells belonging to the same cell type, as they would be confined to move around within their quasistable attractor. This enables us to use a linear approximation while still accounting for a non-zero velocity in the quasi-stable state. This formulation ensures computational efficiency and scalability. In such a linear model, every gene expressed in a given cell type could affect all the other genes and vice versa. The model takes advantage of RNA velocity [12] as the gene-level outcome of the expression of all the genes regulating it, weighted by a directed interaction matrix (the inferred GRN) and their corresponding degradation rate (Figure 1a). With both the weights and directions in hand, we can estimate an energy function [13] such that individual cellular energies capture their developmental potential. When projecting such energies on a twodimensional grid, one can get a robust estimation of the developmental energy landscape recovered by those cells (Figure 1a).
To evaluate this idea, we simulated expression profiles for 500 genes in 20,000 cells using Dyngen [14]. We constructed two independent branching trajectories and used scVelo to estimate RNA velocity and degradation rate (Figure 1b). The inferred GRNs were used to derive the associated energy (Figure 1c). The cells from the branch starting at cluster 1 go towards a lower energy state, and the endpoint clusters are all at a lower energy state than that of their starting point. This agrees with Waddington’s classical proposition that cells will go “downhill” on the landscape as they differentiate. On the other hand, the branch starting at cluster 0 follows an upward trend, indicating the need for external inputs to progress along that branch instead of being in a poised state. To further explore the directionality of cells in the landscape, we projected the velocities on top of the energy landscape (Figure 1d). We found a flow from a high to a low energy state. Moreover, the branching is captured both by the velocities and by the energies (Figure 1d).
To assess the performance of scMomentum, we benchmarked it against GENIE3 [15] and GRNOOST2 [16]. Since there is a lack of a widely accepted gold standard, we designed a metric that quantifies the extent of biological signal preservation. We assume that similar cell types have relatively close expression profiles (expression-derived distance). Consequently, if the derived networks are accurate enough, they would also be similar (network-derived distance). Thus, we test the correlation between cell type distance matrices using a Mantel test to account for the matrices’ spatial arrangement (symmetry and row-column relationships). scMomentum showed the highest correlation (Figure 1e), although GENIE3 and GRBOOST2 also had significant correlations in this controlled setting. Thus, we found this to be a reasonable metric to assess the performance of different methods.
We evaluated scMomentum in an in-house-generated human hematopoiesis study and five public data sets (see Methods), totaling more than 200K cells (see supplement for details on the preprocessing steps). We found that the inferred networks preserved the distances between cell types in all the datasets that captured multiple developmental stages at once (dynamic behavior) (Figure 2a). As a negative control, we showed that this was not true for a non-dynamical dataset (mBA18, see Methods), highlighting the need for cellular dynamics when computing RNA velocity and using it to infer GRNs. We benchmarked our approach on the human hematopoiesis dataset and found that scMomentum was better at preserving cell type distances than existing methods (Figure 2b). Although the difference between the coefficients was relatively small, scMomentum holds a significant advantage in network structure and computational efficiency. We could predict GRNs for ~100K cells expressing 1,000 genes in ~1 minute on a machine with a 3.1 GHz Dual-Core Intel Core i5 processor, while GENIE3 and GRNBOOST2 were unable to finish within a day.
To assess our networks’ ability to capture cellular dynamics, we looked at cellular differentiation and response to targeted perturbations. The network distance matrix recovered trajectories on a Multidimensional Scaling projection (MDS) that resemble cell progressions along hematopoiesis (Figure 2c), suggesting that the networks capture cell-type-specific properties underlying their developmental progressions. Moreover, this result was robust to cellular noise and information loss (see Supplementary material). Then, we took a gene-centered approach and removed 30% of the genes with the largest eigenvalue’s centralities (see Methods) and found that the correlation between distance matrices was lower than that of a random perturbation (Figure 2d). This result shows that our networks detected a set of essential network regulators, a common feature of biological networks. This was not the case for other methods.
To investigate the developmental potential captured by our networks, we reconstructed an analog of the Waddington landscape for the lymphoid lineage (Figure 2e and f). Interestingly, HSC and CLP landscapes have the highest energy and lack steep regions, highlighting their pluripotency and tendency to continue differentiating. The Pro-B cell landscape has a deep basin that contains the vast majority of cells (see Supplementary material), suggesting that they are “trapped” at this stage and might only progress along a confined path. The next stage is Immature B cells, which have the lowest energy and are positioned right underneath the basin of Pro-B cells, showing a possible direction of differentiation. Thus, our networks allow the reconstruction of energy landscapes capturing the corresponding cells’ biology [17].
A significant advantage of our approach is the extraction of weights and directions within the networks. To further assess its biological relevance, we computed an activator/repressor score by adding up all the positive/negative outgoing edges of every gene in the cells along both hematopoietic lineages (Figure 3). Notably, we analyzed Prothymosin Alpha (PTMA), a proteincoding gene involved in immune function modulation [18]. The expression and velocity of PTMA follow similar trends in both lineages. Although, it has a strong activator score in CLPs and a strong repressor score in GMPs. This observation raises the intriguing possibility that some of the expression changes previously associated with alterations of PTMA [19] might be the effect of altered cellular development, rather than expression changes alone. The inferred weights are also useful to discover genes with dynamic properties within the network, providing insights on possible cell reprogramming undetectable by gene expression or velocity alone.
Here we showed that incorporating RNA velocity into the inference of cell-type-specific GRNs allows us to model the regulatory mechanisms underlying dynamic developmental processes. The resulting networks harbor cell-type-specific regulatory properties that made it possible to reconstruct an interpretable Waddington landscape analog. Conceptually, these networks capture different states corresponding to different cell types, each with momentum to move in a landscape. This interpretation, constructed from the information within GRNs, opens up the possibility to study how regulatory processes shape cellular development in multiple contexts [20]. Moreover, scMomentum has a significant computational advantage over previous approaches, which would facilitate its application to the vast compendium of existing scRNA-seq datasets and open the prospect of building a Cell Regulatory Atlas. To our knowledge, this is the first cell-type-specific GRN inference method that scales to large datasets and recovers directed, signed, and weighted GRNs in a data-driven manner.
Competing interests
Authors declare no competing interests.
Materials and Methods
Network inference
We derive networks in a cell-type-specific manner. Therefore, the method can readily be combined with a user-defined choice of appropriate clustering and annotation pipelines to process the data. For specific pre-processing details, see Supplementary Methods. For each cell type, we re-define the change in gene expression over time as the contribution from all the other genes expressed in the cluster have on its mRNA expression and the level of degradation of its mRNA as follows:
Where X is the cell-by-gene expression matrix, and V is the cell-by-gene velocity matrix. The diagonal matrix γ contains gene-specific degradation constants (both V and γ are retrieved from scVelo). W is the gene by gene weighted and directed adjacency matrix (inferred GRN). Then we solve for W, obtaining the final model:
Since we solve Eqn. 1 as an overdetermined system using least squares, the network’s size is bounded by the number of cells in each cluster.
Calculation of distances between clusters and between networks
For every cluster in the data set, we calculate the Euclidean distance with all the remaining clusters. In the expression-derived mode, for every pair of clusters, the distance between them is defined as:
Where ci and cj are the mean expression vectors of clusters Ci and Cj, respectively. This is used as a reference distance matrix.
In the network-derived mode, we define the distance between every pair of clusters as:
Where Wi and Wj denote the adjacency matrices derived from clusters Ci and Cj, respectively. To test the accuracy of the inferred networks, we calculate
Where mcorr refers to the Mantel correlation of distance matrices, which accounts for the inherent symmetry and row-column relationship of D. Although Wi and Wj have the same dimensions; they might not have the same genes. To accommodate this when computing the correlations, we use the set of genes G in each network to find Gi ∪ Gj, the universal set of genes Gu. The uniform distribution is used to sample the entries of Wi corresponding to and those of Wj corresponding to .
Selection of genes
A critical step in any single-cell analysis is choosing the appropriate set of genes. We tested six different approaches to rank and select varying numbers of genes within each cluster, and rm (Eqn. 4) to rank them. The ranking schemes were based on absolute gene velocity, signed gene velocity, velocity variance, expression and expression variance. We tested their combination with different network sizes, ranging from 50 to 500 genes in steps of 50. In each data set, we selected the combination of ranking and sized with the highest, rm value.
Landscape reconstruction
For every network Wi, an energy landscape is reconstructed using the Discrete Hopfield Network (DHN) formalism [1]. A DHN is formed by n neurons that can be in two different states ON or OFF. At each discrete time step, a subset of these neurons change state depending on the influence of all other neurons weighted by the interaction network Wi:
In the original paper [1] Hopfield demonstrated the existence of an energy function such that
If the interaction matrix Wij is symmetric, after evolving the system according to Eqn. 5 the system would move to the states corresponding to local minima of the function.
The original bottom-up idea underlying the DHN is that the W matrix can be constructed so that it stores certain fates, corresponding to local minima of the energy function to which the system would evolve.
Our top-down approach corresponds to analyzing the structure of the energy landscape after taking the GRN as interaction matrix for a DHN where each gene corresponds to a neuron. To analyze the energy of a cell, its expression must first be discretized to the ON and OFF states. To accomplish this is to consider a gene to be ON or OFF depending if it is above or below a certain predefined threshold for each gene. This threshold can be taken as the mean value of that gene over all the cells, the median of the gene expression over all the cells, or other values that might be biologically important.
The landscape is displayed over a two-dimensional space for visualization purposes. Therefore, we use PCA or any embedding algorithm where an inverse exists for the data over two dimensions.
First, the two-dimensional embedding, , is calculated and used to project the cells. A square containing all the projected cells is gridded into N × M grid-points, , which are then pushed back to the high-dimensional space with the inverse of the embedding, . These points are then discretized as in equation (7), and the energy for each point is calculated. In this way, the energy of each grid-point is calculated and can be plot as a surface over the 2d space containing the projections of the cells, for which the energy is also calculated and plot on top of the grid.
Network perturbations
For every network Wi, we calculate a vector of eigenvalues λi, and remove from Wi all the entries of the genes with the top 30% eigenvalues in λi. Then, we estimate the distance Dn,p between the perturbed networks to calculate mcorr(De, Dn,p), and use rm,p as a measure of network robustness.
Gene-specific network scores
For a gene gk in a network Wi where k goes from 1,…, Ni and Ni is the number of genes in Wi, we compute an activator score act as
Where Ek,q refers to the outgoing edge from gk to gq. Similarly, we calculate a repressor score rep as
Benchmark framework
We defined the best-performing set of genes as previously described to infer cell-type-specific manner using GENIE3 [2], GRNBOOST2 [3] and PIDC [4] in the simulated and the in-house generate data sets. We discarded PIDC since often the clusters did not contain enough cells to meet its data requirements. We benchmarked the methods using the set of rm values estimated by comparing the networks derived from each method to the same De matrix.
Data
Simulated
We simulated a dataset with Dyngen v0.4.0 [5], using backbone_disconnected with left and right backbones set to backbone_consecutive_bifurcating. Models were initialized with the following parameters: num cells 20.000, num_tfs 50, num targets 200, num_hks 250.
Public
Human fetal forebrain (hFB18)
Human fetal forebrain cells from 10-week fetal tissue were generated in La Manno et al., 2018. This dataset is accessible from the SRA under the accession code SRP129388 [6].
Human PBMCs (hPB20)
10X Genomics 5K PBMC dataset downloaded from the company’s website [7].
Mouse developing spinal cord (mSC19)
Mouse embryos from stages E9.5 to E13.5 from Delile et al., 2019. Raw sequencing files were retrieved from the database ArrayExpress under the accession E-MTAB-7320 [8].
Mouse brain atlas (mBA18)
Adult mouse nervous system data set generated by Zeisel et al., 2018. It is deposited in the SRA under the accession code SRP135960 [9].
Human embryonic hematopoiesis (hED19)
Human embryonic sections were collected from the Carnegie at stage 12 to 14. The data generated in Zeng et al., 2019 is available at NCBI’s Gene Expression Omnibus (GEO) with the accession code GSE135202 [10].
Human hematopoiesis - In house (mBD20)
Cell sorting was performed using a FACSAria (BD Biosciences) and analyzed with FACSdiva software (BD Biosciences). Standard, strict forward scatter width versus area criteria were used to discriminate doublets and gate only singleton cells. Viable cells were identified by staining with 7-AAD (BD Bioscience). HSC cells were extracted from the bone marrow using CD34+ membrane marker. According to the manufacturer’s instructions, the transcriptome of the cells was profiled using Single Cell 3’ Reagent Kits v3 (10X Genomics).
Sequenced libraries were demultiplexed, aligned to the human transcriptome (GRC3h8/hg20) and quantified using Cell Ranger (3.0.1) from 10X Genomics. The output of the pre-processing pipeline consisted of UMI-derived expression matrices per cell. Quality control filters applied for filtering the cells were: the number of detected genes, the number of UMIs, and the proportion of UMIs mapped to mitochondrial genes per cell. The thresholds for each of the single-cell libraries were selected based on the distribution of the variables enumerated. Count-based matrices were subjected to normalization, identification of highly variable genes, and removal of unwanted sources of variation using Seurat3 [11] Next, cells were labeled to the different cell populations shown using SingleR [12]. The annotation was conducted using as a reference an In-house bulk RNA-seq from the enumerated populations.
Acknowledgments
We acknowledge B. Li and J. Ye for initial guidance and discussion of the mathematical model. We also thank J. Ye for helping with the annotations of one of the public datasets. F.P acknowledges funding obtained Instituto de Salud Carlos II (PI17/00701, PI20/01308 and CB16/12/00489) co-funded by FEDER grant, and the AGATA grant (0011-1411-2020-000011 and 0011-1411-2020-000010) from the Government of Navarra. NAK was supported by the Karolinska Institute’s funds and KA Wallenberg Foundation (KAW 2017.0077). L.M.S and J.P.B-T were supported by a VSRP fellowship from King Abdullah University of Science and Technology.
Footnotes
↵+ These authors conducted the work at King Abdullah University of Science and Technology but are now part of the Department of Human Genetics at McGill University, Montreal, QC H3A 0C7, Canada.