Abstract
Time-series single-cell RNA sequencing (scRNA-seq) data have opened a door to elucidate cell differentiation processes. In this context, the optimal transport (OT) theory has attracted attention to interpolate scRNA-seq data and infer the trajectories of cell differentiation. However, there remain critical issues in interpretability and computational cost. This paper presents scEGOT, a novel comprehensive trajectory inference framework for single-cell data based on entropic Gaussian mixture optimal transport (EGOT). By constructing a theory of EGOT via an explicit construction of the entropic transport plan and its connection to a continuous OT with its error estimates, EGOT is realized as a generative model with high interpretability and low computational cost, dramatically facilitating the inference of cell trajectories and dynamics from time-series data. The scEGOT framework provides comprehensive outputs from multiple perspectives, including cell state graphs, velocity fields of cell differentiation, time interpolations of single-cell data, space-time continuous videos of cell differentiation with gene expressions, gene regulatory networks, and reconstructions of Waddington’s epigenetic landscape. To demonstrate that scEGOT is a powerful and versatile tool for single-cell biology, we applied it to time-series scRNA-seq data of the human primordial germ cell-like cell (human PGCLC) induction system. Using scEGOT, we precisely identified the PGCLC progenitor population and the bifurcation time of the segregation. Our analysis suggests that a known marker gene TFAP2A alone is not sufficient to identify the PGCLC progenitor cell population, but that NKX1-2 is also required. In addition, we found that MESP1 and GATA6 may also be crucial for PGCLC/somatic cell segregation.
1 Introduction
The “epigenetic landscape” proposed by the renowned biologist C. H. Waddington is a well-known metaphor for describing cell differentiation and is a key concept in developmental biology [47]. In this conceptual model, cells begin as stem cells at the top of this landscape and differentiate into more specialized cell types as they move down the valleys during the development, with the ridges representing potential barriers that prevent transitions between cell types.
Although a useful concept, the actual shapes of the landscapes during differentiation processes have remained unclear in many biological systems. However, recent advances in genome-scale high-dimensional single-cell technologies, such as single-cell RNA sequencing (scRNA-seq) [16, 23], have opened an avenue for inferring the dynamics of cell differentiation in a data-driven manner, as well as for reconstructing Waddington’s landscape. This has made trajectory inference for cell differentiation a central topic in current single-cell and systems biology [21, 41, 48].
Many methods [4, 34] for trajectory inference have been developed using a single snapshot of the scRNA-seq data. In spite of the snapshot data, due to the heterogeneity of the cell population, these methods can identify changes in gene expression levels along the pseudo-time [12, 32, 39, 43, 53]. However, the dynamics of overall cell differentiation are very complex, causing the static trajectory inference described above to have obvious limitations [51]. Recently, time-series scRNA-seq data have been used to overcome this difficulty and to gain more insight into the dynamics of cell differentiation. Nevertheless, the destruction of cells at each measurement severely impedes the identification of cell populations between adjacent time points. To deal with this issue, trajectory inference methods based on the optimal transport theory have attracted attention in recent years.
The optimal transport (OT) is a mathematical theory that provides distances and optimal matchings between probability distributions [36, 45, 46]. It has recently been applied in biology [55], especially in single-cell analysis [17, 27, 28], as well as in several other research fields [31]. Among them, Waddington-OT [38] is a well-known method for inferring the cell lineages by applying a static unbalanced optimal transport to the time-series scRNA-seq data. While it can predict cell lineages from the optimal matching of cell populations, since it does not learn the continuous distributions of cell populations (i.e., a non-generative model), we cannot gain much insight into the intermediate states in the differentiation process. It is also known that such optimal transports between cells do not sufficiently reflect those between the cell distributions [50].
On the other hand, optimal transport methods with generative models based on neural networks have also been reported, such as TrajectoryNet [42], JKONET [2], and PRESCIENT [56]. They can generate data in the intermediate states of the differentiation process. However, the neural networks used there introduce black boxes into these methods, making biological interpretation difficult.
In general, the computational cost of solving optimal transport problems, including the above methods, is very high and can be a bottleneck for trajectory inference. GraphFP [15] is a method that addresses this problem by combining dynamic optimal transports on cell state graphs with the nonlinear Fokker–Planck equation. The key to reducing the computational cost is the formulation using inter-cluster optimal transport. Owing to its construction, this method achieves high biological interpretability and low computational cost. However, since it deals with only the cell lineages of the cell clusters, it cannot infer the continuous state of the differentiation process (e.g., transitions during the merging/separation of cell clusters).
In this paper, we present scEGOT, a novel trajectory inference framework based on entropic Gaussian mixture optimal transport (EGOT). It aims to provide a comprehensive trajectory inference framework to infer the dynamics of cell differentiation from time-series single-cell data. The methodology is based on an inter-cluster optimal transport, where clustering and learning of the distributions are performed on cell populations in the gene expression space using the Gaussian mixture model (GMM), and each Gaussian distribution corresponds to a cell type. The main feature of this method is that it has a clear and rigorous correspondence to a continuous optimal transport. Moreover, its computational cost is significantly low owing to the inter-cluster optimal transport. Accordingly, we can continuously infer the intermediate states of the cell differentiation process at low computational cost. As a comprehensive framework, scEGOT can construct (i) cell state graphs, (ii) velocity fields of cell differentiation, (iii) time interpolations of single-cell data, (iv) space-time continuous videos of cell differentiation with gene expressions, (v) gene regulatory networks (GRNs), and (vi) Waddington’s landscape (Fig. 1).
As a biological application, we apply scEGOT to the time-series scRNA-seq data of the human primordial germ cell-like cell (human PGCLC) induction system. We demonstrate that scEGOT provides insights into the molecular mechanism of PGCLC/somatic cell segregation. In particular, using the functions of scEGOT, we elucidate the dynamics of the PGCLC differentiation and identify the PGCLC progenitor cell population. Furthermore, we find key genes such as NKX1-2, MESP1, and GATA6 that may play crucial roles during human PGCLC differentiation.
2 Theory
We present here the mathematical foundation of EGOT and its application to single-cell biology, called scEGOT. By generalizing [7], the EGOT is formulated by an entropic regularization of the discrete optimal transport, which is a coarse-grained model derived by taking each Gaussian distribution as a single point.
We first summarize the properties of the solution of the EGOT and then discuss the entropic transport plan constructed from the solution of EGOT. Furthermore, we show a correspondence between EGOT and the continuous optimal transport presented by [8]. This theoretical compatibility enables us to present a novel trajectory inference framework in scEGOT. Specifically, this framework allows us to construct a cell state graph and infer the intermediate states and velocity of the cell differentiation process, from which we can further infer the dynamic GRNs between adjacent time points and reconstruct Waddington’s landscape in the gene expression space, with high biological interpretability and low computational cost.
All proofs of the mathematical statements made here are provided in the supporting information.
2.1 Gaussian mixture model
Let denote the time-series data of the point clouds in ℝn(n≥1) with I time stages (possibly with different sample sizes). For each point cloud , we apply a Gaussian mixture model (GMM) [26]. Then, we obtain where Ki denotes the number of clusters in the distribution at ti, is a Gaussian distribution with a mean vector and a covariance matrix , and is a non-negative weight of each Gaussian distribution satisfying . In the scRNA-seq data, denotes a cell population in the gene expression space ℝn at the time ti, where each cell is characterized by the n values of the gene expressions, or those in a principal component analysis (PCA) space after the dimension reduction. Then, each Gaussian distribution and its weight are regarded as a certain cell type and its existence probability, respectively.
2.2 Entropic Gaussian mixture Optimal Transport (EGOT)
EGOT connects these Gaussian mixture distributions between adjacent times by solving the following optimization problem for the weights and : where is a subset of Ki × Ki+1 matrices of non-negative real numbers; expresses the Ki-dimensional vector whose entries are all 1; and (•)T denotes the transpose. Here, W2 is the L2-Bures– Wassersterin distance [9, 11, 30] using the squared Bures metric [3] and H is an entropy function defined as The solution w of the EGOT represents how close the Gaussian distributions of the adjacent time points are to each other. Thus, it indicates the similarity between the cell populations when applied to the scRNA-seq data. Since the discrete entropy H is a strongly concave function, the objective function of (2) is a strongly convex function. Therefore, the optimization problem (2) has a unique optimal solution. In addition, by considering the Lagrangian associated with the problem (2) and the first-order optimality condition, the following proposition holds:
The optimization problem (2) has a unique solution wε given by where are derived from the dual variables of the Lagrangian associated with the problem (2).
We emphasize that by coarse-graining point clouds with Gaussian mixture distributions, the EGOT is significantly less computationally expensive than directly analyzing the optimal transports with full point clouds (e.g., Waddington-OT [38]). Furthermore, as we will see subsequently, EGOT can recover the solution and distance of a continuous optimal transport between Gaussian mixtures, which will provide deeper insights into the stochastic dynamics of cells in the gene expression space.
2.3 Convergence of EGOT and its connection to continuous OT
In this section, we clarify the connection between EGOT and the following continuous optimal transport proposed by [8]: where is a set of probability measures on ℝn × ℝn with the given marginals and ; G2n(K) is a set of probability measures on ℝn × ℝn, which can be written as Gaussian mixtures with K or less components; and G2n (∞) = ∪K≥1 G2n (K).
Let us first consider the convergence of the EGOT (2) and the solution (3) as ε→0. By using an estimate similar to that for discrete entropic regularized optimal transports with Shannon entropy by Weed [49], we can prove the following proposition.
The solution wε (3) exponentially converges to an optimal transport plan of with the maximal entropy as ε→0. That is, there exists M > 0 independent of ε > 0 such that where w* is the solution of and ‖ · ‖1 denotes the ℓ1 norm of vectors (viewing w as a vector). Moreover, as ε→0 with the following estimate:
Next, we define an entropic transport plan where is the multivariate Gaussian distribution and Tk,l is the optimal transport map between and given by with By applying the convergence result (5) and the estimate (7), we obtain the main theorem in this paper, which provides a correspondence between EGOT and the continuous optimal transport (4) (Fig. 2).
The continuous optimal transport (4) has a minimizer and the entropic transport plan γεconverges to γ* in the sense of the narrow convergence, i.e., for any bounded continuous functions φ ∈ Cb(ℝn× ℝn), Moreover, as ∈ → 0 with the following estimate:
Theorem 3 enables us to go back and forth between the discrete and continuous optimal transports through the entropic transport plan (8). Thus it provides a versatile trajectory inference framework (from continuous OT) with the low computational complexity (from discrete OT) (Fig. 3).
2.4 Cell state graph by EGOT
In the case of point clouds obtained from the time-series scRNA-seq data, each Gaussian distribution and its weight can be regarded as a certain cell type and its existence probability, respectively. Then, the solution of EGOT, which represents how much weight is transported between the Gaussian distributions at adjacent time points, can be interpreted as the degree of relationship between these cell types.
Based on this interpretation, we define a cell state graph of cell differentiation as follows. For each adjacent times ti, ti+1, a complete weighted bipartite graph (Vi, Wi) in ℝn is constructed as where the set Vi of nodes consists of the mean vectors at times ti and ti+1 representing the locations of the cell types, and the set Wi of the weighted edges are given by the normalized solution of the EGOT (called transport rate) corresponding to the degree of relationship of the cell types. Then, the cell state graph (V, W) is constructed by combining (Vi, Wi) for all the time stages. From this cell state graph, we can study the state transitions of the cell population in the temporal evolution.
2.5 Entropic barycentric projection map and cell velocity
The entropic transport plan γε (x, y) can be regarded as the weights to be transported from the point x to y. By averaging over y, we can define an entropic barycentric projection map of the EGOT, which transports the least cost from x, as follows: where is the disintegration of the entropic transport plan γε with respect to the first marginal , i.e., . Then, by (8), we can compute the entropic barycentric projection map Tε explicitly.
The entropic barycentric projection map with respect to γε is expressed as where denotes
From the definition, the entropic barycentric projection map T∈ represents where a cell at time ti moves at time ti+1 in the gene expression space. Accordingly, we define the rate of change of the gene expression for a cell x from the time ti to ti+1, called the cell velocity, as The cell velocity v(x) expresses which direction and how much the gene expression of the cell x changes. A high speed |v| implies a significant change of the cell in cell differentiation, whereas a low speed |v| indicates that the gene expression hardly changes and is close to the steady state.
2.6 Entropic displacement interpolation and gene expression animation
One of the great advantages of the optimal transport theory is that we can obtain the displacement interpolations between the probability distributions, allowing us to study the dynamics of cell differentiation. We define the following entropic displacement interpolation between and at ti+s = (1 − s)ti + sti+1, s 2 [0, 1]. Here is the solution (3), and with and where Id is the d × d identity matrix and
The probability distribution is the geodesic with the maximum entropy between and in the space G2n(∞) equipped with the distance MW2, where w* is the solution (6). Moreover, for any s ∈ [0, 1], the entropic displacement interpolation converges to narrowly.
From Theorem 5, we see that the entropic displacement interpolation (12) is an approximation of the geodesic (13) in the space G2n(∞) equipped with the distance MW2. Through this entropic displacement interpolation (12), we can generate an interpolation distribution between and , which means that the interpolation of the adjacent scRNA-seq data and can be generated. Therefore, by successively constructing the interpolations for all the time stages, we can create an animation of the gene expressions to track the time-varying dynamics of cell differentiation.
2.7 Estimation of GRNs
Once the cell velocity (11) representing the changes in the gene expressions is obtained, it can be used to estimate the gene regulatory network (GRN) that drives the cell differentiation dynamics. Let be a cell population consisting of D cells for which the GRN is to be obtained. We assume that the gene expressions in the cell population are driven by linear dynamics where A ∈ ℝn×n is the matrix characterizing the GRN and represents the effect of the expression level of each gene on the expression dynamics of the other genes. By the cell velocity (11), we may assume that . Thus, we obtain Based on (14), the GRN matrix A* is estimated by solving the following regression problem: where λ > 0 is a regularization parameter. In terms of each cell xj, this can be expressed as where vi(xj) is the i-th component of v(xj). By solving the problem (15) or (16) at each adjacent time point, we can estimate the dynamic GRNs driving cell differentiation.
2.8 Construction of Waddington’s landscape
From the cell velocity (11), we can construct Waddington’s landscape in the gene expression space. The Helmholtz–Hodge decomposition implies that the cell velocity can be written as where φ is the gradient potential and q is the divergence-free part. Since Waddington’s landscape is a gradient system, the potential φ can be regarded as the realization of Waddington’s landscape in the gene expression space. To construct φ, we take the divergence operator in the equation (17) and obtain the following differential equation The partial differential equation (18) is not uniquely solvable in general since the boundary conditions are undefined. In the following, instead of solving (18) directly, we consider it as an equation on the k-nearest neighbor graph of cells and look for its least-squares and least-norm solution. In other words, we consider the potential φ as the solution to the following minimization problem: where S is the set of solutions to the least-squares minimization problem Here, denotes the graph Laplacian on the k-nearest neighbor graph of the cells. The solution of the minimization problem (19) can be written as where denotes the Moore–Penrose pseudoinverse of . Note that the divergence of the cell velocity v on the right-hand side in (20) can also be computed concretely, as the cell velocity (11) is obtained explicitly. Thus, the following theorem holds.
The potential φ * is represented by with where (·, ·) denotes the standard Euclidean inner product of vectors and Ck denotes
Through this procedure, we can construct the gradient potential (20) of the cells as in Waddington’s landscape.
3 Biological application of scEGOT to human PGCLC induction system dataset
This section presents a biological application of scEGOT to time-series single-cell gene expression data and its validation. In particular, we consider the human primordial germ cell-like cell (PGCLC) induction system in vitro. In this in vitro culture system, previous studies have shown that genes such as EOMES, GATA3, SOX17, TFAP2C are critical genes for PGCLC differentiation [14, 18, 19, 37, 40]. However, the induction rate of PGCLC is only approximately 10 to 40 %, and the precise molecular mechanisms underlying the PGCLC differentiation and PGCLC/somatic cell segregation remain poorly understood. Using the functions of scEGOT, we characterize the progenitor cell population and elucidate the molecular mechanism of PG-CLC/somatic cell segregation.
3.1 Dataset
The induction starts with the induction of human induced pluripotent stem cells (hiPSC) into the incipient mesoderm-like cells (iMeLC) in a 2D culture, followed by their transfer into a 3D culture medium as aggregates with the inclusion of BMP4 and other signals. The dataset for scRNA-seq was generated using the 10X Genomics platform and consists of five-time points (I = 5); we refer to iMeLC as day 0, whereas 3D-cultured aggregates are denoted as day 0.5, day 1, day 1.5, and day 2, respectively (Fig. 4A).
3.2 Preprocessing and clustering
As a preprocessing of scEGOT, we applied a noise reduction method RECODE [13] and selected genes with the top 2,000 higher normalized variances (variances divided by means) of non-mitochondrial genes, followed by log scaling. We then performed PCA and applied scEGOT to the top 150 principal components (their cumulative contribution rate was 93.67%). We also put the regularization parameter of EGOT to ε = 0.01.
We set the number of clusters Ki as 1, 2, 4, 5, and 5 for day 0, day 0.5, day 1, day 1.5, and day 2, respectively (Figs. 4 and S1A). In particular, by examining the expression of the signature genes, we annotated the clusters at day 2 with four cell types: PGCLC (NANOG +/SOX17 +/TFAP2C +/PRDM1 +), amnion-like cell (AMLC, TFAP2A+/TFAP2C +/ GATA3 +), endoderm-like cell (EDLC, GATA6 +/SOX17 +/FOXA2 +), and extra-embryonic mesenchyme-like cell (EXMCLC, HAND1 +/FOXF1 +) (Figs. 4B and S1B–C). This annotation of the clusters is consistent with [6].
3.3 Verification of scEGOT interpolation
We verify the accuracy of the entropic displacement interpolation of scEGOT (12). We set the data at day 1 as the reference distribution and generated the interpolation distribution using the data at day 0.5 (two clusters) and 1.5 (five clusters). The analysis shows that the interpolation distribution (1,000 cells) properly reproduces the reference distribution (Fig 5A). This was quantitatively verified using the silhouette score (Fig 5B), showing that the interpolation by scEGOT is well overlapped with the reference data rather than the source and target data. Here, the silhouette score indicates that the clusters are separated (overlapped, resp.) if the value is close to one (zero, resp.).
3.4 Trajectory inference with cell state graph
In this section, we analyze the trajectory of cell differentiation in the human PGCLC induction system using the cell state graph (Fig. 6) generated by scEGOT and gain molecular insights into the mechanism that segregates PGCLCs from somatic lineages. The cell state graph shows four primary differentiation pathways leading to PGCLC, AMLC, EDLC, and EXMCLC as follows:
PGCLC: day0-1 → day0.5-1 → day1-1 → day1.5-1 →day2-1;
AMLC: day0-1 → day0.5-1 → day1-1 → day1.5-2 → day2-2;
EDLC: day0-1 → day0.5-1 → day1-1 → day1.5-3 → day2-3;
EXMCLC: day0-1 → day0.5-2 → day1-3 → day1.5-5 → day2-5.
The top path in the cell state graph (Fig. 6B) represents the PGCLC differentiation pathway. This path features key PGCLC markers such as NANOG, SOX17, TFAP2C and PRDM1 [14, 18, 37, 40] (Fig. S1C). The expression levels of these markers increase from day 0.5 to day 1.5. It is of note that major EXMCLC progenitors are segregated as early as day 0.5 (see Space-time continuous gene expression analysis section for further analysis). Additionally, EXMCLC can be generated through alternative pathways, most typically, day0-1 → day0.5-1 → day1-2 → day1.5-4 → day2-4, suggesting that day0.5-1 cells retain a competence to differentiate into EXMCLC. Indeed, while there are pathways from the PGCLC progenitors until day 1 to the somatic cell populations, there is no pathway in the opposite direction. This result is reminiscent of Weismann’s barrier [52].
To gain a deeper understanding of PGCLC/EXMCLC segregation, we examine the difference in the gene expression value between the clusters. The volcano plots (Fig. 7A–B) show the comparison between two branched clusters at day 0.5 (day0.5-1 and day0.5-2) and day 1 (day1-1 and day1-2) of the PGCLC and EXMCLC lineages. Notably, we found that TFAP2A is expressed in the PGCLC progenitor cell population (day0.5-1). Previous studies have shown that this gene plays an important role in PGCLC differentiation [5, 6, 57], suggesting that the cell state graph generated from scEGOT has the ability to capture the progenitor cell population of PGCLC differentiation.
We also discovered that NKX1-2 is highly expressed in the PGCLC progenitor compared to the somatic cells. This gene is known to be expressed during mesoderm development [44].
However, its role in PGCLC differentiation has been unknown. This finding will be further investigated in later sections using other functions of scEGOT.
Conversely, genes that are highly expressed in the EXMCLC pathway may also play a critical role in PGCLC/EXMCLC segregation. The early mesoderm markers (MESP1, MESP2) and the later mesoderm markers (GATA6, FOXF1) are significantly upregulated on day0.5-2. This shows that these somatic genes might act as repressors of PGCLC during the segregation.
Furthermore, when comparing the clusters day1-1 and day1-2, the former, which is on the way to the PGCLC pathway, shows enrichment of key PGCLC specification genes, namely NANOG, TFAP2C, SOX17, KLF4, SOX15 and PRDM1 (Fig. 7B). This indicates that the PGCLC path is more specified on day 1.
In mouse PGC, genes such as stella and fragilis are actively expressed, while HOX genes such as Hoxa1, Hoxb1 are repressed [20, 29, 35, 54]. On the other hand, we did not observe any HOX gene expression in human PGCLC. Interestingly, in our induction system, HOXA1, HOXB2, HOXB3, HOXB5 and HOXB6 are highly expressed in the EXMCLC lineage.
From the above observation, the cell state graph well captures the trajectories of human PGCLC induction and can potentially find key genes by analyzing differentially expressed genes among the clusters. In particular, the cell state graph succeeded in tracing PGCLC/somatic cell segregation and identified key genes that may enhance or suppress PGCLC differentiation, such as NKX1-2, MESP1, MESP2, and GATA6.
3.5 Velocity analysis
We investigated the dynamics of the developmental process in human PGCLC induction using the cell velocity (11), and compared our method with the RNA velocity [24, 1]. RNA velocity is a standard method to estimate the velocity field in the gene expression space from scRNA-seq data, and it is also widely used to infer cell trajectories [33].
The cell velocity properly indicated that the lineage fate is determined, and the transfer to other lineages (between EDLC, EXMCLC, AMLC, and PGCLC) terminates when the flow becomes stationary (Fig. 8A). On the other hand, the RNA velocity showed that the flow passes through the AMLC, EDLC, and PGCLC populations without stopping, and finally moves towards the EXMCLC lineage on day 2. However, this behavior is not consistent with biological observations (Fig. 8B).
Furthermore, scEGOT can provide the velocities for all the genes, whereas the RNA velocity cannot calculate the velocities for genes without a sufficient amount of detection of unspliced RNA [1, 24] (Fig. 8C). We also emphasize that since the cell velocity is a data-driven method, it can be applied not only to scRNA-seq data but also to any other time-series single-cell data, such as scATAC-seq data. We apply it to time-series scATAC-seq data for innate immune cells from mouse-draining lymph nodes (Fig. 8D). The flow of NK cells and monocytes from inactive to active states can be observed. Overall, the cell velocity allows us to perform a comprehensive velocity analysis.
3.6 Space-time continuous gene expression analysis
To study the cell differentiation dynamics, such as the bifurcation time of PGCLC/somatic cell segregation, we constructed the time interpolations of cell populations and the time-continuous gene expression dynamics (animation) using the entropic displacement interpolation (12) (Fig. 9 and Movies S1–S4). The result clearly shows the temporal evolution of cell differentiation and gene expression patterns for the marker genes of PGCLC. In particular, as early as 0.25 and more clearly at 0.75 days, the clusters NKX1-2 + and NKX1-2 − are separated, and the former cluster exactly moves to the PGCLC population (TFAP2C +/PRDM1 +) (Movies S1, S3).
On the other hand, although the cluster TFAP2A+, which has been reported as the progenitor cell population of the PGCLC in previous studies [5, 6, 57], shows a similar tendency to the cluster NKX1-2 + until day 1, TFAP2A is also highly expressed in the AMLC population after this day, implying that TFAP2A alone cannot identify the PGCLC progenitor cell population. This analysis suggests that NKX1-2 is one of the earliest marker genes of the PGCLC progenitor cell population, and its expression guides the cell population to the PGCLC pathway. It also shows that the bifurcation of PGCLC/somatic cell segregation occurs much earlier, as early as 0.25 days.
3.7 GRN analysis
We inferred the GRNs for PGCLC induction. We computed the GRN matrix A* using the cells in the clusters on the PGCLC pathway, which is determined by the cell state graph analysis. We then extracted a subnetwork of the GRN specified by the key genes of the human PGCLC specification PRDM1, SOX17, TFAP2C and two candidates TFAP2A and NKX1-2 of PGCLC progenitor marker genes (Fig. 10). We cut the small regulatory edges with weight less than 0.02. The regularization parameter γ > 0 was automatically determined by cross-validation.
On days 0–0.5, we found that both TFAP2A and NKX1-2 can activate SOX17 which is essential for PGCLC differentiation [14, 18]. However, on days 0.5–1, NKX1-2 continues to be able to activate the other PGCLC regulators, whereas TFAP2A does not interact with them. This result suggests that NKX1-2 may play a more essential role than TFAP2A during human PGCLC differentiation.
3.8 Reconstruction of Waddington’s landscape
Finally, we reconstructed Waddington’s landscape to validate the differentiation ability of cells in the PGCLC induction system. We constructed the k-nearest neighbor graph of cells (k = 15) on the top two principal components to compute the graph Laplacian. Fig. 11 shows the reconstructed Waddington’s landscape together with the expression values of the genes NKX1-2, MESP1, and GATA6.
The landscape (Fig. 11A) shows that the iMeLCs, which are characterized by high levels of pluripotent genes (SOX2, NANOG, POU5F1) and relatively low levels of early mesoderm genes (TBXT, MIXL1, EOMES) (Fig. S1B), are located at the top of the potential. This suggests that these cells have high plasticity, which is consistent with biological knowledge.
In addition, viewing the gene expression values on the landscape allows us to understand the contribution of the genes to cell differentiation. For instance, it can be seen that NKX1-2 contributes to the early stage of PGCLC (Fig. 11B). In contrast, MESP1 and GATA6 contribute to the early stage of EXMCLC and EDLC differentiation, respectively (Figs. 11C–D). Importantly, we emphasize that their expression levels are complementary at days 0.5-1, suggesting that these genes may play the role of landscape pegs, forming a barrier between the PGCLC and the somatic cell pathways.
3.9 Conclusion
We have formulated an entropic Gaussian mixture optimal transport (EGOT) and proved the mathematical properties of EGOT. Based on EGOT, we have also proposed a novel single-cell trajectory inference framework (scEGOT) that infers cell differentiation pathways and gene expression dynamics from time-series scRNA-seq data. Furthermore, we applied scEGOT to the human PGCLC induction system and validated it biologically.
ScEGOT provided us with the following outputs in biological applications. The cell state graph, which represents the cell clusters and their transport rates as a graph model, showed the trajectories between the cell populations. The cell velocity (11) generated by the entropic barycentric projection map (9) has expressed the cell differentiation flow, which describes more detailed and consistent structures in the case of human PGCLC than the conventional method. The scEGOT interpolation (12), which can generate a pseudo cell population at any intermediate time, has simulated the temporal evolution of gene expression dynamics in single-cell resolution and has certainly identified the bifurcation time. The GRN analysis, derived by solving the regression problem (15) between the gene expression and cell velocity, has contributed to the discovery of candidates of upstream genes that induce a cell differentiation system. The reconstruction of Waddington’s landscape, constructed based on the potential of the gradient flow associated with the cell velocity, has summarized the pathway of cell differentiation and represented the cell potency of each cell type. We note that all the outputs represented by the PCA coordinates can be replaced by those with other coordinates generated by other dimensionality reduction methods, such as uniform manifold approximation and projection (UMAP) [25], which provides biological interpretations from different aspects (Fig. S2).
Through the abovementioned scEGOT analysis, we have identified genes such as NKX1-2, MESP1, and GATA6 as potential regulators for human PGCLC differentiation or its divergence towards somatic lineages. The biological functions of these genes in human PGCLC are still unknown and are expected to be revealed through biological experiments.
Looking back on this paper, we have revisited an epigenetic world that was once envisioned by Waddington. He viewed the process of cell development as a canal and stated that a cell within a region denoting a particular state would be carried through the canals to a single specified cell, which is the corresponding steady state. He modeled such a time transition as a phase-space diagram of development (Fig. 3 in [47]), which is a similar concept to the cell state graph. He then modeled the epigenetic landscape, which is a well-known potential model, by simply describing the phase-space diagram in a three-dimensional picture (Fig. 4 in [47]). He further explained that the chemical states of the genes and their regulatory system underlie the landscape model (Fig. 5 in [47]).
Overall, scEGOT has realized the images envisioned by Waddington through a comprehensive mathematical framework called EGOT and has succeeded in capturing complex cell differentiation systems and their dynamics by analyzing the cell state graph, cell velocity, interpolation analysis, gene regulatory network, and reconstruction of the epigenetic landscape.
We believe that scEGOT is useful for a broader range of biological data because it is a data-driven method. In other words, all scEGOT settings are never limited to scRNA-seq data. For example, scATAC-seq data containing open chromatin information could be applied in scEGOT to provide a deeper explanation of cell differentiation and lineage fate determination at epigenome resolution (Fig. 8D). Indeed, the epigenetic regulations during cell differentiation, such as the regulations of cis-regulatory elements and transposon elements, are fundamental questions, and our approach will be instructive in addressing such questions.
Conflicts of Interest
The authors declare no conflict of interest.
Data availability
ScRNA-seq data of the human PGCLC induction system are available in the Gene Expression Omnibus (GEO) under identification number GSE241287.
Code availability
The Python software scEGOT is available through an open-source package at https://github.com/yachimura-lab/scEGOT.
Supporting information
Proofs of propositions and theorems
Proof of Proposition 1
Since the discrete entropy H is a strongly concave function, the objective function of (2) is a strongly convex function. Therefore, the optimization problem (2) has a unique optimal solution. Also, we consider the following Lagrangian associated with the problem (2): where and are the Lagrange multipliers and (·,·) denotes the standard Euclidean inner product of vectors. The first-order condition of (S1) yields that Thus the unique solution w = wεof (2) is given by Putting and in (S2), then we obtain which is the desired conclusion.
Proof of Proposition 2
We consider the convergence of EGOT and the solution (S3) as ε → 0. By using an estimate similar to that for discrete entropic regularized optimal transports with Shannon entropy by Weed [2], we can show that the solution (S3) of EGOT exponentially converges to an optimal transport plan of dG with the maximal entropy as ε → 0. That is, there exists M > 0 independent of ε → 0 such that By using the estimate (S4), we have From Jensen’s inequality and the fact that the function −log t is convex, we obtain This leads to the following estimate which is the desired conclusion.
Proof of Theorem 3
As shown in [1, Proposition 4, p.944], the discrete optimal transport problem coincides with the continuous optimal transport problem (4). Therefore, by Proposition 2, we obtain the following estimate: Moreover, for any bounded continuous functions ϕ ∈ Cb(ℝn× ℝn), By using (S4), we obtain Thus, the entropic transport plan γε converges to γ* in the sense of the narrow convergence.
Proof of Proposition 4
Recall that the definition of the entropic barycentric projection map is where is the disintegration of the entropic transport plan γε with respect to the first marginal , i.e., . Then, by (8), the probability density of is given by where denotes Thus, the entropic barycentric projection map (S5) can be computed as follows: which is the desired conclusion.
Proof of Theorem 5
By the results of [1, Proposition 5 and Corollary 2, pp.946–948], the probability distribution is the geodesic with the maximum entropy between and in the space G2n(∞) equipped with the distance MW2, where w* is the solution of (6). Also, the proof that the entropic displacement interpolation converges to narrowly can be shown in the same way as Theorem 3. Indeed, for any bounded continuous functions ϕ ∈ Cb(ℝn), Here, we used (S4).
Proof of Theorem 6
Theorem 6 can be shown by direct calculation as follows: By the definition of , we have Then, we get Also, in the same manner as above, by the definition of , we have Then, we obtain By substituting (S7) into (S6), we have where we set Ck as the final expression. Therefore, we obtain Also, let us recall that the optimal transport map between and is given by where Ak,l denotes The direct computation implies that the following equality holds: By (S8) and (S9), we obtain which is the desired conclusion.
Acknowledgements
This work was supported by the World Premier International Research Center Initiative (WPI) and AMED-CREST Grant (JP19gm1310002h, Y.H.). This work was partially supported by JSPS Grant-in-Aid for Early-Career Scientists (21K13822, T.Y.), JST PRESTO (JPMJPR2021, Y.I.), Grants-in-Aid for Specially Promoted Research from JSPS (17H06098, 22H04920, M.S.), Grants from the Open Philanthropy Project (2018-193685, M.S.), JSPS Grant-in-Aid for Transformative Research Areas (A) (22H05107, Y.H.), and JST MIRAI Program Grant (22682401, Y.H.).
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵