Mathematical modeling with single-cell sequencing data

Single-cell sequencing technologies have revolutionized molecular and cellular biology and stimulated the development of computational tools to analyze the data generated from these technology platforms. However, despite the recent explosion of computational analysis tools, relatively few mathematical models have been developed to utilize these data. Here we compare and contrast two approaches for building mathematical models of cell state-transitions with single-cell RNA-sequencing data with hematopoeisis as a model system; by solving partial differential equations on a graph representing discrete cell state relationships, and by solving the equations on a continuous cell state-space. We demonstrate how to calibrate model parameters from single or multiple time-point single-cell sequencing data, and examine the effects of data processing algorithms on the model calibration and predictions. As an application of our approach, we demonstrate how the calibrated models may be used to mathematically perturb normal hematopoeisis to simulate, predict, and study the emergence of novel cell types during the pathogenesis of acute myeloid leukemia. The mathematical modeling framework we present is general and can be applied to study cell state-transitions in any single-cell genome sequencing dataset. Author summary Here we compare and contrast graph- and continuum-based approaches for constructing mathematical models of cell state-transitions using single-cell RNA-sequencing data. Using two publicly available datasets, we demonstrate how to calibrate mathematical models of hematopoeisis and how to use the models to predict dynamics of acute myeloid leukemia pathogenesis by mathematically perturbing the process of cellular proliferation and differentiation. We apply these modeling approaches to study the effects of perturbing individual or sets of genes in subsets of cells, or by modeling the dynamics of cell state-transitions directly in a reduced dimensional space. We examine the effects of different graph abstraction and trajectory inference algorithms on calibrating the models and the subsequent model predictions. We conclude that both the graph- and continuum-based modeling approaches can be equally well calibrated to data and discuss situations in which one method may be preferable over the other. This work presents a general mathematical modeling framework, applicable to any single-cell sequencing dataset where cell state-transitions are of interest.

Introduction 1 A significant limitation of these approaches is if the graph structure and underlying 26 relationships between the cells is unknown. As shown in a comprehensive review of 27 trajectory inference methods by Saelens et al, most TI algorithms have difficulty 28 inferring even simple graphs which may include cycles or disconnected subgraphs [4]. 29 We suggest that a hypothesis-driven and mathematical approach to the analysis of 30 scRNA-seq data to study cell state transitions is warranted. 31 Moreover, single cell genomic sequencing suffers from a number of challenges in 32 analysis. Beyond the several choices to be made for even simple analyses such as 33 clustering or visualization, the data may be sparse and incomplete. Gene "drop outs" 34 and background signal (noise) can complicate differential expression and clustering 35 analyses. For this reason, analysis of these data have remained fairly superficial despite 36 the wealth of information contained in these high-dimensional datasets. Moreover, 37 results obtained from analysis of single cell sequencing datasets may be very sensitive to 38 choices in the method of analysis and parameters in algorithms. To date, single-cell 39 sequencing data have not been effectively leveraged as inputs into mathematical models. 40 Here we compare two approaches of mathematical modeling cell state-transitions 41 with scRNA-seq data. Building on our prior work [12], we model cell differentiation as 42 as a continuous process with reaction-diffusion-advection partial differential equations 43 (PDE) solved on: 1) an abstracted graph and 2) a continuous space. We compare and 44 contrast these two approaches with hematopoeisis as the model system. 45 This manuscript is structured as follows: first we present the PDE model on a graph 46 and in continuous space, then we apply the model to two datasets. We examine the 47 impact of various graph construction and trajectory inference methods on the geometry 48 of the cell state space, and solve the model on these geometries. We then use the model 49 the study the effects of perturbing 1) the graph structure 2) expression of select subsets 50 of genes 3) and cell state transition dynamics by perturbing flow on the graph or by 51 modifying the dynamics in the continuous space. We predict novel dynamics of leukemia 52 July 21, 2019 2/32 pathogenesis by perturbing normal hematopoeisis and conclude with a comparative 53 analysis of our approach and description of future directions for mathematical modeling 54 with single cell genomic sequencing data. A summary of our workflow is shown in Fig 1.  Step-by-step illustration of our modeling process. 1. Processed single-cell sequencing expression matrices are represented in a reduced dimension space through one of many dimension reduction techniques such as diffusion mapping, principle component analysis, t-SNE, or UMAP. 2. Temporal events (pseudotime trajectories) and cell clusters are inferred to construct the graph or continuum geometry of cell states. 3. From these representations, models are calibrated to the transport of cell distribution along the graph or in the cell state-space. 4. The models can then be utilized to perturb genes and cell states. The calibrated models predict cell state-transitions and the emergence of novel cell states. 56 Modeling cell state-transitions in a continuous cell state-space 57 In this section we develop PDE models that describes the cell dynamics in the 58 continuous phenotype space identified by dimension reduction techniques. For a given 59 single cell genomic sequencing data {g i } N i=1 , where N is the number of cell samples and 60 g i ∈ R G is a G-dimensional vector of gene expression level of the i-th cell, the 61 dimension reduction method can be written as an operator P : R G → Γ ⊂ R n where the 62 reduced dimensional space is truncated at the n-th dimension. Various dimension 63 reduction techniques exist to potentially employ for the mapping P, including principal 64 component analysis, diffusion mapping, and t-distributed stochastic neighbor PDE model of cell state-transitions on a reduced component space 71 We first develop a cell state model that describes the dynamics of cell distribution 72 u(t, θ) on the reduced component space Γ, where θ ∈ Γ is the variable that represents 73 continuous cell state. Three highly distinctive dynamic regimes of the cell states are 74 considered, that is, directed cell transition, random phenotypic instability, and 75 birth-death process. Such model can be written as an advection-diffusion-reaction PDE 76 that governs the cell distribution u(t, θ) as 77 ∂ t u(t, θ) = −∇ · (V (t, θ)u(t, θ)) + R(θ, u(t, θ)) + ∇ · (D(θ)∇u(t, θ)) ,

Materials and methods
with zero Dirichlet boundary condition. The three terms in our equation that involve 78 parameterized functions V , R, and D, represent cell differentiation, population growth, 79 and phenotypic instability, respectively.

80
Let us first describe the advection term V ∈ R n that represents directed cell 81 differentiation, where we divide it into two parts as V = v 1 + v 2 . The first part models 82 the attractor cell states of homeostasis. Assuming that the magnitude of phenotypic 83 instability is with a magnitude ν, that is, D(θ) = ν, one can compute the advection where U (θ) is the exponent of the homeostasis distribution in the exponential form 86 u s (θ) = e −U (θ) . Here, u s (θ) can be regarded as the cell landscape that the 87 hematopoiesis system desires to maintain. There are multiple methods to compute the 88 cell landscape, so called quasi-potential [14], that focuses on relative stability of multiple 89 attractors and models cell differentiation as transition between the attractor states.

90
Here, we compute the cell landscape empirically by assuming that the obtained 91 single-cell data is a representative subset of the entire hematopoiesis system, and by 92 using density approximation methods. In particular, we use kernel density 93 estimation [15] from the projected single-cell data θ i ∈ Γ. The second advection term 94 models an active cell differentiation. In particular, we consider the following form that is parameterized by the proliferation rate r(θ) and cell maturation [16]. c(θ) ∈ R n 96 represents the direction and magnitude of differentiation on the phenotype space, and 97 a(θ) is the proportion of cells that remains in cell type θ, while 1 − a(θ) cells further 98 differentiate in to matured states. This is can be counted from symmetric and 99 asymmetric stem cell division. In addition, we assume a signal parameter s v (t) that 100 controls the active differentiation term.

101
The reaction term represents the growth rate that consist of proliferation and 102 apoptosis. We consider the logistic growth term as following where r(θ) is the proliferation rate and d(θ) is the apoptosis term assuming a logistic 104 growth as d(θ, u) = min{ u us(θ) ,d}, whered models the maximum magnitude of 105 apoptosis rate. The second-order diffusion term represents the instability on the 106 phenotypic landscape of the cells that should be taken account into when modeling the 107 macroscopic cell density. We simply consider a constant term D(θ) = ν, and the value is 108 estimated as ν ≈ ε 2 /4δt with ε as a single step size in the phenotype space and δt as Although the continuum-based multi-dimensional model provides a framework to study 112 cell states, it is not always straightforward to map back locations in the space to novel 113 or otherwise unknown cell types. Moreover, a central feature of contemporary analysis 114 of scRNA-seq data is clustering and inferring relationships between clusters of known 115 cell types [4]. Therefore, we develop a model that can describe cell state-transition 116 dynamics on a graph that represents relationships between known cell types identified 117 with clusters, extended from our previous work in [12]. An immediate advantage of this 118 approach is that it is convenient to employ biological insights from well-known classical 119 discrete cell states.

120
The continuum differentiation cell states are assumed to be on the graph obtained 121 from the scRNA-seq data, for instance, using partition-based graph abstraction 122 algorithm [17]. We project the graph on the reduced component space, and denote the 123 nodes as {v k } nv k=1 and the edges as e ij connecting in the direction from the i-th to the 124 j-th node. For convenience of notation, the edges are also denoted as {e k } ne k=1 with the 125 index mapping I : J → {1, ..., n e } on the set of nontrivial edges (i, j) ∈ J . The end 126 points in the direction of cell transition are {a k } ne k=1 and {b k } ne k=1 , where The model follows the dynamics of the cell distribution on the graph, u(x, t), where 129 x ∈ e k is a one dimensional variable that parameterizes the differentiation continuum 130 space location. We annotate the cell distribution on each edge e k as u k (x, t) such that 131 u(x, t) = {u k (x, t)} ne k=1 , and model the cell density by an advection-diffusion-reaction 132 equation [18] as The three terms are similarly modeled as the multi-dimensional model in Eq (1), 134 representing cell differentiation, population growth, and phenotypic instability. To 135 summarize once more, the advection coefficient V k (x) models the cell differentiation and 136 the transition between the nodes, that is, different cell types. We model the advection 137 term in two parts as in the reduced component space model, 138 and compute them as Here, e −U k (x) is the homeostasis cell distribution on the graph, ν is the magnitude of 140 phenotypic instability, r k (x) is the proliferation rate, a k (x) is the self-renewal rate, and 141 s v (t) is the signal parameter. The transition rate per unit time (e.g., day −1 ) or 142 pseudotime can be taken into consideration in c k (x). Cell proliferation and apoptosis 143 can be modeled by the reaction coefficient R k (x) as Finally, the diffusion term that represents phenotypic fluctuation is taken as D k (x) = ν. 145 The magnitude of phenotypic fluctuation of the cell density is assumed as a random 146 process subject to Brownian motion with magnitude σ, so that the diffusion term 147 becomes ν = σ 2 /2 [18]. where respectively. In addition, we compute the number of cells in specific cell type by 163 assigning weight w k that corresponds to cells in the k-th cluster as k w k (θ) = 1. In the graph model, we assign the cell states along the edge to be the 165 cell type of the closest node, so that we can compute the number of the k-th node cell 166 type as Although we can understand the continuum of cell states by mapping cells in = corr(f , g j ), where f represents the vector of function evaluation at each single-cell data point θ i , An alternate quantity to examine is the 177 genes that are related to a certain direction in the reduced component space. For instance, the correlation between the j-th gene and the k-th reduced component and to a certain vector v = {v k } n k=1 can be computed as respectively. Regarding Eq (13) as global quantities, we can also compute the local 181 correlation on the subdomain of the reduced space Ω d by collecting the cell indices that 182 lie within the subdomain 184

185
In this section, we employ the framework developed in the previous section to the 186 mouse hematopoiesis cell data from Nestorowa et al [19] and Paul et al [20]. We obtain 187 the graph and space models of hematopoiesis cell state and focus on comparing the 188 strengths and weaknesses of the two models.

189
Comparing a continuous cell state-space with a cell state graph 190 The hematopoiesis single cell data from [19] and [20]  the cells that are the most stemlike in Paul data are common myeloid progenitors, that 204 is more matured than the ones in Nestorowa data, that includes the long-term and 205 short-term HSCs. In addition to the single-cell data, the figure also presents the 206 abstracted graphs obtained using partition-based graph abstraction [17]. Further  The homeostatic cell distribution u s on the reduced dimensional space is computed 211 by kernel density approximation [15,21]. The computed cell landscapes viewed in lineages to different cell types. Accordingly, counting the number of cells in each discrete 227 cell types is more straightforward, for instance, by integrating the cell distribution along 228 the edges half way as in Eq (11). Although the space model has ambiguity on classifying 229 the cells into discrete cell types, the number of cells in each discrete cell states can 230 computed by assigning weights to integrate as in Eq (10). Moreover, we emphasize that 231 the advantage of clear cell states in the graph model is also its limitation at the same 232 time, since it restricts the approach to only study the presumed cell types and lineages. 233 The advantage of space model is its potential of exploring novel cell states that 234 deviates from the known cell types. While the graph model cannot explore the cell states 235 that is not already included in the graph structure, the space model can immediately 236 study the abnormal trajectories and emergence of cells at any space location. We will 237 show later in our simulation that the hypothesis of genetic and epigenetic alteration can 238 be studied directly in the space model, without projecting it once more on the graph 239 structure. Moreover, the space model is more sensitive to genetic and epigenetic 240 variation than the graph model, although when the variation is large and a considerably 241 distinct cell type arises, the graph model can append another cluster node.

242
In the following sections, we consider mainly two application problems, namely, 243 normal hematopoiesis and abnormal hematopoiesis differentiation, resulting in myeloid 244 leukemia as example applications of our modeling approach.

245
Calibrating the mathematical models to normal hematopoiesis 246 We demonstrate that normal hematopoiesis can be visualized by both models on the using Nestorowa data [19] and Paul data [20].  The initial stem cells differentiate to progenitors and more matured cell types and recover the entire cell landscape. Numbers of cells in each type/cluster using the space model and the graph model (B) are successfully calibrated to the observed data so that at t = 100 each model predicts the correct cell ratios to within ±5% (C,D). We remark that the full capacity using Paul data is reached more rapidly compared to the Nestorowa data, due to its shorter distance in the reduced space.

258
The advantage of graph model is apparent that we can observe distinct cell states as 259 a mass of cells shifting from a node to distinct edges toward different cell types. For Ery lineage can be clearly separated in the graph models, while those can be ambiguous 262 in the two-dimensional distribution. Still, we can compute the number of cells in each 263 discrete cell types in both models as shown in Fig 3B. We observe that the number of 264 cells reaches the full capacity at later times around t = 100, with the intended ratio of 265 cell numbers in each discrete cell types approximating the given data [19] in Fig 3C. We 266 comment that the recovery is more rapid for larger values of ν and larger number of 267 initial stem cells ρ(0) (see S8 Fig). multiple hematopoietic lineages at various stages [24][25][26]. Most notable in such 283 leukemia pathogenesis and progression is the increased in abnormal myeloid progenitors, 284 which has an MEP-like immunophenotype and a CMP-like differentiation potential [25]. 285 Experimental studies [27] show that such MEP attains a predominant increase in 286 pre-megakaryocyte/erythroid (Pre-Meg/Ery) population (ranging from 5 to 12 fold) 287 accompanied by impaired erythroid lineage differentiation [28]. This refined phenotypic 288 Pre-Meg/Ery population consists partly of the CMP and MEP populations which are 289 identified using conventional markers [19,29].  (4) and (7) 294 that controls the local cell capacity. In addition to the over-proliferation, another aspect 295 of the leukemia pathogenesis of our interest is the impaired differentiation of erythroid 296 lineage differentiation, where it can be modeled by reducing the cell differentiation V (θ) 297 in Eq (5) and V k (x) in Eq (1) toward Ery (See S1 Appendix and [12] for the details).

298
The corresponding results are shown in Fig 4, where we modify the model to leukemia 299 progression at t = 10. The cell distribution changes from the normal cell landscape at    genes that are reported to be related to leukemia stem cells [30,31], although we 325 emphasize that these genes serve simply as examples, and are not intended to model the 326 precise biological process of AML pathogenesis. The j-th gene expression level of i cell, 327 g i j , is modified as g i j = 2 γj g i j , where γ j is the log 2 -fold change compared to the normal 328 cells, that ranges from -3.5 to 2.7. The full list of altered genes and magnitudes γ j are 329 shown in S1 Table. In addition, we consider the extreme case of gene alteration as the 330 maximum level log 2 ( g i j + 1) = 16 for up-regulated genes and log 2 ( g i j + 1) = 0 for LGALS3, LY86, and ANXA5. We show these subset of genes simply to illustrate the process. The normal single-cell data (blue circle) and modified gene expression (red square) are shown together, with the case of extreme levels (purple diamond).

423
We have shown how to construct and calibrate mathematical models of cell 424 state-transitions using single-cell sequencing data. We compare two approaches: solving 425 equations on graphs and solving equations on a continuous cell state-space. Each 426 approach has strengths and limitations. Selecting an approach for a given application or 427 dataset will depend on the type of biological data and the nature of the scientific 428 question.

429
When the modeling application and quantity of interest includes well-known cell 430 lineages and relation between discrete cell states, the graph model is more appropriate 431 due to its ability of distinguishing distinct cell lineages more clearly compared to the 432 continuum space model. Dynamics of cell number in discrete cell types, alteration of 433 proliferation and apoptosis in particular cell type, differentiation block, and emergence 434 of intermediate cell states can be straightforwardly quantified and studied. However, to 435 explore the cell states beyond the known cell lineages, the continuum space model is 436 more advantageous since it includes all pathological and intermediate cell states, rather 437 than confining the model into presumed cell lineages. Moreover, the continuum model 438 can incorporate a relatively small genetic and epigenetic alteration that the graph 439 abstraction may not recognize, and study abnormal trajectories that yield 440 unconventional cell states. 441 We selected genes to perturb to simulate AML based on genes known to be 442 associated with leukemia pathogenesis. We do not intend for this to be an accurate 443 model of the biological process, rather, as an illustration of how one may select sets of 444 genes and perturb them in a prescribed fashion in order to study the effect on cell 445 state-transition dynamics. This approach assumes that AML pathogenesis originates 446 from changes in gene expression in specific cell subsets, which is limited by our 447 identification of these genes based on literature. We acknowledge this is a limitation of 448 the modeling approach, although we also note that our model predictions are consistent 449 with known features of AML progression. Of particular note are works which use modeling and simulation to generate synthetic in 454 silico gene expression datasets [32]. These important approaches to mechanism based 455 mathematical modeling may also be used to study and predict the effects of 456 perturbations on cell state distributions. They may also be used as computational 457 controls to benchmark analysis tools and potentially to benchmark and compare 458 mathematical models, although using a model to benchmark other models can lead to 459 consistent but incorrect circular reasoning and caution is warranted. Another example 460 is Ferrall-Fairbanks and Papalexi et al, who use mathematical analysis to generate novel 461 quantifications of cell heterogeneity in cancer or immune cell subsets 462 respectively [33,34]. These methods may be used to map and interpret novel cell states 463 predicted by mathematical models or similarly as a method to interpret model-predicted 464 changes in cell heterogeneity following a perturbation. 465 Schiebinger et al compute and predict differentiation trajectories in cell development 466 using optimal transport (OT) [35]. This approach considers the optimal transport of 467 cells as a mass flowing along differentiation trajectories, and is conceptually the most 468 similar to our approach. As presented, Schiebinger et al do not use the OT framework 469 to examine perturbations of cell states or genes along the differentiation trajectory, 470 although this is possible with an OT model. Setty et al present a method to compute 471 cell fate probabilities [36], which may also be achieved by inferring cell state-transition 472 dynamics with lineage trees [37]. Fischer et al have demonstrated a method for inferring 473 population dynamics from single-cell sequencing data [38], Sharma et al use longitudinal 474 sequencing to study drug-induced infidelity in the stem cell hierarchy [39], and 475 Karaayvaz et al show how to use single-cell sequencing to examine drug resistance in 476 breast cancer [40]. These approaches and analysis methods may be used to inform and 477 potentially calibrate mathematical models of cell population dynamics or response to 478 treatment-induced perturbations.

479
Recently, vector fields derived from RNA velocity [41] have been used to infer 480 potential energy or fitness landscapes for cell state-transitions [42]. These approaches 481 may be used to inform the computational domain for mathematical models that we 482 present here, however, an important limitation of the RNA-velocity approach is 483 extrapolation of the vector field outside of the data range. This underscores the need for 484 hypothesis-based and model-guided approaches to inform the shape of these fields. This 485 limitation also applies to the rapidly growing field of deep learning approaches [43] to 486 analyze single-cell sequencing data, namely, whether the learning algorithm can 487 effectively make predictions to datasets which are not sufficiently similar to those upon 488 which it has been trained. We believe that the future likely involves a merger of 489 mathematical modeling with machine learning, in which mathematical models are used 490 to inform learning approaches and impute sparse data as has been recently shown by 491 Gaw and Rockne et al [44,45]. may not be directly evident from the data; for example, extrapolation of RNA velocity 502 fields beyond the dataset boundaries or to interpret and predict novel cell types which 503 may not otherwise be clearly identified with known canonical cell state markers.

504
Another advantage of our approach is the use of pseudo-time analysis of data collected 505 at a single timepoint to calibrate the models, however, the models can also be calibrated 506 directly to time-sequential single-cell datasets, which we expect to become more 507 commonly available as single-cell sequencing continues to be used as a tool to study cell 508 dynamics.

509
The disadvantages and limitations include: the potential for misleading or incorrect 510 inference due to poor data quality including drop-outs, small non-representative samples 511 of large heterogeneous populations, no physical or micro-environmental context, no 512 direct or physical interactions between cells, and the possibility of model predictions to 513 be sensitive to methods of dimension reduction, graph abstraction, state-space 514 construction, and potentially sequencing platform. Sensitivity of the modeling to 515 experimental and computational methods may be directly studied and mitigated as we 516 have shown in this work, however this remains potentially a significant source of 517 uncertainty and variability in the modeling calibration and predictions.

518
In terms of computational cost, the graph model is more efficient since it is a 519 multiple of one-dimensional cost, while the cost of implementing the space model 520 increases exponentially as the dimension of reduced space increases. In our simulation, 521 the continuum model runs approximately 10 times longer than the graph model with 8 522 nodes, therefore, the space approach will be reasonable when the reduced component 523 can be truncated at two-to three-dimension, unless the numerical method is carefully 524 implemented.

525
Future work and applications 526 Future applications of this approach is to explore hypothesis in the resolution of 527 single-cell genomics and study altered and novel cell states with genetic and epigenetic 528 alterations in various biological systems and pathogenesis [46]. Incorporating effects of 529 external perturbations such as therapy is our interest as well. Moreover, comparing the 530 model prediction to sampling/sequencing of perturbed biological system is our most 531 anticipated future work, for instance, we look forward to include scRNA-seq data of 532 leukemic progenitor cells.

533
There are opportunities for further enhancements in our model in improving the 534 model of cell landscape dynamics to accurately estimate cell transition pathways in the 535 reduced component space, for instance, minimum action paths [6] and bifurcation [7,47]. 536 The model can be improved by obtaining parameter functions or mappings of biological 537 quantities directly from single cell sequencing data, for example, more precisely infer the 538 proliferation rate function. Also, developing methodologies to obtain reduced 539 component space that captures desired characteristic of cell states [48] will help us 540 explore our approach for other biological settings where cell states are less clearly 541 characterized. Moreover, we propose to develop quantities, such as index of critical state 542 transitions [47,49], in the phenotype space that could be used to predict forthcoming 543 major alterations in development and diseases. We also expect to be able to infer the 544 potential landscape directly from the RNA velocity vector field [41,42].

546
In summary, despite the explosion of computational tools to analyze single-cell 547 sequencing data, there have been relatively few mathematical models developed which 548 utilize this data. Here we begin to explore the possibilities-and limitations-of 549 dynamical modeling with single-cell data. We hope this work paves the way for 550 mathematical models to be developed to guide the interpretation of these complicated 551 datasets as they begin to be collected after biological perturbations (ex. cancer, 552 treatment, altered developmental processes), sequentially over time, or sampled spatially 553 within biological tissues.

554
Supporting information 555 S1 assuming that the overall proliferation of intermediate cell states change gradually. In the multi-dimensional model, we compute the interpolation based on local means as where we takeθ = 0.04. The self-renewal rate functions a k (x) and a(θ) are computed 562 similarly. See S1 Fig for r(θ) and a(θ) computed for Nestorowa data.

563
To compute the multi-dimensional function on the continuum space from the single-cell data, we employ the kernel density method [15,21], that is a non-parametric way to estimate the density function based on a finite data sample. Using the single-cell samples in the reduced component space, {θ i } N i=1 , the method approximates the density function as where K is the kernel smoothing function that we take it as a Gaussian function and h 564 is the bandwidth. The optimal bandwidth to estimate normal densities can be 565 computed by (4σ 5 /3N ) 1/5 , whereσ is the standard deviation and N is the sample size, 566 July 21, 2019 20/32 and the optimal bandwidth for our data is computed as 0.0383 to 0.0456, however in 567 our simulation, we choose a slightly smaller value, h = 0.03, to reveal more features of 568 multiple modes. Fig 2(a,b) shows the results computing u s (θ) using Nestorowa and 569 Paul data, and S1 Fig(a) shows the corresponding v 1 (θ).

570
In the diffusion term, we explore the parameter ν so that the phenotypic instability 571 does not dominate the cell maturation. We compute the parameter in the range of 572 ν ≤ (L/T d ) 2 /4, where the distance in the diffusion space is L = 1 and the time that 573 HSC differentiates to the progenitors is T d = 5 ∼ 30 (day), that is, ν ≤ 0.0027 ∼ 0.01, 574 and we consider ν = 0.001. Quantifying the local phenotypic instability in the reduced 575 component space, and justifying this term is our future work.

576
To compute the reduced component space using dimension reduction approaches, we employ diffusion mapping. See [13,55] for the detail of the algorithm. We take the cosine distance, k(x i , x j ) = 1 − corr(x i , x j ) for the Nestorowa data and the gaussian distance for Paul data with σ = 50. From L(i, j) = k(x i , x j ), the diffusion mapping use parameter α to tune the influence of density of the data points as where D (α) (i, i) = j L (α) (i, j), and we choose α = 0.5. From the eigen-decomposition 577 of M φ = λφ and ordered eigenvalues 1 = λ 0 ≤ λ 1 ≤ λ 2 ≤ · · · , the corresponding right 578 eigenvectors, φ 1 , φ 2 , · · · are the diffusion components. We truncate the reduced space 579 at the second diffusion component, where the eigenvalues are λ 1 = 0.1039, λ 2 = 0.0326, 580 λ 3 = 0.0167, λ 4 = 0.0135 for Nestorowa data, and λ 1 = 2.4653e-03, λ 2 = 5.8338e-04, For the pseudotime inference, we use the algorithm developed in [56]. The diffusion distance between two cells are computed as  presented. We remark that, v 2 corresponds to the cell differentiation along the edges in 597 the graph model.