## Abstract

Temporal dynamics of gene expression are informative of changes associated with disease development and evolution. Given the complexity of high-dimensional temporal datasets, an analytical framework guided by a robust theory is needed to interpret time-sequential changes and to predict system dynamics. Herein, we use acute myeloid leukemia as a proof-of-principle to model gene expression dynamics in a transcriptome state-space constructed based on time-sequential RNA-sequencing data. We describe the construction of a state-transition model to identify state-transition critical points which accurately predicts leukemia development. We show an analytical approach based on state-transition critical points identified step-wise transcriptomic perturbations driving leukemia progression. Furthermore, the gene(s) trajectory and geometry of the transcriptome state-space provides biologically-relevant gene expression signals that are not synchronized in time, and allows quantification of gene(s) contribution to leukemia development. Therefore, our state-transition model can synthesize information, identify critical points to guide interpretation of transcriptome trajectories and predict disease development.

**Graphical Abstract**

**In brief** The theory of state-transition is applied to acute myeloid leukemia (AML) to model transcriptome dynamics and trajectories in a state-space, and is used to identify critical points corresponding to critical transcriptomic perturbations that predict leukemia development.

**Highlights**

Leukemia transcriptome dynamics are modeled as movement in transcriptome state-space

State-transition model and critical points accurately predicts leukemia development

Critical point-based approach identifies step-wise transcriptome events in leukemia

State-based geometric analysis provides quantification of leukemogenic contribution

## Introduction

A complex disease such as cancer evolves as a dynamic system wherein multilayer interconnected inputs collectively produce a disease state corresponding to a clinical phenotype. Identification of genomic alterations including gene mutations, epigenetic changes and gene expression profiles by high-throughput sequencing assays are becoming a part of the routine clinical assessment of cancer patients at diagnosis and subsequent follow-ups. With the potential for tens of thousands of statistically significant genomic alterations to be detected at any given time point, it is challenging to quantitatively determine which changes are biologically relevant. In addition, these profiles continuously change over time as result of malignant cell transformation, cancer evolution, and ultimately, response to treatment. The analysis and interpretation of such a large volume of data comprising dynamic changes with regard to biological and clinical evolution poses a significant challenge. Although methods are available to analyze time-series genomic data(Bar-Joseph et al., 2012; Sanavia et al., 2015; Spies et al., 2017), to date it has been difficult to meaningfully interpret global gene expression changes over time and use them to predict system dynamics. To achieve the ultimate goal of predicting cancer system dynamics, it is of critical biological and clinical importance to develop a framework guided by a robust theory by which to analyze temporal genomic data.

A central challenge for interpreting time-sequential changes in gene expression is the identification and prioritization of the genes, pathways, and sampling timepoints that are necessary for understanding changes of interest (e.g., transition from health to cancer, cancer evolution, response to treatments). In theory, if we could continuously monitor the state of health of an individual through longitudinal data collection, we would observe critical inflection points at which the system transitions from one state to another (e.g., from health to cancer, from localized to metastatic disease, from therapy responsiveness to therapy refractoriness). In such a hypothetical scenario, any number of existing methods would suffice to analyze and identify specific moments in time—and corresponding genomic, clonal, or immunological events—critical to cancer dynamics. In practice, however, data are typically sparsely collected over time, and a system is unlikely to be observed precisely at the relevant critical points. Thus, existing genomic analysis approaches are limited in their ability to identify relevant alterations and interpret temporal dynamics in many real-world situations.

Here, we present an analytical approach based on state-transition theory, which allows us to conceptually fill in the gaps of incomplete and temporally sparse genomic data and identify time-dependent critical transcriptomic perturbations that predict cancer progression. We illustrate how state-transition theory can be applied to model and predict the development of acute myeloid leukemia (AML) based on time-series RNA-sequencing data collected at sparse timepoints. AML is a devastating malignancy of the hematopoietic system that may rapidly lead to bone marrow failure and death. Approximately 21,000 new patients are diagnosed with AML each year in the United States, and the latest 5-year overall survival rate remains at only 28% (https://seer.cancer.gov)(Noone et al., 2018). AML comprises multiple entities characterized by gene mutations and chromosomal abnormalities that drive leukemogenesis and predict prognosis and/or therapeutic response.^{11} Genomic studies such as the cancer genome atlas have revealed various mutational landscapes in AML, highlighting patterns of cooperation and exclusivity among the gene mutations(Dohner et al., 2015). These gene mutations ultimately alter the expression of downstream target genes and are therefore associated with unique gene expression profiles representing functional networks in leukemic cell biology. Thus, changes of the global system-level gene expression (i.e., the transcriptome) may be viewed as an *emergent property of the system*, which is informative with regard to disease evolution.

To model AML development as a state-transition, we use a well-established conditional knock-in mouse model that mimics a subset of human AML driven by the fusion gene *CBFB-MYH11* (CM), corresponding to the cytogenetic rearrangement inv(16)(p13.1q22) or t(16;16)(p13.1;q22) [henceforth inv(16)]. Inv(16) is one of the most common recurrent cytogenetic aberrations and is found in approximately 5-12% of all patients with AML. Induction of CM expression disrupts normal hematopoietic differentiation, resulting in perturbed hematopoiesis in the bone marrow and an increased probability of state-transition from health to leukemia. We show here that temporal transcriptome dynamics in peripheral blood mononuclear cells (PBMCs) from CM mice are predictive of state progression represented as the movement of a transcriptome-particle in a two-dimensional state-space (i.e., normal hematopoiesis and leukemia). Critical points of these transcriptome trajectories in the state-space correspond to specific gene expression changes that contribute to the progression from a reference state of normal hematopoiesis to leukemia, thereby providing biological insights into critical time-dependent transcriptomic pertubations during leukemogenesis.

## Results

### State-transition dynamics of the transcriptome

State-transition theory has a rich mathematical foundation(Pavliotis, 2014) and has been broadly applied in various scientific fields (i.e., chemistry, physics, and biology(Esteban et al., 2018; Folguera-Blasco et al., 2018; Herring et al., 2017; Hormoz et al., 2016; Pastushenko et al., 2018; Zhou et al., 2012)). To our knowledge, it has not yet been applied to study cancer progression from a reference state of health (i.e., normal hematopoiesis) to a state of disease (i.e., leukemia). To apply state-transition theory to this setting, we started from the observation that the composition of PBMCs changes at different stages of leukemia development, treatment response, or post-treatment disease relapse. We considered blood rather than bone marrow as the organ of interest for our study because it is much more accessible for frequent sequential sampling and therefore our approach, if successful, may be easily applied in the clinical setting.

Changes in the cellular composition of PBMCs are obvious once disease is clinically present (defined here as at least 20% circulating blasts). However, current approaches (including diagnostic molecular testing) may not fully detect the subtle changes occurring in between time-sequential sampling before overt leukemia is present, and therefore unable to predict with accuracy the trajectory of the disease development or regression at any discrete timepoint. Herein, we hypothesize that the PBMC transcriptome can be modeled by representing it as a particle undergoing Brownian motion in a double-well quasi-potential that determines the probability of the system to transition from a state of health to a state of leukemia (Figure 1A). To represent health to leukemia transition, we postulated that in a normal hematopoiesis state, a large energy barrier reduces the probability of the system to transition to a leukemia state. However, once hematopoiesis is perturbed by the expression of a leukemogenic event (i.e., the fusion gene *CM* expression in the case of our murine AML model), the double-well quasi-potential energy landscape is altered in a way that the energy barrier is lowered and the probability of transition from normal hematopoiesis to leukemia significantly increases (Figure 1B). Mathematically, this double-well quasi-potential can be represented as a 4^{th} degree polynomial with constant coefficients *α _{i}*. The equation of motion of the particle in the double-well quasi-potential then takes the form of a stochastic differential equation where

*X*denotes the location of the particle at time

_{t}*t*and

*dB*is a Brownian motion that is uncorrelated in time , with

_{t}*δ*being the Dirac delta function and

_{i,j}*β*

^{−1}is the diffusion coefficient.

In this double-well quasi-potential, we postulated the existence of three critical points, denoted as *c*_{1}, *c*_{2}, *c*_{3}. (Figure 1). The critical point (* for control mice) represents a stable state of normal hematopoiesis whereas the critical point *c*_{1} represents a state of *CM*-perturbed hematopoiesis in *CM* mice (i.e., CM activation but not overt leukemia). The critical point *c*_{2} represents an unstable critical point that is a transition point in the dynamics from *c*_{1} to *c*_{3}; *c*_{3} represents a state of overt leukemia corresponding to at least 20% circulating AML blasts. Because the critical point *c*_{2} is unstable, state-transition theory predicts that it would be unlikely to observe the system precisely at or very near this state. When the transcriptome-particle crosses the unstable critical point *c*_{2}, the velocity of the particle increases. This can be interpreted biologically as an acceleration of the transcriptome-particle toward the leukemia state defined by the stable critical point *c*_{3}. This prediction underscores the practical utility of state-transition theory in interpreting time-series genomic data. In fact, acquiring data precisely at each of the critical timepoints would be unlikely; therefore, theoretical constructs and mathematical predictions are necessary for contextualizing the relevance of data collected at any point or time near critical transitions.

To experimentally test our state-transition model based on transcriptome-particle dynamics from a state of normal hematopoiesis to a state of perturbed hematopoiesis and eventually to overt leukemia, we performed a longitudinal study using a well-established, conditional knock-in mouse model (*Cbfb ^{+/56M}/Mx1-Cre;* C57BL/6) that mimics a subset of human AML driven by the fusion gene

*CM*(Figure 2A). In the conditional

*CM*knock-in mice, CM expression is induced via the activation of Cre-mediated recombination by intravenous administration of synthetic double-stranded RNA polyinosinic–polycytidylic acid [poly (I:C)] (Supplemental Figure S1) (Cai et al., 2016; Kuo et al., 2006). Induction of CM expression results in perturbed hematopoiesis, lower energy barrier and increased probability of state-transition from normal hematopoiesis to leukemia. We collected PBMC samples from a cohort of

*CM*-induced mice (n = 7) and similarly treated littermate control mice lacking the transgene (n = 7) before induction (t = 0) and at one-month intervals after induction up to 10 months (t = 1-10) or when the mouse was moribund. All but one of the

*CM*-induced mice developed AML within the 10-month duration of the experiment (Figure 2B). The one remaining

*CM*mouse exhibited CM-perturbed pre-leukemic expansion of progenitor populations in the bone marrow but leukemia had not yet manifested by the end of the 10-month study (data not shown). All PBMC samples were subjected to RNA-sequencing (Figure 2C) and flow cytometry analyses.

Given the positions of critical points, the double-well quasi-potential can be constructed up to a constant of integration by evaluating *U* = ∫ *U*′(*z*)*dz* where *U*′ = *dU/dx* = *a(x – c*_{1})(*x –c*_{2})(*x – c*_{3}) where α is a scaling parameter and *x* is the spatial variable in the quasi-potential. The coefficients *α _{i}* can be expressed in terms of

*α*,

*c*

_{1},

*c*

_{2},

*c*

_{3}by expanding

*U*′ and integrating with respect to

*x*. State-transition theory predicts that the energy barrier—defined as the energetic difference between the initial state and the transition state—will be lowered by the expression of the

*CM*gene, resulting in significantly increased probability and rate of transition from normal hematopoiesis to leukemia in

*CM*-induced mice compared to control mice. In support of our hypothesis, we calculated α = 4.85 x 10

^{-8}and observed that the energy barrier (

*EB*≡

*U(c*

_{2}) −

*U(c*

_{1})) is 0.99 (arbitrary units, A.U.) for

*CM*-induced mice, nearly an order of magnitude lower than the energy barrier for control mice [6.45 (A.U.)].

### Construction of the leukemia transcriptome state-space

To represent the state-transition dynamics of the transcriptome from normal hematopoiesis to leukemia, we constructed a leukemia transcriptome state-space, in which the trajectory of the evolution from *CM*-perturbed hematopoiesis to development of overt AML could be geometrically represented. We initially performed genomic signal processing and dimension reduction analysis on the time-series RNA-sequencing data (Figure 2B). We constructed a matrix (*X*) such that each row corresponds to a sample and each column corresponds to a gene transcript level (log2 transformed counts per million reads)(Robinson et al., 2009). We then performed principal component analysis (PCA) to process and deconvolute the RNA-sequencing data into the components of the variance that most clearly associated with normal hematopoiesis or leukemia progression. Principal components (PCs) were computed via singular value decomposition (Figure 3A), which is a matrix factorization method. The singular value decomposition is computed on the mean-centered data where represents the column mean such that (* denotes the conjugate transpose, and *U* is a square matrix not to be confused with the quasi-potential energy polynomial). The columns of the unitary matrix *U* form an orthonormal basis for the sample space (i.e., the temporal dynamics of the transcriptome), the diagonal matrix Σ contains the singular values, and the columns of the matrix *V** correspond to the eigengenes, or loadings, of each gene in the transcriptome per PC.

We analyzed the singular values and 65% of the total variance was captured in the first 4 components, representing a majority of the variation in the data and corresponding to the PCA “elbow” (Figure 3B). A second elbow was identified at component 15 (see supplemental methods, Figure S2). A pairwise analysis of the first 4 components (Figure 3C) revealed that the second component (PC2) strongly correlated with the appearance of differentially expressed *Kit* (Figure 3D; supplemental Figure S4), which in this mouse model is a surrogate immunophenotypic marker for leukemic cells. We note that PCs are eigenvectors of the data matrix X, are orthogonal by construction and in fact the data projected along PC1 and PC2 have the smallest correlation coefficients (Figure 3D). We therefore chose PC1 and PC2 as orthogonal states (i.e. components) for the transcriptome state-space. We constructed a 2-dimensional state-space with the first (denoted as non-leukemic) and second (denoted as leukemic) sample components (*x*_{1}, *x*_{2}) = (*PC*1, *PC*2). We associated this space with the temporal dynamics of state-transition to leukemia so that the mean position of the reference (non-leukemic) state was located at PC2 = 0 and smoothly increased toward a leukemic state as the trajectory traveled south in the space (Figure 3E).

The second column of the loading matrix V* represents the eigengenes of leukemic progression (Figure 3F). Each gene is represented as a 2-dimensional vector which has components . This representation enables the geometric decomposition of each gene into non-leukemic and leukemic components, respectively (Figure 3F). This geometric analysis of eigengenes allowed us to interpret the leukemic component of genes based on their relative contribution to the leukemia state in differential expression analysis. We examined other dimension reduction methods to construct the state-space, but found them to be sub-optimal due to free parameters (e.g., diffusion mapping (Haghverdi et al., 2015)) or the inability to isolate leukemia trajectories with default settings (e.g., t-SNE (Pezzotti et al., 2016)) (see supplemental methods, Figure S5).

The construction of the state-space using PCA was not sensitive to variations in data-normalization method, sample number, or gene selection criteria, as shown by bootstrap cross-validation (see supplemental methods, Figure S6–S10). In fact, the geometry of the state-space could be inferred from time-series RNA-sequencing data derived from just 1 control mouse and 1 CM mouse (supplemental Figure S9) and was not changed by the exclusion of the known leukemia genes *Kit* or *CM* (supplemental Figure S10). These results demonstrate that PCA-based state-space construction is robust and reproducible regardless of variation in data-processing methods.

### Estimation of critical points in the transcriptome state-space

The PC-constructed transcriptome state-space allowed us to identify the position of the critical points in the state-space, to define the shape of the double-well quasi-potential and to predict the dynamics of the transcriptome-particle. We used K-means clustering to identify 3 clusters of the data along the leukemia axis (PC2) (Figure 4A). The centroid of all control samples was used as the point *c*_{1,}*. The centroids of the CM clusters K1 and K3 were used as estimates of the leukemia critical points *c*_{1,} and *c*_{3}, respectively. Because the critical point *c*_{2} is unstable, it is unlikely to be correctly identified with a centroid approach. To estimate *c*_{2}, we constructed all quasi-potentials with values of *c*_{2} which ranged from to (Figure 4B, left), and computed the Boltzmann Ratio (BR) for each quasi-potential (Figure 4B, right). We assume the temperature is constant. We took *c*_{2} to be the “upper” boundary (maximum value) of the *c*_{2} cluster, which resulted in a BR of 81.4.

We then mapped the state-space trajectory of each mouse in time and the PC2 location relative to the estimated critical points *c*_{1,} *c*_{2,} *c*_{3} (Figure 4C; supplemental Figure S11). As early as one month (t = 1) following induction of CM (p<0.01) and despite the absence of any detectable circulating leukemic blasts, we were able to detect changes of the position in the transcriptome state-space (value of PC2) (Figure 4D), in line with the early hematopoietic perturbation driven by *CM* expression we previously reported(Cai et al., 2016; Kuo et al., 2009, 2006). Consistent with the prediction made using the state-transition theory, we observed acceleration of the transcriptome toward the leukemia state once the transcriptome-particle crossed the unstable critical point *c*_{2}. Indeed, circulating leukemic blasts (cKit^{+}) began to be detected (<5%) once the samples approached *c*_{2} (Figure 4E). After crossing this point, the acceleration toward the leukemia state was confirmed by a rapid increase of leukemic blasts and manifestation of overt disease (Figure 4F). The accelerated movement after crossing *c*_{*} was also supported by velocity calculated between each pair of time-sequential points in the 2D state-space (Figure 4G). Our results therefore support the applicability of the state-transition theory to cancer and demonstrate that leukemia development can be interpreted as transcriptome dynamics defined by critical points in the state-space trajectory.

### Prediction of leukemia progression

Because our model incorporates stochasticity that may be due to biological, experimental, technical, or time-sampling variations, the transcriptome trajectories cannot be precisely predicted by simply applying the equation of motion of the transcriptome-particle in the state-space double-well quasi-potential . However, the stochastic model can provide examples of trajectories (Figure 5A). A well-established approach to characterizing and predicting stochastic dynamics is to consider the evolution of the probability density function. The spatial and temporal evolution of the probability density for the position of the particle *p(x*_{2}, *t*) is given by the solution of the corresponding Fokker-Planck (FP) equation
where *β*^{−1} is the diffusion coefficient, which may be estimated with a mean-squared displacement analysis of particle trajectories (see supplemental methods, Figure S11)(Pavliotis, 2014). Solution of the FP permits the direct calculation of the expected first arrival time from an initial point (i.e., perturbed hematopoiesis) in the state-space to a final point (i.e., leukemia). In our study, we calculated the mean time to develop leukemia for the cohort of mice by numerically solving the FP equation with the initial condition taken from the data (Eq. 1, Figure 5B-C). We calculate the probability of being in the leukemia state at time *t = t _{i}* from a given initial position in time and space by integrating the probability density past

*c*from time

_{2}*t = 0*as . The mathematical model accurately predicted both the proportion of induced mice that will develop leukemia as well as the mean time for disease to manifest; in fact, the predicted and actual median survival time, defined as the median time to develop leukemia (at least 20% circulating blasts), were not significantly different (p >0.05) (Figure 5D).

### Biological interpretation of the critical points in the transcriptome state-space

Because state-transition theory enables the interpretation of time-series genomic data in terms of critical points, we hypothesized that the transcriptional events [differentially expressed genes (DEGs)] occurring at each critical point (*c*_{+}, *c*_{*}, *c*_{)}) represented critical biological alterations driving the transition from normal hematopoiesis to perturbed hematopoiesis and eventually to leukemia (Figure 6A). To identify these events, we partitioned the data such that each sample was associated with a unique critical point by calculating the nearest proximity of a given sample to the location of the critical points in the state-space (Figure 4A). We then identified DEGs by performing pairwise comparisons of gene expression at each of the critical points with that of the reference state (i.e., *c*_{1} vs ; *c*_{2} vs ; *c*_{3} vs ) and with other critical points (*c*_{2} vs *c*_{1}; *c*_{3} vs *c*_{1}; *c*_{3} vs *c*_{2}) using edgeR and a false discovery rate of 0.05 (supplemental Figure S13). The number of DEGs (defined as q<0.05, |log2(FC)|>1) for each comparison is summarized in Table 2 (see supplemental Table S1-6 for gene list). We then categorized the DEGs at each critical point as early events (*c*_{1} vs , not in *c*_{2} vs *c*_{1} or *c*_{3} vs *c*_{2}), transitional events (*c*_{2} vs , not in *c*_{2} vs or *c*_{3} vs *c*_{2}), and persistent events (genes that remained as DEGs at all three of the critical points; *c*_{1} vs ; *c*_{2} vs ; *c*_{3} vs ) (Figure 6A).

Gene Ontology (GO) analysis revealed insights into the biological and functional impact of DEGs identified on the basis of critical points in the transition from normal hematopoiesis to leukemia (supplemental Table S7-9). For early transcriptional events at *c _{1}*, the top three GO terms ranked by q-value (multiple-test corrected p-value) included extracellular matrix organization (GO-0030198), cellular response to cytokine stimulus (GO-0071345) and cytokine-mediated signaling pathway (GO-0070098) (Figure 6A; Figure S14A; Table S7). For the transitional events at

*c*, the top three ranked GO terms included DNA metabolic processes (GO-0006259), DNA replication (GO-0006260), and G1-S transition of mitotic cell cycle (GO-0000082) (Figure 6A; Figure S14A; Table S8). DEGs occurring at

_{2}*c*, which continued to be differentially expressed at the critical points

_{1}*c*and

_{2}*c*, were defined as persistent events. Consistent with cKit expression marking the leukemia blasts,

_{3}*Kit*up-regulation was among the persistent events seen at

*c*and

_{1}, c_{2},*c*. For persistent events, the top three ranked GO terms included positive regulation of phosphatidylinositol 3-kinase activity (GO-0043552), positive regulation of phospholipid metabolic process (GO-1903727), and positive regulation of lipid kinase activity (GO-0090218) (Figure 6A; Figure S14A; Table S9) We note that hierarchical clustering correctly identified leukemia samples but failed to differentiate CM from controls at intermediate time points or to partition the samples temporally (supplemental Figure S12).

_{3}Given the 2-dimensional geometry of the transcriptome state-space, we were also able to decompose the contribution of each gene to leukemia progression by considering the second component of the eigengene vector . For instance, considering the expression of *Kit* and *CM* genes, we showed that the magnitude of the second component of both genes was negative ( < 0) with and , confirming their contribution to promote leukemia development (Figure 6B). Notably, the proangiogenic factor *Egfl7* (Papaioannou et al., 2017), leukemia-associated antigen *Wt1* (Adnan-Awad et al., 2019; Chaichana et al., 2009; Løvvik Juul-Dam et al., 2019) and the protein kinase *Prkd1* genes are among those showing a strong selectivity in the direction toward leukemia in the eigengene space ( < 0) (Figure 6B). Consistent with a positive contribution promoting leukemia progression, all CM mice that developed leukemia (CM1-5, CM7) showed increasing expression of these leukemia eigengenes (*Kit*, *CM*, *Egfl7*, *Wt1*, *Prkd1*) and reproducible trajectories as they move from normal hematopoiesis () to leukemia (*c*_{3}) (Figure 6C). The trajectories of these leukemia eigengenes in the state-space were remarkablly concordant for all CM leukemia mice (Figure 6C, top), in constrast to the nonsynchronous trajectories observed in time (Figure 6C, bottom). Therefore, geometric analysis through the transcriptome state-space provides a robust and meaningful approach to identity relevant gene expression signals and to examine leukemogenic contribution of individual gene as well as the collective contribution of a set of genes.

With this geometric interpretation in hand, we quantified each GO term as the vector sum of each constituent eigengene, so that . The contribution of an individual GO term to the progression from normal hematopoiesis state to a leukemia state was also given by the second component of the summed vector, *G*_{2}, which represented the maximum contribution of an individual GO term pathway *G* to leukemogenesis (Figure 6D; black vector). Because not all of the constituent genes for a GO term (black dots) were differentially expressed (pink dots), we considered only the subset of genes in the GO term that actively contributed to leukemogenesis as evidenced by differential expression (Figure 6D; pink vector). As such, each significantly enriched GO term could be quantitatively analyzed for its relative contribution to the development of leukemia as the sum total contributions of the DEGs in that particular GO term.

To evaluate the potential leukemogenic contribution of of the GO terms corresponding to the step-wise perturbation, we also performed vector analysis of each GO pathway enriched in early, transitional, and persistent events and represented them as vectors in the state-space (Figure 6D). Notably, our analysis of *c*_{+} early events revealed that some highly significantly enriched pathways which exhibited contributions away from leukemia (north, > 0)(Figure 6E; Figure S15), suggesting the presence of a restorative force in initially CM perturbed hematopoiesis, which attempts to revert the perturbed system back to the reference state of normal hematopoiesis. On the other hand, analysis of GO terms enriched for transitional and persistent events demonstrated an increasing magnitude and direction (angle) toward the leukemic state (Figure 6D, Figure S16-18). Evaluation of for all early, transitional, and persistent GO terms revealed a strong overall leukemogenic contribution given by transition events occurring at *c*_{2} (Figure 6E), underscoring the unique biological insights gained by differential analysis based on state-space critical points.

Analysis of the leukemic transcriptome at *c*_{3} showed dysregulation of a large number (11,634) of genes (supplemental Figure S15B), making it difficult to perform pathway enrichment or to interpret in terms of contribution to leukemia. Thus, we filtered genes based on the angle subtended (*θ*) by the gene in the 2-dimensional state-space. The range of angles (*θ*) identified as being associated with leukemia evolution was defined as the minimal sector of a circle centered at (0,0) that included all leukemic samples as well as the mirror image of this sector along the absicca axis of symmetry (supplemental Figure S19). This selection identified differentially expressed leukemia eigengenes (leukemia eigenDEG). The top three GO terms for leukemia eigenDEG ranked by q-value (multiple-test corrected p-value) included mitotic spindle organization (GO-0007052), centromere complex assembly (GO-0034508), and microtubule cytoskeleton organization involved in mitosis (GO-1902850) (Table S10), consistent with the hyper-proliferative phenotype, leukemic cell trafficking, and extramedullary tissue infiltration associated with late-stage disease.

Altogether, our state-transition model, leukemia transcriptome state-space, and leukemia eigengenes offer biologically relevant interpretation and insights into the complex multistep process of malignant transformation and disease evolution.

### Validation study with independent cohorts

To validate our state-transition mathematical model, state-space, and analytical approach, we collected PBMC RNA-sequencing data from two additional independent cohorts of control and CM mice. We collected validation (v) cohort 1 samples monthly for up to 6 months; (vControl1-7; vCM1-9) and collected validation cohort 2 samples sparsely at 3 randomly selected timepoints; (vControl8-9; vCM10-12) during leukemia progression. We performed PCA analysis of the validation cohort 1 and 2 data, which again demonstrated that the majority of the variance was encoded in the first 4 PCs (supplemental Figure S20A) and the leukemia-related variance was encoded in PC2 (Figure 7A). We then evaluated our ability to map state-transition trajectories and predict leukemia development in the validation cohorts by projecting the data from the validation cohorts into the state-space constructed using the training cohort. We accomplished this by using the eigengenes from the singular value decomposition of the training data as follows: a matrix of new data, , was projected into the state-space by multiplying by the matrix *V* from the training data, so that the coordinates of the new data in the reference state-space were given by *PC* =*V*. Because the matrices and *V* must have the same dimension, and more importantly the weights of the genes in the matrix *V* must match one-to-one with the genes in , the projection may use only genes in the intersection of *X* and .

Thus, we mapped the trajectory of each mouse in the validation cohorts in the PC2 space (Figure 7B). State-transition critical points *c*_{1,} *c*_{2,} *c*_{3} were estimated using k-means clustering (k=3), using the same method that was used on the training cohort (Figure 4A-B). The trajectories of vControls remained at *c*_{1,}*, and vCM mice progressed toward a leukemic state at *c*_{3}, as in the original analysis. We detected CM mice at *c*_{1}, 1 month (t = 1) after induction of CM expression (supplemental data Figure S20B-C), similar to the initial dataset (Figure 4D). We observed similar state-space trajectories, in that we detected acceleration of the transcriptome toward the leukemia state once it crossed the unstable critical point *c*_{2} (Figure 7C).

We solved the FP equation for probability density (Eq. 1) with initial conditions derived from the validation cohort data and parameters estimated from the training cohort, which accurately predicted the time to leukemia for all CM mice (n=9) that developed leukemia during the study period (p>0.05, Figure 7D). Three CM-induced mice in validation cohort 1 (vCM2, 3, 6) did not develop leukemia during the 6-month study period, and were mapped to positions in the state-space between *c*_{1} and *c*_{2} but did not cross the transition point *c*_{2} (Figure 7B), consistent with a delayed leukemia onset predicted by our model. These mice showed pre-leukemic states in the bone marrow (i.e., expansion of pre-leukemic progenitor populations) at the end of the study, indicating leukemia progression was taking place but had not yet manifested (see supplemental data Figure S21). Altogether, these results demonstrate our ability to reproduce state-transition trajectories, identify critical points and predict leukemia development in multiple independent cohorts. Thus, our state-transition mathematical model, transcriptome state-space, and analytical approach provides a robust framework to contextualize and interpret global and temporal changes in gene expression, and use them to predict system dynamics relevant to leukemia development.

## Discussion

Here we report the application of the theory and mathematics of state-transitions to interpret temporal genomic data and accurately predict AML development. As a proof of principle, we used time-sequential RNA-sequencing data from a well-characterized orthotopic mouse model of inv(16) AML and applied a mathematical model of state-transition from health to leukemia. We demonstrate the feasibility of predicting the trajectory of state dynamics and time to leukemia using time-sequential genomic data collected at sparse timepoints.

Recent studies have shown that somatic mutations may precede diagnosis of acute myeloid leukemia (AML) by months or years(Desai et al., 2018) and that deep sequencing of mutations can be used to differentiate age-related clonal hematopoiesis from pre-AML and predict AML risk in otherwise healthy individuals(Abelson et al., 2018). However, these approaches rely on population statistical analysis and do not provide a hypothesis-based theory which can predict specific time to leukemia development in individual patients or be translated to other cancer settings. One of the greatest challenges in analyzing time-sequential data is the fact that the signal(s) of interest (i.e., leukemia) often are not synchronized in time. Although methods exist to perform time realignment, designed to overcome this issue (Liu and Müller, 2003; Liu and Yang, 2009; Sun et al., 2016, 2008) they require prior knowledge of the system dynamics. Here, we implemented a method that does not require a priori information, and allows for any nonlinear time dynamics, including transient changes to elements of the eigengene due to stochastic variations or biological fluctuations, including environmental conditions that may be random. Furthermore, our approach guides interpretation of temporal genomic data even when data are incomplete or sparse—as is often the case with longitudinal data from the clinic.

We constructed the leukemia transcriptome state-space based on PCA of RNA-sequencing data obtained from the peripheral blood over the course of leukemia progression. PCA is most often employed to study variance in a dataset and to cluster samples with similar levels of variance. When the PCs can be interpreted, the method provides both a means of reducing the dimensionality of the dataset and also a means to gain insight into the data. In our case, PCA yielded insight into the temporal dynamics of leukemia progression that are encoded in a single component (PC2). The corresponding transcriptome state-space allowed us to study leukemia dynamics by isolating the transcriptional events directly affecting the transition to a leukemia state. Using a mathematical model of state-transition, we demonstrated our ability to predict the time to develop leukemia from a state of perturbed hematopoiesis based on changes in the transcriptome of circulating blood cells over time.

Our results show that movement through the transcriptome state-space can be understood in terms of critical points—mathematically-derived inflection points—which provide a framework to predict the development of leukemia for any point in the space, at any timepoint in leukemia progression. The double-well quasi-potential mathematical model allows us to compute the probability of transition from one state to another for any point in the state-space by numerically solving the FP equation (Figure 5C). This approach may be generalized for any two-state system for which a distance measured between the states can be defined. The trajectories and dynamics in the state-space and the theory of state-transition provided disease-relevant context to guide our analysis of complex time-sequential gene expression data.

Through the analysis of DEGs at each critical point and by categorizing early, transitional, and persistent events according to GO classification, we identified critical step-wise transcriptomic perturbations during leukemogenesis. Early events are enriched for cellular response to cytokine stimulus and cytokine-mediated signaling pathway, consistent with previously reported altered cell signaling and impaired lineage differentiation induced by the *CM* oncogene (Cai et al., 2016; Kuo et al., 2009). Notably, our results revealed that early perturbations associated with critical point *c*1 are not necessarily contributing positively to leukemogenesis but may instead represent counteracting homeostatic response. In addition, the unstable critical point *c*2 was characterized by aberrant expression of many genes involved in DNA damage and DNA repair, consistent with the notion that additional cooperating mutations or epigenetic alterations are necessary for full leukemia development (Cai et al., 2016; Kuo et al., 2006). Furthermore, we identified genes that, although not uniquely associated to individual critical points, were persistently and differentially expressed at all critical points *c _{1}, c_{2},* and

*c*during the leukemia state-transition. These genes are mainly involved in signaling pathways that support cell proliferation and survival, and vector analyses demonstrated a direction of strong contribution toward PC2 (leukemia). These persistent events can be interpreted as a force cooperating with the

_{3}*CM*oncogene to propel the movement of the transcriptome-particle from the reference state to the leukemia state.

Thus, in addition to the mathematical and physical interpretation of state-transition dynamics, our approach provides a possible biological connotation for the critical points identified by the state-transition theory. Furthermore, the location and trajectory of individual genes in the state-space allows assessment of the direction and the magnitude of leukemia contribution. For example, *Egfl7*, *Wt1* and *Prkd1* showed a strong selectivity in the direction toward leukemia and their expression level consistently increased during transition toward leukemia, particularly between *c _{1}* and

*c*in the state-space. Unlike the nonsynchronouse changes for individual CM leukemia mouse observed in time, these leukemia eigengenes displayed consistenly concordant trajectories in the leukemia state-space. These results reaffirm that our transcriptome state-space and approach indeed can infer information relevant to the phenotype of interest (i.e., leukemia), and highlight the utility of location in state-space to guide analysis and biological interpretation of temporal data. Notably,

_{2}*Egfl7*,

*Wt1*and

*Prkd1*are among the persistent differentially expressed genes at all critical points

*c*and

_{1}, c_{2},*c*during leukemia state-transition.

_{3}*Egfl7*is known to be highly expressed and predict poor prognosis in AML(Papaioannou et al., 2017), and is also a host gene of

*miR-126*, which is a miRNA signature associated with inv(16) AML(Li et al., 2008). Overexpression of

*WT1*has been found to predict poor prognosis and minimal residual disease in AML(Adnan-Awad et al., 2019; Becker et al., 2010; Løvvik Juul-Dam et al., 2019).

*Prkd1*encodes a serine/threonine protein kinase and is part of all top 3 ranked GO terms enriched for persistent events. The specific role of

*PRKD1*in AML has not been described, however, it is known to promote invasion, cancer stemness and drug resistence in several solid tumors(Döppler et al., 2016; Kim et al., 2016; Zhang et al., 2018). These results highlight the relevance of biological insights provided by the geometry of the transcriptome state-space and critical points of state-transition theory.

State-transition theory and corresponding mathematical models have been applied to other systems and to other omics data platforms (e.g., epigenomics, miRomics)(Zadran et al., 2014, 2013; Zhou et al., 2012). However, our application of state-transition theory to the interpretation of cancer evolution is novel not only because is applied to time-series data, but also because of the use of bulk RNA-sequencing data collected from the peripheral blood. Notably, this approach allows us to derive useful information about the state of the system as a whole, without concern for the heterogeneity related to additional mutations, clonal dynamics, or composition of cells within the sample. Given that additional mutations, epigenetic alterations, and clonal heterogeneity converge into gene expression profiles collectively representing perturbations that are ultimately responsible for the observed phenotypes, our use of RNA expression profiles represented as configurations of the system allows integration of informative quantities relevant to the disease state, without suffering from a combinatorial explosion of complexity. We showed that hierarchical clustering was unable to distinguish CM samples from controls or to partition the samples temporally. Other methods currently available to analyze time-series genomic data (Fischer et al., 2018; Spies et al., 2017; Spies and Ciaudo, 2015; Sun et al., 2016) were not able to: 1) predict the time to develop leukemia, 2) predict the effects of changes in sets of genes over time, or 3) quantify the relative importance of one gene set or pathway over another. Other approaches (i.e., surprisal analysis) that use concepts of thermodynamic (non)equilibrium and statistical mechanics(Facciotti, 2013) are also useful tools for analyzing cellular transitions (e.g., epithelial to mesenchymal transition(Zadran et al., 2014) and early stages of carcinogenesis(Remacle et al., 2010)), but to our knowledge they have not been used to analyze temporal data and do not provide similar geometric- or critical point-based interpretation of the data.

Although data from a relatively simple mouse model of AML were used to develop this theoretical framework, we demonstrated that the results are reproducible in multiple animal cohorts (i.e., one training cohort and two validation cohorts) and that the robustness of this approach is not affected by variability in sampling time, frequency, sample preparation, or data normalization methods (supplemental Figure S6–S10). Future studies will include prospectively collected time-sequential samples from healthy donors and patients with leukemia. Furthermore, it would be informative to determine the extent to which AML driven by different oncogenes share biological connotations and interpretations derived from the perspective of state-transition critical points. These information will guild further refinement and application of the framework to analyze diverse patient samples in the clinic.

We show that early perturbation could be detected based on the transcriptome state-space and early transcriptome trajectory could predict disease development using our mathematical model of state-transition. In the future, our state-transition dynamical model could be applied in the clinic to support proactive monitoring that detects early perturbations away from a reference state in a patient/individual over time, adding a mathematical and theoretical component to recent reports of longitudinal deep-sequencing to monitor changes in health over time(Schüssler-Fiorenza Rose et al., 2019). The reference state could be a healthy state prior to disease or could be a diseased state, in which case the orthogonal direction would be interpreted as transition back to a state of health upon therapeutic intervention. To this end, possible applications include monitoring of patients who achieved complete remission post-treatment and are closely followed for relapse, or patients with clonal hematopoiesis of indeterminate potential (CHIP) undergoing chemo or radiation therapy for unrelated cancer and who are at high risk of developing therapy related myeloid neoplasms(Bowman et al., 2018; Genovese et al., 2014; Sperling et al., 2017). This approach could be also applied to evaluation of antileukemia therapies, and to predict achievement and duration of treatment response at early time points. Therefore, our state-transition analysis approach has potential applications for prospective monitoring in disease prevention, early detection, and predicting response to therapy and disease relapse not only in leukemia but also in other types of cancer.

## Author Contributions

Conceptulization, R.C.R, S.B., Y-H.K and G.M.; Methodology, R.C.R, S.B., Y-H.K, D.F, D.O’M., H.W., D.M, X.W.; Investigation, J.Q., W-H.K, G.J.C, E.C, L.Z, A.M; Resources, Y-C.Y., Z.L.; Formal Analysis, R.C.R, S.B., Y-H.K; Writing-Original Draft, R.C.R and Y-H.K; Writing – Review & Editing, L.D.W, S.J.F, N.C., G.M.; Funding Acquisition, R.C.R, Y-H.K and G.M.

## Declaration of Interests

The authors declare no competing interests.

## STAR Online Methods

### Mice

To induce expression of CM, *Cbfb ^{+/56M}/Mx1-Cre* C57BL/6 mice (4-8 weeks old, including both females and males) were injected with 14 mg/kg poly (I:C) (InvivoGen, San Diego CA) every other day for 7 doses. Similarly treated, wild-type,

*Cbfb*littermates were used as controls. As previously described(Cai et al., 2016), induction of CM expression results in expansion of pre-leukemic hematopoietic stem and progenitor cell (HSPC) populations in the bone marrow and subsequent development of overt leukemia characterized by >20% cKit+ leukmia blasts, increased white blood cell counts, and splenomegaly. Peripheral blood samples were taken via retro-orital venous sinus before induction and monthly thereafter for the duration of the experiment. All mice were maintained in an AAALAC-accredited animal facility, and all experimental procedures were performed in accordance with federal and state government guidelines and established institutional guidelines and protocols approved by the Institutional Animal Care and Use Committee at the Beckman Research Institute of City of Hope.

^{+/56M}or Mx1-Cre### Flow cytometry

PBMCs were stained in PBS with 0.5% BSA for 15 min on ice with fluorescently labeled antibodies for cell surface markers, including cKit, CD3, B220, CD11b, CD11c, CD71, Ter119. Phenotypic HSPC populations were defined as LSK (Lin^{-}ckit^{+}Sca1^{+}); MP (Lin^{-}ckit^{+}Sca1^{-}); Pre-GM (Lin^{-} ckit^{+}Sca1^{-}CD41^{-}CD16/32^{-/lo}CD105^{-}CD150^{-}); GMP (Lin^{-}ckit^{+}Sca1^{-}CD41^{-}CD16/32^{+}CD150^{-}); Pre-Meg/E (Lin^{-}ckit^{+}Sca1^{-}CD41^{-}CD16/32^{-/lo}CD105^{-}CD150^{+}). All antibodies were purchased from BioLegend, BD Biosciences, or eBiosciences. Flow cytometry was performed using a 5-laser, 15-detector BD LSRII or LSRFortessa. Data analysis was performed using FlowJo (Tree Star, Ashland OR).

### RNA extraction, sequencing, and bioinformatics

Peripheral blood samples were accrued for all timepoints and allocated to randomized batches for RNA extraction. Total RNA was extracted from whole blood using the AllPrep DNA/RNA Mini Kit (Qiagen, Hilden, Germany); quality and quantity were estimated using a BioAnalyser (Agilent, Santa Clara, CA). Samples with a RIN > 4.0 were included. External RNA Controls Consortium (ERCC) Spike-In Control Mix (Ambion, Foster City, CA) was added to all samples per the manufacturer’s recommendations, although these were not used for downstream analyses. Samples were allocated to randomized batches for library preparation, such that samples from each timepoint were distributed evenly over all sequencing runs. Sequencing libraries were constructed using the KapaHyper with RiboErase (Kapa Biosystems, Wilmington, MA), loaded on to a cBot system for cluster generation, and sequenced on a Hiseq 2500 (Illumina) with single end 51-bp for mRNA-seq to a nominal depth of 40 million reads. To mitigate batch effects, samples were randomly assigned to one of eight flow cells such that each flow cell contained a sample from at least one mouse and one timepoint. Image processing and base calling were conducted using Illumina’s Real-Time Analysis pipeline.

Raw sequencing reads were processed with the nf-core RNASeq pipeline version 1.0 (Ewels et al., 2018). Briefly, trimmed reads were mapped using Spliced Transcripts Alignment to a Reference (STAR)(Dobin et al., 2013) to the GRCm38 reference amended with ERCC sequences and the human *MYH11* fusion gene sequence. Each library was subjected to extensive quality control, including estimation of library complexity, gene body coverage, and duplication rates, among other metrics detailed in the pipeline repository(Ewels et al., 2018). Reads were counted across genomic features using Subread featureCounts(Liao et al., 2014) and merged in to a matrix of counts per gene for each sampled timepoint. *CM* fusion transcript expression was measured by counting reads that spanned the *CM* fusion boundary. Surrogate variable analysis was used to check for confounding experimental effects(Leek et al., 2018). None were apparent (data not shown); however, library preparation batch was used as a covariate in differential expression analyses using the Bioconductor package edgeR(Robinson et al., 2009) as implemented in SARTools (Dobin et al., 2013).

### Supplemental methods and information

In this supplement we provide additional data to support our methods and results. In particular, we: show details of our mouse model of AML; demonstrate the robustness of our state-space construction to normalization methods and gene selection criteria; compare dimension reduction methods (PCA, Diffusion Mapping, t-SNE, MDS); show bootstrap cross-validation of our predictions; include detailed illustrations of individual state-transition trajectories and potential energy landscapes; compare hierarchical clustering with our state-space approach; and finally, provide full lists of differentially expressed genes and gene ontology (GO) term enrichment. Each of the concepts is illustrated with a figure.

### Uncertainty quantification and sensitivity analysis

To evaluate the sensitivity of our state-space construction to variations in sample frequency, number of reads per gene, and number of timepoints, each of these quantities was varied independently (Figure S4-6). Sample frequency was assessed by varying the gene inclusion criteria from 5 counts in at least 2 samples, to 5 counts in each of 5, 10, 30, 50, 70, 90, 110, 120, and all 132 samples, with 5 counts in 2 samples being the most permissive criteria, resulting in 21,482 genes, to 5 in 132 samples being the most restrictive criteria, resulting in 8,995 genes. The number of reads per gene was assessed by varying the log of the counts per million reads (cpm) log2(cpm) threshold in increments of 0.01, 0.05, 0.5, 5, 1, 3, 5, 7, 10, 15, and 20 for each of the sampling frequencies, resulting in 100 combinations. For a subset of combinations, the effect of normalization methods [e.g., trimmed means of M values (TMM), relative log expression (RLE), upper quartile (UQ), and RLE with ComBat regularization] on the state-space was examined. Sampling frequency during leukemia evolution was assessed by performing a leave- “x”-out cross-validation technique, in which x=70 samples were randomly identified and removed from the data set. The remaining 62 samples were used to predict the positions in the state-space for the 70 removed samples. This cross-validation procedure was performed 100 times and the absolute difference between the true state-space position and the predicted position was computed. The state-space was also constructed with and without the *CM* and *Kit*. There was no difference in state-space positions when the state-space was constructed without these genes, up to machine precision, 1×10^{-14} (Figure S9).

### Leukemia eigengene selection

Due to the large number of differentially expressed genes at the leukemic endpoint (c_{3}), genes were filtered based on the angle subtended (*θ*) by the gene in the 2-dimensional state-space. The range of angles (*θ*) identified as being associated with leukemia evolution was defined as the minimal sector of a circle centered at (0,0) that included all leukemic samples as well as the mirror image of this sector along the x-axis of symmetry (Figure S19).

## SUPPLEMENTAL DATA FIGURES

Table S1. Differentially expressed genes for *c*_{1} vs

Table S2. Differentially expressed genes for *c*_{2} vs

Table S3. Differentially expressed genes for *c*_{3} vs

Table S4. Differentially expressed genes for *c*_{2} vs *c*_{1}

Table S5. Differentially expressed genes for *c*_{3} vs *c*_{1}

Table S6. Differentially expressed genes for *c*_{3} vs *c*_{2}

## Acknowledgments

This work was supported in part by the National Institutes of Health under award number R01CA178387 (to Y.-H.K) and R01CA205247 (to Y.-H.K/G.M.) and the Gehr Family Center for Leukemia Research. Research reported in this publication included work performed in the Integrated Genomics Core, Bioinformatics Core, Analytical Cytometry Core, and Animal Resource Center supported by the National Cancer Institute of the National Institutes of Health under award number P30CA33572. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

## Footnotes

↵* Lead contacts

This revision includes geometric analysis of genes and pathways in the 2D transcriptome/eigengene state-space, as well as the addition of 2 validation cohort experiments.