Ergodic patterns of cell state transitions underlie the reproducibility of embryonic development

The reproducibility of embryonic development is a remarkable feat of biological organization, but the underlying mechanisms are poorly understood. Clearly, gene regulatory networks are central to the orderly progression of development, but noisy molecular and cellular processes should reduce reproducibility. Here, we identify ergodicity, a type of dynamical stability, as underlying the reproducibility of development. In ergodic systems, a single timepoint measurement equals a time average. Focusing on the zebrafish tailbud, we define gene expression and cell motion states using a parallel statistical analyses of single cell RNA sequencing data and in vivo timelapse cell tracking data and a change point detection algorithm. Strikingly, the cell motion state transitions in each embryo exhibit the same patterns for both a single timepoint and a 2-3 hour time average. Both the cell motion and gene expression cell states exhibit balanced influx and outflux rates reflecting a spatiotemporal stability. Stated simply, these data indicate the pattern of changes in the tailbud doesn’t change. This ergodic pattern of cell state transitions may represent an emergent meta-state that links gene networks to the reproducible progression of embryogenesis.


Introduction 1
Aristotle first noted the astonishing reproducibility of embryogenesis in "Historia Animalium," 2 where he observed, "Generation from the egg occurs in an identical manner in all birds". At the 3 cellular level, development entails a reproducible series of cell state transitions representing 4 changes in gene expression state, physical state and cell fate. These processes can be noisy, for 5 example, cell migration can be either ordered or disordered, and such disorder is part of normal 6 orderly development . We now appreciate that gene networks control cell state transitions, but 7 these networks are comprised of stochastic molecular processes . How biological order emerges 8 from stochastic molecular events was the subject of Erwin Schrödinger's "What is Life? ". Despite 9 the remarkable progress in the field of developmental biology in recent decades, there is still a 10 gap in our understanding of the organizing processes that lie between genes and the reproducible 11 dynamics of developing embryos. 12 One framework for analyzing the reproducibility of embryonic development is that of ergodicity. 13 An ergodic system is one in which the average behavior of all objects at a single timepoint equals 14 the average behavior of a random sample of objects over a longer time interval. For example, 15 measurement of all gas molecules within a chamber at a single time point yields the same result 16 as the average of a random sample of gas molecules over the entire experimental time interval. 17 The ergodic hypothesis lacked a mathematical foundation until the development of the ergodic 18 theorem in 1932 . In the field of ergodic theory, the single timepoint average is typically referred 19 to as the "phase average". Ergodicity is a mathematical ideal and real systems are not truly 20 ergodic. Embryonic development is by definition not ergodic since the embryo changes as it 21 develops, but it is possible for ergodicity to exist over a short period of time. Ergodicity is implicitly 22 assumed in many biological experiments, yet it is rarely demonstrated. For example, ergodicity 23 of gene expression levels in clonal cell populations is only observed when accounting for cell age. 24 a neuronal-mesodermal axis from sox3 expressing neuronal cells to mespaa expressing anterior 1 PSM cells (Fig. 1B, arrow) . This approach avoids the requirement to define the NMP population 2 a priori. Instead, NMPs will be located in the middle of the pseudotime sequence and 3 differentiation will proceed towards both ends, i.e. neuronal to the left and mesodermal to the right 4 ( Fig. 1C). Marker genes for neuronal and mesodermal development map with respect to 5 pseudotime in the correct developmental sequence indicating that the procedure was successful. 6 To objectively define gene expression states, we extracted the wild-type data, and then utilized a 7 change point detection algorithm to divide pseudotime into a series of distinct states . The change 8 point algorithm identified five transition points (Fig. S2). These transition points (vertical lines in 9 Fig. 1C) divide the pseudotime sequence into six states that generally agree with those predicted 10 previously from marker gene expression . These transition points were mapped to the full 11 pseudotime sequence, and we calculated the relative abundance of each state in wild type, Wnt 12 inhibited embryos, Fgf inhibited embryos and Bmp inhibited embryos (Fig. 1D). 13 To determine whether this analysis of scRNAseq data accurately quantifies changes in cell state, 14 we mapped the transcriptional states back onto the embryo and measured their abundance using 15 simultaneous multicolor fluorescent in situ hybridization for marker genes for the first five states 16  To validate the scRNAseq analysis, we chose to test the predictions of changes in the abundance 22 of neuronal and PZ states. First, the scRNAseq predicts that Wnt inhibited embryos would have 23 more neuronal cells (Fig. 1E). This is consistent with reports that elimination of Wnt signaling 24 leads NMPs to exclusively adopt a neuronal fate . In our milder perturbation of Wnt signaling, 1/3 25 of embryos have an abnormal cap of neuronal tissue covering the embryos' posterior, confirming 1 the scRNAseq results (Fig. S3). 2 A second prediction of the scRNAseq analysis is that the PZ is smaller in BMP and Wnt inhibited 3 embryos but not in embryos subject to FGF inhibition. To test this prediction, we performed 4 fluorescent in situ hybridization for a PZ marker, tbx16, and a PSM marker, tbx6 (Fig 2B). In wild-5 type, BMP and FGF inhibited embryos, the tbx16 and tbx6 signal was measured along the 6 anterior-posterior axis of the embryo for both the left and right sides (Fig. 2C). The PZ/PSM 7 transition was set to the value derived from the scRNAseq analysis (20% of the maximum value 8 of tbx6) and then the PZ length was normalized to the total tailbud length. Consistent with the 9 scRNAseq analysis, BMP but not FGF inhibited embryos exhibited a decrease in PZ length ( Fig.  10 2D). Due to the bent body axis exhibited by the majority of Wnt inhibited embryos, the area of the 11 PZ and PSM were quantified. As predicted, Wnt inhibited embryos have a smaller PZ (Fig. 2E). 12 Thus, this approach to analyzing scRNAseq data accurately identifies cell states that can be 13 quantitatively mapped back onto the embryo. 14

Cell motion states 15
We hypothesized that the same computational techniques used to classify gene expression states 16 could be applied to cell motion data to objectively define the cell motion states (Fig. 3A). For this 17 purpose, we used tracking data from confocal timelapse imaging of cells in the DM through PSM 18 collected over 1-3 hours in wild-type embryos and embryos subject to signaling perturbations . As to facilitate pooling of data from multiple embryos together and to make the analysis analogous 24 to that of the scRNAseq data which had all spatial information removed by cell dissociation. This 25 procedure is successful solely using the statistics for cell velocity, average neighborhood cell 1 speed within 20 micron radius of each cell, acceleration, and displacement over 6 and 15 minutes 2 (Fig. 3B). The change point detection algorithm classifies the cells into four cell motion states (Fig.  3   S4). These states are roughly segregated in space and their sequence matches the known 4 developmental trajectory. Thus, cell migration states can be considered analogous to gene 5 expression states. 6 An ergodic pattern of cell motion states 7 The cell tracking data includes cell position, and we postulated that utilizing this information would 8 improve the cell state segmentation. We therefore created a cell state map for each embryo using 9 cell position and cell track displacement as inputs for pseudotime assembly (Fig. 3C, 3D and S5). 10 These pseudotime sequences were then segmented based on the aforementioned cell motion 11 statistics. This approach cleanly segmented the embryo into four cell states. As each embryo 12 contains tens of thousands of data points, we plotted only a sample of the data from either a single 13 time point, i.e. a phase average (Fig. 3C), or an identically sized selection randomly chosen from 14 all time points, i.e. a time average (Fig. 3D). The distribution of states is extremely similar in both 15 plots which is indicative of an ergodic system. The stability of the cell state pattern is evident in a 16 movie generated using the average of each timepoint of our longest wild-type dataset (Movie S1). 17 To obtain further evidence of ergodicity, we measured the cell abundance in each state over time 18 as well as the influx and outflux from these states. The expectation is that the size of these 19 domains would remain constant, and the fluxes would balance. We focused on the two states in 20 the middle of the sequence, the PZ and posterior PSM, since we have both their complete influx 21 and outflux data. Interestingly, the abundance and dynamics of these states can vary substantially 22 from embryo to embryo and treatment to treatment, but the fluxes are balanced in each embryo 23 ( Fig. 3E and F). The balanced fluxes would help maintain the ergodic pattern of cell state 1

transitions. 2
Given the stability of the migration state transitions, we wondered whether the transitions between 3 gene expression states were also ergodic. Since RNA sequencing is an endpoint assay that does 4 not readily lend itself to the calculation of time averages, we utilized RNA velocity, which considers 5 the relative amount of intron and exon RNA for each gene, to estimate the flux between states . 6 As expected, the overall RNA velocity is directed down the path of mesoderm differentiation ( Fig.  7 4A and S6). Flux was calculated as the proportion of cells that transitioned to a different state 8 ( Fig. 4B and S7). For wild-type, the influx generally matches the outflux suggesting that the size 9 of these domains is stable and that the patterns of cell state transitions may be ergodic. However 10 unlike in the cell motion states, the balance between influx and outflux can be altered by 11 perturbation of cell signaling. 12 A batch of sibling embryos at roughly, but not exactly the same stage in development, produce 13 very reproducible patterns of gene expression (Fig. 4C). In the tailbud, the consistency of this 14 pattern is remarkable given the dynamics of cell motion that are driving elongation of the body 15 axis. However, if the pattern of state transitions in cell motion is ergodic, it is not surprising that 16 the gene expression patterns are also likely ergodic. 17

Discussion 18
The ergodic pattern of cell state transitions may represent an emergent level of biological order 19 that mediates gene network actuation of the stereotypical progression of embryogenesis. Our 20 parallel analysis of gene expression and cell migration states using dimensional reduction and a 21 change point detection algorithm demonstrates that these cell state transitions can be objectively 22 defined and mapped back onto the embryo. While any time series dataset is well suited for an 23 analysis of ergodicity, starting with cell state identification enables detection of ergodicity in 1 complex datasets and reveals higher order ergodic patterns. 2 Ergodicity normally refers to a single stable state in which a dynamical system resides for a given 3 amount of time . The length of time that a system remains in this state is referred to as the sojourn 4 time. In this study, ergodicity refers to a pattern of successive cell states that remains stable over 5 a period of 2-3 hours. Thus counterintuitively, this ergodicity does not mean that the tailbud 6 doesn't change, but that the pattern of changes doesn't change. The ergodic pattern of cell state 7 transitions could be thought of as a "meta-state". 8 This biological ergodicity is a dynamic order that arises from the genome and the biochemical and 9 physical interactions among cells in space and time. This ergodicity is dependent upon the length 10 of the time interval being studied. If one were to combine data from a gastrula with data from an 11 embryo during body elongation, then there would likely be no ergodicity. Thus, there is a sojourn 12 time for a given pattern of cell state transitions that will scale with the developmental process YK led the data analysis, performed data interpretation and wrote the manuscript, DJ contributed 8 to the wet lab experiments, HK interpreted data and supervised data analysis, and SAH conceived 9 of and supervised the project, interpreted data and wrote the manuscript. 10

Declaration of Interests 11
The authors declare no competing interests. 12

Materials and Methods 13
Data and code availability 14 The scRNAseq data has been archived at NCBI GEO (accession no: GSE173894). 15

Zebrafish methods 16
Tüpfel-longfin zebrafish were raised according to standard protocols approved by the  Full-length, barcoded cDNA was then amplified by PCR to generate sufficient mass for library 1 construction. Enzymatic fragmentation and size selection were used to optimize the cDNA 2 amplicon size prior to library construction. R1 (read 1 primer sequence) were added to the 3 molecules during GEM incubation. P5, P7, a sample index, and R2 (read 2 primer sequence) 4 were added during library construction via End Repair, A-tailing, Adaptor Ligation, and PCR. The while Read 2 is used to sequence the cDNA fragment. Sequencing a Single Cell 3' Library 10 produces a standard Illumina BCL data output folder. The BCL data includes the paired-end Read 11 1 (containing the 16 bp 10x Barcode and 10 bp UMI) and Read 2 and the sample index in the i7 12 index read. 13

Preprocessing of scRNA sequencing data 14
We aligned the scRNA-seq data to Grcz11 and demultiplexed using Cell Ranger (10X 15 Genomics). After the generation of expression matrices for each sample, we utilized Seurat v3 16 for preprocessing and clustering of scRNA-seq data. First, we excluded cells with an ectopic 17 number of genes or exceeding a specified percentage of mitochondrial genes (Table S1) based 18 on visual inspection for the distribution of these statistics. After the filtering genes, we conducted 19 integration following Seurat's SCTransform integration. 20 We applied principal components analysis and embedded the 30-dimensional PCA coordinates 21 into 2 dimensional UAMP. We clustered cells by Seurat function "FindClusters" with a resolution 22 parameter of 0.5. 23

Pseudotime estimation of scRNA-seq 24
To recover cell state dynamics encoded in the gene expression data, we ordered a subset of 1 scRNA-seq cells which belong to the axis from ADM to PSM so that its ordering recapitulates the 2 developmental trajectory during body elongation. In particular, we embedded the z-scores of 30 3 dimensional PCA coordinates of cells belonging to specified clusters (Sox3+, Sox2+, DM, PZ, 4 pPSM and aPSM) into one dimensional UMAP coordinates. Here, we expected that the most 5 variable axis within gene expression space during this process would be the developmental 6 trajectory. For UMAP embedding, we used the "umap-learn" package in Python and set 7 "n_neighbors" as 400 and "min_dist" as 0.1. 8

Segmentation of scRNA-seq pseudotime 9
We segmented the pseudotime trajectory of scRNA-seq into several segments within which each 10 cell c has similar z-scores of 30-dimentional PCA coordinate ! in order to dissect the dynamics 11 along the progression of cell state transitions during zebrafish body elongation. We utilized a 12 Bayesian algorithm of change detection to find break points of segments " ( = 1, … , ) which 13 minimize the total error from the mean of profile of the segment ∑ " and ! is the discretized rank of the 15 estimated pseudo time. We discretized the pseudotime rank into 30 bins for computational 16 efficiency. We determined K as 5 scRNA-seq data using the elbow method which chose a 17 saturation point along the group variation curve as function of the number of clusters. 18

Flux analysis based on RNA velocity 19
We recovered cell state dynamics behind scRNA-seq data using scVelo which estimates the 20 velocity of RNA for each single cell Using a computed velocity vc of cell c with its single cell 21 transcriptome xc, we calculated the predicted transcriptome after a micro duration as ! ′ = ! + 22 ! . We set δ so that 3% of transcriptome xc changed during δ. We conducted PCA analysis on 23 a concatenated expression matrix of current and δ-elapsed transcriptome and used the 10-24 dimensional PCA coordinates of cell c at current and δ-elapsed time points, which we denoted as 1 ! and ! ′ We estimated the segment ′ ! which cell c after is belonging to as the current 2 segment ! ′ of the cell ′ whose PCA coordinates ! ′ are the nearest to the predicted δ-3 elapsed PCA coordinates ! ′ . We quantified the transition "," ′ RNA from segment k to segment ′ 4 as the count of cells whose current and δ-elapsed segments are k and ′ respectively. The 5 normalization is done for the total number of cells within the segment. We also defined the influx 6 rate and outflux rate of segment k as the normalized summation of transition from any segments 7 to k and from k to any segments and defined them as " respectively, where " ,+is number of cells in segment . 9

Estimation and segmentation of cell movement pseudotime 10
We recovered the pseudo dynamics of the cell movement properties by pseudotime estimation information each embryo was processed separately. We embedded the z-scores of 3D spatial 18 position and 3D displacement during 15 minutes into one dimensional order using UMAP. For 19 UMAP embedding, we used the "umap-learn" package in Python and set "n_neighbors" as 100 20 and "min_dist" as 0.1. We segmented both types of cell movement pseudotimes using the same 21 methodology for the segmentation of scRNA-seq pseudotime, except that the properties ! for 22 minimizing within-group variation are the z-scores of the speed, acceleration, magnitude of 23 neighborhood velocity, and displacement distance for 6 minutes and 15 minutes. We specified 24 the number of change points as 3 using the elbow method which chose a saturation point of 1 along the group variation curve as function of the number of clusters. 2

Flux analysis between cell movement segments 3
We quantified the flux between segments in cell movement data utilizing cell tracking information. 4 We counted the cells which stay at segment k at a time point t and transit to segment ′ at the 5 subsequent time point t + 1. In the same way as the previous section, we defined influx rate and 6 outflux rate of segment k as the normalized summation of transition from any segments to k and 7 that from k to any segments and defined them as 0," mov = ∑   Preprocessing of the microscopy images was done using ImageJ. The sox2 and tbx6 channels 1 were subtracted from each other to eliminate bleed through. Images were rotated to a consistent 2 orientation and a max intensity projection was created. Adaxial cells were identified in the DAPI 3 channel and manually removed from the image. The midline separating the embryo into left and 4 right halves was identified manually. Subsequent quantification was performed in Matlab. The 5 image was smoothed with a Gaussian filter, and the region of interest was thresholded using 6 Otsu's algorithm on both the tbx16 and tbx6 channels. For wildtype, BMP, and FGF inhibited John D'Errico. The mean intensity along the axis was calculated using a sliding window. The 20 same thresholds were used for the PZ and PSM as described previously. The boundaries for 21 these regions were taken to be a perpendicular dropped from the axis at the cutoff point. The 22 scaled PZ area was the PZ area divided by the area of the PZ plus PSM. 23 Statistics were calculated using Mann-Whitney's U.   Fig. S3.  Movie S1. Ergodic pattern of cell state transitions. Related to Figure 3. A timelapse was 2 generated using the phase average for each timepoint in the longest wild-type dataset. The 3 states were segmented using position and displacement.