Data-theoretical Synthesis of the Early Developmental Process

Biological development is often described as a dynamic, emergent process. Yet beyond the observation of gene expression in individual cells, it is hard to conceptualize large-scale patterns that confirm this description. We provide an example of combining theoretical insights with a data science approach. The availability of quantitative data allows us to examine aggregate trends across development, from the spatial organization of embryo cells to the temporal trends as they differentiate. The first half of this paper lays out alternatives to the gene-centric view of development: namely, the view that developmental genes and their expression determine the complexity of the developmental phenotype. Caenorhabditis elegans biology provides us with a highly-deterministic developmental cell lineage and clear linkage between zygote and cells of the adult phenotype. These properties allow us to examine time-dependent properties of the embryonic phenotype. We utilize the unique life-history properties of C. elegans to demonstrate how these emergent properties can be linked together by relational processes and data analysis. The second half of this paper focuses on the process of developmental cell terminal differentiation, and how terminally-differentiated cells contribute to structure and function of the adult phenotype. An analysis is conducted for cells that were present during discrete time intervals covering 200 to 400 minutes of embryogenesis, providing us with basic statistics on the tempo of the process in addition to the appearance of specific cell types and their order relative to developmental time. As with ideas presented in the first section, these data may also provide clues as to the timing for the initial onset of stereotyped and autonomic behaviors of the developing animal. Taken together, these overlapping approaches can provide critical links across life-history, anatomy and function.


Introduction
The understanding of development as a dynamic, emergent process stands at odds with our current understanding of large-scale developmental patterns. While there have been attempts to characterize these patterns using physical laws [1,2], accessing these phenomena with formal logical descriptions is also useful for purposes of both modeling and connections to molecular mechanisms. This will be done here through data science and the theory of relational biology. Therefore, this paper features a conceptualization of the relationship between cell differentiation, alternatives to a gene-centric view, and relational processes. Viewing biological development as a series of relational processes [3] provide a means to understand both causality in the absence of mechanism [4] and anticipatory systems [5]. Relational processes are built upon both a causal structure as well as categories that help us understand the relationship between structure and function. In this paper, we will build towards a formalism that bridges patterns observed as cells differentiate in an embryo. This will bring us closer to understanding the developmental process as an anticipatory system, or as an adaptive system that is prepared to incorporate multicellular phenomena such as the emergence of adult behaviors, adult plasticity, and even species-level evolvability.
Understanding embryogenetic systems in this way is not attainable using a gene-centric approach. To overcome this limitation of more traditional developmental biology research, we have engaged in work that demonstrates some of these relational processes during embryogenesis. We draw from the concept of a differentiation tree [6,7] to make connections between the developmental process and the emergence of the adult phenotype. In nematodes differentiation trees are a means to relate the binary, mostly asymmetric cell divisions to the broader context of embryonic tissue differentiation. This provides us with a means by which to explore comparisons between developmental cell lineages correspond with specific genotypes, as well as changes that demonstrate the evolution of development. This paper will proceed by introducing the reader to cellular-level alternatives to reductionism in the study of development, an analysis of differentiation into terminal adult cell types, and an analysis of early development. We assume that a temporal analysis of early stages in the differentiation process (in this case, 200 to 400 minutes of C. elegans embryogenesis) can reveal much about the emergence of larger-scale processes and structures in the developmental phenotype [8,9]. The first and second points provide a means to better understand the connection between developmental cell lineages and their differentiated descendant cells. A discussion and synthesis of this analysis is presented, followed by the potential use of relational theory to interpret these results.

Cellular-level Alternatives to Gene-centrism
Our approach is based on quantitative characterization and phenomenological modeling of development. In this paper, both digital approaches to morphogenesis as well as cellular-level models serve as alternatives to a gene-centric approach. Through the use of cellular-level computational models, we can account for short-range interactions such as paracrine signaling and physical interactions. We can also combine the results from simulation and primary data sets into a model of selective interactions between cells and regions of the embryo. Our work on establishing an "interactome" for C. elegans embryos [10] is an example of the value of such local-toglobal information. There is also value in establishing frameworks for multiple types of data, which might lead to inference or insight down the road. The availability of both molecular and cellular data at the single cell level in C. elegans provides a unique opportunity to ask questions such as how the physiology of embryogenesis unfolds in space.

Analysis of Development
In this section, we will discuss current initiatives and future directions in the analysis of development. This includes a discussion of developmental cell organization, an overview of developmental cell lineage and differentiation trees, segmentation/partitioning of imaging data, and the extraction of developmental dynamics.
Developmental Cell Organization. C. elegans has a mode of development called mosaic development. While this is different from embryonic regulative development in amphibians and mammals, in which many cells appear to have equivalent roles [11], there are many other examples of mosaic development throughout the tree of life. Mosaic development is a process whereby most developmental cells have determined fates. After the initial cleavages in C. elegans, there are six founder cells (AB, C, D, E, MS, P4) which go on to produce specialized lineages of cells with no variation across individuals. These sublineages contribute to various tissue and anatomical structures in the adult worm. C. elegans is eutelic, which means that there is a fixed number of somatic cells in the adult.
Developmental Cell Lineage Tree. The C. elegans lineage tree [12] describes the lineal order of descent for all developmental cells from the one-cell stage to terminal differentiation or cell death. The lineage tree is ordered along the anterior-posterior axis of the worm [13], and describes the lineage of descent leading to all cells in the adult worm. Sublineages (descendants of the founder cells) consist of multiple layers of cells, which diversify at fixed times before becoming terminally differentiated cell types, also at fixed times. The timings of these division events are rather uniform across layers of the tree, although there are some notable exceptions.
Developmental Cell Differentiation Tree. Lineage trees have been proven to be adequate data structures for organizing information about developmental cell descent. However, other intriguing sets of relationships between developmental cells exist, and require different modes of data organization and analysis. Alternative methods include meta-Boolean models [14], complex networks [10], algorithmic complexity [15], and scale-invariant power laws [16]. One method that relies upon simply reorganizing the lineage tree by the occurrence of differentiation waves is called differentiation tree analysis [11].

Pre-Hatch Morphogenesis and Timepoints
Pre-hatch morphogenesis in C. elegans is the period from fertilization to 400 minutes of embryogenesis at 25 o C. We begin sampling intervals at 200 minutes, at which time the only the terminally differentiated cells are the germ cells. Our sampled time points are at intervals of 5 minutes from 200 to 300 minutes of embryogenesis, and at intervals of 50 minutes between 300 and 400 minutes of embryogenesis. This gives us a total of 23 time points: 21 points over the 200 to 300 minute interval, and two points post-300 minutes (350 and 400). Many of the major terminal cell types emerge between 200 and 300 minutes, making sparse sampling of the post-300 minute period adequate.

Timed Cell Lineage Data
Timed cell lineage data were acquired courtesy of Nikhil Bhatla and his lineage tree application (http://wormweb.org/celllineage). Cells represented in an embryo at a given time are determined by first calculating the lifespan of each cell in the lineage tree (e.g. the time at which each cell is born and either divides or dies), and then identifying all cells alive at a given time. Terminally-differentiated cells were assumed never to die, unless specified by the data.

Cell Functional Annotations
The annotation of each terminally-differentiated cell's function was acquired courtesy of Stephen Larson and Mark Watts from the PyOpenWorm project (https://github.com/openworm/PyOpenWorm). Annotations were matched to each cell name using the Sulston nomenclature system [7], which resulted in a series of annotated cells (Ci) at time t . This also resulted in a function ti(x) for each cell (1, 2,……, x). Text mining was then used to determine a given cell's functional class. The birth of functional classes over developmental time was done using a binary classifier.

Functional Classes and Families
To look at differences within and between groups of terminally-differentiated cells, we used a two-tiered classification scheme. This consisted of functional classes and families. Functional classes are based on annotation identities, which are extracted as keywords found in the list of annotations. Families are groups of cells with the same first letter or prefix in their nomenclature identity (e.g. all cells with the nomenclature identity hyp belong to the same family). In the heat maps (Supplemental Figures 1-3), these categories are shown to largely overlap.

CAST (Cell Alignment Search Tool)
The original methodology for the CAST Alignment is shown in [7]. In this analysis, we calculate pairwise CAST alignment for the current time point and the next time point. The CAST alignment yields an alignment score, which is divided into the maximum possible score to yield the CAST coefficient. The maximum possible score is equivalent to the length of the cell list for the next time point (the longer cell list of the two cell lists in the pairwise comparison). This value of this coefficient can range from -1 to 1, and allows for a time-series of these pairwise comparisons to be compared.

Cluster Analysis and Information Content
A hierarchical cluster analysis was conducted using R version 3.3.1. The data were visualized using Rstudio 0.99. The cluster vector matrix was extracted, transposed, and vectorized using SciLab 5.5.2. The cluster vector is then used to determine how many cells from each family (n=26) belong to each cluster. This allows for the Shannon Information for each cell family to be calculated.

Data and Visualization -Results
We conducted an analysis of publicly available data demonstrating the unfolding of adult morphology during embryogenesis. The first step in the analysis is to show the number of developmental and terminally-differentiated cells from 200-400 minutes. These data are available in tabular form for annotated nomenclature identities (Supplemental File 1) and for five distinct somatic cell types (Supplemental File 2). A more finely sampled demographic representation of the 200-300 minute interval shown in Figure 1 and Supplemental Figure 1. Perhaps more surprisingly is that developmental cells are added along with an increasing number of terminaldifferentiation cells until around 250 minutes of embryogenesis. At around the same time, there is an inflection point for developmental cell number and an increase in the number of terminally-differentiated cells in the embryo.
In general, Figure 1 also provides two critical pieces of information about developmental dynamics. Figure 1A shows that the number of cells increases 2.5-fold over that 100 minute interval. One consequence of this finding suggests a periodicity in the rate of expansion in the number of cells of the embryo. In Figure 1A, it appears that there are periods of relative stasis and periods where the rate of division and differentiation increase. One of these apparent periods of stasis is from 235 to 270 minutes for terminally-differentiated cells, and 245 to 270 minutes for all cells. This includes both developmental and terminally-differentiated cells, so the difference in stasis time is likely due to changes in developmental cell number.  Figure 1B). After 285 minutes, the C. elegans embryo is increasingly dominated by terminally-differentiated cells, as the number of developmental cells decreases. There are roughly the same number of developmental cells at the beginning and end of this time interval. However, in the middle of this interval (from roughly 230 to 285 minutes), there is an increase in the number of developmental cells. This is probably to feed the large increase in terminally-differentiated cells in the subsequent time periods (from roughly 285 to 350 minutes).    The heat map visualization gives us a rough guide to the amount of heterogeneity in each functional class with respect to time of birth. For some functional classes (nomenclature identity "h"), the birth of cells overwhelmingly occurs early in the 200 to 400 minute window of development. In other functional classes (nomenclature identity "i"), there is structured variation with respect to birth time. A third set of functional classes (nomenclature identities "A" and "M") also demonstrate variation in timing between cells. Supplemental File 3 shows the descriptive statistics for each family and functional class of cell present in the embryo up to 400 minutes of embryogenesis.
As they both represent cell types that form the emerging connectome, a comparison of neurons and interneurons in terms of their emergence time is warranted. In Supplemental Figure 4, we compare the joint distribution of emergence time for three types of differentiated cell (neurons, interneurons, and hypodermal) in two comparisons. In Supplemental Figure 4A, we directly compare neurons and interneurons. Supplemental Figure 4B shows an evaluation of interneurons and hypodermal cells in the same manner. In the case of Supplemental Figure 4A, neurons merge in a bimodal fashion (with a majority of terminally-differentiated neurons being born from 290-400 minutes). By contrast, interneurons seem to almost always emerge after 280 minutes. Critically, there is an overlap in terms of terminaldifferentiation between the two cell types. This may reveal an interdependency between the two cell types. By contrast, Supplemental Figure 4B shows a difference in mode between interneurons and hypodermal cells, with their frequency of emergence being almost inverse with respect to the 200 to 400 minute time interval.
The "Interneuron" functional class in Figure 4 shows the phenomenon of structured variation in more detail. In the heat map, the emergence of cells at different points in time look like jagged teeth across the cell identity (vertical) axis. This represents the birth of axial variants of the same cell type at slightly different points in time. Figure 5 shows the relationship between syncytium and muscle cells. For the most part, syncytium emerges earlier in time than do muscle cells. However, there is a group of embryonic body wall (mu bod) cells born just after the first wave of syncytia. More closely resembling the timing of neuronal cells, these syncytia differentiate much earlier than the other embryonic body wall cells in our dataset.
Looking more closely at axial variants with the same identity, we can see that while some axial variants emerge at the same time (e.g. AIAL and AIAR, right/left homologues of amphid interneurons), others emerge 5-15 minutes apart. Examples of these include SMBDL and SMBVL (dorsal/ventral homologues of ring/ motor interneurons) and RIPR and RIPL (right/left homologues of ring/pharynx interneurons). We can also look at the relationship between the time of birth and number of cells per functional class. To discover patterns in these data, we conducted a hierarchical cluster analysis on the birth times for each terminally-differentiated cell. Supplemental File 4 provides an overview of the relationship between cluster membership and nomenclature family. This provides us with a set of 17 distinct clusters which we can use to classify each cell. Given this information, we asked whether cells from the same nomenclature family belonged to the same cluster. Supplemental Figure 5 shows the variation in information content across nomenclature families. The closer the value is to 1.0, the greater the information (e.g. cells from a single family are represented in a greater number of clusters).
Supplemental Figure 5 demonstrates that there are four types of nomenclature families: 1) relatively high information content with few members, 2) relatively low information content with few members, 3) relatively high information content with many members, and 4) relatively low information content with many members. This can be determined quantitatively by classifying the families based on whether their information content and cell number is above or below the median value of each. Finally, we can examine the series of terminally-differentiated cells that emerge at different time points as a CAST alignment [20]. CAST alignments provide an assessment of gaps in series of functionally-related cells as well as potential periods of stasis in the differentiation process (Supplemental File 5). Supplemental Figure 6 shows us the pattern for the 200 to 400 minutes of C. elegans embryogenesis time-series. In this time-series, we see a large fluctuation in the CAST coefficient between the 205-210 minute interval and the 240-245 minute interval. There are subsequent fluctuations in the CAST coefficient that become increasing sharp after the 240-245 minute interval. This may be due to a transient period of stasis in differentiation shown in Figure 1.

Data and Visualization -Discussion
We have presented an analysis and visualization of cellular differentiation at a critical time period in C. elegans embryogenesis. The 200 to 400 minute interval is the time between the first appearance of non-germline terminally-differentiated cells and the comma stage of development [17]. It is during the first part of this time period that the major differentiated cell categories are established. This has been done by looking at the ratio of developmental cells to terminally-differentiated cells, looking at the different cell families and the relative timing of their differentiation, and variation in timing within and between functional classes.
Looking between functional classes also reveals information about how largerscale structures are built (e.g. nervous system). For example, Figure 6 shows the relationship between interneurons and neurons (A) and interneurons and hypodermal cells (B). In Figure 6A, the appearance of neurons is multimodal with respect to time (one early group and a larger latter group). By contrast, almost all interneurons appear after 275 minutes. The timing of hypodermal cells is even more striking in comparison to interneurons as shown in Figure 6B. In this case, a large group of hypodermal cells appear before the sampled interneurons, while a smaller group of hypodermal cells appear alongside the sampled interneurons. These types of comparisons can provide clues as to the emergence of organs as well as other functional networks of cells (connectome).
The first consideration for further study is the behavioral relevance of structured differentiation. As autonomic (e.g. pharyngeal pumping) and other basic behaviors emerge from the developing embryo [18], we can ask questions regarding the minimal set of cells required for initiation of a given behavior, the appearance of cells essential to turning on that behaviour, and whether or not behavioral emergence involves more than terminally differentiated cells.
The second consideration is how the process of development can be represented as a spatiotemporal process (Figure 7). While this is foremost a data visualization problem, it is also critical in showing how the adult phenotype is modular with respect to developmental time. In a number of cases, we can observe a multitude of its components terminally differentiated well before the initiation of function. Figure 7 is called a differentiation map which is based on the differentiation tree analysis of embryogenesis [11]. Each map is a 2-D representation of cell division as a spatial process. The extent of each differentiation map corresponds to the number of divisions in the lineage tree. Each cell is located by its position on the anterior-posterior (x) axis and the left-right (y) axis (in embryo units, AU). The lines between cells provide information about the change in position between a mother cell (e.g. AB) observed at time 0 and daughter cells (e.g. ABa, ABp) observed at time 1. Information about GFP area tells us whether the line leading to either the smaller cell of the division (red) or larger cell of the division (green). Insets for each differentiation map shows the corresponding differentiation tree. For purposes of space, we truncated the 64-cell trees at 32 nodes (4 divisions).

The visualization in
Differentiation waves involve propagation of either a contraction or expansion of the apical surfaces of cells in a given epithelial tissue. In the case of mosaic development (such as in the case of C. elegans), tissues are replaced with individual cells [11]. In other words, an asymmetric cell division involves both a single-cell contraction wave, resulting in the smaller cell, accompanied by a single-cell expansion wave, resulting in the larger cell. An exception to this involves the small proportion of the cell divisions in C. elegans are symmetric, resulting in tissues containing two cells [10]. This set of rules allows us to bring regulative and mosaic development under one theory, the difference being that in regulative embryos tissues consist of many cells, whereas in mosaic embryos tissues consist of one cell.

Synthesis of Data and Visualization
Structures describing the differentiation process (such as the differentiation wave) provide a means to determine the emergence of function in embryogenesis. In the model organism C. elegans, a deterministic developmental trajectory [19,20] combined with available secondary data can be used to determine when terminallydifferentiated cells appear and their relationship to both cell lineages and the adult phenotype. In this paper, we will ask the following question: in what order do distinct cells emerge within and between tissue types at multiple time points in pre-hatch morphogenesis? These data can provide insights into how movement and other behaviors first turn on, such as in cases where a specific cell is required for a generalized behavior or response [21]. In general, there is a great deal known about why the temporal emergence of C. elegans tissues and organs from terminallydifferentiated cells is tightly regulated. However, a systems-level analysis and visualization of these cells could allow us to understand which cell types and anatomical features are necessary and/or sufficient for the emergence of autonomic behaviors and functional phenotypes.
In C. elegans, cell division patterns directly correspond to cell fate [22]. Furthermore, the timing and ordered emergence of cells making up a specific tissue or organ is highly regulated at the molecular level. Heterochronic timing and associated heterochronic genes are major drivers of C. elegans embryogenesis, particularly since the developmental process is more discrete than in vertebrates [23]. Cellular behaviors such as reorientation and contraction accompany the multi-step morphogenesis of various anatomical structures [24]. The coordination of cell division timing is a complex relationship related to developmental timing, and leads to asynchrony of divisions between sister cells [25]. The pace of cell division itself is an important regulator critical for the normal formation of tissues and organs [26]. The failure of normal development outside a specific temperature range, such as has been observed in amphibians [27], could be investigated in C. elegans at the single cell level.
This time-dependent type of single-cell developmental regulation has consequences for differentiated cells that comprise specific tissues and organs. For example, every cell has a unique pattern of transcriptional regulation in embryonic development [28]. The dynamic regulation of each developmental cell [29] leads to differentiated cells with diverse functions [28]. A key to better understanding the coordination of cellular differentiation in development is to look at differential transcription within and between cells [30]. The timing of cell division and differentiation events appear to influence which parts of a tissue or organ form before others and ensure proper function [31]. There is also a functional role for certain types of cells, which thus must be present at a certain stage of embryogenesis for proper anatomical function and the onset of behaviors. For example, glial cells are all purpose cells that play a critical role in the onset of movement and autonomic behaviors [32]. The presence, and more importantly absence, of actin molecules in cells that make up certain anatomical structures can affect their formation and function [33].

Relevance to Relational Biology
This study serves as a first step towards developing mathematical formalisms to describe relational processes. One type of relationship involves quantitative representations of cells in space and time. These include both identity relations (attributes of the cell) and developmental processes. We can apply the logical structure of category theory to represent these properties. The formalization of categories includes objects that constitute sets and arrows that define functions. Examples of objects include terminally-differentiated cell families, functional cell groups, and birth-time cohorts. Functions that constitute formal biological categories include cell division events, changes in spatial location, and transitions of an object through the developmental process. Taking a relational biological view of embryonic development will allow for a common language to be used across species and different patterns of development.
Figure 7 also provides us with an early version of relational C. elegans development. The differentiation map is a relational graph, which allows us to assess the relationship between biological objects in different temporal contexts. Another representation consistent with relational biology would be represented using algebraic terms and a specific notation. This allows us to predict limits and alternate states, even in the absence of data. A relational framework also provides a means to develop computational programming languages that aid in the analysis of data and the discovery of subtle features of the developmental process. Thus, the relational approach enables the discovery of higher-order mathematical structures in a developmental system. Before this can be realized, there is more work to be done in terms of compositional associativity, particularly with respect to the objects and arrows of formalized categories. Pair categories can also be used to better understand bilateral sets of cells and their structure and function. By combining sets and arrows, category theory also allows for the identification of initial and terminal objects for use in establishing formal modeling relations. Other structures such as organismic sets and supercategories might also be used to clarify relational processes. While the analysis presented here provides insights into a relatively simple developmental system (C. elegans), applying such formalisms to similar data in more complex developmental contexts will advance the science of relational biology, and strengthen our ability to predict and understand these systems.

Competing Interests
We have no competing interests.

Funding
No external funding sources were used to write this paper or conduct the studies herein.