RT States: systematic annotation of the human genome using cell type-specific replication timing programs

The replication timing (RT) program has been linked to many key biological processes including cell fate commitment, 3D chromatin organization and transcription regulation. Significant technology progress now allows to characterize the RT program in the entire human genome in a high-throughput and high-resolution fashion. These experiments suggest that RT changes dynamically during development in coordination with gene activity. Since RT is such a fundamental biological process, we believe that an effective quantitative profile of the local RT program from a diverse set of cell types in various developmental stages and lineages can provide crucial biological insights for a genomic locus. In the present study, we explored recurrent and spatially coherent combinatorial profiles from 42 RT programs collected from multiple lineages at diverse differentiation states. We found that a Hidden Markov Model with 15 hidden states provide a good model to describe these genome-wide RT profiling data. Each of the hidden state represents a unique combination of RT profiles across different cell types which we refer to as “RT states”. To understand the biological properties of these RT states, we inspected their relationship with chromatin states, gene expression, functional annotation and 3D chromosomal organization. We found that the newly defined RT states possess interesting genome-wide functional properties that add complementary information to the existing annotation of the human genome. AUTHOR SUMMARY The replication timing (RT) program is an important cellular mechanism and has been linked to many key biological processes including cell fate commitment, 3D chromatin organization and transcription regulation. Significant technology progress now allows us to characterize the RT program in the entire human genome. Results from these experiments suggest that RT changes dynamically across different developmental stages. Since RT is such a fundamental biological process, we believe that the local RT program from a diverse set of cell types in various developmental stages can provide crucial biological insights for a genomic locus. In the present study, we explored combinatorial profiles from 42 RT programs collected from multiple lineages at diverse differentiation states. We developed a statistical model consist of 15 “RT states” to describe these genome-wide RT profiling data. To understand the biological properties of these RT states, we inspected the relationship between RT states and other types of functional annotations of the genome. We found that the newly defined RT states possess interesting genome-wide functional properties that add complementary information to the existing annotation of the human genome.

states and other types of functional annotations of the genome. We found that the newly defined RT states possess interesting genome-wide functional properties that add complementary information to the existing annotation of the human genome.

INTRODUCTION
The functional significance of genome organization at the sequence level of many regions in the human genome remains poorly understood despite the completion of the human genome sequencing project more than 15 years ago. Large international consortia such as ENCODE (1), REMC (2) and IHEC (3) have demonstrated that there are many functional elements present throughout the genome that play interesting and important roles in structuring the human genome. The discovery of these functional elements highlights the importance of defining, cataloguing, and characterizing the diverse regions that remain poorly understood. It is important to advance the field through a variety of methods, integrating different experimental designs and findings, allowing the definition of structure like topologically associating domain (TADs) (4)(5)(6) or Lamin Associated Domains (LADs) (7,8).
Alternatively, statistical modelling applied on these experimental data created a new, composite types of genomic elements. As a prime example, Ernst and Kellis exploited ChIP-seq data of histone modifications to classify the genome into chromatin states using the Hidden Markov Model (HMM) (9)(10)(11). They found that many of these chromatin states are correlated with particular cis-regulatory elements, such as enhancers or insulators. The appealing feature of the chromatin state concept is its ability to crystallize complex and biologically meaningful patterns into a few numbers using a large collection of epigenomic profiles that contain lots of variation, overlap and redundancy. We believe the same strategy can be applied to other types of data. In this study, we annotate the genome using a rich set of genome-wide replication timing (RT) programs derived from multiple human cell types (12,13). The RT programs represent the genome duplication occurring in a defined temporal order in human cells.
In humans, the RT program is dynamic and cell type-specific. The difference between types of differentiated cells, and even stem cells and their progenitors can be dramatic. For example, half of the genome undergoes RT changes during cell differentiation of embryonic stem cells (14,15). These dynamic changes in RT occur in units of 400-800 kb known as replication domains (RDs) (12,16,17).
RT alterations have also been linked to many diseases (18)(19)(20)(21). Recent studies have revealed an association between RT and higher-order chromosomal structure (TADs and sub-nuclear compartment A and B) (17,(22)(23)(24)(25)(26), as well as gene expression regulation (24), supporting the hypothesis that the genome is partitioned into smaller replication units of replication (27). In general, early replication is correlated with elevated transcriptional activity (17,28). Researchers have classified RDs as constitutive or developmentally regulated, depending on how consistent the replication occurs across different cell types (17,28). RD boundaries are stable across different cell types, but the changes of RT on the developmental domains are correlated with changes in subnuclear position, chromatin structure and transcriptional state.
The introduction of RDs helps put context to a genomic locus from the RT perspective, but the concept is qualitative. As more RT profiles of more cell types becomes available, we gain the opportunity to characterize RT dynamics in a more refined and quantitative fashion. In this study, we use 42 RT profiles collected from 25 cell types and differentiation intermediates (24), and apply a Hidden Markov Model (HMM) based approach to break down the genome into 15 RT states. The goals are two-fold: first, identify frequent combinatorial RT patterns among a compendium of cell types; second, annotate the genome by assigning an RT state to every locus. Note that unlike chromatin states (9)(10)(11), RT states defined here remain unchanged across cell types since they represent the behaviour of chromatin across a variety of cell types, rather than the composition of chromatin in a particular cell type. The 25 cell types and the differentiation intermediates have been strategically selected to represent the diversity among all the cell types in human and represent developmental stages of the three germ layers (24). Finally, the RT states described here represent a more accurate and quantitative means of defining constitutive early, constitutive late or developmental RDs, which adds a novel quantitative annotation of the functional properties of the human genome.

RT value distribution along the genome.
RT values from Repli-chip (12,13) are the log2 ratios of signals derived from early and late samples measured by microarrays from published studies (24). About 85% of the genome is covered by RT values. The dynamic range has been normalized which ranges from -2.9 to 2.9 as previously described (24,29), and their distributions in all cell types show a similar bi-modal pattern with two peaks around -1 and 1, representing respectively the genome part replicate late and early during replication ( Fig 1A). Higher RT values (>0.3) were classified as early replicating; low values (<-0.3) were classified as late replicating (24). Alignment of RT values from 42 data sets along a chromosome shows the global preservation of RT during development and also some deviations among cell types ( Fig 1B). As examples, we show some regions with early RT values in hESC cells but late in other cell types (the green rectangle in Fig 1B). The opposite pattern can be observed in the blue rectangle ( Fig 1B). Such patterns illustrate the importance of statistical modelling of RT program to identify recurrent and biologically meaningful combinatorial patterns of RT values. The y axe is the RT value between -1.5 and 1.5 in this region; the x axe is the genome coordinate.
The red dash line show the RT value of zero. The rectangles highlighted variations among cell lines.
In the green box we observe a region with early RT value in pluripotent and MED cell lines but late or close to zero RT values in Lymphoid, Myeloid and Erythroid cell lines. In the blue box we observe a region that shows early RT values in Mesoderm, Liver and Pancreas cell lines but late RT values in pluripotent and MED cell lines. The same key color is used in A and B panel.

Genome decomposition into RT states.
The HMM requires the specification of the number of hidden states. In this study, we chose to = 15 balance complexity and biological interpretation. We have tested several different numbers of states (6, 10, 20, 25 and 30) and found the results are largely comparable (Supplementary Figure 1 to 5). In addition, we found that using 15 states, clustered the cell lines according to their genome-wide RT profiles in a manner consistent with the lineage origin of each cell type (Fig 2) (24). We analysed the spatial organisation of the RT states, i.e., how often does the switch (RT transition of contiguous regions) between different RT states occur. By looking at the RT changes in neighbouring genomic regions, we found that the RT states tend to stay unchanged for long stretches of chromosomes. It is evident from the transition matrix that the probability of staying in the same state is very high and immediately adjacent chromosomal segments are mostly classified into the same RT state (Supplementary Figure 6A). Analysis of the average of contiguous bins for each RT states shows that the two largest domains are observed for the Early1 (618 kb) and Late3 (1.16 Mb). The other RT states have value between 267 kb and 500 kb (Supplementary Figure 6D). This is consistent with the previous findings that RDs are usually hundreds of kilobases in length (12,16,17).

Overview of the 15 RT states.
From Fig 2A,  We also analysed the distribution of RT states based on the gene density per chromosome (Fig 3).
We then analysed the base composition of the defined RT state and found a monotonically increasing trend in AT percentage from early RT states to late RT states: Early1 RT state has around 52% AT, compared to about 64% AT in the Late3 RT state (Supplementary Figure 6E). This is expected since most of the late RT states are located in gene-poor regions, which are marked with high AT content (31).  Table 4).
RT states clearly illustrate that replication is consistently executed in an orderly fashion. And the RT profiles show distinct pattern between HGD and LGD chromosomes. Perhaps the HGD chromosomes replicated faster than the LGD chromosomes since the latter ones are less accessible. The imbalance of the Early and Late domains on the HGD and LGD chromosomes are similar to that of the A and B compartments discovered on the Hi-C experimental data (6,32,33). We speculate that the distinctive and elegant periodic pattern we observed in RT indicates the chromosomal properties are highly regulated and the pattern may contribute to its ability to be readily folded in 3D space in an ordered fashion (34).

Annotation features of distinct RT states.
To characterize the different RT states identified by our HMM method, we compared them to the previously described chromatin states annotated based on epigenetic modifications derived from ChIP-seq data (9). Additionally, we also analysed the gene expression activity inside different RT states.
First, we investigated the abundance of the four types of gene families (protein coding, ncRNA, lncRNA and pseudogenes) in each of the 15 RT states. The results are summarized in Fig 4. We saw that the proportion of bins with annotation is high in Early states and decreased steadily when moving from Early states to Late states. Only about 50% of the Late state bins possess at least one gene annotation (of four types), down from 98% in Early states (Fig 4A). In all RT states, protein-coding genes are the most frequently detected gene annotations except for the Late3 RT state where around 40% of the gene annotations are lncRNAs and only 35% are protein-coding genes. We found proteincoding gene annotation frequency decreased steadily across RT states from Early (60%) to Late (35%) (Fig 4B). Frequency of ncRNA annotations is stable throughout all the states ( Fig 4C) and lncRNA frequency increased steadily from Early1 (≈ 15%) to Late3 (≈ 35%) (Fig 4D). Pseudogenes annotation is similar in Early and Dyn1-7 states and higher in the other states ( Fig 4E). From the above, we concluded that early RT is positive correlated with the density of protein-coding genes, negatively correlated with the density of lncRNAs, but not correlated with the density of ncRNAs and pseudogenes.

Specific chromatin architecture is detected by combining RT states and chromatin states.
It is of interest to compare the relationship between previously defined chromatin states (9) and newly defined RT states. Since chromatin states are cell type specific, we hope to show how the dynamic DNA chromatin organisation changes with respect to the DNA replication profiles in different cell lines.

Gene expression and GO term enrichment analysis in the RT states.
For each cell line, we computed the average gene expression (in RPKM) for each RT state (Fig 5B).
We first observed that the cell lines of the same type tend to cluster together. As expected, proteincoding genes located inside regions belonging to the three Early states are highly expressed. Genes To find out if there is any specific pathway or gene set that is enriched in different RT domains, we conducted a GO term enrichment analysis on the genes overlapping with each of the 15 RT states.
The 20 most significantly enriched GO terms for each RT state are presented in Supplementary Table  5. We also calculated the average RT state score (since RT state number is ordinal) for each GO term and report result in Supplementary Table 5. We observed that Early states are enriched in processes involved in housekeeping pathways, like cell division, apoptosis, and transcription regulation (Table 1   and 2 and Supplementary Table 5). That is consistent with the RT score analysis which showed lower values for the same pathways (apoptotic process (3.99), protein phosphorylation (3.74), cell division  Table 5). Dyn RT states are enriched in transcription regulation. Finally, we also observed heart, brain, epidermal growth factor development and nervous system development GO terms enriched for Dyn RT states, which is consistent with the fact that the Dyn RT states are the RT states which change between different cell types.  Representation of sub-compartment domains and LAD in the RT states.

Genome-wide chromatin conformation capture (Hi-C) experiments (32) have revealed Topologically
Associated Domains (TADs) (4-6). The majority of these domains were conserved across cell types, but the inter-TAD interactions are found to be cell type-specific (4,32,36). Comparing TAD and RT states allows us to analyze changes in the 3D chromatin organization that occur during cell differentiation which respect to RT states. Furthermore in higher organization of the genome was identified, A and B compartments characterized by the spatial segregation of open and closed chromatin (32).

Rao and collaborators analyzed the A and B compartments and detected six sub-compartments
based on their contact, architecture and epigenetics features: A1 and A2 (part of the A compartment) and B1, B2, B3 and B4 (part of the B compartment). In this analysis, we ignored the B4 compartment, which is defined only for by a handful of regions, containing many KRAB-ZNF super-family genes and 0.01% of the bins of each RT possess overlapped with this sub-compartment (33). The A1 subcompartment is characterised by a high density of actively expressed genes, and enriched with permissive histone marks (H3K36me3, H3K79me2, H3K27ac, and H3K4me1 and H3K4me3) are associated with active transcription (33). We found the A1 sub-compartment highly enriched in Early1 and Early2 RT states (85% and 62% of bins in these two states belong to this sub-compartment respectively) and the proportion reduced steadily to almost zero for the B2 and B3 sub-comportments.
In addition of the afore mentioned euchromatin marks, the A2 sub-compartment is also associated with the repressive mark H3K9me3, and is enriched in the states Early3 (59% of the bins) and Dyn1 to Dyn4 (Dyn1: 39%, Dyn2: 56%, Dyn3: 45%) ( Fig 6A) and depleted in all others. These five RT states are mainly dynamics and the genes located within these RT states are differentially expressed among different the cell lines analyzed (Fig 5B), so it is expected to have these different states potentially associated with repressive chromatin marks.
The RT states enriched at A1 and A2 sub-compartments are also depleted in LADs (Fig 6A and 6B Dyn6: 25%) (Fig 6A). The sub-compartments B2 and B3, which correspond to regions that are not replicated until the end of S phase, show the opposite pattern of sub-compartment A1, which is enriched in late RT states including Dyn9 (27% and 64% respectively), Late1 to Late3 (Late1: 34% and 59%, Late2: 32% and 65%, and Late3: 27% and 71%, respectively) (33) and depleted in early RT states ( Fig 6A). sub-compartments are also enriched in LADs. As an example, around 60% of bins in Late RT states (Dyn9, Late1-3) contain LADs. In contrast, only approximately 2.5% of bins in Early1 RT state overlap with LADs ( Fig 6B). Therefore, the Late RT states possess all the characteristics of condensed heterochromatin, late replication and dense in LAD domains. Some of the LADs are cell-type specific and defined as facultative LADs (fLADs), and others are conserved across different cell types and are referred to as constitutive LADs (cLADs) (8). We speculate the subset of LADs with Late RT states (Dyn9, Late1-3) that are likely to be cLADs, and are enriched in B3 sub-compartment (Fig 6A and B) and A/T rich, gene poor regions (Fig 4 and Supplementary Figure 6). The fLADs, which are cell specific, on the other hand, could be localized more in the Dyn RT states, and enriched in B1 and B2 sub-compartment, which are correlated with the facultative heterochromatin. Various computational methods have been proposed to identify RDs using replication timing profile data (16,26,44). In a recent study, Liu et al. described an advanced deep neural network (DNN)based method named DNN-HMM that is capable of identifying multiple RDs genome-wide (22). Using manually labelled training data, Liu and colleagues apply DNN on Repli-seq data to assign four major types (and 14 subtypes) of RDs throughout the genome. On top of the methodology, the biggest difference between our method and DNN-HMM (and other existing RD calling methods) is that DNN-HMM is designed to call RDs in a single cell type using replication profiling data from that cell type whereas we are trying to define and call RT states across a compendium of cell types, emphasizing the homogeneity and heterogeneity of the RT programs among cells from different developmental lineages and stages. Each of the 15 RT states represent a particular combination of RDs from multiple cell types. As a results, the 15 RT states can be used to annotate the diverse properties of the genome holistically, not just for a single cell type. We believe our method represents the first attempt to jointly model multiple RT profiling data in order to reveal underlying biological insights.
Many studies have been conducted using epigenetics to annotate the genome, especially the noncoding part, RT states can add a new dimension to the annotation, and can elucidate the changes in organisation during differentiation and development. This work could also shed light on the dynamics of the 3D chromatin organisation during differentiation by correlating the RT states and the Hi-C data.
Data with better resolution (smaller size of fragments) of the RT data will allow for better definition of the domains of replication and refinement of our models. This could permit future analysis of the RT of specific promoters involved in cell differentiation.

Replication timing data
We analysed 42 genome-wide RT datasets of Repli-chip previously published (24) . We also have parameters which denote the probability that the state of the first Three-dimensional chromosomal organization analysis. We use high resolution Hi-C data (GEO accession number: GSE63525) obtained from human lymphoblastoid cells (GM12878) produced by Rao and collaborators (33). We count the number of contact domains found in regions corresponding to each of the 15 RT states. The same method was used to analyse the Lamin Associated Domains (LADs) in fibroblasts. The LADs data was obtained from ENCODE (7).