Abstract
Genome organization is critical for setting up the spatial environment of gene transcription, and substantial progress has been made towards its high-resolution characterization. The underlying molecular mechanism for its establishment is much less understood. We applied a deep-learning approach, variational autoencoder (VAE), to analyze the fluctuation and heterogeneity of chromatin structures revealed by single-cell super-resolution imaging and to identify a reaction coordinate for chromatin folding. This coordinate monitors the progression of topologically associating domain (TAD) formation and connects the seemingly random structures observed in individual cohesin-depleted cells as intermediate states along the folding pathway. Analysis of the folding landscape derived from VAE suggests that well-folded structures similar to those found in wild-type cells remain energetically favorable in cohesin-depleted cells. The interaction energies, however, are not strong enough to overcome the entropic penalty, leading to the formation of only partially folded structures and the disappearance of TADs from contact maps upon averaging. Implications of these results for the molecular driving forces of chromatin folding are discussed.
Introduction
Three-dimensional genome organization is expected to play a crucial role in transcription, DNA replication, and repair (1–5). Significant progress has been made towards its high-resolution characterization as a result of advances in chromosome-conformation-capture based methods such as Hi-C (6, 7). These methods approximate the 3D distance between pairs of genomic loci using contact frequencies measured via proximity ligation and have revealed many conserved features of genome packaging (8–11). The emerging picture is a hierarchical organization for interphase chromosomes that ranges from chromatin loops and topologically associating domains (TADs) to compartments at kilobase and megabase scales, respectively (12–17).
Hi-C and related techniques have also provided insight into the dynamical folding process for the establishment of genome organization. In particular, the extrusion model was proposed to explain numerous features of chromatin loops and TADs observed in Hi-C contact maps (18, 19). It provides a detailed hypothesis on the folding process driven by CCCTC-binding factor (CTCF) and cohesin molecules. Several predictions of the extrusion model have been validated with perturbative Hi-C experiments using cells that are depleted with these molecules (20–24). Due to its unavoidable ensemble averaging, however, Hi-C cannot capture the heterogeneity within a cell population, and the average picture it presents may be insufficient to uncover the full complexity of genome folding (25, 26).
Many questions on genome folding remain outstanding and necessitate the development of additional experimental techniques and theoretical tools of interpretation. Recently, Zhuang and coworkers applied a super-resolution tracing method (27–29) to characterize single-cell chromatin structures and observed substantial cell-to-cell variation for TAD boundaries (30). Upon cohesin depletion, in agreement with population Hi-C experiments (24), these studies suggest that TADs disappear in ensemble averaged distance matrices. Remarkably, however, chromatin domains persist in individual cells. The biological implications of these imaging results remain to be explored, and it is unclear what folds the chromatin in cells that lack cohesin molecules. The large set of single-cell structures provides unprecedented details into chromatin organization but calls for the use of statistical mechanical approaches for its interpretation.
Here we combine the energy landscape theory that has found great success in studying protein folding (31–34) with deep learning techniques to investigate the mechanism of genome folding. Specifically, we apply the variational autoencoder (VAE) (35) to analyze single-cell imaging data and infer a one-dimensional reaction coordinate for chromatin folding. This coordinate captures the variation of TAD boundaries in wild-type (WT) configurations and establishes connections among the seemingly random structures in cohesin-depleted cells. It suggests that these structures are intermediate states along the folding pathway to chromatin configurations that bear a striking resemblance to those found in WT cells. We further demonstrate that the probability distribution estimated from the VAE can provide an accurate approximation of the energetic cost for chromatin folding. Energy landscape analysis suggests that the formation of WT-like structures remains favorable even in cohesin-depleted cells but is penalized by the configurational entropy. A phase separation mechanism potentially contributes to chromatin folding in these cells as supported by the presence of distinct histone modification patterns across the TAD boundary.
Results
Deep generative model identifies the reaction coordinate for chromatin folding
Chromatin folding refers to the dynamical process during which chromatin experiences a large scale reorganization in its 3D conformation, and transitions from extended, unfolded configurations (reactant) to collapsed and folded structures (product). It is inherently a high-dimensional process, the complexity of which makes it challenging to develop intuition towards the folding mechanism. Great insight can be gained by projecting this process onto the so-called reaction coordinate, a one-dimensional variable that monitors the progression from reactant to product (36). Reaction coordinate is a key concept that has significantly advanced our understanding of condensed phase chemical reactions (37), including protein folding (38–41). The identification of the reaction coordinate itself, however, is nontrivial and often requires kinetic measurements of the folding process. Though significant progress has been made in live-cell imaging (42–45), monitoring chromatin with high spatial and temporal resolution remains out of reach.
Approximate definitions of the reaction coordinate can be obtained using dimensionality reduction analysis of a large set of configurations that connects the reactant to the product, and have provided mechanistic insight into a wide range of biomolecular processes (39, 46–48). In this study, we apply similar ideas to determine the chromatin folding coordinate by analyzing an ensemble of structures obtained from single-cell super-resolution imaging (30). Specifically, we used the deep learning framework VAE to derive a deep generative model (49–51). Compared to existing approaches, the generative model not only compresses the data into a lowdimensional space for reaction coordinate analysis, but also provides an estimation of the probability for each configuration. This quantitative aspect is crucial for connecting with the energy landscape theory, as discussed in later sections.
We carried out the analysis on a chromatin region (Chr21:34.6Mb-37.1Mb) of the human HCT116 cell line (Fig. 1A). The average distance matrices suggest that this region adopts two pronounced TADs in WT cells with a boundary at 36.1 Mb, and the TADs disappear upon cohesin depletion (see Fig. 1B). Chromatin structures determined for both WT and cohesin-depleted cells (30) were included to produce the generative model. By mixing the structures from two cell types, we ensure the inclusion of both folded and unfolded configurations and that the largest variance in the dataset corresponds to the folding transition. We converted the 3D positions from super-resolution imaging into binarized contact matrices to provide rotationally and translationally invariant representations for chromatin (see Methods Section for details). We then applied VAE over the binarized representations to find two optimal latent variables in an unsupervised manner with an encoder that compresses the contact matrices and a decoder that reconstructs the inputs (Fig. 2A).
As shown in Fig. 2B, we found an apparent separation between WT (red) and cohesin-depleted (green) cells in the two-dimensional latent space. Therefore, VAE succeeds in uncovering the distinction between the two sets of chromatin conformations. From the two latent variables, we further defined a one-dimensional folding coordinate as the distance from the decision boundary that best separates the two cell types (Fig. 2B). We identified the boundary with the support vector machine (52), and WT and cohesin-depleted cells exhibit the largest difference along the direction perpendicular to the boundary. Projecting chromatin configurations onto the folding coordinate leads to a clear separation between the corresponding probability distributions as well (Fig. 2C), supporting its usefulness in separating the reactant from the product of the folding transition.
We further examined whether the one-dimensional variable can serve as a good reaction coordinate and provide mechanistic insight into chromatin folding. A key difference between chromatin structures from WT and cohesin-depleted cells is the presence of TADs. A simple variable that captures this distinction can be defined as the fraction of chromatin segment pairs that form contacts within the domains. As shown in Fig. 2D, the two variables are indeed highly correlated. The folding coordinate, therefore, faithfully tracks the progression of TAD formation. However, at both large and small values, the correlation is weak, suggesting that the folding coordinate may reveal additional complexity of the reaction beyond the intuitive contact formation.
Folding coordinate reveals TAD formation in cohesin-depleted cells
To better understand the physical meaning of the folding coordinate, especially at large absolute values, we grouped chromatin structures from individual cells and built average distance matrices along the coordinate using either WT or cohesin-depleted cells. The number of cells at various values of the folding coordinate are listed in Tables S1 and S2.
As shown in Fig. 3A, for WT cells, we find that the folding coordinate captures the heterogeneity of chromatin organization both within a single TAD and across TAD boundaries. For example, chromatin in most cells with the folding coordinate less than 1.2 exhibits two TADs with a separating boundary at 36.1 Mb. This boundary coincides with the one found in the average distance matrix (Fig. 1) and in Hi-C contact map (24). The contacts within each TAD, however, can vary significantly as the reaction coordinate increases. In particular, the emergence of sub-TADs gives rise to more compact chromatin with decreased spatial distances, and correspondingly, the colormap varies from red to yellow. Interestingly, we also find a significant population of cells, i.e., those with the folding coordinate larger than 1.2, with a shifted TAD boundary at 36.4 Mb. This chromatin reorganization could alter the regulatory environment for genes (e.g., RCAN1 and KCNE1) within this region and may impact their expression profiles.
Remarkably, for cohesin-depleted cells (Fig. 3B), variation in distance matrices along the folding coordinate highlights the gradual formation of chromatin structures with striking resemblance to those found in WT cells. For example, for cells with folding coordinate values between −1.6 and −0.8, the chromatin segment appears to adopt open, extended configurations and there is no prominent feature in the distance matrices. At large values (∼ 0.4), chromatin adopts two domain-like structures with a boundary identical to that found in WT cells. We note that the observed structural ordering only become apparent after averaging and the conformational ensembles at individual folding coordinates can exhibit substantial heterogeneity (see Figs. S2-S4).
Close examination of the distance matrices reveals additional subtlety of chromatin folding in cohesin-depleted cells. In particular, though both share similar TAD boundaries, the folded chromatin structures in cohesin-depleted cells are less compact and do not exhibit fine sub-TADs as those from WT cells. In addition, the folding coordinate also uncovers off-pathway configurations at values less than −1.6. In these cells, chromatin exhibits a single domain at the end of the genomic region with a boundary quite different from that of WT cells. This domain must unfold before chromatin can transition into WT-like structures.
The folding coordinate, therefore, provides a fresh perspective on the heterogeneity intrinsic to single-cell imaging data. The ensemble of chromatin structures from cohesindepleted cells appears to be well described with a single folding transition that leads to the formation of WT-like configurations. The seemingly random organizations observed in individual cells are, in fact, interrelated to each other as intermediate states along the folding pathway and only differ in the degree of foldedness. What drives the folding transition in cohesin-depleted cells? Why doesn’t chromatin from these cells fully commit to the well-folded WT-like structures? In the next two sections, we attempt to address these questions using the energy landscape theory (31, 32), which has already provided significant insight into the folding of interphase and metaphase chromosomes (53–55).
Deep generative models recover the energy landscape of in silico chromatin models
An advantage of the VAE is that it provides an estimation for the probability of each individual chromatin structure represented as a binary contact matrix Q. It is tempting then to define a quantity as –logPVAE(Q) and connect it with the corresponding free energy from statistical mechanics, F(Q). To our knowledge, there has been no prior evaluation of the performance of VAE in reproducing the free energy of a microscopic model.
Before we go on to evaluate the accuracy of the VAE, it is, however, useful to first clarify the physical meaning of F(Q) and how it differs from the interaction energy U(r) used in traditional computer simulations. Statistical mechanics suggests that we can define where Q(r) indicates the mapping from the Cartesian space r to the binarized contact space Q. S(Q) corresponds to the entropy arising from the loss of information during the mapping (coarse-graining) process (56, 57). Though U[Q(r)] can be easily determined, computing the entropy itself is a challenging task, making a direct comparison between –logPVAE(Q) and F(Q) impractical. One way to circumvent this challenge is to evaluate the difference of the two quantities from a reference system. In particular,
The second equation holds if the entropic functional of the reference system is the same as that from the system of interest. Under such a condition, validating VAE is equivalent to compare with ΔU(Q).
To determine the relevant quantities and evaluate accuracy of the VAE, we first carried out two computer simulations to collect 3D structures for a reference and a chromatin-like polymer model. The interaction energy in the reference model was fine-tuned to ensure that the average distance between neighboring beads and the overall size of the polymer are comparable to those measured experimentally for chromatin. For the chromatin-like model, in addition to the potential energy defined in the reference system, we introduced attractive interactions for beads within the first and second half of the polymer to promote the formation of domain like structures. Snapshots of the reference and chromatin-like polymers are provided in Figs. 4A and 4B, with the simulated average distance matrices shown on the side. Because the two systems share the same basal interactions that define the polymer topology, their entropic functional should be identical.
We then trained two VAE models using a total of 100000 configurations for each polymer. From these two models, we calculated for each one of the chromatin-like configurations. We further determined the corresponding ΔU[Q(r)] by evaluating the potential energy differences in the Cartesian space. As shown in Fig. 4C, the two quantities are significantly correlated with each other, with a Pearson coefficient of 0.53. The slope of the linear fit for the data is slightly larger than 1, with a value of 2.2. This deviation could potentially be a result of the maximization of a lower bound, rather than the true likelihood function in the VAE framework.
Balance between enthalpy and entropy dictates TAD formation
Encouraged by its accuracy in reproducing the microscopic energy of a synthetic system, we applied VAE over the WT and the cohesin-depleted imaging data separately to derive the corresponding chromatin energy landscapes. We note that these landscapes are deemed effective as chromatin exhibits slow dynamics (45, 58, 59) and is subject to perturbations driven by ATP-powered molecular motors (60). Nevertheless, provided that they can reproduce the corresponding steady-state distributions, effective landscapes are powerful concepts for characterizing non-equilibrium systems (61, 62).
Before analyzing the derived energies, we performed additional tests for the probability distributions estimated by VAE models and evaluated their accuracy in reproducing the measured statistics of chromatin conformation. First, we simulated a total of 10000 chromatin contact matrices by converting randomly distributed latent space variables into contacts using the VAE decoder networks. From these matrices, we computed the average contact frequencies ⟨Qi⟩ and the pairwise correlation between contacts ⟨QiQj⟩. As shown in Figs. 5A-D, values determined from VAE models match well with those from imaging data for both WT and cohesin-depleted cells. It is worth pointing out that a simple independent model fails to capture the cooperativity among chromatin contacts, as evidenced by the deviation between ⟨Qi⟩ ⟨Qj⟩. and ⟨QiQj⟩ (Figs. 5C and D). Finally, we found that VAE models also capture the higher-order collective behavior of chromatin contacts, and the probability distributions of the folding coordinate obtained from simulated contact matrices agree well with the experimental values (Figs. 5E and F).
Therefore, both the tests on in silico models and the experimental data support a quantitative interpretation of the energy landscape inferred from VAE. We next examined the change of various VAE energies along the folding coordinate by averaging over chromatin structures from both WT and cohesin-depleted cells. As shown in Fig. 6, consistent with the observed low probability of TAD like domains, the free energy –log[PVAE(Q)] favors unfolded chromatin configurations with negative folding coordinate values for cohesin-depleted cells. However, its difference from the homopolymer free energy introduced in the previous section, , becomes more negative along the folding coordinate. This quantity, according to Eq. 2, measures the strength of specific interactions in chromatin relative to the generic potential of a homopolymer. Since the homopolymer energy itself is weakly attractive and decreases along the folding coordinate (Fig. S5), the specific chromatin interactions favor folded structures even in cohesin-depleted cells. Therefore, the formation of two-domain like structures is indeed energetically stable but must be penalized by the configurational entropy to result in an overall unfavorable free energy. For WT cells, on the other hand, both the free energy and the potential energy stabilizes TADs over unfolded structures.
Conclusions and Discussion
We applied a state-of-the-art deep learning framework to analyze single-cell imaging data on chromatin organization. By projecting the 3D configurations onto low-dimensional latent variables, we identified a reaction coordinate that tracks the progression of TAD formation. Our analysis suggests that the seemingly random structures from individual cohesin-depleted cells can be viewed as intermediate states along the folding transition. Connecting VAE models with the energy landscape theory further reconciles the clear intent of folding with the lack of commitment. The TAD-like structures remain energetically favorable upon cohesin depletion, driving the formation of chromatin contacts in individual cells. The penalty from the configurational entropy, however, prevents the formation of the full set of contacts to stabilize an entire TAD, resulting in the disappearance of well-defined domains in average distance matrices.
What are the physicochemical interactions that stabilize the folded WT-like structures in cohesin-depleted cells? Numerous studies have demonstrated the importance of phase separation or compartmentalization in genome organization (63–71). Different regions of the chromatin could adopt distinct post-translational modifications on histone proteins. Such differences, and potentially in combination with the presence of additional intrinsically disordered proteins, could drive the collapse of chromatin into non-overlapping domains in 3D space. An analysis of the underlying combinatorial patterns of twelve histone marks (72) indeed supports this hypothesis. As shown in Figs. 1A and 3B, the five states defined using the software chromHMM (73) partition the chromatin into active and inactive segments at the position roughly corresponding to the TAD boundary. We note that the presence of different chromatin types is not obvious with a coarser classification. As shown in Fig. S1, consistent with the analysis based on Hi-C data (24), this region is assigned as a single active A compartment when only two states were used. Additional experiments could provide further insight into the importance of this weak compartmentalization boundary marked with different histone modification patterns in folding the chromatin.
Methods
Imaging data processing
Single-cell super-resolution imaging data were obtained from Ref. (30), with a total of 11631 and 9526 chromatin structures for WT and cohesindepleted cells, respectively. Though the experiments were performed at a 30 kb resolution, we carried out all our analysis at the 90 kb resolution for more accurate estimation of the probability distributions from VAE. We built the distance matrices from 3D positions of every third imaged chromatin segments and converted them into binary contacts with a cutoff of 450 nm. The contact probability between neighboring genomic segments at the 90 kb resolution is about 0.8. For chromatin segments with missing imaging positions, we filled in the corresponding entries in contact matrices with random numbers generated based on the sequence-separation specific average contact probabilities derived from imaging data.
We performed additional tests to confirm that the results shown in Figs. 2 and 3 are robust to the cutoff for binarization (see Fig. S6) and resolution of the data (see Fig. S7).
Variational autoencoder
VAE attempts to compress imaging data (Q) into the low dimensional latent space, z, with an encoding neural network (q(z|Q)). Quality of the latent space is ensured with the simultaneous optimization of a decoding network (p(Q|z)) that aims to faithfully reconstruct the original imaging data from latent variables. The probability of a chromatin configuration represented in the binary contact matrix can be formally defined as where p(z) is the prior distribution for latent variables. Directly computing this probability is intractable, however, and we used a variational inference to give a lower bound on the (log) probability
The two terms in the above equations correspond to reconstruction error calculated using cross entropy and the Kullback-Leibler divergence between the posterior and prior distribution of latent variables.
We implemented VAE models in PyTorch (74) and employed the stochastic gradient descent method with the Adam optimizer (75) to derive parameters with a batch size of 500. A total of 1000 epochs with a learning rate of 0.001 was used for model training to ensure the convergence of the loss function. One hidden layer with 200 nodes was used for both the encoding and decoding neural network. Two latent variables were used to define the folding coordinate for better interpretation. For more accurate estimation of probability distributions, we increased the latent variables to a total of 25 for results shown in Figs. 4-6.
Polymer simulations
We carried out two 50 million-step-long polymer simulations using the molecular dynamics package LAMMPS (76). These simulations were performed with reduced units with τ, σ, and ϵ as the time, length and energy unit, respectively. The timestep was set to dt = 0.01τ. Langevin dynamics with a damping coefficient of γ = 0.5τ was used to maintain the temperature at T = 1.0. We saved polymer structures at every 500 steps to collect a total of 100000 configurations from each simulation. Simulated polymer configurations were then converted to contact matrices with a cutoff of 3.0σ for VAE model parameterization. The cutoff was chosen to ensure that the simulated contact probability between neighboring beads is comparable to the experimental value.
The polymer consists of 28 beads to mimic the chromatin region at 90 kb resolution. The energy function for the reference model is defined as
Ub(r) is the harmonic bonding potential between neighboring beads with an equilibrium distance of 2.0σ and a spring constant of 1.0 ϵ/σ2. Usc(r) is a soft-core potential applied to all the non-bonded pairs to account for the excluded volume effect and to allow for chain crossing (53, 68). It is equivalent to a capped off Lennard-Jones potential and only incurs a finite energetic cost for overlapping beads.
Unb(r) is a weak collapsing potential with the following form where rc = 3.0σ and η = 10.0. α = –0.04 ϵ was chosen such that number of contacts formed by the reference polymer is comparable to that for chromatin.
Polymer beads in the chromatin-like model experience additional specific interactions besides those defined in Eq. 5. In particular, an attractive potential similar to Unb(r) with α = –0.1ϵ was applied between beads within the first or second half of the polymer to promote domain formation.
Acknowledgement
We thank Xingqiang Ding for helpful discussions. This work was supported by the National Science Foundation (Grant MCB-1715859) and the National Institutes of Health (Grant 1R35GM133580-01).