A unified framework for inferring the multi-scale organization of chromatin domains from Hi-C

Chromosomes are giant chain molecules organized into an ensemble of three-dimensional structures characterized with its genomic state and the corresponding biological functions. Despite the strong cell-to-cell heterogeneity, the cell-type specific pattern demonstrated in high-throughput chromosome conformation capture (Hi-C) data hints at a valuable link between structure and function, which makes inference of chromatin domains (CDs) from the pattern of Hi-C a central problem in genome research. Here we present a unified method for analyzing Hi-C data to determine spatial organization of CDs over multiple genomic scales. By applying statistical physics-based clustering analysis to a polymer physics model of the chromosome, our method identifies the CDs that best represent the global pattern of correlation manifested in Hi-C. The multi-scale intra-chromosomal structures compared across different cell types uncover the principles underlying the multi-scale organization of chromatin chain: (i) Sub-TADs, TADs, and meta-TADs constitute a robust hierarchical structure. (ii) The assemblies of compartments and TAD-based domains are governed by different organizational principles. (iii) Sub-TADs are the common building blocks of chromosome architecture. Our physically principled interpretation and analysis of Hi-C not only offer an accurate and quantitative view of multi-scale chromatin organization but also help decipher its connections with genome function.

C hromosome conformation capture (3C) and its derivatives, which are used to identify chromatin contacts through the proximity ligation techniques (1,2), take center stage in chromosome research (3,4). Square-block and checkerboard patterns manifested in Hi-C data provide glimpses into the organization of chromatin chains inside cell nuclei. Despite cellto-cell variations inherent to Hi-C data, which is in practice collected from a heterogeneous cell population, the cell-type specificity of chromosome architecture gleaned from Hi-C is still clear. Furthermore, the change of Hi-C pattern with the transcription activity and the phase of cell cycle underscores the functional roles of chromosome structure in gene regulation (5)(6)(7)(8)(9)(10)(11)(12)(13)(14). Given that pathological states of chromatin are also manifested in Hi-C (15,16), accurate characterization of chromatin domains (CDs) from Hi-C data is of great importance in advancing our quantitative understanding of the genome function.
Currently there are many algorithms available to identify CDs from Hi-C data, making important contributions to understanding the intra-chromosome architecture. However, CDs identified using different algorithms or parameters display significant variations, and there is no generally accepted definition for the above-mentioned CD at each scale. For example, the average size of a TAD varies from 100 kb to 2 Mb depending on the specific algorithm being used. Furthermore, still lacking is a unified algorithm to characterize CDs at multiple scales. In many of the existing algorithms specialized in finding CDs at particular genomic scales, Hi-C data should first be formatted at specific resolution (18,19,22). Although there are methods (32,33) developed for identifying hierarchical domain structures of chromosomes, their algorithms rely on local pattern recognition analyses (18,19,22,34), not implementing the physical viewpoint that chromosomes are a three dimensional object made of a long polymer (6,14,(35)(36)(37)(38)(39)(40).
Here, we interpret Hi-C data as pairwise contact probability matrix resulting from polymer networks whose inter-monomer distances are harmonically restrained. The cross-correlation matrix derived from Hi-C is used as the sole input for Multi-CD, the algorithm that we have developed to identify CD at varying genomic scales. Agreement of CD solutions from Multi-CD with the previous knowledge on chromatin organization as well as with information from bio-markers indicates the reliability of Multi-CD. A single algorithm-based solutions of CD allow us to assess the multi-scale structure of chromatin from a more unified perspective. We assert that amid the rapidly expanding volume of Hi-C data (10)(11)(12), Multi-CD holds good promise to quantitative and principled determination of chromatin organization. Adjacent TADs are merged to meta-TAD (6), and individual TAD is further decomposed into sub-TADs (19,(28)(29)(30)(31).
phase is slow (39,40,48,49), such that the system remains in local mechanical equilibrium over an extended period of time, as captured by the stable patterns in the Hi-C data (14). Although a great amount of cell-to-cell variation is expected for a population of cells (39,49,50), fluorescence measurement indicates that the spatial distances between pairs of chromatin segments can be well described by the gaussian distribution (21, 51-53) (see Fig. S1). This motivates us to model the chromosome structure using a gaussian polymer network whose configuration fluctuates around its local mechanical equilibrium state (54,55). See SI Appendix for more justifications for the use of the Gaussian polymer network model.
For a polymer chain whose long-range pairwise interactions are restrained via harmonic potentials with varying stiffness, the distance between each pair of segments i and j is written in the following form P (rij; γij) = 4 √ π γ 3/2 ij r 2 ij exp(−γijr 2 ij ), [1] with γ −1 ij = 2(σii+σjj −2σij), where σij is the positional covariance determined by the topology of polymer network (14,42). The contact probability between the two segments, pij, is the probability that their pairwise distance rij is below a cutoff rc. In other words, it is calculated using pij = rc 0 dr P (r; γij), where P (r; γij) is the distance distribution in Eq. 1. Importantly, this model establishes a one-to-one mapping from the contact probability pij to the parameter γij, which allows one to determine the covariance matrix {σij} and consequently the cross-correlation matrix, (C)ij = σij/ √ σiiσjj (see Materials and Methods for mathematical details). As a result, the following transformation from pij to (C)ij is conceived: pij −→ γij −→ σij −→ (C)ij. [2] The resulting cross-correlation matrix C is used as the input data for our domain inference algorithm (see Fig. 2).
We propose a principled interpretation of Hi-C data as a contact probability matrix, which can be derived from a mathematically tractable yet physically meaningful model of gaussian polymer network. As a pre-processing method, our approach can replace the common but arbitrary use of nonlinear (most often logarithmic) scaling of the Hi-C data.
Modeling the correlations with the group model. Clustering a correlation matrix into a finite number of correlated groups is a general problem discussed in diverse disciplines. We adapted a statistical mechanical formalism known as the "group model," developed for identifying the correlated groups of companies from empirical data of stock market price fluctuations (56)(57)(58). Without ambiguity, the formalism can be applied to the clustering of correlated genomic segments in a chromosome.
Let us assume that each genomic segment i ∈ {1, 2, · · · , N } belongs to a chromatin domain si. Then the vector s = (s1, s2, . . . , sN ) can be called the domain solution for the N segments. For example, a state s = (1, 1, 1, 2, 2, 3) describes a structure where the 6 genomic segments are clustered into 3 domains. Indexing of the domains is arbitrary. If there are K distinct domains in the solution, we can always index the domains such that si ∈ {1, 2, · · · , K}.
We also assume that the cross-correlation matrix C, captured by Hi-C, is essentially described by the correlation of a set of hidden variables {xi} where xi represents the "genomic state" of the i-th chromatin segment. Without loss of generality we can only consider the case where xi has zero mean and unit variance; or simply standardize as (xi − xi )/σx i → xi. Adapting the formalism in Refs. (56,57), we assume that each xi obeys the following stochastic equation where ηs i and i are two independent and identically distributed (i.i.d) random variables with ηs i , i ∼ N (0, 1), that are linked to the domain (si) and the individual segment (i) respectively. The parameter gs i (≥ 0) is associated with each domain si, such that a larger gs i indicates a stronger contribution from the domain-dependent variable ηs i . The cross-correlation between two segments i and j is written as In light of Eq. 4, the first term of Eq. 3 on the right hand side contributes to intra-CD correlation of the si-th CD with increasing gs i ; the second term of Eq. 3 corresponds to a noise that randomizes the intra-domain correlation. Regarding (C)ij calculated from Hi-C as the correlation between the stochastic variables xi and xj, each representing the genomic state of segment (Eq. 4), namely, (C)ij ⇔ xixj , [5] we use C as an input data for the clustering approach prescribed by the group model. Our goal is to find the domain solution s = (s1, s2, · · · , sN ) that best represents the pattern in the correlation matrix C, with an appropriate set of strength parameters g = (g1, g2, · · · , gK ), where K is the number of distinct domains in the solution. Using the group model described above, we can calculate the likelihood p(C|s, g) that the observed correlation 2 matrix C was drawn from an underlying set of domains s with strengths g (see SI Appendix for derivation). The loglikelihood is written as a sum over all domains in the solution s: [6] where n k = N i=1 δ s i ,k is the size of domain k, and c k = N i,j=1 Cijδ s i ,k δ s j ,k is the sum of all intra-domain correlation elements. The log-likelihood in Eq. 6 is maximized atĝ k = (c k − n k )/(n 2 k − c k ) for each k, allowing us to consider the reduced likelihood p(C|s) ≡ maxg p(C|s, g) = p(C|s,ĝ).
For convenience, we define an energy-like cost function E(s|C) for a domain solution s given a correlation matrix C, such that the problem of finding the maximum-likelihood solution s is equivalent to finding the s that minimizes E(s|C). Specifically, it is useful to write the likelihood function as p(C|s) ∝ exp(−E(s|C)/T ) to resemble a Boltzmann distribution, with where the new parameter T > 0 corresponds to the thermodynamic temperature. The best domain solution can be found by using the simulated annealing method where T is slowly decreased.
Inferring the domain solutions at multiple scales: a tunable group model. Besides evaluating how well a domain solution s explains the correlation pattern in the data, which is captured by the likelihood, we also want to impose an additional preference to more parsimonious solutions. This is done by introducing a prior distribution, p(s). Here we use the form p(s) = exp(−λK(s)/T ), where K(s) is a function that gives a larger value for a more fragmented solution s (larger number of domains), and λ (≥ 0) is a parameter that scales the overall strength of the prior. Specifically, K(s) is defined such that log K(s) is the entropy of s: This quantity measures the effective number of domains; for example, K(s) = K when the domain sizes are uniform. Taking a Bayesian approach, we combine the likelihood and the prior, to construct the posterior distribution according to the Bayes rule: p(s|C) ∝ p(C|s) p(s). The posterior is written in terms of an effective Hamiltonian H: Finding the maximum a posteriori (MAP) estimate, s * = argmax s p(s|C), is equivalent to solving a global minimization problem for H(s|C). As the s-space is expected to be highdimensional and is likely characterized with multiple local minima, we use simulated annealing (59) to find the energyminimizing s * (Fig. S2) (see Materials and Methods for the details).
Our algorithm, Multi-CD, is a principled method for inferring the chromatin domain (CD) solutions at multiple scales. The algorithm of Multi-CD is repeatedly applied to C to determine the best domain solutions at different values of λ (outer red box). At each λ, the best domain solution is found through a simulated annealing, in which the effective temperature T is gradually decreased (inner blue box). We use a MCMC sampling method to approximate the posterior distribution p(s|C) ∝ exp(−H(s|C; λ)/T ) and to find the domain solution s at each fixed temperature T . The boxes in red and blue represent the two levels of iterations varying λ and T , respectively, in the algorithm.
The group model becomes "tunable" at the level of inference through the parameter λ: as λ is increased, the resulting estimate s * tends to have a fewer number of domains. In parallel to the statistical physics problem of a grand-canonical ensemble, T is the effective temperature of the system, and λ amounts to the negative chemical potential. observed that the onset of heterogeneity was related to the appearance of non-local domains (Fig. 3c).
(iii) We quantified the goodness of each CD solution by comparing its corresponding binary matrix against the Hi-C data in terms of the normalized mutual information (nMI; see Materials and Methods for details). There is a scale, λ * , at which the diagonal block pattern manifested in Hi-C data is most accurately captured. In Fig. 3g, the best solution was found at λ * ≈ 30 for GM12878, HUVEC, and NHEK; the λ * 's were identified at larger values for K562 and KBM7. As an interesting side note, K562 and KBM7 belong to immortalized leukemia cell lines, whereas the other three cell types are normal cells; the different statistical property of Hi-C patterns manifested in λ * may hint at a link between the pathological state and a coarser organization of the chromosome.
(iv) The CD solutions inferred by Multi-CD, especially the families of local CDs, appeared to be conserved across different cell types ( Fig. 3h-i). We quantified the extent of domain conservation, in terms of the Pearson correlation (see Materials and Methods), averaged over all pairs of different cell types. Domain conservation was strong for smaller domains at λ ≤ 30 ( n 1.5 Mb), with the strongest conservation at λ = 10 (Fig. 3h). The CDs at λ = 10 are shown in Fig. 3i for five different cell types.
(v) Finally, we quantified the similarity between pairs of CD solutions obtained at different scales, again using the similarity measure based on Pearson correlation. In the case of GM12878, the family of CD solutions is divided into two regimes; the smaller-scale CD solutions from a range of 10 ≤ λ ≤ 40 are correlated among themselves, and the larger-scale CD solutions from λ > 40 as well. CD solutions below and above λ ≈ 40 are not correlated with each other (Fig. 3j; also see Fig. S6 for the other cell lines).
The division boundary in (v) is found at a λ value in the sim-ilar range with the best-clustering scale λ * , and the crossover λcr from local/homogeneous to non-local/heterogeneous CDs (compare Fig. S6 to Fig. 3f-g). Hereafter we will refer to the two regimes as the family of local CDs with homogeneous size distribution (λ λ * ), and the family of non-local CDs with heterogenous size distribution (λ λ * ).
TAD-like organizations in the family of local CDs. We identified at least three important scales in the family of local CDs. First of all, there was a scale at which domain conservation was maximized across different cells (λ = 10). This observation is consistent with the widely accepted notion that TADs are the most well-conserved, common organizational and functional unit of chromosomes, across different cell types (27,60). Thus, for the example from human chromosome 10, we identify the CDs found at this scale λ = 10 as the TADs. The average domain size at λ = 10 was n ≈ 0.9 Mb, which agrees with the typical size of TADs as suggested by previous studies (22,23). The goodness of clustering, on the other hand, was maximized at a larger scale, λ ≈ 30 ( n ≈ 1.5 Mb) for GM12878. The CDs at this scale turned out to be aggregates of multiple TADs in the genomic neighborhood, from visual inspection (see Fig. 3c), or as quantified in terms of our nestedness score (see Materials and Methods). We therefore identify these CDs as the "meta-TADs", a higher-order structure of TADs, adopting the term of Ref. (6). In contrast to a previous analysis that extended the range of meta-TADs to the entire chromosome (6), we use the term meta-TAD exclusively for the larger-scale local CDs, distinguishing them from the non-local structures (i.e., compartments, discussed below). We note, however, that the terminologies of TADs and the meta-TADs are still not definitive -a recently proposed algorithm based on structural entropy minimization (61) found that the "best" solutions were found at ∼ 2 Mb domains, which is consistent with our findings, although these domains were called the TADs in Ref. (61).
Finally, a trivial but special scale is λ = 0, where no additional preference for coarser CDs is imposed. The CDs at this scale are supposed to best explain the local correlation pattern that is reflected in the strong Hi-C signals near the diagonal. These smaller CDs are almost completely nested in the TADs and the meta-TADs; we can therefore call them the sub-TADs. We also confirmed that the sub-TAD solutions were not limited by the resolution of the Hi-C data; sub-TADs were robustly reproduced from a finer, 5-kb Hi-C (Fig. S8).
The first three panels in Fig. 3c shows three representative TAD-like CD solutions at λ ≤ λ * : sub-TADs (λ = 0; smallest CDs), TADs (λ = 10, strongest domain conservation), and meta-TADs (λ = λ * = 30, largest nMI). The nested structure is reminiscent of the hierarchically crumpled structure of chromatin chains (37,62).  database (63), we identified the regulatory elements for this gene (enhancers and promoters) in the interval between 26.65 and 27.15 Mb. Notably, our Multi-CD solutions show that the interval associated with the regulatory elements is fully enclosed in the same TAD in GM12878 and KBM7, whereas it is split into different TADs in the other three cell lines (Fig. 3k). The observation suggests that, for a gene to be expressed, it is critical that all regulatory elements are within the same TAD; this is consistent with the understanding that TADs define the functional boundaries for genetic interactions (5,6,17,20).

Compartments as the best domain solution that coexists with
TAD-like domains. The super-Mb sized domains are generally defined as the compartments in the chromosome organization (27). Compartments are characterized by the checkerboard pattern in off-diagonal part of correlation matrix, being highly non-local CDs. Our formulation of the group model has the flexibility for dealing with the non-local CDs naturally. However, a naïve application of Multi-CD by increasing λ did not identify the compartments; some non-local CDs were found (Fig. 3c), but they do not correspond to alternating patterns characteristic to compartments.
We hypothesize that compartments correspond to a secondary CD solution, that coexists with a best solution that was already identified. Assuming a statistical independence between the two solutions and the additivity of cross-correlations (extension of Eq. 3), the inference of the secondary solution is reduced to a standard Multi-CD applied to a modified input data, which is essentially the result of taking out the best CD solution from the original correlation matrix C (see SI Appendix). Here we consider a simplified version of this problem, and remove from C a diagonal band of width 2 Mb, similar to the size of meta-TADs (Fig. 4a). Applied to the modified Hi-C, Multi-CD successfully captures the non-local correlations, and identifies two large compartments with alternating patterns (Fig. 4b). The correspondence is clearer when the indices of segments are re-ordered ( Fig. 4c). Because the larger CD (k = 1) shows a greater number of contacts ( Fig. S9), it can be associated with the B-compartment, which is usually more compact; k = 2 is associated with the A-compartment. Further validation of the two compartments will be presented below, through comparisons with epigenetic markers.
To compare the goodness of our compartments with existing methods, we calculated the nMI against the C O/E matrix, the conventional form for compartment identification (see Materials and Methods) (18). We find that Multi-CD outperforms GaussianHMM (19), a widely accepted benchmark, in capturing the large-scale structures in Hi-C (Fig. 4d).

Multi-scale, hierarchical organization of chromatin domains.
Now that we identified four classes of CD solutions, namely sub-TADs, TADs, meta-TADs and compartments, we examined their hierarchical relationships. Note that these CDs were obtained independently at the respective λ values, not through a hierarchical merging. Sub-TADs or TADs are almost always nested inside a meta-TAD, and TADs inside a meta-TAD, whereas there are mismatches between the TAD-like domains and the compartments. We quantified this relationship in terms of a nestedness score h, such that h = 0 indicates the chance level and h = 1 a perfect nestedness (see Materials and Methods), along with a visual comparison of each pair of CD solutions (Fig. 5a). This analysis confirms that the hierarchy between any pair of TAD-like domains (sub-TADs, TADs, and meta-TADs) is appreciably strong. On the other hand, the hierarchical links between the TAD-like domains and compartments are much weaker, which is again consistent with the recent reports that TADs and compartments are organized by different mechanisms (64,65).
Although the nestedness score between sub-TADs and compartments (nestedness score h = 0.67) is not so large as those among the pairs of TAD-based domains, it is still greater than those between TADs and compartments (h = 0.55) or between meta-TADs and compartments (h = 0.43). Thus, sub-TAD can be considered a common building block of the chromatin architecture (see Fig. 5b).

Validation of domain solutions from Multi-CD.
The CD solutions from Multi-CD are in good agreement with the results of several existing methods. Specifically, our CDs correspond to the previously proposed sub-TADs (19) at λ = 0, to the TADs (22) at λ ≈ 10, and to the compartments (19) at λ ≈ 90 (see Fig. S10). When assessed in terms of the nMI, Multi-CD outperforms the corresponding alternatives (ArrowHead (19), DomainCaller (22), GaussianHMM (19) for sub-TADs, TADs, meta-TADs) at the respective scales (Fig. 6a).
In order to further validate the biological relevance of the CD solutions from Multi-CD, we compared them with several biomarkers that are known to be correlated with the spatial organization of the genome (66). All results shown here are for chr10 of GM12878.
First, we calculated how much the boundaries of our sub-TAD and TAD solutions are correlated with the CTCF signals, which are known to be linked to TAD boundaries (22, 23) (Fig. 6b). We quantified this in terms of a correlation function, χ(d), where d is the genomic distance between a domain boundary and each CTCF signal (see Materials and Methods). The correlation function shows a strong enrichment of CTCF signals at domain boundaries (high peak of correlation at d ≈ 0), as well as precision (fast decay of correlation as d increases). Multi-CD performs similarly to ArrowHead and DomainCaller in terms of the enrichment at the boundary, and does better in terms of the precision (Fig. 6b). Specifically, when fitted to exponential decays, the correlation lengths are 34 kb (λ = 0) and 143 kb (λ = 10) for Multi-CD, compared to 900 kb for the two previous methods (Fig. 6b).
Next, we compared our compartment solutions (CDs at λ = 90, shown in Fig. 4b) with the replication timing profiles (Repli-Seq), which are known to correlate differently with the A-and B-compartments (7,67). Our inferred compartments exhibit the anticipated patterns of replication timing (Fig. 6c); the Acompartment shows an activation of replication signals in the early-phases (G1, S1, S2) and a repression in the later phases (S3, S4, G2), whereas the B-compartment shows an opposite trend. There is a clear anti-correlation between the replication patterns in the two compartments along the replication cycle ( Fig. 6c), as quantified in terms of the Pearson correlation (Fig. 6d). Comparison to other epigenetic markers, such as the pattern of histone modifications, further confirms the association of our CD solutions with the A/B-compartments (Fig. S11).

Discussion
Multi-CD has many essential advantages that will make it a useful tool for the study of chromatin organization. As a computational algorithm, Multi-CD includes two core steps: the pre-processing of raw Hi-C data into a correlation matrix, and the inference of chromatin domain (CD) solutions from the correlation matrix. The pre-processing is based on a model of gaussian polymer network, allowing a physically justifiable interpretation of the Hi-C data.
The domain identification problem is formulated by combining the group model with a Bayesian inference for the CD solution. The formulation of Multi-CD that optimizes the recognition of global pattern appeared in Hi-C naturally deals

Pearson correlation
Repli-Seq signal with non-local CDs, which differentiates Multi-CD from previous methods that focus on local features in Hi-C, such as CD boundaries or loops enriched with higher contact frequencies.
Moreover, Multi-CD can find CDs across a wide range of scales without having to adjust or down-sample the Hi-C data to match the scale of CDs to be identified, which is an important improvement over many existing methods. An important feature of Multi-CD, as emphasized in the name, is that it provides a unified framework to identify CDs at multiple scales, where the scales of the CDs are tuned by a single parameter λ. The resulting family of CD solutions allow quantitative comparisons between CD solutions at different scales. The analysis revealed special scales at which the CD solutions are particularly interesting: sub-TADs (λ = 0), TADs (λ = 10, where domain conservation was strongest), and meta-TADs (λ = 30, where the correlation pattern was best captured). At larger scales, we found that compartments (λ = 90) emerge as a secondary solution that can be inferred after removing the local signals in the correlation matrix that correspond to the TAD-like solutions. We confirmed that Multi-CD successfully reproduces, or even outperforms, the existing methods to identify CDs at the specific scales. Importantly, Multi-CD achieves this performance through a single unified algorithm, which not only identifies the specific CD solutions accurately, but also allows a comparative analysis of the multi-scale family of solutions.
In particular, we characterized the hierarchical organization of the chromatin by quantifying the similarity and the nestedness between CD solutions at two different scales. We showed that the characteristics of CD solutions shared by the local, TAD-like domains do not precisely hold together in the non-local, compartment-like domains. This finding is consistent with the recent studies which report that compartments and TADs are formed by different mechanisms of motor-driven active loop extrusion and microphase separation, and that they do not necessarily have a hierarchical relationship (65,(68)(69)(70). Meanwhile, the sub-TADs are nested in each of the other three solutions, including the compartments (Fig. 5), indicating that sub-TADs are the fundamental building blocks of the higherorder CD organization. In fact, the existence of sub-TADs is robust when a finer-resolution Hi-C is considered. Applying Multi-CD on Hi-C data at 5-kb resolution, we clearly recover the sub-TADs that are consistent with the sub-TADs obtained from the 50-kb Hi-C (see Fig. S8).
While there are methods that report hierarchical CDs (32,33), Multi-CD makes significant advances both algorithmically and conceptually. Multi-CD can detect non-local domains with better flexibility instead of finding a set of intervals. Multi-CD also avoids the high false-negative rate that is typical of the previous method (e.g., TADtree (32)) that focuses on the nested domain structure (Fig. S12). Further, employing an appropriate prior to explore the solution space effectively, Multi-CD can avoid the problem encountered in Armatus (33) which skips detection of domains in some part of Hi-C data while its single scale parameter is varied (Fig. S12).
Multi-CD is a method of great flexibility that can be readily applied to analyze any dataset that exhibits pairwise correlation patterns. However, two cautionary remarks are in place for more careful interpretation of the results. (i) In general, the relevant values of λ depend on the resolution of the input Hi-C dataset, as well as on the cell type. While λ is a useful parameter that allows comparative analysis, its specific value does not carry any biological significance. Although we referred to a specific CD solution by the corresponding value of λ in the current analysis (Fig. 3), the lesson should not be that TADs, for instance, always correspond to the particular value of λ; instead, TADs should be identified as the most conserved CD solutions across cell types after scanning a range of λ's.
(ii) Multi-CD is agnostic about whether the collected data is homogeneous or heterogeneous. Application of Multi-CD to single-cell Hi-C data, and the subsequent interpretation of the result, would be straightforward; however, if the input Hi-C data were an outcome of a mixture of heterogeneous subpopulations, the solution from Multi-CD would correspond to their superposition. This is a fundamental issue inherent to any Hi-C data analysis method. Despite the presence of cell-to-cell variations, the population-averaged pattern manifest in Hi-C carries a rich set of information that is specific to the cell type. The need for interpretable inference methods that can extract valuable insights into the spatial organization of the genome, including ours, is still high.
To recapitulate, in order to glean genome function from Hi-C data that varies with the genomic state (10-13), a computationally accurate method to identify CD structures is of vital importance. Multi-CD is a physically principled method that identifies multi-scale structures of chromatin domains by solving the global optimization problem. We find the chromatin domains identified from Multi-CD in excellent match with biological data such as CTCF binding sites and replication timing signal. Quantitative analyses of CD structures identified across multiple genomic scales and various cell types offer general physical insight into chromatin organization inside cell nuclei.

Interpretation of Hi-C data.
Normalization and contact probability. Here we describe how the Hi-C data can be interpreted as a set of contact probabilities for pairs of genomic segments, p ij . Typically, a Hi-C matrix have widely varying row-sums; for example, the net count of the i-th segment in the experiment is much larger than the net count of the j-th segment. To marginalize out this site-wise variation and only focus on the differential strengths of pairwise interactions, the raw Hi-C matrix Mraw is normalized to have uniform row and column sums. This is achieved using the Knight-Ruiz (KR) algorithm (71), which finds a vector v = (v 1 , · · · , v N ) for calculating (M) ij = v i v j (Mraw) ij , such that each row (column) in M sums to 1.
We assume that the normalized Hi-C signal is proportional to the contact probability: (M) ij ∝ p ij . Note that p ij is the probability that the two segments i and j are within a contact distance, and the rows of the contact probability matrix (P) ij = p ij is not required to sum to 1. Because the proportionality constant is unknown a priori, however, we have a free parameter to choose. We do this by fixing the average nearest-neighbor contact probability, p 1 = p i,i+1 . We expect thep 1 to be relatively close to 1, assuming that nearest-neighbor contact is likely; but not exactly 1, because there are variations among the nearest-neighbor Hi-C signal. In this work we chosep 1 = 0.9. The resulting contact probability matrix P is given as (P) ij = min(1,p ij ), withp ij = (p 1 /µ)(M) ij , where µ = M i,i+1 is the Hi-C signals averaged over the nearestneighbors. Atp 1 = 0.9, in our case, the fraction of over-saturated elements (p ij > 1) was sufficiently small.
Building correlation matrix from Hi-C. A chromosome can be regarded as a polymer chain containing N monomers, each of which corresponds to the i-th genomic segment and its spatial position is written as r i . Adapting the random loop model (RLM) (42), we interpret chromosome conformation as described by an ideal polymer network, with cross-links between spatially close segments. In RLM, we describe the spatial positions of the polymer segments using a gaussian distribution with zero mean and and a covariance matrix Σ, with elements (Σ) ij = σ ij = δr i · δr j . It follows that the distance r ij = |r i − r j | between two monomers i and j can be described in the form of a weighted gaussian function (Eq. 1) where the variance (2γ ij ) −1 is associated with the covariance matrix elements as γ −1 ij = 2(σ ii + σ jj − 2σ ij ). The contact probabilities can be calculated from the distribution of pairwise distances (Eq. 1), by saying that two segments i and j are in contact when their distance r ij is below a cutoff, rc. In other words, we write [10] where erf(x) = 2 √ π x 0 dte −t 2 . The value of γ ij is uniquely determined for each p ij ; once we have the γ ij 's, we can reconstruct the covariance matrix {σ ij }. Because the diagonals are underdetermined in Hi-C (self-contacts are not reported), we assume a uniform variance σ ii = σ jj = σc along the diagonal. Note that although the value of γ ij depends on the choice of rc, its effect is only to scale the γ ij 's as γ ij → r 2 c γ ij , and consequently the σ ij 's. Finally, we normalize the covariance matrix to build the correlation matrix C: It appears that σc sets the overall intensity of C. Here, we chose the value of σc as the median of 1/4γ ij , i.e., σc = median(1/4γ ij ). This cancels out the scaling effect of rc in σ ij , so that the choice of rc does not affect the ultimate construction of the correlation matrix C.

The observed/expected (O/E) matrix and its Pearson correlation matrix.
The O/E matrix was used to account for the genomic distancedependent contact number due to random polymer interactions in chromosome (18). Each pair (i, j) in O/E matrix is calculated by taking the count number M ij (observed number) and dividing it by average contacts within the same genomic distance d = |i − j| (expected number). Since the expected number could be noisy, one smooths it out by increasing the window size (see refs (18,19) for further details). In this study, we used the expected number obtained from (19). The Pearson correlation matrix of the O/E (C O/E ) represents the overall contact pattern through pairwise correlation coefficients between segments. In sampling the space of CD solutions, a move from a state s to another state s is defined such that the two CD solutions (s, s ) differ only by one genomic segment. More precisely, because a CD solution is invariant upon permutations of the domain indices, the distance between s and s is uniquely defined as the minimal number of mismatches over all possible domain index permutations.

Metropolis
To ensure that the sampling is properly conducted, we continue the sampling until each chain collects ttot ≥ 5τ * samples in the CD solution space. Here τ * is the "relaxation time" defined as the number of steps it takes until the autocorrelation function R(τ ), drops significantly: τ * = argmin τ |R(τ ) − 1/e|. The autocorrelation function is calculated from the value of the total cost function H, as [12] where st is the t-th sample in the chain, and µ and σ are the mean and standard deviation of {H(st|C)}, respectively. In Eq.12, the running time average is taken over all the pairs of samples with the time gap of τ . We note that sampling is the computational bottleneck for our method; our stop condition (at 5τ * ) was chosen conservatively for accurate solutions. In practice, the sampling time can be reduced at the cost of an increased batch size (number of different initial configurations, as described below in simulated annealing), which is significantly cheaper if parallel computing is used. Simulated annealing. The simulated annealing process is described below.
Initialization. An initial configuration s (0) is generated in two random steps. First, the total number of CDs, K, is drawn randomly from the set of integers {1, · · · , N }. Then, each genomic segment i ∈ {1, · · · , N } is allocated randomly into one of the CDs, k ∈ {1, 2, · · · , K}. The initial temperature T 0 is determined such that the acceptance probability for the "worst" move around s (0) is 0.5.
Iteration. At each step r, the temperature is fixed at Tr. We sample the target distribution pr(s|C) ∝ exp(−H(s|C)/Tr), using the Metropolis-Hastings sampler described above. For the next step r + 1, the temperature is lowered by a constant cooling factor c cool ∈ (0, 1), such that the next temperature is T r+1 = c cool · Tr. We used c cool = 0.95 in this study.
Final solution. The annealing is repeated until the temperature reaches T f . We used T f = 0.03. Then we quench the system to the closest local minimum by performing gradient descent. Because there is still no guarantee that the global minimum is found, we tried a batch of at least 10 different initial configurations and chose the final state s * that gives the minimal H(s * |C).

Analysis on subsets of Hi-C data.
Our method allows the user to break down the Hi-C data into subsets, as long as the CDs are localized within the subsets (Fig. S13). This saves the algorithm from the large memory requirement of dealing with the entire intrachromosomal Hi-C (for example, Hi-C of chromosome 10 has 2711 bins in 50-kb resolution). For the analysis of the 50-kb resolution Hi-C data in this paper, we used subsets of the data that correspond to 40-Mb ranges along the genome, or 800 bins.

Similarity of two distinct CD solutions using Pearson correlation.
To measure the extent of similarity between two CD solutions s and s , we evaluate the Pearson correlation. The binary matrices B and B that represent the two CD solutions, are defined such that the matrix element are all 1's within the same CD and 0 otherwise. i.e., (B) ij = B ij = δs i s j . The similarity between B and B is quantified using the Pearson correlation where δBδB = (B ij −B)(B ij −B ) i =j , and (δB) 2 = (B ij − B) 2 i =j . The average · i =j runs over all distinct pairs.

Normalized mutual information.
We use the mutual information to evaluate how well a CD solution s captures the visible patterns in the pairwise correlation data. We consider the binary grouping matrix (B) ij = B ij = δs i ,s j for the CD solution of interest, and compare it to the input data matrix (A) ij = A ij . In this study, either log 10 M or C O/E was used for A. Treating the matrix elements a ∈ A and b ∈ B as two random variables, we construct the joint distribution [14] where · i =j is an average over all distinct pairs. The Kronecker delta for the continuous variable a is defined in a discretized fashion:  [15] and the normalized mutual information (nMI), H(B), [16] where H(X) = − x∈X p(x) log p(x) is the marginal entropy.

Nestedness of CD solutions.
Here we define a measure to quantify the nestedness between two CD solutions, s (assumed to have smaller domains on average) and s (larger domains). The idea is the following: s is perfectly nested in s if, whenever two sites belong to a same domain in s, they also belong to a same domain in s . For each domain k ∈ s, we consider the best overlap of this domain k on the other solution s : [17] where v k,k is the number of overlapping sites between two domains k ∈ s and k ∈ s , and n k is the size of domain k. The highest score h 1 (k → s ) = 1 is obtained when domain k is fully included in one of the domains in s . The null hypothesis corresponds to where the domains in s and s are completely uncorrelated, in which case h 1 only reflects the overlap "by chance". The chance levelh 1 (k → s ) is calculated by making n k random draws from s ; we averaged over 100 independent trials. We normalize the score aŝ such thatĥ 1 (k → s ) = 0 indicates the chance level, andĥ 1 (k → s ) = 1 means a perfect nestedness. Finally, we define the nestedness score h(s → s ) for the entire CD solution as a weighted average: Correlation between CTCF signal and domain boundaries. The validity of domain boundaries, determined from various CD-identification methods including Multi-CD, is assessed in terms of their correlation with the CTCF signal. Suppose that the CTCF signal at genomic segment i is given as φ CTCF (i). Then, we can consider an overlap function between φ CTCF (i) and a CD-boundary indicating function ψ DB (i), where ψ DB (i) = 1 if the i-th segment is precisely at the domain boundary; ψ DB (i) = 0, otherwise. We evaluated a distance-dependent, normalized overlap function χ(d), defined as [20] where δφ CTCF = φ CTCF − φ CTCF . If the domain boundaries determined from Multi-CD is well correlated with TAD-capturing CTCF signal, a sharply peaked and large amplitude overlap function (χ(d)) is expected at d = 0.  (B) ||h|. [21] Data availability. All data used in the paper were obtained from publicly available repositories. See SI Appendix.

Correlation between epigenetic marks and compartments.
Code availability. The Matlab software package and associated documentation are available online (https://github.com/multi-cd).

ACKNOWLEDGMENTS.
We thank Roger Oria Fernandez for critical comment on our script deposited in GitHub. This work was supported in part by a KIAS Individual Grant at Korea Institute for Advanced Study (No. CG035003 to C.H.). We thank the Center for Advanced Computation in KIAS for providing computing resources.

Supporting Information Appendix
A. Data acquisition.
Hi-C data. Hi-C data were obtained through GEO data repository (GSE63525-celltype-primary) (19), where celltype is replaced by one of the five different cell types that were considered in our analysis (GM12878, HUVEC, NHEK, K562, and KBM7).
Biological markers. The domain solutions from Multi-CD were compared with known biological markers. We obtained these data mostly from the ENCODE project (72). Specifically, we used the enrichment data of the transcriptional repressor CTCF measured in a Chip-Seq assay from http://hgdownload. cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwTfbs/ wgEncodeUwTfbsGm12878CtcfStdPkRep1.narrowPeak.gz.
We binned the CTCF assay at 50-kb resolution (to match the Hi-C format). If there are multiple signal enrichments in a single bin, we took the average value. Because each CTCF signal has a finite width, there are occasional cases where a signal ranges across two bins; in those cases we evenly divided the signal strength into the two bins. The Repli-seq signals in the six phases G1, S1, S2, S3, S4, and G2 were obtained from http://hgdownload.cse.ucsc.edu/goldenPath/ hg19/encodeDCC/wgEncodeUwRepliSeq/, and were averaged over 50-kb windows along the genome to construct the replication timing profiles. The 11 histone mark signals were obtained from http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_ jan2011/byDataType/signal/jan2011/bigwig/, and undergone the same preprocess as Repli-seq. The RNA-seq data for the four cell lines GM12878, HUVEC, NHEK and K562 were also obtained from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/ wgEncodeCaltechRnaSeq/. RNA-seq for the cell line KBM7 were separately obtained from https://opendata.cemm.at/barlowlab/2015_ Kornienko_et_al/hg19/AK_KBM7_2_WT_SN.F.bw. Information about the APBB1IP gene was obtained from the GeneCards database https://www.genecards.org/cgi-bin/carddisp.pl?gene=APBB1IP.

B. Justifications for the Gaussian polymer network for modeling chromosomes.
Here we provide additional justifications for the use of harmonic potentials in the effective Hamiltonian, and consequently, a gaussian distribution for pairwise distances (Eq. 1).
The concept of an effective Hamiltonian consisting of harmonic potential terms is not new; it has been widely employed to study a variety of systems, including the phase transition of vulcanized macromolecules with increasing numbers of crosslinks (73, 74), and the fluctuation dynamics of native proteins (gaussian network model, (75)). Furthermore, a slightly modified, but essentially identical, form of Hamiltonian was used to study the dynamics of folding/unfolding transitions of a single RNA molecule under external force (generalized Rouse model, (76)).
Whereas the success of the gaussian polymer network model does not necessarily guarantee its extension to the modeling of chromosomes, our use of a gaussian distribution for the pairwise distance between two segments in the polymer is empirically justified. The Gaussian-like pairwise distance distributions reported by fluorescence measurements of the chromosome (Fig. S1), and the agreement of the 3D structural properties inferred by a modeling approach that shares the same philosophy (heterogeneous loop model or HLM, (14)), suggest that a harmonic spring network provides a reasonable approximation of the energy landscape for the mixture of those subpopulations.
As a side note, it is worth highlighting the versatility of the Gaussian polymer network model in representing the complex topology of chromosome conformation. For the conventional Rouse chain whose monomers along the backbone are constrained by an energy hamiltonian H = (k/2) N −1 i (r i+1 − r i ) 2 with a uniform spring constant k, it is straightforward to show that r 2 ij ∼ |i − j|. Furthermore, if two monomers are in close proximity to form a contact (r ij < rc), then one can obtain the contact probability between monomers i and j in the chain backbone as p ij = rc 0 dr ij P (r ij ) ∼ |i − j| −3/2 . However, adding just a few non-nearest-neighbor interactions to the Rouse model makes the results highly nontrivial. To illustrate this, we explicitly compared the contact probability map of a linear Gaussian chain (Rouse chain), and those of Gaussian polymer network models with varying numbers of non-nearest-neighbor interactions, which were calculated from the HLM-generated structural ensemble (14) (see Fig. S14). The statistical behavior of Gaussian polymer network model differs from that of the linear "Gaussian" chain. The mean square distance r 2 ij no longer scales linearly with the separation s ≡ |i − j| (see Fig. S14e), and the contact probability p ij (or p(s)) is no longer described with a simple scaling relation (Fig. S14f). The simple modification to the Rouse model, resulting in the Gaussian polymer network model, allows one to explore many different issues of chromosomes. In fact, our recent work based on HLM (14) demonstrated several case studies, substantiating the various experimental measurements on chromosome conformation by solving the inverse-problem of inferring chromosome structures from Hi-C data.

C. Derivation of the likelihood function.
Here we show how to derive the likelihood function, Eq. 6.

Problem. We want to compute
[S1] with the following assumptions: • x ∈ R N is a sequence of normalized and uncorrelated observations, with zero mean x = 0 N and unit covariance Cov(x) = I N .
• s = (s 1 , · · · , s N ) is a clustering map that assigns each site i ∈ {1, · · · , N } to a cluster index s i ∈ {1, · · · , K}. Without loss of generality, we can assume that s i ≤ s j whenever i < j (ordered indexing).

D. Modification of the correlation matrix for the secondary domain solution.
Here we show how the correlation matrix can be modified to solve for a secondary domain solution (for example compartments), by taking out the contribution from the primary domain solution (for example the meta-TADs), which is supposed to be known already.
We extend the group model to consider a bivariate grouping, i → (s i , u i ), where each genomic locus i simultaneously belongs to a primary group s i and a secondary group u i , presumably at different scales. Generalizing Eq. 3, we assume a linear model where ξu i and hu i are respectively the random variable and the grouping strength parameter that correspond to the secondary group u i . If we further assume that s i and u i are statistically independent, the pairwise correlation between two loci i and j can be written as the contributions from different groups are additive. We can do a straightforward algebra to rearrange the above model as such that the right-hand side becomes the single-group model. The left-hand side of this expression is a normalized residual of the correlation, which we will call C res ij . If we have already inferred the primary group s i and the corresponding strengthg without considering the secondary group, the correspondence to this two-group model (due to normalization) is given as g = (1 + h)g. Substituting this, and replacing the model correlation x i x j with the data C ij , the left-hand side of the previous expression is simplified to [S14] C res is written in terms of the original data C and the primary solution s (andg) only; it is independent of the unknown secondary solution u (and h). Now u is the solution of a modified single-group problem, using this residual correlation C res as the input data.        We plot the genomic distance-dependent contact number for the two large domains k = 1 and k = 2, from the solution at λ = 90 for chromosome 10 in GM12878. At short genomic distance, the domain solution of k = 1 is characterized with a greater number of contacts than k = 2, which suggests that k = 1 domain is locally more compact. We therefore associate the first domain k = 1 to the B-compartment, and the second domain k = 2 to the A-compartment.