Multi-CD: Multi-scale discovery of Chromatin Domains

,


INTRODUCTION
Chromosome conformation capture (3C) and its derivatives, which are used to identify chromatin contacts through the proximity ligation techniques [1,2], take center stage in studying the organization and function of chromosomes [3,4]. It is clear from the genomewide interaction profiles of Hi-C data that details of chromosome architecture not only vary with cell type but also with the transcription activity and the phase of cell cycle, underscoring the functional roles of chromosome structure in gene expression and regulation [5][6][7][8][9][10][11][12][13]. Since pathological states of chromatin are also manifested in Hi-C [14,15], accurate characterization of chromatin domains (CDs) from Hi-C data is of utmost importance.
Before discussing our new method and algorithm, we give a brief overview of the current knowledge on scale-dependent organization of chromatin [6,[16][17][18][19][20][21]. Chromosomes packaged inside the nucleus are first segregated into their own territories (Fig. 1a) [18]. At the scale of O(10) Mb, alternating blocks of active and inactive chromatin are phase-separated into two megabase sized aggregates, Sub-megabase to megabase sized chromatin folds into TADs. According to recent studies, adjacent TADs are merged to meta-TAD [6], whereas individual TAD is further decomposed into sub-TADs [17, 28-31]. . It was suggested that the proximal TADs in genomic neighborhood aggregate into a higher-order structural domain termed "meta-TAD" [6], although such proposal has not fully been accepted by the community. At smaller genomic scale, each TAD is split into sub-structures called sub-TADs that display more localized contacts [17, 28-31] (Fig. 1c). A number of algorithms make important contributions toward understanding the intra-chromosome architecture by identifying CDs from Hi-C data. However, CDs identified using different algorithms or parameters display significant variations, and currently there is no universally accepted definition of CD at each genomic scale. For example, the average size of a TAD varies from 100 kb to 2 Mb depending on the specific algorithm being used. Furthermore, still lacking is a unified algorithm to characterize CDs at multiple scales. Many of the existing algorithms, specialized at finding CDs at particular genomic scales, require that the Hi-C data be formatted at specific resolutions that match the scale of the target domains [16,17,24]. Although more recent methods, such as TADtree [32] and Armatus [33], considered hierarchical domain structures of chromosomes, most existing methods are motivated from the viewpoint of local pattern recognition analyses [16,17,24,34], and do not take into account the underlying physical nature of chromosomes that chromosomes are a long polymer organized in 3 dimensions [6,[35][36][37][38][39][40].
Here, we interpret Hi-C data as the pairwise contact probability of a locally equilibrated polymer network whose inter-loci distances are restrained with varying strengths of interactions. Just like Hi-C map changes along cell cycle but is stably defined over a certain time interval of cell cycle phase, we assume that the strengths of inter-loci interactions are stably maintained over the time interval. Based on this model of the locally equilibrated model of polymer network, we derive the cross-correlation matrix from Hi-C and use it as the sole input for the algorithm of identifying CD at varying genomic scales (Multi-CD). The algorithm includes a tuning parameter which enables us to control the average domain size. We demonstrate the utility of Multi-CD by applying it to Hi-C data from various cell lines as well as to that of a particular cell line over multiple genomic scales. We also confirm that the CD structures identified at multiple genomic scales are consistent with the previous knowledge on chromatin organization, as well as with information from bio-markers. This study will show that amid the rapidly expanding volume of Hi-C data [10][11][12], Multi-CD holds good promise to more quantitative and principled determination of chromatin organization.

Overview of Multi-CD
The primary goal of this study is to extract information of CDs from Hi-C data at varying genomic scale of interest. First, we translate the Hi-C data into a cross-correlation matrix of polymer network, by noting that chromosomes are in essence a polymer network with pairwise interactions between the loci [35,41,42]. The three dimensional structure of chromosome is deemed in nonequilibrium steady state with "activity" where the energy source of ATP is in-ternally injected to the system and is dissipated as heat. As long as the system remains in local mechanical equilibrium over an extended time scale, one can assume a stable structural ensemble manifested in Hi-C data. Using this assumption of local mechanical equilibrium, we model the dynamic fluctuation of polymer network configuration, representing chromosome structure, around its local equilibrium ensemble by using a sum of harmonic potentials, writing down the distance distribution between two loci i and j into gaussian: with γ ij = (σ ii + σ jj − 2σ ij )/2, where σ ij (= δr i · δr j ) is the positional covariance, determined by the topology of polymer network [42]. Indeed, distance distributions measured using fluorescence measurement between chromatin loci justifies this hypothesis (see Fig. S1). Importantly, our interpretation of chromosome conformation as a locally equilibrated, quasi-stable polymer network enables a one-to-one mapping of the contact probability p ij = rc 0 dxP (x; γ ij ) to the positional covariance σ ij , and hence to the cross-correlation matrix, (C) ij = σ ij / √ σ ii σ jj (see Methods). The cross-correlation matrix C normalizes the wide numerical range of the original Hi-C counts into the range between −1 and 1.
Clustering a correlation matrix into a finite number of correlated groups is a general problem discussed in diverse disciplines. Here, we adapted a formalism known as the "group model," developed for identifying the correlated groups of companies from empirical data of stock market price fluctuations [43][44][45]. Without ambiguity, the formalism can be applied to the clustering of correlated genomic loci in a chromosome. For a given correlation matrix C, the group model finds the optimal solution of clustered loci groups (domains) that best explains the pattern manifested in C. The domain solution for N loci can be written as a vector s = {s 1 , s 2 , . . . , s N }, where s i denotes the domain index for locus i. Technically, this procedure involves finding a vector s that maximizes the posterior distribution p(s|C) for a given correlation data C; the optimal CD solution is found as s * = argmax s p(s|C). Maximizing the posterior distribution in the form of p(s|C) ∝ e −H(s|C)/T is equivalent to minimizing the cost function (or the effective Hamiltonian) H(s|C). We consider the cost function of the form H(s|C) = E(s|C) + λK(s), where E(s|C) quantifies the goodness of clustering, and K(s) with λ(≥ 0) promotes simpler CD solutions by penalizing the effective number of clusters (see Methods). This gives rise to a "tunable" group model that allows us to flexibly control the average size of domain solutions by changing the parameter λ. In light of the grandcanonical ensemble in statistical mechanics, T is the effective temperature of the system, and λ amounts to the chemical potential.
Our "tunable group model," Multi-CD, applied to Hi-C data, is used to identify the four major CD families, namely, sub-TADs, TADs, meta-TADs, and compartments.

Discovery of chromatin domains at multiple scales
We applied Multi-CD to 50-kb resolution Hi-C of chromosome 10 from five different cell lines: GM12878, HUVEC, NHEK, K562, KBM7 ( Fig. 2a-b). Given a Hi-C matrix, we first obtained the crosscorrelation matrix C (Fig. 2c), and used Multi-CD to identify a set of CDs for each fixed value of λ (Fig. 2d). This resulted in a family of CD solutions at varying λ, with coarser CDs at larger values of λ. We note that the family of CD solutions for a given cell line are divided into two regimes. In the case of GM12878 (Fig. 2e), CD solutions can be partitioned into two groups below and above λ ≈ 40. Solution families for other four cell lines are also similarly  (c) The domain solution for compartments obtained using the original Hi-C data in (b) are re-ordered with the cluster (compartment) index. The two largest compartments (k = 1, 2), corresponding to B (k = 1) and A (k = 2) compartments, are depicted in the lower triangle. Clearly separated B-and A-compartments emerge from the correlation matrix CO/E (upper triangle) when the rows and columns of CO/E are also re-ordered in accordance with the domain solution (lower triangle divided (Fig. S2).
From the Multi-CD analysis for all five cell lines of chromosome 10, we find the following features which are also shared by chr 4, 11, and 19 ( Fig. S4) (i) In each of the cells, the average domain size n increased monotonically with λ ( Fig. 2f).
(ii) There is a crossover point at λ = λ cr where the distribution of domain sizes suddenly changes. The variability of domain size, quantified in terms of the index of dispersion, D = σ 2 n / n , is below 1 for small λ (< λ cr ), which means that the domain size is regular, but it exhibits transition at λ cr ≈ 30 − 40 ( n cr ≈ 1.6 Mb) for GM12878, HUVEC, and NHEK; and at λ cr ≈ 60 − 70 ( n cr ≈ 2.2 Mb) for K562 and KBM7 (Fig. 2g).
(iii) The goodness of CD solution quantified by the normalized mutual information (nMI, see Methods for its definition) against Hi-C data is maximized at λ * = 30 in all the cell types, except for K562 (λ * = 50) (Fig. 2h).
(iv) The extent of domain conservation is quantified in terms of the average cell-to-cell similarity over all the cell-type pairs, where the similarity is evaluated using the Pearson correlation (see Methods). We found strong cell-to-cell domain conservation in the range of 0 < λ ≤ 30, which corresponds to CD sizes n 1.5 Mb (Fig. 2i). The maximal extent of domain conservation across the cell lines is found at λ = 10 ( Fig. 2i), at which the average domain size is n ≈ 0.9 Mb, which is close to the typical size of a TAD suggested in the previous studies [24,25]. This finding is consistent with the widely accepted notion that TADs are the most well-conserved, common organizational and functional unit of chromosomes, across different cell types [21]. Thus, we identify the CDs found at λ = 10 as the TADs for human chromosome 10 (see Fig. 2j; also see Fig. S3 for domain solutions at λ = 0 and λ = 40). Fig. 2h shows that there is a special value of λ * ≈ 30 at which the CD solution best captures the pattern of Hi-C data. The family of CD solutions are also divided into two regimes at λ ≈ λ * .

Two families of chromatin domains
TAD-based chromatin organization at λ ≤ λ * . What do the CDs at λ * = 30 represent? First, the CDs at this scale have an average size of n * ≈ 1.5 Mb (Fig. 2h), which is slightly greater than the size of TADs ( n TAD ≈ 0.9 Mb); Second, for GM12878 the CD solutions showing high similarity at 10 < λ < 40 ≈ λ * can be grouped together ( Fig. 2e. See also the similarity matrices calculated for HUVEC and NHEK in Fig. S2). Based on these observations, we surmise that CDs at λ * ≈ 30 are associated with a higher-order structure of TADs, a "meta-TAD", which results from an aggregate consisting of multiple TADs in genomic neighborhood [6]. In contrast to the previous analysis which extended the range of meta-TAD to entire chromosome via hierarchical clustering analysis [6], the meta-TAD implicated from Multi-CD is confined in a finite range, so that it is well discerned from compartments and at the same time is more correlated with TADs (Fig. 2e). Notably, the pattern of CDs identified at λ < λ * is localized (see Fig. 2d, λ = 0, 10, 30). Our algorithm identifies the diagonal blocks of Hi-C data as the subsets of a hierarchically crumpled structure of chromatin chain [37, 46]. Compartment-like chromatin organization at λ > λ * . The super-Mb sized domains are generally defined as the compartment in the chromosome organization [21]. In this scale, a direct application of Multi-CD to the cross-correlation matrix C (as in Fig. 2) is dominated by the strong local correlation from the loci pairs in genomic neighborhood. A simple and effective solution to capture the compartment-like structures is to exclude a narrow band along the diagonal of the Hi-C matrix ( Fig. 3a; also see Methods). Then we can apply Multi-CD to identify two large compartments with alternating patterns (Fig. 3b). The results from Multi-CD success- fully capture the non-local correlations, as clearly seen with a reordering of indices (Fig. 3c). It is natural to associate a domain (k = 1) showing a greater number of contacts ( Fig. S5) with the compartment B, which is usually more compact than compartment A; and k = 2 with compartment A. Further validation of the two compartments, by comparisons to epigenetic markers, will be presented below.
Conventionally, in order to identify chromosome compartments, the existing methods based on principal component analysis (PCA) use the Pearson correlation matrix of a low-resolution Hi-C data (C O/E ) [16] (see Methods), whose heatmap is typically featured with checkerboard pattern (upper triangular part of Fig. 3b). For the case of compartment, instead of making direct comparison with Hi-C data, we used the C O/E matrix to evaluate the goodness of compartment-like CD solutions, again calculating the normalized mutual information (nMI). We note that Multi-CD outperforms GaussianHMM [17], a widely accepted benchmark, in capturing the large-scale structures in Hi-C (Fig. 3d).

Hierarchical organization of chromatin domains
We examined the extent of hierarchical relationship between the four classes of CD solutions obtained at varying λ. From the diagram in which sub-TADs, TADs, meta-TADs and compartments are overlaid on top of each other (Fig. 4a), it is visually clear that sub-TADs or TADs almost never fail to be included inside the boundary of meta-TAD, whereas there are mismatches between the domain boundaries of meta-TADs and compartments. We evaluated the extent of overlap or domains-within-domains type of hierarchy between two domain solutions by means of the hierarchy scores (h) which quantifies the extent of inclusion of smaller domains into larger domains (see Methods).
Based on the hierarchy scores, calculated over the CD solutions from Hi-C data of GM12878 (Fig. 4b), we found the basic principles for chromatin organization: (i) The hierarchy scores between the pairs of TAD-related domains (sub-TADs, TADs, and meta-TADs) are all > 0.96, which is appreciably greater than that of any pair of TAD-related domains with compartments. (ii) The hierarchical links of TADs and meta-TADs with compartments are relatively weak. This implies that TADs or meta-TADs are not necessarily the components of compartments, which is also consistent with the recent reports that TADs and compartments are organized by different mechanisms [47,48]. (iii) Although the hierarchy score between sub-TADs and compartments (h = 0.85) is not so large as those among the pairs of TAD-based domains, it is still greater than the hierarchy scores between TADs and compartments (h = 0.77) or between meta-TADs and compartments (h = 0.69). Thus, sub-TAD can be considered a good candidate for a common building block of the chromatin architecture.

Validation of domain solutions from Multi-CD
The CD solutions from Multi-CD are in good agreement with the previously proposed CDs, obtained from several different methods. Specifically, CDs correspond to the sub-TADs [17] in the priorfree solution at λ = 0, to the TADs [24, 49] at λ ≈ 10, and to the compartments [17] at λ ≈ 90 (see Fig. S6). When assessed in terms of nMI of acquired CD solutions against the input Hi-C data, Multi-CD outperforms other conventional methods (Arrow-Head, DomainCaller, GaussianHMM) in identifying three distinct CD families (Fig. 5a).
In order to further validate the biological relevance of the CD solutions from Multi-CD, we compared them with several biomarkers that are known to be correlated with the spatial organization of the genome [50].
First, we calculated how much our domain boundaries obtained at λ = 10 are correlated with the CTCF signals which are known to capture TAD boundaries [24, 25] (Fig. 5b). Compared to the correlation (or extent of overlap quantified by χ(d). See Eq.21 and Fig. 5b) of CD solutions for λ = 0 with CTCF signals, the overlap at the domain boundary (d ≈ 0) is stronger for solutions at λ = 10 and 20, which are in the parameter range where Multi-CD identifies TADs. We also observe that Multi-CD identifies TAD boundaries that are more sharply correlated with the CTCF binding sites than those identified by two popular methods, ArrowHead [17] and DomainCaller [24] (Fig. 5b). Specifically, when fitted to  The replication activities in the two compartments are anti-correlated. In the early phase of cell-cycle (G1, S1, S2) the replication of A-compartment is more active than B-compartment, but an opposite trend is observed in the later phase (S3, S4, G2). exponential function, the correlation lengths are 34 kb (λ = 0), 143 kb (λ = 10), and 234 kb (λ = 20); whereas the correlation lengths obtained from ArrowHead and DomainCaller are 900 kb (Fig. 5b).

Multi-CD
Next, we compared our compartment-like domains with the replication timing profile (GM12878 Repli-Seq data) [7,51]. The large-scale domains from Multi-CD (at λ = 90) are in good agreement with the patterns of replication timing anticipated for the A/B compartments, which exhibits anti-correlated activation/repression along the replication cycle ( Fig. 5c-d). Specifically, the replication signal in the Multi-CD-identified compartment A (blue shade in Fig. 5c) is active in the early phases (G1, S1, S2), whereas it is repressed (or deactivated) in the late phases (S3, S4, G2). An entirely opposite trend is observed for B-compartment (red shade in Fig. 5c): the replication activity in B compartment is repressed in the early phases (G1, S1, S2), but is activated in the late phases (S3, S4, G2). The Pearson correlation between the replication signals ( Fig. 5d) quantitatively confirms the clear contrast between the replication timing of A/B compartments. Correlation between A/B compartments determined by Multi-CD and compartmentalized patterns of the histone modifications are also shown in Fig. S7, which validates the domain solutions of compartments identified by Multi-CD.

Chromatin organization and its link to gene expression
To demonstrate that chromatin organization is closely linked to gene expression, we overlaid the RNA-seq profiles on the 10 Mb range of the TAD solutions identified for the chr10 of five cell lines (GM12878, HUVEC, NHEK, K562, KBM7) ( Fig. 2j). At around 26.8 Mb position of this chromosome, we found APBB1IP gene, that is transcriptionally active in GM12878 and KBM7 but not in HUVEC, NHEK and K562 (Fig. 6). We also consulted the Gene- Hancer database [52] to identify the regulatory elements for this gene (enhancers and promoters) in the interval between 26.65 and 27.15 Mb. Remarkably, our Multi-CD solutions show that the interval associated with the regulatory elements is fully enclosed in a single TAD in GM12878 and KBM7, whereas it is split into two different domains in the other three cell lines (Fig. 6). This suggests that for the gene expression it is critical to have all regulatory elements within the same TAD, which is consistent with the understanding that TADs are the functional boundaries for genetic interactions [5,6,18,20].

DISCUSSION
What fundamentally differentiates Multi-CD from other approaches rests on the algorithm by which the pattern of CD is identified. In the conventional methods, local features of Hi-C data, such as CD boundaries or loops enriched with higher contact frequencies, are key for CD-identification, and it is usually required that the Hi-C data is prepared in a specific range of resolution that matches the scale of domain to be identified. In contrast, Multi-CD solves the problem of global pattern clustering as its basic algorithm for CD discovery. Therefore, Multi-CD can find CDs across a wide range of scales without resorting to a coarse-grained version of Hi-C data or to a particular bin size at each scale of interest. Moreover, as Multi-CD is based on a physical model of polymer network, the method can offer a physically relevant interpretation of Hi-C data.
Multi-CD uses a tuning parameter λ, which is tantamount to the "chemical potential" in statistical thermodynamics, to set the average domain size, giving rise to λ-dependent CD solution for a given Hi-C data. nMI comparing CD solutions with the 50-kb resolution Hi-C data is maximized at λ ≈ 30, which corresponds to ∼ 1.5 Mb in domain size (length) (Fig. 2f). Notably, 1.5 Mb, the average size of CD that we can best read off from the 50-kb resolution Hi-C data [17] used in this study, is also similar to the domain size detected by a recently proposed TAD detection algorithm called deDoC [53]. In essence, the concept of "graph structural entropy" used in deDoC is also based on global pattern recognition. The authors of deDoC, who developed deDoC as a TAD detection algorithm, have concluded that their ∼ 2-Mb-sized domain solution from their analyses on 40-kb data of Dixon et al. [24] was the best solution for TAD, based on their finding that deDoC identified domain solution displayed the lowest structural entropy in comparison with all the five other TAD detection algorithms they tested. Interestingly, we also found that the best domain solution from varying λ, assessed in terms of nMI with Hi-C heatmap, was when λ ≈ 30, which corresponds to the genomic length of 1.5 Mb; however, we do not conclude CD solution at λ = 30 represents the solution for canonical TAD. Instead, we surmise the domain solution at λ = 30 is for meta-TAD, an aggregate of TADs in genomic neighborhood. As indicated by the domain solutions from Multi-CD at varying λ, the extent of domain conservation across different cell types are maximized at λ = 10 ( n = 0.8 Mb). To be consistent with the general notion that TADs are the functional unit of chromosome conserved across different cell types and species [24], CD solution obtained at λ = 10 is better interpreted as the solution for TAD.
We showed that the characteristics of CD solutions shared by the TAD-like domains do not precisely hold together in compartmentlike domains. This finding is consistent with the recent insightful studies which report that compartments and TADs do not necessarily have a hierarchical relationship because they are formed by different mechanisms of motor-driven active loop extrusion and microphase separation [48, 54-56]. Notably, even when clear mismatches are present between the meta-TAD and compartment, the sub-TADs are, in most of the cases, a part of the compartment (Fig. 4). This finding points to sub-TADs as the fundamental building blocks of the higher domain organization. In fact, the existence of sub-TADs is robust even when a higher resolution Hi-C data is analyzed. From a clustering analysis on 5-kb resolution HiC data, the boundaries of ∼ 300-kb-sized sub-TAD are clear and consistent with those obtained from 50-kb resolution Hi-C (see Fig. S8).
Although there are methods that report hierarchical CDs [32, 33], Multi-CD makes significant advances both algorithmically and conceptually. Multi-CD formulates the problem in terms of a statespace model of the high-dimensional domain solution vector, s, instead of finding a set of intervals, so that it can detect non-local domains with better flexibility. Multi-CD also avoids the high falsenegative rate that is typical of the previous method (e.g., TADtree [32]) that focuses on the nested domain structure (Fig. S9). Further, employing an appropriate prior to explore the solution space effectively, Multi-CD can avoid the problem encountered in Armatus [33] which skips detection of domains in some part of Hi-C data while its single scale parameter is varied (compare Fig. S9e and Fig. S9f).
In order to glean genome function from Hi-C data that vary with genomic state [10][11][12][13], a computationally efficient and accurate method to identify CD structures is of vital importance. In summary, we developed Multi-CD, a novel and versatile method for CD-identification. The method identifies multi-scale structures of chromatin domains by solving the global optimization problem. We find that the chromatin domains identified from Multi-CD are in excellent match with biological data such as CTCF binding sites and replication timing signal, supplementing the existing methods. Quantitative analyses of CD structures identified by this unified algorithm across multiple genomic scales and various cell types not only offer general physical insight into how chromatin is organized in the nucleus but also will be of practical use to decipher broad spectrum of Hi-C data obtained under various conditions.

ONLINE CONTENT
All methods and additional information are available in Methods and Supplementary Figures.

Data acquisition
Hi-C data. We applied Multi-CD on the 50-kb-resolution Hi-C data of chr10 from five different cell types (GM12878, HUVEC, NHEK, K562, and KBM7). The data were obtained through GEO data repository (GSE63525-cell type-primary) [17].
Biological markers. The domain solutions from Multi-CD were compared with known biological markers. We obtained these data mostly from the ENCODE project [60]. Specifically, we used the enrichment data of the transcriptional repressor CTCF measured in a Chip-Seq assay from http://hgdownload.cse. ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwTfbs/ wgEncodeUwTfbsGm12878CtcfStdPkRep1.narrowPeak.gz. We binned the CTCF assay at 50-kb resolution (to match the Hi-C format). If there are multiple signal enrichments in a single bin, we took the average value. Because each CTCF signal has a finite width, there are occasional cases where a signal ranges across two bins; in those cases we evenly divided the signal strength into the two bins. The Repli-seq signals in the six phases G1, S1, S2, S3, S4, and G2 were obtained from http://hgdownload.cse.ucsc. edu/goldenPath/hg19/encodeDCC/wgEncodeUwRepliSeq/, and were averaged over 50-kb windows along the genome to construct the replication timing profiles. The 11 histone mark signals were obtained from http://ftp.ebi.ac.uk/pub/ databases/ensembl/encode/integration_data_jan2011/ byDataType/signal/jan2011/bigwig/, and undergone the same preprocess as Repli-seq. The RNA-seq data for the four cell lines GM12878, HUVEC, NHEK and K562 were also obtained from http://hgdownload.cse.ucsc.edu/ goldenpath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/. RNA-seq for the cell line KBM7 were separately obtained from https://opendata.cemm.at/barlowlab/2015_Kornienko_ et_al/hg19/AK_KBM7_2_WT_SN.F.bw.

Pre-processing of Hi-C data
Normalization and contact probability. We performed Knight-Ruiz (KR) normalization [61] on Hi-C data, using the normalization vectors provided by Ref.
[17] along with the Hi-C matrices. The resulting matrix elements are in similar orders of magnitude as in the original Hi-C, unlike in a standard application of KR normalization where they are normalized to row-sums of 1. In order to use the KR-normalized Hi-C map, M, as the contact probability matrix P, defined element-wise as (P) ij = p ij , we first divided M by the mean value of the greatest matrix elements typically concentrated near the diagonal band of M-matrix, i.e., µ = 1 , and next multiplied a constant value µ c . In other words, we rescaled M ij into p ij = (µ c /µ)M ij . We set µ c = 0.9 and regarded the elements of P greater than 1 as outliers and set them to 1, which effectively filters the unusually high contact signals from the actual data. For the contact probability for a pair of loci, we used p ij , a rescaled and high-intensity-signal-filtered version of M ij .
Correlation matrix from Hi-C. A chromosome can be regarded as a polymer chain containing N monomers, each of which (i-th monomer or locus) corresponds to the i-th genomic segment and its spatial position is written as r i . Employing the idea of the random loop model (RLM) [42], which has been proposed for modeling chromosome conformation, we interpret that chromosome conformation is described with an ideal polymer network crosslinked at multiple sites, which is deemed a chromosome conformation in local mechanical equilibrium. In RLM, the position vector of the chromosome r = (r 1 , r 2 , · · · , r N ) obeys gaussian distribution with zero mean r = 0 and covariance matrix Σ, the probability of relative distance P (r ij ) between i and j-th monomers in 3D is given as Eq. 1, or where γ ij = 1/2(σ ii + σ jj − 2σ ij ) and σ ij ≡ (Σ) ij . Provided that the contact between two monomers is formed when their distance r ij is within a certain cutoff distance r c , the contact probability (p ij ) can be calculated as with erf(x) = 2 √ π x 0 dte −t 2 . Note that p ij is a monotonically increasing function of γ ij (≥ 0). Therefore, given covariance matrix Σ, we can explicitly calculate the contact probability p ij through Eq.(2). In inverse problem, the covariance matrix Σ can be inferred from contact probability matrix P. However, this inverse problem requires additional assumption about the variance of each monomer σ ii from the definition of γ ij = 1/2(σ ii + σ jj − 2σ ij ). We assume that all variances have identical value (σ ii = σ jj = σ c ), which generates the following normalized covariance matrix (i.e. correlation matrix, C) The parameter σ c sets the overall intensity of C. Here, we set this variable as the median of 1/2γ ij to maintain the balance of C ij . The primary goal of this study is to extract information of CDs from Hi-C correlation matrix (C). In fact, a very similar problem has been posed for stock market price fluctuations [43,44]. Adapting the formalism in References [43, 44], we assumed that x i , which stands for the "genomic state" (or "transcriptional state") of i-th locus, obeys the following stochastic equation in terms of the standardized variable ξ i , i.e., ξ i = (x i − x i )/σ xi where s i denotes the index of CD to which the i-th locus is clustered, and the parameter g si (−1/2 ≤ g si ≤ ∞) defines the strength of intra-CD correlation; η si and i are the independent and identically-distributed (i.i.d) random variables with zero mean and unit variance, η si , i ∼ N (0, 1). From Eq.4, the cross-correlation between the loci i and j is written as Therefore, in light of Eq.5, the first term of Eq. 4 on the right hand side represents intra-CD variation of the s i -th CD where intradomain correlation increases with g si ; the second term of Eq. 4 corresponds to a noise that randomizes the intra-domain correlation dictated by the first term. By matching Eq. 3 with Eq. 5 one can use the cross-correlation matrix C from σ ij as an input for Multi-CD.

Removal of the diagonal band for identifying compartments.
The Hi-C matrix shows that the interaction strength is highly concentrated near the diagonal elements, which makes it difficult to identify the compartment characterized with the long-range interaction pattern. To circumvent this issue, the previous methods have either intentionally reduced the resolution of Hi-C data (usually to 1 Mb) [16] or used only inter-chromosomal interactions [17]. In this study, as a similar motivation we ignore the diagonal band of C. Specifically, we set all elements in C to zero if its genomic distance (|i − j|) is smaller than l c . We scanned l c and found that the CD solutions most consistent with the compartment are obtained for l * c = 40 ( 2 Mb in 50-kb resolution). This value is almost identical to the crossover value of domain size n * which sets the boundary between TAD-like and compartment-like domains.

The observed/expected (O/E) matrix and its Pearson correlation matrix.
The O/E matrix was used to account for the genomic distance-dependent contact number due to random polymer interactions in chromosome [16]. Each pair (i, j) in O/E matrix is calculated by taking the count number M ij (observed number) and dividing it by average contacts within the same genomic distance d = |i − j| (expected number). Since the expected number could be noisy, one smooths it by increasing the window size (see refs [16,17] for further details). In this study, we used the expected number obtained from [17]. The Pearson correlation matrix of the O/E (C O/E ) represents the overall contact pattern through pairwise correlation coefficients between loci.

Identifying chromatin domains
Our goal is to find the chromatin domain (CD) solution s = {s 1 , s 2 , · · · , s N }, as well as a set of parameters for the intra-CD correlation G = {g 1 , g 2 , · · · , g Kmax }, that best represent the pattern in the Hi-C data. Each s is a state in the CD solution space. For each genomic locus i ∈ {1, 2, · · · , N }, the element of the CD solution s i = k indicates that the i-th locus belongs to the domain k (= 1, 2, · · · , K max ). It is always ensured that the elements of each s spans a set of consecutive integers from 1 to K max , where K max is the number of distinct domains in the solution. For example, a state s = {1, 1, 1, 2, 2, 3} describes a structure where the 6 loci are clustered into 3 domains as {{s 1 , s 2 , s 3 }, {s 4 , s 5 }, {s 6 }} with the corresponding strength of three intra-domain correlation G = {g 1 , g 2 , g 3 }.
Likelihood: goodness of clustering. To formulate the clustering problem, we first consider the likelihood of data, at a given CD solution s with strength parameters G: [45] p(data|s, G) where · · · η, denotes an average over the gaussian noises for η si and i . The data dependence of the likelihood is written simply in terms of the correlation matrix C (Eq. 6). With standard calculations involving Gaussian integrals, the corresponding loglikelihood can be written as: (1 + g k ) n k − g k c k 1 + g k n k −n k log(1 + g k ) + log(1 + g k n k )] , where n k = N i=1 δ si,k is the size of domain k, and c k = N i,j=1 C ij δ si,k δ sj ,k is the sum of all intra-domain correlation elements. Conveniently, the likelihood p(C|s, G) is maximized at g k = (c k − n k )/(n 2 k − c k ) for a given {s, C}. We define the cost function E(s|C) as the negative log-likelihood at this likelihoodmaximizing G: such that max G p(C|s, G) = exp(−E(s|C)). The cost function evaluates the (negative) goodness of clustering for a given CD solution s.
Prior: preference to simpler solutions. Because of the structural hierarchy inherent to chromosome and the ensemble characteristic of the Hi-C measurement, it is still an issue to define CDs at a certain length scale of interest. In order to construct a unified formalism that can control the overall domain size in a CD solution, we introduce a prior of the form p(s) = exp(−λK(s)), where K(s) increases with the complexity of the solution s. Specifically, we define K(s) as such that it measures the effective number of CDs from the domain size distribution. For example, in the limit where all CDs are of the same size, K(s) = K max . This formulation is also equivalent to adding a regularizer to the cost function E, such that the total cost function H becomes: where the parameter λ controls the relative weight of K(s) with respect to E(s|C).

Posterior distribution.
Then the posterior distribution is given by the following Bayes rule: We remark that this formulation is analogous to the grand canonical ensemble in statistical mechanics. The total cost function H(s|C) can be thought of as the effective Hamiltonian of the system; E(s|C) amounts to the energy of the system, and K(s) the effective number of particles (in this case CDs) with a chemical potential of λ. It is natural to introduce an effective temperature T , so that the probability of having a state s is given as A higher temperature T makes the distribution flatter; in other words, it tempers the distribution. This is useful for an efficient posterior inference through simulated annealing.
Metropolis-Hastings sampling. We use Markov chain Monte Carlo (MCMC) sampling to find the maximum value of the posterior distribution, or equivalently the minimum value of the total cost function H. The standard Metropolis-Hastings (MH) routine was used, such that at each trial move from the current state s to the next state s , the move is accepted with a probability min(1, α), where α(s, s ) = exp [−(H(s |C) − H(s|C))/T ]. We used "single mutation" proposals, as described below.
To ensure that a steady state is reached, we continue the sampling until each chain collects t tot ≥ 5τ * samples in the CD solution space. Here τ * is the "relaxation time" defined as the number of steps it takes until the autocorrelation function R(τ ), drops significantly: τ * = argmin τ |R(τ ) − 1/e|. The autocorrelation function is calculated from the value of the total cost function H, as where s t is the t-th sample in the chain, and µ and σ are the mean and standard deviation of {H(s t |C)}, respectively. The running time average in Eq.13, · · · t = 1 ttot−τ ttot−τ 0 dt(· · · ) , is taken over all the pairs of samples with the time gap of τ . We stop the sampling as soon as the total sampling time is five times longer than the relaxation time (t tot > 5 · τ * (t tot )), so that the state s t (or H(s t |C)) is practically in steady states.
Single mutation in the CD solution space. In the space of CD solutions, we define a single mutation as a move from a state s to another state s , such that the two CD solutions (s, s ) differ only by one genomic locus. In other words, it is a move with distance d(s, s ) = 1, where the distance between the two CD states is defined as the number of loci with differing domain memberships, d(s, s ) = Simulated annealing. We use the simulated annealing to explore the high-dimensional CD solution space which is also likely characterized with multiple local minima. We start from a finite temperature T = T 0 > 0 and slowly decrease it to T → 0, letting the system relax toward the global minimum of the configurational landscape of H (Fig. S10). The simulated annealing process is described below.
Initialization. An initial configuration s (0) is generated in two random steps. First, the total number of CDs, K max , is drawn randomly from the set of integers {1, · · · , N }. Then, each genomic locus i ∈ {1, · · · , N } is allocated randomly into one of the CDs, k ∈ {1, 2, · · · , K max }. The initial temperature T 0 is determined such that the acceptance probability for the "worst" move around s (0) is 0.5. Specifically, it is given as T 0 = argmin T |exp(−∆H 1 /T ) − 0.5|, where ∆H 1 = max s∈S1(s0) {H(s|C) − H(s (0) |C)} is the energy difference to the least favorable move among the set of all single mutations.
Iteration. At each step r, the temperature is fixed at T r . We sample the target distribution p r (s|C) ∝ exp(−H(s|C)/T r ), using the Metropolis-Hastings sampler described above. For the next step r + 1, the temperature is lowered by a constant cooling factor c cool ∈ (0, 1), such that the next temperature is T r+1 = c cool · T r . We used c cool = 0.95 in this study.
Final solution. The annealing process is repeated until the temperature reaches a small (but finite) value T f . We used T f = 0.03. Then we quench the system to the closest local minimum by performing a "zero-temperature" sampling, in which a proposed move is always accepted if it lowers the cost function. This process is simply to remove any remaining fluctuation from the finite temperature, which is usually very small at this point. Because there is still no guarantee that the global minimum is found, we tried a batch of at least 10 different initial configurations and chose the final state s * that gives the minimal H(s * |C).

Analysis on subsets of Hi-C data.
Our method allows the user to break down the Hi-C data into subsets, as long as the CDs are localized within the subsets (Fig. S11). This saves the algorithm from the large memory requirement of dealing with the entire intrachromosomal Hi-C (for example, Hi-C of chromosome 10 has 2711 bins in 50-kb resolution). For the analysis of the 50-kb resolution Hi-C data in this paper, we used subsets of the data that correspond to 40-Mb ranges along the genome, or 800 bins.
The overall schematic involving the algorithm of Multi-CD. A schematic diagram of Multi-CD is provided in Fig. S12.

Analysis and evaluation of domain solutions
Index of dispersion. The index of dispersion for the domain size distribution is defined as D = σ 2 n / n , where n is the average size of a domain, and σ 2 n is the variance. It measures how clustered or dispersed a given distribution is, compared to a normal distribution. If D < 1, it indicate that the domain sizes are all very similar. If D > 1, on the other hand, it means that the domain size distribution is over-dispersed and heterogeneous, which may be the case when there are a few large domains and many small ones.

Similarity of two distinct CD solutions using Pearson correlation.
In order to measure the extent of similarity between two CD solutions s and s , we consider the Pearson correlation at the level of loci pairs. We start by constructing the binary matrices B and B that represent the two CD solutions, where (B) ij = B ij = δ sisj , such that the matrix element are all 1's within the same CD and 0 otherwise. Considering the lower triangular elements of B, we can calculate the meanb = where the element-wise covariance is cov(

Normalized mutual information.
We use the mutual information to evaluate how well a CD solution s captures the visible patterns in the pairwise correlation data. We consider the binary grouping matrix (B) ij = B ij = δ si,sj for the CD solution of interest, and compare it to the input data matrix (A) ij = A ij . In this study, either log 10 M or C O/E was used in the place of A. Assuming that we can treat the matrix elements a ∈ A and b ∈ B as two random variables, we can construct the joint distribution p(a, b) by binning and counting, as: where the Kronecker delta for the continuous variable a should be understood in a discretized fashion. That is, δ Aij ,a = 1 if A ij ∈ [a, a + ∆a) and 0 otherwise, where we used ∆a = [max {A ij } − min {A ij }] /100 to discretize the values into 100 bins. It is straightforward to obtain the marginal distributions as p(a) = b p(a, b) and p(b) = a p(a, b). We can use the standard definitions to calculate the marginal entropies, H(a) = − a p(a) log(p(a)) and (p(b)), as well as the mutual information Note that the sum runs over the discretized values of a that are the endpoints of the bins used for counting, and over {0, 1} for the binary variable b. Finally, the normalized mutual information (nMI) is defined as Hierarchy score. We define the hierarchy score to quantify the hierarchical relationship between two CD solutions, s and s . We assume that we know the average domain sizes for the two solutions: we will say that s is a set of smaller CDs and s a set of larger CDs. Then the perfect hierarchy condition can be defined as in the following statement: if two loci i, j belong to the same CD in the smaller-scale s (s i = s j ), then they also belong to the same CD in the larger-scale s (s i = s j ). Here we extend this idea to evaluate the extent of overlap of s to s . To begin, for each domain k ∈ s in the smaller-scale CD solution, the best overlap of this domain on s is quantified by the single-domain hierarchy score h 1 : We get a maximum score h 1 (k → s ) = 1 if there is a k ∈ s such that the smaller domain k ∈ s is completely included in the larger domain k ∈ s . On the other hand, the worst score is obtained when the domains in the two CD solutions s and s are completely uncorrelated, in which case h 1 only reflects the overlap "by chance". The chance level is written ash 1 (s ) = n s /N , where n s = i δ s i ,k k ∈s is the average domain size of the larger solution s . This naturally defines a normalized scorê Consequently, the hierarchy score h(s → s ) of the entire CD solution s on s is calculated as a weighted sum of h 1 as where n k = i δ si,k is the size of domain k in s. In this study, we are interested in h(s λ1 → s λ2 ) for CD solutions evaluated at two distinct values λ 1 < λ 2 , knowing that the average domain sizes are n s λ 1 < n s λ 2 .

Correlation between CTCF signal and domain boundaries.
The validity of domain boundaries, determined from various CDidentification methods including Multi-CD, is assessed in terms of their correlation with the CTCF signal. We write φ CTCF (i) to indicate the CTCF signal at locus i. We also define a binary variable ψ DB (i) that indicates the boundaries of a CD solution s, such that ψ DB (i) is 1 if the i-th locus is precisely in the domain boundary, ψ DB (i) = 0, otherwise (ψ DB (i) = (1 − δ si,si−1 δ si,si+1 )). We evaluated a distance-dependent, normalized overlap function χ(d), defined as where δφ CTCF = φ CTCF − φ CTCF and the approximation sign is used because of N N −d ≈ 1 for N d. If the domain boundaries determined from a CD-identification method is correlated with TADcapturing CTCF signal, a sharply peaked and large amplitude overlap function (χ(d)) is expected at d = 0.

Correlation between epigenetic marks and compartments.
We calculate the correlation of our compartment solutions with the epigenetic marks. Given a compartment solution s with two large domains A and B, we consider two binary vectors q (A) and q (B) , such that q . In other words, we set q (A) i = +1 if the i-th locus belongs to compartment A, and q (A) = −1 if not. Let h be a set of epigenetic marks measured across the genome, such that h i indicates the value at the i-th locus. Then we calculate the Pearson correlations, c A = (h · q (A) )/|h||q (A) | and c B = (h · q (B) )/|h||q (B) |, between the epigenetic mark h and the two compartments, respectively.

Code availability
The Matlab software package and associated documentation are available online (https://github.com/multi-cd). Fig S1. Distance distributions of loci pairs are described by Gaussian.          Figure S5: Intra-domain contact profiles for the two compartment solutions. We plot the genomic distance-dependent contact number for the two large domains k = 1 and k = 2, from the solution at λ = 90 for chromosome 10 in GM12878. At short genomic distance, the domain solution of k = 1 is characterized with a greater number of contacts than k = 2, which suggests that k = 1 domain is locally more compact. We therefore associate the first domain k = 1 to the B-compartment, and the second domain k = 2 to the A-compartment.