## Abstract

The probabilities that two loci in chromosomes that are separated by a certain genome length can be inferred using chromosome conformation capture method and related Hi-C experiments. How to go from such maps to an ensemble of three-dimensional structures, which is an important step in understanding the way nature has solved the packaging of the hundreds of million base pair chromosomes in tight spaces, is an open problem. We created a theory based on polymer physics and the maximum entropy principle, leading to the HIPPS (Hi-C-Polymer-Physics-Structures) method allows us to go from contact maps to 3D structures. It is difficult to calculate the mean distance between loci *i* and *j* from the contact probability because the contact exists only in a fraction (unknown) of cell populations. Despite this massive heterogeneity, we first prove that there is a theoretical lower bound connecting ⟨*p*_{ij} ⟩and via a power-law relation. We show, using simulations of a precisely solvable model, that the overall organization is accurately captured by constructing the distance map from the contact map even when the cell population is heterogeneous, thus justifying the use of the lower bound. Building on these results and using the mean distance matrix, whose elements are , we use maximum entropy principle to reconstruct the joint distribution of spatial positions of the loci, which creates an ensemble of structures for the 23 chromosomes from lymphoblastoid cells. The HIPPS method shows that the conformations of a given chromosome are highly heterogeneous even in a single cell type. Nevertheless, the differences in the heterogeneity of the same chromosome in different cell types (normal as well as cancerous cells) can be quantitatively discerned using our theory.

## INTRODUCTION

The question of how chromosomes are packed in the tight space of the cell nucleus has taken center stage in genome biology, largely due to the spectacular advances in experimental techniques. In particular, the routine generation of a large number of probabilistic contact maps for many species using the remarkable Hi-C technique [1–6] has provided us a glimpse of the genome organization. This in turn has opened several avenues of research with the hope of understanding the many features associated with chromosomes, such as how they are packaged in the nucleus, and how the chromosome organization affects the dynamics, and eventually function. A high contact count between two loci means that they interact with each other more frequently compared to ones with low contact count. Thus, the Hi-C data describes the chromosome structures in statistical terms expressed approximately in terms of a matrix, the elements of which indicate the probability that two loci separated by a specific genomic distance are in contact. The Hi-C data provide only a two-dimensional (2D) representation of the multidimensional organization of the chromosomes. How can we go beyond the genomic contact information to 3D distances between the loci, and eventually the spatial location of each locus is an important unsolved problem. Imaging techniques, such as Fluorescence *In Situ* Hybridization (FISH) and its variations, are the most direct way to measure the spatial distance and coordinates of the genomic loci [7]. But currently, these techniques are limited in scope because currently they provide information on only a small number of loci in a given experimental setup. Is it possible to harness the power of the two methods to construct, at least approximately, 3D structures of chromosomes? Here, we answer this question in the affirmative by building on the precise results for an exactly solvable Generalized Rouse Model for chromosomes [8, 9], and by using certain unusual polymer physics principles governing genome organization.

Several data-driven approaches have been developed in order to go from Hi-C to 3D structure of genomes [10–17] (see the summary in [18] for additional related studies). Although these methods are insightful, they do not predict the physical dimensions of the organized chromosomes nor have the methods been validated, especially when the structures are highly heterogeneous. These are difficult problems to solve using solely data-driven based approaches to infer structures from Hi-C data, without physical considerations, reflected in the polymeric features of the chromosomes. One problem is associated with the difficulty in reconciling Hi-C (contact probability) and the FISH data (spatial distances) [19–22]. For example, in interpreting the Hi-C contact map, one makes the intuitively plausible assumption that loci with high contact probability must also be spatially close. However, it has been demonstrated using Hi-C and FISH data that high contact frequency does not always imply proximity in space [19–22]. Because the cell population is heterogeneous, even though they are synchronized in the Hi-C experiments, a given contact is not present with unit probability in all the cells. Elsewhere [9], we showed that the heterogeneity in the genome organization is the reason for the absence of one-to-one relation between contact probability and spatial distance between a pair of loci. The inconsistency between Hi-C and FISH experiments makes it difficult to extract the ensemble of 3D structures of chromosomes using Hi-C data alone without taking into account the physics driving the condensed state of genomes. Even if one were to construct polymer models that produce results that are consistent with the inferred contact map from Hi-C, certain features of the chromosome structures would be discordant with the FISH data, reflecting the heterogeneous genome organization[23].

Despite the difficulties alluded to above, we have created a theory, based on polymer physics concept and the principle of maximum entropy to determine the 3D organization solely from the Hi-C data. The resulting physicsbased data-driven method, which translates Hi-C data through polymer physics to average 3D coordinates of each loci, is referred to as HIPPS (Hi-C-Polymer-Physics-Structures). The purposes in the development creating and applications of the HIPPS method are two fold. (1) We first establish that there is a lower theoretical bound connecting the contact probability and the mean 3D distance in the presence of heterogeneity in the genome organization. We prove this concept by using the Generalized Rouse Model for Chromosomes (GRMC) for which accurate simulations can be performed. (2) However, mean spatial distances, *r*_{ij} s, between the loci do not give the needed 3D structures. In addition, it is important to determine the variability in chromosome structures because massive conformational heterogeneity has been noted both in experiments [23, 30] and computations [9]. In order to solve this non-trivial problem, we use the principle of maximum entropy to obtain the ensemble of individual chromosome structures. The HIPPS method, which allows us to go from the Hi-C contact map to the three-dimensional coordinates, *x*_{i} (*i* = 1, 2, 3, *…, N*_{c}), where *N*_{c} is the length of the chromosome, may be summarized as follows. First, we construct the mean distances ⟨*r*_{ij} ⟩ between all *i* and *j* using a power-law relation connecting ⟨*p*_{ij} ⟩, the probability that the loci *i* and *j* are in contact measured in Hi-C experiments, and ⟨*r*_{ij} ⟩. The justification for the power law relation is established using GRMC and polymer physics concepts. Then, using the maximum entropy distribution *P* ({*x*_{i}}) with ⟨*r*_{ij} ⟩ as constraints, we obtained an ensemble of chromosome 3D structures (the 3D coordinates for all the loci).

We tested the HIPPS procedure rigorously using the GRMC, which accounts for the massive heterogeneity noted in recent experiments [23]. The application of our theory to decipher the 3D structure of chromosomes from any species is limited only by the experimental resolution of the Hi-C technique. Comparisons with experimental data for sizes and volumes of chromosomes derived from the calculated 3D structures are made to validate the theory. Our method predicts that the structures of a given chromosome within a single cell and in different cell types is heterogeneous. Remarkably, the HIPPS method can detect the differences in the extent of heterogeneity of a specific chromosome among both normal can cancer cells.

## RESULTS

### Inferring the mean distance matrix from the contact probability matrix (P) for a homogeneous cell population

The elements, , of the matrix give the *mean* spatial distance between loci *i* and *j*. Note that *r*_{ij} is the distance value for one realization of the genome conformation in a homogeneous population of cells. In this case a given contact is present with non-zero probability in all the entire cell population. The elements *p*_{ij} of the **P** matrix is the contact probability between loci *i* and *j*. We first establish a power law relation between and *p*_{ij} in a precisely solvable model. For the Generalized Rouse Model for chromosomes (GRMC), described in Appendix A, the relation between and *p*_{ij} is given by,
where erf(·) is the error function, and *r*_{c} is the threshold distance for determining if a contact is established. This equation provides a way to calculate the distance matrix directly from the contact matrix (**P**) by inverting . Note that **P** is inferred only approximately from Hi-C experiments. However, there are uncertainties, in determining both *r*_{c} due to systematic errors, and *p*_{ij} due to inadequate sampling, thus restricting the use of Eq.1 in practice. In light of these considerations, we address the following questions: (a) How accurately can one solve the inverse problem of going from the **P** to the (b) Does the inferred faithfully reproduce the topology of the spatial organization of chromosomes? We use GRMC to answer these questions.

To answer these two questions, we first constructed the distance map by solving Eq.1 for for every pair with contact probability *p*_{ij}. The **P** matrix is calculated using simulations of the GRMC, as described in Appendix B. For such a large polymer, some contacts are almost never formed even in long simulations, resulting in *p*_{ij} *≈* 0 for some loci. This would erroneously suggest that , as a solution to Eq.1. Indeed, this situation arises often in the Hi-C experimental contact maps where *p*_{ij} *≈* 0 for many *i* and *j*. To overcome the practical problem of dealing with *p*_{ij} *≈* 0 for several pairs, we apply the block average (a coarse-graining procedure) to **P** (described in Appendix C), which decreases the size of the **P**. The procedure overcomes the problem of having to deal with vanishingly small values of *p*_{ij} while simultaneously preserving the information needed to solve the inverse problem using Eq.1.

The simulated and constructed distance maps are shown in the lower and upper triangle, respectively (Fig.1a). We surmise from Fig.1a that the two distance maps are in excellent agreement with each other. There is a degree of uncertainty for the loci pairs with large mean spatial distance (elements far away from the diagonal (Fig.1a,b) due to the unavoidable noise in the contact probability matrix **P**. The Spearman correlation coefficient between the simulated and theoretically constructed maps is 0.97, which shows that the distance matrix can be accurately constructed. However, a single correlation coefficient is not sufficient to capture the topological structure embedded in the distance map. To further assess the global similarity between the from theory and simulations, we used the Ward Linkage Matrix [24], which we previously used to determine the spatial organization in interphase chromosomes [25]. Fig.1c shows that the constructed indeed reproduces the hierarchical structural information accurately. These results together show that the matrix , in which the elements represent the mean distance between the loci, can be calculated accurately, as long as the **P** is determined unambiguously. As is well known, this is not possible to do in Hi-C experiments, which renders solving the problem of going from **P** to , and eventually the precise three-dimensional structure extremely difficult.

### A bound for the spatial distance inferred from contact probability

The results in Fig.1 show that for a homogeneous system (specific contacts are present in all realizations of the polymer), can be faithfully reconstructed solely from the **P**. However, the discrepancies between FISH and Hi-C data in several loci pairs [26] suggest that the cell population is heterogeneous, which means that contact between *i* and *j* loci is present in only a fraction of the cells. In this case, which one has to contend with in practice [9, 23], the one-to-one mapping between the contact probability and the mean 3D distances (as shown by Eq.1) does not hold, leading to the paradox [19, 20] that high contact probability does not imply small inter loci spatial distance.

Heterogeneity in genome organization implies that given the contact probability, one can no longer determine the mean 3D distance uniquely, which implies that for certain loci the results of Hi-C and FISH must be discordant. Recently, we solved the Hi-C-FISH paradox by calculating the extent of cell population heterogeneity using FISH data and concepts in polymer physics. The distribution of subpopulations could be used to reconstruct the Hi-C data. For a mixed population of cells, the contact probability *p*_{ij} and the mean spatial distance between two loci *m* and *n*, are given by,
where and *pm,ij* are the mean spatial distance and contact probability between *i* and *j* in *m*^{th} subpopulation, respectively. In the above equation, *S* is total number of distinct subpopulations, and *η*_{m,ij} is the fraction of the subpopulation *m*, which satisfies the constraint . Although there exists a one-to-one relation between *p*_{m,ij} and in each *m*^{th} subpopulation, it is not possible to determine ⟨*p*_{ij}⟩ solely from without knowing the values of each *η*_{m,ij} and *vice versa*.

More generally, if we assume that there exists a continuous spectrum of subpopulations, and ⟨*p*_{ij}⟩ can be expressed as,
where and *p*_{ij} are the mean spatial distance and the contact probability associated with a single population. *K* and *Q*(*p*_{ij}) are the probability density distribution of and *p*_{mn} over subpopulations, respectively.

We have shown [9] that the paradox arises precisely because of the mixing of different subpopulations. The value *η*_{m,ij}, *K* or *Q*(*p*_{ij}) in Eq. 2-5 in principle could be extracted from distribution of , which can be measured using imaging techniques. However, this is usually unavailable or the data are sparse which leads to the question: Despite the lack of knowledge of the composition of cell populations, can we provide an approximate but reasonably accurate relation between ⟨*p*_{ij} ⟩ and ? In other words, rather than answer the question (a) posed in the previous section precisely, as we did for the homogeneous GRMC, we are seeking an approximate solution. The GRMC calculations provide the needed insights to construct the approximate relation to calculate distance matrix from the contact probability matrix.

### A key inequality

Let us consider a special case where there are only two distinct discrete subpopulations, and the relation between the and *p*_{ij}(*p*_{ij}) is given by Eq. 1. According to Eqs. 2-3, we have and *p* = *ηp*_{1} +(1 − *η*)*p*_{2}. Note that exists since *f* is a monotonic function. Fig.2a gives a graphical illustration of the inequality . This inequality states that the mean spatial distance of the whole population has a lower bound of , which is the mean spatial distance inferred from the measured contact probability ⟨*p* ⟩ as if there is only one homogeneous population. This is a powerful result, which is the theoretical basis for constructing the HIPPS method, allowing us to go from Hi-C data to 3D organizations.

The inequality shows that a theoretical lower bound for exists, given the value of ⟨*p*_{ij} ⟩ regardless of the compositions of the whole cell population. In fact, such an inequality can be generalized for arbitrary discrete or continuous distribution of subpopulations. Let us assume that for a homogeneous system, there exists a convex and monotonic decreasing function, *φ*, relating the contact probability *p* and the mean spatial distance , (we neglect the suffix *ij* for better readability). Note that *φ* takes the form of Eq. 1 for the GRMC. It can be shown that the following inequality holds (Appendix D),

The above equation (Eq.6) shows that the lower bound of the mean spatial distance of a heterogeneous population is given by the mean spatial distance computed from the measured contact probability as if the cell population is homogeneous. The equality holds exactly only when the population of cells is precisely homogeneous. This finding is remarkably useful in predicting the approximate spatial organization of chromosomes from Hi-C contact map, as we demonstrate below. For the GRMC, according to Eq. 6, we have , which is a special case in which only two distinct discrete subpopulations are present. Thus, the precisely solvable model suggests that the approximate power law relating ⟨*p*_{ij} ⟩ and could be used as a starting point in constructing the spatial distance matrices using only the Hi-C contact map for chromosomes.

### Validation of the lower bound relating ⟨*p*_{ij} ⟩ **and****in heterogeneous cell population**

In order to investigate the effect of heterogeneity (contact between *i* and *j* for all (*i, j*) pairs do not exist in all the cells) on the quality of the constructed mean distance matrix from the contact probability matrix ⟨**P** ⟩, we simulated a model system with two distinct cell populations. One has all the CTCF mediated loops present (with fraction *η*), and the other is a polymer chain without any loop constraints (with fraction 1 −*η*). We used the lower bound, , to infer from ⟨*p*_{ij} ⟩. The results, shown in Fig.2b,c,d, provide a numerical verification of the theoretical lower bound linking contact probability and mean spatial distance. Fig.2b shows the scatter plot for versus ⟨*p*_{ij} ⟩ from the simulation. The theoretical lower bound, is shown in comparison. Fig.2b shows that the lower bound holds with all the points are above it. Using the , the in Fig.2d are calculated from the simulated ⟨**P** ⟩. The comparison between the inferred and the simulated (middle ad bottom in Fig.2d) show that the difference between the constructed and simulated DMs is largest near the loops resulting in an underestimate of the spatial distances in the proximity of loops. This occurs because the constructed is computed from the simulated ⟨**P** ⟩, which is sensitive to the heterogeneity of the cell population. The difference matrices show that, although the constructed underestimated the spatial distances around the loops, most of the pairwise distances are hardly affected. This exercise for the GRMC justifies the use of the lower bound as a practical guide to construct from the ⟨**P**⟩.

To show that the constructed using the lower bound gives a good global description of the chromosome organization, we also calculated the often-used quantity ⟨*R*(*s*) ⟩, the mean spatial distance as a function of the genomic distance *s*, as an indicator of average structure (Fig.2c). The calculated ⟨*R*(*s*) ⟩ differs only negligibly from the simulation results. Notably, the scaling of ⟨*R*(*s*) ⟩ versus *s* is not significantly changed (inset in Fig.2c), strongly suggesting that constructing the using the lower bound gives a good estimate of the average size of the chromosome segment.

To further assess the quality of the constructed , we calculated the WLMs for the heterogeneous system with *η* = (0.1, 0.3, 0.5, 0.7, 0.9, 1.0) (see Fig.S1). The results are consistent with the visual comparison of the ; the calculated for large *η* agree significantly better with the simulations compared to small values of *η*. This is also reflected in the distance correlation [27] between the reconstructed and simulated WLMs (blue curve in Fig.S1b), increasing from *≈*0.8 to *≈*1.0 from *η <* 0.7 to *η >* 0.7. In contrast, the distance correlation coefficients between the reconstructed and simulated (red curve in Fig.S1b) stays around 0.95 for all values of *η*, which would not allow us to distinguish between different models.

It is worth noting that even for small values of *η*, the distance correlation coefficient is 0.8, which is a high value. This is consistent with the result shown in Fig.2c that the constructed gives a rough but reasonable global estimation of the structural organization even though it may deviate from the exact result in details. Taken together these results show that the reconstructed provides a fairly accurate description of the conformations in spite of the presence of heterogeneity in the conformations.

The distance correlation gives a global description of the similarity between the simulated and inferred DMs. To further investigate the degree of similarity at different length scales, we computed the Adjusted Mutual Information (AMI) scores between the simulated and constructed clustering result from WLM by varying the number of clusters (Fig.S1). A small number of clusters corresponds to the large scale hierarchical organization whereas a higher number of clusters reveals the structure on the small length scale. For *η ≤*0.7, AMI scores are low (Fig.S1) for the small number of clusters and increases upon increasing the number of clusters up to around 0.8. For *η >* 0.7, the AMI scores remain around 0.9 throughout the range of the number of clusters.

### Inferring 3D organization of interphase chromosomes from experimental Hi-C contact map

To apply the insights from the results from GRMC to obtain the 3D organization of chromosomes, we conjecture that a power law relation, first suggested using imaging experiments [7] and subsequently established by us [25], relating the contact probability between two genomic loci *p*_{ij} and holds generally for chromatins. Thus, we write,
where *α* and Λ are unknown coefficients. Again, note that the ⟨*·* ⟩ and represent the average over subpopulations and the average over individual conformations in a single subpopulation, respectively. In a homogeneous system, the equalities and ⟨*p* ⟩ = *p* hold. For the GRMC, Λ = *r*_{c} and *α* = 3.0. From the ensemble Hi-C experiments, *p*_{ij} can be inferred. For a self-avoiding polymer, *α ≈* 3.71 for two interior loci that are in contact (see Appendix E). Based on experiments [7] and simulations using the Chromosome Copolymer Model [25] a tentative suggestion could be made for a numerical value for *α ≈* 4.0. Given the paucity of data needed to determine *α* we follow the experimental lead [7] and set it to 4.0, which is an unusually large value not associated with any known polymer model. We show below that the power-law relation given in Eq.7 provides a way to infer the approximate 3D organization of chromosomes from the experimental Hi-C contact map.

### Experimental Validation on Eq7 and choice of *α*

To further show that Eq.7 with *α* = 4 is accurate, we calculated the square of the radius of gyration of all the 23 chromosomes using . The dashed line in Fig.3a is a fit of as a function of chromosome size, which yields where *N*_{c} is the length of the chromosome. For a collapsed polymer, and for an ideal polymer to be . To ascertain if the unusual value of 0.27 is reasonable, we computed the volume of each chromosome using and compared the results with experimental data [28]. The scaling of chromosome volumes versus *N*_{c} of the predicted 3D chromosome structures using HIPPS is also in excellent agreement with the experimental data (Fig.3b). The exponent 0.27 ≲ 1/3 suggests the chromosomes adopt highly compact, space-filling structure, which is also vividly illustrated in Fig.4.

Since the value of Λ (Eq.7) is unknown, we estimate it by minimizing the error between the calculated chromosome volumes and experimental measurements. We find that Λ = 117 nm, which is the approximate size of a locus of 100 kbps (the resolution of the Hi-C map used in the analysis). It is noteworthy that the genome density computed using the value of Λ = (100 10^{3}/(4/3)*π*Λ^{3})bps nm^{−3} = 0.015bps nm^{−3} is consistent with the typical average genome density of Human cell nucleus 0.012bps nm^{−3} [29]. The value of Λ does not change the scaling but only the absolute size of chromosome.

### Generating ensembles of 3D structures using the maximum entropy principle

The great variability in the genome organization have been noted before [9, 23, 30]. To investigate the structural heterogeneity of the chromosomes, we ask the question: how to generate an ensemble of structures consistent with the mean pair-wise spatial distances between the loci? More precisely, what is the joint distribution of the position of the loci, *P* ({*x*_{i}}), subject to the constraint that the mean pair-wise distance is Generally, there exists an infinite number of *P* ({*x*_{i}}), satisfying the constraint of mean pair-wise spatial distances. By adopting the principle of maximum entropy, we seek to find the *P* ^{MaxEnt}({*x*_{i}}) with the maximum entropy among all possible *P* ({*x*_{i}}). The maximum entropy principle has been previously used in the context of genome organization [31, 32] for different purposes. We note parenthetically that the preserving the constraints of mean pairwise distances is equivalent to preserving the constraints of mean squared pairwise distances. In practice, we found that using the constraints of squared distances, , yields better convergence. Recall that the *P* ^{MaxEnt}({*x*_{i}}) with respect to the constraints of the mean squared pairwise spatial distances is,
where *k*_{ij} are the Lagrange multipliers that are chosen so that the average values ⟨∥*x*_{i} *−**x*_{j}∥^{2}⟩ matches , which could be either inferred from the Hi-C contact map or directly measured in FISH experiments; *Z* is the normalization factor. The merit of the maximum entropy distribution (Eq.8) is that it is both data-driven and physically meaningful since the parameters *k*_{ij} are inferred from experimental data and the term *k*_{ij} ⟨∥*x*_{i} *−**x*_{j}∥^{2}⟩ can be viewed as pair-wise potential energy between the loci. Indeed, Eq. 8 is exactly the same as the generalized Rouse model [8] where *k*_{ij} are the spring constants between genomic loci.

The procedure used to generate an ensemble of 3D chromosome structures is the following: First, we compute the mean spatial distance matrix from contact map using Eq. 7 with *α* set to 4.0. The value of the scaling factor Λ = 117nm, calculated using additional experimental constraints (see the previous section). Recall that Λ only sets the length scale but has no effect on the conformational ensemble of the chromosome. Using the iterative scaling algorithm, we obtain the values of *k*_{ij} (Appendix G). Once the values of *k*_{ij} are obtained, *P* ^{MaxEnt} can be directly sampled as a multivariate normal distribution, thus generating an ensemble of chromosome structures. Fig.5a shows the comparison between the inferred DM and the DM for Chromosome 1 obtained using the maximum entropy principle. It is visually clear that the two DMs are in excellent agreement (see Fig.S2-S7 for the other chromosomes). We should emphasize that the maximum entropy method described here, in principle, can achieve exact match with the inferred DM. The small discrepancies are due to 1) the quality of convergence and 2) the intrinsic error in the Hi-C map and the inferred DM derived from it.

### Characteristics of 3D chromosome structures

The 3D conformations are specified by *x*_{i}, *i* = 1, 2, 3,, *N*_{c} where *N*_{c} is the number of loci at a given resolution (the centromeres are discarded due to lack to information about them in the Hi-C contact map). The values of *N*_{c} for all the 23 chromosomes are given in Table.S1. We generated an ensemble of 1,000 structures for each of the 23 Human interphase chromosomes using the procedure described above. Fig.4a shows the typical conformations of averaged value of radius of gyration for each chromosome. Visually it is clear that there is considerable shape heterogeneity among the chromosomes. To quantify the shape of chromosomes, we obtain the distribution of relative shape anisotropy *κ*^{2} (Appendix H). Fig.4b shows the violin plots of *κ*^{2} for all the 23 chromosomes, ordered by value of ⟨*κ*^{2} ⟩. The chromosomes exhibit considerable variations in *κ*^{2}. Chromosome 13 is most spherical and chromosome 19, 9 and 21 have the most elongated shape.

We can draw important conclusions from the calculated 3D structural ensemble with some biological implications that we mention briefly.

#### Compartments and microphase separation

The probabilistic representations for Chromosome 1 are shown in Fig.5b,c,d where we align all the conformations and superimpose them. First, we note that such a probabilistic representation demonstrates clear hierarchical folding of chromosomes where the loci with small genomic distance (similar color) are also close in space (Fig.5b, see Fig.11 for the other chromosomes). Long-range mixing between the loci is avoided, supporting the notion of crumpled globule [33–35]. Furthermore, the reconstructed structure of the chromosomes shows clear microphase separation (different colors are segregated. These are referred to as A and B compartments (Fig.5c, see Fig.12 for the other chromosomes), representing two epigenetic states (euchromatin and heterochromatin), which we obtained using the spectral clustering [25]. Each compartment predominantly contains loci belonging to either euchromatin or heterochromatin. Contacts within each compartment are enriched between either euchromatin or heterochromatin epigenetic states. In the Hi-C data the compartments appear as a prominent checker board pattern in the contact maps. Fig.5c shows that the two compartments are spatially separated and organized in a polarized fashion, which is fully consistent with multiplexed FISH and single-cell Hi-C data[30].

#### Mapping ATAC-seq to 3D structures

Advances in sequencing technology have been used to infer epigenetic information in chromatin without the benefit of integrating with structures. In particular, the assay for transposase accessible chromatin using sequencing (ATAC-Seq) technique provides chromatin accessibility, which in turn provides insights into gene regulation and other functions. The results obtained using ATAC-seq (see Appendix I for details on processing of ATAC-seq data), also shows microphase separation pattern between high ATAC signal and low ATAC signal region (Fig.5d). With the structures determined by HIPPS in hand, we mapped the ATAC-Seq data onto ensemble of conformations for Chromosome 1 from GM 12878 cell in Fig.5d. It appears that accessibilities in chromosome 1 for various functions (such as nucleosome positioning and transcription factor binding regions) may be spatially segregated. Such segregation between high ATAC signal loci and low ATAC signal loci are also visually clear in other chromosomes (Fig.13). Remarkably, these results, derived from the HIPPS method, follow directly from the Hi-C data *without* creating a polymer model with parameters that are fit to the experimental data.

### Structural Heterogeneity

To investigate the heterogeneity in chromosome conformations, we examined the variations among the 1,000 conformations generated for chromosome 5. First, as a global structural characteristic, we computed the radius of gyration of individual structure. *R*_{g}. Fig.6a shows the histogram, *P* (*R*_{g}), and conformations with compact, intermediate and expanded conformations as examples. We then wondered what is the degree of variations in the organization of the A/B compartments? Specifically, we want to know whether A/B compartments are spatially separated in a single-cell. To answer this question, we first define a quantitative measure of the degree of mixing between A/B compartments, *Q*_{k},
where *k* is the number of nearest neighbor of loci *i*. In Eq. 9 *n*_{A}(*i*; *k*) and *n*_{B}(*i*; *k*) are the number of neighbor loci belonging to A compartment and B compartment for loci *i* out of *k* nearest neighbor, respectively (*n*_{A}(*i*; *k*) + *n*_{B}(*i*; *k*) = *k*). With *N*_{c} = (*N*_{A} + *N*_{B}), the fraction of loci in the A compartment is and is the fraction in the B compartment where *N*_{A}, *N*_{B} are the number of A and B loci, respectively. The *k* nearest neighbors of *i* are computed as follows. First, the distance from *i* to all loci are calculated. From these distances, the *k* smallest values are chosen, and this process is repeated for all *i*. Note that *Q*_{k} is length-scale invariant because it is a function of the number of nearest neighbors, which allows us to compare the structures with different values of *R*_{g}. Note that *Q*_{k} = 2 or 0 for perfect demixing and mixing between A and B compartments, respectively. Fig.6b shows *Q*_{k} and *P* (*Q*_{k}) histograms for different values of *k*. The distribution is clearly skewed toward large values, indicating demixing of the A and B compartments on the population level. At the same time, the distribution shows that there exists a small fraction of single cell chromosomes conformations, which have *Q*_{k} values close to 0.8, implying that the compartment organization of chromosome exhibits a degree of heterogeneity.

### Chromosome organizations in different cell types

Since single chromosome conformations in a single cell exhibit extensive variations, it is natural to wonder how structurally heterogeneous a given chromosome is in different cells types and if the HIPPS method can quantify these differences at the single-cell level? We are searching for differences in the heterogeneity of a specific chromosome in different cell types. From a physical viewpoint this is difficult to answer this question precisely because structural heterogeneity of a chromosome in a given cell type could overwhelm the analysis. Furthermore, one has to contend with high-dimensional data (each conformation has 3N coordinates) in the ensemble of conformations.

In order to delineate the differences in the heterogeneities in the conformations of a specific chromosome in different cell types we used a machine learning method for large data analysis [37]. To compare two single chromosome conformations, we first normalized the distance matrix such that . By doing so, we eliminate the effect of overall size of the individual chromosome conformation, thus allowing us to compare them in terms of only their conformations. We generated a total number of 1,000 structures for chromosome 21 from 7 cell types using Hi-C data [6]. Fig.7a shows the tSNE (t-Distributed Stochastic Neighbor Embedding) plot [37] for 7,000 individual chromosome conformations from 7 different cell types (1,000 conformations for each cell type). It is clear that the structural ensembles of chromosome 21 from different cell types have different degrees of overlap with each other. IMR-90 (fibroblast), HUVEC (umbilical vein endothelium), and GM12878 (lymphoblastoid), which are normal human cells, form compact, distinct clusters with negligible overlap with each other. In Fig.7a the conformations of chromosome 21 in the 2D tSNE representation are shown as blue (IMR-90), red (HUVEC), and green (GM12878) dots. In sharp contrast, the conformations of the same chromosome in HMEC (breast epithelial cell), K562 (myeloid leukemia cell in bone marrow), NHEK (epidermal keratinocytes - type of skin cell), and KBM7 (a different leukemia cell) cells display very large variations. They are not as compact and their phase space structure in terms of the low dimensional tSNE coordinates show overlapping regions (Fig.7a).

To further investigate the characteristics of chromosome organization in different cell types, we computed the values of *F* (*k*), which quantifies the multi-body long-range interactions of the chromosome structure. We define *F* (*k*) as,
where *k* is the number of nearest neighbors, and *m*_{i}(*k*) is the set of loci that are *k* nearest neighbors of loci *i*; *F*_{0}(*k*) = (1/2)(1 + *k/*2) is the value of *F* (*k*) for a straight chain. From Eq.10, it follows that the presence long-range interaction increases the value of *F* (*k*). It is worth noting that *F* (*k*) can also be viewed as a measure of how well the linear relation along genome is preserved in the 3D structure. Fig.7b show the distributions of *F* (*k*) for each cell type. GM12878 cell shows the most enrichment of long-range multi-body clusters whereas NHEK and HMEC cells show the least. However, there is extensive overlap between different cell types for *F* (*k*). Remarkably, we find that there are substantial variations in the structural ensembles of chromosome 21, and by implication others as well, not only within a single cell but also among single cells belonging to different tissues. From our perspective, it is most interesting that the HIPPS when combined with machine learning techniques can quantitatively predict the differences.

## DISCUSSION AND CONCLUSION

Using theory, based in polymer physics and the principle of maximum entropy, and precise numerical simulations of a non-trivial model, we have provided an approximate solution to the problem of how to construct an ensemble of the three-dimensional coordinates of each locus from the measured probabilities (*p*_{ij}) that two loci are in contact. The key finding that makes our theory possible is that *p*_{ij} is related to through a power law, which is in accord with experiments [7] as well as accurate polymer models for interphase chromosomes [25]. The inferred mean spatial distances are then used to obtain an ensemble of structures using the maximum entropy principle. Our approach, which is both physically motivated and data-driven procedure, is self-consistently accurate for the precisely solvable GRMC. The physically well-tested theory, leading to the HIPPS method, allowed us to go take the Hi-C contact map and create an ensemble of three-dimensional chromosome structures without any underlying model. Using the HIPPS method we constructed the 3D organization of the twenty-three human chromosomes solely from the Hi-C contact maps. We believe that our theory, with sparse data from Hi-C and FISH experiments, may be combined to produce the 3D structures of chromosomes for any species.

The limitation of many population-based experimental approaches for producing the 3D organization is their inability to extract the single-cell information. Due to the apparent heterogeneity in the cell population [9, 20], Hi-C map, as an ensemble average quantity, does not contain the information about the fluctuations of the organization of genomes. The Hi-C map and the derived only characterize the *averaged* structure. In other words, there may not exist a typical single-cell genome that can be described by the Hi-C map, and hence the derived from it. Using the maximum entropy principle, we are able to generate an ensemble of structures from , consistent with observation from imaging data. It is worth noting that our use of the maximum entropy principle with pairwise distances as constraints, leads to a joint distribution of loci coordinates without assuming a predefined energy function.

The HIPPS method may also suffer from the same problem because is inferred from ensemble averaged Hi-C map. Thus, we suggest that the actual single-cell experimental measurements are fundamentally crucial to decipher the single-cell genome organization. This can also be reasoned from the following arguments using our simple mixture model system as an example. Every trajectory can be described by either a chain containing all the loops or a chain that is devoid of loops. Therefore, averaging over an ensemble of cells may not be meaningful from an *in vivo* perspective. Using the maximum entropy principle described in this work, a single mode widespread distribution can be obtained instead of a bimodal distribution which characterize two distinct sub-populations. This problem can be overcome by using distribution instead of mean as constraints under the maximum entropy principle. However, such distributions should only be obtained from single-cell measurements. Nevertheless, the theoretical lower bound that we have derived provides a way forward to obtain 3D organization from contact map alone, perhaps even from single-cell Hi-C data.

The HIPPS method could be improved in at least two ways. First, the theory relies on Eq.7, which relates the average contact probability between two loci to the mean distance between them. Even though choosing *α* = 4.0 in Eq.7 provides a reasonable description of the sizes of all the chromosomes it should be treated as a tentative estimate. More precise data accompanied by an analytically solvable polymer model containing consecutive loops, as is prevalent in the chromosomes, could produce more accurate structures. Second, as the resolution of Hi-C map improves the size of the contact matrix will not only increase but the matrix would be increasingly sparse because of the intrinsic heterogeneity of the chromosome organization. Thus, methods for dealing with sparse matrices will have to be utilized in the HIPPS method for extracting chromosome structures.

We should emphasize that if the chromosome structures are used in conjunction with an underlying model with energy functions that produce the patterns in ensemble averaged Hi-C data then the HIPPS method could be used to predict single cell structures, which would shed light on the heterogeneous organization of chromosomes. Ultimately, this might well be the single most important utility of our theory.

## Acknowledgements

We are grateful to the National Science Foundation (CHE 19-00093) and the Collie-Welch Regents Chair (F-0019) for supporting this work.

## APPENDIX A: SIMULATION DETAILS

The GRMC is a variant of a model introduced previously [8] as a caricature of physical gels. Recently, we used the GRMC [9] as the basis to characterize the massive heterogeneity in chromosome organization. The energy function for the GRMC is [9],

For the bonded stretch potential, we use,
where *a* is the equilibrium bond length. The interaction between the loop anchors is modeled using,
where the spring constant may be associated with the CTCF facilitated loops. The labels {*p, q*} represent the indices of the loop anchors, which are taken from the Hi-C data [6].

The energy function for the ideal Rouse chain simulated in this work is,
which is obtained from the energy function for GRMC by eliminating the loop constraints (setting *ω* = 0 in Eq.13).

In order to accelerate conformational sampling, we performed Langevin Dynamics simulations at low friction [38]. The total number, *N*, of monomers is 10, 000. We simulated each trajectory for 10^{8} time steps, and saved the snapshots every 10, 000 time steps. We generated ten independent trajectories, which are sufficient to obtain reliable statistics, which we illustrate in Fig.S8.

## APPENDIX B: DATA ANALYSES OF THE SIMULATION DATA

The contact probability between the *m*^{th} and *n*^{th} loci in the simulation is calculated using,
where Θ(·) is the Heaviside step function, *r*_{c} is the threshold distance for determining the contacts, the summation is over the snapshots along the trajectory, and *M* is the total number of independent trajectories, and *T* is the number of snapshots for a single trajectory. The mean spatial distance between the *i*^{th} and the *j*^{th} loci in the simulations is calculated using,

The objective is to calculate ⟨*R*_{mn}⟩ from *P*_{mn}, and to determine, if in so doing, we get reasonably accurate results. Because these quantities can be computed precisely for the GRMC, the [*P*_{mn}, ⟨*R*_{mn}⟩] relationship can be rigorously tested.

## APPENDIX C: BLOCK AVERAGE

Fig.8 shows the procedure used for the block average when dealing with several vanishing (or very small) contact probabilities *P*_{mn}s. Such a method could be used for (almost) any sparse matrix. Let the original contact matrix (CM) have size *N × N*. By setting a coarse-grained level *n*, the original CM is divided into blocks, each with size *n*. The new coarse-grained CM is constructed in the way the values of elements in the (*N/n*) *×* (*N/n*) are the arithmetic average of elements in each block. We then demonstrate that this coarse-graining procedure does not alter the structural information embedded in the original CM.

## APPENDIX D: DERIVATION OF A LOWER BOUND FOR THE SPATIAL DISTANCE

Let us use and as notations for the average over each genome conformations in a single homogeneous population and the average over each individual subpopulations, respectively. Here, and *p*_{ij} are the *mean* spatial distance and the contact probability between loci *i* and *j* for a single homogeneous (sub)population. and the ⟨*p*_{ij} ⟩ are the *mean* spatial distance and the contact probability between loci *i* and *j* measured for the whole population. It is easy to see that if the population is homogeneous, we have and ⟨*p*_{ij} ⟩= *p*_{ij}.

In this appendix, we prove that there exists a theoretical lower bound for for a given ⟨*p*_{ij} ⟩. We assume that for a homogeneous population where only one population is present, there exists a convex and monotonic decreasing function relating the contact probability between two loci and their mean spatial distance, . For better readability, we will neglect the suffix *ij* from now on. For a heterogeneous population, the contact probability is calculated as,
where *K* is the distribution of for all the subpopulations, and is the distribution of spatial distance for a single subpopulation given its mean value . *r*_{c} is the threshold distance for determining the contact. Note that by definition. is the probability measure of *p* over individual subpopulation. Since *φ* is a convex function, according to Jensen’s inequality, we have,

Replace the *ψ*(*p*) by . We have,

Eq. 19 shows that the lower bound for is the mean spatial distance inferred from the ⟨*p*⟩ as if the populations of genome conformation is homogeneous, i.e. there is only one single population.

To demonstrate the validity of Eq. 19, we consider the special case where there are two distinct discrete sub-populations. In this case, we and *p* = *ηp*_{1} + (1 *− η*)*p*_{2}. Note that and . Let us denote *p*_{1} = *x* and *p*_{2} = *y*. Given the value of the contact probability ⟨*p*⟩, we show that the lower bound for is *φ*(⟨*p*⟩). This is equivalent to the optimization problem,
where and *g*(*x, y*) = *ηx* + (1 *− η*)*y −*⟨*p* ⟩. The Lagrange multiplier is ℒ(*x, y, φ*) = *f* (*x, y*) *−φg*(*x, y*). Using the condition that *∇*_{x,y,φ} *ℒ*(*x, y, φ*) = 0, it can be shown that *f* (*x, y*) is maximized when *x* = *y*. Thus, we proved that is minimized when *p*_{1} = *p*_{2} and its minimum value is *φ*(*p*). This is also graphically illustrated in Fig.2a in the main text.

## APPENDIX E: CONNECTION BETWEEN THE CONTACT PROBABILITY AND MEAN SPATIAL DISTANCE

For a self-avoiding homopolymer, the distance distribution between two monomers along a polymer chain is [39],
where *r* is the distance between two monomers, is the mean distance between them. *g* is “correlation hole” exponent, and *δ* is related to the Flory exponent by *δ* = 1/(1 − *v*). Given the contact threshold, the contact probability *p* between the two monomers is

When the contact threshold is small compared to the size of the chain , the integral can be approximately evaluated as,

Thus, the contact probability between two monomers, *p*, is connected to their mean distance by a scaling exponent *−*(3 + *g*). For an ideal chain, *g* = 0, we recover the asymptotically exact relation . For a self-avoiding chain, we need to consider three cases [39]: (i) two monomers are at the two ends of the chain. (ii) one monomer is in the chain interior, while the other is at the end. (iii) two monomers are located in the central part of a chain. The correlation hole exponents corresponding to the three cases [39] are *g*_{1} = 0.273, *g*_{2} = 0.46 and *g*_{3} = 0.71. Thus, we have for the contact between two ends of a self-avoiding chain. for contact between two monomers in case (ii), and for the contacts between two monomer located in the chain interior.

For polymers in poor solvents (likely more relevant to the Humam interphase chromosomes), the value of *g* is not well known. Using simulation, Bohn et al [40] showed that for a equilibrium collapsed homopolymer chain, *g* = *−*0.11 for two ends of the chain. This leads to the contact probability between two ends of an equilibrium homopolymer globule and the mean distance . But the values of *g* for scenarios (ii) and (iii) are unknown. In addition, copolymer and out of equilibrium states of chromosomes even complicate the theoretical calculations. Hence, the theoretical estimate of the relation between *p* and for chromosomes is not known rigorously. Nevertheless, we expect based on the arguments given here that a power law connecting *p* and ought to exist. We determine the precise relation based on experimental data and our previous study [25].

## APPENDIX G: ITERATIVE SCALING ALGORITHM FOR MAXIMUM ENTROPY PRINCIPLE

Here, we describe the algorithm for obtaining the *k*_{ij}s in Eq.8. The algorithm we adopted is iterative scaling.

Denote *k*_{ij}(*t*) as the value of *k*_{ij} at *t*^{th} iteration, it is updated according to,
where *r* is the learning rate. is the average squared pairwise distance at *t*^{th} iteration and is the targeted squared pairwise distance. Generally, the value of can be estimated by simulation under the values of *k*_{ij}(*t*). In this particular case, can be numerically computed since *P* ^{MaxEnt} is a multivariate normal distribution.

To demonstrate the effectiveness of the algorithm, Fig.9 shows the comparison between targeted average distance matrix and simulated average distance matrix at different iteration steps. It is clear that after a sufficient number of steps, the simulated distance matrix converges to the targeted one with high accuracy.

## APPENDIX H: RELATIVE SHAPE ANISTROPY

To quantify the shape of each chromosome conformation, we calculate the relative shape anistropy (*κ*^{2}) as following,
where *λ*1,2,3 are the eigenvalues of the gyration tensor. The bounds for *κ*^{2} is 0 *≤ κ*^{2} *≤* 1, where 0 is for highly symmetric conformation and 1 corresponds to a rod.

## APPENDIX I: PROCESSING ATAC-SEQ DATA

Each monomer/loci in the 3D structures generated is assigned a value representing its ATAC signal. We use ATAC BED file from GEO repository GSE47753. The original data, however, needed to be processed in order to use in our model. The procedure is illustrated in Fig.10. Each line in the BED file corresponds to a ATAC peak, associated with the peak value and the start and end genomic positions of the segment. In our model, each monomer represents a 100kbps genome segment. We count how many basepairs are overlapped between the segment represented by the monomer in our model and the segment in the ATAC-seq data. The contribution to the monomoer’s ATAC signal value is computed proportionally from the peak value. For instance, the segment in the ATAC data has a peak value 100, and its length is 50kpbs, and it has overlap of length 30kbps with a given monomer. Then the contribution of ATAC signal from the segment in the ATAC data is (30/50) *** 100 = 60. If a segment has no corresponding data in the ATAC BED fle, we treat it as it has peak value zero.