Abstract
Modern experimental techniques such as Hi-C make it possible to measure the probability that different chromosomal regions are close in space. Usually, these measurements are characterized by the scaling of the contact probability as a function of the genomic distance between regions. In this work, we introduce a multifractal analysis of chromosomal contact maps. Our analysis shows that Hi-C maps display a non-trivial multifractal spectrum. We introduce a simple analytical model that describes the structure of chromosomes as a hierarchical set of domains nested in each other and we solve it exactly. The predicted multifractal spectrum is characterized by a phase transition between two phases with different fractal dimension, in excellent agreement with experimental data. These results support the view that there is no privileged level in the hierarchy of conformational domains. Within the same model, the scaling exponent of the contact probability can also be calculated. We argue that such procedure leads to a much more precise estimate of the contact probability exponent than previous approaches. By applying this method to experimental data, we demonstrate that it can capture subtle conformational differences among chromosomes that are robust in different realizations of the same experiment.
I. INTRODUCTION
During cellular interphase, mammalian chromosomes assume a globular structure in the nucleus [1, 2]. Their conformational properties can be studied in vivo with a set of techniques called chromosome conformation capture [3], most notably their version called Hi-C [4]. Results of Hi-C experiments can be represented as matrices whose elements are proportional to the contact probability in space between pairs of chromosome regions [5].
A way to characterize the information contained in Hi-C matrices is by studying the decay of the contact probability pij of two chromosomal regions i and j with respect to their genomic distance |i − j|. Empirically, such decay is well described by a scaling law
The fitted contact probability exponent for mouse chromosomes is β ≈ 0.75 [6]. Such low value is incompatible with simple equilibrium homopolymeric models. An explanation proposed for such low exponent is that the chromosome fiber is not at thermodynamic equilibrium, but populates a ‘crumpled globule’ with β ≈ 1 [4, 7]. Similarly low scaling exponents can also be obtained if the interaction in the polymer are mediated by other molecules [8] or by an active loop-extrusion mechanism [9]. Low exponents can also emerge in heteropolymers as consequences of finite-size effects, amplified by the heterogeneity of monomer interactions [10].
Another distinct feature of Hi-C matrices is the presence of blocks along the diagonal with large values of the contact probability. Physically, these blocks corresponding to structured regions of the chromosome with density larger than the surrounding. Historically, structures on the multi-megabase scale are called ‘compartments’, sub-megabase structures are called ‘topological associating domains’ (TADs) and those at the kilobase scale are called ‘contact domains’. It was suggested that the physical interaction that stabilizes compartments is different than that which stabilizes smaller structures [11].
Despite this distinction, a quantitative analysis of Hi-C matrices points to a continuous hierarchy of structures nested into each other. Although the level of the hierarchy corresponding to TADs is biologically more important than the others because of its relevance for gene regulation [12], it does not seem to be privileged from the conformational point of view [13]. It has been recently suggested that, because of this hierarchical structure, the properties of Hi-C matrices can be studied by an iterative coarse-graining procedure [14].
In this paper, we perform a multifractal analysis of Hi-C matrices of mouse embryonic stem cells [5, 15]. The theory of multifractals has been developed to study heterogeneous systems characterized by scale invariance such as chaotic systems and turbulent flows [16–18]. The theory of multifractals describes the scaling of moments of the probability distribution of a system as a function of a coarse-graining resolution. This procedure leads to a family of scaling exponents (the multifractal spectrum) as a function of the moments. A linear multifractal spectrum indicates that the system is homogeneous, i.e. all of its parts are characterized by the same scaling behavior. Our analysis shows that the multifractal spectrum of mouse Hi-C data presents two different linear regimes, one at low moments with the properties of a two-dimensional object and one at high moments characterized by a fractal dimension between one and two. In the theory of multifractals, such scenario is analogous to a phase transition in statistical physics, with the moments playing the role of an inverse temperature [19].
To understand this observation, we introduce a hierarchical domain model of Hi-C matrices, characterized by a fractal hierarchy of block domains along the diagonal. We calculate analytically the multifractal spectrum of the model. Strikingly, the analytical solution explains the phase transition and provides an excellent fit to the experimental multifractal spectrum. This suggests that the multifractal spectrum of Hi-C matrix directly reflects the nested organization of domains within the chromosomes. Within the model, we also calculate exactly the scaling exponent β. We find that the determination of β from the multifractal spectrum more robust than with conventional methods.
II. SCALING PROPERTIES OF HI-C MAPS
We illustrate the main idea of our work within the example of a Hi-C map of chromosome 1 in mouse embryonic stem cells [15], Fig. 1A. Darker red pixels in the figure represent higher contact probabilities. We seek to statistically characterize the information contained in this map.
A common way to approach this problem is to study the decay of the contact probability with the genomic distance, i.e. the distance from the diagonal [4, 7]. This approach is sketched in Fig. 1B for the same data. The decay of contact probability at distance x ≲ 106 bp is well described by the power law of Eq. (1). The fitted value of the exponent β is in general smaller than one; in this example β ≈ 0.78.
By carefully observing Fig. 1A, it is possible to discern domains of different sizes, sometimes nested into each other, corresponding to a hierarchy of domains in the chromosome. In principle, it is unclear whether this complex structure relates with the power-law decay of the contact probability in Fig.1B. To shed light on this issue, we take the original Hi-C map and construct a hierarchy of coarse-grained versions of it, Fig.1C. Such coarser maps are obtained by considering a grid of resolution ϵ and making a two-dimensional histogram of the Hi-C map. Geometrical structures of linear size smaller than ϵ are not resolved in these maps.
The key idea of a multifractal [17, 18] is to analyze scaling of moments of the coarse-grained Hi-C maps as a function of the mesh size ϵ. To this aim, we introduce a partition function where pi(ϵ) is the density in bin i at resolution ϵ. For convenience, we always consider normalized maps, so that Z(1, ϵ) = Σijpij (ϵ) = 1 for all choices of ϵ. In practice, the smallest possible value of ϵ is the resolution of the experimental map, in our case ϵ = 4 104 bp. The name “partition function” for Z(q, ϵ) originates from an analogy with equilibrium statistical mechanics, where the exponent q plays the role of an inverse temperature. Indeed in an inhomogeneous system, for q → 0 (high temperature), all states give similar contributions to the sum in Eq. (2), whereas for large q (low temperature) the sum is dominated by relatively few terms characterized by highest values of the measure pij.
For a scale-invariant system, one expects a power-law scaling of the partition function as a function of the mesh size ϵ, at least for ϵ small enough
We observed a beautiful power-law behavior as predicted by Eq. (3) in chromosomes, Fig.1D. The function K(q) is the multifractal spectrum and characterizes the scaling behavior with resolution of the partition function [18]. To gain intuition about the multifractal spectrum, it is useful to first consider the simple case of a homogeneous, possibly fractal object of physical dimension D. Such object can be filled a number N ~ ϵ−D of bins of size ϵ, where ~ denotes the leading order for small ϵ. Since the object is homogeneous, such bins would have comparable weight. Normalization then implies that in the filled bins pij ~ ϵD. Substituting these estimates into Eq. (3) we obtain
This means that homogeneous objects are characterized by a linear multifractal spectrum, with a slope equal to the fractal dimension. Conversely, a non-linear multifractal spectrum signals that the geometric set behaves in a different way depending on the weight attributed to the measure of each element. The multifractal spectrum of chromosome 1 is markedly nonlinear, Fig. 1E. In the following, we shall see that an exactly solvable model can account for all these observations in terms of a single free parameter.
III. HIERARCHICAL DOMAIN MODEL
To rationalize the multifractal spectrum of chromosomes, we introduce a hierarchical domain model able to generate synthetic contact matrices similar to the experimental ones. The model is defined in terms of an iterative transformation of a measure on the unit square [0,1] × [0,1]. At each iteration n, the unit square is divided into smaller squares of size ϵ = 2−n. We use ϵ for the side of the square since this length plays the same role as the coarse graining resolution for experimental contact maps. Inside each of these squares the measure is uniform, so that the measure can be represented by a 2n × 2n matrix whose entries are the probabilities pij in each square. The transformation is constructed in such a way that the measure remains normalized at each iteration. The first iteration reads with a > b > 0. Normalization requires b = 1/2 − a. This means that, effectively, the model is defined by a single free parameter a. Because of the normalization and the requirement that the measure should be concentrated along the diagonal, such parameter falls in the rangey
In each of the following iterations, the blocks along the diagonal are divided in 4 sub-blocks by multiplying by the same matrix of Eq. (5), whereas the off-diagonal blocks are divided in 4 identical sub-blocks. Following this procedure, the measure at the second iteration reads and the third iteration reads
Physically, the parameter a represents the “weight” of TADs compared to the rest of the Hi-C map. Matrices with larger values of a have a measure more concentrated along the diagonal, whereas matrices with lower a are characterized by a more uniform measure. In particular, in the limiting case a = 1/4 the matrix is completely uniform at all iterations, whereas for a = 1/2 the matrix is characterized by a uniform measure on the diagonal and all the off-diagonal elements are zero. Examples of measures after different numbers of iterations are shown in Fig. 2.
A. Multifractal scaling
We now write the partition function Z(q, n), corresponding to the definition of Eq. (2) with ϵ = 2−n. For example, we have
The general expression of the partition function is where δkn is the Kronecker delta. We express the partition function in the form where we defined the exponent
We now seek for the maximum of the exponent. Assuming n − k − 1 ≥ 0 and deriving the exponent with respect to k we obtain which is positive for q > qc = (ln 2)/(ln 4a). This means that, for q ≥ qc, the leading term is either ξ(n) or ξ(n−1). Since ξ(n) − ξ(n − 1) = q ln(a/b), the maximum of the exponent is attained at k = n.
Fr large n, the partition function of Eq. (11) is dominated by the term corresponding to the maximum of the exponent. We therefore substitute, for large n, Z(q, n) ~ exp[ξ(n)] ~ ϵK(q) with the length scale ϵ = 2−n. We directly obtain from which
For a = 1/2 the distribution is linear and we have D = 1. For a = 1/4 the distribution is uniform on the square, so that D = 2. Between these two limiting cases, Eq. (15) predicts that the distribution is monofractal with a dimension D = − ln a/ ln 2 between 1 and 2.
Let us now figure out the scaling in the “high temperature phase” where −ln 2 + q ln 4a < 0. In this case, the term dominating the scaling is k = 0, so that which implies
This means that, since the scaling is determined by the terms far from the diagonal in the high temperature phase, the spectrum is that of a regular two-dimensional set.
Summarizing, the saddle-point calculation predicts a multifractal spectrum characterized by two linear regimes: one with slope − log(a)/log(2) for q < qc = (ln 2)/(ln 4a), Eq. (15), and one with slope 2 for q < qc, Eq. (17). Such predictions are very well confirmed by numerical simulations, Fig. 3A, with very small discrep-ancies for high values of q and low values of a arising from finite size effects. These results confirm the validity of our saddle point approximation.
Strikingly, our theory predicts extremely well also the multifractal spectra of real chromosomes, as illustrated in Fig. 3B for the first three chromosomes, with a value of a ≈ 0.425 and very little variability among the three chromosomes. We will examine in more details the differences among chromosomes in the next section.
To verify whether the multifractal spectrum in Fig. 3 is truly a signature of a hierarchical mechanism and not a more generic statistical feature, we numerically computed the spectra of two null models: one charac-terized by a power-law decay plus blocks at a single scale Pij = |i − j|−0.8 + bij (where bij is 1 if i and j belongs to the same block of linear size 64 and zero otherwise), and another characterized by the same power law plus white noise with zero mean and standard deviation 0.1. In the latter case, the noise is truncated to ensure that probabilities are always non-negative. The solution of Eqs. (15) and (17) provides a much better fit to the hierarchical model (sum of residuals ≈ 0.04) than either block model (sum of residuals ≈ 0.8) or the power law plus noise model (sum of residuals ≈ 0.3), data not shown. The sums of residuals for the fits to the first three chromosomes are in order 0.049, 0.051 and 0.043. Taken together, these observations support the idea that the specific multifractal spectrum of the hierarchical model is compatible with that observed in chromosomes and significantly different than that of other simple models that are lacking a hierarchy of domains.
B. Scaling of the contact probability
We now study the decay of the contact probability with the genomic distance d = |i − j |. In our framework, the contact probability P(d; n) at the n-th iteration can be expressed as the total probability of blocks at a distance from the diagonal equal to d
To find an explicit form for P(d; n) we use our ex-pression of the partition function, Eq. (11), for q = 1. The only difference between P(d; n) and Z(1,n) is the Kronecker delta in (18), that selects a subset of terms contributing to the partition function. In particular, we note from the structure of the matrix that each term with a given power of a first increases linearly with d up to a maximum, then decreases linearly to zero. Following this observation, we introduce the triangular function
Using this definition permits to write P(d; n) in the form
We now approximate the sum with an integral
In the first integral, the derivative of the exponent with respect to k is equal to ln(8a) and is therefore always positive due to Eq. (6). The derivative of the exponent in the second integral is equal to which vanishes at
One has 1/2 < d 2(k*−n) < 1, so that the maximum of the exponent falls within the integration domain. Approximating the integral with the maximum of the integrand yields for large n. This implies that the contact probability scales with the exponent
In the allowed range of a given by Eq. (6), the contact probability exponent predicted by Eq. (25) falls in the range β ∈ [0,1]. The prediction of Eq. (25) is supported by numerical simulations, see Fig. 4. This confirms the validity of our saddle-point approximation.
IV. ESTIMATES OF CONTACT PROBABILITY EXPONENT
An important practical consequence of our theory is that it provides an alternative method to estimate the contact probability exponent β via Eq. (25). This requires estimating the scaling exponents of the partition function. In practice, the partition function exhibits a much cleaner scaling than the contact probability, as seen by comparing Fig. 1D with Fig. 1B. In this Section we scrutinize in more details the robustness of fits of the exponent β with the two methods. We also compare the experimental results in [15] with two different realization of a more recent experiment [5].
For both methods, the value of β is affected by the choice of the power-law fitting range. We fixed a lower cutoff for the power law fits equal to 105 bp, slightly higher than the minimum resolution of the data, and plotted the prediction of the exponent β as a function of the upper cutoff, Fig. 5A. We found that exponent displays a stronger dependence on the upper cutoff for the direct power law fit than for the multifractal method. Moreover, comparing different experiments [5] provides quite similar values of the exponents β when the multifractal method is used, whereas a large variability is observed with the power law fit.
Such more precise inference of the value of β can be appreciated when comparing results of different measures of Hi-C exponents. The multifractal method predicts much less variability of the exponent β and much more consistent values across different experiments, with an average β ≈ 0.77, Fig. 5B. Comparing different replicates of the same experiment, the exponents predicted by the multifractal method present a higher correlation (Pearson r ≈ 0.998) compared with those obtained by a direct power law fit (r ≈ 0.994), Fig. 5C.
V. CONCLUSIONS
In this paper, we introduced a multifractal analysis of Hi-C data as a powerful statistical tool to characterize the scaling properties of chromosomes. We have shown that the multifractal spectrum computed for chromosomes of mouse embryonic stem cells has a broken-line shape. To understand this observation we introduced a hierarchical domain model, whose analytical solution is in striking quantitative with the experimental spectrum of in terms of a single free parameter.
Our solution directly implicates a power-law scaling of the contact probability with the genomic distance with exponents β lower than one, also in agreement with experimental observations. A comparison with other null models characterized by similar contact probability exponents but without a hierarchy of domains shows that the multifractal method is sensitive enough to discard them. This implies that the predicted form of the multifractal spectrum provides a more stringent prediction than the contact probability exponent alone. The analysis proposed here constitutes an excellent benchmark to select among different polymer models that provide similar values of β [7–10].
Our results provide a strong indication that scaling properties of chromosomes are a direct consequence of the hierarchical structure of chromosome domains [13, 22]. Recent work has suggested that such domain structure is generated by a “hierarchical folding” mechanism, mediated by different proteins such as cohesin and CCCTC-binding factors [23]. The precise mechanism underlying folding of these domains has been subject of debate [24]. It will be interesting to test whether the activity of these factors can produce self-similar structures compatible with our observations.
The shape of the multifractal spectrum is controlled by a single parameter, that also controls the contact probability exponent β via Eq. (25). We demonstrated that the determination of β from the fit of the multifractal spectrum is a much more robust way of characterizing the Hi-C map than the direct fit of β, highlighting structural difference among chromosomes that we consistently observed in different realizations of the same experiment. The analysis of the multifractal spectrum is simple and robust enough to become a routinely tool to analyze contact maps, both from polymer models and experiments.
Acknowledgments
We thank Yinxiu Zhan and Luca Giorgetti for help with the data.