## Abstract

Chromatin conformation capture assays output a value for each pair of positions that quantifies their strength of interaction in the nucleus. Hi-C signals lack the property of variance stabilization. That is, a difference between 0 and 200 reads usually has a very different statistical importance from a difference between 2000 and 2200 reads. This deficiency impedes analysis and visualization of Hi-C data.

Here, we propose an approach called VSS-Hi-C that normalizes Hi-C data to produce variance-stabilized signals. It does so by learning the empirical relationship between the mean and variance of Hi-C values and transforming these values such mean and variance of the data are no longer dependent to each other. We show that VSS-Hi-C outperforms the other transformation approaches in stabilizing the variance of the data and using VSS-Hi-C transformed signals improves downstream analysis such as identifying chromosomal subcompartments. Using VSS-Hi-C additionally eliminates the need for complex methods such as negative binomial modeling for downstream analysis and enables expressive visualization of Hi-C data.

## 1 Introduction

The 3D conformation of the genome plays a central role in many cellular functions including gene expression and DNA replication [1, 2, 3, 4]. 3D interactions occur as a result of biological functions including promoter-enhancer interactions, polymer looping, nuclear compartmentalization, phase separation and others [5, 6, 7, 1, 8]. Chromatin conformation capture assays such as Hi-C measure chromosome conformation by quantifying the interactions between pairs of loci in the genome. These sequencing-based assays output a read count for each pair of genomic loci, indicating the loci’s strength of interaction.

Like all biological data, Hi-C contains sources of experimental variability that must be mitigated and normalized to enable accurate analysis. Many approaches have been developed that aim to normalize for sources noise and bias in Hi-C data including mappability, restriction site density, G/C content, read depth, 1D distance, random polymer looping and others [9, 10, 11, 12, 13].

One source of experimental variability in Hi-C data is not addressed by existing normalization methods: interaction counts have a nonuniform mean-variance relationship. For example, a contact pair having 0 reads in one sample and 200 in the other sample is usually considered a more significant difference than a pair having 2000 reads in one and 2200 in the other. More generally, lower interaction counts tend to have lower variability than higher interaction counts. This nonuniform mean-variance relationship poses a challenge to downstream analyses. For instance, considering the difference in interaction counts between different samples is a poor measure of the difference in interaction strength.

To address this nonuniform mean-variance relationship in Hi-C data, some methods, such as those used for detecting significant chromatin interactions, employ techniques like negative binomial distribution modeling [14, 15, 16, 17]. However, due to the complexity of optimizing and implementing negative binomial models, many approaches employ Gaussian-based models and implicitly assume a uniform mean-variance relationship. This issue also afflicts any analysis that employs mean-squared error (MSE) metric for quantifying performance because this metric is equivalent to likelihood of a uniform-variance Gaussian model. Examples of Gaussian- or MSE-based models include identifying topologically associating domains (TAD) [18], identifying genome compartments or subcompartments [19, 20], enhancing Hi-C data resolution [21], detecting loops [22, 23] and identifying promoter-enhancer interactions [24].

To attempt to mitigate the nonuniform mean-variance relationship, many of the Gaussian-based methods employ transformations like log (log(*x*+ *c*) for a constant *c* (usually 1)) or inverse hyperbolic sine transformations . These transformations stabilize the mean-variance relationship under specific assumptions (Methods), but we found that these assumptions are violated, resulting in nonuniform mean-variance relationship (Results). However, these transformations assume a specific relationship between the mean and variance of the data (Methods).

Accounting for mean-variance relationship is also crucial for visualization. Large outliers inherent to nonuniform variance often dominate the viewing scale of plots. This problem is of particular importance to Hi-C data, which is typically visualized as a heatmap, where the choice of color scale can radically change a plot’s interpretation (Fig 1a). More generally, Euclidean distance in a 2D plot corresponds to the log likelihood of difference in a uniform-variance Gaussian model. Although these problems can be partially mitigated by carefully choosing a maximum viewing range, doing so corresponds to a crude crude linear+flat transformation.

For many other data sets such as RNA-seq, researchers precisely mitigate this mean-variance relationship using a variance-stabilizing transformation (Methods). We recently demonstrated the importance of doing so for 1D genomic data sets like ChIP-seq and developed a tool called VSS for this purpose [25].

Here, we extend our previous work and propose a method called VSS-Hi-C that stabilizes the variance of Hi-C contact strength. This method learns the empirical mean-variance relationship of the Hi-C matrices and transforms the Hi-C contact strength using a transformation based on this learned mean-variance relationship. We show that VSS-Hi-C transformed matrices have a fully stabilized mean-variance relationship, in contrast to other transformation methods. Moreover, we illustrate that variance-stabilized signals are beneficial for downstream analyses like identifying topological domains and subcompartments.

## 2 Methods

### 2.1 Hi-C data

We acquired three in-situ Hi-C data sets for GM12878, K562 and KBM7 cell lines generated by Aiden-Lieberman group [1] (GEO accession number GSE63525. Also available at ENCODE consortium with ENCODE accession numbers ENCSR968KAY, ENCSR545YBD and ENCSR987EPR, respectively.). The mentioned Hi-C interactions were mapped to the hg19 reference genome In this study, we use two measures of contact strength [1, 26]: (1) The “observed” contact matrix, which represents the raw read counts of the interactions between different genomic regions, and (2) The “observed over expected” (O/E) contact matrix, defined by dividing each observed signal by the average for pairs at the given 1D distance. This step accounts for the increased number of interactions between nearby genomic loci. We considered both 10kb and 100kb resolution contact matrices in the evaluations in this study.

Following previous work [1], normalization of Hi-C contact matrices are performed using the Knight and Ruiz [27] matrix-balancing algorithm in all the analyses performed in this study. After extracting the raw and O/E matrices, we used the Juicer tool [28] for extracting the KR-normalized contact matrices for Hi-C data. KR normalization approach makes sure that sums of each row and column of the contact matrices are equal, which accounts for sources of bias such as G/C content and mappability.

### 2.2 Repli-seq and ChIP-seq data

We acquired 6 phase Repli-seq data from the ENCODE consortium (encodeproject.org) from GM12878 and K562 cell lines with ENCODE accession numbers ENCSR218XP and ENCSR591OX respectively.

Moreover, we obtained ten different histone modification ChIP-seq data (H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me3, H4K20me1) from GM12878 and K562 cell lines from Roadmap Epigenomics data portal (https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/). We also acquired ChIP-seq data targeting H2A.Z and DNase from GM12878 and K562 cell lines. These ChIP-seq data sets quantified according to fold enrichment, defined as the ratio of observed read count over that of an Input control.

### 2.3 Identifying the mean-variance relationship

Our variance-stabilizing transformation depends on determining the mean–variance relationship for the input Hi-C datasets. We identify this relationship using multiple replicates of the same experiment. We do so for each pair of chromosomes separately.

We first convert the Hi-C matrices to contact signal list vectors. That is, for a Hi-C contact matrix *X _{p}*

_{×q}, we obtain a contact signal list

*x*of length

*pq*. Consider there are

*M*distinct replicates for an experiment. We define two vectors,

*x*

^{(base)}and

*x*

^{(aux)}that capture replicated contact signal lists. Specifically, for each distinct pair of 1-D

*x*

^{(i)},

*x*

^{(j)}where

*i*≠

*j*, we concatenate

*x*

^{(i)}to

*x*

^{(base)}and

*x*

^{(j)}to

*x*

^{(aux)}. Thus,

*x*

^{(base)}and

*x*

^{(aux)}are each list of length , for

*M*replicates and 1-D Hi-C contact signal list of length in a genome with

*N*genomic positions and Hi-C matrix resolution

*r*. Base-aux pairs represent every possible pair of replicated Hi-C contact signal. Consider and as observed contact signal at position

*i*. For a given 1-D Hi-C contact signal list, our model imagines that there is an unknown distribution of contact strength in every position

*i*. Let the mean

*μ*= mean(

_{i}*x*) for a given

_{i}*x*contact signal. We further assume that there is a var(

_{i}*x*) =

_{i}*σ*(

*μ*)

_{i}^{2}relationship between mean and variance of these distributions. We are interested in identifying the

*σ*(

*μ*) mean-variance relationship. For estimating the

*σ*(

*μ*) relationship, note that

*x*is an unbiased estimate of

_{i}*μ*, and that is an unbiased estimate of

_{i}*σ*(

*μ*)

_{i}^{2}.

For identifying the mean-variance relationship, we employ the approach that we introduced in our previous study [25]. Briefly, we first sort the contact intensities by the value of . Then, we define a binning approach for dividing the contact signal intensities into equally-spaced bins each containing *b* contact intensity. For each bin *j*, we compute *μ _{j}* and as and where are the set of contact tracks in bin

*j*. We employ the weighted average of the bins to increase the robustness of these estimates by defining

That is, for a bin *j*, we consider the weighted average of 2*w* + 1 nearby neighbor bins which we assign 2^{−bk/β} weight coefficient for bin *j* + *k*. We define the window size *w* as *w* = −*β* log(0.01)/*b* log(2) which considers the bins with weight at least 0.01. Here, we define *β* as the bandwidth parameter which controls the effectiveness of the distant bins. Meaning that lower values of *β* would force the smoothing process to concentrate on smaller number of nearby bins while higher number of *β* lets the more distant bins to have contribution on the bin *j*’s weighted average.

To account for the bias-variance trade-off, we have considered multiple combination of bin number, *b*, and bandwidth, *β*. Meaning that, larger values for this parameters would include more observations by considering more distant bins as well for calculating which leads to higher variance. On the other hand, lower values for *b* and *β* would concentrate on smaller number of adjacent bins which leads to more similar position in *I _{j}* set and consequently, less variance among the bins.

To fit a curve to the identified mean-variance trend, we used a smoothing spline function implemented in R which implements a regularized regression over the natural spline basis.

Also, to identify the optimum number for parameters *b* and *β*, we have done a hyperparameter search (Figure 3).

### 2.4 Calculating variance-stabilized signals

Variance-stabilized contact strength signals can be computed using
for a learned mean-variance relationship, where *x* is an untransformed contact strength signal and is the learned standard deviation for a contact signal with mean *u*.

This transformation stabilizes the variance among the contact signal matrices, meaning that for each pair of loci *i*, var(*t*(*x _{i}*)) is constant.

### 2.5 Validation of experiments

In the experiments that are performed on inter-chromosomal contact matrices, VSS-Hi-C models are trained on contact matrices between chromosomes 3 and 4 and tested on contact matrices between chromosomes 1 and 2. For the subcompartment experiment analysis, VSS-Hi-C models are trained on inter-chromosomal contact matrices between chromosomes 7 and 8 and tested on the odd-even matrices. In the analyses that require intra-chromosomal contact matrices, VSS-Hi-C models are trained on chromosome 2 intra-chromosomal contact matrices and tested on chromosome 1 intra-chromosomal contact matrices.

### 2.6 Alternative transformations

To stabilize the variance of data, existing approaches mainly employ a log or arcsinh transformation. These transformation stabilize the variance where there is a certain mean-variance relationship in the data [29]. Specifically, log(*x*) is variance-stabilizing when *σ*(*μ*) = *sμ* for some constant *s*, and arcsinh(*x*) is variance-stabilizing when [30]. In addition to log or arcsinh transformation for stabilizing the variance, there is another approach called Haar-Fisz transformation [31] used by [9] to stabilize the variance of Hi-C contact profiles. Briefly, assume that vector *v* = (*v*_{0}, *v*_{1}, …, *v*_{N−1}) for *v _{i}* ≥ 0 where

*N*= 2

^{J}needs to be variance stabilized. Haar-Fisz algorithm works as follows: It first takes the Haar Discrete Wavelet Transformation of the vector

*v*by defining as: where and . It recursively computes s for all

*i*∈ 0

*, …,*(2

^{J−j}− 1) where

*J*∈ 1, …

*J*and sets

*s*

_{0}=

*v*. Finally, to get the Haar-Fisz transformation operator—

*λ*—, it applies the inverse Haar Discrete Wavelet Transformation to the transformed vector (

*s*

^{J}, F^{J}, …, F^{1}) to extract the vector

*u*where

*u*=

*λv*[31]. We used “hft” funtion in R to apply haar-fisz transformation on Hi-C contact profile.

### 2.7 Variance instability evaluation

Following our previous work [32], we employ a variance-instability metric to evaluate whether a transformation achieves a uniform mean-variance relationship. Consider and as transformed contact strength signals at the *i*th position in 1D contact signal list. We first sort the signals by increasing values of . Then, we divide all positions in track profile into *B* bins each containing *b* = 10000 signal values. We define the variance-instability metric as
where is the mean squared difference between replicates for positions in bin *j*, and *σ*_{1} and *σ*_{2} are the standard deviation of *t*(*x*^{(base)}) and *t*(*x*^{(aux)}) respectively.

The normalization factor considers for variance of the transformed contact strength signals so that for any constant *α*, *t*(*x*) and *αt*(*x*) have the same variance-instability value. Large values for the variance-instability metric indicates that variance of the signals are unstable.

### 2.8 Evaluation through clustering interchromosomal Hi-C contacts to assign labels to subcom-partments

To evaluate the utility of VSS-Hi-C-transformed signals, we evaluated how useful they are for defining subcompartments. Genomic regions can be categorized in 3 main clusters including active, inactive centromer-proximal and inactive centromer-distal [13]. Further, another study [1] shows that genomic regions can be segregated into at least six distinct subcompartments which are associated with different patterns of histone marks.

We used the method of Rao et al. [1] to identify subcompartments from an input Hi-C contact matrix. Briefly, to assign different labels to distinct subcompartments in the Hi-C matrix, we separated odd and even chromosomes to eliminate the intra-chromosomal contact profiles. We constructed a 100kb resolution odd-even matrix which consists of inter-chromosomal contact matrices, such that loci from even chromosomes appear on the columns of the matrix and rows of the matrix represents the loci from the odd chromosomes. To cluster different loci based on contact patterns, we employed the GaussianHMM clustering algorithm on this odd-even matrix. For extracting clusters for loci in the odd chromosomes, we considered rows of the matrix as samples and columns of the matrix as features and then performed the clustering algorithm. Following this step, for identifying the clusters for loci in the even chromosomes, we transposed the matrix and did the same procedure. The parameter *k* defines the number of clusters that contact patterns can be categorized in. Rao et al. [1] found that k-means and hierarchical clustering approaches had the same results as GaussianHMM algorithm.

Having assigned distinct labels to subcompartments for different genomic loci, we adopted a variance-explained metric from [33] to quantify the signal variance that is explainable by predicted labels. Let *B* be the resolution of the odd-even Inter-chromosomal Hi-C matrix. We first use the binning approach to divide genomic signals into equally-spaced bins with length *B* and calculate the average of signals for each bin. Let *l _{i}* ∈ {1..

*k*} (k distinct labels) be the assigned subcompartment label to each genomic loci

*i*. Also, consider

*s*as the average of signals in bin

_{i}*i*. For each distinct label

*l*, we compute the mean of the signals with the same label indicated by

*μ*and for each genomic loci

_{l}*i*, we define predicted signal vector as .

We defined the variance-explained metric [33] for any given transformation as which represents the difference between total variance and the residuals of the prediction variance. Higher variance-explained value (bounded by [0,1]) indicates more agreement between signals and predicted sub-compartment labels.

In this study, we have constructed a 100kb resolution inter-chromosomal odd-even matrix which has chromosomes 1,3 and 5 in the rows of the matrix and chromosomes 2,4 and 6 in the columns of the matrix. We also applied k-means clustering algorithm for *k* number of clusters in which *k* ∈ {3, 4, 5, 6, 7}.

### 2.9 Evaluation relative to topologically-associating domain (TAD) enrichment for structural proteins and histone marks

Identifying self-interacting genomic regions known as topologically-associating domains (TADs) is of great importance, as their rearrangement or disruption can cause disease by affecting the expression of the adjacent genes [1, 34, 35, 36]. Following previous work [36] on evaluating the TAD callers, we used fold change metric to evaluate if TAD boundaries are more enriched in ChIP-seq peaks than their adjacent flanking positions. To evaluate the effectiveness of the variance-stabilized signals, we followed TAD evaluation paper [36] to evaluate how useful they are in identification of TAD boundaries. Zufferey et al. [36] start assessing the peak enrichment by dividing the genomic regions into equally-spaced bins and computing the average number of peaks in each bin. They calculated fold change as *FC* = (peak/background) - 1, where “peak” is the average number of peaks in a surrounding region around the TAD boundary (10kb radius from the boundary) and “background” is the average number of peaks in 5kb intervals in two regions of length 100kb which are located 400kb apart from TAD boundary.

For the sake of consistency, we derived the proteins and corresponding ChIP-seq peak files from [36]. We acquired peak files on GM12878 cell line for chromatin insulator protein CTCF and two core subunits of cohesin complex including SMC3 and RAD21 from ENCODE consortium (encodeproject.org). For deriving a peak set for CTCF, intersection of four experiments with encode accession numbers ENCSR000DRZ, ENCSR000DKV, ENCSR000DZN and ENCSR000AKB were used. For RAD21, intersection of peaks from experiments ENCSR000BMY and ENCSR000EAC were used. Also, for SMC3, peaks were derived from experiment ENCSR000DZP. Moreover, we acquired peak files on K562 cell line for CTCF, SMC3 and RAD21 from experiments ENCFF002CEL, ENCFF483CZB and ENCFF002CXU, respectively.

Zufferey et al. [36] compared different approaches for identifying TADs considering multiple criteria including robustness to resolution and normalization, cost-effectiveness, concordance with other TAD callers, enrichment for biological features and computational efficiency. Considering all metrics, TopDom TAD-caller [37] has been able to satisfy all criteria. In this study, we used TopDom R package [38] to identify the TADs in Hi-C contact matrices. Briefly, TopDom method falls in the category of TAD-callers that uses linear score for each bin that summarizes distribution of contacts in Hi-C contact profile introduced by the stated bin. The only parameter used in TopDom is the window size that selects different bin size for binSignals for identifying TADs. We used the recommended window size *w* = 5 for identifying the TADs.

In addition to critical role of cohesin bindings or CTCF in TAD formation, it is also important to see if TADs are enriched for any specific histone marks. Studies have shown that TADs are either enriched for H3K27me3 marks or H3K36me3 marks. To evaluate the utility of VSS-Hi-C transformed matrices in enrichment of TADs for any of the mentioned histone marks, we followed the methodology introduced by [36]. Briefly, Zufferey et al. [36] uses fold change over control ChIP-seq signals for H3K27me3 and H3K36me3 marks to quantify the ratio of the signals between these two histone marks. Consider *m* as the average size of the TADs identified by a TAD caller. They divide the ChIP-seq signals into equally-spaced intervals which each interval has the length of 0.1*m*. Then, for each interval, they calculate LR, the log10 ratio between H3K27me3 and H3K36me3 signals. Next, for quantifying if a identified TAD is significant, for each TAD, they computed the average observed LR. Then, they shuffled all the LR values for all TADs 10 times so they have a distribution of randomized average LR values derived from each TAD. Finally, they employ Benjamini-Hochberg procedure to consider the false discovery rate (FDR) from the TAD-specific p-values that were computed by comparing the within-TAD’s LR with the derived distribution of the shuffled LRs. Fraction of those TADs that have BH corrected p-value smaller than 0.1 are reported as the significant TADs identified by the TAD-caller. Following [36], we acquired H3K27me3 and H3K36me3 ChIP-seq signals on GM12878 from experiments with encode accession numbers ENCSR000DRX and ENCSR000DRW. We also used H3K27me3 and H3K36me3 ChIP-seq signals on K562 from experiments with ENCODE accession numbers ENCSR000AKQ and ENCSR000AKR, respectively.

## 3 Results

### 3.1 Differences between replicates are stabilized after transformation

To evaluate whether existing units for Hi-C contact matrices have stable variance, we identified the mean-variance trend for a number of existing data sets (Figure 1c).

We found that the variance of the Hi-C matrices has a strong dependence on the mean (Figure 1c). This means that pairs with low interaction intensity have smaller variance across replicates, in comparison to pairs with higher interactions where the variance of the data is higher (Figure 1c). Moreover, it has been shown that the mean-variance trend does not match the expected relationship assumed by the currently-used log(*x* + 1) and asinh(*x*) transformations (Methods). This means that neither of these transformations fully stabilize the variance of the Hi-C contact matrices (Figure 1c). Furthermore, the mean-variance relationship differs between data sets, meaning that no single transformation can stabilize variance for all data sets.

Following our previous work on stabilizing the variance in sequencing-based genomic assays [25], we evaluated the consistency of variance of mean squared between-replicate differences using a variance-instability metric (Methods). Lower values of this metric indicates that given transformation stabilizes the variance in a given data set. We evaluated the performance of different transformation approaches on both observed and observed over expected (O/E) contact matrices and on both inter-and intra-chromosomal contacts. We found that signals transformed by VSS-Hi-C have lower variance instability (more stabilized) than the other transformation approaches in all cases (Figure 2a). While Haar-Fisz performs comparably to VSS-Hi-C on interchromosomal and O/E intrachromosomal signals, VSS-Hi-C greatly outperforms all other approaches on raw intrachromosomal signals. This results from the fact that VSS-Hi-C directly stabilizes variance, whereas other transformations use ad-hoc heuristics that attempt to do so.

### 3.2 Transformed signals improve subcompartment calling

To evaluate whether variance-stabilized Hi-C matrices can improve subcompartment calling, we quantified the agreement between predicted sub-compartments and genome regulatory activities and replication timing.

Replication time is defined as the order within S phase that different segments of DNA are replicated. It has been shown that there is a correlation between replication timing and chromosome structure [39]. That is, for example, inactive chromatin domains replicate later than the active domains [40]. Moreover, the correlation between genome compartmentalization and regulatory activities such as histone modification enrichment have been shown in [1, 20, 33, 41].

Following [33], to evaluate the agreement between genome subcompartments called using transformed Hi-C data with replication time and histone modification respectively, we used variance-explained metric (See section 2.8). Briefly, we called subcompartments using k-means clustering, following [1], and evaluated the degree to which these subcompartments agree with 1D data sets (Methods).

We found that signals transformed by VSS-Hi-C have better agreement between Repli-seq data and predicted subcompartments for odd-even matrix in O/E contact matrices (SI Figure 4). Moreover, we have shown that VSS-Hi-C performs better than the other transformation methods in observed matrices when *k* = 6, while for the other number of labels, all transformation approaches perform similarly. There is very small agreement between Repli-seq signals and predicted labels when using untransformed signals (Figure 2b).

Similarly, when comparing subcompartments to 1D epigenomic data, VSS-Hi-C outperforms other approaches in O/E contact matrices (SI Figure 5), although it does not improve the agreement between 1D signals and predicted labels significantly in the observed contact matrices except for *k* = 7 (Figure 2c).

To evaluate the utility of transformed Hi-C data for identifying topologically-associating domains (TADs), we applied two evaluation metrics defined in [36] (Section 2.9). The first metric evaluates whether identified TAD boundaries are enriched for the known structural proteins SMC3 and RAD21 (Section 2.9). We found that VSS-Hi-C outperforms log and Haar-Fisz transformations according to this metric, although it is outperformed by asinh (Figure 2e).

The second metric evaluates whether identified TADs are uniformly activated or repressed as measured by the ratio of the transcription-associated histone mark H3K36me3 to the repression-associated mark H3K27me3 (Section 2.9). We found that all signals perform similarly in this evaluation (Figure 2d).

These results indicate that the choice of signals has a large impact on TAD calling, although variance-stabilized signals are not clearly superior according to these metrics. Notably, the TAD caller in question, TopDom, was originally designed to be used with untransformed data; it is likely that a TAD caller designed to take advantage of variance-stabilized data could achieve even better performance.