ABSTRACT
The metazoan genome is compartmentalized in megabase-scale areas of highly interacting chromatin known as topologically associating domains (TADs), typically identified by computational analyses of Hi-C sequencing data. TADs are demarcated by boundaries that have been shown to be largely conserved across cell types and even across species. Increasing evidence suggests that the seemingly invariant TADs may exhibit some plasticity in certain cases and their boundary strength can vary. However, a genome-wide characterization of TAD boundary strength in mammals is still lacking. In this study, we use fused two-dimensional lasso as a machine-learning method to first improve Hi-C contact matrix reproducibility and subsequently categorize TAD boundaries based on their strength. We demonstrate that increased boundary strength is associated with elevated levels of CTCF and that TAD boundary insulation scores may differ across cell types. Intriguingly, we also found that super-enhancer elements are preferentially insulated by strong boundaries. Presumably, genetic or epigenetic inactivation of strong boundaries may lead to loss of insulation around super-enhancers, disrupt the physiological transcriptional program and cause disease.
INTRODUCTION
The advent of proximity-based ligation assays has allowed us to probe three-dimensional chromatin organization at unprecedented resolution [1, 2]. Hi-C, a high-throughput chromosome conformation variant has allowed genome-wide identification of chromatin-chromatin interactions [3]. Hi-C is prone to biases and multiple algorithms have been developed for Hi-C bias correction, including probabilistic modelling methods [4], Poisson or negative binomial normalization [5] and the widely popular Iterative Correction and Eigenvalue decomposition method (ICE) [6], which assumes “equal visibility” of genomic loci. A similar iterative method named Sequential Component Normalization was introduced by Cournac et al. [7]. Additional efficient correction methods have been developed to handle high-resolution Hi-C datasets [8]. Hi-C has revealed that the metazoan genome is organized in areas of active and inactive chromatin known as A and B compartment respectively [3]. These are further compartmentalized in super-TADs [9], topologically associating domains (TADs) [10–12] and sub-TADs [13], as well as gene neighbourhoods [14]. Some algorithms have been already developed to reveal this hierarchical chromatin organization, including Directionality Index (DI) [10], Armatus [15], TADtree [16], Insulation Index (Crane) [17], IC-Finder [18] and others. TADs are megabase-scale areas of highly interacting chromatin, demarcated by CTCF-enriched boundaries, and are highly-conserved across species and cell types [10, 19].
Genome compartmentalization in TADs confines enhancer-promoter interactions within the same domain [10, 12, 20] and during cell differentiation most changes have been shown to occur within TADs [21]. TAD boundaries have been found to be rich in tRNA genes, transposable elements, CCCTC-binding factor (CTCF), cohesin complex and other structural proteins [10–12]. More recently, proteins involved in chromatin remodelling such as BRG1 – an ATPase driving SWI/SNF activity – as well topoisomerase complexes have been implicated in boundary formation through regulation of chromatin compaction [22]. Whereas TADs are seemingly invariant, mounting evidence suggests that TAD boundaries can vary in strength, ranging from permissive TAD boundaries that allow more inter-TAD interactions to more rigid (strong) boundaries that clearly demarcate adjacent TADs [23]. Recent studies have shown that in Drosophila, exposure to heat-shock resulted in local changes in certain TAD boundaries resulting in TAD merging which is believed to have physiological consequences [24]. A recent study in mammals showed that during motor neuron (MN) differentiation in mammals, TAD and sub-TAD boundaries in Hox cluster are not rigid and their plasticity is linked to changes in the expression of genes of the Hox cluster during differentiation [25]. It has also been demonstrated that boundary strength is positively associated with the occupancy of certain structural proteins including CCCTC-binding factor (CTCF) [10]. Despite the fact that there is a handful of studies demonstrating that not all boundaries are equal and they can vary in strength in organisms like Drosophila, no study has yet addressed the issue of boundary strength in mammals and how it may be related to potential boundary disruptions and aberrant gene activation in diseases like cancer. Here we introduce a new method based on fused two-dimensional lasso [26] in order to: (a) to improve the correlation of Hi-C contact matrices, (b) reveal the multiple levels of chromatin organization and (c) categorize TAD boundaries based on their corresponding strength.
MATERIALS AND METHODS
Hi-C datasets
In order to develop a method that successfully handles variation in Hi-C data and improves reproducibility, we carefully selected our Hi-C datasets to represent technical variation due to the execution of the experiments by different laboratories and/or the usage of different enzymes. We ensured that our datasets included samples at least ~40 million intra-chromosomal read pairs and that the Hi-C experiment was performed in biological replicates, either by using one restriction enzyme (HindIII or MboI) (H1 cells and their derivatives [21], K562, KBM7 and NHEK cells [27] and in-house generated CUTLL-1), or two enzymes (HindIII or MboI) (GM12878 [27], IMR90 [10, 28]), in order to examine the consistency of predicted Hi-C interactions across different enzymes.
Calculation of same-enzyme and cross-enzyme correlations
We calculated two types of correlation for Hi-C matrices, to evaluate the performance of our method. The two types of correlation were: a) same-enzyme correlation which corresponds to all the Hi-C replicates prepared with the same restriction enzyme, b) cross-enzyme correlation which corresponds to all the sample pairs where the same Hi-C sample was prepared with two different enzymes (e.g HindIII/MboI). Pearson correlation coefficients were calculated either on the filtered, ICE-corrected [6] or scaled (see below) Hi-C contact matrices (Pearson) or the distance normalized ones (Pearson (z-score)).
Generation of scaled Hi-C contact matrices
In order to improve the cross-enzyme (and same-enzyme) correlation of Hi-C matrices we accounted for the total number of read pairs and the “effective length” [4]. More specifically, the scaled number of reads corresponding to interactions between the Hi-C matrix bins ij (yij) is defined by the formula: where xij is the original number of interactions between the bins i and j, effi, the effective length for the bin i, effj the effective length for the bin j, and N is the total number of read pairs.
Distance normalization
Genomic loci that are further apart in terms of linear distance on DNA tend to give fewer interactions in Hi-C maps than loci that are closer. For intra-chromosomal interactions, this effect of genomic distance should be taken into account. Consequently, the interactions were distance-normalized using a z-score that was calculated taking into account the mean Hi-C counts for all interactions at a given distance d and the corresponding standard deviation. Thus, the z-score for the interaction between the Hi-C contact matrix bins i and j (zij) is given the following equation: where yij corresponds to the number of interactions between the bins i and j, μ(d) to the mean (expected) number of interactions for distance d=|j-i| and σ(d) is the corresponding standard deviation of the mean. The higher the difference between the observed (yij) and expected number of interactions (μ(d)), the higher the corresponding z-score.
Fused two-dimensional lasso
While our naïve scaling approach successfully increased the cross-enzyme and same-enzyme correlation of Hi-C matrices, we sought to improve the correlation even further. We used two-dimensional lasso, an optimization machine learning technique widely used to analyse noisy datasets, especially images [26]. This technique is very-well suited for identifying topological domains based on contact maps generated by Hi-C sequencing experiments for two reasons: (a) Hi-C datasets are inherently noisy, and (b) topological domains are continuous DNA segments of highly interacting loci that would represent solid squares along the diagonal of Hi-C contact matrices. Topological domains map to squares of different length along the diagonal of the Hi-C contact matrix, but they are not solid as they contain several gaps, i.e. scattered regions on those squares that show little or no interaction. Two-dimensional fused lasso addresses the issue by penalizing differences between neighbouring elements in the contact matrix. This is achieved by the penalty parameter λ (lambda), as described in the equation: where y is the original (i.e. observed) contact matrix, and is the estimated contact matrix such that the objective function described above in minimized. In the interest of computational efficiency, we applied one-dimensional lasso on the Hi-C contact matrices in order to estimate the matrices for high values of λ and obtain the full hierarchy of TAD boundaries. Using one-dimensional lasso instead of the two-dimensional version had no negative impact on the correlations of Hi-C contact matrices between replicates (Supplemental Figure 1).
Classification of boundaries based on fused two-dimensional lasso
We applied two-dimensional fused lasso to categorize TAD boundaries based on their strength. The rationale behind this categorization is that topological domains separated by more “permissive” (i.e. weaker) boundaries [29] will tend to fuse into larger domains when lasso is applied, compared to TADs separated by well-defined, stronger boundaries. We indeed applied this strategy and categorized boundaries into multiple groups ranging from the most permissive to the strongest boundaries. The boundaries that were lost when λ value was increased from 0 to 0.25, fall in the first category (λ=0), the ones lost when λ was increased to 0.5, in the second (λ=0.2) etc.
Association of CTCF levels with boundary strength
We obtained CTCF ChIP-sequencing data for the cell lines utilized in this study (with the exception of KBM7 for which no publicly available dataset was available) and we uniformly re-processed all data using HiC-bench [30]. Total CTCF levels at each TAD boundary were calculated and their normalized distributions for each boundary category (weak to strong) were plotted in boxplots in order to demonstrate the association of increased boundary strength with increased levels of CTCF binding.
Association of boundary strength with super-enhancers and repeat elements
Super-enhancers were called using H3K27ac ChIP-seq data from GEO, ENCODE and inhouse generated data. Reads were first aligned with Bowtie2 v2.3.1 [31] and then HOMER v4.6 [32] was used to call super-enhancers, all with standard parameters. For each super-enhancer in each sample, we identified the corresponding TAD and its TAD boundaries. We then counted (per sample) the percentage of super-enhancers that are surrounded by boundaries belonging in each boundary category, demonstrating that most super-enhancers are insulated by strong boundaries.
RESULTS
Comprehensive re-analysis of published high-resolution Hi-C datasets
We identified publicly available human Hi-C datasets (described in Materials and Methods section) that fulfilled the following criteria: (i) two biological replicates and (ii) sufficient sequencing depth to robustly identify topologically-associating domains (TADs) as described in our TAD calling benchmark study [30]. All datasets were then comprehensively re-analysed using HiC-bench. Quality assessment analysis revealed that the samples varied considerably in terms of total numbers of reads, ranging from ~150 million reads to more than 1.3 billion (Figure 1A). Mappable reads were over 96% in all samples. The percentages of total accepted reads corresponding to cis (ds-accepted-intra, dark green) and trans (ds-accepted-inter, light green) (Figure 1B) also varied widely, ranging from ~17% to ~56%. Duplicate read pairs (ds-duplicate-intra and ds-duplicate-inter; red and pink respectively), non-uniquely mappable (multihit; light blue), single-end mappable (single-sided; dark blue) and unmapped reads (unmapped; dark purple) were discarded. Self-ligation products (ds-same-fragment; orange) and reads mapping too far (ds-too-far; light purple) from restriction sites or too close to one another (ds-too-close; orange) were also discarded. Only double-sided uniquely mappable cis (ds-accepted-intra; dark green) and trans (ds-accepted-inter; light green) read pairs were used for downstream analysis. Despite the differences in sequencing depth and in the percentages of useful reads across samples, all samples had enough useful reads for TAD calling and thus none of them was excluded from downstream analysis. However, due to the wide differences in sequencing depth, and to ensure fair comparisons of Hi-C matrices in this study, all datasets were down-sampled such that the number of usable intra-chromosomal reads pairs was ~40 million for each replicate.
Assessment of same-enzyme and cross-enzyme reproducibility of Hi-C contact matrices
Although it has been demonstrated in the literature that Hi-C libraries are prone to enzyme biases (see Introduction), no systematic large-scale study has investigated in detail the reproducibility of Hi-C contact matrices. Here, we attempt to address this question using the most comprehensive Hi-C dataset that is currently available, as described in the previous section. More specifically, we will focus on multiple factors that may play an important role on reproducibility: first, we will separately consider biological replicates of Hi-C libraries generated with the same or different restriction enzymes; second, we will study the impact of Hi-C matrix resolution (i.e. bin size); third, we will assess reproducibility as a function of the distance of interacting loci pairs. Pearson correlation coefficients were calculated for each pair of replicates (same-or cross-enzyme) on Hi-C contact matrices estimated by three methods: (i) naïve filtering (i.e. matrix generation by simply using double-sided accepted intra-chromosomal read pairs from Figure 1A), (ii) iterative correction (ICE) which has already been demonstrated to improve cross-enzyme correlation, and (iii) our own simple scaling method that only corrects for effective length bias (see Methods for details). Importantly, correlations were computed both on the actual matrices, but also on the distance-normalized matrices (see Methods for details), as Hi-C interactions are typically concentrated around the diagonal of the Hi-C contact matrix, and values are dropping exponentially as the distance between the interacting pairs is increasing. Distance-normalized matrices account for the expected Hi-C read count as a function of distance and may therefore reveal real distal interactions. The results of our benchmark analysis are summarized in Figure 1C: the left panel summarizes the correlations between replicates generated by the same restriction enzyme, whereas the right panel the correlations between replicates generated by a different restriction enzymes.
In both scenarios, as expected, correlations drop quickly as finer resolutions (from 100kb to 20kb) are considered, especially in the distance-normalized matrices. The same conclusion applies for increasing distance (from 2Mb to 10Mb) between interacting loci, demonstrating that long-range interactions require ultra-deep sequencing in order to be detected reliably. To elaborate on this point, we repeated the analysis after retaining only those samples with two replicates of at least 70 million or 110 million usable intra-chromosomal reads and resampling them down to 80 million or 120 million per replicate (Supplemental Figure 2 and Supplemental Figure 3 respectively). Both conclusions hold true with the new sequencing depth and are independent of the Hi-C contact matrix estimation method. Finally, bias-correction methods (ICE and our scaling approach) indeed improved cross-enzyme correlation over the naïve filtering method. Interestingly, this improvement came at the expense of lower correlations in the same-enzyme case. More specifically, we observed that the largest the gain in cross-enzyme correlations, the greater the loss in same-enzyme correlations (ICE method) (Figure 1C).
Fused lasso improves same-enzyme and cross-enzyme correlations of Hi-C contact matrices
Motivated by the poor performance of all methods at fine resolutions and by the observation of a surprising trade-off between improving cross-enzyme at the expense of lower same-enzyme correlation when correcting for enzyme-related biases, we applied fused two-dimensional lasso (see Methods for details), a well-studied image denoising method, to generate Hi-C contact matrices with increased consistency between replicates. Briefly, twodimensional fused lasso utilized a parameter λ which penalizes differences between neighboring values in the Hi-C contact matrix. The effect of parameter λ is demonstrated in Figure 2A where we show an example of the application of fused two-dimensional lasso on a Hi-C contact matrix focused on an 8Mb locus on chromosome 8 for different values of parameter λ. To evaluate the performance of fused lasso, as done in the previous section, we calculated same-enzyme and cross-enzyme Pearson correlations between Hi-C contact matrices generated from different replicates. Pearson correlation coefficients were calculated either for iteratively-corrected (ICE) or scaled Hi-C contact matrices and compared to the naïve filtering approach. The results are summarized in Figure 2B. Clearly, increasing λ improves correlation independent of resolution, restriction enzyme and bias-correction method, demonstrating the robustness of our approach. Similarly, fused two-dimensional lasso improves the reproducibility of distance-normalized matrices as demonstrated in Figure 3.
Fused lasso reveals a TAD hierarchy linked to TAD boundary strength
After demonstrating that parameter λ helps improve reproducibility of Hi-C contact matrices independent of the bias-correction method, we further hypothesized that increased values of λ may define distinct classes of TADs with different properties. For this reason, we now allowed λ to range from 0 to the maximum possible value (after a finite value of λ, the entire Hi-C matrix attains a constant value independent of the value of λ). For efficient computation, we used a one-dimensional approximation of the two-dimensional lasso solution (see Methods for details and Supplemental Figure 1). We then identified TADs at multiple λ values using HiC-bench, and we observed that the number of TADs is monotonically decreasing with the value of λ (Figure 4A), suggesting that by increasing λ, we are effectively identifying larger TADs encompassing smaller TADs detected at smaller λ values. Equivalently, certain TAD boundaries “disappear” as λ is increased. Therefore, we hypothesized that TAD boundaries that disappear at lower values of λ are weaker (i.e. lower insulation score) whereas boundaries that disappear at higher values of λ are stronger (i.e. higher insulation score). To test this hypothesis, we identified the TAD boundaries that are “lost” at each value of λ, and generated the distributions of the insulation scores as defined by the ratio score described in HiC-bench. Indeed, as hypothesized, TAD boundaries lost at higher values of parameter λ are associated with higher TAD insulation scores (Figure 4B). We then stratified TAD boundaries into six classes according to their strength, independently in each Hi-C dataset used in this study and generated a heatmap representation including all TAD boundaries and their associated class across all samples (Figure 4C,D). Hierarchical clustering correctly grouped replicates and related cell types independent of enzyme biases or batch effects related to the lab that generated the Hi-C libraries, suggesting that TAD boundary strength can be used to distinguish cell types. Equivalently, this finding suggests, although TAD boundaries have been shown to be largely invariant across cell types, a certain subset of TAD boundaries may exhibit varying degrees of strength in different cell types. As expected, TAD boundary strength was found to be positively associated with CTCF levels, suggesting that stronger CTCF binding confers stronger insulation (Figure 4E). SINE elements have also been shown to be enriched at TAD boundaries [10], and apart from confirming this finding, we extended it and demonstrated that Alu elements (the most abundant type of SINE elements) are enriched at stronger TAD boundaries, whereas, interestingly, L1 elements (a subset of LINE elements) are enriched at weaker TAD boundaries (Figure 4F). A comprehensive analysis of all major repetitive element subtypes can be found in Supplemental Figure 4. Finally, we investigated the proximity of super-enhancers to TAD boundaries of different strength. Intriguingly, we found that super-enhancers are preferentially insulated by strong TAD boundaries (Figure 4G). Super-enhancers are thought to be cell specific and drive expression of key genes. Thus, a potential explanation of our finding is that super-enhancers should only target genes confined in the same TAD, while strongly insulated from genes in adjacent TADs. Genetic or epigenetic inactivation of strong boundaries may lead to loss of insulation around super-enhancers, disrupt the physiological transcriptional program and cause disease.
DISCUSSION
Multiple recent studies have revealed that the metazoan genome is compartmentalized in boundary-demarcated functional units known as topologically associating domains (TADs). TADs are highly conserved across species and cell types. A few studies, however, provide compelling evidence that specific TADs, despite the fact that they are largely invariant, exhibit some plasticity. Given that TAD boundary disruption has been recently linked to aberrant gene activation and multiple disorders including developmental defects and cancer, categorization of boundaries based on their strength and identification of their unique features becomes of particular importance. In this study, we developed a method based on fused two-dimensional lasso in order to categorize TAD boundaries based on their strength. We demonstrated that our method: (a) improves the correlation of Hi-C contact matrices irrespective of the Hi-C bias correction method used, (b) reveals multiple levels of chromatin organization and (c) successfully identifies boundaries of variable strength and that strong predicted boundaries exhibit certain expected features, such as elevated CTCF levels and increased insulating capacity. We also demonstrated that the boundaries of similar strength are largely conserved across the samples included in this study, however, a subset of TAD boundaries displays varying levels of insulation strength across samples. By performing an integrative analysis of estimated boundary strength with super-enhancers in matched samples, we observed that super-enhancers are preferentially insulated by strong boundaries. Based on this observation, we believe that strong boundaries prevent the aberrant activation of genes residing in adjacent TADs, by consisting a physical barrier between the gene promoters and the super-enhancer elements. We predict that despite the fact that weak boundaries would be more prone to disruption, in many cancers, strong boundaries are actually disrupted by either genetic lesions or epigenetically, leading to aberrant activation of oncogenes by enhancers as recently demonstrated [33–36]. In future work, we will further characterize boundaries of variable strength, reveal their features and help with the identification of targets for pharmacological intervention, in order to restore disrupted boundaries.
AUTHOR CONTRIBUTIONS
YG and CL performed computational analyses and generated figures. AT, AL and PK conceived this study. PN performed the CUTLL-1 Hi-C experiments. PN and IA offered biological insights and helped with the interpretation of Hi-C data. AT designed and implemented the method. CL and AT wrote the manuscript. All authors read and approved the final manuscript.
FUNDING
The study was supported by the American Cancer Society [RSG-15-189-01-RMC to AT] and a Leukemia & Lymphoma Society New Idea Award [8007-17 to AT]. NYU Genome Technology Center (GTC) is a shared resource, partially supported by the Cancer Center Support Grant [P30CA016087] at the Laura and Isaac Perlmutter Cancer Center.
ACKNOWLEDGEMENTS
We would like to thank all members of the Tsirigos and Aifantis Laboratories for critical evaluation of the manuscript. We would like to thank the Applied Bioinformatics Laboratories (ABL) at the NYU School of Medicine for providing bioinformatics support and helping with the analysis and interpretation of the data. This work has used computing resources at the NYU High Performance Computing Facility (HPCF). We also thank the Genome Technology Center (GTC) for expert library preparation and sequencing. This shared resource is partially supported by the Cancer Center Support Grant, P30CA016087, at the Laura and Isaac Perlmutter Cancer Center.