Genome-wide identification and characterisation of HOT regions in the human genome

Li, Hao; Liu, Feng; Ren, Chao; Bo, Xiaochen; Shu, Wenjie

doi:10.1186/s12864-016-3077-4

Research article
Open access
Published: 15 September 2016

Genome-wide identification and characterisation of HOT regions in the human genome

Hao Li¹,
Feng Liu¹,
Chao Ren¹,
Xiaochen Bo¹ &
…
Wenjie Shu¹

BMC Genomics volume 17, Article number: 733 (2016) Cite this article

2474 Accesses
6 Citations
6 Altmetric
Metrics details

Abstract

Background

HOT (high-occupancy target) regions, which are bound by a surprisingly large number of transcription factors, are considered to be among the most intriguing findings of recent years. An improved understanding of the roles that HOT regions play in biology would be afforded by knowing the constellation of factors that constitute these domains and by identifying HOT regions across the spectrum of human cell types.

Results

We characterised and validated HOT regions in embryonic stem cells (ESCs) and produced a catalogue of HOT regions in a broad range of human cell types. We found that HOT regions are associated with genes that control and define the developmental processes of the respective cell and tissue types. We also showed evidence of the developmental persistence of HOT regions at primitive enhancers and demonstrate unique signatures of HOT regions that distinguish them from typical enhancers and super-enhancers. Finally, we performed a dynamic analysis to reveal the dynamical regulation of HOT regions upon H1 differentiation.

Conclusions

Taken together, our results provide a resource for the functional exploration of HOT regions and extend our understanding of the key roles of HOT regions in development and differentiation.

Background

Recent studies in Caenorhabditis elegans [1, 2], Drosophila melanogaster [3–7], and humans [8–10] have identified a class of mysterious genomic regions that are bound by a surprisingly large number of transcription factors (TFs) that are often functionally unrelated and lack their consensus binding motifs. These regions are called HOT (high-occupancy target) regions or “hotspots”. In C. elegans, 22 different TFs were used to identify 304 HOT regions bound to 15 or more TFs [1]. Using the binding profiles of 41 different TFs, nearly 2000 HOT regions were identified in D. melanogaster, and each is bound by an average of 10 different TFs [5]. Many regions that are bound by dozens of TFs were also identified in a small number of human cells [9, 10]. The broad presence of these regions in metazoan genomes suggests that they might reflect a general property of regulatory genomes. However, how hundreds of TFs coordinate clustered binding to regulatory DNA to form HOT regions across cell types and tissues is still unclear. Furthermore, the function of HOT regions in gene regulation remains unclear [11, 12], and their proposed roles include functioning as mediators of ubiquitously expressed genes [1], sinks or buffers for sequestering excess TFs [4], insulators [5], DNA origins of replication [5], and patterned developmental enhancers [7]. In addition, the effects of these regions on human diseases and cancer remain unknown. Thus, it is important to systematically analyse HOT regions in a large variety of cell types and tissues and to further understand their functional roles in the control of specific gene expression programs.

Resolving these challenges requires knowledge of the ensemble of all TF bindings in a cell. However, even predicting where a single TF binds in the genome has proven challenging. Computational motif discovery in regulatory DNA is a commonly used strategy for identifying candidate TF binding sites (TFBSs) for TFs with known binding motifs, which are represented as position weight matrices (PWMs). Numerous algorithms have been developed for discovering motifs, such as FIMO (Find Individual Motif Occurrences) [13] and HOMER (Hypergeometric Optimization of Motif EnRichment) [14]. It has been reported that TFBSs tend to be DNase I hypersensitive, and only a fraction of the human genome is accessible for TF binding [15]. Remarkably, HOT regions correlate with decreased nucleosome density and increased nucleosome turnover and are primarily associated with open chromatin [1, 5, 6]. DNase I hypersensitive sites (DHSs) in chromatin have been used extensively to mark regulatory DNA and to map active cis-regulatory elements in diverse organisms [16–18]. Recent advances in Next-Generation Sequencing (NGS) technologies have enabled genome-wide mapping of DHSs in mammalian cells [19–21], laying the foundations for comprehensive catalogues of human regulatory DNA regions. Thus, DHSs, combined with motif discovery algorithms, could be used in a very powerful approach for identifying a large repertoire of TFs in diverse cell and tissue types with high precision. This approach is likely to be widely applicable for investigating cooperativity among TFs that control diverse biological processes.

Here, we have developed a computational method for the genome-wide mapping of HOT regions in human genome. We have characterised and validated HOT regions in embryonic stem cells (ESCs). Additionally, we have created a catalogue of HOT regions for 154 different human cell and tissue types and have shown that these regions are associated with genes encoding cell-type-specific TFs and other components that play important roles in cell-type-specific developmental processes. We have shown evidence for the developmental persistence of HOT regions at primitive enhancers and have demonstrated unique signatures of HOT regions that distinguish them from typical enhancers and super-enhancers. Importantly, our epigenetic analysis revealed a preliminary view of dynamic epigenetic regulation of HOT regions upon cell differentiation.

Results

Identification and validation of HOT regions

Recently, we used Gaussian kernel density estimation across the binding profiles of 542 TFs to identify TFBS-clustered regions and defined a “TFBS complexity” score based on the number and proximity of contributing TFBSs for each TFBS-clustered region [22]. A preliminary inspection of these regions with different TFBS complexity revealed an unusual feature: Although the vast majority (~90 %) of the TFBS-clustered regions exhibited only low TFBS complexity, a small portion of the TFBS-clustered regions exhibited much higher TFBS complexity scores (Additional file 1: Fig. S1A and Additional file 2: Table S1). The former were called LOT (low-occupancy target) regions, whereas the latter were called HOT (high-occupancy target) regions.

HOT regions were initially defined as regions with high occupancy of TFs and were identified by the binding peaks of many TFs using ChIP-seq data in previous reports [1–10]. However, we defined HOT regions by the colocalisation of a large number of TF motif binding sites and identified them using the TF motif scanning method iFORM [23] on DHSs with PWMs for 542 TFs. To validate our identified HOT regions, we compared them with the experimental HOT regions that were defined by the ENCODE Consortium, which assessed more than 100 TFs from approximately 500 ChIP-seq experiments in more than 70 cell types [24, 25]. We performed a GSC (genome structure correction) analysis between our HOT regions and the experimental HOT regions in five cell types, including H1-hESC, K562, Hep-G2, HeLa-S3, and GM12878 cells (Fig. 1a and Additional file 1: Fig. S1B). The GSC results indicated that our HOT regions were significantly enriched and depleted compared to the experimental HOT and LOT regions, respectively. In addition, our LOT regions were significantly enriched and depleted compared to the experimental LOT and HOT regions, respectively. Furthermore, we used the experimental HOT and LOT regions to validate our HOT regions with ROC (receiver operating characteristic) curves and the corresponding area under the curve (AUC) in each cell type. Our results demonstrated that our HOT regions have powerful discrimination ability between the experimental HOT and LOT regions (Additional file 1: Fig. S1C).

To further verify whether TFs indeed bound within the HOT regions, we counted the occurrence rates of peaks in the ChIP-seq data that corresponded to diverse TFs that were located within our HOT regions and the experimental HOT regions. We found that the number of TFs that colocalised within our HOT regions (median = 9 and mean = 8.18 in H1 cells) was much greater than the number of TFs that colocalised within the experimental HOT regions (median = 2 and mean = 3.14 in H1 cells) (Fig. 1b and Additional file 1: Fig. S1D). Our results suggest that our HOT regions are strongly skewed relative to the experimental HOT regions toward occupancy by a large number of transcription factors identified via ChIP-seq experiments by the ENCODE Consortium. Additionally, with the increase in the TFBS complexity of our HOT regions, the number of TFs that colocalised within our HOT regions also increased (Fig. 1c and Additional file 1: Fig. S1E).

Previous studies have revealed that some ChIP-seq binding peaks of TFs do not contain the DNA sequence motifs of the corresponding TFs; these peaks are designated motifless binding peaks of the TFs [24, 25]. We explored the relationship between the motifless binding peaks of all TFs and our identified HOT regions. We identified 62,764, 87,582, 129,795, 47,384, and 92,592 motifless binding peaks in H1-hESC, K562, Hep-G2, HeLa-S3, and GM12878 cells, respectively. We compared these motifless binding peaks with the HOT regions that we identified within TF ChIP-seq binding peaks for each cell line. We determined that the proportion of the motifless binding peaks intersecting with the experimental HOT regions (average 25 %) was larger than that of the motifless binding peaks intersecting with our HOT regions (average 17 %) (Additional file 1: Fig. S1F). However, the proportion of motifless HOT regions in our HOT regions was much larger than that of motifless HOT regions in the experimental HOT regions (36 % vs 20 %, on average) (Additional file 1: Fig. S1G). This result reflects the much smaller number and longer length of our HOT regions, Furthermore, GSC analysis demonstrated that the statistical z-scores of the intersections of the motifless binding peaks with our HOT regions and the experimental HOT regions were greater than 57 (corresponding to a p-value < 2.2 × 10⁻³⁰⁸) and 220 (corresponding to a p-value < 2.2 × 10⁻³⁰⁸), respectively (Fig. 1d and Additional file 2: Table S2). Our results indicate that although our HOT regions are essentially different from the experimental HOT regions, motifless HOT regions represent a substantial fraction of both our HOT regions and the experimental HOT regions.

Taken together, these findings strongly validate our identified HOT regions. Additionally, the TFBS complexity adequately represented the colocalisation of multiple TFs within our HOT regions. Furthermore, our results clearly clarified the discrimination between our HOT regions and experimental HOT regions that have been reported in the literature [24, 25].

General characterisation of HOT regions

Using a uniform processing pipeline, we created a catalogue of 59,986 distinct HOT regions across 154 cells/tissues studied under the ENCODE Project [24]. Collectively, these HOT regions span 18.8 % of the genome (Additional file 2: Table S3). To assess the rate of discovery of new HOT regions, we performed a saturation analysis as described in a previous study [24] and predicted saturation at approximately 107,184 HOT regions, suggesting that we have discovered more than half (59,986 out of 107,184, 56.0 %) of the estimated total number of HOT regions (Additional file 1: Fig. S2A). An additional location analysis of these 59,986 HOT regions demonstrated that HOT regions were more likely localised to genic regions (intron and exon) and less likely localised to intergenic regions compared with LOT regions (Additional file 1: Fig. S2B). Furthermore, HOT regions are typically much more cell-selective than LOT regions (Fig. 1e, 1^st column). Promoter proximal HOT regions typically exhibit high accessibility across cell types, with the average proximal HOT region detected in 21 cell types; however, distal HOT regions are largely cell selective, with the average distal HOT region detected in 7 cell types (Fig. 1e, 2^nd and 3^rd columns).

We further characterised these HOT regions using multiple data types and showed that they are enriched for active histone markers and depleted for repressive histone markers (Additional file 1: Fig. S3A), they are highly transcribed (Additional file 1: Fig. S3B), they extensively overlap with the transcriptional regulators that control cell development and differentiation (Additional file 1: Fig. S3C, S3F), they exhibit distinct sequence signatures (Additional file 1: Fig. S3D), and their neighbouring genes illustrate functional enrichment linked to the developmental processes of the respective cell and tissue types (Additional file 1: Fig. S3E, Additional file 2: Tables S4-S5.

Associations with functional regulatory elements

To gain understanding of the functional roles of HOT regions, it would be valuable to explore their associations with previously validated regulatory elements. First, we explored the extent to which HOT regions associate with microRNAs, which comprise a major class of regulatory molecules and have been extensively studied, resulting in the consensus annotation of hundreds of conserved microRNA genes [26]. Of 2633 annotated microRNA transcriptional start sites (TSSs), 1667 (63 %) coincide with a HOT region. The accessibility of HOT regions at microRNA promoters was highly promiscuous compared with GENCODE TSSs (Fig. 1e, 4^th column) and showed cell lineage organisation, paralleling the known regulatory roles of well-annotated lineage-specific microRNAs (Fig. 2a). Next, we investigated the association between HOT regions and transposon sequences. A surprising number of these sequences contain highly regulated HOT regions (Fig. 1e, 5^th column, and Additional file 2: Table S6), which is compatible with the cell type-specific transcription of repetitive elements detected using ENCODE RNA sequencing data [27]. The examples shown in Fig. 2b also illustrate the strong cell-selectivity of chromatin accessibility observed for each major class of repeats. Furthermore, we compared HOT regions with an extensive compilation of 373 experimentally validated distal, non-promoter cis-regulatory elements, such as insulators, locus control regions (LCRs), transcription initiation platforms (TIPs), and more. This analysis revealed that the overwhelming majority (76 %) of these elements are encompassed within HOT regions (Additional file 2: Table S7) and typically show strong cell selectivity (Fig. 1e, 6^th column, Fig. 2c and Additional file 1: Fig. S4A). Finally, we explored the extent to which HOT regions associate with different classes of DNA methylation depleted regions, including low methylation regions (LMRs), unmethylated regions (UMRs), and DNA methylation valleys (DMVs). These DNA methylation-depleted regions have been reported to function as cis regulatory elements that are strongly associated with transcription factor genes and developmental genes [28, 29]. Our GSC analysis demonstrated that LMRs, UMRs and DMVs were highly enriched within HOT regions (Additional file 1: Fig. S4B–D) and typically showed strong cell selectivity (Fig. 1e, 7–9th column and Fig. 2d). Together, our results suggested that HOT regions are highly associated with the functional regulatory elements that play key developmental roles in a manner that is typically cell-type-specific.

HOT regions at embryonic enhancers

As HOT regions drive genes that control and define cell development, it is reasonable to surmise that in definitive cells, HOT regions could be persistently associated with enhancers that are active during early development. We compiled 882 early developmental enhancers that were identified through a comparative genome analysis and experimental validation of in vivo enhancer activity in transgenic mice [30]. Each of these enhancers displayed reproducible tissue-staining patterns in one or more embryonic tissues at embryonic day 11.5 (Fig. 3a). Of these 882 non-promoter human enhancers, GSC analysis demonstrated that a surprising proportion (308/882, 35 %, z-score = 14.9, corresponds to a p-value < 6.5 × 10⁻⁴⁹) occurred within HOT regions in at least one definitive human cell type. To quantify the tissue activity spectra of these embryonic enhancers, we systematically examined their lacZ expression patterns in transgenic mice and related these patterns to HOT region patterning at the same elements across different definitive cell types (Fig. 3b). For example, an enhancer that is selectively active in embryonic forebrain tissue (Fig. 3a, 1^st column) was selectively found in HOT regions within cells derived from human ESCs (Fig. 3b, 1^st column), and an enhancer that is selectively active in embryonic blood vessels (Fig. 3a, 2^nd column) was selectively found in HOT regions within endothelial cells (Fig. 3b, 2^nd column). In contrast, an enhancer with extremely broad tissue activity (Fig. 3a, 7^th column) was found in HOT regions in nearly all definitive cell types (Fig. 3b, 7^th column). These findings were further confirmed across the spectrum of enhancers (Fig. 3c-d). A total of 62.5 % of enhancers that are active in embryonic blood vessels were found in HOT regions in endothelial cells, whereas only 26.3 % of all other embryonic enhancers were found in endothelial HOT regions. Similarly, 59.3 % of enhancers that are active in embryonic heart tissue were found in HOT regions within cells derived from human heart and great vessel structures, whereas only 30.7 % of all other embryonic enhancers were found in HOT regions in these cell types.

Distal HOT regions in super-enhancers

Recently, Richard A. Young and his colleagues identified an unusual class of enhancer domains, called super-enhancers, that drive the high-level expression of genes that control and define cell identity and disease [31–33]. To elucidate the relationship between HOT regions and super-enhancers, we assessed the HOT regions and super-enhancers from the same 14 cell types. We performed a GSC analysis between distal HOT regions and super-enhancers in these cell types and found that HOT regions were significantly enriched with super-enhancers (Fig. 4a). To determine whether HOT regions might cooperate with super-enhancers to regulate cell type-specific gene regulation, we performed a colocalisation analysis of these two types of regions in 14-by-14 cell line combinations, as previously described [34, 35] (Fig. 4b). The diagonally matched cell line enrichment values (>1.00 for all comparisons) were much larger than the off-diagonal mismatched cell line values (<1.00 for all comparisons), indicating that cell type-specific HOT regions tended to strongly colocalise with super-enhancers that function in the corresponding cell types. Furthermore, we compared the densities of chromatin markers, TFs, and RNA polymerase II between HOT regions, enhancers, and super-enhancers. All of these elements exhibited similar DNase I hypersensitivity. As expected, enhancer markers, such as H3K27ac, H3K4me1, and P300, were significantly enriched within enhancers and super-enhancers compared to HOT regions. In addition, RNA Pol II was significantly enriched within enhancers and super-enhancers compared to HOT regions. Notably, HOT regions demonstrated simultaneous significant enrichment of the bivalent markers H3K4me3 and H3K27me3, whereas enhancers and super-enhancers showed both enrichment of H3K4me3 and depletion of H3K27me3 compared to the background genome (Fig. 4c). A much higher proportion of HOT regions (28 %) were marked with both H3K4me3 and H3K27me3, whereas only 2 % of super-enhancers were marked with both H3K4me3 and H3K27me3. Finally, we characterised super-enhancer-associated and HOT region-associated genes by gene ontology (GO) analysis. Our results revealed that super-enhancer-associated genes are linked to biological processes that largely define the identities of the respective cell and tissue types, which is highly consistent with the results of a previous study [31]. However, HOT region-associated genes are linked to biological processes that largely define the development and differentiation of the respective cell and tissue types (Additional file 1: Figs. S3E and S5).

Dynamics of HOT regions during H1 differentiation

To preliminarily explore the dynamic changes in HOT regions upon development and differentiation, we examined the HOT regions during the differentiation of the hESC line H1 to mesendoderm (ME), neural progenitor cells (NPC), trophoblast-like cells (TBL), and mesenchymal stem cells (MSC). We first identified 5410, 5602, 4250, 3408, and 6719 HOT regions in H1, ME, NPC, TBL, and MSC, respectively, as well as 4002 HOT regions in a control for terminally differentiated cells (IMR90).

We first examined the dynamic changes in HOT regions and correlated these dynamic HOT regions with their associated gene expression. During H1 differentiation, we defined “gained” HOT regions as those belonging to H1-derived cell types but not occurring in H1 cells (Fig. 5a). We identified 3264, 1784, 2133 and 4452 “gained” HOT regions in ME, NPC, TBL, and MSC, respectively. More than half of the HOT regions were “gained” in H1-derived cells upon H1 differentiation, except in NPC (Fig. 5a). We observed that the genes associated with “gained” HOT regions in H1-derived cells were highly expressed compared to those of H1 cells upon H1 differentiation (Fig. 5b, p-value < 1.38 × 10⁻⁵, Kolmogorov-Smirnov test). Furthermore, the genes associated with “gained” HOT regions played key roles in development-related biological processes in the respective H1-derived cells in a cell-type-specific manner (Fig. 5a and Additional file 2: Table S8).

Then, we scanned TF motifs in HOT regions of H1 and its derived cell types using iFORM [23] and identified 106, 153, 178, 75 and 35 enriched TF genes in ME, NPC, TBL, MSC and IMR90 cells, respectively (Additional file 1: Materials and Methods, Additional file 2: Table S9). These enriched TF genes were highly expressed in H1-derived cell types relative to H1 cells (p-value < 0.01, Kolmogorov-Smirnov test), respectively (Fig. 5d). Further GO analysis demonstrated that these enriched TF genes play key roles in the regulation of transcription, gene expression, and metabolic process during the differentiation of H1 into its derived cell types (Additional file 2: Table S9). Taken together, our results improve our understanding of HOT region dynamics during development and differentiation.

Epigenetic regulation of HOT regions upon H1 differentiation

To further investigate the epigenetic regulatory mechanism of HOT regions upon development and differentiation, we examined the potential role of histone modifications at HOT regions upon H1 differentiation.

Previously, bivalent genes marked by both H3K4me3 and H3K27me3 were shown to be highly associated with developmental genes [36]. Intriguingly, an analysis of the data from previous studies [37, 38] showed that the genes associated with HOT regions during H1 differentiation appeared to be highly enriched in bivalent genes (p-value < 1.77 × 10⁻⁹, hypergeometric test, Additional file 2: Table S10). We then asked whether HOT regions undergo dynamic epigenetic regulation upon development and differentiation. We examined the dynamic epigenetic modifications at these HOT regions upon H1 differentiation and found that over 50 % of HOT regions in H1 cells were bound simultaneously by H3K4me3 and H3K27me3. We named these regions bivalent HOT regions. Interestingly, the number of bivalent HOT regions decreased upon H1 differentiation. In differentiated cells, a large portion of HOT regions were bound only by H3K4me3 or H3K27me3 relative to H1 cells (Fig. 6a). This is in good agreement with the opinion that bivalent genes become monovalent upon cell differentiation [36]. Furthermore, over three-quarters of bivalent HOT regions in H1 cells were “lost”, whereas less than one-quarter of bivalent HOT regions in H1 cells were “shared” during the differentiation of H1 into its derived cell types (Fig. 6b). Moreover, a considerable proportion (ranging from 10 to 38 %) of “lost” bivalent regions were differentiated into “activated” HOT regions bound by H3K4me3 only, whereas only a very small proportion of these regions differentiated into HOT regions bound only by H3K27me3 alone (Fig. 6b).

We further explored the dynamic epigenetic signals upon H1 differentiation. We found that in H1-derived cells, the “activated” HOT regions were marked by a higher level of H3k4me3 and a lower level of H3K27me3 compared to the H1 bivalent HOT regions. Moreover, the “repressed” HOT regions in H1-derived cells were marked by a lower level of H3k4me3 and a higher level of H3K27me3 compared to the H1 bivalent HOT regions. The levels of H3K4me3 and H3K27me3 in the activated HOT regions of H1-derived cells increased and decreased during differentiation, respectively. A gene expression analysis of the “activated” and “repressed” HOT regions further confirmed these findings (Fig. 6c). Furthermore, we performed GO analysis of the bivalent HOT regions in H1 and the activated HOT regions in H1-derived cells and found that the genes associated with these regions were strongly enriched for the functional categories “regulation of transcription”, “metabolic process” and “differentiation” (Additional file 2: Table S11). Taken together, our findings reveal a preliminary view of the dynamic epigenetic regulation of HOT regions, which were strongly associated with developmental genes and had key roles upon cell differentiation.

Discussion and conclusions

Previous studies have revealed regions in worms [1, 2], flies [3–7], and humans [8–10] with heavily clustered TF binding that have been termed HOT regions. These reports [1–10] identified HOT regions by the binding peaks of many TFs using ChIP-seq data, whereas we defined HOT regions by a large number of TF motif binding sites on DHSs in DNase-seq data. This change was made because previous investigations of experimental HOT regions were restricted by the currently limited amount of TF ChIP-seq data and by the consistency of HOT regions defined by identical combinations of TFs across diverse cells/tissues. Although the identifications of HOT regions were based on different data, both definitions demonstrate that HOT regions are a novel class of genomic regions that are bound by a surprisingly large number of TFs and contain numerous TF motifs. Importantly, our identification of HOT regions using TF motif discovery on DHSs can greatly extend the repertoire of both TFs and cell types in the genome, thus greatly enhancing our understanding of HOT regions.

We have extended our understanding of HOT regions by demonstrating that ESC HOT regions are highly transcribed and by identifying the population of TFs, cofactors, chromatin regulators, and core transcription machinery that occupy these domains in ESCs. ESCs were chosen for identifying components of HOT regions because the TFs, cofactors, chromatin regulators, and noncoding RNAs that control the ESC state and that contribute to the gene expression programmes of pluripotency and self-renewal are likely better understood than those for any other cell type [39–41]. HOT regions are occupied by a large portion of enhancer-associated RNA polymerase II and its associated cofactors and chromatin regulators, which may explain how these molecules contribute to the high transcriptional levels of genes associated with HOT regions. Furthermore, the levels of RNA detected in HOT regions vastly exceed the levels of RNA in LOT regions, and recent evidence suggests that these enhancer RNAs (eRNAs) may contribute to gene activation [42–49]. Several additional important insights were gained by studying how more than 40 TFs, cofactors, chromatin regulators, and components of the core transcription machinery occupy HOT regions and LOT regions in ESCs. All of the enhancer-binding TFs are enriched in HOT regions, with some so highly enriched that they distinguish HOT regions from LOT regions.

By uncovering characteristic sequence signatures of HOT regions, our computational analysis revealed that more than one quarter of enriched TF motifs exhibited significantly enriched binding within HOT regions; the majority of these TF motifs play essential roles in development and cell differentiation. Strikingly, 12 of 34 TFs (p-value = 0.0012, binomial test) that showed specifically enriched binding within LOT regions were housekeeping TFs. In combination with previous observations that HOT regions are depleted in the bound TFs’ motifs [1, 3–5] compared with regions bound by single TFs, our findings suggest that HOT regions have distinct sequence features that distinguish them from LOT regions and the genome background. Moreover, these findings suggest that information regarding HOT regions is encoded in the DNA sequence.

We have generated a catalogue of HOT regions and their associated genes in a broad spectrum of human cell and tissue types. HOT regions tend to be cell type-specific, and the genes associated with these elements are linked to biological processes that largely define the development and differentiation of the respective cell and tissue types. Genes that encode candidate key developmental TFs and noncoding RNAs, such as microRNAs, are among those associated with HOT regions. Thus, the HOT region catalogue should be a valuable resource for further studies of the transcriptional control of cell development and differentiation [50–53]. Based on the catalogue of HOT regions, our further study has explored the association of GWAS SNPs and HOT regions, and our findings have illustrated the key roles of HOT regions in human disease and cancer [54].

An association analysis between HOT regions and embryonic enhancers presented direct evidence of the systematic developmental persistence of tissue-selective early developmental enhancers at HOT regions and of the persistent imprint of enhancer roles on the formation of cross-cell-type patterning of HOT regions in definitive cells. Additionally, we found that super-enhancers were highly enriched in HOT regions across diverse cell types, and cell type-specific super-enhancers tend to strongly colocalise with the HOT regions that function in the corresponding cell types. Furthermore, all enhancer markers, including DNaseI, H3K27ac, H3K4me1, enhancer-binding TFs and chromatin regulators, are enriched at HOT regions but have lower levels of enrichment that distinguish them from super-enhancers. Strikingly, we observed the paradoxical coexistence of permissive and repressive histone marks, H3K4me3 and H3K27me3, in HOT regions. Although GO analysis revealed that super-enhancers drive the expression of genes that define the identity of the respective cell and tissue types, HOT regions are associated with biological processes that largely define the development and differentiation of the respective cell and tissue types. Together, our results suggest that HOT regions might therefore represent a novel class of enhancers because they contain many discriminatory features that are different from enhancers or super-enhancers. The activities of HOT regions and super-enhancers are both defined by the colocalisation of TFs in these regions but on different genomic scales of colocalisation. A recent study [55] described the relationship between hotspots and super-enhancers in the early phase of adipogenesis, demonstrating that hotspots are highly enriched in large super-enhancers and revealing that hotspots and super-enhancers function as two levels of regulatory hubs that serve to integrate external stimuli through cooperativity between TFs on chromatin. These findings are highly consistent with ours.

Finally, we examined the dynamics changes in HOT regions and correlated their associated genes and enriched TF genes during the differentiation of H1 into its derived cell types. We observed that these associated genes and enriched TF genes were highly expressed in H1-derived cell types relative to H1 cells and play key roles during development or differentiation. We further examined the dynamic epigenetic regulation at HOT regions during H1 differentiation. We found that a large proportion of HOT regions in H1 cells demonstrated a bivalent state, and the portion of the bivalent HOT regions decreased during the differentiation of the hESC line H1 into ME cells, TBL cells, NPCs, and MSCs. Many of these bivalent HOT regions were differentiated into activated regions. Additionally, we demonstrated that the activated regions showed an inverse relationship with the levels of H3k4me3 and H3k27me3. Our results present a preliminary view of the dynamic regulation of HOT regions upon cell differentiation and will improve our understanding of HOT region dynamics during development and differentiation.

Taken together, our findings provide a resource for the functional exploration of HOT regions and extend our understanding of the key roles of HOT regions during development and differentiation.

Methods

Data sets

The DNaseI Hypersensitivity by Digital DNaseI data were obtained from the Duke and UW ENCODE groups. Histone modifications according to ChIP-seq data were downloaded from the Broad histone ENCODE group. TFs according to ChIP-seq data were obtained from the HAIB and SYDH TFBS ENCODE groups. DNase-seq and ChIP-seq data in both peak file and bam file formats were used in this study. Gene annotations were obtained from the GENCODE data (V15) [56]. All these data were collected from the ENCODE Project [24], and the use of these data strictly adhered to the ENCODE Consortium Data Release Policy. The data used for epigenomic analysis of HOT regions in H1 cells and four H1-derived lineages were obtained from a recent study [29].

Identifying TFBS-clustered regions and HOT regions

Position weight matrices (PWMs) of 542 TFs, which corresponded to 796 motif models, were collected from the TRANSFAC [57], JASPAR [58], and UniPROBE [59] databases. The genomic sequence under DHSs from the hg19 genome was used as the input for iFORM [23] using a custom library of all 796 motifs scanned for motif instances at a p-value threshold of 10⁻¹⁸ (corresponding to the FIMO threshold of 10⁻⁵). For each TF, instances of multiple TF motifs were combined to generate the corresponding TFBSs.

An established method [5] was used to perform Gaussian kernel density estimations across the genome (bandwidth 3 kb, centred on each TFBS). Each peak of the density profile was denoted as a TFBS-clustered region. To determine the complexity of each TFBS-clustered region, the Gaussian kernalised distance from a peak to each TFBS that contributed at least 0.1 to the strength was determined. The window around each TFBS-clustered region was derived by finding the maximum distance (in bp) from the TFBS-clustered region to a contributing TF and adding 1.5 kb (one half of the bandwidth). Each window was centred on a TFBS-clustered region.

To identify HOT regions, we first ranked all the TFBS-clustered regions in a cell type by increasing and plotting the TFBS complexity (Fig. 1a). This plot revealed a clear point in the distribution of the TFBS-clustered regions at which the complexity signal began to increase rapidly. To geometrically define this point, we first scaled the data such that the x and y axes were from 0 to 1. We then found the x axis point for which a line with a slope of 1 was tangent to the curve. We defined the TFBS-clustered regions above this point to be HOT regions and the TFBS-clustered regions below that point to be LOT regions. The pipeline for identifying HOT or LOT regions was applied uniformly to datasets from 349 samples, including 154 cell types studied under the ENCODE Project [24]. The classification of the TFBS-clustered regions as a HOT or LOT region in each cell type for diverse human cells and tissues can be found in Additional file 2: Table S2.

Validation of HOT regions with ChIP-seq

We downloaded publicly available HOT regions defined based on the ChIP-seq data from the ENCODE Consortium obtained in five cell types, including K562, Hep-G2, HeLa-S3, H1-hESC, and GM12878 cells [24, 25]. First, we used GSC (genome structure correction) analysis to assess the performance of predicting HOT regions. The GSC statistic [60, 61] was used to calculate the confidence intervals (CIs) for the experimental HOT regions that were expected to contain our HOT regions by chance. This statistic provides a conservative correction to standard tests. Additionally, we applied the ROC (receiver operating characteristic) curves and the corresponding area under the curve (AUC) to validate our HOT regions with the experimental HOT and LOT regions in the five cell types as the “gold-standard” data sets.

To further verify whether TFs indeed bound within the identified HOT regions, we collected uniform ChIP-seq peaks corresponding to multiple TFs from the ENCODE Project in the five cell types. We counted the occurrence rates of ChIP-seq peaks of diverse TFs that were located within our HOT regions and experimental HOT regions. Additionally, we explored the correlations between the TFBS complexity of our HOT regions and the number of TF peaks that were located within our HOT regions.

Identifying motifless binding peaks

For each TF, we collected the ChIP-seq binding peaks in each cell line and scanned the TF motifs of each binding peak using iFORM [23]. We identified the binding peaks that do not have DNA sequence motifs of the corresponding TFs using the method presented in a previous study [25]. Then, we collected the peaks of all TFs to construct the set of motifless binding peaks for each cell line.

We then associated the motifless binding peaks with our HOT regions and experimental HOT regions, respectively. Because our HOT regions and experimental HOT regions were identified in the whole human genome but the motifless binding peaks were defined from ChIP-seq binding peaks, we first identified the subsets of our HOT regions and experimental HOT regions located within these ChIP-seq binding peaks for each cell line. We then determined the association of the motifless binding regions with the subsets of HOT regions and used GSC analysis to measure the statistical significance of these associations.

Master list and annotation for HOT regions

The HOT regions from 154 cell types were consolidated into a master list of 59,986 unique, non-overlapping HOT region positions by first merging these regions across cell types. Then, for each resulting interval of merged regions, the HOT region with the highest TFBS complexity was selected for the master list. Any HOT regions overlapping the regions selected for the master list were then discarded. The remaining HOT regions were merged, and the process was repeated until each original TFBS-clustered region was either incorporated into the master list or discarded.

Genomic annotations from GENCODE annotations (V15) [56], i.e., Basic, Comprehensive, PseudoGenes, 2-way PseudoGenes, and PolyA Transcripts, were used. The promoter (proximal) class of each GENCODE-annotated TSS was defined as a region from the master list within 1 kb of the TSS. The exon class was defined as any HOT region not in the promoter class that overlapped a GENCODE-annotated “CDS” segment by at least 75 bp. The UTR class was defined as a HOT region not in the promoter or exon class that overlapped a GENCODE-annotated “UTR” segment by at least 1 bp. The intron class was defined as a GENCODE segment annotated as “gene” with all “CDS” segments. The intron class also covered any HOT regions not defined by other categories that overlapped introns by at least 1 bp.

The distal category was defined as the HOT regions located at least 2 kb away from any GENCODE-annotated TSS. Repeat categories for the LINE, SINE, LTR, and DNA repeat classes were obtained from the UCSC RepeatMasker track annotations. The miRNA category counts for each miRNA annotated by miRBase (version 20) [26, 62] were defined by the closest master list HOT regions within 1 kb upstream and downstream of the miRNA TSS.

The cell-type number was defined for each HOT region by annotating the master list with the number of cell types with overlapping HOT regions. The plots in Fig. 1e were generated using the R function “geom_violin” from the “ggplot2” package, which summarises the distribution of cell type numbers for distinct categories of HOT regions. The distribution of cell types containing a HOT region was calculated separately for HOT regions observed in 154 cell types.

Association analysis of functional regulatory elements

The microRNA coordinates were downloaded from miRBase (version 20) [26] and used to map microRNAs to their genomic locations. We used the method described in a recent study [15] to assign TSSs for 2633 microRNA loci. RepeatMasker data were downloaded from the hg19 rmsk table associated with the UCSC Genome Browser. There are 1395 distinctly named repeats in 56 families in 21 repeat classes. The data were analysed by repeat family because this procedure gives a granularity suitable for display. A number of the classes are structural classes rather than classes derived from transposable elements. Bedops utilities [63] were used to count the number of repeat elements that overlapped at least 1 bp with HOT regions. The HOT regions from 154 cell types/tissues were tested for overlap with repeat families. Additional file 2: Table S5 shows overlap statistics for families of elements with at least 5000 overlapping HOT regions. Additionally, an extensive compilation of 373 experimentally validated distal, non-promoter cis-regulatory elements, including insulators, locus control regions, and so on, were taken from a recent study [15] (Additional file 2: Table S6). Finally, we collected low methylation regions (LMRs), unmethylated regions (UMRs), and DNA methylation valleys (DMVs) from a recent study [29].

Comparison of HOT regions with known enhancers

We downloaded the data for tests of human enhancers in a mouse developmental model [30, 64] from the VISTA enhancer browser http://enhancer.lbl.gov/ and received permission to use the embryonic mouse images.

To calculate the enrichment of super-enhancers in a HOT region, we performed a GSC analysis between HOT regions and super-enhancers, as shown in Fig. 4a. To perform the colocalisation analysis on HOT regions and super-enhancers in N-by-N (N = 14) cell line combinations as similarly described in a previous study [34, 35], we collected a catalogue of super-enhancers in 14 human cell types from recent studies. The counts were divided by the corresponding row sum and column sum and multiplied by the matrix sum to obtain enrichment values using the same approach as the χ ² test. We plotted the enrichment factor for each histone modification in an N-by-N heat map.

Dynamics of HOT regions upon H1 differentiation

Using the DNase-seq data obtained from a recent study [29], we identified 5410, 5602, 4250, 3408, and 6719 HOT regions in H1-hESCs (H1), mesendoderm (ME), neural progenitor cells (NPCs), trophoblast-like cells (TBL), and mesenchymal stem cells (MSCs), respectively. As a control for terminally differentiated cells, we also identified 4002 HOT regions in IMR90, a primary human foetal lung fibroblast cell line. Each HOT region was assigned to the closest genes annotated in GENCODE (V15) [56] by determining the distance from the centre of the HOT region to the TSS of each GENCODE gene.

HOT regions that were “shared” upon H1 differentiation were defined as HOT regions belonging to both H1 cells and H1-derived cell types (using bedops –e −25 %). HOT regions “lost” upon H1 differentiation were defined as HOT regions belonging to H1 cells but not found within H1-derived cells (using bedops –n −25 %). HOT regions that were “gained” upon H1 differentiation were defined as regions belonging to H1-derived cells but not found within H1 cells (using bedops –n −25 %). The genes associated with “gained” HOT regions were further restricted to those that expressed (RPKM > 1) in either H1 or its derived cells.

We scanned TF motifs in HOT regions of H1 and H1-derived cells using iFORM [23]. We calculated the TF density of each cell and took the ratio of densities between H1 and its derived cells. We defined as enriched TF gene those TF genes with ratio > 1 and expressed > 1 (RPKM > 1) in either H1 cells or its derived cells.

MACS (Zhang et al., 2008) was used to identify H3k4me3 peaks using default parameters. For H3k27me3, which typically exhibits broad enrichment, we used RSEG [65] to identify enriched regions with “-i 20 –b 100 –v –mode 2”. Upon H1 differentiation, “activated” HOT regions were defined as HOT regions bearing H3k4me3 peaks only, whereas “repressed” HOT regions were defined as HOT regions bearing either H3k27me3 peaks only or no marker.

We used DAVID [66] to perform GO analysis of genes associated with “gained” HOT regions, TF-enriched genes, and genes associated with “activated” HOT regions.

Accession numbers

The identified HOT regions across human cell and tissue types have been deposited in the Gene Expression Omnibus under the accession ID GSE54296.

References

Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science (New York, NY). 2010;330(6012):1775–87.
Article CAS Google Scholar
Araya CL, Kawli T, Kundaje A, Jiang L, Wu B, Vafeados D, et al. Regulatory analysis of the C. elegans genome with spatiotemporal resolution. Nature. 2014;512(7515):400–5.
Article CAS PubMed PubMed Central Google Scholar
Moorman C, Sun LV, Wang J, de Wit E, Talhout W, Ward LD, et al. Hotspots of transcription factor colocalization in the genome of Drosophila melanogaster. Proc Natl Acad Sci U S A. 2006;103(32):12027–32.
Article CAS PubMed PubMed Central Google Scholar
Macarthur S, Li XY, Li J, Brown JB, Chu HC, Zeng L, et al. Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions. Genome Biol. 2009;10(7):R80.
Article PubMed PubMed Central Google Scholar
Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science (New York, NY). 2010;330(6012):1787–97.
Article CAS Google Scholar
Negre N, Brown CD, Ma L, Bristow CA, Miller SW, Wagner U, et al. A cis-regulatory map of the Drosophila genome. Nature. 2011;471(7339):527–31.
Article CAS PubMed PubMed Central Google Scholar
Kvon EZ, Stampfel G, Yanez-Cuna JO, Dickson BJ, Stark A. HOT regions function as patterned developmental enhancers and have a distinct cis-regulatory signature. Genes Dev. 2012;26(9):908–13.
Article CAS PubMed PubMed Central Google Scholar
Yan J, Enge M, Whitington T, Dave K, Liu J, Sur I, et al. Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell. 2013;154(4):801–13.
Article CAS PubMed Google Scholar
Chen RA, Stempor P, Down TA, Zeiser E, Feuer SK, Ahringer J. Extreme HOT regions are CpG-dense promoters in C. elegans and humans. Genome Res. 2014;24(7):1138–46.
Article CAS PubMed PubMed Central Google Scholar
Foley JW, Sidow A. Transcription-factor occupancy at HOT regions quantitatively predicts RNA polymerase recruitment in five human cell lines. BMC Genomics. 2013;14:720.
Article CAS PubMed PubMed Central Google Scholar
Furlong EE. Molecular biology: A fly in the face of genomics. Nature. 2011;471(7339):458–9.
Article CAS PubMed Google Scholar
Blaxter M. Genetics. Revealing the dark matter of the genome. Science (New York, NY). 2010;330(6012):1758–9.
Article CAS Google Scholar
Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics (Oxford, England). 2011;27(7):1017–8.
Article CAS Google Scholar
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38(4):576–89.
Article CAS PubMed PubMed Central Google Scholar
Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. The accessible chromatin landscape of the human genome. Nature. 2012;489(7414):75–82.
Article CAS PubMed PubMed Central Google Scholar
Gaszner M, Felsenfeld G. Insulators: exploiting transcriptional and epigenetic mechanisms. Nat Rev Genet. 2006;7(9):703–13.
Article CAS PubMed Google Scholar
Gross DS, Garrard WT. Nuclease hypersensitive sites in chromatin. Annu Rev Biochem. 1988;57:159–97.
Article CAS PubMed Google Scholar
Li Q, Harju S, Peterson KR. Locus control regions: coming of age at a decade plus. Trends Genet. 1999;15(10):403–8.
Article PubMed Google Scholar
Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, et al. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008;132(2):311–22.
Article CAS PubMed PubMed Central Google Scholar
Hesselberth JR, Chen X, Zhang Z, Sabo PJ, Sandstrom R, Reynolds AP, et al. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat Methods. 2009;6(4):283–9.
Article CAS PubMed PubMed Central Google Scholar
John S, Sabo PJ, Thurman RE, Sung MH, Biddie SC, Johnson TA, et al. Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat Genet. 2011;43(3):264–8.
Article CAS PubMed Google Scholar
Chen H, Li H, Liu F, Zheng X, Wang S, Bo X, et al. An integrative analysis of TFBS-clustered regions reveals new transcriptional regulation models on the accessible chromatin landscape. Sci Rep. 2015;5:8465.
Article CAS PubMed PubMed Central Google Scholar
Ren C, Chen H, Liu F, Li H, Bo X, Shu W: iFORM: incorporating Find Occurrence of Regulatory Motifs. bioRxiv 2016. http://biorxiv.org/content/early/2016/03/16/044214.
Consortium TEP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74.
Article Google Scholar
Yip KY, Cheng C, Bhardwaj N, Brown JB, Leng J, Kundaje A, et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 2012;13(9):R48.
Article CAS PubMed PubMed Central Google Scholar
Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 2014;42(Database issue):D68–73.
Article CAS PubMed Google Scholar
Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. Landscape of transcription in human cells. Nature. 2012;489(7414):101–8.
Article CAS PubMed PubMed Central Google Scholar
Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Scholer A, et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature. 2011;480(7378):490–5.
CAS PubMed Google Scholar
Xie W, Schultz MD, Lister R, Hou Z, Rajagopal N, Ray P, et al. Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell. 2013;153(5):1134–48.
Article CAS PubMed PubMed Central Google Scholar
Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006;444(7118):499–502.
Article CAS PubMed Google Scholar
Hnisz D, Abraham BJ, Lee TI, Lau A, Saint-Andre V, Sigova AA, et al. Super-enhancers in the control of cell identity and disease. Cell. 2013;155(4):934–47.
Article CAS PubMed Google Scholar
Loven J, Hoke HA, Lin CY, Lau A, Orlando DA, Vakoc CR, et al. Selective inhibition of tumor oncogenes by disruption of super-enhancers. Cell. 2013;153(2):320–34.
Article CAS PubMed PubMed Central Google Scholar
Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY, Kagey MH, et al. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell. 2013;153(2):307–19.
Article CAS PubMed PubMed Central Google Scholar
Xi H, Shulha HP, Lin JM, Vales TR, Fu Y, Bodine DM, et al. Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. PLoS Genet. 2007;3(8):e136.
Article PubMed PubMed Central Google Scholar
Chen H, Tian Y, Shu W, Bo X, Wang S. Comprehensive identification and annotation of cell type-specific and ubiquitous CTCF-binding sites in the human genome. PLoS One. 2012;7(7):e41374.
Article CAS PubMed PubMed Central Google Scholar
Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006;125(2):315–26.
Article CAS PubMed Google Scholar
Zhao XD, Han X, Chew JL, Liu J, Chiu KP, Choo A, et al. Whole-genome mapping of histone H3 Lys4 and 27 trimethylations reveals distinct genomic compartments in human embryonic stem cells. Cell Stem Cell. 2007;1(3):286–98.
Article CAS PubMed Google Scholar
Pan G, Tian S, Nie J, Yang C, Ruotti V, Wei H, et al. Whole-genome analysis of histone H3 lysine 4 and lysine 27 methylation in human embryonic stem cells. Cell Stem Cell. 2007;1(3):299–312.
Article CAS PubMed Google Scholar
Ng HH, Surani MA. The transcriptional and signalling networks of pluripotency. Nat Cell Biol. 2011;13(5):490–6.
Article CAS PubMed Google Scholar
Orkin SH, Hochedlinger K. Chromatin connections to pluripotency and cellular reprogramming. Cell. 2011;145(6):835–50.
Article CAS PubMed PubMed Central Google Scholar
Young RA. Control of the embryonic stem cell state. Cell. 2011;144(6):940–54.
Article CAS PubMed PubMed Central Google Scholar
Kim TK, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, et al. Widespread transcription at neuronal activity-regulated enhancers. Nature. 2010;465(7295):182–7.
Article CAS PubMed PubMed Central Google Scholar
Lai F, Orom UA, Cesaroni M, Beringer M, Taatjes DJ, Blobel GA, et al. Activating RNAs associate with Mediator to enhance chromatin architecture and transcription. Nature. 2013;494(7438):497–501.
Article CAS PubMed PubMed Central Google Scholar
Orom UA, Derrien T, Beringer M, Gumireddy K, Gardini A, Bussotti G, et al. Long noncoding RNAs with enhancer-like function in human cells. Cell. 2010;143(1):46–58.
Article CAS PubMed PubMed Central Google Scholar
Ling J, Ainol L, Zhang L, Yu X, Pi W, Tuan D. HS2 enhancer function is blocked by a transcriptional terminator inserted between the enhancer and the promoter. J Biol Chem. 2004;279(49):51704–13.
Article CAS PubMed Google Scholar
Kaikkonen MU, Spann NJ, Heinz S, Romanoski CE, Allison KA, Stender JD, et al. Remodeling of the enhancer landscape during macrophage activation is coupled to enhancer transcription. Mol Cell. 2013;51(3):310–25.
Article CAS PubMed PubMed Central Google Scholar
Mousavi K, Zare H, Dell’orso S, Grontved L, Gutierrez-Cruz G, Derfoul A, et al. eRNAs promote transcription by establishing chromatin accessibility at defined genomic loci. Mol Cell. 2013;51(5):606–17.
Article CAS PubMed PubMed Central Google Scholar
Lam MT, Cho H, Lesch HP, Gosselin D, Heinz S, Tanaka-Oishi Y, et al. Rev-Erbs repress macrophage gene expression by inhibiting enhancer-directed transcription. Nature. 2013;498(7455):511–5.
Article CAS PubMed PubMed Central Google Scholar
Li W, Notani D, Ma Q, Tanasa B, Nunez E, Chen AY, et al. Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation. Nature. 2013;498(7455):516–20.
Article CAS PubMed PubMed Central Google Scholar
Zhou Q, Brown J, Kanarek A, Rajagopal J, Melton DA. In vivo reprogramming of adult pancreatic exocrine cells to beta-cells. Nature. 2008;455(7213):627–32.
Article CAS PubMed Google Scholar
Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell. 2013;152(6):1237–51.
Article CAS PubMed PubMed Central Google Scholar
Graf T, Enver T. Forcing cells to change lineages. Nature. 2009;462(7273):587–94.
Article CAS PubMed Google Scholar
Cherry AB, Daley GQ. Reprogramming cellular identity for regenerative medicine. Cell. 2012;148(6):1110–22.
Article CAS PubMed PubMed Central Google Scholar
Li H, Chen H, Liu F, Ren C, Wang S, Bo X, et al. Functional annotation of HOT regions in the human genome: implications for human disease and cancer. Sci Rep. 2015;5:11633.
Article PubMed PubMed Central Google Scholar
Siersbaek R, Rabiee A, Nielsen R, Sidoli S, Traynor S, Loft A, et al. Transcription factor cooperativity in early adipogenic hotspots and super-enhancers. Cell Rep. 2014;7(5):1443–55.
Article CAS PubMed Google Scholar
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22(9):1760–74.
Article CAS PubMed PubMed Central Google Scholar
Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006;34(Database issue):D108–10.
Article CAS PubMed Google Scholar
Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010;38(Database issue):D105–10.
Article CAS PubMed Google Scholar
Robasky K, Bulyk ML. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2011;39(Database issue):D124–8.
Article CAS PubMed Google Scholar
Bickel PJ, Boley N, Brown JB, Huang H, Zhang NR: Subsampling methods for genomic inference. Ann Appl Stat. 2010;4(4):1660-97. http://projecteuclid.org/euclid.aoas/1294167794.
Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, et al. Identification and analysis of functional elements in 1 % of the human genome by the ENCODE pilot project. Nature. 2007;447(7146):799–816.
Article CAS PubMed Google Scholar
Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ. miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008;36(Database issue):D154–8.
CAS PubMed Google Scholar
Neph S, Kuehn MS, Reynolds AP, Haugen E, Thurman RE, Johnson AK, et al. BEDOPS: high-performance genomic feature operations. Bioinformatics (Oxford, England). 2012;28(14):1919–20.
Article CAS Google Scholar
May D, Blow MJ, Kaplan T, McCulley DJ, Jensen BC, Akiyama JA, et al. Large-scale discovery of enhancers from human heart tissue. Nat Genet. 2012;44(1):89–93.
Article CAS Google Scholar
Song Q, Smith AD. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics (Oxford, England). 2011;27(6):870–1.
Article CAS Google Scholar
da Huang W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44–57.
Article CAS Google Scholar

Download references

Acknowledgements

We wish to thank the ENCODE Project Consortium for making their data publicly available. This work was supported by grants from the Major Research plan of the National Natural Science Foundation of China (No. U1435222), the Program of International S&T Cooperation (No. 2014DFB30020) and the National High Technology Research and Development Program of China (No. 2015AA020108).

Authors’ contributions

WS conceived the project. WS, XB and SW designed all experiments. HL, HC, and FL performed the experiments. All authors analysed the data and contributed to the manuscript preparation. WS wrote the manuscript.

Competing interests

The authors declare no competing interests.

Author information

Authors and Affiliations

Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, China
Hao Li, Feng Liu, Chao Ren, Xiaochen Bo & Wenjie Shu

Authors

Hao Li
View author publications
You can also search for this author in PubMed Google Scholar
Feng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chao Ren
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochen Bo
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Shu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiaochen Bo or Wenjie Shu.

Additional files

Additional file 1:

Supplementary materials. (PDF 2308 kb)

Additional file 2: Table S1.

H1hESC TFBS Clusters info. Table S2. Comparison of motifless binding peaks and HOT regions. Table S3. Information of HOT regions in 154 files. Table S4. Key TF genes. Table S5. Developmental TF genes. Table S6. Repetitive elements in HOT regions. Table S7. 1046 Validated elements in HOT regions. Table S8. gene list of associated “gained” HOT regions. Table S9. Enriched TF gene list. Table S10. Enrichment of bivalent genes in HOT regions. Table S11 GO analysis of activated HOT regions in H1-derived cells. (XLSX 766 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Li, H., Liu, F., Ren, C. et al. Genome-wide identification and characterisation of HOT regions in the human genome. BMC Genomics 17, 733 (2016). https://doi.org/10.1186/s12864-016-3077-4

Download citation

Received: 29 October 2015
Accepted: 08 September 2016
Published: 15 September 2016
DOI: https://doi.org/10.1186/s12864-016-3077-4

Genome-wide identification and characterisation of HOT regions in the human genome

Abstract

Background

Results

Conclusions

Background

Results

Identification and validation of HOT regions

General characterisation of HOT regions

Associations with functional regulatory elements

HOT regions at embryonic enhancers

Distal HOT regions in super-enhancers

Dynamics of HOT regions during H1 differentiation

Epigenetic regulation of HOT regions upon H1 differentiation

Discussion and conclusions

Methods

Data sets

Identifying TFBS-clustered regions and HOT regions

Validation of HOT regions with ChIP-seq

Identifying motifless binding peaks

Master list and annotation for HOT regions

Association analysis of functional regulatory elements

Comparison of HOT regions with known enhancers

Dynamics of HOT regions upon H1 differentiation

Accession numbers

References

Acknowledgements

Authors’ contributions

Competing interests

Author information

Authors and Affiliations

Corresponding authors

Additional files

Additional file 1:

Additional file 2: Table S1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us