ABSTRACT
Cellular identity relies on cell type-specific gene expression profiles controlled by cis-regulatory elements (CREs), such as promoters, enhancers and anchors of chromatin interactions. CREs are unevenly distributed across the genome, giving rise to distinct subsets such as individual CREs and Clusters Of cis-Regulatory Elements (COREs), also known as super-enhancers. Identifying COREs is a challenge due to technical and biological features that entail variability in the distribution of distances between CREs within a given dataset. To address this issue, we developed a new unsupervised machine learning approach termed Clustering of genomic REgions Analysis Method (CREAM) that outperforms the Ranking Of Super Enhancer (ROSE) approach. Specifically CREAM identified COREs are enriched in CREs strongly bound by master transcription factors according to ChIP-seq signal intensity, are proximal to highly expressed genes, are preferentially found near genes essential for cell growth and are more predictive of cell identity. Moreover, we show that CREAM enables subtyping primary prostate tumor samples according to their CORE distribution across the genome. We further show that COREs are enriched compared to individual CREs at TAD boundaries and these are preferentially bound by CTCF and factors of the cohesin complex (e.g.: RAD21 and SMC3). Finally, using CREAM against transcription factor ChIP-seq reveals CTCF and cohesin-specific COREs preferentially at TAD boundaries compared to intra-TADs. CREAM is available as an open source R package (https://CRAN.R-project.org/package=CREAM) to identify COREs from cis-regulatory annotation datasets from any biological samples.