Abstract
Super-enhancer is a newly proposed concept, which refers to clusters of enhancers that can drive cell-type-specific gene expression and are crucial in cell identity. Many disease-associated sequence variations are enriched in the super-enhancer regions of disease-relevant cell types. Thus, super-enhancers can be used as potential biomarkers for disease diagnosis and therapeutics. Current studies have identified super-enhancers for more than 100 cell types in human and mouse. However, no centralized resource to integrate all these findings is available yet. We developed dbSUPER (http://bioinfo.au.tsinghua.edu.cn/dbsuper/), the first integrated and interactive database of super-enhancers, with the primary goal of providing a resource for further study of transcriptional control of cell identity and disease by archiving computationally produced data. This data can be easily send to Galaxy, GREAT and Cistrome web servers for further downstream analysis. dbSUPER provides a responsive and user-friendly web interface to facilitate efficient and comprehensive searching and browsing. dbSUPER provides downloadable and exportable features in a variety of data formats, and can be visualized in UCSC genome browser while custom tracks will be added automatically. Further, dbSUPER lists genes associated with super-enhancers and links to various databases, including GeneCards, UniProt and Entrez. Our database also provides an overlap analysis tool, to check the overlap of user defined regions with the current database. We believe, dbSUPER is a valuable resource for the bioinformatics and genetics research community.
Database URL: http://bioinfo.au.tsinghua.edu.cn/dbsuper
Introduction
Enhancers are cis-regulatory elements of DNA that enhance the transcription of target genes by communicating with core promoter though several mechanisms including; looping, tracking, linking, and relocation models (1–6). These enhancers contain binding sites for sequence-specific transcription factors (TFs), which help to recruit coactivators and RNA polymerase II to target genes (2–5). Since the discovery of the first enhancer in animal virus SV40 (7), there has been tremendous development of technology and methodology to study the role of enhancers in gene expression. There can be thousands of active enhancers operating in a single mammalian cell, and in total there maybe ∼1 million enhancers in the human genome (8, 9). Many approaches are applied to identify enhancers genome-wide, such as chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) for coactivator protein p300 (10), and histone modifications including H3K4me1 and H3K27ac (11, 12). H3K27ac was used as enhancer mark to identify enhancers in hESC (13) and mESC (14) and these enhancers were further grouped into active and poised enhancers.
Recently, a small set of enhancers that span large regions of the genome in a clustered manner and are occupied by high levels of Mediator (Med1), master transcription factors and coactivators are named as superenhancers (15, 16). These super-enhancers drive the cell-type-specific gene expression programs (15, 17) and many disease-associated sequence variation are especially enriched in these regions of disease-relevant cell types (16, 17). In cancer cells, super-enhancers are associated with key genes with known oncogenic function, including MYC (16, 17). Initially, super-enhancers were discovered through ChIP-seq experiments for the master transcription factors and Mediator (Med1) in five mouse cell types (15). Further, super-enhancers were identified in 86 human cell and tissue types using H3K27ac, and through GWAS, disease-associated sequence variation are particularly enriched in these regions (17). A parallel study observed similar patterns through integrated analysis of human pancreatic islets data with nine cell types from ENCODE, and were named as “stretch enhancers” which are greater than 3kb and tissue-specific (18). Further downstream computational and in vivo analysis revealed that, these stretch enhancer regions are key chromatin features for cell type-specific gene expression programs, and that sequence variation in stretch enhancers affects risk of major common human diseases (18).
The master transcription factors including Oct4, Sox2 and Nanog regulate the pluripotency of embryonic stem cells (ESCs). In mouse embryonic stem cell (mESC), 231 super-enhancers were identified based on the occupancy of Med1 ChIP-seq signal from a list of 8,794 co-bound regions of Oct4, Sox2 and Nanog (15). On deleting a 13kb long Sox2 super-enhancer, using genome editing technique CRISPR, showed that this superenhancer was responsible for over 90% of Sox2 gene's expression (19). A recent study demonstrated that hotspots of transcription factors in the early phase of adipogenesis are highly enriched in super-enhancer regions, which drive adipogenic-specific gene expression (20). Wang et al found a large number of dynamic NOTCH1 (a master regulatory protein) sites in the super-enhancer regions (21). They observed that 83% of NOTCH1 sites overlap with H3K27ac ChIP-seq peaks and demonstrated the importance of Notch superenhancer interaction in gene expression (21). Another study discovered super-enhancers by profiling BRD4 ChIP-seq signal in multiple cancer cells and found considerable loss of BRD4 at super-enhancer regions by treating cancer cells with the BET-bromodomain inhibiter JQ1 (16). Recently, a large scale collaborative research revealed highly asymmetric loading of BRD4 at super-enhancers in Diffuse large B-cell lymphoma (DLBCL) cells, and revealed that the genes regulated by super-enhancers are particularly sensitive to JQ1 inhibition (22). A significant decrease in the growth of DLBCL cells was showed after JQ1 treatment which was engrafted in mice and improved survival of mice (23). Plutzky et al extended the current understanding of super-enhancer function by their discovery that super-enhancers can perform as fast switches to enable the rapid cell state transition (24, 25). Kwiatkowski et al discovered that the transcription-targeting drug THZ1, a CDK7 inhibitor, preferentially reduces the expression of genes associated with super-enhancers (26). A followup in vivo study in small cell lung cancers (SCLC) showed the association of super-enhancers with protooncogenes and SCLC identity genes, and transcription-targeting drug THZ1 preferentially targets superenhancer-driven genes (27, 28). Mansour et al further extended the functional importance of super-enhancers by showing that somatic mutations introduce binding sites for MYB transcription factor, which creates a powerful super-enhancer that mediates the overexpression of oncogenes in T-cell acute lymphoblastic leukaemia (T-ALL) (29, 30). Very recent studies linked the activation-induced deaminase (AID) off-targeting activity to the process of convergent transcription (31) and these AID targets are mainly grouped within superenhancer and regulatory clusters (32). The current research well demonstrates the importance and potential application of super-enhancers as they can play key roles in cell identity and diseases. The concept of super-enhancer/stretch enhancer is still evolving but has already gained extensive attention in the community. More systematic and comprehensive studies will lead to more accurate understanding of the concept as well as its functions.
Since the discovery of super-enhancers, a large amount of data is generated by profiling ChIP-seq signal for Mediator complex (Med1), master transcription factors (MyoD, T-bet and C/EBPα) (15), H3K27ac (17, 22, 27) and BRD4 (16, 22) in different tissue and cell types. The data produced by these studies is shared with the public in the literature. But a centralize database to integrate all produced and new data is needed, to streamline further down-stream analysis to answer many potential biological questions related to these newly discovered regions. Hence, we developed a user-friendly and interactive database of super-enhancers by integrating all the produced and new data with the aim to provide a resource to help bioinformaticians and biologists to perform further analysis and study of transcriptional control of cell identity. We named the database dbSUPER and have made it available for academic use at (http://bioinfo.au.tsinghua.edu.cn/dbsuper/). It can help the research community to search, browse, export, send and download super-enhancer-related data in more systematic way. The database will be updated with latest progresses in the field and we hope it will be a helpful tool to better enable downstream analysis of super-enhancers and their role in gene regulation.
Material and methods
Data sources
For the current version of dbSUPER, data was collected from a variety of sources (15, 17, 22, 27) and also produced by using the published pipeline (17). We store this data into a MySQL based database after preprocessing for fast and efficient query. We collected 2,558 super-enhancer regions for 5 mouse tissue and cell types including mESC, pro-B cells, myotubes, Th cells, and macrophages in the mouse genome (15). These super-enhancers were identified by ranking ChIP-seq signals for Med1 for mESC and pro-B cells, and MyoD, T-bet and C/EBPα for myotubes, Th cells, and macrophages respectively (15). We collected 58,283 superenhancer regions for 86 human tissue and cell types in the human genome, which were identified using H3K27ac ChIP-seq signal based ranking (17). We added super-enhancers for three cells in small cell lung cancer (SCLC) including NCI-H69, GLC16 and NCI-H82, which were identified by profiling ChIP-seq signal for H3K27ac (27). We further integrated super-enhancers for six cells in Diffuse Large B Cell Lymphoma (Ly1, DHL6, Ly3, HBL1, Ly4, and Toledo) and one human tonsil, which were identified by profiling ChIP-seq signal for H3K27ac and BRD4 (22). In total, the current database contains 66,033 super-enhancers (mean size of 33588bp and mean number of 654 super-enhancers in each cell-type) in 96 human and 5 mouse tissue/cell types. A detailed list of all tissue/cell-types including number of super-enhances, mean size (bp) and identification method used for each can be found in Supplementary Table S1.
Computational methods for super-enhancer identification
An overview of super-enhancer identification and data integration is presented in Figure 1. To identify superenhancers, initially ChIP-seq was performed for Mediator, master transcription factors or enhancer surrogates. ChIP-seq data sets were aligned to mouse (mm9) and human (hg19) genome using Bowtie (version 0.12.9) (33). MACS (version 1.4.1) was used to identify enriched regions as enhancers with a threshold (p value 1 × 10-9) (34). ROSE (Rank Ordering of Super-Enhancers) algorithm (https://bitbucket.org/young_computation/rose) was used to separate super-enhancers from enhancers (15, 16). Initially, ROSE stitches enhancers together if they are within 12.5kb. Next, these stitched enhancers are ranked, based on the ChIP-seq occupancy of Med1, H3K27ac, BRD4, MyoD, T-bet or C/EBPα, which revealed a geometrical inflection point and established a cut off that separates super-enhancers from typical enhancers (15-17). An implementation of ROSE algorithm and more detailed definition of super-enhancer concept can be found in the literature (15, 16, 35). The current database is developed using hg19 assembly for human and mm9 for mouse genome. For any of the data which was not available in this assembly, we adjusted the coordinates using UCSC genome browser using liftOver tool (36).
ChIP-seq data for many chromatin regulators and coactivators including p300 (10) is used for enhancer identification. But in the case of super-enhancers, we separate them from a list enhancers based on ChIP-seq occupancy for certain factors including Mediator (Med1). Initisally, super-enhancer were identified using Med1 ChIP-seq signal (15), but using H3K27ac (17) and BRD4 (16) we can achieve comparable results. So far different ChIP-seq-based ranking methods have been used to identify super-enhancers but still a conceptually appealing definition and a set of functionally important features yet to come (35).
Assigning genes to super-enhancers
Transcriptionally active genes were assigned to super-enhancers using a simple proximity rule. It is known that enhancers tend to loop and associate with target genes in order to activate their transcription (37), while most of these interactions occur within a distance of ∼50kb of the enhancer (38). Hence, using a distance threshold of 50kb, all transcriptionally active genes (TSSs) are assigned to super-enhancers within a 50kb window. This approach identified a large proportion of true enhancer/promoter interactions in embryonic stem cells (39).
Database features
General web interface and database
The web interface of dbSUPER provides an interactive solution for searching, browsing, visualizing, downloading, exporting and transferring the data to other public servers. To access all these features, the interface provides a navigation bar on left side and footer. A quick search box is available on the home page, which can be used for fast searching and browsing. The database provides an advanced search feature to filter the super-enhancers based on more detailed criteria. To keep the browsing more organized, dbSUPER displays the data in paginated, sortable and responsive tables. The responsive feature allows to change the table shape and fit the data into a screen based on the user's device resolution by adding a plus sign at the beginning of each row. By clicking the super-enhancer ID, users can view general details, details about associated genes, FASTA sequences and also links to external sources including UCSC (40), NCBI RefSeq (41) and Entrez Gene (42), GeneCards (43), UniProt (44) and Wikipedia. The data can be downloaded in different formats including BED, FASTA and UCSC custom tracks. The user-queried data can be exported to Excel, CSV and PDF files and copied to clipboard. To make the downstream analysis faster and efficient, the dbSUPER interface provides links with Galaxy (45), GREAT (46) and Cistrome (47) server to send data with one click. The interface also provides a link to visualize data in UCSC genome browser (36) by adding custom tracks automatically. The overlap analysis tool allows users to check the overlap of the regions of interest with the current database and outputs the overlapped regions in a responsive table. Further, dbSUPER plots the distribution of overlapped regions with each cell/tissue type while the overlapped regions can be exported to different formats. In the following sections, we will explain these features in detail. Figure 1 illustrates the general workflow, features and user-interface of the database.
Searching and browsing
dbSUPER supports comprehensive and user-friendly searching in different ways to bring the data to users in a more productive way. Figure 2 illustrates the interactive searching and browsing activity of dbSUPER. The home page of the website provides a quick search utility. Using this utility, users can query the database for genes of interest, cell/tissue types and enhancer identification marks. The quick search uses jQuery-based auto-completion features to help and guide user to discover vocabulary available in the database. After clicking the search button, a new page will display the queried data in a responsive table. The database can be browsed for each tissue or cell-type by clicking the “Browse Database” tab on the left-side navigation menu.
An “Advanced Search” link is available as an option for more detailed search. The results will be displayed in a dynamic tabular form with sorting and filtering options. In this table (Figure 2C) each row is a super-enhancer and each column contains region-specific information including; ID maintained by our database, genomic loci, size, associated gene, method used to rank enhancers, ChIP-seq signal, rank based on ChIP-seq signal strength, cell/tissue type, genome and a link to UCSC genome browser. If user browses the database on a mobile device such as smartphones or tablets, a “+” sign will appear at the beginning of each row for hiding some information. The hidden information will not be displayed unless the user touches the “+” sign, as shown in Figure 2E. This will avoid horizontal scrolling, by making data fit into the screen. By default, each page displays 25 records and the user can view the remaining records using the pagination features on the bottom right of the table. The number of records in each page can be increased/decreased between 10, 25, 50 and 100 using the “records per page” dropdown menu. The tabular data can be further filtered using the search box on the top right and the user can sort the data based on any field of interest. Details about each super-enhancer can be viewed by clicking the super-enhancer ID. Beside the general details about the super-enhancer, it also list details about the associated gene, which includes information like gene symbol, chromosome, transcription start site, transcription end site, strand and number of exons for the gene. Further, dbSUPER provides external links to UCSC Genome Browser (40) (http://genome.ucsc.edu/), NCBI Gene (41, 42) (http://www.ncbi.nlm.nih.gov/gene/), GeneCards (43) (http://www.genecards.org), UniProt (44) (http://uniprot.org) and Wikipedia (http://en.wikipedia.org) to facilitate users to use those sources to further study the selected super-enhancer on the corresponding aspects. For each super-enhancer region, a FASTA sequence file can be viewed and downloaded.
Data download and export
We provide all data in multiple formats including BED, FASTA and UCSC custom tracks for users to download. Downloading can be performed either from the download page or during browsing cell/tissue specific data in the browse section. The user-queried data can be exported as Excel, CSV and PDF files, using the respective
buttons on the top right of each data table. dbSUPER also provides the one-click feature to copy data tables to the clipboard and also printing features. The data can be provided freely in rational files upon request.
Linking with other web servers and visualization
In order to provide a one-stop solution for searching and to facilitate the downstream analysis including functional annotation and visualization, we provide features to directly transfer data from our database to external web servers without downloading. Currently, it supports three web servers including Galaxy (45), GREAT (46) and Cistrome (47) to which data can be send directly. The Supplementary Figure S1 shows a demo run for cell-type LNCaP.
Linking with Galaxy server
The web interface provides a handy facility to directly send the data for each cell/tissue type to Galaxy (45) web server for further downstream analysis. Galaxy is a very useful public web server, which can be used for intensive data analysis using many integrated tools, creating pipelines, storing data and sharing analyses with others. This feature can be found under “Visualize and Send Data” tab of the user-interface. When the user clicks the Galaxy logo next to cell/tissue type of interest, dbSUPER will add a BED file to Galaxy history.
Linking with GREAT server
To perform functional prediction of super-enhancers by analysing the GO (Gene Ontology) annotations of the nearby genes and assigning biological meaning to them, we linked dbSUPER to GREAT (46) web server and provided the one-click facility to load data from our database to GREAT, and to perform the cell-type specific analysis. This feature can also be found under “Visualize and Send Data” tab of the user-interface and by clicking the GREAT logo next to cell/tissue type of interest.
Linking with Cistrome server
dbSUPER provide features to send data directly to Cistrome Analysis Pipeline (47) to perform correlation analyses, gene expression analyses and motif discovery. Currently, Cistrome requires users to register to perform the analysis, so users need to register and login at (http://cistrome.org/ap/) before loading data from our database to Cistrome. This feature can also be found at “Visualize and Send Data” page.
Visualizing in UCSC genome browser
A single super-enhancer region or super-enhancers of individual cell/tissue type can be visualized in UCSC Genome Browser (40). This feature can be found on the “Visualize and Send Data” page, and also the browse page of the dbSUPER user-interface. Once a user clicks the visualization icon, dbSUPER will take the user to UCSC Genome Browser and a custom track will be added to the UCSC genome browser session automatically. Super-enhancers for more than one cell-type or sample can be visualize together by simply adding them to UCSC Genome Browser, as the session will keep the previously added track.
Overlap analysis tool
We provide an overlap analysis tool to annotate user-submitted regions with the super-enhancers available in the dbSUPER. We use the intersectBed tool from the BEDTools suite (48) to find overlapped super-enhancers from dbSUPER with the submitted regions. The user is required to define a minimum percentage of overlap before running the analysis. By default, a super-enhancer in dbSUPER must overlap with user defined regions by at least 10% to be reported as an overlapping super-enhancer. User can also define the minimum percentage of overlap on both the dbSUPER and the regions uploaded. The overlap analysis can be performed by clicking the “Overlap Analysis” tab and uploading regions of interests in BED format. The BED file should be in tab-delimited format without a header. After submission, the user will receive an email with a private link to their results when the computation is performed. It may take a while to get the analysis results depending on the number of regions uploaded. Two donut plots will be generated: one for the ratio of overlap within the individual tissue/cell type, and the other plot will show a total overlap map. All the overlapped regions can be downloaded in BED format and also displayed in tabular form, which further adds features to export these regions as CSV, Excel and PDF files. The Supplementary Figure S2 shows an output of overlap analysis tool with necessary steps.
Technical background
The current version of dbSUPER was developed using MySQL 5.5 (http://www.mysql.com) and it runs on Linux-based Apache servers. We used PHP 5.3 (http://www.php.net/) for server-side scripting. The interactive and responsive user interface was designed and built using Bootstrap 3 (http://www.getbootstrap.com), a popular responsive development framework including HTML, CSS, and JavaScript. The user-interface is responsive, which means the web interface will detect the user device and changes its structure and shape according to the device resolution, in order to optimize the data view. This feature makes the interface compatible across variety of devices and browsers with different screen resolution. The database can be browsed and searched from a variety of devices including smartphones or tablets. Although we recommend Google Chrome, Firefox and Safari web browsers for best results, but the database also supports other latest standard web browsers including IE version 8 and greater. We aim to improve the accessibility and user interactivity of dbSUPER by asking for user feedbacks through the contact page on our website. We are also anonymously tracking user interactions with our website including clicks, browser and device information. This will help us to know which part of our database is more important and which part needs to be improved based on user's interactions.
Availability
The dbSUPER database is freely available for the research community using the web link (http://bioinfo.au.tsinghua.edu.cn/dbsuper). The users are not required to register or login to access any feature available in the database.
Discussion
Super-enhancers or stretch enhancers are cell-type specific and are associated with the key genes that drive cell-type-specific expression, and are linked to biological processes which define the cell identity. We integrated the information of these regions and their associated genes in the dbSUPER database. dbSUPER provides a rich collection of features including (i) Fast searching, browsing and visualization; (ii) Downloading and exporting data in different formats including BED, FASTA, UCSC custom tracks and CSV, Excel, PDF files; (iii) Linking with external web servers including Galaxy, GREAT, and Cistrome and sending data directly to perform downstream analyses; (iv) Providing the associated genes with links to various databases including GeneCards, UniProt and Entrez, and (v) An overlap analysis tool to check the overlap of user submitted regions with dbSUPER. The overall goal of this database is to provide a comprehensive resource and a set of interactive analysis tools to facilitate the further study of super-enhancers and their functions. The responsive user-friendly web interface facilitates efficient and comprehensive searching and browsing of the data. While there are still many unclear questions on the concept of super-enhancer and even controversy in its exact molecular definition, such an organized collection of all existing data in one compact database provides researchers a handy platform for studying those questions.
Currently, dbSUPER contains 66,033 super-enhancers for 96 human and 5 mouse tissue/cell types. The current understanding and research on super-enhancers is progressing very fast. We will keep adding more data to the database once they are available. In the future, we are interested in also collecting published research on in-vivo validation of the computationally defined super-enhancers. We hope that, as more cell-type-specific validated data becomes available, we can construct a highly reliable supervised predictive models for super-enhancers. Currently, we are working on adding more features such as motif analysis, SNP information, tissue-specificity analysis and the use of additional datasets to find super-enhancers for other cell/tissue types. Those features will further extend the value of the database. More powerful user and session management modules are also under consideration, which will enable users to save their results and sessions and share with their collaborators or the community. Based on the current progress in the field, we believe that dbSUPER will be of particular interests to people working on the molecular and systems biology of cancer and other diseases.
Supplementary data
Supplementary Data are available online.
Funding
This work is supported in part by the National Basic Research Program of China [2012CB316504], Hi-tech Research and Development Program of China [2012AA020401] and NSFC grant [91010016].
Acknowledgements
We would like to thank Dr. Richard Young at Whitehead Institute for Biomedical Research, MIT, for his useful suggestions and comments about the database and also sharing the data produced by his research lab, and thank for the anonymous reviewers for their helpful suggestions.