Abstract
Summary Better protocols and decreasing costs have made high-throughput sequencing experiments now accessible even to small experimental laboratories. However, comparing one or few experiments generated by an individual lab to the vast amount of relevant data freely available in the public domain might be limited due to lack of bioinformatics expertise. Though several tools, including genome browsers, allow such comparison at a single gene level, they do not provide a genome-wide view. We developed Heat*seq, a web-tool that allows genome scale comparison of high throughput experiments (ChIP-seq, RNA-seq and CAGE) provided by a user, to the data in the public domain. Heat*seq currently contains over 12,000 experiments across diverse tissue and cell types in human, mouse and drosophila. Heat*seq displays interactive correlation heatmaps, with an ability to dynamically subset datasets to contextualise user experiments. High quality figures and tables are produced and can be downloaded in multiple formats.
Availability Web application: http://www.heatstarseq.roslin.ed.ac.uk/. Source code: https://github.com/gdevailly.
Contact Guillaume.Devailly{at}roslin.ed.ac.uk; Anagha.Joshi{at}roslin.ed.ac.uk
1 Introduction
High throughput sequencing is now becoming routine for many biological assays including transcriptome analysis through RNA sequencing (RNA-seq), or transcription factor (TF) binding sites identification through chromatin immuno-precipitation followed by sequencing (ChIP-seq). Additionally, collaborative projects such as Bgee (Bastian et al.), ENCODE (Bernstein et al., 2012), and Roadmap Epigenomics (Kundaje et al., 2015) have generated genome-wide datasets across hundreds of cell types or tissues. Despite this large data being freely available in the public domain, the lack of computational tools accessible to experimental scientists with no or elementary computational skills prohibits the use of this data to its full potential for discovery.
Though genome browsers, including summary tracks provided by many consortia, are extremely useful to study a few genes, promoters, or single nucleotide polymorphisms, they lack the genome-wide overview. Only a few public resources such as the CODEX database (Sánchez-Castillo et al., 2015a) and the BLUEPRINT GenomeStats tool (Zerbino et al., 2014) allow a genome-wide comparison with the user data. We therefore developed Heat*seq, a free, open source, web application providing fast and interactive comparison against high throughput sequencing experiments in the public domain. The application provides clustered correlation heatmaps, summarising global similarities between all samples in the dataset and the user sample. Heat*seq provides over 12,000 publicly available genome-wide experiments in human, mouse and drosophila for fast and interactive comparison. In summary, Heat*seq is an interactive web tool that allows users to contextualise their sequencing data with respect to vast amounts of public data in a few minutes without requiring any programming skills.
2 Methods
2.1 Data collection
We collected gene expression data (RNA-seq), TF ChIP-seq data and CAGE (Cap Analysis of Gene Expression) data (over 4000 individual experiments) from Bgee (Bastian et al.), Blueprint epigenome (Pradel et al., 2015), CODEX (Sánchez-Castillo et al., 2015b), ENCODE (Bernstein et al., 2012), FANTOM5 (Forrest et al., 2014), FlyBase (Attrill et al., 2016), GTEx (Lonsdale et al., 2013), modENCODE (Celniker et al., 2009) and Roadmap Epigenomics (Bernstein et al., 2010), in human, mouse and drosophila. Data formatting was done using R (R scripts available on GitHub). The source for each dataset is listed in Table S1. Heatmaps represent Pearson’s correlation values between experiments calculated using a Gene x Experiment numeric matrix with normalised gene expression values for expression data, a Genomic regions x Experiments binary matrix indicating presence or absence of a peak for TF ChIP-seq data and a Genomic regions x Experiments numeric matrix of expression values for CAGE data. Importantly, we constructed a metadata table which provides a web-link to original data and allows users to sub select each dataset.
2.2 Web-application development
Heat*seq is an R shiny open source interactive tool which computes correlation values between the user file and each experiment in a dataset.
It offers an a posteriori linear up-scaling of correlation values to correct for systematic biases due to different library preparation methods, data analysis methods or sample quality (Figure S1). Detailed user instructions are on the application website.
3 Results
3.1 Application description
Heat*seq tool supports three data types: HeatRNAseq, HeatChIPseq and HeatCAGEseq. Data upload, correlation calculation and heatmap generation takes about a minute. Importantly, users can interactively sub select relevant experiments using the metadata information (e.g. cell type, TF name). The interactive heatmap also allows selecting different clustering methods as well as zooming in and out on the heatmap. The high resolution figures and tables can be downloaded in multiple formats. Thus, Heat*seq provides global overview of relationships between public experiments and the user data. Four user scenarios are discussed below.
3.2 User scenarios
3.2.1 User data quality control
We compared a Neocortex, 10 days post-partum (Ray et al., 2015) RNA-seq sample with Bgee mouse RNA-seq data using HeatRNAseq. The top five correlation values (PCC > 0.9) correspond to Bgee brain samples (Table S2). Thus, Heat*seq can be used as a fast data quality check for next generation sequencing data.
3.2.2 Cell context identification
An oestrogen receptor (ER) alpha ChIP-seq in MCF7 cells (Zhuang et al., 2015) comparison to the ENCODE TFBS dataset by sub-selecting ENCODE ER ChIP-seq experiments revealed that the binding pattern of ERα in MCF7 cells was more similar to its binding pattern in T-47D cells than in ECC-1 cells (Figure 1A). MCF7 and T-47D were derived from mammary tumours while ECC-1 is an endometrial cell line.
A. ERα ChIP-seq in MCF7 cells from Zhuang et al. is closer to ENCODE ERα ChIP-seq in T-47D than in ECC-1 cells. B. BRF1 and RNA PolIII bind tRNA genes, but nor BRF2. C. c-MYC ChIP-seq in H1-hESC from UT-A and Stanford show low correlation. D. Two erythroblast RNA-seq samples from BLUEPRINT are closely related to endothelial cells, while five other are not.
3.2.3 New hypotheses by data integration
CpG islands (CGI) from the UCSC (Karolchik et al., 2004) comparison to HeatChIPseq found that RNA polymerase II and TAF1 (Table S4) were enriched at CGIs as about 50% of human gene promoters contain a CGI (Illingworth and Bird, 2009). Interestingly we identified factors avoiding CGIs including MAFK, GATA3 and ZNF274.
Similarly, tRNA promoters were highly correlated with RNA polymerase III, and its co-factors BDP1, RPC155, and BRF1 (Table S4) using HeatChIPseq. Interestingly, comparison with BRF family data revealed that BRF1, but not BRF2 was bound at tRNA genes (Figure 1B).
3.2.4 Public data assessment
Heat*seq can be used to assess data in the public domain, highlighted by two examples below amongst others: A MYC ChIP-seq in H1-hESC cells does not cluster with other ENCODE MYC ChIP-seq experiments (Figure 1C), including H1-hESC sample from a different experimental group (Devailly et al., 2015).
Two out of seven erythroblast RNA-seq samples from the Blueprint Epigenome consortium are more correlated with endothelial cells than with the rest of the erythroblast samples (Figure 1D).
4 Conclusion
With Heat*seq, comparing RNA-seq, ChIP-seq or CAGE experiments to hundreds of publicly available datasets becomes a trivial task. Researchers can now investigate the relationships between various high-throughput sequencing experiments fast and interactively without requiring any programming skills. Such analysis can assess data quality, cell variability, and generate novel regulatory hypotheses.
Funding
AJ is a Chancellor’s fellow and AJ lab is supported by institute strategic funding from Biotechnology and Biological Sciences Research Council (BBSRC, BB/J004235/1). GD is funded by the People Programme (Marie Curie Actions FP7/2007-2013) under REA grant agreement No PCOFUND-GA-2012-600181.
Conflict of Interest
none declared.
Acknowledgements
We would like to thank Barry Horne for the R shiny server set-up and administration and the Edinburgh R user group (EdinbR) for their advice and support.