Abstract
The 3D organization of the genome and epigenetic marks play important roles in gene expression, DNA repair, and chromosome segregation. Understanding how structure and composition of the chromatin fiber contribute to function requires integrated analysis of multiple genomics datasets from various techniques, experimental conditions, and cell states. Genome browsers facilitate such analysis, yet currently visualize only a few regions at a time and lack statistical functions that are often necessary to extract meaningful information. Here, we present HiCognition, a visual exploration and machine-learning tool based on a new genomic region set concept, which enables detection of patterns and associations between 3D chromosome conformation and collections of 1D genomics profiles of any type. By revealing how transcriptional activity and cohesin subunit isoforms contribute to chromosome conformation, we showcase how the flexible user interface and machine learning tools of HiCognition can help understand the relationship between structure and function of the genome.
Regulated expression, maintenance, and propagation of the genetic information depends not only on the DNA sequence, but also on the thousands of different proteins and posttranslational modifications that enrich at specific sites of the genome. The regulation and function of genomes further depends on an intricate organization of DNA in 3D space1,2, established by DNA looping3, chromatin phase separation4–6, and potentially other processes. How 3D genome organization relates to local variation in chromatin composition, DNA sequence, and physiological functions are key questions that will be important to answer for understanding the function of complex genomes. The advent of techniques mapping function, composition, and 3D organization genome-wide provides rich sources of complex data to address this challenge. Curated public repositories of various functional and 3D genomics data, e.g., Encyclopedia of DNA Elements (ENCODE)7,8 and 4Dnucleome9, provide opportunities for experimentalists to assess their data in the context of multi-dimensional epigenetic and spatial signatures. However, the challenge of extracting meaningful information from large sets of complex data has hampered progress.
A common approach towards identification of biologically relevant patterns is by studying relationships between multiple independent experiments, representing different assays, molecular components, cell states, or treatments. For example, the observation that the protein complex cohesin enriches at insulation sites of transcriptional regulation10 and at the boundaries of topologically associated domains (TADs)11 has inspired models for how the genome is organized by cohesin- mediated loop extrusion12–14, with broad implications for various processes3. Detecting associations between multiple genomics datasets is facilitated by genome browsers such as the UCSC genome browser15–17, which provide side-by-side views of functional genomics data and support user interaction by panning and zooming. However, available genome browsers visualize only a small number of regions at a time, which restricts the assessment of large genomes and highly heterogeneous signals in genomic profiles. To facilitate visualization and grouping of small multiples of genomic regions, a set of tools has been recently developed to leverage the concept of visual piling18,19. While these tools allow detection of patterns in single genomic tracks, they do not support integration of different data sources and have performance limitations with large sets of genomic views.
Systematic analysis of correlations in multiple independent genomics datasets often starts by defining a specific type of genomic region based on a common function (e.g., genes) or experimental observation (e.g., ChIP-seq peaks). Owing to the necessity to interface different datatypes and to combine algorithms from different sources, the analysis of genomic region sets is typically performed by script-based approaches20– 22. While script-based analysis provides flexible access to powerful statistics and machine learning tools23–25, it often takes a lot of time and requires advanced programming expertise to adapt workflows for investigation of new biological questions. Many wet-lab biologists have limited expertise in scripting or programming and therefore delegate advanced data analysis tasks to dedicated computer scientists, which represents a severe bottleneck in testing and developing new hypotheses.
Here, we present HiCognition, a tool for interactive visualization and statistical analysis of 3D genomics data and other (epi)genetic profiles based on a region set concept. HiCognition combines a visual exploration interface with high- performance data processing and statistical and machine learning tools. Thereby, HiCognition allows biologists without programming skills to systematically explore their large multi- dimensional genomics data, providing unprecedented opportunities for discovering fundamental mechanisms underlying the organization and function of the genome.
Results
Exploring genomic region sets in multi-dimensional feature space
In contrast to conventional 3D genome browsers like JuiceBox17 or HiGlass16, which visualize a specific subregion of the genome that can be panned or zoomed, HiCognition has been designed for interactive analysis of large sets of genomic regions that are pre-defined by the user before data exploration. The genomic region set approach of HiCognition allows users to address biological questions about how a specific type of region is composed, regulated, and organized in 3D space. The genomic region set can be freely defined by the user, for example, based on a common function (e.g., genes, enhancers, or origins of replication), based on molecular composition (e.g., regions with specific histone modifications or enrichment sites of proteins), or based on 3D organization (e.g., loops or topologically associated domains). The region set is provided as input data to HiCognition by a file containing genome coordinates. HiCognition then allows the user to explore associations between the genomic region set and large collections of genomics features, which can be downloaded from public repositories or from lab-internal experiments.
In HiCognition, genomic features can contain any type of numerical data associated with genomic coordinates26–28, including two-dimensional data like chromosome conformation contact maps (e.g., from Hi-C29 or SPRITE30,31), or one- dimensional data such as protein binding profiles (e.g., ChIP- seq32 or Cut&Run33 read densities), chromatin accessibility measurements (e. g., ATAC- seq34 or MNase- seq35), transcriptional activity (e.g., GRO-seq36), or replication timing measurements (e.g., Repli-seq37). Moreover, genomic features can contain data from unperturbed conditions as well as data obtained after genetic or chemical treatments, or data from different cell states (e.g., cell cycle stage or differentiation state), thereby enabling queries of how specific types of regions respond to perturbations or state transitions. HiCognition combines an intuitive and configurable graphical user interface with statistics and machine learning methods to enable interactive exploration of multi-dimensional genomics data within versatile workflows.
HiCognition supports data analysis by three basic approaches (Fig. 1a):
Exploring average distributions: HiCognition visualizes average magnitudes of genomic signals within the region window, whereby the features can be interactively selected by the user.
Exploring region heterogeneity: HiCognition visualizes genomic signals of individual regions to visually explore heterogeneity in the region set. Moreover, multi-dimensional cluster analysis and visualization of region distributions in embedding plots allows identification of region sub-sets with common properties.
Enrichment analysis: HiCognition automatically detects features that are enriched or depleted in the specific region set under investigation relative to the genome-wide average. It further shows where within the genomic region window, individual features are particularly enriched or depleted. This enables the discovery of regulatory, functional, or spatial patterns characteristic for the region set under investigation.
The user interface of HiCognition is based on a widget architecture that allows easy configuration of views. These widgets represent genomic features and are arranged within widget collections that are associated with a specific genomic region set (Fig. 1b). This arrangement maps the abstract region set concept to a specific user interface component, allowing users to construct views that integrate different genomic features to understand the properties of a genomic region set. Specifically, following import and pre-processing of region and feature data sets, HiCognition widgets generate average feature signal plots of all regions, as well as stacked representations of individual regions, whereby the graphical user interface allows interactive adjustment of region size, resolution, look-up table, contrast, etc. For automatic detection of genomic features enriched in the region set, HiCognition provides a widget for locus overlap analysis (LOLA38), which is displayed as a ranked feature plot. For the analysis of heterogeneity within the region set, a clustering and embedding widget automatically groups regions based on similarity in multi-dimensional feature space and represents their distribution in embedding plots. The embedding plots are interactive and display feature patterns for individual region clusters to allow fast, interactive exploration of heterogeneity within the region set. Overall, this widget architecture with interactive visualization integrates improved versions of domain-specific tools38 and creatively applies state- of-the-art machine learning for embeddings39 and clustering.
HiCognition is implemented as a web-based tool that allows performant analysis of large datasets and interactive exploration of aggregation results. The software is open source and fully containerized, such that it can run on centralized servers or locally. An integrated database for region sets and features makes HiCognition a hub for various data types from public or private sources, whereby a session concept allows sharing of insights as fully customizable views and analysis workflows with others. A showcase server for hands on experience can be accessed at https://www.hicognition.com/app.
Revealing common patterns in region sets
To exemplify the power of HiCognition’s region set approach, we analyzed the chromatin fiber organization around all transcriptional start sites (TSS) of protein-coding genes annotated in the human genome40. TSS are known to frequently contact upstream and downstream regions; at the same time, TSS insulate against contacts between upstream and downstream genomic regions41–45. Using published ChIP-seq data from HeLa cells8,46, we first visualized the distribution of two key architectural regulators, cohesin (based on its subunit Structural Maintenance of Chromosomes 3, SMC3) and CCCTC-binding factor (CTCF) using HiCognition’s 1D average widget. A prominent enrichment of both proteins at TSS (Fig. 2a, panel i) supports a role of cohesin-mediated DNA looping in shaping the conformation around TSS10,41,47,48.
To assess the 3D organization of protein-coding genes, we next visualized the genome-wide average contact probability around TSS using the 2D average widget and published Hi-C data49 (Fig. 2a, panel ii). Prominent stripes emerging from the TSS towards upstream and downstream regions indicate frequent interactions of TSS with distal genomic regions. Moreover, contacts within regions upstream or downstream the TSS were much more frequent than between upstream and downstream regions (Fig. 2a, visible as red and blue areas, respectively), as previously observed41–44. Thus, HiCognition allows simple visualization of genome-wide averages for region-type-specific conformations.
To assess the functional role of cohesin-mediated looping to the conformation at TSS, we next used the 2D average widget to visualize published Hi-C data obtained from cells depleted of Nipped-B-like protein (NIPBL)49, a cofactor essential for cohesin-mediated loop extrusion50,51 (Fig. 2a, panel iii). The stripes emerging from TSS and the squared regions containing high contact probability that were characteristic for unperturbed controls were almost completely suppressed in the Hi-C maps obtained from NIPBL-depleted cells, indicating a key role of cohesin-mediated looping in establishing these structures, consistent with previous observations41,48. Thus, HiCognition enables fast and interactive side-by-side visualization of genome-wide average profiles across various techniques and experimental conditions.
Understanding heterogeneity within region sets
Understanding the relationship between chromatin fiber composition, 3D conformation, and physiological function has remained challenging owing to the heterogeneity of regions defined by a common feature under investigation. HiCognition’s region set approach allows fast and simple visualization of regional heterogeneity and supports interactive clustering of these regions based on multiple genomic features. To demonstrate how HiCognition’s flexible widget architecture can be used for heterogeneity analysis of region sets, we investigated how histone posttranslational modification patterns relate to chromosome conformation around genes. Using the Stacked lineprofiles widget, we visualized for the genome-wide set of TSS regions the ChIP-seq read densities of two histone posttranslational modifications, H3K9ac and H3K27me3, which enrich at transcriptionally active or inactive chromatin, respectively52,53. Sorting the line profiles by H3K9ac abundance showed that only about half of the TSS regions were enriched for this mark (Fig. 2b, panel i). Moreover, displaying stacked line profiles of H3K27me3 ChIP-seq read density in a separate widget and sharing the sort order between widgets showed that TSS regions enriched in H3K9ac are depleted of H3K27me3 (Fig. 2b, panel ii). Thus, coupling multiple widgets by sorting allows intuitive visual assessment of correlations between genomic features.
Next, we aimed to identify region subsets with distinct histone modification profiles for the study of the corresponding Hi-C conformations, considering an extended set of ten different histone posttranslational modifications (see methods for details). HiCognition’s Embedding widget visualizes regional heterogeneity based on multi-dimensional feature values, which can contain linear profiles such as ChIP-seq data or Hi-C contact matrices (Fig. 2b, panel iii). Besides visualizing heterogeneity, the Embedding widget automatically groups regions into clusters by feature similarity. The features enriched or depleted in each cluster are interactively displayed by pointing to clusters. We selected two clusters enriched either in marks for transcriptionally active chromatin or transcriptionally repressed chromatin (Fig. 2b-d) to create two new region subsets for analysis of the corresponding Hi-C conformations.
Using the 2D average widget and the Hi-C data of HeLa cells, we observed pronounced high-contact stripes and insulation around TSS for the region subset enriched in active chromatin marks, whereas these Hi-C structural features were entirely absent in the region subset enriched in repressive histone marks (Fig. 2e, f, panels i), consistent with previous script-based analyses of mouse stem cell data41. To investigate how cohesin- mediated DNA looping contributes to chromosome conformation at TSS residing in transcriptionally active or inactive chromatin, we visualized average Hi-C maps of NIPBL-depleted cells, using published data49. For the region subset enriched in transcriptionally active histone marks, we found strong reduction of stripes and insulation around TSS, whereas the region subset with repressive marks was unaffected by NIPBL depletion (Fig. 2e, f, panels ii). Together, these data suggest that cohesin- mediated DNA looping establishes a specific chromosome architecture around transcriptionally active TSS but not at inactive TSS. Thus, HiCognition’s flexible widget architecture enables simple and powerful analysis workflows to explore regional heterogeneity and to detect interactions between different types of genomics data.
Discovering new associations with HiCognition
Public repositories such as ENCODE8 or the 4Dnucleome9 contain thousands of different genomics data sets derived from diverse technologies, cell types, and experimental conditions. The difficulty to interpret such complex data has prompted the development of various computational methods to detect associations between specific types of regions and features describing the chromatin fiber, such as GREAT54, the Encode ChIP-seq significance tool55, GenometriCorr56 and Locus Overlap Analysis (LOLA)38. HiCognition provides an improved implementation of LOLA, extended by interactive exploration of feature enrichment in distinct genomic sub-bins obtained from a region set. We exemplify association analysis with HiCognition’s Lola widget by investigating how cohesin subunit isoforms relate to chromosome conformation.
Cohesin contains three core subunits that form a ring, and an associated Stromal Antigen (STAG) subunit of which vertebrates encode two isoforms, STAG1 and STAG257–60. Previous script- based analysis of ChIP-seq profiles and Hi-C data showed that STAG2-cohesin predominantly forms loops at active TSS, whereas STAG1-cohesin predominantly contributes to the formation of TADs58,61–63. Here, we aim to recapitulate these findings and search for new associations by the automated machine learning tools and interactive workflows of HiCognition. We created a region set centered on all 34,857 SMC3 ChIP-seq peaks and then clustered SMC3 regions based on the abundance of STAG1 and STAG2, using the Embedding widget and published ChIP-seq data63 (Fig. 3a, b). Comparing ChIP-seq read densities with the 1D average widget showed that the region subset enriched in STAG1 contained less SMC3 than the region subset enriched in STAG2 (Fig. 3c, d).
To visualize the chromosome conformation around these region subsets, we used the 2D average widget and published Hi-C data49. Strikingly, the STAG1-enriched sites had much more pronounced long-range contacts than the STAG2-enriched sites (Fig. 3c, d, panels iii), despite the lower abundance of the core cohesin subunit SMC3 at STAG1-enriched sites (Fig. 3c, d, panels ii). To determine in which genomic context STAG1- or STAG2-enriched sites predominantly reside, we used the Lola widget to analyze 11 region sets including histone posttranslational modifications, TAD boundaries, and the cohesin-associated protein Sororin that is required for cohesion maintenance in G264,65. This analysis showed that STAG1- enriched sites predominantly reside at TAD boundaries, whereas STAG2-enriched SMC3 peaks predominantly reside in chromatin bearing marks of active transcription (Fig. 3c, d, panels iv), supporting the previously reported distinct localization and function of cohesin bound to STAG1 or STAG2, respectively58,61–63. Moreover, STAG1-enriched cohesin sites also prominently overlapped with Sororin sites detected by ChIP-seq in G2 phase of the cell cycle46, indicating a previously unrecognized association between genomic sites of sister chromatid cohesion and genomic sites where STAG1-enriched cohesin forms long-range loops in G1. Importantly, HiCognition’s region-set-based approach and flexible widget architecture enable detection of such complex associations within a few minutes. Thus, HiCognition allows biologists untrained in genomic analysis to rapidly perform their own analyses, discover new associations, and generate new hypotheses, greatly reducing the bottleneck between data generation and interpretation.
Discussion
HiCognition leverages interactive genome exploration to comprehensive views of genome-wide region sets defined by a common property. Its flexible user interface and integrated statistics and machine learning tools support the detection of common patterns, heterogeneity, and associations in complex genomics datasets representing 3D conformation, epigenetic profiles, and functional readouts. A fast and computationally efficient implementation allows real-time browsing through thousands of genomic regions, thereby accelerating hypothesis testing on genomics data of various experimental techniques, experimental conditions, or cell states.
HiCognition’s rich online documentation and containerized distribution supporting desktop as well as server installations provide easy access for both experienced developers as well as beginner analysts. The integrated database and interfaces to widely used file formats allow assessment of a biologist’s own data in the context of the vast amount of public data available from resources like ENCODE or 4Dnucleome. HiCognition’s streamlined workflows and visualization concepts enable users to address a broad range of biological questions, yet the focus on usability limits customizability compared to approaches that simply provide a graphical interface to command-line tools66 or custom scripts67. Via the export of region set coordinates derived from clustering and association analysis, however, HiCognition can be seamlessly integrated with script-based analysis for extended functionality. Hence, HiCognition allows biologists lacking programming skills to rapidly reduce the space of possible hypotheses before applying more time-consuming methods. Furthermore, the software’s modular design and open- source implementation in Python provide an extendable framework towards development of new machine learning algorithms and visualization concepts. Therefore, we foresee that HiCognition will serve as a bridge between the experimentalists who formulate biological hypotheses and specialized computer scientists implementing script-based analyses workflows.
While HiCognition’s potential is exemplified here by an analysis of epigenetic marks and topological structures formed by cohesin, the software is applicable to any type of 1D or 2D genomics data. Its ease of use and data integration based on the region set concept will provide new opportunities for discovering relationships between structure and function of the genome.
Author contributions
Conception: M.M., C.C.H.L, D.W.G.; software design and implementation: M.M., C.C.H.L.; data analysis and interpretation: D.W.G., M.M., R.R.S.; manuscript writing: M.M., C.C.H.L, D.W.G.; funding acquisition and supervision: D.W.G.
Competing interests
The authors declare no competing interests.
Methods
Software architecture
HiCognition is a containerized application (https://github.com/docker/compose) and designed as a server-client web app to minimize set-up requirements and facilitate easy usage for non- technical users after set-up (Fig. S1a).
The backend portion of HiCognition is implemented as a Flask webserver (https://github.com/pallets/flask) with NGINX (https://github.com/nginx) as a reverse proxy that operates in conjunction with a MySQL database (https://github.com/mysql) to persist metadata and data preprocessing results. The server utilizes a Redis task queue (https://github.com/rq/rq) to offload time-intensive computation tasks to an adjustable number of worker containers. The communication between these workers and the main server is implemented via network requests (when submitting a task) and the MySQL database (when registering a task as complete). This organization allows the operation of the worker containers on separate machines that could, in principle, be started on demand.
The frontend part of HiCognition is implemented in JavaScript and uses the Vue.js framework (https://github.com/vuejs/vue) to manage components and implement reactivity. The visualizations are custom-designed for each type of data widget (see below for details) and are implemented either using the data-driven visualization library D3.js (https://github.com/d3/d3) or in case of more demanding visualizations using PixiJS (https://github.com/pixijs/pixijs).
For implementation details of the HiCognition architecture, see the GitHub repository (https://github.com/gerlichlab/hicognition) and the accompanying documentation page (https://gerlichlab.github.io/hicognition/docs/).
Point- and interval-regions
As genomic data frequently span multiple length-scales 16,17, visualization concepts have to adapt to this challenge. HiCognition solves this problem by precomputing a “resolution- stack” for each genomic region-set (Fig. S1b). This precomputation is adapted for two types of genomic regions supported by HiCognition:
Point-regions are specified by center coordinates and the region surrounding the center position can be adjusted interactively for analysis and visualization. This enables the user to zoom in and out of genomic regions when viewing data to discover genomic effects at multiple length scales.
Interval-regions are specified by start and end coordinates and each region includes 20% neighboring regions on either side. The processing bin size for this region type is adjusted by normalization to the interval size, and thus different for differently sized regions. Interval regions allow to investigate length-independent patterns, as for example profiles of genes that are scaled to transcription start and termination sites.
Data management and preprocessing
HiCognition contains a dataset manager that stores available datasets as well as finished pre-computations in a MySQL database. The user interface of HiCognition distinguishes between two principal types of data – genomic regions of interest and genomic features that are available for precomputation (Fig. S2a). Users can add and view datasets in an interactive table that allows filtering and editing (Fig. S2b).
HiCognition supports the most common input data formats for genomic regions and features. Specifically, genomic regions can be added as bed-files15, 1D-features as bigwig files68 and 2D-features as cooler files20. These files can be uploaded one at a time or using a bulk upload feature (see our documentation at https://gerlichlab.github.io/hicognition/docs/data_management/ for details).
To select a region-set of interest, the user can submit preprocessing tasks using the preprocessing dialogues and get an overview of running and finished computations via the dataset viewer of the genomic regions (Fig. S2c). Once pre-computation of a combination of a region-set of interest and a genomic feature has finished, it is available for display.
Many preprocessing steps involve analysis of genomic feature collections, for example, when calculating enrichment amongst a set of candidate features or embedding regions based on the values of multiple features (see below for details). In HiCognition, users can create feature collections in a specific dialogue window and select them for preprocessing and display.
HiCognition also supports adding and managing multiple genome assemblies to analyze and compare data generated for different genome assemblies and species.
Data and workflow sharing
HiCognition’s allows storing specific arrangements of widgets, widget collections, and the corresponding data under display as named sessions. This is possible due to an implementation of the HiCognition analysis view as declarative configurations stored in the Vuex frontend storage (https://github.com/vuejs/vuex/). Here, the arrangement, settings, and data sources loaded in a particular widget are stored as JavaScript objects, and HiCognition reacts to changes therein by adjusting the displayed view. This makes it easy to restore saved sessions from configuration objects stored in the database and to share saved sessions with collaborators through a static link.
Widgets and visualization concepts
HiCognition uses widget-collections as a container to display specific visualizations (Fig. 1b). A widget collection has a single region-set that is shared by all its contained widgets. Each widget in the collection represents a genomic feature or a collection of genomic features and provides a suitable visualization for the respective data (Fig. 1b).
1D-average widget
The 1D-average widget displays the average magnitude of a 1D genomic feature, as for example ChIP-seq reads, for the selected region set in the widget collection as a line plot. The preprocessing algorithm extracts snippets of the relevant genomic feature for each genomic region and calculates the average value over all snippets along the relative genomic offset.
2D-average widget
The 2D-average widget displays the average magnitude of a 2D- genomic feature, for example a Hi-C contact probability map, for the selected region set in the widget collection as a 2D heatmap. The preprocessing algorithm extracts snippets of the 2D- genomic feature for each rectangular genomic region and calculates the average value over all snippets for each pixel.
Stacked line profile widget
The stacked line profile widget displays individual examples of 1D-genomic features for the selected region set in the widget collection as a 2D heatmap. Within this heatmap, each row represents a specific genomic region. The preprocessing algorithm extracts the relevant genomic feature snippets for each genomic region (subsampled to contain a maximum of 1000 regions) and “stacks” them vertically to form a matrix for display.
1D-feature embedding widget
The 1D-feature embedding widget displays the distribution of genomic regions based on a collection of 1D genomic features. The results are displayed as a 2D-histogram, where points close on the plot represent genomic regions with similar feature profiles. The dimensionality reduction algorithm UMAP39 is used with default parameters to embed the high-dimensional regions into a two-dimensional space suitable for display.
This widget also automatically groups region neighborhoods by k-means clustering, with 10 (“large neighborhood”) or 20 (“small neighborhood”) clusters, respectively, in the embedded space. The normalized intensity of the features for each cluster is then calculated and used to interactively display the distribution of features within the selected clusters by mouse hovering. Users can create new regions from interesting subsets by clicking on a subset and giving it a name in the relevant dialogue.
2D-feature embedding widget
The 2D-feature embedding widget displays the distribution of genomic regions using a single 2D genomic feature. The results are displayed as a 2D-histogram, where points next to each other exemplify genomic regions with similar 2D-feature values. The widget implements a hover interaction that shows the 2D average with respect to the selected genomic feature for the selected subset. Users can create new regions from interesting subsets by clicking on a subset and giving it a name in the relevant dialogue.
The preprocessing algorithm extracts snippets of the 2D genomic feature for each genomic region in the region set. These snippets are then smoothed using a Gaussian filter and down- sampled to be of size 10 × 10. Here, the smoothing kernel size and standard deviation depend on the interpolation factor: Where I is the interpolation factor, m is the size of the quadratic snippet, f is the target size of the down-sampled matrix (in this case 10), K is the size of the smoothing kernel, and σ is the standard deviation of the Gaussian filter. The smoothing and down-sampling operations are done using OpenCV (https://github.com/opencv/opencv). Note that since the snippets can be of different sizes (see above for details), the interpolation factor and smoothing function can differ for different extracted snippets. The down-sampled matrix is then flattened and treated as image features for each of the genomic regions, resulting in a matrix where each row corresponds to a genomic region in the region set and each column to one of the pixel features (100 in total). Then, the matrix is embedded into a 2D space using UMAP39 (https://github.com/lmcinnes/umap), and clustering is performed as for the 1D-feature embedding widget. The representation for each cluster that is displayed to the user is the 2D average of all contained matrix snippets in the original pixel space.
Association widget
The association widget allows users to quantify for a given region set the extent by which other sets of independent genomic regions overlap, based on the LOLA method38. This allows to detect associations between different types of genomics data, as for example ChIP-seq peaks and Hi-C structures like boundaries of TADs.
This widget consists of two visualizations, where the upper bar chart shows quantification of the maximum enrichment of all regions within a collection, and the lower chart indicates the enrichment values for a selected bar ranked by enrichment. A significantly faster python reimplementation of LOLA38 (https://github.com/Mittmich/pylola) allows calculating the association not just on the region of interest level but for each individual bin of these regions. Specifically, we use a bin as the target region, the regions in the selected collection as query regions, and all genomic-wide bins of that size as a universe. The reported values correspond to the odds ratio of the underlying contingency table for each combination of target, query, and universe.
Use-cases
Data sources
All data sets used for analysis in the current study have been obtained from public repositories as listed in the following table:
Preparation of datasets for HiCognition
All ChIP-seq data were directly imported into HiCognition based on data from public repositories, except for the SMC3 and Sororin ChIP-seq peaks, which were detected by the following procedure in the published ChIP-seq read profiles from Ladurner et al.46:
Deep (Illumina) sequencing results of ChIP-Seq libraries were downloaded from ENA (ID: SAMEA5988740) and mapped against the human hg19 reference assembly using bowtie resp. bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) counting only uniquely mappable reads with 0 - 2 mismatches allowed. Resulting alignments from two replicates each were processed with MACS peak calling algorithm (version 1.4.2) with a P-value threshold of 1e-10 resp. 1e-5 adding control inputs from the same cell line. Peak overlaps were calculated by using multovl 1.3 (https://github.com/aaszodi/multovl) while treating overlaps as unions and including unique peaks from both replicates. Since occasionally two neighboring peaks from one dataset overlap with a single peak in another dataset, the output of such overlap is displayed as a connected genomic site and merged into one single data entry.
To derive protein-coding genes split along their direction of transcription, the GENCODE annotations for hg19 (GRCh37) were downloaded and filtered for entries that were of type “gene” and of gene type “protein_coding”. These genes were then split into genes with strand “+”, named “forward”, and genes with strand “-”, named “reverse”. The transcriptional start sites for these genes were then defined to be the start or end of these intervals respectively and saved as bed files. The script for this preprocessing step can be found in the HiCognition GitHub repository (https://github.com/gerlichlab/hicognition/blob/master/publication/scripts/ convert_genes.ipynb). For the use-case figures, the transcriptional start sites of “forward” oriented genes were used.
Showcase server
To provide readers a fast hands-on experience of HiCognition, we implemented a showcase server (www.hicognition.com/ app). On this server, the login for individual users is deactivated. We uploaded and preprocessed all the datasets in this paper so the reader can explore them independently and provide all saved sessions used for the figures in this paper. On this server, the upload and preprocessing functionality is deactivated.
Acknowledgments
The authors thank Jan-Michael Peters, Paul D. Batty, Federico Teloni, Zsuzsanna Takacs, and Sofia Kolesnikova for comments on the manuscript. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 101019039), from the Austrian Academy of Sciences, and the Vienna Science and Technology Fund (WWTF; project nr. LS17-003).