HiCORE: Hi-C analysis for identification of core chromatin loops with higher resolution and reliability

Genome-wide chromosome conformation capture (3C)-based high-throughput sequencing (Hi-C) has enabled identification of genome-wide chromatin loops. Because the Hi-C map with restriction fragment resolution is intrinsically associated with sparsity and stochastic noise, Hi-C data are usually binned at particular intervals; however, the binning method has limited reliability, especially at high resolution. Here, we describe a new method called HiCORE, which provides simple pipelines and algorithms to overcome the limitations of single-layered binning and predict core chromatin regions with 3D physical interactions. In this approach, multiple layers of binning with slightly shifted genome coverage are generated, and interacting bins at each layer are integrated to infer narrower regions of chromatin interactions. HiCORE predicts chromatin looping regions with higher resolution and contributes to the identification of the precise positions of potential genomic elements. Author Summary The Hi-C analysis has enabled to obtain information on 3D interaction of genomes. While various approaches have been developed for the identification of reliable chromatin loops, binning methods have been limitedly improved. We here developed HiCORE algorithm that generates multiple layers of bin-array and specifies core chromatin regions with 3D interactions. We validated our algorithm and provided advantages over conventional binning method. Overall, HiCORE facilitates to predict chromatin loops with higher resolution and reliability, which is particularly relevant in analysis of small genomes.


Introduction 35
Hi-C analysis maps all possible genomic interactions in a manner dependent on three-36 dimensional (3D) distance. In this method, cross-linked chromatin is digested into fragments 37 by a restriction enzyme [1]. Then, restriction fragments are proximity-ligated to generate a 38 library of chimeric circular DNA. Paired-end sequencing and mapping of reads to a reference 39 genome identifies interacting restriction fragments and measures their interaction frequencies 40 [2]. Unfortunately, despite the high sequencing depth of typical Hi-C experiments, single 41 fragment-resolution contact matrices are too sparse to analyze. Hence, to ensure efficient and 42 reliable data processing, Hi-C data are usually binned with fixed intervals at low resolution 43 (>5 kb) [3][4][5]. 44 Because the resolution of the Hi-C map is a critical issue for 3D genome 45 conformation studies, multiple studies have focused on improving the resolution of raw Hi-C 46 matrices [6][7][8][9][10]. Although many cutting edge technologies have been implemented, including 47 machine learning-based computational methods, the binning strategy is still an important 48 determinant of Hi-C resolution. Two major types of binning methods have been used in 49 practice: fixed interval binning and fragment unit binning. Fixed interval binning allows 50 intervals to have a regular size (e.g., 5 kb) without considering restriction fragment size, 51 whereas fragment unit binning has an interval defined by the size of restriction fragment(s) 52 [5]. Although fixed interval binning is a practical approach that facilitates efficient analysis of 53 Hi-C data, fixed intervals mismatch restriction fragments of variable sizes, especially at a 54 high resolution, creating bias in chromatin loop detection [5]. Thus, fragment unit binning 55 methods have also been improved in parallel for high-resolution Hi-C analysis. Because 56 single fragment resolution analysis is limited due to contact matrix sparsity, as an alternative 57 method, several restriction fragments are assembled to generate a single bin (here called 4 58 multi-fragment binning) [6,11] [16,18]. Because the methods of the latter category do not account for a high-67 order 3D folding structure, they require a more robust post-analysis to determine the 68 reliability of chromatin loop prediction.

69
In this study, we developed the HiCORE algorithm, which includes an advanced

84
HiCORE generated multiple layers of bin arrays, in which a single bin was defined through 85 serial assembly of fragments (multi-fragment binning) to have a size above a certain 86 threshold. The first layer was generated by forward multi-fragment binning initiated from the 87 starting point (5'-end) of the genome (forward binning), and the second layer was created by 88 reverse multi-fragment binning initiated from the end point (3'-end) of the genome (reverse 89 binning). Additional layers were generated by bidirectional binning from randomly selected 90 positions of the genome (random binning). Bin arrays at each layer could be shifted a bit 91 relative to one another, and thus have different genome coverages (Fig 1). An interaction-92 frequency matrix of fragment-resolution Hi-C data was assigned to bin arrays at each layer, 93 and bin pairs with statistically significant interactions were identified using the Fit-HiC2

124
To validate the relevance of HiCORE in specifying chromatin looping regions, we generated 125 multiple layers of bin arrays, in which a single bin was defined through multi-fragment 126 binning to have a size above 700bp. The threshold bin size was determined by the raw matrix 127 resolution of the original Hi-C data [14]. We integrated information about 3D chromatin (conventional method) revealed that the average size of all specified chromatin regions with 131 3D contacts was 4.6 fragments (Fig 2A). However, the HiCORE-generated multiple layers 132 enabled further specification of chromatin looping regions. Compared with single-layered 133 binning, the average size of all specified chromatin regions with 3D contacts was reduced by 134~50% (to 2.2 fragments) after integration of 20 layers (Fig 2A), demonstrating that chromatin   In support, chromatin looping regions were better specified throughout the genome, compared 155 with conventional single-layered Fit-HiC2 analysis. For example, while three chromatin loops 156 were predicted by the conventional Fit-HiC2 analysis within Chr1:590 kb -599 kb regions 157 (Fig 3A), HiCORE selected a reliable chromatin loop (shown in blue) and defined the 158 chromatin looping region with higher resolution (Fig 3B): one side specified at a single-159 fragment unit and the other at a 2-fragment unit.  HiC2 method identified 185 chromatin loops at single fragment resolution (Fig 4A).

174
Furthermore, the average size of HiCORE-specified chromatin regions with 3D contacts (for 175 1987 chromatin loops) was much smaller than that by the conventional single-layered Fit-176 HiC2 analysis (for 185 chromatin loops) (Fig 4B). The chromatin loops predicted by HiCORE 177 had also lower q-value (Fig 4A), indicating that HiCORE enables the successful identification  (Fig 2A).

193
The inconsistency was possibly due to the sparsity of 3D contact matrices in smaller bins, 194 which caused low reproducibility at each layer and thereby excluded during HiCORE 195 analysis. Thus, HiCORE can specify the core chromatin looping regions, but optimal analysis 196 can usually be performed above the matrix resolution. FLC locus was empirically validated [11,23], but this loop was predicted by only a few 203 binning layers (Fig 5). This observation indicates that there is no best single binning layer that 204 sufficiently covers genomic interactions, and that comprehensive integration of chromatin 205 looping information predicted by multi-layered binning with different genome coverage is 206 required. HiCORE can provide a platform for data integration and optionally suggest a wide 207 range for possible chromatin looping regions (Fig 1). Furthermore, using HiCORE, chromatin Fragments with a size below a given cutoff length are merged with neighboring fragments.

246
The merged fragments are assembled into a single bin (multi-fragment bin). This binning 247 strategy is applied to all fragments except for the end fragment. For 'forward binning', from 248 the 5' end of the chromosome, a restriction fragment shorter than the cutoff length is merged 249 with the following fragment (in the 3'-direction) until the merged fragment is longer than 12 250 cutoff length. For 'reverse binning', from the 3' end of the chromosome, a restriction 251 fragment shorter than the cutoff length is merged with the preceding fragment (in the 5'-252 direction) until the merged fragment is longer than the cutoff length. For 'random binning ', 253 HiCORE randomly selects restriction fragments for construction of a binning layer.