Intuitive interpretation of heterochromatin and euchromatin through rapid Hi-C analysis

Hi-C is a technique that provides contact frequencies between pairs of loci on chromosomes. The conventional classification of heterochromatin and euchromatin based on Hi-C data is performed by principal component analysis; however, it requires long computational times and does not provide insight into the difference in contact frequencies between heterochromatin and euchromatin. Here, we propose a simple, intuitive and rapid method named the scaled contact number (SCN), which allows the contact frequencies to be visually interpreted and heterochromatin and euchromatin to be classified based on Hi-C results in a few minutes for long chromosomes at 1-kb resolution. The robustness of SCN was validated by confirming that SCN with reduced reads gives almost the same results as the original SCN. Overall, the approach described herein thus considerably decreases the time and computing power required to analyze Hi-C and further provides mechanistic insight indicating that euchromatin has more contacts than heterochromatin.


Introduction
Hi-C technology has enhanced our understanding of chromosome structure and provided clues relating chromosome structure to functions 1 . A number of chromosome models that are compatible with Hi-C data have been suggested [1][2][3][4][5][6][7] . After normalization of the Hi-C matrix, chromosomes were classified into A (euchromatin) and B compartments (heterochromatin) according to the sign of the first or second eigenvector obtained by principal component analysis (PCA) of the Pearson correlation matrix. Normalization is critical to obtain correct eigenvectors 8 , and several different normalization methods to remove unwanted biases have therefore been proposed [9][10][11][12][13][14] . However, once it became possible to perform Hi-C at 1-kb resolution 15 , a technical problem in processing large amounts of Hi-C data emerged, as the diagonalization of the Pearson correlation matrix used in PCA requires very large amounts of memory and computational time 16 .
Accordingly, fast and efficient methods of processing Hi-C data were developed [16][17][18] , but they still take a few days 19 . Another problem remains regarding the classification procedures themselves.
Although the PCA-based classification of heterochromatin and euchromatin has worked well in practice, this mathematical procedure does not reveal the relation between the compartments and contact frequency. Moreover, the sign of the eigenvector after diagonalization is arbitrary 8 . Non-PCA-based classification methods have recently been suggested [19][20][21] , but the relation between the compartments and contact frequency remains unclear.
Here, to achieve rapid processing of a large amount of data without requiring a large amount of memory and to reveal the relation between the compartments and contact frequency, we propose a simple and intuitive classification method. The clarified difference in the interactions of compartments allows us to perform a rough classification by visually inspecting Hi-C results. This method requires only 3.4 MB of memory and takes less than 6 minutes to process one of the largest available Hi-C datasets (chr3 of the human B-lymphocyte cell line GM12878 15 ) at 1-kb resolution using a standard laptop computer. Benchmarks of the processing time and comparison with CscoreTool 19 are shown in Supplementary Fig. 1.

Results
To test our classification method against the conventional method 1 , we first performed PCA on GM12878 (restriction enzyme: MboI) 15 , for which Hi-C was obtained at several resolutions, including 1 kb. The Hi-C matrix of chr14 at a 100-kb resolution and the corresponding eigenvectors obtained via the conventional procedure 1 are shown in Fig. 1a. The scaled count number (SCN), a metric summing each column of Hi-C matrices developed here (see Method), is plotted at the bottom of Panel (a). SCN is very similar to the eigenvector, suggesting the accuracy of the new classification. The similarity is quantified later. This procedure allows us to calculate SCN at the highest resolution (1 kb) without requiring high computational power or large amounts of memory. SCNs of chr14 at 1, 5, 10, 25, 50 and 100 kb are shown ( Supplementary Fig. 2). There is large fluctuation in the SCNs at 1 and 5 kb, and it is not clear whether this fluctuation is noise or reflects real local differences between heterochromatin and euchromatin. The fluctuation is suppressed at 10 kb, and the SCNs at 25 and 50 kb are essentially the same as those at 100 kb.
To quantify the correlation between SCNs and eigenvectors, their relationship was plotted, and the results showed a good correlation (Fig. 1b). This was also the case for the low-resolution We also compared the neighboring region contact index (NCI), a metric recently suggested by Fujishiro and Sasai for evaluating the correlation of contacts of a locus with its neighbours 20 , with the eigenvectors. The NCIs showed a high correlation with the eigenvectors at 100 kb but almost no correlation at 1 Mb ( Supplementary Fig. 5). Therefore, we concluded that SCN is a more robust and meaningful measure than NCI.
To reveal the biological meaning of SCN, the values of SCN when the eigenvectors were 0 were plotted. The SCN intercept was consistently found to be 1 (Fig. 2a and Supplementary Fig.  4d), independent of the chromosome number or resolution. Thus, chromosome regions with an SCN > 1 or < 1 correspond to euchromatin and heterochromatin, respectively (Fig. 2b). Based on these results, heterochromatin and euchromatin can be roughly classified visually based on Hi-C results, without PCA, as illustrated in Fig. 2c. A higher than expected contact probability corresponds to euchromatin, which is open and more likely to contact other chromosome regions, while a lower probability corresponds to heterochromatin, which is packed and folded and thus contacts other chromosome regions infrequently (Fig. 2d). Thus, SCN is a clear indicator to distinguish between heterochromatin and euchromatin interactions.
It is important to further verify the robustness of this approach. As indicated earlier, the Pearson correlation coefficient for GM06990 is lower than that for GM12878, so one may assume that reads affect SCN (the reads for ch14 of GM06990 are 755k and those for GM12878 are 186 M). The reads of chr14 of GM06990 and GM12878 were randomly selected and reduced to 10%, 1%, and 0.1%. The SCNs calculated by using 10% and 1% data from GM12878 are indistinguishable from the original SCN, and the SCN of 0.1% data is very similar to the original SCN with some differences (Supplementary Fig. 6a). On the other hand, the SCN of 10% of the data from GM06990 is very similar to the original SCN with some differences, and the SCNs of 1% and 0.1% are different from the original SCN ( Supplementary Fig. 6b). Accordingly, all SCNs with reduced data from GM12878 and SCNs with 10% data from GM06990 have similar Pearson correlation coefficients to the original SCN. Given that the lower number of reads (0.1% data of GM12878 (186k)) resulted in a higher Pearson correlation coefficient than the original GM06990 (755k), the correlation with the eigenvector does not depend only on the reads. Rather, SCN is influenced by the quality of Hi-C, since it is just a sum of the column of Hi-C matrices. In the Hi-C map of GM12878 (Fig. 1a), topologically associating domains (TADs) are clearly visible, while

Discussion
We developed a simple method for classifying heterochromatin and euchromatin based on Hi-C data that works even on laptops and can be readily implemented. Although it seems contrary to the conventional understanding, the results obtained via this method clearly demonstrated that the contact frequency of heterochromatin is lower than expected, while that of euchromatin is higher. This implies that euchromatin is flexible, easily interacts with other regions, and is accessible to the surface of heterochromatin. That is, euchromatin has frequent contacts in various locations. In contrast, heterochromatin is folded and less flexible than euchromatin and has a defined structure. Heterochromatin is in frequent or persistent contact within the defined structures, but chromosome parts inside the structures are difficult to access. Accordingly, heterochromatin has a lower contact frequency. Thus, SCN not only rapidly classifies heterochromatin and euchromatin but also offers insight into the different contact frequencies between heterochromatin and euchromatin.
There are several methods to identify TADs 22 . SCN appears to have some relation to TAD, as discussed above, and thus presumably contributes to improved identification methods. For example, a sudden change in SCN would be a sign of TAD boundaries. This classification method also has implications for developing simulation models. There were a few simulation models for heterochromatin and euchromatin 20,23,24 , where beads in the simulation model were assigned to heterochromatin or euchromatin using Hi-C data at 1-500 kb resolution. Using SCN, it is possible to assign beads at a resolution of 1-5 kb, corresponding to 5-25 nucleosomes, which is a requisite for future model development. At 1-kb or 5-kb resolution, large fluctuations were seen in SCN.
Thus far, there is no information to determine if these fluctuations are noise or are truly capturing a heterochromatin/euchromatin difference, but such information is expected to be obtained by future simulation studies. Eventually, SCN will contribute to the development of a more realistic chromosome model implementing heterochromatin and euchromatin.

Methods
Hi-C matrices (mij) record the counts of contacts between loci i and j. Accordingly, the sum of the column contents ( " = ∑ interaction. The thickness of the red arrows indicates the contact frequency inferred from SCNs.