Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

HiC-Spector: A matrix library for spectral and reproducibility analysis of Hi-C contact maps

Koon-Kiu Yan, Galip Guürkan Yardimci, William S. Noble, Mark Gerstein
doi: https://doi.org/10.1101/088922
Koon-Kiu Yan
1Program in Computational Biology and Bioinformatics, University of Washington, Seattle
2Department of Molecular Biophysics and Biochemistry, University of Washington, Seattle
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Galip Guürkan Yardimci
4Department of Genome Sciences, University of Washington, Seattle
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
William S. Noble
4Department of Genome Sciences, University of Washington, Seattle
5Department of Computer Science and Engineering, University of Washington, Seattle
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mark Gerstein
1Program in Computational Biology and Bioinformatics, University of Washington, Seattle
2Department of Molecular Biophysics and Biochemistry, University of Washington, Seattle
3Department of Computer Science, Yale University, University of Washington, Seattle
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Summary Genome-wide proximity ligation based assays like Hi-C have opened a window to the 3D organization of the genome. In so doing, they present data structures that are different from conventional 1D signal tracks. To exploit the 2D nature of Hi-C contact maps, matrix techniques like spectral analysis are particularly useful. Here, we present HiC-spector, a collection of matrix-related functions for analyzing Hi-C contact maps. In particular, we introduce a novel reproducibility metric for quantifying the similarity between contact maps based on spectral decomposition. The metric successfully separates contact maps mapped from Hi-C data coming from biological replicates, pseudo-replicates and different cell types.

Availability Source code in Julia and the documentation of HiC-spector can be freely obtained at https://github.com/gersteinlab/HiC_spector

Contact pi{at}gersteinlab.org

1 Introduction

Genome-wide proximity ligation assays such as Hi-C have emerged as powerful techniques to understand the 3D organization of the genome (Lieberman-Aiden et al., 2009; Kalhor et al., 2011). While these techniques offer new biological insights, they demand different data structures and present new computational questions (Dekker et al., 2013; Ay and Noble, 2015). For instance, a basic question of particular practical importance is, how can we quantify the similarity between two Hi-C data sets? In particular, given two experimental replicates, how can we determine if the experiments are really reproducible?

Data from Hi-C experiments are usually summarized by so-called chromosomal contact maps. By binning the genome into equally sized bins, a contact map is a matrix whose elements store the population-averaged co-location frequencies between pairs of loci. Therefore, mathematical tools like spectral analysis can be extremely useful in understanding these chromosomal contact maps. The aim of this project is to provide a set of basic analysis tools for handling Hi-C contact maps. In particular, we introduce a simple but novel metric to quantify the reproducibility of the maps using spectral decomposition.

2 Algorithms

We represent a chromosomal contact map by a symmetric and non-negative adjacency matrix W. The matrix elements represent the frequencies of contact between genomic loci and therefore serve as a proxy of spatial distance. The larger the value of Wij, the closer is the distance between loci i and j. The starting point of spectral analysis is the Laplacian matrix L, which is defined as L = D-W. Here Dis a diagonal matrix in which Dii = Σi Wij (the coverage of bin i in the context of Hi-C). As in many other applications, the Laplacian matrix further takes a normalized form ℒ = D-1/2LD-1/2 (Chung, 1997). It can be verified that 0 is an eigenvalue of ℒ, and the set of eigenvalues of ℒ (0<λ0<λ1<…<λn-1) is referred to as the spectrum of ℒ. Like common dimensionality reduction procedures, the first few eigenvalues are of particular importance because they capture the basic structure of the matrix, whereas the very high eigenvalues are essentially noise. The normalized Laplacian matrix is closely related to random walk processes taking place in the underlying graph of W. In fact, the first few eigenvalues correspond to the slower decay modes of the random walk process, and capture the large-scale structure of the contact map.

Given two contact maps WA and WB, we propose to quantify their similarity by decomposing their corresponding Laplacian matrices ℒA and ℒB respectively and then comparing their eigenvectors. Let Embedded Image be the spectra of ℒA and ℒB, and Embedded Image be their sets of normalized eigenvectors. A distance metric Sd is defined as Embedded Image

Here ║.║ represents the Euclidean distance between the two vectors. The parameter r is the number of leading eigenvectors picked from ℒA and ℒB. In general, Sd provides a metric to gauge the similarity between two contact maps. Embedded Image are more correlated if A and B are two biological replicates as compared to the case when they are two different cell lines (see Supplemental Information).

For the choice of r, like any principal component analysis, the contribution of leading eigenvectors is more important than the contribution from lower ranked eigenvectors. In fact, we observe that the Euclidean distance between a pair of high-order eigenvectors is the same as the distance between a pair of unit vectors whose components are randomly sampled from a standard normal distribution (see Supplemental Information). In other words, the high-order eigenvectors are essentially noise terms, whereas the signal is stored in the leading vectors. As a rule of thumb, we found the choice r = 20 is good enough for practical purposes. Furthermore, as the distance between a pair of randomly sampled unit vectors presents a natural limit Embedded Image, we rescale the distance metric into a reproducibility score Q ranges from 0 to 1 by Embedded Image

As shown in Figure 1, the reproducibility scores between pseudo-replicates are greater than the scores for real biological replicates, which are greater than the scores between maps from different cell lines.

Figure 1
  • Download figure
  • Open in new tab
Figure 1

Reproducibility scores for 3 sets of Hi-C contact maps pairs. Contact maps came from Hi-C experiments performed in 11 cancer cell lines by the ENCODE consortium (https://www.encodeproject.org/). Biological replicates refer to a pair of replicates of the same experiment. Pseudo replicates are obtained by pooling the reads from two replicates together and performing down sampling. There are 11 biological replicates, 33 pairs of pseudo replicates, and 110 pairs of maps between different cell types. Each box shows the distribution of Q in 23 chromosomes.

Apart from the reproducibility score, a number of matrix algorithms useful for analyzing contact maps are provided in HiC-spector. For instance, we have a function for performing matrix balancing using the Knight-Ruiz algorithm (Knight and Ruiz, 2012), which is widely used as a normalization procedure for contact maps (Imakaev et al., 2012). In addition, we have included the functions for estimating the average contact frequency with respect to the genomic distance, as well as identifying the so-called A/B compartments (Lieberman-Aiden et al., 2009) using the corresponding correlation matrix.

3 Implementation and Benchmark

HiC-spector is a library written in Julia, a high-performance language for technical computing. The bottleneck for evaluating the reproducibility metric we introduced is matrix diagonalization. The runtime depends very much on the size of contact maps. We found the performance efficient for most practical purposes, for instance, given a pair of contact maps of human chromosome 1 with bin-size equal to 40kb, it takes 80 seconds on a laptop with 2.8GHz Intel Core i7 and 16Gb of RAM.

Funding

This work has been supported by NIH award U41 HG007000.

Conflict of Interest

The authors declare no conflict of interest.

Acknowledgements

We want to thank the 3D Nucleome subgroup in the ENCODE consortium for processing the Hi- C data and discussion.

References

  1. ↵
    Ay, F., and Noble, W.S. (2015). Analysis methods for studying the 3D architecture of the genome. Genome Biol. 16, 183.
    OpenUrlCrossRefPubMed
  2. ↵
    Chung, F. (1997). Spectral graph theory (American Mathematical Society).
  3. ↵
    Dekker, J., Marti-Renom, M.A., and Mirny, L.A. (2013). Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 14, 390–403.
    OpenUrlCrossRefPubMed
  4. ↵
    Imakaev, M., Fudenberg, G., McCord, R.P., Naumova, N., Goloborodko, A., Lajoie, B.R., Dekker, J., and Mirny, L.A. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9, 999–1003.
    OpenUrlCrossRefPubMedWeb of Science
  5. ↵
    Kalhor, R., Tjong, H., Jayathilaka, N., Alber, F., and Chen, L. (2011). Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nat. Biotechnol. 30, 90–98.
    OpenUrlCrossRefPubMed
  6. ↵
    Knight, P.A., and Ruiz, D. (2012). A fast algorithm for matrix balancing. IMA J. Numer. Anal. drs019.
  7. ↵
    Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O., et al. (2009). Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science 326, 289–293.
    OpenUrlAbstract/FREE Full Text
Back to top
PreviousNext
Posted November 21, 2016.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
HiC-Spector: A matrix library for spectral and reproducibility analysis of Hi-C contact maps
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
HiC-Spector: A matrix library for spectral and reproducibility analysis of Hi-C contact maps
Koon-Kiu Yan, Galip Guürkan Yardimci, William S. Noble, Mark Gerstein
bioRxiv 088922; doi: https://doi.org/10.1101/088922
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
HiC-Spector: A matrix library for spectral and reproducibility analysis of Hi-C contact maps
Koon-Kiu Yan, Galip Guürkan Yardimci, William S. Noble, Mark Gerstein
bioRxiv 088922; doi: https://doi.org/10.1101/088922

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4105)
  • Biochemistry (8810)
  • Bioengineering (6509)
  • Bioinformatics (23446)
  • Biophysics (11784)
  • Cancer Biology (9200)
  • Cell Biology (13314)
  • Clinical Trials (138)
  • Developmental Biology (7430)
  • Ecology (11403)
  • Epidemiology (2066)
  • Evolutionary Biology (15143)
  • Genetics (10431)
  • Genomics (14036)
  • Immunology (9167)
  • Microbiology (22149)
  • Molecular Biology (8806)
  • Neuroscience (47541)
  • Paleontology (350)
  • Pathology (1427)
  • Pharmacology and Toxicology (2489)
  • Physiology (3729)
  • Plant Biology (8077)
  • Scientific Communication and Education (1437)
  • Synthetic Biology (2220)
  • Systems Biology (6036)
  • Zoology (1252)