Abstract
Analysis of Hi-C data has shown that the genome can be divided into two compartments called A/B compartments. These compartments are cell-type specific and are associated with open and closed chromatin. We show that A/B compartments can be reliably estimated using epigenetic data from two different platforms, the Illumina 450k DNA methylation microarray and DNase hypersensitivity sequencing. We do this by exploiting the fact that the structure of long range correlations differs between open and closed compartments. This work makes A/B compartments readily available in a wide variety of cell types, including many human cancers.
Background
Hi-C, a method for quantifying long-range physical interactions in the genome, was introduced by Lieberman-Aiden et al. [2009], and reviewed in Dekker et al. [2013]. A Hi-C assay produces a so-called genome contact matrix which – at a given resolution determined by sequencing depth – measures the degree of interaction between two loci in the genome. In the last 5 years, significant efforts have been made to obtain Hi-C maps at ever increasing resolutions [Dixon et al., 2012, Jin et al., 2013, Naumova et al., 2013, Pope et al., 2014, Rao et al., 2014, Dixon et al., 2015]. Currently, the highest resolution maps are 1kb [Rao et al., 2014]. Existing Hi-C experiments have largely been performed in cell lines or for samples where unlimited input material is available.
In Lieberman-Aiden et al. [2009] it was established that at the megabase scale, the genome is divided into two compartments, called A/B compartments. Interactions between loci are largely constrained to occur between loci belonging to the same compartment. The ‘A’ compartment was found to be associated with open chromatin and the ‘B’ compartment with closed chromatin. Lieberman-Aiden et al. [2009] also showed that these compartments are cell-type specific, but did not comprehensively describe differences between cell types across the genome. In most subsequent work using the Hi-C assay, the A/B compartments have received little attention; the focus has largely been on describing smaller domain structures using higher resolution data. Recently, it was shown that 36% of the genome changes compartment during mammalian development [Dixon et al., 2015] and that these compartment changes are associated with gene expression; they conclude “that the A and B compartments have a contributory but not deterministic role in determining cell-type-specific patterns of gene expression”.
The A/B compartments are estimated by an eigenvector analysis of the genome contact matrix after normalization by the observed-expected method [Lieberman-Aiden et al., 2009]. Specifically, boundary changes between the two compartments occur where the entries of the first eigenvector change sign. The observed-expected method normalizes bands of the genome contact matrix by dividing by their mean. This effectively standardizes interactions between two loci separated by a given distance by the average interaction between all loci separated by the same amount. It is critical that the genome contact matrix is normalized in this way, for the first eigenvector to yield the A/B compartments.
Open and closed chromatin can be defined in different ways using different assays such as DNase hypersensitivity or ChIP sequencing for various histone modifications. While Lieberman-Aiden et al. [2009] established that the ‘A’ compartment is associated with open chromatin profiles from various assays, including DNase hypersensitivity, it was not determined to which degree these different data types measure the same underlying phenomena, including whether the domain boundaries estimated using different assays coincide genomewide.
In this manuscript, we show that we can reliably estimate A/B compartments as defined using Hi-C data by using the Illumina 450k DNA methylation microarray data [Bibikova et al., 2011] as well as DNase hypersensitivity sequencing [Crawford et al., 2006, Boyle et al., 2008]. Both of these data types are widely available on a large number of cell types. In particular the 450k array has been used to profile a large number of primary samples, including many human cancers; more than 20,000 samples are readily available through the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) [TCGA]. We show that our methods can recover cell type differences. This work makes it possible to study A/B compartments comprehensively across many cell types, including primary samples, and to further investigate the relationship between genome compartmentalization and transcriptional activity or other functional readouts.
As an application, we show how the somatic mutation rate in prostate adenocarcinoma is different between compartments and we show how the A/B compartments change between several human cancers; currently The Cancer Genome Atlas does not include assays measuring chromatin accessibility. Furthermore, our work reveals unappreciated aspects of the structure of long-range correlations in DNA methylation and DNase hypersensitivity data. Specifically, we observe that both DNA methylation and DNase signal are highly correlated between distant loci, provided that the two loci are both in the closed compartment.
Results
A/B compartments are highly reproducible and are cell-type specific
We obtained publicly available Hi-C data on EBV-transformed lymphoblastoid cell lines (LCLs) and fibroblast cell lines and estimated A/B compartments by an eigenvector analysis of the normalized Hi-C contact matrix (Methods). The contact matrices were preprocessed with ICE [Imakaev et al., 2012] and normalized using the expected-observed method [Lieberman-Aiden et al., 2009]. As in Lieberman-Aiden et al. [2009], we found that the eigenvector divides the genome into two compartments based on the sign of its entries. These two compartments have previously been found to be associated with open and closed chromatin; in the following we will use open to refer to the ‘A’ compartment and closed to refer to the ‘B’ compartment. The sign of the eigenvector is arbitrary; in this manuscript we select the sign so that positive values are associated with the closed compartment (Methods). In Figure 1 we show estimated eigenvectors at 100kb resolution from chromosome 14 across 2 cell types measured in multiple laboratories with widely different sequencing depth, as well as variations in the experimental protocol. We observe a very high degree of correspondence between replicates of the same cell type; on chromosome 14, the correlation between eigenvectors from experiments in the same cell type is greater than 96% (ranges from 96.3% to 98.4%). The agreement, defined as the percentage of genomic bins which are assigned to the same compartment in two different experiments, is greater than 92% (ranges from 92.6% to 96.0%).
Using high resolution data does not change the estimated A/B compartments as seen in Figure 2.
Figure 1 shows the A/B compartments are cell-type specific, with a variation between cell types which exceeds technical variation in the assay; this has been previously noted [Lieberman-Aiden et al., 2009, Dixon et al., 2015]. The correlation between eigenvectors from different cell types is around 60%, in contrast to 96+% between eigenvectors from the same cell type.
In the remainder of the manuscript, we use the most recent data, ie. EBV-2014 and IMR90- 2014, to represent eigenvectors and A/B compartments derived from Hi-C data in these cell types.
Predicting A/B compartments from DNA methylation data
To estimate A/B compartments using other types of epigenetic data we first concentrate on DNA methylation data assayed using the Illumina 450k microarray platform. Data from this platform is widely available across many different primary cell types. To compare with existing Hi-C maps, we obtained data from 288 EBV-transformed lymphoblastoid cell lines from the HapMap project [Heyn et al., 2013].
DNA methylation is often described as related to active and inactive parts of the genome. Most established is high methylation in a genic promoter leading to silencing of the gene [Deaton and Bird, 2011]. As a first attempt to predict A/B compartments from DNA methylation data, we binned the genome and averaged methylation values across samples and CpGs inside each bin. Only CpGs more than 4kb away from CpG islands were used; these are termed open sea CpGs (Methods). We found that high levels of average methylation were associated with the open compartment and not the closed compartment; this might be a consequence of averaging over open sea probes. Figure 3 depicts data from such an analysis for lymphoblastoid cell lines on chromosome 14 at a 100kb resolution and shows some agreement between estimated compartments from Hi-C and this analysis, with a correlation of 56.3% and a compartment agreement between datasets of 71.7%. In this analysis, we implicitly assume that there is no variation in compartments between different individuals in the same cell type.
Surprisingly, we found that we could improve considerably on this analysis by doing an eigenvector analysis of a suitably processed between-CpG correlation matrix (Figure 3). This matrix represents correlations between any two CpGs measured on the 450k array, with the correlation being based on biological replicates of the same cell type. The correlation eigenvector shows strong agreement with the Hi-C eigenvector; certainly higher than with the average methylation vector (Figure 3). Quantifying this agreement, we find that the correlation between the two vectors is 84.8% and the compartment agreement is 83.8% on chromosome 14. Genomewide, the correlation is 70.9% and the agreement is 79% (Table 1); we tend to perform worse on smaller chromosomes. Again, this analysis implicitly assumes lack of variation in compartments between biological replicates.
Closely examining differences between the 450k-based predictions and the Hi-C-based estimates, we find that almost all disagreements between the two methods occur when entries in one of the two eigenvectors is close to zero; in other words, where there is uncertainty about the compartment in either of the two analyses. Excluding bins where the 450k-based prediction is close to zero, that is bins that have an absolute eigenvector value less than 0.01, we get an agreement of 88.8% (14.2% of the bins excluded). Excluding bins where either the 450k-based prediction is close to zero or the Hi-C eigenvector is close to zero, we get an agreement of 93% (24.8% of the bins excluded).
Our processing of the correlation matrix is as follows (details are in Methods and rationale is explained below): in our correlation matrix, we only include so-called ‘open sea’ CpGs; these CpGs are more than 4kb away from CpG islands. Next, we bin each chromosome into 100kb bins and compute which open sea CpGs are inside each bin; this will vary between bins due to the design of the 450k microarray. To get a single number representing the correlation between two bins, we take the median of the correlations of the individual CpGs located in each bin. We obtain the first eigenvector of this binned correlation matrix and gently smooth the signal by using two iterations of a moving average with a window size of 3 bins.
The sign of the eigenvector is chosen so that the sign of the correlation between the eigen-vector and column sums of the correlation matrix is positive; this ensures that positive values of the eigenvector is associated with the closed compartment (see Methods).
Long-range correlations in DNA methylation data predicts A/B compartment changes between cell types
To examine how well the predictions based on long-range correlations in 450k data captures differences between cell types, we obtained publicly available 450k data from 62 fibroblast samples [Wagner et al., 2014], and compared them to Hi-C data from the IMR90 cell lines. Note that the fibroblast cell lines assayed on the 450k platform are from primary skin in contrast to the IMR90 cell line, a fetal lung fibroblast. Figure 4 and Table 1 shows our ability to recover the A/B compartments in fibroblasts; it is similar to our performance for EBV-transformed lymphocytes.
To firmly establish that the high correlation between our predicted compartments using DNA methylation and Hi-C data is not due to chance, we compared the predicted compartments in EBV transformed lymphocytes and fibroblasts to Hi-C data from different cell types, including the K562 cell line which serves as a somewhat independent negative control. In Figure 5, we show the correlation and agreement between the two sets of predicted compartments and Hi-C data from the three cell types. There is always a decent agreement between predicted compartments of any two cell types, but the agreement is consistently higher when the prediction is from data from the same cell type as the Hi-C data.
How to best quantify differences in A/B compartments is still an open question. Lieberman-Aiden et al. [2009] used 0 as a threshold to differentiate the two compartments. Considering the difference of two eigenvectors derived in different cell types, it is not clear that functional differences exist exactly when the two eigenvectors have opposite signs; instead, functional differences might be associated with changes in the magnitude of the eigenvectors reflecting a genomic region being relatively more open or closed. We note that the genomic region highlighted as cell-type specific, and validated by FISH, in Lieberman-Aiden et al. [2009], is far away from zero in one condition and has small values fluctuating around zero in the other condition.
Following this discussion, we focus on estimating the direction of change in eigenvectors between different cell types. Figure 4 shows estimated differences between Hi-C and 450k eigenvectors for two cell types. Large differences between the two vectors are replicated well between the two data types, but there is disagreement when the eigenvectors are close to zero. This is to be expected; there is technical variation in such a difference even between Hi-C experiments (Figure 1). Using the data displayed in Figure 1 we find that the technical variation in the Hi-C data is such that 98% of genomic bins have an absolute value less than 0.02. Using this cutoff for technical variation, we find that the correlation between the two difference vectors displayed in Figure 4 is 85% when restricted to the 24% of genomic bins where both vectors have an absolute value greater than 0.02. The sign of the differential vectors are also in high agreement; they agree in 90% of the genomic bins exceeding the cutoff for technical variation. In contrast, the correlation is 61% when the entire chromosome is included, reflecting that the technical noise is less correlated than the signal.
The structure of long-range correlations in DNA methylation data
To understand why we are able to predict open and closed compartments using the 450k array, we studied the structure of long-range correlations in DNA methylation data. First, we note that entries in our binned correlation matrix (within a chromosome) do not decay with distance between bins (Figure 6a). This is in contrast to a Hi-C contact matrix, which has repeatedly been shown to decay with distance as expected (Figure 6b). However, for the first eigenvector to define open and closed compartments, the Hi-C contact matrix needs to be normalized using the expected-observed method [Lieberman-Aiden et al., 2009]. This normalization has the consequence that values in the matrix no longer decay with distance (Figure 6c).
In Figure 7 we show density plots of binned correlations on chromosome 14, stratified in two ways. The first stratification separates correlations between bins which are both in the open compartment, both in the closed and finally cross-compartment correlations. This stratification shows that we have a large amount of intermediate correlation values (0.2-0.5), but only between bins which are both in the closed compartment. The second stratification separates open sea probes and CpG resort probes (probes within 4kb of a CpG island, see Methods). This stratification shows that we only have intermediate correlation values for open sea probes; CpG resort probes are generally uncorrelated. In conclusion, we have the following structure of the binned correlation matrix: most of the matrix contains correlation values around zero (slightly positive), except between two bins both in the closed compartment, which have an intermediate correlation value of 0.2-0.5. This shows why an eigen analysis of the binned correlation matrix recovers the open and closed compartments, see Figure 8 for an illustration.
The lack of decay of correlation with distance extends even to trans-chromosomal correlations, again with a clear difference between correlations within the open compartment and the closed compartment (Figure 9).
To understand what drives the correlation between closed compartments, we carefully examined the DNA methylation data in these genomic regions. Figure 10 shows a very surprising feature of the data, which explains the long-range correlations. In this figure, we have arbitrarily selected 10 samples and we plot their methylation levels across a small part of chromosome 14, with each sample having its own color. Data from both EBV-transformed lymphocytes and fibroblasts are depicted. While the same coloring scheme has been used for both cell types, there is no overlap between the samples assayed in the different experiments. The figure shows that the 10 samples have roughly the same ranking inside each region in the closed compartment. This illustrates a surprising genome-wide ranking between samples in the closed compartment.
To gain more insights into whether this ranking was caused by technological artifacts or whether it reflects real differences between the biological replicates, we obtained data where the exact same HapMap samples were profiled in two different experiments using the Illumina 27k methylation array. This array design is concentrated around CpG islands, but we determined that 5,599 probes are part of the 450k array and annotated as open sea probes. For these probes, we determined which were part of the closed compartment and we computed the sample specific average methylation in this compartment, as a proxy for the observed ranking described above. In Figure 11a, we show that the correlation of these measurements between hybridization duplicates from the same experiment is high (92.7%). In Figure 11b we show that the these measurements replicate well between different experiments (correlation of 74.4%).
The striking global ranking between different samples using the open sea probes in the closed compartment could not be explained by the bisulfite conversion control probes, neither batch nor background noise. Neither the medians not the means were significantly associated with any of the different control probe types.
Finally, using the 27k data, we show that the eigenvector replicates between a 450k experiment and a 27k experiment using the same cell type (EBV), but different samples (correlation of 89%, see Figure 12). As a control, we compared to a 450k derived eigenvector for a different cell type (fibroblast) and observed weak correlation (40%). We note that the eigenvector derived from the 27k experiment is based on far fewer probes; we do not recommend using 27k data to estimate compartments. This result shows that the estimated genome compartments do not depend on the design of the microarray and suggests that our observations are common across methylation assays.
Notes on processing of the DNA methylation data
We have analyzed a wide variety of DNA methylation data both from the Illumina 450k and Illumina 27k microarrays. For each dataset, it varies which kind of data is publicly available (raw or processed). If possible, we have preferred to process the data ourselves starting from the Illumina IDAT files. However, for several datasets, we had to use the original authors preprocessing pipeline; see the Methods section for details.
We examined the impact of preprocessing methods on the estimated eigenvectors by using both functional normalization [Fortin et al., 2014b], quantile normalization adapted to the 450k array [Aryee et al., 2014] and raw (no) normalization; we did not find any substantial changes in the results. The agreement between the eigenvectors using the different preprocessing methods is greater than 94% and we note that the agreement with Hi-C data is best using functional normalization. This might be caused by the ability of functional normalization to preserve large differences in methylation between samples [Fortin et al., 2014b], which is what we observe in the closed compartment.
We examined the resolution of our approach using data from the 450k methylation array. As resolution increases, the number of bins with zero or few probes per bin increases. In Figure 13 we show the tradeoff between bins with zero probes and agreement with Hi-C data. This figure shows that a reasonable lower limit of resolution is 100kb. We note that the compartments estimated from Hi-C data do not change with increased resolution (Figure 2).
An application to prostate cancer
We applied these methods to Illumina 450k data on prostate adenocarcinoma (PRAD) from The Cancer Genome Atlas (TCGA). Quality control shows both normal and cancer samples to be of good quality. Since the normal prostate samples represents uncultured primary samples, we confirmed that this dataset had the same information in its long-range correlation structure as established above (Figure 14; compare with Figure 10).
We obtained a list of curated somatic mutations from TCGA and used them to compute simple estimates of the mutation rate in each 100kb bin of the genome. Since the list of somatic mutations was obtained using whole-exome sequencing (WXS), we obtained the relevant list of capture regions and used this list to compute mutation rates. We compared the somatic mutation rate to the eigenvector estimating the open and closed compartments and found the mutation rate to be elevated in the closed compartment. This confirms previous observations about the relationship between mutation rates and open and closed chromatin [Makova and Hardison, 2015], including cancer [Schuster-Böckler and Lehner, 2012, Polak et al., 2015]. Of particular interest was the mutation rate in genomic regions belonging to different compartments in normal and tumor samples. Table 2 shows these mutation rates computed using bins where the associated eigenvector value has a magnitude greater than 0.01; this was done to discard bins where the compartment association could be considered ambiguous. The table shows that regions of the genome changing from open to closed compartment in tumors have a similar mutation rate to regions which are in the closed compartment in both tumor and normals. Said differently: changes in compartments are associated with changes in somatic mutation rate. To our knowledge, this is the first time a cancer-specific map of open and closed compartments based on primary samples have been derived; existing analyses depends on chromatin assays performed in ENCODE and Epigenomics Roadmap samples [Schuster-Böckler and Lehner, 2012, Polak et al., 2015].
Compartments across human cancers
Using the method we have developed in this manuscript, it is straightforward to estimate A/B compartments across a wide variety of human cancers using data from TCGA. Figure 16 displays the smoothed first eigenvectors for chromosome 14 at 100kb resolution for eleven different cancers. Regions of similarity and differences are readily observed. We emphasize that TCGA does not include assays measuring chromatin accessibility such as DNase or various histone modifications. The extent to which these differences are associated with functional differences between these cancers is left for future work.
Compartment prediction using DNase hypersensitivity data
Lieberman-Aiden et al. [2009] established a connection between A/B compartments and DNAse data, mostly illustrated by selected loci. Based on these results we examined the degree to which we can predict A/B compartments using DNase hypersensitivity data. This data, while widely available from resources such as ENCODE, does not encompass the wide variety of primary samples as the Illumina 450k methylation array.
We obtained DNase-seq data on 70 samples [Degner et al., 2012] from EBV-transformed lymphocytes from the HapMap project, as well as 4 experiments on the IMR90 cell line performed as part of the Roadmap Epigenomics project [Bernstein et al., 2010]. We computed coverage vectors for each sample and adjusted them by library size. For each sample, we computed the signal in each 100kb genomic bin and averaged this signal across samples. The resulting mean signal is highly skewed towards positive values in the open compartment, and we therefore centered the signal by the median. The median was chosen as this has the best compartment agreement with Hi-C data. Figure 17 shows the result of this procedure, slightly modified for display purposes (the sign was changed to let high values be associated with closed compartment; additionally very low values were thresholded). A good visual agreement is observed for both cell types; the correlation between Hi-C and the average DNase signal on chromosome 14 is 72% for EBV and 76% for IMR90 with a compartment agreement of 83% for EBV and 86% for IMR90.
Inspired by the success of considering long-range correlations for the 450k data, we computed a DNase correlation matrix by computing the Pearson correlation matrix of the binned DNase signal; in contrast to the 450k data, we did not bin the correlation matrix as the signal matrix was already binned. The first eigenvector of this correlation matrix is highly skewed; we centered it by its median. Figure 17 shows the result of this procedure. We obtained a correlation between this centered eigenvector and the Hi-C eigenvector of 78% for EBV and 76% for IMR90 and a compartment agreement of 86% for EBV and 83% for IMR90. These results are comparable to what we obtain using the average DNase signal. It might be notable that the correlation based method works better for the EBV data which contains biological replicates and worse for the IMR90 dataset which is based on growth replicates of the same cell line.
To examine why the correlation based approach works for DNase data, we performed the same investigation as for the 450k datasets. In Figure 18 we show the distribution of correlations stratified by compartment type. As for the DNA methylation data, the DNase data has high positive correlations between bins in the closed compartment, although the correlations in the DNase data are much higher. For the DNA methylation data, correlations were close to zero between loci when at least one loci was in the open compartment. In contrast to this, the DNase data show an almost uniform distribution of correlation values when one of the two loci are in the open compartment.
Figure 19 suggests that, like DNA methylation, the DNase signal is ranked in the same way between samples in every region part of the closed compartment. There is a tendency for the ranking to be reversed in the open compartment.
Discussion
In this work, we show how to estimate A/B compartments using long-range correlations of epigenetic data. We have comprehensively evaluated the use of data from the Illumina 450k DNA methylation microarray for this purpose; such data is widely available on many primary cell types. Using data from this platform, we can reliably estimate A/B compartments in different cell types, as well as changes between cell types.
This result is possible because of the structure of long-range correlations in this type of data. Specifically, we found that correlations are high between two loci both in the closed compartment and low otherwise, and does not decay with distance between loci. This result only holds true for array probes measuring CpGs located more than 4kb from CpG islands, so-called open sea probes. This high correlation is the consequence of a surprising ranking of DNA methylation in different samples across all regions belonging to the closed compartment. We have replicated this result in an independent experiment using the Illumina 27k DNA methylation microarray.
We have furthermore established that A/B compartments can be estimated using data from DNase hypersensitivity sequencing. This can be done in two ways: first by simply computing the average DNase signal in a genomic region, and second by considering long-range correlations in the data, like for 450k array data. Again, we exploited the structure of long range correlations in this type of epigenetic data and, like the case for DNA methylation data, we found that correlations between loci both in the closed compartment are high, whereas correlations between other loci are approximately uniformly distributed. Again, this correlation is caused by a ranking of DNase signal in different samples across all regions belonging to the closed compartment.
Our approach is based on computing the first eigenvector of the (possibly binned) correlation matrix. It is well known that this eigenvector is equal to the first left singular vector from the singular value decomposition of the data matrix. The right singular vector of this matrix is in turn equal to the first eigenvector of the sample correlation matrix; also called the first principal component. This vector has been shown to carry fundamental information about batch effects [Leek et al., 2010]. Because of this relationship, we are concerned that our method might fail when applied to experiments that are heavily affected by batch effects; we recommend careful quality control of this issue before analysis.
The reason our method works is because of a surprising, consistent ranking of different samples across all regions belonging to the closed compartment (and only the closed compartment). By comparison with additional 27k methylation array experiments, we have shown that this ranking is not a technical artifact caused by (for example) hybridization conditions. In recent work, we studied colon cancer and EBV transformation of lymphocytes using whole-genome bisulfite sequencing (WGBS) [Hansen et al., 2011, 2014]. Using WGBS, it is easy to estimate the average methylation level across all CpGs in the genome; we call this global methylation. In these two systems, we observed global hy-pomethylation as well as an increased variation in global methylation levels in colon cancer and EBV-transformed lymphocytes when compared to normal matched samples from the same person. However, we saw minimal variation in global methylation between 3 normal samples in both systems. This work might explain why EBV-transformed samples from different people show a consistent ranking genomewide. But it does not explain why the same observation is made in fibroblasts and normal primary prostate (however, the later could be affected by contamination of the normal tissue with the adjacent cancer). More work is needed to firmly establish whether this observation holds true for most primary tissues or might be a consequence of oncogenesis or manipulation in culture. We note that the cause of the ranking does not matter; as long as the ranking is present it can be exploited to reconstruct A/B compartments.
The functional implications of A/B compartments have not been comprehensively described; we know they are associated with open and closed chromatin [Lieberman-Aiden et al., 2009], replication timing domains [Ryba et al., 2010, Pope et al., 2014], changes during mammalian development and are somewhat associated with gene expression changes [Dixon et al., 2015]. Our work makes it possible to more comprehensively study A/B compartments, especially in primary samples. We have illustrated this with a brief analysis of the relationship between A/B compartments and somatic mutation rate in prostate adenocarcinoma.
Competing Interests
The authors declare that they have no competing interests.
Materials and Methods
Infinium HumanMethylation450 BeadChip
We use the standard formula for estimating percent methylation given (un)methylation intensities and M. Traditionally, the term M-value is used for the logit transform of the beta value, and we do the same.
With respect to CpG density, the 450k array probes fall into 4 categories that are related to CpG islands. CpG Island probes (30.9% of the array) are probes located in CpG islands, shore probes (23.1%) are probes within 2 kbs of CpG islands, and shelf probes (9.7%) are probes between 2 kbs and 4 kbs from CpG islands. Open sea probes (36.3%) are the rest of the probes. We use the term CpG resort probes to refer to the union of island, shore and shelf probes; in other words non-open sea probes.
Methylation Data
Also see Table 3.
450k-Fibroblast dataset: The study contains 62 samples from primary skin fibroblasts from Wagner et al. [2014]. The raw data (IDAT files) are available on GEO (Accession number: GSE52025)
450k-EBV dataset: The study contains 288 samples from EBV-transformed lymphoblastoids cell lines (LCL) [Heyn et al., 2013] from three HapMap populations: 96 African-American, 96 Han Chinese-American and 96 Caucasian. The data are available on GEO (Accession number: GSE36369).
27k-EBV Vancouver: The study contains 180 samples from EBV-transformed lymphoblastoid cell lines (LCL) [Fraser et al., 2012] from two HapMap populations: 90 individuals from Northern European ancestry (CEU), and 90 individuals from Yoruban (West African) ancestry (YRI). The processed data are available on GEO (Accession number: GSE27146)
27k-EBV London: The study contains 77 EBV-transformed lymphoblastoid cell lines (LCL) assayed in duplicates [Bell et al., 2011]. Individuals are from the Yoruba HapMap population, and 60 of them are also part of the 27k-EBV Vancouver dataset. The raw data (IDAT files) are available on GEO (Accession number: GSE26133)
450k-PRAD Normal, 450k-PRAD Tumor: At the time of download, the dataset contained 340 prostate adenocarcinoma tumor samples from The Cancer Genome Atlas (TCGA) [TCGA] along with 49 matched normal samples. We used the Level 1 data (IDAT files) available through the TCGA Data portal.
Processing of the methylation data
For the 450k-Fibroblast and 450k-PRAD datasets, we downloaded the IDAT files containing the raw intensities. We read the data into R using the illuminaio package [Smith et al., 2013]. For data normalization, we use the minfi package [Aryee et al., 2014] to apply the noob background subtraction and dye-bias correction [Triche et al., 2013] followed by functional normalization [Fortin et al., 2014b]. We have previously shown [Fortin et al., 2014b] that functional normalization is an adequate between-array normalization when global methylation differences are expected between individuals. For the 450k-EBV dataset, only the methylated and unmethylated intensities were available, and therefore we did not apply any normalization. For the 27k-EBV London dataset, IDAT files were available, and we applied the noob background correction and dye-bias correctin as implemented in the methylumi package [Triche et al., 2013]. For the 27k-EBV Vancouver dataset, IDAT files were not available and therefore we used the provided quantile normalized data as discussed in Fraser et al. [2012].
For quality control of the samples, we used the packages minfi and shinyMethyl [Aryee et al., 2014, Fortin et al., 2014a] to investigate the different control probes and potential batch effects. All arrays in all data sets passed the quality control. After normalization of the 450k array, we removed 17,302 loci that contain a SNP with an annotated minor allele frequency greater than or equal to 1% in the CpG site itself or in the single-base extension site. We used the UCSC Common SNPs table based on dbSNP 137. The table is included in the minfi package.
For the analysis of the 27k array data, we only considered probes that are also part of the 450k array platform (25,978 probes retained in total) and applied the same probe filtering as discussed above.
Construction of 450k correlation matrices
For each chromosome, we start with a p × n methylation matrix M of p normalized and filtered loci and n samples. We use M-values as methylation measures. We compute the p × p matrix of pairwise probe correlations C = cor(M′), and further bin the correlation matrix C at a predefined resolution k by taking the median correlation for between CpGs contained in each of two bins. Because of the probe design of the 450k array, some of the bins along the chromosome do not contain any probes; these bins are removed. As discussed in the Results section, the correlations of the open sea probes are the most predictive probes for A/B compartments, and therefore the correlation matrix is computed using only those probes (36.3% of the probes on the 450k array). The inter-chromosomal correlations are computed similarly.
Processing of the Hi-C data
For the Hi-C datasets EBV-2014, K562-2014 and IMR90-2014 from Rao et al. [2014], we used the raw observed contact matrices that were constructed from all read pairs that map to the human genome hg19 with a MAPQ ≥ 30. These contact matrices are available in the supplementary files of the GEO deposition (GSE63525). For the IMR90-2013 dataset from Jin et al. [2013], we used the online deposited non-redundant read pairs that were mapped with Bowtie [Langmead et al., 2009] to human genome hg18 using only the first 36 bases. For the EBV-2009 and K562-2009 datasets from Lieberman-Aiden et al. [2009], we used the mapped reads deposited on GEO (GSE18199). Reads were mapped to human genome hg18 using Maq, as described in [Lieberman-Aiden et al., 2009]. For the Fibro-Skin dataset from McCord et al. [2013], we merged the reads from two individuals with normal cells (Father and Age-Matched control). We used the processed reads of the GEO deposition (GSE41763) that were mapped using Bowtie2 to the hg18 genome in an iterative procedure called ICE previously described in Imakaev et al. [2012].
For the EBV-2012 dataset from Selvaraj et al. [2013] and the Fibro-HFF1 dataset from Nau-mova et al. [2013], we downloaded the SRA experiments containing the FASTQ files of the raw reads. We mapped each end of the paired reads separately using Bowtie to the hg18 genome with the –best mode enabled. We kept only paired reads with both ends mapping to the genome.
For all datasets but the Hi-C datasets from Rao et al. [2014], we used the liftOver tool from UCSC to lift the reads to the human genome hg19 version for consistency with the 450k array. Reads from Rao et al. [2014] were already mapped to the hg19 genome.
Construction of Hi-C matrices
As a first step, we build for each chromosome an observed contact matrix C at resolution k whose (i, j)’th entry contains the number of paired-end reads with one end mapping to the i-th bin and the other end mapping to the j-th bin. The size of the bins depends on the chosen resolution k. We remove genomic bins with low coverage, defined as bins with a total count of reads less than 10% of the total number of reads in the matrix divided by the number of genomic bins. This filtering also ensures that low mappability regions are removed.
To correct for coverage and unknown sources of biases, we implemented the iterative correction procedure called ICE [Imakaev et al., 2012] in R. This procedure forces bins to have the same experimental visibility. We apply the normalization procedure on a chromosome basis and noted that for each Hi-C dataset, the iterative normalization converged in less than 50 iterations. For the purpose of estimating A/B compartments, we further normalize the genome contact matrix by the observed-expected procedure Lieberman-Aiden et al. [2009], where each band of the matrix is divided by the mean of the band. This procedure accounts for spatial decay of the contact matrix.
DNase-Seq Data
Also see Table 5.
DNase-EBV dataset: The study contains 70 biological replicates of EBV-transformed lymphoblastoid cell lines (LCL) [Degner et al., 2012] from the HapMap Yoruba population. The data are deposited on GEO (GSE31388) and raw files are available through http://eqtl.uchicago.edu/dsQTL_data/RAW_DATA_HDF5/.
DNase-IMR90 dataset: The dataset is composed of 4 technical replicates of the IMR90 fetal lung fibroblast cell line available on GEO (GSE18927).
Processing of the DNase-Seq data
For the DNase-EBV dataset from Degner et al. [2012], we downloaded the raw reads in the HDf5 format for both the forward and reverse strands. We converted the reads to bed-Graph, lifted the reads to the hg19 genome and converted the files to bigWig files using the UCSC tools. For the DNase-IMR90 dataset, we used the raw data already provided in the bigWig format. Reads were mapped to the hg19 genome. For both datasets, data were read into R by using the rtracklayer package [Lawrence et al., 2009]. We normalized each sample by dividing the DNase score by the total number of reads to adjust for library size.
Construction of DNase signal and correlation matrices
For each sample, we construct a normalized DNase signal at resolution 100kb by taking the integral of the coverage vector in each bin. This was done using BigWig files and the rtracklayer package in R [Lawrence et al., 2009]. All DNAse datasets have the same read length within experiment (EBV/IMR90). This results in a p × n signal data matrix where p is the number of bins for the chromosome, and n the number of samples. The average DNase signal is the across samples mean of this matrix. The DNase correlation matrix is the p × p Pearson correlation matrix of the signal matrix.
Eigenvector analysis
To obtain eigenvectors of the different matrices from Hi-C, DNA methylation and DNase data, we use the non-linear iterative partial least squares (NIPALS) algorithm implemented in the mixOmics package in R [Dejean et al., 2014]. Each eigenvector is smoothed by a moving average with a 3-bin window, except the 450k data where we apply two iterations of the moving average smoother.
When we compare eigenvectors from two different types of data, we only consider bins which exists in both data types; some bins are filtered out in a data-type dependent manner for example because of absence of probes or low coverage. This operation slightly reduces the number of bins we consider in each comparison.
Because the sign of the eigenvector is arbitrarily defined, we use the following procedure to define a consistent sign across different chromosomes, datasets and data types. For Hi-C data and DNase data, we correlate the resulting eigenvector with the eigenvector from Lieberman-Aiden et al. [2009]; changing sign if necessary to ensure a positive correlation. For DNA methylation data, we use that the long-range correlations are significantly higher for the closed-closed interactions. We therefore ensure that the eigenvector has a positive correlation with the column sums of the binned correlation matrix; changing sign if necessary. This procedure results in positive values of the eigenvector being associated with closed chromatin and the ‘B’ compartment as defined in Lieberman-Aiden et al. [2009] (in this paper they ensure that negative values are associated with the closed compartment).
To measure the similarity between two eigenvectors, we use two measures: correlation and compartment agreement. The correlation measure is the Pearson correlation between the smoothed eigenvectors. The compartment agreement is defined as the percentage of bins that have the same eigenvector sign, interpreted as the percentage of bins that belong to the same genome compartment (A or B) as predicted by the two eigenvectors. Occasionally, this agreement is restricted to bins with an absolute eigenvector value greater than 0.01 to discard uncertain bins.
Because open chromatin regions have very high DNase signal in comparison to closed chromatin regions, the DNase signal distribution is highly skewed to the right; therefore we center both the average signal and the first eigenvector by subtracting their respective median, before computing the correlation and agreement.
Somatic mutations in PRAD
We obtained a list of somatic mutations in PRAD from the TCGA data portal (https://tcga-data.nci.nih.gov/tcga/). Several lists exists; we used the Broad Institute curated list (broad.mit.edu IlluminaGA curated DNA sequencing level2.maf. To obtain capture regions, we queried the CGHub website (https://cghub.ucsc.edu) and found that all samples were profiled using the same capture design described in the file whole exome agilent 1.1 refseq plus 3 boosters.targetIntervals.bed obtained from the CGHub bitbucket account.
Somatic mutation rates in each 100kb genomic bin were computed as the number of mutations inside each bin, divided by the length of the capture regions inside the bin.
Software
Methods for performing the analysis of 450k methylation arrays described in this manuscript will be added to the minfi package [Aryee et al., 2014].
Acknowledgments
Thanks to John Muschelli who made our Obs/Exp normalization function a thousand times faster. The results shown here are in whole or part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.