Abstract
Chromatin conformation constitutes a fundamental level of eukaryotic genome regulation. However, our ability to examine its biological function and role in disease is limited by the large amounts of starting material required to perform current experimental approaches. Here, we present Low-C, a Hi-C method for low amounts of input material. By systematically comparing Hi-C libraries made with decreasing amounts of starting material we show that Low-C is highly reproducible and robust to experimental noise. To demonstrate the suitability of Low-C to analyse rare cell populations, we produce Low-C maps from primary B-cells of a diffuse large B-cell lymphoma patient. We detect a common reciprocal translocation t (3;14) (q27;q32) affecting the BCL6 and IGH loci and abundant local structural variation between the patient and healthy B-cells. The ability to study chromatin conformation in primary tissue will be fundamental to fully understand the molecular pathogenesis of diseases and to eventually guide personalised therapeutic strategies.
Introduction
The three-dimensional organisation of chromatin in the nucleus plays a fundamental role in regulating gene expression, and its misregulation has a major impact in developmental disorders (Lupiáñez et al. 2015; Franke et al. 2016) and diseases such as cancer (Hnisz et al. 2016). The development of chromosome conformation capture (3C) (Dekker et al. 2002) assays and, in particular, their recent high-throughput variants (e.g. Hi-C), have enabled the examination of 3D chromatin organisation at very high spatial resolution (Rao et al. 2014; Lieberman-Aiden et al. 2009). However, the most widely used current experimental approaches rely on the availability of a substantial amount of starting material – on the order of millions of cells – below which experimental noise and low sequencing library complexity become limiting factors (Belaghzal, Dekker, and Gibcus 2017). Thus far, this restricts high-resolution analyses of population Hi-C to biological questions for which large numbers of cells are available and limits the implementation of chromatin conformation analyses for rare cell populations such as those commonly obtained in clinical settings. While single-cell approaches exist (Nagano et al. 2013; Ramani et al. 2017; Stevens et al. 2017; Flyamer et al. 2017), they typically operate on much lower resolutions than population-based approaches and require an extensive set of specialist skills and equipment that might be out of reach for the average genomics laboratory.
Recently, two methods have been developed to measure chromatin conformatiion using low amounts of starting material (Z.Du et al. 2017; Ke et al. 2017). However, the lack of a systematic comparison of the data obtained with these approaches and conventional in situ Hi-C limits our understanding of the technical constraints imposed by the amounts of starting material available. In addition, it remains to be demonstrated whether these methods could be directly applied to samples with clinical interest, such as for example, tumour samples.
Here, we present Low-C, an improved in situ Hi-C method that allows the generation of high-quality genome-wide chromatin conformation maps using very low amounts of starting material. We validate this method by comparing chromatin conformation maps for a controlled cell titration, demonstrating that the obtained maps are robust down to 1,000 cells of starting material and are able to detect all conformational features –compartments, topologically associating domains (TADs) and loops– similarly as maps produced with a higher number of cells. Finally, we demonstrate the applicability of Low-C to clinical samples by generating chromatin conformation maps of primary B-cells from a diffuse large B-cell lymphoma (DLBCL) patient. Computational analysis of the data allows us to detect patient-specific translocations and substantial amounts of variation in topological features.
Results
Low-C: A Hi-C method for low amounts of input material
We first sought to develop a Hi-C method for low amounts of input material. To do so, we modified the original in situ Hi-C protocol (Rao et al. 2014), which recommends 5-10 million (M) starting cells, to allow for much smaller quantities of input material. The modifications are subtle, involving primarily changes in reagent volume and concentrations, as well as timing of the individual experimental steps (Fig. 1a, Methods, Supplementary Table 1). The combined changes, however, are highly effective, allowing us to produce high-quality Hi-C libraries from starting cell numbers as low as one thousand (1k) cells.
To assess the feasibility and limitations of Low-C, we prepared libraries for progressively lower numbers of mouse embryonic stem cells (mESC) using two different restriction enzymes (Supplementary Table 2). Each library was deep-sequenced to an average depth of 100-150×106 reads and processed using a computational Hi-C pipeline with particular emphasis on the detection and filtering of experimental biases (Methods). The ratios of the number of cis- and trans-contacts (Lajoie, Dekker, and Kaplan 2015) indicate a high library quality for all samples (Methods) (Supplementary Table 3). Visual inspection of normalised Hi-C maps for 1M to 1k cells revealed a high degree of similarity between Low-C samples, with TADs clearly identifiable at a resolution of 50kb (Fig. 1b, Supplementary Fig. 1a). To determine the degree of similarity between samples, we computed correlations of all contact intensities against the 1M sample, which showed very high levels of reproducibility (Fig. 1b, Pearson correlation coefficient R>=0.95 in all cases). To evaluate the overall level of reproducibility with other protocols, we performed a comparison of a pooled Low-C dataset, merging samples up to 50k cells, to a previously published mESC Hi-C dataset (Dixon et al. 2012), to account for differences in sequencing depth. This comparison revealed a strong contact intensity correlation (R=0.97), that was further confirmed by a principal component analysis that displayed strong clustering of Low-C samples and high similarity of Low-C to other mESC datasets (Supplementary Fig. 1b). In addition, we performed aggregate TAD and aggregate loop analysis (Flyamer et al. 2017) on the 1k and 1M samples (Fig. 1c,e), which revealed highly consistent TAD (Fig. 1d) and loop strengths (Fig. 1f) across datasets. Overall these results suggest that Low-C is a robust method to generate chromatin conformation maps using small amounts of input material.
Low-C data have similar properties as conventional Hi-C
We next wanted to ensure that the number of input cells does not limit the range of observations one can obtain from a Hi-C matrix. In a Hi-C experiment, each DNA fragment can only be observed in a single ligation product, limiting the number of possible contacts of the corresponding genomic region to twice the number of input cells (in a diploid cell line). This raises the concern for Low-C that low-probability contacts – such as those in far-cis – would be lost for very small numbers of cells. To test this, we calculated the correlation of contact intensities at increasing distances for the 100k, 10k, and 1k against the 1M sample. Reassuringly, while the expected decrease in correlation with distance was apparent, the decrease in contact correlation is independent of the input cell number (Fig. 2a), indicating that the loss of low-probability contacts was not a limiting factor for input cell numbers as low as one thousand. Furthermore, the remaining differences in correlation disappeared when comparing sub-sampled matrices to the same number of valid pairs (Supplementary Fig. 2a), suggesting that sequencing depth, and not the initial number of cells, is the main determinant of the correlation coefficients. We also confirmed that diversity of Hi-C contacts, measured as the absolute number of unique fragment pairs in a Low-C experiment, is not affected by the amount of input cells, but it is primarily a function of sequencing depth (Supplementary Fig. 2b-c).
To explore the limits of Low-C, we performed an extensive characterisation of the properties of these libraries. Previous work had identified systematic biases in Hi-C data that can serve as read-outs for the efficiency of Hi-C library generation (Yaffe and Tanay 2011; Jin et al. 2013; Cournac et al. 2012) (Fig. 2b, Supplementary Fig. 3a-f). Most notably, PCR duplicates indicate low library complexity – a limitation that has been previously described when trying to scale down the Hi-C protocol (Belaghzal, Dekker, and Gibcus 2017) – while an excess of different types of ligation products, such as self-ligated fragments, can point to problems in the digestion and ligation steps (Methods). Unsurprisingly, given the higher need for amplification, we find that PCR duplicates increase with lower amounts of starting material, with roughly 20% of read pairs identified as duplicates in the 1k sample (Fig. 2b). Ligation errors, however, remained more or less constant across samples, irrespective of the number of cells (Supplementary Fig. 4). Other low-input Hi-C datasets (Z. Du et al. 2017 a; Ke et al. 2017) display similar biases (Fig. 2a,b), confirming that decreasing library complexity appears to be the strongest limitation on the lowest number of input cells that is feasible for low-input Hi-C approaches.
Compartments, TADs and loops can be detected in Low-C data
Next, we set out to ensure that not only the Hi-C maps themselves, but also measures derived from them are reproducible and unaffected by differences in input cell number. To do so, we calculated several established and widely used Hi-C measures on the Low-C matrices at 50kb resolution, namely: the profile of expected contacts at increasing distances between genomic regions (Lieberman-Aiden et al. 2009) (Fig. 3a, Supplementary Fig. 5); the correlation matrix and its first eigenvector, used to derive AB compartments (Lieberman-Aiden et al. 2009) (Fig. 3b, Supplementary Fig. 6); and the insulation score (Crane et al. 2015), commonly used to infer TADs and TAD boundaries (Kruse et al. 2016) (Fig. 3c). All three examples of Hi-C measures are consistent with results from conventional Hi-C and showed high reproducibility between the 1M and 1k samples with no apparent dependence on the number of input cells, demonstrating that Low-C libraries are highly consistent and reproducible for input cell numbers as low as one thousand cells.
Generation of Low-C maps for DLBCL primary tissue
Given our ability to obtain high quality chromatin conformation maps using low amounts of input material, we sought to determine whether the technique could be applied in a real-world scenario where the amount of starting material is likely to be the limiting factor in obtaining chromatin contacts maps. To test this, we performed Low-C on a diffuse large B-cell lymphoma (DLBCL) sample and in normal B-cells extracted from a healthy donor as a control (see Methods). Generating chromatin contact maps with low amounts of input material is beneficial not only because it allows to test the 3D chromatin conformation directly in the diseased cells, but also since it maximises the availability of tissue for other procedures and minimises patients’ burden from having to undergo repeated biopsies to obtain extra material.
Patient and donor CD20+ lymphocytes were isolated from lymph nodes and blood, respectively, using a magnetic microbead-labelled CD20+ antibody and magnetic-activated cell sorting (MACS) (Yan et al. 2009) (Fig. 4a, Methods). We confirmed that the cell fixation procedure did not affect the efficiency of MACS sorting and that we were able to correctly distinguish CD20+ from CD20-cells in a mixture of HBL1 and Jurkat cells (Supplementary Fig. 7) and in the peripheral blood mononuclear cells (PBMCs) from the control sample (Supplementary Fig. 8), respectively before and after formaldehyde fixation. Using the MACS approach, we were able to isolate the majority of B-cells from the control sample and the cell line mixture, although a non-critical fraction of B-cells was lost during the process (Supplementary Fig. 9). The same was true for the patient sample where fixation did not affect the surface molecules needed for MACS sorting (Supplementary Fig. 10) and where the eluted CD20+ cell population was made up of 95.5% B-cells (Supplementary Fig. 10f). We then performed Low-C on approximately 50k cells from each of the patient and control samples and deep-sequenced the resulting libraries to approximately 500 million (patient) and 300 million (control) reads (Supplementary Table 2). The resulting chromatin maps show a high degree of similarity between the patient and control B-cells (Fig. 4b). TADs (ordinary and loop domains) and loops are clearly distinguishable in the maps, and de novo loop calling using HICCUPS (Rao et al. 2014) and subsequent aggregate loop analysis (Fig. 4c) confirms that these can be identified automatically with high confidence. Overall these results confirm that Low-C can be successfully used in a clinical setup to obtain high-quality chromatin conformation maps directly from primary patient tissue.
Identification of structural variation in patient Low-C data
Structural variation and, in particular, genome rearrangements are a characteristic feature in many cancers (Weischenfeldt et al. 2013). Since chromatin contact maps have an intrinsic bias for detecting interactions that happen in the proximal linear sequence (Lieberman-Aiden et al. 2009), Hi-C-like data can be used to detect structural variation (Harewood et al. 2017; Lin et al. 2018; Lupiáñez et al. 2015; Franke et al. 2016; Hnisz et al. 2016; van de Werken et al. 2012; Krijger and de Laat 2016; Zepeda-Mendoza et al. 2015; Simonis et al. 2009). In order to detect potential translocations in the DLBCL sample in a fully automated and unbiased manner, we performed virtual 4C (V4C) for the patient and control data. Specifically, we considered each 25kb bin in the genome in turn as a viewpoint to detect cases that display significant amounts of signal anywhere in the genome of the DLBCL cells that do not appear in control B-cells (Methods). Hi-C maps at locations of putative structural variations were then browsed manually to remove false positives.
Most prominently, this analysis identified two regions of interest on chromosome 3q27 separated by ~8Mb, with significant interactions with chromosome 14q32 (Fig. 5a-c). As expected, a normal V4C profile was observed around the viewpoint in chromosome 3 for both regions in both control and DLBCL cells. In contrast, the V4C profile found for the interacting regions on chromosome 14q32 was only apparent in the patient data, suggesting that the interactions are patient-specific. A closer examination of the genes located in the patient-interacting regions revealed that the first viewpoint (Fig. 5, magenta shaded region) lies directly at the BCL6 gene, a transcription factor known to be affected in DLBCL, while the interacting region on chromosome 14q32 lies at the immunoglobulin heavy-chain (IGH) locus (Fig. 5d, e), suggesting a t (3q27:14q32) reciprocal translocation. Translocations involving BCL6 are among the most commonly observed rearrangements in DLBCL (C Bastard et al. 1992; Kramer et al. 1998; Offit et al. 1994), with one study reporting a ~30% (14/46) penetrance in DLBCL patients (Christian Bastard et al. 2018). The second viewpoint with significant interactions towards the telomeric end of chromosome 3 (Fig. 5, green shaded region) interacts with a more centromeric location on chromosome 14. The pattern of interaction signal decay over linear distance in the trans-chromosome interactions map suggests a breakpoint around 195.2Mb (Fig. 5d,f, black triangles) and allows us to manually reconstruct the most likely rearrangement of these regions in DLBCL from the Hi-C data: the telomeric ends of both chromosomes are involved in a reciprocal translocation, with breakpoints around chr3:187.7Mb and chr14:105.9Mb (Fig. 5h). To validate our data, we performed a fluorescence in situ hybridisation (FISH) analysis that confirmed a rearrangement of the BCL6 gene (Fig 5i), providing orthogonal validation of the Hi-C findings. In addition, the lack of Hi-C signal between the breakpoints in chromosome 3 and 14 suggests that the regions chr3:187.7-195.2Mb and chr14:105.6-105.9Mb have been lost on one pair of chromosomes, generating regions of loss of heterozygosity in the remaining chromosome. Interestingly, we find another smaller rearrangement involving ANXA3 on chromosome 4 and EDAR2 on chromosome X (Supplementary Fig. 11). Misregulation of ANXA3 is known to promote tumour growth, metastasis and drug resistance in both breast cancer (R. Du et al. 2018) and hepatocellular carcinoma (Tong et al. 2015). In summary, our results demonstrate that Low-C can be used directly on primary tissue to detect patient-specific chromosomal rearrangements in an unbiased manner.
Extensive rewiring of chromatin organisation in DLBCL cells
Visual comparison of the patient and control chromatin contact maps revealed numerous local structural differences. For example, the region undergoing loss of heterozygosity reported above (chr3:187.7-195.2Mb; Fig. 5d, arrow) displays a clear gain of TAD structure encompassing the genes TP63, a member of the p53 family of transcription factors that has been previously associated with cancer, and the tumour protein p63 regulated gene-1 (TPRG1) which lies in the same de novo established TAD, suggesting their potential co-regulation. To evaluate the overall extent of changes in chromatin conformation at the TAD structure level between the two samples, we used the insulation score (Crane et al. 2015) to determine TAD boundaries in both samples and looked for regions with broad changes in the Hi-C signal (Methods). Using a conservative threshold, we detected 648 regions in the genome with notable changes in local Hi-C contacts (Supplementary Table 4). Out of these, 37 appear to be de novo TADs, which in many cases overlap with known disease-related genes such as PTPRG (LaForgia et al. 1991) (Fig. 6c), APBB2 (Deffenbacher et al. 2012) (Fig. 6d), and TEAD1 (Zhou et al. 2017; Schmid et al. 2015) (Fig. 6e). Overall, we observe the majority of changes in TAD structure to be patient-specific gains, whereas the loss of TADs present in normal B-cells in the patient is a relatively rare event (Fig. 6b). Altogether, our results demonstrate that Low-C can be used to study chromatin contact differences between patient samples at the TAD level and that there are significant differences in TAD structure between DLBCL and normal B-cells.
Discussion
The development of high-throughput genome-wide techniques to measure chromatin conformation has been instrumental to further our understanding of the biological importance of the three-dimensional organisation of chromatin in the nucleus. In addition to providing a local environment where enhancer-promoter interactions can orchestrate the correct deployment of gene expression programmes during development, the three-dimensional chromatin conformation is fundamental to establish proper spatial boundaries, that provide enhancer insulation and limit their function to those genes that need to be regulated. Chromatin conformation at the level of TADs seems to be fairly static for fully differentiated cells (Nora et al. 2012; Dixon et al. 2012, 2015), although dynamic changes in TAD structure can be observed during development in organisms ranging from Drosophila to mammals (Hug et al. 2017; Bonev et al. 2017; Z.Du et al. 2017; Ke et al. 2017), highlighting their dynamic behaviour.
A current limitation for our understanding of these dynamic changes and the potential differences in 3D chromatin conformation between tissues or in a disease context is the high amount of material that is usually necessary to perform these experiments. While single-cell Hi-C methods exist, these are usually only able to capture a small fraction of the chromatin contacts that occur across the genome. This results in sparse chromatin maps of low resolution that usually rely on TAD calls made using standard Hi-C maps, limiting their applicability in comparing samples or finding de novo TADs.
Here, we introduce Low-C, an improved Hi-C method that allows the generation of high-resolution chromatin contact maps using low amounts of input material. Beyond existing low input Hi-C approaches (Z.Du et al. 2017; Ke et al. 2017), we perform a thorough comparison of Low-C maps and their derived measurements in a controlled environment to systematically demonstrate that Low-C is not affected by biases originating from the amount of starting material. We also show that the method is robust and applicable to mammalian samples down to one thousand cells without compromising the quality of the resulting datasets. Therefore, our results establish Low-C as an efficient method to study chromatin conformation for rare cell populations, where the collection of material currently necessary to perform population-based Hi-C protocols is infeasible. These include transient developmental stages (Hug et al. 2017; Z.Du et al. 2017; Ke et al. 2017), as well as systems of medical relevance, such as primary tissue from patient samples, where an examination of changes in chromatin conformation between healthy and disease cells might shed light on the etiology of the disease.
To demonstrate the usability of this approach in a real-world scenario, we generated Low-C maps for a DLBCL patient sample. Since changes in chromatin contact profiles and genomic rearrangements can be detected very easily through Hi-C approaches (Engreitz et al. 2012), we developed an unbiased approach to systematically detect translocations using these data, uncovering a known reciprocal translocation in this patient biopsy. This, together with recent reports of similar approaches in other tumour types (Harewood et al. 2017) highlights the clinical applicability of this technology. An added benefit of our approach when compared with previous work in primary tissue samples is the generation of high-quality genome-wide chromatin interaction maps, which allows us to examine the level of variability between cells in health and disease. In fact, we detect a large amount of variation at the TAD level, in particular in the DLBCL sample, which gains a significant amount of structure. Interestingly, in several cases the emergence of novel chromatin structural features coincides with the genomic location of genes previously associated with cancer, such as TP63 and ANXA3. Whereas the current maps do not allow us to determine cause or consequence for these changes, a broader examination of these changes in larger cohorts of patient samples, together with an integrative analysis of gene expression and chromatin states might provide insight into the causal relationships between these in a disease-specific and patient-specific manner.
Despite the increased applicability of our method, there are still a number of factors to take into consideration when planning such experiments. First, tissue heterogeneity or the presence of healthy cells in biopsies can become an issue with increasingly lower cell numbers. Specifically, the lower the input cell number, the greater the impact of contaminations or variabilities in sample composition will be on the averaged chromatin structures visible in the Hi-C maps. These might obfuscate or increase the uncertainty about specific structural observations. In our DLBCL analysis, we set out to minimise these effects by coupling our Low-C to efficient cell sorting techniques. Second, decreasing library complexity is still the current limiting factor for low input Hi-C studies (Belaghzal, Dekker, and Gibcus 2017), and a significant amount of PCR duplicates are to be expected when reducing the amount of starting material. Third, a further general limitation for bulk Hi-C methods, regardless the initial cell input, is that long-range three-dimensional contacts between gene promoters and enhancers are likely to be missed, since they usually happen within the context of TAD interactions. Therefore, to study these important interactions, which have been shown to affect gene regulation and are associated with the risk for various types of diseases (Javierre et al. 2016; Martin et al. 2015), it might be useful to couple Low-C with capture or promoter-capture techniques (Hughes et al. 2014; Dryden et al. 2014; Mifsud et al. 2015; Jäger et al. 2015), that will allow the retrieval of these specific interactions.
In summary, our data demonstrates that it is feasible to obtain high-quality genome-wide chromatin contact maps from low amounts of input material. We anticipate that the robustness and relatively simple implementation will make Low-C an attractive option that will facilitate bringing the analysis of chromatin architecture within reach of personalised clinical diagnostics.
Methods
Low-C protocol
We followed the general protocol for in situ Hi-C as described previously (Rao et al. 2014), which we adapted for use on low cell numbers. Mainly, differences were related to adjustments in the volume of the reactions, a shortening of the digestion step, a removal of biotin from the unligated fragments, and an alternative strategy for size-selection during library preparation. For a detailed step by step protocol please see Fig. 1a and Supplementary Table 1.
Cell culture
mESC OG2 cells were cultured as described previously (Shi et al. 2008), FACS-sorted, selected for positive eGFP expression and collected in PBS. Cells were then pelleted (300 g, 4°C for 10 min) and resuspended in 1 ml PBS.
Patient and Control samples processing
Peripheral blood mononuclear cells (PBMCs) were obtained either from a blood extraction from a healthy donor or from a lymph node biopsy from a DLBCL patient. The patient sample came from the Department of Clinical Pathology at the Robert-Bosch-Hospital in Stuttgart (Germany) and its informed consent for retrospective analysis was approved by the ethics committee of the Medical Faculty, Eberhard-Karls-University and University Hospital Tübingen (reference no. 159/2011BO2). PBMCs from the control came from a donor from the Department of Medicine A, Hematology, Oncology and Pneumology, University Hospital Muenster in Muenster (Germany).
Control PBMCs were isolated from the in-between layer by density gradient centrifugation with Biocoll (Biochrom AG, Germany) and were then frozen at −80°C for preservation. The patient sample came from a biopsy of a lymph node. Briefly, the biopsy was immediately cut into pieces, homogenized and re-suspended generating a cell suspension that was then frozen and kept at −80°C, as previously described (Staiger et al. 2017).
Once the samples were thawed, cells were cross-linked in a 1% formaldehyde and quenched with 2.5M Glycine solution (for details check the detailed Low-C protocol at Supplementary Table 1). A test to ensure that formaldehyde fixation won’t affect the surface molecules was performed before and after fixation (Supplementary Figure 8). The viability of the surface molecules on a mixture of HBL1 and Jurkat cells was assessed by staining with a CD19-PE (FL2) and CD20-FITC (FL1) antibodies (Supplementary Figure 7).
B-cells were then isolated by MACS-sorting (Yan et al., 2009) using a positive selection kit (Miltenyi Biotec, 130-091-104). Briefly, CD20+ cells were labelled using magnetic coated CD20 MicroBeads, the cell suspension was loaded onto a MACS LS column (Miltenyi Biotec, 130-041-306) and placed on a magnetic field generated by a MACS Separator. The CD20+ cells were retained into the column while the flow through (unlabelled cells) was eliminated. Then the column was removed from the MACS Separator, the magnetically retained CD20+ cells were then eluted and collected into a 15 ml Falcon tube. The performance of the MACS sorting was assessed by checking the B-cell presence and its proportions in the flow through as well as in the eluted portion for the control PBMCs (Supplementary Figure 9a-b, e-f), for the mixed cell population sample (Supplementary Figure 9c-d, g-h) and for the patient sample (Supplementary Figure 10).
Once the eluted samples were recovered, we proceeded with the lysis and the rest of the Hi-C library preparation as described in detail in Supplementary Table 1.
Bioinformatics processing of Low-C and Hi-C libraries
Prior to mapping, the two mates of each paired-end reads sample were scanned for MboI ligation junctions, indicating sequencing through a Hi-C ligation product. If a junction was found, the read was split. Reads were then mapped independently to the M. musculus reference genome (mm10) using BWA-MEM (0.7.17), which may also result in split reads where the ends map to different locations in the genome. Those reads that did not align uniquely to the genome or that had a mapping quality lower than 3 were filtered out. Read pairs where one read was filtered out are discarded.
For the remaining read pairs, there are three possibilities: (i) none of the two reads in a mate pair was split in the pre-processing or mapping step (see above), (ii) one read in the pair was split, resulting in 3 mapped reads with the same ID, and (iii) one read in a pair was split multiple times or both reads were split at least once, resulting in more than 3 reads with the same ID. In case (iii) the mate pair is filtered out, as the exact interacting genomic location cannot be determined; in case (ii) the pair is considered valid if two reads map to the same genomic location (within 100bp), otherwise it is discarded; case (i) is considered valid.
Restriction fragments in the genome were identified computationally using known restriction sequences of MboI and HindIII, and the remaining pairs of reads were assigned to the restriction fragments.
Obtaining valid pairs of reads
Pairs were filtered out if: i) the mapped reads’ distance to the nearest restriction site was larger than 5kb, ii) both reads mapped to the same fragment, or iii) the orientation and distance of reads indicated a ligation or restriction bias (Jin et al. 2013; Cournac et al. 2012). Briefly, paired reads mapping in the same direction on the chromosome likely originate from a pair of fragments that had a cut restriction site between them and that had subsequently ligated – these were considered valid. Paired reads mapping in opposite directions may indicate that the reads map to a single large fragment with one or more uncut restriction sites. In this case, pairs facing inward would have originated from an unligated, pairs facing outward from a self-ligated fragment. At large genomic distances, there are approximately equal numbers of same and opposite orientation pairs. At shorter distances, there is an increased likelihood of uncut restriction sites between two reads, and pairs in opposite direction are filtered out. For every dataset, both the inward and outward ligation cut-offs have been fixed at 10kb.
Finally, pairs were marked as PCR duplicates if another pair existed in the library that mapped to the same locations in the genome, with a tolerance of 2bp. In those cases, only one pair from all duplicate ones for a given locus was retained for downstream processing. Finally, the genome was partitioned into equidistant bins and fragment pairs were assigned to bins using a previously described strategy (Rao et al. 2014). The resulting contact matrix was filtered for low-coverage regions (with less than 10% of the median coverage of all regions) and corrected for coverage biases using Knight-Ruiz matrix balancing as described before (Rao et al. 2014; Knight and Ruiz 2013). Bins that had no contacts due to filtering were marked as “unmappable”.
Cis/trans ratio calculation
The cis/trans ratio is calculated as the number of valid intra-chromosomal contacts (cis) to the valid inter-chromosomal contacts (trans). When comparing different species, this ratio will be affected by genome size and the number of chromosomes. We therefore also provide a “species-normalised” cis/trans ratio by multiplying the trans value by the ratio of possible intra-chromosomal to inter-chromosomal contacts f (the ratio of the number of intra-chromosomal pixels in the Hi-C map to the number of inter-chromosomal pixels).
Observed/expected (OE) Hi-C matrix generation
For each chromosome, we obtain the expected Hi-C contact values by calculating the average contact intensity for all loci at a certain distance. We then transform the normalized Hi-C matrix into an observed/expected (OE) matrix by dividing each normalized observed by its corresponding expected value.
Aggregate TAD/loop analysis
In general, average feature analysis is performed by extracting subsets of the OE matrix (can be single regions along the diagonal, or region pairs corresponding the matrix segments off the diagonal) and averaging all resulting sub-matrices. If the sub-matrices are of different size, they are interpolated to a fixed size using “imresize” with the “nearest” setting from the Scipy Python package.
TADs and loop anchors in Fig. 1 have been obtained from (. TADs and loop anchors in Fig. 4 have been called de novo from their respective datasets (see below). The region size for TADs has been chosen as 3x TAD size, centred on the TAD, and aggregate analyses have been performed in 25kb matrices. The region size around loop anchors has been chosen as 400kb in 25kb matrices.
TAD strength is calculated as in (. Briefly, we calculate the sum of values in the OE matrix in the TAD-region and the sum of values for the two neighbouring regions of the same size divided by two. The TAD strength is then calculated as the ratio of both numbers.
Loop strength is calculated as in (. Briefly, we first calculate the sum of all values in the 300kb region of the Hi-C matrix centred on the loop anchors. As a comparison, we calculate the same value for two control regions, substituting one of the loop anchors for an equidistant region in the opposite direction. The loop strength is then calculated as the original sum of values divided by the average sum of values in the two control regions.
Expected values vs. distance
Intra-chromosomal Hi-C matrix entries (50kb resolution) were binned by distance to the diagonal and divided by the total number of possible contacts at each distance. The resulting average counts were plotted against distance in a log-log plot.
AB compartments
For each chromosome separately, the Hi-C matrix was converted to an OE matrix (see above). The OE matrix was then converted into a correlation matrix, where each entry (i, j) represents the Pearson correlation between row i and j of the OE matrix. Finally, the signs of the first eigenvector entries were used to call compartments.
Insulation score and TAD boundaries
The insulation score was calculated as described before (Crane et al. 2015), by averaging contacts in a quadratic sliding window along the diagonal of the Hi-C matrix. Insulation scores were then divided by the chromosomal average and log2-transformed. Boundaries were calculated from the vector of insulation scores as previously described (Crane et al. 2015; Hug et al. 2017 a). Aggregate TAD plots in Fig. 4, and the insulation and TAD intensity difference plots in Fig. 6 use the intervals between two consecutive boundaries as input.
De novo loop calling
Loops in the DLBCL and B-cell samples have been called using an in-house implementation of HICCUPS (Rao et al. 2014). Briefly, for each entry in the Hi-C matrix, HICCUPS calculates several enrichment values over different local neighbourhoods (termed “donut, lower-left, horizontal and vertical – for definition of the neighbourhoods see the original publication). Each enrichment value is associated with an FDR value for assessing statistical significance. We call loops at a matrix resolution of 25kb and perform filtering exactly as described, only retaining loops that (i) are at least 2-fold enriched over either the donut or lower-left neighbourhood, (ii) are at least 1.5-fold enriched over the horizontal and vertical neighbourhoods, (iii) are at least 1.75-fold enriched over both the donut and lower-left neighbourhood, and (iv) have an FDR <= 0.1 in every neighbourhood. We thus obtain 10,093 loops in the DLBCL and 13,213 loops in the B-cell samples – comparable to the number of loops identified originally in GM12878 cells (Rao et al. 2014).
Identification of structural rearrangements in DLBCL
To generate a list of candidate regions that may have undergone structural rearrangements in DLBCL, we performed Virtual 4C (V4C) for each Hi-C bin of the DLBCL matrix at 50kb resolution (viewpoint), looking for peaks of signal away from the original viewpoint (target) that were not present in normal B-cells.
Specifically, in a Hi-C matrix M of size NxN, we examined each bin i, with i ϵ [0, N]. If any of the bins in the interval [i-7, i+7] is unmappable (see above), it is not considered for further analysis, as we found that regions with mappability issues are typically false-positive rearrangements. We then obtained the vector v of Hi-C signal as row i of M. The viewpoint peak height is then given by vi. An entry vj, with j ≠ i, is considered a peak if it is larger than 0.15*vi and 99.5% of all other values in v (the latter was introduced to filter out highly noisy V4C profiles). Peaks closer than 50 bins to i are discarded as local enrichment of contacts.
V4C peaks are called as above for the DLBCL and the B-cell samples. We consider a peak as a putative rearrangement if it only occurs in the DLBCL, but not he B-cell sample. The final list of <100 putative rearrangements could then be inspected by eye in the local and inter-chromosomal Hi-C, eliminating highly noisy Hi-C regions and likely false-positives. Finally, this left just 14 peaks, of which 4 could be attributed to the ANXA3, and 10 to the t (3,14) rearrangements discussed in the manuscript.
Hi-C difference matrices
Plots highlighting differences between DLBCL and B-cell samples (Figure 6) have been obtained by subtracting B-cell from DLBCL Hi-C matrices at 50kb resolution. Pixels without signal in either datasets are removed for clarity.
TAD intensity difference calculations
To quantify the changes in TAD formation and intensity that occur from B-cell to DLBCL (Fig. 6a), we first merged boundaries in both samples (see above), and then calculated the average Hi-C signal between all possible pairs of contacts in-between two consecutive boundaries. This was done separately for the two datasets, and the TAD intensity difference for each region was calculated as the difference in average Hi-C signal of DLBCL and B-cell.
Correlations
All reported correlations are Pearson correlations. Corresponding plots were made using the “hexbin” plotting function on log-transformed counts from the matplotlib library version 2.0.0 in Python (matplolib.org).
The distance correlations in Fig. 2a have been obtained as follows: All intra-chromosomal contacts in a Hi-C map are first binned by distance. Bins are defined as [0-250kb), [250kb, 500kb), [500kb, 750kb), … in the 50kb resolution maps, [0-500kb), [500kb-1Mb), [1.5Mb-2Mb), … in the 100kb resolution maps, and [0-1Mb), [1Mb-2Mb), [2Mb-3Mb), … in the 250kb resolution maps. For each library (100k, 10k, 1k, Dixon et al., Du et al.) correlations to the 1M sample between all corresponding contact strengths in each bin are calculated. The x axis has been scaled to omit very large distances at which correlations become erratic due to the sparsity of the Hi-C matrix.
Fluorescent in situ hybridisation analysis
Interphase-FISH for BCL6 (Vysis Break apart FISH probe kit, Abbot Molecular Diagnostics, Germany) was performed on 4 μm thick tissue sections cut from FFPE archival tissue blocks as previously described (Horn et al. 2014).
Data availability
The in situ Hi-C data generated in this study have been deposited in ArrayExpress and will be available upon publication.
Previously published Hi-C datasets used in this study are available in Gene Expression Omnibus (GEO; Rao et al. 2014: GSE63525; Dixon et. al 2012: GSE35156; Du et al. 2017: GSE82185) and Genome Sequence Archive (GSA) (Ke et al. 2017: PRJCA000241).
Genome annotations have been downloaded from GENCODE, version 27.
Author contributions
J.M.V. conceived and supervised the study. N.D. performed in situ Hi-C experiments. K.K. performed computational analyses. T.E., G.O. and G.L provided clinical samples and assisted with B-cell isolation. A.S. performed FISH on clinical samples. N.D., K.K. and J.M.V. analysed and interpreted the data. N.D., K.K. and J.M.V. wrote the manuscript. All authors participated in the discussion of the results, and commented on and approved the manuscript.
Competing financial interests
The authors declare no competing financial interests.
Acknowledgments
This research was funded by the Max Planck Society. We thank Caitlin McCarthy and Hans Schöler for kindly providing the OG2 mouse stem cells (B6; CBA-Tg (Pou5f1-EGFP)2Mnn/J; stock number 004654) used in this experiment and the Genomics core facility of the Medical Faculty Muenster for sequencing.