Predicting chromatin conformation contact maps

Over the past 15 years, a variety of next-generation sequencing assays have been developed for measuring the 3D conformation of DNA in the nucleus. Each of these assays gives, for a particular cell or tissue type, a distinct picture of 3D chromatin architecture. Accordingly, making sense of the relationship between genome structure and function requires teasing apart two closely related questions: how does chromatin 3D structure change from one cell type to the next, and how do different measurements of that structure differ from one another, even when the two assays are carried out in the same cell type? In this work, we assemble a collection of chromatin 3D datasets—each represented as a 2D contact map— spanning multiple assay types and cell types. We then build a machine learning model that predicts missing contact maps in this collection. We use the model to systematically explore how genome 3D architecture changes, at the level of compartments, domains, and loops, between cell type and between assay types.


Introduction
The spatial conformation of the genome inside a cell ("chromatin architecture") plays a key role in controlling cell type-specific cellular processes, such as regulation of gene expression and control of replication timing.An important aspect of this control involves bringing distal regulatory elements close, in 3D space, to the genes that they regulate.Consequently, abberant changes to chromatin architecture can lead to diseases as genes become improperly regulated [1][2][3][4].Hence, understanding endogenous chromatin architecture in each cell type and tissue in the body is crucial for understanding which alterations can disrupt these programs and what their anticipated effects are.
Accordingly, a variety of technologies have been developed to measure genome 3D conformation in a high-throughput fashion, allowing scientists to more fully explore the relationship between genome structure and function (Table 1).In the most commonly used genome-wide 3D conformation assay, dilution Hi-C, DNA-DNA contacts are counted by first fixing DNA using a cross-linker, thereby linking genomic regions that are close together [5].The DNA is then cut using an endonuclease, and fragments of cross-linked DNA are ligated and sequenced to produce counts.A variety of modifications to this general protocol have been proposed subsequently.For example, in situ Hi-C [7], micro-C [8], and DNAse Hi-C [10] aim to improve resolution and signal-to-noise ratio of Hi-C based assays by either performing ligation in the nucleus (in-situ Hi-C) or using a higher resolution nuclease (micro-C or DNAse Hi-C).ChIA-PET [6], PLAC-seq [9] and Hi-Chip [12] use immunoprecipitation to target DNA-DNA contacts that are mediated by a protein of interest.In contrast, SPRITE [11] and GAM [13] do not rely on ligation.SPRITE uses "split pooling," in which cross-linked molecules are tagged with the same barcodes to identify neighboring genomic regions.GAM uses cryosectioning in conjunction with sequencing and identifies genomic regions as neighboring if they are frequently observed in the same cryosection regions.The output of each such assay can be summarized in a contact map, a matrix containing counts of observed contacts between pairs of genomic loci.Dilution Hi-C [5] 2009 ✓ ChIA-PET [6] 2009 ✓ ✓ in-situ Hi-C [7] 2014 ✓ ✓ Micro-C [8] 2015 ✓ ✓ PLAC-seq [9] 2016 ✓ ✓ DNase Hi-C [10] 2016 ✓ ✓ DNA SPRITE [11] 2018 ✓ The National Institutes of Health 4D Nucleome Network (4DN) [14,15] has collected chromosome conformation experiments from multiple labs involving many cell or tissue types (generically, "biosamples") and assays.However, it is not feasible to perform every type of assay in every biosample, because chromosome conformation experiments are complex and costly and there are many biosamples that one may wish to interrogate.Therefore, we decided to pursue the hypothesis that machine learning methods can provide draft characterization of chromosome conformation across a wide range of biosamples and assays for which experimental data is not yet available, using only experiments that have already been performed.We can summarize the available experiments in a 2D matrix, with eight rows corresponding to different 3D chromatin assays and 11 columns corresponding to different biosamples (Figure 1).Each entry in the matrix entry is a contact map, and among the 88 possible contact maps, 4DN has carried out 41.Our goal is to accurately impute the missing 47 contact maps.
Such predicted contact maps have a variety of possible uses.A systematic analysis of a complete collection of contact maps may provide global insights into how important aspects of chromatin architecture, e.g.TAD boundaries and loops, differ or are similar across biosamples and assays.In some cases, it may not be possible to run multiple 3D chromatin assays on a particular sample due to the material necessary for each experiment.Interesting features in a predicted contact map may suggest hypotheses for follow-up experiments or may help in prioritization of future 3D chromatin assays.Conversely, if the predicted contact map for a given biosample or assay closely resembles one or more existing contact maps, that suggests that conducting the experiment may not be worthwhile.In these cases, predicted contact maps for other assay types may give an idea of what the results would have been if they had been done.
Several methods have been proposed previously to predict Hi-C contact maps, or portions thereof.Some methods, such as Akita [16], DeepC [17] and Orca [18], take as input the DNA sequence and produce as output a corresponding contact map.Akita and DeepC predict only contacts close to the diagonal of the matrix, whereas Orca can make predictions for an entire chromosome at a time.However, these sequence-only models can only make predictions for the set of experiments that they were trained on and so cannot generalize to cell types or tissues without experimental data.Alternatively, HiC-Reg [19] uses a random forest regression model to predict a Hi-C contact counts based on epigenomic features of the pair of anchoring genomic intervals.Similarly, Epiphany [20] predicts Hi-C contacts from 1D epigenomic data, including ChIP-seq measurements of chromatin accessibility, CTCF binding, and several types of histone modifications.Alternatively, DeepChIA-PET [21] predicts ChIA-PET contacts from a combination of Hi-C and ChIP-seq data.Despite being able to generalize to biosamples that they were not trained on, these methods are limited to making predictions for a single type of contact map and require that a fixed set of epigenomic experiments have been performed.
In contrast to these approaches, we consider the complementary problem of imputing unseen contact maps given only a collection of available contact maps.This imputation task is conceptually similar to previous work on imputing 1D epigenomic data [22][23][24].In that setting, large compendia of 1D epigenomics experiments were organized into a three dimensional tensor with the axes being biosample, assay, and genomic position, and computational models learned from the structure of the available measurements to make predictions of unseen experiments.In our setting, our data is organized into a four dimensional tensor with the additional axis being the second genomic position, as contact maps are comprised of interactions between pairs of positions.We introduce Sphinx, a deep tensor factorization method for imputing entire contact maps.Sphinx is comprised of learned representations of biosamples, assays, and genomic positions, and a neural network that combines these factors to make predictions.We evaluate Sphinx in a cross-validated setting, demonstrating that it outperforms several baseline methods.We then systematically apply Sphinx to the full 4DN dataset in a prospective fashion, and we examine the predicted contact matrices on three levels: contact decay profiles, compartments, and topologically associating domains (TADs).The results suggest an increased capability to compare contact matrices, and a smoothing effect on similarity between biosamples and assays.

Sphinx
The deep matrix factorization approach was implemented in a similar fashion to the Avocado imputation approach [24].We used Pytorch version 2.0.1 with Python 3.10.12for the implementation.We created embedding layers for cell type, assay, and genomic position.We then select two genomic position factors associated with the two positions required along with the cell type and assay factors.These are concatenated and used as input to a series of fully connected layers with equal number of hidden nodes.The final output layer is a single float value.We constrain the first position factor to be less than the second to enforce symmetry of the prediction.These predictions are then reflected across the diagonal.A schematic of the architecture of the Sphinx model is shown in Figure 2. We conducted a hyperparameter search using a partial grid search spanning initial learning rate, dropout, and the number of cell type factors, assay factors, position factors, layers, and nodes (Figure 3).The Adam optimizer [25] was used to optimize our model.We randomly chose 500 of 9126 hyperparameters to train models, and each model was trained up to at most 50 epochs using a batch size of 10,000 examples, each example being one matrix entry of one experiment.Each epoch consisted of 526 batches.Batches were generated from randomly selected off-diagonal elements of each of the training experiments.Validation error was calculated at the end of each epoch, and model selection was based off the best validation error.The loss curves from the 9 models with lowest validation error are shown in Figure S1 with their associated values in Table S1.The Sphinx Python code is available, with an Apache license, at https://github.com/Noble-Lab/Sphinx.

Data sets
We systematically downloaded all available contact map datasets from the 4D Nucleome web portal (https: //data.4dnucleome.org/)and assembled them into a biosample-by-assay type matrix.Rows or columns with a single entry were iteratively deleted until every row and column contained at least two non-missing entries.The final data set contains eight assay types (Table S1), 11 biosamples (Table 2), and 44 contact maps (Figure 1 and Supplementary Table S2. We only trained the model on one chromosome at 100 kb resolution due to computational constraints.Chromosome 19 was chosen due to its moderate size.Prior to analysis, each contact map is normalized as follows.First, we log(x + 1) transform each contact map entry to avoid high dynamic range of the contact counts and to avoid taking the log of 0. Second, we discard rows and columns of each chromosome matrix with low marginal counts.This is done by creating a separate vector of marginal counts for each contact map.We then discard any bin that is beyond 10 mean absolute deviations from the median marginal count in any one of the contact maps in the training set.In practice, this procedure eliminates 31 of the 587 of the bins resulting in 556 pruned bins.To avoid leakage, the same set of bins (identified using the training data) is also eliminated from the validation and test sets.Third, we normalize the sum of the remaining log(x + 1) transformed contacts to sum to 10 5 to avoid detecting differences in read depth.The data was randomly partitioned into train, test and validation sets as illustrated in Figure 1.For each assay, we randomly chose experiments one by one and assigned them to the training, test, and validation set in that order.This ensured that when assays only had 2 examples, the training and test set each had one example of that assay.This resulted in 17 training experiments, 10 validation experiments, and 14 test experiments.

Baseline methods
Without any other predictive models to directly compare to, we consider three baseline methods: the row mean, the column mean, and the cross mean.In previous imputation studies, this type of mean predictor provides a strong baseline [24,26].The row mean is the average of all observed data that shares the same assay as the queried matrix.The column mean is the average of all observed data that shares the same biosample.The cross mean takes the average of all data that share either biosample or assay with the queried matrix.

Computing the contact decay, eigenvector, and insulation score
To compute the contact decay profile, we computed the average along each of the diagonals of the contact matrix.To compute the eigenvector, we used NumPy version 1.25.2.To compute the insulation score, we used a window size of 30, and computed the mean of sliding window diagonal submatrices of the contact matrix, sliding the window by 1 bin at each step.

Sphinx accurately imputes unobserved contact maps
To validate our modeling approach, we began by measuring how well the model predicts previously unseen data.To this end, we randomly selected a subset of contacts map to use for validation and testing (Section 2.2).As our primary performance measure, we used the MSE between the observed and imputed contact maps, which corresponds to the loss function optimized by Sphinx.Because, to our knowledge, this type of imputation has not been previously reported in the literature, we designed several baseline methods against which to compare the performance of Sphinx.These involve averaging over rows, columns, or both (i.e., averaging in a cross-shaped pattern centered on the target matrix) (Section 2.3).
Our experiment began by using the validation set to select hyperparameters for Sphinx.In this step, we defined a seven-dimensional grid of hyperparameters, including learning rate, cell type factors, assay factors, position factors, the number of layers, the number of nodes per layer, and the dropout rate.Because this grid is too large to explore completely, we opted to randomly select 500 points on the hyperparameter grid and use those as the basis for our selection.
The results of this hyperparameter search suggest that Sphinx successfully outperforms the baseline methods with a variety of hyperparameter settings (Figure 3).Among the 500 possible hyperparameter settings, Sphinx's validation MSE is less than the cross-mean MSE in 306 cases.The best-performing hyperparameter combination achieves an MSE of 0.052, which is 6.32% lower than the MSE achieved by the best-performing baseline (MSE of 0.056 from the cross-mean model).
Next we evaluated how well the selected hyperparameters generalize to the test data.This analysis suggests that Sphinx performs comparably to the best-performing baseline (cross-mean).In particular, computing the overall MSE on the test set, we find that Sphinx outperforms the baseline in 8 out of 14 cases (Figure 4).

Imputation model reveals similarities between biosamples
Having established that Sphinx produces comparable predictions to the cross-mean baseline, we next set out to systematically apply Sphinx to the full dataset.The goal here is to provide a data resource that enables exhaustive comparison of all cell lines and all assay types.Without imputation, many comparisons would be incomplete.For example, based on the pattern of missingness in Figure 1    similarity between the GM12878 and HeLa cell lines, we could do so only on the basis of dilution Hi-C and in situ Hi-C data.In particular, this comparison would not make use of six data sets: HeLa data from DNase Hi-C and dilution Hi-C and GM12878 data from DNA Sprite, PLAC-seq, and two ChIA-PET experiments.Furthermore, some comparisons would be impossible to compute: in our dataset, three pairs of assay types have not been carried out in any common cell lines: DNA SPRITE and DNase Hi-C as well as DNAse Hi-C and both ChIA-PET assays.Accordingly, we used the previously trained model to impute contact maps for all unobserved assay-cell type combinations (Figure 1).Also, for each imputed contact map, we computed three types of features: contact decay profiles, eigenvectors, and insulation scores (Figures 6-8).For comparison, we also carried out the imputation and feature calculation using the cross-mean model.In each case, we see good qualitative agreement between the observed and imputed values in the test and validation sets.
Finally, based on these imputed values, we computed complete pairwise correlation matrices for cell types and assay types (Figure 9).To obtain these correlations, we first calculated the contact decay profile, eigenvectors, or insulation score profile of each of the imputed matrices (Section 2.4).Next, we selected two cell types (or assay types), c 1 and c 2 .We then computed the mean Pearson correlation between the experiments (c 1 , a) and (c 2 , a) for each of these data modalities, where a is a shared assay (or cell type).We also computed these mean correlation values using the unimputed data, skipping any experiments that were not available.The correlation matrix based on imputed values is complete and thus allowed us to carry out hierarchical clustering of cell types and assay types, which is not possible using the unimputed data due to the presence of missing values.We noted that in the unimputed correlation matrix, the entries with the greatest dissimilarities are often based on only a few experiments.When using the imputed matrices for the correlation matrix, these values often became less extreme.For example, in the contact decay profile assay similarities, dilution Hi-C appears to be dissimilar from PLAC-seq and DNase Hi-C (unimputed Pearson correlations of 0.89 and 0.86, respectively, compared to imputed values of 0.96 and 0.97), and DNase Hi-C is dissimilar from in situ Hi-C (unimputed Pearson correlation of 0.83 compared to 0.95).However, there is only one matrix with shared cell type between dilution Hi-C and PLAC-seq, and likewise with DNase Hi-C.In the case of DNase and in situ Hi-C, only two experiments are shared.This sparsity makes these computed similarities potentially less reliable.In practice, we observed that in these cases, the imputed correlations are higher than the original correlations.

Discussion
Although many chromatin 3D contact maps have been collected for combinations of assays and human cell types, there are still many combinations that are unobserved.This missingness motivates a computational method to impute unobserved experiments in order to generate hypotheses and enable downstream analyses that require complete observations.We first developed baseline methods that take the average of experiments either using only the same assay, the same cell type, or either.We then improved on these baseline methods using our machine learning model to predict areas where the baseline method was most likely to err.We demonstrated that we improved upon the baseline methods more in many of the validation cases and often than not in the test set (Figures 3, 4).
In our matrix of 88 potential contact maps from the 4DN web portal, we produced imputed experiments for the 47 unobserved experiments (Figure 5).We then conducted an analysis to determine similarities between assays and cell types across contact decay profile, eigenvector, and insulation score (Figure 9).These analyses enable determination of characteristics unique to individual cell types and assays.We demonstrated that without the imputed data, some comparisons cannot be made due to insufficient data.Furthermore, we demonstrated that some aberrant correlations that were found in the unimputed data were tempered in the imputed data.
A limitation of this work is that the number of available experiments is currently somewhat small, limiting both the training and evaluation of such models.However, we anticipate that as the cost of sequencing continues to decrease that the number of available experiments, and their quality, will increase over time.In parallel, we expect that the number of biosamples of interest will continue to increase as methods for isolating cells from conditions of interest become more sophisticated.Put together, methods such as Sphinx will become increasingly valuable as time goes on as a means to provide draft characterization in the face of

Figure 1 :
Figure 1: Available contact maps from 4DN.Each panel displays a contact map for chromosome 19 from a particular assay and a particular biosample where the number of normalized log counts is displayed as a color.Only cell types with at least two experiments and assays with at least two experiments were included.Each of the 44 non-missing contact maps has a colored border to indicate whether it was used for training (green), validation (orange) or testing (red).

Figure 2 : 2 :
Figure 2: The Sphinx model.Each dimension of the 4D input is encoded using latent factors, and the concatenated factors are processed by a multi-layer perceptron.The final output of the model is the predicted normalized contact count.

Figure 3 :
Figure 3: Hyperparameter search results.Sphinx was trained using 500 randomly selected settings of seven different hyperparameters.Each panel plots the validation set MSE (x-axis) for various values of a particular hyperparameter (y-axis).MSEs are shown as horizontal lines for the three baseline methods: cross-mean (red line), same cell type (magenta), and same assay type (green).All analyses are based on data from chromosome 19.

Figure 4 :
Figure 4: Comparison of Sphinx to the cross-mean baseline.Each panel plots the test set MSE achieved by Sphinx (x-axis) versus the cross-mean baseline (y-axis).Each point corresponds to one of the 14 assay/cell type combinations in the test set.

Figure 5 :
Figure 5: Sphinx imputes contact maps for unobserved assay-cell type combinations.The plot is similar to Figure 1, except that the unobserved contact maps (purple outline) are imputed by Sphinx.Each plot is a contact map for chromosome 1, where the number of normalized log counts is displayed as a color.

Figure 6 :
Figure 6: Imputation enables visualization of unobserved contact decay profiles.Contact decay profiles for training (green outline), validation (orange outline), test (red outline) and unobserved dataoutline) are shown.The inset shows an example of one contact decay profile with the distance from the diagonal (X-axis) plotted against the intensity (Y-axis), which is the mean value for all normalized contacts at that distance from the diagonal.Axes are consistent across all panels.

Figure 7 :
Figure 7: Imputation enables visualization of unobserved eigenvector profiles.Eigenvector profiles for training (green outline), validation (orange outline), test (red outline) and unobserved data (purple outline) are shown.The inset shows an example of the eigenvector profile with the position (X-axis) plotted against the eigenvector value (Y-axis).Axes are consistent across all panels.

Figure 8 :Figure 9 :Figure 10 :
Figure 8: Imputation enables visualization of unobserved insulation score profiles.Insulation score profiles for training (green outline), validation (orange outline), test (red outline) and unobserved data (purple outline) are shown.The inset shows an example of the insulation score profile with the position (X-axis) plotted against the insulation (Y-axis).Axes are consistent across all panels.

Figure S1 :
Figure S1: Loss curves demonstrate sufficient training epochs.The loss curves for the 9 lowest hyperparameter combinations are shown.The associated hyperparameter combinations are shown in TableS1.The Sphinx training loss (blue), Sphinx validation loss (orange), cross-mean baseline training loss (green), cross-mean validation loss (purple), and same-assay validation loss (red) are shown.

Table 1 :
Properties of assays measuring 3D genome architecture.The table only includes assay types generated by 4DN and included in this study.
, if we want to compute the