Abstract
In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Here we present a deep convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of CTCF and reveal a complex grammar underlying genome folding. Akita enables rapid in silico predictions for sequence mutagenesis, genome folding across species, and genetic variants.
Main text
Recent research has advanced our understanding of the proteins driving and the sequences underpinning 3D genome folding in mammalian interphase, including the interplay between CTCF and cohesin1, and their roles in development and disease2. Still, while disruptions of single bases can alter genome folding, in other cases genome folding is surprisingly resilient to large-scale deletions and structural variants3,4. As follows, predicting the consequences of perturbing any individual CTCF site, or other regulatory element, on local genome folding remains a challenge.
Previous machine learning approaches have either: (1) relied on epigenomic information as inputs5–7, which does not readily allow for predicting effects of DNA variants, or (2) predicted derived features of genome folding (e.g. peaks8,9), which depend heavily on minor algorithmic differences10. Making quantitative predictions from sequence poses a substantial challenge: base pair information must be propagated to megabase scales where locus-specific patterns become salient in chromosome contact maps.
Convolutional neural networks (CNNs) have emerged as powerful tools for modelling genomic data as a function of DNA sequence, directly learning DNA sequence features from the data. CNNs now make state-of-the-art predictions for transcription factor binding, DNA accessibility, transcription, and RNA-binding11–14. DNA sequence features learned by CNNs can be subsequently post-processed into interpretable forms15. Recently, Basenji16 demonstrated that CNNs can process very long sequences (∼131kb) to learn distal regulatory element influences, suggesting that genome folding could be tractable with CNNs.
Here we present Akita, a deep CNN to transform input DNA sequence into predicted locus-specific genome folding. Akita takes in ∼1Mb (220 bp) of DNA sequence and predicts contact frequency maps for all pairs of ∼2kb (2048bp) bins within this region. Crucially, this allows Akita to predict the effects of mutating single base pairs. We trained Akita with five of the highest-quality Hi-C and Micro-C datasets as targets (Table 1), focusing on the locus-specific patterns evident in log(observed/expected) maps, minimizing the mean squared error (MSE) between predictions and targets.
The Akita architecture consists of a ‘trunk’ based on the Basenji16,17 architecture to obtain 1D representations of genomic sequence, followed by a ‘head’ to transform to 2D maps of genome folding (Fig. 1a, Methods). In the ‘head’, we first averaged the representations of genomic bins i and j. Averaging produced slightly better generalization accuracy relative to several alternatives, including concatenation (Supplemental Fig. 1, Supplemental Note). As genomic distance can impact regulatory element communication, we appended a positional encoding of the distance between bins. Drawing inspiration from CNNs used in image processing, we computed multiple layers of dilated residual 2D convolutions, re-symmetrizing after each block. Finally, we compared the upper triangular regions of target and predicted maps. We reasoned the trunk would enable Akita to learn DNA motifs and how they combine into a grammar for genome folding. In turn, the head would recognize relationships between these features and propagate this information across the map, while accounting for the dependencies between neighboring bins.
Akita learned a predictive representation of genome folding from DNA sequence (overall 0.14 MSE, 0.61 Pearson, 0.56 Spearman on held-out test data). On a region-by-region basis, Akita captured the variety of patterns seen experimentally (Fig. 1b,c) and displayed a bimodal distribution of correlations. Many of the lower correlations represented correct predictions for featureless experimental maps, indicating that correlations, while interpretable, underestimated model performance on this task. Indeed, Akita’s predictions also captured the strength of locus-specific folding seen experimentally (Supplemental Fig. 2). By simultaneously training on all five datasets in a multi-task framework, Akita has greater accuracy for each dataset compared to models trained on that dataset alone (Supplemental Fig. 3). Still, Akita predicted limited cell-type-specific differences (Supplemental Fig. 4). We hypothesize this was constrained by the extent of cell-type specific differences currently ascertainable in experiments. Even the dramatic cellular transformation of cardiomyocyte differentiation displayed minimal differences upon exit from the ESC state and mainly similarities thereafter via Hi-C18,19. Unless noted, we thus focused our following analyses on Akita’s predicted outputs for HFF Micro-C20, the training dataset with the strongest locus-specific folding.
Akita predicted more prominent patterns in regions with greater CTCF binding and DNAse hypersensitivity (Supplemental Fig. 2). Visually, salient patterns in predicted maps often aligned with CTCF ChIP-seq peaks (Fig. 2b). However, CTCF motifs were too prevalent to observe a correspondence at the bin level (Supplemental Fig. 5). Fortunately, Akita enabled us to ascertain their influence via in silico mutagenesis; while training Akita was computationally intensive, effects of sequence changes can be predicted in seconds. Akita predicted greatly diminished locus-specific patterns upon CTCF motif mutagenesis (Fig. 2e). Still, Akita predicted some patterns would persist, and these often aligned with DNase hypersensitive sites that lacked evidence of strong CTCF binding. Inverting all CTCF motifs produced very different predictions, redistributing rather than abrogating contact patterns (Fig. 2d, Supplemental Fig. 6). This indicated that Akita learned sequence features specifying an orientation-specific grammar of the CTCF sites most crucial for genome folding.
To explore the role of CTCF for Akita’s predictions genome-wide, we mutagenized the CTCF motifs in each region of the test set. The majority of mutagenized regions showed weaker locus-specific patterns (Fig. 3a), reminiscent of changes seen experimentally following acute CTCF degradation21,22. Performing a similar mutagenesis for each motif in the JASPAR transcription factor database23 revealed that CTCF had the strongest impact. The second largest effect was for CTCFL, which binds a very similar motif to CTCF but is typically inactive in somatic cells. For the remaining motifs, mutagenesis either imperceptibly disrupted genome folding or the predicted impact directly tracked the number of overlaps with CTCF motif positions (Supplemental Fig. 5). These results argue that no other transcription factor with a known motif plays as large of a role as CTCF for genome architecture, and that CTCF-independent aspects of genome architecture emerge from a combinatorial interplay between different DNA-binding factors.
We next investigated Akita’s ability to predict how genetically engineered mutations alter genome folding. As Akita makes predictions for 1Mb sequences and is not influenced by information beyond this window, we sought an example where a <100kb variant had a dramatic effect on genome folding. At the Lmo2 locus in HEK293T cells24, two domains are separated by a boundary positioned at a cluster of three CTCF-bound sites (Fig. 3C). In cells with a ∼25kb deletion encompassing this boundary, the two domains merge. Making the same deletion in silico recapitulated this effect in the predicted Hi-C map (Fig. 3C). Leveraging Akita’s ability to rapidly assay sequence perturbations, we examined a combinatorial set of in silico deletions in the Lmo2 locus (Supplemental Fig. 7). We found that deleting any individual CTCF site minimally altered predictions. Our model thus predicts this boundary is formed by redundant CTCF sites, a phenomenon observed experimentally in other genomic locations3,4.
Given similar overall human and mouse genome folding25, we reasoned the mouse genome could provide evolutionarily perturbed sequences to further test Akita (Fig. 3B). Using mouse DNA sequences as input, we compared predictions from our human-trained model (hESC output) with mESC Hi-C data26. These cross-species predictions generally recapitulated mouse genome folding (Supplemental Fig. 8, median Spearman R: 0.50). Intriguingly, poorer predictions had more B2 SINE elements, which dramatically expanded in murid lineages and carry CTCF sites27. Mutagenizing B2 SINE elements improved our predictions for mouse genome folding (median Spearman R 0.55 vs 0.50). This suggests either the mouse genome specifically mitigates these elements, or Akita did not learn their true influence due to the lack of B2 SINEs in the human genome. These results are consistent with recent observations that the ChAHP complex hinders CTCF binding within murine B2-SINE elements28 and highlight opportunities for sequence-based modeling to uncover species-specific regulatory strategies.
An appealing hypothesis for future work is that neural networks with layers that better reflect the molecular and physical mechanisms organizing genomes will make more accurate and generalizable predictions. For the initial layers, convolutions naturally extend11–13 position weight matrix approaches for capturing the biophysics of protein-DNA interactions. The architectures and layers that might best reflect the process of loop extrusion, believed to organize mammalian interphase chromosomes,29 or other mechanisms of genome organization remain open questions. The near future promises exciting progress: recently, a similar CNN model, deepC, was posted to bioRxiv30. While deepC has a similar ‘trunk’ to Akita, it differs greatly in the architecture of the ‘head’, data pre-processing, and training schemes (Supplemental Note 2). Future work will benefit from comparing these approaches, continuing to explore the space of alternatives, and incorporating high quality data as it becomes available.
In summary, we present Akita, a model that predicts genome folding using only DNA sequence as an input. In the future, we envision that end-to-end sequence-to-genome-folding approaches will advance our ability to design functional screens, model enhancer-promoter interactions, prioritize causal variants in association studies, and predict the impacts of rare and de novo variants.
Methods
Code availability
All code used for training Akita available at: https://github.com/calico/basenji/tree/tf2_hic/. Trained Model at: https://github.com/calico/basenji/tree/tf2_hic/manuscripts/akita.
Training Data
To obtain Hi-C data conducive for convolutional neural network learning, we reprocessed five of the highest-quality publicly available human Hi-C and Micro-C datasets to 2048bp (211 bp) bins using distiller (https://github.com/mirnylab/distiller-nf)35 to map to hg38 and cooler36 to perform genome-wide iterative correction37.
To focus on locus-specific patterns and mitigate the impact of sparse sampling present in even the currently highest-resolution Hi-C maps, we: adaptively coarse-grain, normalize for the distance-dependent decrease in contact frequency, take a natural log, clip to (−2,2), linearly interpolate missing bins, and convolve with a small 2D gaussian filter (sigma=1, width=5). The first through third steps use cooltools functions (https://github.com/mirnylab/cooltools). Interpolation of low-coverage bins filtered out in typical Hi-C pipelines was crucial for learning with log(observed/expected) Hi-C targets, greatly outperforming replacing these bins with zeros.
To prepare the Hi-C data for training, we divided the human genome into large virtual contigs and assigned them to training, validation, and test sets with an 80/10/10 split. We broke the chromosomes at assembly gaps, large unmappable regions, and consecutive stretches of ≥10 filtered-out Hi-C bins (in any target dataset). Within the contigs, we extracted 220 bp (∼1Mb) sequences, striding by 218 bp (∼262kb) for the training set and 219 bp (∼524kb) for the validation and test sets. This procedure resulted in 7,008 training, 419 validation, and 413 test sequences.
Model architecture
We created a neural network architecture to predict 2D Hi-C maps from 1D DNA sequences that consists of two major components. First, we process the 1D DNA sequence using a ‘trunk’ that applies a series of convolutions, following previous work on convolutional neural networks for DNA sequence analysis. Second, we applied a ‘head’ that transforms the 1D representations to 2D for Hi-C prediction. We implemented the model using the Basenji software16,17, which is written in Tensorflow40 and Keras41.
More specifically, the ‘trunk’ includes:
Convolution with 96 filters of size 11-by-4 to transform the 1-hot encoded DNA sequence followed by batch normalization, ReLU, and width 2 max pooling.
Convolution tower that iteratively performs convolution with 96 filters of width 5, batch normalization, ReLU, and width 2 max pooling to arrive at 512 vector representations of the sequence in 2048bp windows.
Dilated residual convolution tower that iteratively performs dilated convolution with geometrically increasing dilation rate, adding the new representation back into the old. This block spreads information about relevant sequence elements and global context across the sequence16.
Bottleneck width 1 convolution with 48 filters.
To convert these 1D representations to 2D for the Hi-C ‘head’, we averaged the representations for every pair of genomic bins i and j. This operation transforms a tensor with dimensions [512 length, 48 filters] to a tensor with dimensions [512 length, 512 length, 48 filters]. We also concatenated a positional encoding of the distance between bins, abs|i-j| and applied a (1,1) convolution block to finalize the transition to 2D. Next, we treat this map as a 2D image and run multiple layers of dilated residual 2D convolutions with geometrically increasing dilation rate, re-symmetrizing after each step. Finally, we apply one last linear transformation to make predictions for the 5 datasets.
Intuitively, the initial transformation to 1D to 2D should be able to recognize genomic features with important relationships for Hi-C prediction, such as two boundary elements, and the subsequent 2D convolutions serve to disseminate that recognition to the surrounding region. Intriguingly, similar sequence-to-map architectures have recently been successful for protein contact map prediction42.
Training Approach
We computed a mean squared error loss from the targets and predictions, considering only the upper triangular portion of the matrixes. We fit the model parameters using stochastic gradient descent with momentum for ∼60 epochs, taking steps in batches of 2 sequences.
Data augmentation was critical to avoid overfitting and maximize generalization accuracy to unseen sequences. Each time that we processes a sequence, we stochastically shifted input sequences by up to +/-11 bp and reverse complemented the DNA and flipped the Hi-C map.
We stopped training when validation loss had not improved for 12 epochs, and we took the model parameters that had achieved that minimum validation loss forward as the final model. We performed a search over learning rate, momentum, gradient norm clipping, dropout probability, and convolution filters using the Dragonfly Bayesian optimization toolkit [https://github.com/dragonfly/dragonfly]43.
Comparison with 1D features
For comparison to 1D features of the epigenome, we downloaded processed bigWigs for the relevant cell types from the ENCODE data portal 31 and binned them into 2048bp profiles.
In silico motif mutagenesis
To perform in silico motif mutagenesis, we intersect our test set regions with motif positions using bedtools 44. We then generate multiple randomized sequences, where DNA sequence at positions of motifs with randomly generated DNA sequences of the same length. We then calculate the average disruption as mean((pred - predΔmotif)2), and the change in signal as mean(pred2) - mean(pred2 Δmotif). Motifs names were plotted with adjustText (https://github.com/Phlya/adjustText)45. Maps in Fig. 2 shown as averages over 10 randomized sequences, JASPAR-wide analyses in Fig. 3a averaged over 3 randomized sequences.
In silico CTCF motif inversions
We perform in silico motif inversions similarly to motif mutagenesis for determining intersections. We then merge overlapping motifs and replace sequences in these intervals with their reverse complements.
Predictions for mouse DNA sequences
To test the accuracy of Akita’s predictions for mouse DNA sequences, we obtained mESC Hi-C data from Bonev et al., 2017 26, mapped reads to mm10, and otherwise processed the data as for human datasets. Positions of B2-SINE elements were downloaded from UCSC (from RepeatMasker34). B2-SINE mutagenesis was performed as described for motifs.
5C data processing
To test Akita’s ability to predict experimentally induced deletions, we obtained processed 5C data for the Lmo2 locus from Hnisz et al., 201624, re-binned fragments to 2048bp bins, and otherwise performed the same processing into log(observed/expected) maps as for Hi-C data above.
In silico deletions
As Akita makes predictions for fixed input size, to make a deletion in silico we must both remove the DNA sequence we hope to delete and supply the model with an equal amount of additional DNA sequence. Here we centered on the position of the deletion and symmetrically extended the start and end to maintain the size of the input.
Supplemental Notes
Supplemental Note 1: lessons from previous architectures
Before arriving at the model described above, we considered various alternative designs in the space of possible model architectures and data pre-processing schemes.
Some of our early attempts at predicting 3D genome folding from sequence involved predicting slices of the Hi-C matrix (‘virtual-4C’) directly from the outputs of the trunk, with the idea that predicting a contact vector of length N could require fewer parameters in the ‘head’ than predicting a contact map of size NxN. While such virtual-4C models readily learned boundaries, we found they failed to learn sharp peaks. We also found that predictions of these models were often asymmetric (i.e. predi,j != predj,i), likely because the virtual-4C architecture we considered did not encode a symmetry constraint.
We also tested the performance of models that replace the dilated convolution layers in the trunk with bidirectional LSTMs, popular layers for capturing long-range dependencies in natural language processing, while preserving roughly the same time per training epoch. This architecture performed slightly more poorly on both training and validation data, and we did not pursue it further. We also explored the utility of separable convolutions to reduce the number of parameters in the convolutional tower: we found little benefit in the rate of learning and a slight loss in accuracy.
Supplemental Note 2: differences with DeepC
In Schwessinger et al.30, the authors report successful predictions of Hi-C maps at 10kb resolution using a similar deep convolutional neural network approach, deepC. While deepC has a similar ‘trunk’ to Akita, it differs greatly in the architecture of the ‘head’, data pre-processing, and training schemes. First, deepC uses non-linearly quantile normalized targets, rather than log(observed/expected) targets, which may emphasize learning peaks rather than insulation. Second, deepC predicts a ‘zig-zag pole’ target directly with a dense layer from the output of their model’s trunk, which implicitly encodes distance but requires a large number of parameters, rather than predicting a dense patch of a map. Third, we focused on higher-resolution predictions (2048bp bins vs. 10kb bins). Finally, deepC currently requires pre-training a model on a large set of epigenomic profiles, and transferring weights to their full model. It is possible that strict transfer learning could limit the richness of representations that a deep CNN can learn for 3D genome folding; for example, a CTCF profile may not contain information about the directionality of motifs under its peaks, which is important for predicting genome folding.
Acknowledgements
The authors thank Vikram Agarwal, Han Yuan, and Elphege Nora for detailed feedback. GF and KSP were funded by Gladstone Institutes, the National Heart, Lung and Blood Institute (grant #HL098179), and the National Institute of Mental Health (grant #MH109907).