## Abstract

Transcription factors bind complex regulatory DNA sequence patterns in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn these cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order, epistatic feature interactions encoded by the models. We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from *in vivo* TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from *in vitro* TF binding models. We also apply DFIM to regulatory sequence models of *in vivo* chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.

## 1. Introduction

Genome-wide biochemical profiling experiments have revealed millions of putative regulatory elements in diverse cell states. These massive datasets have spurred the development of deep neural network (DNN) models to predict cell-type specific or context-specific molecular phenotypes such as TF binding, chromatin accessibility and gene expression from DNA sequence^{1–3}. Beyond high prediction accuracy, the primary appeal of DNNs is that they are capable of learning predictive sequence features and modeling non-linear feature interactions directly from raw DNA sequence without any prior assumptions. Hence, interpreting these purported black box models could reveal novel insights into the combinatorial regulatory code.

Advances in feature attribution methods for DNNs have enabled the identification of predictive cis-regulatory patterns in DNA sequences used as input to the models. Feature attribution methods estimate the contribution (or importance) of features, such as individual nucleotides or contiguous subsequences (e.g. motifs), in an input DNA sequence to a model’s output prediction. A perturbation-based, forward-propagation approach known as in-silico mutagenesis (ISM) quantifies the importance of a nucleotide in an input DNA sequence as the maximal change in the output prediction from the DNN model when the observed nucleotide (e.g. a G) at that position is mutated to any of the alternative bases (e.g. A, C or T). ISM has been used to score the effects of genetic variants in regulatory DNA sequences^{1–3}. However, ISM is computationally inefficient since each perturbation at every position in an input sequence requires a separate forward propagation to the output through the network. ISM also fails to highlight important features masked by saturation due to buffering interactions with other features (e.g. multiple motif instances in a sequence that buffer each other) ^{4}. SHAP is a perturbation-based feature attribution method that borrows from game theory^{5}. Max-Ent is a feature attribution method that uses a Markov chain Monte Carlo algorithm to find the maximum-entropy distribution of inputs that produced a similar hidden representation to the chosen input^{6}. While SHAP and Max-Ent show improved sensitivity and specificity relative to ISM, they do not scale efficiently to comprehensively characterize feature importance across millions of regulatory sequences. An alternative family of computationally efficient backpropagation approaches decompose the output prediction corresponding to an input sequence by recursively propagating contribution scores through the layers of the DNN from the output to the input. One single backpropagation pass provides the contribution of all nucleotides in an input DNA sequence to the output prediction. The gradient of the output with respect to each nucleotide in the input DNA sequence – known as a saliency map ^{7} – is one such estimate of importance and has been used to identify predictive nucleotides in regulatory DNA sequences. Other related approaches such as DeepLIFT ^{4} and integrated gradients ^{8} differ in the definition of the importance score that is backpropagated and provide improved sensitivity in the presence of saturation effects. DeepLIFT ^{4} has also been shown to be an efficient approximation of SHAP scores ^{5}.

Current feature attribution methods only provide the importance of individual features. They do not highlight predictive, higher-order feature interactions that are encoded in the DNN model. Perturbation-based approaches such as ISM cannot scale to comprehensively score all pairwise and higher-order interactions between nucleotides or subsequence features. Recently, an efficient algorithm was proposed to calculate SHAP-based pairwise feature interaction scores^{9} specifically from tree-based ensemble models. However, computing SHAP interactions from neural network models between all pairs of features in regulatory DNA sequences is computationally inefficient (Section 2.6) and cannot scale to reveal comprehensive interaction maps across millions of regulatory sequences.

Here, we present an efficient approach called Deep Feature Interaction Maps (DFIM) to estimate pairwise interactions between features (nucleotides or subsequences) in an input DNA sequence mapped to an associated regulatory phenotype by a neural network. We define a novel Feature Interaction Score (FIS) between any pair of features (source feature and target feature) in an input DNA sequence as the change in the importance score of the target feature when the source feature is perturbed, while keeping all the other features in the sequence intact. By leveraging efficient backpropagation-based feature attribution methods, we can efficiently compute FIS between all pairs of nucleotides or predictive motifs across large sets of input DNA sequence. Aggregate summary statistics of the pairwise Feature Interaction Score across multiple sequences provide insights into common, shared patterns of feature interactions. We benchmark DFIM in controlled simulations that explicitly encode motif interactions. We use DFIM to reveal synergistic interactions between GATA1 and TAL1 motifs from *in vivo* TF binding models. We apply DFIM to reveal epistatic interactions involving nucleotides flanking the core motif of the Cbfl TF in yeast from *in vitro* TF binding models. We also apply DFIM to regulatory sequence models of *in vivo* chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci.

## 2. Methods

We assume that we have trained a deep neural network to accurately map one-hot encoded DNA sequences *X* of length *L* to a categorical (binary or multiclass classification) or continuous (regression) output *0.* Let *Y* refer to the scalar predicted output *0* from the neural network for regression tasks. For classification tasks, let *Y* refer to the scalar input to the final output sigmoid of the neural network.

### 2.1 Nucleotide-resolution Feature Interaction Score (FIS)

We are given a one-hot encoded input DNA sequence *X*_{0} ∈ {0,1}^{{4 x L}} i.e. a matrix of size [4, *L*] such that *X*_{0} [*b*, *p*] = 1 for the observed nucleotide *b* ∈ {*A*, *C*, *G*, *T*} at position *p* ∈ [1, *L*] (**Fig. 1**).

First, we compute *C*_{X0} a matrix of size [4, *L*] that contains the **importance** (or contribution) of every nucleotide (rows) at each position in the sequence (**Fig. 1(Step 1)**). While our approach extends to any other efficient feature attribution method, for the analyses in this paper, we show results using both DeepLIFT ^{4} and gradient saliency maps as importance scores^{7}. In **gradient-based saliency maps**, for a specific input sequence *X*_{0}, the output *Y*_{0} can be approximated by a first-order Taylor expansion *Y*_{0} ≈ Σ_{p, b} *w*_{0} [*b*, *p*]. *X*_{0} [*b*, *p*] where *w*_{0} is the partial derivative (gradient) of *Y* with respect to the input sequence variable *X* evaluated at *X*_{0} i.e. *w*_{0} = | *X*_{0}. It is worth noting that the entire gradient matrix *w*_{0} can be computed efficiently in one backpropagation pass. We then perform a point-wise multiplication of the gradient matrix *w*_{0} with the one-hot encoded observed input sequence *X*_{0} to obtain the importance scores for the observed nucleotides *b* at each position *p* i.e. *C*_{X0}[*b*,*p*] = *w*_{0}[*b*,*p*].*X*_{0}[*b*,*p*]. Only the observed nucleotides at each position can have non-zero values. **DeepLIFT contribution scores** quantify the sensitivity of the output to finite changes in the input^{4}. This is in contrast to gradients, which measure the sensitivity of the output to infinitesimal changes in the input. Specifically, the DeepLIFT algorithm backpropagates a score (analogous to gradients) which is based on comparing the activations of all the neurons in the network for the actual input sequence *X*_{0} to those obtained when using neutral ‘reference’ sequences^{4}. We use dinucleotide-shuffled versions of *X*_{0} as reference sequences unless otherwise specified.

Our goal is to query the neural network to estimate the interaction between the observed nucleotide at one position in the sequence (source feature) and the observed nucleotide at some other position (target feature) in the sequence. Let (** α, s**) represent the

**source feature**i.e. the observed source nucleotide

*α*∈ {

*A*,

*C*,

*G*,

*T*} at a source position

*s*such that

*X*

_{0}(

*α*,

*s*) = 1. Let (

**) represent the**

*β*,*t***target feature**i.e. observed target nucleotide

*β*∈ {

*A*,

*C*,

*G*,

*T*} at some target position

*t*such that

*X*

_{0}(

*β*,

*t*) = 1.

Intuitively, we define the **Feature Interaction Score** *FIS*_{X0} ((*β*, *t*)| (*α*, *γ*, *s*)) of the target feature on the source feature as the change in the importance score of the target feature (*β*, *t*) when the source feature (*α*, *s*) is mutated to a different nucleotide (*γ*, *s*). To compute FIS, we create a **new mutated sequence** from *X*_{0} where we switch the observed nucleotide *α* at source position s to a different mutant nucleotide *γ* ∈ {*A*, *C*, *G*, *T*} – {*α*}, while keeping the nucleotides at all other positions as they were in *X*_{0} **(Fig. 1(Step 2))**. We then compute the importance matrix *C*_{X0′} for as we did for *X*_{0} **(Fig. 1(Step 3)]**. The *FIS* of the target feature with the source feature is defined as

Since, only two backpropagation passes are required to compute *C*_{X0} [, *t*] and [, *t*] for all *t* ∈ [1, *L*], we can efficiently compute the *FIS* of all target features *FIS*_{X0}(∗ | (*α*,*γ*, *s*)) in a sequence with a specific source feature mutation **(Fig. 1(Step 4))**. Note that the *FIS* is a directional interaction score of the target with the source. In some cases, we may only be interested in the magnitude of the score rather than its sign. In such cases, we use the **absolute value of the** *FIS.*

We define the **maximal Feature Interaction Score** (* maxFIS*) of the target feature with the source feature as the maximal

*FIS*marginalized over all possible values of the mutant nucleotide

*γ*at the source feature (

*α*,

*s*) i.e

*maxFIS*

_{X0}((

*β*,

*t*) | (

*α*,

*s*)) =

*max*(

_{γ}*FIS*

_{X0}((

*β*,

*t*) | (

*α*,

*γ*,

*s*)) (

**Fig. 1(Step 5)**).

**A nucleotide-resolution Deep Feature Interaction Map (DFIM)** summarizes the *maxFIS* scores for all pairs of source and target features in an input DNA sequence (**Fig. 1(Step 6)**).

### 2.2 Aggregate statistics of nucleotide-resolution FIS over multiple input sequences

In order to analyze the prevalence of the *FIS* between a source position *s* and target position *t* across a collection of input sequences *X _{t}*, we first identify the subset of sequences

*S*= {

*X*} that have identical source nucleotides at the source position and identical target nucleotides at the target position i.e ∀

_{i}*X*∈

_{i}X_{j}*S*,

*X*[

_{i}*α*,

*s*] =

*X*[

_{j}*α*,

*s*] = 1

*AND X*[

_{i}*β*,

*t*] =

*X*[

_{j}*β*,

*t*] = 1. We then compute aggregate statistics such as the mean of the FIS or absolute FIS corresponding to each ((

*β*,

*t*) | (

*α*,

*γ*,

*s*)) over all sequences in the subset

*S.*(See

**Fig. 8**as an example).

### 2.3 Motif-resolution Feature Interaction Score

We are often interested in the FIS of a specific target motif {(*β _{P}*,

*t*),…,(

_{p}*β*,

_{q}*t*)} i.e. a specific subsequence of nucleotides {

_{q}*β*…

_{p}*β*} at a specific subset of contiguous target positions {

_{q}*t*…

_{p}*t*} with a source nucleotide-resolution feature (

_{q}*α*,

*s*) (i.e. specific source nucleotide at specific source position) such as a regulatory single nucleotide variant (SNV). In such a case, we compute the

*FIS*of a target motif with a source nucleotide feature as the difference of the sum of importance scores across all target nucleotides ((

*β*),…, (

_{p},t_{p}*β*,

_{q}*t*)) in the target motif in the original sequence

_{q}*X*

_{0}and the mutated sequence (obtained by mutating (

*α*,

*s*) in

*X*

_{0}to {

*γ*,

*s*)).

To compute the *FIS* of a target motif {(*β _{p}*,

*t*), …, (

_{p}*β*,

_{q}*t*)} with a source motif {(

_{q}*α*,

_{k}*s*), …, (

_{k}*α*)} (See

_{ℓ}s_{ℓ}**Fig. 3**as an example), we use a different source mutation method. One option would be use the maximal

*FIS*of the target motif over all possible single nucleotide mutations of each position in the source motif. However, this procedure is computationally infeasible for long motifs. We instead, generate one mutant sequence, where we mutate the one-hot encoding (where rows 1-4 correspond to A,C,G,T) of all positions {

*s*…

_{k}*s*} in the source motif to the expected background GC nucleotide frequency

_{ℓ}*f*i.e. the mutant sequence has [(2,3),

_{GC}*s*] = The

*FIS*of the target motif with the source motif is once again the difference of the sum of importance scores across all target nucleotides {(

*β*,

_{p}*t*), …, (

_{p}*β*,

_{q}*t*)} in the target motif feature between the original sequence

_{q}*X*

_{0}and the mutated sequence

### 2.4 Statistical significance of FIS

Given a continuous distribution of *FIS*, across a collection of input sequences, we define statistically significant interactions based on an empirical null distribution of scores from dinucleotide shuffled versions of the input sequences. For each dinucleotide shuffled input sequence, we compute *FIS* for all nucleotide pairs. We fit a Gaussian distribution to this null empirical distribution of *FIS* scores. *FIS* values passing a *p*-value of 0.05 with respect to this null distribution are considered statistically significant. We use the Benjamini-Hochberg procedure for multiple hypothesis correction. **SFig. 1 (bottom row**) demonstrates how the null model can be used to identify responding motifs in the context of a longer sequence.

### 2.5 Comparison of DFIM to SHAP Interaction Scores and pairwise ISM interaction scores

For an input sequence with *F* features (nucleotides/motifs), SHAP interaction scores scale quadratically to compute all pairwise interactions giving a complexity of *O*(*F*^{2})^{9}. A pairwise ISM-based interaction score (Supp. Methods), defined as the difference between the ISM score obtained by jointly mutating two features and the sum of the ISM scores of individual features, also has a complexity of *O*(*F*^{2}). For DFIM, we require one backpropagation pass to obtain importance scores for the original sequence. Then for each of the *F* source features, we need one more backpropagation pass to obtain *FIS* of that source with ALL target features. Thus, DFIM exhibits a complexity of *O*(*F*) scaling linearly in the number of features. Our proposed FIS is essentially an efficient approximation of SHAP interaction scores. Further, in contrast to SHAP interaction scores and pairwise ISM interaction scores which are necessarily symmetric over the source and target, FIS is directional and can produce asymmetric interaction scores.

## 3. Results

### 3.1 Benchmarking FIS on ground-truth motif interactions embedded in simulated data

To benchmark *FIS*, we simulated 60K random DNA sequences (0.46 G/C frequency) of length 200 bp. We divided these into 3 sets of 20K sequences. We randomly embedded 1 or 2 instances of the ELF1 motif (using the highest affinity sequence from Position Weight Matrix ^{10}) in the sequences in Set 1, 1 or 2 instances of the SIX5 motif ^{10} in Set 2 and 1 or 2 instances of both ELF1 **and** SIX5 motifs in Set 3. We further independently embedded 0 or 1 instances of the AP1 and TAL1 motifs in a random subset of sequences across all 3 sets^{10} (Supp. Methods). We then set up a binary classification task where all sequences in Set 3 (ELF1 and SIX5) were labeled as positive and all other sequences from Sets 1 and 2 were labeled as negatives (**Fig. 2A**). We trained a Convolutional Neural Network (CNN) with one convolutional and one dense layer (Supp. Methods.). We achieved 100% classification accuracy on held out validation set of sequences indicating the model had learned the necessary interaction between ELF1 and SIX5. We computed motif-resolution *FIS* (Section 2.3) for all pairs of embedded motif instances (SIX5, ELF1, AP1 and TAL1) for all sequences in the positive class (i.e. Set 3). We used DeepLIFT with a fixed GC reference for computing importance scores since the underlying sequences were generated using a fixed GC background. Only pairs of SIX5 and ELF1 motifs (positive control) showed strong *FIS* (**Fig. 2B, green distribution)**, compared to all other pairs of motifs (negative controls) demonstrating that *FIS* can effectively discriminate ground truth interactions learned by a neural network. We further assessed the significance of these interactions using a empirical null distribution from dinucleotide shuffled sequences (Section 2.4) and found that the vast majority of true ELF1-SIX5 interactions have significant (*p* < 0.05) p-values, even after multiple hypothesis correction. None of the other motif pairs show statistically significant interactions (**SFig. 2A,B**). The results are replicated using gradient saliency maps as importance scores (**SFig. 2C,D**).

### 3.2 Uncovering epistatic motif interactions of co-binding TFs from CNN models of *in vivo* TF binding

We analyzed CNN models of *in vivo* TF binding to investigate epistatic interactions between motifs of cobinding TFs. We trained a multi-task CNN model to classify 1 kbp sequences centered at GATA1, GATA2 and TAL1 ENCODE ChIP-seq peaks (positive class) in erythroid K562 cells from all other chromatin accessible DNase-seq peaks in K562 (negative class) ^{11}. The CNN model with 5 convolutional layers (25 convolutions, size 10), a max pooling layer (size 25) and a sigmoid activation (Supp. Methods), achieved mean auROC of 0.953 and mean auPRC of 0.459 across all three tasks on held-out test set. Next, we identified all matches to the known motifs of GATA1 and TAL1 in all ChIP-seq peak sequences (Supp. Methods). We then computed motif-resolution *FIS* (using DeepLIFT with shuffled reference as importance scores) for all pairs of GATA1, TAL1 motif instances across all sequences using GATA1 as the source motif. We observed several instances with strong *FIS* between proximal GATA1 and TAL1 motifs which corroborates their experimentally validated cobinding interactions^{12} (**Fig. 3A**). To understand the relationship between the distance between motif instances and their interaction scores, we binned GATA1 and TAL1 motif pairs into 4 distance bins - within 20bp (n=13,004), 20-50bp (n=18,898), 50-100bp (n=28,684), and 100-200bp (n=21,1154). We compared the distribution of *FIS* for the motif pairs across the bins. As expected, TAL1 and GATA1 motifs in close proximity (<20 bp) showed statistically significant higher interaction scores than all three other bins (*p*<1e-16, Mann Whitney test for all 3 comparisons) (**Fig. 3B**). However, interestingly, we observed some strong long-range interactions between motifs as far as 70 bp apart (**Fig. 3B**), an observation corroborated by a recent analysis of SNP effects on TAL1 ChIP-seq signal in erythroid cells that found that GATA1 motif mutations impact TAL1 binding at distances as great as 75 bp^{13}. The interactions were also symmetric, such that mutating TAL1 demonstrated a distribution of *FIS* on GATA1 (**SFig. 4**).

### 3.3 Discovering interactions between regulatory variants and their target TF motifs from CNN models of *in vivo* chromatin accessibility

DNNs mapping regulatory DNA sequences to TF binding and chromatin accessibility have been previously used to score the predicted in-silico allelic effects of putative regulatory genetic variants based on ISM^{1–3}. Here, we instead use *FIS* to investigate an orthogonal question – What proximal sequence features are affected by (interact with) regulatory genetic variants? Tehranchi *et al.* developed a pooling-based approach to identify thousands of SNVs that have allelic effects on TF binding (as measured by ChIP-seq) across a large collection of genotyped lymphoblastoid human cell-lines^{14}. They provide coordinates, effect sizes, reference/alternative alleles and the allele with stronger binding for statistically significant binding QTLs (bQTLs) and non significant background SNVs in ChIP-seq peaks for JUND, NFKB, SPI1, STAT1 and POU2F1. This dataset provides an excellent resource to investigate the feature interactions of bQTLs. Further, we wondered if we could discover bQTL feature interactions for different TFs from a single DNN model trained to predict chromatin accessibility (instead of TF binding) from sequence.

Hence, we trained a multi-task (18 tasks) CNN model to map 1kbp length DNA sequences to binary chromatin accessibility profiles across 16 primary hematopoietic cell types (with ATAC-seq data) and 2 ENCODE cell-lines (with DNase-seq data) including the GM12878 lymphoblastoid cell-line (LCL) (Supp. Methods). The model achieved high performance on the test set (average auPRC = 0.69, auROC=0.91). We used the LCL task to investigate bQTL feature interactions using *FIS* (DeepLIFT with shuffled reference as importance score). We restricted our analysis to the statistically significant (allelic binding *p*<5e-05 as recommended by Tehranchi *et al.*) bQTLs that overlapped the DNase-seq peaks in GM12878.

To understand proximal interactions, for each bQTL, we used *FIS* to estimate the effect of mutating the reference allele to all alternate alleles at the source QTL on every target nucleotide +/− 15 bp around the QTL. First, we observed strong positive (**Fig. 4-left**) and negative (**Fig. 4-right**) interactions of bQTLs with nucleotides of overlapping target TF motifs. The direction of the allelic effect (stronger or weaker ChIP-seq signal) of the reference and alternate bQTL alleles on TF binding also matched the predicted direction of change (stronger or weaker motif score) E.g. A significant JUND bQTL at chr22:42925130 falls in a high affinity JUND binding motif (**Fig. 5A)**. The reference A allele has higher binding than the alternative G allele with *p*-value 1.71e-140 in the Tehranchi *et al.* study. *FIS* predicts that the G allele (weaker allelic binding) but not the A allele (stronger allelic binding) will destroy the importance of the entire JUND motif.

Next, we also found several TF-bQTLs in the flanking nucleotides of weak affinity motif matches of the target TF having significant interaction effects with the entire motif. E.g. a significant SPI1 bQTL at chr1:94169843 has reference allele T (with stronger binding) and alternate allele C. The bQTL is in the flanking nucleotides of a low affinity SPI1 site where only the core “GGAA” matches the canonical motif. *FIS* predicts that the C allele (weaker binding) destroys the importance scores of the core GGAA element (**Fig. 5B, Top)**. Tehranchi *et al.* and several other studies have reported that a large fraction (70-90%) of QTLs do not overlap high affinity instances of canonical TF motifs. We hypothesize that several of QTLs may be affecting flanking nucleotides of weak affinity matches of TF motifs. Finally, while most bQTLs with statistically significant *FIS* exhibit the maximal absolute interaction with other nucleotides within 10 bp of the bQTL, we also observe strong and significant longer-range interactions at distances ranging from 20-200 bp (**Fig. 6A**). E.g. An SPI1 bQTL has a significant interaction with a proximal SP1 motif but also a strong interaction with a RUNX1 motif 20 bp away (**Fig. 6B**). SPI1 QTLs were also found to affect motifs 100s of base pairs away (**SFig. 5).**

As a negative control, for each TF, we also evaluated the *FIS* of a matched number of conservative control SNVs from the Tehranchi *et al.* study that overlap the TF’s ChIP-seq peaks and LCL DNase-peaks with least significant allelic effects on binding (allelic binding *p* ≈ 1). For each bQTL and control SNV, we recorded its maximal absolute *FIS* (maxAbsFIS) over all target nucleotides +/-15 bp around the SNV. For all the TFs, we found that the bQTLs exhibit significantly (Mann Whitney test) stronger maxAbsFIS than control SNVs (**Fig. 7**), indicating that *FIS* may be an alternative approach to ISM to identify putative regulatory variants.

### 3.4 Discovering interactions between nucleotides flanking the core sequence motif of the Cbf1 TF in yeast from *in vitro* binding DNN models

Paralogous TFs have been recently shown to have distinct sequence affinity preferences to nucleotides flanking the core canonical binding motifs. Le and Shimko *et al.* recently developed a microfluidics based *in vitro* TF binding assay called BET-seq to investigate this question^{15}. They used the BET-seq assay to measure high-resolution *in vitro* binding affinity landscapes of the yeast TFs Cbf1 and Pho4 to a high complexity library of > 1 million DNA sequences with a fixed central core E-box sequence (CACGTG) and 5 variable flanking nucleotides on either side. They trained a feed forward neural network to predict relative binding affinity (ΔΔ*G*) for each of the TFs from the 10bp flanking sequences (using a flattened one-hot encoding) in the library^{15} (**Fig. 8A)**. The model architecture consisted of 3 dense layers of sizes 500, 500 and 250 with ReLU activation followed by batch normalization and dropout (p=0.25) with a final dense classification layer having a linear activation. They used a distillation approach to interpret the NN model by fitting a linear model with all mononucleotide features across all positions and all dinucleotide features across all pairs of positions to the output predictions of the NN. They found that dinucleotide features were critical for the linear model to have a good fit (*r*^{2} > 0.95) especially for Cbf1^{15}. They then estimated the contributions of all pairwise interaction terms by comparing the mononucleotide+dinucleotide linear model to a mononucleotide-only linear model. Cbf1 was found to exhibit significant interactions between several pairs of flanking nucleotides^{15}.

We instead used DFIM to directly query the Cbf1 neural network model and estimate pairwise nucleotide-resolution *FIS* between all pairs of nucleotides at all positions for all sequences in the library (**Fig. 8B)**. We compute aggregate statistics (mean) of the absolute nucleotide-resolution FIS (Section 2.2) for all pairs of nucleotide features across the 5,000 sequences with strongest binding affinity (lowest measured ΔΔG). We obtain four (40 × 40) DFIMs where each map corresponds to one of the 4 bases {A,C,G,T} as the observed source nucleotide. The rows in each 40 × 40 map correspond to 4 mutant bases x 10 source positions, while the columns correspond to 4 target bases x 10 target positions. To ease interpretation, we compute a marginalized 40 × 40 DFIM that records the maximal average *FIS* score over all mutant bases for each source base, source position, target base and target position (**Fig. 8C**), marginalized over the 3 potential mutations for a given source base. We observe that the marginalized aggregate DFIM for the high binding affinity sequences exhibit several strong interactions between flanking nucleotides (**Fig. 8C**). The map corroborates several of the strongest interactions identified by Le and Shimko using the distillation approach such as the strong interaction between a T at the −1 position and an A at the +1 position. Our maps also identify novel interactions such as a strong interaction between T at −1 and T at +2. In contrast, the aggregate DFIMs across 5,000 sequences with weakest binding affinity (highest measured ΔΔG) exhibit uniformly weak interaction scores (**SFig. 6**).

## Discussion

We present an efficient method called Deep Feature Interaction Maps (DFIM) to identify epistatic interactions between all pairs of nucleotides or motif features in any DNA sequence input to a deep learning model for regulatory genomics. Our method accurately recovers ground truth interacting motifs in simulated regulatory DNA sequences. When applied to deep learning models of *in vivo* TF binding, we recover known proximal interactions between motifs of interacting co-factors while also discovering long-range interactions between motifs as far as 75 bp apart. We interpret deep learning models trained on *in vitro* TF binding to discover extensive interactions between pairs of nucleotides in sequences flanking core TF binding motifs. Finally, we interpret deep learning models of *in vivo* chromatin accessibility to generate nucleotide-resolution interaction maps for non-coding regulatory sequences surrounding SNVs (bQTLs) that affect binding of transcription factors. Our maps link binding QTLs to nearby sequence features including high and low affinity matches to the canonical binding site of the TF whose binding is disrupted. We also find bQTLs interacting with motifs of multiple co-binding TFs. These epistatic interactions seem to capture both cooperation and competition. While our primary focus in this manuscript is on interpreting feature interactions in DNA sequence inputs, DFIM can easily be generalized to other data modalities.

Partial dependence plots are commonly used to understand the sensitivity of a prediction to a one or more features ^{16}. DFIM serves as complementary approach to understand the predictive higher-order, non-linear interactions between features. DFIM is most efficient to estimate all pairwise interactions between predetermined features such as known binding sites or SNVs or a sparse set of de-novo discovered predictive features with significant importance scores. However, DFIM also scales well to estimate interactions between all nucleotides in large sets of sequences because it leverages efficient backpropagation-based feature attribution methods. While DFIM is generally compatible with any efficient feature attribution method, we have not evaluated our approach on all such methods. However, we have found overall strong replication of DFIM results and associated conclusions by using two separate importance scores namely DeepLIFT and gradient saliency maps (**SFig. 2, SFig. 3**). This suggests that DFIM could generalize to other importance scoring approaches.

There are several potential caveats to using DFIM including some that are independent of the methodology itself. For example, DFIM will only work well with a high performance deep learning model that has been appropriately trained, with a suitable architecture, and if the importance scoring methods appropriately capture salient features. DFIM depends on these two cornerstones and so will only perform as well as the model and the importance scores. Changes in model architecture can also change the interactions encoded by the model and thus the interactions learned with DFIM. In addition, DFIM is limited to detecting interactions within the same input sequence (i.e. for a model with a 1 kbp input, the interactions must be within that 1kbp). Despite these mentioned caveats, the case studies we present here showcase the utility of DFIM to provide a nuanced view into the combinatorial code of regulatory DNA sequences through the lens of predictive neural network models.

### Code

We have made code for this project publicly available at: https://github.com/kundaielab/dfim. We include Jupyter Notebooks to demonstrate several use cases.

## Acknowledgements

We would like to thank Johnny Israeli and Nathan Boley for their help training a TAL1, GATA1, GATA2 transcription factor binding model. We would also like to thank Avanti Shrikumar for discussions during early development of the methodology. PG was supported by a BioX Stanford Interdisciplinary Graduate Fellowship (SIGF). AK was supported by NIH grants 1DP2GM123485, 1U01HG009431 and 1R01HG00967401.

## Footnotes

Contact: pgreens{at}stanford.edu, akundaje{at}stanford.edu