FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data

Due to the large numbers of transcription factors (TFs) and cell types, querying binding profiles of all TF/cell type pairs is not experimentally feasible, owing to constraints in time and resources. To address this issue, we developed a convolutional-recurrent neural network model, called FactorNet, to computationally impute the missing binding data. FactorNet trains on binding data from reference cell types to make accurate predictions on testing cell types by leveraging a variety of features, including genomic sequences, genome annotations, gene expression, and single-nucleotide resolution sequential signals, such as DNase I cleavage. To the best of our knowledge, this is the first deep learning method to study the rules governing TF binding at such a fine resolution. With FactorNet, a researcher can perform a single sequencing assay, such as DNase-seq, on a cell type and computationally impute dozens of TF binding profiles. This is an integral step for reconstructing the complex networks underlying gene regulation. While neural networks can be computationally expensive to train, we introduce several novel strategies to significantly reduce the overhead. By visualizing the neural network models, we can interpret how the model predicts binding which in turn reveals additional insights into regulatory grammar. We also investigate the variables that affect cross-cell type predictive performance to explain why the model performs better on some TF/cell types than others, and offer insights to improve upon this field. Our method ranked among the top four teams in the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge.


Introduction
... Real-valued single-nucleotide signal values are concatenated as extra rows to this matrix. A rectifier activation convolution layer transforms the input matrix into an output matrix with a row for each convolution kernel and a column for each position in the input (minus the width of the kernel). Each kernel is effectively a sequence motif. Max pooling downsamples the output matrix along the spatial axis, preserving the number of channels. The subsequent recurrent layer contains long short term memory (LSTM) units connected end-to-end in both directions to capture spatial dependencies between motifs. Recurrent outputs are densely connected to a layer of rectified linear units. The activations are likewise densely connected to a sigmoid layer that nonlinear transformation to yield a vector of probability predictions of the TF binding calls. An identical network, sharing the same weights, is also applied to the reverse complement of the sequence (bottom). Finally, respective predictions from the forward and reverse complement sequences are averaged together, and these averaged predictions are compared via a loss function to the true target vector. Although not pictured, we also include a sequence distributed dense layer between the convolution and max pooling layer to capture higher order motifs.

4/28
To predict cell type-specific TF binding, we developed FactorNet, which combines elements of the 64 aforementioned algorithms. FactorNet trains a DNN on data from one or more reference cell types for which 65 the TF or TFs of interest have been profiled, and this model can then predict binding in other cell types. The 66 FactorNet model builds upon the DanQ CNN-RNN hybrid architecture by including additional real-valued 67 coordinated-based signals such as DNase-seq signals as features. FactorNet is similar to a recently developed 68 method called DeepCpG, which integrates sequence context and neighboring methylation rates to predict 69 single-cell DNA methylation states using a CNN and a bidirectional RNN (Angermueller et al., 2017). We 70 also extended the DanQ network into a "Siamese" architecture that accounts for reverse complements (Figure 71 1). This Siamese architecture applies identical networks to both strands to ensure that both the forward and 72 reverse complement sequences return the same outputs, essentially halving the total amount of training data, 73 ultimately improving training efficiency and predictive accuracy. Both networks share the same weights.

74
Siamese networks are popular among tasks that involve finding similarity or a relationship between two 75 comparable objects. Two examples are signature verification (Bromley et al., 1993) and assessing sentence 76 similarity (Mueller and Thyagarajan, 2016). Another recent method, TFImpute (Qin and Feng, 2017), shares 77 many similarities with FactorNet. Like FactorNet, TFImpute is intended to impute missing TF binding 78 datasets, and it uses a CNN-RNN architecture, and a weight-sharing strategy to handle reverse complements. 79 TFImpute is a sequence-only method and therefore more comparable to DeepSEA, DeepBind, and Basset.

80
Unlike FactorNet, TFImpute does not directly accept cell type-specific data like DNase-seq as model inputs. 81 We submitted the FactorNet model to the ENCODE-DREAM in vivo Transcription Factor Binding Site 82 Prediction Challenge (https://www.synapse.org/ENCODE), where it placed among the top four ranked 83 teams. All results discussed in this paper are derived from data in the Challenge. The Challenge delivers a 84 crowdsourcing approach to figure out the optimal strategies for solving the problem of TF binding prediction. 85

86
Predictive performance varies across transcription factors 87  (Kharchenko et al., 2008), which 102 6/28 uses the fold-change signal and peak shape to score and rank peaks. An additional processing step scores peaks according to an irreproducible discovery rate (IDR), which is a measure of consistency between 104 replicate experiments. Bins are labeled positive if they overlap a peak that meets the IDR threshold of 5%. 105 The IDR scores are not always monotonically associated with the fold-changes. Nevertheless, we expect that 106 performance scores from the fold-change signal classifiers should serve as overly optimistic upper bounds for 107 benchmarking. Commensurate with these expectations, the auPR scores of the FactorNet models are less 108 than, but positively correlative with, the respective auPR scores of the ChIP-seq fold-change signal classifiers 109 ( Figure 2A). Interestingly, this pattern does not extend to the auROC scores, and in more than half of the 110 cases the FactorNet auROC scores are greater ( Figure 2B). These results are consistent with previous studies 111 that showed the auROC can be unreliable and overly optimistic in an imbalanced class setting (Saito and 112 Rehmsmeier, 2015), which is a common occurrence in genomic applications (Quang and Xie, 2016), 113 motivating the use of alternative measures like the auPR that ignore the overly abundant true negatives.

114
We can also visualize the FactorNet predictions as genomic signals that can be viewed alongside the 115 ChIP-seq signals and peak calls ( Figure 2C and S1). Higher FactorNet prediction values tend to coalesce 116 around called peaks, forming peak-like shapes in the prediction signal that resemble the signal peaks in the 117 original ChIP-seq signal. The visualized signals also demonstrate the differences in signal noise across the 118 ChIP-seq datasets. The NANOG/iPSC ChIP-seq dataset, for example, displays a large amount of signal 119 outside of peak regions, unlike the HNF4A/liver ChIP-seq dataset which has most of its signal focused in 120 peak regions.

121
The ENCODE-DREAM challenge data, documentation, and results can be found on the Challenge 122 homepage: https://www.synapse.org/ENCODE.

123
Interpreting neural network models 124 Using the same heuristic from DeepBind (Alipanahi et al., 2015) and DanQ (Quang and Xie, 2016), we 125 visualized several kernels from a HepG2 multi-task model as sequence logos by aggregating subsequences that 126 activate the kernels ( Figure 3A). The kernels significantly match motifs associated with the target TFs.

127
Furthermore, the aggregated DNase I signals also inform us of the unique "footprint" signatures the models 128 use to identify true binding sites at single-nucleotide resolution. After visualizing and aligning all the kernels, 129 we confirmed that the model learned a variety of motifs ( Figure 3B). A minority of kernels display very little 130 sequence specificity while recognizing regions of high chromatin accessibility ( Figure 3C).  (Kent et al., 2002) screenshot displays the ChIP-seq fold change signal, FactorNet predictions, and peak calls for four TF/cell type pairs in the TYW3 locus. Confidently bound regions are more heavily shaded than ambiguously bound regions. 2013). To generate a saliency map, we compute the gradient of the output category with respect to the input 133 sequence. By visualizing the saliency maps of a genomic sequence, we can identify the parts of the sequence 134 the neural network finds most relevant for predicting binding, which we interpret as sites of TF binding at 135 single-nucleotide resolution. Using a liver HNF4A peak sequence and HNF4A predictor model as an example, 136 the saliency map highlights a subsequence overlapping the summit that strongly matches the known 137 canonical HNF4A motif, as well as two putative binding sites upstream of the summit on the reverse 138 complement ( Figure 3D).    (Mathelier, A. et al., 2016) using TOMTOM (Gupta et al., 2007). Mean normalized DNase I cleavage signals and their maximum values are displayed above the aligned logos. E-values measure similarity between query and target motifs, corrected for multiple hypothesis testing. All kernels are converted to sequence logos and aligned with RSAT (Medina-Rivera et al., 2015). The heatmaps are ordered by this alignment and colored according to the motif information content (IC) (B) or mean DNase I cleavage signal (C) at each nucleotide position. (D) Normalized liver DNase I cleavage signal and saliency maps of aligned stranded sequences centered on the summit of a liver HNF4A peak in the TYW3 locus ( Figure 2C). Negative gradients are converted to zeros. We visualized saliency maps with the DeepLIFT visualizer (Shrikumar et al., 2017) .

Data variation influences predictive performance 140
In the cases for which two or more testing cell types are available for the same TF, we also observe some 141 rather large disparities in performance. With the exception of FOXA1, FactorNet consistently performs 142 poorer for liver than for other cell types, the difference in auPR reaching as much as 33.5% in the case of 143 EGR1 (Table 1). Variation in data quality across cell type-specific datasets may partially explain these performance differences. The DNase-seq data, which is arguably the most informative cell type-specific 145 feature for binding prediction, widely varies in terms of sequencing depth and signal-to-noise ratio (SNR) 146 across the cell types, which we measure as the fraction of reads that fall into conservative peaks (FRiP) ( Figure S2A). Notably, liver displays the lowest SNR with a FRiP score of 0.05, which is consistent with its 148 status as a primary tissue; all other cell types are cultured cell lines.

9/28
To further scrutinize the effect data variation has on performance, we trained several FactorNet single-task models and plotted the learning curves to monitor for overfitting ( Figure S2B). Learning curves 151 trace the predictive performance of neural networks on training and validation sets. They are useful for 152 identifying signs of overfitting, a common problem in machine learning. These learning curves focus on the 153 GM12878 and HeLa-S3 cell types, using one cell type for training and the other as a validation set. We 154 selected these two cell types because they are the only two reference cell types for E2F1, which FactorNet 155 performed particularly poor on. In addition, the HeLa-S3 DNase-seq data read count and FRiP score are 156 both almost twice that of the read count and FRiP score for the GM12878 DNase-seq data.

157
From the learning curves of the E2F1 model trained on GM12878, we observe evidence of overfitting. The 158 HeLa-S3 cross-cell type validation loss reaches a minimum value within four training epochs, after which it 159 increases until it reaches a steady state value. In contrast, the GM12878 within-cell type validation loss 160 steadily decreases past the first four epochs and remains much smaller than the HeLa-S3 validation loss 161 throughout training. At first, we speculated the gap to be caused by the differences in the cell type 162 DNase-seq data; however, based on the learning curves for other TFs, this may not necessarily be the sole 163 reason. In the cases of GABPA and TAF1, the differences in validation losses is much smaller. One possible 164 explanation for these results is the differences in the ChIP-seq protocols between the GM12878 and HeLa-S2 165 datasets. Unlike the other three TFs, the GM12878 and HeLa-S3 E2F1 ChIP-seq datasets were generated 166 using two different antibodies: ENCAB037OHX and ENCAB000AFU, respectively. Both ZNF143 ChIP-seq 167 datasets were generated using the same antibody (ENCAB000AMR), but the model trained on HeLa-S3 168 displays an unusually high validation loss difference. We speculate this is because the GM12878 ZNF143 169 ChIP-seq dataset was generated using both single-end 36 bp and paired-end 100 bp reads while the HeLa-S3 170 ZNF143 ChIP-seq dataset was generated using only single-end 36 bp reads. Given that paired-end 100 bp 171 reads can map to genomic regions that are unmapable for the shorter 36 bp reads, we suspect that 172 differences in read types can introduce significant dataset-specific artifacts.

173
Given the differences in the GM12878 and HeLa-S3 E2F1 ChiP-seq datasets resulting from the use of 174 different antibodies, we investigated whether a model exclusively trained on one cell type could improve our 175 predictive performance for the K562/E2F1 testing set. To do so, we retrained single-and multi-task models 176 exclusively on either GM12878 or HeLa-S3 and evaluated cross-cell type binding performance on the 177 E2F1/K562 testing set. In contrast, the E2F1 model used at the conclusion of the Challenge was trained on 178 data from both reference cell types. The K562 E2F1 ChIP-seq dataset was generated using the antibodies 179 ENCAB037OHX and ENCAB851KCY, the former of which was also used for GM12878. Hence, we expect 180 that the GM12878 model would be a better predictor for K562 E2F1 binding sites than the other two models, 181 10/28 which we find to indeed be the case ( Figure S2C-D). Although we managed to improve upon our previous E2F1 model, the cross-cell type performance for E2F1 is still inadequate, especially compared to TFs like 183 CTCF. Predicting binding for TFs in the E2F family is notoriously difficult because members of this protein 184 family share almost identical binding motifs, which in turn makes distinguishing between multiple members 185 of the same family difficult. For TFs that are part of a large family sharing similar sequence binding 186 preference, we conjecture that performance will be limited regardless of the choice of cell type or antibody. 187 Comparing single-and multi-task training 188 Although a thorough comparison between single-and multi-task training is beyond the scope of this paper, 189 our results on E2F1 show that single-and multi-task models can differ in performance. Specifically, the 190 cross-cell type auPR of the single-task GM12878 model is more than 10% greater than that of its respective 191 multi-task model ( Figure S2C and Figure S2D). To the best of our knowledge, the cross-cell type performance 192 of each training method depends on the TF/cell type pair. For example, when we retrained single-task and 193 multi-task models for NANOG using H1-hESC as a reference cell type and evaluated the models on iPSC, 194 the multi-task model's auPR score is over 16% greater than that of the single-task model ( Figure S3).

195
While we initially assumed that the multi-task training confers an advantage by introducing additional 196 information through the multiple labels, at least in the case of NANOG, there are too many conflating 197 variables to immediately conclude this. One of these conflating variables is the differences in training data 198 between single-task and multi-task models. In our current framework, the multi-task training contains  (Figure 2A). We expect that predictive performance 224 for many TF/cell type pairs can be improved by redoing experiments with higher quality antibodies.

225
Alternatively, ChIP-exo, a modification of ChIP-seq that uses exonucleases to degrade contaminating 226 non-protein-bound DNA fragments (Rhee and Pugh, 2011), may improve the quality of ChIP signals. Next, 227 we investigated the variation in the DNase-seq datasets. We found that the DNase-seq datasets greatly differ 228 in terms of sequencing depth and SNR ( Figure S3). While we do correct for the variation in sequencing 229 depth by normalizing the cleavage signals to 1x coverage, we do not correct for the variation in the SNR. The 230 performance lost is most staggering for the liver cell type, which has the DNase-seq dataset with the lowest 231 SNR. However, differences in DNase-seq SNR do not fully account for differences in predictive performance. 232 By studying several within-and cross-cell type validation curves, we also concluded that differences in 233 antibodies and read lengths can introduce significant dataset-specific biases ( Figure S2B). Accordingly, we 234 can improve performance by omitting less compatible cell type datasets ( Figure S2C-D). 235 We also compared single-and multi-task training frameworks. Several deep learning methods, including 236 DeepSEA and Basset, primarily use multi-task training, which involves assigning multiple labels, 237 corresponding to different chromatin markers, to the same DNA sequence. The authors of these methods 238 propose that the multi-task training improves efficiency and performance. FactorNet supports both types of 239 training. Our results do not entirely favor one method over the other. For the K562/E2F1 cross-cell type 240 12/28 testing set, the GM12878 single-task model outperformed GM12878 multi-task model ( Figure S2C); however, 241 for the NANOG/iPSC cross-cell type testing set, the H1-hESC multi-task model outperformed the H1-hESC 242 single-task model ( Figure S3). In the latter case, the performance gap can be narrowed by changing the 243 proportion of negative to positive training samples in the single-task framework, suggesting that any 244 additional gain granted by the multiple labels is eclipsed by the choice of negative sets. Nevertheless, 245 ensembling single-and multi-task models together appears to be an effective method of improving predictive 246 performance, at least if antibodies and read lengths are kept consistent.

247
Another avenue we can explore for improving the model is hyperparameter tuning. We selected the 248 hyperparameters for the models in this work arbitrarily for demonstration and uniformity purposes (Table   249 S1-S3). Although we have not yet implemented them, distributed computing hyperparameter tuning 250 algorithms (Bergstra et al., 2013) can systematize hyperparameter selection and improve performance.

251
One of the chief criticisms of neural networks is that they are "black box" models. While neural networks 252 can achieve great performances in predictive tasks, the exact reasons for why this is the case is not always 253 entirely clear. In contrast to these criticisms, we can visualize and interpret aspects of the FactorNet model. 254 By converting network kernels to motifs, we show that FactorNet can recover motifs that are known to 255 contribute to binding ( Figure 3A). DNase I footprint patterns help discriminate true binding sites from 256 putative sites that simply match a motif. Previous TF binding prediction methods, such as Centipede, 257 require users to supply motifs. FactorNet relaxes this strong assumption and essentially performs de novo 258 motif discovery during the learning process to identify the sequence patterns that are most useful for binding 259 prediction. Saliency maps can also help elucidate the complex regulatory grammar that govern TF binding 260 by visualizing the spatial positions and orientations of multiple binding sites that work together to recruit 261 TFs ( Figure 3D).

262
Our adherence to standardized file formats also makes FactorNet robust. For example, FactorNet can 263 readily accept other genomic signals that were not included as part of the Challenge but are likely relevant to 264 TF binding prediction, such as conservation and methylation. Along these same lines, if we were to refine our 265 pre-processing strategies for the DNase-seq data, we can easily incorporate these improved features into our 266 model as long as the data are available as bigWig files (Kent et al., 2010). Other sources of open chromatin 267 information, such as ATAC-seq (Buenrostro et al., 2015) and FAIRE-seq (Giresi et al., 2007), can also be 268 used to replace or complement the existing DNase-seq data. In addition, FactorNet is not necessarily limited 269 to only TF binding predictions. If desired, users can provide the BED files of positive intervals to train 270 predictive models for other markers, such as histone modifications. As more epigenetic datasets are 271 constantly added to data repositories, FactorNet is already in a prime position to integrate both new and 272 13/28 existing datasets.
In conclusion, FactorNet is a very flexible framework that lends itself to a variety of future research 274 avenues. The techniques that we introduced in this paper will also be useful for the field of machine learning, 275 especially since neural network models are becoming increasingly popular in genomics. Some of the design 276 elements of FactorNet were motivated by the specific properties inherent in the structure of the data. Many 277 of these properties are shared in data found in other applications of machine learning. For example, the 278 directional nature and modularity of DNA sequences prompted us to search for a model that can discover 279 local patterns and long-range interactions in sequences, which led us to ultimately select a hybrid neural 280 network architecture that includes convolution and bidirectional recurrence. Natural language processing 281 problems, such as topic modeling and sentiment analysis, can also benefit from such an architecture since 282 language grammar is directional and modular. Another unique aspect of the data that guided our design is 283 the double strandedness of DNA, which prompted us to adopt a Siamese architecture to handle pairs of input 284 sequences ( Figure 1). Protein-protein interaction prediction also involves sequence pairs and would likely 285 benefit from a similar framework. Our heuristics for reducing training time and computational overhead will 286 serve as useful guidelines for other applications involving large imbalanced data, especially if recurrent 287 models are utilized. We therefore expect that FactorNet will be of value to a wide variety of fields. The TF binding prediction problem is evaluated as a two-class binary classification task. For each test 303 TF/cell type pair, the following performance measures are computed: 304 1. auROC. The area under the receiver operating characteristic curve is a common metric for evaluating 305 classification models. It is equal to the probability that a classifier will rank a randomly chosen positive 306 instance higher than a randomly chosen negative one. 307 2. auPR. The area under the precision-recall curve is more appropriate in the scenario of few relevant 308 items, as is the case with TF binding prediction (Quang and Xie, 2016). Unlike the auROC metric, the 309 auPR metric does not take into account the number of true negatives called. items. This metric is often used in applications such as fraud detection in which the goal may be to 313 maximize the recall of true fraudsters while tolerating a given fraction of customers to falsely identify 314 as fraudsters. The ENCODE-DREAM Challenge computes this metric for several FDR values.

315
As illustrated in Figure 1, the FactorNet Siamese architecture operates on both the forward and reverse 316 complement sequences to ensure that both strands return the same outputs during both training and 317 prediction. Although a TF might only physically bind to one strand, this information cannot usually be 318 inferred directly from the peak data. Thus, the same set of labels are assigned to both strands in the 319 evaluation step.

320
Features and data pre-processing 321 FactorNet works directly with standard genomic file formats and requires relatively little pre-processing.

322
BED files provide the locations of reference TF binding sites and bigWig files (Kent et al., 2010) provide 323 dense, continuous data at single-nucleotide resolution. bigWig values are included as extra rows that are 324 appended to the four-row one hot input DNA binary matrix. FactorNet can accept an arbitrary number of 325 bigWig files as input features, and we found the following signals to be highly informative for prediction: 326 1. DNase I cleavage. For each cell type, reads from all DNase-seq replicates were trimmed down to first 327 nucleotide on the 5' end, pooled and normalized to 1x coverage using deepTools (Ramírez et al., 2014). Our implementation is written in Python, utilizing the Keras 1.2.2 library (Chollet et al., 2015) with the 346 Theano 0.9.0 (Bastien et al., 2012;Bergstra et al., 2010) backend. We used an NVIDIA Titan X Pascal GPU 347 for training.

348
FactorNet supports single-and multi-task training. Both types of neural network models are trained 349 using the Adam algorithm (Kingma and Ba, 2014) with a minibatch size of 100 to minimize the mean 350 multi-task binary cross entropy loss function on the training set. We also include dropout (Srivastava et al.,351 2014) to reduce overfitting. One or more chromosomes are set aside as a validation set. Validation loss is sampled and streamed for training, but this ratio is an adjustable hyperparameter (see Table S1 for a Supporting Information Table S1. Summary and description of the hyperparameters used for the single-task models in Figure S2B. Hyperparameter   Figure S1. FactorNet cross-cell type predictions are comparable to ChIP-seq signals and peaks. A genome browser shot similar to Figure Figure S2. Variation in cell type-specific datasets influence cross-cell type predictive performance. (A) IGV ( Thorvaldsdóttir et al., 2013) browser screenshot displays pooled DNase I cleavage signal and conservative DNase-seq peaks for eight cell types. The inset is a magnified view at the MTPN promoter, a known NRF1 binding site. (B) Each plot displays learning curves of single-task models trained on either GM12878 or HeLa-S3. We generated within-and cross-cell type validation sets by extracting an equal number of positive and negative bins from the validation chromosomes. The difference between the smallest withinand cross-cell type validation losses are displayed in each plot. (C and D) Precision-recall curves of singleand multi-task models evaluated on the E2F1/K562 testing set trained exclusively on either GM12878 or HeLa-s3. Dotted lines indicate points of discontinuity. Model weights were selected based on the within-cell type validation loss on chr11. We generated single-task scores by bagging scores from two single-task models initialized differently. Final bagged models ensemble respective single-and multi-task models.  Figure S3. Comparison of single-and multi-task training. Cross-cell type precision-recall curves of single-task and multi-task NANOG binding prediction models trained on H1-hESC and evaluated on iPSC.
Model weights were selected based on the within-cell type validation loss on chr11. We generated single-task scores by bagging scores from two single-task models initialized differently. The three single-task models differ in the ratio of negative-to-positive bins per training epoch. The bagged models are the rank average scores from the multi-task model and one of the three single-task models. auPR scores are in parentheses. Both training and testing ChiP-seq datasets use the ENCAB000AIX antibody.