Abstract
Genes are regulated by cis-regulatory sequences, which contain transcription factor (TF) binding motifs in specific arrangements (syntax). To understand how motif syntax influences TF binding, we train a deep learning model, BPNet, that uses DNA sequence to predict base-resolution ChIP-nexus binding profiles of four pluripotency TFs Oct4, Sox2, Nanog, and Klf4. We interpret the model to accurately map hundreds of thousands of motifs in the genome, learn predictive motif representations, and identify rules by which specific motifs interact. We find that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences TF binding in a directional manner. Most strikingly, Nanog shows a strong preference for binding with helical periodicity. We validate our model using CRISPR-induced point mutations, demonstrating that interpretable deep learning models are a powerful approach to uncover the motifs and syntax of cis-regulatory sequences.
Footnotes
1. We now validate BPNet's predictions computationally and experimentally using independent data: i) BPNet's predictions were experimentally validated using CRISPR/Cas9 experiments (new Figure 6 and Figure S21). ii) We used recently published ATAC-seq data after Oct4 or Sox2 depletion to show that BPNet outperforms traditional methods (motif recall and signal prediction in new Figure 2G,H and new Figure S12). 2. We performed a more rigorous comparison of our method versus a traditional PWM approach (updated Figure S9). 3. We showcase the robustness of the method, including the syntax we derived. Training BPNet on 5 different chromosomes folds with different random seeds yields consistent downstream analyses (new Figures S6, S17, S19). We excluded the possibility of mappability bias on model predictions (new Figure S1). 4. We tried to make the method and results more accessible to the reader by rewriting the manuscript and adding a Q&A section.
↵1