Abstract
Transcription factors (TFs) are important contributors to gene regulation. They specifically bind to short DNA stretches known as transcription factor binding sites (TFBSs), which are contained in regulatory regions (e.g. promoters), and thereby influence a target gene’s expression level. Computational biology has contributed substantially to understanding regulatory regions by developing numerous tools, including for discovering de novo motif. While those tools primarily focus on determining and studying TFBSs, the surrounding sequence context is often given less attention. In this paper, we attempt to fill this gap by adopting a so-called convolutional restricted Boltzmann machine (cRBM) that captures redundant features from the DNA sequences. The model uses an unsupervised learning approach to derive a rich, yet interpretable, description of the entire sequence context. We evaluated the cRBM on a range of publicly available ChIP-seq peak regions and investigated its capability to summarize heterogeneous sets of regulatory sequences in comparison with MEME-Chip, a popular motif discovery tool. In summary, our method yields a considerably more accurate description of the sequence composition than MEME-Chip, providing both a summary of strong TF motifs as well as subtle low-complexity features.