Abstract
Although convolutional neural networks (CNNs) have been applied to a variety of computational genomics problems, there remains a large gap in our understanding of how they work. Here we perform systematic experiments on synthetic sequences to reveal principles of how CNN architecture influences the internal representations of genomic sequence motifs that are learned. We focus our study on representations learned by first convolutional layer filters. We find that deep CNNs tend to learn distributed representations of partial sequence motifs. However, we demonstrate that the architecture of a CNN can be modified to predictively learn more interpretable localist representations, i.e. whole motifs. We then validate that the representation learning principles established from synthetic sequences generalize to in vivo sequences.