ABSTRACT
Many statistical methods have been developed to infer the binding motifs of a transcription factor (TF) from a subset of its numerous binding regions in the genome. We refer to such regions, e.g. detected by ChIP-seq, as binding sites. The sites with strong binding signals are selected for motif inference. However, binding signals do not necessarily indicate the existence of target motifs. Moreover, even strong binding signals can be spurious due to experimental artifacts. Here, we observe that such uninformative sites without target motifs tend to be “crowded” -- i.e. have many other TF binding sites present nearby. In addition, we find that even if a crowded site contains recognizable target motifs, it can still be uninformative for motif inference due to the presence of interfering motifs from other TFs. We propose using less crowded and shorter binding sites in motif interference and develop specific recommendations for carrying this out. We find our recommendations substantially improve the resulting motifs in various contexts by 30%-70%, implying a “less-is-more” effect.
Competing Interest Statement
The authors have declared no competing interest.