TY - JOUR T1 - THiCweed: fast, sensitive motif finding by clustering big data sets JF - bioRxiv DO - 10.1101/104109 SP - 104109 AU - Ankit Agrawal AU - Leelavati Narlikar AU - Rahul Siddharthan Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/01/29/104109.abstract N2 - Motivation Thousands of ChIP-seq (chromatin immunoprecipitation followed by sequencing) datasets are now publicly available that provide genome-wide binding profiles for hundreds of transcription factors in various species and various cell types, with thousands to hundreds of thousands of “peaks” per dataset. Transcription factors commonly bind to regions of DNA that have short conserved sequence patterns, or “motifs”. Ab initio motif finding is a well-established problem in computational biology, but such large data sets are challenging for most existing tools. Additionally, it is common for the target proteins to bind indirectly to DNA via co-factors, with the result that the ChIP-seq peaks contain a mixture of motifs. Few tools exist to deal effectively with this problem.Results We describe a new approach to motif finding that models the problem as one of clustering bound regions based on sequence similarity. We take an iterative “top-down” approach of repeatedly subdividing an initial single large cluster of all input sequences into smaller and smaller clusters, while also exploring shift and reverse-strand matches of sequences to clusters. Our implementation is significantly faster than any other ChIP-seq-oriented motif-finding program we tested, able to process 5,000 sequences of 100bp length in a few minutes, or 30,000 sequences in 1-2 hours, on a desktop computer using a single CPU core. On synthetic data it outperforms all programs except one (MuMoD) on accuracy; compared to MuMoD it is somewhat less accurate but orders of magnitude faster. It is designed to perform well with “window” sizes much larger than the length of a typical binding site (7-15 base pairs), and we commonly run it with window sizes of 50bp or more. On actual genomic data it successfully recovers literature motifs, but also uncovers highly complex sequence characteristics in flanking DNA, and in many cases recovers secondary motifs (of possible, and sometimes documented, biological significance) even when they occur in less than 5% of the input sequences. We suggest that this is a powerful new approach to the analysis of ChIP-seq data.Availability The software is open source and available at http://www.imsc.res.in/~rsidd/thicweed/ under the two-clause BSD license.Contact rsidd{at}imsc.res.in ER -