TY - JOUR T1 - A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction JF - bioRxiv DO - 10.1101/130815 SP - 130815 AU - Yuchun Guo AU - Kevin Tian AU - Haoyang Zeng AU - Xiaoyun Guo AU - David K. Gifford Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/04/26/130815.abstract N2 - The representation and discovery of the sequence specificities of transcription factors (TFs) is critical for understanding gene regulatory networks and interpreting the impact of disease-associated non-coding genetic variants. We present a novel motif representation, the K-mer Set Memory (KSM), which consists of a set of aligned k-mers that are over-represented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs are more accurate at predicting in vivo binding sites than position weight matrix models (PWMs) and other more complex motif models across a large set of ChIP-seq experiments. KMAC also identifies correct motifs in more experiments than four state-of-the-art motif discovery. In addition, KSM derived features outperform both PWM and deep learning model derived sequence features in predicting eQTL regulatory activities. Finally, we have applied KMAC to 1488 ENCODE TF ChIP-seq datasets and created a public resource of KSM and PWM motifs. ER -