TY - JOUR T1 - An Experiment in Learning the Language of Sequence Motifs: Sequence Logos vs. Finite-State Machines JF - bioRxiv DO - 10.1101/143024 SP - 143024 AU - Alexandre P Francisco AU - Travis Gagie AU - Dominik Kempa AU - Leena Salmela AU - Sophie Sverdlov AU - Jarkko Toivonen AU - Esko Ukkonen Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/05/27/143024.abstract N2 - Position weight matrices (PWMs) are the standard way to model binding site affinities in bioinformatics. However, they assume that symbol occurrences are position independent and, hence, they do not take into account symbols co-occurrence at different sequence positions. To address this problem, we propose to construct finite-state machines (FSMs) instead. A modified version of the Evidence-Driven State Merging (EDSM) heuristic is used to reduce the number of states as FSMs grow too quickly as a function of the number of sequences to reveal any useful structure. We tested our approach on sequence data for the transcription factor HNF4 and found out that the constructed FSMs provide small representations and an intuitive visualization. Furthermore, the FSM was better than PWMs at discriminating the positive and negative sequences in our data set. ER -