PT - JOURNAL ARTICLE AU - Alexandre P Francisco AU - Travis Gagie AU - Dominik Kempa AU - Leena Salmela AU - Sophie Sverdlov AU - Jarkko Toivonen AU - Esko Ukkonen TI - An Experiment in Learning the Language of Sequence Motifs: Sequence Logos vs. Finite-State Machines AID - 10.1101/143024 DP - 2017 Jan 01 TA - bioRxiv PG - 143024 4099 - http://biorxiv.org/content/early/2017/05/27/143024.short 4100 - http://biorxiv.org/content/early/2017/05/27/143024.full AB - Position weight matrices (PWMs) are the standard way to model binding site affinities in bioinformatics. However, they assume that symbol occurrences are position independent and, hence, they do not take into account symbols co-occurrence at different sequence positions. To address this problem, we propose to construct finite-state machines (FSMs) instead. A modified version of the Evidence-Driven State Merging (EDSM) heuristic is used to reduce the number of states as FSMs grow too quickly as a function of the number of sequences to reveal any useful structure. We tested our approach on sequence data for the transcription factor HNF4 and found out that the constructed FSMs provide small representations and an intuitive visualization. Furthermore, the FSM was better than PWMs at discriminating the positive and negative sequences in our data set.