RT Journal Article SR Electronic T1 An Experiment in Learning the Language of Sequence Motifs: Sequence Logos vs. Finite-State Machines JF bioRxiv FD Cold Spring Harbor Laboratory SP 143024 DO 10.1101/143024 A1 Alexandre P Francisco A1 Travis Gagie A1 Dominik Kempa A1 Leena Salmela A1 Sophie Sverdlov A1 Jarkko Toivonen A1 Esko Ukkonen YR 2017 UL http://biorxiv.org/content/early/2017/05/27/143024.abstract AB Position weight matrices (PWMs) are the standard way to model binding site affinities in bioinformatics. However, they assume that symbol occurrences are position independent and, hence, they do not take into account symbols co-occurrence at different sequence positions. To address this problem, we propose to construct finite-state machines (FSMs) instead. A modified version of the Evidence-Driven State Merging (EDSM) heuristic is used to reduce the number of states as FSMs grow too quickly as a function of the number of sequences to reveal any useful structure. We tested our approach on sequence data for the transcription factor HNF4 and found out that the constructed FSMs provide small representations and an intuitive visualization. Furthermore, the FSM was better than PWMs at discriminating the positive and negative sequences in our data set.