MotifCut: regulatory motifs finding with maximum density subgraphs

Bioinformatics. 2006 Jul 15;22(14):e150-7. doi: 10.1093/bioinformatics/btl243.

Abstract

Motivation: DNA motif finding is one of the core problems in computational biology, for which several probabilistic and discrete approaches have been developed. Most existing methods formulate motif finding as an intractable optimization problem and rely either on expectation maximization (EM) or on local heuristic searches. Another challenge is the choice of motif model: simpler models such as the position-specific scoring matrix (PSSM) impose biologically unrealistic assumptions such as independence of the motif positions, while more involved models are harder to parametrize and learn.

Results: We present MotifCut, a graph-theoretic approach to motif finding leading to a convex optimization problem with a polynomial time solution. We build a graph where the vertices represent all k-mers in the input sequences, and edges represent pairwise k-mer similarity. In this graph, we search for a motif as the maximum density subgraph, which is a set of k-mers that exhibit a large number of pairwise similarities. Our formulation does not make strong assumptions regarding the structure of the motif and in practice both motifs that fit well the PSSM model, and those that exhibit strong dependencies between position pairs are found as dense subgraphs. We benchmark MotifCut on both synthetic and real yeast motifs, and find that it compares favorably to existing popular methods. The ability of MotifCut to detect motifs appears to scale well with increasing input size. Moreover, the motifs we discover are different from those discovered by the other methods.

Availability: MotifCut server and other materials can be found at motifcut.stanford.edu.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Base Sequence
  • Binding Sites
  • Chromosome Mapping / methods
  • Computer Simulation
  • DNA / genetics*
  • Models, Genetic*
  • Models, Statistical
  • Molecular Sequence Data
  • Protein Binding
  • Regulatory Elements, Transcriptional / genetics*
  • Sequence Alignment / methods
  • Sequence Analysis, DNA / methods*
  • Software*
  • Transcription Factors / genetics*

Substances

  • Transcription Factors
  • DNA