Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters

J Mol Biol. 1987 Feb 20;193(4):723-50. doi: 10.1016/0022-2836(87)90354-8.

Abstract

We present a statistical-mechanical selection theory for the sequence analysis of a set of specific DNA regulatory sites that makes it possible to predict the relationship between individual base-pair choices in the site and specific activity (affinity). The theory is based on the assumption that specific DNA sequences have been selected to conform to some requirement for protein binding (or activity), and that all sequences that can fulfil this requirement are equally likely to occur. In most cases, the number of specific DNA sequences that are known for a certain DNA-binding protein is very small, and we discuss in detail the small-sample uncertainties that this leads to. When applied to the binding sites for cro repressor in phage lambda, the theory can predict, from the sequence statistics alone, their rank order binding affinities in reasonable agreement with measured values. However, the statistical uncertainty generated by such a small sample (only 6 sites known) limits the result to order-of-magnitude comparisons. When applied to the much larger sample of Escherichia coli promoter sequences, the theory predicts the correlation between in vitro activity (k2KB values) and homology score (closeness to the consensus sequence) observed by Mulligan et al. (1984). The analysis of base-pair frequencies in the promoter sample is consistent with the assumption that base-pairs at different positions in the sites contribute independently to the specific activity, except in a few marginal cases that are discussed. When the promoter sites are ordered according to predicted activities, they seem to conform to the Gaussian distribution that results from a requirement for maximal sequence variability within the constraint of providing a certain average activity. The theory allows us to compare the number of specific sites with a certain activity to the number that would be expected from random occurrence in the genome. While strong promoters are "overspecified", in the sense that their probability of random occurrence is very low, random sequences with weak promoter-like properties are expected to occur in very large numbers. This leads to the conclusion that functional specificity is based on other properties in addition to primary sequence recognition; some possibilities are discussed. Finally, we show that the sequence information, as defined by Schneider et al. (1986), can be used directly (at least in the case of equilibrium binding sites) to estimate the number of protein molecules that are specifically bound at random "pseudosites" in the genome.(ABSTRACT TRUNCATED AT 400 WORDS)

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Base Composition
  • Base Sequence
  • Binding Sites
  • Biological Evolution
  • DNA*
  • DNA-Binding Proteins*
  • Models, Genetic
  • Operator Regions, Genetic*
  • Promoter Regions, Genetic*
  • Repressor Proteins
  • Statistics as Topic
  • Viral Proteins
  • Viral Regulatory and Accessory Proteins

Substances

  • DNA-Binding Proteins
  • Repressor Proteins
  • Viral Proteins
  • Viral Regulatory and Accessory Proteins
  • phage repressor proteins
  • DNA