Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L

FEBS Lett. 2006 Feb 6;580(3):723-30. doi: 10.1016/j.febslet.2005.12.072. Epub 2006 Jan 4.

Abstract

The development of 2D graph-theoretic representations for DNA sequences was very important for qualitative and quantitative comparison of sequences. Calculation of numeric features for these representations is useful for DNA-QSAR studies. Most of all graph-theoretic representations identify each one of the four bases with a unitary walk in one axe direction in the 2D space. In the case of proteins, twenty amino acids instead of four bases have to be considered. This fact has limited the introduction of useful 2D Cartesian representations and the corresponding sequences descriptors to encode protein sequence information. In this study, we overcome this problem grouping amino acids into four groups: acid, basic, polar and non-polar amino acids. The identification of each group with one of the four axis directions determines a novel 2D representation and numeric descriptors for proteins sequences. Afterwards, a Markov model has been used to calculate new numeric descriptors of the protein sequence. These descriptors are called herein the sequence 2D coupling numbers (zeta(k)). In this work, we calculated the zeta(k) values for 108 sequences of different polygalacturonases (PGs) and for 100 sequences of other proteins. A Linear Discriminant Analysis model derived here (PG=5.36.zeta1-3.98.zeta3-42.21) successfully discriminates between PGs and other proteins. The model correctly classified 100% of a subset of 81 PGs and 75 non-PG proteins sequences used to train the model. The model also correctly classified 51 out of 52 (98.07%) of proteins sequences used as external validation series. The uses of different group of amino acids and/or axes orientation give different results, so it is suggested to be explored for other databases. Finally, to illustrates the use of the model we report the isolation and prediction of the PG action for a novel sequence (AY908988) isolated by our group from Psidium guajava L. This prediction coincides very well with sequence alignment results found by the BLAST methodology. These findings illustrate the possibilities of the sequence descriptors derived for this novel 2D sequence representation in proteins sequence QSAR studies.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Amino Acid Sequence
  • Image Processing, Computer-Assisted
  • Molecular Sequence Data
  • Plant Proteins / genetics*
  • Psidium / enzymology
  • Psidium / genetics*
  • Sequence Analysis, DNA*
  • Software*

Substances

  • Plant Proteins

Associated data

  • GENBANK/AY908988