Brief communicationThe challenge of annotating protein sequences: The tale of eight domains of unknown function in Pfam
Introduction
Whole genome sequencing projects have yielded sequences of a large number of proteins for which no direct experimental information is available. Many methods have been used for the structural and functional annotation of such proteins (Watson et al., 2005, Lee et al., 2007). The most common approach to function prediction is ‘inheritance through homology’ (Lee et al., 2007). The Pfam database (Finn et al., 2006) is an important tool in homology detection, since it provides a collection of curated protein families with functional annotation. Pfam has now become one of the mainstays of modern genome annotation, having been used to characterize a large number of genomes including the human genome (Lander et al., 2001, Venter et al., 2001).
However, not all protein families in Pfam are well characterized. Specifically, Pfam contains a subset of families, known as domains of unknown function (DUFs) which form a significant fraction of the Pfam database (Bateman et al., 2004). Providing a functional annotation for such domains is required for the complete characterization of the protein family space, and will aid in the annotation of whole genome sequences.
A technique that was developed recently, CSSM-BLAST (Goonesekere and Lee, 2008), was successful in relating proteins from many such DUFs to proteins of known structure and function within the midnight zone of sequence identity (9–20%). Detection of a remote homolog in this manner annotates each DUF with a structure, and membership within a superfamily of homologs (Murzin et al., 1995, Orengo et al., 1997). However, homology does not necessarily imply conservation of function (Dessailly et al., 2009) (Conant and Wolfe, 2008). In the case of enzymes, homologous proteins belonging to the same superfamily (<50% sequence identity) (Gerlt and Babbitt, 1998) may catalyze different overall reactions while retaining a common mechanistic strategy (Glasner et al., 2006) or may have completely lost enzymatic function (Todd et al., 2002b), and closer scrutiny is required to probe functionality beyond membership in a superfamily (Babbitt et al., 1995).
We have used a combination of sequence analysis and model building to provide functional annotation to domains from eight Pfam DUFs. Our study also links several DUFs to proteins that have previously been identified as therapeutic targets against pathogens.
Section snippets
Materials and methods
The program CSSM-BLAST (Goonesekere and Lee, 2008) was used to create a profile for each protein chain in the RCSB protein data bank (Berman et al., 2000) as follows: the environment polarity of each residue in each chain of 39,952 pdb files obtained from RCSB was calculated using the program SHEBA (Jung and Lee, 2000). All single chains, and all chains with more than 50 amino acids in multi-chain files were used as input to CSSM-BLAST program to generate position specific substitution matrix
Results and discussion
PDB homologs for proteins from eight Pfam domains of unknown function are given in Table 1. These assignments have been made using the highly significant CSSM-BLAST E-values (E-value < 10−25) and Sov scores (Geourjon et al., 2001) (Sov score > 50%) (Table 1). Results from two other homology detection programs, I-TASSER (Zhang, 2008) and HHPred (Soding et al., 2005) are also given in Table 1. The homologous relationships detected by CSSM-BLAST were further investigated to determine the functional
Conclusions
Inferring function often relies on the identification of conserved residues, which in turn is dependent on accurate alignment of sequences. This is non-trivial even when structures are known (Kim and Lee, 2007) and becomes a challenge at the low sequence identities. We find that the use of Sov scores (Geourjon et al., 2001) not only help establish homology, but can also serve as a check for alignment quality. Detecting functionally equivalent residues that are unconserved with respect to
Acknowledgement
We acknowledge the thoughtful comments of one reviewer, which resulted in improvements to the manuscript (Table 2). K.S. and K.O. were supported by SURP grants from UNI.
References (45)
X-ray crystal structure of Staphylococcus aureus FemA
Structure
(2002)Exploiting structural classifications for function prediction: towards a domain grammar for protein function
Curr. Opin. Struct. Biol.
(2009)Structural basis for the mechanism of Ca(2+) activation of the di-heme cytochrome c peroxidase from Pseudomonas nautica 617
Structure
(2004)- et al.
Modeller: generation and refinement of homology-based protein structure models
Methods Enzymol.
(2003) - et al.
Mechanistically diverse enzyme superfamilies: the importance of chemistry in the evolution of catalysis
Curr. Opin. Chem. Biol.
(1998) - et al.
Evolution of enzyme superfamilies
Curr. Opin. Chem. Biol.
(2006) Conjugative plasmid protein TrwB, an integral membrane type IV secretion system coupling protein. Detailed structural features and mapping of the active site cleft
J. Biol. Chem.
(2002)SCOP: a structural classification of proteins database for the investigation of sequences and structures
J. Mol. Biol.
(1995)The crystal structure of pyroglutamyl peptidase I from Bacillus amyloliquefaciens reveals a new structure for a cysteine protease
Structure
(1999)The ATPase activity of the DNA transporter TrwB is modulated by protein TrwA: implications for a common assembly mechanism of DNA translocating motors
J. Biol. Chem.
(2007)
Plasticity of enzyme active sites
Trends Biochem. Sci.
Sequence and structural differences between enzyme and nonenzyme homologs
Structure
Predicting protein function from sequence and structural data
Curr. Opin. Struct. Biol.
A functionally diverse enzyme superfamily that abstracts the alpha protons of carboxylic acids
Science
Evolutionary lines of cysteine peptidases
Biol. Chem.
The Pfam protein families database
Nucleic Acids Res.
FemA, a host-mediated factor essential for methicillin resistance in Staphylococcus aureus: molecular cloning and characterization
Mol. Gen. Genet.
The Protein Data Bank
Nucleic Acids Res.
Protein structure prediction servers at University College London
Nucleic Acids Res.
Turning a hobby into a job: how duplicated genes find new functions
Nat. Rev. Genet.
The PyMol Molecular Graphics System
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res.
Cited by (10)
Gilvimarinus xylanilyticus sp. nov., a novel 1,3-xylanase-secreting bacterium isolated from a marine green alga
2022, Frontiers in MicrobiologyDiscovery of a new intravacuolar protein required for the autophagy, development and virulence of Beauveria bassiana
2017, Environmental MicrobiologyThe complexity, challenges and benefits of comparing two transporter classification systems in TCDB and Pfam
2014, Briefings in BioinformaticsRevisiting the biosynthesis of dehydrophos reveals a tRNA-dependent pathway
2013, Proceedings of the National Academy of Sciences of the United States of America