Brief communication
The challenge of annotating protein sequences: The tale of eight domains of unknown function in Pfam

https://doi.org/10.1016/j.compbiolchem.2010.04.001Get rights and content

Abstract

The Pfam database is an important tool in genome annotation, since it provides a collection of curated protein families. However, a subset of these families, known as domains of unknown function (DUFs), remains poorly characterized. We have related sequences from DUF404, DUF407, DUF482, DUF608, DUF810, DUF853, DUF976 and DUF1111 to homologs in PDB, within the midnight zone (9–20%) of sequence identity. These relationships were extended to provide functional annotation by sequence analysis and model building. Also described are examples of residue plasticity within enzyme active sites, and change of function within homologous sequences of a DUF.

Introduction

Whole genome sequencing projects have yielded sequences of a large number of proteins for which no direct experimental information is available. Many methods have been used for the structural and functional annotation of such proteins (Watson et al., 2005, Lee et al., 2007). The most common approach to function prediction is ‘inheritance through homology’ (Lee et al., 2007). The Pfam database (Finn et al., 2006) is an important tool in homology detection, since it provides a collection of curated protein families with functional annotation. Pfam has now become one of the mainstays of modern genome annotation, having been used to characterize a large number of genomes including the human genome (Lander et al., 2001, Venter et al., 2001).

However, not all protein families in Pfam are well characterized. Specifically, Pfam contains a subset of families, known as domains of unknown function (DUFs) which form a significant fraction of the Pfam database (Bateman et al., 2004). Providing a functional annotation for such domains is required for the complete characterization of the protein family space, and will aid in the annotation of whole genome sequences.

A technique that was developed recently, CSSM-BLAST (Goonesekere and Lee, 2008), was successful in relating proteins from many such DUFs to proteins of known structure and function within the midnight zone of sequence identity (9–20%). Detection of a remote homolog in this manner annotates each DUF with a structure, and membership within a superfamily of homologs (Murzin et al., 1995, Orengo et al., 1997). However, homology does not necessarily imply conservation of function (Dessailly et al., 2009) (Conant and Wolfe, 2008). In the case of enzymes, homologous proteins belonging to the same superfamily (<50% sequence identity) (Gerlt and Babbitt, 1998) may catalyze different overall reactions while retaining a common mechanistic strategy (Glasner et al., 2006) or may have completely lost enzymatic function (Todd et al., 2002b), and closer scrutiny is required to probe functionality beyond membership in a superfamily (Babbitt et al., 1995).

We have used a combination of sequence analysis and model building to provide functional annotation to domains from eight Pfam DUFs. Our study also links several DUFs to proteins that have previously been identified as therapeutic targets against pathogens.

Section snippets

Materials and methods

The program CSSM-BLAST (Goonesekere and Lee, 2008) was used to create a profile for each protein chain in the RCSB protein data bank (Berman et al., 2000) as follows: the environment polarity of each residue in each chain of 39,952 pdb files obtained from RCSB was calculated using the program SHEBA (Jung and Lee, 2000). All single chains, and all chains with more than 50 amino acids in multi-chain files were used as input to CSSM-BLAST program to generate position specific substitution matrix

Results and discussion

PDB homologs for proteins from eight Pfam domains of unknown function are given in Table 1. These assignments have been made using the highly significant CSSM-BLAST E-values (E-value < 10−25) and Sov scores (Geourjon et al., 2001) (Sov score > 50%) (Table 1). Results from two other homology detection programs, I-TASSER (Zhang, 2008) and HHPred (Soding et al., 2005) are also given in Table 1. The homologous relationships detected by CSSM-BLAST were further investigated to determine the functional

Conclusions

Inferring function often relies on the identification of conserved residues, which in turn is dependent on accurate alignment of sequences. This is non-trivial even when structures are known (Kim and Lee, 2007) and becomes a challenge at the low sequence identities. We find that the use of Sov scores (Geourjon et al., 2001) not only help establish homology, but can also serve as a check for alignment quality. Detecting functionally equivalent residues that are unconserved with respect to

Acknowledgement

We acknowledge the thoughtful comments of one reviewer, which resulted in improvements to the manuscript (Table 2). K.S. and K.O. were supported by SURP grants from UNI.

References (45)

  • A.E. Todd et al.

    Plasticity of enzyme active sites

    Trends Biochem. Sci.

    (2002)
  • A.E. Todd et al.

    Sequence and structural differences between enzyme and nonenzyme homologs

    Structure

    (2002)
  • J.D. Watson et al.

    Predicting protein function from sequence and structural data

    Curr. Opin. Struct. Biol.

    (2005)
  • P.C. Babbitt

    A functionally diverse enzyme superfamily that abstracts the alpha protons of carboxylic acids

    Science

    (1995)
  • A.J. Barrett et al.

    Evolutionary lines of cysteine peptidases

    Biol. Chem.

    (2001)
  • A. Bateman

    The Pfam protein families database

    Nucleic Acids Res.

    (2004)
  • B. Berger-Bachi

    FemA, a host-mediated factor essential for methicillin resistance in Staphylococcus aureus: molecular cloning and characterization

    Mol. Gen. Genet.

    (1989)
  • H.M. Berman

    The Protein Data Bank

    Nucleic Acids Res.

    (2000)
  • K. Bryson

    Protein structure prediction servers at University College London

    Nucleic Acids Res.

    (2005)
  • G.C. Conant et al.

    Turning a hobby into a job: how duplicated genes find new functions

    Nat. Rev. Genet.

    (2008)
  • W.L. DeLano

    The PyMol Molecular Graphics System

    (2002)
  • R.C. Edgar

    MUSCLE: multiple sequence alignment with high accuracy and high throughput

    Nucleic Acids Res.

    (2004)
  • View full text