Automatic annotation of eukaryotic genes, pseudogenes and promoters

Victor Solovyev; Peter Kosarev; Igor Seledsov; Denis Vorobyev

doi:10.1186/gb-2006-7-s1-s10

Automatic annotation of eukaryotic genes, pseudogenes and promoters

Genome Biol. 2006;7 Suppl 1(Suppl 1):S10.1-12. doi: 10.1186/gb-2006-7-s1-s10. Epub 2006 Aug 7.

Authors

Victor Solovyev¹, Peter Kosarev, Igor Seledsov, Denis Vorobyev

Affiliation

¹ Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK. victor@cs.rhul.ac.uk

Abstract

Background: The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation.

Results: The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software.

Conclusion: We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome.

Publication types

Evaluation Study

MeSH terms

Animals
Base Sequence
Chromosome Mapping
Computational Biology / methods*
Genes*
Genomics / methods*
Humans
Molecular Sequence Data
Promoter Regions, Genetic*
Pseudogenes*
Sequence Analysis, DNA
Sequence Analysis, Protein
Sequence Analysis, RNA
Software