Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes

  1. Michael F. Lin1,
  2. Joseph W. Carlson2,
  3. Madeline A. Crosby3,
  4. Beverley B. Matthews3,
  5. Charles Yu2,
  6. Soo Park2,
  7. Kenneth H. Wan2,
  8. Andrew J. Schroeder3,
  9. L. Sian Gramates3,
  10. Susan E. St. Pierre3,
  11. Margaret Roark3,
  12. Kenneth L. Wiley, Jr.4,
  13. Rob J. Kulathinal3,
  14. Peili Zhang3,
  15. Kyl V. Myrick4,
  16. Jerry V. Antone4,
  17. Susan E. Celniker2,
  18. William M. Gelbart3,4, and
  19. Manolis Kellis1,5,6
  1. 1 Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02139, USA;
  2. 2 Berkeley Drosophila Genome Project, Department of Genome Biology, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA;
  3. 3 FlyBase, The Biological Laboratories, Harvard University, Cambridge, Massachusetts 02138, USA;
  4. 4 Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA;
  5. 5 MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA

Abstract

The availability of sequenced genomes from 12 Drosophila species has enabled the use of comparative genomics for the systematic discovery of functional elements conserved within this genus. We have developed quantitative metrics for the evolutionary signatures specific to protein-coding regions and applied them genome-wide, resulting in 1193 candidate new protein-coding exons in the D. melanogaster genome. We have reviewed these predictions by manual curation and validated a subset by directed cDNA screening and sequencing, revealing both new genes and new alternative splice forms of known genes. We also used these evolutionary signatures to evaluate existing gene annotations, resulting in the validation of 87% of genes lacking descriptive names and identifying 414 poorly conserved genes that are likely to be spurious predictions, noncoding, or species-specific genes. Furthermore, our methods suggest a variety of refinements to hundreds of existing gene models, such as modifications to translation start codons and exon splice boundaries. Finally, we performed directed genome-wide searches for unusual protein-coding structures, discovering 149 possible examples of stop codon readthrough, 125 new candidate ORFs of polycistronic mRNAs, and several candidate translational frameshifts. These results affect >10% of annotated fly genes and demonstrate the power of comparative genomics to enhance our understanding of genome organization, even in a model organism as intensively studied as Drosophila melanogaster.

Footnotes

  • 6 Corresponding author.

    6 E-mail manoli{at}mit.edu; fax (617) 262-6121.

  • [Supplemental material is available online at www.genome.org. Additional supplemental materials are available online at http://compbio.mit.edu/fly/genes/. Full-length cDNA sequence data from this study have been submitted to GenBank under accession nos. BT029554–BT029635, BT029637–BT029727, BT029940–BT029957, BT030133– BT030144, BT030416–BT030421, and BT030448–BT030452. RT-PCR amplicon and primer sequence data have been submitted to GenBank under accession nos. ES439769–ES439782.]

  • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6679507

    • Received May 7, 2007.
    • Accepted September 21, 2007.
  • Freely available online through the Genome Research Open Access option.

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server