The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes

  1. Kim D. Pruitt1,9,
  2. Jennifer Harrow2,
  3. Rachel A. Harte3,
  4. Craig Wallin1,
  5. Mark Diekhans3,
  6. Donna R. Maglott1,
  7. Steve Searle2,
  8. Catherine M. Farrell1,
  9. Jane E. Loveland2,
  10. Barbara J. Ruef4,
  11. Elizabeth Hart2,
  12. Marie-Marthe Suner2,
  13. Melissa J. Landrum1,
  14. Bronwen Aken2,
  15. Sarah Ayling5,
  16. Robert Baertsch3,
  17. Julio Fernandez-Banet2,
  18. Joshua L. Cherry1,
  19. Val Curwen2,
  20. Michael DiCuccio1,
  21. Manolis Kellis6,7,
  22. Jennifer Lee1,
  23. Michael F. Lin6,7,
  24. Michael Schuster8,
  25. Andrew Shkeda1,
  26. Clara Amid4,
  27. Garth Brown1,
  28. Oksana Dukhanina1,
  29. Adam Frankish2,
  30. Jennifer Hart1,
  31. Bonnie L. Maidak1,
  32. Jonathan Mudge2,
  33. Michael R. Murphy1,
  34. Terence Murphy1,
  35. Jeena Rajan2,
  36. Bhanu Rajput1,
  37. Lillian D. Riddick1,
  38. Catherine Snow2,
  39. Charles Steward2,
  40. David Webb1,
  41. Janet A. Weber1,
  42. Laurens Wilming2,
  43. Wenyu Wu1,
  44. Ewan Birney8,
  45. David Haussler3,
  46. Tim Hubbard2,
  47. James Ostell1,
  48. Richard Durbin2 and
  49. David Lipman1
  1. 1 National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA;
  2. 2 Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom;
  3. 3 Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA;
  4. 4 Zebrafish Information Network, University of Oregon, Eugene, Oregon 97403-5291, USA;
  5. 5 The University of Manchester, Faculty of Life Sciences, Manchester Interdisciplinary Biocentre, Manchester M1 7DN, United Kingdom;
  6. 6 Computer Science and Artificial Intelligence Laboratory, Institute of Technology, Cambridge, Massachusetts 02139, USA;
  7. 7 Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02141, USA;
  8. 8 European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, United Kingdom

    Abstract

    Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.

    Footnotes

    | Table of Contents

    Preprint Server