PT  - JOURNAL ARTICLE
AU  - Ulrich Omasits
AU  - Adithi R. Varadarajan
AU  - Michael Schmid
AU  - Sandra Goetze
AU  - Damianos Melidis
AU  - Marc Bourqui
AU  - Olga Nikolayeva
AU  - Maxime Québatte
AU  - Andrea Patrignani
AU  - Christoph Dehio
AU  - Juerg E. Frey
AU  - Mark D. Robinson
AU  - Bernd Wollscheid
AU  - Christian H. Ahrens
TI  - An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
AID  - 10.1101/153213
DP  - 2017 Jan 01
TA  - bioRxiv
PG  - 153213
4099  - http://biorxiv.org/content/early/2017/06/21/153213.short
4100  - http://biorxiv.org/content/early/2017/06/21/153213.full
AB  - Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations.Our strategy towards accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics dataset against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and variants identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin, and release iPtgxDBs for B. henselae, Bradyrhozibium diazoefficiens and Escherichia coli as well as the software to generate such proteogenomics search databases for any prokaryote.