RT Journal Article SR Electronic T1 ProteomeGenerator: A framework for comprehensive proteomics based on de novo transcriptome assembly and high-accuracy peptide mass spectral matching JF bioRxiv FD Cold Spring Harbor Laboratory SP 236844 DO 10.1101/236844 A1 Paolo Cifani A1 Avantika Dhabaria A1 Akihide Yoshimi A1 Omar Abdel-Wahab A1 John T. Poirier A1 Alex Kentsis YR 2017 UL http://biorxiv.org/content/early/2017/12/19/236844.abstract AB Modern mass spectrometry now permits genome-scale and quantitative measurements of biological proteomes. However, analyses of specific specimens are currently hindered by the incomplete representation of biological variability of protein sequences in canonical reference proteomes, and the technical demands for their construction. Here, we report ProteomeGenerator, a framework for de novo and reference-assisted proteogenomic database construction and analysis based on sample-specific transcriptome sequencing and high-resolution and high-accuracy mass spectrometry proteomics. This enables assembly of proteomes encoded by actively transcribed genes, including sample-specific protein isoforms resulting from non-canonical mRNA transcription, splicing, or editing. To improve the accuracy of protein isoform identification in non-canonical proteomes, ProteomeGenerator relies on statistical target-decoy database matching augmented with spectral-match calibrated sample-specific controls. We applied this method for the proteogenomic discovery of splicing factor SRSF2-mutant leukemia cells, demonstrating high-confidence identification of non-canonical protein isoforms arising from alternative transcriptional start sites, intron retention, and cryptic exon splicing, as well as improved accuracy of genome-scale proteome discovery. Additionally, we report proteogenomic performance metrics for the current state-of-the-art implementations of SEQUEST HT, Proteome Discoverer, MaxQuant, Byonic, and PEAKS mass spectral analysis algorithms. Finally, ProteomeGenerator is implemented as a Snakemake workflow, enabling open, scalable, and facile discovery of sample-specific, non-canonical and neomorphic biological proteomes (https://github.com/jtpoirier/proteomegenerator).