TY - JOUR T1 - <em>MetaNovo</em>: a probabilistic approach to peptide and polymorphism discovery in complex mass spectrometry datasets JF - bioRxiv DO - 10.1101/605550 SP - 605550 AU - Matthys G Potgieter AU - Andrew JM Nel AU - David L. Tabb AU - Suereta Fortuin AU - Shaun Garnett AU - Jerome M. Wendoh AU - Jonathan M Blackburn AU - Nicola J Mulder Y1 - 2019/01/01 UR - http://biorxiv.org/content/early/2019/07/11/605550.abstract N2 - The characterization of complex mass spectrometry data obtained from metaproteomics or clinical studies presents unique challenges and potential insights in fields as diverse as the pathogenesis of human disease, the metabolic interactions of complex microbial ecosystems involved in agriculture, and climate change. Previous approaches essentially rely on prior expectation or knowledge of likely sample composition in order to construct focussed search libraries, but this is potentially limiting in many cases. Here we present a novel software pipeline to directly estimate the proteins and species present in complex mass spectrometry samples at the level of expressed proteomes, using de novo sequence tag matching and probabilistic optimization of very large sequence databases prior to target-decoy search. We validated our pipeline against the results obtained from the recently published MetaPro-IQ pipeline on 8 human mucosal-luminal interface samples, with comparable numbers of peptide and protein identifications being found when searching relatively small databases. We then showed that using an unbiased search of the entire release of UniProt (ca. 90 million protein sequences1) MetaNovo was able to identify a similar bacterial taxonomic distribution compared to that found using a small, focused matched metagenome database, but now also simultaneously identified proteins present in the samples that are derived from other organisms that are missed by 16S or shotgun sequencing and by previous metaproteomic methods. Using MetaNovo to analyze a set of single-organism human neuroblastoma cell-line samples (SH-SY5Y) against UniProt we achieved a comparable MS/MS identification rate during target-decoy search to using the UniProt human Reference proteome, with 22583 (85.99 %) of the total set of identified peptides shared in common. Taxonomic analysis of 612 peptides not found in the canonical set of human proteins yielded 158 peptides unique to the Chordata phylum as potential human variant identifications. Of these, 40 had previously been predicted and 9 identified using whole genome sequencing in a proteogenomic study of the same cell-line. Data are available via ProteomeXchange with identifier PXD014214. The MetaNovo software is available from GitHub2 and can be run as a standalone Singularity or Docker container available from the Docker Hub3. ER -