RT Journal Article SR Electronic T1 MetaNovo: a probabilistic approach to peptide and polymorphism discovery in complex mass spectrometry datasets JF bioRxiv FD Cold Spring Harbor Laboratory SP 605550 DO 10.1101/605550 A1 Thys Potgieter A1 Andrew JM Nel A1 David L. Tabb A1 Suereta Fortuin A1 Shaun Garnett A1 Jonathan Blackburn A1 Nicola Mulder YR 2019 UL http://biorxiv.org/content/early/2019/04/11/605550.abstract AB The characterization of complex mass spectrometry data obtained from metaproteomics or clinical studies presents unique challenges and potential insights in fields as diverse as the pathogenesis of human disease, the metabolic interactions of complex microbial ecosystems involved in agriculture, or climate change. However, accurate peptide identification requires representative sequence databases, which typically rely on prior knowledge or matched genome sequencing, and are often error-prone. We present a novel software pipeline to directly estimate the proteins and species present in complex mass spectrometry samples at the level of expressed proteomes, using de novo sequence tag matching and probabilistic optimization of very large sequence databases prior to target-decoy search. We validated our pipeline against the results obtained from the recently published MetaPro-IQ pipeline on 8 human mucosal-luminal interface samples with comparable numbers of peptide and protein identifications, and novel identifications. We showed that using the entire release of UniProt we were able to identify a similar taxonomic distribution compared to a matched metagenome database, with improved identifications of certain taxa. Using MetaNovo to analyze a set of single-organism human neuroblastoma cell-line samples (SH-SY5Y) against UniProt we achieved a comparable MS/MS identification rate during target-decoy search to using the UniProt human Reference proteome, with 22583 (85.99 %) of the total set of identified peptides shared in common. Taxonomic analysis of 612 peptides not found in the canonical set of human proteins yielded 158 peptides unique to the Chordata phylum as potential human variant identifications. Of these, 40 had previously been predicted and 9 identified using whole genome sequencing in a proteogenomic study of the same cell-line. The MetaNovo software is available from github1 or can be run as a standalone Docker container available from the Docker Hub2.