Phoenix Enhancer: proteomics data mining using clustered spectra

Motivation: Spectrum clustering has been used to enhance proteomics data analysis: some originally unidentified spectra can potentially be identified and individual peptides can be evaluated to find potential mis-identifications by using clusters of identified spectra. The Phoenix Enhancer provides an infrastructure to analyze tandem mass spectra and the corresponding peptides in the context of previously identified public data. Based on PRIDE Cluster data and a newly developed pipeline, four functionalities are provided: i) evaluate the original peptide identifications in an individual dataset, to find low confidence peptide spectrum matches (PSMs) which could correspond to mis-identifications; ii) provide confidence scores for all originally identified PSMs, to help users evaluate their quality (complementary to getting a global false discovery rate); iii) identify potential new PSMs for originally unidentified spectra; and iv) provide a collection of browsing and visualization tools to analyze and export the results. In addition to the web based service, the code is open-source and easy to re-deploy on local computers using Docker containers. Availability: The service of Phoenix Enhancer is available at http://enhancer.ncpsb.org. All source code is freely available in GitHub (https://github.com/phoenix-cluster/ ) and can be deployed in the Cloud and HPC architectures.


Introduction
Mass spectrometry (MS) has become the main technology in proteomics research and, consequently, proteomics data is growing rapidly. With more and more researchers sharing their data through public data repositories like PRIDE Archive (Perez-Riverol, et al., 2019) and iProX (Ma, et al., 2019), proteomics has become a "big data" field. As of 1 September 2019, PRIDE Archive hosts 13,389 datasets, representing 94,757 assays for acquiring proteomics data.
The number of unidentified spectra in public datasets ("dark matter") is on average 75% of spectra measured in each MS experiment (Griss, et al., 2016;Perez-Riverol, et al., 2018). The main reason behind the low number of identified spectra, is that during the peptide identification step (Vaudel, et al., 2014) many derived peptides are either not present in the sequence database (e.g. sequence variants, or incomplete genome sequences) or they contain unexpected PTMs. In 2016, by clustering all PRIDE Archive data, we were able to identify almost 20% of previously unidentified spectra (Griss, et al., 2016).
In this manuscript, we presented the Phoenix Enhancer; a platform that enables researchers to perform comparison of their identified and unidentified spectra against previously published datasets in PRIDE Cluster. Four functionalities are provided: i) evaluate the original peptide identifications in an individual dataset, to find low confidence peptide spectrum matches (PSMs) which could correspond to misidentifications; ii) provide confidence scores for originally identified PSMs, to help users to evaluate their quality (complementary to getting a global false discovery rate); iii) identify potential new PSMs for original-. CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/846303 doi: bioRxiv preprint

M.Bai et al.
ly unidentified spectra; and iv) provide a collection of browsing and visualization tools to analyze and export the results.

Design and Implementation
Phoenix Enhancer is designed for enhancing proteomics identifications by evaluating the quality of original PSMs or find new potential PSMs by searching the query spectra against a library of clustered spectra (Figure 1). It provides three core components, the front-end web interface, a Restful API and the data analysis pipeline. Users can upload the query files which include MS/MS spectra with/without identifications, and set the search parameters (Supplementary information, Note 1). Then, the pipeline searches the query spectra against the spectral clusters and then score the previous PSMs and new recommend PSMs, finally write the scored PSMs into the MySQL database. The Restful-API or Phoenix Enhancer web can be used to retrieve or browse the final results. The analysis pipeline informs three major reports: i) potential incorrect identifications with a confidence score and a new suggested sequence if possible; ii) newly identified PSMs for previously unidentified spectra; iii) high confidence previous identifications which also got a high confidence score.

Data analysis pipeline
When a file containing the MS/MS spectra with/without identificatio uploaded, the Phoenix Enhancer pipeline converts the files to mzML MS/MS spectra are searched against the PRIDE cluster archives SpectraST (Lam, et al., 2007) in "archive searching mode" (Supple tary information, Note 2). After picking the matches whose " scores are higher than or equal to SpectraST's default threshold Phoenix Enhancer calculates the confidence scores for the 3 typ matched spectra/PSMs: previous PSMs get confidence scores for previously assigned peptides; unidentified spectra or those in low dence PSMs will get new recommend peptides with confidence sc corresponding to the peptide sequence which has highest ratio wit cluster (called major peptide sequence).
Confidence scores can be used to assess the quality of the PSM to help the users finding novel PSMs. The detail of confidence sc calculation is in Supplementary information, Note 2.

Restful API and Web
The interactive web interface and the restful API allow to: i) upload for analysis, set analysis parameters; ii) show the results PSMs in t and charts (Supplementary information, Note 3); iii) filter the r using species; iv) compare the query spectrum to matched cluster sensus spectrum (Figure 2); (v) check the details of the matched cl including comparing the spectra inside a cluster to its consensus trum; vi) download analysis result files for further analyses. In addition to the source code on GitHub, we provide our four co nents (web, web service, pipeline and MySQL server) as BioConta images at Docker Hub (da Veiga Leprevost, et al., 2017).

Benchmark datasets
We tested Phoenix Enhancer's pipeline using 71 datasets from PR Archive. 39 out of 71 (called "inside datasets" later) are already inc The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/846303 doi: bioRxiv preprint Article short title in PRIDE Cluster, which are used to test the accuracy of the pipeline.
We found that 0.172% of the previously identified spectra in all 39 "inside datasets" are incorrectly matched by the pipeline, and for 36 out of 39 "inside datasets", the incorrect match rates are less than 1%. (

Conclusion
In summary, we believe that Phoenix Enhancer is a valuable tool to take advantage of spectral clustering results for deeper proteomics data analysis, especially in validating the confidence of specific peptide biomarkers, and in finding interesting new potential biomarkers in repository datasets. The proposed Phoenix Enhancer is easy to deploy and reuse in local HPC and Cloud environments. The framework enables smaller research efforts to utilize the data mining potential of repository scale spectral clusters, and it provides new visualization, inspection, and validation tools for the results of such data mining efforts.