IsoProt: A fully reproducible one-stop-shop for the analysis of iTRAQ/TMT data

Mass spectrometry coupled with isobaric labelling provides fast large-scale comparison of protein abundances over multiple conditions. To date, no one-stop-shop software solution exists that enables non-bioinformatics experts to carry out the full analysis of the acquired raw data with minimal intervention. Mostly, pipelines for such analyses are based on a combination of different software tools and in-house programs. In addition, different and often new versions of used tools and issues with the compatibility of apparently interoperable tools make it very difficult to ensure reproducible data analysis in the proteomics realm. We present a 100% reproducible software protocol to fully analyse data from one of the most popular types of proteomics experiments. The protocol uses only open source tools installed on a portable container environment and provides a user-friendly and interactive browser interface for configuration and execution of the different operations. An example use case is provided that can be used for testing and adaptation of own data sets. This setup will yield identical results on any computer analysing isobaric labelled MS data.


Introduction
Isobaric labelling has become one of the most common methods for quantitative mass spectrometry based proteomics experiments. A major advantage is that it allows researchers to multiplex samples and thereby reduce instrument runtime and eliminate variability caused by the mass spectrometer itself. The two methods currently available for these experiments Tandem Mass Tag (TMT, Proteome Science) and Multiplexed Isobaric Tagging Technology for Relative Quantitation (iTRAQ (1) ) basically only differ in the reporter masses they generate but do not require dedicated software tools.
Even though isobaric labelling has become a standard method in many laboratories, no easy-to-use software solutions exist to analyse these data. This is particularly problematic when dealing with more complex experimental designs that include multiple runs on the mass spectrometer, such as multiple instances of differently labeled multiplexed samples.
This causes many research groups to rely on unpublished in-house scripts to process their experiments which greatly hampers reproducibility. Even in the case of fully documented workflows, the use of different software versions or even only different versions of the underlying software libraries can dramatically influence the final results. Therefore, considerable effort is often required to replicate the computational processing of this common and widely used proteomics approach.
Workflow software suites and workflow managers, such as Proteome Discoverer (Thermo Scientific), OpenMS (2) , and to a certain extent MaxQuant (3) with Perseus (4) allow the user to create a complete analysis workflow in a single software. An obvious but seldom executed step to increase reproducibility is to save this analysis workflow and deposit it in a public repository such as PRIDE Archive (5) together with the original data. However, full together with its computational environment.
Furthermore, currently available workflow tools are rather complex to use or do not provide much flexibility in the statistical analysis of the data and supported experimental designs. For example, a standard OpenMS workflow to analyse isobarically labelled experiments can quickly grow to 15 different nodes all with their own settings. Moreover, to replicate the analysis, users have to install / buy the respective software and again ensure that all software versions (such as Proteome Discoverer node versions, or external OpenMS tools) are identical.
In an effort to simplify proteomics data analysis and provide fully reproducible data analysis workflows we launched the ProtProtocols project ( https://protprotocols.github.io ) under the umbrella of the European Bioinformatics Community (EuBIC). Based on the Biocontainers project (6) the protocols are shipped in containerized Docker images that include all necessary software tools. Docker containers are lightweight virtual machines that encapsulate all the software required for the protocol to run. This ensures that the version of all used software is linked to the protocol version and the user does not have to worry about installing any separate tools. Hence, 100% reproducibility can be achieved by using the same protocol version on any computer with a Docker environment.
Here, we present the IsoProt, a ProtProtocol designed for the analysis of isobarically labelled experiments. Next to a user-friendly web interface, it provides accurate statistical analyses over a wide range of common experimental designs.

General implementation
All software was installed in a Docker image to ensure full reproducibility on each computer system supported by Docker. To simplify the installation and usage of our protocols we created the free, open-source "ProtProtocol docker-launcher" ( https://github.com/ProtProtocols/docker-launcher ). It provides an easy-to-use graphical user interface that can automatically install the protocol (once Docker is installed) and launch the image. As it is written in Java it supports the major operating systems Windows, Mac OSX, and Linux. Therefore, many technical difficulties surrounding the use of Docker are hidden from the user.
The complete protocol is run through a Jupyter notebook ( http://jupyter.org ) corresponding to one web page in the browser. All relevant parameters can be set through common graphical user elements created through Jupyter widgets. Therefore, the user interface is highly similar to most available search engines. All steps are documented to facilitate usage. The complete source code as well as additional documentation of the protocol is freely available through https://github.com/ProtProtocols/IsoProt .

Proteomics software
IsoProt handles the entire analysis pipeline from mass spectra given as peak lists to a set of proteins that is differentially regulated between the given experimental conditions ( Figure   1A). Herein, we used SearchGUI (7) and PeptideShaker (8) to perform peptide identification and validation, with MS-GF+ (9) as database search engine. Protein summarization and quantification is handled by R scripts based on the MSnBase R library (10) . R scripts furthermore generate figures for quality control and perform statistical tests (LIMMA library, (11) ) according to the experimental design.

Input files
The only files required for the analysis are mass spectra as peak lists (MGF format) and a FASTA file containing the protein sequences where we recommend the UniProt version of the FASTA format. Databases can already contain decoy sequences (following the SearchGUI instructions, http://compomics.github.io/projects/searchgui.html ), otherwise the decoy database is created automatically. The files can be copied into the Docker file structure or directly mirrored onto the /data folder which is automatically done by our docker-launcher application.

Analysis parameters
interface integrated into the Jupyter notebook. In the first section, the user has to set database search related parameters such as precursor and fragment ion tolerance, the FASTA sequence database to use, the labelling agent used, and the fixed and variable modifications to consider.
Based on the selected labelling method and detected folder structure, the interface to enter the experimental design is generated. The protocol currently supports two setups: 1) all MGF files are placed in the input directory and are part of the same (fractionated) run or 2) if samples were spread over several runs on the mass spectrometer where MGF files from different runs are placed in different subdirectories per run ( Figure 1B-C). The experimental design user interface now allows the user to enter names for the sample groups (for example "treatment" and "control"), names for the samples (one name per channel and subdirectory) and assign each sample to one of the groups. Most importantly, the protocol supports up to 20 sample groups and can thereby model complex experimental designs.
Finally, the user is asked to enter parameters related to the analysis of the quantitative data.
Once all required information is entered, the search and analysis is directly controlled through buttons in the user interface.

Test data sets
To evaluate the performance of our analysis workflow we processed the data from three publically available datasets. We downloaded the respective RAW files from PRIDE Archive (5) and converted them into the MGF file format using ProteoWizard's msconvert tool (13) when no MGF peak list files were available. UniProt, were used for spectra identification.
Quantitative analysis was done using the R Bioconductor package MSnbase version 2.7.1 (10) . Protein summarization was performed using the medpolish method. Modified peptides were not used for quantitation. Only proteins with at least 2 identified peptides were accepted for further analysis. Differential expression was assessed using the R Bioconductor package limma version 3.34 (11) .

Cerebral malaria pathogenesis
The study uses TMT6 labeling to compare mouse blood with different stages of cerebral malaria (d3, ECM) to non-infected mice (NI) (15) . Four replicates of each of the three sample types were arranged in TMT6 sets and run separately. Peak list data files (MGF file format) were downloaded from PRIDE (PXD003772).
The analysis was again performed using IsoProt version 0.2 (see above) with the precursor tolerance set to 10 ppm and the fragment tolerance to 0.05 Daltons. 1 missed cleavage was allowed. Carbamidomethylation and TMT 6-plex of K,TMT 6-plex of peptide N-term were set as fixed modifications. Oxidation of M were set as variable modifications. PSMs were filtered at a target FDR of 0.01 using the target-decoy approach. SwissProt sequences from mouse (January 2018) were used for spectra identification.Only proteins with at least 2 identified peptides were accepted for further analysis.

N on-muscle invasive and muscle-invasive bladder cancer
The study compares tumor tissue samples from non-muscle invasive and muscle-invasive bladder cancer (16) . MGF files were downloaded from PRIDE Archive (PXD002170).
The analysis was again performed using IsoProt version 0.2 (see above) with the precursor tolerance set to 10 ppm and the fragment tolerance to 0.05 Daltons. 1 missed cleavage were Running title: A reproducible one-stop-shop for the analysis of iTRAQ/TMT data allowed. Carbamidomethylation and iTRAQ 8-plex of K,iTRAQ 8-plex of Y ,iTRAQ 8-plex of peptide N-term were set as fixed modifications. Oxidation of M were set as variable modifications. PSMs were filtered at a target FDR of 0.01 using the target-decoy approach.
Sequences from sp_human.fasta (SwissProt human proteome, January 2017) were used for spectra identification. Only proteins with at least 2 identified peptides were accepted for further analysis.

Results
IsoProt enables end users to run the full data analysis of iTRAQ/TMT experiments in a very straight-forward and reproducible way. The protocol can be applied to different experimental designs including multiple runs on the mass spectrometer and differently labeled multiple samples. This will be exemplified by a detailed description of the example workflow and evaluation of the results from three selected studies.
Apart from allowing data analysis through a few mouse clicks, the open layout of the protocol allows complex adjustments and modifications at all stages of the workflow.

A fully reproducible environment
The protocol can be run on any computer with a functional Docker environment, by just downloading and running the available Docker image. This is fully automated through our "ProtProtocol docker-launcher" tool ( https://github.com/ProtProtocols/docker-launcher ).
Hence, the protocol avoids all possible platform-and operating system-specific installation issues and provides identical results independent of operating system, its configuration and computer hardware.
Every IsoProt release has a stable version number that points to a specific docker image.
Therefore, by citing the used IsoProt version number it will always be possible to exactly restore the used analysis environment -including the versions of all used software tools.
Once the protocol has been executed, it is possible to save it, including all generated figures, as a standard HTML or pdf page. Therefore, the complete analysis workflow can be easily made available, for example at the time of review, and be viewed with a standard web browser. For an overview of the visualizations, see Figure 2.

Simple example workflow
Functionality and output of IsoProt can be tested using the available example data set which  were the lowest amount of proteins were spiked. In actual experiments, it is often unknown whether a value is missing at random or not which is why our pipeline is not using any imputation. The complete output of our pipeline can be found in Supplementary File 1.

Cerebral malaria pathogenesis
The authors investigated differences in the plasma proteome between healthy and malaria-infected mice (two stages). The available two TMT 6plex sets were considered to contain independent samples. IsoProt quantifies more protein groups (324 versus 289) when requiring a minimum of 2 unique PSMs and an identification FDR < 1%. For the further comparison, we restricted the IsoProt output to the uniquely identified 214 proteins (no peptides shared with other proteins).
In the original study, statistical testing was carried out separately for the two TMT runs, yielding a total of 54 (more precisely 43 as 11 were detected in both runs) proteins found to be differentially regulated between plasmodium berghei ANKA (PbA)-infected (d8 post-infection, labeled ECM) and non-infected (labelled NI) mice (Mann-Whitney U test, p ≤ 0.001, no correction for multiple testing). We found a total of 41 differentially regulated proteins (FDR < 0.01) and an overlap of only 20 proteins with the original study.
Given the rather different statistical testing, we looked into proteins that were not determined as differentially regulated by either method. All but four proteins found differentially regulated in the original study were quantified by IsoProt and showed similar abundances in both analyses (see Figure 4A) Proteins only found significantly regulated in the original study were not found significant by IsoProt mostly due to low fold-changes in the quantitative analysis ( Figure 4B).
We investigated the two proteins that mostly differ between the 2 types of analyses.
Retinol-binding protein 4 (Q00724) was the protein with the lowest FDR within the proteins Running title: A reproducible one-stop-shop for the analysis of iTRAQ/TMT data found differentially regulated by IsoProt but not in the original study. Figure 4C shows PSM measurements for the 2 TMT runs of this protein (scaled for better comparison).

Summarized protein abundances (thick lines) by median summarization with outlier removal
show that the PSMs of peptides with less differential behavior were removed. By merging the observation of the two TMT runs, IsoProt gains more statistical power and thus provides evidence for regulatory behavior of this protein.
On the other hand, protein Protein disulfide-isomerase (P09103) was the protein with the highest FDR (least significant) of proteins found significantly changing in the original study (TMT-1) but not by IsoProt ( Figure 4D). Given only high abundances in one of the two ECM replicates in TMT-1, at least manual interpretation would discard this protein from being regulated ( Figure 4D). The PSMs measured in the 2nd TMT-2 run confirm this observation.
The complete output of our pipeline can be found in Supplementary File 2.

Non-muscle invasive and muscle-invasive bladder cancer
IsoProt quantified 1,145 protein groups when restricting to a minimum of 2 unique peptides and 1% FDR, compared to 1,092 in the original study (minimum of 2 peptides, Occam razor principle for peptide inference and 1% FDR). Both analyses had an overlap of 662 proteins.
We then compared the mean log-ratios between the two cancer subtypes (four replicates each). Despite only having different bioinformatics workflows, relatively large differences were observed between the estimated log-fold changes ( Figure 5A, Pearson's correlation of 0.78).
Statistical testing did show one differentially regulated protein (15-hydroxyprostaglandin dehydrogenase, FDR < 0.01) after correction for multiple testing which has not been carried Running title: A reproducible one-stop-shop for the analysis of iTRAQ/TMT data out in the original study. When comparing uncorrected p-values, the majority of significant proteins were different between the two studies ( Figure 5B and C, colored points indicate p<0.05 in the other respective study). This striking difference in the statistical results can be explained when looking at the distribution of protein abundances ( Figure 5D and E). A deeper look into the original analysis showed that the authors normalized the ratios between cancer subtypes after protein approach is to normalize the different channels (ie. individual samples) on the (measured) PSM or (aggregated) peptide level prior to the aggregated analysis of these measurements on the protein level and, most importantly, prior to merging any independent (ie. replicate) measurements. Strong deviations of individual channels which are visible on the peptide level were thus discarded in the original study. The complete output of our pipeline can be found in Supplementary File 3.

Discussion
While the availability of proteomics techniques is continuously increasing, researchers often do not have the required bioinformatic support to analyse the data. Here, we present a simple, statistically accurate protocol for one of the most commonly used quantitative approaches in proteomics. This should enable users to analyse even complex experimental setups with just a few mouse clicks.
Lack of reproducibility in general, and in bioinformatics workflows specifically is a growing concern. Seemingly small changes to a workflow, such as normalisation method details can have dramatic effects on the final result. Due to the many steps and settings that make a complex workflow, it's usually impossible to fully describe such a workflow in the methods section of a research paper. The IsoProt protocol allow users to save the complete analysis workflow as a simple HTML page which can then be submitted in addition to the classical methods section in a paper.
Finding exactly the same software version used in a paper is often a major obstacle when replicating bioinformatic analyses. Often, this older version is no longer compatible with the available operating system or is just altogether unavailable. By encapsulating protocols into docker containers the complete setup including all software versions can be referenced through a single protocol version number. This allows anyone to replicate the results at any later stage without having to worry that older software might no longer work. Once a given version of the protocol is downloaded, users can be sure that it will behave in exactly the same way on all supported platforms.
The use of docker makes the protocol highly portable. Docker currently supports Windows, Linux and Mac OS making our protocol trully multiplatform. The fact that the protocol can be installed through a single command makes it trivial to move the setup from one machine to another. With our "ProtProtocol docker-launcher" tool the protocol can even be installed with the click of a single button. This should greatly reduce the effort in setting up a complex proteomics analysis environment and relieve the often already strained IT support.
Applying IsoProt to available data showed that subtle differences in the data analysis can lead to considerable differences in the final results. These differences are very difficult to spot during the review of a paper since (bioinformatics) methods can only be summarized in the manuscript. Therefore, if researchers are able to make the complete workflow available in an easy to read format such points can already be discussed at the time of review, and therefore increase quality and credibility of both the study and the journal.
All of these developments are available as free and open-source software. Thereby, we encourage other researchers to use the ProtProtocol infrastructure as starting point to develop their own analysis workflows and make them available to the community. All our tools are modularized and prepared to support and simplify such external developments.
Since Docker has become an industry standard for containerized applications long-term support seems to be guaranteed for these developments.
In summary, we presented here an environment for fully reproducible data analysis and exemplified its power by hand of a fully functional software for the analysis of data from mass spectrometry experiments with isobaric labeling.