MaxQuant and MSstats in Galaxy enable reproducible cloud-based analysis of quantitative proteomics experiments for everyone

Quantitative mass spectrometry-based proteomics has become a high-throughput technology for the identification and quantification of thousands of proteins in complex biological samples. Two de facto standard tools, MaxQuant and MSstats, allow for the analysis of raw data and finding proteins with differential abundance between conditions of interest. To enable accessible and reproducible quantitative proteomics analyses in a cloud environment, we have integrated MaxQuant (including TMTpro 16/18plex), Proteomics Quality Control (PTXQC), MSstats and MSstatsTMT into the open-source Galaxy framework. This enables the web-based analysis of label-free and isobaric labeling proteomics experiments via Galaxy’s graphical user interface on public clouds. MaxQuant and MSstats in Galaxy can be applied in conjunction with thousands of existing Galaxy tools and integrated into standardized, sharable workflows. Galaxy tracks all metadata and intermediate results in analysis histories, which can be shared privately for collaborations or publicly, allowing full reproducibility and transparency of published analysis. To further increase accessibility, we provide detailed hands-on training materials. The integration of MaxQuant and MSstats into the Galaxy framework enables their usage in a reproducible way on accessible large computational infrastructures, hence realizing the foundation for high throughput proteomics data science for everyone.


INTRODUCTION
Mass spectrometry-based proteomics is a standard technique for the identification and relative quantification of thousands of proteins in complex samples. A common aim is to identify proteins that are differentially abundant between conditions of interest. Two standard software tools for data dependent acquisition (DDA)-based quantitative proteomics are MaxQuant 1,2 and MSstats 3,4 . Together they allow for a typical quantitative shotgun proteomics analysis workflow. MaxQuant is a standalone freeware that takes raw data as input and performs protein identification and quantification. MaxQuant supports all common protein quantification methods such as label-free, label-based and isobaric labeling 1,5 . MSstats is a Bioconductor R package for finding proteins that are differentially abundant in different conditions. It uses flexible linear models to analyze label-free proteomics experiments with complex designs 3 . Recently, MSstatsTMT was released for the statistical modeling of isobaric labeling quantification data e.g. iTRAQ (isobaric tags for relative and absolute quantitation) or TMT (tandem mass tag) 4 .
Typically, a quantitative proteomics analysis requires several steps: First, all software needs to be installed. Often this is done on a shared lab workstation with sufficient computational power. Next, the MaxQuant run is started and once it is finished the results may be inspected manually or with a dedicated software such as the PTXQC R package to obtain a direct quality control report 6 . Afterwards, the MaxQuant result files are loaded into the R programming environment for processing and statistical analysis in MSstats. In contrast to many other proteomics software, MaxQuant and MSstats are compatible with powerful computational infrastructures such as high-performance computing (HPC) systems and cloud environments 7,8 . This is required as technical advancements in sensitivity, mass resolution and acquisition speed lead to larger file sizes and increasing number of samples per experiment 7 . With the steadily expanding availability of instrumentation, proteomics experiments are increasingly widespread and complex. This emphasizes the need for easily accessible and scalable software solutions. However, even for HPC and cloud compatible software, monetary hurdles and technical complexity of software installation and maintenance severely hamper the access to high throughput analysis 7 . Reproducible, and thus trustable high-throughput analyses require even more computational skills to control software versions and dependencies 7 . Even the most detailed methods section cannot ensure reproducible analyses if not every researcher has access to the same software, software version, computing environment and computational resources 9 .
Here, we present the integration of MaxQuant and MSstats into the Galaxy framework to enable accessible and reproducible quantitative proteomics analyses in a cloud environment. The biomedical data analysis platform Galaxy is an open-source, free-to-use web-based service with a graphical user interface that can schedule pre-installed tools on large compute resources and public clouds while recording all provenance data from parameters to the tool version and information on analysis workflows 10 . It allows sharing of complete analysis histories and workflows and therefore provides a platform on which high throughput analyses can be executed and repeated by everyone. Galaxy already offers thousands of tools for many different omics domains, including a variety of tools for explorative proteomics, such as msconvert 11 , SearchGUI 12 , PeptideShaker 13 , OpenMS 14 , OpenSwath 15 , and DIAumpire 16 . Thus, MaxQuant and MSstats in Galaxy not only enable classical DDA-based quantitative proteomics analyses but may also be integrated with other Galaxy tools into standardized, shareable workflows. With the integration of MaxQuant and MSstats into the Galaxy framework, we enable every researcher to run quantitative proteomics analysis of a quasi unlimited number of files on public cloud infrastructures.  17,18 . These allow for easy software installation while having full version control in any (Linux-based) environment. Therefore, they are also beneficial for applications outside of the Galaxy framework. Within the Galaxy framework, Bioconda recipes and Biocontainers allow installation of multiple versions of the same tool and easy switching between them, which allows full reproducibility even of older analyses. Lastly, we built so-called Galaxy wrappers that define the input parameters in Galaxy's graphical user interface and link them to the software executables.

MaxQuant in Galaxy
Two MaxQuant tools were integrated into Galaxy framework: One uses the mqpar.xml parameter file as input while the other allows setting parameters directly in the tool user interface. Both MaxQuant tools offer the same options for raw data, database input files and output files. Raw data is accepted in the Thermo RAW file format as well as in the open standard formats mzXML and mzML 19,20 , which can be obtained by converting any vendorspecific RAW format with the msconvert software 11 that is also available in Galaxy. Single or multiple FASTA files are allowed as database input and the 'parse rules' can directly be adjusted in the user interface and do not require an additional configuration step. All common MaxQuant files are offered as output options. The PTXQC R script is directly integrated into the MaxQuant tools and allows the optional creation of a QC report following the MaxQuant run. The 'MaxQuant (using mqpar.xml)' tool runs MaxQuant with the input parameters specified in a mqpar.xml parameter file that was created beforehand, e.g. by using the traditional MaxQuant software. The intended use-case is to scale from a local installation easily to large compute resources using Galaxy. In addition to the selection of input files, the only parameters that have to be set are the "parse rules" for the FASTA file, the PTXQC parameters and the selection of output files.
Since mqpar.xml files do not always exist or might need complicated adjustments, we have built an additional 'MaxQuant' tool that allows specifying the most crucial parameters directly in the Galaxy user interface. The tool is separated into five categories: Input options, Search options, Protein quantification, Parameter group and Output options. In contrast to the original MaxQuant software, the experimental setup, which includes file name, experiment name, fraction and post translational modifications needs to be specified in a tab-separated values file outside Galaxy. Custom modifications cannot be configured by the user. They need to be added by a Galaxy tool developer, but once the modifications are installed they will remain in all following tool versions. We have integrated the modifications for TMTpro-16plex and TMTpro-18plex, allowing the user to use these quantification options directly without any additional installation steps. Inside the Galaxy tool, the specified parameters are transferred via an additional python script into the mqpar.xml parameter file, which is then used to launch MaxQuant.

MSstats in Galaxy
Two MSstats Galaxy tools were built based on the Bioconductor R packages MSstats and MSstatsTMT, which analyze quantitative proteomics data from label-free and isobaric labeling data, respectively. The MSstats and MSstatsTMT Galaxy tools cover the entire statistical analysis workflow from importing and converting results from quantitative proteomics software to protein summarization, protein quantification, and group comparison. For this workflow, the full set of parameters is adjustable via the Galaxy tool interface and MSstats converter for MaxQuant, OpenSwath (only MSstats), OpenMS and Proteome Discoverer (only MSstatsTMT) are included. Like in the original software, an additional file is needed to specify experimental annotations such as condition, biological and technical replicates. Quantitative proteomics data from not supported software such as Skyline and Progenesis can be converted and annotated outside Galaxy, for example in a text editor, into the MSstats specific table format.
The desired comparison between conditions requires an additional tab-separated value file that defines the comparison matrix. For each analysis step, the user can select the result tables and visualizations of interest.

Access to MaxQuant and MSstats in Galaxy
To enable every researcher to perform reproducible and scalable quantitative proteomics analyses, we have integrated two de-facto standard tools, MaxQuant and MSstats, into the Galaxy framework. According to the modular tool structure in the Galaxy framework, we have built four new tools: 'MaxQuant' including PTXQC functionality, 'MaxQuant (using mqpar.xml)' including PTXQC functionality, 'MSstats' and 'MSstatsTMT'. These tools are available via the Galaxy toolshed 25 , which is the central tool repository from which Galaxy administrators can install the tool on any Galaxy server, including the more than 125 public Galaxy servers. The described tools are already installed on several public Galaxy servers, where everyone can create a free user account and use the graphical user interface to adjust tool parameters and run the tools on public computing infrastructure. Only internet access and a web-browser are needed to access these public Galaxy instances. In the case of the European Galaxy server (https://usegalaxy.eu), thousands of cores and dozens of terabytes of RAM are available (de.NBI cloud), allowing for the comprehensive analysis of large proteomics datasets (Figure 3). Table S1 provides links to the tools in the Galaxy toolshed and on the European Galaxy server.

Quantitative proteomics in the Galaxy framework
In combination, MaxQuant and MSstats enable protein quantification and differential abundance analysis of label-free, TMT and iTRAQ proteomics data. Within Galaxy, MaxQuant and MSstats can be operated individually or together in a workflow. Workflows require the user to only start the analysis once because the generation of the MaxQuant results automatically triggers MSstats to continue with the analysis. Regardless if the analysis is performed step by step or via workflows, Galaxy histories are generated. A history contains all intermediate and result files together with all metadata required for transparency and reproducibility such as tool name and tool versions and the used parameters and input files. Histories and workflows can be either shared privately with collaborators or publicly for example as part of peer-reviewed publications.
Several hundred to thousands of Galaxy tools are pre-installed on every Galaxy server and enable high levels of interoperability. Therefore, MaxQuant and MSstats Galaxy tools seamlessly integrate into the already existing tool landscape of proteomics [26][27][28] , metabolomics 29-31 and many more omics disciplines that allow complex, large-scale proteomics and multiomics analysis 32 in a reproducible manner (Figure 2). MSstats is the first Galaxy tool specialized for statistical analysis of quantitative proteomics data. It is not only compatible with MaxQuant but also other proteomics software that is available in Galaxy such as OpenMS and OpenSwath 33 and therefore expands analysis options inside Galaxy. The tab-separated values

dark blue), visualization (yellow) and protein annotation (green). Users only need to load the required input files (raw, FASTA, MSstats annotation file and MSstats comparison matrix) into a new analysis history and select them as inputs for the workflow. Then all workflow steps run automatically, however email notification can be enabled when selected tools are finished.
file outputs of MaxQuant and MSstats are compatible with the many text manipulation tools in Galaxy that allow for example filtering, sorting, computing, summarizing and visualization. All tab-separated values files are furthermore compatible with other downstream Galaxy tools such as protein annotation, Gene Ontology (GO) annotation and enrichment analysis.

Training material with example datasets
To facilitate the usage of these newly built quantitative proteomics tools in Galaxy, we created three accompanying tutorials that showcase the application of MaxQuant and MSstats for different use cases. All three trainings are available online via the central repository of the Galaxy Training Network 21 (Table S1) and provide example datasets and step-by-step explanations that enable hands-on training.
The first training is tailored towards researchers that are not yet familiar with MaxQuant. Two human serum samples are analyzed. One sample was depleted for the most abundant serum proteins and the training aims to find which of the samples was depleted and how successful the depletion was. To answer this question, a label-free MaxQuant analysis is performed, followed by inspecting the quality control report from the PTXQC tool, filtering, sorting, computing and visualizing the properties of both datasets (Figure 3a).
The second training explores a realistic label-free dataset consisting of skin cancer tissue samples from 19 patients 22 . The training starts with a label-free analysis in MaxQuant, followed by statistical analysis in MSstats to find differentially abundant proteins between two types of skin cancers. Several follow-up steps are performed to filter and visualize the result and annotate proteins of interest (Figure 3b).
The third training explores a realistic, fractionated TMT dataset consisting of 12 high pH fractions from a human cell line experiment 23 . The training starts with the TMT11-plex analysis in MaxQuant followed by the statistical analysis in MSstatsTMT to find differentially abundant proteins between knockdown and control cells (Figure 3c).
The trainings do not only serve as self-study material but due to the detailed descriptions and the hands-on design, are a meaningful resource to teach proteomics data analysis to researchers and (undergraduate) students in their curriculum 24 .

Conclusion:
The integration of MaxQuant and MSstats into the Galaxy framework allows easily accessible, reproducible and scalable quantitative proteomics data analysis. An internet connection and web browser suffice to run these tools on public clouds. The availability of many other omics tools in the Galaxy framework allows the integration of MaxQuant and MSstats into more complex, even multi-omics analyses in a single analysis platform. In addition, the Galaxy framework enables the highest levels of reproducible research starting from tool version control to storing all metadata and intermediate results. This enabled MaxQuant in combination with MSstats for the first time to run in an accessible and reproducible way, in parallel on large infrastructures, which is the next step to real high-throughput proteomics.