Abstract
Label Free Quantification (LFQ) of shotgun proteomics data is a popular and robust method for the characterization of relative protein abundance between samples. Many analytical pipelines exist for the automation of this analysis and some tools exist for the subsequent representation and inspection of the results of these pipelines. Mass Dynamics 1.0 (MD 1.0) is a web based analysis environment that can analyze and visualize LFQ data produced by software such as Maxquant. Unlike other tools, MD 1.0 utilizes cloud-based architecture to enable researchers to store their data, enabling researchers to not only automatically process and visualize their LFQ data but annotate and share their findings with collaborators and, if chosen, to easily publish results to the community. With a view toward increased reproducibility and standardisation in proteomics data analysis and streamlining collaboration between researchers, MD 1.0 requires minimal parameter choices and automatically generates quality control reports to verify experiment integrity. Here, we demonstrate that MD 1.0 provides reliable results for protein expression quantification, emulating Perseus on benchmark datasets over a wide dynamic range.
The MD 1.0 platform is available globally via: https://app.massdynamics.com/.
Contact webb{at}wehi.edu.au
Introduction
Proteomics can be defined as the application of technologies for identification and quantification of protein content of complex biological samples. Over the past few decades, proteomics research and the complexity of experimental research questions addressed by it have developed rapidly and are having a growing impact in biological and medical research. In particular, areas such as understanding mechanisms of action in disease progression and therapeutic intervention, as well as detection of diagnostic markers, identifying candidates for vaccine production, understanding pathogenic mechanisms and gene expression patterns have been of growing importance in advancing many areas of medically related research1,2.
Mass Spectrometry via LC/MS-MS is the leading technology in proteomics research, with Label Free Quantification (LFQ) and isotope-based labelling methods being two approaches that facilitate interrogation and measurement of a large number of analytes in even very complex samples. Isotope based labelling methods such as SILAC and TMT have provided the gold-standard for protein quantification, but are limited by their applicability to types of samples which cannot be easily labelled and by the associated cost of reagents required. By contrast, LFQ is a simpler, more economical and scalable method that requires considered experimental design in order to achieve robust biological insights3.
An exemplar of the growing use and adoption of LFQ based approaches is the sheer variety of analytical software packages that have been developed to support LFQ experiments which have been comprehensively covered recently by Tang et. al.4, the most widespread of which is MaxQuant5. MaxQuant’s success can be attributed at least in part to the holistic nature of it’s analytical tools, covering a breadth of steps including feature extraction, database search, protein identification and quantification which must otherwise be achieved using a combination of other tools.
While data processing is essential for the success of proteomic analysis, objective quality control reporting and reproducible downstream analysis is equally important. The Perseus computational platform is the analytical counterpart to MaxQuant and provides users with a highly flexible framework for post-processing and visualising results6, 7. Despite the diverse suite of tools offered by Perseus (accompanied by supporting documentation and online tutorials), the sheer amount and breadth of functionality can be overwhelming for users requiring straight forward analysis of LFQ proteomics data. For some users, who commonly repeat identical analytical procedures frequently for collaborators, Perseus can be more labour and time intensive than other tools designed to automate the same process such as LFQAnalyst 6,8.
Since the publication of LFQAnalyst in 2019, there have been numerous publications of automated LFQ pipelines including Eatomics 9 and ProVision 9,10. While each of these resources differ in their scope and degree of automation, they are among a growing list of attempts to shift toward standardization through automationstraightfoward statistical analysis and visualization performed as an output of the LFQ pipeline. Whilst these attempts have undoubtedly made headway toward more accessible and reliable analytical processes, there still remain large challenges such as scale, storage, sharing and platform sustainability that will inhibit broad adoption among the community. MD 1.0 addresses these needs by meeting multidimensional criteria in reliability, ease-of-use and transparency while offering functionality designed to facilitate collaboration and sharing. A qualitative survey of the tools mentioned above and MD 1.0 is provided in supplementary table S1.
There are a number of measures that can be taken to ensure that tools facilitating LFQ proteomics maintain reliability of analysis and interpretability of results. Firstly, data analysis and statistical approaches should be demonstrated to produce accurate and reliable results in accordance with accepted best practice of statistical testing. Secondly, by demonstrating that the specific implementation of these methods has been successful and continues to be over the life of the tool. This ensures that the specific code works and is not accidentally compromised during further development. Thirdly, by incorporating automated quality control metrics and figures. These provide confidence that results on non-benchmark datasets can be reasonably interpreted. By meeting these three criteria, any scientific tool can be safely shared with users of varying degrees of expertise.
However, reliability of results are only one side of the coin when it comes to developing useful, empowering tools for the proteomics field. Interviews with 100 Scientists in the field of Mass Spectrometry11 found that “free” tools often imposed significant hidden costs in time spent learning to use the software and in manual tuning of parameters. Though opportunity costs of users are hard to quantify, they represent a genuine cost of use.
It is likely that an awareness of these hidden costs is in part a large driver for the number of tools created for automating analysis of LFQ data, although calls for robust, user-friendly automation in proteomics have been present since 199912. Automated tools are not only less time consuming but reduce the number of points of failure that must be investigated in development, benchmarking and comparison. For many practical reasons, automation simplifies tools, making them more accessible to users and scientific developers.
Moreover, codifying analysis and open sourcing analysis and benchmarking allows for standardisation that addresses reproducibility difficulties that plague the broader scientific field.
Tools which automate highly complicated tasks inevitably end up comprising large amounts of code that must be available for the tool to be reproducible (as a necessary, but not sufficient condition). Tools such as OpenMS13 achieve reproducibility by open sourcing their code enabling thorough and ongoing peer review, as tools grow over time.
MD 1.0 attempts to meet these standards of reliability, automation (thereby ease-of-use) and transparency in the specific domain of LFQ analysis. However, MD 1.0 provides further functionality designed to assist with resource integration, annotation, sharing and collaboration.
Traditional and ubiquitous methods for sharing LFQ data likely include email or messenger service (eg: Slack) type transfers of text and use of file storage tools like Dropbox or Google Drive. These tools present obstacles to effective collaboration such as separating data from quality control metrics which can obscure interpretability. By allowing an entire experiment to be shared with ease, MD 1.0 attempts to make supervision and collaboration easier for scientists to ask for and receive assistance while performing their experiments.
Methods
Raw data processing with MaxQuant
MaxQuant v1.6.17.0 was used with default parameters except for an LFQ min. ratio of 1 and enabling match between runs. Output files used for quality control include msms.txt, peptides.txt, modificationSpecificPeptides.txt, proteinGroups.txt and evidence.txt whilst only proteinGroups.txt was used for quantification. Where maxquant labelled spiked proteins as potential contaminants, proteinGroups.txt was manually edited to remove this label so these rows would be present in subsequent processing.
The results and parameters files can be downloaded from the supplementary files and from the the MassDynamics platform at the following addresses:
iPRG2015: https://app.massdynamics.com/p/56a6d7f7-c129-4d7a-a221-d2ef3b8ff4c3
Dynamic Benchmark Range Dataset: https://app.massdynamics.com/p/152bd07f-ddd6-4943-bf62-298648b43bd3
HER2: https://app.massdynamics.com/p/b832f17a-a9f3-4890-8533-2d69835f814e
Please note that public experiment links provide a limited feature set due to “read only access”. Downloading input files from the supplementary information and uploading as new experiments will allow access to all features.
MD 1.0 LFQ Processing
Statistical analysis is performed using R version 3.6.0.
An experiment design file is generated from user input during experimental set up in the application prior to processing which is thereafter automatic.
Proteins corresponding to reverse sequences, potential contaminants and proteins only identified by site are filtered out. Intensities provided inside proteinGroups.txt are converted to log2 scale and values imputed using the MNAR (“Missing Not at Random”) method with a mean shift of negative 1.8 and a standard deviation of 0.3 as recommended by in the Perseus protocol. The probability of differential expression was calculated using the function lmFit from the Bioconductor package limma followed by eBayes using the default settings with the exception of “robust” and “trend” that are set to TRUE and false-discovery rate correction using the Benjamini–Hochberg method Protein groups where more than 50% of intensities are imputed for both conditions are excluded from the quantitative analysis. P-values are adjusted using the Benjamini Hochberg approach.
This code is provided in the following github repository: https://github.com/MassDynamics/lfq_processing
In application graphics are produced indicating that an FDR cutoff of 5% and fold change of at least 2 are required for a result to be significant.
Implementation
MD 1.0 is composed of two separate components, a web component using a modern software stack (javascript and rails) and a processing component built using R and Elastic Compute Cloud (ec2) on Amazon Web Services (AWS). The combination of these two components ensures that processing is repeatable and runs in a computing environment using identical software, parameters and code, while maintaining the privacy & security of user data. Users have the option to share their data with specific users or share publicly.
Perseus Processing
All proteinGroups.txt were analysed with Perseus version 1.6.14.0. Proteins corresponding to reverse sequences, contaminants and proteins only identified by site were removed. Intensities were transformed to the Log2 scale and missing values imputed as in the MD 1.0 protocol. Student t-tests were performed for each pairwise comparison with an S value of 0 and a Benjamini Hochberg FDR correction. Tests were only performed when more than 50% of values aren’t imputed in at least one group. Session files are available in the supplementary material.
Benchmarking Datasets
Different LFQ benchmarking datasets are chosen to verify that MD 1.0 recovers near identical results to Perseus including two datasets with ground truth PXD0002793, PXD00205714 and one without, which constitutes a “real world” scenario PXD01098115.
PXD000279 (“dynamic range dataset”) contains raw data for 2 samples (4 replicates each) enriched with one of two “Universal Protein Standards” (1 and 2) which test LFQ accuracy over a large dynamic range.
PXD002057 (“HER2 dataset”) contains raw data for an experiment with 2 samples, each with 3 replicates. These two samples come from 2 cancer cell lines, a parental SKBR3 cell line and another cell line derived from the first which is resistant to human epidermal growth factor receptor 2 (HER2)-targeted therapy.
Lastly, PXD010981 (“iPRG2015 dataset”) contains raw data for the iPRG2015 benchmarking dataset. This dataset is composed of 4 samples with 200ng of tryptic digests of S. cerevisiae (ATCC strain 204508/S288c) were then spiked with different quantities of six individual protein digests of Ovalbumin, Myoglobin, Phosphorylase b, Beta-Galactosidase, Bovine Serum Albumin and Carbonic Anhydrase according to a schema in that publication (present in the benchmarking github repository).
In order to ensure that certain spiked proteins (P02768, P06396 for the dynamic range dataset data and P44683 for the iPRG2015 data) were not removed as contaminants, proteinGroups.txt output were manually edited prior to benchmarking analysis. The resulting input files are provided in the supplementary information.
Figure Generation and Results Comparison
All results figures and tables were calculated using bespoke python scripts utilizing packages including pandas for data manipulation, plotly and matplotlib for graphics and scipy for pearson correlation calculations.
This code is provided in the following github repository: https://github.com/MassDynamics/lfq_benchmark
Results and Discussion
Overview of MD+ Discovery, User Interface, Experiment Creation and Sharing
Mass Dynamics 1.0 is a web-based, integrated and automated analysis and collaboration environment that facilitates LFQ experiments. MD 1.0 enables users to upload MaxQuant output files and visualize the quantitative analysis.The user interface begins at the Experiments page, after users sign up and log in (Figure 1).
To create a new experiment, users can click the Create Experiment button on the landing page which takes them to the experiment creation page. Here they are prompted to upload their MaxQuant .txt files which will be used for quantification and quality control reporting. After this, experiment files are presented to the user, who is prompted to allocate them into replicate groups. They can then add an experiment description and complete the experiment creation step. Computing usually takes a matter of minutes, with users being sent an email notification when their experiment has been completed.
In an experiment view (Figure 2), users can choose between the following tabs: Analysis, QC Report, Insights, Results Files, Export and Experiment Design. Analysis contains protein expression volcano plot and tables. The QC Report contains a number of experimental quality control plots. The Insights tab contains a list of comments and notes introduced by the user. The Results Files tab contains a list of output files that can each be downloaded, which includes the tab separated files produced by the quantitative analysis scripts. The Export tab enables users to export the volcano plot graph with all “candidate proteins” (proteins that have been selected and added to at least one selection list) highlighted and annotated. Lastly, the Experimental Design tab indicates which files have been assigned to each experimental condition.
As the experiment data is stored securely in the cloud, sharing is as simple as giving collaborators access to the same web site. After an experiment is completed, users are able to select the Share button and enter the email address of someone they would like to share their experiment with. This email address is then sent an invitation to join MD 1.0 which contains a link to the shared experiment. Insights associated with particular proteins, which are accessible at all protein level analysis interfaces, perpetuate between users accessing the same experiment and are thus shared with the experiment data.
MD 1.0 facilitates analysis and annotation of Label Free Quantitation Results
After an experiment has finished processing, the Analysis tab can be used to visualize results in the form of a volcano plot (Figure 2b). Next to the volcano plot is a table containing a list of each protein with it’s gene, estimated fold change, p-value and adjusted p-value. From this table, proteins can be added to lists which can be used to filter results to proteins of interest or into groups of up or down regulated proteins. A protein in a list can be given an annotation, where the user is able to comment on a protein with text and/or hyperlinks which are then presented in the Insights tab.
MD 1.0 allows users to perform Over-Representation Analysis with Reactome
While MD 1.0 automatically links protein accession codes from table views to Uniprot.org16, the Pathways tab (Figure 3) provides further integration with an external knowledge base. The Pathways tab uses the Reactome17 API content service to provide Over-Enrichment Analysis (ORA) results to users. For each candidate list, one API call is made to perform ORA, while a second is used to retrieve the complete list of proteins in each resulting pathway. A Volcano plot is shown with all proteins in that pathway and that experiment and the “candidate proteins” labelled, with corresponding estimated False Discovery Rate (FDR) provided.
MD 1.0 Quality Control Report produces diagnostic figures to assess experiment health
A feature of MD 1.0 is the automated generation of a quality control report accessible in the Quality Control (QC) Report (see Supplementary Files) tab in the experiment view.
The report contains three sections, Experiment Health, Feature Completeness and Identifications. Experiment Health contains principal component analysis scatter plots of the first two principal components and a scree plot for all proteins, differentially expressed proteins, modification specific peptides and all peptides. Quantitative CV (coefficient of variation) distributions and Sample Intensity Correlation plots are then produced for proteins, peptides and modification specific peptide tables. The Feature Completeness section provides the percentage of missing measurements in a histogram at the protein, peptide and modification specific peptide levels and a histogram of the percentage of all measurements missing at the LC-MS run (file) level. Lastly, the Identifications section reports counts per file for detected PSMs, modification specific peptides, peptides and proteins. Complete QC reports are provided both in application and in the supplementary information.
Due to the automatic nature of the QC report and data processing, users are able to quickly review and assess experimental results and confidently interpret results or share with collaborators via other features.
MD 1.0 reproduces Perseus Results Reliably on Sample Datasets
In order to determine the reliability of the MD 1.0 automated workflow for LFQ quantification, we analysed the same experiments on Perseus and MD 1.0. Two of these experiments, the iPRG2015, and dynamic range dataset datasets contained ground truth data whereas the last dataset from a study on breast cancer resistant cells constitutes a more realistic real world scenario with no ground truth.
We compared Perseus and MD 1.0 results on these datasets in continuous and discrete measures of accuracy (for the ground truth datasets) and in similarity (for all datasets).
Table 1 and Table 2 show the confusion matrices for Perseus and MD 1.0 performance across iPRG2015 and dynamic range dataset data sets respectively. We defined a “true positive” if the estimated and true protein abundance ratios were greater than 2 in the same direction and an adjusted p-value of less than 0.05 was produced. “False negatives”, where the true protein abundance ratio was greater 2 but the estimated value was not or the adjusted p-value was greater than 0.05, were rare but occured once in the dynamic range dataset (in common between the two experiments) and three time for perseus in the iPRG2015 dataset. Table 1 and Table 2 provides more details on these missed detections.
Pearson correlation was used to measure the similarity between estimated log fold change and the true log fold change (according to the benchmark dataset descriptions). Scatter plots of these values are provided in (Figure 4a-d). In all cases pearson correlation was greater than 0.9 suggesting very high accuracy.
The pearson correlation between MD 1.0 and Perseus estimated log fold changes were 0.998, 1.0 and 0.901 for the dynamic range dataset, iPRG2015 and HER2 studies respectively (Figures 4 - 6) whereas the associated log10 adjusted p-values varied slightly more, with pearson correlations 0.925, 0.978 and 0.901 for the dynamic range dataset, iPRG2015 and HER2 studies respectively, showing that MD 1.0 produces results consistent with what can be achieved by those using the Perseus platform except with a completely automated workflow.
Conclusion and Outlook
Mass Dynamics 1.0 is a web-based, integrated and automated analysis and collaboration environment that facilitates label free quantitative experiments. It currently accepts the MaxQuant output and is designed to be adaptable to accept data from other preprocessing tools. The output of MD 1.0 is served to the user via web pages which contain interactive figures, tables and downloads. The analysis performed by MD 1.0 is inherently reproducible both in platform and elsewhere, and can be easily shared or published to the community.
With the intention of broadening access to a wider range of users from computational and non-computational backgrounds, this environment provides many features that facilitate quality control, reproducibility, automation and transparency.
We demonstrated that MD 1.0 reproduces results similar to Perseus protocol reliably and accurately with respect to known benchmarking datasets, while producing comprehensive quality control reports that allows the user to be confident about the quality of experiment. MD 1.0 therefor constitutes a reliable, straight-forward and streamlined alternative to the Perseus platform when performing LFQ.
Mass Dynamics is well placed to begin expanding MD 1.0 in terms of analysis available beyond LFQ. Enrichment analysis such as Over-Representation analysis (ORA), Gene Set Enrichment Analysis (GSEA) might be achieved via integrations 3rd party databases such as GO, DAVID, KEGG or Drugbank. User interface improvements may involve more opportunities for annotation and sharing utilities. Lastly, leveraging the cloud-storage element of MD 1.0 implementation, insights may be gained by cross referencing experiment data and insights.
Author Contributions
Giuseppe Infusini wrote the MD 1.0 processing pipeline. Aaron Triantafyllidis built the web interface and set up the Amazon Web Services (AWS) compute and cloud storage infrastructure. Joseph Bloom performed the benchmarking analysis, prepared the figures and wrote the manuscript. Paula Burton, Giuseppe Infusini and Andrew Webb reviewed the manuscript. Paula Burton and Aaron Triantafyllidis interviewed and researched existing work in this space.
Competing financial interests
The authors Giuseppe Infusini, Aaron Triantafyllidis, Paula Burton and Andrew Webb declare that they are founders of Mass Dynamics, a for-profit enterprise, delivering software as a service in the processing, analysis and sharing of proteomics data. Joseph Bloom is an employee of Mass Dynamics.
Supplementary Figures and Tables
Acknowledgements
The authors would like to thank members of the Future Industries Institute at the University of South Australia for their feedback and support during the development of MD+ Discovery. The authors would also like to thank the proteomics community for their ongoing development of comprehensive tools and repositories for the analysis and sharing of proteomics data without the work in this manuscript would be impossible.