ABSTRACT
Glycomics targets released glycans from proteins, lipids and proteoglycans. High throughput glycomics is based on mass spectrometry (MS) that increasingly depends on exchange of data with databases and the use of software. This requires an agreed format for accurately recording of experiments, developing consistent storage modules and granting public access to glycomic MS data. The introduction of the MIRAGE (Mimimum Requirement for A Glycomics Experiment) reporting standards for glycomics was the first step towards automating glycomic data recording. This report describes a glycomic e-infrastructure utilizing a well established glycomics recording format (GlycoWorkbench), and a dedicated web tool for submitting MIRAGE-compatible MS information into a public experimental repository, UniCarb-DR. The submission of data to UniCarb-DR should be a part of the submission process for publications with glycomics MSn that conform to the MIRAGE guidelines. The structure of this pipeline allows submission of most MS workflows used in glycomics.
Author contributions: M.A.R.-M. designed and lead the development of the database; J.M. O.H, and P.A. contributed to the UniCarb-DR implementation and designed the database and web interface; C.J. and N.G.K. contributed to the design of the upload forms, C.J., V.V., K.M., C.A., R.L.M, T.Z. contributed with data for beta-testing of the data submission tool and feedback; N.P.A, D.S. and K.F.A.-K. contributed to the setting up the communication with GlyTouCan; F.Le. contributed with design of Proteios for glyco MS data; W.B.S, D.K., P.M.R., M.W, C.K., N.H.P, K.F.A.-K, F. Li. and N.G.K. formed the advisory core that contributed with feedback on repository design, advised in implementation of glycomic MS/reporting guidelines and design of databases and coding; K.F.A.-K and N.G.K jointly received the funds for UniCarb-DR and GlyTouCan communication; F.Li. and N.G.K. jointly received funds for development of the UniCarb-DR infrastructure and structured the manuscript; N.G.K. proposed the UniCarb-DR concept and managed the project. M.A.R.-M. F.Li and N.G.K. jointly wrote the manuscript with input from the other authors. All authors signed of on the final version.
INTRODUCTION
Posttranslational modifications of proteins play an essential role in modifying the single 20 amino acids used in protein synthesis thereby extending the functions and regulating the activities of the synthesized proteins. A census of all possible protein forms now commonly called proteoforms was recently estimated1. In this renewed view of protein diversity, glycoforms are expected to play a major role. Glycosylation is the basis of several biological events that include structural stability, recognition, immunological responses, cancer development and the attachment of pathogens to host cells as the first step in the process of infection2, 3. Furthermore, defects in glycosylation result in a range of human diseases, including Congenital Disorders of Glycosylation (CDG) caused by mild-loss of function in N-linked4 and O-linked oligosaccharide5 biosynthesis, both of whish result in sever illness, organ failure and premature death.
Mass spectrometry (MS) has become the central tool for the study of protein glycosylation largely due to its speed, high sensitivity and the structural information it delivers6, 7. Due to the range in experimental techniques and the nature of MS systems there are therefore large amount of data generated. Glycan profiling by MS of free and released glycans makes use of a considerable variety of experimental techniques and instrumentation to increase speed, depth and efficiency. This is summarized in Figure 1 (adapted from8), where a generic workflow for glycan analysis is shown as well as the different options available for each task.
A factual assessment process to acknowledge the existence of a literature reported glycan and its localization (e. g. tissue and/or protein) demands a thorough description of experimental conditions and instrumental settings. The reproducibility of results also requires a disclosure of how the structural assignment was performed, e.g., manually or software assisted. Therefore, to ensure transparency and reproducibility there is a need for comprehensive reporting of experimental methods in publications. Other omics fields have addressed these concerns, which motivated the respective science communities to develop guidelines for the reporting, collecting and distributing of data and information. This started with MIAME launched for the handling of microarrays9. Shortly after was the arrival of MIAPE guidelines for proteomics10. Since these early efforts, many guidelines were published in a wide range of applications such as STRENDA in enzymology11 or CIMR in metabolomics12, 13. There are currently 120 reporting guidelines published and registered with the BioSharing portal14. The glycomics community has proposed the MIRAGE (Minimum Information Required for A Glycomics Experiment) project in 2011 supported by experts from the diverse areas of glycomics research15. This has resulted in setting up the following set of guidelines: sample preparation16(doi:10.3762/mirage.1), MS analysis17 (doi:10.3762/mirage.2), glycan microarray analysis18 doi:10.3762/mirage.3), and liquid chromatography guidelines (manuscript in preparation).
These glycomic guidelines need to be viewed in a larger picture of e(lectronic)-infrastructure. In 2008 an NIH “Frontiers in glycomics” work group stressed in its white paper the need for a curated glyco-structure database with long term funding19. This database should contain associated information about experimental and biosynthetic data. Attempts to create this “super” glycomic database was pioneered already in the 80s with Carbank20, and has been followed by GlycomeDB21, GLYCOSCIENCES.de22, GlycosuiteDB23 and UniCarbKB24. The latest release of GlyConnect (https://glyconnect.expasy.org) at the Swiss Institute for Bioinformatics provides a long-term solution for a stable and financially supported database of glycoconjugates (manuscript in preparation). There is also now an agreement in the glyco-research field, that all proposed glycan structures should be registered with a unique identifier in the GlyTouCan registry. This provides a foundation for developing complementary repositories, where unique glycan records can be associated with additional information, including incorporation of the MIRAGE reporting guidelines.
Presently, glycomic experimental MS data are collected and stored locally in individual research laboratories. Data sharing between different labs that may perform comparable experiments yet subtle differences in methods often precludes distribution because of incomplete recording of experimental protocols as well as conflicting hardware and software parameters. However, glycomics is practiced in laboratories worldwide that address glycobiological questions in cancer, inflammation and infection to name only a few and critically they require tools to share their findings and data. Especially now, when powerful software tools can interpret large complex MS datasets, making it impossible disclose all data as part of a traditional publication. The database UniCarb-DB was set up in 200925, 26 in order to set up the framework for providing access to experimental MS data including both fragmentation spectra with associated structures and metadata about biological origin. Currently, it contains structural and fragmentation data of O-glycans, N-glycans and glycosaminoglycans obtained in positive and negative ion modes. Since its introduction, several versions of UniCarb-DB have been released, mainly to improve the glycomic data quality, increase the number of entries and advance the usability of the application. Reference MS fragmentation spectra of glycans are also being assembled at the NIST Glycan Mass Spectral Reference Library (https://chemdata.nist.gov/glycan/spectra).
The progress in glycomic software development for glycomics has been slow but steady. The early GlycosidIQ automated the comparison of observed fragments with theoretical glycofragments derived from a structural database27. This approach has been adopted in commercial software28, and expanded into matching glycopeptide data in the Glycomaster DB application29. Other approaches convert spectra into structures relying on spectral libraries26, 30. More advanced tools for glycomics use partial de novo sequencing31 including the recently published Glycoforest32. High throughput glycomic annotation (GRITS Toolbox, www.grits-toolbox.org/) and quantitation tools33 in glycomics are now available and increase the need for a common data exchange format, for these data to be publicly accessible and scrutinized. If these data can be provided in an agreed format, the validation and curation of structures deposited into the GlyTouCan registry will become feasible.
In this paper, we propose a workflow to collect, process and store experimental data in compliance with the MIRAGE MS and sample preparation guidelines a UniCarb-DR (DR = Data Repository) that benefits from the previous developed UniCarb-DB framework of quality LC-MS/MS data and structural assignments25, 26. UniCarb-DR incorporates both the MIRAGE MS and sample preparations guidelines. It also provides an electronic submission tool, guiding users for initial data validation to ensure all required information is provided. Data is entered in a structured form (template, http://unicarb-dr.biomedicine.gu.se/generate) that can be submitted to UniCarb-DR together with GlycoWorkbench files, including structures, spectra, fragmentation annotation and meta-data with scoring parameters, spectral quality and the use of orthogonal methods for structural assignments.
METHODS
System overview and implementation
UniCarb-DR repository is based on the UniCarb-DB database format25, 26, adopted to include tables and layouts for MIRAGE information. The repository design is based on a PostgreSQL as database manager system. The UniCarb-DR web application is supported by the Play Framework (Error! Hyperlink reference not valid. The Play Framework makes use of the MVC paradigm, where the elements of an application adopt one of three roles: Model, View or Controller. The Model is written in Java and represents the data and how the data is manipulated. The View is the layer that is displayed to users in the web interface. In UniCarb-DR, the View is written in Scala, JavaScript and implements the Jquery, Bootstrap and SpeckTackle libraries for data visualization. The Controller layer, also written in Java, controls the data that flows to the model and updates the View when the data change in response to user actions.
Testing of the MIRAGE glycomic workflow
In order to develop and test the MIRAGE parameter on-line form and the submission tool, we selected beta-test sites that generated glycomic LC-MS2 and MS2 from N-linked, O-linked and proteoglycan type protein oligosaccharides ((http://unicarb-dr.biomedicine.gu.se/references). MIRAGE data spreadsheets were generated via the described on-line submission form available at http://unicarb-dr.biomedicine.gu.se/generate, where LC parameters also were recorded. Generated spreadsheets from this submission are available in supplementary material. Individual centroided MS2 spectra were copied manually into GlycoWorkbench34.gwp files together with the identified structures assigned from peak matching or manual interpretation Examples of Glycoworkbench files is also available in Supplementary material. Structures were assigned based on MS2 spectra and/or retention time and the quality of matching was manually validated. Reference LC-MS .raw were uploaded to Swegrid via the Proteios submission portal, also converting the data into .mzml format.
RESULTS
UniCarb-DR application to support using MIRAGE guidelines
In order to implement the glycomic MIRAGE guidelines15, we identified and adopted the two existing guidelines for glycomic sample preparation16 and MS17. In addition, an HPLC experimental recording was developed, expanding on the guidelines to enable recording of LC-MS parameters. For this first version, efforts targeted to the most essential implementations (the qualitative/structural information) and thus the quantification aspects of the guidelines were only addressed on a superficial level. This is justified considering that workflows for quantitative glycomics are still evolving and the basic level of methods and software tools are unavailable.
To support the e-workflow, we created the UniCarb-DR (Data Repository) (http://unicarb-dr.biomedicine.gu.se/) that facilitates submission of glycomics MSn data conforming to the MIRAGE guidelines as part of a publication submission process (Figure 2). This will serve as the interim storage of experimental MS fragment data and structures before curation and transition into the UniCarb-DB database. A submission tool allows the user to browse and re-enter submitted data before it is uploaded to the repository. Since most data is likely to be submitted before publication, the user will need to refer to it as a “manuscript”. If data is uploaded after publication, the PubMed ID (PMID) available from (https://www.ncbi.nlm.nih.gov/pubmed/) will be required. The deposition of data in the repository first requires the registration and login as a user at http://unicarb-dr.biomedicine.gu.se/login and then providing a number of files and information (Figure 2) including:
A compiled file with MIRAGE data (proposed format is spreadsheets).
Compiled information about structures (proposed format is GlycoWorkbench)34.
Location of publicly accessible unprocessed MS files
Unique structure identifier (this information is automatically generated by communication with GlyTouCan structural repository35).
The specific information for each of these four items is described below.
Step 1) MIRAGE record-Spreadsheets
Experimental data needs to be provided in a spreadsheet form with data fields reflecting the principle structure of the MIRAGE guidelines. Prefilling and downloading of the MIRAGE compliant spreadsheets is possible in the web form (http://unicarb-dr.biomedicine.gu.se/generate). Three different spreadsheets are available: 1) sample preparation, 2) LC and 3) MS guidelines. These can be generated individually or combined into one file with several sheets. These spreadsheets can be modified off-line using common software packages such as Excel template v 2007 or later (saved as xlsx file type).
Step 2) MIRAGE record-GlycoWorkbench
The open source software GlycoWorkbench developed by EuroCarbDB to assist manual interpretation of MS data34 is used to generate initial glycan structures. Glycoworkbench provides a straightforward interface to draw glycan structures in different cartoon formats (SNFG, Oxford, IUPAC) using the embedded GlycanBuilder module34. Glycan structures are stored in a linear format (.gws) for easy parsing and recording into databases. All recorded data is stored in an XML format (.gwp file extension) (Figure 3). GlycoWorkbench allows the recording of individual structures as a “Scan” with associated fragment data (fragment list imported from MS software as centroided data). The first level should be MS2. Hence, a GlycoWorkbench file that includes both several structures and MS2 data needs to be defined by several “Scans” (one for each structure) directly under the “Workspace” item.
Figure 3 shows the sections that are typically included in a .gwp file. A tag is represented by the “<” and “>”symbols and it defines the different elements in a file. These elements are delimited by a start tag e.g. <scan> and an end tag, e.g. </scan>. The example shown in Figure 3 belongs to a single structure however somewhat simplified, highlighting important MIRAGE tags. In order to be MIRAGE-compatible we introduced a ‘Notes’ section, for recording of orthogonal assignment methods, scoring and validation (see section “Interpretation of the MIRAGE guidelines for parameter organization”). The format of the ‘Notes’ section needs to be respected in order to upload its information to UniCarb-DR. The proposed format for the Figure 3 ‘Notes’ section is expanded in Supplementary Material.
Step 3) MIRAGE record-Native MS files deposit
There is currently no organization comparable to the proteomic PRIDE36 and the ProteomeXchange consortium37committed to host glycomics/glycoproteomics MS experimental data. Without this option, we conceived a new scheme for how glycomic MS data could be made available. We adapted a web-based proteomic MS Laboratory Information Management System called Proteios Software Environment (http://www.proteios.org)38, and using this system, files can be managed and uploaded to storage accessible through WebDAV. The concept implementation is using Swestore (http://www.snic.se/allocations/swestore/) as online storage with regulated file access through dCache (https://www.dcache.org). From Proteios it is subsequently possible to export an entire project with all the required data into appropriate formats for curation. Native MS data (.raw and mzML) is made publicly available at Swestore URIs (Uniform Resource Identifiers) that can be provided in the MIRAGE spreadsheets before submission to UniCarb-DR. Storing of the unprocessed data as part of an open access policy requires long-term commitment from national/international life science data storage organizations. In order to be compliant with MIRAGE guidelines, the organizations or institutions that provide long term data-access that goes beyond individually funded glycomic efforts, will have to be identified. We are currently engaged with JPOST (Japan ProteOme STandard Repository) to develop a pipeline to push unprocessed glycomic MS data from Proteios for permanent storage to a designed repository for glycomic and glycoproteomic data (GlycoPOST, unpublished).
Step 4) MIRAGE record-Glycan structure registration
GlyTouCan35 is a glycan structure repository promoted by the glycocommunity as the prime location for generating unique identifiers for glycan structures. It is recommended that glycan structures should be submitted to this repository as part of a publication of glycomic data. In order to avoid duplicate submissions to both UniCarb-DR and GlyTouCan we developed a tool that assesses whether the structures submitted to UniCarb-DR are already deposited in GlyTouCan. In this case the GlyTouCan ID provides a link to UniCarb-DR. If a UniCarb-DR submitted structure is not available in GlyTouCan, a new ID will be generated and communicated to UniCarb-DR. This process will commence after the submission of data to UniCarb-DR.
Interpretation of the MIRAGE guidelines for parameter organization
The MIRAGE guidelines are generic and flexible to collect information from different types of experiments aiming to study glycoconjugates. However, the use of commonly defined vocabularies is required for databasing in order to easily compare data within UniCarb-DR as well as shared data with other glycomic and life science databases. To preserve the flexibility of MIRAGE guidelines in the reporting we provide “free text” fields to describe experiments. Only to record key MIRAGE parameters (tissue, MS-device) a rigorous reporting language is implemented. Inspired by the organization of PRIDE39, four different types of formats of the MIRAGE parameters were encoded in UniCarb-DR (Table 1) and outlined in Supplementary Material.
Upload of MIRAGE compatible MS fragmentation spectra to UniCarb-DR
MIRAGE-compliant data sets generated by the web interface along with data stored in .gwp files of both individual fragment spectra and structures can be itself be submitted as supplementary data in a given publication. We also propose uploading theses collected and structured glycomic information (spreadsheets and .gwp files). The upload allows fully and partly assigned structures as can be seen in Figure 4. The reporting of orthogonal methods (i.e. NMR, HPLC retention time mapping, and chemical/enzymatic treatment) also justifies that UniCarb-DR accepts structures, where MS but not MSn has been collected. In Figure 4 there are examples from reference 1 in UniCarb-DR (http://unicarb-dr.biomedicine.gu.se/references/1) of assigned structures both with associated fragment data and structures without fragment data. The latter structures would have been assigned based on retention time (RT) and biosynthetic knowledge about the constituting monosaccharides, linkage position and configuration. We have assembled an expandable list of treatments and orthogonal methods for isolation/characterisation (Table 1 and supplementary spreadsheet). Current records in UniCarb-DR have been uploaded using data generated in various laboratories by researchers in the author list. During this process we noted that the MIRAGE guidelines were focused on the overall description of the experiment to encourage the use of the submission tool implementing the MIRAGE guidelines. However, the requirement to record the full information about individual structures (e.g. scoring and orthogonal method validation) is time consuming. Hence, UniCarb-DR is also accepting data with only partial MIRAGE records for an individual structure, i e. at least the record of the parent ion mass has to be provided, excluding information about scoring and validation.
DISCUSSION
The lack of an established formalized description of glycomic experiments may cause progress in the field to stall. The community needs to rely on sharing yet it remains one open question as to how? Here, we are proposing a solution for a glycomics MS format using spreadsheets in combination with GlycoWorkbench files. This proposed format is one step closer to enforcing MIRAGE compliant scientific publications in glycomics. Past experience in introducing guidelines for glycomic studies as part of publications (http://www.mcponline.org/site/misc/glycomic.xhtml) has shown that if there is a clear pathway and format, researchers will conform. With the developed tools in this report, we propose that glycomic MS recording can be adopted early at the start of a project, where the spreadsheet can be completed and modified as the project evolves. Furthermore, the use of GlycoWorkbench files for saving glycomic structural discoveries can be implemented for housekeeping. The spreadsheet and .gwp format are flexible to support a variety of glycomics MS applications only with limited modifications using the templates provided. Hence, journal editors can confidently request that authors are MIRAGE compatible by providing the files as supplementary data. With an increasing awareness of MIRAGE formats, a grassroots movement from those performing both research and scientific publication reviewing will insist that, not only their own but also others, data are MIRAGE compliant in scientific publications.
While both spreadsheets and .gwp formats are flexible and can easily be expanded to adapt to various workflows as part of a publication supplement, the UniCarb-DR upload requires a known number and defined formats for the variables. Hence, an e-upload will have to be further developed to support various glycomic workflows, in addition to the current LC-MS and MS module. The GlycoWorkbench structure format has been adopted in other glycomic commercial (Glycoquest, Bruker, Bremen Germany) and academic (GRITS Toolbox (http://www.grits-toolbox.org/) software projects. Hence, automated submission to UniCarb-DR is likely to be easily implemented for these. Other workflows utilized in glycomics, such as permethylation followed by MS with or without coupled separation, are also easily implemented if the spectra are from single isomers. With several isomers present in one spectrum, these data can still be recorded in GlycoWorkbench (several structures recorded in one “Scan”), but to accommodate upload, the UniCarb-DR format will need to be modified. Similar concerns relate to workflows involving multiple MSn, even though the flexibility of GlycoWorkbench allows this to be recorded as sub-“Scans”. Other glycomic MS workflow (eg ion mobility MS) may need smaller or larger adoption of UniCarb-DR. We request the help from the community to identify additional major glycomic workflows for us to adapt the submission accordingly. Templates for these alternative workflows may require modifications regarding software and experiment. We are also planning to modify the Glycoforest32 output to allow automated submission to UniCarb-DR. The Glycoforest module generating consensus spectra, could be used in the curation process of UniCarb-DR. Furthermore, the Glycoforest module for matching and clustering spectra could use the repository and score similarities between samples. This could be achieved through access to stored native MS data from an outside location with associated metadata.
The commitment to store glycomic MS datasets is essential. It is obvious that interpretation of glycomic LC-MS data is based only on the knowledge of the interpreters32, be it a software or a human researcher or both. Hence, glycomic native MS data should be considered as libraries that could be read repeatedly to harvest new data and to ask new questions. This is even more important when glycomics evolves similarly to proteomics, where data independent acquisition40 could be utilized as a means to generate data from clinical or other reference samples. These glycomic libraries could be used for harvesting information for hypothesis driven glycomics. Similar to the PRIDE and ProteomeXchange initiatives, the glycomic community needs to voice the unanimous opinion that this is needed and target both national and international life science e-infrastructure organizations. The MIRAGE board already identified this requirement by introducing the demand for raw data deposition in the guidelines.
A pipeline for curation of data between UniCarb-DR to UniCarb-DB is changing the perception of how curated glycomic structural databases will be generated. The top-down approach of a database generator and curator trolling literature for information will shift to researchers submitting and managing their own data. Researchers and curators will need software tools to help in the curation process. Utilization of the metadata in the reporting guidelines together with evaluation of accompanying publication as well as previous and future knowledge can together aid the curation process. This process must remain objective and transparent in that information can only be added neither deleted nor altered without permission from the data supplier. The curation process will be strengthened by the MIRAGE information in order to generate unbiased statement about data quality. The mission of UniCarb-DR and -DB is to support the development of a knowledgebase of glycan structures by providing the pipeline for glycomic experimental MS data.
Acknowledgements
This work was financed by the European Union FP7 GastricGlycoExplorer ITN (No 316929), the Swedish Research Council (621-2013-5895), The Swedish Foundation for International Cooperation in Research and Higher Education (STINT) initiation grant (IB2015-5931) and institutional grant (IG2010-2050). SIB is supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI). ExPASy is maintained by the web team of the Swiss Institute of Bioinformatics and hosted at the Vital-IT Competency Center. The MIRAGE project is supported by Beilstein-Institut.
Author contributions: M.A.R.-M. designed and lead the development of the database; J.M. O.H, and P.A. contributed to the UniCarb-DR implementation and designed the database and web interface; C.J. and N.G.K. contributed to the design of the upload forms, C.J., V.V., K.M., C.A., R.L.M, T.Z. contributed with data for beta-testing of the data submission tool and feedback; N.P.A, D.S. and K.F.A.-K. contributed to the setting up the communication with GlyTouCan; F.Le. contributed with design of Proteios for glyco MS data; W.B.S, D.K., P.M.R., M.W, C.K., N.H.P, K.F.A.-K, F. Li. and N.G.K. formed the advisory core that contributed with feedback on repository design, advised in implementation of glycomic MS/reporting guidelines and design of databases and coding; K.F.A.-K and N.G.K jointly received the funds for UniCarb-DR and GlyTouCan communication; F.Li. and N.G.K. jointly received funds for development of the UniCarb-DR infrastructure and structured the manuscript; N.G.K. proposed the UniCarb-DR concept and managed the project. M.A.R.-M. F.Li and N.G.K. jointly wrote the manuscript with input from the other authors. All authors signed of on the final version.