MSnbase, efficient and elegant R-based processing and visualisation of raw mass spectrometry data

We present version 2 of the MSnbase R/Bioconductor package. MSnbase provides infrastructure for the manipulation, processing and visualisation of mass spectrometry data. We focus on the new on-disk infrastructure, that allows the handling of large raw mass spectrometry experiments on commodity hardware and illustrate how the package is used for elegant data processing, method development, and visualisation.


Introduction
Mass spectrometry is a powerful technology to assay chemical and biological samples.It is used in routine applications with well characterised protocols such as in clinical settings, as well as a development platform, with the aim to improve on existing protocols and devise new ones.The complexity and diversity of mass spectrometry yield complex data of considerable size, that require non trivial processing before producing interpretable results.The complexity and size of these data constitute a signicant challenge for protocol development: in addition to the development of sample processing and mass spectrometry methods that yield the raw data, it is essential to process, analyse, interpret and assess these new data to demonstrate the improvement in the technical, analytical and computational workows.
Practitioners have a diverse catalogue of software tools at their disposal.These range from low level software libraries that are aimed at programmers to enable the development of new applications, to more user-oriented applications with graphical user interfaces which provide a more limited set of functionalities to address a dened scope.Examples of software libraries include Java-based jmzML 1 or C/C++-based ProteoWizard. 2 Thermo Scientic Proteome Discoverer (Thermo Fisher Scientic), MaxQuant 3 and PeptideShaker 4 are among the most widely used user-centric applications.
In this software note, we present version 2 of the MSnbase 5 software, available from the Bioconductor 6 project.The package, like other software such as Python-based pyOpenMS, 7 spectrum_utils 8 or Pyteomics, 9 oers a platform that lies between low level libraries and enduser software.MSnbase provides a exible R 10 command-line environment for metabolomics and proteomics mass spectrometry-based applications.It lays out a sound infrastructure to work with raw mass spectrometry data from MS les in mzML, mzXML, mzData or ANDI-MS/netCDF format as well as quantitative and proteomics identication data.The package enables manipulation (for example subsetting, ltering, or accessing specic parts thereof), detailed step-by-step processing (for example smoothing and centroiding of prole-mode MS data, or normalisation and imputation of quantitative data), analysis and visualisation of these data and the development of novel computational mass spectrometry methods. 11tensive documentation and use cases are provided in package vignettes 12 and workows. 13re, we focus on the new developments pertaining to raw mass spectrometry data handling and processing.

Infrastructure for raw data
In MSnbase, mass spectrometry experiments are handled as MSnExp objects.While the implementation is more complex, it is useful to schematise a raw data experiment as being composed of raw data, i.e. a collection of individual spectra, as well as spectra-level metadata (Figure 1).Each spectrum is composed of m/z values and associated intensities.The metadata are represented by a single table with variables along the columns and each row associated to a spectrum.Among the metadata available for each spectrum, there are MS level, acquisition number, retention time, precursor m/z and intensity (for MS level  The main feature in version 2 of the MSnbase package was the addition of dierent backends for raw data storage, namely in-memory and on-disk.The following code chunk demonstrates how to import data from an mzML le to create two MSnExp objects that store the data either in memory or on disk. library("MSnbase") raw_mem <-readMSData("file.mzML",mode = "inMemory") raw_dsk <-readMSData("file.mzML",mode = "onDisk") Both modes rely on the mzR 2 package to access the spectra (using the mzR::peaks() Because the on-disk backend does not hold all the spectra data in memory, direct manipulations of these data are not possible.We thus implemented a lazy processing mechanism for this backend that caches any data manipulation operations in a processing queue in the object itself.These operations are then applied only when the user accesses m/z or intensity values.As an additional advantage, operations on subsets of the data become much faster since data manipulations are applied only to data subsets instead of the full data set at once.Also, on-disk data access is parallelized by data le ensuring a higher performance of this (left) and on-disk (right) backends for 1, 10, 100 1000, 5000 and all 6103 spectra.Benchmarks were performed on a Dell XPS laptop with an Intel i5-8250U processor 1.60 GHz (4 cores, 8 threads), 7.5 GB RAM running Ubuntu 18.04.4LTS 64-bit and an SSD drive.The data used for the benchmarking are a TMT 4-plex experiment acquired on a LTQ Orbitrap Velos (Thermo Fisher Scientic) available in the msdata package and described in 14 .backend over conventional in-memory data representations.As an example, the following short analysis pipeline, that can equally be applied to in-memory or on-disk data, retains MS2 spectra acquired between 1000 and 3000 seconds, extracts the m/z range corresponding to the TMT 6-plex range and focuses on the MS2 spectra with a precursor intensity greater than 11 × 10 6 (the median precursor intensity).ms <-ms %>% filterRt(c(1000, 3000)) %>% filterMz(120, 135)

ms[precursorIntensity(ms) > 11e6, ]
As shown on Figure 2 (c), this lazy mechanism is signicantly faster than its application on in-memory data.The advantageous reading and execution times and memory footprint of the on-disk backend are possible by retrieving only spectra data from the selected subset hence avoiding access to the full raw data.Once access to the spectra m/z and intensity values become mandatory (for example for plotting), then the in-memory backend becomes more ecient, as illustrated on Figure 2 (d).The benet of accessing data in memory is however reduced by underlying copies that are performed during the subsetting operation.
When subsetting an in-memory MSnExp into a new, smaller in-memory MSnExp instance, the matrices that contain the spectra for the new object are copied, thus leading to increased execution time and (transient, if the original data are replaced) memory usage.Figure 2 (d) shows that the larger the subset, the smaller the benets of an in-memory backend become.The example with the 6103 spectra, corresponding to the full data (i.e.all spectra are already in memory and there is no memory management overhead) is representative of memory access only and constitutes the best case scenario.
The on-disk backend has become the preferred backend for large data, and the only viable alternative when the size of the data exceeds the available RAM and/or when several MS levels are to be loaded and handled simultaneously.The on-disk backend can still prove useful in cases when small MS2-only data are to be analysed and will remain available in future versions of MSnbase.

Prototyping
The MSnExp data structure and its interface constitute an ecient prototyping environment for computational method development.We illustrate this by demonstrating how to implement the BoxCar 15 acquisition method.In a nutshell, BoxCar acquisition aims at improving the detection of intact precursor ions by distributing the charge capacity over multiple narrow m/z segments and thus limiting the proportion of highly abundant precursors in each segment.A full scan is reconstructed by combining the respective adjacent segments of the BoxCar acquisitions.The MSnbaseBoxCar package 16 is a small package that demonstrates this.The simple pipeline is composed of three steps, described below, and illustrated with code from MSnbaseBoxCar in the following code chunk.
1. Identify and lter the groups of spectra that represent adjacent BoxCar acquisitions (Figure 3 (b)).This can be done using the lterString metadata variable that identies BoxCar spectra by their adjacent m/z segments with the bc_groups() function and ltering relevant spectra with the filterBoxCar().
2. Remove any signal outside the BoxCar segments using the bc_zero_out_box() function from MSnbaseBoxCar (Figures 3 (c) and (d)).After processing of the BoxCar data, the nal object can either be further analysed using MSnbase or written back to disk as an mzML le using writeMSData() for processing with other tools.
All the functions for the processing of BoxCar spectra and segments in MSnbaseBoxCar were developed using existing functionality implemented in MSnbase, illustrating the exibility and adaptability of the MSnbase package for computational mass spectrometry method development.

Visualisation
The R environment is well known for the quality of its visualisation capacity.This also holds true for mass spectrometry. 1821Here, we conclude the overview of version 2 of the MSnbase package by highlighting the exibility of the software to visualise and assess the eciency of raw data processing.The rst public commit to the MSnbase GitHub repository was in October 2010.Since then, the package beneted from 12 contributors 22 that added various features, some particularly signicant ones such as the on-disk backend described herein.Contributions to the package are explicitly encouraged, rewarded by an ocial contributor status and governed by a code of conduct.
According to MSnbase's Bioconductor page, there are 36 Bioconductor packages that depend, import or suggest it.Among these are pRoloc 23 to analyse mass spectrometrybased spatial proteomics data, msmsTests, 24 DEP, 25 DAPAR and ProStaR 26 for the statistical analysis of quantitative proteomics data, RMassBank 27 to process metabolomics tandem MS les and build MassBank records, MSstatsQC 28 for longitudinal system suitability monitoring and quality control of targeted proteomic experiments and the widely used xcms 29 package for the processing and analysis of metabolomics data.MSnbase is also used in non-R/Bioconductor software, such as for example IsoProt, 30 that provides a reproducible workow for iTRAQ/TMT experiments.The BioContainers 31 project oers a dedicated container for the MSnbase package, this facilitating the reuse of the package in third-party pipelines.MSnbase currently ranks 101 out of 1823 packages based on the monthly downloads from unique IP addresses, tallying over 1000 downloads from unique IP addresses each months.
As is custom with Bioconductor packages, MSnbase comes with ample documentation.
Every user-accessible function is documented in a dedicated manual page.In addition, the package oers 5 vignettes, including one aimed at developers.The package is checked nightly on the Bioconductor servers: it implements unit tests covering 72% of the code base and, through its vignettes, also provides integration testing.Questions from users and developers are answered on the Bioconductor support forum as well as on the package GitHub page.The package provides several sample and benchmarking datasets, and relies on other dedicated experiment packages such as msdata 32 for raw data or pRolocdata 23 for quantitative data.
MSnbase is available on Windows, Mac OS and Linux under the open source Artistic 2.0 license and easily installable using standard installation procedures.
The growth of MSnbase and the user support provided over the years attest to the core maintainers commitment to long-term development, and the quality and maintainability of the code base.

Discussion
We have presented here some important functionality of MSnbase version 2. The new on-disk infrastructure enables large scale data analyses, 33 either using MSnbase directly or through packages that rely on it, such as xcms.We have also illustrated how MSnbase can be used for standard data analysis and visualisation, and how it can be used for method development and prototyping.
The version of MSnbase used in this manuscript is 2.14.2.The main features presented here were available since version 2.0.The code to reproduce the analyses and gures in this article is available at https://github.com/lgatto/2020-msnbase-v2/.

Associated Content
Supplementary le 1: script documenting the processing of 1182 mzXML les (5773464 spectra) using MSnbase.Introduction This document describes handling of mass spectrometry data from large experiments using the MSnbase package and more specifically its on-disk backend.For demonstration purposes, the MassIVE data set MSV000080030 is used.This consists of over 1,000 mzXML files from swab-samples collected from hands and various personal objects of 80 volunteers.

Data handling and analysis with MSnbase
In this section we demonstrate data handling and access by MSnbase on a large experiment consisting of more than 1,000 data files.
To reproduce the analysis described in this document, download the MSV000080030 folder from ftp://massive.ucsd.edu/MSV000080030/and place it into the same folder as this document.
Below we load the required libraries and define the files to be analyzed.The data set consists of 1182 mzXML files.We next load the data using the two different MSnbase backends "inMemory" and "onDisk".For the in-memory backend, due to the larger memory requirements, we import the data only from a subset of the files.
ms _ dsk <-readMSData(fls, mode = "onDisk") Below we count the number of spectra per MS level of the whole experiment.Note that the in-memory MSnExp object contains only MS2 spectra (in total 2140520) from a subset of data files.However, the data import was much slower (over ~12 hours for the in-memory backend while creating the on-disk object from the full data data set took ~3 hours).
Next we subset the on-disk object to contain the same set of spectra as the in-memory MSnExp and compare their memory footprint.For this combined subsetting and data access operation the on-disk backend performed better than the in-memory MSnExp, while even requiring much less memory.
Next we extract all MS2 spectra with a retention time between 50 and 60 seconds and a precursor m/z of 108.5362 (+/-5ppm).This subsetting operation is performed on the on-disk MSnExp object representing the full experiment with the 1182 data files/samples.To assess the performance of the following operations we use system.timecalls that record elapsed time in seconds.In total length(ms _ sub) spectra were selected from in total 928 data files/samples.The plot below shows the data for the first spectrum.Since there seems to be quite some background noise in the MS2 spectrum we next remove peaks with an intensity below 50 by first replacing their intensities with 0 (with the remove Peaks call) and subsequently removing all 0-intensity peaks from each spectrum with the clean call.In addition we normalize each spectrum by dividing the maximum intensity per spectrum from the spectrum's intensities.

2
and above), and many more.MSnbase relies on the mzR package 2 to import raw mass spectrometry data from one of the many community-maintained open standards formats (mzML, mzXML, mzData or ANDI-MS/netCDF) and provides a rich and principled interface to manipulate such objects.The code chunk below illustrates such an object as displayed in the R console and an enumeration of the metadata elds.> show(ms) MSn experiment data ("OnDiskMSnExp") Object size in memory: 0.54 Mb ---Spectra data ---MS level(s): 1 2 3 Number of spectra: 994 MSn retention times: 45:27 -47:6 minutes ---Processing information ---

Figure 1 : 6 -
Figure 1: Schematic representation of what is referred to by raw data: a collection of mass spectra and a table containing spectrum-level annotations along the lines.Raw data are imported from one of the many community-maintained open standards formats (mzML, mzXML, mzData or ANDI-MS/netCDF).
function) and the metadata (using the mzR::header() function) in the data les.The former is the legacy storage mode, implemented in the rst version of the package, that loads all the raw data and the metadata into memory upon creation of the in-memory MSnExp object.This solution doesn't scale for modern large dataset, and was complemented by the on-disk backend.The on-disk backend only loads the metadata into memory when the on-disk MSnExp is created and accesses the spectra data (i.e.m/z and intensity values) in the original les on disk only when needed (see below and Figure2 (d)), such as for example for plotting.There are two direct benets using the on-disk backend, namely faster reading and reduced memory footprint.Figure2shows 5-fold faster reading times (a) and over a 10-fold reduction in memory usage (b).

Figure 2 :
Figure 2: (a) Reading time (triplicates, in seconds) and (b) data size in memory (in MB) to read/store 1, 5 and 10 les containing 1431 MS1 (on-disk only) and 6103 MS2 (on-disk and in-memory) spectra.(c) Filtering benchmark assessed over 10 interactions on in-memory and on-disk data containing 6103 MS2 spectra.(d) Access time to spectra for the in-memory(left) and on-disk (right) backends for 1, 10, 100 1000, 5000 and all 6103 spectra.Benchmarks were performed on a Dell XPS laptop with an Intel i5-8250U processor 1.60 GHz (4 cores, 8 threads), 7.5 GB RAM running Ubuntu 18.04.4LTS 64-bit and an SSD drive.The data used for the benchmarking are a TMT 4-plex experiment acquired on a LTQ Orbitrap Velos (Thermo Fisher Scientic) available in the msdata package and described in14 .

3 .
Using the combineSpectra function from the MSnbase, combine the cleaned BoxCar spectra into a new, full spectrum (Figure 3 (e)).

Figure 4 Figure 3 :Figure 4 :
Figure 3: BoxCar processing with MSnbase.(a) Standard full scan with (b) three corresponding BoxCar scans showing the adjacent segments.Figure (c) shows the overlapping intact BoxCar segments and (d) the same segments after cleaning, i.e.where peaks outside of the segments were removed.The reconstructed full scan is shown on panel (e).Spectra visualisation, as shown here, rely on the ggplot2 17 package.