In silico approach to accelerate the development of mass spectrometry-based proteomics methods for detection of viral proteins: Application to COVID-19

The novel coronavirus disease first identified in 2019 in Wuhan, China (COVID-19) has become a serious global public health concern. One current issue is the ability to adequately screen for the virus causing COVID-2 (SARS-CoV-2). Here we demonstrate the feasibility of shotgun proteomics as a SARS-CoV-2 screening method, through the detection of viral peptides in proteolytically digested body fluids. Using in silico methods, we generated trypsin-based shotgun proteomics methods optimized for LCMS systems from 5 commercial instrument vendors (Thermo, SCIEX, Waters, Shimadzu, and Agilent). First, we generated protein FASTA files and their protein digest maps. Second, the FASTA files were used to generate spectral libraries based on experimental data. Third, transition lists were derived from the spectral libraries using the vendor neutral and open Skyline software environment. Finally, we identified 17 post-translational modifications using linear motif modeling.


Introduction
The detection of viral proteins in body fluids can be a rapid and specific diagnostic for infection in severe acute respiratory syndrome (SARS). 1-3 During the 2003 (SARS) outbreak, non-MS based methods of protein detection proved to be more successful 4,5 than LCMS methods. [6][7][8] Non-MS based methods, such as western blots, enzyme-linked immunosorbent assays (ELISAs), and protein arrays, rely on antibodies for the detection of proteins. Given recent studies concerning high variability in antibody production, LCMS-based methods are an attractive alternative approach for the rapid identification of small molecules, proteins, and peptides in clinical settings where consistency is paramount. 9,10 In the 15 years since the 2003 SARS outbreak, LCMS technology has experienced a revolution led primarily by increases in the speed, sensitivity, and resolution of MS instruments. Today, protein array and antibody-based methods are falling out of favor in both research and clinical diagnostics, due in large part to the improvements in LCMS technology. 11,12 A review of this growth by Grebe and Singh described a clinical lab with no LCMS systems in 1998 that completed over 2 million individual LCMS clinical assays in 2010. 13 Incremental improvements in rapid sample preparation techniques, chromatography, and data processing have also contributed to the increasing use of LCMS-based clinical testing. A 2013 study demonstrated the level of advance by identifying 4,000 yeast proteins in one hour of LCMS run time, identifying approximately 75 proteins/min at a rate 100 times faster than studies a decade prior. 14 LCMS methods can measure protein quantity by either the intact protein (protein-centric) or the analysis of proteolytic peptides (peptide-centric). While the mass to charge (m/z) ratio of a peptide or protein (MS1) may be a specific diagnostic in some materials, the majority of LCMS methods employ tandem MS in which the peptide or protein parent ion is subjected to gas phase collision to produce fragment ions. The measurement of the fragment ions (MS2) has a higher specificity and lower level of false positives and is the method of choice in clinical diagnostics. 12,13 Protein-centric assays, such as intact parallel reaction monitoring or proteoform monitoring, performs well on small GTPases like KRAS, but larger proteins cannot currently be monitored with the same relative ease and accuracy. 15 For this reason, peptide-centric assays are currently the dominant proteomics approach.
In peptide-centric assays, commonly called "bottom-up" or "shotgun" proteomics, proteins are subjected to proteolytic cleavage to produce smaller peptide sequences prior to LCMS, often using chemically modified trypsin. [16][17][18] Proteolytic digestion of a protein mixture increases the total number of molecules present and thereby increases the relative background noise of the sample. However, the production of multiple peptides from the protein of interest typically results in the production of highly selective peptide targets. Through a priori selection of peptide targets that are biologically unique or in some way chemically distinct, extremely specific assays can be rapidly designed. 19,20 When multiple selective peptides are available for a given protein, the independent peptide measurements can be combined to provide replicate measurements to increase certainty in protein presence and relative abundance.
A universal in all LCMS assays is the rule of "breadth vs. depth," simply stated, increasing the number of targets in an experiment results in a decrease in the overall sensitivity of each measurement. The most common example is that untargeted assays which may observe thousands of peptide ions per experiment invariably have a lower limit of detection (LOD) and quantification (LOQ) compared to assays where a smaller number of ions are targeted. [21][22][23] Improvements in each subsequent generation of hardware can mitigate this compromise, but the improvement is limited. Today, the only way to truly offset this rule is to increase the total LCMS run time. A 2014 study by Majchrzykiewicz-Koehorst et al. described an untargeted LCMS assay that could discern three viruses in samples, both in vitro and ex vivo. However, the run time for this experiment was six hours per sample using nanoESI-quadrupole TOF technology. 24 Targeted peptide-centric assays are advantageous when sensitivity is paramount over the quantity of identified targets. Targeted assays often rely on tandem MS with high speed, but relatively low accuracy, quadrupoles. 21 Quadrupoles can be used to select ions for fragmentation with quantification of fragment ions by other quadrupoles in single reaction monitoring (SRM). They can also be used in conjunction with high resolution systems in single ion monitoring (SIM) and parallel reaction monitoring (PRM). SRM relies on fragmentation which requires a priori the mass to charge ratio (m/z) of both the peptide ion of interest and its dominant ions produced during fragmentation. Commonly, the collision energy is optimized for each peptide fragment in order to maximize efficiency and signal. SIM uses higher resolution scans in lieu of fragmentation, which requires a priori only an approximate m/z ratio for the peptide. In SIM targeted assays, the peptide ion's exact mass is extracted post-run during data processing. Two studies have shown that SIM scans with ≥60,000 resolution produce selectivity comparable to SRM. 25,26 PRM targeted assays combine fragmentation and high resolution molecule monitoring. In contrast to SIM, isolated ions are subjected to fragmentation. Post-run data processing then calculates quantification from the highresolution accurate mass of the fragment ions. For proteins with only one available peptide target, PRM is the most selective technique in modern proteomics. 27,28 Untargeted assays can be divided into two broad categories, data dependent (DDA) and data independent analyses (DIA). In DDA, the masses of all ions are observed in a relatively wide m/z range (MS1); the MS1 peptide ions meeting user-defined thresholds are subjected to fragmentation. 29 The resulting fragment ions are scanned, typically in a wide range for obtaining the most complete coverage of the peptide fragment ions. Common user-defined thresholds in DDA include selecting for peptides with: (1) sufficient signal for sequencing and (2) isotopic distributions matching models of typical peptides for that m/z. 30 The ion selection is performed in real time and automatically by the instrument, and has required less user input in method design with each generation of hardware. 31,32 DDA data analysis methods requires the least a priori information of all current LCMS methods. Although full de novo identification of peptide sequences is possible, it remains computationally demanding, thus typical post-instrument analysis relies on peptide search engines. Peptide search engines require the user to provide a database containing the protein sequences the user expects. As the MS1 and MS2 data is directly compared to this expected database, results are impacted by the accuracy of the database as well as on the availability of comprehensive proteome sequence information for the organism(s). Much of proteomics research relies on peptide search engines that predate the LCMS instrumentation by decades. Such software can consider multiple scenarios, such as alternative charge states, but cannot account for post translational modifications (PTMs) unless defined by the user a priori. 33 Recent developments in next-generation search engines (i.e., MetaMorpheus 34 , Bolt 35 , Byonic 36 , and MSFragger 37 ) can identify PTMs without the user providing predicted PTMs, but these tools have yet to be widely adopted by the field.
Unlike DDA, DIA does not perform automatic real-time decision making processes. Instead, MS1 and MS2 scans are acquired with set mass widths (MS2 windows) that cover the entire peptide mass range of interest. The size of the MS2 windows vary based on the instrument speed and sensitivity, but the same instrument run method may be applied to any peptide experiment. Thus, DIA involves minimal optimization of the instrument run method, but this generalized data collection approach puts the onus on the post-run data analysis of the resulting MS2 fragments.
Large MS2 windows may display peptide ion fragments originating from hundreds of unique peptide ions. Peptides are detected in DIA MS2 windows by matching the experimental results against peptide fragment spectral libraries. Spectral libraries are annotated peptide fragments, typically produced by previous DDA experiments. MS2 spectra selected for libraries typically contain complete fragment sequence coverage, with an MS2 fragment representing the product of breaking each peptide bond within the peptide of interest. In a common DIA workflow, a portion of all peptides from the study are pooled and chemically fractionated before being subjected to DDA LCMS analysis. The DDA experiments are used to create the spectral library for the experiment and each individual sample is separately analyzed by DIA. Quantification of the peptides and proteins occurs in the individual DIA experiments and the spectral libraries from the pools serve as a reference for identification of the peptides being quantified. [38][39][40] While spectral libraries are essential for DIA, they can also be utilized in DDA and can results in the most specific SRM and PRM targeted assays. Peptide fragmentation follows specific energetic patterns, resulting primarily in fragments caused by separation at the peptide bond. It is therefore possible to create theoretical spectral libraries in silico from peptide sequence alone. Theoretical spectral libraries are especially useful when the biological samples are unavailable. New tools that employ deep learning algorithms have been demonstrated to produce theoretical MS2 spectra superior to previous prediction models and, in the absence of true experimental data, are the best resources currently available. 41,42 These deep learning algorithms can learn from vast libraries of experimental data to predict the fragmentation patterns of new peptide sequences that they are given. One such algorithm, PROSIT, uses the vast synthetic human peptide libraries, from the ProteomeTools project 43 for its training dataset. Due to the high quality of the 450,000 synthetic peptides experimentally fragmented in ProteomeTools to date, PROSIT has been demonstrated to create spectral libraries that are, in some cases, superior to experimentally derived in-house spectral libraries. 44 In this study, we describe a in silico approach to develop LCMS screening assays of virus peptides in human samples. Currently, there is no publicly available proteomics data from COVID-19 infected human tissues and we are restricted from access to these materials. Our in silico developed materials facilitate both global and targeted analysis by providing all necessary materials for both DDA and DIA investigation of these materials through the production of FASTA databases, spectral libraries and a list of predicted PTMs investigators should consider when searching with historic peptide search engines. The peptide spectral libraries are further utilized to create transition lists optimized for hardware from 5 instrument vendors and complete PRM methods for all 3 quadrupole Orbitrap architectures. All materials and methods described herein are available as supplemental material to this publication as described in Table 1. The supplemental material includes protein sequences (FASTA files), predicted PTMs, theoretical MS2 spectral libraries, instrument methods and targeted method data processing templates.
Our work demonstrates not only the feasibility of this approach, but also its ability to rapidly develop methods even in the face of limitation of access to sample experimental data. We use the example of SARS-CoV-2 viral protein detection to underscore the utility of this approach in responding to an urgent public health crisis.

Coronavirus FASTA databases
At the date of this writing, only theoretical protein sequences for SARS-CoV-2, are available. These sequences are being acquired and annotated and the result of translation of genomic sequence information. All sequences in this study were obtained from NCBI accession: txd2697049, https://www.ncbi.nlm.nih.gov/protein/?term=txid2697049). Using Proteome Discoverer 2.4 (Thermo), the protein sequences were combined into a single protein FASTA database (2019-nCOVpFASTA1; Supplemental Information), and added to human proteome sequences (UniProt SwissProt Human database; downloaded 2/15/2020) to produce a database including both human and COVID-19 protein sequences (Human_plus_2019-nCOVpFASTA2; Supplemental Information).

Publicly available proteomics data from human samples infected with other coronaviruses
Publicly available experiments on other coronavirus experiments were found by searching the ProteomeXchange Consortium web interface (http://www.proteomexchange.org/). 45,46 Clarification of the identity of unpublished data from Pacific Northwest National Laboratory (PNNL) was provided by Dr. Michael Monroe.

PRM and SRM method development
For SRM transitions, the 2019-nCOVpFASTA1 (Supplemental Information) was imported into Skyline v20.1.0.28 (University of Washington) along with the PROSIT tryptic peptide spectral library. Peptide settings and transitions were optimized within Skyline to reflect the vendor optimization requirements. For Agilent systems, the 20ms default dwell time was selected for the transition settings. For SCIEX instruments, the same dwell time was utilized as well as automatic optimization of the declustering potential and compensation voltage from the transition settings menu. For Waters, Thermo, and Shimadzu systems, no further settings were required for transition list generation. All transition lists were exported as unscheduled 15min methods. For PRM methods, three peptides were selected for each protein due to the increased time per scan relative to SRM methods. Instrument-specific Skyline files are included in the Supplemental information as described in Table 1.

Prediction of PTMs
PTMs were predicted for the 2019-nCOVpFASTA1 proteins (Supplemental Information) using the ModPred web interface (www.modpred.org; accessed 1/31/2020). 50 All PTMs available at the date of analysis were selected as theoretical sites, and the Basic non evolutionary model was applied. ModPred ranks each PTM as high, medium, or low confidence as previously described. 50 The ModPred web interface can accept a maximum of 5,000 amino acids. In order to analyze YP_009724389.1 the predicted sequence had to be divided into five sequences, using a 100 amino acid overlap to avoid disrupting potential large motifs. The results of ModPred were compiled into a single spreadsheet with all modifications of all confidence levels. A second sheet was created that contained only the high confidence PTMs predicted, as well as a final summary for the counting of predicted high confidence PTM occurrence, provided as Supplemental Material as described in Table 1.

Theoretical peptides
In shotgun proteomics, proteins are first digested into smaller peptide fragments that are more easily detected by the instrument. Given its widespread use, high efficiency and speed of digestion, we chose to develop methods that exclusively use the proteolytic enzyme trypsin, which produces "tryptic peptides." Sequencing grade trypsin exhibits high efficiency cleavage at unmodified (1) arginine and (2) lysine residues unless followed by a proline. Trypsin also has the advantage of leaving a terminal basic residue at the cleavage site, which increases the likelihood of complete fragment ion coverage from the charged terminal. 18 Due to these reasons, trypsin is utilized unless the protein sequence has an abnormally high or low number of lysine or arginine residues. A very high frequency of the residues (such as Lysine-rich proteins) will create very short peptides that could be uninformative for protein identification. A very low frequency of the residues will create very large peptides, or undigested ("intact") proteins in some cases, that are difficult to detect and fragment. Our theoretical trypsin digest of the 2019-nCoVpFASTA1 database produced tryptic peptides with average lengths of 8 to 18 amino acids. These results indicate that trypsin digestion is an appropriate choice for detection of these viral proteins.

Example targeted methods
To widen use of our developed methods, we used the Skyline software to create transition lists optimized specifically for each of the Skyline-compatible triple quadrupole instruments. While minor modifications are required for SCIEX, Agilent, and Shimadzu instruments, Waters Xevo and Thermo instruments use identical parameters for transition list design. Most modern triple quadrupole instruments are capable of 500 SRMS/sec and fully permit the use of 2,000 transition lists, as provided here. For older instruments that lack this scan speed or that require higher dwell times, the transition lists included in the supplemental methods may be reduced by the end user accordingly.
PRM methods monitor multiple transitions simultaneously but at a time cost. The highest scan speed currently available in Orbitrap instruments is 48 scans per second and is only available on the Exploris 480 system (data not shown). In order to achieve maximum sensitivity, higher fill times are often required for these instruments. We chose to utilize three peptides/protein for these methods. Alternative peptides can be selected from the Skyline files provided (Supplemental Information) or by selecting peptide mass targets from the other transition lists.

Untargeted methods
The Prosit spectral libraries (Supplemental Material) enable the interrogation of DIA data and may be used for DDA experiments that employ tools such as the MSPepSearch (NIST). 51 DDA data requires only the protein FASTA file and a list of PTMs that may be present in the sample. Our analysis using ModPred predicted 17 possible PTMs (Figure 1). Amidation was the most frequent predicted PTM, but there is no known biological mechanism that we could derive from a survey of the literature. Palmitoylation, the second most frequent predicted PTM, is a well characterized viral PTM with critical functions in human immunodeficiency virus (HIV), human herpes virus (HHV), and influenza virus infectivity. [52][53][54]

LCMS datasets for other coronaviruses
The ProteomeXchange Consortium is an open-access platform for the rapid sharing of proteomics data. The prominent proteomic technical journals, the Journal of Proteome Research and Molecular and Cellular Proteomics, strictly require that all unprocessed instrument files and processed results are made publicly available through these services. While attempting to obtain coronavirus data, we identified unpublished metabolomics, lipidomics, and proteomics data generated from human samples infected with Middle Eastern respiratory syndrome (MERS-CoV). We provide this data, as well as a list of other proteomic studies of note, for use in comparative studies with SARS-CoV-2.

Conclusions
Using in silico methods, we have developed methods for the detection of SARS-CoV-2 in human samples. In vitro validation of this method is required and outside the scope of this paper given our lack of access to such samples. We have provided the minimum materials for data processing for both DDA and DIA untargeted proteomics methods with FASTA databases, spectral libraries and by predicting relevant PTMs for consideration. To broaden the number of labs that can apply our methods, we optimized run parameters for widely used LCMS systems compatible with Skyline, representing instruments from five companies. We will continue to refine these resources and post updates to these methods to LCMSmethods.org and invite researchers anywhere in the world to contact us for assistance in further optimization to address this emerging threat.