ABSTRACT
Mass spectrometry-based proteomics is constantly challenged by the presence of contaminant background signals. In particular, protein contaminants from reagents and sample handling are often abundant and impossible to avoid. For data-dependent acquisition (DDA) proteomics, exclusion list can be used to reduce the influence of protein contaminants. However, protein contamination has not been evaluated and is rarely addressed in data-independent acquisition (DIA). How protein contaminants influence proteomics data is also unclear. In this study, we established the protein contaminant FASTA and spectral libraries that are applicable to all proteomic workflows and evaluated the impact of protein contaminants on both DDA and DIA proteomics. We demonstrated that including our contaminant libraries can reduce false discoveries and increase protein identifications, without influencing the quantification accuracy in various proteomic software platforms. With the pressing need to standardize proteomic workflow in the research community, we highly recommend including our contaminant FASTA and spectral libraries in all bottom-up proteomics workflow. Our contaminant libraries and a step-by-step tutorial to incorporate these libraries in different DDA and DIA data analysis platforms can be a valuable resource for proteomics researchers, which are freely accessible at https://github.com/HaoGroup-ProtContLib.
INTRODUCTION
Mass spectrometry (MS)-based proteomics is constantly challenged by the exogenous contaminants and interferences that can be introduced into samples throughout the experimental workflow. Contaminations from polymers, detergents, solvents, ion sources, and other additives are often singly charged, which can be avoided by the careful selection of reagents or removed by ion mobility MS interface (e.g. FAIMS).1–3 However, contaminant proteins and peptides are almost impossible to eliminate from proteomics workflows. For example, keratins from researchers’ skin and hair can be found on all surfaces and dust during sample handling.4 Rodent and sheep keratins can originate from animal facilities and wool clothing. Residue cell culture medium can lead to bovine protein contaminations. Protein digestion enzymes (e.g. trypsin and Lys-C) and the production of enzymes (from bacteria or bovine) can introduce protein contaminants into bottom-up proteomics workflow.1 Additionally, bovine serum albumin (BSA), immobilized antibodies, and affinity tags (e.g. streptavidin, FLAG, HA) from affinity columns/beads also represent major contaminants in immunoassays and affinity purification MS.5,6 These exogenous contaminant proteins/peptides can compete with real samples in the MS ion source, occupy the cycle times in the mass analyzer, reduce the number of useful peptide spectra, and hinder the detection of low abundant proteins from complex biological samples.
Due to the negative effects of protein contaminants in MS-proteomics, various methods have been implemented to combat this problem. Keratin contaminations can be reduced by using a laminar flow hood and fastidiously wiping down surfaces with ethanol and water.7 However, it is almost impossible to eliminate keratins from proteomic experiments. Contaminations from digestion enzymes and affinity tags can be reduced by carefully optimizing the amount of enzymes and beads. But such practices may not be feasible for MS facilities and biological samples with limited amounts. Sample type-specific interferences have been evaluated previously, such as non-specific interactions in affinity purification and contaminations from plasma proteomics.5,8 Note that these interference proteins are sample-specific and may actually be useful proteins in other proteomic experiments. Therefore, these interference proteins cannot be marked as exogenous contaminant proteins for all proteomic experiments. For data-dependent acquisition (DDA) proteomics, an exclusion list can be used to disregard specific ions from being isolated for MS/MS fragmentation.9 However, exclusion list is specific to LC-MS gradient and instrument, which is difficult to transfer across different MS platforms and laboratories. Peptides with similar m/z and retention times could also be accidentally excluded in complex biological samples.9–12 Additionally, contaminant FASTA libraries can be used in DDA data analysis to mark and remove contaminant proteins from the dataset.13 Although widely used, current contaminant FASTA files used in Mascot, MaxQuant14, or other proteomics platforms have not been updated in years.
Despite various strategies to reduce the influence of protein contaminants in DDA proteomics, protein contamination has not been evaluated and is rarely addressed in data-independent acquisition (DIA) proteomics. Many exogenous contaminants from different species cannot be identified unless included in FASTA or spectral libraries. DDA exclusion list is not compatible with DIA because all co-eluting peptides within a pre-determined isolation window are fragmented together regardless of precursor intensities. Due to the wide isolation window in DIA, we hypothesized that contaminant proteins/peptides can be especially problematic if left unaddressed, leading to false identifications in DIA proteomics.
DIA data analysis can be conducted using spectral library-based software tools (e.g. OpenSWATH15, Spectronaut16, DIA-NN17, Skyline18, EncylopeDIA19, MaxDIA20) or library-free strategies with in-silico digested pseudo peptide spectra based on FASTA protein sequences (e.g. DirectDIA21, DIA-NN17, DIA-Umpire22, PECAN23,24, DeepDIA25). While contaminant FASTA libraries are widely implemented in DDA data analysis, they have not been updated for years and are rarely used for DIA data analysis.26 To our knowledge, there is no contaminant spectral library that is publicly available and universally applicable for DIA proteomics.
In this study, we created a series of contamination-only samples to establish contaminant protein spectral libraries. We generated a new contaminant FASTA library containing 381 protein contaminants commonly found in bottom-up proteomic experiments. We then evaluated how protein contaminants influence identification and quantification in DDA and DIA proteomics. The benefits and applicability of these contaminant libraries were demonstrated in various DDA and DIA data analysis platforms. These contaminant FASTA and spectral libraries are freely accessible at https://github.com/HaoGroup-ProtContLib with straightforward user manual to promote standardized and reproducible proteomics data analysis and reporting pipeline in the broad proteomics community.27–29
MATERIALS AND METHODS
Generation of Contaminant-Only Samples
We generated a series of contaminant-only samples by adding different proteolytic enzymes to the lysis buffer (1M Urea in 50 mM Tris-HCl) as well as commonly used beads coated with affinity tags. The proteolytic enzymes used here include sequencing grade Trypsin, Trypsin Gold, Trypsin/Lys-C, and Lys-C from Promega. The beads used here include Sero-Mag streptavidin magnetic beads (Cytivia), Anti-Flag M2 affinity agarose beads (Sigma), and EZview Red anti-HA affinity agarose beads (Sigma). Clean ungloved hands were purposely rubbed together above these samples to increase keratin contaminations.
Human Cell Culture and Mouse Brain Tissues
HEK293 cells were maintained in DMEM/F12 HEPES medium (Gibco) containing 10% of fetal bovine serum (FBS). Mouse brain samples were obtained from wild-type mice (C57/B6) under protocols approved by the George Washington University Institutional Animal Care and Use Committee. HEK cells and mouse brain samples were lysed in 8 M Urea, 50 mM Tris-HCl buffer and sonicated for 15 min in an ice-cold water bath using s QSonica Q700 Sonicator with alternating cycles of 1 min on and 30 s off. Protein lysates were clarified by 15 min of centrifugation at 12,000 rpm at 4 ºC and stored in -80 ºC. Total protein concentrations were determined using a detergent-compatible colorimetric protein assay (DCA, BioRad).
Proteomic Sample Preparation
The routine bottom-up proteomic workflow was conducted for contaminant-only samples, HEK cells and mouse brain lysates as described previously.30,31 Briefly, disulfide bonds were reduced using 5 mM Tris(2-carboxylethyl)phosphine (TCEP) for 30 min, 15 mM of iodoacetamide for 30 min in dark, and 5 mM TCEP for 10 min on a ThermoMixer shaking at 1,200 rpm at 37 ºC. Protein digestions were conducted using various enzymes (contaminant-only samples) and Trypsin/Lys-C (HEK and mouse samples) for 18 hours at 37 ºC on a ThermoMixer, and quenched with 10% trifluoracetic acid until pH < 3. Peptides were then desalted on a Waters Oasis HLB Plate using the manufacturer’s protocol, dried down under SpeedVac, and stored at - 30 ºC.
LC-MS/MS Analysis for DDA and DIA Proteomics
Peptide samples were analyzed on a Dionex UltiMate 3000 RSLCnano system coupled with a Thermo Fisher Q-Exactive HF-X mass spectrometer. The mobile phase buffer A was 0.1% formic acid in water, and buffer B was 0.1% formic acid in acetonitrile. HEK cells and mouse brain samples were injected onto an Acclaim PepMAP C18 trap column (3 µm, 100Å, 75 µm × 2cm) and further separated on an Easy-spray PepMap C18 column (2 µm, 100Å, 75 µm × 75cm) with a flow rate of 0.2 µL/min, an LC gradient of 210 min, and a column temperature of 55 ºC. Contaminant-only samples were analyzed with a 15 cm PepMap C18 column, a flow rate of 0.3 µL/min, and an LC gradient of 120 min. For DDA analysis, MS scans from m/z 380 to 1,500 with a resolving power of 120K (at m/z 200 FWHM), an automatic gain control (AGC) target of 1 × 106, and a maximum injection time (maxIT) of 50 ms. Precursors were isolated at a window of m/z 1.4 and fragmented with a normalized collision energy (NCE) of 30%, a resolving power of 7.5K for MS/MS, and a maxIT of 40 ms. For DIA analysis, MS scans from m/z 400 to 1000 at a resolving power of 60K, an AGC target of 1 × 106, and a maxIT of 30 ms. The precursor isolation window was set to m/z 8.0 (staggered) with 75 sequential DIA MS/MS scans between m/z 400 to 1000 at a resolving power of 30K, an AGC target of 5 × 105, a MaxIT of 30 ms, and an NCE of 30%.
Repository Data from ProteomeXchange
Two repository datasets from the ProteomeXchange website were downloaded and reanalyzed using our contaminant libraries. Repository dataset A is a HepG2 human cell DIA dataset (PXD022589) containing 27 raw data.20 Dataset B is a fractionated mouse cortex DIA dataset (PXD005573) containing 12 raw data.32 Additionally, a fractionated HEK and HeLa cell DDA dataset (PXD001468) was used to generate a spectral library for library-based DIA data analysis.33
DDA Proteomics Data Analysis
Common contaminant protein sequence library (e.g. enzymes, keratins, affinity tags, bovine proteins) were created by updating existing contaminant lists online, searching new contaminants on Uniprot website, and combining them into a new FASTA library (Supporting Information FASTA and Table S1). All DDA proteomic datasets in this study were analyzed with both the MaxQuant (2.0.2.0) and Thermo Fisher Proteome Discoverer (2.4.1.15) software with similar parameters. Contaminant-only samples were analyzed with the contaminant FASTA library only. HEK cells and mouse brain samples were analyzed using the Swiss-Prot Homo sapiens database (reviewed) and mus musculus database (reviewed), respectively, with and without our contaminant FASTA library. The false discovery rate for protein and peptide spectral matches (PSMs) identifications was set at a false discovery rate (FDR) of 0.01. Trypsin or LysC enzyme was used with a maximum missed cleavage of two. Precursor tolerance was set to 20 ppm. The fixed modification was cysteine carbamidomethyl, and variable modifications were methionine oxidation and protein N-terminus acetylation.
DIA Proteomics Data Analysis
Spectronaut software
A contaminant-only spectral library was generated using the set of contaminant-only DDA proteomic dataset using the Pulsar search engine in Spectronaut 15.34 Mouse and human spectral libraries were also generated using Pulsar. Two spectral libraries were generated for each sample type with and without including the contaminant-only samples in Pulsar (Supplemental Table S2). The “Library Generation Step” of Pulsar was conducted using “BGS Factory Settings”. Specific trypsin digestion was set with a maximum of two missed cleavages. A fixed carbamidomethyl modification of cysteine, and up to three variable modifications for oxidation of methionine and acetylation of the protein N-terminus were allowed. PSM, peptide and protein FDR were set to 0.01. Both library-based and library-free (DirectDIA) analyses were performed in Spectronaut 15 using default settings. The quantification step was modified to perform an interference correction that used only identified peptides to train the machine-learning model. No cross-run normalization or imputation was used.
DIA-NN software
Both spectral library-based and library-free DIA analyses were performed in DIA-NN (v1.8).17 Raw data files were converted to the open-format .mzML using the msConvert feature of the ProteoWizard package.35 Library-based analysis was conducted in DIA-NN using the spectral libraries established above in Spectronaut Pulsar. All settings were default and the same as in the Spectronaut. For library-free analysis, FASTA digest was selected. The spectral libraries were also included to train the deep learning model.
Post-Data Analysis Filtering
To increase the confidence of protein/peptide identifications, proteins that were identified with only one precursor or an intensity below 10 were removed from all datasets using R. Contaminant proteins can be easily filtered out from the results by searching the “Cont_”prefix in the UniProt protein group identifier from the results. Contaminant proteins were removed before calculating the coefficient of variation and Spearman’s correlation for the proteomic datasets.
Data Availability
All raw files have been deposited to the ProteomeXchange Consortium with the data identifier, PXD031139. The protein contaminant library and step-by-step user tutorial is also freely accessible at https://github.com/HaoGroup-ProtContLib.
RESULTS AND DISCUSSION
Building the Contaminant Protein FASTA and Spectral Libraries
Exogenous contaminant proteins orignated from reagents and sample handling are mostly shared in all bottom-up proteomic experiments. Therefore, we aim to build exogenous contaminant protein libraries that can be used in all bottom-up proteomics (Figure 1). Although widely used for DDA proteomics, the list of common protein contaminants from Mascot and Maxquant platforms have not been updated for years and contained many incorrect Uniprot IDs. Some sample-specific interference proteins were also incorrectly listed as contaminant proteins. Therefore, we first built a new contaminant FASTA library by manually merging several contaminant lists online, updating their Uniprot entry IDs, deleting noncontaminant proteins, searching new contaminant proteins on Uniprot, and combining them into a new FASTA file. Our new contaminant FASTA library contains 381 contaminant proteins including all human keratins and skin-derived proteins, common bovine contaminants from cell culture and affinity columns, various proteolytic enzymes, affinity tags, and other contaminants (Supplemental FASTA and Table S1). When compared to the MaxQuant contaminant list, our new FASTA library is up-to-date for all Uniprot IDs and contains an additional 183 contaminant proteins (Figure 2A). This new FASTA library can be used for both DDA and DIA proteomics. We also added a “Cont_” prefix in each contaminant entry in the FASTA library, allowing contaminant proteins to be easily filtered and removed in the result files.
To establish comprehensive contaminant protein spectral libraries for DIA proteomics, we created a series of contaminant-only samples using various proteolytic enzymes and affinity purification beads. We validated the presence of each contaminant peptides by creating spectral libraries in MaxQuant, Proteome Discoverer and Spectronaut Pulsar (Figure 1). Hundreds of contaminant peptides were detected throughout the LC-MS gradient (Figure 2B and Supplemental Table S3). Since trypsin and LysC are the two most used enzymes for bottom-up proteomics, we created two DIA spectral libraries using Spectronaut Pulsar: one for tryptic contaminant peptides, one for LysC-digested contaminant peptides. These spectral libraries are built from highly confident fragment ions assigned to each peptide sequence (Figure 2C, Supplemental Table S4), also freely accessible on ProteomeXchange (PXD031139). We further examined the contaminant datasets and found that human keratins and Lys-C enzyme produce the largest number of contaminant PSMs (Supplemental Table S3). LysC enzyme provides higher cleavage efficiency at lysine and is therefore often used in combination with trypsin to improve digestion efficiency.36 However, LysC enzyme contains almost two fold more arginine/lysine residues compared to trypsin and can therefore generate numerous contaminant peptides as shown in Figure 2. As expected, the quality of enzymes also influences the number of contaminant peptides. Trypsin Gold produced less contaminant PSMs compared to sequencing grade trypsin. Additionally, bovine protein contaminants (albumin et. al.) were identified in all affinity purification beads despite conducting pre-washing steps. Streptavidin coated beads generated overwhelming streptavidin peptide signals which is consistent with our previous findings.6 These exogenous contaminant proteins are often originated from a different species which will not be identified unless the contaminant FASTA library is included in the data analysis workflow.
Contaminant Peptides can Cause False Discoveries in DIA Proteomics
Contaminant FASTA library has been widely used for DDA proteomics, but is rarely included in DIA data analysis.26,30,37,38 Since DIA uses a much wider precursor isolation window (4-15 Da) compared to DDA (0.4-2 Da), contaminant peptides in DIA are more likely to be coeluted and co-fragmented with other peptides. If not addressed properly, contaminant peptides can cause false identifications of peptides/proteins. To evaluate the influence of contaminant peptides, we analyzed several DIA proteomic datasets with and without our contaminant FASTA library. As shown in Figure 3A, when the contaminant FASTA library is not included for data analysis, a contaminant Lys-C peptide was misidentified as a KIF20B peptide due to numerous shared peptide fragments. After including the contaminant library, the peak picking algorithm identified an additional y3 ion and assigned the fragmentation spectra to Lys-C instead of KIF20B with a higher confidence and lower peptide q-values. Similar scenario happened to another bovine contaminant protein SERPINA1 which was misidentified as CFAP100 (Figure 3B). Including the contaminant library allows the identification of three additional fragments and the correct assignment to SERPINA1 contaminant peptide. Furthermore, as contaminant peptides elute throughout the LC gradient and mass range (Figure 2B), many contaminant peptides can be coeluted and co-fragmented with real peptides of interest. Figure 3C and 3D showed examples of coeluted and co-fragmented contaminant peptides in our DIA datasets with an 8 Da isolation window. When larger isolation windows are used such as 10 Da or 12 Da, more peptides including contaminants will be co-fragmented to generate highly convoluted MS/MS spectra.39,40 Contaminant peptides with high abundance can also suppress the detection of low abundant peptides by competing with them in the ion source and mass analyzer. Therefore, including a contaminant library can greatly reduce false discoveries in DIA data, particularly for complex biological samples. However, careful optimization of experimental workflow and DIA parameters are still crucial and fundamental to reduce the numbers and abundances of contaminant signals.
Including Contaminant Protein Library Improves both DDA and DIA Proteomics
Contaminant libraries can be integrated into the DDA and DIA data analysis workflow via different strategies. DDA and library-free DIA analyses only requires the contaminant FASTA protein sequences. But library-based DIA analysis requires both FASTA and spectral libraries. Contaminant spectral library can be generated in two ways: 1) an integrated spectral library built from contaminant-only raw data and custom proteomics data together; 2) two separate spectral libraries for contaminant and custom proteomics data. Contaminant FASTA file is also required when building these spectral libraries. In Spectronaut software, multiple spectral libraries can be included during data analysis. We found that the integrated spectral library performs similar with two separate libraries with slightly higher total protein identifications in some datasets (Supplemental Figure S1). Either method is better compared to the results analyzed without the contaminant library. However, many other DIA software platforms do not allow the inclusion of multiple spectral libraries, and thus require an integrated spectral library. Fortunately, including the additional contaminant FASTA and spectral libraries did not increase the software processing time for multiple DDA (Proteome Discoverer, Maxquant) and DIA (DIA-NN, Spectronaut) platforms.
To demonstrate the benefits of contaminant protein libraries for both DDA and DIA proteomics, HEK cells and mouse brain samples were analyzed in DDA and DIA workflows in various data analysis software (Figure 4). The overall increase of protein and peptide identifications were around 0.9% and 1.3%, respectively across all software and sample types. After removing the contaminants, more peptide IDs and similar protein IDs were achieved when analyzed with the contaminant library. For various DIA platforms, library-free DIA-NN generated the highest number of protein and peptide IDs possibly due to the deep learning model implemented in search algorithm. Besides in-house generated proteomics data, we also analyzed repository datasets with and without contaminant library. Two DIA repository datasets were downloaded from ProteomeXchange: repository dataset A from HepG2 human cell samples20 and dataset B from mouse brain samples32. Higher number of proteins and peptides were identified when the data was analyzed with contaminant libraries (Figure 5). Particularly, after removing contaminants, almost 5% more proteins and peptides were identified when the data was analyzed with contaminant libraries in Spectronaut library-based platform. Benefited from the additional contaminant spectral library, library-based DIA platforms provided greater increase of identifications compared to library-free platforms. Many bovine contaminant proteins were identified from repository dataset A similar with our in-house generated HEK cell dataset, which can be traced back to the fetal bovine serum (FBS) used for human cell culture. To minimize the contaminations from cell culture medium, we highly recommend two to three times of PBS washes during cell harvest.
Since our contaminant libraries can improve protein/peptide identifications, we further assessed protein quantification with and without contaminant libraries. Coefficient of variation (CV) of all quantified proteins from HEK cells (Figure 6A) and mouse brain samples (Figure 6B) were calculated after removing the contaminant proteins. No significant differences were observed with and without including contaminant libraries. DIA-NN resulted more protein identifications but higher CVs compared to Spectronaut platform. Library-based methods provided less variation and better reproducibility compared to library-free methods, consistent with other reported studies.19,41 Protein intensities were not exactly the same when the data was analyzed with or without contaminant libraries, but they did correlate very well with spearman’s correlation close to 1 (Figure 6C and 6D). These results demonstrated that contaminant libraries didn’t influence DIA protein quantification.
CONCLUSIONS
This study filled a critical gap in bottom-up proteomics by establishing and evaluating contaminant protein libraries to reduce false discoveries and improve identifications in both DDA and DIA proteomics. Although the software used here (Spectronaut, DIA-NN, Maxquant, Proteome Discoverer) are not an exhaustive list of all available data analysis platforms, we believe that our contaminant libraries can be universally applied to all bottom-up DIA and DDA proteomics software. In fact, we provided step-by-step tutorial on how to best incorporate our contaminant FASTA and Spectral libraries for many other software platforms such as Skyline18, MaxDIA20, and PECAN23 (Supplemental Tutorial). We will also continue updating and enriching our contaminant libraries to include sample type-specific contaminant libraries. These freely accesible contaminant FASTA and spectral libraries can be valuable resources for proteomic researchers and facilitate the standardization of proteomic data analysis across different laboratories.
SUPPLEMENTAL INFORMATION
Supplemental FASTA. Contaminant protein FASTA with Cont_ prefix.
Supplemental Tutorial. Tutorial for using contaminant libraries for DDA and DIA data analysis in various proteomics software.
Supplemental Figure S1. Evaluation of different methods to build contaminant spectral libraries in Spectronaut software.
Supplemental Table S1: Contaminant protein information in the FASTA library and potential source of contaminations.
Supplemental Table S2: List of the established FASTA and Pulsar spectral libraries.
Supplemental Table S3: Summary of the protein contaminants identified in contaminant-only samples.
Supplemental Table S4: List of the contaminant peptides and fragments in Pulsar spectral libraries.
AUTHOR CONTRIBUTIONS
A.M.F. and L.H. designed the study and wrote the manuscript with inputs and revisions from all coauthors. A.M.F, J.N., and A.M. conducted the experiments. A.M.F performed data analysis. All authors have read and agreed to the published version of this manuscript.
CONFLICTS OF INTEREST
The authors declare no competing financial interests.
ACKNOWLEDGEMENTS
This study is supported by the NIH grant (R01NS121608). L.H acknowledges the ORAU Ralph E. Powe Junior Faculty Enhancement Award. A.M.F acknowledges the ARCS-Metro Washington Chapter Scholarship and the Bourbon F. Scribner Endowment Fellowship. We thank the Vertes lab and Lu Lab at GW for the access to the SpeedVac instrument and mouse brain samples.