The Personalized Proteome: Comparing Proteogenomics and Open Variant Search Approaches for Single Amino Acid Variant Detection

Discovery of variant peptides such as single amino acid variant (SAAV) in shotgun proteomics data is essential for personalized proteomics. Both the resolution of shotgun proteomics methods and the search engines have improved dramatically, allowing for confident identification of SAAV peptides. However, it is not yet known if these methods are truly successful in accurately identifying SAAV peptides without prior genomic information in the search database. We studied this in unprecedented detail by exploiting publicly available long-read RNA seq and shotgun proteomics data from the gold standard reference cell line NA12878. Searching spectra from this cell line with the state-of-the-art open modification search engine ionbot against carefully curated search databases resulted in 96.7% false positive SAAVs and an 85% lower true positive rate than searching with peptide search databases that incorporate prior genetic information. While adding genetic variants to the search database remains indispensable for correct peptide identification, inclusion of long-read RNA sequences in the search database contributes only 0.3% new peptide identifications. These findings reveal the differences in SAAV detection that result from various approaches, providing guidance to researchers studying SAAV peptides and developers of peptide spectrum identification tools.


Introduction
Proteomes display significant inter-individual variability 1,2 and personal proteomes may delineate disease risk and pave the way for personalized disease prevention and treatment.
Previously, scientists looked for protein evidence of a small number of variants in particular and resorted to targeted proteomics approaches such as selected reaction monitoring (SRM) [9][10][11][12] . Alternatively, BLAST-like query tools such as peptimapper and PepQuery 13,14 or database tools like XMAn v2 15 and dbSAP 16 can be used to investigate single events 17,18 .
Proteogenomics, the integration of genome and transcriptome information, is a more holistic and higher-throughput form of mass spectrometry-(MS-) based detection of variant peptides.
A main limiting factor of SAAV peptide (called 'variant peptide' for the remainder of the manuscript) detection with shotgun proteomics is the tandem mass spectrometry (MS/MS) technology itself. Since MS/MS spectra are generally too noisy to call a peptide sequence de novo, current MS/MS analysis methods rely on a database of known peptides. This limits the ability to detect unknown peptides such as variant peptides. The most flexible way to detect variant peptides is an exhaustive search; allowing any possible amino acid substitution at any position in the peptide sequence 19,20 . However, this strategy increases the search space immensely to a point where it is no longer useful in practice. The larger search space leads to ambiguity in peptide identification and thus a high number of false positive hits 21,22 . Therefore, more careful curation of sequences in the search database pays off.
Databases of peptides containing variants from dbSNP have been created to facilitate the search for SAAVs 3,16 , and simply adding these variant peptides to the database showed promise early on 3,23 . Not all dbSNP variants however, are expected to be found in every sample, and including them all may lead to false identifications 24 . In addition, rare and unique variants may be overlooked. A proteogenomics approach where only those variant peptides predicted from genome or transcriptome information are added to the peptide search databases, can improve their detection. Proteogenomics pipelines have streamlined this process of incorporating personal genome information into a proteomic search database [25][26][27][28][29] . In addition, there is evidence that including correct sequence variant information, including often-overlooked sample-specific indels and frameshifts, improves variant peptide identification workflows 30 .
Yet, false discovery rate (FDR) correction is needed to compensate for the increase of database size and complexity 21,22 . When searching for evidence of specific peptides such as variant peptides, an additional subset specific FDR correction should be made 31 .
In addition to SAAVs, alternative splicing may also introduce sample specific peptides.
Alternative splicing is commonplace as 90% of genes undergo alternative splicing 32 . Since protein reference databases do not cover all protein isoforms produced by alternative splicing, sample-specific transcriptome information is advantageous. Typically, the information on alternatively spliced sequences comes from RNA sequencing. Short-read RNA seq is however not ideal for properly capturing the complete splicing patterns and the resulting open reading frames (ORFs). Traditionally, this is circumvented by including 3-or 6-frame translations of the sample's transcriptome. However, this approach was found to expand the database far too much for eukaryotic organisms, leaving few remaining hits after FDR correction 33 . Studies utilizing long-read RNA seq frequently discover previously unannotated transcript structures.
Thus, full-length transcripts may add essential information for correct ORF prediction and peptide identification.
An emerging alternative to proteogenomics methods for the detection of variant peptides is the 'open search' method. This allows unexpected post-translational modifications and amino acid substitutions in the peptide spectrum match, while maintaining accurate FDR and a workable computation time. Using sequence tag-based approaches, the search space is narrowed with de novo sequence tags, which makes room for the addition of all possible SAAV peptides in the search space [34][35][36][37][38] . These methods were historically not as effective as classical proteogenomics searches in finding variant peptides, since there is difficulty in discerning between post-translationally modified and SAAV peptides. However, this situation has recently improved with the inclusion of optimized probabilistic models 39 . One implementation of the tag-based method improved with such models is ionbot (manuscript in preparation; compomics.com/ionbot), which is a machine learning search engine that uses MS2PIP 40 and ReSCore 41 to significantly improve the accuracy of peptide match scoring.
The main objective of this study is to compare a previously established proteogenomics approach based on long-read sequencing with a recently-developed open search method for the detection of true variant peptides. In simpler terms, we compare a genome-informed search space with typical spectrum identification settings to a genome-uninformed search space with advanced identification settings. We aim to understand the power of, and potential biases associated with, using an open search method without prior information about the genome. For this, we make use of high-confidence nucleotide sequencing and (ultra)-deep proteomics data from a gold standard cell line NA12878. Using correct ORFs from the long-read transcriptome and high-confidence phased variants belonging to this cell line, we gain a unique perspective on exactly what advantages can be gained by each approach.

Creation of the search databases
Two sets of two search databases were created from the two sources of sequences: the ONT transcriptome and GENCODE-predicted coding transcriptome sequences. The first set contained sequences of each of the sources individually (referred to as 'ONT' or 'Ref'). The second set consisted of the union of the two sources, one database with NA12878-specific variants included (variant-containing, VC) and one database without (variant-free, VF). A simple depiction can be found in Figure 1A, while the detailed full workflow can be found in Figure S1. Each database had MaxQuant 44 contaminant sequences appended before search.  ionbot identifications presumed to be variant peptides (and variant peptide decoys) underwent subset-specific FDR correction for both combination databases, but the exact subset of variant peptides differed between the two searches due to different assumptions. The assumption in the VF database is that variants in the genome are unknown, so all predicted variant peptides (and predicted variant decoy peptides) were pooled for FDR correction. In the VC database, only known variant peptides (and corresponding decoy peptides) are pooled for FDR correction. Q value calculation and cutoff (q < 0.01) were performed with an in-house python script (distribution can be seen in Figure S2). Retention time predictions were calculated with DeepLC 49 . All scripts referred to in this manuscript can be found in the GitHub repository (https://github.com/cmbi/NA12878-saav-detection).

Search database makeup
The main goal of this study is to evaluate the added value of transcriptomics data for SAAV identification in proteomics data. In this evaluation, SAAV identification with and without transcriptomics prior knowledge is compared for a state-of-the-art open search engine. To this end, we searched the NA12878 deep shotgun proteomics data set with four distinct search databases corresponding to two comparisons, as outlined in Figure 1A. The first comparison was between databases based on the Oxford Nanopore (ONT) long-read transcriptome, the GENCODE reference proteome (referred to as Ref) or the combination of the two (referred to as combi, Figure 1B)     Aside from the contributions from the ONT-only sequences, it is also interesting to investigate protein identifications that were not found in the ONT transcriptome. While these should theoretically not be present, roughly 20% of identified peptides are exclusively matched with the ENCODE transcripts ( Figure 2). As expected, this percentage is smaller than the 42% of peptides in the search database that are exclusive to GENCODE transcripts, but still a significant fraction. This suggests that it is best to still use a reference transcript database, even if there is full transcriptome sequencing data available.

Variant-containing method allows detection of many more genome-supported variant peptides
We     There were 402 unique false negative peptides observed ( Figure 5A). These false negatives peptides were classified as variant peptides by VC method but not by VF, although they were contained in the VF search space. Identifying causes of false negatives requires investigation of how the VC peptides were identified with the VF method. There was no particular length peptide that was mis-identified more than others in general, despite the difference in detectible peptide length ( Figure S3). The peptide identifications were similar between the VF and VC methods. In general, length correlated highly between the identifications of the two methods (R 2 =0.9071, p=0). When comparing individual peptide identifications per method for mismatches and length difference, the largest source of error was a 1 aa length difference.
Nonvariant peptides with a 1 aa length difference from the variant peptide were being identified instead of the correct variant peptide in >30% of the false negatives ( Figure S3). Another possible source of false negative errors that was investigated is SAAVs being mistaken for unexpected post-translational modifications. In the false negative set, this did not appear to be an issue. The false negative VF identifications had approximately the same rate of unexpected PTMs ( Figure S3).  To further understand how false negatives could occur, we compared the peptide matching scores of the false negative spectra for the VF and VC search methods ( Figure 5B). Higher scores indicate higher confidence in assignment of spectra. VC scores for false negative peptides were generally higher than the VF scores (mean score ratio VC/VF = 1.31). However, a large fraction of the false negatives received comparable scores in the VC and VF search methods. This could indicate a ranking problem: the variant peptide received a score equal to another peptide, to which the peptide spectrum was ultimately assigned. Delta retention time can often be a useful independent validator when score disagrees between the different search methods. Despite high retention time discrepancies in this particular data set, observed retention time aligns relatively well with predicted retention time for those spectra that received higher scores in VC.
The genome-supported variants are a tiny fraction of the high confidence variant peptide predictions from the VF database, indicating a high false positive rate ( Figure 6A). We   A closer inspection of genome-unsupported variants reveals potential sources of confusion for variant prediction algorithms, leading to false positive identifications. There was a high level of concordance of peptides matched to these spectra in general. Two thirds of spectra that corresponded to genome-unsupported variant peptide identifications by VF had the same base peptide identifications in both the VF and VC searches. Mass shifts predicted to be SAAV in VF were commonly predicted to be 'unexpected' PTMs by the VC method ( Figure 6C). A common PTM mistaken as a SAAV in VF was threonine oxidation, but many PTMs contributed to this mix-up. There was no clear trend to the identification errors, underlining the difficulty of correctly classifying minor mass shifts corresponding to PTMs and SAAVs.

Evaluation of the variant peptides' SNPs of origin
The detection of variant peptides is ultimately a means to understanding which single nucleotide variants (SNVs) are expressed on the protein level. By incorporating SNVs into predicted ORFs, we ended up with a theoretical set of 34,968 variant peptides originating from 9,298 SNVs from all chromosomes, of which 5,989 are heterozygous variants.
In the case of a heterozygous variant, both variant peptides and their reference counterparts can be identified in some ratio. A ratio different from 0.5 may be indicative of preferred expression of one of the alleles on the protein level, otherwise known as ASPE (allele specific protein expression). Presence and magnitude of ASPE is potentially key information that can be used to understand biological mechanisms. However, technical biases of search methodology may invalidate potential findings by distorting these ratios. For the VF method, the reference peptide was identified more frequently than the variant peptide (p=0.013, oneway ANOVA). The opposite was true for the VC method.
Homozygous variants can be used as a type of control to understand the bias in search methods, since we know that only one of the two alleles can be expressed. In case of homozygous variants, the variant peptide is expected to be present in all cases -with no reference counterpart. This was observed for the VC but not for the VF method ( Figure 7A).
Thus, without prior information about zygosity, the VF method tends to be conservative in identifying SAAV peptides, resulting in a higher likelihood of the reference peptide than its variant counterpart.   One significant subgroup of heterozygous variants was particularly biased towards the alternative allele. Fourty-four out of 183 variant peptides supported by more than two PSMs did not have any detected reference counterparts. One third of these variants had a substitution involving arginine or lysine (tryptic cleavage sites). One gene, HLA-DBQ1, had two alternative alleles instead of one reference and one alternative. In general, the score distribution for these highly biased group was lower than the score distribution for all VC detected variant peptides.
The allele frequencies of this group were not different to those of the overall alternative-biased group (p=0.5, ANOVA). There was also no correlation between the RNA expression of these

Discussion
Here The fact that around a quarter of peptide identifications cannot be attributed to the transcriptomics data is rather surprising. There are a couple possible explanations. Using transcriptomics data from different cells than the proteomics data (different labs and different year) will unavoidably cause some discrepancies 50 . This could also be attributed to protein stability in the cell, as proteins are detectable for some time after RNA have already been degraded 51 . Also notable is the fact that including the transcriptome sequences did not seem to add significantly to the peptide detections; the proportion of novel peptides found was lower than the proportion of novel transcripts found. As this cell line/organism is so well studied, it is likely that the vast majority of present proteins have already been characterized. For other cell types and organisms with more novel transcripts, adding (full length) transcriptomes may lead to more peptide identifications.  62 . This in combination with rapidly improving data-independent acquisition removes detection limitations of lowabundance or otherwise difficult to detect peptides 63 , which is currently a considerable hurdle in SAAV peptide detection 55 . Including open search is clearly useful and bound to get more accurate. This study used ionbot as the sole predictor of unexpected modifications/SAAVs, and comparison between identification tools was difficult as no other identification software tested reported the precise reporter ions per matched spectra (to be able to separate TMT tags corresponding to different cell lines). A study to compare methods given these updates is certainly warranted and ensemble methods may eventually be used to even more accurately predict these unexpected modifications/SAAVs.
One important implication of correctly detecting SAAVs is the ability to observe allele specific expression on the protein level. A targeted proteomics approach has recently been described to study ASPE (allele specific protein expression) with high confidence 64 . It found no correlation between RNA and protein level ASE for the few variants studied, highlighting the utility of having higher throughput methods to study this phenomenon. One simple way to measure ASPE when using a proteogenomics approach is by comparing the spectral counts for the SAAV and its reference counterpart, since a reference counterpart usually has equal detectability by MS/MS 52 . Here we found low correlation between the abundance of the variant and reference counterparts regardless of VF or VC method. This is potentially indicative for a high level of ASPE. In contrast, 54 demonstrated a high correlation between variant and reference peptides. This may be attributed to the low stringency associated with using the multitier search strategy for SAAV detection. We found no correlation between ASE and ASPE was found in this study which is consistent with the findings of Shi et al.

Conclusions
Our study provides guidance for the detection of variant peptides that shape the personal proteome. While personal genomes currently seem indispensable for the characterization of personal proteomes, new computational and analytical tools and new file formats to accommodate personal proteome information will allow us to get the fullest picture possible of the individual proteome, even without personal genome information.