Rigorous estimation of post-translational proteasomal splicing in the immunopeptidome

Recently, de novo peptide sequencing has made it possible to gain new insights into the human immunopeptidome without relying on peptide databases, while identifying peptides of unknown origin. Many recent studies have attributed post-translational proteasomal splicing as the origin of those peptides. Here, we describe a peptide source assignment workflow to rigorously assign the source of de novo sequenced peptides and find that the estimated extent of post-translational splicing in the immunopeptidome is much lower than previously reported. Our approach demonstrates that many peptides that were thought to be post-translationally spliced are likely linear peptides, and many peptides that were thought to be trans-spliced could be cis-spliced. We believe our approach furthers the understanding of post-translationally spliced peptides and thus improves the characterization of immunopeptidome which plays a critical role in the immune response to antigens in cancer, autoimmune disease, and infections.


21
Antigen presenting cells use major histocompatibility (MHC) complexes I or II to present peptides to 22 CD8+ or CD4+ T cells, respectively 1 . Characterization of the peptides presented to T cells, known as the 23 immunopeptidome, is of major interest for the study of infectious disease, autoimmunity as well as cancer 24 immunotherapy 2,3 . Identifying cancer associated peptides that illicit an immune response can point to safe and 25 effective targets for cancer immunotherapy. The discovery and characterization of the immunopeptidome can 26 be achieved using a multitude of technologies such as whole-exome sequencing, RNA sequencing, ribosome 27 profiling and tandem mass spectrometry (MS/MS) based peptide sequencing. While next-generation 28 sequencing approaches can characterize the potential immunopeptidome, only direct detection of peptides, 29 like in MS/MS, can provide experimental evidence for the existence of peptides presented by MHC complexes. 30 By combining information about potential and measured MHC-bound peptides, it is possible to reduce false-31 positives and accurately characterize the immunopeptidome 4-6 . When using next generation sequencing 32 approaches, peptides sequences stemming from somatic missense and indel mutations can be determined. 78910

92
For each potential source of peptides described above (Figure 1, databases), including cisand trans-93 splicing of peptides, we set out to estimate the chances of finding a match randomly. To estimate the false 94 discovery rate (FDR) associated with each source, we asked how many randomly generated sequences could be 95 found in each source. We generated 8-12mer peptide sequences (1,000 per length) in two ways: random 96 sequences uniformly sampling all amino acids (referred to below as uniform random) (Supplementary Table S1) 97 or sequences with frequencies of amino acids matching those found in vertebrates 23,24 (referred to below as 98 weighted random)(Supplementary Table S2). 99 We used the simulated sequences to estimate the FDR of each hypothesized sources of peptides ( Figure  100 2A). Using the uniform random sequences, fourteen out of 5,000 peptides (0.28%) were found in the expanded 101 human proteome (UniProt, OpenProt, lncRNAs, miRNAs and HERVs). When searching in canonically non-coding 102 regions of the human genome using BLAT, 178 out of 5,000 (3.3%) of random peptides could be mapped. We 103 also sought to estimate the FDR when searching for peptides mapping to the human proteome with a single 104 mismatch; we found that 192 out of 5,000 (3.9%) peptides could be mapped. When searching for peptides that 105 may come from non-human organisms in the BLAST database, 604 out of 5,000 (12.1%) sequences could be 106 mapped. Finally, when we search for cis-spliced peptides in the expanded human proteome, 1,936/5,000 (38%) 107 could be mapped; for trans-spliced peptides, 3,598/5,000 (71%) could be mapped (Figure 2A). For weighted 108 random peptides, 50% and 68% of peptides could be assigned as cisor trans-spliced, respectively (Figure 2A). 109 We also estimated the FDR for the hybridfinder pipeline, which we recapitulated ( Figure S2A). We found 110 that none of the simulated peptides were assigned as linear, since hybridfinder only matches peptides to the 111 UniProt database, not the expanded database we designed. However, for the uniform random sequence set, 112 772/5,000 (15%) could be assigned as cis-spliced and 2,329 of the remaining 4,228 (55%) could be assigned as 113 trans-spliced. A total of 3,101/5,000 (62%) of uniform random sequences could be assigned as cisor trans-114 spliced peptides ( Figure S2B). For weighted random peptides, 78% of simulated peptide sequences could be 115 assigned as cisor trans-spliced ( Figure S2B). This high FDR indicates that the search space of spliced peptides 116 within the human genome is so large that post-translational splicing events in the immunopeptidome are likely 117 over-estimated. 118 We set out to design an enhanced peptide mapping pipeline to assign sources for peptides in order of 119 decreasing FDR. When using either set of simulated data to order peptide sources by FDR, the enhanced pipeline 120 searches for peptide sources in the following order: 1) the expanded human proteome database (assigned as 121 linear), 2) the non-coding regions of the human genome using BLAT (assigned as linear), 3) single mismatch 122 peptides in the expanded human proteome (assigned as linear), 4) the BLAST database (assigned as linear), 5) 123 cis-spliced and 6) trans-spliced peptides in the expanded human proteome. When ordered in series, 4,495/5,000 124 (90%) of uniform random sequences and 4,847/5,000 (97%) of weighted random sequences are found with this 125 pipeline ( Figure 2B); while this FDR is high, researchers can choose an appropriate threshold and exclude 126 mapped peptides from high-FDR sources. Indeed, previous studies have excluded searches for trans-spliced 127 peptides due to the presumed high FDR and assumed rarity of occurrence 15,25 . 128 129 Re-analysis of monoallelic cell line data using the peptide source assignment workflow 130 We compared peptide identification by the peptide source assignment workflow versus hybridfinder on 131 peptides eluted from MHC complexes on the data set from the hybridfinder publication: immunopeptidomics 132 from a collection of cell lines engineered to express a single HLA allele 6 . For example, we found that for the cell 133 line expressing HLA-A*02:04, of the 1,075 peptides classified as spliced by hybridfinder, 215 could be classified 134 as linear from the expanded human database, 120 could be classified as linear with one mismatch, and 301 135 could be classified as linear from the BLAST database; overall 636/1,075 (60%) of putatively spliced peptides 136 were reclassified as linear. Additionally, 133 of the peptides that were classified as trans-spliced could be 137 reclassified as cis-spliced using an expanded human ( Figure 3A). Across all cell lines, 36% of putative cis-spliced 138 peptides could be reclassified as linear, and 45.9% of putative trans-spliced peptides could be reclassified as 139 linear or cis-spliced ( Figure 3B). The peptide source assignment workflow we present here shows that putative 140 spliced peptides are likely peptides stemming from mutated DNA sequences, non-canonically spliced RNA 141 sequences, non-canonically translated regions of the human genome, mismatched human sequences or 142 bacterial proteins. Altogether, 20% of peptides are assigned as spliced peptides with the workflow presented 143 here, down from 29% using hybridfinder ( Figure 3B).

144
At each step, we used the random peptides mapping results to estimate how many peptides were likely 145 found by chance. When compared to weighted random peptides, more peptides detected in cell lines were 146 assigned as linear (P < 1e-308, two-sided Fisher's exact test), linear with a single mismatch (P = 3.16e-05), linear 147 from the BLAST database (P = 0.00126), cis-spliced (P = 9.9e-92) and trans-spliced (P = 3.29e-73) ( Figure 3C).

148
More peptides were found that would be expected by searching for random peptides, this could indicate that 149 each source is truly contributing to the immunopeptidome found in each cell line.

151
Peptides that map throughout the human genome are enriched for expressed regions 152 We wanted to examine where peptides that map within the human genome, but outside of the UniProt 153 proteome, land in terms of genomic annotations. The first three steps of the pipeline can map a peptide to 154 regions of the human genome. In the first step, peptides that map exclusively in the OpenProt 18 database land 155 in ORFs that are not in the UniProt human proteome. We analyzed the location of these peptides in the human 156 genome 26 (Figure 4A and Supplementary Table S3), and found that, compared to the locations of all proteins in 157 OpenProt, these peptides are enriched in exons, promoters and 5`UTRs ( Figure 4B). The exonic enrichment is 158 likely due to out-of-frame translation. In the subsequent step, peptides are mapped to a 6-frame translation of 159 the human genome, though since OpenProt includes all proteins originating from ORFs longer than 30 amino 160 acids, these peptides must come from ORFs shorter than 30 amino acids. The genomic annotation distribution 161 of peptides mapped at this step is closer to that of the human genome, i.e. the majority of peptides map to 162 intergenic or intronic regions ( Figure 4C and Supplementary Table S3), possibly indicating these peptides 163 assignments are contaminated with random matching. However, peptides from proteomic data sets were more 164 enriched for exons, promoters and 5`UTRs than peptides from the random uniform or weighted simulations 165 ( Figure 4D). In the third step of the enhanced pipeline, peptides can be mapped to the expanded human 166 proteome with a single mismatch ( Figure 4E and Supplementary Table S3). Peptides mapped at this step show 167 stronger enrichment in exons, introns and promoters than the enrichment found from peptides in the weighted 168 or uniform simulated data sets (Figure 4F). At each step, there is consistent depletion of intergenic regions and 169 enrichment of transcribed regions, as has been found in other studies focused on unidentified peptides in the 170 immunopeptidome 27,28 . The enrichment of transcribed sequences supports the idea that the peptides assigned 171 in these steps of the pipeline are correctly assigned, even though they do not map to proteins in the UniProt 172 database. 173 174 Peptides identified by BLAST are not enriched for any microbial genus

175
The search within the BLAST database has the highest FDR for linear peptides in the peptide source 176 assignment workflow. While peptides from cell lines had modestly more matches in the BLAST database than 177 would be expected based on the uniform or weighted random data (Figure 3C)

201
With the increasing number of peptides identified in immunopeptidomics experiments using de novo 202 sequencing, the need for better characterization of the immunopeptidome is more pressing than ever. Previous 203 studies attributed proteasome-catalyzed peptide splicing (PCPS) as the sole source of peptides of unknown 204 protein identity. We present a peptide source assignment workflow that assigns parental proteins of de novo 205 sequenced peptides from several sources with lower False Discovery Rate (FDR) than the set of all possible PCPS 206 peptides. We found that 32% of putative PCPS peptides can be explained by single mismatches with known 207 proteins, translational of supposedly untranslated parts of the human genome, or bacterial and viral peptides. 208 Not surprisingly, the majority of peptides are encoded by known expressed regions. Finally, we identified a 209 recurrent out-of-frame peptide in the tumor suppressor gene FAM96A that could be of interest as a cancer 210 immunotherapy target. 211 Our pipeline has certain limitations; it is likely that a subset of peptides will be miscategorized in the 212 pipeline described here. It is also possible that, by checking all possible permutations of leucine and isoleucine, 213 due to their identical mass, we assign some peptides to incorrect proteins. We recommend that researchers 214 incorporate expression and develop custom proteogenomic databases from RNA and whole exome sequencing 215 to limit ambiguity of matches. In addition, as found in the data sets explored above, peptides found during the 216 search of the BLAST step could all represent spurious matches, without evidence for a plausible source. In 217 addition, there are additional sources of matches to de novo peptides sequences that, due to the complexity it 218 would add to the search space, we have not added to our pipeline, such as PTM peptides. While there is 219 undoubtedly some misidentification due to not accounting for PTM peptides, the addition of every possible PTM 220 on every peptide would lead to too many possibilities to search. 221 Results from the peptide source assignment workflow presented here indicate that the extent of PCPS 222 peptides may have been highly overestimated. The scope of the search space introduced when searching for 223 cis-and trans-spliced peptides is enormous, and a large fraction of putatively spliced peptides can be explained 224 by incomplete or inaccurate proteome database curation. While the contribution of PCPS is possibly 225 overestimated in the past, our workflow annotates many more peptides as spliced than would be expected by 226 chance. By checking other low-FDR sources first, we anticipate that the peptides that we annotate as PCPS are 227 more likely to be enriched for actual post-translational spliced peptides. Importantly, While the total FDR of our 228 workflow is extremely high, armed with the knowledge of the FDR of each step, researchers can filter out 229 peptides assigned at steps with an FDR that is unacceptably high. 230 231 METHODS: 232 Datasets

233
Simulated random peptides 234 We simulated two sets of peptide sequences for FDR estimation. We used the "random" built-in 235 python library to produce sets of 8-12 length amino acid sequences, 1,000 peptides for each length, a total of 236 5,000 random peptides in each set. For the first simulated peptide sequence set, all amino acids have an equal 237 probability of being incorporated into a sequence; we refer to this set as "uniform random". In the second set, 238 the amino acids have a probability of being incorporated that matches their frequency in vertebrates; we refer 239 to this set as "weighted random". 240 HLA-monoallelic immunopeptidomics 241 For the MS/MS data from HLA-I monoallelic cell lines, we downloaded the peptides from the 242 supplementary

262
We use this database when the workflow searches for linear human peptides and single-mismatched human 263 peptides (steps 1 and 3), as well as in the search for cis-and trans-spliced peptides. 264 265 The peptide source assignment workflow 266 We hypothesized possible sources from which peptides in immunopeptidomics experiments may be 267 found, then measured the FDR inherent in each source using the simulated random datasets described above. 268 We then ordered the steps of the workflow in order of ascending FDR to construct the workflow. The steps 269 applied to each de novo-sequenced peptide are as follows: 270 Step 1: Search for perfect sequence matches in the expanded human proteome database (described 271 above). Leucine (L) and isoleucine (I) have the same mass; therefore it is impossible to differentiate them in de 272 novo search sequencing 31 . To account for this, for a given peptide containing I/L we considered all permutations 273 of I and L residues. For example, for the peptide "ATTSLLHN" we have four possible permutations: ATTSLLHN, 274 ATTSLIHN, ATTSILHN, and ATTSIIHN. If the algorithm finds a perfect match for any permutation, the peptides 275 is annotated as "Linear", and all possible protein sources of the peptide are included in the output. Otherwise 276 the algorithm moves to step 2. 277 Step 2: Search for a perfect match in any of the six frames of the translated human genome using BLAT 32 . 278 We used the following commands: 279 blat -t=dnax -q=prot -minScore=7 -stepSize=1 hg38.2bit Fasta_query output.psl 280 psl2bed < output.psl > perfect_match.bed 281 282 If a perfect match is found, that peptide is annotated as "Linear" and possible source sequences are 283 included in the output. Otherwise the peptide is passed to the step 3. 284 Step 3: Peptides are mapped to the expanded human proteome database, this time allowing one 285 mismatch using this code: 286 blat -t=prot -q=prot -minScore=7 -stepSize=1 combined_DB.processed.fasta Fasta_query 287 output_blat_hits.psl"

288
If a sequence with a single mismatch is found, the peptide is annotated as "one mismatch". Otherwise 289 the peptide is passed to Step 4. 290 Step 4: Sequences are mapped to other organisms using the BLAST NCBI tool. If any perfect matches are 291 found the results are annotated as "Blast Linear". 292 Step 5: For the remaining peptides, the algorithm generates all possible splits of the peptide where the 293 length of the smaller piece is larger than 1. Then it looks for matches of both fragments in all human sequence 294 databases. If there is a match for both chunks in the same protein, the tool annotates the peptide as "cis-295 spliced". Otherwise, if there are hits for both fragments in two different proteins, the tool annotates the peptide 296 as "trans-spliced". The rest of peptides that do not have any matches are assigned as not available (N/A  c) The Fisher's exact test p-values measuring the enrichment of how many peptides were able to be assigned to 419 a source at each step, compared to how many would be expected based on how many were assigned of the 420 simulated random sequences.  Step 3

433
The linear peptide sequence matches perfectly to their parental protein, fragments of cis-spliced 434 peptides are from the same protein, and the trans-spliced peptide fragments are from different proteins 435