Abstract
HLA-I molecules play a central role in antigen presentation. They typically bind 9- to 12-mer peptides and their canonical binding mode involves anchor residues at the second and last positions of their ligands. To investigate potential non-canonical binding modes we collected in-depth and accurate HLA peptidomics datasets covering 54 HLA-I alleles and developed novel algorithms to analyze these data. Our results reveal frequent (442 unique peptides) and statistically significant C-terminal extensions for at least eight alleles, including the common HLA-A03:01, HLA-A31:01 and HLA-A68:01. High resolution crystal structure of HLA-A68:01 with such a ligand uncovers structural changes taking place to accommodate C-terminal extensions and helps unraveling sequence and structural properties predictive of the presence of these extensions. Scanning viral proteomes with the new C-terminal extension motifs identifies many putative epitopes and we demonstrate direct recognition by human CD8+ T cells of a C-terminally extended epitope from cytomegalovirus.
Introduction
Human Leukocyte Antigen class I (HLA-I) molecules play a major role in immune defense mechanisms by presenting to T cells peptides from the intracellular matrix (1). Peptides presented on HLA-I molecules originate mainly from proteasomal degradation of self or pathogen-derived proteins. These peptides are first translocated to the endoplasmic reticulum. There, they can load on HLA-I molecules provided their sequence is compatible with HLA-I binding motifs. Peptide-HLA-I complexes are then transported to the cell surface where they can elicit T cell recognition, primarily upon presentation of non-self peptides.
Most HLA-I alleles preferentially bind 9- to 12-mer peptides, although important differences have been observed between alleles in terms of the preferred length of their ligands (2–6). The majority of alleles accommodate peptides with anchor residues at the second and last positions. From a structural point of view, anchor residues point directly into the HLA-I peptide binding groove. Their importance for HLA-I peptide interactions is also reflected at the sequence level, where alignments of HLA-I ligands display clear specificity at the second and last positions for most alleles. 9-mer HLA-I ligands are characterized by a linear binding mode. For longer peptides, numerous crystal structures have shown the presence of a bulge at middle positions, protruding outside of the HLA-I binding site in order to accommodate the additional residues between the two anchor positions (e.g., ref. (7, 8)).
Over the years, anecdotal evidences of C-terminal extensions beyond the last anchor position have been observed among HLA-A02:01 ligands and the first crystal structure with such a ligand was published in 1994 (PDB: 2CLR) (9). More recently, analysis of HLA peptidomics data obtained by Mass Spectrometry (MS) from cell lines transfected with soluble HLA-A02:01 and infected with Toxoplasma gondii revealed several C-terminal extensions and showed that, for this allele, C-terminal extensions were mainly found among peptides coming from pathogens and may involve cross-presentation (10, 11). X-ray crystallography revealed distinct structural mechanisms in HLA-A02:01 to accommodate C-terminal extensions (10, 11). N-terminal extensions have also been recently observed in HLA-B57:01 (12) and HLA-B58:01 (13). However, N- or C-terminal extensions have not been much investigated in other HLA-I alleles based on unbiased HLA-I ligand datasets (see some predictions for HLA-C04:01 in (14)). As such, it remains unclear whether they occur frequently and, if yes, whether they can be recognized by CD8 T cells.
Here, we introduce a novel statistical approach to rigorously investigate N- and C-terminal extensions in large datasets of naturally presented HLA-I ligands obtained by in depth and accurate HLA peptidomics profiling of cell lines and tissue samples covering more than 50 HLA-I alleles. Our work reveals widespread C-terminal extensions for at least eight HLA-I molecules (HLA-A02:03, HLA-A02:07, HLA-A03:01, HLA-A31:01, HLA-A68:01, HLA-A68:02, HLA-B27:05 and HLA-B54:01), and we identify both sequence and structural features in HLA-I alleles predictive of the presence of C-terminal extensions. A new crystal structure of HLA-A68:01 in complex with such a ligand uncovers structural changes to accommodate the C-terminal extensions. Scanning viral proteomes with our new motifs describing C-terminal extensions further enabled us to demonstrate direct CD8 T cell recognition of an HLA-A03:01 restricted C-terminally extended epitope from cytomegalovirus (CMV).
Results
Unbiased investigation of C- and N-terminal extensions
To investigate the presence of C- and N-terminal extensions in HLA-I ligands, we collected recent pooled and mono-allelic HLA peptidomics datasets from seven studies covering 43 samples, 54 different HLA-I alleles and 109,953 unique peptides (3, 15–20) (see Materials and Methods and Table S1). These datasets were all generated with <1% false discovery rate and were not filtered with any predictor. We hypothesized that N- or C-terminal extensions among naturally presented endogenous HLA-I ligands may be determined by identifying peptides of ten or more amino acids that do not match motifs expected for ligands following the bulge model (Figure 1A). For pooled HLA peptidomics data, 9-mer HLA binding motifs were identified and annotated with our recent motif deconvolution algorithm (4, 20) (see example in Figure 1B and Materials and Methods). Importantly, these motifs were identified without relying on HLA-I ligand predictors and therefore represent a fully unbiased view of the 9-mer HLA peptidome. Focusing on the first three and the last two positions in the 9-mer motifs, we then built Position Weight Matrices (PWM) modeling bulges, N-terminal extensions or C-terminal extensions for each of the 9-mer motifs (Materials and Methods and Figure 1C). For 10-mers, bulges were modeled by incorporating five non-specific positions between the first three and the last two residues. N-terminal extensions were modeled by adding four non-specific positions in the middle and one non-specific position at the N-terminus. C-terminal extensions were modeled by adding four non-specific positions in the middle and one non-specific position at the C-terminus (Materials and Methods and Figure 1C). Alleles displaying anchor residues at middle positions (e.g., HLA-B08:01) were not considered as 10-mer ligands are much less frequent for them (4).
Using the three models (bulge, N- and C-terminal extensions) derived from each motif observed among 9-mer ligands, we then scored all 10-mer peptides (see Figure 1C and Materials and Methods). Peptides that displayed a much higher score for exactly one allele and one model were assigned to this allele and the corresponding model. For the special case of mono-allelic cell lines (3), comparison was only performed among bulge, N- and C-terminal extension models of the same allele. To determine statistical significance of N- or C-terminal extensions, we further developed a null model representing the expected 10-mer HLA-I ligands assuming only bulges (Materials and Methods). We finally retrieved all sets of peptides predicted to follow N- or C-terminal extensions associated to a given allele in a given sample that could not be explained by statistical fluctuations in the data (Z-score > 2). For each set of peptides the motif was modeled with a PWM, graphically represented as a sequence logo. This resulted in 15 motifs of C-terminal extensions (for a total of 396 unique 10-mer peptides) and no motif of N-terminal extensions (Figure 2A and Table S3). Seven motifs corresponded to C-terminal extensions associated to HLA-A03:01 across different HLA peptidomics samples, two motifs to HLA-A68:01 and one motif to each of the other alleles (i.e., HLA-A02:03, HLA-A02:07, HLA-A31:01, HLA-A68:02, HLA-B27:05 and HLA-B54:01, Figure 2A, see also Figure S1, Table S2 and S3). We first note that motifs describing C-terminal extensions associated to the same allele (i.e., HLA-A03:01 or HLA-A68:01) in different datasets displayed a very high similarity among each other (rectangles in Figure 2A). This highlights the remarkable reproducibility of our predictions across our very heterogeneous set of studies performed in different places, with different protocols and on different cell lines or tissue samples. For all predicted C-terminal extensions that passed our statistical threshold, we show the average frequency of such ligands in Figure 2B (Materials and Methods). Overall, our results predicted widespread (i.e. from 3% to 16% of the 10-mer peptides) C-terminal extensions.
To investigate whether C-terminal extensions may extend for more than one amino acid, we applied our approach to longer peptides (i.e., 11-mers, allowing for two unspecific positions at the N- or C-terminus) and found statistically significant evidences of C-terminal extensions for HLA-A02:07 and HLA-B27:05 for a total of 46 unique 11-mer peptides (Figure 2C-D, Table S2 and S3). Of note, the motifs are consistent with the 10-mer C-terminal extensions previously observed (Figure 2A). A trend was also observed for HLA-A02:01, HLA-A02:03, HLA-A03:01, HLA-A68:02 and HLA-B54:01 (see Table S3), although it did not pass our thresholds on the number of ligands or the Z-score, and was not consistently observed across samples. For other alleles, we did not observe statistical evidences of C-terminal extensions of two residues among 11-mers, although we cannot exclude that rare C-terminal extensions could be missed in MS data because of the competition with shorter ligands. When analyzing even longer ligands (12-mers), we did not find anything statistically significant, but the number of such ligands identified by MS is much smaller so that it may be difficult to confidently identify C-terminal extensions with our approach and distinguish them from potential noise in MS data.
In vitro validation
To experimentally test our predictions, we selected several 10-mer ligands predicted to follow the C-terminal extension binding mode for three of the most frequent alleles with such predictions (i.e., HLA-A03:01, HLA-A31:01, HLA-A68:01). We then mutated either the last or second-to-last residue and experimentally measured the stability of the wild-type and the two mutated peptides (Materials and Methods). 9-mer peptides without the predicted C-terminal extensions were used as positive controls. As expected mutating the last residue of the 10-mer peptides had little effect on their binding affinity, while mutating the second to last residue significantly decreased the stability of the complexes (Figure 3). These data strongly suggest that, for all these 10-mer peptides, the second to last residue is playing the role of the anchor residue and the last residue is extending beyond the canonical C-terminus of HLA-I ligands. To test whether longer C-terminal extensions may bind to HLA-A03:01, we added all amino acids not compatible with the bulge model at the C-terminus of one of the C-terminally extended 10-mer ligands used in Figure 3 (KLAYTLLNKL*) and measured the stability of these 11-mer peptides (Figure S2). Our results indicate that most of the 11-mers did bind, although with lower stability compared to the 10-mer peptide (i.e. most complexes were fully dissociated at time 72h). This suggests that longer C-terminal extensions can extend for more than one residue in HLA-A03:01 ligands, although this likely corresponds to a small fraction of the actual HLA peptidome, as suggested by their low frequency in MS data.
Robustness to noise
To explore the robustness of our findings with regard to noise in HLA peptidomics data, we rerun our whole pipeline (i.e., motif identification (4) and annotation (20) in 9-mers as well as C-terminal extension predictions in 10-mers) adding 5% of randomly selected peptides from the human proteome to all datasets. Remarkably, the predicted C-terminal extensions remained basically unchanged (Figure S3). In particular, we did not observe any new predicted C-terminal extension motifs that would arise by chance from the random peptides. This analysis clearly suggests that C-terminal extensions predicted in this work are not resulting from contaminations in HLA peptidomics data.
Crystal structures of C-terminally extended HLA-I ligands
To investigate the structural mechanisms underlying C-terminal extensions uncovered in this work, we generated a high resolution (1.6Å) crystal structure of HLA-A68:01 in complex with a 10-mer peptide (ETSPLTAEKL, see Figure 3) predicted to follow the C-terminal extension binding mode (see Materials and Methods, Figure 4A and Figure S4). As expected, the lysine at P9 (yellow sidechain) filled the F pocket and superimposed nicely with the last residue (K9) of canonical 9-mer ligands of HLA-A68:01 (blue sidechain in Figure 4A, PDB:4HWZ) (21). More importantly, in order to accommodate the C-terminal extension (L10), the Y84 sidechain was flipped by 90 degrees and the two alpha helices around the F pocket moved away from each other (Figure 4A), as measured by the distance between C-alpha atoms of residues 80 and 143 (D80-143=11.1Å versus D80-143=10.1Å for HLA-A68:01 in complex with a 9-mer peptide). Interestingly, the same flip had been observed in one of the HLA-A02:01 structures in complex with C-terminally extended ligands (10) (see Figure 4B). However, in contrast to HLA-A02:01 where the sidechain of the first residue of the C-terminal extension points towards the solvent (K12 in PDB:5DDH in Figure 4B (10), see also L10 in PDB:5FA4 and D10 in PDB:5F7D (11)), in the case of HLA-A68:01, the C-terminal extension sidechain (L10) filled the pocket created by the flip of Y84 sidechain (see also Figure S5). This result is fully consistent with the specificity for hydrophobic residues in the C-terminal extension motif of HLA-A68:01 (Figure 2A).
Propensity of HLA-I alleles for C-terminal extensions
To investigate the molecular determinants of C-terminal extensions, we aligned the sequences of all HLA-I alleles considered in this work and checked whether some amino acid patterns in residues surrounding the F pocket would characterize alleles with C-terminal extensions (Materials and Methods). Clear differences were observed at specific positions (Figure 4C) and showed overlap with properties characterizing HLA-A alleles, as expected from the higher proportion of HLA-A alleles with C-terminal extensions. Interestingly, glycine is known to destabilize alpha helices and the presence of glycine at position 79 (star in Figure 4C) in most HLA-I alleles displaying C-terminal extensions may endow the α1 helix with the flexibility required to accommodate such extensions (Figure 4A). To test whether these amino acid patterns may help predict whether an allele is more likely to display C-terminal extensions, we trained a logistic regression and performed a rigorous cross-validation, using as features all amino acids surrounding the C-terminus of 9-mer ligands (Materials and Methods). An average Area Under the ROC Curve (AUC) of 0.85 could be obtained, suggesting that alleles accommodating C-terminal extensions may be reasonably well predicted from their sequence. Lower accuracy was reached in a simple model where HLA-A alleles are predicted to display C-terminal extensions and HLA-B/C alleles are not (AUC=0.76). To further investigate molecular mechanisms allowing for C-terminal extensions, we surveyed available X-ray structures of HLA-I alleles considered in this work (mainly in complex with canonical 9-mer ligands and not including any C-terminal extensions, see Table S4). We then computed the distance between the two alpha helices surrounding the C-terminus of canonical ligands (see before and Table S4). Interestingly, we observed that, on average, alleles with C-terminal extensions showed naturally larger distances between these two helices already when interacting with canonical 9-mer peptides (P=0.002, Wilcoxon rank-sum test, Figure 4D). Using this distance to predict alleles accommodating C-terminal extensions among those with available crystal structures led to an AUC of 0.92. Of note, HLA-B51:01 is characterized by much higher frequency of 8-mer ligands compared to other HLA-I alleles (3, 5, 22). As for HLA-A01:01, we observed a trend in some samples (e.g., the mono-allelic cell line, see Table S2), but not in other samples (e.g., Melanoma/Mel_12, although most 10-mers are expected to come from HLA-A01:01 since HLA-B08:01 and HLA-C07:01 poorly bind 10-mers). This suggests that C-terminal extensions for HLA-A01:01, if present, show lower stability and may not be reproducible across samples. We finally point out that the clear patterns in both sequence and structural properties of HLA-I alleles displaying C-terminal extensions provides an additional and independent validation of our predictions in Figure 2 based only on HLA peptidomics data.
Explicitly incorporating C-terminal extensions in HLA-I ligand predictors
C-terminal extensions have not been routinely investigated in previous studies and the training set of most existing HLA-I ligand predictors typically does not include them, even if recent version of NetMHC tools can mathematically handle them (23, 24). We therefore collected all C-terminally extended ligands uncovered in this work and retrained our predictor MixMHCpred (20) using multiple PWMs to model C-terminal extensions (25, 26) (Materials and Methods). To validate our new algorithm, we took advantage of the fact that HLA-A03:01 and HLA-A68:01 alleles were present in multiple datasets (see Table S1) and attempted to re-predict all the 10-mer peptides of these datasets, excluding data from the dataset used for testing in the training of our predictor (see Materials and Methods). As negative data, we included 4-fold decoy (i.e., 10-mers randomly selected from the human proteome) and computed both the Positive Predictive Value corresponding to the top 20% predictions (PPV, which in this case is also equivalent to the recall since the number of true positive is equal to the number of predictions) and the AUC (see Materials and Methods). Overall, we observed that explicitly modeling C-terminal extensions increased the performance compared to the previous version of our predictor (MixMHCpred1.0) (20), as well as other widely used HLA-I ligand predictors that did not include unbiased MS data in their training set (23, 24) (Figure 5A and Figure S6). The improvement came mainly from higher scores for ligands displaying the C-terminal extensions, as shown in Figure 5B for the HLA-A03:01 10-mer ligands isolated from a mono-allelic cell line (3) (P=1.0x10-7 for [R/K][A/F/I/L/M/Y]-COOH 10-mer peptides, Wilcoxon signed-rank test).
Almost no difference between MixMHCpred1.0 and MixMHCpred1.1 could be observed when testing our algorithm on other non-MS datasets used in previous benchmarking studies. However, we anticipate that C-terminal extensions are very rare in these datasets since most of the known HLA-I ligands had been first predicted with former versions of HLA-I ligand predictors that did not include C-terminal extensions. For instance, when analyzing IEDB data (27) for HLA-A03:01 ligands coming from earlier studies than those considered in this work, we could not see any statistical evidence of C-terminal extensions, suggesting that such ligands had not been tested in binding assays, or had been filtered in MS data.
Identification of immunogenic C-terminally extended epitopes
To investigate the immunological relevance of HLA-I ligands displaying C-terminal extensions, we used our new predictor and scanned both human cancer testis antigens and viral proteomes with the motifs characterizing 10-mer C-terminal extensions for HLA-A03:01 and HLA-A68:01 (Materials and Methods). This analysis revealed many putative epitopes, including 10-mer peptides from the PRAME cancer testis antigen (RLWGSIQSRY for HLA-A03:01 and ETLSITNCRL for HLA-A68:01) as well as several CMV, EBV, HIV, HPV, Influenza and Yellow Fever peptides. To confirm that these peptides can indeed bind to HLA-A03:01 or HLA-A68:01, we measured the binding stability and found values falling within the range of CD8 T cell epitopes (see Table 1).
We then focused on one CMV peptide binding to HLA-A03:01 (star in Table 1). We stimulated a peripheral blood mononuclear cells (PBMCs) sample from a healthy donor (both HLA-A03:01 and CMV seropositive) with the 10-mer peptide for 12 days. Subsequently we observed cytokine production by IFNγ-ELISpot after re-challenge with the CMV 10-mer peptide for 16h (Figure 5C), suggesting that this epitope is potentially immunogenic in human. To determine whether CD8 T cells could directly recognize the C-terminally extended 10-mer peptide, we constructed HLA-A03:01 tetramers loaded with the 10-mer peptide. Analyzing CD8 T cells with such tetramers revealed a population of CD8 T cells that directly bound to the C-terminally extended 10-mer peptide/HLA-A03:01 complex (Figure 5D). This clearly indicates that the presence of C-terminal extensions in HLA-I ligands does not prevent TCR recognition and shows, for the first time to our knowledge, that C-terminally extended peptides can form bona fide CD8 T cell epitopes.
Discussion
MS analysis of HLA-I ligands provides an unbiased view of HLA-I peptidomes that is not restricted by a priori assumptions on HLA-I binding specificity. MS-based approaches have therefore a high potential to identify new properties of HLA-I-peptide interactions. Here we capitalized on this premise to explore non-canonical binding modes of HLA-I ligands. Out of the 54 alleles considered in this work, we found clear statistical evidences of C-terminal extensions for eight of them and validated these predictions at the biochemical and structural level for the most frequent of these alleles in Caucasian populations.
In addition to providing statistical evidences of C-terminal extensions, our work enabled us to characterize new motifs describing this non-canonical binding mode. We observed that the C-terminal extensions are often characterized by the presence of hydrophobic amino acids, which could be structurally rationalized for HLA-A68:01. We also note that, for three out of eight alleles (i.e., HLA-A03:01, HLA-A31:01 and HLA-A68:01), positively charged amino acids are found at the anchor residue filling the F pocket. This positively charged residue interacts with D77 and D116 in our HLA-A68:01 structure, which is consistent with the observation that Aspartate is preferentially observed at these positions in alleles accommodating C-terminal extensions (Figure 4C). The clearly distinct binding specificity between the anchor residue and the C-terminal extension makes these cases especially amenable for the statistical model that we developed to analyze HLA peptidomics data. However, charged residues at the C-terminal anchor positions only characterize a minority of HLA-I alleles, and the majority of alleles show preference for hydrophobic residues. In these cases, and assuming that the preference for hydrophobic amino acids at the C-terminal extension is conserved, sequenced-based algorithms cannot unambiguously determine whether a 10-mer peptide with 2 hydrophobic amino acids at the last two positions follows the bulge or the C-terminal extension binding mode. Moreover, we anticipate that competition with high affinity (mainly 9-mer) ligands can mask cases of less frequent C-terminal extensions in HLA peptidomics data. Therefore our estimates of the number of alleles that accommodate C-terminal extensions correspond to lower bounds and we cannot exclude that this non-canonical binding mode may be observed in other alleles. In particular, we cannot exclude that some C-terminal extensions among HLA-A02:01 ligands may be present in our data with two hydrophobic residues at the last positions. Nevertheless, the smaller distance between the two alpha helices of HLA-A02:01 (Figure 4D) suggests that C-terminal extensions for this allele should be rarer, or involve other processing and presentation mechanisms, like cross-presentation (10, 11). Along this line, we note that statistically significant C-terminal extensions are identified by our algorithm in the set of Toxoplasma gondii HLA-A02:01 ligands (10) (see Table S5), confirming that our approach is able to capture previous evidences of C-terminal extensions.
Finally, we stress that contaminations from co-eluting or wrongly identified peptides, as well as challenges in aligning small peptides or in the deconvolution of pooled HLA peptidomics datasets, can easily result in cases that look like N- or C-terminal extensions by chance. This is the reason why we developed the statistical framework described in this work and extensively tested the robustness of our predictions with respect to noise in the data.
Despite these inherent limitations of sequence-based approaches, it is likely that several alleles show very few if any C-terminal extensions. For instance, six alleles from our list showed specificity at the last anchor residue that is not restricted to hydrophobic amino acids (mainly tyrosine, see Figure S7) but we did not see any trend of C-terminal extensions among their 10-mer ligands, both in pooled and in mono-allelic HLA peptidomics data (see Table S2). Among them, all that had available X-ray data displayed a smaller distance between the two alpha helices (D80-134), which is consistent with our observation in Figure 4D.
Our model also incorporated the possibility to have N-terminal extensions but we did not find any such event, although N-terminal extensions have been recently reported (12, 13). Inspection of existing structures of HLA-I molecules in complex with 9-mer peptides shows that the C-terminal carboxyl group of 9-mer HLA-I ligands is often partly solvent exposed. Conversely, the N-terminal amide group points toward the binding site and is significantly less surface accessible in most HLA-I structures. This likely explains why N-terminal extensions appear to be much less frequent, although we cannot exclude that our approach may miss some N-terminal extensions if the specificity at P2 is the same as at P3 in the N-terminally extended ligands. Interestingly, this appears to be the case for N-terminal extensions reported for HLA-B57:01 and may explain why these extensions could not be easily detected with sequence-based approaches only and have been first identified by X-ray crystallography (12).
Our demonstration of direct recognition of C-terminally extended HLA-I ligands by human CD8 T cells (Figure 5D) shows that C-terminal extensions are compatible with TCR binding. C-terminal extensions may further play a role in the binding of KIR proteins, which are known to recognize mainly amino acids surrounding the C-terminus of canonical HLA-I ligands (12, 28, 29). As such, we anticipate that inclusion of C-terminal extensions the new version of our HLA-I ligands predictor for alleles displaying frequent C-terminal extensions may help uncovering new CD8 T cell epitopes in viruses or tumors, as well as potential targets of NK cells.
Overall, our results reveal frequent C-terminal extensions in at least eight out of the 54 HLA-I alleles analyzed in this study and highlight the power of unbiased HLA peptidomics data together with new algorithms to unravel novel properties of HLA-I molecules. Our demonstration of direct T cell recognition of such epitopes suggests that C-terminal extensions may be clinically relevant for infectious diseases or cancer immunotherapy.
Materials and Methods
Collection of HLA peptidomics data
HLA peptidomics data used in this study came from seven different studies (3, 15–20) (see Table S1). In all these studies, peptides were eluted from class I specific antibody purified HLA molecules. None of these data had been filtered with existing predictors, and therefore all had the potential to reveal non-canonical ligands. We only included MS samples generated with < 1% false discovery rate, with available HLA-I typing information and in which HLA-I motifs could be clearly annotated by our motif deconvolution approach for pooled HLA peptidomics data (4, 20). In total, our dataset comprises 43 samples covering 54 different HLA-I alleles (see Table S1) for a total of 109,953 unique peptides (9- to 12-mers).
Analysis of non-canonical binding modes
Analysis of 9-mer HLA peptidomes
To analyze non-canonical binding modes in pooled HLA peptidomics studies that include ligands from up to six different HLA-I molecules, we developed the following pipeline (Figure 1B and 1C). We first characterized the different motifs of the 9-mer HLA peptidomes using our recent motif deconvolution algorithm (4). This algorithm has the major advantage of not depending on a priori knowledge of HLA-I binding specificity, and therefore can be applied even in the presence of poorly characterized HLA-I alleles. Each 9-mer peptide was then assigned to its corresponding motif and PWMs were built for each group of peptides, including random counts based on BLOSUM62 (20).
Prediction of N- and C-terminal extensions
To distinguish between the three different models describing longer peptides (i.e. bulge, N-terminal extensions, C-terminal extensions), we developed the algorithm illustrated in Figure 1C. We first reasoned that longer peptides following the bulge model should have conserved binding specificity around the anchor residues at the second and last positions. We therefore modeled these peptides with the binding specificity of 9-mers at the first three and last two amino acids, and unspecific position in the middle. In the absence of a priori information about the amino acid preferences at terminal extensions, we modeled them as with one unspecific position at the N- or C-terminus. To enable meaningful comparison between the scores of the bulge and the terminal extension models, we did not impose any constrain at middle positions in the N- or C-terminal extension models. Each longer peptide (i.e., L-mer, with L equal to 10, 11 or 12) was then scored with all models (i.e. bulge, N- or C-terminal extensions) derived from each motif identified in the 9-mer HLA peptidome. Scores for the different models were computed as: for the bulge model; for the C-terminal extension model and for the N-terminal extension model, where M stands for the 20x9 PWM derived from 9-mer ligands for a given allele, X stands for the sequence of a L-mer peptide, S equals to L-9, and P(Xi) stands for the frequency of amino acid Xi in the human proteome. Peptides were then assigned to one model and one allele if their score with this allele and this model was higher than a threshold T1 and no other score for all the other possible models of any allele was larger than the largest score minus T2. Here we chose values of T1 and T2 equal to 2.0, since over 95% of the longer peptides found in our HLA peptidomics datasets had a score larger than this value for at least one allele and one model.
Alleles with anchor residues at positions 4 to 7 (i.e., HLA-B08:01, HLA-B14:01 and HLA-B14:02) were excluded, as they display non-conserved binding motifs between 9- and 10-mers and much less 10-mer ligands (4).
Null model
To account for possible noise in MS data and focus on predictions that showed statistical significance, we developed a null model representing the expected HLA peptidome from the bulge model, and described here for the case of 10-mer ligands. Starting with a list 100’000 10-mer peptides randomly selected from the human proteome, we selected those that passed the threshold T1 for the bulge model of at least one motif. The list of peptides was further randomly filtered so as to have the same number of peptides assigned to each allele as in the actual 10-mers HLA peptidomics data. We then re-predicted these peptides using all three models for each motif (Figure 1C). This enabled us to assess how many 10-mer peptides generated from the bulge model of each allele could by chance be assigned to the N- or C-terminal extensions of any allele when considering the three models. The simulations were repeated 100 times to derive a Z-score. Only alleles with at least 50 10-mer peptides assigned to them, at least 10 peptides assigned to the C- or N-terminal extension model and with a Z-score larger than 2 were included in our predictions. Similar predictions were obtained using values ranging between 1.5 and 2.5 for the threshold T1, or using a different threshold T1 for each motif given by the score corresponding to the top 2% predictions in a large set of 100’000 10-mer peptides randomly selected from the human proteome (Table S6).
The fractions of C-terminal extensions shown in Figure 2B and 2D were computed as the number of peptides unambiguously assigned to the C-terminal extension mode of a given allele, divided by the number of peptides assigned to any of the three models for the same allele.
Sequence logos were generated with the LOLA software (http://baderlab.org/Software/LOLA).
In vitro binding assays
All peptides shown in Figure 3, Figure S2 and Table 1 were synthesized with free N and C-termini (1mg of each peptide, > 80% purity). Peptides were incubated separately with denaturated HLA alleles refolded by dilution in the presence of biotinylated beta-2 microglobulin proteins at temperature T=4°C for 48 hours. The solution was then incubated at 37°C. Samples were retrieved at time t=0h, 8h, 24h, 48h and t=72h. Stable complexes indicating interactions between HLA-I molecules and the peptides were detected by ELISA. Signals for 9-mer ligands at time t=0h were used for renormalization in Figure 3 and Figure S2, while negative controls consisted of absence of peptides. Half-lives (Table 1) were computed as ln(2)/koff, where koff were determined by fitting exponential curves to the light intensity values obtained by ELISA, after removing the background signal (i.e., no peptides). Two independent replicates were performed for each measurement.
Expression and purification of HLA-A68:01 and β2M
The heavy chain HLA-A68:01 and the light chain β2M were purified from inclusion-bodies (30) with some small modifications. Briefly, recombinant expression plasmids were transformed into BL21 (DE3) (HLA-A68:01) or XA90 strain (β2M) bacteria. The cells were grown overnight in LB (Luria-Bertani medium) twice concentrated supplemented with 50 µg ml-1 kanamycin or 100 µg ml-1 ampicillin respectively at 37 °C. One litre of pre-warmed LB was inoculated with 10 ml of the overnight culture and was incubated at 37 °C. At an OD600 nm of 0.6, expression was induced for 8 h at 37 °C with 1.0 mM isopropyl-β-D-thiogalactopyranoside (IPTG). Cells were then harvested by centrifugation (8700xg, 15 min, 4 °C, Beckman Coulter Avanti J-20 XP centrifuge), and then re-suspended in lysis buffer (10 mM TRIS-HCl, pH 8.0 at 20 °C complemented with 100 µg ml-1 Lysozyme (Sigma Aldrich), 250 units of Benzonase (Novagen), 1 mM EDTA and 1:1000 (v/v) Protease Inhibitor Cocktail III (Calbiochem)). After 20 min of continuous rocking at 22 °C, cells were lysed 3 times at 4 °C using a Basic Z-Model Cell Disrupter (Constant Systems Ltd, UK). Inclusion-bodies were isolated by centrifugation (16000xg for 1 h at 4 °C, JA 25.50 rotor, on a Beckman Coulter Avanti J-20 XP centrifuge). The supernatant was decanted and the contaminated material was removed by scrapping the outer rings (viscous and dark coloured) leaving the more compact and lighter coloured inner ring (containing the protein of interest) intact. The pellet was then fully dispersed in 20 ml of washing buffer (10 mM TRIS-HCl, pH 8.0 at 20 °C) complemented with 1:1000 (v/v) Protease Inhibitor Cocktail III (Calbiochem) and centrifuged again for 10 min (16000xg at 4 °C, JA 25.50 rotor, on a Beckman Coulter Avanti J-20 XP centrifuge). The washing/scrapping steps were repeated 5 times (until the outer rings vanished). The pellet containing the recombinant protein was then dissolved in 20 ml of solubilisation buffer (100 mM TRIS-HCl, pH 8.0 at 20 °C, 8 M urea). Insoluble material was precipitated by centrifugation (16000xg for 1 h at 4 °C, JA 25.50 rotor, on a Beckman Coulter Avanti J-20 XP centrifuge). Solubilized HLA-A68:01 heavy chain was immediately flash frozen in liquid nitrogen and stored at -80 °C. The recombinant β2M protein in urea was refolded by dialysis against 10 mM TRIS-HCl pH7.0 using SnakeSkin® Dialysis tubing (3.500 MWCO, Thermo Scientific) and purified by ion exchange on Hi-Trap Q HP (5 ml, GE Healthcare) column, with a linear gradient from 0 to 100 mM NaCl. Fractions containing pure β2M were dialysed overnight against water, concentrated with a 10 MWCO concentrator (Amicon® Ultra, MILLIPORE) to 2 mg ml-1, flash frozen in liquid nitrogen and stored at -80 °C. HLA:β2M:peptide complex assembly
The protein complex was reconstituted by dilution of the denatured HLA-A68:01 heavy chain (3 μM) and β2M (6 μM) in presence of the peptide ([H]-E-T-S-P-L-T-A-E-K-L-[OH], 10 μM) into 200 ml of refolding buffer (100 mM TRIS-HCl, pH 8.0 at 20 °C, 400 mM L-Arginine HCl (Sigma Aldrich), 2 mM EDTA, 5 mM reduced L-glutathione (Sigma Aldrich), 0.5 mM oxidized L-glutathione (Sigma Aldrich) and with 1:1000 (v/v) Protease Inhibitor Cocktail III (Calbiochem)). The refolding mixture was incubated at 10 °C during 36 h under constant stirring. Every 12 h another batch of the denatured HLA-A68:01 heavy chain (3 μM) was added to the mix. The 200 ml were then concentrated to 5 ml using a 3 MWCO concentrator (Amicon® Ultra, MILLIPORE). The concentrated protein mixture was further submitted to exclusion size chromatography (HiLoad™ 16/60 Superdex™ 75 prep grade, on Äkta Pure GE Healthcare) and each peak collected was submitted to SDS-PAGE gel and mass spectrometry (Agilent 6530 QTOF (Agilent Technologies Inc. - Palo Alto, CA)) to assess the simultaneous presence of the 3 components of the complex. Fractions of interest were then stored at 4°C until further use.
Crystallization
Prior to crystallization, the buffer of the protein complex was exchanged to 25 mM MES pH6.5 at 20 °C and 150 mM NaCl on a Superdex™ 200 Increase 10/300 GL column using an Äkta Pure system (GE Healthcare). The complex was then concentrated to 10.96 mg ml-1 final using a 3 kDa MWCO concentrator (Amicon® Ultra, MILLIPORE). Aliquots of the complex were set up for crystallization using a mosquito® crystallization robot (TTP Labtech, Royston UK). Coarse screens were typically setup onto Greiner 3-well plates using three different drop ratios of precipitant to protein per condition (100+50 nl, 75+75 nl and 50+100 nl). Crystallization was carried out using the sitting drop vapor diffusion method at 4 °C. Crystals were grown by mixing 100 nl of the protein complex (10.96 mg/ml) with 50 nl of reservoir solution containing 0.1 M HEPES pH 7.5, 12 % PEG3350, 0.005 M CoCl2, 0.005 M NiCl2, 0.005 M CdCl2 and 0.005 M MgCl2. Diffraction quality crystals grew within a few days.
Data Collection and Structure Refinement
Crystals were cryo-protected using the well solution supplemented with additional ethylene glycol and were flash frozen in liquid nitrogen. Data were collected at Diamond beamline I24 on a Pilatus3 6M detector at a wavelength of 0.96864 Å. Indexing and integration was carried out using XDS (31) and scaling was performed with SCALA (32). Initial phases were calculated by molecular replacement with PHASER (33) using a model of HLA/β2M peptide complex (PDB:5T6X). Initial models were built by ARP/wARP (34) followed by manual building in COOT (35). Refinement was carried out in REFMAC5 (36). Thermal motions were analyzed using TLSMD (37) and hydrogen atoms were included in late refinement cycles. Data collection and refinement statistics can be found in Table S7. The model and structure factors have been deposited with PDB accession code: 6EI2 (see Preliminary_Validation_report.pdf).
HLA-I sequence and structure analysis
For all 54 alleles considered in this work, the sequence of the peptide binding domains were retrieved from IMGT database (38) and aligned with MUSCLE (39). To investigate the molecular determinants of C-terminal extensions, we selected all amino acids surrounding the F pocket (75-81, 84, 114-118, 142-147, following residue numbering in HLA-I structures). The eight positions displaying the largest changes in amino acid frequencies (Jensen-Shannon divergence larger than 0.1) are shown in Figure 4C. Using 20-dimensional encoding for each amino acid, we trained a logistic regression model based on glmnet package in R (40). Four-fold cross-validation was performed and repeated 10 times for different random seeds to compute AUCs.
Available HLA-I crystal structures were then collected for alleles considered in this work. In total, 20 alleles among those with available HLA peptidomics data had experimental X-ray crystal structures. For each allele, a reference structure was selected by prioritizing complexes with 9-mer ligands and highest resolution (see Table S4). In each structure, the distance between the C-alpha of residues 80 and 143 was computed (Figure 4D and Table S4). The AUC of a predictor using as input feature this distance was 0.92, which is comparable to the AUC of the predictor based on HLA-I sequences when considering only the 20 HLA-I alleles with available crystal structures (AUC=0.93, three-fold cross-validation).
HLA-I ligand predictors explicitly incorporating C-terminal extensions
For each allele displaying C-terminal extensions, we retrained our predictor based on HLA peptidomics data modeling C-terminal extensions as multiple motifs. In practice, all 10-mer ligands predicted to follow the C-terminal extension mode of each allele were treated separately and a PWM (MC) was built with them (20). All other 10-mer ligands (non-C-terminal extensions) were used to build another PWM describing the bulge model (Mb). For alleles with C-terminal extensions, the score of a 10-mer peptide X=(X1, … X10) with this predictor is then computed as: where wC, respectively wb, stands for the fraction of ligands predicted to follow C-terminal extensions, respectively bulges, and PMS stands for the background distribution of amino acids in HLA peptidomics data (see (20)). For alleles without C-terminal extensions, the predictions did not change compared to the previous version of our predictor (20).
Benchmarking of the algorithm was done using HLA-A03:01 and/or HLA-A68:01 positive samples, since these are the two HLA-I alleles displaying C-terminal extensions in multiple samples in our collection of HLA peptidomics studies. Careful cross-validation was performed, where for each sample used as testing set, the predictor was trained only on the data from the other samples. To this end, we further excluded from our analysis samples where some other HLA-I alleles did not appear in other samples (i.e., Fibroblast and RA957). For each sample, 4-fold excess of random peptides from the human proteome were added as negatives and all peptides were ranked with our new predictor, using for each peptide the highest score across the different HLA-I alleles present in the sample. The fraction of positives found in the top 20% of the predictions as well as the AUC were then computed (Figure 5A and Figure S6). The performance of the predictor explicitly modeling C-terminal extensions (MixMHCpred1.1) was compared with the former version of MixMHCpred (v1.0, based on a single PWM for each allele) (20), NetMHC4.0 (24) and NetMHCpan3.0 (23).
Human PBMC, ELISpot assays and Multimer analysis
Healthy volunteers donated circulating leukocytes according to the standards of the Blood Transfusion Center in Epalinges, Switzerland (Service Vaudois de Transfusion Sanguine). Samples from patients were obtained under written informed consent following study protocol approval by the Human Research Ethics Committee of the Canton de Vaud (Switzerland).
PBMCs were peptide stimulated in vitro for 12 days in vitro in presence of 100 U/ml IL-2. Subsequently, the Elispot was performed using the ELISpotPRO kit for Human IFN-γ from MABTECH (3420-2APT-10), following the standard supplier instructions. 100’000 cells per well were re-challenged with the peptide for 16h. The spots were analysed by the iSpot Robot ELISpot reader (AutoImmun Diagnostika GMBH).
Monomeric HLA-A03:01/TVRSHCVSKI complexes were prepared by refolding procedures using heavy chain HLA-A03:01 and the light chain β2M, as described before. The heavy chain contained added BSP (BirA enzyme Substrate Peptide). The enzymatic biotinylation of HLA-A3:01-BSP was performed over night at 25°C with ATP, biotin and the biotin ligase Bir A. Multimer complexes were prepared by mixing biotinylated HLA-A03:01/TVRSHCVSKI monomers with PE-labeled streptavidin (Invitrogen).
CD8 T cells of the CMV 10-mer peptide stimulated PBMC after 12 days in vitro stimulation were stained with a PE labeled HLA-A03:01/TVRSHCVSKI multimer and co-stained with anti-CD8 antibody (BC A94683) and DAPI for dead cell exclusion. Multimer+ CD8+ T cells were analysed at the BD ARIA III instrument equipped with the FACS Diva software. The analysis was performed with the FlowJo software (FLOWJO.LLC). The negative control consists of an EBV-derived peptides (RVRAYFYSKV)/HLA-A03:01 tetramer and a PBMC sample from another HLA-A03:01 and EBV seropositive healthy donor, for which we did not observe any recognition by CD8 T cells.
Code availability
The code for C- and N-terminal extensions (MHCpExt) and the new version of the predictor (MixMHCpred1.1) are available at https://github.com/GfellerLab.
Contributions
D.G. designed the study, developed the algorithms, analyzed the data and wrote the manuscript. P.G., P.B., N.M. and J.S. performed the experiments. S.P and P.F performed the X-ray crystallography. D.E.S, and G.C. contributed with reagents, M.B.-S., and D.E.S provided feedback and manuscript revision.
Acknowledgment
We thank Julien Racle and Santiago Carmona for a critical reading of the manuscript. DG acknowledges the financial support of CADMOS. We thank Anne Wilson and her collaborators of the flow cytometry facility, and the members of our laboratories for their support. Peptides were synthesized at the Protein and Peptide Chemistry Facility. The authors are grateful for support received by the SGC, a registered charity (number 1097737) that receives funds from AbbVie, Bayer Pharma AG, Boehringer Ingelheim, Canada Foundation for Innovation, Eshelman Institute for Innovation, Genome Canada, Innovative Medicines Initiative (EU/EFPIA) (ULTRA-DD Grant 115766), Janssen, Merck & Co., Novartis Pharma AG, Ontario Ministry of Economic Development and Innovation, Pfizer, S~ao Paulo Research Foundation-FAPESP, Takeda, and the Wellcome Trust (092809/Z/10/Z). P.F., and S.P. are supported by a Wellcome Career Development Fellowship (095751/Z/11/Z). We also thank Diamond Light Source for access to beamline I24 (proposal number mx15433), which contributed to the results presented here. Computations were performed at the Vital-IT (http://www.vital-it.ch) Center for high-performance computing of the Swiss Institute of Bioinformatics.
Footnotes
↵* DG should be considered both as first and last author.