Predicting Peptide-MHC Binding Affinities with Imputed Training Data

Predicting the binding affinity between MHC proteins and their peptide ligands is a key problem in computational immunology. State of the art performance is currently achieved by the allele-specific predictor NetMHC and the pan-allele predictor NetMHCpan, both of which are ensembles of shallow neural networks. We explore an intermediate between allele-specific and pan-allele prediction: training allele-specific predictors with synthetic samples generated by imputation of the peptide-MHC affinity matrix. We find that the imputation strategy is useful on alleles with very little training data. We have implemented our predictor as an open-source software package called MHCflurry and show that MHCflurry achieves competitive performance to NetMHC and NetMHCpan.


Introduction
In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing infected or cancerous cells. Each organism possesses a poly-clonal army of T-cells which collectively are able to distinguish unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs) (1). Each distinct TCR recognizes a small number of similar peptides bound to an MHC molecule on the surface of a cell (2). Though there are many steps in "antigen processing" (3), it has become apparent that MHC binding is the most restrictive step. Peptide-MHC affinity prediction is the well-studied problem of predicting the binding strength of a given peptide and MHC pair (4). Early approaches focused on "sequence motifs"(5), followed by regularized linear models, linear models with interaction terms such as SMM with pairwise features (6), and more recently the NetMHC family of predictors, a collection of related models based on ensembles of neural networks. Two of these predictors, NetMHC (7) and NetMHCpan (8), have emerged as the methods of choice across multiple fields of study within immunology, including virology (9), tumor immunology (10), and autoimmunity (11).
NetMHC is an allele-specific method which trains a separate predictor for each allele's binding dataset, whereas NetMHCpan is a pan-allele method whose inputs are vector encodings of both a peptide and a subsequence of a particular MHC molecule. The conventional wisdom is that NetMHC performs better on alleles with many assayed ligands whereas NetMHCpan is superior for less well-characterized alleles (12).
In this paper we explore the space between allele-specific and pan-allele prediction by imputing the unobserved values of peptide-MHC affinities for which we have no measurements and using these imputed values for pre-training of allele-specific binding predictors.

Data and evaluation metrics
Two datasets were used from a recent paper studying the relationship between training data and pMHC predictor accuracy (13). The training dataset (BD2009) contained entries from IEDB (14) up to 2009 and the test dataset (BLIND) contained IEDB entries from between 2010 and 2013 which did not overlap with BD2009 ( Throughout this paper we will evaluate a pMHC binding predictor using three different metrics: • AUC: Area under the ROC curve. Estimates the probability that a "strong binder" peptide (affinity ≤ 500nM) will be given a stronger predicted affinity than one whose ground truth affinity is > 500nM.
• F 1 score: Measures trade-off between sensitivity and specificity for predicting "strong binders" with affinities ≤ 500nM.
• Kendall's τ : Rank correlation across the full spectrum of binding affinities.

Comparison of imputation algorithms as predictors
A dataset of peptide-MHC affinities for n peptides and a alleles may be thought of as a n × a matrix where peptide/allele pairs without measurements are missing values. The task of predicting values at these positions is known as matrix completion or imputation (depending on the community and data source). We investigated the performance of several imputation algorithms as a standalone solution to the peptide-MHC affinity prediction problem. The algorithms considered were: • meanFill: Replace each missing pMHC binding affinity with the mean affinity for that allele. This is a very simple imputation method which serves as a baseline against which other methods can be compared.
• knnImpute (15): Each missing entry X ij is imputed using the values in the k closest columns with observation in row i. Similarity between alleles is computed as e −d 2 st , where d st is the mean squared difference between observed entries of alleles s and t.
• softImpute (16): A singular value thresholding method which iteratively estimates a lowrank matrix completion without forcing the pre-specification of a particular solution rank. Instead, the softImpute method is parameterized by a shrinkage value λ that is subtracted from each singular value.
• MICE (17): Average multiple imputations generated using Gibbs sampling from the joint distribution of columns.
We evaluated the performance of these methods using three-fold cross validation on BD2009, only considering peptides which occurred in at least three alleles and excluding alleles with less than five measurements (

Neural network architecture
Each MHCflurry predictor is a feed-forward neural network containing (1) an embedding layer which transforms amino acids to learned vector representations, (2) a single hidden layer with tanh nonlinearity, (3) a sigmoidal scalar output. This network is implemented using Keras (19).
Three-fold cross validation on the training set was used to select the hyper-parameters. The best model had 32 output dimensions for the amino acid vector embedding, a hidden layer size of 64, a dropout rate of 50%, and 250 training epochs. These hyper-parameters achieved reasonable performance across alleles, but it's likely that performance could be further improved by setting the hyper-parameters separately for each allele.

Data encoding
Like the NetMHC family of predictors (20), MHCflurry uses fixed length 9mer inputs which requires peptides of other lengths to be transformed into multiple 9mers. Shorter peptides are mapped onto 9mer query peptides by introducing a sentinel "X" at every possible position, whereas longer peptides are cut into multiple 9mers by removing consecutive stretches of residues. The predicted affinity for a non-9mer peptide is the geometric mean of the predictions for the generated 9mers. When n training samples derive from a single non-9mer sequence then their weights are adjusted to 1/n.

Training
For each allele, we train a MHCflurry model using the measured peptide affinities for the allele and the values imputed by MICE based on other alleles in the training set. As training progresses, we place quadratically decreasing weight on the imputed values.
A randomly generated peptide is unlikely to bind a given MHC strongly, but a data acquisition bias toward strong binders in the training set can lead models to assign a high affinity to most peptides. As a form of regularization, we augment the training set at each epoch to include random peptides with affinity set to be maximally weak. The number of random negative peptides is 20% of the training size (without imputation). At each training epoch, a fresh set of random peptides is generated.

Results
We evaluated the effect of imputation by drawing subsets of the BD2009 training set for the wellcharacterized allele HLA-A*02:01. Predictors were trained on a range of simulated training set sizes and tested on the remaining data ( Figure 1). We find that imputation gives a modest improvement up to approximately 100 training samples. With more training data there is no benefit to imputation.

Discussion
Imputing training data shows promise in cross-validation as a way to improve performance on alleles with few observations, but only seems to help for very small training sizes (≤ 100). Unfortunately, none of the alleles included in the BLIND dataset had fewer than 100 samples in BD2009, and only one had fewer than 200. Thus, additional work is required to assess the accuracy of MHCflurry and other predictors on alleles with scarce training data. Additionally, we need to further investigate the interaction between imputation parameters, the decay schedule for the weights of imputed samples, and stopping criteria for training individual allele-specific predictors.