Comparing a language model and a physics-based approach to modify MHC Class-I immune-visibility for the design of vaccines and therapeutics

Protein therapeutics already have an arsenal of applications that include disrupting protein interactions, acting as potent vaccines, and replacing genetically deficient proteins. Therapeutics must avoid triggering unwanted immune-responses towards the therapeutic protein or viral vector proteins. In contrast, vaccines must support a robust immune-reaction targeting a broad range of pathogen variants. Therefore, computational methods modifying proteins’ immunogenicity without disrupting function are needed. While many components of the immune-system can be involved in a reaction, we focus on Cytotoxic T-lymphocytess (CTLs). These target short peptides presented via the MHC Class I (MHC-I) pathway. To explore the limits of modifying the visibility of those peptides to CTL within the distribution of naturally occurring sequences, we developed a novel machine learning technique, CAPE-XVAE. It combines a language model with reinforcement learning to modify a protein’s immune-visibility. Our results show that CAPE-XVAE effectively modifies the visibility of the HIV Nef protein to CTL. We contrast CAPE-XVAE to CAPE-Packer, a physics-based method we also developed. Compared to CAPE-Packer, the machine learning approach suggests sequences that draw upon local sequence similarities in the training set. This is beneficial for vaccine development, where the synthetic sequence should be representative of the real viral population. Additionally, the language model approach holds promise for preserving both known and unknown functional constraints, which are essential for the immune-modulation of therapeutic proteins. In contrast, CAPE-Packer, emphasizes preserving the protein’s overall fold and can reach greater extremes of immune-visibility, but falls short of capturing the sequence diversity of viral variants available to learn from. Source code https://github.com/hcgasser/CAPE (Tag: v1.1)


Introduction
The recent SARS-CoV-2 pandemic has shown that RNA vaccines are safe, quick to develop and can reliably induce the production of viral proteins by patients.This success has spurred development of more vaccines against other pathogens [1].Additionally, similar technologies could also be used to stimulate production of other proteins exhibiting a broad range of functions.As of May 4, 2023, the pharmaceutical company Moderna is advancing mRNA therapeutics targeting conditions including autoimmune hepatitis, cancer, and genetic disorders like propionic acidemia and cystic fibrosis [1].Although an immune response to the SARS-CoV-2 spike protein vaccine was intended, an unintended immune response to a protein designed to provide a beneficial function could compromise therapy effectiveness or, in extreme cases, trigger autoimmune reactions.Hence, cost-efficient computational methods that design proteins to retain their beneficial functions while eliciting the desired immune reaction, are crucial to the success of these therapeutics.
In particular, deimmunization of proteins has garnered attention [2,3,4].So far, deimmunization research has mainly been directed towards immune-reactions by Antibodys (Abs).One approach is to remove Ab surface epitopes.For example, Bootwala et al. demonstrated a significant reduction in Ab binding using computational protein resurfacing [5].Another strategy, utilized by Yachnin et al. [6], aims to decrease the presentation of MHC Class II (MHC-II) epitopes crucial for B-cell activation.
mRNA therapeutics and vaccines translate into proteins within cells, so flexible computational methods are needed to modify their immune visibility, particularly to the MHC-I pathway -the cell's internal antigen presentation system.Also, in the context of gene therapy, effective MHC-I deimmunizers are essential, as demonstrated by anti-transgene immunity [7,8] and examples where vector capsid proteins flag hepatocytes for CTL destruction [9,10] as well as examples of unwanted immune responses to the bacterial-derived CRISPR-Cas9 gene editing machinery [11].Conversely, vaccines -particularly those targeting rapidly mutating viruses with highly diverse quasi-species like HIV [12] -need to include sequences that expose the immune-system to a wide array of MHC-I immune-visible peptides, mirroring the diversity found in the viral population.This is underscored by Borrow et al. [13] who found evidence for rapid intra-patient evolution of HIV to evade CTL recognition leading to the conclusion that a key aim of HIV-vaccines should be to elicit strong CTL responses against multiple codominant viral epitopes.Nevertheless, immune-reaction to an antigen is a complex process and many aspects are not fully understood yet.Therefore, methods need to be flexible to incorporate the latest developments.
This article introduces the first stage of developing the Controlled Amplitude of Present Epitopes (CAPE) framework, which is focused on managing MHC-I visibility.In this article we make the following novel contributions.(1) We develop CAPE-XVAE, a flexible and novel machine learning (ML)-based approach that combines the abilities of Variational Autoencoders (VAEs) and reinforcement learning (RL), to generate MHC-I immune-visibility modified protein sequences.Its flexibility will allow extension to other components of the immune system in the future.(2) We introduce CAPE-Packer, a modification of an existing physics-based approach from Yachnin et al. [6], enabling it to reduce or increase MHC-I immune-visibility.And finally, (3) we apply these two novel methods to the HIV Nef protein, comparing their ability to modify immune-visibility while maintaining stability and similarity to naturally occurring sequences.We find that CAPE-XVAE effectively modifies the immune-visibility of generated Nef proteins, while maintaining sequence similarity to natural proteins.In contrast CAPE-Packer achieves even greater immune-visibility modification, albeit it generates proteins markedly different from naturally occurring ones in sequence space.
Box 1: Terms and concepts • Support sequences and natural sequences: Often applying an analysis to all sequences in the dataset would be prohibitively computationally expensive.To still be able to describe the behavior of all sequences in the dataset, we once randomly selected 300 sequences from the whole dataset and apply the analysis just to those.We refer to these 300 sequences as support sequences, while we use natural sequences to refer to all sequences in the whole dataset.
• Closest support sequence: For a given (query) sequence, closest support sequence refers to the support (subject) sequence with the highest pairwise TMscore.If the query sequence is a support sequences, the query sequence is excluded from the potential subject sequences.
Box 2: More terms and concepts • MHC-I immune-visibility (also just 'immune-visibility' or 'visibility'): The human immune-system is comprised of many components.Our study focuses on the MHC-I pathway which is used by CTL to detect destruction targets (see 2.1.1).
We define the visibility of a protein to be the number of different 8, 9 and 10-mers within a protein sequence, that are predicted to be presented on the cell's surface.Overlapping k-mers are counted separately.Also, if a k-mer is presented by several MHC-I alleles, then each one is independently increasing the visibility number by one.For this study we selected the following common MHC-I alleles to tailor it towards a hypothetical patient: HLA-A*02:01, HLA-A*24:02, HLA-B*07:02, HLA-B*39:01, HLA-C*07:01, HLA-C*16:01.Hence, a sequence's visibility is the same as the sum of peptide rewards (see Subsection 2.3) under the increased visibility profile (see inside box below) over all 8 to 10 amino acid long peptides within the sequence.
Importantly, while being visible/presented on the cell's surface is a necessary condition for a peptide to cause an immune-reaction by CTLs, this is by no means a sufficient one (see 2.1.1).
• Visibility profiles -baseline/reduced /increased /inc-nat: We are trying to modify how the protein is visible to the immune-system.We use the word baseline to refer to generated sequences that were produced by a model without any immune-visibility modification.For CAPE-XVAE, this means just using the VAE decoder without any modification step.For CAPE-Packer this means using an energy term that does not consider the visibility of the designed sequence.In contrast, we use reduced/increased to refer to synthetic sequences with (lower/higher) modified immune-visibility.In inc-nat (increased-natural ) we modified the protein's visibility profile by maximizing naturally occurring visible peptides and minimizing artificial ones.(see also Subsection 2.3) • Highly/lowly immune-visible sequences: All natural or designed sequences that have an immune-visibility of equal or above/below the 99.5%/0.5% percentile of immune-visibility in the natural sequences.

Material and methods
This section introduces the methods and dataset employed in this paper, with essential terms summarized in Boxes 1 and 2. The first subsection delves into the MHC-I pathway.Following that, we present an approximation method designed to accelerate the prediction of peptides presented through this pathway.We then introduce the dataset used in our analysis before explaining how each peptide's reward is calculated.Subsequently, we shed light on the inner workings of our ML approach, CAPE-XVAE.This is followed by a description of our physics-based approach, CAPE-Packer.Finally, we present a comprehensive overview of our evaluation process.

The MHC-I pathway and predicting its presented peptides
The MHC-I pathway is key in immune response, presenting cellular peptides to T cells.Detailed exploration of this pathway and its broader immune-system integration are reviewed elsewhere, beyond the scope of this paper.Nevertheless, we provide a concise overview and refer the interested reader to Murphy and Weaver [14] for more in-depth information before introducing how we assessed peptide presentation via this pathway.

The MHC-I pathway
Within most human cells ribosomes constantly produce proteins, while proteasomes constantly degrade them into small fragments, known as peptides.Some of these get transferred into the cell's endoplasmic reticulum, where some bind to MHC-I proteins (typically 8 to 10 amino acids (AAs) long [14,), forming pMHC complexes.These are transported and displayed on the cell's surface for surveillance by CTL.CTL use a T-cell receptor (TCR) to bind to these visible/presented pMHC.Due to random genetic recombination, the TCRs within a person's population of CTL vary -leading to differences in the specific pMHC they bind to.To prevent auto-immunity, TCR undergo selection in the thymus to be tolerant of the person's own naturally occurring peptides while still binding specifically to foreign peptides.These foreign peptides can originate from viral proteins, cancer proteins, or other intra-cellular pathogens.Additionally, CTL develop from naive CD8 T cells via priming.These T cells circulate the bloodstream visiting lymph nodes, where they transiently bind to antigen-presenting cells (typically dendritic cells).If a naive CD8 T cell encounters its cognate antigenic peptide presented on an antigen-presenting cell and receives costimulation [14, Fig. 9.22], the T cell proliferates and its progeny differentiate into CTL.If a CTL detects a foreign peptide on a cell's surface, it initializes the destruction of this cell.[14, Chapter 9] MHC-I alleles exhibit variation in peptide specificity.To ensure the presentation of a broad range of peptides on the cell surface and make them visible to CTL, evolution has led to the expression of a diverse set of MHC-I proteins across the human population.In fact, three main genes (HLA-A, HLA-B, and HLA-C) are present in humans, allowing each individual to express up to six different MHC-I proteins [14,Chapter 6].
From the above, it becomes evident that visibility to CTL of a peptide is a necessary but not a sufficient condition for an immune-reaction.For example, presented self-peptides, as well as foreign peptides for which no matching TCR exists, will not trigger an immune reaction.Additionally, an immune-reaction requires prior activation of the CTL.Therefore, visibility and immune-reaction are not the same.

Assessing antigen presentation via MHC Class I
For the estimation of MHC-I immune-visibility, our study employs netMHCpan-4.1 [15].This tool represents the latest iteration in a series of MHC-I prediction models dating back to the early 2000s.Based on a given peptide sequence and a specific MHC-I allele, the model produces a score.Higher values indicate greater confidence in the peptide's presentation by the MHC-I protein.The score's reversed rank relative to random peptides, known as EL-rank, serves us to assess MHC-I presentation of the peptide.Peptides with an EL-rank of below 2% are considered "weak binding" and those below 0.5% are deemed "strong binding"1 .
In our exploration of methods to integrate the immune-visibility constraint into the protein design process, we encountered challenges with an approach suggested by Yachnin et al. [6].This method involves numerous peptide presentation prediction requests, rendering direct use of netMHCpan in CAPE-Packer impractical (see Subsection 2.5).To reduce the execution time of the immune modification process in CAPE-Packer and CAPE-XVAE (while still relying directly on netMHCpan during evaluation), we approximated the behavior of netMHCpan using a position weight matrix (PWM) based approach.
To construct the PWMs, we initially generated a uniform sample of three million random peptides of lengths 8, 9, and 10 AAs.Subsequently, we employed netMHCpan to calculate the EL-rank for all considered MHC-I alleles, selecting peptides with an EL-rank below 2%.For each position within these peptides, we computed the distribution of AAs, obtaining a PWM for each MHC-I allele and peptide length.Next, we scored all the random peptides using these PWMs.This involved summing up the logarithmic probabilities assigned by the PWM to each AAs at their corresponding positions within the peptide.The scores were then used to establish rankings for peptides, equivalent to the EL-rank.

Protein dataset
We applied our protein design methods to the HIV Nef protein.HIV's urgent need for effective vaccines as well as Nef's relatively short sequence length, slightly above 200 AAs, and the availability of approximately 25 thousand different sequence examples make it a suitable test case.On 6 June 2023, we retrieved unaligned HIV Nef sequences from the Los Alamos National Laboratory HIV Sequence Database2 .To avoid using dysfunctional sequences or sequences with sequencing errors, we only considered sequences between 200 and 220 AAs length.Many sequences occur multiple times and we assigned proportionally higher weights during CAPE-XVAE training to those.While 'X'-containing sequences (unknown AA, absent in complete sequences) were retained for training, they were excluded from the evaluation process.We randomly partitioned our dataset into the following subsets:

Peptide reward
As the protein is designed, newly generated peptides receive a reward depending on the targeted immune-visibility profile (see Box2).It is also influenced by how many of the considered MHC-I alleles present the peptide via the MHC-I pathway (M ) and whether the peptide is present in the natural sequences or not (see Box1).The reward for a given peptide is computed as the product of the corresponding value in

CAPE-XVAE
This model is a VAE (see 2.4.1).During the VAE-training stage (see 2.4.2) its encoder is trained to map a protein sequence into a low dimensional latent space, based on which its auto-regressive (AR) decoder aims to reconstruct the protein sequence.Afterwards, the immune-visibility profile modification procedure (see 2.4.3)employs RL to modify the protein's immune-visibility by updating its latent space representation in the direction of the desired MHC-I immunevisibility profile (see Box2).The improved latent space representation can then be converted to a sequence by the decoder.Similar to Dauparas et al. [16] we train the decoder to generate sequences in any ordering -allowing certain parts of the sequence to be fixed upfront.This as well as the ability to adjust the modification procedure to a wide range of immune-visibility profiles, makes CAPE-XVAE a particularly flexible approach for protein design.
After presenting CAPE-XVAE's architecture, we delve into its VAE training process.Finally, the immune-visibility modification procedure is outlined.CAPE-XVAE is based on the VAE architecture [17] and, therefore, comprises the three main components listed below: an encoder, a latent space, and a decoder (see Figure 1a).The encoder and the decoder were realized as transformers [18].More details on the various sub-components can be found in the Supplement S.1.2.
• Encoder: CAPE-XVAE's encoder consists of an Embedding Layer, an Encoder Stack, a Cross Attention Block and two linear transformations embedding the stack's output into latent space.Along with a learnt positional encoding, each AA in the input protein sequence x is first encoded as a d dimensional vector using a learnt embedding layer.The Encoder Stack (comprising multiple Self Attention Blocks) processes these into a same length sequence of contextually encoded vectors.
To create an information bottleneck, first a Cross Attention Block converts this sequence into set of L < N (currently L = 1) vectors (each of dimension d).Subsequently, two linear transformations further compress each latent vector to l < d dimensions.The first transformation (µ) generates a mean vector for the latent's conditional distribution p(z|x), while the second one (σ) produces its log variance vector.
• Latent Space: To generate sequences, first a vector z needs to be sampled in latent space.Sampling from the latent's conditional distribution p(z|x) is appropriate when z should represent a sequence similar to x (e.g. during training).In contrast, when we are only interested in generating any sequence similar to the training set, sampling from the unconditional distribution p(z) is possible.For the loss function, we assume p(z) to be a standard normal distribution.However, we acknowledge that this assumption may not accurately represent the distribution of natural sequences in latent space.So, when generating new baseline (see Box2) examples, we calculate the mean and covariance matrix of the latent vectors of the training set examples and sample from a Normal distribution with those parameters.
• Decoder: As input, it receives the latent vector, and the AAs in the sequences sampled so far (or the whole sequence x during training).It comprises an embedding layer for the AA sequence, a transformer decoder stack (see Supplement S.1.2) and a head for AA prediction.We embed each sequence element with a learnt token embedding, a learnt positional embedding and a learnt next position embedding.The next-position embedding allows the prediction of protein sequences in arbitrary order (not only in N to C direction).Then, the decoder stack consists of multiple cross-attention blocks.These blocks can attend to the latent vector z and already generated tokens but, importantly for training, they cannot attend to not yet generated ones.The sampling is initialized with a start/stop token in the AA sequence.For training this is followed by the ground truth tokens, while during inference the predicted tokens are used (dashed line Figure 1a).

VAE-training stage
Training a VAE in conjunction with an AR decoder is challenging, due to "KL-vanishing" (Supplement S.1).To address this, we adopted the approach proposed by Li et al. [19] (see Supplement S.1.1).So, our training process (see Figure 1b) follows the steps below: 1. Auto encoder (AE) training: We trained CAPE-XVAE like an AE for 500 epochs (KL divergence weight β in VAE loss function is zero).This focuses training solely on reconstructing the input sequence resulting in very meaningful latent representations.

2.
Reset the decoder: Afterwards, the decoder is reset to its initialization.This counters the risk of the decoder learning faster than the encoder.

Transition to VAE training:
Over the next 250 epochs the KL-divergence weight in the loss function is increased linearly until it reaches one.

VAE training:
Then we continue training the model for an additional 750 epochs.This phase optimizes the full VAE objective with free bits.We fixed the free bits parameter λ as one.
To further increase the decoder's reliance on the latent variable, we train the decoder to sample the protein sequence in random permutations of the ordering -not just N-C direction.This means that, while the normal next token prediction task is: here are amino acids 1, 2, 3 and 4 -please give me amino acid 5, we train the model on the following task: here are amino acids 22, 11, 59 and 37 -please give me amino acid 17.Further details about the hyper-parameter search, the eventually obtained hyper-parameters and the training procedure can be found in the Supplement S.1.4.

Immune-visibility profile modification procedure
We modify the immune-visibility profile of sampled proteins by combining the capabilities of VAEs and RL techniques.VAEs excel at assigning reasonable protein sequences to probable points in a latent space.However, as seen in [20], directly sampling from this latent space typically leads to only marginal reductions in immune-visibility.Therefore, we would like to relay immune-visibility information through the decoder to the latent space.Since back-propagation through the decoder's discrete output is not directly feasible, we formulate our task as a RL problem (see Supplement S.1.3).Unlike in a typical RL setting, we do not modify the parameters of CAPE-XVAE.Instead, we adjust the latent vector to better align with our desired immune-visibility profile (see Box2).Given that, in theory, all probable vectors in latent space should be mapped to plausible proteins by the decoder, we only need to ensure that the modification does not shift the latent vector to an improbable region of the latent space.
We formulate the RL problem as follows: CAPE-XVAE's decoder represents the agent.The state is the protein sequence produced so far.CAPE-XVAE's next token prediction represents the policy.We also define an immune-visibility dependent reward function (see 'Episode' below).The episode ends when the decoder generates a stop token.Overall, the modification process follows the following steps (Figure 1c): 1. Initialization: First a random natural sequence x (see Box1) is sampled from the dataset and a conditional latent distribution p(z|x) = N (z, exp(σ)) obtained from the encoder.From this an initial latent vector z (0) ∼ N (z, exp(σ)) is sampled.To ensure that the modified latent variable stays in a likely area of the latent space, we force it to be within a certain range m • σ around z (0) .To do so, we use an auxiliary latent vector w (j) in an auxiliary latent space W , which is mapped into the sphere with radius m • σ in latent space surrounding z (0) .We use Manhattan distance and m is currently set to 2.
2. Episode: At the beginning of each episode j the current auxiliary w (j) is converted to a latent variable z (j) , which remains constant throughout the episode.The decoder uses this to sample 64 AA sequences in parallel in N-C direction.The temperature used for this is linearly decreased from 1.2 to 0.1 over the modification phase.The gradient of the log-probability of the sampled AA with respect to the auxiliary latent variables is calculated and saved.Also, for each added AA the reward function is called and its result saved as this step's reward.In addition to the current state, e.g.MGGKWSKSSSEAKWPAVRERMRRTEPADG), and the current action (e.g.V) the reward function receives the selected MHC-I alleles as input.As most MHC-I epitopes have length 8, 9 or 10 AAs, only the terminal 8-, 9-and 10-mers of the combined sequence (e.g.MRRTEPADGV, RRTEPADGV, RTEPADGV) are examined.The reward at the corresponding step in the trajectory is then the sum of the peptide rewards (see Subsection 2.3) for all the terminal k-mers.This penalizes or rewards presented peptides as soon as they are fully sampled -but not at all before.
3. Updating auxiliary latent variables: At the end of each episode j, the auxiliary latent variable w (j) is updated to w (j+1) using the REINFORCE algorithm (see Supplement S.1.3).As baseline, we use the average across the batch of the future rewards for each step.Then the algorithm goes back to step 2 until 100 updating steps have been completed.

CAPE-Packer
The underlying physics-based idea is, that proteins will fold into their energy minimizing state.So, CAPE-Packer searches for the energy minimizing sequence for a fixed 3D structure.By considering immune-visibility in the energy function (score function), the search process can be steered towards a desired immune-visibility profile.The implementation of CAPE-Packer builds upon the framework developed by Yachnin et al. [6] utilizing the Rosetta Packer [21,22] (see 2.5.1) for protein design.Yachnin et al. modify Rosetta's score function to discourage the Rosetta Packer protein design algorithm from incorporating MHC-II epitopes.

Rosetta software for macro-molecular modeling and Rosetta Packer
Rosetta is a versatile software suite for modeling and analyzing protein structures.Next to others, it comprises algorithms dealing with structure prediction, docking and membrane proteins 3 .It has been developed for more than two decades and is extensively used in laboratories worldwide [23].At its core lies the score (energy) function.This is the weighted sum of terms describing van der Waals energies, hydrogen bonds, electrostatics, disulfide bonds, residue solvation, backbone torsion angles, sidechain rotamer energies, and an average unfolded state reference energy [23].Rosetta also features a key algorithm called the Packer, serving as a side-chain optimization tool.It identifies score function minimizing rotamers for a given sequence and fixed backbone structure (3D structure) [24].Additionally, the Packer can be used for sequence design by considering different side-chain AA identities.In fact, this is its standard behavior and the Packer designs at every position, allowing for all 20 standard amino acid rotamers (position dependent subsets of eligible rotamers can be specified).The space of possible rotamer combinations expands rapidly, making enumeration infeasible.Therefore, Monte Carlo simulation is utilized.Starting from a random initialization, random changes are made to the rotamers.If the change decreases the score, it is always accepted.Conversely, if it increases the score, acceptance is stochastic with a probability that decreases with the energy [24].

Rosetta Packer adaptations for immune-visibility modifications
Yachnin et al. [6] have suggested an additional term for the score function, that takes into account the number of peptides in the protein that get presented on the cell's surface.In their case, they used MHC-II.We adapted their tool for MHC-I.The method only considers 9-mer peptides and out of the box it is only implemented using a PWM approach as well as a lookup database.While they mention that they have experimented with a tensorflow based presentation predictor, no implementation of this is known to us.In fact Mulligan [24] state that "In order to be compatible with the Packer, an energy term must either be residue-level pairwise-decomposable, or must otherwise be very fast to compute and update as rotamer substitutions are considered".Therefore, we think the reason for the lack of a more advanced predictor backbone is, that the Rosetta Packer requires many, many score function evaluations -and, hence, peptide presentation evaluations.
For a short 203 amino acid long Nef sequence, we counted roughly 42 million peptide evaluation requests.This must be multiplied by the number of considered major histocompatibility complex (MHC) alleles to arrive at the actual number of predictor calls.Using one of the modern and larger MHC-I presentation prediction state-of-the-art models -like MHCflurry 2.0 [25] or netMHCpan 4.1 [15] -is, therefore, very time intensive.
We implemented a new Rosetta class MHCEpitopePredictorClient that inherits from MHCEpitopePredictor and receives 9-mer evaluation requests from Yachnin et al.'s framework.Our client then forwards these peptides together with the MHC-I alleles and the profile to modify towards to a server that we also implemented.This server returns the peptide's reward (see Subsection 2.3).This number is multiplied by −1 and then integrated into the energy function used by the Rosetta Packer that ultimately designs the protein.
Being based on the Rosetta Packer, CAPE-Packer requires a 3D structure of the protein as input.AlphaFold generated 3D structures of the support sequences (see Box1) were used for this.We design at every AA position in the sequence, without any restrictions to admissible AAs.

Evaluation
We assessed the generated sequences based on three criteria: immune-visibility, similarity to natural sequences and stability.Each criterion is detailed below.It is noteworthy that, in the absence of experimentally determined 3D structures, we employed AlphaFold [26] to predict the necessary 3D structures as needed.

Visibility
To assess how MHC-I visible (see Box2) the proteins are to CTL, we used netMHCpan-4.1 [15].We used the same MHC-I alleles during the evaluation step as we had used for the modification process.The visibility number is calculated as outlined in Box2.

Similarity
We assess the similarity of the generated sequences to natural ones with regards to three criteria: local-, structural-and functional similarity.
• Local similarity is used to demonstrate the extent to which local sequence patterns in protein design are captured.
To assess whether there is a similarity in parts of the sequences to the natural sequences (see Box1), we checked short sub-sequences of length k (k-mers) in the sequence against all k-mers occurring in the natural sequences.The percentage that is also present in the natural sequences we define as the k-mer similarity.9-mer similarity is of particular interest since most presented peptides have 9 AA length, making it especially relevant for vaccine design, where we try to design sequences that more broadly capture epitopes across variants of a protein.To assess the overlap between k-mers we define the following metrics: (a) the k-mer recall of a query sequence (a generated or natural sequence) wrt a subject sequence (a natural sequence) is the percentage of k-mers in the subject sequence that are also present in the query sequence; (b) the k-mer precision of a query sequence wrt a subject sequence is the percentage of k-mers in the query sequence that are also present in the subject sequence; (c) for a specific k-mer, k-mer precision is defined as the percentage of the natural sequences that include this k-mer.
• Structural similarity is used to compare the overall fold of the predicted structure of the designed sequence to that of its closest support sequence (see Box1).The TMscore between the two structures was calculated using TMalign [27].The TMscore ranges between zero and one.A score of one signifies a perfect match while scores below 0.2 are associated with randomly chosen unrelated proteins.Scores higher than 0.5 assume generally the same fold in SCOP/CATH4 .
• Functional similarity is used to determine whether the designed protein is likely to preserve function.Function is assessed via the ability of a classifier to distinguish the generated from the support sequences.In a first step, we used the GO term predictor of Boadu, Cao, and Cheng [28] to predict the scores of 11,156 molecular function GO categories for each of our generated sequences as well as the support sequences.After normalization, we performed a PCA keeping the first 19 components which cover 99% of the observed variance.Based on this dataset we then trained a Gaussian Naive Bayes classifier (called GO based Sequence Pack Classifier ) to distinguish between 8 potential sequence origins: support, CAPE-XVAE-baseline/reduced /increased /inc-nat, CAPE-Packer-baseline/reduced /increased.As we do not want to perform prediction but simply want to assess the classifier's ability to distinguish between the classes, no holdout dataset was used.

Stability
To assess the stability of our generated protein sequences we select a subset of these and the support sequences and run a Molecular Dynamics (MD) simulation on this.We then calculate the maximum root mean squared distance (RMSD) (based only on CA-atoms) between the confirmations observed during the simulation and the original AlphaFold prediction.We use OpenMM to run the MD simulation for 2 microseconds (2 femtoseconds step-size) at a temperature of 25 degrees Celsius using the amber99sb force field and an implicit solvent model (amber99 obc).Positions were reported every 4 picoseconds.Due to the high computational requirements, we only performed this analysis on a sub-sample of sequences.

2D sequence space visualizations
We used t-SNE to visualize the distribution of sequences in two-dimensional space.First, we construct a 300-dimensional representation for each sequence.Each dimension represents the dissimilarity between the sequence and one of the support sequences (see Box1).We compute this dissimilarity based on the pairwise alignment score of the sequence with the support sequence.Gotoh global alignment algorithm [29] as implemented by the biopython [30] PairwiseAligner (using BLOSUM62, open gap score of -20 and extend gap score of -10) was used for this.The dissimilarity is then defined as the difference between the alignment score of the sequence with itself and the alignment score of the sequence with the support sequence.To further reduce the dimension of these vectors to only two, we use t-SNE with a perplexity of 30 and running for 500 iterations.We have observed similar results for a broad range of perplexity numbers.
3.1 Differential distribution in sequence space: CAPE-XVAE sequences distribute more naturally than CAPE-Packer ones We first explored the landscape of naturally occurring and CAPE-XVAE and CAPE-Packer generated sequences (Figure 2, see Subsection 2.7).Notably, the baseline sequences (see Box2) produced by CAPE-XVAE are dispersed across the natural sequence space in Figure 2ab, showing no aggregation in any particular region.However, The modification step removes them from several areas of the sequence space and moves some even outside the range occupied by natural sequences.This is not observed for CAPE-XVAE-inc-nat sequences.
Starkly, the sequences generated by CAPE-Packer all aggregate outside the natural background distribution of sequences (Figure 2bdf ).Despite appearing close in t-SNE plots (Figure 2a-f ), which is a cluster-preserving but not distancepreserving dimension reduction technique, the sequences outside the natural background distribution are highly dissimilar from each other.In fact, Figure 2h illustrates that the CAPE-Packer baseline sequences are markedly different from all other sets and even differ from each other, while support sequences are similar to themselves (Figure 2g).

CAPE-XVAE designs incorporate many natural peptides
In Figure 3 we compare the generated sequences to naturally occurring ones with respect to immune-visibility (Figure 3a, see Box2), local similarity (Figure 3b-e) and structural similarity (Figure 3f, see 2.6.2).These results indicate that CAPE-XVAE and CAPE-Packer are effective in altering MHC-I immune-visibility (Figure 3a, Table 3) while maintaining structural similarity.
In Figure 3a we see that CAPE-Packer can produce extremely immune-visibility modified sequences.CAPE-Packer is only cognizant of protein fold, without access to the natural grammar of protein function.It seems unlikely, that longer natural peptides are randomly sampled during Rosetta Packer's Monte Carlo simulation process (see Subsection 2.5), preventing it from incorporating those (Figure 3b) even when rewarded -like in the case of CAPE-Packer-inc-nat.This limits CAPE-Packer's usefulness in vaccine design.In fact, CAPE-Packer-inc-nat is effectively turned into CAPE-Packerreduced and will, therefore, be ignored from here on.With regards to CAPE-Packer-reduced, Yachnin et al. [6,Figure 2b] observed a slightly higher reduction in MHC-II visibility for de-immunizing human erythropoietin when using the same weighting of the immune-visibility term in the score function as us (see 2.5).With regards to structural similarity, Figure 3f shows that, despite their unnatural looking sequences, there are CAPE-Packer proteins with high TMscores.
In comparison, CAPE-XVAE sequences tend to be less immune-visibility modified (Figure 3a).The incorporation of artificial visible peptides enables CAPE-XVAE-increased to achieve higher visibility than CAPE-XVAE-inc-nat. Figure 3b demonstrates that nudging CAPE-XVAE-inc-nat to only incorporate natural visible peptides also seems to cause it to incorporate more of the natural k-mers in general.
The analysis of k-mer similarity for highly and lowly immune-visible (see Box2) CAPE-XVAE generated sequences reveals that, even within strongly modified sequences, many exhibit high k-mer similarity to natural sequences (Figure 3c-e).
In particular CAPE-XVAE-inc-nat designed sequences demonstrate high values.This conservative behavior of CAPE-XVAEinc-nat may be beneficial in vaccine design, which might benefit from representative sequences that sample the breadth of potential MHC-I epitopes (visible 8, 9 and 10-mers) observed in the pathogen.Even for generated sequences with reduced visibility, remaining close to natural proteins in sequence space may increase the chances that complex nuances of protein function are retained in the synthetic sequence.
The structural similarity of CAPE-XVAE designs is more dispersed than for CAPE-Packer designs (Figure 3f ).Very interestingly, CAPE-XVAE-inc-nat designs tend boast higher TMscores than CAPE-XVAE-baseline ones.Applying some form of RL inspired latent space search might, therefore, be beneficial in the protein design process of VAE approaches in general.Modifying immune-visibility comes at a cost: the more the sequences diverge, the more they lose local similarities to the training set, though their stability remains unaffected.Figure 4 demonstrates, that the further we move away from the data distribution, the less similar the generated sequences become at the peptide length pertinent to antigen presentation (9-mers).Stability does not seem to be affected as assessed using a sub-sample of RMSDs from computationally expensive MD simulations.Despite this trade-off, we find among CAPE-XVAE designs many that contain significant 9-mer similarity even with high visibility.In contrast, 9-mer similarity is not present for any of the CAPE-Packer generated sequences.

mean visibility baseline reduced increased inc-nat
For CAPE-XVAE, the proportion of high TMscore (above 0.8) proteins generated decreases with distance to the natural visibility range, while this seems not to be the case with CAPE-Packer.

Distinguishing between designed and natural sequences is difficult based on predicted function
We explored whether CAPE-XVAE and CAPE-Packer designs are likely to retain natural functionality (see Functional similarity 2.6.2).For each sequence we first predicted the scores for various molecular functions.This long vector representation for each sequence was reduced by principal components analysis.The first two components Figure 5 (explaining 68% of variance) visualize the challenge posed to distinguish many of the generated sequences from the support sequences.Certainly, many CAPE-XVAE generated sequences would be challenging to distinguish this way from natural sequences.
We then trained a GO based Sequence Pack Classifier (Subsection 2.6.2) using the first 19 principal components covering 99% of variance.This classifier also struggles to distinguish between the various sources as demonstrated in the confusion matrix (Table 4).CAPE-Packer sequences are easier to distinguish from natural sequences than CAPE-XVAE generated sequences where only 60% of support sequences are correctly identified as support sequences, while it considers 28% of them to be CAPE-XVAE-baseline sequences.It also classifies 28% of CAPE-XVAE-baseline as well as many CAPE-XVAE modified sequences as support sequences.In fact, a third of the CAPE-XVAE-inc-nat sequences are classified as natural.
It appears that while as a whole there are differences in predicted function between natural and generated sequences.More importantly, a considerable proportion of the proposed CAPE-XVAE sequences closely resemble natural ones in terms of predicted molecular function.In practical applications, these sequences may be more desirable to advance.

CAPE-XVAE-inc-nat designs tend to have high mean recall and precision
While the intricacies of vaccine design are best managed by vaccinologists with the capacity to experimentally test various strategies, CAPE-XVAE offers a tool to suggest chimeric sequences capable of exposing the immune-system to a representative set of epitopes, potentially enhancing population-wide protection.As CAPE-XVAE proposes similar sequences to natural ones, we were interested in understanding how diversely potential epitopes (k-mers predicted to be immune-visible) were sampled from the natural sequences.Figure 6 shows how the natural sequences compare to CAPE-XVAE-inc-nat designed sequences with respect to how well they recall naturally occurring visible peptides.First, we observe that even the sequences with the highest recalls in expectation include a third of the visible peptides of a randomly chosen natural sequence.Furthermore, we find that although the highest recall value can still be found in naturally occurring sequences, the set of designed ones has a significantly higher average mean recall as determined by a two-sided t-test.Interestingly, for highly visible sequences, this is even more pronounced.In conclusion, we believe proteins designed using language model (LM) can be valuable constituents of vaccines due to their ability to tap into the array of natural viral variant epitopes -even for less common ones.Figure 7 then explores the ability of the highest mean recall CAPE-XVAE-inc-nat sequence with a TMscore of above 0.95 to capture the naturally occurring k-mer and epitope diversity.We find that all visible 9-mers in the CAPE-XVAE designed sequence can indeed be found in the natural sequences (Figure 7b).Only 3 of the visible 9-mers (red boxes in Figure 7a) were present in less than 0.1% of the natural sequences (Figure 7b).In contrast, six of these were present in more than 50% of all natural sequences.Finally, Figure 7c shows the distribution of the visible 8, 9 and 10-mer recall.We find that only a few sequences have less than 10% of their visible peptides being present in the designed sequence.CAPE-XVAE supports the design of vaccines that contain real epitopes existing across variants of a virus.This figure shows an epitope map of the highest mean recall CAPE-XVAE-inc-nat designed sequence (see Figure 6).Its predicted visibility is 103 and its mean recall 31%.The GO based Sequence Pack Classifier predicts it to be a support sequence.We investigate where each epitope is derived from within the natural distribution of input sequences.(a) We present 9-mer peptides in the designed Nef protein sequence using a sliding window approach advancing 1 amino acid at a time.The N-Terminal 9-mer is top left progressing one amino acid at a time to the C-Terminal 9-mer on the bottom right.The colors signify how often these 9-mers occur among the natural sequences (see k-mer precision in 2.6.2).A red box signifies that this 9-mer is predicted to be presented by at least one MHC allele (a visible 9-mer).(b) The labels of the stacked bar chart are the number of visible 9-mers [red boxes in (a)] in a precision category (signifying the percentage of natural sequences that contain this visible 9-mer).(c) For all natural sequences we identify the visible 8-, 9-and 10-mers and calculate the percentage of those present in the designed sequence.The histogram shows the distribution of this number over all natural sequences (see k-mer recall in 2.6.2).

Limitations
The results presented are limited by several factors.First, we are heavily reliant on the state-of-the-art AlphaFold folding prediction model to produce accurate structures.The assessment of structural and functional similarity of the designed sequences to naturally occurring ones relies on these.AlphaFold was trained on naturally occurring sequences -in particular for CAPE-Packer designs, which are very divergent sequences, it is questionable whether predicted structures will remain accurate -and if they even were able to fold into those.Next, we are using a GO term predictor to assess functional similarity.On occasions, even a single mutated amino acid can lead to a loss of function.GO term predictors that are trained to predict the function of working proteins may miss those subtle differences.However, we are not aware of any ML method in protein design that performs a major modification to a protein and still proposes a functional protein in each attempt.In reality, the protein design process involves many filtering steps.The GO based Sequence Pack Classifier can be seen as another way to filter sequences.The difficulty of assessing designs is a challenge for the whole field and ultimately only confirmation in the wet lab on the output of tools like ours will deliver definite results.
Finally, our model does not directly control immune-reaction by CTLs, but only visibility to CTLs (see 2.1.1).Furthermore, other immune-system components (e.g.Abs) were not considered.

Conclusions
Designed protein-based medicines hold transformative potential for healthcare, serving as potent vaccines and therapeutics for various diseases.Ensuring these medicines do not inadvertently trigger an immune response, except when intended (as with vaccines), is crucial.This paper focuses on the design of proteins with controlled visibility to CTLs via the MHC Class I pathway.
We introduce CAPE-XVAE, a latent space controlled language model trained on natural sequences to propose new synthetic constructs that are more visible to the immune-system when needed, and stealthy when not.We compare its outputs to those generated by another method we developed -the physics-based CAPE-Packer which is based on the deimmunization protocol developed by Yachnin et al. [6] and ignorant of the target protein's natural sequence space.
We find that both methods can modify immune-visibility compared to natural sequences while maintaining high levels of structural similarity.As CAPE-Packer is not limited by the diversity in the target protein's natural sequence space, we find that it is more flexible in adjusting immune-visibility.However, designing proteins within the boundaries of natural sequence space is crucial for two main reasons.Firstly, it helps in preserving the known and potentially undiscovered functions of the protein being designed by conserving local functional constraints.Secondly, for vaccines, it allows to incorporate the diversity of in particular sequence-based MHC-I epitopes.That CAPE-XVAE can capture this local sequence similarity is demonstrated by the substantial proportion of naturally occurring short sub-sequences within its generated sequences (k-mer similarity).In fact, it generated sequences with particularly high visible peptide recall, allowing it to capture the diversity of potential epitopes found in viral protein sequence variants.This contrasts with CAPE-Packer, which is unable to incorporate those.
We conclude that methods like CAPE-Packer that disregard sequence information will be useful when designing completely synthetic proteins de novo under immune-visibility constraints.The designer needs to have full knowledge of the functions the protein is supposed to fulfill as well as the mechanism to achieve those.Functional regions can then be fixed, before allowing the algorithm to design around them.However, when we want to modify naturally occurring proteins where no full information about functionality and its mechanism is available, the hope is for machine learning algorithms to infer these from countless existing examples.
Flexible models trained from natural sequences and able to incorporate naturally occurring epitopes, like CAPE-XVAE, will be particularly useful for vaccine design.CAPE-XVAE can be easily adapted to modify for a broad range of immunevisibility profiles (see Box2) dependent on the latest developments in vaccine design.Generating chimeric sequences could potentially make escape mutations more unlikely and capturing the diversity in naturally occurring sequences may provide better, more universal vaccines in the future, compared to those that use just one strain of a virus.The concept will need validation, and this will be the subject of future studies.
However, with regards to reducing immune-visibility, we have also observed that CAPE-XVAE is not able to completely hide proteins from CTLs in our example case.As it solely relies on the distribution of natural protein sequences it might struggle to propose sub-sequences that need to be introduced to achieve invisibility if they do not exist in the training data.Because of this, we think that new ML-based approaches will need to be developed implementing a tunable and hybrid approach.These methods would incorporate structural information in addition to sequence.They could also utilize information from other proteins in addition to the target protein and be able to diverge by a defined and tunable amount in sequence space.Eventually, more complex immune-visibility profiles -considering immunogenicity and immunodominance also with regards to other components of the immune system (e.g.MHC-II) -need to be explored, but using a feedback loop with wet-lab validation.In contrast, the Cross Attention Block applies attention two times.First, to self-attend over the residual stream sequence (this is typically done in a masked way -not attending over 'future' sequence elements that have not been generated yet).And second, to attend over a sequence that is different from the residual stream.

Details
In Figure 8 we depict the details of the transformer blocks that are used to implement the encoder and decoder.Multiple sequential Self Attention Blocks form an Encoder Stack while multiple sequential Cross Attention Blocks form a Decoder Stack.
Each transformer block updates a Residual Stream -a sequence of vectors received from the precedent transformer block or from an embedding layer.While the Self Attention Block (see Figure 8a) only considers information in the Residual Stream for updating the Residual Stream, the Cross Attention Block (see Figure 8b) also incorporates information from a separate stream (red lines in Figure 8b) into it.

S.1.3 Reinforcement Learning
We want to backpropagate gradient information from the MHC-I presentation predictor, through the CAPE-XVAE decoder, to the latent space.However, directly backpropagating through the discrete decoder output is not possible, so we require an alternative method.One such approach is RL, which involves learning how to take actions to maximize a reward signal [35].RL has been utilized for training the model parameters of sequence-to-sequence models [36].In contrast, we employ RL to train the latent vector z, which steers the decoders behavior.
The typical RL setting involves an agent observing its current state s t and based on this taking an action a t in an environment.The agent follows a policy π θ , parameterized by θ, to select an action, which can be stochastic.After performing the action, the environment provides the agent with a reward signal r t based on the immediate benefit of the action in the environment.An episode consists of sequential iterations of this process until a terminal state is reached.The complexity of RL arises from the fact that actions taken at earlier stages of an episode can have repercussions on future rewards, either increasing or decreasing them.Consequently, the agent's objective is to maximize the (sometimes discounted) expected sum of rewards (J(θ) = E π(θ) t r t ) over the course of the episode.Additionally, we define the state value function q π (s, a), which represents the expected value of discounted rewards when taking action a in state s and subsequently following policy π.The state-value function serves as a useful estimate for evaluating the long-term desirability of actions within a given state.
The policy function π θ (s t ) is crucial for maximizing the expected reward J.To find its optimal parameters θ (in our case the latent vector), the gradient ascent method shown in Equation 1 can be used: Directly back-propagating each trajectory through the discrete tokens is not possible.Fortunately, the REINFORCE

Figure 1 :
Figure 1: Overview of the CAPE-XVAE model architecture, training and immune-visibility modification process.(a) Sketch of the CAPE-XVAE architecture.On the left side is the encoder, in the middle the latent space and on the right side the decoder.More details can be found in the Supplement S.1.(b) Shows the evolution of the KL-divergence term weight during CAPE-XVAE training.(c) Depicts the steps followed during the immune-visibility modification process

Figure 2 :
Figure 2: Differential distribution of CAPE-XVAE and CAPE-Packer sequences in natural sequence space.The density of the natural sequences used to train the models is depicted pink in the background.(a,b) The dark red/blue crosses correspond to the sampled sequences obtained through CAPE-XVAE/CAPE-Packer without modification process (baseline).(c,d) Show the reduced immune-visibility results.The red/blue crosses are the finally obtained CAPE-XVAE-reduced /CAPE-Packer-reduced sequences.(e,f) Here increased and increased-natural immune-visibility results are displayed.CAPE-XVAE designs are in orange and CAPE-Packer ones in blue.The light/dark crosses show the finally obtained inc-nat/increased sequences.(g,h) The distribution of the average dissimilarity to the 300 support sequences (g, see Box1) as well as to the CAPE-Packer-baseline sequences (h).Colors correspond to the other plots

Figure 3 :Figure 4 :Figure 5 :
Figure 3: CAPE-XVAE Alters MHC Class I immune-visibility while maintaining natural sub-sequence characteristics.(a) Visibility: Shows the distribution of the visibility of the various sequence sets.The visibility is calculated equivalently to how the reward function is calculated for the increased modification profile (See Subsection 2.4.3).The dashed pink line signifies the mean of the natural sequences.(b) K-mer Similarity: Dependent on k-mer length, this chart shows the average proportion of k-mers with that length in the sequences that can also be found in the natural sequence data.One means that all k-mers can be found in the data and zero means that none of the k-mers can be found in the data.(c-e) Boxplots show the k-mer length dependent distribution of k-mer similarity for the CAPE-XVAE modified sequences that are either in the bottom (reduced ) (c) or top (d,e) 0.5 percentile (increased, inc-nat) of natural sequences.(f) Shows the distribution of TM Score as calculated by TMalign[27] between the sequences and their closest natural sequence.Color scheme continues from Figure2

Figure 6 :
Figure6: Average recall and precision of natural and CAPE-XVAE-inc-nat designed sequences.Each dot in the scatterplots represents a natural or designed sequence.Its x-coordinate shows the sequences average precision over all natural sequences, while its y-coordinate depicts the sequences average recall over all natural sequences (8-, 9-and 10-mers).The boxplots on the right and bottom depict the distributions of average recall and precision for the two sources.The blue p-values are the t-test p values for the null hypothesis that the mean recalls of natural and designed sequences are the same.While plot (a) depicts the situation considering all sequences, (b) is restricted to highly immune-visible sequences (see Box2) mer recall distribution over natural sequences

Figure 7 :
Figure7: Language models in vaccine design: borrowing from the diversity of existing protein variants.CAPE-XVAE supports the design of vaccines that contain real epitopes existing across variants of a virus.This figure shows an epitope map of the highest mean recall CAPE-XVAE-inc-nat designed sequence (see Figure6).Its predicted visibility is 103 and its mean recall 31%.The GO based Sequence Pack Classifier predicts it to be a support sequence.We investigate where each epitope is derived from within the natural distribution of input sequences.(a) We present 9-mer peptides in the designed Nef protein sequence using a sliding window approach advancing 1 amino acid at a time.The N-Terminal 9-mer is top left progressing one amino acid at a time to the C-Terminal 9-mer on the bottom right.The colors signify how often these 9-mers occur among the natural sequences (see k-mer precision in 2.6.2).A red box signifies that this 9-mer is predicted to be presented by at least one MHC allele (a visible 9-mer).(b) The labels of the stacked bar chart are the number of visible 9-mers [red boxes in (a)] in a precision category (signifying the percentage of natural sequences that contain this visible 9-mer).(c) For all natural sequences we identify the visible 8-, 9-and 10-mers and calculate the percentage of those present in the designed sequence.The histogram shows the distribution of this number over all natural sequences (see k-mer recall in 2.6.2).

Figure 8 :
Figure 8: Transformer blocks add to a residual stream received from a preceding block or embedding layer.(a) The Self Attention Block uses self-attention only over the single input sequence itself.(b)In contrast, the Cross Attention Block applies attention two times.First, to self-attend over the residual stream sequence (this is typically done in a masked way -not attending over 'future' sequence elements that have not been generated yet).And second, to attend over a sequence that is different from the residual stream.

Table 1 :
Number of unique sequences in dataset splits

Table 2 :
Rewards for a single peptide presentation: Rows signify if a peptide is present in the natural sequences or not.Columns represent the immune-visibility profile we aim to modify towards.

Table 3 :
Mean Visibility: Rows represent the sequence sources and columns the modification profile (see Box2)

Table 4 :
Confusion matrix: The rows signify the actual sequence origin, while the columns their predicted one