Computing the inducibility of B cell lineages under a context-dependent model of affinity maturation: Applications to sequential vaccine design

A key challenge in B cell lineage-based vaccine design is understanding the inducibility of target neutralizing antibodies. We approach this problem through the use of detailed stochastic modeling of the somatic hypermutation process that occurs during affinity maturation. Under such a model, sequence mutation rates are context-dependent, rendering standard probability calculations for sequence evolution intractable. We develop an algorithmic approach to rapid, accurate approximation of key marginal sequence likelihoods required to inform modern sequential vaccine design strategies. These calculated probabilities are used to define an inducibility index for selecting among potential targets for immunogen design. We apply this approach to the problem of choosing targets for the design of boosting immunogens aimed at elicitation of the HIV broadly-neutralizing antibody DH270min11.


Introduction
Vaccination aims to induce antibodies -the secreted versions of B cell receptors -that can neutralize pathogens and thus provide protection against infection.In natural infection, B cells evolve their antigen receptors to specifically recognize pathogens through progressive rounds of mutation and selection based on affinity to antigen.October 12, 2023 1/18 However, rapidly mutating viruses such as HIV, influenza and SARS-CoV-2 can escape the B cell response through viral diversification.In such cases, it is of great interest to develop vaccination strategies to elicit broadly neutralizing antibodies (bnAbs).However, bnAbs are rarely elicited in infection due to a variety of factors [1].For example, HIV bnAbs typically originate from B cells with low precursor frequencies in the human B cell receptor repertoire [2,3].They also typically have high numbers of mutations, some of which may be essential for broad neutralization but be made infrequently by activation-induced cytidine deaminase (AID), the enzyme responsible for mutating the B cell receptor [4].Acquisition of these improbable mutations is a key bottleneck for bnAb induction [4,5].In these situations, traditional vaccine design strategies have proven ineffective at eliciting bnAbs.This has motivated the development of advanced vaccine design strategies which aim to use known bnAbs as templates to design immunogens that can direct B cell evolution towards their induction [2,[6][7][8][9].
One promising HIV vaccine design approach, called sequential prime-boosting, aims to first induce low-frequency bnAb B cell precursors with a priming immunogen, and then mature those clonal lineages with one or more additional, distinct boosting immunogens.A key step of this approach is the inference of a reconstructed B cell receptor (BCR) sequence of the bnAb precursor, referred to as the unmutated common ancestor (UCA), from observed, clonally-related sequences [6,10].Recent progress has been made on the first component of this sequential prime-boost approach through the design of priming immunogens that can initiate precursors of bnAb B cell clonal lineages [11].However, the design of boosting immunogens that can select for sets of required, improbable mutations in the initiated bnAb B cell clonal lineages remains a central challenge [2].
Current experimental approaches to develop sequential prime boost vaccine regimens are highly labor-and time-intensive, relying on cycles of immunogen design, immunization of animal models, and B cell receptor repertoire sequencing to assess whether desired mutations were elicited [5,8,12].An alternative approach, which we adopt here, is the use of computational models to accelerate the design process.Specifically, we consider the use of stochastic models of the somatic hypermutation process to better inform the design of immunogens and development of vaccine regimens to efficiently guide a maturing bnAb response by vaccination.
Critical to this approach is the recognition that fixation of mutations is the product of two separate forces: mutation and selection.While vaccination with carefully chosen immunogens can introduce targeted selection, it is hypothesized that one of the primary difficulties in eliciting bnAbs (in HIV, say) is the low frequency with which essential mutations occur [4].If this frequency is sufficiently low, there are unlikely to be any instances of the mutation in the B cell population on which this selection can act.The use of stochastic models of the somatic hypermutation process to understand this frequency is therefore critical.A key step in formalizing this problem is answering the question: given an ancestral B cell receptor sequence x within a B cell clonal lineage, what is the probability of obtaining a specified target sequence y along the lineage under a realistic model of affinity maturation.However, because mutation of B cell receptors is sequence context-dependent due to biases in mutational targeting and substitution by AID, mutations do not arise independently during B cell evolution.This context dependence raises the key technical challenge to be addressed in this paper, namely that the context-dependent nature of somatic hypermutation makes standard approaches for computing P (x → y) under molecular evolution models intractable.
October 12, 2023 2/18 Challenges in Sequential Prime Boost Vaccine Design Practical limitations on the number of boosting immunogens that can be administered mean that optimizing the sets of mutations to be selected by each immunogen is an important consideration in the design of a sequential prime-boosting vaccine regimen.
Due to the sequence context-dependence of somatic hypermutation, the order in which the immunogens are administered and select for desired mutations within a vaccine regimen will also affect the probability of eliciting a bnAb response.Additionally, since it is not known a priori what the success of the chosen immunogens will be, the vaccine designer must make choices about which mutations to target first without information about the complete vaccine regimen.The ability to calculate P (x → y) enables us to compute more generally P (x (1) → x (2) → . . .x (p−1) → y) for any ordered set of intermediate sequences x (2) , . . ., x (p−1) , a key step for such design choices.The ability to accurately calculate these full-length BCR sequence evolution probabilities, and to rank BCR sequences by their inducibility, has a number of practical applications in the sequential vaccine design process.We outline three common scenarios encountered in designing sequential prime-boost vaccine strategies where such calculations can be used to inform the design of specific immunogens: (Design scenario I) In ab initio lineage-based vaccine design, the first step is to design a priming immunogen that optimally engages bnAb precursor B cells and maximizes the probability that B cells will evolve along bnAb maturation pathways.We assume that the priming immunogen already binds with high affinity to the unmutated precursor bnAb B cells.Given that a limited number of mutations can be selected by any one immunogen, the challenge is to identify the set of mutations to target for selection first, in order to give the B cells the highest chance of success to eventually mature into bnAbs.
(Design scenario II) At intermediate stages of the design process, the vaccine designer has developed an initial set of immunogens in a vaccine regimen and has evaluated the B cell response to this partial regimen by sequencing the antibody repertoire of immunized bnAb UCA knock-in mice [5,8,[12][13][14].By measuring the frequency of occurrence of targeted mutations in immunized knock-in mice, the vaccine designer can determine which of the mutations necessary for broad neutralization are induced by immunization with the regimen.Often the regimen is partially successful, in that a subset of the necessary mutations are selected, leaving others still in need of selection by addition of subsequent immunogens to the regimen.During this iterative process, the probability of evolving the target bnAb conditional on the observed intermediate(s) can be used to identify the set of mutations to target for selection with the next boosting immunogen, in order to maximize the probability of bnAb induction.2 Methods

Background and Motivation
The ARMADiLLO model [4] is a recently developed model for forward simulation of the somatic hypermutation process in affinity maturation.The simulation procedure uses a set of mutation rates and base frequencies derived from NGS data [15].Unlike the continuous-time Markov chain (CTMC) models common in molecular evolution, sites in the ARMADiLLO model evolve in discrete jumps, a process that can be viewed as the skeleton of a time-inhomogeneous CTMC.However, a critical aspect of the somatic hypermutation model encoded in ARMADiLLO is the sequence context-dependence of mutation and substitution rates, arising from the sequence targeting preferences of the AID enzyme.Just as in dependent-site models of sequence evolution [16][17][18], calculating p(y | x) under the ARMADiLLO model is computationally difficult due to the dependence among sites which precludes the use of Felsenstein's pruning algorithm [19].
Although forward simulation of the ARMADiLLO model has been used to estimate the probability of individual mutations arising by chance [4], this approach becomes prohibitively expensive when trying to calculate p(y | x) for a specific, full sequence y.In Section 2.3, we develop an importance sampling algorithm for tractably approximating p(y | x).Our approach samples mutation orderings (trajectories) of observed mutations rather than entire path histories as in previous dependent-site models, and handles the multimodality that can plague other sampling approaches.
Section 3 demonstrates that our method can be used to easily approximate a large number of such transition probabilities quickly.

The ARMADiLLO Model of Somatic Hypermutation
Let x = (x 1 , x 2 , . . ., x n ) and y = (y 1 , y 2 , . . ., y n ) denote two nucleotide sequences.Let d H (x, y) = n i=1 δ(x i ̸ = y i ) denote the Hamming distance between x and y and let xi = (x i−2 , x i−1 , x i+1 , x i+2 ) for i ∈ {1, . . ., n} denote the two-nearest-neighbor context of site x i .(For i = 1 and i = n, we assume that the two left and right flanking nucleotides, respectively, are fixed).Each site x i is assigned a mutability score m(x i ; xi ) and a set of substitution probabilities π(x i , x ′ i ; xi ), where π(x i , x i ; xi ) = 0 and b∈N π(x i , b; xi ) = 1 for N = {A, G, C, T }.The ARMADiLLO algorithm is given in Algorithm 1.Note that the transition probability under ARMADiLLO is defined by a probability matrix Q (i) with entries appearing in (1) correspond to the probability of selecting site x i for mutation and transitioning to nucleotide base b, respectively.For information on how the mutability scores and transition probabilities are estimated, see [15].

Algorithm 1 ARMADiLLO
Input: Initial Sequence x and Mutation Number r ≥ 1 Algorithm 1 was used [4] to estimate the marginal probability p(y i | x) of a codon at a given location in x transitioning to a target amino acid by simulating many terminal nucleotide sequences y (1) , . . ., y (M ) and calculating the proportion pyi = 1 M M j=1 1 yi (y (j) ) of times that the target amino acid occurs at the desired location.Because the emphasis is on the marginal probability p(y i | x) at a single site, a reliable estimate of this probability appears to be obtainable with a feasible number of simulations.However, as noted this approach does not scale to computation of p(y | x) for full-length sequences y.Instead, we develop an efficient importance sampling algorithm for evaluating full-length sequence probabilities of this form.

Estimation using Importance Sampling
Given two sequences x and y such that r = d H (x, y), we wish to approximate p(y | x) under Algorithm 1.To do this, we sample orderings in which the r mutations occur in the transition from x to y.More formally, let S = (s 1 , . . ., s r ) for s j ∈ {1, . . ., n} be the set of sites at which the two sequences differ.Let S r be the set of all permutations of the elements of S. The goal is to approximate where ; x σ(0:j−1) , xσ(0:j−1) , y σ(sj ) ; xσ(0:j−1) denotes, for a given σ ∈ S r , the value of the ith nucleotide after k updates to x according to the permutation σ.That is (so e.g.x σ(0) i = x i , and x σ(0:r) i = y i ).We refer to σ ∈ S r as a 'path' from x to y.
Let Q(σ) = Z −1 q q(σ) where Z q = σ∈Sr q(σ) denote a distribution on S r (the instrumental distribution) which can be easily sampled.Then the corresponding importance sampling estimator for (2) is for σ (1) , . . ., σ (N ) iid ∼ Q(σ).The performance of this importance sampling algorithm depends strongly on the choice of q, with max σ π/q controlling Var(p IS (y | x)).Because π is unnormalized here, we require Z q in order to recover Z π := p(y | x).Consequently, we need to choose Q complex enough to approximate π reasonably well but simple enough so that Z q can be feasibly computed.
Recall that under the ARMADiLLO model the mutability score m(x i ; xi ) for a site x i depends only on its two-nearest-neighbor context xi .However, x i mutates to a ∈ N with probability equal to its normalized mutability score m(a; xi ).This makes the normalization Z π for the ARMADiLLO path distribution intractable because the product of normalization terms appearing in π(σ) induces long-range dependence among sites that do not have overlapping pentamers, and consequently the probability that x i mutates depends on all sites x 1:n := (x 1 , . . ., x n ).This suggests replacing m(a; x, xi ) with m(a; xi ) in π(σ) to make Z π tractable; we use this form as our instrumental distribution Q.In particular, we set ; xσ(0:j−1) , y σ(sj ) ; xσ(0:j−1) ).
Since the context xi plays the most important role in determining the mutation probability for x i , this provides an instrumental distribution Q that closely approximates π while also yielding a tractable normalizing constant Z q .
To see that Z q is tractable, call two subsets s, s ′ ⊂ S separated if for any s ∈ s and s ′ ∈ s ′ we have |s − s ′ | > 2. Under q, the probability of a permutation is invariant with respect to the re-ordering of mutations belonging to different separated sets.More formally, let (s 1 , . . ., s k ) be a partition of S into separated sets.Let r j = |s j | and let S rj be the set of all permutations of the elements of s j (so |S rj | = r j !) with σ j = (σ j (s j1 ), . . ., σ i (s jr j )) denoting a corresponding permutation σ j ∈ S rj .Let Sr = S r1 × . . .× S r k be the product group consisting of all k-tuples (σ 1 , . . ., σ k ) with σ j ∈ S rj and note that Provided q(σ i ) ̸ = q(σ j ) for i ̸ = j (this will generally be the case), then the equivalence classes Putting it all together, we have by ( 2), (3), and ( 6) the estimator for σ (1) , . . ., σ Sampling from q proceeds by enumerating all σ ∈ Sr and evaluating (4).Z q is then obtained by summing and multiplying by C(r, k) as in (6).We can then sample index i with probability q(σ i )/Z q for σ i ∈ Sr corresponding to one of the equivalence classes [σ i ].Finally, we sample uniformly from the equivalence class [σ i ] to obtain the desired sample from q.

Results
We begin with a simulation study to evaluate the performance of our approach, followed by the application to design calculations for HIV-I immunogen design.

Simulation Study
We first evaluate the approximation algorithm of Section 2 on a set of example sequences where the true transition probabilities can be calculated exactly.Test sequences are chosen to span a range of values of both the largest separated set r ⋆ := max j r j and the quantity α := r 2 /n which quantifies the number of mutations r relative to the size of the sequence n.(We expect that the variance of our estimator increases with α.)This set of test sequences demonstrates the performance of the October 12, 2023 7/18 algorithm as the complexity of the problem varies.We investigate the performance of the estimator (7) as a function of r ⋆ and α, measured in terms of both coefficient of variation and effective sample size (ESS): .
In all test sequences, the number of mutations is kept low (r = 10) so that the exact value of the transition probability can be computed for comparison.
For each test sequence, the algorithm was run 1000 times using a sample size N = 1000 in each run.Table 1 shows the results, including the exact value, and the mean and relative standard deviation of the 1000 estimates.Across all values of r ⋆ and α the transition probability is estimated accurately up to at least the second significant digit.The coefficient of variation decays with α as expected, and appears to be unaffected by r ⋆ .The most difficult case is when α and r ⋆ are large (  1. Results for estimating transition probabilities on test sequences.1000 runs of the algorithm were performed for each sequence, each using N = 1000.Shown are exact calculations (True prob.), mean estimate over the 1000 replicates, mean effective sample size, and standard deviation of estimates relative to the mean (CoV).We see that probabilities are estimated accurately across all values of r ⋆ and α, despite the problems getting more challenging (lower ESS, higher CoV) when α and r ⋆ are large.

Application to B-Cell Evolution
We now return to the problem of sequential immunogen design and the scenarios described in Section 1 encountered by vaccine designers in which the transition probabilities between the UCA sequence and target mature antibody sequences can be used to inform the design of specific immunogens.
For example, Fig 1 shows (see also Table 3 below) that the order of mutation selection in a clonal lineage can significantly affect the probability of obtaining specific bnAb maturation pathways.Since multiple immunogens may be required to direct the evolution of B cells when a large number of mutations must be acquired for bnAb activity [13,20], the ordering of prime and boosting immunogens -each selecting distinct sets of mutations -within a lineage-based vaccine regimen may critically affect the probability of successful bnAb elicitation.It is therefore of great interest to determine a mutation ordering that maximizes this elicitation probability for a specified target bnAb.This in turn can be used to develop a sequence of immunogens to target bnAb mutations in the most probable order of acquisition.

Transition probabilities for protein sequences
In calculating maturation pathway probabilities, consideration of both nucleotide and the amino acid sequences is critical.The UCA sequence is the result of recombination of germline-encoded gene segments and addition of non-templated nucleotides at the gene segment junctions, providing a defined starting nucleotide sequence.However, selection acts upon the BCR at the protein level, where binding to the immunogen is determined by the amino acid sequence, and thus all nucleotide sequences giving rise to the target bnAb amino acid sequence must be accounted for.Because AID acts at the nucleotide level, the ARMADiLLO model describes transitions between nucleotide sequences.
Marginalization is then required to obtain probabilities of amino acid sequences.In what follows then, we compute the probability that a known unmutated common ancestor (UCA) nucleotide sequence transitions to a target amino acid sequence.For purposes of vaccine design, it will also be of interest to consider "intermediate" amino acid sequences along potential mutation pathways, as potential targets for sequential immunogen design.We formalize these calculations below.
Let a(y) denote the amino acid sequence arising from a nucleotide sequence y and A(y) = {z : a(z) = a(y)} the equivalence class of nucleotide sequences giving rise to the same amino acid sequence.Denote by a i (y) the ith amino acid in a(y), and let c i (y) denote the number of codons that encode a i (y).We estimate transition probabilities from an inferred UCA (initial) nucleotide sequence x to a any terminal nucleotide sequence z ∈ A(y), where y is an observed bnAb (target nucleotide sequence).
Let m = {m 1 , . . ., m k } for m i ∈ {1, . . ., n 3 } be the set of amino acid sites where a(x) and a(y) disagree and m 1 , . . ., m p be all k q subsets of m of size q, with elements Denote by x ij the jth element of N(x, m i , y).So x ij is an intermediate nucleotide sequence on a trajectory from x to some sequence in A(y) which is equal to x outside of (the codons indexed by) m i and matches a(y) at all sites m i (i.e. a mi (x ij ) = a mi (y)).The superscript j indexes the possible combinations of codons giving rise to a mi (x ij ) = a mi (y), of which there are J i := q r=1 c mi r (y) for i = 1, . . ., p.
Let z be an end-of-trajectory sequence for x ij , meaning that z ∈ A(y) and z differs from x ij only at nucleotide positions corresponding to amino acid sites m \ m i of the mutations not yet acquired by We estimate the transition probability of x i,j to each z ∈ Z(x, y, i, j) by assuming exactly r = d H (x ij , z) mutational events occur (i.e.unmutated nucleotide positions remain fixed, and no reversions).Starting from intermediate x ij for i = 1, . . ., p and j = 1, . . ., J i , this yields transition probability Similarly, the probability of obtaining the initial amino acid mutations in m i is given by October 12, 2023 10/18 where again the conditioning indicates that exactly r mutations occur in the transition from x to x ij , with all other nucleotide positions remaining fixed.Finally, we calculate the joint probability of first obtaining initial amino acid mutations m i on the way to obtaining the full set of mutations m as the joint probability: Conditioning on the number of mutational events that occur and assuming unmutated nucleotide positions remain fixed in our estimates amounts to ignoring synonymous mutations and multiple-mutation reversions.We make these assumptions for computational tractability.Indeed, if n a = n 3 is the length of a(y), then the number of terminal sequences that give rise to a(y) can grow exponentially in n a via synonymous mutations due to the redundancy in the genetic code.

Inducibility of a minimal set of critical mutations in an HIV bnAb
We applied this approach to study the inducibility of critical mutations in DH270.6, an HIV bnAb.Here the target sequence y is the heavy chain sequence of DH270min11, an antibody engineered from the DH270.6 bnAb to contain only those amino acid mutations determined to be functionally important for neutralization breadth [21].Here x is the corresponding estimated UCA sequence for the DH270.6 clone [22] obtained by clonal lineage reconstruction using Clonalyst [10].(Alternatively, a distribution over UCAs accounting for reconstruction uncertainty may be obtained by probabilistic methods such as Partis [23][24][25]; see Discussion.)In this case, x and y differ by seven nucleotides, and a(x) and a(y) differ by six amino acids.The DH270min11 mutations are given in Table 2. Notice that the largest separated set is of size r ⋆ = 2 and α = 0.005 since n = 382.Of the observed amino acid mutations, only one requires multiple nucleotide substitutions in the corresponding codon.We first consider the critical mutations individually in turn.So q = 1, and m i consists of only a single amino acid location (i.e.we let m i = {m i }, i = 1, . . ., k, and J i = c mi (y)), so x ij is equal to x except at a single amino acid site.In total, there are 6 i=1 c mi (y) = 19 intermediate sequences x ij corresponding to the total number of codons that encode the six amino acids where a(x) and a(y) differ.Table 3  where ⌊⌉ denotes rounding to the nearest integer.This index equates sequences whose evolution probabilities are of the same order of magnitude, and facilitates direct comparison between potential targets with practically significant differences in inducibility.A lower inducibility index therefore indicates a sequence that has higher a priori probability of arising in the absence of selection.3. Transition probability results for q = 1.Initial amino acid mutations and corresponding codons are given in the first and second column.The starred codons correspond to the actual codons observed in the DH270min11 sequence.In the third column, we give the probability of transitioning to any nucleotide sequence z ∈ A(y) (i.e.any nucleotide sequence that gives rise to the same amino acid sequence as y) conditional on first obtaining the amino acid mutation m i via codon j.In the fourth column, we give the estimate of the probability of obtaining the initial amino acid mutation in m i by codon j.The weighted estimate in the fifth column is the product of the third and fourth columns.The sixth column is the average effective sample size of the transition probability estimates.
We see that the transition probability p(A(y) | x ij ) to the terminal DH270min11 sequence can be maximized by acquiring the G110Y mutation first.G110Y requires two base substitutions to occur within its codon for the amino acid transition from glycine (G) to tyrosine (Y).Comparing the calculated probabilities, we conclude that a vaccine regimen that successfully selects for the G110Y mutation first would increase the calculated probability of induction by 5 orders of magnitude, an order of magnitude (or more) improvement over selecting any of the other mutations first.
The vaccine designer can then use this information (Design scenario I) to aim for a priming immunogen that both binds with high affinity to the DH270 UCA (to initiate October 12, 2023 12/18 the clonal lineage), but also with even higher affinity to the UCA+G110Y mutation, in order to select for G110Y and guide the B cell response along the most probable bnAb maturation pathway.
As noted, the G110Y mutation requires two base changes.Multiple required base changes within a codon will typically result in a low transition probability of a targeted amino acid.However, it also provides an opportunity for the vaccine designer to accelerate its acquisition.For example, for the G110Y mutation, a single base change in codon 110 (GGT, glycine) of the DH270.6UCA can transition through either GAT (aspartic acid) or TGT (cysteine).Our calculations indicate that the transition through aspartic acid is ≈ 1.5× more probable than the alternative path through cysteine.
Therefore, adding an immunogen to the vaccine regimen that selects for the intermediate amino acid state of aspartic acid at position 110 could accelerate induction of the critical and highly improbable G110Y mutation.

Multiple simultaneous mutations in immunogen design calculations
In design scenarios 1 and 2, we may often wish to consider the induction of multiple simultaneous mutations.Here we consider initial target mutation sets of size q = 2 or q = 3. Tables 4 and 5 list the path probabilities conditional on all initial pairs and triplets of DH270min11 mutations.We see that different initial subsets differ by orders of magnitude in their probabilities, while the overall joint probabilities obtained as the products differ only by small constants (this also reassures that the approximation error is small in each case).We observe that (31D, 55T), (31D, 51M), and (31D, 98T) are the most probable pairs of mutations to arise in the absence of a selecting immunogen (have the highest "inducibility"), and (31D,51M,55T) and (31D,55T,98T) the most probable triples.Whereas (57R,110Y), (98T,110Y) and (51M,110Y) are the least probable pairs to arise by chance, and we therefore might expect them to be more difficult to elicit via immunogen selection.Similarly for the triples (57R,98T,110Y) and (51M,57R,110Y).
Conversely, if we were able to design a priming immunogen to elicit the mutation pair (57R,110Y) or triplet (57R,98T,110Y), it would be of high impact as we would expect this to maximize the probability of obtaining the full mature bnAb using a boosting immunogen.5. Transition probability results for q = 3.The first column indicates which amino acids are targeted first.The second column corresponds to the probability of transitioning to z ∈ A(y) conditional on first obtaining the amino acid mutations in m i .The third column corresponds to the probability of first obtaining the amino acid mutations in m i .The fourth column corresponds to the probability of transitioning from x to any z ∈ A(y) (not the product of the second and third columns).The final column gives the average minimum effective sample size for the estimates of p(z | x ij , r = d H (x ij , z)), where the minimum is across all z ∈ Z(x, y, i, j) and the average is across j = 1, . . ., J i .

Comparison with experimental results
It is interesting to compare these results with recently obtained experimental data [20], where we have immunized DH270 UCA knock-in mice with a priming immunogen, sequenced their heavy chain BCR repertoires, and measured the frequency of the 6 DH270min11 mutations both individually and in combination.The G110Y mutation is observed to be the second least frequent mutation selected by our priming regimen.Our probability calculations (Table 3) indicate that of all individual mutations, G110Y selection maximizes bnAb maturation probability.Thus, adding a boosting immunogen to this vaccine regimen that can select for G110Y would be an optimal strategy for maximizing the probability of bnAb elicitation.
(Design scenario III) In our repertoire sequencing data, the I51M mutation is the lowest frequency of the six DH270min11 mutations.Contrasting this with our probability calculations (Table 3),which estimate that I51M is the third most probable mutation in the absence of selection.Such differences between model calculations and observed frequencies may be indicative of selection effects; thus one explanation for the low I51M frequency observed in immunized mice is that our priming immunogen lacks  6. Starting from (G31D, R98T), the most frequent pair observed in immunized mice, and transitioning to the next pair.

Design scenario II
When information is available about the performance of the first immunogen(s) in a sequential boost vaccine regimen, the vaccine designer can use the estimated transition probabilities to make decisions about which boosting immunogen(s) to administer next in the series.From our experimental data, the highest frequency mutation pair induced by our priming immunogen was (G31D, R98T).
For current purposes, we assume a single immunogen can select for only one pair of mutations, and use the model calculations to choose which pair of mutations should be selected for with the next boosting immunogen.Table 6 shows the estimated transition probabilities starting from the UCA + (G31D,R98T), for all remaining 4  2 pairs of mutations.We observe that the transition probability to the DH270min11 sequence is maximized upon acquiring (57R,110Y).Thus, the optimal sequential boosting strategy is to use a first boosting immunogen to select for G57R and G110Y followed by a second boosting immunogen to select for S55T and I51M.

Conclusion
We have introduced a model-based approach to sequential immunogen design using the ARMADiLLO model of somatic hypermutation.To calculate design-relevant marginal sequence transition probabilities in the face of context-dependent mutation, we have developed a fast and accurate Monte Carlo approximation scheme.We have demonstrated that this model performs well on test sequences of varying complexity.
Finally, we have applied this approach to answer questions of great current significance regarding mutation targeting for boosting immunogen design for ongoing efforts to elicit the HIV bnAb DH270min11.These results are now being used to guide efforts for immunogen design in the Duke Human Vaccine Institute.
Our approach relies on knowledge of a precursor sequence.Typically this is obtained as an estimated UCA obtained from a set of clonally related sequences.However, the process of inferring the UCA retains residual uncertainty both due to choices regarding which sequences to include, and the probabilistic information content of the sequences themselves.Although we use the Clonalyst procedure here to produce a single maximum likelihood UCA, other methods [23] are available which account for (some of) the reconstruction uncertainty by providing a distribution over UCA sequences, and our approach easily extends to use such information.However those methods rely on phylogenetic computations which become intractable in the face of context-dependent mutation, and so do not currently account for this aspect of the somatic hypermutation process, which has been our focus here.Applying the accurate, rapid approximation of October 12, 2023 15/18 marginal sequence likelihoods developed here to the problem of lineage reconstruction in the face of context-dependent mutation is a promising area for future work.We expect that this may only impact results in the CDRH3 region, as the posterior probabilities on V genes and even alleles are expected to be close to one, but it may indeed be important for the CDRH3 (R98T and G110Y are in the CDRH3, for example).Another source of uncertainty in the inferred UCA arises from the selection of sequences to define the clone itself; we are exploring approaches to account for context-dependence in this step as well.Finally, given the speed of computation, our approach here could be extended to sets of bnAb UCAs that define an entire precursor class, i.e. a common set of precursor sequences evolving to a common set of paratopic features that define that bnAb class's ability to recognize a conserved site of vulnerability on a pathogen.

Figure 1 .
Figure 1.The use of estimated transition probabilities to inform sequential prime-boost vaccine design.A) CDRH2 sequence of the HIV bnAb CH235.12 inferred UCA.AID hot spots (high mutability) are highlighted in red and cold spots (low mutability) highlighted in blue, with amino acid alignment of the UCA and CH235.12mature sequence shown below.Dots represent sequence matches and denote unmutated amino acid positions.B) Graph of amino acid transitions for a subset of three mutations at sites 54, 55, and 57.Arrow widths are proportional to estimated transition probabilities (shown).Based on the highest probability full path (yellow), C) the vaccine designer can choose the order of immunogens in a sequential prime-boost vaccine regimen to maximize the probability of induction of the three targeted mutations.
y)}, where N denotes the set of nucleotides.(The condition n m c i = x m c i is a slight abuse of notation, indicating that all the codons outside of the set m i are equal.)So N(x, m i , y) is the set of nucleotide sequences of length n which match x in all positions m c i , and which map to the same amino acids as y in positions m i .(N is a set due to the redundancy of the genetic code.) Description of the amino acid changes in the DH270min11 sequence.Columns show (in order): (1) amino acid change; (2) UCA sequence codon at change site; (3) corresponding codon observed in the DH270min11 sequence (mutations underlined); (4) minimal number of base substitutions required for amino acid change;(5) number of possible terminal codons that encode amino acid change.
lists the calculated path probabilities to DH270min11 conditional on each of the 6 individual amino acid mutations occurring first.To aid in the interpretability of results, we define an inducibility index I(y) = ⌊− log 10 P (x → y)⌉ October 12, 2023 11/18

Table 1 ,
first row), whereas the ideal case is when both of these quantities are small (last row), and we see this difficulty reflected in the average effective sample size.