Deep learning model of somatic hypermutation reveals importance of sequence context beyond targeting of AID and Polη hotspots

B-cells undergo somatic hypermutation (SHM) of the Immunoglobulin (Ig) variable region to generate high-affinity antibodies. SHM relies on the activity of activation-induced deaminase (AID), which mutates C>U preferentially targeting WRC (W=A/T, R=A/G) hotspots. Downstream mutations at WA Polymerase η hotspots contribute further mutations. Computational models of SHM can describe the probability of mutations essential for vaccine responses. Previous studies using short subsequences (k-mers) failed to explain divergent mutability for the same k-mer. We developed the DeepSHM (Deep learning on SHM) model using k-mers of size 5-21, improving accuracy over previous models. Interpretation of DeepSHM identified an extended DWRCT (D=A/G/T) motif with particularly high mutability. Increased mutability was further associated with lower surrounding G content. Our model also discovered a conserved AGYCTGGGGG (Y=C/T) motif within FW1 of IGHV3 family genes with unusually high T>G substitution rates. Thus, a wider sequence context increases predictive power and identifies novel features that drive mutational targeting.

to higher binding B cell receptors to cognate antigen occurs in the GC light zone, thus, producing 39 a diverse repertoire of high-affinity antibodies (Methot and Di Noia, 2017;Pilzecker and Jacobs, 40 2019; Rajewsky, 1996). The mutagenic enzyme, activation-induced deaminase (AID), initiates 41 SHM (Muramatsu et al., 2000) by converting cytosine (C) to uracil (U) in single-stranded DNA 42 (ssDNA), resulting in a U:G (guanine) mismatch . AID displays 43 preferential targeting at WRC/GYW "hotspot" motifs (where W=A/T, R=A/G, Y=C/T, and the 44 underlined base indicates the mutated base in the top and bottom strand, respectively), whereas 45 SYC/GRS "coldspots" (S=C/G) are significantly less targeted Rogozin and 46 Diaz, 2004; Rogozin and Kolchanov, 1992;Yu et al., 2004). If left unrecognized, U mismatches 47 will act as a template T and be replicated over (Pilzecker and Jacobs, 2019). The resulting C>T 48 transition mutation is commonly referred to as the DNA "footprint" of AID (Liu et al., 2008). 49 Downstream DNA repair further contributes to antibody diversity that is mediated by low-fidelity 50 polymerases. During non-canonical base-excision repair (ncBER), the U:G mismatch is 51 recognized and excised by uracil-DNA glycosylase (UNG), resulting in an abasic site (Rada et al., 52 2004). Repair of these abasic sites by REV1 can cause both transition and transversion mutations 53 at C:G base-pairs (Jansen et al., 2006). In the case of non-canonical mismatch repair (ncMMR), 54 the U:G mismatch is recognized by the MSH2/MSH6 heterodimer. Next, EXO1 exonuclease is 55 recruited to create a patch of ssDNA, which then allows error-prone polymerases, particularly 56 Polymerase eta (Polη), to resynthesize. Polη is known to create mutations at neighboring adenine 57 (A) and thymine (T) sites of the initial AID-induced lesion, most notably at WA/TW hotspot motifs 58 (Matsuda et al., 2001;Mayorov et al., 2005). 59 Several computational models have been developed for the SHM process and intrinsic 60 biases exhibited by key proteins such as AID and Polη. These models have mainly utilized k-mer 61 The objective of our analysis was to use supervised deep learning to build an accurate 105 convolutional neural network (CNN) for SHM and, as much as possible, identify novel features 106 contributing to mutability. We chose CNNs because we still expected mutation frequency to 107 depend on recurring motifs that might occur at any position in the sequence (most obviously, AID 108 hotspots), a task CNNs are well suited to. The workflow of our network consists of an input layer 109 that processes a k-mer subsequence represented in its one-hot encoding format (i.e. a 4×k matrix 110 of zeros and ones), followed by a convolution layer and two fully connected layers as the hidden 111 layers, and finally the output layer of size 4×1 or 1×1, depending on the task that is being predicted 112 (Figure 1, see Methods). Several hyperparameters, including dropout rate and learning rate, were 113 fine-tuned with our model as well (Supplementary Table 1). We defined a model that would 114 separate mutations on each strand (which are predominantly at C and A on the top strand and at G 115 and T on the bottom strand) at the input level. To achieve this, we identified a simple solution 116 using padding that assigns a row in each channel of the convolution layer output separately to each 117 strand (Figure 1). CNNs are also often used together with attribution methods such as Integrated 118 Gradients, to help with interpretation of the results. 119 120 Figure 1. DeepSHM model architecture. Each model had an input layer, one convolution layer, two fully connected 121 layers, and an output layer. The input layer was a 4×k dimensional one-hot encoded matrix (k is length of 122 subsequence). The dimension of the output layer was dependent on the task: substitution (4×1), mutation frequency 123 (1×1), or weighted substitution (4×1). For the convolutional layer, 'same' padding was used to allow the model to 124 process top and bottom strand mutations separately. With 'same' padding, the output of each convolutional channel 125 has the same shape as the input (4×k) with the following properties: the first and the fourth rows are populated with 126 zeros only (there was no real input, only padding; cyan and magenta rows); the input used for the second (light blue) 127 row contained two rows of padding and two data rows corresponding to A or C nucleotides only; and similarly, the 128 input used for the third (green) row also contained two rows of padding and two rows of data corresponding to G or 129 T nucleotides. Since AID and Polη target C and A sites respectively, this approach was taken with the expectation of 130 helping the model distinguish top and bottom strands.

131
As a starting point, we trained two CNN models, which we collectively refer to as 132 DeepSHM (Deep learning on SHM), to separately predict mutation frequency and substitution 133 rates, calculated from previously published B cell repertoire data containing non-productively 134 rearranged and clonally independent VDJ coding sequences , for varying k-mer 135 lengths (see Methods). We trained both models independently using different combinations of k-136 mer lengths and hyperparameters as listed in Supplementary Table 1. We found that for 137 predicting mutation frequency, 15-mers were moderately better than 9-mers (purple boxplots in 138 Figure 2A, Mann-Whitney U test: P<2.2×10 -16 ) and that further extending the motif length to 21 139 did not improve accuracy since both produced an overall maximum correlation (across 140 hyperparameters) of 0.76 (Figure 2A, Table 1). Thus, using k-mers of length 15 or longer 141 outperformed shorter lengths, specifically 5-mers and 9-mers (Table 1), suggesting that an 142 extended DNA motif can better model the SHM process. However, using longer k-mers did not 143 substantially improve the model that predicts SHM substitution bias alone, achieving an average 144 correlation of 0.55 for 15-mers (green boxplots in Figure 2B, Table 1), but which is similar for 145 different lengths. For the interpretability analysis below, we chose to use the best 15-mer models 146 to keep the k-mer length consistent for comparisons across all models. In order to check if the 147 performance of the models leading to the best results was consistent, we also trained 30 different 148 iterations of each model, keeping the hyperparameters fixed but using different random seeds. We 149 found the standard deviation across correlations was very small, at 0.002 for the mutation 150

157
We next sought to compare DeepSHM against the widely used S5F model that is based on 158 5-mer motifs (Yaari et al., 2013). To ensure a fair comparison, we generated an S5F targeting 159 model using the same data set that was used to train DeepSHM, as well as the same cross-validation 160 scheme (see Methods). Using the same test set splits as above, we found that there was an average 161 correlation of 0.66 between the predicted S5F model mutabilities and empirical mutation 162 frequency, and an average correlation of 0.50 for predicted S5F substitution scores and empirical 163 substitution rates (red dashed lines in Figure 2, Table 1). The substitution model slightly (but 164 statistically significantly) outperformed S5F for all k-mer values we analyzed. The mutation 165 frequency model achieved a modest improvement over S5F using 5-mers as an input, and this 166 difference became more evident for 9-, 15-, and 21-mers (Figure 2A, Table 1). We also similarly 167 computed 30 iterations (using different random seeds) of the best 15-mer models for both mutation 168 frequency and substitution models, and found these iterations to have significantly greater 169 accuracy than S5F both individually and in aggregate (P<1. To identify associations between mutation frequency and specific substitutions, we further 178 constructed a DeepSHM model to predict the "weighted substitution" of a k-mer, i.e., the product 179 of the percentage of each observable substitution type (e.g G>N) and the mutation frequency of 180 the k-mer (see Methods). Note that this weighted substitution metric is a vector representing the 181 four ordered DNA bases, with a "0" placed at the position that matches the middle nucleotide of 182 the k-mer. Since weighted substitution constitutes aspects of both the observable mutation 183 frequency and substitution rate of the middle nucleotide of a given k-mer, we were able to evaluate 184 DeepSHM on each metric separately. Although this model made poorer substitution rate 185 predictions on average (varying hyperparameters) than S5F (Table 1), the best model performed 186 similarly to S5F for substitution rates while, surprisingly, performing slightly better than any 187 model in predicting mutation frequency. Cross-validation in this instance produced a range of 188 average correlations between 0.50-0.52 for predicted substitution rates -a level similar to that of 189 S5F ( Figure 2B, Table 1). On the other hand, DeepSHM of weighted substitution values was 190 marginally better at predicting mutation frequency for 15-mers (correlation: 0.77) than the 191 previous standalone model that was tasked to learn mutation frequency only as well as being better 192 than S5F. (Figure 2A, Table 1). Since the weighted substitution model was able to perform at a 193 level slightly better to the standalone mutation frequency model for longer k-mers and substantially 194 better for shorter (5-mer, 9-mer), this suggested a possible association between the projected 195 substitution bias of a site and overall mutability and furthermore, that interpretability methods 196 might uncover these (see below). to make full use of the data, we merged all of the 15-mer data into one training set, and then trained 208 three new individual models (one for each output type) using the hyperparameters which 209 previously led to the best cross-validation results. The analyses we present below are derived from 210 the DeepSHM models that were trained using this merged data set. 211 We began by identifying features learned by DeepSHM that predicted weighted 212 substitutions. Since weighted substitution is a measure of both mutation frequency and substitution 213 bias, the embedding should capture both metrics simultaneously. Each point in the resulting t-SNE 214 embedding in Figure 3A represents a single 15-mer and is colored according to its corresponding 215 mutation frequency. We identified several clusters of 15-mers that are mostly grouped by similar 216 mutation rates, including those expressing high mutability. Clusters with mid to high mutation 217 frequencies are similarly within close proximity but displayed no obvious groupings other than 218 being mostly located towards the center. When we considered the middle nucleotide of each 15-219 mer, we observed that these clusters also shared the same middle nucleotide (Figure 3B), 220 suggesting that the network identified as a key feature the "0" value in the weighted substitution 221 output vector that is associated with the middle nucleotide. 222 we observed WRCT to be the most highly mutated in each case (Supplementary Figure 2). 243 Previous studies identified WRCY/RGYW (Y=C/T, R=A/G) and later WRCH/DGYW 244 (H=A/C//T, D=A/G/T) to be a better predictor of mutability at C:G bases (Rogozin and Diaz, 2004; 245 Rogozin and Kolchanov, 1992). However, we discovered some inconsistencies with these 246 definitions, as AGCC was found to be the least mutated of the AGCN motifs and WRCG was not 247 always the least mutated, on both strands. Overall, these early hotspot definitions may have been 248 too broad, and WRCT/AGYW is a more consistent predictor of AID targeting. Lastly, we also 249 noted that among the least mutated k-mer clusters, many were G-rich in their surrounding context 250 (for example, C clusters 8 and 9, A cluster 4, T cluster 21), and particularly for G (G clusters 12 251 and 14) (Supplementary Table 2

256
As a complementary way to find sequence motifs associated with mutability, we used TF-257 MoDISco (Transcription Factor Motif DIScovery), a program for identifying recurring motif 258 patterns in genomic data (see Methods) (Shrikumar et al., 2018). We applied TF-MoDISco to the 259 standalone model that predicts only mutation frequency because we reasoned that sequence 260 features related to mutability would be more easily identifiable since the model is only required to 261 learn a single task. TF-MoDISco uses importance scores, which can be derived from many 262 machine learning methods, to produce a set of unique motifs learned by the model (see Methods). 263 We began by analyzing the importance scores derived from Integrated Gradients (Sundararajan et  264 al., 2017) of 15-mers whose middle nucleotide conformed a WRC/GYW AID hotspot motif. As 265 expected, the positively contributing sites in the set of ensuing motifs aligned with the hotspot 266 motifs (Figure 4A, B). In addition, TF-MoDISco again revealed a preference for having a T base 267 at the +1 position of the WRC (WRCT, Figure 4A) and an A base at the -1 position of the GYW 268 (AGYW, Figure 4B). 269 In addition to WRCT/AGYW being a well-represented motif identified by TF-MoDISco, 270 as measured by having positive contributions to mutability (above horizontal axis on Figure 4A), 271 we also noticed many neighboring C and G bases contained negative contributions (below 272 horizontal axis on Figure 4A), most evidently at the C located at the -3 position of the WRC 273 hotspot, and the G located at the +3 position of the GYW hotspot ( Figure 4B). Here, the negative 274 contribution at the -3 position signifies that having a C at that position reduces mutational targeting 275 to the middle C. By the same token, a mutation that changes the -3 position from C will increase 276 the likelihood of the middle C subsequently being targeted. This observation supports our recently 277 published study where we observed a strong positive "mutual association" -a correlation metric 278 describing the impact of mutating one site and its effect at another site -between CC (or GG) pairs 279 distanced by two nucleotides (Krantsevich et al., 2021). In that study we were able to explain most 280 of such correlations in terms of overlapping AID and/or Polη hotspots, with the CNNC/GNNG 281 motif being one the exceptions which we suggested might be explained by AID processivity (Pham 282 et al., 2003;Storb et al., 2009). However, the TF-MoDISco analysis suggests a different 283 explanation in which the absence of a C in the -3 position might be a part of an extended AID 284 hotspot, defining CWRC as being similar to a sequential overlap motif, which we previously 285 defined (Krantsevich et al., 2021) as a motif in which an initial mutation creates a new hotspot that 286 previously did not exist. Here, although the WRC hotspot did previously exist, a mutation in the 287 first C would create a DWRC (D=A/G/T) motif, potentially with higher mutability. 288 We next sought to determine whether adding the 5' D or 3' T context of the canonical WRC 289 hotspot is more influential in terms of increasing its susceptibility to AID mutagenesis. To address 290 this, we increased the hotspot specificity step by step, starting from CWRCV (V=A/C/G) and 291 assessed the impact a single change in the motif at either the first C or V site, causing a DWRCV 292 or CWRCT intermediate hotspot to form respectively, has on mutability ( Figure 5A). We found 293 that both DWRCV and CWRCT intermediate hotspots were shown to mutate significantly more 294 than CWRCV (Figure 5B). We also discovered that the mutability of the DWRCT hotspot, which 295 contains the extended hotspot in both 5' and 3' directions, was significantly higher than both 296 intermediate hotspots (Figure 5B). Performing a pairwise comparison between the mutation 297 frequency of all 16 individual (D/C)WRC(T/V) contexts further confirmed that those containing 298 both a 5' D and 3' T were significantly more mutated than the remaining hotspot motifs, with 299 DAGCT being the most mutated ( Figure 5C, Supplementary Table 3). Additionally, the next 300 three successively mutated hotspots followed a CWRCT context, overall suggesting the 3' T to be 301 more impactful to AID recognition than the 5' D, but the addition of both substantially increases 302 targeting in human V regions. 303

313
In addition, another secondary motif that unexpectedly emerged from the TF-MoDISco 314 analysis of WRC/GYW 15-mers did not contain a positively contributing C nucleotide; rather it 315 conformed to a TA Polη hotspot (Figure 4A, bottom). Having a TA hotspot appear while 316 specifically analyzing only 15-mers containing WRC hotspots reveals the importance of attracting 317 Polη to these areas. This finding is consistent with our previous analysis highlighting the 318 importance of co-localization of AGCT overlapping AID hotspots and Polη hotspots within the 319 CDRs Wei et al., 2015). 320 The TA motif also emerged when we applied TF-MoDISco to all 15-mers conforming to 321 either a WA (Supplementary Figure 3A) or TW Polη hotspot (Supplementary Figure 3B). In 322 addition to our model identifying both the TA and AA hotspot motifs as important, it also identified 323 a TAT/ATA motif as a special case for both strands. Further analysis showed that WAT/ATW 324 hotspots mutate significantly more than their WAV/BTW counterparts (Figure 6). Thus, while TA 325 hotspots consistently have higher mutability than AA, the presence of a 3' T individually increases 326 the mutability of each of these Polη hotspots. 327 We next applied the same t-SNE methodology to the DeepSHM model that predicted only 336 mutation frequency. We found that the organization of the subsequent embedding followed a 337 direction of descending mutation frequency, with the highest mutating 15-mers located at the mid-338 to upper-right portion of the plot (Figure 7A). A cluster of low-mutating 15-mers was also isolated 339 in the upper-left (Figure 7A), which was enriched with ~76% of FW1 15-mers ( Figure 7B).   Table 2) where we observed several clusters with G-rich k-mers and low 352 mutation frequencies. If we further separate the mutation frequencies into categories defined by 353 the middle nucleotide, we find that G content has a consistent negative correlation regardless 354 (column G of Supplementary Figure 4B).

More generally, A and T richness (columns A and T 355
of Supplementary Figure 4B) shows a consistent positive or sometimes non-significant 356 correlation, whereas C and G richness shows a consistent negative (or non-significant) correlation. 357 In summary, it appears that low-mutating sites generally have a high local GC (and particularly G) 358 content, and conversely, that highly targeted sites display an elevated local AT (particularly A) 359 content. 360  The resulting t-SNE embedding from this model identified four main clusters, as well as two much 377 smaller satellite clusters, with each cluster containing 15-mers that share a common middle 378 nucleotide ( Figure 8A). A distinction between 15-mers with high and low mutation frequencies 379 could also be observed based on their location on opposite ends of the cluster, especially for 380 clusters containing either a C or G middle nucleotide, with high-mutating 15-mers typically located 381 on the side closest to the center ( Figure 8B). Since the model was tasked with learning the 382 distributed substitution rates of each 15-mer, we next sought to evaluate the embedding by the rate 383 of each individual substitution type (e.g. C>T). In certain clusters, a similar gradient of high to low 384 substitution rates could also be seen as we observed for mutation frequency (Figure 8C-F). For 385 instance, we noticed the rate of G>T substitutions increasing from the side nearest to the origin 386 towards the outer boundaries of the cluster (top-right cluster in Figure 8F), which was associated 387 with a shift towards decreasing mutation frequencies in the same cluster while proceeding in the 388 same direction (Figure 8B). To evaluate this trend more closely, we analyzed three human IGHV 389 genes from different families for which we had the most data (IGHV1-18, IGHV3-23, IGHV4-390 34), so as to include sites with low mutation frequencies at high coverage, and calculated the 391 correlation between mutation frequency and rate of substitution for each substitution type. As an 392 example, for IGHV3-23 we found the most significant negative correlations to be at C>A

405
In the t-SNE analysis of the substitution model, we also discovered two small clusters of 406 15-mers containing a C and T as their middle nucleotide (Figure 8A) that did not group with their 407 respective larger clusters, suggesting that these particular sites might have distinct substitution 408 patterns. Generating the consensus sequence of the outlier T cluster revealed a partially conserved 409 AGYCTGGGGG sequence (Figure 9A). When we examined these subsequences more closely, 410 we discovered that they were located only in IGHV3 family genes at either position 21 or position 411 45 according to the IMGT unique numbering system (Lefranc, 2001) (Supplementary Figure 6). 412 The motif was also surprisingly common. At position 21 it appeared in 37 different alleles (across 413 19 genes) and was fully conserved in all alleles. Coincidentally, the motif also appeared in 37 414 different alleles (across 18 genes) at position 45, although it differed slightly at the +3 and +4 415 positions ( Figure 9A). These two sets of alleles only partially overlap, such that 15 alleles had the 416 motif at both positions 21 and 45. Thus, this specific motif in FW1 of the IGHV3 family genes 417 appears to be highly conserved evolutionarily, suggesting a possible functional role. The rates of 418 substitution at these sites were also found to be highly biased towards creating T>G mutations, 419 with an average T>G rate of about 0.66 at position 21, and an even greater rate of 0.89 at position 420 45 (background rate: 0.28 ± 0.23) ( Figure 9B, Table 2). A previous study using Sanger sequencing 421 data that was limited to IGHV3-23 and the pseudogene IGHV3-h had noted similarly high T>G 422 substitution rates at positions 21 (for IGHV3-h) and 45 (for IGHV3-23) (Ohm-Laursen and 423 Barington, 2007). Although the T subjected to mutation at both positions did not conform to a 424 bottom strand TW Polη hotspot, these genes at position 45 displayed a relatively high average 425 mutation frequency of 0.17 ± 0.08 (Table 2), which is somewhat unusual given that mutations are 426 generally more biased towards the CDRs than FW regions (Cohen et al., 2011;Shapiro et al., 427 2002), and that we reported above that many sites within FW1 tended to display low mutability 428 ( Figure 7A, B). 429

436
While examining the C outlier cluster (Figure 8A), we found the consensus sequence to 437 be more diverse compared to the outlier with a middle T (Supplementary Figure 7A). The 438 sequence variation seen here was partly due to the fact that the 15-mers that constituted this cluster 439 belonged to many other IGHV families besides IGHV3 and across different sub-regions of the IgV 440 (Supplementary Figure 6). On the other hand, we noticed some overlap between both outlier 441 clusters since, in some cases, the C corresponded to positions 20 and 44 that preceded the middle 442 T of the other outlier cluster (Supplementary Figure 7A, Table 2). We further found these sites 443 to have a similar elevated C>G substitution rate (mean rate of 0.62 compared to background mean 444 of 0.33, P<2.2×10 -16 ) (Supplementary Figure 7B, Table 2), suggesting the model distinguished 445 sites with a general preference to create G mutations. 446 Given that the sites with strikingly high T>G and C>G substitution rates we identified here 447 are in adjoining G-rich sub-regions ( Figure 9A, Supplementary Figure 7A), we evaluated the 448 possible influence these mutations might have on the formation of G-quadruplex (G4) structures. 449 In a recent study, we assessed the potential for DNA G4 structures to form in the IgV region, using 450 a pre-trained deep learning model that computes the G4 potential of a linear DNA sequence (Tang 451 and MacCarthy, 2021). There we found that the IGHV3 family had the highest propensity to form 452 stable G4s in the top strand. We now sought to assess the overall mutational effect on G4 assembly 453 of the IGHV3 sites that are biased towards G. Following the methodology of our previous study, 454 we calculated the difference between the predicted G4 potential of the germline with that of the 455 sequence with a single mutation at either position 21 or 45. Here, we found that a T>G mutation 456 at position 21 elevated average G4 potentials to a very high value of 0.84 ± 0.10 compared to a 457 germline value (already relatively high) of 0.54 ± 0.19 , whereas the same mutation occurring at 458 position 45 displayed a far smaller average increase of 0.05 ± 0.04 ( Figure 9C, Table 2). As for 459 the remaining cases, there seemed to be little effect of C>G mutations on G4 potential (  Table 2. Summary statistics on outlier C and T clusters from Figure 8A. 463 nucleotides at the +3 or +4 positions ( Figure 9A) which was that these sites also displayed 464 high A>G substitution rates (0.79 ± 0.11; Table 2, positions 48 and 49). This hypothetical mutation 465 also caused a moderate, though substantial, increase in G4 potential (0.17 ± 0.02, In this study, we leveraged deep learning to gain novel insights into SHM, a key process 474 in antibody affinity maturation. We trained multiple deep learning models using a convolutional 475 neural network (CNN) framework to analyze DNA k-mer subsequences of various lengths, ranging 476 from 5 to 21 nts, derived from human IGHV germline sequences. Using a high-quality data set 477 containing non-productive B cell repertoire data, the model was tasked to learn two focal aspects 478 of SHM: the frequency of mutation at a given site, and the spectrum of mutations that can arise at 479 this site (substitution). Understanding the propensity of a site to mutate and the underlying 480 substitution biases that ensue can lead to a better understanding of how AID is recruited to and 481 targets the Ig V region, as well as the associated downstream DNA repair mechanisms that follow 482 AID deamination. 483 We began by developing three models, collectively referred to as DeepSHM, to predict 484 separate tasks for a given k-mer: observable mutation frequency; distributed substitution rates; and 485 a combination of both measures (weighted substitution). We found that predicting substitution 486 rates did not substantially depend on the k-mer size, while 15-mers were optimal for predicting 487 mutation frequencies (Figure 2, Table 1). Additionally, DeepSHM predicted both substitution 488 rates and mutation frequencies more accurately than the widely used S5F targeting model for all 489 k-mer sizes we evaluated (k = 5, 9, 15 and 21) ( Table 1). Even though we were able to outperform 490 S5F in representing substitution biases, the correlation between our predictions and empirical data 491 was moderate (~0.55), suggesting that the processes underlying SHM substitution biases may be 492 more fundamentally random than mutational site targeting alone. Error-prone DNA repair 493 processes downstream of AID are highly complex. For example, while Polη is biased towards 494 making WA>WG mutations (Zhang et al., 2014) and plays a dominant role in generating mutations 495 at A:T sites, many A:T mutations still occur in its absence (Saribasak et al., 2009) that are mediated 496 by other polymerases (Maul et al., 2016). Similarly complex, BER is biased towards transversions 497 but can also repair faithfully, with a further dependence on hotspot mutability (Pérez-Durán et al., 498 2012). Thus, downstream repair processes may simply be too complex, or genuinely random, to 499 be captured well by a model that depends on sequence context alone. 500 In order to uncover some of the hidden features learned by DeepSHM, we analyzed the 501 output, or encodings, obtained from the penultimate layer of the network predicting weighted 502 substitution using input 15-mers, and performed t-SNE, a method of dimensionality reduction, to 503 visualize the encodings in two dimensions. The subsequent embedding formed clusters of 15-mers 504 that were distinguished by mutation frequency and middle nucleotide (Figure 3A, B). Individual 505 clusters containing a C or G middle nucleotide that were associated with high mutability, assumed 506 to be relevant to AID hotspots, revealed a strong preference for a T base at the +1 position of the 507 top strand AID WRC (W=A/T, R=A/G) hotspot, including for WAC motifs that are not part of a 508 WGCW motif, and similarly, an A base at the -1 position of the bottom strand GYW (Y=C/T) 509 context (Figure 3C, D). As an alternative way to identify sequence features, we applied TF-510 MoDISco (see Methods) to reveal recurrent genomic patterns using importance scores extracted 511 from the model for each 15-mer. This approach confirmed the importance of the T base at the +1 512 position of WRC ( Figure 4A) and the A base at the -1 position of the bottom strand GYW hotspot 513 ( Figure 4B). An early study by Rogozin and Diaz reported the WRCH/DGYW (H=A/C/T, 514 D=A/G/T) to be a good predictor of mutability at C:G bases (Rogozin and Diaz, 2004), but we 515 found WRCT to be a more consistent definition. The authors of the S5F model also supported the 516 WRCH definition since they found their model can capture the higher mutability rate seen at 517 certain WRCA motifs (Yaari et al., 2013), presumably at the AGCA overlapping hotspot. 518 However, previous hotspot definitions have largely failed to describe targeting beyond the -2 519 position of the WRC motif. We further identified having a C at the -3 position of WRC or a G at 520 the +3 position of GYW as a strong negative contribution, i.e., as a reduced effect on targeting. 521 Thus, our results suggest the typical AID hotspot definition might be extended to DWRCT 522 (D=A/G/T). Comparing the mutation frequencies of the individual DWRCT hotspot motifs 523 showed the 3' T to be more important for AID recognition than the 5' D alone, however, together 524 they have a synergistic effect that makes mutability between 1.8-fold (for TAC) and 4.7-fold (for 525 TGC) higher ( Figure 5C, Supplementary Table 3). 526 We next applied the same t-SNE methodology on the two developed standalone models 527 that separately predicted either the mutation frequency or substitution rates of the 15-mer middle 528 nucleotides. The t-SNE embedding on the independent DeepSHM model predicting only mutation 529 frequency revealed a significant negative correlation between the mutability of a site and the 530 surrounding GC content of the 15-mer (Figure 7D). This finding alternatively suggests that highly 531 mutated sites may have evolved to have a richer local AT content. This in vivo result is consistent 532 with earlier in vitro results that considered AID targeting on artificial substrates (Abdouni et al., 533 2018). 534 On the other hand, the t-SNE embedding stemming from the standalone substitution model 535 hinted at plausible associations, both positive and negative, between mutation frequency and 536 certain transition and transversion mutations (Figure 8B-F). We next analyzed multiple genes 537 representing different IGHV families containing the largest amounts of mutation data in order to 538 avoid any potential sites with few observable mutations, such as coldspots. We observed a negative 539 correlation between mutability and substitution rates specifically for C>A and G>T transversion 540 mutations (Figure 8, Supplementary Figure 6) and, on the other hand, positive correlations for 541 C>T and G>A transitions (Supplementary Figure 6). The trend for increased transition mutations 542 at highly mutating AID hotspots mediated by UNG2 had previously been observed in experiments 543 using 3T3 (mouse fibroblast) cells (Pérez-Durán et al., 2012), although the particular bias against 544 C:G>A:T transversions was not apparent. Previous work has also shown that UNG2 is cell-cycle 545 regulated, possibly mediated by FAM72A (Feng et al., 2020), and active primarily during G1 546 (Sharbeen et al., 2012). Although AID is also primarily active during G1, it may sometimes persist 547 for slightly longer than UNG2 and thus highly targeted sites may avoid BER especially when the 548 mutations occur just before the cell enters S phase, which would lead to fixation of C>T transitions 549 via replication bypass. Alternative polymerases may also be preferentially recruited to some sites. 550 For example, in DT40 (chicken) B-cell lines, the POLD3 subunit of Polymerase delta (Polδ) has 551 been proposed as a specific mechanism for both C>A and G>T mutations (Hirota et al., 2015;552 Pilzecker and Jacobs, 2019). 553 Additionally, we investigated two outlier clusters from the substitution model embedding 554 that contained 15-mers having a C and T middle nucleotide that did not group with their respective 555 larger clusters (Figure 8A). A closer analysis revealed that the T outlier contained a highly 556 conserved AGYCTGGGGG consensus sequence that was derived from two independent sites 557 located in FW1 from multiple IGHV3 alleles ( Figure 9A, Table 2). Both outlier clusters also 558 displayed significantly elevated T>G ( Figure 9B, Table 2) and C>G substitution rates 559 (Supplementary Figure 7, Table 2) respectively. In our recent study on G-quadruplexes (G4s) 560 in IGHV genes, we observed the IGHV3 family to form G4s more favorably on the top strand, as 561 measured by their predicted G4 potential using a pre-trained CNN model (Tang and MacCarthy, 562 2021). Given the strong preference for creating G mutations in these FW1 sub-regions, we 563 evaluated the impact of these mutations on G4 potential. In some cases, the resulting G mutation 564 led to a strong increase in G4 potential, particularly at position 21 ( Figure 9C, Table 2), whereas 565 for other sites, the effect was mostly negligible ( Table 2). Notably however, a high A>G 566 substitution rate was also observed at the +3 or +4 positions (Figure 9A), which were also 567 associated with increase in G4 potential ( Table 2). These biased A>G mutations may further be 568 related to previous work that found that a repeated mutation that occurs in one IGHV allele often 569 matches the sequence variant of a different allele (Saini and Hershberg, 2015). Alternatively, these 570 mutations may be related to R-loop initiation, which forms in G-rich non-template DNA, possibly 571 forming in FW1 of these IGHV3 genes. Studies have found that reducing G-density in mammalian 572 Ig Switch regions compromises class-switch recombination efficiency and R-loops from forming 573 (Roy et al., 2008;Zhang et al., 2014). The high rate of T>G and C>G transversions also suggests 574 that particular repair enzymes may be recruited to these sub-regions during SHM. 575 576 Limitations of the study 577 578 In principle, a wider range of k-mers, as well as a greater variety of neural network architectures, 579 might have been considered for this study. However, since the tuning of each model takes a 580 substantial amount of computational resources and time, we considered a reduced number of 581 models. Additionally, we limited this study to consider data only for human, the species for 582 which we had high quality (UMI barcoded) data in high abundance, although the approach could 583 be extended to other species such as mouse in future work. 584 585 Acknowledgements 586 We would like to extend our gratitude to Sergio Roa Gómez for his comments and suggestions. 587