Selection analysis identifies unusual clustered mutational changes in Omicron lineage BA.1 that likely impact Spike function

Among the 30 non-synonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (i) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (ii) interactions of Spike with ACE2 receptors, and (iii) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any genomes within which they occurred. We further propose that the mutations in each of the three clusters therefore cooperatively interact to both mitigate their individual fitness costs, and adaptively alter the function of Spike. Given the evident epidemic growth advantages of Omicron over all previously known SARS-CoV-2 lineages, it is crucial to determine both how such complex and highly adaptive mutation constellations were assembled within the Omicron S-gene, and why, despite unprecedented global genomic surveillance efforts, the early stages of this assembly process went completely undetected.


Introduction
The Omicron (B.1.1.529) SARS-CoV-2 variant of concern (VOC) identified in Southern Africa in late November 2021 1 is the product of extensive evolution within an infection context that has so far yielded at least three genetically distinct viral lineages (BA.1, BA. 2  (2) long-term evolution in one or more chronically infected people -similar to the proposed origin of lineages such as Alpha and C. 1

.2 2 3 4 -may have left intermediate forms unsampled within
one or a few individual(s); and (3) reverse zoonosis to a non-human host, followed by undetected spread and diversification therein prior to spillover of some sub-lineages back into humans 5 . At present there is no strong evidence to support or reject any of these hypotheses on the origin of Omicron, but as new data are collected, its origin may be more precisely identified.
Regardless of the route that Omicron took to eventual community transmission, the genome of the BA.1 lineage that caused surges of infections globally in late 2021 and early 2022, accumulated 53 mutations relative to the Wuhan-Hu-1 reference strain, with 30 nonsynonymous substitutions in the Spike-encoding S-gene alone ( Figure 1). Here, we characterize the selective pressures that may have acted during the genesis of the BA.1 lineage and curate available data on the likely adaptive value of the BA.1 S-gene mutations. We were particularly interested in identifying BA.1 S-gene codon sites displaying evolutionary patterns that differed from those of other SARS-CoV-2 lineages (including variation of SARS-CoV-2 in individual hosts), and closely related non-human sarbecoviruses. We use these comparisons to identify which BA.1 S-gene mutations might contribute to recently discovered shifts relative to other SARS-CoV-2 variants in the way that BA.1 interacts with human and animal ACE2 receptors and is primed by cellular proteases to mediate cellular entry [6][7][8][9][10] . Our analysis identifies three clustered sets of mutations in the Spike protein, involving amino acids substitutions at 13 sites previously highly conserved across other SARS-CoV-2 lineages and other sarbecoviruses. The dramatic about-face in evolutionary dynamics at the 13 codon sites encoding these amino acids indicates that the mutations at these sites in BA.1 are likely interacting with one another, that the combined effects of these interactions are likely adaptive, and that these adaptations likely underlie at least some of the recently discovered shifts in BA.1 Spike function. selection when considering all SARS-CoV-2 genomic data prior to the discovery of Omicron (Table 1, Figure 2, https://observablehq.com/@spond/selection-profile). For context, this fraction of positively selected sites (0.53) is approximately four times higher than the fraction of all SARS-CoV-2 S-gene sites that have ever shown any signals of positive selection (0.14).
The observed substitutions at four of these sixteen sites (K417N 16 , N501Y [17][18][19] , H655Y, P681H 20 ) and a two-nucleotide deletion at one additional site (Δ69-70 21 ) are among the nineteen "501Y meta-signature" Spike mutations that are likely highly adaptive within the context of 501Y lineage viruses such as the Alpha, Beta and Gamma VOCs 15 . Given that the BA.1 mutations at these sites converge on those seen in these other VOCs, they are likely to be adaptive in BA.1 lineage viruses as well (sites coloured red in Figure 3).
A further four BA.1 S-gene mutations are found in SARS-CoV-2 sequences belonging to other VOC lineages, and are either VOC lineage defining mutations (majority mutations), or are lower frequency mutations that have increased in frequency >2 fold between early and late VOC lineage circulation periods within sampled sequences belonging to these lineages (A67V in Alpha and Beta, T95I in Beta and Gamma, T478K in Beta, and N679K in Gamma; https://observablehq.com/@spond/sc2-selection-trends): an indication that these mutations too are likely adaptive in BA.1 lineage viruses (Table 1). Additionally, three other BA.1 Sgene mutations either: (1) occur at the same codon sites as Alpha, Beta, Gamma or Delta lineage defining mutations but encode a different amino acid than these other lineages (E484A in BA.1 and E484K in Beta and Gamma); or (2) occur at the same codon sites as mutations in VOC lineages that increased in frequency > 2 fold between early and late VOC lineage circulation periods but encode a different amino acid than these other lineages (N440K in BA.1 and N440S in Alpha; S477N in BA.1 and S477I in Beta and Gamma).
Lastly, the S/D796Y mutation occurs at one of the four sites identified as potential locations of adaptation in human beta-coronaviruses via the analysis of convergent evolutionary patterns and functional impact (Table 1) 11 and a mutation at this site has previously been inferred to be potentially adaptive within the context of a chronic SARS-CoV-2 infection 22 . All of these mutations likely have a substantial impact on the phenotype of BA.1 lineage viruses (coloured orange in Figure 3). (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. Figure 2. Selection signals that were evident at BA.1 amino acid change sites in other SARS-CoV-2 lineages prior to the emergence of Omicron. All SARS-CoV-2 near full-length genome sequences present in GISAID 12 on 21 November 2021 that passed various quality control checks were split up into three month sampling windows and analysed using the FEL method restricted to internal tree branches 13 implemented in Hyphy 2.5 14 . This method was also used in 15 . Red circles show sites under positive selection (selection favouring changes at amino acid states encoded at these sites). Blue circles show sites under negative selection (selection against non-synonymous changes). When no circle is shown, the corresponding site offered no statistical evidence for nonneutral evolution at a given time point. The areas of circles indicate the statistical strength of the selection signal (and not the actual strength of selection) within sequences sampled in the three months preceding the 1st day of the indicated months. Note that none of these analyses included any Omicron sequences, hence selection signals are derived solely from other SARS-CoV-2 lineages.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  In this rendering of the trimer, one subunit is shown in the "up" or "open" configuration while interacting with human ACE2 23 . The other two subunits are in the "down" or "closed" configurations. Amino acids are color coded according to their likely contribution to viral adaptation in a Wuhan-Hu-1-like genetic background based on (1) patterns of synonymous and nonsynonymous substitutions at the codons encoding these amino acids in non-Omicron sequences, (2) patterns of mutational convergence between viruses in different VOCs and (3) increases in the frequency over time of VOC sub-lineages encoding amino acids that match those found in BA.1. NTD = N-terminal domain, RBD = Receptor binding domain; RBM = receptor binding motif. Locations of sites in the three clusters of BA.1 mutations that are rarely seen and fall at either negatively selected (dark blue) and neutrally evolving (light blue) sites. An interactive version of this figure can be found here: https://observablehq.com/@stephenshank/sars-cov-2-ace2-protein-interaction-and-evolution-for-omicr The mutations occurring at the 14 BA.1 Spike codons which display either evidence of negative selection or no evidence of selection (neutral evolution), have rarely been seen within previously sampled sequences (bold rows in Table  1; https://observablehq.com/@spond/omicron-mutations-tables) indicating the action of strong purifying selection due to functional constraints. Despite the rarity of these mutations in assembled genomes, it is not uncommon to find them in within-patient sequence datasets ( Figure 4), often at sub-consensus allelic frequencies. This indicates that, with the possible exceptions of S/S371L, S/N764K, S/N856K and S/Q954H, the mutations at these sites are not rare simply because they are unlikely to occur (note the sizes and numbers of dots corresponding to these mutations in Figure 4), but rather because whenever they do occur they are unlikely to either increase sufficiently in frequency to be transmitted (note the predominantly light orange/yellow colours of the dots corresponding to these mutations in Figure 4), or increase sufficiently in frequency among transmitting viruses to be detected by genomic surveillance.
On their own, none of these 14 BA.1 mutations at codon sites that have previously been evolving either neutrally or under negative selection prior to November 2021 would be expected to provide SARS-CoV-2 with any selective advantage. If the BA.1 mutations observed at the ten negatively selected S-gene codon sites had occurred in the Wuhan-Hu-1 sequence, it is very likely that they would have been selected against. Specifically, since the start of the pandemic Spike proteins tended to function best whenever they had amino acids at these ten sites that were the same as those in the Spike encoded by the Wuhan-Hu-1 sequence.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022.  24 . The areas of the circles indicate the proportions of raw sequence datasets (per 1,000 samples) where a mutation away from the Wuhan-Hu-1 consensus sequence was called. The colour of the circle indicates the median intrapatient allele frequency (AF) in datasets for which each mutation was detected. Mutations occurring at lower AFs are only present in a subpopulation of viruses in a particular host. The data has been generated by calling variants from read-level data of 230,506 samples from COG-UK, Estonia, Greece, Ireland, and South Africa: PRJEB37886, PRJEB42961 (and multiple other bioprojects with the study title: Whole genome sequencing of SARS-CoV-2 from Covid-19 patients from Estonia), PRJEB44141, PRJEB40277 and PRJNA636748. Note that S371L is the result of two nucleotide substitutions in codon S/371 and was never detected in intrapatient samples. S371F represents an intermediate mutation between the Wuhan-Hu-1 state and that of BA.1 and is presented here for completeness. . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint It is clear that the amino acids encoded by 13 of the 14 mutated codon sites in the BA.1 Sgene that either show evidence of negative selection or no evidence of any selection, cluster within three regions of the Spike three dimensional structure (dark blue sites in Figure 3): 1. Cluster region 1 in the RBD (green sites in Figure 5): codons/amino acids S/339, S/371, S/373 and S/375; may be targeted by some class 4 neutralizing antibodies 25 . S/371L alone impacts, but probably does not provide escape from, binding of some antibodies in all four neutralizing antibody classes 26  mutations are likely to enhance interactions between the S1 and S2 subunits of the BA.1 Spike and are likely to contribute to reduced S1 shedding following proteolytic cleavage of the polybasic S1/S2 site 10,31 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint To determine whether patterns of selection at the Omicron/BA.1-specific sites are broadly consistent with those occurring in the horseshoe bat-infecting SARS-related coronaviruses, in the Sarbecovirus subgenus to which SARS-CoV-2 belongs, we examined patterns of synonymous and non-synonymous substitutions in 167 publicly available Sarbecovirus genomes. Accounting for recombination, we tested for selection signatures at all 44 codons encoding amino acids that differ between Wuhan-Hu-1 and BA.1 (https://observablehq.com/@spond/ncos-evolution-nov-2021). We specifically focused the analyses on selection signals in the subset of sarbecoviruses that are more closely related to SARS-CoV-2 in each recombination-free part of their genome: a group of sequences we refer to as the nCoV clade 32 . Depending on the recombination-free genome region being considered, this clade was represented by between 15 and 27 sequences. We refer to the remaining sarbecoviruses as the non-nCoV sequences.
Of the 44 codon sites considered, 26 are detectably evolving under negative selection (FEL p-value <0.05; 14 13 ) and one (S/417) under positive selection (MEME p-value <0.05; 33 ) in the nCoV clade. This positive selection signal at S/417 reflects an encoded amino acid change from an ancestral V that is present in all background sequences, to a K that is specific to the nCoV clade. A K is also encoded at this site in Wuhan-Hu-1 but has since changed multiple times in various SARS-CoV-2 lineages: for example, to an N during the genesis of lineages such as Omicron and Beta and to a T during the genesis of the Gamma lineage.
We were, however, particularly interested in whether the cluster 1, 2 and 3 mutation sites in . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint

What can the sarbecoviruses tell us about the biological consequences of the rarely seen BA.1 mutations?
Despite the observation that, even among sarbecoviruses, BA.1 mutations seen in cluster regions 1, 2 and 3 are only rarely seen, the instances where they do occur might be illuminating. For example, among the bat-infecting sarbecoviruses, the BA.1 S/G339D substitution (in cluster region 1) has primarily to date been found among the bat-infecting viruses within a clade ( Figure 6) that does not use ACE2 as a cell entry receptor 34 . The change in receptor binding function in these viruses is, however, most likely due to two RBM deletions that are also specific to this clade. As with SARS-CoV-2, the amino acids encoded at cluster region 2 sites (all of which fall within the RBM) vary substantially between different sarbecoviruses but without any associated signals of positive selection at these sites within the nCoV clade. Notably, the same BA.1 encoded amino acids at codon S/493R and S/505H also co-occur in a clade of sarbecoviruses that are closely related to SARS-CoV (virus accessions: KY417144, OK017858, KY417146, OK017852, OK017855, OK017853, OK017854, OK017856, OK017857); although S/493R (AY613951 and AY613948) and S/505H (MN996532, LC556375) can also occur independently. Besides the various Omicron sublineages, S/493R and S/505H are not found as a pair in any SARS-CoV-2 sequences. These mutations occurring along the same branch of the sarbecovirus tree ( Figure 6) suggests that, rather than favouring changes at the sites individually, selection may favour simultaneous changes to S/493R and S/505H due to these residues together having a greater combined fitness benefit than the sum of their individual effects: a type of interaction between genome sites referred to as positive epistasis.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint The region 3 cluster sites are conserved across the sarbecoviruses with almost all known viruses having the same residues at these sites as the Wuhan-Hu-1 SARS-CoV-2 strain.
This supports the hypothesis that, when considered individually, the mutations seen at these fusion domain sites in BA.1 are likely to be maladaptive.

BA.1 mutations at neutral or negatively selected S-gene sites might only be adaptive when they co-occur
Given both the apparent selective constraints on mutations arising at the cluster region 1, 2 and 3 sites in SARS-CoV-2 and other sarbecoviruses, and the rarity of observed mutations at these sites among the millions of assembled SARS-CoV-2 genomes (despite evidence that individually such mutations do regularly occur during within-host evolution; Figure 4), it is very likely that BA.1 mutations at cluster region, 1, 2 and 3 sites are maladaptive when present on their own. Nevertheless, the presence of mutations at these sites in BA.1, a . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint lineage of viruses that is clearly highly adapted, suggests that these mutations might interact with one another such that, when present together, they become adaptive. Therefore while individually the mutations might decrease the fitness of any genome in which they occur, collectively they might compensate for one-another's deficits to yield a fitter virus genotype.
Positive epistasis of this type has, in fact, already been demonstrated between the cluster 2 mutation, S/Q498R, and the pivotal mutation of the 501Y SARS-CoV-2 lineages, S/N501Y. If mutations in the three cluster regions do epistatically interact with one another, then one might expect that selection would favour their co-occurrence either within individual SARS-CoV-2 genome sequences that have so far been sampled, or as minor variants within unassembled intrapatient sequence data. We failed to detect such associations in any systematic manner (Figure 7). While there are individual pairs of BA.1 mutations that cooccur more frequently than expected by chance (e.g. 440K in the presence of T95I), they do not involve cluster 1, 2, and 3 mutations. Furthermore, many of the BA.1 mutation pairs occur together less frequently than expected by chance (e.g. 478K and 501Y). Rather than reflecting an absence of epistasis between the cluster 1, 2, and 3 mutation sites our failure . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint to detect the co-occurrence of Omicron mutation pairs at these sites simply reflects the rarity of these mutations within both assembled SARS-CoV-2 genome sequences and raw intrapatient sequence datasets (Figure 4).

Figure 7.
Patterns of co-occurrence of BA.1 amino-acid residues in circulating SARS-CoV-2 S-gene haplotypes from other lineages (data up to October 15, 2021). Only mutations occurring in at least 10 haplotypes are shown. All sequences having exactly the same S-gene sequence count as a single unique haplotype; instead of counting raw sequence numbers, this approach focuses on the number of unique genetic backgrounds in which pairs of codons co-occur. Circles show odds ratios for finding the mutation on the X axis when the mutation on the Y axis is also present (vs when it is not present). Red circles depict OR > 1, while blue circles 1/OR for OR < 1. Black circles on the right show the fraction of globally sampled SARS-CoV-2 S-gene haplotypes which carry the corresponding mutation.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022.  The detected coevolution between these site pairs supports the hypothesis that at least some mutations within each of the three cluster regions are epistatically interacting with one another, and, therefore, that the combined fitness impacts of the mutations in each of the cluster regions are likely more positive than the sum of the individual impacts of each mutation alone.

How might mutations in the three cluster regions impact spike function?
Whether or not epistasis is restricted to a few site-pairs within the three cluster regions or is extensively operating between mutations within and/or between these regions, the amino acid changes caused by these mutations likely represent a substantial remodelling of two  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; sites that differ between the BA.1 lineage viruses and Wuhan-Hu-1. The line plot shows antibody binding escape measured by deep mutational scanning of the Wuhan-Hu-1 RBD 16 , averaged across 36 monoclonal antibodies (8 class 1, 13 class 2, 7 class 3, and 8 class 4 antibodies). Sites that are mutated in the BA.1 relative to Wuhan-Hu-1 are indicated and colored according to the predicted antigenic effect of mutations at that site (strong, moderate, or minimal). An interactive version of this plot is available at https://jbloomlab.github.io/SARS2_RBD_Ab_escape_maps/.

How and why have so many apparently maladaptive mutations been assembled within Omicron?
Given the manifest viability of BA.1 and the other Omicron sub-lineages there is a pressing need to understand how and why they accumulated so many mutations that, on their own at least, are apparently either selectively neutral or maladaptive. The genetic distance between the Omicron sublineages and their nearest known SARS-CoV-2 relatives implies that the The Omicron progenitor could have spent this period of intensive or prolonged evolution in a region that carries out minimal genomic surveillance and/or where access to, or utilization of, health care resources is low (the surveillance failure hypothesis). Alternatively, this viral evolution could have taken place within a long-term infection (or possibly serial long-term infections; the chronic infection hypothesis), or during spread within a non-human host population (the reverse-zoonosis hypothesis). Combinations of these evolutionary modes are also a possibility. We will only be able to distinguish between these hypotheses with more data.
Currently, the simple existence of three distinct Omicron lineages best supports the surveillance failure hypothesis at least for the latter stages of Omicron evolution following the divergence of the BA.1, BA.2 and BA.3 lineages from their most recent common ancestor. However, if similarly divergent SARS-CoV-2 variants are discovered in either longterm human infections or in other animal species, these would support the other hypotheses.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint Relative to evolution during normal SARS-CoV-2 person-to-person transmission, evolution within the context of either long-term infections or an alternative animal host could potentially have occurred at an accelerated pace 22,42 . In the context of either chronic infections of immunosuppressed individuals 4,22,29 , or animals that naturally sustain long-term SARS-CoV-2 infections (such as may be the case for white tailed deer given the extraordinarily high frequencies of ongoing SARS-CoV-2 infections discovered in these 43,44 ), purifying selection may have been relaxed somewhat relative to that occurring during normal human-to-human transmission: enough so for genomes carrying suboptimal combinations of epistatically interacting mutations to remain viable while fitter combinations were discovered via additional mutations and genetic recombination. In addition, chronic

It remains unclear whether mutations in cluster regions 1, 2 and 3 are showing signs of reversion
Whatever the process that yielded the three clusters of rarely seen mutations in the Omicron progenitor, now that it is being transmitted among people, any deleterious immune evasion mutations it has accumulated might be substantially less tolerable. Likewise, some of the . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. We found no molecular evidence for negative selection at any of these sites. At all sites, the vast majority of changes, measured either as fractions in all consensus genomes, or substitutions along internal branches of the phylogenetic tree of representative sequences, involve reversions to Wuhan-Hu-1 amino-acid states. At all sites, a fraction of sampled genomes have missing data (fully or partially unresolved nucleotides; Table 2). For key sites in RBD, this fraction is very high and, crucially, there is a strong correlation (R 2 = 0.773) between the percentage missing data at a site and the number of reversion mutations inferred at that site ( Figure 10). When multiplexing multiple samples in single sequencing runs, it is likely that known primer dropout issues for BA.1 sequences 48 can result in the amplification of environmental SARS-CoV-2 nucleic acid templates (e.g. from Delta lineages) that contaminate sample preparation laboratories and sequencing devices. When sequence reads derived from these contaminating templates are amplified to a similar degree to (or a greater degree than) BA.1 templates for a given region and are then used to . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint assign nucleotide states in assembled genomes, apparent reversion mutations could result.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint Table 2  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint It is therefore unsurprising that for every BA.1 mutation in cluster regions 1, 2 and 3 we found multiple instances of reversions occurring along internal tree branches (a mean of 4.7 reversions per site; computed over internal tree branches in the reduced haplotype tree; Table 2). However, we noted that this pattern was also apparent for all of the other BA. were within clusters of three to four contiguous mutations: a degree of clustering that is significantly higher than would be expected for random independent mutations (permutation p-value < 0.001) ( Figure 10). This pattern would, however, be expected with the widespread use of sequencing primers that are poorly suited to BA.1 sequencing.
When we account for the association between sequence coverage and reversion mutation counts, it is apparent that in the S-gene we do not see more reversion mutations at cluster region 1, 2 and 3 codon sites than at other BA.1 lineage-defining mutation sites ( Figure 10).
It therefore follows that, by this metric, the cluster 1, 2 and 3 mutations are, with the possible exception of that at S/339 (Figure 10), not obviously less adaptive during the ongoing diversification of BA.1 than are the other S-gene BA.1 lineage-defining mutations.
Despite not supporting one origin hypothesis over another, our inability to convincingly demonstrate unusually frequent reversions of cluster region 1, 2 and 3 mutations, remains consistent with the hypothesis that these mutations are broadly adaptive when they occur in the combinations found in BA.1 lineage viruses.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; Figure 10. Association between the proportion of sequences with missing data at a BA.1 mutation site and the number of reversion mutations seen at that site. This significant association between missing data and reversion mutation counts (dotted blue trendline with Pearson's R 2 = 0.773; p < 0.01) is likely attributable to miscalled nucleotides at BA.1 mutation sites whenever read coverage is low during sequencing. Under conditions when PCR/sequencing primers are not optimal for the amplification of BA.1 sequence, non-BA.1 SARS-CoV-2 genetic material contaminating sequencing instruments and other laboratory equipment used for sample preparation, will occasionally yield more amplions/sequence reads than those from the intended BA.1 target sequences. Wherever the nucleotide states of these contaminant amplicons are different than those of the intended BA.1 target, they will frequently yield base miscalls during sequence assembly that, if the miscalled base corresponds with an ancestral state, will be misinterpreted as reversion mutations. Compared to BA.1 lineage-defining mutations in the S-gene at codon sites that are positively selected (red dots), the thirteen mutations at negatively selected or neutrally evolving cluster region 1,2 and 3 sites (blue dots) actually have a lower than average number of detectable reversion mutations (note how the blue dots predominantly fall below the blue trend line). Only one of these 13 mutations (at codon S/339) has a number of reversions that might be higher than expected given the percentage missing data for the codons where the mutations occur.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint

Conclusion
Regardless of how the complement of mutations in the three cluster regions was assembled, their presence in BA.1 together with indirect evidence that the mutations are epistatically interacting is concerning. As with the concomitant emergence of the Alpha, Beta and Gamma VOCs in late 2020, part of the reason that the emergence of Omicron was a surprise is that the evolvability of SARS-CoV-2 is still deeply under-appreciated. It is becoming increasingly apparent that the evolutionary processes that yielded BA.1 involved balancing multiple fitness trade-offs: (1) between immune escape 6,9,26,49,50 and affinity for human and/or animal ACE2 proteins [7][8][9][10]30 ; (2) between efficient proteolytic priming with TMPRSS2 which expedites cellular entry via the cell surface [6][7][8] and increased resistance to endosomal restriction factors (such as IFITM proteins) which enable more efficient cellular entry via the endocytic route 7 ; (3) between preferred tropism for cells in the upper respiratory tract and preferred tropism for cells in the lower respiratory tract 7,8 , and (4) between increased propensity for Spike monomers to switch from the down to up configurations and overall Spike trimer stability 37,39 . Fortunately, the collection of mutations in BA.1 appear at present to have tilted the balance of these and other trade-offs towards the virus having decreased clinical severity in humans 51,52 .
It remains unclear what roles epistatic interactions between the BA.1 S-gene cluster region 1, 2 and 3 mutations have played in resolving these trade-offs. It is evident, however, that the extensive mutational changes in BA.1 that have collectively yielded these resolutions are as similar to "normal" stepwise mutational changes seen in previous variants as antigenic shifts are to antigenic drifts 53 . The evolutionary dynamics of the clustered rarely seen mutations in the RBD and fusion domains of BA.1 lineage viruses suggest that -rather than merely supporting minor tweaks in the antigenicity of Spike, its ACE2 binding affinity or its membrane fusion properties -these mutations are likely pivotal to the big observed shifts in how BA.1 Spike proteins function.
While a threat in its own right, BA.1 is also a warning. It demonstrates that complex evolutionary remodelling of important functional elements of SARS-CoV-2 are not just possible, but are potentially already occurring unnoticed in other poorly sampled lineages.
We should not complacently assume that the balance of fitness trade-offs achieved by the extensively evolved VOCs that succeed BA.1 will be similarly tilted towards lower severity.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint

Global analyses of selection
Unless specified otherwise, all analyses were performed on single gene (e.g. S) or peptide products (e.g. nsp3); since genes/peptides are the targets of selection. Global SARS-CoV-2 gene/peptide datasets were compiled (from GISAID; 12 ), processed and analysed at monthly intervals for evidence of selection acting on individual codon sites as in 15 . Results of these analyses at codons where Omicron mutations occur can be visualized using an Observable notebook at https://observablehq.com/@spond/sars-cov-2-selected-sites.

Analyses of intrapatient SARS-CoV-2 diversity
Intrahost allelic variation seen at BA.1 amino acid mutation sites was analysed in 282788 annotated (i.e. with detailed associated metadata) publically available SARS-CoV-2 raw sequencing datasets from the UK, Greece, Estonia, Ireland and South Africa between March 2020 and September 2021 all of which were processed and analyzed using the standardized variant calling pipeline described in 24

Analyses of selection in sarbecoviruses related to SARS-CoV-2
The whole genome sequences of 167 members of the were reconstructed using IQTREE2 56 (GTR+I+F+G4 model) and selection signals specific to the nCoV clade branches were inferred using the FEL 13 and MEME 57 methods as in 58 .
Results of these analyses for all gene regions can be explored using the observable notebook at https://observablehq.com/@spond/ncos-evolution-nov-2021.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint

Analyses of selection in the BA.1 sublineage
Because the codon-based selection analyses that we performed gain no power from including identical sequences, and minimal power from including sequences that are essentially identical, we filtered BA.1 and reference (GISAID) sequences using pairwise genetic distances complete linkage clustering with the tn93-cluster tool (https://github.com/veg/tn93). All groups of sequences that were within D genetic distance (Tamura-Nei 93) of every other sequence in the group were represented by a single (randomly chosen) sequence in the group. We set D at 0.0001 for lineage-specific sequence sets, and at 0.0015 for GISAID reference (or "background") sequence sets. We restricted the reference set of sequences to those sampled before Oct 15th, 2020.
We inferred a maximum likelihood tree from the combined sequence dataset using raxml-ng using default settings (GTR+G model, 20 starting trees). We partitioned internal branches in the resulting tree into two non-overlapping sets used for testing and annotated the Newick tree. Because of a lack of phylogenetic resolution in some of the segments/genes, not all analyses were possible for all segments/genes. In particular, this is true when lineage BA.1 sequences were not monophyletic in a specific region, and no internal branches could be labeled as belonging to the focal lineage.
Analyses in this setting need to account for a well-known feature of viral evolution 41 where terminal branches include "dead-end" (maladaptive or deleterious on the population level) 14 mutation events within individual hosts which have not been "seen" by natural selection, whereas internal branches must include at least one transmission event. However, because our tree is reduced to only include unique haplotypes, even leaf nodes could represent "transmission" events, if the same leaf haplotype was sampled more than once (and the vast majority were). The branches leading to these repeatedly sampled haplotypes were therefore also included in the analyses.
We performed an additional analysis on BA.1 sequences, which includes data available in GISAID up to Jan 5th, 2022. The workflow for intrahost gene analysis is as follows (code available at https://github.com/veg/omicron-selection; please note the scripts require the GISAID FASTA files and are not robust to changes in input format).
1. Obtain GISAID sequences annotated as BA.1 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint 2. Map them to the reference S-gene using bealign (part of the BioExt Python package). bealign -r CoV2-S input.fasta output.bam; bam2msa output.bam S.mapped.fasta 3. Identify all sequences that are identical up to ambiguous nucleotides using tn93cluster (these are the unique haplotypes). . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 18, 2022. ; https://doi.org/10.1101/2022.01.14.476382 doi: bioRxiv preprint