Natural selection in the evolution of SARS-CoV-2 in bats, not humans, created a highly capable human pathogen

RNA viruses are proficient at switching host species, and evolving adaptations to exploit the new host’s cells efficiently. Surprisingly, SARS-CoV-2 has apparently required no significant adaptation to humans since the start of the COVID-19 pandemic, with no observed selective sweeps since genome sampling began. Here we assess the types of natural selection taking place in Sarbecoviruses in horseshoe bats versus SARS-CoV-2 evolution in humans. While there is moderate evidence of diversifying positive selection in SARS-CoV-2 in humans, it is limited to the early phase of the pandemic, and purifying selection is much weaker in SARS-CoV-2 than in related bat Sarbecoviruses. In contrast, our analysis detects significant positive episodic diversifying selection acting on the bat virus lineage SARS-CoV-2 emerged from, accompanied by an adaptive depletion in CpG composition presumed to be linked to the action of antiviral mechanisms in ancestral hosts. The closest bat virus to SARS-CoV-2, RmYN02 (sharing an ancestor ∼1976), is a recombinant with a structure that includes differential CpG content in Spike; clear evidence of coinfection and evolution in bats without involvement of other species. Collectively our results demonstrate the progenitor of SARS-CoV-2 was capable of near immediate human-human transmission as a consequence of its adaptive evolutionary history in bats, not humans.


Introduction
In December 2019 an outbreak of pneumonia cases in the city of Wuhan, China, was linked to a novel coronavirus. Evolutionary analysis placed this new virus to humans in the same subgenus of Betacoronaviruses, the Sarbecoviruses 1 , that the SARS virus belongs to, and it was named SARS-CoV-2 to reflect its status as a sister lineage to the first SARS virus 2 . This novel lineage represents the seventh known human-infecting member of the Coronaviridae. The initial outbreak of human cases of the virus was connected to the Huanan Seafood Wholesale Market in Wuhan 3 , and while related viruses have been found in horseshoe bats 4 and pangolins 5 , their divergence represents decades of evolution 6 leaving the direct origin of the pandemic unknown. In addition to elucidating the transmission route from animals to humans, key questions for assessing future risk of emergence are: (i) what is the extent of evolution required for a bat virus to transmit to humans, and (ii) what subsequent evolution must occur for efficient transmission once the virus is established within the human population?
The first SARS virus outbreak in 2002/2003, causing approximately 8,000 infections, and its reemergence in late 2003, causing four infections, were linked to Himalayan palm civets and raccoon dogs in marketplaces in Guangdong province 7,8 . Later it became clear that these animals had been conduits for spillover to humans and not true viral reservoirs 9 . Extensive surveillance work subsequently identified related viruses circulating in horseshoe bats in China, some of which can replicate in human cells 10,11 . The bat viruses most closely related to SARS-CoV-1, can bind to human angiotensin converting enzyme (ACE2, the receptor SARS-CoV-2 also uses for cell entry), while the addition of protease is required for the more divergent bat viruses tested (red and grey lineages respectively, Supplementary Figure 1) 12 . Collectively these results demonstrate that, unlike other RNA viruses assumed to acquire adaptations after switching to a new host species for efficient replication and/or spreading, the Sarbecoviruses -which already transmit frequently among bat species 13 -can exploit the generalist properties of their ACE2 binding ability, facilitating successful infection of non-bat species, including humans. A main difference between SARS-CoV-1 and -2 is the increased binding affinity for human ACE2 14 permitting more efficient use of human cells and the upper respiratory tract, and on average lower severity but, paradoxically -due to greater penetrance into the human population -higher disease burden.
There is intense interest in the mutations emerging in the SARS-CoV-2 pandemic [15][16][17] . Although the vast majority of observed genomic change is expected to be 'neutral' 18,19 , mutations with functional significance to the virus will likely arise, as they have in many other viral epidemics and pandemics 20 . In SARS-CoV-2 amino acid replacements in the Spike protein could reduce the efficacy of vaccines, replacements in proteases and polymerases could result in acquired drug resistance, and other mutations could change the biology of the virus, e.g., enhancing its transmissibility or severity 21 , contributing to adaption to us, a new host species. A main way to begin to understand the functional impact of mutations is to characterise the selective regime they are under. Mutations which are under positive selection are of particular interest as they are more likely to reflect a functional change. However, identifying mutations under positive selection from frequency data alone can be misleading, as allele frequencies in viral pandemics are significantly driven by biased sampling, founder effects and superspreading events 17 . Exponentially growing populations can increase in average fitness 22 , however they are also expected to exhibit elevated genetic drift, with deleterious mutations surfing expansion waves 23 . Here, we investigate the ability of bat Sarbecoviruses to readily infect non-bat species through a comprehensive search for signatures of positive selection (a measure of molecular adaptation) in the virus circulating in humans since the COVID-19 outbreak began, and contrast this to historic selection acting on related bat viruses.

2.
We first analyse selection acting on the encoded amino acids of SARS-CoV-2 using 50,240 QC filtered genome sequences from the GISAID database as of the 28th of June 2020, representing a sample of the variants circulating in humans (see Methods). Purifying selection acts more strongly on nonsynonymous sites, supported by the estimate that the nonsynonymous substitution rate (dN) was only 4% of speed of the synonymous substitution rate (dS) in the divergence between SARS-CoV-2 and bat virus RaTG13 24 . We used the phylogenetically corrected SLAC counting method 25 to calculate "frequency-stratified" dN/dS estimates. This qualitative analysis revealed that, on average, the evolutionary process is weakly purifying, with point estimates of dN/dS ranging from 0.6 to 1.0, and generally declining with increasing frequency, with the exception of the high frequency variants (seen in more than 3,000 sequences), where dN/dS is 1.4, indicating weak positive selection ( Figure 1A). The vast majority of observed mutations occur at low frequency, with 90% of mutations observed in fewer than 15 of the 50,240 sequences ( Figure 1A Subsequently, we have been performing daily analyses of the data excluding these terminal branches (http://covid19.datamonkey.org). As more sequences are deposited in public databases ( Figure 1B), we expect to see increasingly more variation, with 89.6% of codons in Spike (S) containing some variation in most recent analyses ( Figure 1C). Similar increasing trends are seen for sites with amino-acid variants (82.2%) and variants shared by more than one sequence (40.4% for amino-acids). However, the number of unique sequences has increased more slowly than the number of sequenced genomes ( Figure 1B), and the fraction of sites inferred to be under positive selection remains below 1%. This observation shows SARS-CoV-2 is evolving relatively slowly with no dramatic increases in selective pressures occurring over this period. This pattern is further confirmed by the relatively stable behavior of gene-wide dN/dS estimates ( Figure 1D), both on internal branches, which include successful transmissions, and terminal branches, which sometimes include intra-host "dead-ends" or deleterious variation 26 .
Even in Spike, which is being assiduously scrutinized for selection due to its immunogenic and phenotypic importance, overall selective pressure is stable over time, and consistent with weak There are individual sites in the SARS-CoV-2 genomes which possess statistically detectable signatures of episodic diversifying selection along internal tree branches. In S and RdRp, there are four sites that appear to be selected in data prior to April 15 (early pandemic, Figure 1E).
These include Spike 614 and RdRp 323 which are in nearly perfect linkage disequilibrium (R 2 >0.99), with the former extensively discussed by others as having some functional significance 28 , while the latter received much less attention. The majority of these sites, however, are not detected until later in the pandemic, and most are transient is because mildly deleterious mutations in viral outbreaks fail to persist over longer time periods, being gradually purged by purifying selection 26,30 . Exponential growth, consistent with the SARS-CoV-2 trajectory (Supplementary Figure 2), is known to reduce the efficacy of purifying selection 23 , with deleterious mutations able to surf expansion waves. It is likely that many of these putatively deleterious segregating mutations will ultimately be lost, reducing the long term dN/dS, though their persistence will be influenced by future demographic patterns 23 , and the effect of lockdowns on the circulating variants.

Figure 1. (A) Estimates of molecular adaptation (dN/dS) for 50,240 SARS-CoV-2 genome sequences based on the counting SLAC method 25 (black circles) and the number of variants as a function of their frequency (grey bars). (B)
Cumulative number of Spike gene sequences in GISAID passing QC filters from 5 th April to 28  To search for branch-specific evidence of selection in the non-recombinant regions, we first used the aBSREL method 36 . Our analysis found that diversifying selection left its imprints primarily in the deepest branches of the nCoV clade or lineage leading to it, with no evidence of selection in the terminal branch leading to SARS-CoV-2 ( Figure 2). This is consistent with the non-human progenitor of SARS-CoV-2 requiring little or no novel adaptation to successfully infect humans.  Table 3). Interestingly ORF3a has six sites with evidence of positive selection. Little is known about the function of this accessory gene, however, it could be related to immune evasion, like that of other short accessory genes (e.g. Orf3b) 40 . Still, this analysis does not allow attributing specific functional relevance to these individual sites.  Table 4). Three categories of individual sites (conserved, negatively selected, positively selected) are shown as tracks in or above the schematic. For positively selected sites, colouring reflects the fraction of the branches in the nCoV clade inferred to be under selection at the site (gray: smaller, red: larger); further discussion of some of these sites is provided in the text.
The BUSTED[S] method also partitioned synonymous rate variation into three rate classes across the sites. The majority of regions showed large, in some cases more than 20-fold, differences between rate classes, with all three classes representing a substantial proportion of sites for most regions (Supplementary Figure 6), with varying degrees of autocorrelation. This suggests that strong purifying selection is acting on some synonymous sites (e.g., conserved motifs or RNA features), and some synonymous mutations in the SARS-CoV-2 genome may not be selectively neutral or occur at sites that are hypervariable. Some synonymous rate variation may also be attributed to the 5' and 3' context-specific mutation rate variation observed in SARS-CoV-2 29 .
Patterns of CpG depletion in the nCoV clade. Genome composition measures, such as dinucleotide representation and codon usage can also be an informative tool for characterising the host history of a virus 41 . Various host antiviral mechanisms accelerate the depletion of CpG dinucleotides (a cytosine followed by a guanine in the 5' to 3' direction) in virus genomes. This is thought to be primarily mediated either through selective pressures by a CpG-targeting mechanism involving the Zinc finger Antiviral Protein (ZAP) 42 or C to U hypermutation by APOBEC3 cytidine deaminases 43 , and recent work has demonstrated that the CpG binding ZAP protein inhibits SARS-CoV-2 replication in human lung cells 44 . These forces are likely to vary across tissues and between hosts. Thus, a smaller or greater level of CpG depletion in particular viral lineages may be indicative of a switch in the evolutionary environment of that lineage or its ancestors. Although care must be taken to not over-interpret these results 45 .
We examined the CpG representation using the corrected Synonymous Dinucleotide Usage (SDUc) framework, controlling for amino acid abundance and single nucleotide composition bias in the sequences 46 . In Orf1ab, the longest ORF in the genome, all viruses show CpG underrepresentation (SDUc < 1), while viruses in the nCoV clade have even lower CpG levels than the other Sarbecoviruses (Supplementary Figure 7). To further examine this trend, we fitted CpG representation as traits in a phylogenetic comparative approach, using two long putatively nonrecombinant regions in the alignment (NRR1 and NRR2; see Methods). Our method identified an adaptive shift favouring CpG suppression in the lineage leading to nCoV clade ( Figure 3A). This is coupled with an inflated substitution rate in this lineage compared to the rest of the phylogeny ( Figure 3A). We propose that the elevated rate specific to the nCoV clade ancestral lineage can explain some of the variation between different Coronaviridae substitution rates, previously attributed to a time-dependent evolutionary rate phenomenon 6 . This shift may indicate a change of evolutionary environment, e.g., host or tissue preference, since CpG depletion following host switches has been observed in other human infecting RNA viruses, such as Influenza B 41 .  causing viruses are here for the long-term.

SARS-CoV-2 GISAID sequence filtering.
To reduce the impact of sequencing errors on selection analysis the data from GISAID was filtered by excluding all sequences which meet any of the following criteria: any sequence of length less than 29000 nucleotides; any sequence with a non-human host (e.g., bat, pangolin); any sequences with ambiguous nucleotides in excess of 0.5% of the genome; any sequences with greater than 1% divergence from the longest sampled sequence (Wuhan-Hu-1); and any sequence with stop codons. Unique (identical sequences are collapsed into a single group) codon-aligned sequences are translated to amino-acids and aligned with MAFFT 54 . The amino-acid alignment is mapped back to the constituent nucleotide sequences. Only unique haplotypes are retained for comparative phylogenetic analyses, since including identical copies is not informative for these types of inference.  Table 6) were aligned using MAFFT version 7 (L-INS-i) 54 . Subsequent manual corrections were made on the protein alignments and pal2nal 53 was used to convert them to codon alignments.

SARS-CoV
Phylogenies for each codon alignment were inferred using RAxML with a GTR+Γ nucleotide substitution model 55 . As this metric is susceptible to error when used for short coding sequences, we applied SDUc on the longest ORF, Orf1ab, of all the viruses. To estimate the extent of phylogenetic independence between synonymous sites across SDUc datapoints, we measured the pairwise synonymous divergence (Ks) between viruses. Pairwise Ks values were calculated using the seqinr R package 57 which utilises the codon model of Li (1993) 58  We partitioned the coding regions of NRR1 and NNR2 by codon position and specified an independent general time-reversible (GTR) substitution model with gamma-distribution rate variation among sites for each of the three partitions. We used a constant size coalescent model as tree prior and specified a lognormal prior with mean = 6.0 and standard deviation = 0.5 on the population size. Three independent MCMC analyses were run for 250 million states for each data set. We used the BEAGLE library v3 62 to increase computational performance. Continuous parameters were summarized and effective sample sizes were estimated using Tracer 63 .

Sarbecovirus
Divergence time estimates for the nCoV clade are summarised in Supplementary Table 5. Trees were summarized as maximum clade credibility (MCC) trees using TreeAnnotator and visualized using FigTree.
Identifying shifts in CpG content. Shifts in CpG content were identified using a phylogenetic comparative method that infers adaptive shifts in multivariate correlated traits 64 . This approach models trait evolution on phylogenies using an Ornstein-Uhlenbeck (OU) process and uses a computationally tractable version of the full multivariate OU model (scalar OU) for multivariate traits. Estimates of the shift positions are obtained using an Expectation-Maximization (EM) algorithm. The shift positions are estimated for various numbers of unknown shifts and a lassoregression model selection procedure identifies the optimal number of shifts. We applied the procedure to the MMC trees for NRR1 and NRR2 with their respective CpG SDUc values (ln transformed as required by the phylogenetic approach). The trees were forced to be ultrametric, as required for the scalar OU model, by extending all the external edges of the trees to match the most recently sampled tip. Using this procedure, we identified three and two CpG shifts in both NRR1 and NRR2 respectively (Supplementary Figure 9), with the shift on the branch ancestral to the nCoV clade being the only consistent one identified in both genomic regions ( Figure 3A).