Evidence of significant natural selection in the evolution of SARS-CoV-2 in bats, not humans

RNA viruses are proficient at switching to novel host species due to their fast mutation rates. Implicit in this assumption is the need to evolve adaptations in the new host species to exploit their cells efficiently. However, SARS-CoV-2 has required no significant adaptation to humans since the pandemic began, with no observed selective sweeps to date. Here we contrast the role of positive selection and recombination in the Sarbecoviruses in horseshoe bats to SARS-CoV-2 evolution in humans. While methods can detect some evidence for positive selection in SARS-CoV-2, we demonstrate these are mostly due to recombination and sequencing artefacts. Purifying selection is also substantially weaker in SARS-CoV-2 than in the related bat Sarbecoviruses. In comparison, our results show evidence for positive, specifically episodic selection, acting on the bat virus lineage SARS-CoV-2 emerged from. This signature of selection can also be observed among synonymous substitutions, for example, linked to ancestral CpG depletion on this bat lineage. We show the bat virus RmYN02 has recombinant CpG content in Spike pointing to coinfection and evolution in bats without involvement of other species. Our results suggest the non-human progenitor of SARS-CoV-2 was capable of human-human transmission as a consequence of its natural evolution in bats.

This analysis reported ten sites as showing significant evidence of positive selection across the pandemic phylogeny (Table 1). Due to the low diversity in these 396 SARS-CoV-2 samples, there is limited power to confidently estimate the synonymous and nonsynonymous substitution rate for each codon. This means that statistical power to identify positive selection in the form of dN/dS>1 for any given codon is limited, and the posterior distribution should be flat. The presence statistically significantly signatures of positive selection is therefore somewhat surprising. To understand the specific mutational patterns that might explain these significant results, we looked at where in the phylogeny these putatively positively selected mutations were occurring. For all but two of the ten positive selected codons, this signal was being driven by apparent convergent evolution (or homoplasy) in the tree, with the same mutation occurring in parallel across the phylogeny. To investigate whether this observation was truly due to independent events or because of recombination signatures in the SARS-CoV-2 outbreak tree, we firstly determined if the samples with these convergent mutations were geographically correlated. As selective pressure acting on an untreatable novel zoonotic virus is likely to be globally shared (adaptation to humans), but recombination requires co-localisation of viruses in the time and space, geographic clustering would be a good indication that these mutations are not independent.

Supplementary
The homoplasies driving ORF8 L84S and ORF1ab L1599S mutations were both found in South Korean isolates, and each of the two instances of ORF M D3G and ORF N I292T were found in the Netherlands. This geographic clustering was suggestive of recombination and was investigated further.

Recombination or selection.
Two of the ten FUBAR-flagged sites, Spike codons 860 and 861, did not show any homoplasies. Both signals could be attributed to the same run of four neighbouring U to A mutations spanning the two codons. These mutations were found in only a single sample: EPI_ISL_408485 from Beijing and have not been observed since (to date 8/5/2020). This suggests that they were either sequencing errors, or a large single mutation spanning two codons, which has not subsequently spread. Multiple nucleotide changes within a single codon should be rare and sequencing error is a plausible explanation.
The positive selection signature at NSP6 codon 37 can be explained by multiple homoplasies of G to U mutations at nucleotide 11083. This mutation is found in four distinct haplotypes Both the ORF8 codon 84 and ORF1ab codon 1599 positive selection signals appear to be raised by a single South Korean sample (GISAID accession 413017). This sample possesses two derived mutations either side of a hypothesised breakpoint. These pairs of derived mutations belong to samples with different haplotypes (Supplementary figure 2). Therefore the 413017 sample appears to be a recombinant between sample 413018 and 413513 or 412871. As both 413017 and 413018 were sequenced by the same lab and released at the same time, this recombination event may be an artefactual product of lab cross-contamination.  figure 3A), suggesting it is not the result of recombination. No newly sequenced samples uploaded up to 27/4/2020 containing the D614G mutation clustered with the Wuhan 412982 sample, suggesting that this haplotype did not spread or that this homoplasy is driven by sequencing error. Additional sequences displaying apparent convergent evolution at this site have since been sequenced, these have been taken as evidence of positive selection 3 . However, given that this mutation now occurs in 59% of sequenced samples (as of 27/4/20), it will be one of the mutations most likely to be variable if multiple viral genotypes are present following lab contamination or in mixed infections, and so most prone to being shuttled onto new backgrounds by recombination. Therefore, whilst high frequency mutations are the most important to study, they are also the most prone to misleading homoplasies, and must be analysed with the most caution.
The N ORF 292 site detected by FUBAR is driven by a similar convergent evolution event history. However, both samples exhibiting the same derived I to T mutation ( Figure 3B; GISAID IDs 413570 and 413574) were sequenced by the same Dutch lab and released at the same time, again suggesting that lab cross contamination is a likely driver. However, unlike South Korean sample 413017, there is only one shared derived mutation (codon 292), and therefore the genomic evidence for recombination in these samples is weaker.
The Spike V367F signal was driven by apparent convergent evolution between four french samples sequenced in January and a Hong Kong sample 412028, which shows shared variation either side of the homoplasy suggesting it is not a recombinant (Supplementary figure 3C).
Looking through more recent data shows additional homoplasies in a simple neighbour joining tree. Additionally, newly generated sequences since the FUBAR analysis cluster around the Hong Kong sample, further suggesting it is not a lab sequencing error.
Subsequent informal scans of newer data have revealed evidence of additional lab recombination events (see Supplementary figure 3D).

Frequency-based analysis of SARS-CoV-2 polymorphisms
In addition to searching for positive selection, we investigated if signatures of purifying selection on segregating variation in the current SARS-CoV-2 data could be observed (sequences as of 14/5/20). We compared the relative frequencies of nonsynonymous and synonymous mutations in the pandemic data. Codons with multiple mutations present were discarded from the analysis to avoid ambiguity in the order of mutations, and simplify synonymous/nonsynonymous classification.
Most mutations of both classes are at very low frequency (main text Figure 2), indicative of the viral population expansion that the pandemic has undergone. dN/dS was approximately 0.6 in singletons, suggesting that 40% of nonsynonymous mutations are strongly deleterious and therefore never observed in the population. There is a weak observable trend towards a higher proportion of mutations being synonymous at the highest frequency intervals, suggestive of some ongoing selection against circulating amino acid replacements in the pandemic. This observation may be partially driven by sequencing errors which are not transmitted and so are at low frequency. These sequencing errors are likely to have a dN/dS value of 1, which may make the estimate that 40% amino acid replacements are strongly deleterious an underestimate of the true value. However, the decline in nonsynonymous/synonymous ratio occurs across the range of frequencies, suggesting that sequencing errors alone are not driving the trend. It is important to consider that the observed frequencies are likely to differ from true global frequencies due to biased sampling of infections in the pandemic 4 , and so we caution against overinterpretation of specific mutation frequencies.