SARS-CoV-2 genome evolution exposes early human adaptations

The set of mutations observed at the outset of the SARS-CoV-2 pandemic may illuminate how the virus will adapt to humans as it continues to spread. Viruses are expected to quickly acquire beneficial mutations upon jumping to a new host species. Advantageous nucleotide substitutions can be identified by their parallel occurrence in multiple independent lineages and are likely to result in changes to protein sequences. Here we show that SARS-CoV-2 is acquiring mutations more slowly than expected for neutral evolution, suggesting purifying selection is the dominant mode of evolution during the initial phase of the pandemic. However, several parallel mutations arose in multiple independent lineages and may provide a fitness advantage over the ancestral genome. We propose plausible reasons for several of the most frequent mutations. The absence of mutations in other genome regions suggests essential components of SARS-CoV-2 that could be the target of drug development. Overall this study provides genomic insights into how SARS-CoV-2 has adapted and will continue to adapt to humans. SUMMARY In this study we sought signals of evolution to identify how the SARS-CoV-2 genome has adapted at the outset of the COVID-19 pandemic. We find that the genome is largely undergoing purifying selection that maintains its ancestral sequence. However, we identified multiple positions on the genome that appear to confer an adaptive advantage based on their repeated evolution in independent lineages. This information indicates how SARS-CoV-2 will evolve as it diversifies in an increasing number of hosts.


INTRODUCTION 22
A better understanding of the origin and evolution of the SARS-CoV-2 pandemic may 23 help to mitigate disease outbreaks. Initial genome comparisons point toward a proximal origin 24 in horseshoe bats (1), with possible intermediate hosts including pangolins, cats, and ferrets (2-25 4). Differences between host species are believed to present many barriers to host switching, 26 likely resulting in a virus that is initially maladapted to a new host (5,6). Viruses are expected to 27 quickly adapt to a new species via mutations that increase transmissibility and decrease the 28 serial interval (7). Yet, relatively little is known about the mode and tempo of evolution at the 29 start of many epidemics, including the SARS-CoV-2 pandemic. Owing to the relatively high 30 mutation rates of RNA viruses, comparison of genome sequences may reveal a wealth of 31 information even at early stages of an outbreak (8). 32 Previous studies suggest that new environments provide the opportunity to develop a 33 large number of beneficial mutations, which may accrue quickly due to natural selection (9). 34 High rates of adaptation have been observed for viruses propagated in cell lines belonging to 35 new host species (10). The possibility of an adaptive advantage over the ancestor gives rise to 36 three outcomes: (i) increasing frequencies of beneficial mutations, (ii) parallel evolution, where 37 the same mutation rises to detectable frequencies in different lineages, and (iii) positive 38 selection, where the number of non-synonymous changes exceeds the number of synonymous 39 changes in protein coding regions of the genome. Mutations can rise in frequency for reasons 40 other than positive selection, such as arising by chance near the outset of a pandemic or after 41 undergoing a bottleneck when invading a new population of susceptible hosts. It is also 42 challenging to accurately determine mutant frequencies due to an uneven sampling of genomes. Therefore, repeated parallel evolution is a clearer signal of adaptation than changes 44 in frequency. Evidence of parallel evolution also enables identification of beneficial mutations 45 before they have had time to substantially rise in frequency, which is particularly useful at the 46 beginning of a pandemic. 47 At the other end of the spectrum is the mode and tempo of evolution that is expected 48 when an organism is well-poised to enter a new environment or has spent a long time adapting 49 to a specific context. In such cases we anticipate evolution to occur slowly because further 50 mutation would not provide an advantage over recent ancestors. This would appear in the 51 genome as neutral or purifying selection, where the number of non-synonymous changes does 52 not exceed the number of synonymous changes at protein coding sites. Invasion into an 53 environment where an organism was already well-suited would cause a bottleneck event that 54 results in a homogenous population and a low rate of divergence due to purifying, or negative 55 selection. Therefore, the frequency and type of mutations at the outset of a pandemic can 56 provide insight into the ways in which a pathogen is initially well-adapted or maladapted, 57 potentially informing the development of therapies that target its weaknesses and avoid 58 resistance evolution. 59

SARS-CoV-2 is undergoing purifying selection 61
Based on previous studies from similar coronaviruses, SARS-CoV-2 is anticipated to have 62 a relatively low per-base pair mutation rate for RNA viruses, resulting in approximately 1 63 mutation per genome replication (11)(12)(13). We estimated an average viral replication time of 6 64 hours based on SARS-CoV-2 titers and the known replication times of similar viruses (14).

Mutational and selection biases across the SARS-CoV-2 genome 74
Beneficial mutations can be readily identified in laboratory evolution experiments 75 through parallel mutations that arise in multiple independent replicates (16,17), also known as 76 homoplasies. We applied this intuition to search for beneficial mutations in the SARS-CoV-2 77 genome as it diversifies across many hosts. As shown in Figure 2, we observed 6,028 positions 78 with substitutions out 29,903 nucleotides in the genome (Fig. 2a). Of these, 2,070 positions had 79 more than one independent substitution, 1,858 of which were located in coding regions. 80 Substitutions displayed a strong bias toward guanine (G) and cytosine (C) being replaced with 81 uracil (U) in the genome (Fig. S1). U replacing C was 2.5-fold more common than the reverse, 82 and U replacing G was 6.4-fold more common than the reverse. C to U transitions accounted for 83 31% of substitutions, and may result from effects of the APOBEC3G gene causing deamination 84 of C to U (18). G to U substitutions represented 43% of transversions, although the cause of 85 their relatively high frequency was unknown. 86 Non-synonymous substitutions represented 66% of substitutions in coding regions. 87 Based on the frequency of codons and observed substitution rates, we estimated that non-88 synonymous substitutions would represent 71% of substitutions without selection against 89 changes to the protein sequence. Therefore, the slight bias toward non-synonymous 90 substitutions is lower than expected and consistent with purifying selection being the dominant 91 mode of evolution. However, the skew toward non-synonymous substitutions varied across the 92 genome (Fig. 2a), reaching a peak within the gene coding for the nucleocapsid protein N that 93 protects the viral RNA. In contrast, genes with a bias toward synonymous substitutions, such as 94 the membrane protein M, indicate their protein sequence is relatively constrained. 95 We observed several conserved regions that displayed a relative lack of mutations ( the RNA template (19) and the predicted binding sites of many antivirals (20,21). The 100 conserved region within S encompasses the central helix (22), which is believed to initiate the 101 fusion of viral and host membranes (23). These conserved regions may offer reasonable drug 102 targets because they are more likely to avoid the evolution of drug resistance. 103

Evidence of adaptation at multiple genome positions 104
The observation of parallel evolution in independent lineages enables us to pinpoint 105 specific genome positions that likely increase the fitness of SARS-CoV-2 in the human host. The 106 extreme 5' and 3' ends of the genome contained the highest concentration of parallel 107 substitutions (Fig. 2a). Despite their high frequencies, these substitutions were observed exclusively in genomes originating from the same laboratory, which suggests they are 109 sequencing errors rather than authentic mutations. Therefore, we chose to focus on 110 substitutions found in genomes from at least four of the 529 contributing laboratories to 111 mitigate the presence of lab-specific sequencing errors. 112 We observed two substitutions with more than 30 cases of parallelism across SARS-CoV-113 2 genomes (Fig. 2b). The most frequent substitution occurred 50 different times at position 114 11,083, which results in a non-synonymous change (L37F) in nsp6, a transmembrane protein 115 localized to the endoplasmic reticulum and implicated in formation of autophagosomes (24)(25)(26)(27). 116 The substitution at 11,083 occurred nearby another frequent substitution at position 11,074 117 that is synonymous. Both substitutions were conversions to uracil at sites adjacent to eight 118 consecutive uracils in the genome (Fig. 3), suggesting they may occur more frequently due to an 119 increased mutation rate at homopolymeric sites (28). A similar conversion to uracil at position 120 21,575 is located in the middle of 7 other uracils and results in a non-synonymous change (L5F) 121 to the protein sequence of S (Fig. 3). Three other substitutions were adjacent to at least 3 122 uracils in the genome: positions 9,474, 26,681, and 28,253. The high frequency of substitutions 123 next to poly(U) tracts is likely due to increased mutation rates at these positions, although we 124 cannot rule out that they may also have adaptive significance. 125 The next most frequent substitution occurred 16 times at position 16,887 and results in 126 a synonymous change to nsp3. There is presently no evidence that this mutation is involved in 127 RNA base pairing, and it is located in a region of the genome with relatively little conserved RNA 128 secondary structure (29). The most frequent non-C-to-U substitution was A10323G, which 129 results in a non-synonymous change (K90R) to the protease nsp5. This amino acid replacement is distally located from the active site and the nsp5 dimer interface (30), suggesting it may not 131 be of adaptive significance. We observed a similar substitution, A21137G, which results in a 132 non-synonymous change (K160R) to nsp16. However, this residue is distant from the active site 133 and from the nsp16/nsp10 interface (31), suggesting its replacement could be of little 134 consequence. 135 Two different nonsynonymous mutations in the N gene, encoding the nucleocapsid 136 protein, repeatedly evolved in a disordered linker domain between structural capsid elements 137 (32). These mutations, R185C and T205I, alter a region acting as an RNA chaperone that 138 facilitates template switching and RNA synthesis during replication (33,34). Similarly intriguing, 139 we observed divergent nucleotide substitutions, G28077C/U, that both result in the same non-140 synonymous change (V62L) to ORF8. This region of ORF8 is missing in some SARS-related 141 viruses (Fig. 3), and underwent repeated deletions during the SARS-CoV epidemic (35). ORF8 is 142 known to rapidly evolve and the necessity of its role in the human host remains contentious 143 (36). 144 We sought to determine whether any of these eighteen highly parallel mutations (Fig.  145 2b) could be attributable to common sequencing errors. We reasoned that sequencing errors 146 would be randomly distributed across the phylogenetic tree, whereas adaptive mutations are 147 likely to expand in size along a specific lineage. Therefore, we calculated the probability of 148 finding a mutant clade of size R or larger by chance given each substitution's observed 149 frequency. For example, the substitution G11074U had a largest clade size of R=4, for which we 150 observed 24 mutants among 12,435 genomes. In this case, the probability of observing four or 151 more adjacent mutants on the phylogenetic tree is much less than 10 -6 . Extremely small p-values were found for all parallel substitutions reported here, except the mutation at 9,474 153 which was only supported by singletons (R=1). 154 In this study, we determined that SARS-CoV-2 is evolving predominantly under purifying 155 selection that purges most mutations since they are deleterious. This suggests that SARS-CoV-2 156 was well-poised to invade the human population, although it continues to adapt to humans 157 through specific mutations that may accumulate in individual genomes as SARS-CoV-2 158 continues to evolve. The few highly parallel substitutions that we observed offer intriguing 159 avenues for further investigation, as most are cryptic and located in poorly characterized 160 regions of the SARS-CoV-2 genome. Notably, some genes acquired relatively few mutations, 161 which implies strict sequence constraints that may focus drug development strategies against 162 these gene products. The paucity of mutations overall suggests that coronaviruses are well-163 suited to jumping between hosts and caution should be taken to avoid direct or indirect contact 164 with their animal reservoirs. This is further corroborated by the relatively small number of 165 genome positions that have undergone multiple parallel substitutions despite a plentiful supply 166 of mutations. 167

Genome collection and comparison 169
Complete (> 29,000 nucleotide) SARS-CoV-2 genomes were downloaded from GISAID 170 (37) on May 2 nd , 2020. Genomes with more than 500 degeneracies (e.g., N's) were removed, 171 resulting in a collection of 12,435 genomes, of which 12,285 had a known date of collection 172 that we used as a proxy for the duration of growth relative to the first genome (2019- [12][13][14][15][16][17][18][19][20][21][22][23][24]. 173 Genomes were aligned to the SARS-CoV-2 reference genome (NC_045512.2) using the distance was defined as the number of positions differing from the reference genome without 176 considering insertions or deletions, which were very infrequent. 177 To create Figure 4, viruses closely related to SARS-CoV-2 were selected from a recent 178 study (1) and supplemented with one sequence derived from a pangolin host in another study 179 (2). Genomes were aligned and a maximum likelihood tree was created using DECIPHER with 180 the best fitting evolutionary model. 181

Identification of parallel substitutions across independent lineages 182
Starting from the set of all SARS-CoV-2 genomes, we constructed a multiple sequence 183 alignment, matrix of pairwise nucleotide identity, and rooted neighbor joining tree using 184 DECIPHER. Sequences were compared at each site in the reference sequence to identify 185 independent substitutions on the phylogenetic tree. That is, we mapped mutations onto tips of 186 the phylogenetic tree and propagated them back till they coalesced at a common ancestor 187 (edge). This enabled us to count the number of independent substitutions that were inherited 188 by one or more strains. To increase robustness to the tree topology, we ignored single 189 reversions to the ancestral character that occurred within a clade sharing a derived character. 190 This process resulted in an integer representing the number of parallel substitutions 191 occurring at each position in the reference sequence. We determined that eight or more 192 independent substitutions was statistically significant for C to U transitions (p < 0.001, Poisson 193 distribution and Bonferroni correction) given the observed mutations rates and assuming 194 mutations are randomly distributed along the genome. All other substitutions (e.g., G to U) 195 required fewer cases of parallelism to achieve the same degree of statistical significance.