Molecular Evolution of SARS-CoV-2 Structural Genes: Evidence of Positive Selection in Spike Glycoprotein

SARS-CoV-2 caused a global pandemic in early 2020 and has resulted in more than 8,000,000 infections as well as 430,000 deaths in the world so far. Four structural proteins, envelope (E), membrane (M), nucleocapsid (N) and spike (S) glycoprotein, play a key role in controlling the entry into human cells and virion assembly of SARS-CoV-2. However, how these genes evolve during its human to human transmission is largely unknown. In this study, we screened and analyzed roughly 3090 SARS-CoV-2 isolates from GenBank database. The distribution of the four gene alleles is determined:16 for E, 40 for M, 131 for N and 173 for S genes. Phylogenetic analysis shows that global SARS-CoV-2 isolates can be clustered into three to four major clades based on the protein sequences of these genes. Intragenic recombination event isn’t detected among different alleles. However, purifying selection has conducted on the evolution of these genes. By analyzing full genomic sequences of these alleles using codon-substitution models (M8, M3 and M2a) and likelihood ratio tests (LRTs) of codeML package, it reveals that codon 614 of S glycoprotein has subjected to strong positive selection pressure and a persistent D614G mutation is identified. The definitive positive selection of D614G mutation is further confirmed by internal fixed effects likelihood (IFEL) and Evolutionary Fingerprinting methods implemented in Hyphy package. In addition, another potential positive selection site at codon 5 in the signal sequence of the S protein is also identified. The allele containing D614G mutation has undergone significant expansion during SARS-CoV-2 global pandemic, implying a better adaptability of isolates with the mutation. However, L5F allele expansion is relatively restricted. The D614G mutation is located at the subdomain 2 (SD2) of C-terminal portion (CTP) of the S1 subunit. Protein structural modeling shows that the D614G mutation may cause the disruption of salt bridge among S protein monomers increase their flexibility, and in turn promote receptor binding domain (RBD) opening, virus attachment and entry into host cells. Located at the signal sequence of S protein as it is, L5F mutation may facilitate the protein folding, assembly, and secretion of the virus. This is the first evidence of positive Darwinian selection in the spike gene of SARS-CoV-2, which contributes to a better understanding of the adaptive mechanism of this virus and help to provide insights for developing novel therapeutic approaches as well as effective vaccines by targeting on mutation sites.


SARS-CoV2 S gene is operated by positive selection at a definitive
to April 20, 2020 (17 weeks). Detailed information of these isolates including collection date, 3 2 0 collection region and accession or biosample numbers is summarized on S3 and S4 Tables.

2 1
In 173  respectively), carry 614D in the S protein, while the first SARS-CoV-2 isolate with a D614G 3 2 5 mutation is GZMU0019 in our collected dataset, isolated from a patient with COVID-19 on 3 2 6 February 5, 2020 (week 7 in our dataset). After that, except for week 9 and week 10 (possibly due 3 2 7 to the small number of samples and sampling deviation), a spread trend that more and more 3 2 8 proportion of isolates carry the D614G mutation in the S protein stands out. In the week 17, the 3 2 9 last week of our dataset, 91.11% of SARS-CoV-2 isolates carry this mutation (S3 Table, Fig 6A).

0
Further analysis reveals that the frequency of D614G mutation in the S gene was steadily 3 3 1 increasing when combining data from week 6 to 17 (S3 Table,

4 9
From structural studies in both SARS-CoV and SARS-CoV-2, receptor binding domain (RBD) 3 5 0 located at the C-terminal of S1 and the adjacent N-terminal domain (NTD) are relatively flexible, 3 5 1 which is the feature required for receptor recognition and subsequent membrane fusion [47,48].

5 2
We found that the D614G mutation is located at the subdomain 2 (SD2) that at the C-terminal of 3 5 3 RBD and close to the two potential cleavage sites between S1 and S2 [48] (Fig 7A). Considering

5 4
that positive selection is usually beneficial to the survival of the individual carrying the mutation, 3 5 5 we speculate that the D614G mutation may facilitate structural conformation change to promote 3 5 6 receptor binding or membrane fusion [5,44], and in turn improving the infection efficiency. From

5 7
the latest cryo-electron microscopy (cryo-EM) structure of SARS-CoV-2 S protein, the negatively 3 5 8 charged sidechain of D614 points towards the positively charged sidechain of K854 from the 3 5 9 neighboring monomer ( Fig 7B) [48] . The distance between the closest atoms of the two residues 3 6 0 is 2.6 Å, which is an optimal distance to form salt bridge ( Fig 7C). From the modelled structure 3 6 1 with D614G mutation, the distance is increased to 5.2 Å (Fig 7D), which would potentially abolish 3 6 2 the salt bridge and destabilize the integrity of the S trimer in wild type. It has been reported that 3 6 3 human receptor ACE2 binds to an "open" conformation of S protein, where RBD move away from 3 6 4 the core structure and expose its receptor binding surface. The entire S trimer then undergoes a 3 6 5 serial of dramatic conformation changes, including cleavages between S1 and S2, disassociation 3 6 6 of S1 and post-fusion transformation of S2 [49,50]. Changes including mutations at cleavage sites 3 6 7 and adding internal crosslinks in S trimer would keep the protein in a stable and "closed" 3 6 8 conformation where the receptor binding surface of RBD is inaccessible [48,51]. Therefore, we 3 6 9 hypothesize that the highly transmissible D614G mutation driven by the positive selection through 3 7 0 evolution promotes accessibility of RBD by losing a critical salt bridge between the S protein 3 7 1 monomers, which subsequently triggers membrane fusion upon ACE2 binding. We present modern molecular evolution analyses on a large and comparative set of SARS-CoV-2 3 7 5 structural gene sequences, derived from an international collection of SARS-CoV-2 isolates.

7 6
Distinct phylogenetic patterns of four structural proteins of SARS-CoV-2 are depicted. Protein SARSr-CoV, suggesting the evolution conservation of these two genes. In contrast, relatively high 3 7 9 genetic variation is observed in N and S proteins among SARS-CoV-2 isolates, implying extensive potentially affect the assembly and secretion of SARS-CoV-2. A close eye on L5F mutation may 3 9 2 be required in case another expansion occurs. As S protein is a key target for SARS-CoV-2 3 9 3 vaccines, therapeutic antibodies, and diagnostics, the D614G mutation of S should be paid more 3 9 4 attention. Owning that the exact mechanism remains unclear, further study should focus on the 3 9 5 exact function of these mutation sites and how they affect the expansion of these mutated alleles 3 9 6 on SARS-CoV-2.   . .  S  o  n  g  W  ,  Z  h  o  u  H  ,  X  u  J  ,  C  h  e  n  S  ,  X  i  a  n  g  Y  ,  e  t  a  l  .  C  r  y  o  -e  l  e  c  t  r  o  n  m  i  c  r  o  s  c  o  p  y  5  5  7   s  t  r  u  c  t  u  r  e  s  o  f  t  h  e  S  A  R  S  -C  o  V  s  p  i  k  e  g  l  y  c  o  p  r  o  t  e  i  n  r  e  v  e  a  l  a  p  r  e  r  e  q  u  i  s  i  t  e  c  o  n  f  o  r  m  a  t  i  o  n  a  l  s  t  a  t  e  f  o  r  5  5