Abstract
The SARS-CoV-2 Spike protein needs to be in an open-state conformation to interact with ACE2 as part of the viral entry mechanism. We utilise coarse-grained normal-mode analyses to model the dynamics of Spike and calculate transition probabilities between states for 17081 Spike variants. Our results correctly model an increase in open-state occupancy for the more infectious D614G via an increase in flexibility of the closed-state and decrease of flexibility of the open-state. We predict the same effect for several mutations on Glycine residues (404, 416, 504, 252) as well as residues K417, D467 and N501, including the N501Y mutation, explaining the higher infectivity of the B.1.1.7 and 501.V2 strains. This is, to our knowledge, the first use of normal-mode analysis to model conformational state transitions and the effect of mutations thereon. The specific mutations of Spike identified here may guide future studies to increase our understanding of SARS-CoV-2 infection mechanisms and guide public health in their surveillance efforts.
1. Introduction
The coronavirus pandemic has emerged as a major and urgent issue affecting individuals, families and societies as a whole. Among all outbreaks of aerosol transmissible diseases in the 21st century, the COVID-19 pandemic, caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) virus [1,2], has the highest infection and death cumulative numbers - 83 million infections and over 1.8 million deaths, according to the World Health Organization (WHO) epidemiological report of January 5, 2021 [3]. Recent WHO reports also show significant weekly increases in the number of infections and deaths as countries start to face upcoming waves of the disease. In 2003 the SARS coronavirus (SARS-CoV) pandemic caused 8,098 infections and 774 deaths before it was brought under control [4,5]. In 2012, the Middle East respiratory syndrome-related coronavirus (MERS-CoV) outbreak caused 2499 infections and 858 deaths, presenting the highest fatality rate [6]. SARS-CoV-2, SARS-CoV and MERS-CoV, as coronaviruses in general, present considerable mutation rates, which may contribute to future outbreaks. For instance, SARS-CoV-2 is estimated to have a mutation rate close to the ones presented by MERS-CoV [7] and by SARS-CoV [8], as well as other RNA viruses, showing a median of 1.12 × 10−3 mutations per site per year [9]. The high mutation rate may in part be responsible for the zoonotic nature of these viruses and points to a clear risk of still-undetected additional members of the coronavirideae family of viruses making the jump from their traditional hosts to humans in the future.
The SARS-CoV-2 Spike protein (Uniprot ID P0DTC2) is responsible for anchoring the virus to the host cell. The entry receptor for SARS-CoV-2 and other lineages of human coronaviruses is the human cell-surface protein angiotensin converting enzyme 2 (ACE2) (Uniprot ID Q9BYF1) [10]. Therefore, studying the Spike protein family is essential to understand the evolution of coronaviruses.
SARS-CoV-2 Spike is a homo-trimeric glycoprotein, with each chain built by subunits S1 and S2, delimited by a furin cleavage site at residues 682-685. The S1 subunit comprises the N-terminal Domain (NTD), located in the peripheric part of the extramembrane extreme, and the Receptor Binding Domain (RBD), the most flexible site, located in the central part of this same extreme. The S2 subunit consists of the fusion peptide (FP), heptad repeat 1 (HR1), heptad repeat 2 (HR2), the transmembrane domain (TM), and the cytoplasmic tail (CT) (Figure 1). The interaction between Spike and ACE2 relies on Spike to be in its open conformation, in which the Receptor Binding Domain (RBD) is extended [11]. The study of the binding properties between Spike and ACE2, although important, cannot explain all the nuances of the infection mechanism. An example of this limitation is the comparison between SARS-CoV and SARS-CoV-2, which have different rates of infection even though they share similar Spike-ACE2 affinities [12]. These facts lead us to consider the contribution of the dynamics of the Spike protein to the infection process.
Computational structural biology methods have grown in both accuracy and usability over the years and are increasingly accepted as part of an integrated approach to tackle problems in molecular biology. Such integration permits to speed up research, decrease needs in infrastructure, reagents, and human resources and allows us to evaluate increasingly larger data sets. Computational approaches are being extensively used in the study of SARS-CoV-2 and its mechanisms of infection [13-15]. Among these, we highlight the study of dynamic properties of the Spike protein as well as in antibody recognition and the search for therapeutic interventions [16-18].
Several aspects of the dynamics of the Spike protein are being currently studied, with a range of particular goals: to evaluate the docking of small molecules to the RBD domain [19], to search for alternative target binding-sites for vaccine development [20], to understand residue-residue interactions and their effects on conformational plasticity [21] and to investigate the flexibility of different domains in particular conformational states [22].
Normal Mode Analysis (NMA) methods are being employed in the study of different conformational states [23] and of different coronavirus variants [24]. These methods, however, are limited with respect to their ability to study the effects of mutations on dynamics due to the fact that such methods are either extremely taxing on computational resources (e.g., molecular dynamics) or agnostic to the nature of amino acids (e.g., traditional coarse-grained NMA methods). In the past, our group developed a coarse-grained NMA method called ENCoM (for Elastic Network Contact Model) that is more accurate than alternative coarse-grained NMA methods due to the explicit consideration of the chemical nature of amino acids and their interactions and consequently their effect on dynamics [25]. ENCoM performs better than other NMA methods on traditional applications and is the only coarse-grained NMA method capable of predicting the effect of mutations on protein stability and function as a result of dynamic properties [26-28].
In this study, we use the ENCoM method to study the dynamics of the Spike protein, considering different conformational states and several sequence variants observed during the current pandemic, as well as through large-scale analysis of in silico mutations. Experimental analysis of the effect of the SARS-CoV-2 Spike mutation D614G and the comparison between SARS-CoV and SARS-CoV-2 Spike proteins show unique dynamic characteristics that correlate with epidemiological and experimental data on infection. The present work shows that we can replicate such results computationally, suggesting that rigidity or flexibility of different Spike conformational states affects infectivity. We present a high throughput analysis of simulated single amino acid mutations on dynamic properties to seek potential hotspots and individual Spike variants that may be more infectious and therefore may guide public health decisions if such variants were to appear in the population. We also introduce a Markov model of occupancy of molecular states with transition probabilities derived from our analysis of dynamics that recapitulates experimental data on conformational state occupancies. This is the first application of an NMA method that derives transition probabilities from normal modes and employs them in a dynamic system to predict the occupancy of different conformational states. We model the occupancy of several variants and highlight those that may be useful in studying future epidemiological trends that can be responsible for new outbreaks.
2. Methods
2.1 Spike protein models
We performed our analyses using the crystallographic models of the SARS-CoV-2 Spike protein in the open (PDB ID 6VYB) and closed (PDB ID 6VXX) states. The open (prefusion) state was designed with an abrogated Furin S1/S2 cleavage site and two consecutive proline mutations that improve expression [29]. Despite the mutations, the engineered structures correctly represent the conformational states of Spike, as confirmed by independently solved structures [23,30]. The PDB structures used for the SARS-CoV comparison were 5×58 and 5×5B for closed state and one RBD open state, respectively [31].
We removed heteroatoms, water molecules, and hydrogen atoms from the PDB structures. Missing residues were reconstructed using template-based loop reconstruction and refinement with Modeller [32]. Single amino acid mutants were generated using FoldX4 [33]. ΔΔSvib and occupancy calculations were performed with reconstructed closed and one-RBD-open structures using as template 6VXX and 6VYB. These engineered structures contain the GSAS sequence in the Furin cleavage site as well as two prolines in positions 986 and 987. In order to minimize potential artefacts in the calculations due to modelling errors, we chose to model all mutations and subsequent calculations using the above engineered structures and sequences unless otherwise noted. That is to say, when we refer to the wild type SARS-CoV-2 Spike protein in our calculations, it is the Spike protein with the above alterations in the Furin recognition site as well as the pair of prolines. This choice in our methodology is made as stated to decrease the possibility of modelling artefacts as the alternative would have required modelling 6 additional mutations to ‘de-engineer’ the structures of the open and closed states.
For the parameter fitting used in the calculation of occupancies, we utilized the following experimentally determined structures for which occupancy data exists as follows (acronyms described in results): S-GSAS/WT: 7KDG,7KDH; S-GSAS/D614G: 7KDI,7KDJ [30]; S-R/x2: 6ZOX; S-R/PP/x1: 6ZOY,6ZOZ; S-R: 6ZP0; S-R/PP: 6ZP1,6ZP2 [34].
2.2 Dynamic analyses
We analysed dynamic properties of the Spike protein with ENCoM [25]. ENCoM employs a potential energy function that includes a pairwise atom-type non-bonded interaction term and thus makes it possible to consider the effect of the specific nature of amino-acids on dynamics. NMA explores protein vibrations around an equilibrium conformation by calculating the eigenvectors and eigenvalues associated with different normal modes [35-37]. Representing each protein residue as a single point, for a given conformation of a protein with N amino acids, we obtain 3N −6 nontrivial eigenvectors. Each eigenvector represents a linear, harmonic motion of the entire protein in which each amino acid moves along a unique 3-dimensional Euclidean vector. The associated eigenvalues rank the eigenvectors in terms of energetic accessibility, lower values corresponding to global, more easily accessible motions.
NMA calculations allow us to computationally estimate b-factors associated with the protein structure, as shown in Equation 1 for the ith residue, which in turn are related to local flexibility. Higher predicted b-factors denote more flexible positions. Individually calculated b-factors are combined in a vector for a protein sequence or part thereof and called Dynamic Signature.
The eigenvectors and associated eigenvalues can also be used to obtain a variable called vibrational entropy that can be used to compare the relative stability of two states. For example, by measuring the difference of vibrational entropy (ΔSvib) between a mutant and a wild type (WT), one can calculate how much a mutation affects the overall flexibility and stability of the mutant relative to the WT. The ΔSvib value predicted by ENCoM is positive when the mutation makes the protein more flexible and negative when the mutation makes the protein more rigid. The differences between the ΔSvib values for closed and open states were calculated for each mutant (ΔΔSvib = ΔSvib (open) – ΔSvib (closed)) in order to evaluate individual mutations according to a single score. Vibrational Entropy calculations are dependent on the thermodynamic β factor, that for pseudo-physical models such as ENCoM serves as a scaling factor. This term was optimized to fit experimental Gibbs free energy differences [38] and established as β = 1. The vibrational contribution of the entropic components of the free energy is calculated as described in Eq. 2 [39] in units of J.K-1, where N is the total number of amino acids in the protein, vi is the vibrational frequency and KB is the Boltzmann constant. Equation 3 shows the association between eigenvalues and vibrational frequency.
The Najmanovich Research Group Toolkit for Elastic Networks (NRGTEN) [38], with the latest implementation of ENCoM, also includes a function to evaluate state occupancies by calculating transition probabilities between different states. A probability Pj of moving along each eigenvector j can be obtained using a Boltzmann distribution given its associated eigenvalue λ j and a scaling factor γ.
Let’s consider two conformations A and B of the same protein and the vector EA→B, which represents the conformational change going from conformation A to conformation B. The overlap between each normal mode Mj computed from conformation A and the EA→B vector is a value between 0 and 1 describing how well that normal mode recapitulates the conformational change required to go from one state to the other [40].
We can then calculate the transition probability of going from conformation A to conformation B as the weighted sum of the Boltzmann probability Pj of each normal mode Mj times the overlap between that normal mode and the conformational change EA→B.
The reverse probability PB→A can be computed in the same fashion, giving an indication of which conformation is favored between the two.
A simple way of computing the occupancies of these conformations from the transition probabilities is to use a Markov model. Each conformation is represented by a state, and the transition probabilities between states are computed as described above. We add a constant k to all states as the probability of staying in that state. Since all states must have outgoing transition probabilities that sum to 1, we normalize these values after the addition of k. For a two-state Markov chain representing the open and closed states of the Spike protein, we obtain the diagram shown in Figure 2. All transition probabilities are computed using ENCoM and Eq. 6. The parameters k and γ need to be optimized for the system being studied as they are not directly coupled to physical quantities because of the pseudo-physical, coarse-grained nature of the ENCoM model. Once the parameters are set, there is a unique equilibrium solution that gives the occupancies of the two states. This approach could be easily generalized to a Markov model with more than two states, where the transition between any two states is computed exactly as described above if that transition is deemed possible.
3. Results and Discussion
3.1. Dynamic Signature of different Spike variants
3.1.1. G614 and D614 dynamic comparison
An important event in the progression of the COVID-19 pandemic was the appearance of the D614G variant in mid-February 2020 in Europe. The fast spread of this variant raised the possibility that this mutation conferred advantages relative to other forms of the virus in circulation at the time [41,42]. Studies revealed that the mutation has indeed greater infectivity, triggering higher viral loads [43,44]. Several hypotheses have emerged to explain the mechanisms behind this higher infectivity primarily focused on possible effects on the Furin cleavage site [30,45,46], but recently also considering possible important dynamic differences [44,47,48].
In order to test if Dynamic Signatures reveal differences between Spike variants, we analysed the 13741 sequences of the protein available on May 08 in the COVID-19 Viral Genome Analysis Pipeline, enabled by data from GISAID [49,50]. The mutant Spike proteins harboring mutations (Table S1) were modelled in the open and closed states. Dynamic Signatures were calculated for each mutant in both states and clustered (Figure 3). Mutations in positions that had no occupancy in the original templates used for the open and closed states (positions 5, 8, and 1263) were ignored.
Analysis of the effect of mutations on the Dynamic Signature show that the D614G mutation produces similar dynamic patterns largely independent of the other mutations accumulated, and dynamic patterns that are distinct from that of the wild type and other mutants on both the open and closed states. The dynamic characteristics of D614G are very specific and cannot be obtained with random mutations (Figure S1, Table S2). Performing the clustering using segments of the Dynamic Signature representing lengths of 100 amino acids identifies a section of the Spike protein from around position 250 to around position 750, responsible for the unique characteristics that the mutation D614G confers to the dynamics of the Spike protein (data not shown). This section of Spike includes part of the N-Terminal Domain (NTD) and all of the RBD domain.
When checking the difference between the Dynamic Signatures of the wild type D614 and the mutant G614 we observe that for the closed conformation, the pattern tends towards negative values, indicating that this mutation makes the closed state more flexible, especially around the position of the mutation. On the other hand, for the open B chain conformation the pattern is positive for the open RBD, the same chain NTD and the adjacent chain NTD, indicating that this mutation makes these areas of the open conformation more rigid (Figure 4).
This result led us to hypothesize that a more flexible closed state would favor the opening of Spike and that a more rigid open state would disfavor its closing, thus shifting the conformational equilibrium towards the open state and favouring interaction with ACE2, leading to increased cell entry. Mutating position 614 to every other amino acid, we observe a correlation in the closed state between residue size and flexibility. Namely, smaller amino acids tend to make the closed state more flexible. However, we do not observe the opposite effect on the open state. Mutation of D614 to Glutamine, which is similar to Aspartate, barely shows any effect. Nevertheless, we can see that other amino acids have a similar effect as Glycine, such as Proline and Threonine (Figure S2).
3.1.2. Comparison of the Dynamic Signatures of Spike from SARS-CoV and SARS-CoV-2
It has been previously observed that RBD flexibility in SARS-CoV influences binding to ACE2 and facilitates fusion with host cells [51]. Thus, considering the lesser infectivity of SARS-CoV relative to SARS-CoV-2 and our aforementioned results for the D614G mutation, we expected the SARS-CoV Spike to be more rigid in the closed state and more flexible in the open state relative to Spike from SARS-CoV-2. This is indeed the case (Figure 5). The dynamic signature values of SARS-CoV are smaller than those of SARS-CoV-2 in several areas throughout the closed structure, indicating that when in the closed state, the SARS-CoV Spike protein is more rigid. For the open state we can see that SARS-CoV open RBD and adjacent NTD are significantly more flexible than for SARS-CoV-2 Spike.
3.2. Vibrational entropy
It is possible to combine the trend of a Dynamic Signature into a single value to represent the overall flexibility of any given mutation and compare it to the WT. This can be achieved with ΔSvib, calculated with Eq. 2 for each state (see materials and methods). For any given state, positive ΔSvib values represent mutants that relative to the wild type make the protein more rigid, whereas negative values of ΔSvib describe mutations that cause the protein to be more flexible in the given state relative to the wild type. In the case of the mutation D614G, we obtain ΔSvib (open) = 5.26×10−2 J.K-1 and ΔSvib (closed) = −9.27×10−2 J.K-1 with a ΔΔSvib (calculated as ΔSvib (open) – ΔSvib (closed)) of 1.45×10−1 J.K-1.
We generated in silico the 19 possible single mutations in each position from residue 14 to residue 913 and calculated ΔSvib (open), ΔSvib (closed) and ΔΔSvib. Other positions were ignored due to uncertainties in modelling or the fact that they are not expected to have a pronounced effect on dynamic [23]. It should be noted that Spike cannot accommodate the vast majority of such single mutations, particularly in its core as these would lead to unstable or misfolded conformations. However, those that occur near the surface are more likely to represent single residue variations of the Spike protein that lead to a stable, correctly folded protein. Therefore, the stability of specific mutations highlighted in this work, unless otherwise stated (such as those already observed experimentally or within the RBD domain as stated below), needs to be validated experimentally.
The heatmap in Figure 6A shows ΔSvib values associated with mutations on the closed conformational state (left) and open conformational state (right). Lighter colors represent high ΔSvib values, meaning that the specific mutant is more flexible than the WT, and darker colors represent low ΔSvib values, meaning that the specific mutant is more rigid than the WT. The second heatmap (Figure 6B) shows ΔΔSvib values, or Difference Scores, highlighting positions and specific mutations with great contrast between their effect on the open and closed states. In this representation, blue mutants are more rigid in the closed state and more flexible in the open state, therefore candidates for less infectious mutants, and red mutants are more flexible when closed and more rigid when open, candidates for more infectious mutants.
In Figure 7 we map ΔΔSvib values (Figure 6B) on the structure of Spike, colored according to the median value for each position with the same color scheme as the heatmap. From the 17081 single mutations considered, we show the top 64 mutants with with ΔΔSvib>0.3 (Tables 1 and S3) as well as the bottom 20 in terms of ΔΔSvib values (Table S3). The mutants with predicted open state occupancy higher than that of the wild type are presented in Table 1. The Dynamic Signature comparison for 3 of those most infectious candidates (Figure 8A) and 3 of the least infectious candidates (Figure 8B) shows some of the patterns that could lead to a greater or lesser effect on infectivity. For instance, in Figure 8A we can see that high scores can come from a large flexibility of the closed state, a very large rigidity of the open state, or have the contribution of both. We can also observe that these effects can be different in each chain and can affect more the NTD, the RBD, or both. Finally, these single mutants also show how a point mutation can have widespread impacts across the protein.
3.3. Conformational state occupancies
We calculated forward and reverse transition probabilities between the open and closed states (Eq. 4, 5 & 6) from the calculated normal modes and used the Markov model described in Materials and Methods to calculate the equilibrium occupancies for each state in wild type and mutant Spike proteins. It is unclear if any additional conformational states other than those with either all three RBD domains in the closed state or only one RBD open state are biologically relevant. Specifically, Yurkovetskiy et al. [44] observed an occupancy for states with two or three RBD domains in the open conformation, but these were not observed by Gobeil et al. [30] and Xiong et al. [34] or taken into consideration in several other structural studies [20-24]. As such, we employ the two-state model shown in Figure 2, with one state representing all three RBD domains closed and the second state representing one RBD open. We calculated the robustness of this Markov occupancy model utilizing 60 different reconstructed structures, varying the positions of loops and with minor differences in the core structure, representing the closed state and the open state for each chain. The results are equivalent no matter what specific structural template is used to represent each of the two states above.
The Markov model calculation of occupancies requires two parameters (see Materials and Methods) that were optimized based on experimental data for six Spike variants. These variants were: S-GSAS/D614, an engineered Spike with the sequence GSAS in the furin cleavage site and no 614 mutation; S-GSAS/G614, with the same furin site modifications and the D614G mutation [30]; S-R, the Spike protein with original furin site RRAR; S-R/x2, with added S383C, D985C mutations inducing a disulfide bond; S-R/PP, engineered with two prolines in positions 986 and 987; S-R/PP/x1, in which from the double prolines sequence the mutations G413C, V987C were performed to induce a disulfide bond [34]. It is worth stressing that all 6 variants used to calibrate the two parameters affecting the occupancy were modelled on the same open and closed state conformations. All differences in observed occupancies and the agreement with experimental occupancy data came about as a consequence of the effect of the mutations on the normal modes and derived transition probabilities and not as a result of structural differences between variants. We obtained a good fitting to the experimental results with k and γ of 0.5 and 0.001, respectively (Pearson correlation = 0.89, p-value = 1.94×10−2). Predicted occupancies of the open and closed states for each of the six variants above, as well as the experimental data, are presented in Table 2.
We utilized this data to calculate occupancy differences for each variant (Figure 9). The range of variation of our predicted occupancies is small compared to that of experimental values. We believe that given the limitations of our coarse-grained model as well as additional phenomena that ultimately affect occupancy, our predictions reflect only a fraction of the myriad of factors contributing to the occupancy. Nonetheless, our predictions correctly capture the pattern of relative variations of occupancy observed in the experimental data. To ensure that the calculated correlation is not due to chance, we simulated random sets of occupancies for the 6 sequence variants and calculated simulated correlations for the 110 different combinations of k and γ to determine if the observed correlations represent an actual signal in the data or could be randomly obtained with different values for the parameters k and γ. We observed a marked shift with higher correlations for the data representing our predicted occupancies when compared to the gaussian noise data (Figure S3), suggesting that the predicted occupancies are not due to chance.
The computational resources needed for the calculation of occupancies for all 8250 mutations with ΔΔSvib>0 is beyond our current capabilities. We set a threshold of ΔΔSvib>0.3 to select candidates for the calculation of occupancies. This threshold corresponds to 64 mutations (Table 1, in red). Using the parameters k and γ obtained above, we calculated occupancies for these 64 mutants as well as the 20 mutants with lowest ΔΔSvib values (Table 1, in blue). In Figure 10A we show the difference in occupancy between the open and closed states using a non-linear scale adapted to better show the results around the wild type occupancy. Whereas ΔΔSvib values for particular mutations may hint at a more flexible closed state and more rigid open state, this is a global measure that may not reflect the necessary pattern of flexibility across the structure that leads to effective transition probabilities between the open and closed states. Yet, for the most part, ΔΔSvib can predict the shifts in occupancy, showing a clear distinction between the 64 mutants predicted using ΔΔSvib as shifting occupancy towards the open state and the 20 mutants predicted to shift the equilibrium towards the closed state (p-value=2.04×10−6). Figure 10B shows the location in the structure of the mutants in Table 1. We can see that the least infectious candidates (blue) are positioned in the interfaces between NTD and RBD domains, while the most infectious candidates, especially the ones validated by the occupancy prediction (dark red), are more concentrated in the interfaces between different RBD domains.
Residue G252 stands out as capable of accommodating a large number of mutations (C, D, E, H, M, P, Q, S, T, W) that shift the occupancy in favour of the open state. The fact that variants in this position do not seem to be prevalent in outbreaks to date, points to the possibility that this position may be under additional functional constraints that prevent the emergence of variants. A number of other Glycine residues also could accept mutations that we predict to increase the occupancy of the open state: G72W; G404W; G413M; G416E,W; and G404I. In fact, three of the top four mutations are mutations on Glycine. A number of other potential mutations are adjacent to Glycine residues above. Namely, R403S and K417D,E,G,P. Additionally, D467P,W and I468T are also positions that are adjacent to others that can accommodate mutations that may lead to a conformational shift favouring the open state. The mutation that favours the open state the most in our calculations is N501W with ΔSvib (open) = 6.02×10−1 J.K-1 and ΔSvib (closed) = 2.30×10−1 J.K-1 and a resulting ΔΔSvib value of 3.72×10−1 J.K-1 leading to occupancies compared to those of the wild type (in parenthesis) of 62.7% (25.8%) and 37.3% (74.2%) for the open and closed states respectively. It is important to stress, as discussed in methods, that the calculations are performed using structures containing a modified Furin recognition site and prolines in positions 986 and 987. Furthermore, the contribution of vibrational entropy changes is one among potentially several effects whose overall importance remains to be determined. Therefore, relative changes in occupancy are relevant whereas the specific values are less so.
The COG-UK consortium (https://www.cogconsortium.uk/about/) monitors the appearance and spread of new strains of SARS-CoV-2. COG-UK recently detected a strain containing the mutation N501Y that has been observed to be spreading rapidly at the time of writing. We believe that shifts in occupancy may be in part responsible for its emergence. According to our calculations, the N501Y mutant shows ΔSvib (open) = −1.60×10−2 J.K-1 and ΔSvib (closed) = 2.37×10−1 J.K-1, with ΔΔSvib = 2.53×10−1 J.K-1. The predicted occupancies for the N501Y mutant compared to those of the wild type (in parenthesis) are 54.3% (25.8%) and 45.7% (74.2%) for the open and closed states, respectively. Therefore, the N501Y mutant shows a marked increase of the occupancy of the open state relative to other mutations. Additionally, this mutation was shown to also increase binding affinity to the ACE2 receptor relative to the wild type with a Δlog10 (KD,app) of 0.24 [52]. Therefore, we predict that N501Y has a strong potential to contribute to increased transmission. The calculations above were performed in the context of D614. However, the double mutant representing the N501Y mutation in the context of G614 also shows an increase in the occupancy of the open state to 35.06%. The recently observed A222V mutation on the other hand [53], does not show in our analysis any propensity of altering the occupancy of states with a negative ΔΔSvib of −1.64×10−2 J.K-1. Predicted occupancies for A222 and V222 are nearly identical either in the context of D614 (WT) or the mutant containing G614.
Notice that N501Y has a ΔΔSvib value of 2.53×10−1 J.K-1 that is slightly below the 3.00×10−1 J.K-1 threshold, suggesting that there may be many other mutations with ΔΔSvib values below our set threshold that turn out to have augmented occupancies for the open state relative to the wild type.
D614G shows that changes in the occupancy of conformational states can impact infectivity despite no changes or even weaker binding affinities [44]. A recent study [52] on binding and expression of Spike mutations within the RBD domain (positions 331 to 531) shows that several (but not all, see below) of the mutations that we predicted to have increased occupancy of the open state are associated with a decrease of binding affinity with ACE2. Incidentally, the data also shows that the mutations in Table 1 within the RBD produce stable and properly folded Spike proteins. As shown for D614G, infection does not rely on binding affinity alone, and even a strain with higher dissociation rates from ACE2 can bring about fitness advantages.
The mutation N501W is predicted to have the largest effect in augmenting the occupancy of the open state relative to the wild type. This mutation is associated with stronger binding to ACE2 (Δlog10(KD,app)=0.11) [52] relative to the wild type Spike (but lower than N501Y). Furthermore, N501W appears to have increased expression relative to the wild type with a Δlog(MFI) of 0.1 compared to decrease in relative expression of −0.14 for N501Y [52]. The authors note that changes in expression correlate with folding stability [52]. However, even with a Δlog(MFI) of −0.14, N501Y is viable and spreading. Therefore, N501W might be even more stable and infective.
We consider all mutations with increased predicted occupancy of the open state in Table 1 as good candidates for further experimental validation to better understand the role of binding and dynamics of Spike and their role in SARS-CoV-2 infectivity. Furthermore, we suggest that their appearance in outbreaks should be closely monitored.
3.4. SARS-CoV-2 Variants B.1.1.7 and 501.V2
The mutation N501Y above appears in both the B.1.1.7 variant first observed in the UK [54] as well as the 501.V2 variant first observed in South Africa [55] that rapidly spreading around the globe. These two strains contain additional mutations in Spike. Namely B.1.1.7 contains N501Y, A570D, D614G, P681H, T716I, S982A, D1118H and deletions on positions 69, 70 and 144. As the number of normal modes is related to the number of amino acids, we are unable to model deletions while still making comparisons with the wild type strain given the nature of the quantities calculated (Eq. 2 and 6). Therefore, the deletions of three residues at positions 69,70 and 144 that are present in B.1.1.7 were not modelled here. 501.V2 includes the mutations L18F, D80A, D215G, R246I, K417N, E484K, N501Y, D614G, A701V. The dynamic signatures for both B.1.1.7 as well as 501.V2 show a strong rigidification of the open state and added flexibility of the closed state (Supplementary Figures S4 and S5 respectively) leading to ΔΔSvib values of 5.30×10−1 J.K-1 and 6.45×10−1 J.K-1 and open state occupancies of 36.2% and 35.8%, for B.1.1.7 and 501.V2 respectively. Both variants show an increase in occupancy of approximately 40% relative to the wild type (25.8%). Despite our preference of modelling the smaller number of mutations and therefore using the engineered structure containing the modified Furin binding site and proline modification, we also modelled B.1.1.7 (except the deletions) and 501.V2 using the original sequence of Spike. In that case we obtain 33.0% and 33.6% occupancy for B.1.1.7 and 501.V2, respectively.
3.5. Polyclonal human serum antibody escape
Recently the Bloom group utilised human serum antibodies from subjects that recovered from COVID-19 and tested mutations in the RBD for their capacity to escape recognition [56], i.e. mutations leading to weaker binding to polyclonal serum antibodies. The presented patterns of escape vary between subjects but a number of positions and specific mutations at those positions are relevant to the present study. Positions Y369, N448, F456, Y473 and F486 are noteworthy as specific mutations at these locations not only allow varying levels of escape in particular subjects [56] but also lead to positive values of ΔΔSvib above a threshold of 0.1 J.K-1 (Table S4). Among these, the mutations N448G; Y473 mutations to A, Q and T; and lastly, F486E all show occupancy of the open state modestly higher than that of the wild type (Table S4). The mutations noted, by virtue of potentially increasing infectivity as well as displaying varying levels of escape to immune responses, may give the virus an evolutionary edge and therefore should be closely monitored.
3.6. Data Availability
Raw data and structures used to build the images presented here are available in a Github repository (https://github.com/nataliateruel/data_Spike). All vibrational entropy results are available for visualisation and analysis through a link to the dms-view open-access tool, available on GitHub [57] through the same URL above. On dms-view, it is possible to visualise the effects of different mutations for each residue of the Spike protein and visualise these on the 3D structure of Spike. Each site has 20 ΔΔSvib values, one of them being zero (corresponding to the amino acid found in the wild type). The option max will show the top ΔΔSvib score for each position. Therefore, it shows which mutation for that specific position represents the candidate with the highest predicted infectivity as defined here in terms of a propensity to higher occupancy of the open state. The option min will show the lowest score for each position and the mutation associated with the least predicted infectious candidate. The option median returns the median score, presenting a general trend for any given position, and var shows the variance between the results for each position, highlighting sites in which mutations to different residues lead to a broader range of ΔΔSvib values. Furthermore, for the mutations for which occupancy was calculated, the data can be accessed through the same menu. As new occupancy data is calculated, it will be added to this resource. Readers interested on the occupancy of particular mutations not yet available are invited to contact the authors via email or through the GitHub repository. When selecting each specific point on the first panel, it is possible to access all ΔΔSvib values on the second panel and see the highlighted position in 3D on the structural representation.
The Najmanovich Research Group Toolkit for Elastic Networks (NRGTEN) including the latest ENCoM implementation is freely available at (https://github.com/gregorpatof/nrgten_package).
4. Conclusions
SARS-CoV-2 mutations are still arising and spreading around the world. The A222V mutation, reportedly responsible for many infections, emerged in Spain during the Summer of 2020 and since then has spread to neighbor countries [53]; In Denmark, new strains related to SARS-CoV-2 transmission in mink farms were confirmed in early October by the WHO and shown to be caused by specific mutations not previously observed with the novelty of back-and-forth transmission between minks and humans [58]. A new strain containing N501Y first appeared in the UK and is now on the rise worldwide at the time of writing. Such occurrences point to the possibility that new mutations in SARS-CoV-2 may bring about more infectious strains.
Using the methods described in this paper, it is possible to predict potential variants that might have an advantage over the wild type virus insofar as these are the result of changes in occupancy of states and with the limitations of the simplified coarse-grained model employed here. In our analyses, flexibility properties and conformational state occupancy probabilities contribute to the infectivity of a SARS-CoV-2. Our results explain the behaviour of the D614G strain, the increased infectivity of SARS-CoV-2 relative to SARS-CoV as well as offers a possible explanation for the rise of new strains such as those harboring the N501Y mutation.
The results we present on SARS-CoV-2 Spike mutations have several limitations. First and foremost, some of the in silico mutation discussed may not be thermodynamically stable, may affect expression, cleavage, or binding to ACE2, and our approach does not consider that Spike is, in fact, a glycoprotein and the sugar molecules may have an effect on dynamics. However, the remarkable agreement between our model and experimental observations shows that the simplified model of Spike and the coarse-grained methods used here allow us to calculate dynamic properties of Spike that are relevant to understand infection and epidemiological behavior. It is important to keep in mind that all of the mutations that we discuss in Table 1 that lay within positions 331 and 531 within the RBD domain were already experimentally validated and are viable [52]. However, we highlight the need for experimental validation of our predictions particularly for those candidates that we believe would help elucidate the extent of the effect of the conformational dynamics of Spike on infectivity. Beyond in vitro biophysical studies, experimental alternatives exist such as using pseudo-type viruses or virus-like-particles that would not require studying gain-of-function mutations using intact viruses. Alternatively, loss-of-function mutations can be created with intact viruses and compared to the wild type SARS-CoV-2 virus to validate the role of dynamics on infectivity.
To the best of our knowledge, this is the first time that a Normal Mode Analysis method is used to model the effect of mutations on the occupancy of conformational states opening a new opportunity in computational biophysics to create dynamic models of transitions between conformational states of proteins based on physical properties and sensitive to sequence variations. We hope that our results help public health surveillance programs decide on the risk posed by new strains, contribute to inform the research community in understanding SARS-CoV-2 infection mechanisms and open new possibilities in computational biophysics to study protein dynamics.
Acknowledgements
OM is the recipient of a PhD fellowship from the Fonds de Recherche du Québec − Nature et Technologie (FRQ-NT). RN is a Fonds de Recherche du Québec - Santé (FRQ-S) Senior Fellow, a member of the Réseau Québécois de Recheche sur les Médicaments (RQRM) and the Quebec Network for Research on Protein Function, Engineering and Applications (PROTEO). The authors would like to dedicate this work to the memory of Mordechai Najmanovich, Z”L, father of RN, who passed away from complications due to COVID-19 on November 26, 2020. RN would like to thank all healthcare workers, particularly ICU nurses and physicians at the Avista Adventist Hospital in Louisville, Colorado, for their efforts.
Footnotes
1. We updated the abstract 2. We added redistributed text into a new section 3.4 to clarify the text. 3. We added a new section 3.5 with new data. 4. We updated the data in the introduction regarding the most recent numbers of COVID-19 infections and deaths worldwide. 5. We added two figures and a table to the supplementary data. 6. We removed the supplementary data from the main manuscript file into its own file.