Bifurcations and mutation hot-spots in SARS-CoV-2 spike protein

The spike protein is a most promising target for the development of vaccines and therapeutic drugs against the SARS-CoV-2 infection. But the apparently high rate of mutations makes the development of antiviral inhibitors a challenge. Here a methodology is presented to try and predict mutation hot-spot sites, where a small local change in spike protein’s structure can lead to a large scale conformational effect, and change the protein’s biological function. The methodology starts with a systematic physics based investigation of the spike protein’s Cα backbone in terms of its local topology. This topological investigation is then combined with a statistical examination of the pertinent backbone fragments; the statistical analysis builds on a comparison with high resolution Protein Data Bank (PDB) structures. Putative mutation hot-spot sites are identified as proximal sites to bifurcation points that can change the local topology of the Cα backbone in an essential manner. The likely outcome of a mutation, if it indeed occurs, is predicted by a comparison with residues in best-matching PDB fragments together with general stereochemical considerations. The detailed methodology is developed using the already observed D614G mutation as an example. This is a mutation that could have been correctly predicted by the present approach. Several additional examples of potential hot-spot residues are identified and analyzed in detail, some of them are found to be even better candidates for a mutation hot-spot than D614G. Significance statement A novel approach to predict mutation hot-spots in SARS-CoV-2 spike protein is presented. The approach introduces new topology based techniques to biophysical protein research. For a proof-of-concept the approach is described with the notorious D614G mutation of the spike protein as an example. It is shown that this mutation could have been correctly predicted by the present methods. Several additional mutation hot-spots are then identified and a number of them are shown to be topologically similar to the observed D614G mutation. The methodology can be used to design effective drugs and antibodies against the spike protein. It can also be employed more generally, whenever one needs to search for and identify mutation hot-spots in a protein.


INTRODUCTION
The COVID-19 is a global public health emergency that continues to spread across the world. The disease is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It belongs to the same β genus with the SARS-CoV and MERS-CoV viruses that caused epidemics in 2002 and 2012 respectively. All three viruses are notorious for causing a disease with severe symptoms and even death. Several studies of the SARS-CoV-2 virus have been published including investigations on the source of infection [1][2][3][4][5], the mechanism of transmission [6][7][8][9][10], and the structure and function of its various proteins [11][12][13][14][15][16][17].
Of particular interest to vaccine and therapeutic drug development is the transmembrane spike glycoprotein that assembles into a homo-trimer to cover the virion surface, giving the virus its distinctive crown-like look. The spike protein monomer starts with a short signal peptide at the N-terminal, followed by two larger subunits called S1 and S2. The subunits mediate binding to host cell and membrane fusion, respectively. The Figure 1 identifies the major functional domains in a monomeric component of the spike protein [11,12]: The S1 subunit consists of residues between the sites 14-685. It starts with a N-terminal domain (NTD, residues 14-305) that participates to conformational changes during attachment. The NTD is followed by the receptor binding domain (RBD, residues 319-541) that initiates the viral entry by recognizing and binding to the host receptor angiotensin-converting enzyme 2 (ACE2). The junction segment between the S1 and S2 subunits includes several cleavage sites that can be hydrolyzed and cleaved by a variety of host proteases including furin and in particular TM protease serine 2, that prime the spike protein for membrane fusion [18]. The S2 subunit comprises the rest of the protein. It starts with the fusion peptide (FP, residues 788-806) that initiates the fusion. The two heptapeptide repeat sequences HR1 (residues 912-984) and HR2 (residues 1163-1213) form a six-helical bundle in the homotrimer that is essential for the viral fusion and entry. The transmembrane domain (TM with residues 1213-1237) anchors the protein to the viral membrane, and the S2 subunit ends in a cytoplasm domain tail (CT, residues 1237-1273).
There are two principal conformational states in the spike protein, called the closed state and the open state. The major difference between these two states is in the relative positioning of the RBD [13]. The closed state is the native prefusion state, in this state the RBD is surrounded by the NTD. The virus is also glycosylated with polysaccharide molecules that help it evade the host immune system. When the virus interacts with the host cell, the spike protein transits from the closed state to the open state with a major conformational change taking place in the S1 subunit. In the open state the RBD becomes exposed and can bind with the ACE2 of the cell membrane. The proteases of the host cell membrane then search for a cleavage site in the junction region between the S1 and S2 subunits, and hydrolyze the protein by cutting off the S1 subunit. This exposes the FP that resides in the head of the S2 subunit, and the FP undergoes a conformational change inserting itself into the host cell membrane. The fusion core that forms the junction between the HR1 and the HR2 then bends, and the viral envelope and the cell membrane become pulled together for fusion that completes the cell invasion [19].
The spike protein has a key role in processes that range from receptor recognition to viral attachment and entry into host cell. Therefore it has become a prevalent target in both vaccine and therapeutic research that combat the SARS-CoV-2 virus. In particular the RBD domain and the HR1/HR2 domain are in the prime foci of these endeavors. It appears that the RBD is part of a highly mutable region, and a mutation in the RBD can change the interactions between the spike protein and the ACE2 of the host cell membrane.
With limited understanding of mutation effects, the RBD can be a demanding target for the development of broad-spectrum antiviral inhibitors. The HR1/HR2 domain plays an important role in the infection itself. It is assumed to be relatively conserved, which makes the HR1/HR2 domain an attractive target for the development of fusion inhibitors to prevent a SARS-CoV-2 infection [20].
Apparently, there are many putative mutations that can take place in the SARS-CoV-2 virus. But most of them are likely biologically insignificant. This is because many mutations can only minimally alter the shape of the affected protein. Therefore they tend to have a limited impact on its biological functionality. Many mutations are also right out harmful to the virus and thus do not prevail. Probably the most prominent mutation that has taken place in the spike protein with documented epidemiological consequences, is the D→G substitution that occurred at its site 614, near the junction between the subunits S1 and S2. Apparently this mutation converted SARS-CoV-2 into a more transmissible form, and the D614G mutated virus now dominates in the global COVID-19 pandemic [21]: Since the G amino acid has a smaller size than the D amino acid, it is plausible that the D→G substitution increased the conformational flexibility of the spike protein around the S1-S2 junction domain. This may have enhanced the efficiency of the RBD to bind with the ACE2, increasing the exposure of the S1-S2 junction for cleavages by proteases of the host cell.
Here the goal is to introduce methodology that can help to identify those sites along the spike protein backbone, where a small localized change can cause a large conformational effect. If a mutation takes place at such a conformational hot-spot site, it can substantially alter the protein's biological function.
The approach builds on a geometric scrutiny of the spike protein's Cα backbone. As a piecewise linear polygonal chain the Cα backbone is subject to the universal rules that govern the shape of all space curves. In the case of a regular space curve the shape is determined by an interplay of its curvature and torsion; these are the two geometric quantities that describe how the curve bends and how it twists. An essential change in the local topology of a curve can only take place at a bifurcation point, where either the curvature or the torsion vanish. In the case of a protein, when a mutation occurs at a residue that is proximal to such a bifurcation point, it has an increased potential to affect the protein's conformation and hence also change its biological activity. Thus the identification and analysis of these mutation hot-spot sites in the spike protein can help to predict its future evolution. It can aid the development of structure based therapeutic drugs and vaccines.
The Methods section first summarizes the known results that govern the geometry and the local topology of a smooth space curve. These results are then adapted to the special case of a discrete, piecewise linear chain such as the protein Cα backbone. In particular, the two curve specific bifurcations that can alter the local topology of a curve [22][23][24] [13] and 6XS6 (D614G mutation) [25] are used. The potential mutation hot-spots are first identified by a tabulation of all those sites that are proximal to a flattening point. The details of the methodology are then worked out in the case of the known D614G mutation; the D→G substitution is correctly predicted by the present methodology. Additional examples are then analyzed in detail, including the identification of the likely amino acid substitution that may take place if a mutation occurs at a hot-spot.

Regular space curves and the Frenet Equation
The geometry of a regular, analytic space curve that is not a straight line is governed by the Frenet equation [26]. To describe this equation, consider a parametric representation x(s) ∈ R 3 of a space curve, with s ∈ [0, L] the arc-length parameter and L its fixed length; the curve can be open or closed. The curve is self-avoiding so that x(s 1 ) = x(s 2 ) for all s 1 = s 2 . The unit length tangent vector is The unit length binormal vector is and the unit length normal vector is Together, the three vectors (n, b, t) define the right-handed orthonormal Frenet frame at the regular point x(s) of the curve. The Frenet equation relates the frames along the curve, The curvature κ(s) is a measure how the curve bends on the osculating plane that is spanned by t(s) and n(s), and the torsion τ (s) measures how the curve deviates from this osculating plane.
The Frenet frames can be introduced whenever the curvature κ(s) is non-vanishing. In the limit where the curvature is very small in comparison to the torsion, but does not vanish The limit is a framed, straight line. The framing rotates around the line at a rate and direction that is determined by τ (s).
Consider a curve that is subject to continuous local deformations and can change its shape freely; in the case of an open curve it is assumed that the end points remain fixed.
The shape changes are subject to the following rules [22][23][24]: When the shape of a curve changes freely, an isolated inflection point generically occurs at some instance. When an inflection point appears the curve undergoes a bifurcation that is called an inflection point perestroika [22]. This bifurcation can change the number of The discrete Frenet frame formalism [27] describes the geometry of a piecewise linear chain with vertices r i (i = 1, ..., N ) that in the case of a protein backbone are the space coordinates of the Cα atoms. A line segment defines the unit tangent vector It points from the center of the i th Cα atom towards the center of the (i + 1) st Cα atom, in the case of a protein backbone. The unit binormal vector is and the unit normal vector is and in the case of a protein these two vectors are virtual in that they do not point towards any particular atom. The orthonormal triplet (n i , b i , t i ) defines a discrete version of the and the torsion angles are It is notable that the value of the bond angle κ i is evaluated from three, and the value of the torsion angle τ i is evaluated from four consecutive vertices.
Conversely, when the values of the bond and torsion angles are all known, the discrete Frenet equation computes the Frenet frame at the vertex r i+i from the frame at the preceding vertex r i .
A continuum limit of the discrete Frenet equation coincides with the continuum Frenet equation (4) [27].
It should be noted that besides the Frenet framing, a protein backbone can also be framed in other ways, for example using the peptide planes: A peptide plane frame is built starting with the vector t i and the unit normal vector of the ensuing peptide plane. Unlike the Frenet frames that can not be introduced if a bond angle vanishes, the peptide plane frames can always be introduced along a protein Cα backbone. When both frames exist they are closely related [27].
The fundamental range of the bond angles is κ i ∈ [0, π] and in the case of the torsion angles τ i ∈ [−π, π). For visualization purposes, the bond angles κ i can be identified with the latitude angle of a two-sphere that is centered at the i th Cα atom; the north pole coincides with the inflection point κ i = 0. The torsion angles τ i ∈ [−π, π) correspond to the longitudinal angle on the sphere, it increases in the counterclockwise direction around the tangent vector and the value τ i = 0 of flattening points coincides with the great semi circle that starts from north pole and passes through the tip of the normal vector n to the south pole. The sphere can be stereographically projected onto the complex (x, y) plane. A projection from the south pole is as shown in figure 2: The north pole i.e. the point of inflection with κ = 0 becomes mapped to the origin (x, y)=(0, 0) and the south pole κ = π is sent to infinity. The Cα backbone can be visualized on the stereographically projected two sphere as follows [28]: At each Cα atom, introduce the corresponding discrete Frenet frames (6)- (8).
The base of the i th tangent vector t i that is located at the position r i of the i th Cα atom coincides with the centre of a two-sphere with the vector t i pointing towards its north pole.
describes a right-handed α-helix, and values for which describes a β-strand.
In  (5). Notably, κ i = 1 and τ i → ±π corresponds to an ideal, straight β-strand. Therefore a point on the line τ i → ±π (or in its immediate vicinity) will be called a β-point in the sequel, even when the bond angle has a value that is different from its β-strand value κ ≈ 1.
The multi-valuedness of the torsion angle τ i affects the local topological invariants, in the case of a discrete chain. This is exemplified in the  .

RESULTS
The Figure      In the case of a bond angle, the value is considered small when κ i < 0.5. General arguments state that inflection points are not generic, and there are indeed very few sites that are close to an inflection point. The smallest value is κ i = 0.39 and it is located at the site 103 that also appears in Table I. Thus sites that are proximal to an inflection point are not considered separately.
The local geometry of all the residues that are proximal to a flattening point has been studied and the potential for a mutation hot-spot at a residue that is proximal to a flattening point has been estimated. For this, a combination of statistical analysis and stereochemical constraints has been utilized. The Table I also summarizes these findings. The statistical analysis employs the classification scheme of Protein Data Bank structures described in [31]. This scheme decomposes a Cα backbone into fragments that consist of backbone segments with six successive sites; in accordance with Eq. (9), (10) a fragment with six sites determines three pairs of bond and torsion angles. In [31] the fragments that appear in high resolution PDB structures have been organized into disjoint clusters. To assign a cluster to a fragment, there must be at least one other fragment in the same cluster within a prescribed RMS cut-off distance; in [31] the cut-off is 0.2Å. Two clusters are then disjoint, when the RMSD between any fragment in the first cluster and any fragment in the second cluster exceeds this RMS cut-off distance. It was found that around 38% of protein loops in the high resolution PDB structures can be decomposed into fragments that belong to twelve disjoint clusters, labeled I-XII in [31]. When fragments from an additional set of 30 disjoint clusters are included, the coverage increases to ∼ 52% [31].
In the Table I both cluster sets I-XII and 1-30 appear; the notation of [31] is used throughout in the following. Beyond these two sets, the clusters become increasingly smaller, and in the present study those smaller clusters have not been considered. The somewhat low resolution of the available spike protein structures in comparison to the very high resolution structures used in [31] does not justify a more detailed scrutiny.
The clusters in the Table I have been identified as follows: A pair of bond and torsion angles (κ i , τ i ) at the i th site of the spike protein that is a proximal site to a flattening point can be assigned to three different clusters. The first cluster describes the bond and torsion angles for the sites (i − 2, i − 1, i); this cluster is labelled P (for Preceding) in the Table   I. The second cluster that is labelled A (for Adjacent) in the Table I describes the angles for sites (i − 1, i, i + 1). The third cluster is labelled F (for Following) and it describes the angles at spike protein sites (i, i + 1, i + 2). The cluster that provides the best match to the spike protein is listed in the Table I and it is determined as follows: Let (x a , y a , z a ) denote the six space coordinates of a segment that corresponds to three consecutive pairs of (κ, τ ) values in the spike protein. Let (x k,a , y k,a , z k,a ) be the corresponding six space coordinates of a k th fragment in each of the clusters of [31]. The best matching cluster in Table I is the one that contains a fragment with the minimal root-mean-square distance (RMSD) to the given spike protein segment (in units ofÅngström): Once the best matching cluster is determined, the corresponding statistical distribution of amino acids found in [31] is used to identify those amino acids that are most probable to appear at the i th site of the spike protein, in case a mutation occurs. If the size of this statistically most probable amino acid is smaller than the size of the present amino acid at the i th site, a substitution by mutation is considered to be sterically possible. But if the size is larger, a substitution likely requires an extended rearrangement of the spike protein conformation and this can be energetically costly. The Table I lists the most probable substitutions that are predicted in this manner. The adjacent histogram displays the amino acid distribution in the best matching cluster, using the data that is obtained from [31].  :(613, 614, 615), in the background of the best matching cluster #26 in [31]. The most common amino acid in the cluster is G. c) The spectrum for following sites F: (614, 615, 616), in the background of the best matching cluster #10 in [31]. The most common amino acid in the cluster is G.
The Figure 5 a) shows the bond and torsion angles for the sites P: (612,613,614) of the spike protein. The best matching preceding (P) cluster is also shown, it is the cluster #XII of [31]. From Table I the RMSD (16) between the spike protein segment and the cluster is ∆ min ≈ 0.76Å which is clearly larger than the 0.2Å cut-off used in [31]. But the resolution at which the spike protein PDB structures (6VXX, 6VYB and 6XS6) have been measured is also clearly larger than the resolution of the structures used in [31]. The statistical analysis [31] shown in Figure 5 a) proposes that the most probable mutations at site 614D are to S, A, N and G; the amino acid E can be excluded on sterical grounds since it has a larger size than D.
The best matching cluster for the adjacent sites A:(613,614,615) shown in Figure 5 b) is the cluster #26 of [31], with ∆ min ≈ 0.75Å. The statistical analysis [31] now proposes that the most probable mutation at site 614 is a D→G substitution; the probability for any other amino acid substitution is very low. In particular the probability for D itself is low, suggesting instability.
Finally, the best matching cluster for the following sites F:(614,615,616) shown in Figure   5 c) is #10. The RMSD has now a somewhat lower value ∆ min ≈ 0.41Å. The statistical analysis [31] proposes that the most probable substitution at site 614 is again D→G. The probability for any other substitution is very low. In particular the probability for D itself is low, again suggesting instability of the residue. Since G is the only amino acid that consistently appears in all three clusters and since there is no obvious steric hindrance for a D→G mutation, the prediction of the present analysis is that a D→G mutation is probable at the site 614 of spike protein. This is the notorious D614G mutation that has already been observed.
A comparison of the three PDB structures 6VXX, 6VYB with 6XS6 shows that apparently the mutation has not caused any change in local topology, at least according to available structures. But since the site 614 is located near the junction between S1 and S2 subunits, the D→G mutation in combination with the proximity of a flattening point has probably increased the chain flexibility in the junction segment so that it is now more prone to a cleavage with enhanced infectiousness as a consequence.
A survey of potential mutation hot-spots in spike protein The Table I  , in the background of the best matching cluster #7 in [31]. b) The spectrum for A:(58, 59, 60), in the background of the best matching cluster #26 in [31]. c) The spectrum for F:(59, 60, 61), in the background of the best matching cluster #XI in [31]. The most common amino acid in all three clusters is G.  (101, 102, 103), in the background of the best matching cluster #7 in [31]. The most common amino acids in the cluster is T, followed by I. Panel b) The spectrum for A: (102, 103, 104), in the background of the best matching cluster #26 in [31]. The most common amino acid in the cluster is A, followed by G. Panel c) The spectrum for F: (103, 104, 105), in the background of the best matching cluster #XI in [31]. The most common amino acid in the cluster are T and L, followed by A and V. The conclusion os the present analysis is that the site 103G is a potential G→A mutation hot-spot.  [31]. The most common amino acids in the cluster is A, followed by D. Panel b) The spectrum for A:(286, 287, 288), in the background of the best matching cluster #VI in [31]. The most common amino acid in the cluster is A, followed by G. Panel c) The spectrum for F:(287, 288, 289), in the background of the best matching cluster #25 in [31]. The most common amino acid in the cluster are T and V, followed by A and L.
The RMSD values (16) are all very small so that the three clusters identified in the Figures 9 are an excellent match. The statistical analysis in all three clusters show that A has a very high probability at the site 287; both T and V are also likely substitutions in the following cluster, and G has also propensity in the adjacent cluster. But D that is now located at the site, is not very prominent in any of the clusters. Thus the site 287 is a very good candidate for a D→A mutation hot-spot.

The case of 316S
The site 316S is located in the junction between the NTD and the RBD domains. A mutation hot-spot in this junction is of interest as it can have a large effect to the transition between the closed and open states, the way how the RBD becomes exposed to ACE2. The site 316S is the only potential hot-spot mutation site in this junction that has been identified in the present study.  [31]. The most common amino acids in the cluster is A, followed by D. Panel b) The spectrum for A:(315, 316, 317), in the background of the best matching cluster #VI in [31]. The most common amino acid in the cluster is A, followed by G. Panel c) The spectrum for F:(316, 317, 318), in the background of the best matching cluster #25 in [31]. The most common amino acid in the cluster are T and V, followed by A and L.
The RMSD values (16) [31]. Panel b) The spectrum for A:(463, 464, 465), in the background of the best matching cluster #26 in [31]. Panel c) The spectrum for F:(464, 465, 466), in the background of the best matching cluster #XI in [31]. The most common amino acid in all three clusters is G.

The case of 1080A
According to the histogram in Figure 4 [31]. The most common amino acids in the cluster is A. Panel b) The spectrum for A:(1079, 1080, 1081), in the background of the best matching cluster #I in [31]. The most common amino acid in the cluster is A, followed by G. Panel c) The spectrum for F: (1080, 1081, 1082), in the background of the best matching cluster #25 in [31]. The most common amino acid in the cluster are T and V, followed by A and L.
at this site, is also the most probable one and with very high probability in the case of the adjacent cluster. Thus, it appears that a recent mutation may have taken place at this site with the amino acid A as the substitution. The statistical analysis of the third cluster shows a relatively high propensity for T, V and L but all three are subject to steric hindrances and they also have a very low propensity in the adjacent cluster. The conclusion is that even though the vertex is proximal to a bi-flattening perestroika, it does not qualify as a mutation hot-spot. It is more likely that due to proximity of the bi-flattening perestroika, the site has exceptional flexibility with an important role in the fusion process. The residue 1080A can be a good target for the development of fusion inhibitors.

DISCUSSION
The spike protein is a prominent target for the development of vaccines and therapeutic drugs against the SARS-CoV-2 virus. But for a durable antiviral one needs to know, how to identify those amino acids that are prone for a mutation and how to predict the biological consequences of any potential mutation. The methodology that is presented here aims to pinpoint those hot-spot residues where a mutation can be expected to have sub-stantial conformational consequences. Since conformation is pivotal for a protein's function, the knowledge of hot-spots is important for better understanding of how the spike protein operates and how it can be incapacitated. For this the local topology of spike protein's Cα backbone is scrutinized; the local backbone topology can only change when a structural bifurcation occurs at a critical site. In the case of a protein backbone a pre-requisite for such a bifurcation is the presence of a flattening point. Therefore, the flattening points along the spike protein backbone are first localized and classified. The geometry in the neighborhood of a flattening point is then investigated to deduce its potential for a mutation hot-spot.
The detailed methodology is developed using the notorious D614G mutation site of the spike protein as an example; the site is proximal to a flattening point and the present approach correctly predicts its D→G mutation. It is found that several topologically very similar potential hot-spot sites are located in particular in the N-terminal domain and in the fusion core between the two heptapeptide repeat sequences. A number of these are analyzed in detail, as examples. Those in the N-terminal domain are found to be good candidates for mutation hot-spots, while the potential hot-spot sites in the junction between the heptapeptide repeat sequences appear to be more stable. In particular, the residue 1080A that has been analyzed as an example seems to be stable against mutation. But it seems to be prone to a change in local topology, due to its proximity to a bi-flattening point. This can make the residue 1080A into a good target for the development of structure based fusion inhibitors.

CONCLUSIONS
The topological techniques that are employed here have already proven themselves to be most powerful in several Physics problems. Since conformation is pivotal for a protein's function, topology should be a most effective tool also in protein research. The present methodology combines the investigation of local Cα backbone topology with a statistical analysis of PDB structures. This brings a downside, as the present structural data on the spike protein has been measured with quite a low resolution; the PDB structures that are used in the present study have been determined using electron microscopy with a resolution no better than around 2.8 -3.5Å and there are even some missing residues in available structures. The low resolution causes also some uncertainty in the correct identification of sites that are proximal to a flattening point, for a conclusive investigation better resolution spike protein structures are needed. At the same time the statistical analysis that is used here for the identification of the pertinent PDB cluster is also preliminary, as it utilizes a somewhat limited set of structural clusters that are measured with a very high resolution.
Therefore, besides a need for improvement in the structural data on spike protein, a further development and refinement of the statistical methodology is also needed to ensure a higher reliability of the present methodology.

AUTHOR CONTRIBUTIONS
AJN proposed the research; XP performed the calculations; AJN and XP analyzed the data and wrote the manuscript.