Local topology, bifurcations and mutation hot-spots in proteins with SARS-CoV-2 spike protein as an example

Novel topological methods are introduced to protein research. The aim is to identify hot-spot sites where a bifurcation can change the local topology of the protein backbone. Since the shape of a protein is intimately related to its biological function, a mutation that takes place at such a bifurcation hot-spot has an enhanced capacity to change the protein’s biological function. The methodology applies to any protein but it is developed with the SARS-CoV-2 spike protein as a timely example. First, topological criteria are introduced to identify and classify potential mutation hot-spot sites along the protein backbone. Then, the expected outcome of a substitution mutation is estimated for a general class of hot-spots, by a comparative analysis of the backbone segments that surround the hot-spot sites. This analysis employs the statistics of commensurable amino acid fragments in the Protein Data Bank, in combination with general stereochemical considerations. It is observed that the notorious D614G substitution of the spike protein is a good example of such a mutation hot-spot. Several topologically similar examples are then analyzed in detail, some of them are even better candidates for a mutation hot-spot than D614G. The local topology of the recently observed N501Y mutation is also inspected, and it is found that this site is prone to a different kind of local topology changing bifurcation.


Introduction
Topological methods are among the most versatile tools available to predict, model and analyze a wide range of Physics related phenomena, from theories of fundamental interactions to models of condensed matter and dynamical systems [1][2][3][4]. However, despite the apparently rich and intriguing topology of many biomolecules, thus far topological methods have been sparsely applied to biophysical problems. Among the notable exemptions are the analysis of enzyme action on DNA and the challenge to understand knots and their biological role in folded proteins [5].
Here a novel topological methodology, apparently with no previous physical applications, is introduced and developed for biophysical purposes. The methodology builds on inspection of bifurcations that can change the local topology of a protein backbone. Bifurcations are the common cause for qualitative changes in a physical system, bifurcation theory describes how a small change in an input parameter can cause a large scale change in the system [4,6]. Accordingly it is proposed that in the case of a protein backbone, an amino acid site that is proximal to a bifurcation where the local topology can change, can be a mutation hot-spot where a single substitution can cause a large scale conformational change with potentially substantial biological effects: Even though such a mutation hot-spot is not necessarily more prone for a mutation than any other site along the backbone, when a mutation causes a bifurcation that changes the local topology it can also have a more profound biological impact. Thus the present methodology, that employs bifurcations to identify and classify potential mutation hot-spot sites, can have a value to a wide range of future investigations.
Indeed, the methodology that is developed here is very general. It is applicable to any protein structure, even though it is presented with the spike protein of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) as an example. The choice reflects the current urgency to understand the function of the virus that causes COVID-19, a global public health emergency that continues to spread across the world. Several studies of the SARS-CoV-2 virus have been published including investigations on the source of infection [7][8][9][10][11], the mechanism of transmission [12][13][14][15][16], and the structure and function of its various proteins [17][18][19][20][21][22][23]. These studies also detail the biophysical and biochemical properties of the spike protein, a transmembrane glycoprotein that assembles into a homo-trimer to cover the virion surface and gives the virus its distinctive crown-like look. For the present purposes the ensuing short introduction to the spike protein structure and function is sufficient: The spike protein has three different stages. The prefusion stage, the intermediate stage, and the post-fusion stage. Before the fusion with a host cell takes place the spike protein is in the prefusion stage, and the present study focuses solely on the protein in this stage. The monomer structure of the prefusion stage is as follows: Starting from the N-terminal, first there is a short signal peptide. This is followed by two larger subunit, S1 and S2. The subunit S1 can recognize the host cell and bind to the receptor angiotensin-converting enzyme 2 (ACE2). The subunit S2 can bind to the membrane of the host cell, to mediate the fusion between the virus and the host cell. The Figure 1A identifies the major functional domains in these subunits in a monomeric spike protein [17,18]: The S1 subunit consists of residues between the sites 14-685. It starts with a N-terminal domain (NTD) with residues 14-305 [25]. The NTD is followed by the receptor binding domain (RBD, residues 319-541) [26]. The junction segment between the S1 and S2 subunits includes several cleavage sites [27]. The S2 subunit that comprises the rest of the protein, contains the fusion peptide (FP, residues 788-806) followed by two heptapeptide repeat sequences (HR1, residues 912-984) and (HR2, residues 1163-1213), the transmembrane domain (TM, residues 1213-1237) and the cytoplasm domain tail (CT, residues 1237-1273) [24].
In the prefusion stage the protein has two principal conformational states, called the closed state and the open state. The closed state is the native prefusion state of the spike protein, and a model of this closed stage is shown in Figure 1B. When the virus starts interacting with the host cell the spike protein transits from the closed state to the open state, with a major conformational change that takes place in the S1 subunit [19].
The spike protein has a pivotal role in processes that range from receptor recognition to viral attachment and entry into host cell. It is a major target in both vaccine research and therapeutic research that combat the SARS-CoV-2 virus. But the spike protein evolves and mutates continuously which makes it a demanding target for the development of antiviral inhibitors. Among the prominent examples of a mutation A three-dimensional model of the full length closed stage spike protein, based on the PDB structure 6VXX (closed state) and adapted from [24].
in the spike protein analyzed here, with documented epidemiological consequences, is the D→G substitution that occurred at its site 614, near the junction between the subunits S1 and S2. Apparently this mutation converted SARS-CoV-2 into a more transmissible form, and the D614G mutated virus now dominates in the global COVID-19 pandemic [30]. Another, more recent mutation that is also analyzed here, with similarly substantial epidemiological consequences that still remain to be fully understood, is the N→Y substitution that occurred at site 501 in the RDB domain [31]. Unfortunately, due to missing residues in the available data the even more recent substitutions that affect the sites 484 and 677 [32] can not yet be analyzed.
The goal of the present article is to develop general methodology that can help to understand the structural effects of these and other mutations, and to identify additional mutation hot-spot sites along the spike protein backbone, those where a single substitution mutation can have a large scale conformational effect. The methodology starts with a geometric scrutiny of a protein's Cα backbone. For this, the Methods section first summarizes the known mathematical results on geometry and local topology of a smooth space curve. The focus is on the two curve specific bifurcations that can alter the local topology of a curve, these were introduced by Arnol'd [33][34][35] who called them the inflection point perestroika and the bi-flattening perestroika; see also [36,37]. The Methods section then adapts these bifurcations to the case of a generic protein Cα backbone.
The Results section applies the general methodology to identify and classify mutation hot-spots. The SARS-CoV-2 spike protein is used as a timely example but any protein backbone can be analyzed in the same manner. For the spike protein, the Protein Data Bank structures 6VXX (closed state) 6VYB (open state) [19] and 6XS6 (D614G mutation) [38] are used. First, a general class of local topologies with a mutation hot-spot is identified. It consists of backbone segments that surround a site that is proximal to a flattening point: The D614G mutation occurred at such a hot-spot, thus the details of the methodology are worked out with the site 614 of the spike protein as an example. All similar hot-spot sites of the spike protein, with a site proximal to a flattening point, are tabulated in the currently available PDB structures. Additional examples are analyzed, including the identification of the likely amino acid substitution that may take place if a mutation occurs at such a hot-spot. Finally, the site of the N501Y substitution is identified as a different kind of mutation hot-spot, and analyzed as an additional example of the present approach.

Local topology of regular space curves
The bifurcations that can change the local topology of a space curve were introduced and analyzed by Arnol'd in a series of articles [33][34][35]; see also [36,37]. This subsection summarizes the main results. The starting point is in the Frenet framing that governs the geometry of a regular, analytic space curve that is not a straight line [39]. To describe this framing, consider a parametric representation x(s) ∈ R 3 of a space curve, with s ∈ [0, L] the arc-length parameter and L its fixed length. The unit length tangent vector is The unit length binormal vector is and the unit length normal vector is Together, the three vectors (n(s), b(s), t(s)) define the right-handed orthonormal Frenet frame at the regular point x(s) of the curve. They govern the geometry of the curve in terms of curvature κ(s) that describes how the curve bends on the osculating plane spanned by t(s) and n(s), and torsion τ (s) that measures how the curve deviates from this osculating plane.
The Frenet frames can be introduced whenever the curvature κ(s) is non-vanishing. In the specific limit where the curvature is very small in comparison to the torsion, but does not vanish κ(s) τ (s) → 0 the frames obey d ds This limit, that not discussed in [33][34][35][36][37] but turns out to be relevant to protein backbones, describes a framed, straight line with framing that rotates around the line at a rate and direction that is determined by τ (s). In the following it is assumed that the curve is open and its shape can change freely by local deformations, but the end points are always fixed. The shape changes are governed by the following rules [33][34][35][36][37]: At a point where the curvature κ(s) vanishes, the Frenet frame can not be defined; here it is assumed that κ(s) has only isolated simple zeroes. The points where κ(s) = 0 are called the inflection points of the curve. A local deformation that retains the osculating plane can move an inflection point along the curve. But if a deformation lifts a point with κ(s) = 0 off the osculating plane the inflection point becomes removed. This implies that the co-dimension of an inflection point is two: An inflection point is not a local topological invariant of the curve, and a generic curve does not have any inflection points.
A point where the torsion vanishes τ (s) = 0 is a flattening point. Unlike an inflection point, at least one single simple flattening point is generically present in a curve. A single simple flattening point is a local topological invariant that can not be removed by any continuous local deformation of the curve, it can only be moved along the curve.
The three vectors (n(s), b(s), t(s)) determine a framing of the curve, and either b(s) or n(s) or their linear combination can be chosen as the framing vector. The self-linking of the curve describes how it links with a nearby curve that is obtained by pushing points of the original curve along the framing vector. The self-linking number is a local topological invariant of the curve, in the absence of an inflection point the self-linking number can not change.
When the shape of a curve changes freely, an isolated inflection point generically occurs at some instance. When an inflection point appears the curve undergoes a bifurcation that is called an inflection point perestroika [33][34][35]. This bifurcation can change he number of flattening points: Since the torsion τ (s) changes its sign at a simple flattening point, and since the curvature κ(s) is generically not zero, an inflection point perestroika commonly changes the number of simple flattening points by two.
When the shape of a curve changes so that a pair of flattening points comes together they combine into a single bi-flattening point. A bi-flattening point can then be removed by a further, generic local deformation of the curve. Similarly, a bi-flattening point can first be created by a proper local deformation of a curve, and when the curve is further deformed the bi-flattening point can resolve into two separate simple flattening points. When either of these occur the curve undergoes a bifurcation that is called a bi-flattening perestroika [33][34][35].
Inflection point perestroika and bi-flattening perestroika are the only two bifurcations where the number of flattening points can change. Furthermore, according to [33][34][35][36][37]the number of flattening points and the self-linking number that is determined by the Frenet framing are the only two curve specific local topological invariants that can be assigned to a curve.
The number of flattening points and the self-linking number are independent topological invariants, but in the presence of an inflection point they can interfere with each other. For example, when a curve is deformed so that two simple flattening points become combined and disappear in a bi-flattening perestroika, the self-linking number in general does not change. However, if the bi-flattening perestroika occurs in conjunction of an inflection point perestroika the self-linking number can change: If the torsion is initially positive and two flattening points combine and disappear with the passage of an inflection point, the self-linking number increases by one. But if the torsion is initially negative the self-linking number decreases by one.
Note that the limiting case (6) is excluded in these analyses but it will become important in the sequel, in applications to protein backbones.

The geometry and local topology of a protein Cα backbone
The relations that are described in the previous subsection are valid for (thrice) continuously differentiable curves. In this subsection they are adapted to the case of a Cα backbone that determines a piecewise linear polygonal chain, with Cα atoms at the vertices.
Various shape changes are common in a biologically active protein.
A polygonal chain such as the Cα backbone can always be thought as a limiting case of a regular, analytic space curve. Thus the changes in the shape of the Cα backbone are subject to same rules that govern the local topology of any regular space curve. In particular, the three essential shape deformations that can change the local topology i.e. discrete variants of inflection point perestroikas, bi-flattening perestroikas, and changes in a self-linking number should all have a profound role in physiological processes.
The discrete Frenet frame formalism is developed in [40]. The formalism describes the geometry of a piecewise linear chain with vertices r i (i = 1, ..., N ) that in the case of a protein backbone are the space coordinates of the Cα atoms. A line segment defines the unit tangent vector It points from the center of the i th Cα atom towards the center of the (i + 1) st Cα atom. The unit binormal vector is and the unit normal vector is In the case of a protein these two vectors are virtual in that they do not point towards any particular atom. The orthonormal triplet (n i , b i , t i ) defines a discrete version of the Frenet frames (1)-(3) at each position r i along the chain. In lieu of the curvature κ(s) and the torsion τ (s) there are now their discrete versions, the bond angles κ i and the torsion angles τ i . The values of these angles can be computed from the discrete Frenet frames. The bond angles are and the torsion angles are It is notable that the value of the bond angle κ i is evaluated from three, and the value of the torsion angle τ i is evaluated from four consecutive vertices. Conversely, when the values of the bond and torsion angles are all known, the discrete Frenet equation [40] computes the Frenet frame at the vertex r i+i from the frame at the preceding vertex r i . In a continuum limit [40] the discrete Frenet equation becomes the continuum Frenet equation [39]. The fundamental range of the bond angles is κ i ∈ [0, π] and in the case of the torsion angles τ i ∈ [−π, π). For visualization purposes, the bond angles κ i can be identified with the latitude angle of a two-sphere that is centered at the i th Cα atom; the north pole coincides with the inflection point κ i = 0. The torsion angles τ i ∈ [−π, π) correspond to the longitudinal angle on the sphere, the value increases in the counterclockwise direction around the tangent vector and the value τ i = 0 of flattening points coincides with the great semi circle that starts from north pole and passes through the tip of the normal vector n to the south pole. The sphere can be stereographically projected onto the complex (x, y) plane. A projection from the south pole is as shown in figure 2: The north pole i.e. the point of inflection with κ = 0 becomes mapped to the origin (x, y)=(0, 0) and the south pole κ = π is sent to infinity. The Cα backbone can be visualized on the stereographically projected two sphere as follows [41]: At each Cα atom, introduce the corresponding discrete Frenet frames (7)- (9). The base of the i th tangent vector t i that is located at the position r i of the i th Cα atom coincides with the centre of a two-sphere with the vector t i pointing towards its north pole. Now, translate the sphere from the location of the i th Cα atom to the location of the (i + 1) th Cα atom, without any rotation of the sphere with respect to the i th Frenet frames. Identify the direction of t i+1 , i.e. the direction towards the Cα atom at site r i+2 from the site r i+1 , on the surface of the sphere in terms of the ensuing spherical coordinates (κ i , τ i ). When this construction is repeated for all the protein structures in Protein Data Bank that have been measured with better than 2.0 A resolution, the result can be summarized by the map that is shown in figure 2 b). The color intensity correlates directly with the statistical distribution of the (κ i , τ i ); red is large, blue is small and white is none. The map describes the direction of the Cα carbon at r i+2 as it is seen at the vertex r i+1 , in terms of the Frenet frames at r i .
Approximatively, the statistical distribution in figure 2 b) is concentrated within an annulus that corresponds to the latitude angle values (in radians) κ = 0.57 and κ = 1.82 shown in the Figure. The exterior of the annulus is a sterically excluded region. The entire interior is sterically allowed, but there are very few entries in this region. The four major secondary structure regions, α-helices, β-strands, left-handed α-helices and loops, are identified according to their PDB classification. For example, (κ, τ ) values (in radians) for which describes a right-handed α-helix, and values for which describes a β-strand.
In the case of a regular space curve both κ(s) and τ (s) are smooth functions and the inflection points and the flattening points are easily identified as the points where κ(s) = 0 and τ (s) = 0. In a crystallographic protein structure where the Cα positions are experimentally determined, an inflection point is detectable as a very small value of the bond angle κ i at the proximal Cα vertex. Similarly, the presence of a simple flattening point can be deduced from a very small torsion angle value at the proximal vertex, accompanied by a sign change in τ between two neighboring vertices; if the sign of τ does not change the proximal vertex has the character of a bi-flattening point. Accordingly, when searching for Cα atoms where essential shape changes such as inflection point or bi-flattening perestroikas can take place, the natural points to start are the neighborhoods of vertices where either κ i ≈ 0 or τ i ≈ 0. These are the likely locations where a small change in the shape of the backbone can change its local topology, with a potentially substantial change in the protein's biological function.
From Figure 2 b) one observes that inflection points i.e. small κ i values are extremely rare in crystallographic protein structures. Indeed, a generic space curve does not have any inflection points. At the same time general arguments state that generically at least one flattening point can be expected to be present. As shown in Figure 2 b) flattening points where τ i ≈ 0 do appear even though they are relatively rare in protein structures. Moreover, it is observed from the Figure that at a flattening point the bond angle values are mostly either κ i ≈ 1 or κ i ≈ π/2.
Since the torsion angles are defined mod(2π) as implied in (15), in the case of discrete Frenet frames there is an additional structure: There is the line τ i = ±π in Figure 2 b) where the torsion angle has a 2π discontinuity, hence it can change sign by crossing the line. This multivaluedness is absent in regular space curves, with τ (s) a single-valued continuous function. But the limits τ i → ±π can be thought of as the small curvature and large torsion limits of the equations (6). Notably, κ i = 1 and τ i → ±π corresponds to an ideal, straight β-strand. Therefore a point on the line τ i → ±π (or in its immediate vicinity) will be called a β-flattening point in the sequel.
The multi-valuedness of the torsion angle τ i affects the local topological invariants, in the case of a discrete chain. This is exemplified in the   Finally, it is noted that the Figure 2 b) is akin the Newman projection of stereochemistry. The vector t i which is denoted by the red dot at the center of the figure, points along the backbone from the proximal Cα at r i towards the distal Cα at r i+1 , and the colour intensity displays the statistical distribution of the r i+2 direction. Moreover, unlike the Ramachandran map the figure 2 b) provides non-local information on the backbone geometry. The Ramachandran map can only provide localized information in the immediate vicinity of a single Cα carbon, but the information content in the figure 2 b) map extends over several peptide units. As shown in [43], the Cα backbone bond and torsion angles (κ i , τ i ) are sufficient to reconstruct the entire backbone, but the Ramachandran angles are not. .

Results
It turns out that the spike protein site 614 of the notorious D→G substitution is a good example of a site that is proximal to a flattening point. This motivates to select the flattening points of the spike protein as a starting point for presenting the general methodology; the methodology is quite independent of this choice. A detailed analysis of other potential bifurcation structures, those that engage a site that is proximal to an inflection point or a β-flattening point, is presented elsewhere. However, in the sequel it is also proposed that the recently observed N501Y mutation can provide an example that engages the inflection point. First, the sites along the SARS-CoV-2 spike protein Cα backbone that are proximal to a flattening point are classified. Their local neighborhoods are then inspected by comparisons with Figures 3, to investigate the potential bifurcations.
The Table 1 lists all those SARS-CoV-2 spike protein Cα-sites that are proximal to a flattening point in the PDB structures 6VXX (closed state) and 6VYB (open state). Here a torsion angle value is determined to be proximal to a flattening point when |τ i | < 0.2. In terms of distance, this is less than the radius of a carbon atom. The histogram in Figure 4 shows the distribution of the flattening point sites in relation to spike protein backbone. There are relatively many entries in the NTD and RBD domains, and in the fusion core between the HR1 and HR2 domains. Notably the number of proximal sites is also different in the closed and open states of the spike protein, there are more proximal sites in the open state. Thus a transition between the two states involves changes in local topology that affect in particular the NTD and RBD domains, and the HR1-HR2 junction.  In the case of a bond angle, here the value is considered small when κ i < 0.5. General arguments state that inflection points are not generic, and Figure 2 b) shows that there are indeed very few sites that are close to an inflection point. For the spike protein the smallest value is κ i = 0.39 and it is located at the site 103 that also appears in Table 1.
The local geometry of all the individual residues that are proximal to a flattening point has been investigated and the potential for a mutation hot-spot at a residue that is proximal to a flattening point has been estimated, by comparison to Figures 3. For this, a combination of statistical analysis and stereochemical constraints has been utilized. The Table 1 also summarizes these findings.
The statistical analysis employs the classification scheme of Protein Data Bank structures described in [44]. This scheme decomposes a Cα backbone into fragments that consist of backbone segments with six successive sites; in accordance with Eq. (10), (11) a fragment with six sites determines three pairs of bond and torsion angles. In [44] the fragments that appear in high resolution PDB structures have been organized into disjoint clusters. To assign a cluster to a fragment, there must be at least one other fragment in the same cluster within a prescribed RMS cut-off distance; in [44] the cut-off is 0.2Å. Two clusters are then disjoint, when the RMSD between any fragment in the first cluster and any fragment in the second cluster exceeds this RMS cut-off distance. It was found that around 38% of protein loops in the high resolution PDB structures can be decomposed into fragments that belong to twelve disjoint clusters, labeled I-XII in [44]. When fragments from an additional set of 30 disjoint clusters are included, the coverage increases to ∼ 52% [44].
In the Table 1 both cluster sets I-XII and 1-30 appear; the notation of [44] is used throughout in the following. Beyond these two sets, the clusters become increasingly smaller, and in the present study those smaller clusters have not been considered. The somewhat low resolution of the available spike protein structures in comparison to the very high resolution structures used in [44] does not justify a more detailed scrutiny.
The clusters in the Table 1 have been identified as follows: A pair of bond and torsion angles (κ i , τ i ) at the i th site of the spike protein that is a proximal site to a flattening point can be assigned to three different clusters. The first cluster describes the bond and torsion angles for the sites (i − 2, i − 1, i); this cluster is labelled P (for Preceding) in the Table 1. The second cluster that is labelled A (for Adjacent) in the Table 1 describes the angles for sites (i − 1, i, i + 1). The third cluster is labelled F (for Following) and it describes the angles at spike protein sites (i, i + 1, i + 2). The cluster that provides the best match to the spike protein is listed in the Table 1 and it is determined as follows: Let (x a , y a , z a ) denote the six space coordinates of a segment that corresponds to three consecutive pairs of (κ, τ ) values in the spike protein. Let (x k,a , y k,a , z k,a ) be the corresponding six space coordinates of a k th fragment in each of the clusters of [44]. The best matching cluster in Table 1 is the one that contains a fragment with the minimal root-mean-square distance (RMSD) to the given spike protein segment (in units ofÅngström): Once the best matching cluster is determined, the corresponding statistical distribution of amino acids found in [44] is used to identify those amino acids that are most probable to appear at the i th site of the spike protein, in case a mutation occurs. If the size of this statistically most probable amino acid is smaller than the size of the present amino acid at the i th site, a substitution by mutation is considered to be sterically possible. But if the size is larger, a substitution likely requires an extended rearrangement of the spike protein conformation and this can be energetically costly. The Table 1 lists the most probable substitutions that are predicted in this manner.

Example: The D614G mutation of spike protein
The Table 1 identifies the site 614 of the spike protein, where the D→G substitution has occurred, as a site that is proximal to a flattening point. Thus, this site serves as a good example to describe the present methodology. The local topology and its potential bifurcations can be analyzed using figures such as Figure 5. Each of the three Figures depicts the three bond and torsion angle pairs for the three backbone segments (P, A and F respectively) that include the angles of the site 614. The legend is as follows: In figures such as Figure 5, the vertices that mark the Cα sites of spike protein are always identified by stars. These stars are color-coded and organized according to increasing site number following the Cyan-Magenta-Yellow (CMY) color table. The dotted background comprises the Cα sites of all the fragments in the corresponding best matching cluster. The background sites are always ordered similarly using the Red-Green-Blue (RGB) color table. The adjacent histogram displays the amino acid distribution in the best matching cluster, using the data that is obtained from [44].  (612, 613, 614), in the background of the best matching cluster #XII in [44]. The most common amino acids in the cluster are S, D, A, E, N, G. b) The spectrum for adjacent sites A:(613, 614, 615), in the background of the best matching cluster #26 in [44]. The most common amino acid in the cluster is G. c) The spectrum for following sites F:(614, 615, 616), in the background of the best matching cluster #10 in [44]. The most common amino acid in the cluster is G.
The Figure 5 a) shows the bond and torsion angles for the sites P: (612,613,614) of the spike protein. The best matching preceding (P) cluster is also shown, it is the cluster #XII of [44]. From Table 1 the RMSD (17) between the spike protein segment and the cluster has the value ∆ min ≈ 0.76Å. This is clearly larger than the 0.2 A cut-off used in [44]. Nevertheless, the cluster is included in the analysis since the resolution at which the spike protein PDB structures (6VXX, 6VYB and 6XS6) have been measured is also clearly larger than the resolution of the structures used in [44]. This justifies that despite the relatively large value of ∆ min the present analysis proceeds with cluster #XII, keeping in mind that the ∆ min is not very small.
The statistical analysis [44] shown in Figure 5 a) proposes that the most probable mutations at site 614D are to S, A, N and G; the amino acid E can be excluded on sterical grounds since it has a larger size than D.
The best matching cluster for the adjacent sites A:(613,614,615) shown in Figure 5 b) is the cluster #26 of [44], with ∆ min ≈ 0.75Å which is again relatively large. The statistical analysis [44] now proposes that the most probable mutation at site 614 is a D→G substitution; the probability for any other amino acid substitution is very low. In particular the probability for D itself is low, suggesting instability.
Finally, the best matching cluster for the following sites F:(614,615,616) shown in Figure 5 c) is #10. The RMSD has now a somewhat lower value ∆ min ≈ 0.41Å. The statistical analysis [44] proposes that the most probable mutation at site 614 is again D→G substitution. The probability for any other substitution is very low. In particular the probability for D itself is low, again suggesting instability of the residue.
The combined spike protein chain shown in the three Figures 5, starting from site 612 in 5 a) and ending at 616 in 5 c), encircles the inflection point once in clockwise direction. Thus the folding index has value +1 and, except for the direction and the location of the end points, the topology of the trajectory is similar to that in Figure 3 b). In particular, both a flattening point and a β-flattening point occur at neighboring segments along the chain. This proposes that the site 614 is a potential mutation hot-spot, prone to a change in the local topology by an inflection point bifurcation. The mutation can cause a bifurcation that can change the local topology from that resembling Figure 3 b) to one that resembles either Figure 3 a) with a bi-flattening point or Figure 3 c) with a pair of β-flattening points. Since G is the only amino acid that consistently appears in all three clusters and since there is no obvious steric hindrance for a D→G mutation, the prediction of the present analysis is that a D→G mutation is probable at the site 614 of spike protein, if a mutation indeed occurs; this is the notorious D614G mutation that has already been observed.
A comparison of the three PDB structures 6VXX, 6VYB with 6XS6 shows that apparently the mutation has not caused any change in local topology, at least according to available structures using the available resolution. But since the site 614 is located near the junction between S1 and S2 subunits, the D→G mutation in combination with the proximity of a flattening point has probably increased the chain flexibility in the junction segment so that it is now more prone to a cleavage with enhanced infectiousness as a consequence.

A survey of flattening point hot-spots in spike protein
The Table 1 proposes that there is quite a large number of potential mutation hot-spots in the spike protein that can be similar to 614D, with a proximal flattening point. In the sequel a selection of these sites is analyzed. The examples are representative, but not necessarily the most probable hot-spot sites. There are three examples from the NTD domain, with site numbers 59, 103 and 287. The example at the site 103 is added since this is the site with the lowest bond angle value along the entire spike protein backbone. There is one example that is located in the junction between NTD and RBD domains, with site number 316. There is one example in the RBD domain, with site number 464. An example from the fusion core between HR1 and HR2 domains with site number 1080 is also presented. Finally, the local topology around the site 501 where the N→Y substitution has recently been observed [31] is also analyzed and its bifurcation potential is investigated.

The residue 59F
The Figures 6 show the neighborhood of the site 59F, located in the NTD domain of subunit S1. The Figures reveal that the topology of the trajectory from site 57 to 61 is very similar to that in the case of site 614, shown in Figures 5: There is a residue that is proximal to a flattening point and right after it there is a residue that is proximal to a β-flattening point. The folding index has value +1 since the trajectory encircles the inflection point in clockwise direction. Thus, as in the case of 614, the site 59 is prone to an inflection point bifurcation such as those described in Figures 3. The RMSD values (17) are all quite small, indeed clearly smaller than in the case of 614D, so that the three clusters that are identified in the Figures 6 are a very good match. The statistical analysis of all three clusters show that G has a very high probability at the site 59; both S and D have some propensity albeit much smaller than G while the probability of the existing amino acid F is very small. Thus the site 59 is a very good candidate for a F→G mutation hot-spot.  (57, 58, 59), in the background of the best matching cluster #7 in [44]. b) The spectrum for A:(58, 59, 60), in the background of the best matching cluster #26 in [44]. c) The spectrum for F: (59, 60, 61), in the background of the best matching cluster #XI in [44]. The most common amino acid in all three clusters is G.

The case of 103
The Figures 7 show the neighborhood of the site 103G, located in the NTD domain of subunit S1. Here the situation is somewhat exceptional, since the residue 103 is already the smallest amino acid G.
The chain between sites 101 and 105 is shown in Figures 7. It has one vertex near  (101, 102, 103), in the background of the best matching cluster #7 in [44]. The most common amino acids in the cluster is T, followed by I. Panel b) The spectrum for A: (102, 103, 104), in the background of the best matching cluster #26 in [44]. The most common amino acid in the cluster is A, followed by G. Panel c) The spectrum for F:(103, 104, 105), in the background of the best matching cluster #XI in [44]. The most common amino acid in the cluster are T and L, followed by A and V.
a β-flattening point at 102. This is immediately followed by the vertex 103 that is proximal both to a flattening point and to the inflection point. The following vertex 104 has also a very small bond angle value. The overall shape of the trajectory suggests a mutation hot-spot with inflection point perestroika that converts the site 103 from a vertex that is proximal to the flattening point into a vertex that is proximal to a β-flattening point. It is also plausible that there has been a recent mutation with ensuing inflection point perestroika, that has converted the local topology by moving the vertex 103 from the vicinity of a β-flattening point to the vicinity of a flattening point. The statistical analysis shows that A, which is the smallest amino acids after G, has the highest propensity in the case of the adjacent cluster, shown in Figure 7 b). The propensity of A is also larger than that of G in the following cluster shown in However, RMSD value is not very small, and the Figure also shows that cluster #9 can not be good match to the spike protein segment P:(101,102,103): The distance between the observed τ value at the site 103 deviates from the τ -values in the cluster by some 150 degrees. Thus the conclusion is that the cluster #9 should be used with care, for a mutation outcome prediction.
The Figure 8 a) shows the present 3D spike protein structure in the neighborhood of the site 103. In the Figure 8 b) the amino acid G has been replaced by A; the effect of this substitution is estimated using a crude energetic analysis with Chimera [45]. An inspection of the interatomic distances show that A can be substituted for G without encountering steric clashes. But in the case of the other amino acids T, V and L that also have a high propensity in the cluster of Figure 7 c), a substitution using Chimera leads to steric clashes .   Fig 8. a) The present all-atom structure of the spike protein in the neighborhood of site 103. b) The substitution of A in place of 103G, as predicted by Chimera [45]. There are no apparent steric hinders for the substitution.
The conclusion of the present analysis is that the site 103G is a potential G→A mutation hot-spot.

The case of 287D
The Figures 9 show the neighborhood of the site 287D, located in the NTD domain of subunit S1. The site is both preceded and followed by a site that is proximal to a   [44]. The most common amino acids in the cluster is A, followed by D. Panel b) The spectrum for A:(286, 287, 288), in the background of the best matching cluster #VI in [44]. The most common amino acid in the cluster is A, followed by G. Panel c) The spectrum for F:(287, 288, 289), in the background of the best matching cluster #25 in [44]. The most common amino acid in the cluster are T and V, followed by A and L.
The RMSD values (17) are all very small so that the three clusters identified in the Figures 9 are an excellent match. The statistical analysis in all three clusters show that A has a very high probability at the site 287; both T and V are also likely substitutions in the following cluster, and G has also propensity in the adjacent cluster. But D that is now located at the site, is not very prominent in any of the clusters. Thus the site 287 is a very good candidate for a D→A mutation hot-spot.

The case of 464F
This site is located in the RBD domain of the subunit S1, and a mutation hot-spot can affect the binding to ACE2. Unfortunately, there are missing residues in the PDB data, only data for chain A in open state (PDB structure 6VYB) is available. The chain shown in Figures 11 is very similar to that in the case of 59F shown in Figures 6. The RMSD values are very low so that the three clusters are an excellent match. The statistical analysis shows that the site is a very good hot-spot candidate for a F→G mutation.  [44]. Panel b) The spectrum for A:(463, 464, 465), in the background of the best matching cluster #26 in [44]. Panel c) The spectrum for F: (464, 465, 466), in the background of the best matching cluster #XI in [44]. The most common amino acid in all three clusters is G.

The case of 1080A
According to the histogram in Figure 4 the fusion core between the HR1 and HR2 domains has a large number of sites that are proximal to a flattening point. However, the detailed investigation does not reveal any excellent candidate for a mutation hot-spot. An example is the site 1080A, the analysis results are summarized in Figures 12.  [44]. The most common amino acids in the cluster is A. Panel b) The spectrum for A: (1079, 1080, 1081), in the background of the best matching cluster #I in [44]. The most common amino acid in the cluster is A, followed by G. Panel c) The spectrum for F:(1080, 1081, 1082), in the background of the best matching cluster #25 in [44]. The most common amino acid in the cluster are T and V, followed by A and L.
The RMSD value for preceding cluster P:(1078,1079,1080) is not very good, but the value is excellent both for the adjacent A:(1079,1080,1081) and following F:(1080,1081,1082) clusters. Moreover, the chain shows that the site 1080 is an excellent candidate for a bi-flattening perestroika. But the statistical analysis reveals that the amino acid A, presently at this site, is also the most probable one and with very high probability in the case of the adjacent cluster. Thus, it appears that a recent mutation may have taken place at this site with the amino acid A as the substitution. The statistical analysis of the third cluster shows a relatively high propensity for T, V and L but all three are subject to steric hindrances and they also have a very low propensity in the adjacent cluster. The conclusion is that even though the vertex is proximal to a bi-flattening perestroika, it does not qualify as a mutation hot-spot. It is more likely that due to proximity of the bi-flattening perestroika, the site has exceptional flexibility with an important role in the fusion process. The residue 1080A can be a good target for the development of fusion inhibitors. Notably both the segment connecting 499 to 500, and the segment connecting 501 to 502 pass very close to the inflection point κ = 0. This suggests that the local topology can be prone to a variant of an inflection point bifurcation that causes a transition either into the topology akin that in Figure 3 a) or that in Figure 3 b). Fig 14. a) The present all-atom structure of the spike protein in the neighborhood of site 501. b) The effect of N→Y substitution as predicted by Chimera [45]. Besides a small change in the orientation of 505Y, despite the substantially larger size of Y in comparison to N there is no apparent steric hinder for the N→Y substitution at site 501.
In Figures 13 a) and b) the RMSD values (17) are very small; in the case of Figure  13 a) the value is ∆ min = 0.2 and in the case of Figure 13 ) the value is ∆ min = 0.16. Thus both spike protein segments are well represented by the ensuing clusters. But in the case of Figure 13 c) the value is much larger, ∆ min = 0.7 so that this segment is not well represented by the cluster. According to all three histograms in Figures 13 the likelihood of the observed N→Y substitution at site 501 should be lower than substitutions N→A or N→D by the similarly hydrophobic A and D. Furthermore, since Y has a much larger size than N, a priori the N→Y substitution requires an extensive conformational rearrangement which can be energetically costly. But a crude energetic Chimera [45] analysis of the neighborhood around the site 501N that is summarized in Figures 14 reveals that there is a "pocket" inside the spike protein structure that is large enough to accommodate Y with no major conformational re-arrangement; only a change in the orientation of 505Y is observed. Since the N501Y mutation is not in necessarily in contrast with stereochemical considerations, the energetic cost of a substitution can be minor .   Fig 15. a) The fragments of cluster #26 where Y appears in the last (Blue) site can be divided into two disjoint subsets, labeled A and B. b) The subsets A and B differ by a ∼ 180 degree rotation (flip) of the peptide plane between the second (Green) and third (Blue) site, around the connecting Frenet frame tangent vector t.
The likely bifurcation that can accompany the N→Y substitution can be deduced by an analysis of all those fragments in the (Preceding) cluster #26 of Figure 13 a) where Y is in the third (Blue) position; this is the cluster where Y has the highest prevalence. The (Adjacent) cluster #XI can be analyzed similarly. But in the (Following) cluster #28 where the spike protein segment has a relatively low quality match, there is no Y.
This analysis starts with Figure 15 a): The initial (Red ) dots of the fragments in the cluster #26 are naturally divided into two disjoint subsets. There are 42 backbone fragments that start in the subset labeled A in the Figure 15, and there are 4 fragments that start in the subset labeled B, with Y in the last (Blue) position in all fragments. The fragments in the subset A all pass very close the inflection point κ = 0, in a manner which is very similar to the 499-501 segment in Figure 13. The fragments in the subset B all cross the τ = 0 line in a manner that is topologically similar to the A-B-C segments in Figures 3 a) and b) (except for orientation). A comparison of the two subsets A and B in the Figure 15 b) reveals that there is a transition akin the peptide plane flip observed in [46]. For this, define the normal vector of the peptide plane between the Cα(i − i) and Cα(i), as the cross product between the vector t i and the vector that points from Cα(i − i) to the O(i) atom of the corresponding peptide plane. Similarly, define the normal vector of the peptide plane between Cα(i) and Cα(i + 1). Then, evaluate the angle between these two normal vectors. The result is shown in the histogram of Figure 15 b): For all entries in the subset A of Figure 15 a) the angle has a value which is very close to −π/2 while for all entries in the subset B the angle has a value that is very close to +π/2. The sign is determined by comparison to the virtual plane that is defined by Cα(i − 1), Cα(i) and Cα(i + 1).
The present analysis proposes that the peptide plane between sites 500 and 501 of spike protein can be prone to an inflection point bifurcation where the peptide plane flips, corresponding to a rotation of ∼ 180 degrees around the Frenet frame vector t between the ensuing Cα atoms. Whether this bifurcation, or one that can occur between sites 501 and 502, plays a role in the transmissibility of the N501Y mutation can be determined once structural data becomes available.
Finally, there are recent mutations that have taken place in the spike protein with considerable epidemiological consequences, including the sites 484 and 677. Unfortunately, the structural data at and around these sites is presently missing in the available Protein Data Bank structures, thus the potential bifurcations that may be associated with these mutations can not yet be characterized.

Conclusions
Topological techniques are commonly regarded to be among the most effective ones for addressing a wide range of physical problems. In particular in the case of a protein, where conformation is pivotal to biological function, topology should be a most valuable tool. However, thus far applications of topological methods in protein studies, and more generally in the study of filamental (bio)molecules such as DNA and RNA and other (bio)polymers, have been relatively sparse. The topological methodology that is developed here, for the purposes of protein studies, is designed to investigate bifurcations that can change the local topology of a space curve such as the protein backbone; similar studies on proteins have not been performed previously. The particular problem that is addressed here, is to try and pinpoint those hot-spot residues where a mutation can have the most profound conformational consequences: The hot-spot residues are those sites that are proximal to bifurcations where the local topology can change. The present methodology identifies these hot-spots using local topological considerations in combination with a comparative analysis that uses Protein Data Bank structures. But the methodology does not aim to actually predict, whether mutations at these hot-spot residues are more likely than at other sites. This needs to be deduced by other methods. The underlying mathematical structure is relatively new, it was developed mainly by Arnol'd starting late 1990's. Prior to the present study it has thus far not found any (bio)physical applications. Thus the present study serves also as an invitation to apply these powerful methods of local topology to relate the structure and function of proteins, and to apply them to study the physical properties of biomolecules and filamental structures also more widely.
The methodology is developed using the spike protein of the SARS-CoV-2 virus as an example. Any other protein could be used equally but the choice of the spike protein is timely, reflecting the current pandemic situation. This choice also brings some downside, as the available structural data on the spike protein has been measured February 21, 2021 27/31 with relatively low resolution and is also partially lacking, at the time of writing: The Protein Data Bank structures that are used in the present study have been determined using electron microscopy with a resolution no better than around 2.8 -3.5Å. Missing residues in the available spike protein structures also limits the study. For example the recent epidemiologically important mutations that have been observed at the sites 484 and 677 could not be included in the study, due to lacking experimental data. The low resolution causes also some uncertainty in the proper identification of sites that are proximal to a flattening point. At the same time the statistical analysis that is used here for the identification of the pertinent PDB clusters is also preliminary, as it is based on a somewhat limited set of structural clusters that are measured with ultra high resolution. Therefore, a further development and refinement of the underlying statistical methodology is desired. Finally, the present article concerns only the local topology of a protein backbone. A more complete analysis should combine topological investigations with energetic studies. But that brings a practical limitation, as the presently available all-atom molecular force fields are still not very accurate and demand substantial computational resources. Here Chimera has been employed, as it is a method that can provide crude energetic estimates in the case complex proteins such as the spike protein, without undue need for computer resources.