Abstract
The understanding of the molecular mechanisms driving the fitness of the SARS-CoV-2 virus and its mutational evolution is still a critical issue. We built a simplified computational model, called SpikePro, to predict the SARS-CoV-2 fitness from the amino acid sequence and structure of the spike protein. It contains three contributions: the viral transmissibility predicted from the stability of the spike protein, the infectivity computed in terms of the affinity of the spike protein for the ACE2 receptor, and the ability of the virus to escape from the human immune response based on the binding affinity of the spike protein for a set of neutralizing antibodies. Our model reproduces well the available experimental, epidemiological and clinical data on the impact of variants on the biophysical characteristics of the virus. For example, it is able to identify circulating viral strains that, by increasing their fitness, recently became dominant at the population level. SpikePro is a useful instrument for the genomic surveillance of the SARS-CoV-2 virus, since it predicts in a fast and accurate way the emergence of new viral strains and their dangerousness. It is freely available in the GitHub repository github.com/3BioCompBio/SpikeProSARS-CoV-2.
1. Introduction
Despite mitigation measures put in place around the world to slow down the fast spreading of the SARS-CoV-2 virus, the CoViD-19 viral pandemic continues to have global devastating effects, with more than 130,000,000 people infected and almost 3,000,000 deaths [1]. Lots of efforts and resources have been devoted in the last year to develop vaccines and new therapeutics in response to the SARS-CoV-2 infection [2,3]. Several vaccines such as mRNA-1273 [4], BNT162b2 [5], AZD1222 [6], Sputnik V [7], Ad26 [8] and NVX-CoV2373 [9] have proven to be safe and efficacious against the viral agent and have recently been approved by the regulatory agencies for emergency use. Thanks to these developments, large-scale vaccine administration is now ongoing throughout the world.
Moreover, while the pathogenic mechanisms of the viral infection are still unclear, effective therapeutic agents have been developed. For example, neutralizing antibodies (nAbs) targeting the viral spike protein or human convalescent plasma have been employed in clinical practice by passively transferring them to patients [10–13]. This therapy generally leads to an improvement of the disease conditions and to a reduction of viral load.
The increase in viral immunity at the population level due to infection, vaccination or passive immunization via nAbs clearly results in a stronger selection pressure on the SARS-CoV-2 virus [14,15]. This causes the emergence of new variants of the virus which are able to escape from the immune response. Lots of computational and experimental studies are currently focusing on the understanding of these escape mechanisms in the SARS-CoV-2 viral infection [16–20] and on setting up SARS-CoV-2 immune surveillance of the world’s population to track and eventually limit the spreading of potentially escaping variants [21–26].
However, the prediction of how SARS-CoV-2 evolves under this selective pressure is far from obvious. Indeed, even though SARS-CoV-2 has a moderate mutation rate compared to other RNA viruses due to its more accurate replication [27], tracking viral dynamics in the huge space of possible variant combinations (including also deletions and insertions) under the influence of human immunity makes predictions highly challenging. Extensive large-scale monitoring of SARS-CoV-2 evolution and host immunity will help to better understand these issues [27].
In this paper, we performed an extensive computational analysis of the mutational mechanisms that lead to the emergence of SARS-CoV-2 strains with increased fitness, with the aim to better understand the molecular mechanisms that drive viral adaptation and escape from the human immune system. We performed in silico mutagenesis experiments and predicted the impacts of mutations in the spike protein on its stability and on its affinity for nAbs and for the angiotensin-converting enzyme 2 (ACE2), known to be the SARS-CoV-2 entry point into the cell. We validated these predictions on viral variants for which experimental, epidemiological or clinical data has been obtained, and especially on the variants that are emerging and rapidly spreading to become prevalent genotypes. Our predictions are of utmost importance to help monitoring the future evolutionary dynamics of SARS-CoV-2 and to identify the emergent strains whose spread will have to be limited via either the design of new vaccines or new mitigation measures.
2. Methods
2.1. Spike protein structures
The spike protein or S-protein of the SARS-CoV-2 virus (Uniprot code P0DTC2) is a homotrimeric glycoprotein attached to the viral membrane. It can adopt two forms, a closed and an open form. The transition between these forms increases the solvent exposure of the protein’s receptor-binding domain (RBD), which encompasses residues 333-526 and mediates the fusion of the membranes of the virus and its host.
The 3-dimensional (3D) structures of the two forms have been experimentally resolved by cryo-electron microscopy (cryo-EM) and are deposited in the Protein DataBank (PDB) [28]. The closed form, with PDB code 6VXX, has a resolution of 2.80 Å [29], and the open form, 6VYB, has a resolution of 3.20 Å. These structures have thus quite a low resolution and do not contain all the residues of the spike protein. To get structures of the closed and open forms without missing residues, we modelled the complete amino acid sequence using the PDB structures 6VXX and 6YVB as templates and the homology modelling webserver SWISS-MODEL [30].
More accurate structures, resolved by X-ray crystallography, are available for the RBD of the spike protein. We used the PDB structure 6M0J [31] for this region, which contains the RBD bound to ACE2, with a resolution of 2.45 Å.
Furthermore, we set up a dataset of spike protein/nAb complexes taken from [32], referred to as 𝒟 nAb. We used the following selection criteria:
Human monoclonal nAbs generated in response to SARS-CoV-2 infection;
nAbs targeting the spike protein;
nAbs/spike protein complexes available in the PDB, with X-ray structure of resolution ≤ 3.2 Å.
𝒟 nAb contains 31 structures of nAbs/spike protein complexes, listed in the GitHub repository github.com/3BioCompBio/Sp CoV-2. These nAbs exclusively target the RBD of the spike protein, and are assumed to mimic the diversity of the human immune B-cell repertoire.
2.2. Spike protein stability
To compute the change in folding free energy upon point mutations in the spike protein, we used the PoPMuSiC algorithm [33], which is based on the 3D structure of the target protein and a combination of statistical mean-force potentials. We applied it to the modelled structures of the open and closed forms of the spike protein, and to the experimental structure of the RBD domain. The final value of the change in folding free energy caused by a mutation i, , was defined as follows: for mutations of residues in the RBD, we considered the predictions based on the 6M0J structure of the RBD; for mutations of other residues, we averaged the predicted energy values obtained from the two models obtained from the low-resolution structures 6VYB and 6VXX.
2.3. Spike protein/ACE2 binding affinity
For the changes in binding affinity upon single-site mutations, we used the BeAtMuSiC predictor [34], which is a linear combination of free energy values predicted by PoPMuSiC on the protein complex and on the separate partners. We applied BeAtMuSiC to predict the effect of variants in the viral spike protein on its binding affinity for the ACE2 receptor of the host, which allows entry of SARS-CoV-2 virus into cells. For this purpose, we considered the X-ray structure 6M0J of the RBD/ACE2 receptor complex [31] as input, and computed the change in binding free energy of the RBD/ACE2 complex upon mutations i in the RBD. Mutations in the spike protein but outside of RBD were assumed to have no effect on ACE2 binding.
2.4. Spike protein/nAb binding affinity
The changes in binding affinity between the spike protein and the 31 nAbs from the 𝒟nAb set caused by point mutations in the spike protein were also estimated using BeAtMuSiC [34]. We computed the effect of each mutation i on the binding affinity of each nAb/spike protein complex p, and computed their mean value over the 31 complexes from 𝒟nAb: where ni is the number of structures that include the mutation i. Indeed, the structures of the nAb/spike protein complexes do not cover exactly the same region of the spike protein.
2.5. SARS-CoV-2 fitness
Viral fitness is related to how efficiently the virus produces infectious progeny [35]. It is a fairly complex function of different characteristics among which the transmissibility of the virus, its infectivity and its ability to escape from the host’s immune response. We estimated the fitness Φi of a variant i of the SARS-CoV-2 virus on the basis of a simplified model which only takes into account the spike protein. More precisely, we defined it as a product of three fitness contributions: where ϕS, ϕACE2 and ϕnAb represent the relative propensities of the mutant virus to be transmitted, to infect the host, and to escape the host’s immune system. These propensities are assumed to be higher for spike protein variants that are stabler [36] (ΔΔGS < 0), that have greater binding affinity for the ACE2 receptor [37] (ΔΔGACE2 < 0), and that have lower binding affinity for nAbs , respectively. We thus defined the fitness contributions and of a mutation i to be a positive decreasing function of and , respectively, and a positive increasing function of . More precisely: where µS, µACE2, µnAb, βS, βACE2 and βnAb are parameters. The choice of the ϕ-functions and parameters is justified as follows:
Mutations i that strongly destabilize the spike protein or its binding to ACE2 , or that stabilize its binding with nAbs have a fitness close to zero.
Mutations that stabilize the spike protein or its binding to ACE2 , or that destabilize binding to nAbs have an evolutionary advantage and a fitness higher than one.
To avoid excessively high fitness values, we cut the exponential growth of the ϕ-functions for ΔΔGi = β, with β = βS = βACE = βnAb chosen to be −1, similarly to what has been proposed in [38].
The folding free energy changes predicted by PoPMuSiC have been shown to be biased towards destabilizing mutations [39,40]. To correct for this effect, the µS parameter has been chosen to be equal to 0.5. The changes in binding free energy predicted by BeAtMuSiC have an analogous bias, as they are constructed from PoPMuSiC scores. Following the BeAtMuSiC construction detailed in [34], a bias in the PoPMuSiC energy value of 0.5 results in a bias in the BeatMuSiC energy value of 0.19. We thus fixed µS = 0.50 and µACE = µnAb = 0.19.
We set by definition the fitness value of the wild-type equal to one: .
The global viral fitness, which takes into account multiple mutations in the spike protein, is defined as the product of the fitness values of all point mutations i as: where m correspond to the total number of mutations in the spike protein relative to the wild-type strain. Note that, in doing so, we considered the mutations as independent and discard possible epistatic effects.
3. Results
3.1. Computational pipeline
In its viral evolution, SARS-CoV-2 and our immune system are constantly engaged in what is known as a cat-and-mouse game, where SARS-CoV-2 attempts to increase its fitness by increasing its transmissibility, infectivity and/or to escape from the human immune response. To quantitatively describe the viral fitness landscape, we developed a simplified model in which we focused on the spike protein. This protein, which protrudes from the virus surface, is a crucial component of the infection, as its binding to the ACE2 receptor of the host mediates the virus entry into the cells. The binding affinity of the spike protein for ACE2 has thus been related to SARS-CoV-2 infectivity [37]. The stability properties of the spike protein itself are another key element in the viral infection which has been related to the viral transmissibility [36].
Moreover, the spike protein is a major inducer of the host’s immune response [18,26]. We mimicked the effect of the immune system on the SARS-COV-2 virus through a set of 31 nAb/spike protein complexes contained in the dataset 𝒟 nAb (see Section 2.1). We observed that these nAbs target exclusively the RBD of the spike protein and that the epitopes cover almost the entire RBD surface, as shown in Fig. 1. A recent investigation suggests that RBD-binding antibodies are the major contributors of the neutralizing activity in convalescent human plasma [18,26]. This justifies our approximation of considering the nAbs of the set 𝒟 nAb as representative of the immune response.
To estimate the global viral fitness F of spike protein variants in terms of transmissibility, infectivity and escape from the host’s immune response, we computed it, using physics-based approaches, as a product of three fitness contributions, related to the change in stability of the spike protein upon amino acid substitution (ϕS), and to its change in binding affinity for ACE2 (ϕACE2) and for neutralizing antibodies (ϕnAb), respectively, as defined in Eqs (1)-(3). The effect on fitness of multiple mutations are considered as independent and thus simply multiplied (Eq. 4). The fitness contributions are in turn expressed in terms of the change upon mutation of the folding free energy of the spike protein (ΔΔGS) and of its binding affinity for ACE2 (ΔΔGACE2) and for nAbs (ΔΔGnAb), using the PoPMuSiC [33] and BeAtMuSiC [34] algorithms (see Sections 2.2-2.4).
In order to identify mutations in the spike protein that increase or decrease the SARS-CoV-2 transmissibility or infectivity, or that facilitate or block the escape from the protective immunity elicited by the infection, we constructed a computational pipeline of three steps, schematically represented in Fig. 2, in which we estimated ΔΔGSand ϕS, ΔΔGACE2 and ϕACE2, and ΔΔGnAb and ϕnAb. Using this pipeline, we performed large-scale computational mutagenesis experiments, in which we introduced basically all mutations in the spike protein and predicted their effect on viral fitness. In what follows, we confronted these predictions with a large series of available experimental, epidemiological and clinical data on the SARS-CoV-2 infection and evolution.
Our prediction pipeline, called SpikePro, is freely available as an easy-to-use c++ program, which needs a variant spike protein sequence in fasta format as input. It outputs the sequence alignment with the reference spike protein (Uniprot code P0DTC2), the list of all point mutations introduced and the predicted overall viral fitness F. It can be downloaded from github.com/3BioCompBio/SpikeProSARS-CoV-2.
3.2. Spike protein stability and SARS-Cov-2 transmissibility
We performed a large in silico mutagenesis experiment to study the influence of mutations on spike protein stability and thus on viral transmissibility [36]. Using PoPMuSiC [33], we computed the change in folding free energy of all possible single-site mutations i in the spike protein, and the corresponding fitness contribution i defined in Eq. (3).
As a first check of our method, we analyzed the relation between the predicted values for all point mutations in the RBD domain and the measured effects of these variants on the spike protein expression [41]. These measurements were done using a yeast surface display platform, in which protein expression was quantitatively determined at large scale via flow cytometry. Even though protein expression and stability are only partially correlated, we found a good Pearson correlation coefficient of −0.51 between the measured expression and the predicted values, which can be considered as the first validation of our approach.
To analyze the relation between stability predictions and epidemiological data, we compared the computed spike protein stability changes with the observed mutation rate Ri. We estimated Ri as the number of occurrences of each point mutation i in the set of about 7.8 × 105 SARS-CoV-2 spike protein sequences collected in the GISAID database [42], divided by the number of residues in the spike protein. We analyzed Ri as a function of the predicted values for all possible mutations i in the whole spike protein. As seen in Fig. 3.a, the majority of mutations that became dominant during the evolutionary trajectory show a slight increase of the spike protein stability, with between −1 and 0 kcal/mol. A smaller number of dominant variants have their stability slightly decreased with between 0 and 1 kcal/mol. Outside of this free energy interval, the rate Ri is almost vanishing.
Moreover, we found a very good agreement between the predicted fitness and the Ri rate, as seen in Fig.3.b. Indeed, variants that are predicted to be fitter than the wild type protein, and especially the variants i with , have a high Ri rate, which means that they circulate a lot and got fixed during viral evolution. We will deepen this point in Sections 3.6-3.7.
It is important to underline that we did not fit any parameters of our model on the SARS-CoV-2 data. Thus, this prediction as well as all the predictions presented in the following sections are truly blind predictions.
Finally, it is also instructive to analyze the localization of the variants fixed through viral evolution in the 3D structure of the spike protein. The mean values of Ri in the core (solvent accessibility <20%), in partially buried regions (20%-50%) and at the surface (>50%) are equal to 0.06, 0.06 and 0.23, respectively. This indicates that variants that got fixed are mainly situated in solvent-exposed regions, where they can play a key role in modulating binding with other biomolecules. Variants in buried or partially buried regions are less often observed, as these areas are more constrained from a structural point of view and are usually not involved in function.
3.3. Spike protein/ACE2 binding affinity and SARS-Cov-2 infectivity
We analyzed here the impact of variants on the binding of the spike protein with the ACE2 receptor. For all possible point substitutions i in the spike protein, we computed the change in binding affinity of the spike protein/ACE2 complex, , using the BeAtMuSiC program [34]. Based on the values, we estimated the viral fitness, aimed at modeling infectivity. Indeed, a higher binding affinity between the spike protein and ACE2 results in a higher efficiency of virus entry into the host’s cells [37], which in turn leads to an increase of SARS-CoV-2 infectivity.
We compared the predicted binding free energy values with the experimentally characterized binding properties of thousands of variants introduced in the RBD of the spike protein using a yeast surface display platform, in which binding to ACE2 were quantitatively determined via flow cytometry [41]. Such deep mutagenesis scanning techniques are excellent tools to estimate biophysical quantities on a large scale. However, even though the average accuracy is reasonably good, the measured quantities are often noisy [43].
A good agreement was found between the computed values and the large-scale measured binding affinity properties, with a Pearson’s correlation coefficient of −0.46. This result is very good, especially as not only the computed but also the experimental values have limited accuracy. It clearly underlines the quality of our prediction approach.
3.4. Spike protein/nAb binding affinity and immune escape
Immune evasion is the well-known mechanism used by viruses to evade from the immune system of its host, thus making its replication and spreading more efficient [44]. This mechanism involves a series of strategies such as spontaneous mutations that result in the inactivation of nAbs [45] or in the inhibition of pattern-recognition receptors initiating signalling pathways [46].
To represent the diversity of the B-cell receptor repertoire and to mimic the effect of the human immune response, we considered the set 𝒟nAb of more than 30 nAbs, of which the 3D structures with the RBD of the spike protein have been experimentally resolved (see Section 2.1). We performed a large-scale in silico mutagenesis experiment by introducing all possible point mutations i in the RBD of the spike protein and by computing with BeAtMuSiC [34] the resulting change in binding free energy averaged over of all spike protein/nAb complexes that contain the mutation, as well as their associated fitness contribution (see Eqs (2)-(4)). With this procedure, we identified key spike protein variants that are likely to either help or destroy the neutralization activity of the nAbs.
In a first stage, we performed validation tests on BeAtMuSiC’s predictions. We compared them with deep mutagenesis scanning data measuring the impact of mutations in the RBD on their escape fractions from two nAbs, REGN10933 and REGN10987, which are often administrated as a cocktail to COVID-19 patients [47]. The escape fractions were estimated using a high-throughput yeast-surface-display platform, in which folded RBDs were expressed on the yeast cell surface and the fraction of cells that express mutant RBDs and that are bound to nAbs was measured [19]. Per-mutant escape fraction values close to zero indicate that the variant is bound to nAbs while values close to one indicate that it is not.
The structures of the complexes formed by the spike protein and REGN10933 or REGN10987 nAbs have recently been resolved (PDB code 6XDG). They target two different structural epitopes in the RBD of the spike protein. We did not include these structures in our set 𝒟nAb as they have been resolved via cryo-EM technique at only 3.9 Å of resolution. We predicted the changes in binding affinity ΔΔGi of the two spike protein/nAb complexes caused by all RBD mutations i for which experimental escape fractions were available. Despite the low resolution of the 3D structures, we found very good Pearson correlation coefficients of 0.48 and 0.43 between the per-mutant escape fractions and the computed changes in affinity ΔΔGi for REGN10933 and REGN10987 nAbs, respectively.
In a second stage, we estimated the fitness contributions of all possible mutations i in the spike protein’s RBD on the basis of the predicted changes in binding free energy for the set of 31 good-resolution nAbs/spike protein complexes collected in 𝒟nAb. We made here and in what follows the strong approximation that these 31 nAbs represent the diversity of the human nAb repertoire. To validate this model, we compared the estimated fitness contributions with a series of data obtained from in vivo experiments aimed to study the viral escape from nAbs.
We started by considering the set of 22 variants of the spike protein for which the neutralizing activity of six nAbs has been experimentally tested in terms of the relative degree of resistance (in %) of the growth of each mutant virus in the presence or in the absence of each of these nAbs [48]; we considered the average percentage over the six nAbs tested. Low percentages identify variants that escape much more from nAbs than the wild type virus and high percentages, variants that only weakly affect the wild-type spike protein/nAbs affinity. We predicted correctly 18 out of the 22 variants as having ϕnAb fitness values greater than one; the last four variants have ϕnAb 0.9. Detailed results are reported in Table 1 for the five variants shown to have the broadest in vitro neutralizing spectrum [48]. Our results reproduce quite well the in vitro trends: variants that are likely to escape from at least some nAbs tend to have fitness values larger than one. Note, moreover, that the antibodies tested in [48] are different from the nAbs of our 𝒟nAb set. Because of that, we did not expect such a good match between the experiments and our predictions. This result indicates that the set 𝒟nAb is truly representative of the antibody repertoire neutralizing the SARS-CoV-2 virus.
The response to the viral infection drastically depends on the ensemble of nAbs present in the host, given that each nAb behaves differently with respect to wild-type and variant strains. In agreement with this, the predicted change in binding free energy is found to strongly depend on the considered variant and nAb/spike protein complex, as clearly seen in Fig. 4. Remember that it is the average over all the nAbs that is used to define the fitness contribution ϕnAb and thus the overall immune escape ability.
We also validated our fitness predictions ϕnAb against the large-scale experimental estimation of the immune escape fractions of about 2,000 variants, averaged over a set of 17 nAbs [49]; these nAbs are not in the set 𝒟nAb. We found a reasonably good overall Pearson correlation coefficient of 0.29 between ϕnAb and measured escape fractions. Looking at more detail, the residues whose mutations most affect nAb binding belong to two regions of the RBD: the 443–450 and 484-490 loops that are situated at both sides of the ACE2 binding interface [49]. Using our set 𝒟nAb of nAbs, we predicted the second region as potentially leading to immune escape with a ϕnAb value of 1.6. The nAb escaping capability is predicted to be weaker for the first region, with ϕnAb = 1.1.
A 3D representation of the per-residue fitness contributions in the RBD of the spike protein, averaged over all possible mutations at each position, is shown in Fig. 5. This figure is very useful to identify residues whose mutation is likely to lead to the escape from the 𝒟nAb set of nAbs.
3.5. Immune escape from polyclonal human sera
We examined to what extent our method reproduces the impact of variants on the neutralizing activity of polyclonal human sera. Note that such activity depends on a wide range of factors among which inter-patient variability and time since infection [49]. Our computational approach is obviously unable to capture all intricate dependencies but rather, we expect it to detect general trends.
We used deep mutagenesis scanning data from [49], in which the escape fractions of about 2,000 single-site RBD variants were assessed on the neutralizing activity of plasma samples taken from 17 SARS-CoV-2-infected individuals, at different time points after infection. We calculated the correlation between the escape fraction for each variant averaged over the patients and post-infection time points and the predicted fitness contributions ϕnAb computed from the 𝒟nAb set of nAbs. We obtained a reasonably good Pearson correlation coefficient of 0.35 between the predicted and measured quantities.
Only few residues appear to contribute substantially to the escape mechanisms, when averaged over the whole plasma sample collection. Indeed, only 23 residues have an average escape fraction greater than 3%. Our predictions for these residues are in very good agreement with experiments: we obtained an average per-residue ϕnAb equal to 1.5. Residue F456 shows almost perfect agreement: it has the highest measured escape fraction, and also has the highest predicted ϕnAb value, equal to 2.2. Almost all substitutions at that position are predicted to strongly impact on the binding in the majority of spike-protein/nAbs complexes analyzed.
Finally, it is interesting to compare the measured immune escaping fractions in polyclonal plasma discussed in this section with the experimentally characterized escape fractions in the set of nAbs studied in [49] and discussed in Section 3.4. We found that their linear correlation coefficient is equal to 0.4, which indicates there are differences between the tested cocktail of nAbs and serum plasma. Possible explanations include the scarcity of the tested antibodies in the polyclonal plasma, or the subdominance of the epitopes they target [49].
3.6. Overall variant fitness, transmissibility, infectivity and immune escape
We focused on five SARS-CoV-2 variants most frequently observed worldwide, as reported in the GISAID database [42] in March 2021, and predicted their fitness; the results are shown in Table 2.
The most frequently observed spike protein variant involves the substitution of aspartic acid at position 614 into glycine, situated outside the RBD. This variant quickly became dominant after its appearance in early 2020 [36,50]. We correctly predicted a substantial increase of fitness for this variant with respect to wild type, which is driven by an increased stability of the spike protein . We hypothesize that this stabilization leads to a higher person-to-person viral transmissibility, as also suggested in [36,50,51] and observed in vivo [51]. In the latter study, a stabilization of the spike protein was measured upon D614G substitution via a strengthening of the S1-S2 subunit interactions, where S1 is the receptor binding subunit containing the RBD and S2 is the membrane fusion subunit. In contrast, this variant was shown to alter neither the binding of the spike protein to ACE2 nor the antibody neutralization, as it is situated outside the RBD [51]. We also correctly reproduced this result, with fitness values of (Table 2). The overall predicted fitness is thus ΦD614G = 3.7.
Two other variants, A222V and P681H, show similar albeit less pronounced trends. Our results predict an increase in transmissibility , but to a lesser extent than D614G. Experimental data are in agreement with the weaker impacts of these variants on the spike protein fitness and the viral transmissibility compared to D614G [52,53]. The A222V variant has been related to the large outbreaks in Europe in early summer 2020, while P681H is associated to the so-called UK lineage (B.1.1.7) that appeared in UK in late 2020 and is now becoming dominant in Europe in the current outbreaks.
Finally, N501Y is also a widely spread variant appearing in all major lineages, i.e. UK (B.1.1.7), Brazilian (P.1) and South African (B.1.351) lineages. We predicted this variant as having a high overall fitness Φ due to a combination of increased fitness contributions and , but a . In other words, we predicted this variant to be more transmissible and infectious than the wild type but to have no impact on the response of the human immune system. More precisely, we predicted N501Y as improving the stability of the spike protein RBD and its binding affinity for ACE2; the latter property is also suggested by another computational study [54]. No clinical data suggest that N501Y is able to escape from the immune post-vaccination response [55], which tends to support our prediction results.
3.7. Viral evolution and overall fitness
We applied our prediction pipeline to analyze SARS-CoV-2 evolution, focusing on the spike protein. We started by predicting the viral fitness F of all the SARS-CoV-2 strains collected in the GISAID database from December 2019 till March 2021, which amounts to about 7.8 × 105 strains. We subdivided the strains according to the month of collection and computed the per-month average of viral fitness. The results are reported in Fig. 6.a as a function of time. Clearly, we predict an increase of the viral fitness since the beginning of the infection in December 2019, in agreement with epidemiological results. This result once again demonstrates the quality of our computational pipeline.
Note that to predict the future evolution of the fitness Φ, it is necessary to take into account different parameters such as the varying repertoire of human nAbs and the effect of vaccination. While the fitness contributions ϕS and ϕACE2 are expected to reach a plateau when the spike protein sequence becomes optimal for stability and for binding to ACE2, the cat-and-mouse game played by the virus and its host leads the host to continuously adapt its B-cell repertoire to the new variants of the virus, so that ϕnAb certainly increases with respect to the old nAbs, but not with respect to the new nAbs. In total, the overall fitness Φ is expected to plateau after some time, or at least increase less.
We analyzed in more detail the evolution of the partial distribution function of the per-month averaged fitness in Fig. 6.b. In January 2020, the population was dominated by the wild type strain whose fitness F is by definition equal to one. The effect of the D614G spike protein variant with a predicted Φ ∼ 4 is observed from May 2020, while in October of the same year, aΔitional mutations with Φ = 2.0, such as A222V, started to be fixed in the population, leading to a further increase of F. In March 2021, the distribution became dominated by new variants, i.e. UK, South-African and Brazilian variants, with a much higher fitness than both the wild-type and D614G strains.
Finally, we carefully checked that our large-scale mutagenesis predictions are not biased towards high fitness values. Indeed, such bias could potentially cause a trivial increase in fitness upon evolution and lead to erroneous interpretations. To verify this, we created 2 × 106 viral strains by inserting either three or five random mutations in the wild-type spike protein and assumed that they got fixed with probability one independently of their fitness value; the number of random mutations was chosen based on the average number of single variants per strain in the GISAID database which is between three and four. We then computed the fitness Φ for all these random variant strains. On the other hand, we plotted the Φ distribution of the real variant strains observed in the GISAID database. The fitness distribution of the two simulated and the real viral strains are completely different, as shown in Fig. 7. When three or five random mutations are inserted in the spike protein, the Φ distributions have a median value of 0.32 and 0.12, respectively; moreover, 79% and 86% of the mutated strains have a lower overall fitness Φ than the wild-type virus. In contrast, the distribution of real strains has a median of 4.8 and basically all the strains (99.5%) have a predicted fitness higher than the wild type. This analysis further supports the unbiased nature and validity of our computational approach.
4. Conclusion
Here we set up and validated SpikePro, a simple computational model that predicts the impact of spike protein variants on the SARS-CoV-2 fitness and more specifically, on viral transmissibility, infectivity and ability of escaping from the host’s immune system. Moreover, the program is easy to use and can be freely downloaded from github.com/3BioCompBio/SpikeProSARS-CoV-2. SpikePro allows identifying, with good accuracy and in a few seconds, new SARS-CoV-2 variants with high fitness which need to be closely monitored by health agencies. It has thus a central role to play in the genomic surveillance programs of the new SARS-CoV-2 strains, especially in the coming future with the growing number of people vaccinated and thus the larger selective pressure on the virus [56].
We thoroughly analyzed and validated SpikePro on a wide series of experimental, epidemiological and clinical data available. Despite the simplicity of the model, the approximations made, and the absence of parameters that were fitted to optimize the accuracy of the predictions, the SpikePro pipeline reproduces well the collected data. Whether the validation is performed on large-scale mutagenesis data, nAb cocktails or polyclonal human sera, whether the comparison involves the fitness of the spike protein, of the spike protein/ACE2 complex, or of a series of spike protein/nAb complexes, the results are very good with correlation coefficients in the 0.3 to 0.5 range.
In addition, SpikePro predicts a high overall fitness value for the frequently occurring variants such as the UK, Brazilian or South-African variants and correctly identifies the main fitness contributions. It also reproduces quite well the overall fitness evolution of the SARS-CoV-2 virus over the past pandemic year.
It has to be emphasized that the SpikePro model, besides being able to reproduce known results, has a true prediction potential in describing and interpreting the effect of new spike protein variants that could be fixed in the near future and the future SARS-CoV-2 evolution, owing to the physical description of the fitness in terms of free energy contributions, which are estimated using the well-known structure-based PoPMuSiC and BeAtMuSiC predictors [33,34].
Despite the progress we made towards a better understanding of the molecular mechanisms underlying the SARS-CoV-2 fitness, we made some approximations in the construction of our model which we will try to relax in future studies. For example, we did not take into account possible amino acid deletions or insertions in the spike protein, although they certainly influence the viral fitness. It would also be interesting to take into account epistatic effects. Indeed, while more and more variants get fixed, interactions between them are expected to become non-negligible. Furthermore, the model should be extended to other proteins of the SARS-CoV-2 virus such as the non-structural protein 1 (Nsp1) which also contributes to immune evasion [57], rather than considering the spike protein only. Finally, when more nAbs/spike protein complexes will be resolved at high resolution, they will enrich our set 𝒟nAb and better describe the B-cell receptor repertoire. Considering a weighted combination of the effects of RBD variants on all nAbs depending on different factors such as time and vaccination status would further improve our method in mimicking the immune response and its temporal evolution.
Author Contributions
Conceptualization, F.P. and M.R.; formal analysis and investigation F.P. and M.R.; methodology and validation, F.P.; writing–original draft preparation, F.P. and M.R.; writing–review and editing F.P. and M.R. All authors have read and agreed to the published version of the manuscript.
Funding
This work is funded by the F.R.S.-FNRS Fund for Scientific Research through a COVID—Exceptional Research Project.
Conflict of Interest
The authors declare that they have no conflict of interest.
Data availability
The SpikePro algorithm is freely available on GitHub (https://github.com/3BioCompBio/SpikeProSARS-CoV-2).
Acknowledgements
FP and MR are Postdoctoral Researcher and Research Director, respectively, at the F.R.S.-FNRS Fund for Scientific Research.