ABSTRACT
The currently ongoing COVID-19 pandemic caused by SARS-CoV-2 has accounted for millions of infections and deaths across the globe. Genome sequences of SARS-CoV-2 are being published daily in public databases and the availability of this genome datasets has allowed unprecedented access into the mutational patterns of SARS-CoV-2 evolution. We made use of the same genomic information for conducting phylogenetic analysis and identifying lineage-specific mutations. The catalogued lineage defining mutations were analysed for their stabilizing or destabilizing impact on viral proteins. We recorded persistence of D614G, S477N, A222V V1176F variants and a global expansion of the PANGOLIN variant B.1. In addition, a retention of Q57H (B.1.X), R203K/G204R (B.1.1.X), T85I (B.1.2-B.1.3), G15S+T428I (C.X) and I120F (D.X) variations was observed. Overall, we recorded a striking balance between stabilizing and destabilizing mutations, therefore well-maintained protein structures. With selection pressures in the form of newly developed vaccines and therapeutics to mount soon in coming months, the task of mapping of viral mutations and recording of their impact on key viral proteins would be crucial to pre-emptively catch any escape mechanism that SARS-CoV-2 may evolve for.
STUDY IMPORTANCE As large numbers of the SARS CoV-2 genome sequences are shared in publicly accessible repositories, it enables scientists a detailed evolutionary analysis since its initial isolation in Wuhan, China. We investigated the evolutionarily associated mutational diversity overlaid on the major phylogenetic lineages circulating globally, using 513 representative genomes. We detailed phylogenetic persistence of key variants facilitating global expansion of the PANGOLIN variant B.1, including the recent, fast expanding, B.1.1.7 lineage. The stabilizing or destabilizing impact of the catalogued lineage defining mutations on viral proteins indicates their possible involvement in balancing the protein function and structure. A clear understanding of this mutational profile is of high clinical significance to catch any vaccine escape mechanism, as the same proteins make crucial components of vaccines recently approved and in development. In this direction, our study provides an imperative framework and baseline data upon which further analysis could be built as newer variants of SARS-CoV-2 continue to appear.
INTRODUCTION
The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in Wuhan, China and the subsequent global spread has brought the world to a standstill (1). During the course of 11 months the coronavirus disease 19 (COVID-19) pandemic has caused more than 81 million confirmed cases in 220 countries with close to 1,770,000 fatalities. (2). Initially, and rightly too, the efforts were focused on minimising the number of cases and deaths due to COVID-19 (3). This included fast tracking the search and development of novel treatment and prevention options (4). Today, however, as vaccine candidates have started showing promising results, there is a cautious shift towards assessing the efficacy of vaccine candidates with respect to the circulating diversity of SARS-CoV-2 and its continuously evolving genetic variants (5).
Functional mutations that help the virus to adapt to the recent host-shift events are hypothesised to drive the evolution of transmissibility and virulence in SARS-CoV-2 (6). Shortly after the first isolated SARS-CoV-2 genome from China was published, >30,500 distinct mutations were catalogued in the CoV-GLUE (http://cov-glue.cvr.gla.ac.uk/) among globally circulating strains of this virus (7). Variation in the genetic makeup are key determinants in measuring the evolutionary distance and stability of SARS-CoV-2 from the first sequenced isolate (8). Moreover, tracking the evolution of SARS-CoV-2 since its introduction in humans is a high priority undertaking to prevent future waves of this pandemic from escaping the global preparedness (9). Since many vaccine candidates currently under development are derived from the first available SARS-CoV-2 sequences, recurrent genetic changes may have an unforeseen impact on their sustained effectiveness in the longer term (10).
The availability of whole genome sequences of SARS-CoV-2 in public repositories such as Global Initiative on Sharing All Influenza Data (GISAID) and real-time data visualisation pipeline NextStrain (https://nextstrain.org) offers a great opportunity for scientists to track the evolutionary path of this virus (11, 12). Phylogenetic Assignment of Named Global Outbreak LINeages tool (PANGOLIN) has been the most widely used tool for lineage assignment to newly emerging variants. PANGOLIN (https://cov-lineages.org/pangolin.html) has also been deployed in establishing the transmission patterns of various clones of this virus (13). Since coronaviruses frequently recombine, tracking the evolution and assigning lineages has been challenging (13, 14). As a result, multiple studies that tracked the evolution of SARS-CoV-2, have been hugely controversial. For example, doubts have been cast on the claim of finding more aggressive L type emerging from S type strains (14). Similarly, the hypothesis of rapid spread of D614G variant of SARS-CoV-2 indicating a possible fitness advantage has been questioned (15 - 17). Therefore, in the current and highly sensitive global circumstances due to this pandemic, having a detailed map of mutations highlighting their prospective role in therapeutics and vaccine development can prepare us better for the future waves of continuously evolving SARS-CoV-2. In this study, we present a catalogue of the most important genomic mutations recorded between December 2019 – November 2020 in SARS-Cov-2 and their possible impact on stability of protein candidates that form crucial part of vaccines and also constitute the most common therapeutic targets.
MATERIAL AND METHODS
Data acquisition and curation
In total, we have retrieved 7000 genomes from GISAID EpiCoV database (https://www.gisaid.org/). Datasets that were flagged as complete (>28,000 bp) were screened and subsequently manually curated for excluding low quality/coverage sequences and duplicates. Sequence metadata was retrieved and genome containing sampling time and location were only chosen for the study. Lineages were assigned from alignment file using the Phylogenetic Assignment of Named Global Outbreak LINeages tool PANGOLIN v1.07 (https://github.com/hCoV-2019/pangolin). We selected a subset of 513 genomes (Supplementary Table S1) that belongs to all major PANGOLIN lineages and common mutations for the optimal output of the phylogenetic tree.
Phylogenetic analysis
Genome sequences were aligned against the original Wuhan-Hu-1 genome (Accession: NC_045512) using multiple genome sequence alignment tool MAFFT (v6.240) (18). Subsequently, the error prone 5’-UTR and 3’-UTR regions were masked and the genome size was adjusted without losing key sites. Maximum likelihood tree was generated using IQTREE v.1.6.1 (http://www.iqtree.org/) under the GTR nucleotide substitution model with 1000 bootstrap replicates (19). The ML tree was visualised and labelled using the interactive tree of life software iTOL v.3 (20).
Mutation Profiling
In order to identify the genetic variants, assembled genomes were mapped against the reference (Wuhan-Hu-1: Accession: NC_045512) using Snippy mapping and variant calling pipeline (https://github.com/tseemann/snippy) (21). Among the SNPs, missense SNPs (nonsynonymous) was extracted using custom written bash scripts and manually curated as per CoV-GLUE database (http://cov-glue.cvr.gla.ac.uk/). Specifically, we considered eleven lineage defining mutations and 59 major missense mutations in four major structural proteins: Envelope protein (E), Membrane glycoprotein (M), Nucleocapsid phosphoprotein (N) and Spike protein (S). Structural analysis of these 70 amino acid substitutions in SARS CoV-2 mutants were analysed to examine the potential impact of these mutations on protein stability.
Structural analysis
The structural impact of mutations has been assessed from COVID-3D server (http://biosig.unimelb.edu.au/covid3d) which has integrated analytics regarding mutation-based structural changes in a protein. Vibrational entropy/VE (ΔΔS) and unfolding Gibbs free energy/FE (ΔΔG) were considered as markers to ascertain the stability of the variants. Gibbs free energy/FE (ΔΔG) values from Site Directed Mutator (SDM), DUET and DynaMut tools available in COVID-3D server were considered (22, 23). The change in vibrational entropy energy (ΔΔSVib ENCoM) between wild-type and mutant protein was calculated using DynaMut (24). VE explains the occupation probabilities of protein residues in an energy landscape based on average configurational entropies. Considerable decrease in VE increases the rigidity of the proteins (25). FE on the other hand describes the free energy alterations while unfolding a kinetically stable protein (24). The positive and negative values of ΔΔG indicate the stabilizing and destabilizing mutations. DynaMine (http://dynamine.ibsquare.be/) was employed to validate the stability profiles through residue level (sequence-based) dynamics. Backbone N-H S2 order parameter values (atomic bond vector’s movement restrictions) were generated according to the molecular reference frame. These N-H S2 order parameter values are evaluated from experimentally determined NMR chemical shifts. A value above 0.8 is considered as highly stable, values between 0.6-0.8 can be considered to be functionally contextual, and values >0.6 are highly flexible (26).
RESULTS
Diversity of SARS-CoV-2 Genomes
Of the 7000 SARS-CoV-2 genomes screened, we constructed a robust phylogenetic tree on strategically selected 513 genomes that reflected the most complete diversity among the isolates by covering all the PANGOLIN lineages. Lineage assignment based on PANGOLIN tool indicated the circulation of seven distinct lineages and/or sub-lineages such as A, B.1, B.1.1, B.1.1.1, B.2, B.3, B.4 and B.6. This is in line with the phylogenetic groupings by GISAID (S, L, V, O, G, GH and GR) (Figure 1). As the epidemic has progressed and mutations have accumulated, further subdivision of major lineages into sub-lineages has been observed. Overall, a total of 61 lineages and sub-lineages have been found to be circulating concurrently in multiple countries around the world. In general, numerous introductions of different variants were observed across the globe with a few sub-lineages (C.2, D.2) being restricted to certain regions. While B.1.113 lineage, for example, has been exclusively reported from India, lineages C.2 and D.2 geographically have been confined to South Africa and Australia, respectively.
Maximum Likelihood phylogenetic tree inferred from 513 SARS-CoV-2 genomes. The tree was constructed using multiple genome sequence alignment (MAFFT) by mapping against the Wuhan-Hu-1 strain (Accession: NC_045512). Tips are coloured the major lineages assigned by PANGOLIN. Respective lineages assigned by GISAID and origin of sequence are labelled as colour strips. The scale bar indicates the distance corresponding to substitution per site.
Major amino acid substitutions
Mutation mapping showed a total of 106 amino acid substitutions (missense mutations in >5 genomes) from a representative set of 513 genomes. The analysis also revealed 36 mutations that were found in >5% of genome sequences while 12 major substitutions were lineage defining mutations (Figure 1). The first major mutation to appear was L84S in ORF8, (present in 8.6% of the genomes) that has defined A lineage (i.e., clade S in GISAID classification). The subsequent amino acid substitutions L37F in ORF3a and G251V in nsp6 were found to be present in 13.3% and 1.4% of genomes, respectively. The combination of G251V and L37F, which was initially considered as a defining mutation pattern for B.2 – B.6 lineage (clade V in GISAID classification), more detailed analysis has shown that isolates carrying G251V mutation are distributed in other lineages too. The predominant lineage defining mutations in the whole dataset were D614G (85.5%) and P323L (85.5%), after originally appearing in late January 2020 (Figure 2). Other major mutations noted are Q57H (26.5%), R203K/G204R (33%), G15S (12%), I120F (11.5%) and T85I (14%).
:Schematic representation of the major evolutionary events/ amino acid substitution that give rise to SARS-CoV-2 variants in sequential order
Dominance of D614G variant
Two mutations have become consensus: D614G in S (nucleotide 23,403, A to G) and P323L (also known as P4715L) in nsp12 (nucleotide 14,143, C to T). These mutations were present in 80.5% of the sequences and have defined the B.1 lineage (G in GISAID classification). The widely discussed D614G variant is speculated to have been introduced in Europe at the end of January (EPI_ISl_422424) before becoming globally dominant. Genomes with D614G mutations were assigned as B.1 by PANGOLIN or GH/GR by GISAID. Notably, founder lineage B.1 and its sub lineages B.1.X, B.1.1.X, D.X and C.X that carry both D614G and P323L mutations have become the dominant variants across the world (87% of global collection as per CoV-GLUE as on 30th November 2020).
As the pandemic has progressed several other major substitutions affecting the protein structure have appeared. These are Q57H (nucleotide 25,563, G to T) in ORF3a, R203K + G204R combination (nucleotide 28,881, GGG to AAC) in Nucleocapsid and T85I (nucleotide 1059 C to T) in ORF1a. The region-specific sub lineages C.1, C.2, D.1 and D.2 were found to cumulatively harbour multiple mutations. Amino acid substitution such as T428I and G15S in ORF1a were reported in sub lineages C.1 and C.2, and S477N substitution in S protein along with I120F in nsp2, which specifically established the sub lineage D.2 (Figure 1).
Structural analysis of SARS-CoV-2 mutants
The possible structural consequences of eleven lineage-defining missense mutations identified in this study were investigated. Among the mutations, three were considered as stabilizing the respective protein structure while six mutations were destabilizing (Table 1). The significance of these mutations in evolutionary selection cannot be solely predicted by ΔΔG, or change in free energy. Hence for a precise interpretation, correlation of ΔΔG, ΔΔS and N-H S2 (Supplementary Table S2) order parameter values of the proteins have been taken into account based on fine local-alterations in structures. All lineage-defining mutations except two have reduced the vibrational entropies of the proteins thereby decreasing the flexibility in the structures (Table 1).
Lineage-defining SNPs and their impact on protein structures
Additionally, the impact of mutations in key structural proteins that potentially allows any pathogen to escape available treatment and prevention regimes were investigated. Among the 59 major missense mutations, our analysis using both SDM and DUET server predicted 16 missense mutations as stabilizing 23 missense mutations as destabilizing the protein structure. Twenty major mutations were predicted to be neither stabilizing nor destabilizing as the ΔΔG values provided by SDM and DUET servers were contradictory (Table 2).
Balance of stabilizing and destabilizing mutations
Overall, from both the datasets, 70 amino acid substitutions in SARS CoV-2 were tested for stability of which 19 were stabilizing, 29 were destabilizing and 22 showed inconclusive results. Computational prediction to understand the effect of amino acid substitutions in SARS CoV-2 revealed a balance of stabilization and destabilization of the proteins.
When checked for amino acid substitutions, the stabilizing mutation in S protein predicted an increase in the rigidity of its structure (Figure 3; Supplementary Figure S1). The increased rigidities of the structure may provide a stable conformation to the protein that may positively influence the binding of spike protein to ACE2 receptor (27). Major mutations D614G and S477N were located at potential epitope regions (Codons 469–882) with S477N particularly positioned in the receptor-binding domain (RBD) of the S protein (319 – 541).
Schematic representation of SARS-CoV-2 genome organization, the major amino acid substitutions and stability of amino acid changes. Stabilizing mutations are coloured green, Destabilizing mutations are coloured red and mutations that neither stabilize nor destabilize are coloured in yellow.
Most frequent amino acid substitutions were observed in the N protein, in which the variants S194L, D103Y, P13L, S197L, M234I, and S188L were predicted to be stabilizing according to both the analytical servers (Table 2). In contrast, M and E proteins accounted for the least number of amino acid substitutions. The amino acid changes in M (T175M) indicated a stabilizing effect, while E does not account for any stabilizing variant. Structural analysis of double (D614G + S477N; D614G + A222V) and triple (D614G + S477N + A222V) mutation patterns in S protein indicated ΔΔG values of 0.228, 0.195 and 0.129, respectively (Table 3). This signifies that accumulation of spike mutation in D614G bearing lineages could potentially be affecting the stability of the spike and therefore may influence the binding affinity towards ACE2 receptor.
Impact of double and triple mutation in the spike protein
DISCUSSION
Since the beginning of COVID-19 pandemic, whole genome sequence based phylogenetic inference has been heavily utilized in tracing viral origins and transmission chains (28). However, as the virus has evolved with time, genomic data is being increasingly used in guiding infection risk and control strategies. Several genomic mutations have been mapped that seem to be of advantage to the virus (29). In parallel, numerous vaccine candidates have been designed using genomic data from the original SARS-CoV-2 strain of Wuhan and many are now approved for use or at late-stage trials (30, 31). Based on immunological data obtained from infected and recovered patients, rightly, almost all COVID-19 vaccine candidates of today are based on the original SARS-CoV-2 spike protein or its RBD domain (32 – 34). However, as vaccines are introduced and successful treatment options become available, it is vital that we carefully monitor the mutations in the immunogenic region of SARS-CoV-2 genome (35). Mapping these changes on protein structure will allow pre-emptive forecasting of the direction of change in vaccine effectiveness and guide future preparedness efforts. We analysed the impact of recurrent amino acid replacements in the genomic evolution and proteome stability of SARS-CoV-2 since its introduction in December 2019 to November 2020. Our analysis found an intriguing balance of stabilizing and destabilizing mutations, which may have allowed the SARS-CoV-2 to evolve and persist without losing pathogenicity.
SARS-CoV-2 is considered a slowly-evolving virus as it possesses an inherent proofreading mechanism to repair the mismatches during its replication. This is believed to have a crucial role in maintaining the stability and integrity of the viral genome (36, 37). Our analysis confirmed previously recorded positive natural selection of D614G, S477N (38), S477N, A222V and V1176F (39) variants and a global expansion of the PANGOLIN variant B.1 (11). In addition, we also observed a positive natural selection of Q57H (B.1.X), R203K/G204R (B.1.1.X), T85I (B.1.2-B.1.3), G15S+T428I (C.X) and I120F (D.X) variants (Figure 2).
Apart from the eleven clade defining mutations, some of the major missense mutations were in the four structural proteins (E, M, N and S). When analysed for their impact in the (n=59) respective protein structure, spike glycoprotein, more specifically its RBD domain, was found to be most vulnerable to frequent mutations. This may be due to the immunological observation that most neutralizing anti-SARS-CoV-2 antibodies have been found to target the RBD domain of the S protein (40,41). Consistent with this finding, a total of 4170 missense mutations have been reported in the spike protein, with 683 on the RBD domain alone (when CoV-Glue was accessed on 12th December 2020). Computational prediction to understand the effect of amino acid substitutions in E, M, N and S proteins revealed a balance of stabilization and destabilization of the proteins. While viral population carrying mutations with higher stabilizing effects (Positive ΔΔG values) would be expected to become the dominant variant, it is interesting to note that destabilization mutations in the major protein targets of SARS-CoV-2 have also generated variants that have been hugely successful. For example, many of the favourably selected variants such as L18F, L5F (Spike), R203K, G204R, A220V (Nucleocapsid) were found to be destabilizing the respective protein structure (Table 1). As destabilizing mutations are known for their crucial functional roles, a trade-off between stabilizing and destabilizing mutations may balance the protein function and structure in ways that are not fully understood yet (42,43).
In our study the effect of mutations on respective proteins was primarily estimated based on the physical change in free energy, on a single ‘native’ protein conformation. To allow the most robust correlation of mutations with the molecular evolution, the mutational effects when the protein is in unfolded state and possibility of structural adjustment of the folded state in response to the mutation needs to be explored in future studies as more structural dynamic information becomes available (44). While our study highlights the impact of ΔΔG analyses as a reference frame for evolutionary evaluation, molecular evolution is likely a consequence of complex amalgamation of changes in free energy, entropy, solvent accessibilities, etc (45). As the data on these unchecked parameters becomes available, predicting evolutionary selection of mutation with respect to the phylogeny would become confirmatory. Our study highlighting preliminary data linking free energy and phylogeny would help streamline the scope of future studies by providing a baseline matrix.
The currently circulating spike variants or RBD variants need to be taken into account while evaluating the vaccine candidate or neutralizing monoclonal antibodies against SARS-CoV-2 (46). Mapping the viral mutations that escape antibody binding is essential for accessing the efficacy of therapeutic and prophylactic anti-SARSCoV-2 agents (38, 47). Recently generated experimental evidence suggests that leading vaccines (mRNA-1273, BNT162b1 and ChAdOx1a) and two potent neutralizing antibodies (REGN10987 and REGN10933) are unlikely to be affected by the dominant variant D614G (32, 33, 48-50). As all three candidate vaccines encode RBD or the part of spike protein as antigens, the viral population is expected to try and escape by altering the positioning of the respective antigens (51) when vaccination induced selection pressure would be on. Notably, complete escape mutation map of 3,804 of the 3,819 possible RBD amino acid mutations against ten human monoclonal antibodies are already in place (38,51). The antigenic effect of key RBD mutations against REGN-COV2 cocktail (REGN10933 and REGN10987) showed N439K and K444R variants escaped neutralization only by REGN10987, while E406W escaped both individual REGN-COV2 antibodies and the cocktail (47). Similar strategies should be adopted to map all antibody resistance mutations against neutralizing antibodies elicited after vaccination. Once mutation escape maps are available for all successful vaccine candidates, vaccine roll out strategies should be carefully planned to counter geographically confined escape mutants.
CONCLUSION
Our study highlights the importance of continued genomic surveillance, mutation mapping, stability analysis and potential escape mutation cataloguing in the pre- and post-vaccination period of SARS-CoV-2 in designing epidemiologically best vaccination programs. The currently observed mutation pattern and subsequent phylogenetic diversification of SARS-CoV-2 seem to be strongly influenced by the negative and positive selection pressures. The overall variation in SARS-CoV-2 sequences is currently low compared to many other RNA viruses. One of the possible reasons for the slow rate of mutations can be attributed to the widespread absence of neutralizing antibodies or the selective pressure. Once the virus population is challenged with the vaccine candidates or therapeutic monoclonal antibodies the currently known epitopes on surfaces of SARS-CoV-2 proteins are likely to undergo rapid forced change for survival. Thus, the prevalence of such possible escape mutations needs to be monitored even more carefully after vaccination if we are to remain ahead of this rapidly shifting pandemic curve.
DATA AVAILABILITY
The genome sequences used in this is available in the Global Initiative on Sharing All Influenza Data (GISAID) with accession IDs (Supplementary Table S1)
SUPPLEMENTARY DATA
Supplementary Data are available at online.
FUNDING
This work received no specific external funding and the work was carried out depending on the resources of host institute.
CONFLICT OF INTEREST
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
TABLE AND FIGURES LEGENDS
Heat map showing the stabilizing and destabilizing mutations of SARS-CoV-2 proteins based on the predicted ΔΔG values. The scale of heatmap ranged from −2 (Blue) to +2 (Red). Beige colour in the heat map indicates neutral ΔΔG values.
Supplementary Table S1: List of SARS-CoV-2 genome sequences downloaded from GISAID with accession IDs and metadata
Supplementary Table S2 : Residue level backbone stability values of lineage-defining mutations of SARS-CoV-2
ACKNOWLEDGEMENT
The authors would like to thank Mr. Soumya Basu (ICMR, Senior research Fellow) for his contribution and helpful advice in the structural analysis. The authors gratefully acknowledge the Department of Clinical Microbiology, Christian Medical College and Hospital, Vellore, Tamil Nadu, India, for providing all the necessary computational facilities for this work. We are grateful to the staff of Christian Medical College for their assistance with data curation.