Abstract
SARS-CoV-2 responsible for the pandemic of the Severe Acute Respiratory Syndrome resulting in infections and death of millions worldwide with maximum cases and mortality in USA. The current study focuses on understanding the population specific variations attributing its high rate of infections in specific geographical regions which may help in developing appropriate treatment strategies for COVID-19 pandemic. Rigorous phylogenetic network analysis of 245 complete SARS-CoV-2 genomes inferred five central clades named a (ancestral), b, c, d and e (subtype e1 & e2) showing both divergent and linear evolution types. The clade d & e2 were found exclusively comprising of USA strains with highest known mutations. Clades were distinguished by ten co-mutational combinations in proteins; Nsp3, ORF8, Nsp13, S, Nsp12, Nsp2 and Nsp6 generated by Amino Acid Variations (AAV). Our analysis revealed that only 67.46 % of SNP mutations were carried by amino acid at phenotypic level. T1103P mutation in Nsp3 was predicted to increase the protein stability in 238 strains except six strains which were marked as ancestral type; whereas com (P5731L & Y5768C) in Nsp13 were found in 64 genomes of USA highlighting its 100% co-occurrence. Docking study highlighted mutation (D7611G) caused reduction in binding of Spike proteins with ACE2, but it also showed better interaction with TMPRSS2 receptor which may contribute to its high transmissibility in USA strains. In addition, we found host proteins, MYO5A, MYO5B & MYO5C had maximum interaction with viral hub proteins (Nucleocapsid, Spike & Membrane). Thus, blocking the internalization pathway by inhibiting MYO-5 proteins which could be an effective target for COVID-19 treatment. The functional annotations of the Host-Pathogen Interaction (HPI) network were found to be highly associated with hypoxia and thrombotic conditions confirming the vulnerability and severity of infection in the patients. We also considered the presence of CpG islands in Nsp1 and N proteins which may confers the ability of SARS-CoV-2 to enter and trigger methyltransferase activity inside host cell.
Introduction
In December 2019, a novel RNA virus, Severe Acute Respiratory Syndrome Corona Virus-2 (SARS-CoV-2), belonging to Coronaviridae family (betacoronavirus), emerged as the reason for the chaos of pneumonia disease also called Covid-19 in Chinese city, Wuhan (Li et al., 2020). Covid-19 was declared as pandemic by WHO on March 11, 2020 (Astuti and Ysrafil, 2020). Major outbreaks were reported in many locations of China, USA, Italy, Spain, Japan, and South Korea. As of date it has already spread to more than 200 countries of the world surpassing more than 30 thousand deaths and 6 million reported active cases worldwide (https://www.worldometers.info/coronavirus/).
SARS-CoV-2 is a single stranded RNA virus with a genome size ranging from 29.8 kb to 29.9 kb (Khailany et al., 2020). The genomic repertoire of SARS-CoV-2 comprises of 10 open reading frames (ORFs) encoding 27 proteins (Abduljalil and Abduljalil, 2020). ORF1ab encodes for 16 non-structural proteins (Nsp) whereas structural proteins include spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins (Pyrc et al., 2007; Yang and Leibowitz, 2015). In addition, the genome of SARS-CoV-2 comprises of ORF3a, ORF6, ORF7a, ORF7b, ORF8 and ORF9 genes encoding six accessory proteins, flanked by 5’ and 3’ UTRs (Khailany et al., 2020). In our previous study (Kumar et al., 2020), a higher mutational rate in the genomes from different geographical locations around the world by accumulation of Single Nucleotide Polymorphisms (SNP) was reported. Even during these early stages of the global pandemic, genomic surveillance has been used to differentiate circulating strains into distinct, geographically based lineages (Forster et al., 2020). However, the ongoing analysis of this global dataset suggests no consolidated significant links between SARS-CoV-2 genome sequence variability, virus transmissibility and disease severity.
It is known that mutations at both genomic and protein level are “Hormonical Orchestra” (Yu et al., 2019) that drives the evolutionary changes, demanding a detailed study of SARS-CoV-2 mutations to understand its successful invasion and infection. The study analyzed that mutational profiles of SARS-CoV-2 isolates show very high mutational rates that show the isolates more virulent, causing significant harm to the hosts (Mandal et., 2020). Thus, in the present study, we selected 245 genomic sequences of SARS-CoV-2 deciphering the phylogenetic relationships, tracing them to SNPs at nucleotide and amino acid (Amino Acid Variation) levels and performing structural re-modelling. Our results revealed the evolutionary relationships among the strains predicting Nsp3 as mutational hotspot for SARS-CoV-2. We further extended the study to understand mechanism of host immunity evasion by Host-Pathogen Interaction (HPI) and confirming their interactions with host proteins by docking studies. We identified sparsely distributed hubs which may interfere and control network stability as well as other communities/modules. This indicated the affinity to attract a large number of low-degree nodes toward each hub, which is a strong evidence of controlling the topological properties of the network by these few hubs (Nafis et al., 2015). We also analyzed the transfer of genomic SNPs to amino acid levels and associations of CpG islands contributing towards the pathogenicity of SARS-CoV-2. The existence of CpG islands has always been connected with the epigenetic regulation and act as hotspots for methylation (Jones, 2012; Shiraishi et al., 2002; Hoelzer et al., 2008). Here also, the conservancy found in possession of CpG islands towards the extremities of all the genomes considered in the present analysis indicate their importance in evading host immunity. Our study showed an overall depiction of SARS-CoV-2 variations and interactions that eventually may lead to development of rational therapeutic measures and medication against COVID-19.
Material and Methods
Selection of genomes, annotations and phylogeny construction
Publicly available genomes of SARS-CoV-2 viruses were obtained from the NCBI database (https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/). Until March 31, 2020 only 375 SARS-CoV-2 genomes were available in the databases. The data was screened for unwanted ambiguous bases using N-analysis program, based on which 245 complete and clean genomes of SARS-CoV-2 were selected for further analysis (Supplementary Info 1). A manually annotated reference database was generated using GenBank file of severe acute respiratory syndrome coronavirus 2 isolate-SARS-CoV-2/SH01/human/2020/CHN (Accession number: MT121215.1) and open reading frames (ORFs) were predicted against the formatted database using prokka (-gcode 1) (Seemann, 2014). Genomic sequences included in the analysis belongs to different countries namely, USA (168), China (53), Pakistan (2), Australia (1), Brazil (1), Finland (1), India (2), Israel (2), Japan (5), Vietnam (2), Nepal (1), Peru (1), South Korea (1), Spain (1), Sweden (1). Whole genomes nucleotide and protein sequences were aligned using mafft (Katoh et al., 2013) at 1000 iterations. The alignments so obtained were processed for phylogeny construction using BioEdit software (Hall et al., 2011). The nucleotide-based phylogeny was annotated and visualized on iTOL server (Letunic and Bork, 2006). While, amino acid-based phylogeny was visualized and annotated using GrapeTree (Zhou et al., 2018).
Genotyping based on SNP/AAV
To detect nucleotide and amino acid variations (AAV) among 245 genomes of SARS-CoV-2, sequence alignment of nucleotide and amino acid, respectively were performed against the reference genome. The change of nucleotide and amino acid was calculated as point variations and were recorded. The interpolation and visualization were plotted using computer programs in Python. Co-mutation were predicted and clustering was performed using MicroReact (Argimon et al., 2016)
Data and Computer programs
The genomic analytics is performed using programs in Python and Biopython libraries (Cock et al., 2009). The computer programs and the updated SNP profiles of SARS-CoV-2 isolates are available upon requests.
Construction of the Host-Pathogen Interaction Network of SARS-CoV-2
In order to find the HPI, we subjected SARS-CoV-2 proteins to Host-Pathogen interaction databases such as Viruses.STRING v10.5 (Cook et al., 2018) and HPIDB3.0 (Ammari et al., 2016) for predicting their direct interaction with human as the principal host. The HPI network was constructed and visualized using Cytoscape v3.7.2 (Shannon, et al., 2003). In the constructed Network, proteins with highest degree, which interact with several other signaling proteins in the network indicate a key regulatory role as a hub. In our study, using NetworkAnalyser (Assenov, et al., 2008), plugin of Cytoscape v3.7.2, we identified the hub protein and subjected to functional analysis. The network was functionally annotated using STRINGApp and StringEnrichment app (Doncheva et al., 2019) plugin of Cytoscape using Reactome, GO, InterPRO, KEGG and Pfam databases. This analysis provides an opportunity of a more precise understanding of the biological functions, providing valuable clues for biologists.
Computational structural analysis on wild-type and mutant SARS-CoV-2 proteins
SARS-CoV-2 proteins sequences were retrieved from the NCBI genome database and pairwise sequence alignment of wild-type and mutant proteins were carried out by the Clustal Omega tool (Sievers et al., 2011). The wild-type and mutant homology model of S-protein, NspNsp12 and Nsp13 were constructed using the SWISSMODEL (Waterhouse et al., 2018), whereas the 3D structure of ORF8, ORF3A, Nsp2, Nsp3 and Nsp6 were predicted using Phyre2 server (Kelley et al., 2015). The host proteins (TMPRSS2, RPS6, ATP6V1G1 and MYO5C) 3D structures were generated using the SWISSMODEL and ACE2 structure retrieved from the PDB database (PDB ID: 6M17). These structures were energy minimized by the Chiron energy minimization server (Ramachandran et al., 2011). The effect of the mutation was analyzed using HOPE (Venselaar et al., 2010) and I-mutant (Capriotti et al., 2006). The I-mutant method allows us to predict the stability of the protein due to mutation. The docking studies for wild and mutant SARS-CoV-2 proteins with host proteins was carried out using PatchDock Server (Schneidman-Duhovny et al., 2005). Structural visualizations and analysis were carried out using pyMOL2.3.5 (Jacobson et al., 2002).
Analysis of CpG regions
SARS-CoV-2 genomes were analysed for the presence of CpG regions that can be targeted for methylation induced gene silencing. To locate the CpG regions, meth primer 2.0 (http://www.urogene.org/methprimer2/) and the CpG Plot (http://www.ebi.ac.uk/Tools/emboss/cpgplot/) programs were used, although some variations were found in both the programs. Both the programs were run on default parameters of a sequence window longer than 100 bp; GC content of ≥50%, and an observed/expected CpG dinucleotide ratio ≥0.60. The presence of common CpG islands was confirmed by performing BLAST using the above reference strain.
Results and Discussion
Phylogenetic relationship between different SARS-CoV-2 strains
In our previous study, we reported a mosaic pattern of phylogenetic clustering of 95 genomes of SARS-COV-2 isolated from different geographical locations (Kumar et al., 2020). Strains belonging to one country were found clustered with distant countries strains but not with the neighboring one. Taking clue from these studies we constructed phylogenetic relatedness of 245 strains of SARS-COV-2 from USA, China, and several other countries including, Spain, Vietnam, Peru, Finland and Pakistan and unravel the significant association of evolutionary patterns among SARS-CoV-2 based on their geographical locations predicting their mosaic phylogenetic arrangements. It was found that the majority of strains from USA were clustered together, but comparatively high divergences were found in strains isolated from China and Japan. Japanese strains were found to be scattered and formed clusters with strains from USA, Pakistan, Vietnam, Taiwan, and China. Even with less number of genomes sequences from Japan, Vietnam and Peru revealed a highly scattered pattern and formed close associations with that of USA and Chinese strains. Strains reported from patients of Taiwan (MT192759), Australia (MT007544), South Korea (MT039890), Nepal (MT072688) and Vietnam (MT192773, MT192772) had travel histories from Wuhan, China (Cheng et al., 2020). However, a strain from Pakistan (MT240479) which clustered with the Japanese strains was found to be isolated from patient having travel history from Iran. Indian strains (MT050439, MT012098) that were isolated from patients who travelled from Dubai, clustered with Chinese strains. Later, reports confirmed many cases of SARS-CoV-2 in Dubai from China (https://www.newsbytesapp.com/timeline/India/58169/271167/coronavirus-2-positive-cases-detected-in-delhi-telangana). Thus, a clear landscape of phylogenetic relationships could be obtained reflecting mosaic clustering patterns in accordance with the travel history of patients (Figure1A). However, results were in contradiction with the genomic analysis of SARS-COV-2 by Foster et al.,(2020) where they predicted the linear/directive evolution from ancestral node a to node b and c. Whereas we report here both divergent (from ancestral node a to b, c & e) and directive (node c to d) evolution among the SARS-CoV-2 strains (Figure1B and Figure 3B). Since genome-based phylogeny did not highlight the amino acid level changes, thus to ascertain the variations among the SARS-CoV-2 strains at phenotypic level, we constructed whole proteome alignment-based phylogeny, clustered the 245 strains into five major clades a-e (Figure 1B). The first cluster, Clade-a had maximum nodes (46), including reference node, and strains from Nepal (MT072688), Pakistan (MT262993), Taiwan (MT192759) along with 15 strains from USA and 27 strains from China. It also had the mutated daughter nodes (highlighted by # in figure 3 B for corresponding nodes) radiating outwards, belonging to China, Finland (MT020781), India (MT012098), Japan (LC534419, LC529905), Taiwan (MT066176), Vietnam (MT192772-3), Brazil (MT126808), Australia (MT007544), South Korea (MT039890) and Sweden (MT093571) along with seven USA strains (Figure 1B). This clade represented the ancestral node as it harbored the oldest known SARS-CoV-2 strain from China and laid down the foundation of rest of the mutated daughter strains worldwide, marking the onset of the divergence in SARS CoV-2. Three significantly diverged network nodes originated from the ancestral clade-a and were marked as clade-b, c and e (Figure 1B). For Clade-b, central node included only four strains in which two were from USA (MT184912, MT276328) and one each from Israel (MT276597) and Japan (LC528233). Its major descended radiant belonged to Japan (LC528232, LC534418), Pakistan (MT240479), USA (MT184913, MT184910, MN997409) and China (MT049951, MT226610). It was observed that one of the Chinese strains in clad-b (MT226610) had the longest branch length making the strain very distinct (harboring 25 mutations) by showing exceptionally high rate of evolution. In Clade-c lineage, small central node was comprised of Taiwan (MT066175), USA (MT246667, MT233526, MT020881, MT985325, MT020880) and Chinese (MN938384, LR757995) strains. Interestingly one strain each from Spain (MT233523) and India (MT050493) were also found radiating as daughter node from the central one. Clade-d lineage, which was originated from clade-c lineage, consisted only of USA strains both in central nodes and radiations. Importantly, 2 strains (MT263416, MT246471) were found most divergent with varied mutation suggesting the high rate of evolution among USA strains which might be linked with the high pathogenicity among them. Clade-e bifurcated into two sub-clads (e1 and e2) by significant set of mutations. Sub-clad-e1 include six strains from USA, one from Israel (MT276598) with radiating nodes from Peru (MT263074) and USA (MT276327); whereas, sub-clad e2 had 32 strains belonging to USA. Thus, formation of five major evolutionary clades and subclades based on the amino acid phylogeny needs attention for identifying the assessment of divergence among SARS-CoV-2 strains. This divergence is a proof of the random evolution of SARS-CoV-2 suggesting network expansion in five clads contradicting to the earlier directed evolution proposed by Foster et al., 2020.
Genotyping and variation estimation
In order to understand the implication of mosaic pattern of transmissions and evolutionary lineage clustering (Clad a-e), we studied the Single Nucleotide Polymorphism (SNP) genotyping from the 245 genome sequences as mutation counts along with their frequency at specific genomic locations. Mutational changes at phenotypic levels were also weighed by assessing Amino Acid Variations (AAV). Interpolations of the SNPs/AAVs data were made by assessing their frequency, genomic positions and type of SNPs/AAVs (Figure 2B), highlighted a large mutational diversity among the virus isolates. We identified a total of 12 SNP types (A>G, A>C, A>T, C>A, C>G, C>T,G>A,G>C, G>T, T>A, T>C, T>G) accounting for mutations at 297 genomic locations (Figure 2A, 2B). Overall pattern of SNPs suggested C>T transition as the most common mutation in the entire genomic sets (Figure2A), however highest frequency was recorded for T>C transitions (Figure 2B). Based on the genomic arbitrators SNP frequencies, we analyzed 14 major locations inside the genomes of SARS-CoV-2 for potential mutation generating different allelic forms for genes (Table 1). The SNP of C>T was first observed at 67th location in 5’ UTR region of leader sequence with a frequency of 45 followed by Nsp2 at two locations (885 & 2863) with the frequency of 29 and 44, respectively. Nsp3/PL-PRO and Nsp8 marked the highest frequency of 238 SNP counts of T>C at 5852 and 12299 locations. Another T>C SNP was observed in ORF8 with frequency of 88 at 27973 location. C>T SNP transformation was found in Nsp4 and Nsp12 with the frequency of 88 and 44 at location 8608 and 14234, respectively. Non-structural protein, Nsp13 was strangely found harboring two different SNP (C>T and A>G) at three different locations (17573, 17684, 17886) with a relatively high frequency of 68, 63 and 63 respectively. A>G SNP conversion in S (Spike) protein was found with a frequency of 43. A Low SNP count of G>T transitions were falling in the ORF3a and Nsp6 with frequency of 32 and 21, respectively (Table 1). Though, all SNP counts do not reflect the phenotypic change at protein level and therefore must be estimated at the translation levels for their significant phenotypic effect. Although 297 genomic locations harbored SNPs but their corresponding AAV were found only in 200 genomic locations accounting for 67.34% conversion efficiency. Out of 14 high frequency SNPs, only 9 mutations [Nsp2 (T265I), Nsp3 (S1920P), Nsp6 (L3605F), Nsp12 (P4618L), Nsp13 (P5731L, Y5768C), S (D7611G), Orf3a (Q8327H), Orf8 (L9033S)] were found to reflect at protein level with the highest frequency of 238 in Nsp3 (Table 1). These proteins are known to play various regulatory roles and therefore, mutations at amino acid level can modulate their catalytic activity drastically. Specifically, Nsp3 is the largest and essential component of replication complex in the SARS-CoV-2 genome (Lei et al., 2018) and along with Nsp2 it forms a transcriptional complex in endosome of the infected host cell (Wu et al., 2020). Nsp6 is a multiple-spanning transmembrane protein located into the ER where they induce autophagosomes via an omegosome intermediate (Cottam et al., 2014). Interestingly, the mutation of L3605F causes stiffness in the secondary structure of Nsp6 and leads to low stability of the protein structure in most recent sequences from Asia, America, Oceania and Europe (Benvenuto et al., 2020). Nsp12 and Nsp13 are the key replicative enzymes, which require Nsp6, Nsp7 and Nsp10 as cofactors. Nsp12, RNA dependent RNA polymerase (RdRp) with the presence of the bulkier leucine side chain at location 4618 is likely to create a greater stringency for base pairing to the templating nucleotide, thus modulating polymerase fidelity (Sexton et al., 2016). Nsp13 contains a helicase domain, allowing efficient strand separation of extended regions of double-stranded RNA and DNA (Pachetti et al., 2020). Dual mutations in Nsp13 were reported with profound effect on its activity. P5731L, mutation leads to increased affinity of helicase RNA interaction, whereas Y5768C is a destabilizing mutation increasing the molecular flexibility and leading to decreased affinity of helicase binding with RNA (Begum et al., 2020). Therefore, both the mutations were antagonistic in nature. Thus, ORF1ab polyprotein of SARS-CoV-2 encompasses mutational spectra where signature mutations for Nsp2, Nsp3, Nsp6, Nsp12 and Nsp13 have been predicted. Amino acid mutations in structural proteins S, ORF3a and ORF8 have also been observed with varied frequency of 45, 34 and 89 respectively. The mutation in Spike protein (D7611G) has been reported to outcompete other preexisting subtypes, including the ancestral one. This mutation generates an additional serine protease (Elastase) cleavage site in Spike protein (Bhattacharya et al., 2020) which is discussed in more details in later sections.ORF3a mutation (Q8327H), is located near TNF receptor associated factor-3 (TRAF-3) regions and has been reported as molecular differences marker in many genomes including Indian SARS-CoV-2 genomes (Hassan et al., 2020) for their delineation. Amino acid change in ORF8 sequence (L9033S) propose that it is preserved (Koyama et al., 2020) therefore it is critical to examine its biological function in SARS-CoV-2.
Our results showed that the mutations (SNPs and AVV) in the virus were not uniformly distributed. Genotyping study annotated few mutations in the SARS-CoV-2 genomes at certain specific locations with high frequency predicting their high selective pressure. Thus, mutations can be predicted as location-specific but not type-specific by SNP count. Highly frequent AAV might be associated with the changes in transmissibility and virulence behavior of the SARS-CoV-2. Therefore, high-frequency AAV mutations in Spike protein, RdRp, helicase and ORF3a are important factors to consider while developing vaccines against the fast-evolving strains of SARS-CoV-2.
Prevalence of Co-mutation in SARS-COV-2 evolution
Interestingly, we observed co-mutations in Nsp13 at locations 5768 (Nsp13_1) and 5731(Nsp13_2) that were prevalent in common 64 genomes, all belonging to USA. The AAV reported above (Table 1) were further analyzed and found occurring in 10 different permutations varying from single to multiple mutated protein combinations. Complete details of these co-mutations combinations are given in Table 2. These co-mutations were mapped over the divergent phylogeny for indicating the evolutionary divergence among the 245 strains. The phylogram (Figure 1B) showed clear divergence of strains from the parent strain due to accumulation of mutations at different level of human to human transmission. We found co-mutations in Nsp3, ORF8, Nsp13, S, Nsp12, Nsp2 and Nsp6 were responsible for the above divergence.
These co-mutations were found linked with lineage clades a to e, highlighting their prevalence among them (Figure. 1B). In clade-a, 40 genomes harbored mutations at only Nsp3 protein while six isolates belonging to USA (MT262993, MT044258, MT159716, MT259248, MT259267) and Pakistan (MT263424) showed no mutation confirming their lineage same as that of the reference/ancestral genome from China. Therefore, Nsp3 marked as first mutational hotspot for accumulating amino acid mutations in SARS-CoV-2. Brazil (MT126808) and USA (MT276331) strains form the descendent from clade-a harbored Nsp3/Nsp6 as first co-mutation. The clade-b also had an additional mutation of ORF8 along with Nsp3 and Nsp6 with three descendant strain from US and China. We observed most distant Chinese strain (MT226610), clustered in clade-b and harbored additional 25 AAV making it the highly pathogenic strain in the network as reported above in Figure 1B. The clade-c descendant from clade-a had a different set of co-mutation with Nsp3-ORF8 proteins. Clade-d descended further from clade-c had two mutation in Nsp13 (5768/5731) in addition to Nsp3/ORF8 proteins. Two strains from USA in the cluster radiating from clade-d harbored additional Nsp6 mutation stating them more divergent with scope of further possible evolution. The next subclade-e1 was found holding another new set of co-mutation of Nsp3/S/Nsp12. Whereas the highest number of co-mutations were found in subclade-e2 with combination of Nsp3/Nsp2/Nsp12/S/ORF3a prevalent in 30 genomes belonging to USA predicting them as active carrier of evolutionary force for SARS-CoV-2 divergence (Figure 3 A & B). Presence of Nsp3 mutation (S1920P) in 238 strains underlined the origin of mutation from reference strain highlighting the first divergence in SARS-CoV-2 strain. In future, more and more genome availability from USA may indicate the evolutionary relationships with these co-mutations. Our result suggested that co-mutations are the major evolutionary force that drives the pathogenicity among the different geographical isolated strains which can responsible for higher and lower order of virulence among them.
The assessment of mutations in SARS-CoV-2 proteins
Amino acid variations were predicted in eight (Nsp2, Nsp3, Nsp6, Nsp12, Nsp13, S, Orf3a, Orf8) SARS-CoV-2 proteins (Table 1). To identify their potential functional role, we carried out the structural analysis of the proteins. Pairwise sequence alignment of wild-type and mutant proteins provided the exact location and changes in amino acids. The GMQE and QMEAN values range from 0.45 to 0.72 and −1.43 to −2.81, respectively. The sequence identity ranges from 34% to 99%, which suggests that the models were constructed with high confidence and best quality (Figure 6). The I-Mutant DDG tool predicts if a mutation can largely destabilize the protein (ΔΔG<-0.5 Kcal/mol), largely stabilize (ΔΔG>0.5 Kcal/mol) or have a weak effect (−0.5<=ΔΔG<=0.5 Kcal/mol). The protein stability analysis showed that all the identified mutations decreased the stability of seven proteins (Nsp2, Nsp6, Nsp12, Nsp13, S, Orf3a, Orf8) except Nsp3 (T1103P) which predicted to increase protein stability (Figure 6 A-H). Further, to explore the role of mutations in SARS-CoV-2 proteins, we carried out HOPE analysis. D614G mutation in S-protein could disturb the rigidity of the protein and due to glycine, hydrophobicity will affect the intra hydrogen bond formation with G594. In ORF8 and Nsp3, the mutation location was not conserved, so it did not affect or damage the protein function. The mutation (P409L) in Nsp13 was present in the RNA virus helicase C-terminal domain. Since proline is a very rigid amino acid and therefore induce a particular backbone conformation that might be required at this position so this mutation could disturb domain and abolished its function. Mutation L37F (Nsp6) and T85I (Nsp2) were also highly conserved thus could profoundly damage the function of the respective protein. The P227L (Nsp12) mutation was in the RNA binding domain located on the surface of the protein; modification of this residue could disturb interactions with other molecules or other parts of the protein. Conclusively, Nsp3 mutation which appeared in all co-mutation combinations, contributed in increased protein stability among 238 strains could be assigned to their increased pathogenicity. Thus, we attempted to highlight the effects of these mutations in host pathogen interactions.
Modelling of Host-Pathogen Interaction Network and its Functional Analysis
The HPI Network of SARS-CoV-2 (HPIN-SARS-CoV-2) contained 58 edges, 56 nodes, including 5 viral and 51 host proteins (Figure 5). Number of degree (the number of edges per node) calculated based on HPI. The significant existence of few main gene hubs, namely, N, S and M in the network and the attraction of a large number of low-degree nodes toward each hub show strong evidence of controlling the topological properties of the network by these few hubs. N has 37 degrees, S, and M has 16 and 8 degrees, respectively. These viral proteins are the main hubs in the network, which regulate the network. Based on degree distribution, the viral protein N showed highest pathogenicity followed by S and M. N is a highly conserved major structural component of SARS-CoV virion involved in pathogenesis and used as a marker for diagnostic assays (Xia et al., 2020). Another structural protein S (spike glycoprotein), attach the virion to the cell membrane by interacting with host receptor, initiating the infection (Belouzard et al., 2012). The M protein, component of the viral envelope played a central role in virus morphogenesis and assembly via its interactions with other viral proteins (Garoff et al.,1998). Interestingly, we found four host proteins MYO5A, MYO5B, MYO5C and T had a maximum interaction with viral hub proteins. MYO5A, MYO5B, MYO5C interacting with all three (N, S and M) whereas T with two (S and M) viral hub proteins, showed a significant relationship with persistent infections caused by the SARS-CoV-2.
MYO5A, MYO5B and MYO5C proteins are Class V myosin (myosin-5) molecular motor that functions as an organelle transporter (Roland et al., 2011) (Sasaki, et al., 1995). It was found that the presence of myosin protein played a crucial role in coronavirus assembly and budding in the infected cells (Neuman et al., 2008). These cytoskeletal proteins are of importance during internalization and subsequent intracellular transport of viral proteins. As we know at the entry level of virus, S interacts with host ACE2 receptor that internalizes the virus into the endosomes of the host cell inducing conformational changes in the S glycoprotein (Belouzard et al., 2012). It was found that inhibition of MYO5A, MYO5B, and MYO5C was efficient in blocking the internalization pathway, thus this target can be used for the development of a new treatment for SARS-CoV-2 (Dewerchin et al., 2014). Patients suffering from COVID-19 undergo two major condition in the severe stage, thrombotic phenomenon and hypoxia, that are acting as silent killers (Bikdeli et al., 2020; Negri et al., 2020 https://doi.org/10.1101/2020.04.15.20067017). Hypoxia, condition where oxygen level of the body reduces drastically results in the elevated expression of T protein in the body (Shao et al., 2015; Yoon et al., 2006). T protein (Brachyury/TBXT) is transcription factor involved in regulating genes required for mesoderm formation and differentiation thus playing an important role in pathogenesis. The detailed functional analysis of HPIN-SARS-CoV-2 was mapped on the radiological findings from the COVID-19 severely infected patients and non-survivors. It was reported that the levels of fibrin-degrading proteins, fibrinogen and D-dimer protein were 3-4 folds higher as compared to healthy individual. Therefore, reflecting coagulation activation from infection/sepsis, cytokine storm and impending multiple organs failure (Tang et al., 2020; Shi et al., 2020; Han et al., 2020, Li et al., 2020). In our network, we found 24 proteins (ANGPTL1, TNN, FGL2, ANGPTL6, TNC, FCN3, FCN2, ANGPTL4, FGB, FGA, ANGPT2, ANGPTL5, FGG, TNR, ANGPTL3, FCN1, FIBCD1, ANGPTL2, ANGPTL7, ANGPT4, MFAP4, FGL1, TNXB and ANGPT1) are associated with the above etiology (Figure 5 C). We also found the interaction of SMAD family proteins and SUMO1 with N protein, which may lead to inhibition of apoptosis of infected lung cells (Zhao et al, 2008). The interactome study reveals a significant role of identified host proteins in viral budding and related symptoms of COVID-19.
The mutation in SARS-CoV-2 proteins inhibit viral penetration into host
In order to validate the effect of phenotypic variation (AAV), significant host proteins interactions from HPIN-SARS-CoV-2 were considered for in silico docking studies. Docking of S-Protein (wild type and mutant) with ACE2, TMPRSS2 and one of myosin proteins (MYO5C) were analyzed. Recent studies have shown that SARS-CoV-2 uses the ACE2 for entry and the serine protease TMPRSS2 for S protein priming (Wrapp et al., 2020). The poly-proteins (Nsp12, Nsp13, Nsp2, Nsp3 and Nsp6) of ORF1A and ORF1AB were docked with RPS6 and ATP6V1G1 host proteins. The docking results showed that mutant S-protein could not bind efficiently with ACE2 and MYO5C, whereas mutation slightly promotes the binding with TMPRSS2 (Table 3, Figure 6 and Figure 5C). TMPRSS2 have been detected in both nasal and bronchial epithelium by immunohistochemistry (Bertram et. al., 2012), reported to occur largely in alveolar epithelial type II cells which are central to SARS-CoV-2 pathogenesis (Furong et al., 2020). The wild-type S-protein form 16 hydrogen bonds and 1058 non-bonded contacts with ACE2; whereas the mutant protein forms 12 hydrogen bond and 738 non-bonded contacts (Figure 6). This result suggests that D614G mutation in S-protein could affect viral entry into the host. Similarly, mutations present in the Nsp12, Nsp13, Nsp2, Nsp3 and Nsp6 of SARS-CoV-2 could inhibit the interaction with RPS6, but these mutations promote the binding with ATP6V1G1 expect Nsp6 (L3605F). The RPS6 contributes to control cell growth and proliferation (Chauvin et al., 2014), so a loss of interaction with RPS6 could probably inhibit the production of viruses. Overall structural and interactome analyses suggests that identified mutations (Nsp2 (T265I), Nsp3 (S1920P), Nsp6 (L3605F), Nsp12 (P4618L), Nsp13 (P5731L, Y5768C), S (D7611G)) in SARS-CoV-2 might play an important role in modifying the efficacy of viral entry and its pathogenesis. However, these observations required critical revaluation as well as experimental work to confirm the in-silico results.
Regulation of SARS-CoV-2 pathogenicity by CpG island
The genotyping analysis that we performed showed high frequency rate (45) of SNP at 5’UTR region (Table 1) and recent study also suggested that suppression of GC content could play a vital role in specific antiviral activities (Xia, 2020). As seen in SNP analysis, the common transitions of C>T and G>A alter the GC content of the SARS-CoV-2 (Table 1). This directed the analysis towards understanding the role of CpG island which is involved in silencing of transcription and down regulation of viral replication (Vivekanandan et al., 2010). Viral infections upregulate host DNA methyltransferase genes (DNMTs), and their overexpression leads to methylation of host CpG islands along with the viral CpGs (Vivekanandan et al., 2010). Since increased frequency of CpG motifs can serve as Pathogen-associated molecular pattern (PAMP) or Damage-associated molecular pattern (DAMP) which are potent inducers of strong innate immune responses (Barber, 2011; Frieman, 2008). Thus, CpG island profiling and their importance of existence in SARS-CoV-2 genomes was proceeded. We found that CpG islands were consistently present in two regions of the genome at the positions 285-385 nucleotides (101 bp) and 28,324-28,425 nucleotides (102 bp). The results were consistent in all 245 genomes analyzed in the present study with 100% conservancy in 237 genome sequences (Figure 7 A).
In the remaining 8 genomes, five genomes (MT246474.1 (G to A substitution at 354th position with respect to reference genome); MT276329.1, MT276330.1 and MT276598.1 (C to T substitution at 313th position) and MT246455.1 (G to T substitution at 332nd position)) showed point mutation in 5’ CpG island; whereas three genomes (MT159718.1 (C to T substitution at 28409th position); MT159717.1 and MT184911.1 (G to T substitution at 28378th position)) showed point mutation in 3’CpG end (Figure 7 D). Interestingly, all these sequences belong to USA. On further locating CpG island positions with respect to proteins, it was found that these two CpG islands were located at two prime locations within the genome, one in Nsp1 (Figure 7 B), and another within Nucleocapsid (N) protein (Figure 7 C). Previously, it was reported that both the proteins interacted with 5’ UTR region playing crucial roles in viral replication and gene expressions (Guan et al., 2012; Yang and Leibowitz, 2015; Galan et al., 2005). Most pivotal role of N protein revolves around encapsulation of viral gRNA which leads to formation of ribonucleoprotein complex (RNP), which is a vital step in assembly of viral particles (Cong et al., 2017).
Nsp1 protein in coronaviruses plays a regulatory role in transcription and viral replication (Cong et al., 2017). It is known to interact with 5’ UTR of host cell mRNA to induce its endonucleolytic cleavage (Huang et al., 2011; Narayanan et al., 2015), thus inhibiting host gene expression (Kamitani et al., 2009). It also plays an important role in blocking IFN-dependent antiviral signaling pathways leading to dysregulation of host immune system (Kamitani et al., 2006; Wathelet et al., 2007; Law et al., 2007). CpG sites can be targeted by Zinc Finger Antiviral Proteins which can mediate antiviral restriction through CpG motif detection (Bick et al, 2003; Liu et al., 2015; Chiu et al., 2018). Apart from this, CpG oligodeoxynucleotides (ODNs) are known to act as adjuvants and are already established as a potent stimulator for host immune system (Campbell, 2017; Becker, 2005; Yuan, 2017; Singh et al., 2016; Yu et al., 2018). Moreover, recent studies conducted on influenza A genome and Zika virus genome has shown that by increasing the CpG dinucleotides in viral genome, impairment of viral infection is observed (Gaunt et al., 2016; Trus et al., 2020). Our result showed that the presence of conserved CpG islands in Nps1 and N protein across 245 genomes of SARS-CoV-2 indicated their role in pathogenesis and can be targeted by Zinc Finger Antiviral Proteins or exploited to design CpG-recoded vaccines.
Conclusions
The genomic and proteomic survey of 245 SARS-CoV-2 strains reported from subset of population of different countries reflected global transmission during the outbreak of COVID-19. The viral phylogenetic network with five clads (a-e) provided a landscape of the current stage of epidemic where major divergence was observed in USA strains. From this we propose genotypes linked to geographic clads in which signature SNP can be used to track and monitor the epidemic. Demarcation of co-mutation in the SARS-CoV-2 strains by assessing co-mutations also highlighted the evolutionary relationships among the viral proteins. Our results suggested that co-mutation are indicative of AAV based induced pathogenicity leading to multiple mutations embedded in few genomes. However, co-mutations are still in evolutionary process and more combinations can be predicted with a large dataset. High-frequency AAV mutations were present in the critical proteins, including the Nsp2, Nsp3, Nsp6, Nsp12, Nsp13, S, Orf3a, Orf8 which could be considered for designing a vaccine. Comparative analysis of proteins from wild and mutated strains showed positive selection of mutation in Nsp3 but not in rest of the mutants. HPI model can be used as the fundamental basis for structure-guided pathogenesis process inside host cell. The interactome study showed MYO-5 proteins as a key host partner and also highlighted the key role of N, S ad M viral proteins for conferring SARS-CoV-2 pathogenicity. The mutation in the S protein could affects the viral entry by loose binding with ACE2. The presence of CpG islands in N and Nsp1 protein could play a critical role in pathogenesis regulation. Based on our multi-omics approach: genomics, proteomics, interactomics and structural biology; provided an opportunity for better understanding of COVID-19 pandemic and can be considered in ongoing vaccine development programs.
Authors Contribution
RL, VG, SH, MV conceived and designed the study. VG, HV, SH, NS, KP, MZM performed the analysis and develop figures. VG, SH, MV, KP wrote the manuscript and RL, RK, HV, US, PH, SS help in shaping manuscript.
Conflict of Interest
Authors declare no conflict of interest
Acknowledgements
VG acknowledge Phixgen Pvt. Ltd. for research fellowship. MV, SS acknowledge Dr. P. Hemalatha Reddy, Principal, Sri Venkateswara College, University of Delhi for her constant support and encouragement. RL and US also acknowledge The National Academy of Sciences, India, for support under the NASI-Senior Scientist Platinum Jubilee Fellowship Scheme. NS acknowledge Council of Scientific and Industrial Research (CSIR), New Delhi for doctoral fellowships. KP thanks Hub of Bioinformatics for providing support. SH would like to thank Jaypee Institute of Information Technology, Noida India for providing support. HV would like to thank Ramjas College, University of Delhi, Delhi for providing support. RK acknowledges Magadh University, Bodh Gaya for providing support. MZM acknowledge Department of Health Welfare, Government of India under young scientist scheme for financial support. PH would like to thank Maitreyi College, University of Delhi, Delhi for providing support.