Evidence of the Recombinant Origin and Ongoing Mutations in Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)

The recent global outbreak of viral pneumonia designated as Coronavirus Disease 2019 (COVID-19) by coronavirus (SARS-CoV-2) has threatened global public health and urged to investigate its source. Whole genome analysis of SARS-CoV-2 revealed ~96% genomic similarity with bat CoV (RaTG13) and clustered together in phylogenetic tree. Furthermore, RaTGl3 also showed 97.43% spike protein similarity with SARS-CoV-2 suggesting that RaTGl3 is the closest strain. However, RBD and key amino acid residues supposed to be crucial for human-to-human and cross-species transmission are homologues between SARS-CoV-2 and pangolin CoVs. These results from our analysis suggest that SARS-CoV-2 is a recombinant virus of bat and pangolin CoVs. Moreover, this study also reports mutations in coding regions of 125 SARS-CoV-2 genomes signifying its aptitude for evolution. In short, our findings propose that homologous recombination has been occurred between bat and pangolin CoVs that triggered cross-species transmission and emergence of SARS-CoV-2, and, during the ongoing outbreak, SARS-CoV-2 is still evolving for its adaptability.


INTRODUCTION
The family Coronaviridae is comprised of large, enveloped, single stranded, and positivesense RNA viruses that can infect a wide range of animals including humans Guan et al., 2003). The viruses are further classified into four genera: alpha, beta, gamma, and delta coronavirus (King et al., 2012). So far, all coronaviruses (CoVs) identified in human belong to the genera alpha and beta. Among them betaCoVs are of particular importance. Different novel strains of highly infectious betaCoVs have been emerged in human populations in the past two decades that have caused severe health concern all over the world. Severe acute respiratory syndrome coronavirus (SARS-CoV) was first recognized in 2003, causing a global outbreak (Zhong, 2004;Peiris et al., 2004;Cherry, 2004). It was followed by another pandemic event in 2012 by a novel strain of coronavirus designated as Middle East respiratory syndrome coronavirus (MERS-CoV) (Lu et al., 2013). Both CoVs were zoonotic pathogens and evolved in animals. Bats in the genus Rhinolophus are natural reservoir of coronaviruses worldwide, and it is presumed that both SARS-CoV and MERS-CoV have been transmitted to human through some intermediate mammalian hosts (Li et al., 2005a;Bolles et al., 2011;Al-Tawfiq and Memish, 2014). Recently, emergence of another pandemic termed as Coronavirus Disease 2019 (COVID-19) by World Health Organization (WHO) caused by a novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been reported (Zhu et al., 2020). To date, more than 174,000 people are infected and over 6,600 death tolls, having transmission clusters worldwide including China, Italy, South Korea, Iran, Japan, USA, France, Spain, Germany and several other countries causing alarming global health concern.
The large trimeric spike glycoprotein (S) located on the surface of CoVs is crucial for viral infection and pathogenesis, which is further subdivided into N-terminal S1 subunit and C-terminal S2 domain. The S1 subunit is specialized in recognizing receptors on host cell, comprising of two separate domains located at N-and C-terminal which can fold independently and facilitate receptor engagement (Masters, 2006). Receptor-binding domains (RBDs) of most CoVs are located on S1 C-terminus and enable attachment to its host receptor (Li et al., 2005b). The host specificity of virus particle is determined by amino acid sequence of RBD and is usually dissimilar among different CoVs. Therefore, RBD is a core determinant for tissue tropism and host range of CoVs. This article presents SARS-CoV-2 phylogenetic trees, comparison and analysis of genome, spike protein, and RBD amino acid sequences of different CoVs, deducing source and etiology of COVID-19 and evolutionary relationship among SARS-CoV-2 in human.

Phylogenetic classification of SARS-CoV-2 and its closely related CoVs
To determine the evolutionary relationship of the SARS-CoV-2, phylogenetic analysis was performed on whole genomic sequences of different CoVs from various hosts. The Maximum-likelihood (ML) phylogenetic tree is shown in Figure 1, which illustrates four main groups representing four genera of CoVs, alpha, beta, gamma, and delta. In the phylogenetic tree, strains of SARS-CoV-2 (red colored) are cluster together and belong to the genera Betacoronavirus. Among Beta-CoVs, SARS-CoV, Civet SARS CoV, Bat SARS-like CoVs, bat/RaTG13 CoV, and SARS-CoVs-2 clustered together forming a discrete clade from MERS-CoVs. The clade is further divided into two branches and one of the branches comprises all SARS-CoV-2 strains clustered together with Bat/Yunnan/RaTG13 CoV forming a monophyletic group. Bat/Yunnan/RaTG13 exhibited ~96% genomic similarity with SARS-CoV-2. This specifies that SARS-CoV-2 is closely related to Bat/Yunnan/RaTG13 CoV.
The ML phylogenetic tree demonstrates that CoVs from bat source are found in the inner joint or neighboring clade of SARS-CoV-2. This indicates that bats CoVs particularly Bat/Yunnan/RaTG13 are the source of SARS-CoV-2, and they are emerged and transmitted from bats to humans through some recombination and transformation events in intermediate host.

Detection of putative recombination within the spike protein
To explore the emergence of SARS-CoV-2 in humans, we investigated CoVs S-protein and its RBD as they are responsible for determining the host range ( Table 1). The S-protein amino acid sequence identity between SARS-CoV-2 and related beta-CoVs showed that bat/Yunnan/RaTG13 shares highest similarity of 97.43%. However, the amino acid sequence identity of RBD of SARS-CoV-2 with bat/Yunnan/RaTG13 is 89.57%. On the other hand, Beta-CoVs from pangolin sources (pangolin/Guandong/1/2019 and pangolin/Guangdong/lung08) revealed highest RBD amino acid sequence identity of 96.68% and 96.08% respectively with SARS-CoV-2. These indication shows the existence of homologous recombination events within the S-protein gene between bat and pangolin CoVs.
Similarity plot analysis of CoVs genome sequences from bat, pangolin and human also indicated a possible recombination within S-protein of SARS-CoVs-19 ( Figure S1).
The amino acid residues change in S-protein of SARS-CoV-2 was further analyzed with SARS-CoV, pangolin and bat CoVs including pangolin/Guandong/1/2019, pangolin/Guangdong/lung08, and bat/Yunnan/RaTG13 (Figure 2). Regardless of low homology between SARS-CoV-2 (Wuhan-Hu-1_MN908947) and SARS-CoV (SARS_AAR07630), they had many homologues areas in S-protein. The five key amino acid residues of S-protein at positions 442,472, 479,480, and 487 of SARS-CoVs are described to be at the angiotensin-converting enzyme-2 (ACE2) receptor complex interface and supposed to be crucial for human to human and cross-species transmission (Li et al., 2005b;Wu et al., 2012). Figure 2b and Table S1 describe that all key amino acid residues of RBD (except two positions) are completely homologues between SARS-CoV-2 (Wuhan-Hu-1_MN908947) and pangolin CoVs (pangolin/Guandong/1/2019 and pangolin/Guangdong/lung08), supporting our postulation of recombination event in S-protein gene. Even though, all five crucial amino acid residues of SARS-CoV-2 for binding to ACE2 are different from SARS-CoV, their hydrophobicity and polarity are similar, having same S-protein structural confirmation and identical RBD 3-D structure (Xu et al., 2020). In addition, six critical key residues in MERS-CoV RBD binding to its receptor dipeptidyl peptidase 4 (DPP4) are all different in SARS-CoV and SARS-CoV-2 related coronavirus (Figure 2a).

Ongoing mutations in SARS-CoV-2 during its spread
We also investigated some of the important evolutionary and phylogenetic aspects of SARS-  Table 2. Among different orfs of SARS-CoV-2, orf1a was most variable segment with total number of 44 dissimilar amino acid substitutions. It was followed by spike segment S orf with 13 amino acid residue substitutions. However, orf6 and orf7b are the most conserved regions without amino acid changes. In addition, orf10, E, M and orf7a have tended to be more conserved, with only one or two amino acid substitutions.
With the global spread of SARS-CoV-2, its amino acid sequence is also significantly varied (Figure 3). Usually, RNA viruses have high rate of genetic mutations, which leads to evolution and provide them with increased adaptability (Lin et al., 2019). To further explore SARS-CoV-2 evolution in human, we have performed phylogenetic analysis based on the aforementioned SARS-CoV-2 in correspondence with their amino acid substitution.

Sequence data collection
One hundred and twenty-five newly sequenced SARS-CoV-2 complete genomes were obtained from Global Initiative on Sharing All Influenza Data EpiFluTM database (GISAID EpiFlu TM ) and GenBank. Closely related beta-CoVs genomes sequences from different hosts were also collected and analyzed together with SARS-CoV-2. Open reading frames (orfs) of CoVs genomes were predicted using ORFfinder (v0.4.3) with default parameters ignoring nested orfs.