The genetic variant analyses of SARS-CoV-2 strains; circulating in Bangladesh

Genomic mutation of the virus may impact the viral adaptation to the local environment, their transmission, disease manifestation, and the effectiveness of existing treatment and vaccination. The objectives of this study were to characterize genomic variations, non-synonymous amino acid substitutions, especially in target proteins, mutation events per samples, mutation rate, and overall scenario of coronaviruses across the country. To investigate the genetic diversity, a total of 184 genomes of virus strains sampled from different divisions of Bangladesh with sampling dates between the 10th of May 2020 and the 27th of June 2020 were analyzed. To date, a total of 634 mutations located along the entire genome resulting in non-synonymous 274 amino acid substitutions in 22 different proteins were detected with nucleotide mutation rate estimated to be 23.715 substitutions per year. The highest non-synonymous amino acid substitutions were observed at 48 different positions of the papain-like protease (nsp3). Although no mutations were found in nsp7, nsp9, nsp10, and nsp11, yet orf1ab accounts for 56% of total mutations. Among the structural proteins, the highest non-synonymous amino acid substitution (at 36 positions) observed in spike proteins, in which 9 unique locations were detected relative to the global strains, including 516E>Q in the boundary of the ACE2 binding region. The most dominated variant G614 (95%) based in spike protein is circulating across the country with co-evolving other variants including L323 (94%) in RNA dependent RNA polymerase (RdRp), K203 (82%) and R204 (82%) in nucleocapsid, and F120 (78%) in NSP2. These variants are mostly seen as linked mutations and are part of a haplotype observed in Europe. Data suggest effective containment of clade G strains (4.8%) with sub-clusters GR 82.4%, and GH clade 6.4%. Highlights We have sequenced 137 and analyzed 184 whole-genomes sequences of SARS-CoV-2 strains from different divisions of Bangladesh. A total of 634 mutation sites across the SARS-CoV-2 genome and 274 non-synonymous amino acid substitutions were detected. The mutation rate of SARS-CoV-2 estimated to be 23.715 nucleotide substitutions per year. Nine unique variants were detected based on non-anonymous amino acid substitutions in spike protein relative to the global SARS-CoV-2 strains.


Introduction
Most of the coronaviruses are naturally hosted by bats (Cui et al., 2019, and it is postulated that coronaviruses jumped to humans from the bat reservoir (Dominguez et al., 2007, Li et al., 2005. Besides bat, coronaviruses also identified in several avian and mammals hosts (Cavanagh, 2007, Ismail et al., 2003. Only six coronaviruses have been known to cause infection in humans until late 2019, and these include SARS-CoV, HCoV-OC43, CoV-HKU1, HCoV-NL63, HCoV-229E, and MERS-CoV. A seventh human coronavirus, named SARS-CoV-2, which is closely related to SARS-CoV-1 major cause of pandemic in the modern history of humans. Based on the whole genome sequencing data, it has been shown that the nucleotides bases of the SARS-CoV-2 virus have 96.2% similarity with that of a bat SARS-related coronavirus (RaTG13) (Paraskevis et al., 2020). However, it has a low resemblance to that of MERS-CoV (~50%) or SARS-CoV (~79%) (Gralinski andMenachery, 2020, Zhu et al., 2020).
Coronaviruses mostly associated with mild symptoms (Su et al., 2016) with two marked exceptions that caused a major epidemic. The first coronavirus mediated outbreak that caused Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV) occurred in 2002, with infected over 8,000 cases and killed 800 (Graham and Baric, 2010) and not known to circulate since 2003. The other coronavirus that caused Middle East Respiratory Syndrome (MERS-CoV) causing sporadic infections, mostly in the Arabian peninsula in 2012, a less infective but highly lethal virus, which infected 2,294 laboratory-confirmed cases and killed 858 (Cui et al., 2019, Graham andBaric, 2010) including 38 deaths in South Korea (Lee et al., 2016). As of the 16 th of July 2020, the SARS-CoV-2 virus causative agent of  has spread to over 216 countries or regions and had caused over 15785641 cases with 640016 deaths worldwide (Who, 2020). The first COVID-19 case was reported in Bangladesh on the 8 th of March 2020, and from there on, more than 223453 confirmed cases have been declared with 2928 death (Iedcr, 2020). As COVID-19 is responsible for enormous human casualties and economic loss posing a serious threat to Bangladesh and globally, an understanding of the ongoing situation and the development of strategies to contain the virus's spread are urgently needed.
Novel beta coronaviruses are positive-sense single-stranded non-segmented enveloped RNA viruses. They belong to some of the largest genomes among RNA viruses, with around 25-32 kb. The genome of circulating novel coronavirus comprises 29903 bases include a ('-cap, 'untranslated region (UTR), ORFs, '-UTR, and '-poly(A) tail). The ORF 1ab that encodes 16 non-structural proteins (nsps) represents around 71.5 percent of the entire genome, while the remaining ORFs encode structural and accessory proteins. The orf1ab translated to 7097 amino acids (aa) long two viral polyproteins (pp) pp1a and pp1ab by host ribosomes. ORF1a encodes 4406 aa long polyproteins, including two cysteine proteases, a papain-like protease (PL pro ), and a 3C-like protease (3CL pro ). PL pro cleaves after glycine residue with recognition site LXGG and generates the first three mature non-structural proteins (nsp), nsp1, 2, and 3. 3CL pro hydrolyzes after amino acid Glutamine residue is responsible for cleavage of the remaining 11 sites resulting in the release of a total of mature 16 non-structural proteins (nsps).
Many nsps assemble to form a replicase-transcriptase complex that is responsible for genomic RNA replication and transcription of the sub-genomic RNAs. The sub-genomic RNA, which is encoded within the 'ends of the genome serves as mRNAs for the transcription and subsequent translation of the structural and other accessory genes. The primary structural proteins, including envelope protein (E), participate in viral assembly and budding. The membrane protein (M) defines the shape of the viral envelope, nucleocapsid protein (N) binds to the RNA genome and is also involved in viral assembly and budding.
Similar to SARS-CoV, spike protein mediates attachment of the virus to the host cell surface receptors Angiotensin-Converting Enzyme 2 (ACE2) and facilitate viral entrance into host cells (Letko et al., 2020, Zhou et al., 2020. Therefore, the spike protein largely determines host tropism and infectivity of a coronavirus. Usually, a higher rate of mutation occurred in typical RNA viruses that result in a different version of the viral population with diverse genomes. As SARS-CoV-2 infects human, replicate its copy, and transmits to more individuals they accumulate difference between the progeny and the original viral genome. It has been estimated for coronavirus that the average evolutionary rate is approximately 10 -4 nucleotide substitutions per site per year (Su et al., 2016). When a different variant of viral populations, infecting patients, may contribute to differences in clinical outcomes among the patients. It is, therefore, demand the constant surveillance of the newly arisen viral variant by continuously monitoring the sequences of SARS-CoV-2 from different patients. This study aimed to analyze the variants being circulated in Bangladesh due to mutation at nucleotides and amino acids level.

Novel coronavirus detection and sequencing
The extraction of the viral nucleic acid from the nasopharyngeal specimen was performed After enrichment, the prepared libraries were quantified, pooled, and loaded onto the MiniSeq™ System with an output of 2 × 76-bp paired-end reads for sequencing.

Bioinformatic analysis for generating sequencing data
FASTQ data were exported from the local run manager to BaseSpace Hub, Illumina.
DRAGEN RNA Pathogen Detection V3.5.14 generated consensus FASTA. FASTQ data were further analyzed by the IDSeq platform to check the consensus FASTA coverages (Kalantar et al., 2020). Consensus FASTA files were uploaded in Genome Detective Virus Tools (Vilsker et al., 2018) to check the virus open reading frame and mutations at both nucleotide and amino acids level. The data was post-analyzed by comparing it with the China National Center for Bioinformation (Members and Partners, 2019) and Nextstrain.org.

Dataset construction
We have sequenced a 137 whole-genome of SARS-CoV-2 from COVID-19 patients in Bangladesh. A total of 188 datasets were constructed, including 51 viral sequences from Bangladeshi patients sequenced by other research groups that were submitted to the GISAID database between May 15th and July 10 th , 2020. Among the isolates 101 from Dhaka, 31 from Chattogram, 22 from Rajshahi, 22 from Rangpur, 9 from Sylhet, and 3 from Khulna division. The reference SARS-CoV-2 sequence, which was isolated in Wuhan, China, downloaded from GeneBank (NC_045512.3)4. Four sequences (EPI_ISL_435057) were excluded from the analysis due to harboring an extreme number of unique mutations, which can result from sequencing errors.

Nucleotide substitution analysis
SARS-CoV-2 isolate sequences from Bangladesh were compared to the reference SARS-CoV-2 sequence (NC_045512.3) using nucleotide substitutions. Sequence alignment for constructed 184 datasets was performed using Multiple Sequence Comparison by Log-Expectation (MUSCLE) software and MEGAX (Kumar et al., 2018). Aligned sequences were separately displayed in the AliView to verify that the sequences were in the frame.
Degenerated nucleotide was removed, and all the sequences length were maintained equally by removing 'UTR manually (265 nt) from5' end and 'UTR (229 nucleotides) from 'ends.
Nucleotides substitution in both5' UTR and 'UTR were separately observed. The amino acid substitution and all nucleotides substitutions were reconfirmed separately by utilizing opensource software Genome detective (Vilsker et al., 2018), and China National Center for Bioinformation (CNCB)

Variant analysis
Analysis of 184 SARS-CoV-2 highlights a total of 2242 nucleotide mutation events insertional, and 4 indels (Fig 2a). Of the 575 SNP mutations, thymine replaces cytosine, guanine, and adenine 376 times. At the same time, thymine replaced by cytosine, adenine, and guanines all together by only 65 times (Fig. 2b). The highest mutation was observed in the nsp3 or papain protease; it undergoes 48 amino acid mutations (Fig.3). Although orf1ab variants of SARS-CoV-2 based on a mutation in spike proteins have been found in Bangladesh. These variants include 13S>I, 14Q>H, 26P>L, 140F del, 144Y del, 211N>Y, 516E>Q, 594G>S, and 660Y>F. The change of amino acids glutamic acid at 516 positions to glutamine occurred within the receptor-binding domain (Fig. 3). All the mutations were highlighted in figure, also highlighted the mutations at 516, and 518 positions of spike proteins (Fig. 4). Another critical variant has arisen in Bangladesh due to changes of amino acids glycine to acidic serine at position 594 of spike protein S1 subunit ( Table 2). G614 variants were found in 96% of cases, 60 mutations were observed in 98 positions of the spike protein in out of 186 sequenced SARS-CoV-2; however, 36 mutations (61%) have rendered the amino acid changes in the spike protein.
Nucleotides cytosine has changed into uracil at 241 of 5'UTR found in 95% of cases.
Mutation in the nucleotide of nsp2 at 1163 position has changed the Isoleucine to Phenylalanine in 78.3% of cases. In nsp3, mutation at position 3037 found in 94.6% of cases; however, this mutation did not change the amino acid. We also observed nucleotide cytosine at 14408 changed to thymine (uracil), which has altered amino acid proline to Leucine at 323 positions of RNA dependent RNA polymerase. This mutation was observed in 96.7% of cases. The Arg residue at 203 positions of nucleocapsid has changed into Lys residue in 82.6 of cases, while Gly residue at 204 of nucleocapsid changed to Arg residue in 79% of cases. 3CLpro cuts the polyproteins translated from viral RNA to yield functional viral proteins.
One of the best-characterized drug targets among coronaviruses is the main protease (Mpro, also called 3CLpro) (Anand et al., 2003). Inhibiting the activity of this enzyme would block viral replication. Because no human proteases with a similar cleavage specificity are known, such inhibitors are unlikely to be toxic.

Discussion
The first Bangladeshi cases of COVID-19 caused by SARS-Cov-2 were reported on the 8 th of March 2020, and so far, above 2500 deaths have been documented due to this pandemic. We have analyzed 184 out of 188 SARS-CoV-2 whole-genome sequences available in GISAID published from Bangladesh dated 10 th of July 2020. Off the 188 SARS-CoV-2 whole-genome sequenced published to date, four whole-genome sequences (EPI_ISL_445217, EPI_ISL_468071, EPI_ISL_468072, and EPI_ISL_477140) were found very low in quality hence not genetically analyzed in this study. Off the 184 samples examined, variant G614 based on a mutation in spike proteins found in 178 samples, which determines the virus clades G along with frequently co-evolving variants L323, were also found in 178 samples.
Other mutations found in 152 samples were K203, and R204 based on mutations in the nucleocapsids of the virus. Overall, 152 sequenced SARS-CoV-2 belong to the GR clade, 12 GH clade, and 9 G clade indicating the viruses are mostly linked to the European origin. The variants C241T located in the non-coding 5'UTR' were found in 175 samples. Mutation at 3037 positions of the genome found in 174 samples where cytosine changed to thymine located in the nsp3 or M pro did not change amino acids. Other most frequent variants were C14408T located in the RNA dependent RNA polymerase rendered the amino acid Pro to Leu found in 178 samples. These mutations have been detected across Europe, which mostly belongs to GISAID clade G. This viral journey from Wuhan to Bangladesh, lasting 5 months, is documented by 12.2 mutations events per sample, whereas a common mutation per sample worldwide is 7.23 (Mercatelli and Giorgi, 2020).
As more genomes from Bangladesh are made publicly available, analysis of genetic variants has revealed the highest diversity occurring in nsp3 along with spike proteins, nucleocapsids, ORF3a, and ORF8 across the samples. The primary determinant of the host range and pathogenicity of SARS-CoV-2 is its surface-associated S protein. Spike is a trimeric glycoprotein composed of three chains, each chain consisting of two subunits. The S1 subunit located at N-terminal while the S2 subunit located at the C-terminal. The S1 subunit is responsible for recognizing and binding to the host cell receptor ACE2 while the S2 subunit is specializing for membrane fusion. Compared with the S1, the S2 subunit shows much lower variability (Masters, 2006). Although 1721 mutation sites are identified so far in S protein, only 906 lead to amino acid changes. Amino acids 336 to 516 of 1274 amino acid long spike protein form the binding domain, which mediates coronavirus attachment with ACE2 receptors. It has been found that 111 changed amino acids located in the Receptor Binding Domain (RBD). Among them, 8 out of 17 critical amino acids are responsible for protein interactions. As of 12 th of July 2020, a total of 60 mutation sites were identified in the spike protein, of which 36 lead to amino acid changes. The most common mutations which dominated and spread all over Bangladesh were G614 found in 178 samples, followed by L26 and F5 found in 5 and 4 samples, respectively.
G614 variant of SARS-CoV-2 isolates predominates over time in locales indicating that this change in the spike protein enhances viral transmission. In a study, it has been shown that retroviruses pseudotyped with SG614 can infect angiotensin-converting enzyme expressing cells significantly more effective way than the SD614 . However, it has been shown in a separate study that lower RT-qPCR CT value, which is indicative of higher titer of coronavirus variant G614 linked to higher viral load at the upper respiratory tract and thus attenuated disease severity (Korber et al., 2020). Nucleocapsid (N) protein is encoded by a highly variable region of the genes N gene, responsible for the formation of the helical nucleocapsid. N protein can elicit both cellmediated and humoral immune response and thus has potential value in vaccine development (Ding et al., 2016). In the N gene, one nucleotide change was detected at position G2881A found in 152 samples, and two nucleotides changes in the adjacent codons found in 149 samples. These changes have led to alter the two amino acids replacing Arg to Lys, and Gly to Arg at position 203 and 204, respectively. However, the mutations, as mentioned above, did not yet to be associated with changes in viral transmissibility or pathogenicity.
It has suggested based on selection analysis that as the virus is transmitted between humans, the genes of both spike and nucleocapsid undergo episodic selection (Benvenuto et al., 2020).
The positive selection for parts of any emerging virus is typical (Sironi et al., 2015).
Mutations in the spike or nucleocapsids genes, and subsequently, their adaptation could affect virus stability and pathogenicity (Baric et al., 1997).
SARS-CoV-2 has eight accessory proteins including orf3a, orf3b, orf6, orf7a, orf7b, orf8, orf9b, and orf14 (Ceraolo and Giorgi, 2020). Although ORF14 was not detected in 184 sequencing data analyzed in this study. Among the accessory proteins, ORF3a contains six functional domains (I to VI), which may link to viral infectivity, virulence, and ion channel formation (Issa et al., 2020). In a previous study based on 2782 whole-genome sequencing data, it has shown that 51 different non-synonymous amino acid substitutions in the 3a proteins (Issa et al., 2020). Whereas, non-synonymous 25 amino acid substitutions were detected in ORF3a in this study. Besides, 21 non-synonymous mutations were detected in ORF8.