Sequencing of SARS CoV2 in local transmission cases through oxford nanopore MinION platform from Karachi Pakistan

The first case of severe acute respiratory syndrome 2 (SARS CoV2) was imported to Pakistan in February 2020 since then 10,258 deaths have been witnessed. The virus has been mutating and local transmission cases from different countries vary due to host dependent viral adaptation. Many distinct clusters of variant SARS CoV2 have been defined globally. In this study, the epidemiology of SARS CoV2 was studied and locally transmitted SARS CoV2 isolates from Karachi were sequenced to compared and identify any possible variants.The real time PCR was performed on nasopharyngeal specimen to confirm SARSCoV2 with Orf 1ab and E gene as targets. The viral sequencing was performed through oxford nanopore technology MinION platform. Isolates from first and second wave of COVID-19 outbreak in Karachi were compared. The overall positivity rate for PCR was 26.24% with highest number of positive cases in June. Approximately, 37.45% PCR positive subjects aged between 19-40 years. All the isolates belonged to GH clade and shared missense mutation D614G in spike protein linked to increased transmission rate worldwide. Another spike protein mutation A222V coexisted with D614G in the virus from second wave of COVID-19. Based on the present findings it is suggested that the locally transmitted virus from Karachi vary from those reported from other parts of Pakistan. Slight variability was also observed between viruses from first and second wave. Variability in any potential vaccine target may result in failed trials therefore information on any local viral variants is always useful for effective vaccine design and/or selection. Author’s summary Despite precautionary measures the COVID-19 pandemic is causing deaths all over the world. The continuous mutations in viral genome is making it difficult to design vaccines. Variability in genome is host dependent and data sharing has revealed that variant for different geographical locations may harbor different mutations. Keeping this in mind the current study was focused on the epidemiology of SARS CoV2 in symptomatic and asymptomatic COVID –19 suspected cases with impact of age and gender. The locally transmitted SARS CoV2 isolates from Karachi were sequenced to compared and identify any possible variants. The sequenced viral genome varied from the already submitted sequences from Pakistan thereby confirming that slightly different viruses were causing infections during different time periods in Karachi. All belonged to GH clade with D614G, P323L and Q57H mutations. The virus from second wave had A222V mutation making it more different. This information can be useful in selecting or designing a vaccine.

Variability in genome is host dependent and data sharing has revealed that variant for different geographical locations may harbor different mutations. Keeping this in mind the current study was focused on the epidemiology of SARS CoV2 in symptomatic and asymptomatic COVID -19 suspected cases with impact of age and gender. The locally transmitted SARS CoV2 isolates from Karachi were sequenced to compared and identify any possible variants. The sequenced viral genome varied from the already submitted sequences from Pakistan thereby confirming that slightly different viruses were causing infections during different time periods in Karachi. All belonged to GH clade with D614G, P323L and Q57H mutations. The virus from second wave had A222V mutation making it more different.
This information can be useful in selecting or designing a vaccine.

Introduction:
Corona virus disease (COVID-19) is a transmissible infectious disease caused by a newly emerged beta corona virus -SARS CoV2 originated as a result of viral spill over from animals [1]. It is a positive sense, enveloped, single stranded RNA virus of genus betacoronavirus [2]. The exact origin of this virus either from bat, pangolin or any other mammal is still under debate [1,3]. Metagenomic analysis of SARS Cov2 has revealed that it a distinct virus very closely related to SARS CoV. COVID-19 pandemic started in December 2019 after first reports from Wuhan China [3,4]. To date it has affected 188 countries with a death rate of 2.31%. The death rate for SARS CoV2 infection is lower than SARS CoV with a higher transmission capability [5]. The global elderly population was most affected with higher mortalities due to acute respiratory distress syndrome(ARDS). The first SARS Cov2 genome was sequenced and published in December 2019 by Wang et al [6]. The genome of SARS CoV2 is ~29.9 kb long with orf1ab at the 5'end and spike protein(S), envelope protein (E) and matrix protein (M) coded at the 3' end. Seven viral accessory proteins are also coded by ORF3a, ORF6, ORF7a, ORF7b, ORF8a, ORF8b and ORF10 genes [7]. The virus has sixteen non-structural proteins (NS1-NS16). The infection initiates with lower respiratory discomfort which progress to pneumonia often causing sudden deaths [4,8]. The virus establishes itself by binding through receptor binding spike protein to angiotensin-converting enzyme 2 (ACE2) receptors in lungs [4]. Data from various studies support that a virus induced excessive exaggerated immune reaction or cytokine storming extensively damage tissues ACE2 receptors expressing organs in the host [2].
In the recent outbreak of COVID-19 asymptomatic carrier were capable of transmitting virus to healthy human. As reported earlier, during SARS CoV outbreak in 2002/2003 variant viruses evolved due to possible transformation events within host [9]. Likewise, since its first report many mutations have been reported in SARS CoV2. A large number of SARS CoV2 sequences have been deposited in respective repositories since the beginning of the pandemic [10,11]. Due to relatively lower mutation rates different branches or clades have been defined [12]. The clinical significance of all these clades is yet to be defined. sharing is the key to understand the emergence of variants and geographic epidemiology of SARS CoV2 variants during current pandemic [13]. Clinical correlation of these variants with disease transmission dynamic, treatment responsiveness and fatality rates have also been extensively studied [14]. Among the various sequencing platforms third generation sequencing of SARS CoV whole genome by Oxford nanopore MinIon technology based sequencing has gained popularity. The advantage of this plat form is that long reads of virus genome are obtained and time to data acquisition and analysis is also reduced as compared to other methodologies [15]. The objective of present study was to evaluate the epidemiology of SARS CoV2 in symptomatic and asymptomatic COVID -19 suspected cases with impact of age and gender. Furthermore, locally transmitted SARS CoV2 isolates from Karachi were sequenced to compared and identify any possible variants.

Results:
Two thousand and sixty five (2065) PCR tests were performed from May to November 2020 at COVID-19 Lab. of NIBD. The overall positivity rate for PCR was 26.24%. Highest number of positive cases with increased viral load (lower Ct values) was observed during the month of June (Fig 1a). The Ct value for SARS CoV2 ranged between 10.8 to 34.32 in June with a median Ct value of 24.2 (Fig 1b). After the first wave of COVID-19 in Karachi an increase in positive cases was observed in October; after a decline in August and September; with the lowest median Ct value (20.21) between May to November (Fig 1a). A large number of patients were negative for SARS CoV2 with COVID-19 like symptoms caused by probably some other viral or bacterial infection. The commonest symptom was weakness for PCR positive and PCR negative groups. The frequency of SARS CoV2 positive males (27.5%) was slightly higher than females (26.28%). Approximately 37.45% PCR positive subjects aged between 19-40 years.

Viral whole genome sequencing
Four samples with low Ct values <20 (range: 10.08-19.69) were selected. Of the four sequenced samples two were isolated from asymptomatic while other two from mild or moderate COVID-19 patients each. The age range of selected cases was 22-58 years. Full viral genome sequences from these confirmed cases of local transmission of SARS CoV2 from Karachi during June (peak of first wave) and October (initial phase of second wave) were obtained through MinIon ONT platform. The sequencing details with GISAID accession numbers are listed in Table 1. The genome size obtained was 29903 with depth of coverage between 2976-3653 (Table 1).

Phylogenetic profiling:
The whole genome sequences obtained were aligned with the reference genome and 564 global sequences. strains NIBD 01-PAK-KHI and NIBD 02-PAK-KHI clustered with variants from Bangladesh and India with descendent predominantly from Saudi Arabia, India and England. New nodes were defined for NIBD 03-PAK-KHI and NIBD 04-PAK-KHI with divergence of 14 and 27 respectively. Both these were clustered with separate sequences from New Zealand having presumed ancestral connection with isolates from Germany.
The phylogenetic relatedness with previously submitted sequences from Pakistan, India, Saudi Arabia, Netherland, England, New Zealand was observe with a pair wise mean genetic distance of 0.011.

Mutational analysis
The mutational analysis revealed presence of 21 synonymous mutations, 15 non-synonymous mutations and 2 non-frame shift substitutions altogether, spanning in 5'UTR, spike, orf1ab, orf1a, orf3a, orf7a, orf 8, orf10, N and M protein genes ( Table 2). All the four sequences were grouped in clade 20A (GISAID: GH) since characteristic mutation in 5'UTR 241 C>T and nonsynonymous SNV i.e. 3037C>T, 23403A>G (S-D614G), 25563G>T (Q57H) were observed. As per GISAID database four previously submitted sequences from Pakistan also belonged to GH clade but sub-lineage signatures vary with 32 additional mutations observed in the sequences obtained during present study from local transmission cases of Karachi (Table 2). Hence, all the genome sequences were unrelated to the previously reported cluster of SARS CoV2 from Pakistan except for four isolates; 3 from Islamabad and 1 from Kohat belonging to GH clade lineage B (Table 1; Fig 3). The NIBD4-PAK-KHI obtained from a health care worker varied from the other isolates with highest number of mutations (Table 2).
An additional clade GV specific nonsynonymous variant 22227C>T (A222V) in spike protein was also present inNIBD4-PAK-KHI along with .About 7% of all GISAID sequences belonged to GV clade which is characterized by presence of this SNV. The virus NIBD4-PAK-KHI is the first variant of GH clade harboring 22227C>T isolated in October from Asia. Close but divergent sequence homology was detected with a variant virus from New Zealand collected in November 2020 (GISAID accession no: EPI_ISL_682284; Fig 3).     [15]. To the best of our knowledge this is the first study from Pakistan on SARS CoV2 genome exploiting the potential of this third generation sequencing method.
The evolution of SARS CoV2 is nonrandom and human host dependent variants are evolving.
SARS CoV2 proteins are heterogeneous with a large number of variable amino acid substitutions with either no or significant impact on viral transmission and transcription within human host [17]. Many prevalent mutations have been defined for SARS CoV2 with signature hotspot mutation for each distinct clade [18]. The lineages for studied viruses were defined through PANGOLINE pipeline and all were placed in lineage B1. with sub lineage differences listed in Table 1. The 23403A>G is a widely documented hotspot mutation in spike protein which replaces aspartic acid with glycine at position 614 altering the viral antigenic properties [19]. Variants with D416G evade initial immune recognition by host resulting in production of autoantibodies and facilitate higher-transmission, infectivity and case fatality rate (CFR). All the sequenced viruses of current study had this mutation along with Q57H in orf3a gene grouping them in Nextstrain clade 20A (GISAID: GH; Table   2). The coexisting Q57H has previously been reported to reduce the virulence of D614G therefore the prevalence of the GH clade in Pakistan may be a probable reason for low mortality rate (2.1%) for COVID-19 cases during July to September. Furthermore, the Q57H amino acid substitution causes truncation of orf3b gene via introduction of a stop codon at amino acid 13 giving rise to full length orf 3b deficient variants [20,21]. These variants were prevalent in Asian and North American countries, including Saudi Arabia, Indonesia, South Korea, Israel, Egypt, USA and Colombia [5,22,23]. All the Pakistani viruses sequenced between June to November 2020 belonged to GH clade with Q57H substitution (Fig 3). The orf3b is a potential serological target for most vaccines, hence, this serological difference should be taken into account while selecting a vaccine for SARS CoV2 in Pakistan. Another prevalent mutation in Pakistani GH isolates is non-frameshift substitution P323L in NSP12 which was linked to higher severity [20]. Almost a 100% coexistence of D614G, P323L and C241T has been reported which is in line with the present observations. This coexistence positively favors viral replication, infectivity, transmission and manipulation of host machinery [24]. Of the 39 mutations observed in the SARS CoV2 genomes sequenced during the present study; 33 SNVs were not observed in previously submitted sequences from Pakistan defining a separate local cluster for SARS CoV2 virus accumulating within Karachi (Fig 2). It may be because of the host driven genetic drift within the viral genome of locally transmitted SARS CoV2 virus [25].
Spike A222V mutation was reported from Spain during spring 2020 and clade GV (20A EU1) of SARS CoV2 was defined with increased infectivity and fatality potential [26]. has been reported with a low prevalence rate of 1.5% in healthcare workers globally. In either case caution is warranted since infectivity potential for this variant is higher throughout Europe.

Conclusion:
Hence it can be concluded that the second wave of COVID-19 may not be clinically distinct but host driven genetic evolution of virus may impact its infectivity and CFR. Future comparative genomic studies with substantial number of virus from first and second waves are suggested in order to understand the possible evolutionary origin, genomic variability and trajectory of anticipated futures waves.

Methods:
This study was conducted at National Institute of Blood Diseases & Bone Marrow

Variant calling and phylogenetic profiling
Variant calling was performed using BCF tools v. 1.9. Further variant annotation was done by ANNOVAR. The consensus sequences were generated by mapping the variants to the reference genomes using BCF tools followed by submission to the GISAID database and NCBI. The initial phylogenetic analysis was performed using 570 genome sequences retrieved from the GISAID, epicov TM. The fast alignment was performed using MAFFT.

Statistical analysis
Median Ct values per month for Orf1ab gene were recorded from May to November 2020.
For statistical analyses and graphs SPSS version 22 was used.