Abstract
To date, the SARS-CoV-2 genome has been considered genetically more stable than SARS-CoV or MERS-CoV. Here we report a 382-nt deletion covering almost the entire open reading frame 8 (ORF8) of SARS-CoV-2 obtained from eight hospitalized patients in Singapore. The deletion also removes the ORF8 transcription-regulatory sequence (TRS), which in turn enhances the downstream transcription of the N gene. We also found that viruses with the deletion have been circulating for at least four weeks. During the SARS-CoV outbreak in 2003, a number of genetic variants were observed in the human population [1], and similar variation has since been observed across SARS-related CoVs in humans and bats. Overwhelmingly these viruses had mutations or deletions in ORF8, that have been associated with reduced replicative fitness of the virus [2]. This is also consistent with the observation that towards the end of the outbreak sequences obtained from human SARS cases possessed an ORF8 deletion that may be associated with host adaptation [1]. We therefore hypothesise that the major deletion revealed in this study may lead to an attenuated phenotype of SARS-CoV-2.
On 1 December 2019, a novel coronavirus emerged from Hubei province in China and infected people visiting the Huanan seafood market in Wuhan [3]. The virus demonstrated efficient human-to-human transmission within mainland China and subsequently spread across many countries. The virus was soon identified as 2019-nCoV, more recently designated SARS-CoV-2, while the disease is referred to as COVID-19. On 30 January 2020, the World Health Organization declared a Public Health Emergency of International Concern. As of 10 March 2020, the COVID-19 outbreak has led to 113,702 confirmed cases and 4,012 deaths globally [4]. Zoonotic viruses have crossed the species barrier from animals to humans and the success of an interspecies transmission is the ability of a novel virus to adapt, via genetic mutation events, to a new host and cause sustained transmission that can lead to a significant outbreak.
Nasopharyngeal swabs collected from hospitalized patients positive for SARS-CoV-2 in Singapore were subjected to next generation sequencing (NGS) analysis, with and without passaging in Vero-E6 cells. The NGS data revealed a 382-nt deletion towards the 3’ end of the viral genomes obtained from multiple patients (Suppl. Table 1). To confirm this observation, specific PCR primers were designed flanking the deleted region and Sanger sequencing performed (Suppl. Fig. 1). This verified the deletion at positions 27,848 to 28,229 of the SARS-CoV-2 genome. Interrogation of the NGS assemblies of these 382-nt deletion variants (referred to hereafter as Δ382) indicated that the virus populations were homogenous.
Apart from the 382-nt deletion, the genome organization of the Δ382 viruses is identical to that of other SARS-CoV-2 (Fig. 1A). Closer examination of the deletion indicated that it spans an area of the ORF7b and ORF8 that includes the ORF8 transcriptional regulator sequence (TRS), eliminating ORF8 transcription (Fig. 1B). The ORF8 region has been identified as an evolutionary hotspot of SARSr-CoVs. For instance, sequences of human SARS-CoV TOR2 and LC2 exhibit 29-nt and 415-nt deletions, whereas those of bat SARS-CoV JTMC15 have discontinuous deletions in ORF7/8 (Fig. 1B) [5-8]. The ORF8 region of SARS-CoVs has been shown to play a significant role in adaptation to the human host following interspecies transmission [7] and virus replicative efficiency [2]. Similarly the disruption of ORF8 region in Δ382 viruses may be a result of human adaptation after the emergence of SARS-CoV-2.
To investigate the possible effects of the TRS deletion, we calculated and compared the total number of transcripts per million (TPM) of each gene between wild-type (WT) and Δ382 viruses. For each gene, we counted unambiguous reads that uniquely mapped to the gene-specific transcripts that include the joint leader and TRS sequences [9]. We analysed three WT and two Δ382 sequences for which two independent NGS library preparations were available. Our results indicate differential patterns of TPM level between WT and Δ382 viruses (Fig. 2A). The Δ382 sequences displayed a greater level of TPMs in the ORF6, ORF7a and N genes compared to WT. More specifically, the Δ382 N gene, downstream of ORF8, showed a significant increase in the TPM level. To further determine the impact of the 382-nt deletion on the transcription of the N gene, we conducted qPCR to measure the relative abundance of the RNAs from 2 genes (E and N) located upstream and downstream of the deleted region, respectively. Results shown in Figure 2B confirm the enhancement of N gene transcription in Δ382 viruses. This enhancement was consistently observed regardless of the sample type, whether directly from swab samples or passaged viruses.
The global phylogeny of SARS-CoV-2 (Suppl. Fig. 2) revealed a comb-like appearance in the phylogenetic tree and a general lack of phylogenetic resolution, reflecting the high similarity of the virus genomes. This tree topology is a typical characteristic of a novel interspecies transmission, wherein the viruses are infecting immunologically naïve populations with sustained human-to-human transmission, as previously shown for the early stages of the 2009 H1N1 pandemic [10]. Notably, we observe an emerging virus lineage (tentatively designated as “Lineage 1” in Suppl. Fig. 2) that consists of sequences from various newly affected countries with recent virus introductions. All viruses from this recent lineage possess an amino acid substitution at residue 250 (from lysine to serine) on the ORF8 gene [11], which is known to undergo strong positive selection during animal-to-human transmission [7].
To estimate the divergence times among lineages of SARS-CoV-2, we reconstructed a dated phylogeny based on 137 complete genomes, including data from this study. The estimated rate of nucleotide substitutions among SARS-CoV-2 viruses is approximately at 8.68×10−4 substitutions per site per year (95% HPD: 5.44×10−4 –1.22×10−3), which is moderately lower than SARS-CoV ([12]: 0.80–2.38×10−3) and MERS-CoV (with a mean rate [13] of: 1.12×10−3 and 95% HPD: 58.76×10−4 –1.37×10−3) as well as human seasonal influenza viruses (ranging 1.0–5.5×10−3 depending on individual gene segments [10, 14-16]. In contrast, the estimated rate of SARS-CoV-2 viruses is greater than estimates for human coronaviruses by a magnitude order of 4×10−4 [17].
The mean TMRCA estimates indicate the introduction of SARS-CoV-2 into humans (Fig. 3. node A) occurred in the middle of November 2019 (95% HPD: 2019.77–2019.94) (Table 1), suggesting that the viruses were present in human hosts approximately one month before the outbreak was detected. A single amino-acid mutation was fixed on the ORF8 region of all full-genome Lineage 1 viruses (Fig. 3, node B), with the TMRCA estimate of 22 December 2019 (95% HPD: 2019.9–2020.0). The Δ382 viruses form a monophyletic group within Lineage 1, and dating estimates indicate that they may have arisen in the human population in Singapore around 07 February 2020 (95% HPD: 2020.12–2020.06), consistent with the date of collection of the Δ382 positive samples (17–19 February 2020). The three Δ382 viruses shared a high level of nucleotide similarities (99.9%) with only 5 nucleotide differences between them. Although the dated phylogeny indicates that Δ382 viruses clustered together and are possibly derived from a single source, there is a general lack of statistical support in SARS-CoV-2 phylogenies due to its recent emergence, and evolutionary inferences should be made with caution.
In this report we describe the first major evolutionary event of the SARS-CoV-2 virus following its emergence into the human population. Although the biological consequences of this deletion remain unknown, the alteration of the N gene transcription would suggest this may have an impact on virus phenotype. Recent work has indicated that ORF8 of SARS-CoV plays a functional role in virus replicative fitness and may be associated with attenuation during the early stages of human-to-human transmission [2]. Given the prevalence of a variety of deletions in the ORF8 of SARSr-CoVs, it is likely that we will see further deletion variants emerge with the sustained transmission of SARS-CoV-2 in humans. Future research should focus on the phenotypic consequences of Δ382 viruses in the transmission dynamics of the current epidemics and the immediate application of this genomic marker for molecular epidemiological investigation.
Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data availability are available at [Article DOI].
Data availability
The new sequences generated in this study have been deposited in GISAID database under accession numbers EPI_ISL_407987, EPI_ISL_407988, EPI_ISL_410535 to EPI_ISL_410537 for WT viruses and EPI_ISL_414378 to EPI_ISL_414380 for Δ382 viruses.
Author contributions
YCFS, DEA, LFW, GJDS designed and supervised research. BEY, SK, JGHL, LS, GZY, YSL, DCL collected and provided samples. DEA, ML, YZ conducted experiments. YCFS, ZF, JJ, IHM, GJDS performed analyses. YCFS, ZF, ML, LFW, GJDS wrote the paper. All authors reviewed and approved the manuscript.
Methods
Ethics statement
This study was undertaken as part of the national disease outbreak and the response and the protocols were approved by the ethics committee of the National Healthcare Group. Patient samples were collected under PROTECT (2012/00917), a multi-centred Prospective Study to Detect Novel Pathogens and Characterize Emerging Infections. Work undertaken at the Duke-NUS Medical School ABSL3 laboratory was approved by the Duke-NUS ABSL3 Biosafety Committee, National University of Singapore and Ministry of Health Singapore.
Virus culture, RNA extraction and sequencing
Clinical samples from infected patients were collected from public hospitals in Singapore from January–February 2020. Material from clinical samples was used to inoculate Vero-E6 cells (ATCC®CRL-1586™). Total RNA was extracted using E.Z.N.A. Total RNA Kit I (Omega Bio-tek) according to manufacturer’s instructions and samples analysed by real-time quantitative reverse transcription-PCR (RT-qPCR) for the detection of SARS-CoV-2 as previously described [18]. Whole genome sequencing was performed using next-generation sequencing (NGS) methodology. The cDNA libraries were constructed using TruSeq RNA Library Prep Kit (Illumina) according to the manufacturer’s instructions and sequenced on an Illumina MiSeq System. Raw NGS reads were trimmed by Trimmomatic v0.39 [19] to remove adaptors and low-quality bases. Genome sequences were assembled and consensus sequences obtained using the BWA algorithm in UGENE v.33. To verify the presence of the deletion in the SARS-CoV-2 genome, we designed two specific PCR primers (F primer: 5’-TGTTAGAGGTACAACAGTACTTT-3’ and R primer: 5’-GGTAGTAGAAATACCATCTTGGA-3’) targeting the ORF7–8 regions. For samples with low Ct values, a hemi-nested PCR was then performed with primers (5’-TGTTTATAACACTTTGCTTCACA-3’) and (5’-GGTAGTAGAAATACCATCTTGGA-3’). The PCR mixture contained the cDNA, primers (10µM each), 10x Pfu reaction buffer (Promega), Pfu DNA polymerase (Promega) and dNTP mix (10mM, Thermo Scientific). The PCR reaction was carried out with the following conditions: 95°C for 2 min, 35 cycles at 95°C for 1 min, 52°C for 30 sec, 72°C for 1 min and a final extension at 72°C for 10 min in a thermal cycler (Applied Biosystems Veriti). Deletions in the PCR products were visualized by gel electrophoresis and confirmed by Sanger sequencing. Three full Δ382 genomes were generated and designated BetaCoV/Singapore/12/2020, BetaCoV/Singapore/13/2020 and BetaCoV/Singapore/14/2020 along with five WT SARS-CoV-2 genomes previously generated in our laboratories and deposited in GISAID (see Suppl. Table 1).
Genomic characterization
To characterize and map the deletion regions of SARS-CoV-2 viruses, we compared with available SARS-CoV-2 and SARs-CoV related genomes from bat SARs-CoV-JTMC15 (accession number: KY182964) and humans SARs-CoV-Tor 2 (accession number: AY274119) and LC2 (accession number: AY394999). The viral genome organizations of Wuhan-Hu-1 (accession number: MN908947) and Singapore SARS-CoV-2 (Singapore/2 /2020: EPI_ISL_407987) comprise the following gene order and lengths: ORF1ab (open-reading frame) replicase (21291 nt), spike (S: 3822 nt), ORF3 (828 nt), envelope (E: 228 nt), membrane (M: 669 nt), ORF6 (186 nt), ORF7ab (498 nt), ORF8 (366 nt) and nucleocapsid (N: 1260 nt).
To understand possible effects of the TRS deletion on the ORF8 region, we analysed the total number of transcripts per million (TPM) of each gene between wildtype and Δ382 variants. For each gene, unambiguous reads which uniquely mapped to the specific joint leader and TRS sequences are counted. We analysed three patients of SARS-CoV-2 wild type and two patients of SARS-CoV-2 variants and independent NGS library preparations were performed twice for each sample. The Wuhan-Hu-1 was used as the reference sequence for genomic coordinates. Leader RNA sequence is situated at the 5’ UTR region of the genome and the core sequence consists of 26 nt: UCUCUAAACGAACUUUAAAAUCUGUG. Each of the coding genes is preceded by a core transcription regulatory sequence (TRS) which is highly conserved (i.e. ACGAAC), resembling the leader core sequence. The leader RNA is joined to the TRS regions of different genes and the joining is required for driving subgenomic transcription. Coding sequence (CDS), leader and TRS sequences annotations were generated in Geneious and followed published SARS-CoV studies [9]. To characterize the differential levels of TRS for each gene, 70 nt leader sequence and 230 nt downstream of each TRS sequence were annotated individually to form a 300bp leader-TRS transcript for the splicing-aware aligners in R package rtracklayer (v 1.44). NGS raw fastq reads were then mapped to the reference genome by Geneious RNA-Seq mapper with the annotation of splice-junctions for the leader-TRS. For each gene, the TPM of leader-TRS was then calculated in Geneious by excluding ambiguous reads which may come from other TRS. The resulting TPM data was generated in Geneious and graphs plotted using ggplot 2 (v3.2.0) in R v3.6.1. Wilcoxon one-sided tests were performed in R to test the significant differences between WT and Δ382 samples. We also compared individual Ct values of E and N genes between WT and Δ382 viruses using qRT-PCR assays as previously described [18]. Samples used in this analysis included nasopharyngeal swabs from three WT patient samples and both nasopharyngeal swabs and passaged samples from two Δ382 patients (n=7). Significance between groups was assessed using a one-way analysis of variance (ANOVA) in PRISM v8.3.1.
Phylogenetic analyses
All available genomes of SARS-CoV-2 with associated virus sampling dates were downloaded from the GISAID database. Genome sequence alignment was performed and preliminary maximum likelihood phylogenies of complete genome were reconstructed using RAxML with 1,000 bootstrap replicates in Geneious R9.0.3 software (Biomatters Ltd). Any sequence outliers were removed from subsequent analyses. To reconstruct a time-scaled phylogeny, an uncorrelated lognormal relaxed-clock model with an exponential growth coalescent prior and the HKY85+Γ substitution model was used in the program BEAST v1.10.4 [20] to simultaneously estimate phylogeny, divergence times and rates of nucleotide substitution. Four independent Markov Chain Monte Carlo (MCMC) runs of 100 million generations were performed and sampled every 10,000 generations. The runs were checked for convergence in Tracer v1.7 [21] and that effective sampling size (ESS) values of all parameters was >200. The resulting log and tree files were combined after removing appropriate burn-in values using LogCombiner [20], and the maximum clade credibility (MCC) tree was subsequently generated using TreeAnnontator [20].
Acknowledgements
This study was supported by the Duke-NUS Signature Research Programme funded by the Ministry of Health, Singapore, the National Medical Research Council under its COVID-19 Research Fund (NMRC Project No. COVID19RF-001) and by National Research Foundation Singapore grant NRF2016NRFNSFC002-013 (Combating the Next SARS-or MERS-Like Emerging Infectious Disease Outbreak by Improving Active Surveillance). We thank Viji Vijayan, Benson Ng and Velraj Sivalingam of the Duke-NUS Medical School ABSL3 facility for logistics management and assistance. We thank all scientific staff who assisted with processing clinical samples, especially Velraj Sivalingam, Adrian Kang, Randy Foo, Wan Ni Chia and Akshamal Gamage. We thank Su Ting Tay, Ming Hui Lee and Angie Tan from the Duke-NUS Medical School Genome Biology Facility for expert and rapid technical assistance.