Consistent and High-Frequency Identification of an Intra-Sample Genetic Variant of 2 SARS-CoV-2 with Elevated Fusogenic Properties

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a genome comprised of a ~30K nucleotides non-segmented, positive single-stranded RNA. Although its RNA-dependent RNA polymerase exhibits exonuclease proofreading activity, viral sequence diversity can be induced by replication errors and host factors. These variations can be observed in the population of viral sequences isolated from infected host cells and are not necessarily reflected in the genome of transmitted founder viruses. We profiled intra-sample genetic diversity of SARS-CoV-2 variants using 15,289 high-throughput sequencing datasets from infected individuals and infected cell lines. Most of the genetic variations observed, including C->U and G->U, were consistent with errors due to heat-induced DNA damage during sample processing, and/or sequencing protocols. Despite high mutational background, we confidently identified intra-variable positions recurrent in the samples analyzed, including several positions at the end of the gene encoding the viral S protein. Notably, most of the samples possesses a C->A missense mutation resulting in the S protein lacking the last 20 amino acids (S{Delta}20). Here we demonstrate that S{Delta}20 exhibits increased cell-to-cell fusion and syncytia formations. Our findings are suggestive of the consistent emergence of high-frequency viral quasispecies that are not horizontally transmitted but involved in intra-host infection and spread. Author summaryThe severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and its associated disease, COVID-19, has caused significant worldwide mortality and unprecedented economic burden. Here we studied the intra-host genetic diversity of SARS-CoV-2 genomes and identified a high-frequency and recurrent non-sense mutation yielding a truncated form of the viral spike protein, in both human COVID-19 samples and in cell culture experiments. Through the use of a functional assay, we observed that this truncated spike protein displays an elevated fusogenic potential and forms syncytia. Given the high frequency at which this mutation independently arises across various samples, it can be hypothesized that this deletion mutation provides a selective advantage to viral replication and may also have a role in pathogenesis in humans.


73
sequences, and the geographical distribution of clades was established. Because they induce an 74 abundance of missense rather than synonymous or non-sense mutations, it was suggested that 75 regions of the SARS-CoV-2 genome were actively evolving and might contribute to pandemic 76 spreading [20]. It was observed that variations are mainly comprised of transition mutations 77 (purine->purine or pyrimidine->pyrimidine) with a prevalence of C->U transitions and might 78 occur within a sequence context reminiscent of APOBEC-mediated deamination (i.e., 79 [AU]C[AU]; [21,22]). Consequently, it was proposed that host editing enzymes might be involved 80 in coronavirus genome editing [23,24]. 81 Consensus mutations are only part of the genetic landscape in regard to RNA viruses. 82 Replication of RNA viruses typically produces quasispecies in which the viral RNA genomes do 83 not exist as single sequence entity but as a population of genetic variants [25]. These mutations 84 are most frequently caused by the error-prone nature of each of their respective viral RdRps and 85 by host RNA editing enzymes, such as APOBECs and ADARs [26]. However, the RdRp complex 86 of large RNA viruses, such as coronaviruses, sometimes possess exonuclease proofreading 87 activity, and consequently have lower error rates [25,27]. Quasispecies may sometimes exhibit 88 diminished replicative fitness or deleterious mutations and exert different roles that are not directly 89 linked to viral genomic propagation [28]. Mutations that form the intra-host genetic spectrum have 90 been shown to help viruses evade cytotoxic T cell recognition and neutralizing antibodies and also 91 render viruses more resistant to antiviral drugs [28]. These mutations can also be involved in 92 modulating the virulence and transmissibility of the quasispecies [28]. 93 In this study, we focussed on assessing intra-genetic variations of SARS-CoV-2. We 94 analyzed high-throughput sequencing datasets to profile the sequence diversity of SARS-CoV-2 95 variants within distinct sample populations. We observed high genetic intra-variability of the viral 6 96 genome. By comparing variation profiles between samples from different donors and cell lines, 97 we identified highly conserved subspecies that independently and recurrently arose in different 98 datasets and, therefore, in different individuals. We further analyzed the dominant variant S20 in 99 a functional assay and demonstrate that this truncated spike protein enhances syncytium formation. 100 Here we provide evidence for the existence of a consistently emerging variant identified across 101 geographical regions that may influence intra-host SARS-CoV-2 infectivity and pathogenicity.

104
High intra-genetic variability of the SARS-CoV-2 genome in infected individuals. 105 To assess the extent of SARS-CoV-2 sequence intra-genetic variability, we analyzed 106 15,224 publicly available high-throughput sequencing datasets from infected individuals. The raw 107 sequencing reads were mapped to the SARS-CoV-2 isolate Wuhan-Hu-1 reference genome, and 108 the composition of each nucleotide at each position on the viral genome was generated. Consensus

119
The analysis of the type of nucleotide changes within samples revealed that 52.2% were 120 transitions (either purine->purine or pyrimidine->pyrimidine) and 47.8% were transversions 121 (purine->pyrimidine or pyrimidine->purine). Notably, the highest nucleotide variations 122 corresponded to C->U transitions (43.5%) followed by G->U (28.1%) transversion (Fig 1B), both 123 types encompassing 71.6% of all variations. Since editing by host enzymes depends on the 124 sequence context, we extracted two nucleotides upstream and downstream from each genomic 125 position corresponding to variations and generated sequence logos. Our results indicated a high 126 number of As and Us around all variation types and sites (62.1+/-3.4%; Fig 1B). Because SARS-127 CoV-2 is composed of 62% A/U, this suggests the observed number of As and Us around variation 128 sites are mainly due to the A/U content of the viral genome, that no motifs are enriched around 129 these sites and that these intra-genetic variations are likely not originating from host editing  Table 1). Amongst these, four transversions (at nt 25,324, 25,334, 25,336 and 25,337) 141 located at the 3' end of the S gene are the most recurrent variations (inset of Fig 1C and Table 1).

158
To further investigate variations in a more controlled system, and to determine whether  ACE2 were co-transfected with a plasmid expressing GFP and plasmids expressing or not wild-198 type S or S20 under a cytomegalovirus (CMV) major immediate early promoter. As expected, in 199 the absence of the S protein (i.e., pCAGGS alone) syncytia formation was not observed (Fig 5A).  (Fig 5A and 5B). Our results not only indicate that S20 also induces fusion, but that the 203 cytoplasmic masses are larger than the wild-type S protein (Fig 5A and 5B). To complement this Oxo-dG), that could cause high levels of C->A and G ->U mutations and promote the hydrolytic 244 deamination of C->U [30][31][32][33]35,37,39,40]. It was previously reported that these types of mutations 245 occur at low frequency, that they are mostly detected when sequencing is performed on only one 246 DNA strand and that they are highly variable across independent experiments [32,34].

247
Consequently, most transversions observed in our analysis are likely due to heat-induced damage,

248
RNA extraction, storage, shearing and/or RT-PCR amplification errors. However, we identified 249 several positions with intra-sample variability recurrent in several independent samples, both from 250 infected individuals and infected cells. They were detected at moderate to high frequencies, 251 ranging from 2.5% to 39.3% per sample (Table 1 and    represents the distribution using red dots to represent the samples having this intra-genetic 549 variation and a blue violon to show the distribution of the data.

576
Bars represent averages ± standard deviations of five independent experiments and the p-value 577 (pv) was calculated using paired t-test. Schematic of BiFc created with BioRender.com.