Limited SARS-CoV-2 diversity within hosts and following passage in cell culture

Since the first reports of pneumonia associated with a novel coronavirus (COVID-19) emerged in Wuhan, Hubei province, China, there have been considerable efforts to sequence the causative virus, SARS-CoV-2 (also referred to as hCoV-19) and to make viral genomic information available quickly on shared repositories. As of 30 March 2020, 7,680 consensus sequences have been shared on GISAID, the principal repository for SARS-CoV-2 genetic information. These sequences are primarily consensus sequences from clinical and passaged samples, but few reports have looked at diversity of virus populations within individual hosts or cultures. Understanding such diversity is essential to understanding viral evolutionary dynamics. Here, we characterize within-host viral diversity from a primary isolate and passaged samples, all originally deriving from an individual returning from Wuhan, China, who was diagnosed with COVID-19 and subsequently sampled in Wisconsin, United States. We use a metagenomic approach with Oxford Nanopore Technologies (ONT) GridION in combination with Illumina MiSeq to capture minor within-host frequency variants ≥1%. In a clinical swab obtained from the day of hospital presentation, we identify 15 single nucleotide variants (SNVs) ≥1% frequency, primarily located in the largest gene – ORF1a. While viral diversity is low overall, the dominant genetic signatures are likely secondary to population size changes, with some evidence for mild purifying selection throughout the genome. We see little to no evidence for positive selection or ongoing adaptation of SARS-CoV-2 within cell culture or in the primary isolate evaluated in this study. Author Summary Within-host variants are critical for addressing molecular evolution questions, identifying selective pressures imposed by vaccine-induced immunity and antiviral therapeutics, and characterizing interhost dynamics, including the stringency and character of transmission bottlenecks. Here, we sequenced SARS-CoV-2 viruses isolated from a human host and from cell culture on three distinct Vero cell lines using Illumina and ONT technologies. We show that SARS-CoV-2 consensus sequences can remain stable through at least two serial passages on Vero 76 cells, suggesting SARS-CoV-2 can be propagated in cell culture in preparation for in-vitro and in-vivo studies without dramatic alterations of its genotype. However, we emphasize the need to deep-sequence viral stocks prior to use in experiments to characterize sub-consensus diversity that may alter outcomes.

To understand how serial passaging SARS-CoV-2 affects genomic variation, we sequenced 118 virus populations after each passage using the same SISPA metagenomics approach we used 119 to characterize the original biological specimen. Passaged sample names and cell lines are 120 described in the methods. An in-house pipeline (available at: 121 https://github.com/katarinabraun/SARSCoV2_passage_MS) was applied to trim out primer 122 sequences, bioinformatically deplete host reads, and generate alignment files, which contained 123 all reads mapping to the SARS-CoV-2 Madison reference genome (MT039887.1). At the 124 consensus level, SARS-CoV-2 does not accumulate genetic variation after two passages on 125 Vero 76 cells (Fig 2). We also examined deletions ≥ 1% frequency and ≥ 3 nt in length. We found 126 no evidence of deletions that fit these criteria in any of the cell culture isolates. 127

Most minor variants are found in the largest genes -ORF1a and ORF1b 128
To characterize patterns of sub-consensus diversity, we looked at SNVs at or above 1% 129 frequency in only the Illumina reads. We previously established that this conservative cutoff 130 ensures that only bona fide mutations are considered [15,16]. All minor variant analyses and 131 figures were completed using the Illumina SNV data as these data are higher average quality 132 and ideal for analysis involving low-frequency variants (Fig 3). Seventy-five percent of all minor 133 variants we identify fall in ORF1a and ORF1b, which together take up 72.8% of the length of the 134 28kb coding genome. ORF1a and ORF1b encode the replicase machinery [7]. We account for 135 differences in gene size by normalizing variants to kilobase gene length (variants / kb-gene-136 length -"v/kbgl") [10]. The highest density of variants was reported in smaller genes like 137 envelope, ORF7a, and ORF10 (S2 Table). We also show that through each passage, variant 138 density in ORF1a and ORF1b increases. There were no SNVs ≥ 1% in the spike gene in the 139 primary NP swab, but low-frequency SNVs (all <5%) were identified in spike following passage 140 in cell culture (Fig 3). Outside of ORF1a and ORF1b, the other genes in the primary NP swab are clonal above the 1% threshold, with the exception of one low-frequency SNV   In addition to assessing the fate of individual minor variants, we were also interested in 188 evaluating population dynamics using diversity metrics. Specifically, we calculated genewise 189 diversity using π , the average number of pairwise differences per nucleotide site among a set of sequences, for each gene in each sample. Overall, genewise nucleotide diversity is very low 191 compared to other RNA viruses, consistent with low mutation rates in coronaviruses due to RNA 192 proofreading machinery [18,19]. Genewise diversity was very low in the primary NP swab and 193 was only measurable in ORF1a (9 SNVs), ORF1b (5 SNVs) and N (1 SNV). Genewise diversity 194 is more varied in the passaged samples (Fig 5). Interestingly, π is highest in ORF7a in these 195 samples -although this signal seems to be primarily driven by the small size of this gene. To Vero 76 ORF10, p1 Vero E6 envelope (E), and p1 Vero STAT-1 KO ORF3a. 205

Comparison of Illumina and ONT ability to capture minor variant frequencies 206
We examined the concordance between SNV calls at the same sites, irrespective of frequency, 207 determined by Illumina and ONT workflows. To begin, we used a stringent cutoff of 10% 208 frequency for ONT SNVs. We then called variants at percentage frequencies decreasing by 209 0.5% (eg. calling 8% variants, then 7.5%, etc) until the variants called by ONT no longer 210 matched Illumina variants irrespective of frequency at these sites (Fig 6, S1  Illumina and ONT at lower frequencies in the passaged samples because viral titer in vitro 219 typically exceeds viral titer in vivo resulting in higher average coverage in the passaged samples 220 required to support minor variant calls at lower frequencies. Oxford Nanopore kits SQK-LSK109. Samples were barcoded using the Oxford Nanopore Native 327 Barcodes (EXP-NBD104 and EXP-NBD114), and pooled to a total of 140ng prior to being run 328 on the appropriate flow cell (FLO-MIN106) using the 72hr run script. 329

Nextera XT Illumina library preparation and sequencing 330
Amplified cDNA was purified using a 1:1 concentration of AMPure XP beads (Beckman Coulter, 331 Brea, CA, USA) and eluted in 48µL of water. PCR products were quantified using Qubit dsDNA 332 high-sensitivity kit (Invitrogen, USA) and were diluted to a final concentration of 0.2 ng/µl (1 ng 333 in 5 µl volume). Each sample was then made compatible for deep sequencing using the Nextera 334 fragmented and tagged with short oligonucleotide adapters, followed by 14 cycles of PCR for 336 template indexing. Samples were purified using two consecutive AMPure bead cleanups (0.5x 337 and 0.7x) and were quantified once more using Qubit dsDNA high-sensitivity kit (Invitrogen, 338 USA). The average sample fragment length and purity was determined using Agilent High were pooled with seven other samples (not included in this manuscript) and were denatured to a 346 final concentration of 14pM with a PhiX-derived control library accounting for 1% of total DNA 347 and was loaded onto a 600-cycle v3 flowcell. Average quality metrics were recorded, reads 348 were demultiplexed, and FASTQ files were generated on Illumina's BaseSpace platform. 349

Sequence read mapping and variant calling by ONT 350
Seventy-two hours after sequencing was initiated, raw sequencing reads were demultiplexed 351 using qcat (https://github.com/nanoporetech/qcat). In order to deplete host sequences, 352 sequencing reads are mapped against host genome and transcript references, and unmapped 353 reads are saved. Reads were then trimmed by 30bp on each side to discard SISPA primer 354 sequences. In this step, reads with quality scores  SNPGenie adapts the Nei and Gojobori method of estimating nucleotide diversity (π), and its 384 synonymous (π S ) and nonsynonymous (π N ) partitions from next-generation sequencing data 385 [31]. As most random nonsynonymous mutations are likely to be disadvantageous, we expect 386 π N = π S indicates neutrality suggesting that allele frequencies are determined primarily by 387 genetic drift.     per gene segment, we report the density of variants normalized to gene kilobase length. 540