Within-patient genetic diversity of SARS-CoV-2

SARS-CoV-2, the virus responsible for the current COVID-19 pandemic, is evolving into different genetic variants by accumulating mutations as it spreads globally. In addition to this diversity of consensus genomes across patients, RNA viruses can also display genetic diversity within individual hosts, and co-existing viral variants may affect disease progression and the success of medical interventions. To systematically examine the intra-patient genetic diversity of SARS-CoV-2, we processed a large cohort of 3939 publicly-available deeply sequenced genomes with specialised bioinformatics software, along with 749 recently sequenced samples from Switzerland. We found that the distribution of diversity across patients and across genomic loci is very unbalanced with a minority of hosts and positions accounting for much of the diversity. For example, the D614G variant in the Spike gene, which is present in the consensus sequences of 67.4% of patients, is also highly diverse within hosts, with 29.7% of the public cohort being affected by this coexistence and exhibiting different variants. We also investigated the impact of several technical and epidemiological parameters on genetic heterogeneity and found that age, which is known to be correlated with poor disease outcomes, is a significant predictor of viral genetic diversity. Author Summary Since it arose in late 2019, the new coronavirus (SARS-CoV-2) behind the COVID-19 pandemic has mutated and evolved during its global spread. Individual patients may host different versions, or variants, of the virus, hallmarked by different mutations. We examine the diversity of genetic variants coexisting within patients across a cohort of 3939 publicly accessible samples and 749 recently sequenced samples from Switzerland. We find that a small number of patients carry most of the diversity, and that patients with more diversity tend to be older. We also find that most of the diversity is concentrated in certain regions and positions of the virus genome. In particular, we find that a variant reported to increase infectivity is among the most diverse positions. Our study provides a large-scale survey of within-patient diversity of the SARS-CoV-2 genome.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of COVID-19, spread globally from its origins in Wuhan, China, toward the end of 2019, resulting in the World Health Organisation declaring COVID-19 a pandemic in March 2020. Initially diagnosed as a pneumonia of unknown origin , huge research efforts have drastically developed our clinical knowledge of COVID-19 concerning its etiology [1] , mode of transmission [2] , identification of individuals at risk [3] and potential treatment strategies [4] .
COVID-19 is a respiratory disease marked by a variety of symptoms including a persistent cough, fatigue and anosmia [5] , however, a subpopulation of SARS-CoV-2-infected individuals remains asymptomatic [6] . The large variation in the severity of experienced symptoms, along with the virus incubation time ranging from 5-14 days [7,8] , may have contributed to its rapid propagation. At the end of September 2020, there have been more than 34 million confirmed cases with over a million deaths worldwide [9] . SARS-CoV-2 belongs to the family Coronaviridae and is the latest of three zoonotic coronaviruses that have spilled over to infect humans in the past two decades following SARS-CoV in 2003 and MERS-CoV in 2012 [10] . The SARS-CoV-2 single-stranded RNA genome consists of at least 13 open reading frames (ORFs) spanning 29,903 nucleotides [11] . ORF1ab codes for a polyprotein from which the non-structural proteins originate, including the RNA-dependent RNA polymerase; this is then followed by the Spike (S) gene, the Envelope (E) gene, the Matrix (M) gene, the Nucleocapsid (N) gene and a host of accessory genes [12] . Much like the related SARS-CoV, the Spike protein present on the surface of the SARS-CoV-2 virion is central for its access to the target host cell via binding of the hACE2 receptor, S protein priming and finally fusion of the viral and target cell membranes [12][13][14] . Zeigler et al. identified co-expression of hACE2 and TMPRSS2 (a host protease required for viral entry) in lung type II pneumocytes, nasal goblet secretory cells, and ileal absorptive enterocytes; these are currently thought to be the target cells of SARS-CoV-2 leading to COVID-19 [15] . Viral isolation and sequencing efforts throughout the world have permitted the interrogation of the SARS-CoV-2 genome, providing a deeper understanding of the evolution of the virus, its proximal origin and revealing patterns of its global spread [16,17] . These efforts are largely centered on reverse transcription of the viral RNA genome followed by PCR amplicon and hybrid capture based sequencing using Oxford Nanopore and Illumina sequencers [18] . The sequences generated are largely deposited in data repositories such as the sequence read archive (SRA) and GISAID [19] . The availability of these sequences has allowed for platforms such as Nextstrain [20] to continually update and track viral evolution and spread, and to illuminate the viral genetic diversity among carriers [21][22][23][24] . At this inter-host level, epidemiological studies often employ per-patient consensus sequences, which summarize each patient's virus population into a single sequence and ignore minor variants. However, RNA viruses generally exhibit high mutation rates due to error-prone viral RNA polymerases, which typically leads to the presence of various viral variants within a single host [25] . The resulting intra-host diversity has been shown to affect disease progression [26] , tissue tropism [27] , transmission risk [28] , transmission heterogeneity [29] , and treatment outcome [30,31] in various RNA viruses. Recent findings indicate that the mutation rate of SARS-CoV-2 is about as high as the one observed in the SARS-CoV genome (0.80-2.38 × 10 -3 nucleotide substitutions per site per year) [32,33] . Although coronaviruses have evolved a proofreading capability attributed to Nonstructural protein 14 resulting in lower mutation rates than other positive-sense ssRNA viruses [34][35][36] , the study of the intra-host genetic diversity of the novel SARS-CoV-2 virus remains important to gain a deeper understanding of its evolution and transmission dynamics and possible implications on its pathology. Previous studies highlight that intra-host genetic diversity in clinical samples is indeed prevalent, and some genomic regions that are susceptible to alterations in the SARS-CoV-2 virus have been identified [37,38] . Accounting for intra-host diversity can also improve the resolution of phylogenomic analyses of SARS-CoV-2 [39] and our ability to detect selection pressure [40] .
To assess the within-host genetic diversity of different viral variants, deep-coverage sequencing is needed to access low-frequency subclonal mutations. However, calling low-frequency mutations reliably from deep sequencing data remains challenging because of various amplification and sequencing errors. S everal computational methods have been proposed for this task [41,42] , and specialised bioinformatics pipelines have been developed to streamline and automate the analysis from raw sequencing reads to consensus sequences and single-nucleotide variants [43,44] . Here, we employ V-pipe [44] , a workflow that includes quality control, read mapping and mutation calling, to robustly identify genetic diversity of viral genomes. We examine the intra-host diversity across a large cohort of 3939 public SARS-CoV-2 samples and, as an independent validation cohort, 749 additional samples which have recently been sequenced to study the spread of SARS-CoV-2 in Switzerland [45] . We analyze the distribution of genetic diversity across the genome, including uncovering highly diverse bases and small regions, and we explore the diversity across patient samples. By integrating technical and epidemiological covariates, we show that within-patient viral genetic diversity increases with age of the infected patient. a) b) Figure 1: (a) Different viral variants may coexist in the same host such that the sequencing reads contain a mixture of the different components and their SNVs. We employ the bioinformatics pipeline V-pipe for automated end-to-end processing of the raw sequencing data and calling of SNVs and their frequencies [44] . (b) An example output of the workflow for a single sample (SRR11953858) displaying variations in the coverage (blue histogram, right y-axis) and the frequencies of the different SNVs (black lollipops, left y-axis) across the genome (x-axis). Along with clonal mutations with frequencies near 100%, and many low-frequency variants, several other mutations have frequencies of around 40% and 60% in this example.

Results
We downloaded sequencing data from the SRA from a total of 5934 SARS-CoV-2 samples, which had undergone deep sequencing with Illumina technology. Samples with low (median below 1000) or highly variable (lower quartile below 100 or upper quartile above 10,000) coverage were removed ( Figure S1). The remaining 3940 samples were processed using V-pipe [44] (Figure 1a). For each sample, we obtained the coverage across the genome and all mutated positions, including all variant bases and their frequencies ( Figure  1b). A single sample with no detected mutations at all was removed to leave a final public cohort of 3939 SARS-CoV-2 samples. In addition, we analyze a new set of 749 sequences derived from samples collected in Switzerland. These sequences were generated as part of our sequencing effort described in [45] . While [45] uses only the consensus sequences for a phylodynamic analysis, here we directly analyse the entire set of raw deep-sequencing reads to assess intra-host diversity. This uniformly produced set of sequences, which we make public on the SRA, serves as a validation cohort. As a measure of intra-host genetic diversity, we computed the entropy of the nucleotide distribution for each sample and each genomic locus (Materials and Methods). Low entropy indicates a highly conserved site in the patient's virus population, while high entropy is indicative of variation among nucleotides.
The quartiles of the coverage over the public samples ( Figure S2) display typical coverages of over a thousand reads at each genomic site (in line with the coverage filters) meaning that we can detect low-frequency SNVs down to 1-2% [46] (or using estimates from the Lander-Waterman model [47] ). However, there are regular dips at the primer locations [48] , and noticeable wider dips between genomic positions around 19 to 23 kb. This region covers the end of gene ORF1ab encoding the endoribonuclease and 2'-O-methyltransferase, as well as much of the S gene producing the S1 subunit, needed for the initial attachment of the virus to the target cell.
We observe a wide spectrum of diversity both across the genome and across samples ( Figure 2, focussing on the positions across the genome mutated in at least 4% of the samples). In particular there are subclonal regions visible (as darker contiguous bars in Figure 2) at the start of ORF1ab , in the Spike S gene and across much of the Matrix M gene. There are also highly diverse individual bases and small regions (Tables S1-S4), particularly in the public data.
Next, we summarize genetic diversity per gene by computing the average entropy across all positions of the gene. We observe roughly similar diversity for each gene in the public cohort, except for gene M which is much more diverse, and ORF7b displays hardly any diversity (Figure 3a), while we observe the same trends, but larger differences across genes in the Swiss cohort ( Figure 3b).
The most diverse site in the public data (Table S1 and Figure 3a), which has clonal or subclonal mutations in nearly half the samples, is at position 11075 in the region of the ORF1ab gene coding for non-structural protein 6, a transmembrane protein containing 7 transmembrane helices. Most of the diversity at this position is derived from low-frequency deletions, which would alter downstream amino acids and introduce a premature stop codon. Among the mutated samples, deletions occur at an average rate of 2.62%. 12.5% of mutated samples include variants with a T>C substitution, which occurs with an average frequency of 1.75% among those samples (corresponding to an overall average rate of 0.22% among all mutated samples). In the reference genome, the codon incorporates a Phenylalanine amino acid (Phe35 of nsp6), however, any variation here will lead to a change in the inserted amino acid; for example, a T>C mutation results in the incorporation of a Leucine.
Another residue in close proximity to Phe35, namely Leu37 of nsp6, has often been found to be replaced by a Phenylalanine residue in recent sequences from Europe, Asia and America [49] . Here, we also identify a high entropy at position 11083 inside the codon of Leu37 (Table S1), which is affected in around a quarter of samples. Any alteration from the reference base G to either a T or C results in a Phenylalanine residue as opposed to a Leucine. Here we observe a large average frequency of Thymine bases (44.65%) and common deletions (4.93%) which also lead to a Phenylalanine residue and a premature stop codon.
Genomic position 23403 is the site at which an A>G mutation causes the well-known D614G amino acid alteration in the Spike region. Studies suggest that this alteration provides a fitness advantage and becomes the more prevalent variant in the population over time [50] , and it is also the dominant variant in our public cohort. We find that it is among the 100 most diverse positions in the SARS-CoV-2 genome (rank 95 in our cohort) with 36.0% of our samples having mutations there relative to our cohort consensus G, and 29.7% having diversity of different variants coexisting. Among the mutations, there is an average frequency of 90.4% A, 9.5% G and 0.1% deletions. In the Swiss cohort, position 23403 has the second highest diversity among all individual bases (Table S2). We also assessed genetic diversity over consecutive genomic regions (Materials and Methods). The most diverse regions in both the public and Swiss data (Tables S3 and S4, adjacent sites all with a log entropy above 3.0 for the public data and above 2.0 for the Swiss data) include a span of 16 bases at the start of the ORF1ab gene ( Figure 3). This region forms part of the nsp1 protein which shuts down host mRNA translation and protein production via interactions with the 40S and 80S ribosomal subunits. A notable consequence of this nsp1 induced inhibition of translation and protein production is the reduction in innate immune response activity, specifically interferon signalling [51] .
The next most diverse genomic region (Tables S3 and S4) consists of two adjacent positions at 29187 and 29188 within a single codon in the N gene forming the nucleoprotein. The reference bases at those two positions are C and A, respectively, which results in the incorporation of an Alanine residue at position 305 of the nucleoprotein (Ala305). However, due to the redundancy of the genetic code, alterations in position 29188 alone do not lead to an alteration on the protein level.
The remaining regions in Table S3 with a length greater than two lie within the M and S genes, and are similarly present in the Swiss cohort (Table S4). The 41 nucleotides from 26780 to 26820 lead to the incorporation of 15 amino acids from Cys86 to Phe100 forming a section of the matrix protein's transmembrane region. The matrix protein is known to be central in viral assembly with several interacting partners including itself, the envelope protein, the nucleoprotein and the spike protein [52,53] . The spike protein, when primed, is divided into the S1 and S2 subunits with S1 being essential for hACE2 recognition and S2 mediating viral entry into the host cell. Another region of high entropy comprises 30 nucleotides within the Spike gene which incorporates 11 amino acids to the S1 subunit of SARS-Cov-2 ranging from Ile664 to Tyr674. These amino acids lie between the receptor binding domain and Furin cleavage site needed for S protein priming [54,55] .
The distribution of total entropy per base has a long tail (Figure 4a) resulting in locations with extremely high diversity, such as the examples discussed above, while the majority of the genome is relatively conserved. Taking a logarithmic transform of the entropy we can observe much more detail in the distribution, with a roughly normal distribution and a slight negative skew in the public data (Figure 4a inset). Considering the entropy per sample instead provides a picture of how diversity is distributed among samples. We again find a long tail (Figure 4b) with certain individuals having vastly more internal diversity than the majority of samples dominated by clonal variants and low-frequency mutations. Detecting diversity in each sample is heavily dependent on the sequencing technology, sample processing, and sequencing depth. For example, all the most diverse samples (Table S5) come from one SRA study (SRP253798) which makes up 29% of the entire cohort. The samples with the highest detected diversity had between 10% and 30% of the genome affected, with the vast majority of their mutations (over 99.8%) being subclonal and the coexistence of more than one character (base or deletion) in the mapped reads.
Like the entropy, the number of clonally or subclonally mutated positions per sample also has a long tail ( Figure S3) with, in the public data, for example, quartiles at 119, 211 and 378.5 positions being mutated (corresponding to 0.40%, 0.71% and 1.27% of the genome) but a maximum of 8705 positions (29.11% of the genome).
Finally, we tested the hypothesis that epidemiological parameters are related to viral genetic diversity. Specifically, to determine whether the host's age, sex or geographical location predict the diversity of their virus population, we performed regression modelling on the subset of samples for which we have such information. This resulted in 1043 samples from Australia which were sequenced with paired-end amplicon sequencing with PCR amplification. We also adjusted for technical parameters of the sequencing to avoid confounding (Materials and Methods). We found that sex is not a significant predictor of diversity ( p = 0.50). By contrast, age is significantly associated with intra-host viral genetic diversity ( p = 4 × 10 -4 ) with each decade increasing the total entropy by 8.6% on average (Table 1, Figure S5).
Clinical covariates were not available for the Swiss cohort, but sequencing date was utilized as a proxy for decreasing age, since testing expanded to younger populations over time ( Figure S6). The significant decrease in diversity over time ( p < 10 -7 ; Table 2, Figure  S7) in the Swiss cohort therefore corroborates the association of increased diversity with age uncovered in the public cohort.

Discussion
We processed a large cohort of 3939 deeply sequenced public SARS-CoV-2 genomes, and 749 samples from Switzerland, with the bioinformatics software V-pipe [44] to uncover within-patient genetic diversity. We observe a heavy tail distribution in diversity per sample and per position indicating that much of the diversity is concentrated in small numbers of sites and patients. The most diverse small regions were consistent between the public and Swiss data, though the individual bases were more varied.
Detecting and quantifying intra-patient genetic diversity from deep sequencing data is technically challenging. It may be heavily influenced by the sample preparation and how it is sequenced, along with possible artifacts arising from the process, so that extremely diverse patients may not be comparable across cohorts. Accounting for such technical parameters of the sequencing process and coverage, which affects the detection limit of diversity, we find that age is a significant predictor of diversity in the public data. The model predicts that on average genetic diversity increases by 8.6% every ten years. The increase in diversity with host age is corroborated in the Swiss data. Age has previously been associated strongly with worse disease outcome and higher death rates [56] , along with concomitant comorbidities [57] . With high-quality clinical and genetic data, it will be highly relevant to see whether diversity is a cause or a consequence of disease progression. Likewise, with future transmission network data it will be interesting to uncover whether diversity increases infectiousness, as for influenza [28] .
The detection of subclonal mutations is affected by the sequencing depth at each position, and across the genome depending on the amplification and capture of RNA for sequencing some regions may be more poorly resolved. The coverage distribution therefore affects our ability to detect highly diverse bases. With this caveat, the most diverse gene is the Matrix M gene while highly diverse positions include a mix of low-frequency variants common to a quarter of the cohort or more, and rarer high-frequency subclonal mutations in around 5% of the cohort. The observation of common low-frequency and less common high-frequency genetic variants is in line with previous research on both intra-and inter-host genetic diversity of SARS-CoV-2 [21,39] . The D614G variant, which appears to increase infectivity and is becoming more dominant over time [50] is the dominant variant in our public cohort. It also exhibits high intra-host diversity with 29.7% of the cohort experiencing subclonal mutations with the different variants coexisting. This diversity is mimicked in the data from Switzerland, where the D614G variant is actually encoded by the second most diverse genomic position.

Public data
We retrieved data from the Sequence Read Archive ( https://www.ncbi.nlm.nih.gov/sra ) on June 10, 2020. Samples with the term '"Severe acute respiratory syndrome coronavirus 2"[Organism] OR "Sars-Cov-2[All Fields]"' were filtered and only one copy of duplicates with the same BioSample ID was retained. We used the meta file to further filter the samples by Illumina technology. This resulted in 5934 samples which were associated with downloadable data.

Data from samples collected in Switzerland
Sample collection and sequencing are detailed in [45] . Briefly, we obtained RNA samples extracted from nasal swab tests which had previously tested positive on RT-qPCR from Viollier AG laboratory and sequenced them at the Genomics Facility Basel. We performed reverse transcription using random hexamers and PCR amplified the resulting DNA with primers from the artic-ncov2019 protocol [ https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019/V3 ]. We prepared libraries from these 4000 bp-long amplicons using Illumina TruSeq adapter sequences and sequenced them on an Illumina MiSeq System (Paired-end sequencing, 2x 251 cycle). The phylogenetic relationship between the consensus sequence of 681 samples from this collection has been previously analyzed as part of the subset of GISAID data available for Switzerland until July 10, 2020 in [24] and results based on the consensus sequences of the full dataset are presented in [45] . Here we focus on the raw reads directly and analyse the deep sequencing data to uncover within host diversity. The collection of swabs analysed here spans a time period from Mar 4, 2020 to August 13, 2020, with samples from across Switzerland. The analysed reads are available in the SRA (see Data Availability below).

Filtering
Samples were subset by applying a coverage filter (minimum lower quartile: 100, minimum median: 1000, maximum upper quartile: 10000). After this filtering, 3940 public samples and 749 samples from Switzerland were retained for later analyses.

Data processing
We used V-pipe ( [44] ; sars-cov2 branch of https://github.com/cbg-ethz/V-pipe ) to call variants for each sample using ShoRAH [60] and default settings, including discarding deletions with a frequency below 0.5%. One sample had no remaining variants detected and was excluded giving a final cohort size of 3939. For each called variant per sample, we computed the relative frequency f k of each character k (nucleotide A, C, G, T, or deletion) among the mapped reads at that position. The entropy is then computed as summarising the five frequencies in a single measure of diversity. The entropy is zero whenever one character has a frequency of 100%, and is maximised when all the characters have equal frequency 1/5, giving a maximum value of log (5) (Table S3 and S4). We compute the total entropy per sample by summing over bases (Table S5).

Regression modelling
To evaluate which covariates are predictive of diversity, we build a regression model on the public data of log total entropy on age, sex, and country of origin. To adjust for the possible effects of coverage and sequencing technology on the diversity, we include factors for paired-end or single-end sequencing, the assay type, the library selection and the SRA study. We also include the logarithm of the median coverage. Since not only the average coverage, but its variability may affect the ability to detect SNVs, we further include the IQR of the coverage, divided by the median m , and again log transformed: log(entropy) ~ age + sex + country + sequencing factors + log(median coverage) + log(IQR/ m ) Filtering the samples which have age and sex information left 1060 samples, of which all but 17 were from Australia. We therefore retained just those 1043 from Australia, and removed the country dependence from the regression. All remaining samples were paired-end amplicon sequencing with PCR amplification from the study SRP253798, so those factors were also removed to provide the final regression: log(entropy) ~ age + sex + log(median coverage) + log(IQR/ m ) Of the 1043 samples, 480 (46.0%) were female and 563 (54.0%) were male, while the distribution of ages ( Figure S4) has a median value of 46 and lower and upper quartiles at 29 and 60. The results of the regression are listed in Table 1.  For the data from Switzerland, we do not have clinical covariates, only the date of sequencing. However, we can use the sequencing date as a proxy for age, because over time, the age distribution has progressively reduced in Switzerland as a whole ( Figure S6). All samples were processed in the same way, so for the regression we model log(entropy) ~ date + log(median coverage) + log(IQR/ m ) with the results in Table 2.  For visualisation purposes, we regress patient age on sample date from data collected about positive tests in Switzerland as a whole ( Figure S6). The model is then used to predict the age of our 749 deeply sequenced samples based on the sequencing date. Against the predicted age, we plot the log total entropy, adjusted for the coverage covariates, to show how the negative correlation with date ( Table 2) corroborates an increase in entropy with age ( Figure S7).

Code availability
The source code to process the samples and perform and reproduce the analyses is available on GitHub ( https://github.com/cbg-ethz/SARS-CoV-2_Analysis ) in the form of multiple Snakemake [61] workflows.

Data availability
The public data is available from the SRA, as described in the Methods and Materials section. The Swiss data has been added to the SRA with accession number PRJEB38472, scheduled to be publicly available from November 1, 2020. Figure S1. The median coverage, along with the lower and upper quartiles for the 5934 samples downloaded from the Sequence Read Archive. The dashed grey lines correspond to the coverage filters used to subset the samples before further processing (minimum lower quartile: 100, minimum median: 1000, maximum upper quartile: 10000).        Figure S5: The dependence of log total entropy on age for the 1043 public samples, after adjustment for sex and sequencing coverage covariates in the regression modelling. Figure S6: The distribution of ages of positive Covid-19 tests in Switzerland as a whole, covering the period where our cohort of deeply sequenced samples was collected. Figure S7: The dependence of log total entropy on predicted age for the 749 Swiss samples, after adjustment for sequencing coverage covariates in the regression modelling. The predicted age is constructed from a linear model of age on date built with the data displayed in Figure S6.