Abstract
Phylogenomic analysis of SARS-Cov-2 as available from publicly available repositories suggests the presence of 4 prevalent groups of viral episomes (super-clades), which are mostly associated with outbreaks in distinct geographic locations (China, USA and Europe). While levels of genomic variability between SARS-Cov-2 isolates are limited, to our knowledge, it is not clear whether the observed patterns of variability in viral super-clades reflect ongoing adaptation of SARS-Cov-2, or merely genetic drift and founder effects. Here, we analyze more than 1100 complete, high quality SARS-Cov-2 genome sequences, and provide evidence for the absence of distinct evolutionary patterns/signatures in the genomes of the currently known major clades of SARS-Cov-2. Our analyses suggest that the presence of distinct viral episomes at different geographic locations are consistent with founder effects, coupled with the rapid spread of this novel virus. We observe that while cross species adaptation of the virus is associated with hypervariability of specific genomic regions (including the RDB domain of the spike protein), the more variable genomic regions between extant SARS-Cov-2 episomes correspond with the 3’ and 5’ UTRs, suggesting that at present viral protein coding genes should not be subjected to different adaptive evolutionary pressures in different viral strains. All in all, while - also in the light of the limited variability of SARS-Cov-2 - this study can not be conclusive, we believe that the evidence presented here is strongly consistent with the notion that the biased geographic distribution of SARS-Cov-2 isolates should not be associated with adaptive evolution of this novel pathogen.
Introduction
The SARS-CoV-2 pandemic (Poon and Peiris, 2020) poses the greatest global health and socioeconomic threat since the second world war. Complete genomic sequences of viral isolates from diverse geographic sites, have rapidly been made available through dedicated resources (Shu and McCauley, 2017, Goodacre et al,2018) facilitating comparative genomics studies, identification of putative therapeutic targets (Zhou et al, 2020, Chen et al 2020, Robson 2020) and the development of effective prevention and monitoring strategies (Qiang et al 2020). Analyses of available genomic sequences, according to GISAID EpiCoV, suggest major viral clades, S, V and G which, collectively, circumscribe more than 69% of the characterised isolates. Strikingly, clades show a markedly biased prevalence in different areas, with the S clade accounting for more than 72% of the viral isolates characterized in the USA, and the G clade comprising more than 74% of those that have been sequenced in Europe.
Importantly, while the G-clade was initially considered to be composed of viral strains which were not observed in China, availability of additional genomic sequences suggests that this clade as well should be nested within isolates from Shanghai. Although comparative analyses suggest that genomic variability between different isolates of SARS-Cov-2 is generally low (Lu et al, 2020, Zhang et al, 2020, Tang et al, 2020), the fact that distinct viral episomes show a highly biased geographic distribution is potentially alarming, as, at present, it is not completely clear whether frequent variants reflect the adaptive processes, which result in the emergence of novel, and potentially more virulent strains. Moreover at present it is unclear whether the genomic variability of major clades of SARS-Cov-2 and their biased geographic distribution, could explain -at least in part-apparent differing rates of lethality observed worldwide (Baud et al 2020).
In the present study, exploiting curated viral genomic sequences, we present analyses of more than 1100 complete SARS-Cov-2 genomes, identified from 5 continents and more than 45 countries. By contrasting evolutionary patterns associated with the most prevalent viral clades, with those observed between closely related viral strains isolated from various species, we provide insights into the evolutionary dynamics and adaptation of SARS-Cov-2 like viruses to different hosts and the evolutionary patterns of the major clades SARS-Cov-2. We show that while the majority of the genomic variants that discriminate between major viral clades cause non synonymous substitutions in protein coding genes, including genes implicated in the modulation of the virulence of SARS-Cov-2 such as the spike protein and the RNA dependent polymerase (Weiss and Navas-Martin, 2005), the major clades of SARS-Cov-2 show nearly identical patterns of genomic variability as well as the absence of signatures that are normally associated with adaptive evolution at protein coding loci. Indeed, the major clades of SARS-Cov-2 are identified only by a limited number of clade-specific genetic variants and show very modest variability. Notably, variable sites are enriched in the 5’ and 3’ non coding regions of the genome, unlike genomic sites which are hyper-variable between closely related strains with distinct host specificites.
While, in the light of the limited variability of SARS-Cov-2, our observations cannot be considered conclusive, we believe that the available data are broadly consistent with the notion that the biased geographic distribution of SARS-Cov-2 isolates is not associated with adaptive evolution of this novel pathogen, but rather with extensive founder effects coupled with the rapid spread of this pathogen in diverse geographic zones.
Results
A total of 1113 complete, high quality SARS-Cov-2 genomic sequences, as well as of 2 SARS-Cov-2 like viruses isolated from non-human hosts (bats and pangolins (Zhou et al 2020, Matthew et al 2020), were retrieved from the GISAID EpiCov portal on March 24th 2020. Associated metadata (Supplementary Table S1) show that these isolates included in cover more than 45 countries in 5 continents. As expected, the geographic distribution of the data closely reflects geographic prevalence of the pandemic, although notable exceptions include limited public data from Italy, one of the early hotspots of the pandemic.
Genomic sequences obtained from GISAID EpiCov were aligned to the reference SARS-Cov-2 assembly (refseq accession NC_045512.2) as available from Refseq (O’Leary et al, 2016), using Nucmer (Marçais et al 2018). A custom Perl script was used to process individual alignments and build a phenetic matrix of presence/absence of the variants. Substitution patterns of nucleotide residues, as shown in Supplementary Table S2 and Supplementary Figure S1 show a clear prevalence of C->T substitutions with respect to the, presumably ancestral Wuhan isolate, with C->T representing 38% of all the observed distinct variants and an almost 4 fold enrichment of C->T with respect to T->C. Strikingly, the same pattern is not recovered when the SARS-Cov-2 genome is compared with genomic sequences of closely related viral specimens isolated from non-human hosts, suggesting a specific error/substitution pattern of the SARS-Cov-2 RNA dependent RNA polymerase. This observation is confirmed even when only polymorphic sites common between 2 or more genomes are considered. Intriguingly, analyses of the substitution patterns of the coronavirus associated with the 2003-2004 SARS outbreak (Chinese SARS Molecular Epidemiology Consortium, 2004, Song et al 2005) do not show the same tendency of increased C->T substitutions.
Clustering of viral episomes based on 844 genetic variants present in at least 2 genomes - Figure 1 (see Materials and Methods), delineates 4 super-clades of viral strains, consistent with the classification of the isolates proposed by the GISAID EpiCov portal (Shu and McCauley, 2017). However other, smaller sub-clades, characterized by additional or alternative genomic variants are also observed. Overall our analyses confirm the limited variability of the SARS-Cov-2 genome (Supplementary Figure S2), with an average number of 5.2 polymorphic sites observed between individual isolates and the reference genome. Of the 983 sites that were polymorphic in at least 1 genome considered in the study, 310 (31.5%) were synonymous substitutions, 545 (55.4%) non-synonymous substitutions and 115 were associated with non-coding regions (5’ and 3’ UTR). The majority of polymorphic sites (703/1017, 69.12%) are associated with individual viral isolates (singletons) (Supplementary Figure S3). Interestingly, we notice that the average number of polymorphic sites in SARS-Cov-2 isolates are significantly (Wilcoxon p-value ≤ 1e-05) lower than the average variable sites between late-phase isolates of the SARS epidemic of 2003.
Importantly, all the major clades identified by this analysis show a highly biased geographic distribution (Figure 1), again consistent with previous observations. In particular (Figure 1), we notice that one super-clade is associated prevalently with isolates from the Americas (which corresponds with the S clade, as identified in GISAID EpiCoV), one corresponds, for the most part, to Chinese isolates, and shows relatively little variability with respect to the reference genome. The third super-clade is composed of viral strains prevalent in Europe (the G clade as identified in GISAID EpiCoV), while the fourth super-clade, which is more limited in size, corresponds to the GISAID V clade, and contains a majority of viral strains isolated from Europe. Cluster stability metrics, based on the Dunn index, as computed by means of the clValid R package (Brock et al, 2008), were strongly consistent with the presence of 9 clusters in the phenetic matrix of viral isolates. Clustering of viral isolates, based on the k-means algorithm with 9 clusters (Figure 1) demonstrates a neat separation of viral episomes, with cluster number 7 (Supplementary Table S3), corresponding to the genome sequences of the SARS-Cov-2 like strains isolated from non human hosts. Cluster number 1 is formed by episomes that have limited variability with respect to the reference genome; cluster 2, 5 and 9 incorporate all the strains included in the European clade G; while clusters 3 and 4 correspond to the GISAID Epicov clade S, which is prevalent in the U.S. Cluster 8 is composed of 35 viral strains, mainly isolated in Europe and corresponds to the V clade as described in GISAID EpiCov. Finally cluster 6, which was not described as a group previously, is formed by 40 genomes which do not seem to be associated with a specific geographic location.
Notably, (Table 1, and Supplementary Table 4) we observe that the majority of polymorphic sites that discriminate between the 9 clusters identified by our analyses, are associated with nonsynonymous amino-acid substitutions, and several of these substitutions occur in viral genes that have been implicated in the modulation of the virulence of SARS associated coronavirus, including the spike protein and the RNA dependent RNA polymerase (nsp12) (Weiss and Navas-Martin, 2005). In order to test the possibility that these variations could reflect early hints of adaptive evolution of SARS-Cov-2 strains, evolutionary dynamics of viral super-clades were contrasted with patterns of evolution that are observed between closely related viral strains with different host specificity.
As shown in Figure 2, comparison of intra-cluster variability, performed using only the 713 polymorphic sites that are associated with a single viral isolate, clearly demonstrates similar levels of variation between all the 9 SARS-Cov-2 clusters identified in this study, with a slight (but not statistically significant, Wilcoxon test p-value 0.129) increase in variability for strains included in cluster 6.
In order to identify regions of the genome which could be subject to distinct evolutionary pressures, plots of local genomic variability along the complete genomic sequence of SARS-Cov-2, were prepared by computing the proportion of polymorphic sites identified in each of the viral super-clades (cluster1, cluster 2+5+9, cluster 3+4) on sliding genomic windows of 100 bp in size and overlapping by 50 bp. Clusters 6 and cluster 8 were excluded from these analyses based on considerations concerning the relatively limited number of genomes included.
As shown in Figure 3, the observed patterns are remarkably similar between the 3 larger viral super-clades, suggesting similar evolutionary dynamics. In particular we note that polymorphic sites are significantly enriched (Adjusted Fisher test p-value ≤1e-15 and ≤1e-12 respectively) in both the 5’ and 3’ UTR regions, while protein coding loci show considerably reduced variability.
Similar analyses performed by comparing the reference SARS-Cov-2 genome, with the genomes of closely related coronaviruses with different host specificity (Figure 3 and Figure S4), show the presence of distinct patterns of variation in viral episomes that are likely to be subjected to adaptive evolutionary pressures, with respect to those observed within major clades of SARS-Cov-2. Indeed, hyper-variable genomic regions (Adjusted Fisher test p-value ≤1e-10 and ≤1e-5 respectively for comparisons of SARS-Cov-2 like viruses isolated from bats and pangolins) between these genomes clearly correspond with protein coding loci, and in particular with the RDB protein domain of the S1 subunit of the spike protein – the domain which mediates the recognition of the host ACE2 receptor. Analyses of dN/dS patterns (Supplementary Table S5) indicate that, as expected, this particular protein domain seems to be subject to strong diversifying selection between viral isolates with different host specificity. Intriguingly, while we do not observe an increased variability of the RDB domain in SARS-Cov-2 isolates, it is important to notice that, the gene coding for the spike protein shows the highest level of variability among protein coding genes, suggesting that - probably to facilitate adaptation to different hosts - this gene could be subject to relaxed selective pressure in coronaviruses.
Taken together, our analyses of variability profiles of SARS-Cov-2 and SARS-Cov-2-like coronaviruses are consistent with the notion that regions in the genome corresponding with increased variability, should be associated with relaxed purifying selection, and/or increased diversifying selection. In this respect it is interesting to note, that the 5’ and 3’ UTRs, which are the most variable regions of the genome between major SARS-Cov-2 strains, are among the least variable genomic regions, when SARS-Cov-2 and SARS-Cov-2 like strains with a different host specificity are compared.
Discussion
Notwithstanding the limited variability of the genome, which prevents more detailed evolutionary analyses, our analyses provide no evidence for distinct evolutionary constraints acting on SARS-Cov-2 genomes.
Although these results cannot be conclusive, the observation that similar dynamics of genomic variability are observed, and that variable sites are prevalently at the 3’ and 5’ UTR genomic regions of SARS-Cov-2 indicate that it is unlikely that the observed differences in genomic sequences of the major viral clades should be the result of adaptive selection.
However, we notice that the spike protein also evolves under less constrained evolutionary dynamics in SARS-Cov-2. Preliminary analyses, which however are based only on a limited number of variable sites (118/5096) would suggest a Dn/Ds ratio of ∼ 0.7, a value that is well above the value recovered for genes of similar size in the genome and that would indicate that this gene is subject to weaker evolutionary constraints than the other protein coding genes. This, possibly reflecting a mechanism for the rapid adaptation to a more widespread range of hosts, as for example suggested by Menachery et al (Menachery et al, 2016).
Notwithstanding some limitations, our comparative analyses are consistent with the hypothesis that the biased geographic distribution, and the allelic differences observed between major viral SARS-Cov-2 clades are not the result of and adaptive evolutionary process, but are more consistent with founder effects on viral populations, coupled with the rapid spread of this novel virus.
Although our analyses do not suggest distinct evolutionary patterns, it remains unclear whether the genetic variants that discriminate between major viral clades could be related with differences in the virulence/pathogenicity of these clades. To address this issue it will be crucial to collect patient metadata, to sequence more genomes, to enable the execution of retrospective statistical analyses.
Materials and methods
The complete collection of high quality, complete SARS-Cov-2 genomes and associated metadata was accessed from the GISAID Epicov (Shu and McCauley, 2017) platform on March 24th 2020. Genomes were aligned to the reference assembly of SARS-Cov-2 as available from Refseq (O’Leary et al 2016) Refseq accession NC_045512.2) by means of the nucmer (Marçais et al, 2018) program. Viral genomes of the SARS 2003 outbreak were retrieved from the NCBI virus database (Goodacre et al, 2018). Classification/association of strains to the 3 (early/middle/late) phases of the epidemic are according to Song et al 2005. Polymorphic sites were identified by using the show-snps utility of the nucmer package. Output files were processed by the means of a custom Perl script, and incorporated in a phenetic matrix, with variable positions on the rows and viral isolates in the columns. For all the isolates considered in the study, values of 1 were used to indicate presence of a variant, values of 0 its absence.
Dn/Ds rates were computed on aligned CDS sequences using the Ka/Ks calculator tool (Zang et al, 2006) allowing for the selection of the most appropriate substitution model, based on the Akaike information criterion. The GY (Goldman and Yang,1994) model resulted to be the preferred model in all the settings herein tested. Only proteins longer than 100 amino acid residues and with more than 50 polymorphic sites in the CDS, were considered in this analysis, for SARS-Cov-2 this was limited to nsp12, nsp3, nsp4 and the spike protein.
Variability with respect to the reference NC_045512.2 SARS-Cov-2 assembly was computed, on sliding windows of 100 bp, overlapped by 50 bp, by counting the proportion (sites in the window, divided by the total number of variable sites) of variable genomic sites contained in each window, by using a custom Perl script. A Fisher-exact test, contrasting the local variability in a window, with the average variability in the genome, was used to identify hypervariable regions. P-values were corrected using the Benjamini Hochberg procedure for the control of False Discovery Rate.
Functional effects of genetic variants as identified from genome alignments, were predicted by means of a custom Perl script, based on the reference annotation of the NC_045512.2 SARS-Cov-2 reference assembly.
Determination of the optimal clustering solution was performed based on the Dunn Index metrics, as computed by the clValid R package (Brock et al, 2008).
Graphical representation of the data and basic statistical analyses were performed by means of the standard libraries of the R programming language.
Supplementary Figures and Tables Legends
Supplementary Table S1. List of viral isolates included in the study. The table is in the same format as the submission acknowledgment table available from the GISAID EpiCov website. The accession number of each isolate is reported in the first column, followed by the name of the viral strain (2nd column) and the geographic location where the virus was isolated (3rd column). Column 4, reports the collection date, Column 5 the Lab that isolated the virus, Column 6 the Lab that submitted the sequence. Finally, Column 7, complete reference for the submission.
Supplementary Table S2. Rates of nucleotide substitution as identified from genome alignment. For every possible single nucleotide substitution (rows), the table reports the frequency of that substitution in Column 1: alignments between SARS-Cov-2 genomes, Column 2: equivalent to 1, but computed only on polymorphic sites common to at least 2 genomes, Column3: alignment between all the SARS-Cov isolates from the late phase of the SARS 2003 epidemic, Column4: alignment between the SARS-Cov-2 reference genome and genome of a SARS-Cov-2 like virus isolated from a bat (RaTG13 genome), Column 5: alignment between the SARS-Cov-2 reference genome and genome of a SARS-Cov-2 like virus isolated from a pangolin
Supplementary Table S3. Clustering of viral isolates. For each viral isolate included in the analysis, the table reports: the accession number (column 1), the total number of sites in the genome that were polymorphic with respect to the reference genome (column 2), the cluster to which the strain was assigned (column 3), and the geographic location from which the viral strain was isolated (column 4).
Supplementary Table S4. Functional annotation of polymorphic sites. The table lists all the 983 polymorphic sites identified from the comparison of 1113 SARS-Cov-2 genomes with the Refseq assembly. Column 1 reports the genomic coordinate, followed by the sequence on the reference assembly (Column 2) and the alternative sequence (Column 3). For variants associated with protein coding genes, columns 4 and 5 are used to indicate the affected amino-acidic residue in protein coordinates, and the predicted change in amino acid sequence. Column 6 indicates the predicted functional effect: “S” indicates a silent substitution, “NS” a nonsynonymous substitution “FS” a frameshift, “Non-coding”, that the variant is associated with a non protein coding region of the genome. Column 7 reports the corresponding gene/or functionally annotated genomic element, and column 8 indicates the number of viral isolates in each cluster that have the polymorphic site. The equivalent information for every cluster (number of isolates that have the variant) is reported in columns 9 to 16.
Supplementary Table S5. Dn/Ds ratio. The table reports the Dn/Ds ratio, computed by the means of the KaKs_Calculator for all the protein coding genes longer than 100 aa residues and with more than 50 variable sites associated with their CDS. Proteins are indicated by their gene symbol on the rows. “spike_rbd” indicates the spike recognition binding domain, “concat_no_spike”, indicates a supergene formed by all the protein in the genome with the exclusion of the spike protein.
Supplementary Figure S1. Heatmap of nucleotide substitution frequencies. The heatmap displays nucleotide substitution frequencies, as derived from whole genome alignment, as indicated in Table S1. Frequencies are reported in each cell of the heatmap. A gray (low) to blue (high) gradient of color is used
Supplementary Figure S2. Histogram of the number of variable sites identified in A) any SARS-CoV-2 genomes included in this study, with respect to the reference SARS-CoV-2 genome; B) SARS -CoV genomes from the late phase of the 2003 epidemic with respect to the SARS-CoV reference genome.
Supplementary Figure S3. Prevalence of polymorphic sites in viral isolates. Number of viral genomes is reported on the X axis. Log scaled counts of the number of polymorphic sites supported by that number of genomic sequences is indicated on the Y axis
Supplementary Figure S4. Plot of genomic variability between the reference SARS-CoV-2 genome and the SARS-Cov-2 like virus isolated from pangolin specimens. Genomic coordinates are represented on the X axis, number of variable sites per window on the Y axis
Acknowledgements
We thank ELIXIR Italy for providing the computing and bioinformatics facilities and Edward C. Holmes for his expert advice.