Abstract
Since the outbreak of the COVID-19 pandemic, the SARS-CoV-2 coronavirus accumulated an important amount of genome compositional heterogeneity through mutation and recombination, which can be summarized by means of a measure of Sequence Compositional Complexity (SCC). To test evolutionary trends that could inform us on the adaptive process of the virus to its human host, we compute SCC in high-quality coronavirus genomes from across the globe, covering the full span of the pandemic. By using phylogenetic ridge regression, we find trends for SCC in the short time-span of SARS-CoV-2 pandemic expansion. In early samples, we find no statistical support for any trend in SCC values over time, although the virus genome appears to evolve faster than Brownian Motion expectation. However, in samples taken after the emergence of Variants of Concern with higher transmissibility, and controlling for phylogenetic and sampling effects, we detect a declining trend for SCC and an increasing one for its absolute evolutionary rate. This means that the decay in SCC itself accelerated over time, and that increasing fitness of variant genomes lead to a reduction of their genome sequence heterogeneity. Therefore, our work shows that phylogenetic trends, typical of macroevolutionary time scales, can be also revealed on the shorter time spans typical of viral genomes.
1 Introduction
Pioneering works showed that RNA viruses are excellent material for studies of evolutionary genomics (Domingo, Webster, and Holland 1999; Moya, Holmes, and González-Candelas 2004; Worobey and Holmes 1999). Now, with the outbreak of the COVID-19 pandemic, this has become a key research topic. Despite the difficulties of inferring reliable phylogenies of SARS-CoV-2 (Pipes et al. 2021; Morel et al. 2020), as well as the controversy surrounding the first days and location of the pandemic (Worobey 2021; Koopmans et al. 2021), the most parsimonious explanation for the origin of SARS-CoV-2 seems to lie in a zoonotic event (Holmes et al. 2021). Direct bat-to-human spillover events may occur more often than reported, although most remain unknown (Sánchez et al. 2021). Bats are known as the natural reservoirs of SARS-like CoVs, and early evidence exists for the recombinant origin of bat (SARS)-like coronaviruses (Hon et al. 2008). A genomic comparison between these coronaviruses and SARS-CoV-2 has led to propose a bat origin of the COVID-19 outbreak (Y. Z. Zhang and Holmes 2020). Indeed, a recombination event between the bat coronavirus and either an origin-unknown coronavirus (Ji et al. 2020) or a pangolin virus (T. Zhang, Wu, and Zhang 2020) would lie at the origin of SARS-CoV-2. Bat RaTG13 virus best matches the overall codon usage pattern of SARS-CoV-2 in orf1ab, spike, and nucleocapsid genes, while the pangolin P1E virus has a more similar codon usage in the membrane gene (Gu et al. 2020). Other intermediate hosts have been identified, such as RaTG15, and this knowledge is essential to prevent the further spread of the epidemic (Liu et al. 2020).
Despite its proofreading mechanism and the brief time-lapse since its appearance, SARS-CoV-2 has already accumulated an important amount of genomic and genetic variability (Elbe and Buckland-Merrett 2017; Hadfield et al. 2018; Hamed et al. 2021; Hatcher et al. 2017; Dorp et al. 2020; McBroome et al. 2021), which is due to both its recombinational origin (Naqvi et al. 2020) as well as mutation and additional recombination events accumulated later (Cyranoski 2020; Jackson et al. 2021). Recent phylogenetic estimates of the substitution rate of SARS-CoV-2 suggest that its genome accumulates around two mutations per month. However, Variants of Concern (VoCs) can have 15 or more defining mutations and it is hypothesized that they emerged over the course of a few months, implying that they must have evolved faster for a period of time (Tay et al. 2022). Noteworthy, RNA viruses can also accumulate high genetic variation during individual outbreaks (Pybus, Tatem, and Lemey 2015), showing mutation and evolutionary rates up to a million times higher than those of their hosts (Islam et al. 2020). Synonymous and non-synonymous mutations (Banerjee et al. 2020; Cai, Cai, and Li 2020), as well as mismatches and deletions in translated and untranslated regions (Islam et al. 2020; Young et al. 2020) have been tracked in the SARs-CoV-2 genome sequence.
Particularly interesting changes are those increasing viral fitness (Holmes et al. 2021; Dorp et al. 2020; Zhou et al. 2020), such as mutations giving rise to epitope loss and antibody escape mechanisms. These have mainly been found in evolved variants isolated from Europe and the Americas, and have critical implications for SARS-CoV-2 transmission, pathogenesis, and immune interventions (Gupta and Mandal 2020). Some studies have shown that SARS-CoV-2 is acquiring mutations more slowly than expected for neutral evolution, suggesting that purifying selection is the dominant mode of evolution, at least during the initial phase of the pandemic time course. Parallel mutations in multiple independent lineages and variants have been observed (Dorp et al. 2020), which may indicate convergent evolution, and are of particular interest in the context of adaptation of SARS-CoV-2 to the human host (Dorp et al. 2020). Other authors have reported some sites under positive pressure in the nucleocapsid and spike genes (Benvenuto et al. 2020). All this research effort has allowed tracking all these changes in real-time. The CoVizue project (https://filogeneti.ca/covizu/) provides a visualization of SARS-CoV-2 global diversity of SARS-CoV-2 genomes.
Base composition varies at all levels of the phylogenetic hierarchy and throughout the genome, caused by active selection or passive mutation pressure (Mooers and Holmes 2000). The array of compositional domains in a genome can be potentially altered by most sequence changes (i.e., synonymous and non-synonymous nucleotide substitutions, insertions, deletions, recombination events, chromosome rearrangements, or genome reorganizations). Compositional domain structure can be altered either by changing nucleotide frequencies in a given region or by changing the nucleotides at the borders separating two domains, thus enlarging/shortening a given domain, or changing the number of domains (Bernaola-Galván, Román-Roldán, and Oliver 1996; Keith 2008; Oliver et al. 1999; Wen and Zhang 2003). Ideally, a genome sequence heterogeneity metric should be able to summarize all the mutational and recombinational events accumulated by a genome sequence over time (Bernaola-Galván et al. 2004; Fearnhead and Vasilieou 2009; Román-Roldán, Bernaola-Galván, and Oliver 1998).
In many organisms, the patchy sequence structure formed by the array of compositional domains with different nucleotide composition has been related to important biological features, i.e., GC content, gene and repeat densities, timing of gene expression, recombination frequency, etc. (G Bernardi et al. 1985; Oliver et al. 2004; Giorgio Bernardi 2015; Bernaola-Galván, Carpena, and Oliver 2008). Therefore, changes in genome sequence heterogeneity may be relevant on evolutionary and epidemiological grounds. Specifically, evolutionary trends in genome heterogeneity of the coronavirus could reveal adaptive processes of the virus to the human host.
To this end, we computed the Sequence Compositional Complexity, or SCC (Román-Roldán, Bernaola-Galván, and Oliver 1998), an entropic measure of genome-wide heterogeneity, representing the number of domains and nucleotide differences among them, identified in a genome sequence through a proper segmentation algorithm (Bernaola-Galván, Román-Roldán, and Oliver 1996). By using phylogenetic ridge regression, a method able to reveal both macro- (Serio et al. 2019; Melchionna et al. 2019) and micro-evolutionary (Moya et al. 2020) trends, we present here evidence for a long-term tendency of decreasing genome sequence heterogeneity in SARS-CoV-2. The trend is shared by its most important VoCs (Alpha and Delta) and greatly accelerated by the recent rise to dominance of Omicron (Du, Gao, and Wang 2022).
2 Results
2.1 Genome heterogeneity in the coronavirus
The first SARS-CoV-2 coronavirus genome sequence obtained at the onset of the pandemic (2019-12-30) was divided into eight compositional domains by our compositional segmentation algorithm (Bernaola-Galván, Román-Roldán, and Oliver 1996; Oliver et al. 1999; Bernaola-Galván, Carpena, and Oliver 2008; Oliver et al. 2004), resulting in a SCC value of 5.7 × 10E-3 bits by sequence position (Figure 1).
From then on, descendent coronaviruses have presented substantial variation in each domain’s number, length, and nucleotide composition, which is reflected in a variety of SCC values. The number of segments ranges between 4 and 10, while the SCC do so between 2.71E-03 and 6.8E-03 bits by sequence position. The strain name, the collection date, and the SCC values for each analyzed genome are shown in Supplementary Tables S1-S18 available in Zenodo (https://doi.org/10.5281/zenodo.6844917).
2.2 Temporal evolution of SCC over the coronavirus pandemic time course
To characterize the temporal evolution of SCC over the entire time course of the coronavirus pandemic (December 2019 to March 2022), we downloaded from GISAID/Audacity (Khare et al. 2021; Elbe and Buckland-Merrett 2017; Shu and McCauley 2017) a series of random samples of high-quality genome sequences over consecutive time lapses, each starting at the outbreak of the COVID-19 (December 2019) and progressively including younger samples up to March 2022 (Table 1). In each sample, we filtered and masked these sequences using the GenBank reference genome MN908947.3 to eliminate sequence oddities (Hodcroft et al. 2021). Non-duplicated genomes were aligned with MAFFT (Katoh and Standley 2013), then inferring the best ML timetree using IQ-TREE 2 (Minh et al. 2020), which was rooted to the GISAID reference genome (hCoV-19/Wuhan/WIV04/2019|EPI_ISL_402124|2019-12-30). The proportion of variant genomes in each sample was also determined (Table 1, columns 5-8).
Finally, we sought temporal trends in SCC values and evolutionary rates by using the function search.trend in the R package RRphylo (Silvia Castiglione et al. 2018), contrasting the realized slope of SCC versus time regression to a family of 1,000 slopes generated under the Brownian Motion (BM) model of evolution, which models evolution with no trend in either the SCC or its evolutionary rate. We found that SARS-CoV-2 genome sequence heterogeneity did not follow any trend in SCC during the first year of the pandemic time course, as indicated by the non-significant SCC against time regressions in any sample ending before December 2020 (Table 1). With the emergence of variants in December 2020 (s1573, Table 1), the genome sequence heterogeneity started to decrease significantly over time. In contrast to the decreasing trend observed for SCC, a clear tendency towards faster evolutionary rates occurred throughout the study period, indicating that the virus increased in variability early on but took on a monotonic trend in declining SCC as VoCs appeared. These results were robust to several sources of uncertainty, including those related to the algorithms used for multiple alignment or to infer phylogenetic trees (see the section ‘Checking results reliability’ in Supplementary Material). In summary, these analyses show that statistically significant trends for declining heterogeneity began between the end of December 2020 (s1573) and March 2021 (s1871) corresponding with the emergence of the first VoC (Alpha), a path that continued with the successive emergence of other variants.
2.2 Relative contribution of individual variants to the SARS-CoV-2 evolutionary trends
2.2.1 SCC trends of variants
We estimated the relative contribution of the three main VoCs (Alpha, Delta, and Omicron) to the trends in SARS-CoV2 evolution by picking samples both before (s726, s730) and after (s1871, s1990) their appearance. The trends for SCC and its evolutionary rate in sample s1990, which includes a sizeable number of Omicron genomes, are shown in Figure 2. In all these samples, we tested trends for variants individually (as well as for the samples’ trees as a whole) while accounting for phylogenetic uncertainty, by randomly altering the phylogenetic topology and branch lengths 100 times per sample (see Materials and Methods, and Supplementary Material for details). These cautions seem to us necessary to ensure accuracy in the conclusions based on the SARs-CoV-2 phylogenies we inferred (Wertheim, Steel, and Sanderson 2022). In agreement with the previous analyses (seventeen consecutive bins, see Table 1), we found strong support for a decrease in SCC values through time along phylogenies including variants (s1871, s1990) and no support for any temporal trend in older samples. Just four out of the 200 random trees produced for samples s726 and s730 produced a trend in SCC evolution. The corresponding figure for the two younger samples is 186/200 significant and negative instances of declining SCC over time (Table 2). This ∼50-fold increase in the likelihood of finding a consistent trend in declining SCC over time is shared unambiguously by all tested variants (Alpha, Delta, and Omicron; Table 3). Yet, Omicron shows a significantly stronger decline in SCC than the other variants (Table 3), suggesting that the trends starting with the appearance of the main variants became stronger with the emergence of Omicron by the end of 2021.
We tested the difference in the slopes of SCC values versus time regression computed by grouping all the variants under a single group and the same figure for all other strains grouped together. The test was performed using the function emtrends available within the R package emmeans (Lenth 2022). We found the slope for the group that includes all variants to be significantly larger than the slope for the other strains (estimate = −0.772 × 10−8, P-value = 0.006), still pointing to the decisive effect of VoCs on SCC temporal trend.
2.2.2 SCC evolutionary rates of variants
SCC evolutionary rate (absolute magnitude of the rate) tends to increase over time (Table 2). The slope of SCC rates through time regression for Omicron was always significantly lower than the slope computed for the rest of the tree (Table 3). This was also true for Alpha and Delta, although with much lower support.
3 Discussion
Here we show that despite its short length (29,912 bp for the reference genome) and the short time-lapse analyzed (28 months), the coronavirus RNA genome sequences can be segmented (Fig. 1) into 4-9 compositional domains (∼0.27 segments by kbp on average). Although such segment density is lower than in free-living organisms, like cyanobacteria where we observe an average density of 0.47 segments by kbp (Moya et al. 2020), it may suffice for comparative evolutionary analyses of compositional sequence heterogeneity in these genomes, which might shed light on the origin and evolution of the COVID-19 pandemic.
In early samples (i.e., collected before the emergence of variants), we found no statistical support for any trend in SCC values over time, although the virus as a whole appears to evolve faster than BM expectation. However, in samples taken after the first VoC with higher transmissibility (Alpha) appeared in the GISAID database (December 2020), we started to detect statistically significant downward trends in SCC (Table 1). Concomitantly to the temporal decay in SCC, its absolute evolutionary rate kept increasing with time, meaning that the decline in SCC itself accelerated over time. In agreement with this notion, although declining SCC is an evolutionary path shared by variants, the nearly threefold increase in rates intensified after the appearance of the most recent VoC (Omicron) in late 2021, which shows a much faster decline in SCC than the other variants (Table 3). These results indicate the existence of a driven, probably adaptive, trend in the variants toward a reduction of genome sequence heterogeneity. Furthermore, the emergence of VOCs may be also associated to an episodic increase in the substitution rate of around 4-fold the background phylogenetic rate estimate (Tay et al. 2022). It is well established that variant genomes have accumulated a higher proportion of adaptive mutations, which allows them to neutralize host resistance or escape host antibodies (Thorne et al. 2021; Venkatakrishnan et al. 2021; Mlcochova et al. 2021), consequently gaining higher transmissibility (a paradigmatic example is the recent outbreak of the Omicron variant). The sudden increases in fitness of variant genomes, may be also due to the gathering of co-mutations that become prevalent world-wide compared to single mutations, being largely responsible for their temporal changes in transmissibility and virulence (Ilmjärv et al. 2021; Majumdar and Niyogi 2021). In fact, more contagious and perhaps more virulent VoCs share mutations and deletions that have arisen recurrently in distinct genetic backgrounds (Richard et al. 2021). We show here that these increases in fitness of variant genomes, associated with a higher transmissibility, lead to a reduction of their genome sequence heterogeneity, thus explaining the general decay of SCC in line with the pandemic expansion.
We conclude that the accelerated loss of genome heterogeneity in the coronavirus is promoted by the rise of high viral fitness variants, leading to adaptation to the human host, a well-known process in other viruses (Bahir et al. 2009). Further monitoring of the evolutionary trends in current and new co-mutations, variants, and recombinant lineages (Ledford 2022; Straten et al. 2022; Callaway 2022) by means of the tools used here will enable to elucidate whether and to what extent the evolution of genome sequence heterogeneity in the virus impacts human health.
4 Materials and Methods
4.1 Data retrieval, filtering, masking and alignment
We retrieved random samples of high-quality coronavirus genome sequences (EPI_SET_20220604yp, available at https://doi.org/10.55876/gis8.220604yp), from the GISAID/Audacity database (Khare et al. 2021; Elbe and Buckland-Merrett 2017; Shu and McCauley 2017). MAFFT (Katoh and Standley 2013) was used to align each random sample to the genome sequence of the isolate Wuhan-Hu-1 (GenBank accession MN908947.3), then filtering and masking the alignments to avoid sequence oddities (Hodcroft et al. 2021). In order to check results reliability (see Supplementary Material), we also analyzed other 3,059 genomes of the SARS-CoV-2 Nextstrain global dataset (Hadfield et al. 2018) downloaded from https://nextstrain.org/ncov/open/global?f_host=Homo%20sapiens on 2021-10-08.
4.2 Phylogenetic trees
The best ML timetree for each random sample in Table 1 was inferred using IQ-TREE 2 (Minh et al. 2020), using the GTR nucleotide substitution model (Tavaré 1986; Rodríguez et al. 1990) and the least square dating (LSD2) method (To et al. 2016), finally rooting the timetree to the GISAID coronavirus reference genome (EPI_ISL_402124, hCoV-19/Wuhan/WIV04/2019, WIV04).
4.3 Compositional segmentation algorithm
To divide the coronavirus genome sequence into an array of compositionally homogeneous, non-overlapping domains, we used a heuristic, iterative segmentation algorithm (Bernaola-Galván, Román-Roldán, and Oliver 1996; Oliver et al. 1999; Bernaola-Galván, Carpena, and Oliver 2008; Oliver et al. 2004). We chose the Jensen-Shannon divergence as the divergence measure between adjacent segments, as it can be directly applied to symbolic nucleotide sequences. At each iteration, we used a significance threshold (s = 0.95) to split the sequence into two segments whose nucleotide composition is homogeneous at the chosen significance level, s. The process continued iteratively over the new resulting segments while sufficient significance continued to appear.
4.4 Computing the Sequence compositional complexity (SCC)
Once each coronavirus genome sequence was segmented into an array of statistically significant, homogeneous compositional domains, its genome sequence heterogeneity was measured by computing the Sequence Compositional Complexity, or SCC (Román-Roldán, Bernaola-Galván, and Oliver 1998). SCC increased with both the number of segments and the degree of compositional differences among them. Thus, SCC is analogous to other biological complexity measures, particularly to that described by McShea and Brandon (McShea and Brandon 2010), in which an organism is more complex if it has a greater number of parts and a higher differentiation among these parts. It should be emphasized that SCC is highly sensitive to any change in the RNA genome sequence, either nucleotide substitutions, indels, genome rearrangements, or recombination events, all of which could alter the number of domains or the compositional nucleotide differences among them.
4.5 Phylogenetic ridge regression
To search for trends in SCC values and evolutionary rates over time, phylogenetic ridge regression was applied using the RRphylo R package V. 2.5.8 (Silvia Castiglione et al. 2018). The estimated SCC value for each tip or node in the phylogenetic tree was regressed against its age (the phylogenetic time distance, which represents the time distance between the first sequence ever of the virus and the collection date of individual virus isolates); the regression slope was then compared to BM expectation (which models evolution according to no trend in SCC values and rates over time) by generating 1,000 slopes simulating BM evolution on the phylogenetic tree, using the function search.trend (S Castiglione et al. 2019) in the RRphylo R package.
4.6 Comparing the effects of variants on the evolutionary trend
In order to explicitly test the effect of variants and to compare variants among each other we selected 4 different trees and SCC data (s730, a727, s1871, s1990) from Table 1. In each sample, we accounted for phylogenetic uncertainty by producing 100 dichotomous versions of the initial tree by removing polytomies applying the RRphylo function fix.poly (Silvia Castiglione et al. 2018). This function randomly resolves polytomous clades by adding non-zero length branches to each new node and equally partitioning the evolutionary time attached to the new nodes below the dichotomized clade. Each randomly fixed tree was used to evaluate temporal trends in SCC and its evolutionary rates occurring on the entire tree and individual variants if present, by applying search.trend. Additionally, for the larger phylogenies (i.e., s1871 and s1990 lineage-wise trees) half of the tree was randomly sampled and half of the tips were removed. This way we avoided biasing the results due to different tree sizes.
5 Supplementary Material
Additional details regarding the methods used in this study are provided in the Supplementary Information and in the supplementary data files available in Zenodo (https://doi.org/10.5281/zenodo.6844917).
6 Funding
This project was funded by grants from the Spanish Minister of Science, Innovation and Universities (former Spanish Minister of Economy and Competitiveness) to J.L.O. (Project AGL2017-88702-C2-2-R) and A.M. (Project PID2019-105969GB-I00), a grant from Generalitat Valenciana to A.M. (Project Prometeo/2018/A/133) and co-financed by the European Regional Development Fund (ERDF). The most time-demanding computations were done on the servers of the Laboratory of Bioinformatics, Dept. of Genetics & Institute of Biotechnology, Center of Biomedical Research, 18100, Granada, Spain.
8 Data availability
The data underlying this article are available in Zenodo at https://zenodo.org/, and can be accessed with https://zenodo.org/record/6844917.
9 Contributions
J.L.O., M.V. and A.M. designed research; J.L.O., P.B., F.P., C.G.M, S.C., P.R., M.V. and A.M. performed research. J.L.O., P.B., F.P., C.G.M, S.C., P.R., M.V. and A.M. analyzed data; J.L.O., M.V., A.M. and P.R. drafted the paper. All authors read and approved the final manuscript.
10 Competing interests
The authors declare no competing interests.
7 Acknowledgements
We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. A complete list acknowledging all originating and submitting laboratories is available in the GISAID’s EpiCoV database (Khare et al. 2021; Elbe and Buckland-Merrett 2017; Shu and McCauley 2017) (EPI_SET_ID: EPI_SET_20220604yp; DOI: https://doi.org/10.55876/gis8.220604yp). In the same way, we gratefully acknowledge the authors, originating and submitting laboratories of the genome sequences we used for the analysis of the SARS-CoV-2 Nextstrain global dataset (Hadfield et al. 2018), downloaded on 2021-10-08; a complete acknowledgement list is shown in Supplementary Table S19 available in Zenodo (https://zenodo.org/record/6844917).
Footnotes
E-mails: José L. Oliver, oliver{at}ugr.es
Pedro Bernaola-Galván, rick{at}uma.es
Francisco Perfectti, fperfect{at}ugr.es
Cristina Gómez Martín, c.a.gomezmartin{at}amsterdamumc.nl
Silvia Castiglione, silviacastiglione2{at}gmail.com
Pasquale Raia, pasquale.raia{at}unina.it
Miguel Verdú, Miguel.Verdu{at}ext.uv.es
Andrés Moya, Andres.Moya{at}uv.es
Updated authors list