Abstract
During the spread of the COVID-19 pandemic, the SARS-CoV-2 coronavirus underwent mutation and recombination events that altered its genome compositional structure, thus providing an unprecedented opportunity to check an evolutionary process in real time. The mutation rate is known to be lower than expected for neutral evolution, suggesting purifying selection and convergent evolution. We begin by summarizing the compositional heterogeneity of each viral genome by computing its Sequence Compositional Complexity (SCC). To analyze the full range of SCC diversity, we select random samples of high-quality coronavirus genomes covering the full span of the pandemic. We then search for evolutionary trends that could inform us on the adaptive process of the virus to its human host by computing the phylogenetic ridge regression of SCC against time (i.e., the collection date of each viral isolate). In early samples, we find no statistical support for any trend in SCC values, although the viral genome appears to evolve faster than Brownian Motion (BM) expectation. However, in samples taken after the emergence of high fitness variants, and despite the brief time span elapsed, a driven decreasing trend for SCC and an increasing one for its absolute evolutionary rate are detected, pointing to a role for purifying selection in the evolution of SCC in the coronavirus. The higher fitness of variant genomes leads to adaptive trends of SCC over pandemic time in the coronavirus.
Introduction
Given the difficulties to observe evolution directly over long periods, test tube experiments revealed as a particularly powerful tool for examining evolutionary dynamics. The Richard Lenski’s long-term evolution experiment (LTEE) with a laboratory population of Escherichia coli sampled through 60,000 generations shows the relationships between rates of genomic evolution and organismal adaptation (Barrick et al., 2009; Good et al., 2017). Experimental evolution of a major evolutionary innovation (the origin of multicellularity) has been also carried out on both experimentally tractable model organisms (Ratcliff et al., 2012), as well as in a unicellular relative of animals (Burnetti and Ratcliff, 2022). In the same way, computer simulations of digital organisms revealed important aspects of evolutionary dynamics (Adami, Ofria and Collier, 2000). Now, the outbreak of the COVID-19 pandemic provides an unprecedented opportunity to check for phylogenetic trends by analyzing a natural evolutionary process in real time, which could provide helpful information on the adaptive process of the viral genome to its human host.
Pioneering works showed that RNA viruses are excellent material for studies of evolutionary genomics (Domingo, Webster and Holland, 1999; Worobey and Holmes, 1999; Moya, Holmes and González-Candelas, 2004). Despite the difficulties of inferring reliable phylogenies of SARS-CoV-2 (Morel et al., 2020; Pipes et al., 2021), as well as the controversy surrounding the first days and location of the pandemic (Koopmans et al., 2021; Worobey, 2021), the most parsimonious explanation for the origin of SARS-CoV-2 seems to lie in a zoonotic event (Holmes et al., 2021; Balloux et al., 2022). Direct bat-to-human spillover events may occur more often than reported, although most remain unknown (Sánchez et al., 2022). Bats are known as the natural reservoirs of SARS-like CoVs, and early evidence exists for the recombinant origin of bat (SARS)-like coronaviruses (Hon et al., 2008). A genomic comparison between these coronaviruses and SARS-CoV-2 has led to propose a bat origin of the COVID-19 outbreak (Zhang and Holmes, 2020). Indeed, a recombination event between the bat coronavirus and either an origin-unknown coronavirus (Ji et al., 2020) or a pangolin virus (Zhang, Wu and Zhang, 2020) would lie at the origin of SARS-CoV-2. Bat RaTG13 virus best matches the overall codon usage pattern of SARS-CoV-2 in orf1ab, spike, and nucleocapsid genes, while the pangolin P1E virus has a more similar codon usage in the membrane gene (Gu et al., 2020). Other intermediate hosts have been identified, such as RaTG15, and this knowledge is essential to prevent the further spread of the epidemic (Liu et al., 2020).
Despite its proofreading mechanism and the brief time-lapse since its appearance, SARS-CoV-2 has accumulated an important amount of genomic and genetic variability (Elbe and Buckland-Merrett, 2017; Hatcher et al., 2017; Hadfield et al., 2018; Dorp et al., 2020; Islam et al., 2020; Hamed et al., 2021; McBroome et al., 2021), dramatically impacting viral nucleotide composition and genome organization. Synonymous and non-synonymous mutations (Banerjee et al., 2020; Cai, Cai and Li, 2020; González-Candelas et al., 2021), as well as mismatches and deletions in translated and untranslated regions (Islam et al., 2020; Young et al., 2020) have been tracked in the SARs-CoV-2 genome. This may be related to both its recombinational origin (Naqvi et al., 2020) as well as mutation and additional recombination events accumulated later (Cyranoski, 2020; Jackson et al., 2021).
Recent phylogenetic estimates of the substitution rate of SARS-CoV-2 suggest that its genome accumulates around two mutations per month. However, Variants of Concern (VoCs) can have 15 or more defining mutations, and it is hypothesized that they emerged over the course of a few months, implying that they must have evolved faster for a period of time (Tay et al., 2022). Noteworthy, RNA viruses can also accumulate high genetic variation during individual outbreaks (Pybus, Tatem and Lemey, 2015), showing mutation and evolutionary rates up to a million times higher than those of their hosts (Islam et al., 2020).
Particularly interesting are those changes increasing viral fitness (Dorp et al., 2020; Garvin et al., 2020; Zhou et al., 2020; Holmes et al., 2021), such as mutations giving rise to epitope loss and antibody escape mechanisms. These have mainly been found in evolved variants isolated from Europe and the Americas, and have critical implications for SARS-CoV-2 fitness (transmission, pathogenesis, and immune interventions (Gupta and Mandal, 2020; Loucera et al., 2022)). Some studies have shown that SARS-CoV-2 is acquiring mutations more slowly than expected for neutral evolution, suggesting that purifying selection is the dominant mode of evolution, at least during the initial phase of the pandemic time course. Parallel mutations in multiple independent lineages and variants have been observed (Dorp et al., 2020), which may indicate convergent evolution, and are of particular interest in the context of adaptation of SARS-CoV-2 to the human host. Survival analysis of mutations in the SARS-CoV-2 genome revealed 27 of them were significantly associated with higher mortality of patients (Loucera et al., 2022). Other authors have reported some sites under positive pressure in the nucleocapsid and spike genes (Benvenuto et al., 2020). This impressive research effort has allowed tracking all these changes in real-time. The CoVizue project (https://filogeneti.ca/covizu/) provides a near real-time visualization of SARS-CoV-2 global diversity, the COVID-19 CG website (Chen et al., 2021) tracks SARS-CoV-2 mutation and lineage by locations and dates of interest, while the CoV-Spectrum website (Chen et al., 2022) supports the identification of new SARS-CoV-2 variants of concern and the tracking of known variants. Another recent developed tool (Sanderson, 2022) allows a visualization of mutation-annotated trees of millions SARS-CoV-2 sequences (https://cov2tree.org/).
Nucleotide compositional biases throughout the genome have been identified at all levels of the phylogenetic hierarchy, including RNA virus (Gaunt and Digard, 2022), being caused either by active selection or passive mutation pressure (Mooers and Holmes, 2000). The array of compositional domains in a genome can be potentially altered by most sequence changes (i.e., synonymous and non-synonymous nucleotide substitutions, insertions, deletions, recombination events, chromosome rearrangements, or genome reorganizations). Compositional domain structure can be altered either by changing nucleotide frequencies in a given region or by changing the nucleotides at the borders separating two domains, thus enlarging/shortening a given domain, or changing the number of domains (Bernaola-Galván, Román-Roldán and Oliver, 1996; Oliver et al., 1999; Wen and Zhang, 2003; Keith, 2008). Ideally, a metric of nucleotide compositional heterogeneity should be able to summarize all the mutational and recombinational events accumulated by a genome sequence over time (Román-Roldán, Bernaola-Galván and Oliver, 1998; Bernaola-Galván et al., 2004; Fearnhead and Vasilieou, 2009).
In many organisms, the patchy sequence structure formed by the array of compositional domains with different nucleotide composition (i.e., GC content) has been related to important biological features, as gene and repeat densities, timing of gene expression, recombination frequency, etc. (Bernardi et al., 1985; Oliver et al., 2004; Bernaola-Galván, Carpena and Oliver, 2008; Bernardi, 2015). Therefore, changes in sequence compositional heterogeneity may be relevant on evolutionary and epidemiological grounds. Specifically, the existence of evolutionary trends in the compositional complexity of the coronavirus could reveal adaptive processes of the virus to the human host.
To search for such trends, we computed the Sequence Compositional Complexity, or SCC (Román-Roldán, Bernaola-Galván and Oliver, 1998), an entropic measure of nucleotide compositional heterogeneity, representing the number of domains and nucleotide differences among them, identified in a genome sequence through a proper segmentation algorithm (Bernaola-Galván, Román-Roldán and Oliver, 1996). By using phylogenetic ridge regression, a method able to reveal both macro-(Melchionna et al., 2019; Serio et al., 2019) and micro-evolutionary (Moya et al., 2020) trends, we present here evidence for long-term adaptive tendencies of decreasing sequence compositional heterogeneity, and an increasing one for its evolutionary rate, in SARS-CoV-2. Both trends are shared by its most important VoCs (Alpha and Delta), being greatly accelerated by the recent rise to dominance of Omicron (Du, Gao and Wang, 2022).
Results
Sequence compositional complexity in the coronavirus
The first SARS-CoV-2 coronavirus genome sequence obtained at the onset of the pandemic (2019-12-30) was divided into eight compositional domains by the compositional segmentation algorithm (Bernaola-Galván, Román-Roldán and Oliver, 1996; Oliver et al., 1999, 2004; Bernaola-Galván, Carpena and Oliver, 2008), resulting in a SCC value of 5.7 x 10E-3 bits by sequence position (Figure 1).
From then on, descendent coronaviruses have presented substantial variation in each domain's number, length, and nucleotide composition, which is reflected in a variety of SCC values. The number of segments ranges between 4 and 10, while the SCC do so between 2.71E-03 and 6.8E-03 bits by sequence position. The strain name, the collection date, and the SCC values for each analyzed genome are shown in Supplementary Tables S1-S18 available in the open repository Zenodo (https://doi.org/10.5281/zenodo.6844917).
Temporal evolution of SCC over the coronavirus pandemic
To characterize the temporal evolution of SCC over the time course of the coronavirus pandemic, we downloaded from GISAID/Audacity (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017; Khare et al., 2021) a series of random samples of high-quality genome sequences over consecutive time lapses, each starting at the outbreak of the COVID-19 (December 2019) and progressively including younger samples up to March 2022 (Table 1). In each random sample, we filtered and masked the genome sequences using the GenBank reference genome MN908947.3 to eliminate sequence oddities (Hodcroft, Domman, et al., 2021). Non-duplicated genome sequences were aligned with MAFFT (Katoh and Standley, 2013), then inferring the best ML timetree using IQ-TREE 2 (Minh et al., 2020), which was then rooted to the GISAID reference genome (hCoV-19/Wuhan/WIV04/2019|EPI_ISL_402124|2019-12-30). The proportion of variant genomes in each sample was determined with Nextclade (Aksamentov et al., 2021) (Table 1, columns 5-8).
Finally, we sought temporal phylogenetic trends in SCC values and evolutionary rates by using the function search.trend in the RRphylo R package (Castiglione et al., 2018), contrasting the realized slope of SCC versus time regression to a family of 1,000 slopes generated under the BM model of evolution, which models evolution with no trend in either the SCC or its evolutionary rate. We found that SARS-CoV-2 sequence compositional heterogeneity did not follow any trend in SCC during the first year of the pandemic time course, as indicated by the non-significant SCC against time regressions in any sample ending before December 2020 (Table 1). With the emergence of variants in December 2020 (s1573, Table 1), the SCC started to decrease significantly over time. In contrast to the decreasing trend observed for SCC, a clear tendency towards faster evolutionary rates occurred throughout the study period, indicating that the virus increased in variability early on but took on a monotonic trend in declining SCC as VoCs appeared. These results were robust to several sources of uncertainty, including those related to the algorithms used for multiple alignment or to infer phylogenetic trees (see the section ‘Checking results reliability’ in Supplementary Information). In summary, these analyses show that statistically significant trends for declining SCC began between the end of December 2020 (s1573) and March 2021 (s1871) corresponding with the emergence of the first VoC (Alpha), a path that continued with the successive emergence of other variants. This may suggest a role for purifying selection in the evolution of SCC in the coronavirus.
Relative contributions of individual variants to the SARS-CoV-2 evolutionary trends
SCC trends of variants
We estimated the relative contribution of the three main VoCs (Alpha, Delta, and Omicron) to the trends in SARS-CoV2 evolution by picking samples both before (s726, s730) and after (s1871, s1990) their appearance. The trends for SCC and its evolutionary rate in sample s1990, which includes a sizeable number of Omicron genomes, are shown in Figure 2. In all these samples, we tested trends for variants individually (as well as for the samples’ trees as a whole) while accounting for phylogenetic uncertainty, by randomly altering the phylogenetic topology and branch lengths 100 times per sample (see Methods, and Supplementary Information for details). These cautions seem to us necessary to ensure accuracy in the conclusions based on the SARs-CoV-2 phylogenies we inferred (Wertheim, Steel and Sanderson, 2022). In agreement with the previous analyses (seventeen consecutive bins, see Table 1), we found strong support for a decrease in SCC values through time along phylogenies including variants (s1871, s1990) and no support for any temporal trend in older samples. Just four out of the 200 random trees produced for samples s726 and s730 produced a trend in SCC evolution. The corresponding figure for the two younger samples is 186/200 significant and negative instances of declining SCC over time (Table 2). This ~50-fold increase in the likelihood of finding a consistent trend in declining SCC over time is shared unambiguously by all tested variants (Alpha, Delta, and Omicron; Table 3). Yet, Omicron shows a significantly stronger decline in SCC than the other variants (Table 3), suggesting that the trends starting with the appearance of the main variants became stronger with the emergence of Omicron by the end of 2021.
We tested the difference in the slopes of SCC values versus time regression computed by grouping all the variants under a single group and the same figure for all other strains grouped together. The test was performed using the function emtrends available within the R package emmeans (Lenth, 2022). We found the slope for the group that includes all variants to be significantly larger than the slope for the other strains (estimate = −0.772 × 10−8, P-value = 0.006), still pointing to the decisive effect of VoCs on SCC temporal trend.
SCC evolutionary rates of variants
SCC evolutionary rate (absolute magnitude of the rate) tends to increase over time (Table 2). The slope of SCC rates through time regression for Omicron was always significantly lower than the slope computed for the rest of the tree (Table 3). This was also true for Alpha and Delta, although with much lower support.
Discussion
Here we show that despite its short length (29,912 bp for the reference genome) and the brief time-lapse analyzed (28 months), the coronavirus RNA genome sequences can be segmented (Fig. 1 and Supplementary Tables S1-S18) into 4-10 compositional domains (~0.27 segments by kbp on average). Although such segment density is lower than in free-living organisms, like cyanobacteria where an average density of 0.47 segments by kbp was observed (Moya et al., 2020), it may suffice for comparative evolutionary analyses of SCC in these genomes, which might shed light on the origin and evolution of the COVID-19 pandemic.
In early samples (i.e., collected before the emergence of variants), we found no statistical support for any trend in SCC values over time, although the virus as a whole appears to evolve faster than BM expectation. However, in samples taken after the first higher fitness VoC with higher transmissibility (Alpha) appeared in the GISAID database (December 2020), we started to detect statistically significant downward trends in SCC (Table 1). Concomitantly to the temporal decay in SCC, its absolute evolutionary rate kept increasing with time, meaning that the decline in SCC itself accelerated over time. In agreement with this notion, although declining SCC is an evolutionary path shared by variants, the nearly threefold increase in rates intensified after the appearance of the most recent VoC (Omicron) in late 2021, which shows a much faster decline in SCC than the other variants (Table 3). These results indicate the existence of a driven, probably adaptive, trend in the variants toward a reduction of SCC.
The emergence of VOCs has been associated to an episodic increase in the substitution rate of around 4-fold the background phylogenetic rate estimate (Tay et al., 2022). It is also known that variant genomes have accumulated a higher proportion of adaptive mutations, which allows them to neutralize host resistance or escape host antibodies (Mlcochova et al., 2021; Thorne et al., 2021; Venkatakrishnan et al., 2021), consequently gaining higher transmissibility (a paradigmatic example is the recent outbreak of the Omicron variant). The sudden increases in fitness of variant genomes, may be due to the gathering of co-mutations that become prevalent world-wide compared to single mutations, being largely responsible for their temporal changes in transmissibility and virulence (Ilmjärv et al., 2021; Majumdar and Niyogi, 2021). In fact, more contagious and perhaps more virulent VoCs share mutations and deletions that have arisen recurrently in distinct genetic backgrounds (Richard et al., 2021). We show here that these increases in fitness of variant genomes, associated with a higher transmissibility, lead to a reduction of their sequence compositional heterogeneity, thus explaining the general decay of SCC in line with the pandemic expansion. We conclude that the accelerated loss of SCC in the coronavirus is promoted by the rise of high viral fitness variants, leading to adaptation to the human host, a well-known process in other viruses (Bahir et al., 2009). Further monitoring of the evolutionary trends in current and new co-mutations, variants, and recombinant lineages (Callaway, 2022; Ledford, 2022; Straten et al., 2022) by means of the tools used here will enable to elucidate whether and to what extent the evolution of SCC in the virus impacts human health.
Methods
Data retrieval, filtering, masking and alignment
The sequences of the random samples of high-quality coronavirus genomes we retrieved from the GISAID/Audacity database (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017; Khare et al., 2021) were compiled as EPI_SET_20220604yp, being available at https://doi.org/10.55876/gis8.220604yp. MAFFT (Katoh and Standley, 2013) was used to align each random sample to the genome sequence of the isolate Wuhan-Hu-1 (GenBank accession MN908947.3), then filtering and masking the alignments to avoid sequence oddities (Hodcroft, Domman, et al., 2021). In order to check the reliability of our results (see the section ‘Checking results reliability’ in Supplementary Information), we also analyzed other 3,059 genomes of the SARS-CoV-2 Nextstrain global dataset (Hadfield et al., 2018) downloaded from https://nextstrain.org/ncov/open/global?f_host=Homo%20sapiens on 2021-10-08.
Phylogenetic trees
The best ML timetree for each random sample in Table 1 was inferred using IQ-TREE 2 (Minh et al., 2020), using the GTR nucleotide substitution model (Tavaré, 1986; Rodríguez et al., 1990) and the least square dating (LSD2) method (To et al., 2016), finally rooting the timetree to the GISAID coronavirus reference genome (EPI_ISL_402124, hCoV-19/Wuhan/WIV04/2019, WIV04).
Compositional segmentation algorithm
To divide the coronavirus genome sequence into an array of compositionally homogeneous, non-overlapping domains, we used a heuristic, iterative segmentation algorithm (Bernaola-Galván, Román-Roldán and Oliver, 1996; Oliver et al., 1999, 2004; Bernaola-Galván, Carpena and Oliver, 2008). We chose the Jensen-Shannon divergence as the divergence measure between adjacent segments, as it can be directly applied to symbolic nucleotide sequences. At each iteration, we used a significance threshold (s = 0.95) to split the sequence into two segments whose nucleotide composition is homogeneous at the chosen significance level, s. The process continued iteratively over the new resulting segments while sufficient significance continued to appear.
Computing the Sequence Compositional Complexity (SCC)
Once each coronavirus genome sequence was segmented into an array of statistically significant, homogeneous compositional domains, its nucleotide compositional heterogeneity was measured by computing the Sequence Compositional Complexity, or SCC (Román-Roldán, Bernaola-Galván and Oliver, 1998). SCC increased with both the number of domains in the genome and the degree of compositional differences among them. Thus, SCC is analogous to other biological complexity measures, particularly to that described by McShea and Brandon (McShea and Brandon, 2010), in which an organism is more complex if it has a greater number of parts and a higher differentiation among these parts. It should be emphasized that SCC is overly sensitive to any change in the RNA genome sequence, either nucleotide substitutions, indels, genome rearrangements, or recombination events, all of which could alter the number of domains or the differences in nucleotide frequencies among them.
Phylogenetic ridge regression
To search for trends in SCC values and evolutionary rates over time, phylogenetic ridge regression was applied using the RRphylo R package V. 2.5.8 (Castiglione et al., 2018). The estimated SCC value for each tip or node in the phylogenetic tree was regressed against its age (the phylogenetic time distance, which represents the time distance between the first sequence ever of the virus and the collection date of individual virus isolates); the regression slope was then compared to BM expectation (which models evolution according to no trend in SCC values and rates over time) by generating 1,000 slopes simulating BM evolution on the phylogenetic tree, using the function search.trend (Castiglione et al., 2019) in the RRphylo R package.
Comparing the effects of variants on the evolutionary trend
In order to explicitly test the effect of variants and to compare variants among each other we selected 4 different trees and SCC data (s730, a727, s1871, s1990) from Table 1. In each sample, we accounted for phylogenetic uncertainty by producing 100 dichotomous versions of the initial tree by removing polytomies applying the RRphylo function fix.poly (Castiglione et al., 2018). This function randomly resolves polytomous clades by adding non-zero length branches to each new node and equally partitioning the evolutionary time attached to the new nodes below the dichotomized clade. Each randomly fixed tree was used to evaluate temporal trends in SCC and its evolutionary rates occurring on the entire tree and individual variants if present, by applying search.trend. Additionally, for the larger phylogenies (i.e., s1871 and s1990 lineage-wise trees) half of the tree was randomly sampled and half of the tips were removed. This way we avoided biasing the results due to different tree sizes.
Supplementary Information
Additional details regarding the methods used in this study are provided in the Supplementary Information and in the supplemental data files available in the open repository Zenodo (https://doi.org/10.5281/zenodo.6844917).
Author contributions
J.L.O., M.V. and A.M. designed research; J.L.O., P.B., F.P., C.G.M, S.C., P.R., M.V. and A.M. performed research. J.L.O., P.B., F.P., C.G.M, S.C., P.R., M.V. and A.M. analyzed data; J.L.O., M.V., A.M. and P.R. drafted the paper. All authors read and approved the final manuscript.
Funding
This project was funded by grants from the Spanish Minister of Science, Innovation and Universities (former Spanish Minister of Economy and Competitiveness) to J.L.O. (Project AGL2017-88702-C2-2-R) and A.M. (Project PID2019-105969GB-I00), a grant from Generalitat Valenciana to A.M. (Project Prometeo/2018/A/133) and co-financed by the European Regional Development Fund (ERDF). The most time-demanding computations were done on the servers of the Laboratory of Bioinformatics, Dept. of Genetics & Institute of Biotechnology, Center of Biomedical Research, 18100, Granada, Spain.
Availability of data and materials
The data underlying this article are available in the open repository Zenodo at https://zenodo.org/, and can be accessed with https://zenodo.org/record/6844917.
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
Acknowledgements
We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. A complete list acknowledging all originating and submitting laboratories is available in the GISAID’s EpiCoV database (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017; Khare et al., 2021) (EPI_SET_ID: EPI_SET_20220604yp; DOI: https://doi.org/10.55876/gis8.220604yp). In the same way, we gratefully acknowledge the authors, originating and submitting laboratories of the genome sequences we used for the analysis of the SARS-CoV-2 Nextstrain global dataset (Hadfield et al., 2018), downloaded on 2021-10-08; a complete acknowledgement list is shown in Supplementary Table S19 available in Zenodo (https://zenodo.org/record/6844917).
Footnotes
The title and the abstract have been reworded. References have been updated.