Abstract
There are outstanding evolutionary questions on the recent emergence of coronavirus SARS-CoV-2/hCoV-19 in Hubei province that caused the COVID-19 pandemic, including (1) the relationship of the new virus to the SARS-related coronaviruses, (2) the role of bats as a reservoir species, (3) the potential role of other mammals in the emergence event, and (4) the role of recombination in viral emergence. Here, we address these questions and find that the sarbecoviruses – the viral subgenus responsible for the emergence of SARS-CoV and SARS-CoV-2 – exhibit frequent recombination, but the SARS-CoV-2 lineage itself is not a recombinant of any viruses detected to date. In order to employ phylogenetic methods to date the divergence events between SARS-CoV-2 and the bat sarbecovirus reservoir, recombinant regions of a 68-genome sarbecovirus alignment were removed with three independent methods. Bayesian evolutionary rate and divergence date estimates were consistent for all three recombination-free alignments and robust to two different prior specifications based on HCoV-OC43 and MERS-CoV evolutionary rates. Divergence dates between SARS-CoV-2 and the bat sarbecovirus reservoir were estimated as 1948 (95% HPD: 1879-1999), 1969 (95% HPD: 1930-2000), and 1982 (95% HPD: 1948-2009). Despite intensified characterization of sarbecoviruses since SARS, the lineage giving rise to SARS-CoV-2 has been circulating unnoticed for decades in bats and been transmitted to other hosts such as pangolins. The occurrence of a third significant coronavirus emergence in 17 years together with the high prevalence and virus diversity in bats implies that these viruses are likely to cross species boundaries again.
In Brief The Betacoronavirus SARS-CoV-2 is a member of the sarbecovirus subgenus which shows frequent recombination in its evolutionary history. We characterize the extent of this genetic exchange and identify non-recombining regions of the sarbecovirus genome using three independent methods to remove the effects of recombination. Using these non-recombining genome regions and prior information on coronavirus evolutionary rates, we obtain estimates from three approaches that the most likely divergence date of SARS-CoV-2 from its most closely related available bat sequences ranges from 1948 to 1982.
Key Points
RaTG13 is the closest available bat virus to SARS-CoV-2; a sub-lineage of these bat viruses is able to infect humans. Two sister lineages of the RaTG13/SARS-CoV-2 lineage infect Malayan pangolins.
The sarbecoviruses show a pattern of deep recombination events, indicating that there are high levels of co-infection in horseshoe bats and that the viral pool can generate novel allele combinations and substantial genetic diversity; the sarbecoviruses are efficient ‘explorers’ of phenotype space.
The SARS-CoV-2 lineage is not a recent recombinant, at least not involving any of the bat or pangolin viruses sampled to date.
Non-recombinant regions of the sarbecoviruses can be identified, allowing for phylogenetic inference and dating to be performed. We constructed three such regions using different methods.
We estimate that RaTG13 and SARS-CoV-2 diverged 40 to 70 years ago. There is a diverse unsampled reservoir of generalist viruses established in horseshoe bats.
While an intermediate host responsible for the zoonotic event cannot be ruled out, the relevant evolution for spillover to humans very likely occurred in horseshoe bats.