Abstract
The ongoing SARS-CoV-2 pandemic is the third zoonotic coronavirus identified in the last twenty years. Previously, four other known coronaviruses moved from animal reservoirs into humans and now cause primarily mild-to-moderate respiratory disease. The emergence of these viruses likely involved a period of intense transmission before becoming endemic, highlighting the recurrent threat to human health posed by animal coronaviruses. Enzootic and epizootic coronaviruses of diverse lineages pose a significant threat to livestock, as most recently observed for virulent strains of porcine epidemic diarrhea virus (PEDV) and swine acute diarrhea-associated coronavirus (SADS-CoV). Unique to RNA viruses, coronaviruses encode a proofreading exonuclease (ExoN) that lowers point mutation rates to increase the viability of large RNA virus genomes, which comes with the cost of limiting virus adaptation via point mutation. This limitation can be overcome by high rates of recombination that facilitate rapid increases in genetic diversification. To compare dynamics of recombination between related sequences, we developed an open-source computational workflow (IDPlot) to measure nucleotide identity, locate recombination breakpoints, and infer phylogenetic relationships. We analyzed recombination dynamics among three groups of coronaviruses with impacts on livestock or human health: SARSr-CoV, Betacoronavirus-1, and SADSr-CoV. We found that all three groups undergo recombination with highly diverged viruses, disrupting phylogenetic relationships and revealing contributions of unknown coronavirus lineages to the genetic diversity of established groups. Dynamic patterns of recombination impact inferences of relatedness between diverse coronaviruses and expand the genetic pool that may contribute to future zoonotic events. These results illustrate the limitations of current sampling approaches for anticipating zoonotic threats to human and animal health.
Introduction
In the 21st century alone three zoonotic coronaviruses have caused widespread human infection: SARS-CoV in 2002 [1], MERS-CoV in 2012 [2], and SARS-CoV-2 in 2019 [3]. Four other coronaviruses, OC43, 229E, NL63, and HKU1 are endemic in humans and cause mild-to-moderate respiratory disease with low fatality rates, though they may cause outbreaks of severe disease in vulnerable populations [4–7]. Like SARS-CoV-2, SARS-CoV, and MERS-CoV, these endemic viruses emerged from animal reservoirs. The origins of 229E and NL63 have been convincingly linked to bats, much like the 21st century novel coronaviruses [8–10]. In a striking parallel, both MERS-CoV and 229E appear to have emerged from bats into camelids, established a new persistent reservoir, and then spilled over into humans [11–14]. In contrast, the viral lineages that include OC43 and HKU1 originated in rodents [15,16], though very limited rodent sampling leaves us with a poor understanding of deep evolutionary history of these viruses. Given the short infectious period of human coronavirus infections, the establishment of endemicity was likely preceded by a period of intense and widespread transmission on regional or global scales. In other words, SARS-CoV-2 is likely the fifth widespread coronavirus epidemic or pandemic involving a still-circulating virus, though the severity of the previous four cannot be reliably ascertained.
Livestock are similarly impacted by spillover of coronaviruses from wildlife reservoirs. Three viruses closely related to OC43, bovine coronavirus (BCoV), equine coronavirus (ECoV) and porcine hemagglutinating encephalomyelitis virus (PHEV) are enzootic or epizootic in cows, horses, and pigs respectively [17]. Since 2017, newly emerged swine acute diarrhea syndrome-associated coronavirus (SADS-CoV) has caused significant mortality of piglets over the course of several outbreaks [18,19]. Sampling of bats proximal to impacted farms determined that SADS-CoV outbreaks are independent spillover events of SADSr(elated)-CoVs circulating in horseshoe bats [20]. Molecular studies of SADS-CoV have identified the potential for further cross-species transmission, including the ability to infect primary human airway and intestinal cells [21,22].
Emergence of novel viruses requires access to new hosts, often via ecological disruption, and the ability to efficiently infect these hosts, frequently driven by adaptive evolution. Uniquely among RNA viruses, coronavirus genomes encode a proofreading exonuclease that results in a significantly lower mutation rate for coronaviruses compared to other RNA viruses [23,24]. This mutational constraint is necessary for maintaining the stability of the large (27-32 kb) RNA genome but limits the evolution of coronaviruses via point mutation. The high recombination rate of coronaviruses compensates for the adaptive constraints imposed by high-fidelity genome replication [24,25]. The spike glycoprotein in particular has previously been identified as a recombination hotspot [26]. Acquisition of new spikes may broaden or alter receptor usage, enabling host-switches or expansion of host range. Additionally, it may result in evasion of population immunity within established host species, effectively expanding the pool of susceptible individuals. Recombination in other regions of the genome is less well-documented but may also influence host range, virulence, and tissue tropism, and likely contributed to the emergence of SARS-CoV [27,28].
To study the dynamics of recombination among clinically significant coronavirus lineages we developed a novel web-based software, IDPlot, that incorporates multiple analysis steps into a single user-friendly workflow. Analyses performed by IDPlot include multiple sequence alignment, nucleotide similarity analysis, and tree-based breakpoint prediction using the GARD algorithm from the HyPhy genetic analysis suite [29]. IDPlot also allows the direct export of sequence regions to NCBI Blast to ease identification of closest relatives to recombinant regions of interest.
Using IDPlot, we analyzed recombination events in three distinct lineages of coronaviruses: SARS-CoV-2-like viruses, OC43-like viruses (Betacoronavirus-1) in the Betacoronavirus genus, and the SADSr-CoV group of alphacoronaviruses. In all three groups, we found clear evidence of frequent recombination resulting in closely related viruses exhibiting a high degree of divergence in discrete genomic regions. Recombination was particularly enriched around and within the spike gene. Across all three groups, recombination has occurred with uncharacterized coronaviruses lineages, indicating that coronavirus diversity remains considerably under-sampled and that rapid adaptations due to recombination may have unpredictable consequences for human and/or animal health. The potential for viruses to rapidly acquire novel phenotypes through such recombination events underscores the importance of a more robust and coordinated ecological, public health, and research response to the ongoing pandemic threat of coronaviruses.
Results
Coronavirus phylogenetic relatedness is variable across genomes
Coronavirus genomes, at 27-32 kilobases (kb) in length, are among the largest known RNA genomes, surpassed only by invertebrate viruses in the same Nidovirales order [30,31]. The 5’ ~20 kb of the genome comprises open reading frames 1a and 1b, which are translated directly from the genome as polyproteins pp1a and pp1ab and proteolytically cleaved into constituent proteins (Figure 1A) [32]. Orf1ab is among the most conserved genes and encodes proteins essential for replication, including the RNA-dependent RNA-polymerase (RdRp), 3C-like protease (3ClPro), helicase, and methyltransferase. Given the high degree of conservation in this region, coronavirus species classification is typically determined by the relatedness of these key protein-coding regions [33]. The 3’ ~10 kb of the genome contains structural genes including those encoding the spike and the nucleocapsid proteins, as well as numbered accessory genes that are unique to coronavirus genera and subgenera [34]. In contrast to the relative stability of the replicase region of the genome, the structural and accessory region, and in particular the spike glycoprotein, have been identified as recombination hotspots [26].
We set out to characterize the role of recombination in generating diversity across the coronavirus phylogeny. A classic signature of recombination is differing topology and/or branch lengths of phylogenetic trees depending on what genomic regions are analyzed. To identify lineages of interest for recombination analysis, we built a maximum-likelihood phylogenetic tree of full-length RdRp-encoding regions of representative alpha and betacoronaviruses, which contain all human and most mammalian coronaviruses (Figure 1B). From this tree we chose to further investigate the evolutionary dynamics of three clinically significant groups of coronaviruses: SARS-CoV-2 like viruses (blue) from within SARSr(elated)-CoV, among which recombination has been reported though not characterized in detail, endemic and enzootic OC43-like viruses of Betacoronavirus-1 (BetaCoV1) (red), and SADSr-CoVs (magenta).
Within each group there is little diversity within RdRp: 94-99% nt identity among the SADSr-CoVs, >97% nt identity within Betacoronavirus-1, and 91-99% among the SARS-CoV-2-like viruses (Figure S2). In contrast, spike gene phylogenetic trees of each group show much more diversity as reflected in extended branch lengths and/or changes in tree topology, though the latter are constrained by limitations in sampling (Figure 1B-D). Together, these results provide strong support for recombination of highly divergent spike genes into viruses within these groups.
IDPlot Facilitates Nucleotide Identity and Recombination Analysis
To further investigate recombination-driven diversity among these viruses we developed IDPlot, which incorporates several distinct analysis steps into a single Nextflow workflow [35] and generates a comprehensive HTML report to facilitate interpretation and further analysis. First, IDPlot generates a multiple sequence alignment using MAFFT (Figure 2A) [36] with user-assigned reference and query sequences. Default window size for sliding window analysis is 500 nucleotides, but is customizable. In its default configuration, IDPlot then generates an average nucleotide identity (ANI) plot, also displaying the multiple sequence alignment with differences to the reference sequence (colored vertical lines) and gaps (gray boxes) clearly highlighted. The plot is zoomable, and selected sequence regions can be exported directly to NCBI BLAST. Users can also choose to run GARD, the recombination detection program from the HyPhy suite of genomic analysis tools [29]. If GARD is implemented (Figure 2B), distinct regions of the multiple sequence alignment are depicted between the alignment and the ANI plot, and phylogenetic trees for each region are generated using FastTree2 (Figure 2C) [37] and displayed (Figure 2E).
A significant barrier to effective use of GARD is that because it ultimately presents multiple (sometimes dozens of) iterations, model choice and therefore the selection of breakpoints for further analysis can be challenging. To alleviate this issue the IDPlot output includes a graph with cumulative Akaike information criterion (ΔAIC-c) on the y-axis and the GARD iteration on the x-axis (Figure 2D). GARD uses ΔAIC-c to indicate the degree of fit improvement afforded by successive iterations, and this graph allows the user to easily determine when improvements become increasingly marginal, which is often accompanied by prediction of spurious breakpoints. Upon selection of a GARD iteration, the display switches to show the associated phylogenetic trees (Figure 2E). The ability to export sequences directly to BLAST enables the user to search for sequence identity in GenBank to regions sometimes highly divergent from reference sequences.
SARS-CoV-2-like virus recombination with distant SARSr-CoVs
To test and validate IDPlot as a tool for examining the recombination dynamics of coronaviruses, we initially conducted an analysis of SARS-CoV-2-like viruses within SARSr-CoV. We chose these viruses as our initial IDPlot case study because recombination has been previously described [38,39], though not characterized in detail. This provided the opportunity to evaluate IDPlot against a known framework but also advance our understanding of the role recombination has played in the evolution of these clinically significant viruses.
Prior to 2019 the SARS-CoV-2 branch within SARSr-CoV was known only from a single, partial RdRp sequence published in 2016 [40]. Upon the discovery of SARS-CoV-2 this RdRp sequence was extended to full genome-scale [3] and additional representatives from bats and pangolins have since been identified [39] [41,42]. However, this singularly consequential lineage remains under-sampled and its evolutionary history largely obscured. Most attention on these viruses to date has focused on the recent evolutionary history of SARS-CoV-2 with respect to possible animal reservoirs and recombinant origins. Much less attention has been paid to analyzing the evolution of known close relatives, the bat viruses RaTG13 and RmYN02, and PangolinCoV/GD19.
Using IDPlot, we did not uncover evidence in support of the idea that SARS-CoV-2 arose via recombination, consistent with previously published work [38]. RaTG13 shows consistently high identity across the genome with the only notable dip comprising the receptor-binding domain in the C-terminal region of spike S1 (Figure 3A), which is proposed to originate via either recombination or diversifying selection [38]. However, the limited sampling in the SARS-CoV-2-like lineage results in weak phylogenetic signals unable to distinguish between rapid mutational divergence and recombination producing the low ANI in the RaTG13 receptor binding domain.
In contrast, PangolinCoV/GD19 and RmYN02 show one and two significant drops in ANI, respectively. Phylogenetic analysis of the PangolinCoV/GD19 recombinant region captures the signal for both that virus (Figure 3A, 3C, S3C) and RmYN02 RR1, showing that both viruses fall onto separate branches highly divergent from SARS-CoV-2 and RaTG13 (Figures 3C) with only 81% and 74% nucleotide identity to the closest sequences in GenBank, respectively (Figure 3D, S3A). These findings identify three unique spike genes among SARS-CoV-2 and its three closest known relatives (Figure 3D), indicative of recombination with SARSr-CoV lineages that remain to be discovered despite being the focus of intense virus sampling efforts over the last eighteen years, since the emergence of SARS-CoV.
In addition to spike, RmYN02 contains a second recombinant region that encompasses the 3’ end of Orf7b and the large majority of Orf8 (Figure 3A, S3A). Orf8 is known to be highly dynamic in SARSr-CoVs. SARS-CoV underwent an attenuating 29 nt deletion in Orf8 in 2002-2003 [43] and Orf8 deletions have been identified in numerous SARS-CoV-2 isolates as well [44–46]. In bat SARSr-CoVs intact Orf8 is typically though not always present but exhibits a high degree of phylogenetic incongruence. Additionally, the progenitor of SARS-CoV encoded an Orf8 gene gained by recombination [28,47]. The BtCoV/RmYN02 Orf8 has only 50% nt identity to SARS-CoV-2 Orf8 and groups as a distantly related member of the branch containing SARS-CoV (Figure 3E), exhibiting just 80% nucleotide identity to the closest known sequence. Although the precise function of Orf8 is unknown, there is some evidence that like other accessory proteins it mediates immune evasion [43]. Therefore, recombination in Orf8 has the potential to alter virus-host interactions and may, like spike recombination, impact host range and virulence.
This analysis confirmed that IDPlot allows us to characterize recombination events in detail with a single workflow. We demonstrate that multiple SARS-CoV-2-like viruses have recombined with unsampled SARSr-CoV lineages, limiting our ability to assess sources of genetic diversification for these viruses. Under-sampling has implications limiting the incisiveness of both laboratory and field investigations of these viruses.
OC43-like viruses encode divergent spikes acquired from unsampled betacoronaviruses
After validating IDPlot for recombination analysis of coronaviruses, we used it to characterize recombination among the viruses in the Betacoronavirus-1 (BetaCov1) group, which includes the human endemic coronavirus OC43 and closely related livestock pathogens bovine coronavirus (BCoV), equine coronavirus (ECoV), porcine hemagglutinating encephalomyelitis virus (PHEV), and Dromedary camel coronavirus HKU23 (HKU23). Due to the apparent low virulence of OC43 and limited sampling of the lineage, these viruses receive relatively little attention outside agricultural research. However, this lineage has produced a highly transmissible human virus, can cause severe disease in vulnerable adults, and is poorly sampled [4]. An ancestral BCoV is believed to be the progenitor of the other currently recognized BetaCoV1 viruses with divergence dates estimated at 100-150 years ago for OC43/PHEV [48] and 50 years ago for HKU23 [49]. Recombination with other betacoronaviruses has been previously described for HKU23, so we excluded it from our analysis [50]. The closest outgroup to BetaCoV1, rabbit coronavirus HKU14 (RbCoV/HKU14) was reported to group closely with ECoV in some regions [51], but no detailed recombination analysis of the relationship between these viruses has been previously described.
We conducted IDPlot analysis of OC43 and these related enzootic viruses of livestock (Figure 4A) and identified at least six major recombination breakpoints in the ECoV genome. The largest divergent region (Region 2) is >6 kilobases (Figure 4A). This region encompassing ~20% of the genome exhibits only ~75% nt identity to the reference sequence, just ~81% identity to any known sequence, and occupies a distant phylogenetic position relative to RdRp (Figure 4B-C, S4A, S4C-D). In contrast to previous reports that ECoV clusters closely with RbCoV/HKU14 in this region [51], our analysis reveals that this region of ECoV was acquired via recombination from a viral lineage not documented in GenBank.
Striking variability in ANI within Region 2 led us to conduct a more detailed analysis. IDPlot did not predict internal Region 2 breakpoints, so we conducted a manual analysis guided by the IDPlot multiple sequence analysis, phylogenetic trees for each proposed sub-region, and BLAST analysis to further dissect differing evolutionary relationships for sub-regions. We found at least six and possibly seven distinct sub-regions (Figure S5). Nucleotide identity to top BLAST hits is highly variable (<70% to >90%), as is identity of the hits themselves, with genetic contribution from RbCoV/HKU14-like viruses, BCoV-like viruses, and distant unsampled lineages (Figure S5). Together, this demonstrates that Region 2 was not acquired via a single recombination event but rather represents a mosaic of known and unknown viral lineages that share an overlapping ecological niche with ancestral ECoV.
Another major recombinant ECoV region, Region 6, includes the entire NS2 and HE genes as well as the majority of the spike gene (Figure 4A, S4A). Within this region on the multiple sequence alignment, we also identified a recombination event encompassing the majority of the PHEV spike gene, though this required re-running IDPlot without ECoV to simplify the analysis (Figure 4A, S4A). Both ECoV Region 6 and the PHEV recombinant region occupy relatively distant nodes on a phylogenetic tree (Figure S4G, J) and exhibit <80% sequence identity to the reference sequence or any sequence in GenBank (Figure S4A), indicating they are derived from independent recombination events. Finally, we identified a third recombinant region, Region 4, in which ECoV exhibited high nucleotide identity with RbCoV/HKU14 (Figure S4A, E), further demonstrating the highly mosaic nature of the ECoV genome.
Our analysis of equine coronavirus offers a remarkable example of the degree and speed of divergence facilitated by the high recombination rates among coronaviruses. Previous genomic characterization of ECoV suggested that it is the most divergent member of BetaCoV1 based on nucleotide identity and phylogenetic positioning of full-length Orf1ab. However, in the >10 kilobase Region 3 that accounts for ~1/3 of the entire genome (Figure 4A) ECoV exhibits the highest nucleotide identity to BCoV in our dataset (98.5%) (Figure 4A, S4D), which is inconsistent with it having diverged earlier than OC43 and PHEV. The latter viruses are estimated to have shared a common ancestor with BCoV 100-150 years ago [48], suggesting that all of the observed ECoV recombination has occurred more recently. Our discovery of recombinant regions of unknown origin suggest that unsampled viral lineages have occupied overlapping ecological niches with ECoV and presumably continue to circulate. Basal members of the subgenus that includes BetaCoV1 have been identified exclusively in rodents (Figure 1B), suggesting they are a natural reservoir for these viruses. Although relatively little attention has been directed to these viruses, studies of BCoV and ECoV cross-neutralization suggest population immunity to OC43 may provide only limited protection against infection mediated by these novel spikes [52]. Although no recent zoonotic infections from this lineage have been documented, the genomic collision of these viruses with yet-undiscovered viruses may warrant a reassessment of their potential to diversify the pool of potential pandemic coronaviruses.
SADSr-CoVs encode highly diverse spike and accessory genes
In 2017 a series of highly lethal diarrheal disease outbreaks on Chinese pig farms were linked to a novel alphacoronavirus, swine acute diarrhea syndrome-associated coronavirus (SADS-CoV) [20,53], which is closely related to the previously described BtCoV/HKU2 [54]. Sampling of horseshoe bats nearby affected farms revealed numerous SADSr-CoVs with >95% genome-wide nucleotide identity, suggesting porcine outbreaks were due to spillover from local bat populations. To gain a better view of the genetic diversity among these viruses, we conducted IDPlot analysis of a prototypical SADS-CoV isolate (FarmA) and seven bat SADSr-CoVs sampled at different times before and after the first outbreaks in livestock (Figure 5A) using bat SADSr-CoV/162140 as a reference sequence. Three notable observations emerged from the identity plot: 1. Like ECoV, BtCoV/RfYN2012 exhibits evidence of recombination in the 5’ end of Orf1ab 2. the spike region of the genome is highly variable as previously reported [20], and 3. 3’ end of the genome also exhibits considerable diversity (Figure 5A).
To confirm the recent common ancestry of SADSr-CoVs in our data set we conducted nucleotide identity and phylogenetic analyses of the RdRp, 3ClPro, helicase, and methyltransferase NTD-encoding regions of Orf1ab. All viruses exhibit exhibit 94-100% nucleotide identity to the reference SADSr-CoV/162140 in these regions of the genome (Figure S2B, S6B-D, S7B-D). In contrast, BtCoV/RfYN2012 recombinant region 1 (RR1) has <70% identity to the reference or any known sequence (Figure S6F, S7A), providing evidence that an uncharacterized alphacoronavirus lineage circulates in horseshoe bats, which frequently recombines with SADSr-CoVs.
The spike gene is a striking recombination hotspot among SADSr-CoVs. Due to the clustering of putative breakpoints surrounding the 5’ end, 3’ end, and middle of spike, we ran IDPlot on subsets of three viruses – SADSr-CoV/162140 (reference), SADSr-CoV/141388 or SADS-CoV/FarmA, and a virus of interest from the larger dataset. We found breakpoints delineating six distinct and highly divergent spike genes among the eight analyzed viruses (Figure 5B), which reflects recombination events encompassing either the entire spike or the S1 subunit that mediates receptor binding. There are 3 unique full-length spikes (BtCoV/RfY2012, HKU2r-BtCoV/160660, BtCoV/HKU2) with 63-73% nucleotide identity to the reference sequence and two unique S1 domains (SADSr-CoVs/8462 and 8495) with <80% identity to the reference (Figure 5B, S7A). Some of these regions match with high identity to partial sequences in GenBank (indicated by an asterisk in Figure 5B) which may be either the source of the recombinant spike or different isolates of the same virus for which a full-length genome is available. Other spikes in this dataset are clearly divergent from any other known sequence.
In addition to spike, accessory proteins that target innate immunity can play important roles in host range and pathogenesis [34]. We found a second recombination hotspot surrounding the accessory gene Orf7a, which rivals spike gene diversification. Specifically, our dataset contained five distinct Orf7a genes, some of which lack any closely related sequences in GenBank (Figure 5C, S7A).
The SADSr-CoV lineage is rapidly diversifying via recombination, particularly in the spike and ORF7a accessory genes. We observed that numerous viruses with >95-99% identity in conserved Orf1ab regions contain highly divergent spike and accessory genes which may shift host range and virulence in otherwise nearly isogenic viruses. These findings highlight that viruses sampled to date represent only a sliver of existing coronavirus diversity and that coronaviruses can change rapidly, drastically, and unpredictably via recombination with both known and unknown lineages. The SADSr-CoVs exemplify the potential of coronavirus to rapidly evolve through promiscuous recombination.
Discussion
We developed IDPlot to explore the role of recombination in the diversification of coronaviruses. Coronaviruses are ubiquitous human pathogens with vast and underexplored genetic diversity. SARS-CoV-2 is the second SARSr-CoV known to infect humans and the fifth zoonotic coronavirus known to sweep through the human population following HCoVs 229E, NL63, HKU1, and OC43 [9,10,15,48,55,56]. Most effort in evaluating the threat to human health posed by coronaviruses has been dedicated to discovery of novel SARSr-CoVs in wildlife, yet prior to the SARS-CoV-2 pandemic this group of viruses went largely undetected. Much less attention has been paid to other groups that have produced human coronaviruses such as the undersampled Betacoronavirus-1 and emerging livestock viruses such as the SADSr-CoVs, which exhibit potential to infect humans and already have significant economic impacts.
We initially used the SARS-CoV-2-like viruses to test and validate IDPlot and in the process characterized recombination among these viruses in greater detail than previously reported. The observed variability in arrangements of PangolinCoV/GD19 and RmYN02 on a SARSr-CoV phylogenetic tree (Figure 3B-C, 3E, S3) depending on the region being sampled is a classic recombination signal easily observed in the IDPlot output. We also analyzed recombination dynamics for viruses in BetaCoV1 and among SADSr-CoVs. Broad similarities emerge from these studies. Most recombination appears to involve the spike gene and/or various accessory genes. However, in both BetaCoV1 and among SADSr-CoVs we detected recombination events in Orf1ab as well. Spike and accessory gene recombination events are particularly notable given the potential to influence host range and pathogenesis.
This preliminary analysis showed that IDPlot is a powerful new pipeline for sequence identity analysis, breakpoint prediction, and phylogenetic analysis. Existing workflows for nucleotide similarity analysis are proprietary, lack the ability to identify phylogenetic incongruence that is a signature of recombination and do not support direct export of genomic regions for BLAST analysis. This automates and streamlines multi-step analysis with few barriers to use. Nevertheless, there are opportunities for further improvement. Analysis of recombination breakpoints implemented in GARD are of limited value for resolving unique breakpoints in close proximity, as observed surrounding and within SADSr-CoV and other spike genes, necessitating the use of small sets of sequences. Second, GARD is computationally intensive. It is configured as an optional step in IDPlot, so multiple sequence alignments and nucleotide identity plots can be rapidly generated in a local environment. However, for GARD analysis we relied on a high-performance computing cluster to expedite the process. In the future, we anticipate adding other, less intensive breakpoint prediction algorithms to the IDPlot options menu. Future advances in computational methods may also improve the ability to resolve unique breakpoints clustered in genomic regions that are recombination hotspots, most notably the spike gene.
Our IDPlot analyses revealed new evidence of extensive recombination-driven evolution in other coronavirus groups. Wildlife sampling indicates that SADSr-Covs are a large pool of closely related viruses circulating in horseshoe bat populations at high frequency. This is the same genus of bats that include SARSr-CoVs suggesting that the ecological conditions for SADSr-CoV spillover into humans may be in place. The relatedness of these viruses means they have had little time to diverge via mutation, but we find they are rapidly diversifying due to recombination, acquiring spike and accessory genes from unsampled viral lineages. These findings demonstrate that rather than a single threat to human health posed by SADS-CoV, there is a highly diverse reservoir of such viruses in an ecological position and with diversity reminiscent of SARSr-CoVs. We found a similar dynamic at play among BetaCoV1 which are under-sampled to an even greater degree and receive far less attention. Nevertheless, these viruses are involved in genetic exchange with unsampled lineages, with unpredictable consequences.
Our findings bear on strategies for anticipating and countering future zoonotic events. SARSr-CoVs garner considerable attention, with an intense focus on viruses able to infect human cells using ACE-2 as an entry receptor. However, RmYN02 demonstrates that viruses can toggle between spikes that recognize ACE-2 or different entry receptors but still infect the same hosts and continue to undergo recombination. Work to prepare for future zoonotic SARSr-CoVs must account for the possibility that the threat will come from coronaviruses only distantly related to SARSr-CoVs undergoing frequent recombination and distributing genetic diversity across the phylogenetic tree of coronaviruses.
More attention to the evolutionary dynamics of BetaCoV1 and SADSr-CoVs is also warranted. Both groups originate in wildlife: rodents and horseshoe bats respectively, and are enzootic or epizootic in livestock. BetaCoV1 includes a pandemic virus that swept the human population, OC43, while SADS-CoV efficiently infects primary human respiratory and intestinal epithelial cells [22]. Our ability to anticipate threats from both groups would benefit from additional sampling, with BetaCoV1 being particularly undersampled. Increased surveillance at wildlife-livestock interfaces, including agricultural workers is needed for early detection of novel viruses coming into contact with humans. Due to recombination, prior infection with a virus such as OC43 cannot be presumed to be protective against even closely related viruses that can encode highly divergent spikes, as demonstrated in our analysis. Similarly, efforts to develop medical countermeasures against SADS-CoV should consider the full breadth of diversity among related viruses, while aiming for broadly effective vaccines and therapeutics.
Using IDPlot, we identified extensive diversity among coronavirus spike and accessory genes with potential implications for future pandemics. From the standpoint of understanding coronavirus evolution, frequent recombination events often reshuffle phylogenetic trees and can obscure evolutionary relationships. The extent to which viruses in current databases contain genomic regions with no known close relatives makes clear that coronavirus diversity is vast and poorly sampled, even for viruses circulating in well-studied locations. This proximity raises the possibility of recurrent zoonoses of coronaviruses encoding divergent spike and accessory genes. Therefore, preparedness efforts should consider a broad range of virus diversity rather than risk a more narrow focus on close relatives of coronaviruses that most recently impacted human health.
Methods
Virus Sequences
All sequences were downloaded from GenBank with the exception of PangolinCoV/GD19 and BtCoV/RmYN02, which were acquired from the Global Initiative on Sharing All Influenza Data (GISAID) database (https://www.gisaid.org).
IDPlot
IDPlot is initiated by the user designating reference and query sequences. A .gff3 annotation file can also be included in the input. The first step of IDPlot is multiple sequence alignment using MAFFT [36] with default parameters. Size of the sliding window is customizable and set to 500 for all of our analyses. For recombination analysis we ran GARD [29] as an optional step, utilizing the multiple sequence alignment generated by MAFFT. Trees for each GARD iteration are generated and displayed using Fast Tree 2 [37]. The entire output is then exported into a chosen directory as idplot.html as well .json files containing raw GARD data. More detailed information on IDPlot is available in the GitHub repository at https://github.com/brwnj/idplot.
Phylogenetic validation of breakpoints
Putative breakpoints were further tested by maximum-likelihood phylogenetic analysis using PhyML [60]. For Betacoronavirus-1, RbCoV/HKU14 and MHV (as a root) were aligned with the four viruses in the IDPlot dataset. For SADSr-CoVs we chose HCoV-229E as the root and aligned it with the eight viruses in our dataset. We rooted the SARSr-CoVs with BtCoV/BM48-31/BGR/2008. Given the better sampling of SARSr-CoV, we included more diversity in that alignment to enhance phylogenetic signal. The signal for BetaCoV1 and SADSr-CoV is constrained by sampling limitations. We extracted breakpoint-defined regions from the alignment and generated ML-phylogenetic trees using a GTR substitution model and 100 bootstraps. “Up” and “Dn” regions are the 500 nucleotides upstream or downstream of a proposed 5’ or 3’ breakpoint, respectively. In the case of SADSr-CoV the clustering of breakpoints around the 5’ and 3’ ends of spike precluded using unique Up and Dn regions for each recombination event. Instead, we used the N-terminal section of nsp16 (MTase) and the M gene, respectively. For BtCoV/RmYN02 RR2 and ORf8 phylogenetic testing we excluded SARSr-CoVs that have a deletion in Orf8. RmYN02 UpRR2 also does not include BtCoV/WIV1 because it has a unique open reading frameed insert in this region and so does not align with SARSr-CoVs lacking this Orfx.
BLAST analysis
To identify the source of recombinant regions we used NCBI Blastn with default parameters, excluding the query sequence from the search. For SADSr-CoVs partial spike sequences frequently appear as top hits. We included these, denoted by an asterisk in reporting the results.
Acknowledgements
We thank Edward Holmes for providing BtCov/RmYN02 and PangolinCoV/GD19 genome sequences for use in our analysis and Zoë Hilbert for critical reading of the manuscript. Research and contributing authors were supported by the National Institutes of Health (NIH) grants: HG006693, HG009141 and GM124355 to (A.R.Q.) and GM134936 to (N.C.E). Additional support includes a Burroughs Wellcome Fund Investigators in the Pathogenesis of Infectious Disease Award (1015462; N.C.E.) and a SARS-CoV-2 research pilot grant from the Immunology, Inflammation, and Infectious Disease (3i) Initiative, University of Utah Health.