Abstract
Despite massive investment in research on reservoirs of emerging pathogens, it remains difficult to rapidly identify the wildlife origins of novel zoonotic viruses. Viral surveillance is costly but rarely optimized using model-guided prioritization strategies, and predictions from a single model may be highly uncertain. Here, we generate an ensemble of seven network- and trait-based statistical models that predict mammal-virus associations, and we use model predictions to develop a set of priority recommendations for sampling potential bat reservoirs and intermediate hosts for SARS-CoV-2 and related betacoronaviruses. We find nearly 300 bat species globally could be undetected hosts of betacoronaviruses. Although over a dozen species of Asian horseshoe bats (Rhinolophus spp.) are known to harbor SARS-like viruses, we find at least two thirds of betacoronavirus reservoirs in this bat genus might still be undetected. Although identification of other probable mammal reservoirs is likely beyond existing predictive capacity, some of our findings are surprisingly plausible; for example, several civet and pangolin species were highlighted as high-priority species for viral sampling. Our results should not be over-interpreted as novel information about the plausibility or likelihood of SARS-CoV-2’s ultimate origin, but rather these predictions could help guide sampling for novel potentially zoonotic viruses; immunological research to characterize key receptors (e.g., ACE2) and identify mechanisms of viral tolerance; and experimental infections to quantify competence of suspected host species.
Main text
Coronaviruses are a diverse family of positive-sense, single-stranded RNA viruses, found widely in mammals and birds1. They have a broad host range, a high mutation rate, and the largest genomes of any RNA viruses, but they have also evolved mechanisms for RNA proofreading and repair, which help to mitigate the deleterious effects of a high recombination rate acting over a large genome2. Consequently, coronaviruses fit the profile of viruses with high zoonotic potential. There are seven human coronaviruses (two in the genus Alphacoronavirus and five in Betacoronavirus), of which three are highly pathogenic in humans: SARS-CoV, SARS-CoV-2, and MERS-CoV. These three are zoonotic and widely agreed to have evolutionary origins in bats3–6.
Our collective experience with both SARS-CoV and MERS-CoV illustrate the difficulty of tracing specific animal hosts of emerging coronaviruses. During the 2002–2003 SARS epidemic, SARS-CoV was traced to the masked palm civet (Paguma larvata)7, but the ultimate origin remained unknown for several years. Horseshoe bats (family Rhinolophidae: Rhinolophus) were implicated as reservoir hosts in 2005, but their SARS-like viruses were not identical to circulating human strains4. Stronger evidence from 2017 placed the most likely evolutionary origin of SARS-CoV in Rhinolophus ferrumequinum or potentially R. sinicus8. Presently, there is even less certainty in the origins of MERS-CoV, although spillover to humans occurs relatively often through contact with dromedary camels (Camelus dromedarius). A virus with 100% nucleotide identity in a ∼200 base pair region of the polymerase gene was detected in Taphozous bats (family Emballonuridae) in Saudi Arabia9; however, based on spike gene similarity, other sources treat HKU4 virus from Tylonycteris bats (family Vespertilionidae) in China as the closest-related bat virus10,11. Several bat coronaviruses have shown close relation to MERS-CoV, with a surprisingly broad geographic distribution from Mexico to China12,13,14,15.
Coronavirus disease 2019 (COVID-19) is caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), a novel virus with presumed evolutionary origins in bats. Although the earliest cases were linked to a wildlife market, contact tracing was limited, and there has been no definitive identification of the wildlife contact that resulted in spillover nor a true “index case.” Two bat viruses are closely related to SARS-CoV-2: RaTG13 bat CoV from Rhinolophus affinis (96% identical overall), and RmYN02 bat CoV from Rhinolophus malayanus (97% identical in one gene but only 61% in the receptor binding domain and with less overall similarity)6,16. The divergence time between these bat viruses and human SARS-CoV-2 has been estimated as 30-70 years17, suggesting that the main host(s) involved in spillover remain unknown. Evidence of viral recombination in pangolins has been proposed but is unresolved17. SARS-like betacoronaviruses have been recently isolated from Malayan pangolins (Manis javanica) traded in wildlife markets18,19, and these viruses have a very high amino acid identity to SARS-CoV-2, but only show a ∼90% nucleotide identity with SARS-CoV-2 or Bat-CoV RaTG1320. None of these host species are universally accepted as the origin of SARS-CoV-2 or a progenitor virus, and a “better fit” wildlife reservoir could likely still be identified. However, substantial gaps in betacoronavirus sampling across wildlife limit actionable inference about plausible reservoirs and intermediate hosts for SARS-CoV-221.
Identifying likely reservoirs of zoonotic pathogens is challenging22. Sampling wildlife for the presence of active or previous infection (i.e., seropositivity) represents the first stage of a pipeline for proper inference of host species23, but sampling is often limited in phylogenetic, temporal, and spatial scale by logistical constraints24. Given such restrictions, modeling efforts can play a critical role in helping to prioritize pathogen surveillance by narrowing the set of plausible sampling targets25. For example, machine learning approaches have generated candidate lists of likely, but unsampled, primate reservoirs for Zika virus, bat reservoirs for filoviruses, and avian reservoirs for Borrelia burgdorferi26–28. In some contexts, models may be more useful for identifying which host or pathogen groups are unlikely to have zoonotic potential29. However, these approaches are generally applied individually to generate predictions. Implementation of multiple modeling approaches collaboratively and simultaneously could reduce redundancy and apparent disagreement at the earliest stages of pathogen tracing and help advance modeling work by addressing inter-model reliability, predictive accuracy, and the broader utility (or inefficacy) of such models in zoonosis research.
Because SARS-like viruses (subgenus Sarbecovirus) are only characterized from a small number of bat species in publicly available data, current modeling methods are poorly tailored to exactly infer their potential reservoir hosts. In this study, we instead conduct two predictive efforts that may help guide the inevitable search for known and future zoonotic coronaviruses in wildlife: (1) broadly identifying bats and other mammals that may host any Betacoronavirus and (2) specifically identifying species with a high viral sharing probability with the two Rhinolophus species carrying the closest known wildlife relatives of SARS-CoV-2. To do this, we developed a standardized dataset of mammal-virus associations by integrating a previously published mammal-virus dataset30 with a targeted scrape of all GenBank coronavirus accessions and their associated hosts. Our final dataset spanned 710 host species and 359 virus genera, including 107 mammal hosts of betacoronaviruses as well as hundreds of other (non-coronavirus) association records. We harmonized our host-virus data with a mammal phylogenetic supertree31 and over 60 ecological traits of bat species27,32,33. Using these standardized data, six subteams generated seven predictive models of host-virus associations, including four network-based and three trait-based approaches. These efforts generated seven ranked lists of suspected bat hosts of betacoronaviruses and five ranked lists for other mammals. Each ranked list was scaled proportionally and consolidated in an ensemble of recommendations for betacoronavirus sampling and broader eco-evolutionary research (ED Figure 1).
In our ensemble, we draw on two popular approaches to identify candidate reservoirs and intermediate hosts of betacoronaviruses. Network-based methods estimate a full set of “true” unobserved host-virus interactions based on a recorded network of associations (here, pairs of host species and associated viral genera). These methods are increasingly popular as a way to identify latent processes structuring ecological networks34–36, but they are often confounded by sampling bias and can only make predictions for species within the observed network (i.e., those that have available virus data; in-sample prediction). In contrast, trait-based methods use observed relationships concerning host traits to identify species that fit the morphological, ecological, and/or phylogenetic profile of known host species of a given pathogen and rank the suitability of unknown hosts based on these trait profiles28,37. These methods may be more likely to recapitulate patterns in observed host-pathogen association data (e.g., geographic biases in sampling, phylogenetic similarity in host morphology), but they more easily correct for sampling bias and can predict host species without known viral associations (out-of-sample prediction).
Predictions of bat betacoronavirus hosts derived from network- and trait-based approaches displayed strong inter-model agreement within-group, but less with each other (Figure 1A,B). In-sample, we identified bat species across a range of genera as having the highest predicted probabilities of hosting betacoronaviruses, distributed in distinct families in both the Old World (e.g., Hipposideridae, several subfamilies in the Vespertilionidae) and the New World (e.g., Artibeus jamaicensis from the Phyllostomidae; Figure 1C). Out-of-sample, our multi-model ensemble more conservatively limited predictions to primarily Old World families such as Rhinolophidae and Pteropodidae (Figure 1D). Of the 1,037 bat hosts not currently known to host betacoronaviruses, our models identified between 1 and 720 potential hosts based on a 10% omission threshold (90% sensitivity). Applying this same threshold to our ensemble predictions, we identified 291 bat species that are likely undetected hosts of betacoronaviruses. These include approximately half of bat species in the genus Rhinolophus not currently known to be betacoronavirus hosts (30 of 61), compared to 16 known hosts in this genus. Given known roles of rhinolophids as hosts of SARS-like viruses, our results suggest that SARS-like virus diversity could be undescribed for around two-thirds of the potential reservoir bat species.
Our multi-model ensemble predicted undiscovered betacoronavirus bat hosts with striking geographic patterning (Figure 2). In-sample, the top 50 predicted bat hosts were broadly distributed and recapitulated observed patterns of bat betacoronavirus hosts in Europe, parts of sub-Saharan Africa, and southeast Asia, although our models also predicted greater-than-expected richness of likely bat reservoirs in the Neotropics and North America. In contrast, the top out-of-sample predictions clustered in Vietnam, Myanmar, and southern China.
Because only trait-based models were capable of out-of-sample prediction, the differences in geographic patterns of our predictions likely reflect distinctions between the network- and trait-based modeling approaches, which we suggest should be considered qualitatively different lines of evidence. Network approaches proportionally upweight species with high observed viral diversity, recapitulating sampling biases largely unrelated to coronaviruses (e.g., frequent screening for rabies lyssaviruses in vampire bats, which have been sampled in a comparatively limited capacity for coronaviruses14,38–40). Highly ranked species may also have been previously sampled without evidence of betacoronavirus presence; for example, Rhinolophus luctus and Macroglossus sobrinus from China and Thailand, respectively, tested negative for betacoronaviruses, but detection probability was limited by small sample sizes41–43. In contrast, trait-based approaches are constrained by their reliance on phylogeny and ecological traits, and the use of geographic covariates made models more likely to recapitulate existing spatial patterns of betacoronavirus detection (i.e., clustering in southeast Asia). However, their out-of-sample predictions are, by definition, inclusive of unsampled hosts44, which potentially offer greater return on viral discovery investment.
Multi-model ensemble predictions also clustered taxonomically along parallel lines. Applying a graph partitioning algorithm (phylogenetic factorization) to the bat phylogeny45, we found that in-sample predictions were on average lowest for the Yangochiroptera (Figure 3). This makes intuitive sense, because this clade does not include the groups known to harbor the majority of betacaronaviruses detected in bats (e.g., Rhinolophus, Hipposideridae). Out-of-sample predictions were lower in the New World superfamily Noctilionoidea and the emballonurids, whereas several subfamilies of Old World fruit bats46, including the Rousettinae, Cynopterinae, and Eidolinaei, had higher mean probabilities of betacoronavirus hosting. Lastly, our ensemble also identified the Rhinolophus genus as having greater mean probabilities (ED Table 1).
These clade-specific patterns of predicted probabilities across extant bats could be particularly applicable for guiding future surveillance. On the one hand, betacoronavirus sampling in southeast Asian bat taxa (especially the genus Rhinolophus) may have a high success of viral detection but may not improve existing bat sampling gaps47. On the other hand, discovery of novel betacoronaviruses in Neotropical bats or Old World fruit bats could significantly revise our understanding of the bat-virus association network. Such discoveries would be particularly important for global health security, given the surprising identification of a MERS-like virus in Mexican bats14 and the likelihood that post-COVID pandemic preparedness efforts will focus disproportionately on Asia despite the near-global presence of bat betacoronaviruses.
Although our ensemble model of potential bat betacoronavirus reservoirs generated strong and actionable predictions, our mammal-wide predictions were largely uninformative. In particular, minimal inter-model agreement (ED Figure 2) indicated a lack of consistent, biologically meaningful findings. Major effects of sampling bias were apparent from the top-ranked species, which were primarily domestic animals or well-studied mesocarnivores (ED Figure 2B). Phylogenetic factorization mostly failed to find specific patterns in prediction (ED Table 2): in-sample, mean predictions primarily confirmed betacoronavirus detection in the remaining Laurasiatheria (e.g., ungulates, carnivores, pangolins, hedgehogs, shrews), although nested clades of marine mammals (i.e., cetaceans) were less likely to harbor these viruses, as expected given betacoronavirus epidemiology and their predominance in terrestrial mammals. Our mammal predictions thus reflect a combination of detection bias and poor performance of network methods on limited data that likely signals the limits of existing predictive capacity. Our dataset contained only 30 non-bat betacoronavirus hosts, many of which were identified during sampling efforts following the first SARS outbreak7. Although the laurasiatherians are likely to include more potential intermediate hosts than other mammals, the high diversity of this clade restricts insights for sampling prioritization, experimental work, or spillover risk management.
Given the unresolved origins of SARS-CoV-2 and significant motivation to identify other SARS-like coronaviruses and their reservoir hosts for pandemic preparedness21, we further explored our only model that could generate out-of-sample predictions for all mammals48. This model uses geographic distributions and phylogenetic relatedness to estimate viral sharing probability. Where one or more (potential) hosts are known, these sharing patterns can be interpreted to identify probable reservoir hosts48. Because Rhinolophus affinis and R. malayanus host viruses that are closely related to SARS-CoV-26,16, we used their predicted sharing patterns to identify possible reservoirs of sarbecoviruses. In doing so, we aimed to work around a major data limitation: fewer than 20 sarbecovirus hosts were recorded in our dataset, a sample size that would preclude most modeling approaches.
For both presumed bat host species of sarbecoviruses, the most probable viral sharing hosts were again within the Laurasiatheria. Although bats—especially rhinolophids—unsurprisingly assumed the top predictions given phylogenetic affinity with known hosts (ED Table 3, ED Figure 3), several notable patterns emerged in the rankings of other mammals. Pangolins (Pholidota) were disproportionately likely to share viruses with R. affinis and R. malayanus (ED Figure 4); the Sunda pangolin (Manis javanica) and Chinese pangolin (M. pentadactyla) were in the top 20 predictions for both reservoir species (ED Table 4). This result is promising given the much-discussed discovery of SARS-like betacoronaviruses in M. javanica18. The Viverridae were also disproportionately well-represented in the top predictions (ED Figure 5), most notably the masked palm civet (Paguma larvata), which was identified as an intermediate host of SARS-CoV49,50 (ED Table 4).
The ability of our virus sharing model to capture known patterns of coronavirus hosts using only two predictor variables is encouraging, and implies that mammal phylogeography has played a predictable role in historical betacoronavirus spillover. Moreover, these findings lend credibility to other predictions of SARS-CoV-2 sharing patterns and host susceptibility. Many of the model’s top predictions were mustelids (i.e., ferrets and weasels), and the most likely viral sharing partner for both Rhinolophus species was the hog badger (Arctonyx collaris; ED Table 4). Taken together with reports of SARS-CoV-2 spread in mink farms51, these results highlight the relatively unexplored potential for mustelids to serve as betacoronavirus hosts. Similarly, identification of several deer and Old World monkey taxa as high-probability hosts in our clade-based analysis (ED Figure 3) meshes with the observation of high binding of SARS-CoV-2 to ACE2 receptors in cervid deer and primates52. Felids (especially leopards) also ranked relatively high in our viral sharing predictions (ED Table 4, ED Figure 5), which is of particular interest given reports of SARS-CoV-2 susceptibility among cats53. However, we caution that this model was the only approach in our ensemble that could generate out-of-sample prediction across mammals, and therefore its predictions lacked confirmation (and filtering of potential spurious results) by other models that were designed and implemented independently.
Several limitations apply to our work, most notably the difficulty of empirically verifying predictions. Although some virological studies have incidentally tested specific hypotheses (e.g., filovirus models and bat surveys27,54, henipavirus models and experimental infections23,55), model-based predictions are nearly never subject to systematic verification or post-hoc efforts to identify and correct spurious results. Greater dialogue between modelers and empiricists is necessary to systematically confront the growing set of predicted host-virus associations with experimental validation or field observation. Scotophilus heathii, Hipposideros larvatus, and Pteropus lylei, all highly predicted bat species in our out-of-sample rankings, have been reported positive for betacoronaviruses in the literature43,56; however, resulting sequences were not annotated to genus level in GenBank. These results support the idea that our models identified relevant targets correctly but also highlight an evident limitation of the workflow. Whereas an automated approach was the ideal method to systematically compile over 30,000 samples on the timescales commensurate with ongoing efforts to trace SARS-CoV-2 in wildlife, we suggest this discrepancy highlights the need for careful virological work downstream at every stage of the modeling process, including the development of hybrid manual-automated data pipelines.
Additionally, overcoming underlying model biases that are driven by historical sampling regimes will require coordinated efforts in field study design. Bat sampling for betacoronaviruses has prioritized viral discovery39,40,57–59, but limitations in the spatial and temporal scale (and replication) of field sampling have likely created fundamental gaps in our understanding of infection dynamics in bat populations24. Limited longitudinal sampling of wild bats suggests betacoronavirus detection is sporadic over time and space56,60, implying strong seasonality in virus shedding pulses61. Carefully tailored spatial and temporal sampling efforts for priority taxa identified here, within the Rhinolophus genus or other high-prediction bat clades, will be key to identifying the environmental drivers of betacoronavirus shedding from wild bats and possible opportunities for contact between bats, intermediate hosts, and humans.
Future field studies will undoubtedly be important to understand viral dynamics in bats but are inherently costly and labor-intensive. These efforts are particularly challenging during a pandemic, as many scientific operations have been suspended, including field studies of bats in some regions to limit possible viral spillback from humans. However, various alternative efforts could both advance basic virology and allow testing model predictions. General open access to viral association records, including GenBank accessions and the upcoming release of the USAID PREDICT program’s data, could answer open questions and allow updates to our sampling prioritization (including potentially modeling at subgenus level, with greater data availability). Museum specimens and historical collections from diverse research programs also offer key opportunities to retrospectively screen samples from bats and other mammals for betacoronaviruses and to enhance our understanding of complex host-virus interactions 62. Large-scale research networks, such as GBatNet (Global Union of Bat Diversity Networks) and its member networks, could provide diverse samples and ensure proper partnerships and equitable access and benefit sharing of knowledge across countries63,64. Whole-genome sequencing through initiatives such as the Bat1K Project (https://bat1k.ucd.ie) would facilitate fundamental and applied insights into the immunological pathways through which bats can apparently harbor many virulent viruses (including but not limited to betacoronaviruses) without displaying clinical disease65,66.
To expedite such work, we have made our binary predictions of host-virus associations for all seven models and all 1,000+ bat species publicly available (Supplementary Table 1). Such results are provided both in the spirit of open science and with the hope that future viral detection, isolation, or experimental studies might confirm some of these predictions or rule out others55. In ongoing collaborative efforts, we aim to consolidate results from field studies that address these predictions (e.g., serosurveys) and to track Genbank submissions to expand the known list of betacoronavirus hosts. In several years, we intend to revisit these predictions as a post-hoc test of model validation, which would represent the first effort to test the performance of such models and assess their contribution to basic science and to pandemic preparedness.
It is crucial that our predictions be interpreted as a set of hypotheses about potential host-virus compatibility rather than strong evidence that a particular mammal species is a true reservoir for betacoronaviruses. In particular, susceptibility is only one aspect of host competence22,67, which encompasses the diverse genetic and immunological processes that mediate within-host responses following exposure68. SARS-CoV-2 in particular may have a broad host range52, given hypothesized compatibility with the ACE2 receptor in many mammal species, but this only adds to the extreme caution with which any data should be used to implicate a potential wildlife reservoir of the virus, given that rapid interpretation of inconclusive molecular evidence has likely already generated spurious reservoir identifications69,70. Future efforts to isolate live virus from wildlife or to experimentally show viral replication would more robustly test whether predicted host species actually play a role in betacoronavirus maintenance in wildlife55.
Without direct lines of virological evidence, we note that our sampling prioritization scheme also does not implicate any given mammal species in SARS-CoV-2 transmission to humans. Care should be taken to communicate this, especially given the potential consequences of miscommunication for wildlife conservation. The bat research community in particular has expressed concern that negative framing of bats as the source of SARS-CoV-2 will impact public and governmental attitudes toward bat conservation71. In zoonotic virus research on bats, studies often over-emphasize human disease risks72 and rarely mention ecosystem services provided by these animals 73. Skewed communication can fuel negative responses against bats, including indiscriminate culling (i.e., reduction of populations by selective slaughter)74, which has already occured in response to COVID-19 even outside of Asia (where spillover occurred)75.
To minimize potential unintended negative impacts for bat conservation, public health and conservation responses should act in accordance with substantial evidence suggesting that culling has numerous negative consequences, not only threatening population viability of threatened bat species in shared roosts76 but also possibly increasing viral transmission within the very species that are targeted77,78. Instead, bat conservation programs and long-term ecological studies are necessary to help researchers understand viral ecology and find sustainable solutions for humans to live safely with wildlife. From another perspective, policy solutions aimed at limiting human-animal contact could potentially prevent virus establishment in novel species (e.g., as observed in mink farms51), especially in wildlife that may already face conservation challenges (e.g., North American bats threatened by an emerging disease, white-nose syndrome74,79). At least four bat species with confirmed white-nose syndrome symptoms or that can be infected by the fungal pathogen (Eptesicus fuscus, Myotis lucifugus, M. septentrionalis, Tadarida brasiliensis) are in our list of the 291 bat species most likely to be betacoronavirus hosts, and both Myotis species have already been heavily impacted by this fungal epidemic with over 90% reductions in their populations80.
Substantial investments are already being planned to trace the wildlife origins of SARS-CoV-2. However, the intermediate progenitor virus may never be isolated from samples contemporaneous with spillover, and it may no longer be circulating in wildlife. MERS-CoV circulates continuously in camels81 and SARS-CoV persisted in civets long enough to seed secondary outbreaks49,50, but the limited description of Pangolin-CoV symptoms suggests high mortality, potentially indicating a more transient epizootic such as Ebola die-offs in red river hogs (Potamochoerus porcus)18. In lieu of concrete data, our study provides no additional evidence implicating any particular species—or any particular pathway of spillover (e.g., wildlife trade, consumption of hunted animals)—as more or less likely. No specific scenario can be confirmed or rigorously interrogated by ecological models, and we explicitly warn against misinterpretation or misuse of our findings as evidence for adjacent policy decisions. Although policies that focus on particular potential reservoir species or target human-wildlife contact could reduce future spillovers, they will have a negligible bearing on the ongoing pandemic, as SARS-CoV-2 is highly transmissible within humans (e.g., unlike MERS-CoV or other zoonoses that are sustained in people by constant reintroduction). SARS-CoV-2 is likely to remain circulating in human populations until a vaccine is developed, regardless of immediate actions regarding wildlife. COVID-19 response must be informed by the best consensus evidence available and prioritize solutions that address immediate reduction of transmission through public health and policy channels. Meanwhile, we hope our proposed wildlife sampling priorities will help increase the odds of preventing the future emergence of novel betacoronaviruses.
Methods
The underlying conceptual aim of this study was to produce and synthesize several different models that predict and rank candidate reservoir species—each with different methods, assumptions, and framings—and to rapidly synthesize these into a consensus list. We broadly structured our study around two modeling targets: (1) produce rankings of likely bat hosts of betacoronaviruses and (2) identify potential non-bat mammal hosts. We developed a novel dataset that merged existing knowledge about the broader mammal-virus network with targeted data collection about coronaviruses; implemented seven modeling methods; synthesized these into an ensemble; and post-hoc identified taxonomic patterns in prediction using phylogenetic factorization.
Host-Virus Association Data
Entries were downloaded from GenBank on March 27th 2020 using the following search terms: Coronavirus, Coronaviridae, Orthocoronavirinae Alphacoronavirus, Betacoronavirus, Gammacoronavirus, and Deltacoronavirus. Data were sorted using a Python script that saved all available metadata regarding accession number, division, submission date, entry title, organism, genus, genome length, host classification, country, collection date, PubMed ID, journal containing associated publication, publication year, genome completeness, and the gene sequenced. The dataset was cleaned to remove duplicate entries, using GenBank accession number, and entries that did not correspond to viral sequences, using GenBank division. After cleaning, 31,473 entries remained, of which 25,628 had metadata regarding host species.
Data from GenBank were merged with the Host-Pathogen Phylogeny Project (HP3) dataset30. The HP3 dataset consists of 2,805 associations between 754 mammal hosts and 586 virus species, compiled from the International Committee on Taxonomy of Viruses (ICTV) database, and manually cleaned over a period of five years. Data collection on HP3 began in 2010 and has been static since 2017, but it still represents the most complete dataset on the mammal virome published with a high standard of data documentation. Several recent studies have used the HP3 dataset to produce statistical models of viral sharing or zoonotic potential29,48,82, making it a comparable reference for a multi-model ensemble study.
Because of naming inconsistencies both within GenBank and between the two datasets (HP3 and GenBank), we used a two-step pipeline for taxonomic reconciliation. Viral names were matched to the ICTV 2019 master species list, up to the sub-genus level. Host species names were matched against GBIF using their species API with an automated Julia script, and processed to a fully cleaned set of names. This led to an harmonized dataset representing a global list of mammal-virus associations, from which the bat-coronavirus data can be extracted for downstream and specific modeling efforts. Because the HP3 dataset used an older version of the ICTV master list, and because not all host names in the GenBank metadata could be matched by the GBIF species API (or could be solved unambiguously to the species level), some host-virus interactions were lost; this reinforces the need to careful data curation of taxonomic metadata if they are to enable and support predictive pipelines.
Predictor Data
Phylogeny
We used a supertree of extant mammals to unify modeling approaches incorporating host phylogeny31. Although more recent mammal supertrees exist, we used this particular phylogeny for consistency with trait datasets and several of the modeling frameworks included in our ensemble. We manually matched select bat species names between our edge list and this particular phylogeny. This included reverting any Dermanura to their former Artibeus designation (i.e., D. phaeotis, D. cinerea, D. tolteca)83, switching Tadarida species to either Mops or Chaerophon species (i.e., Tadarida condylura to Mops condylurus, Tadarida plicata to Chaerephon plicatus, Tadarida pumila to Chaerephon pumilus)84, and renaming Myotis pilosus to the more recent Myotis ricketti. Chaerephon pusillus was considered its own species but is now synonymous with Chaerephon pumilus84. Minor discrepancies between virus data and our phylogeny were also corrected (Hipposideros commersonii to Hipposideros commersoni [although more recently changed to Macronycteris commersoni], Rhinolophus hildebrandti to Rhinolophus hildebrandtii, Neoromicia nana to Neoromicia nanus). In other cases, some recently revised genera in our edge list were modified to match former genera in the mammal supertree: Parastrellus hesperus to Pipistrellus hesperus, and Perimyotis subflavus to Pipistrellus subflavus85. Lastly, some names in our edge list missing from the mammal supertree represent former subspecies being raised to full species rank, and names were reverted accordingly: Artibeus planirostris to Artibeus jamaicensis, Miniopterus fuliginosus to Miniopterus schreibersii, Triaenops afer to Triaenops persicus, and Carollia sowelli to Carollia brevicauda. Although we recognize that these are each now recognized as distinct species, in all cases our synonymized names are thought to be either sister taxa or very closely related.
Ecological traits
We used a previously published dataset of 63 ecological traits describing the morphology, life history, biogeography, and diet of 1,116 bat species. These data are drawn from a combination of PanTHERIA32, EltonTraits33, and the IUCN Red List range maps, and were previously cleaned in a study producing predictions of bat reservoirs of filoviruses27. Four redundant variables (two for human population density, mean potential evapotranspiration in range, and body mass) were eliminated prior to analyses, favoring variables with higher completeness.
Correction for sampling bias
To correct for sampling bias, in the style of several previous studies30,82, we used the number of peer-reviewed citations available on a given host as a measure of scientific sampling effort. We used the R package easyPubMed to scrape the number of citations in PubMed returned when searching each of the 1,116 bat names in the trait data on April 10, 2020.
Modeling Approaches
Our team produced an ensemble of seven statistical models (ED Tables 5 and 6), and applied them to generate a predictive set of seven models for bats and five for other mammals. Four use a network-theoretic component (k-nearest neighbors, linear filtering, trait-free plug-and-play, and scaled phylogeny), while three primarily used ecological traits as predictors (boosted regression trees, Bayesian additive regression trees, and neutral phylogeographic).
All eight approaches were used to generate predictions about potential bat hosts of betacoronaviruses. A subset of six were used to recommend potential non-bat mammal hosts of betacoronaviruses (k-nearest neighbor, linear filtering, scaled phylogeny, trait-free plug-and-play, and neutral phylogeographic). We did not use trait-based models to predict non-bat hosts, because assigning pseudoabsences to the vast majority (∼3500 or more) of mammal species would likely lead to largely uninformative predictions, weighed against the 109 known betacoronavirus hosts (79 bats and 30 other mammals).
Network model 1: k-Nearest Neighbors recommender
We follow the methodology previously developed for the recommendation of species feeding interactions86. This method builds a recommender system internally based on the k-NN algorithm, under which candidate hosts are recommended for a virus from a pool constituted by the hosts of the k viruses with which it has the greatest overlap. Overlap (host sharing) is measured using Tanimoto similarity, which is the cardinality of the intersection of two sets divided by the cardinality of their union. To obtain the pairwise similarity between two viruses, this divides the number of shared hosts by the cumulative number of hosts. The k nearest neighbors of a virus are the k other viruses with which it has the highest Tanimoto similarity.
Hosts are then recommended by counting how many times they appear in these k neighbors, a quantity that ranges from 1 to k. We can impose arbitrary cutoffs by limiting the recommendations to the hosts that occur in at least k, k-1, etc, viruses. Previous leave-one-out validation of this model revealed that it is particularly effective for viruses with a reduced number of hosts, which is likely to be the case for emerging viruses. Furthermore, the performance of this model was not significantly improved by the addition of functional traits, making it acceptable to run on the association data only.
This model has been run two times; first, by measuring the similarity of viruses, and recommending hosts; second, by measuring the similarity of hosts, and recommending viruses. In all cases, only results for betacoronaviruses are reported.
The outcome of this model should be subject to caution, as leave-one-out validation revealed that the success rate (i.e. ability to recover one interaction that has been removed) remained lower than 50% even when using k=8, and dropped as low as 5% when using k=1 (the nearest-neighbor algorithm). This strongly suggests that the dataset of reported host-virus associations is extremely incomplete; therefore, the identification of the nearest neighbors can be biased by under-reported interactions, and this can result in noise in the prediction. This noise can be particularly important when the kNN technique operates on viruses, of which the bat dataset has only 15.
Network-based model 2: Linear filter recommender
Following Stock et al.87, we used a previously developed linear filter to infer potential missing interactions. This recommender system assumes that networks tend to be self-similar, and use this information to generate a score for an un-observed interaction that is a linear combination of the status of the interaction (relative weight of 1/4), relative degree of host and virus, and of the observed connectance of the network (all with relative weights of 1); as we are concerned with ranking interactions as opposed to examining the absolute value of the score, the penalization coefficient associated to the interaction being presumed absent could be omitted with no change in the ranking, but has been set to a low value instead. The scores returned by the linear filter are not directly related to the probability of the interaction existing in this context, but higher scores still indicate interactions that are more likely to exist. Indeed, known hosts of betacoronavirus typically scored higher.
We used the zero-one-out approach to assess the performance of this model on the entire datasets. In all cases, non-interactions ranked lower than positive interactions even when entirely removing the penalization coefficient from the linear filter parameters, which suggests that the network structure (degree and connectance) is capturing a lot of information as to which species can interact. Note that as opposed to the k-NN method outlined above, the linear filter is symmetrical, i.e. it captures the properties of both host and virus at once.
Network-based model 3: Plug and play
For network problems, the “plug and play” model is a statistical approach that formulates Bayes’ theorem for link prediction around the conditional density of traits of known associations compared to traits of every possible association in a network. The conditional density function is measured by using non-parametric kernel density estimators (implemented with the R package np), and the conditional ratio between them is used to estimate link “suitability”, a scale-free ratio. Compared to other machine learning methods that fit to training data iteratively, plug and play is comparatively simple, and directly infers the most likely extensions of observed patterns in data. The plug and play was originally developed to forecast missing links in host-parasite networks36, but has since been used to model species distributions88 and predict the global spread of human infectious diseases89. We used this model here to estimate suitability of host-virus interactions by first modeling the entire estimated network of host-virus interaction suitability, and ranking hosts that are not infected by betacoronaviruses by their estimated suitability for betacoronaviruses.
The “plug and play” model is trained using either matched pairs of host and pathogen ecological, morphological, or phylogenetic traits36, or by using a latent approach89 which considers the mean similarity of pathogens in their host ranges and the mean similarity of hosts in their pathogen communities as ‘traits’. We decided to use the latent approach, as host trait data was far more available than viral trait data. Further, the taxonomic scale considered for host (species) and virus (genus) differed, making the resolution of potential trait data different enough to potentially confound trait-based approaches in this modeling framework.
Relative suitability of a host-virus association, as estimated by the “plug and play” model, is formulated as a density ratio estimation problem. The suitability of a host-virus association is quantified as the quotient of the distribution of latent trait values when an association was recorded over the distribution of all the latent trait values. As an attempt to control for sampling effort of mammal and bat host species, we included PubMed citation counts for host species (as described above) in the estimation of host-virus suitability. We explored host-pathogen suitability using the entire mammal-virus associations dataset, to maximize the available information on the network’s structure, and ranked host-pathogen pairs by their relative suitability value. From the final predictions, we subset out bat-specific predictions. When predicting, we set citation counts to the mean of training data, as a sampling bias correction.
Network-based model 4: Scaled-phylogeny
We apply the network-based conditional model of Elmasri et al.90 for predicting missing links in bipartite ecological networks. The full model combines a hierarchical Bayesian latent score framework which accounts for the number of interactions per taxon, and a dependency among hosts based on evolutionary distances. To predict links based on evolutionary distance, the probability of a host-parasite interaction is taken as the sum of evolutionary distances to the documented hosts of that parasite. This allocates higher probabilities when a few closely related hosts, or many distantly related hosts interact with a parasite. In this way phylogenetic distances are combined with individual affinity parameters per taxa to model the conditional probability of an interaction.
In ecological studies, it is common to use time-scaled phylogenies to quantify evolutionary distance among species91. We may use these fixed evolutionary distances for link prediction, but parasite taxa are known to be more or less constrained by phylogenetic distances among hosts92. Further, phylogenies are hypotheses about evolutionary relationships and have uncertainties in the topology and relative distances among species93. Rather than treating phylogenetic distances as fixed, Elmasri et al.90 re-scale the phylogeny by applying a macroevolutionary model of trait evolution. While any evolutionary model that re-scales the covariance matrix may be used, we use the early-burst model, which allows evolutionary change to accelerate or decelerate through time94. This different emphasis to be placed on deep versus recent host divergences when predicting links.
We apply the model to a network of associations among host species and viral genera, and the mammal supertree, which allows us to leverage information from across the network to predict undocumented bat-betacoronavirus associations. We fit sets of models, applying both the full model, and the phylogeny-only model to both the bat-viral genera associations, and the mammal-viral genera associations. For each data-model combination we fit the model using ten-fold cross-validation holding out links for which there is a minimum of two observed interactions. The posterior interaction matrices resulting from each of the ten models are then averaged to generate predictions for all links in the network, with betacoronaviruses subset to generate the ensemble predictions.
To assess predictive performance, we attempted to predict the held out interactions, and calculated AUC scores by thresholding predicted probabilities per fold, and taking an average across the 10 folds. In addition to AUC, we also assessed the model based on the percent of documented interactions accurately recovered. For the bat-viral genera data the full model resulted in an average AUC of 0.82 and recovered an average of 90.1% of held out interactions, while the phylogeny-only model showed increased AUC (0.86), but a decreased proportion of held-out interactions recovered (84.5%). Interestingly, the models for bat-virus genera associations had marginally worse predictive performance compared to the same models run on the larger network of mammal-virus associations (full model: AUC 0.88, 84.4% positive interactions recovered; phylogeny-only model:AUC: 0.88, 88.8% positive interactions recovered), indicating that predicting bat-betacoronavirus associations may benefit from including data on non-bat hosts. The models also estimated the scaling parameter (eta) of the early-burst model to be positive (average eta=7.92 for the full model run on the bat subset), indicating accelerating evolution compared to the input tree (ED Figure 6). This means that recent divergences are given more weight than deeper ones for determining bat-viral genera associations, which is consistent with recent work on viral sharing48,95.
Trait-based model 1: Boosted regression trees
Previous work has been highly successful in predicting zoonotic reservoirs using a combination of taxonomic, ecological, and geographic traits as predictors. This approach has been previously used to identify wildlife hosts of filoviruses27,96, flaviviruses28,97, henipaviruses23, Borrelia burgdorferi26, to predict mosquito vectors of flaviviruses98, and to predict rodent reservoirs and tick vectors of zoonotic viruses37,99. These approaches treat the presence of a specific virus (or genus of viruses) or a zoonotic pathogen as an outcome variable, with negative values given for species not known to be hosts (pseudoabsences), and use machine learning to identify the characteristics that predispose animals to hosting pathogens of concern. By predicting the probability a given pseudoabsence is a false negative, the method can infer potential undetected or undiscovered host species.
This approach has almost exclusively been implemented using boosted regression trees (BRT), a classification and regression tree (CART) machine learning method that became popular a decade ago for species distribution modeling.100 Boosted regression trees develop an ensemble of classification trees which iteratively explain the residuals of previous trees, up to a fixed tree depth (usually between 3 and 5 splits). The incorporation of boosting allows the model, as it is fit, to progressively better explain poorly-fit cases within training data.
We used boosted regression trees to identify trait profiles that predict bat hosts of betacoronaviruses, including all trait predictors from the trait database that met baseline coverage (< 50% missing values) and variation (< 97% homogenous) thresholds. For all model fitting, we specified a Bernoulli error distribution for our binary response variable and applied 10-fold cross validation to prevent overfitting (R package gbm). We started by fitting a global model to our full dataset, first specifying learning rate = 0.01 (shrinks the contribution of each tree to the model) and tree complexity = 4 (controls tree depth) as per default values and subsequently tuning to minimize cross validation error.
We reduced the variable set by calling the gbm.simp() function, which computes and compares the mean change in cross validation error (deviance) produced by dropping different sets of least-contributing predictors. The final simplified model included 23 variables, plus citation counts, which we added to correct for sampling bias.
We applied bootstrapping resampling methods to estimate uncertainty, using our tuned model to fit 1000 replicate models. For each model, training sets were assembled by randomly selecting with replacement 79 bat-coronavirus associations from the set of reported bat hosts and 79 pseudoabsences. Trained models were used to generate relative influence coefficients for trait predictors and coronavirus host probabilities across all bat species. Partial dependence plots display relative influence coefficients and bootstrapped confidence intervals for the top ten contributing trait predictors. The medians of host probabilities were ranked and used to identify the top ten candidate host species. When predicting, we set citation counts to the mean of training data, as a sampling bias correction.
Trait based model 2: Bayesian additive regression trees
A similar workflow to trait-based model 1 was implemented using Bayesian additive regression trees (BART), an emerging machine learning tool that has similarities to more popular methods like random forests and boosted regression trees. BART adds several layers of methodological innovation, and performs well in bakeoffs with other advanced machine learning methods. Several features make BART very convenient for modeling projects like these, including several easy-to-use implementations in R packages, built-in capacity to impute and predict on missing data, and easy construction of variable importance and partial dependence plots.
Like other classification and regression tree methods, BART assigns the probability of a binary outcome variable by developing a set of classification trees - in this case, a sum-of-trees model - that split data (“branches”) and assign values to terminal nodes (“leaves”). Whereas other similar methods generate uncertainty by adjusting data (e.g. random forests bootstrap training data and fit a tree to each bootstrap; boosted regression trees are usually implemented with iterated training-test splits to generate confidence intervals), BART generates uncertainty using an MCMC process. An initial sum-of-trees model is fit to the entire dataset, and then rulesets are adjusted in a limited and stochastic set of ways (e.g., adding a split; switching two internal nodes), with the sum-of-trees model backfit to each change. After a burn-in period, the cumulative set of sum-of-trees models is treated as a posterior distribution. This has some advantages over other methods, like boosted regression trees or random forests. In particular, posterior width directly measures model uncertainty (rather than approximating it by permuting training data), and a single model can be run (instead of an ensemble trained on smaller subsets of training data), allowing the model to use the full training dataset all at once.101
Unlike many Bayesian machine learning methods, BART is easily implemented out-of-the-box, due to a limited set of customization needs. Three main priors control the fitting process: one usually-uniform prior on variable importance, one two-parameter negative power distribution on tree depth (preventing overfitting), and an inverse chi-squared distribution on residual variance. A set of well-performing priors from the original BART study102 are widely used across R implementations for out-of-the-box settings, but can be further adjusted relative to modeling needs. In this study, we implemented BART models using a Dirichlet prior for variable importance (DART), a specification that is designed for situations with high dimensionality data that probably reflects a small number of true informative predictors. This often produces a much more reduced model without going through a stepwise variable selection process, which can be slow and very subject to stochasticity.101
We implemented this approach using the BART package in R, using the bat-virus association dataset to generate an outcome variable, and the bat traits dataset as predictors. BART models were implemented with 200 trees and 10,000 posterior draws, using every trait feature that was at least 50% complete and < 97% homogenous (taken from TBM1).
We tried four total implementations, based on two decisions: BART uncorrected and corrected for citation counts (BART-u, BART-c), and DART uncorrected and corrected for citation counts (DART-u, DART-c). All four models performed well, with little variation in predictive power measured by the area under the receiver operator curve calculated on training data (BART-u: AUC = 0.93; BART-c: AUC = 0.93; DART-u: AUC = 0.93; DART-c: 0.90; ED Figure 7). Across all models, spatial variables had a high importance, including some regionalization (extent of range) and some variables capturing larger geographic range sizes, as did a diet of invertebrates (pulling out the phylogenetic signal of insectivorous bats; ED Figure 8).
All models identified a number of “false negative” hosts that would be suitable based on a 10% false negative classification threshold for known betacoronavirus hosts (implemented with the R package ‘PresenceAbsence’). BART-u identified 217 missing hosts, BART-c identified 279 missing hosts, DART-u identified 222 missing hosts, and DART-c identified 384 missing hosts, suggesting that this model most penalized overfitting as intended. As a result, we considered this model the most rigorous and powerful for inference, and used DART-c in the final model ensemble. We predicted across all 1,040 bats without recorded betacoronavirus associations, and ranked predicted probability. When predicting, we set citation counts to the mean of training data, as a sampling bias correction.
Trait based model 3: Phylogeographic neutral model
We used a previously published pairwise viral sharing model48 to predict potential betacoronavirus hosts based on the sharing patterns of known hosts in a published dataset 30. We used a generalised additive mixed model (GAMM), which was fitted in the first half of 2019 using the mgcv package, with pairwise binary viral sharing (0/1 denoting if a species shares at least one virus) as a response variable. Explanatory variables include pairwise proportional phylogenetic distance and geographic range overlap (taken from the IUCN species ranges), with a multi-membership random effect to control for species-level sampling biases. The model was then used to predict the probability that a given species pair share at least one virus across 4196 placental mammals with available data, producing a predicted viral sharing network that recapitulates a number of known macroecological patterns, as well as predicting reservoir hosts with surprising accuracy48. Subsetting this predicted sharing matrix, we listed the rank order of hosts most likely to share with all known betacoronavirus hosts in our datasets.
Rhinolophus-specific implementation of Trait-based model 3
We then repeated this process with sharing patterns of Rhinolophus affinis and R. malayanus specifically. Given the strong phylogenetic effect, the top 139 predictions were bat species: predominantly rhinolophids and hipposiderids. The top 20 predictions for both R. malayanus and R. affinis are displayed in ED Table 3 and 4. Notable predictions included the hog badger Arctonyx collaris (Carnivora: Mustelidae), which was examined for SARS-CoV antibodies in 2003 and is reported in wildlife markets7,103; a selection of civet cats (Carnivora: Viverridae) including Viverra species; the binturong (Arctitis binturong); and the masked palm civet (Paguma larvata), the latter of which were implicated in the chain of emergence for SARS-CoV49,50; and pangolins (Pholidota: Manidae) including Manis javanica and Manis pentadactyla, which have been hypothesised to be part of the emergence chain for SARS-CoV-218,19.
Alongside these high-ranked species-level predictions, we visually examined how predictions varied across all mammal orders and families using the whole dataset (ED Figure 5). Pangolins (Pholidota), treeshrews (Scandentia), carnivores (Carnivora), hedgehogs (Erinaceomorpha), and even-toed ungulates (Artiodactyla) had high mean predicted probabilities. Investigating family-level sharing probabilities revealed that civets (Viverridae) and mustelids (Mustelidae) were responsible for the high Carnivora probabilities, and mouse deer (Tragulidae) and bovids (Bovidae) were mainly responsible for high probabilities in the Artiodactyla (ED Figure 6).
Consensus Methods and Recommendations
Combining and ranking predictions
For seven models predicting bat hosts of betacoronaviruses, and five models predicting mammal hosts of betacoronaviruses, we combined predictions—generated using the same standardized data—into one standardized dataset. All mammal models were trained on data including bats, but predictions were subset to exclude bats to focus on likely intermediate hosts.
Each study’s unique output—a non-intercomparable mix of different definitions of suitability or probability of association—were transformed into proportional rank, where lower rank indicates higher evidence for association out of the total number of hosts examined. By rescaling all results to proportional ranks between zero and one, we also allowed comparison of in-sample and out-of-sample predictions across all models. Proportional ranks were averaged across models to generate one standardized list of predictions. This absorbed much of the variation in model performance (ED Figure 1) and produced a set of rankings that performed well.
We elected not to withhold any “test” data to measure model performance, given that each method deployed in the ensemble has been independently and rigorously tested and validated in previous publications. Instead, to maximize the amount of available training data for every model, we used full datasets in each model and measured performance on the full training data.
For bats, the final ensemble of models spanned a large range of performance on the training data, measured by the area under the receiver operator curve (AUC; Network 1: 0.624; Network 2: 0.987; Network 3: 0.514; Network 4: 0.726; Trait 1: 0.850; Trait 2: 0.902; Trait 3: 0.762), indicating that it was possible to suitably detect differences in model performance on the full data. The total ensemble of proportional ranks performed medium well (AUC = 0.791). We used known betacoronavirus associations to threshold each model and the ensemble predictions based on a 10% omission threshold (90% sensitivity), and we again found a wide range in the number of predicted undiscovered bat hosts of betacoronaviruses (Network 1: 162 species; Network 2: 1; Network 3: 111; Network 4: 44; Trait 1: 425; Trait 2: 384; Trait 3: 720; total ensemble: 291 species). Given concerns about mammal model performance and biological accuracy (see Main Text), we elected not to apply this exercise to mammal hosts at large.
To visualize the spatial distribution of predicted bat hosts, we used the IUCN Red List database of species geographic distributions. We took the top 50 ranked in-sample predictions and top 50 ranked out-of-sample predictions and combined these range maps to visualize species richness of top predicted hosts (Figure 3).
Phylogenetic factorization of ensemble models
We used phylogenetic factorization to flexibly identify taxonomic patterns in the consensus proportional rankings of likely hosts of SARS-CoV-2. Phylogenetic factorization is a graph-partitioning algorithm that iteratively partitions a phylogeny in a series of generalized linear models to identify clades at any taxonomic level (e.g., rather than a priori comparing strictly among genera or family) that differ in a trait of interest 45. Using the mammal supertree, we used the phylofactor package to partition proportional rank as a Gaussian-distributed variable. We determined the number of significant phylogenetic factors using a Holm’s sequentially rejective 5% cutoff for the family-wise error rate. We applied this algorithm across our four final ensemble prediction datasets: in-sample bat ranks, out-of-sample bat ranks, in-sample mammal ranks, and out-of-sample mammal ranks.
Using network and trait-based models within-sample, we identified only one bat clade with substantially different consensus proportional rankings, the Yangochiroptera ( compared to 0.42 for the remaining bat phylogeny, the Yinpterochiroptera). Out of sample, using only trait-based models, we instead identified seven bat clades with different propensities to include unlikely or likely bat hosts of betacoronaviruses. Subclades of the New World superfamily Noctilionoidea broadly had higher proportional ranks , indicating lower predicted probability of being hosts, as did the Emballanuridae . In contrast, several subfamilies of the Old World fruit bats (Pteropodidae), including the Rousettinae, Cynopterinae, and Eidolinaei, all had lower mean ranks . Our models also collectively identified the Rhinolophidae as having lower ranks .
Using network models within-sample across non-volant mammals, we identified four clades with different proportional ranks. The largest clade was the Laurasiatheria (Artiodactyla, Perissodactyla, Carnivora, Pholidota, Soricomorpha, and Erinaceomorpha), which had lower proportional ranks (higher risk; ). Nested within this clade, the Cetacea had greater proportional ranks , indicating lower risk. A large subclade of the Murinae (Old World rats and mice) also had lower ranks . Out of sample, using only the biogeographic viral sharing model, we instead identified 15 clades with different proportional ranks. The first clade identified large swaths of the Muridae as having higher risk as well as the Laurasiatheria . Old World primates had weakly lower risk , as did the Scuridae . The Cetacea and Pinnipedia both had greater proportional ranks ( and ). Old World porcupines (Hystricidae) and the Erinaceidae (Paraechinus, Hemiechinus, Mesechinus, Erinaceus, Atelerix) both had greater risk ( and ), while the Afrosoricida had higher ranks .
To assess potential discrepancy between taxonomic patterns in model ensemble predictions and those of simply host betacoronavirus status itself, we ran a secondary phylogenetic factorization treating host status as a Bernoilli-distributed variable, with the same procedure applied to determine the number of significant phylogenetic factors. To assess sensitivity of taxonomic patterns to sampling effort, we ran phylogenetic factorization with and without square-root transformed PubMed citations per species as a weighting variable (ED Figure 9).
Without accounting for study effort, phylogenetic factorization of betacoronavirus host status identified one significant clade across the bat phylogeny, the Yangochiroptera, as having fewer positive species (4.71%) than the paraphyletic remainder (12.12%). When accounting for study effort, however, the single clade identified by phylogenetic factorization changed, with a subclade of the family Pteropodidae (the Rousettinae) having a greater proportion of positive species (28.6%). For non-volant mammals, phylogenetic factorization identified only one clade, the family Camelidae, as having more positive species (75%) than the tree remainder (0.68%).
Phylogenetic factorization of Rhinolophidae virus sharing
Because phylogenetic patterns in predictions from our viral sharing model could vary across other taxonomic scales beyond order and family, we also used phylogenetic factorization to more flexibly identify host clades with different propensities to share viruses with R. affinis and R. malayanus. We partitioned rank as a Gaussian-distributed variable and again determined the number of significant phylogenetic factors using Holm’s sequentially rejective 5% cutoff.
Within the Chiroptera, we identified 10 clades with different propensities to share viruses with R. affinis and 5 clades with different propensities to share viruses with R. malayanus. For both bats, the top clade was the family Rhinolophidae, reinforcing phylogenetic components of the biogeographic model and highlighting the greater likelihood of viral sharing (mean rank for R. affinis, for R. malayanus). For R. affinis, several individual bat species had lower risks of viral sharing (e.g., Myotis leibii, ; Pteropus insularis, ; Nyctimene aello, ; Chaerephon chapini, ). The Megadermatidae, Nycteridae, and Hipposideridae (under which the PanTHERIA dataset includes the genus Rhinonicteris, although this is now considered a separate family, the Rhinonycteridae104) collectively had greater likelihood of viral sharing , as did the Vespertillionidae .
Across the non-volant mammals, we identified 7 clades with different propensities to share viruses with R. affinis and only 1 clade with different propensities to share viruses with R. malayanus. For both bat species, the first and primary clade was the Ferungulata (Artiodactyla, Perissodactyla, Carnivora, Pholidota, Soricomorpha, and Erinaceomorpha), which had lower ranks (higher viral sharing; ). For viral sharing with R. affinis, the Sciuridae was more likely to share viruses , as was the Scandentia and many members of the Colobinae . However, members of the tribe Muntiacini (genera Elaphodus and Muntiacus) had especially high likelihoods of viral sharing and low rank .
Data and Code Availability
The standardized data on betacoronavirus associations, and all associated predictor data, is available from the VERENA consortium’s Github (github.com/viralemergence/virionette). All modeling teams contributed an individual repository with their methods, which are available in the organizational directory (github.com/viralemergence). All code for analysis, and a working reproduction of each authors’ contributions, is available from the study repository (github.com/viralemergence/Fresnel).
Extended Data
Acknowledgements
We thank Heather Wells for generously sharing thoughtful comments and code. The VERENA consortium is supported by L’Institut de Valorisation de Données (IVADO) through Université de Montreal. DJB was supported by an appointment to the Intelligence Community Postdoctoral Research Fellowship Program at Indiana University, administered by Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and the Office of the Director of National Intelligence.
Bibliography
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵