Divide and conquer – machine-learning integrates mammalian, viral, and network traits to predict unknown virus-mammal associations

Our knowledge of viral host ranges remains limited. Completing this picture by identifying unknown hosts of known viruses is an important research aim that can help identify zoonotic and animal-disease risks. Furthermore, such understanding can be used to mitigate against viral spill-over from animal reservoirs into human population. To address this knowledge-gap we apply a divide-and-conquer approach which separates viral, mammalian and network features into three unique perspectives, each predicting associations independently to enhance predictive power. Our approach predicts over 20,000 unknown associations between known viruses and mammalian hosts, suggesting that current knowledge underestimates the number of associations in wild and semi-domesticated mammals by a factor of 4.3, and the average mammalian host-range of viruses by a factor of 3.2. In particular, our results highlight a significant knowledge gap in the wild reservoirs of important zoonotic and domesticated mammals’ viruses: specifically, lyssaviruses, bornaviruses and rotaviruses.


INTRODUCTION 23
Thousands of viruses are known to affect mammals, with recent estimations indicating that less than 24 1% of mammalian viral diversity has been discovered to date 1 . Some of these viruses have a very narrow 25 host range, whereas others such as rabies and West Nile viruses have very wide host ranges (rabies can 26 theoretically infect any mammal 2 ). Host range is an important predictor of whether a virus is zoonotic 3 This network presents a global view of how these viruses are shared between mammalian hosts. This 53 sharing exhibits certain characteristics (e.g. DNA vs RNA viruses 10 ; bats vs rodents 11 ) which could only 54 be captured at the global level. Various network topological features have been exploited to provide 55 significant insight into patterns of pathogen sharing 11 , disease emergence and spill-over events 12 , and 56 as means to predict missing links in a variety of host-pathogen networks 13 including helminths 14 , and 57 viruses 15 . Here we express the topology of our virus-mammal network in terms of counts of potential 58 motifs 16 . Motifs 17 are miniature subgraphs which constitute the building blocks of larger, more complex 59 networks 18 . Motifs express specific functions or topological features of the underlying network, and 60 have been used to capture complex and indirect interactions in a variety of systems including biology 19-61 21 , ecology 22,23 and disease emergence 24 . 62 Our framework utilises 6,331 associations between 1,896 viruses and 1,436 terrestrial mammals, 63 representing 0.23% of all possible associations between our mammals and their viruses. It assesses how 64 much these associations are underestimated by predicting which unknown species-level associations 65 are likely to exist in nature (or do already exist but yet undocumented). We aggregate these predictions 66 to enhance estimation of the host-range of known mammalian viruses, and to highlight variation in the 67 degree of underestimation at the level of mammalian order (particularly in wild and semi-domesticated 68 species), and viral group (Baltimore classification), family, and genus. In addition, we highlight 69 significant knowledge gaps in mammalian reservoirs of known zoonoses and equivalent viruses in 70 important domesticated mammals. By investigating this underestimation from three separate points of 71 view, we enhance the overall predictive performance and capture local (at the level of a single viral or 72 mammalian species), as well as global (aggregated) variations in our knowledge gaps. 73

75
Our framework to predict unknown associations between known viruses and their mammalian 76 reservoirs comprised three distinct perspectives: viral, mammalian and network. Each perspective 77 produced predictions from a different vantage point (that of each virus, each mammal, and the network 78 WNV. Host species were ordered by mean probability of predictions, and top 60 were selected. Circles 108 represent the following information in order: 1) whether association is known (found in EID2) or not 109 (potential or undocumented

Domestication
Might influence sharing of viruses between host groups. Domesticated mammals and human might share more viruses with each other than related wild species.

Ecological traits Morphological traits (body mass)
A key feature in terms of metabolism and adaption to environment.

Mammalian host range of viruses 162
Relative importance of viral features: our multi-perspective approach enabled us to assess the relative 163 importance of viral traits ( Table 1) Table 3 lists the results of our framework at Baltimore group level 175 and selected family and transmission routes of our viruses.   (Table 2) to our viral models. We were also able to capture variations in how these features contribute to our viral models at various levels (e.g. Baltimore 200 classification, or transmission route) as highlighted in Figure 3 ( (known viruses that are yet to be associated with these mammals). Our framework highlighted 215 differences in the number of viruses predicted per order (Table 4).   increase respectively, and for rodents we predict a 7.42-fold and a 7.7-fold increase, respectively. The 253

increase in associations indicated a non-uniform knowledge-gap across mammalian virus reservoirs. 254
For bats the largest fold increase was in group III viruses with an 8.72-fold-increase, whereas in rodents 255 the highest fold increase was in group V viruses -a 6.23-fold-increase. 256 known virus speciesthese expanded to 51 viruses using our framework). Linked to this is wealthier 287 countries producing a larger research volume, and hence interactions common within or of importance 288 to such countries are more likely to be described. 289 Practical limitations: infectious agents of endangered and rare mammalian species, and mammalian 290 species found predominantly in remote regions are less likely to be characterised due to difficulties in 291 sampling these mammals in their natural habitats. The same applies to viruses with restrictive 292 geographical range, or those that are less common in mammals (e.g. avian pathogens). Our framework 293 was able to expand the host range of rare viruses, found in one or two mammals (N=1,450) from 1,619 294 to 4,174 hosts (~ 2.16 average increase per rare virus). It also increased the virus range of rarely studied 295 mammals (n=954) from 1,150 to 4,318 viruses (~3.21 average fold increase per host). 296 Biological reasons: virus-mammal associations which produce more visible or marked effects are more 297 likely to be studied. Examples include fertility or physically observable interactions being over-studied, 298 whilst potentially important asymptomatic interactions, or interactions where a cross-immunity from 299 related viruses masks observable symptoms may remain unnoticed and hence understudied. 300 Furthermore, co-evolution between virus and primary host often results in a less severe phenotype, 301 whilst the same virus in an incidental host may result in more marked and hence more studied disease. Studied examples include Ebola virus presenting minimal symptoms in bats but extreme symptoms in 303 humans; analogous interactions where the former host may have been unobserved are likely to be 304 plentiful. For example, our framework indicated that 19 additional species of bats could be carrier of 305 Ebola virus in addition to 10 known species. 306 The novelty of our approach lies in the separation of perspectives -by isolating the viral, mammalian 307 and network perspectives we were able to further our understanding of mammalian reservoirs of known 308 viruses as follows: 1) our novel divide-and-conquer approach explored the explanatory power, by 309 means of variable importance, of a comprehensive set of mammalian and viral traits. Uniquely, we 310 incorporated geospatial features extrapolated from an extensive collection of global data on climate, 311 environmental, agricultural, and mammalian diversity variables. 2) We consolidated these viral and 312 mammalian traits with network topological features, expressed in terms of potential motifs. By counting 313 potential motifs -in which an unknown virus-mammal association (link) may feature -we were able to 314 quantify the topology of our network and incorporate this topology into the prediction process in an 315 explainable, measurable, and extendible way. 3) Our voting approach, despite being more conservative 316 than its components (our 3 perspectives, supplementary results 2-4), was able to bridge a significant 317 gap in our knowledge of reservoirs of mammalian viruses (18,920 associations between wild and semi-318 domesticated mammalian species and known viruses). 319 There remains, however, key areas for further improvement. Prediction of novel viruses and their 320 potential threat to humans, livestock and wildlife is an increasingly important and active research area. 321 Where an established virus is increasing its range beyond the native region (e.g. due to climatic or 322 demographic factors), then our framework provides powerful means to assess potential hosts it has yet 323 to come into contact with. However, for completely novel (e.g. SARS-CoV-2) or never-studied viruses 324 our approach cannot predict potential associations. Future work may be able to enhance the predictive 325 Conversely it will be interesting to further expand the pathogens included in our models to incorporate 335 bacterial, protozoal, helminth and fungal pathogens. 336

338
Virus-host species associations and bipartite network formulation 339 We extracted species-level virus-mammal associations from the Enhanced Infectious Diseases 340 Database 8 -EID2. We recursively aggregated virus-mammal associationsa mammal that was found 341 to host a strain or subspecies of virus was considered a host of the corresponding virus species. We 342 further checked these species-level associations for accuracy and to eliminate laboratory-produced 343 associations and spurious instances. This resulted in 6,331 associations between 1,896 viruses and 1,436 344 terrestrial mammals. We transformed these associations into a bipartite network in which nodes represent either virus or mammal species, and links indicate associations between mammalian and viral 346 species (Supplementary Note 1). 347 Our framework trained and selected a set of supervised classifiers in each of the above perspectives as 368 discussed below. It then consolidated the results of the best performing classifiers using voting whereby 369 an unknown (potential or unobserved/undocumented) association was selected if it was predicted by at 370 least two of the three perspectives. 371

Multi-perspective framework to predict unknown virus-mammal associations
Our framework is flexible, both in terms of machine-learning algorithms selected, classifiers trained, 372 and features engineered for each perspective. It avoids overfitting as it approaches the problem from 373 various perspectives, and effectively consolidates ensembles of classifiers trained on subsets of the 374 underlying data. In addition, no constituent model of our framework has been trained with all available 375 data at any time. Finally, our framework enables the incorporation of hosts where only one virus has 376 been detected to date (via perspectives 2 and 3), and viruses where only one host has been discovered 377 (via perspectives 1 and 3). In contrast with our mammalian and viral perspectives, the network linking known viruses with their 381 mammalian hosts presents a global view of how these viruses are shared amongst mammalian hosts. 382 Here we capture the topology of this bipartite network by means of counts of potential motifs 16 . Figure  383 4 illustrates 3, 4, and 5 node motifs which might appear in a bipartite virus-mammal association 384 network. These motifs capture important indirect pathways between viruses and their mammalian hosts. 385 These pathways vary from simple generalisations capturing whether a virus has wide range of hosts or 386 not (m3.1, m4.1, and m5.1), or if the mammal is exposed to many viruses (m32, m46, m520), to more 387 complex pathways (e.g. two host species sharing 80% of their viruses with each other; three viruses 388 sharing 50% of their hosts with each other). These pathways might indicate if an unknown association 389 is more likely to exist in nature or not, and are only capturable, and most importantly quantifiable, at 390 the global level as encapsulated by our network perspective. 391 Capturing these indirect pathways, by means of motifs, enable us to apply supervised machine learning 392 algorithms to make predictions directly from the network structure, which is not captured by the other 393 two perspectives. Motifs are usually associated with specific frequency thresholds 18 . However, here we 394 follow previous work 16 in removing this restriction. We simply count the number of occurrences of the 395 motifs outlined in figure 4, as discussed below, and then let the machine learning algorithms detect 396 which motifs are particularly important to the problem of predicting links in our network. Associations between viruses and mammals have two states: known (documented in our sources) or 403 unknown. Unknown associations represent the gaps in our knowledge, they could exist in nature but 404 are undocumented or can exist in the future. Therefore, we force insert each possible virus-mammal 405 association (N=2,722,656) prior to counting motifs (hence termed potential motifs). 406

407
In order to incorporate these motifs as features from which a supervised classifier can learn, we applied 408 the following approach. For each virus-host association (link), whether known or unknown, we counted 409 the number of instances of each of our motifs (as categorised in figure 4) in which the association might 410 feature (presented as dashed red line in figure 4). In other words, for each virus-host association we 411 "inserted" the corresponding link into our network and counted all potential motifs in which this link 412 might feature if it actually existed 16 . This enabled us to create a training set of all potential host-virus 413 associations of our 1,896 viruses and 1,436 mammals and the counts of their potential motifs. We then 414 trained a number of machine-learning algorithms with this dataset as detailed in following subsections. 415 Research effort: We incorporated research effort on mammal and virus species into our network 416 perspective models. We calculated research effort as the total number of sequences and publications of 417 each species as indexed by EID2 8 . 418 419 Multi-perspective prediction pipeline of unknown virus-mammal associations 420 As highlighted above, our framework comprised three perspectives: mammalian, viral and network. 421 Each of these perspectives trained a set models with different features (tables 1 and 2, and figure 4 422 respectively), and hence required its own pipeline as described below (Supplementary Note 5). 423

Mammalian and viral perspectives: 424
Class balancing: On average each virus in our dataset affected 3.45 mammals (~0.24%), and each 425 mammalian host was affected by 4.41 viruses (~0.24%). This presented an imbalance in our data, 426 whereby a small percentage of instances are actualised. We dealt with this issue in two ways: first we 427 excluded any virus (N=1,281) which was found in only one mammal species from our virus models 428 pipeline (viral perspective), and we excluded any mammal (N=758) which is only affected by one virus 429 from our mammal models pipeline (mammalian perspective). Second, we deployed SMOTE -Synthetic 430 Minority Over-sampling Technique 52,53 to rebalance the classes prior to training each of our viral 431 (N=8×556) and mammalian (N=8×699) models. SVM with Polynomial Kernel (SVM-P), and Naive Bayes. We selected these classifiers due to their 437 robustness, scalability, availability, and over-all performance 54,55 . All models were trained and tested 438 via caret R package 56 . 439