Abstract
Novel infectious diseases continue to emerge within human populations. Predictive studies have begun to identify pathogen traits associated with emergence. However, emerging pathogens vary widely in virulence, a key determinant of their ultimate risk to public health. Here, we use structured literature searches to review the virulence of each of the 214 known human-infective RNA virus species. We then use a machine learning framework to determine whether viral virulence can be predicted by ecological traits including human-to-human transmissibility, transmission routes, tissue tropisms and host range. Using severity of clinical disease as a measurement of virulence, we identified potential risk factors using predictive classification tree and random forest ensemble models. The random forest model predicted literature-assigned disease severity of test data with 90.3% accuracy, compared to a null accuracy of 74.2%. In addition to viral taxonomy, the ability to cause systemic infection, having renal and/or neural tropism, direct contact or respiratory transmission, and limited (0 < R0 ≤ 1) human-to-human transmissibility were the strongest predictors of severe disease. We present a novel, comparative perspective on the virulence of all currently known human RNA virus species. The risk factors identified may provide novel perspectives in understanding the evolution of virulence and elucidating molecular virulence mechanisms. These risk factors could also improve planning and preparedness in public health strategies as part of a predictive framework for novel human infections.
Author Summary Newly emerging infectious diseases present potentially serious threats to global health. Although studies have begun to identify pathogen traits associated with the emergence of new human diseases, these do not address why emerging infections vary in the severity of disease they cause, often termed ‘virulence’. We test whether ecological traits of human viruses can act as predictors of virulence, as suggested by theoretical studies. We conduct the first systematic review of virulence across all currently known human RNA virus species. We adopt a machine learning approach by constructing a random forest, a model that aims to optimally predict an outcome using a specific structure of predictors. Predictions matched literature-assigned ratings for 28 of 31 test set viruses. Our predictive model suggests that higher virulence is associated with infection of multiple organ systems, nervous systems or the renal systems. Higher virulence was also associated with contact-based or airborne transmission, and limited capability to transmit between humans. These risk factors may provide novel starting points for questioning why virulence should evolve and identifying causative mechanisms of virulence. In addition, our work could suggest priority targets for infectious disease surveillance and future public health risk strategies.
Blurb Comparative analysis using machine learning shows specificity of tissue tropism and transmission biology can act as predictive risk factors for virulence of human RNA viruses.
Introduction
The emergence of novel infectious diseases continues to represent a threat to global public health. Emerging pathogens have been defined as those newly recognised infections of humans following zoonotic transmission, or those increasing in incidence and/or geographic range [1]. High-profile examples of emerging pathogens include the discovery of the novel MERS coronavirus from cases of respiratory illness in 2012 [2], and the expansion of the range of Zika virus across the South Pacific and the Americas [3]. The emergence of previously unseen viruses means that the set of known human viruses continually increases by around 2 species per year [4,5]. Initial comparative studies identified trends among emerging human pathogens, for example, increased risk of emergence for pathogens with broad host ranges, and RNA viruses [6–9]. However, more recent comparative analyses have focused on risk factors for specific pathogen traits, such as transmissibility [10–12]. Here, we focus on understanding the ecological determinants of pathogen virulence, using all currently recognised human RNA viruses as a study system.
Emerging RNA viruses vary widely in their virulence, with some never having been associated with human disease at all. For example, Zaire ebolavirus causes severe haemorrhagic fever with outbreaks, including the 2014 West African outbreak showing case fatality ratios of ∼60% or more [13,14]. In contrast, human infections with Reston ebolavirus have never exhibited any evidence of disease symptoms [15]. Applying the comparative approach to understand the ecology of virulence could offer valuable synergy with studies of emergence, towards prioritisation and preparedness in the detection of potential new human viruses [16].
Few comparative analyses have addressed the risk factors driving human pathogen virulence to date (but see [17–19]), and none have exhaustively investigated virulence across the breadth of all currently recognised human RNA viruses. Several hypotheses regarding how pathogen ecology affects virulence have been derived from theoretical models of evolution. For example, the trade-off hypothesis was developed based on the assumption that rate of transmission between individuals may increase as a function of virulence, but there will be a consequential increase in host mortality (or decrease in host recovery as the inverse of mortality). As a result, pathogen fitness will be subject to trade-off between virulence and transmissibility over a longer infectious window [20,21]. The trade-off hypothesis is highly debated as it is difficult to empirically characterise due to dependency on many other aspects of host-pathogen coevolution [22,23]. However, comparative analysis has been suggested as one method to assess evidence for a virulence-transmission trade-off [22]. Based on these core principles, we hypothesised that limited capability to transmit between humans may act as a predictive risk factor for virulence. We also note that evolutionary trade-offs will only apply to coevolved host-virus relationships and that many human viruses result from zoonotic cross-species transmission without onward transmission or adaptation. In these cases, ‘coincidental’ non-adapted virulence may result [24,25], and as above, we hypothesised that limited human-to-human transmissibility may predict higher virulence.
Transmission route may also influence the evolution of virulence. Ewald [18] suggested that vector-borne pathogens should be less constrained by costs of virulence, i.e. morbidity and immobilisation of the vertebrate host does not impede transmission if it occurs through an arthropod vector. We therefore hypothesised a vector-borne transmission route would predict higher virulence.
Several studies have also suggested a link between host range and virulence. Assuming an evolutionary trade-off exists between virulence and transmission rate, higher virulence may result in pathogens with narrower host ranges following selection pressures to increase transmission rate within the specialist host(s) [19]. Furthermore, the degree of virulence in experimental infections with Drosophila C virus was more similar between closely related hosts [26]. Though similar ideas have not yet been formally tested for human infections, parasite infectivity correlates with phylogenetic relatedness among primates [27]. We hypothesised infection of non-human primates as a specific related host taxon would predict higher virulence. Finally, although yet unexplored via theoretical models, it may be an intuitive expectation that systemic infections present with more severe disease than local infections. A broader tissue tropism could therefore also predict higher virulence.
We aimed to determine patterns of virulence across the breadth of all known human RNA viruses. We then aimed to use predictive machine learning models to ask whether ecological traits of viruses can act as predictive risk factors for virulence in humans. Specifically, we examined hypotheses that viruses would be more highly virulent if they: lacked transmissibility within humans; had vector-borne transmission routes; had a narrow host range including non-human primates; or had greater breadth of tissue tropisms.
Results
Virulence of Human RNA Viruses
Following [5], as of 2015 there were 214 RNA virus species containing viruses capable of infecting humans, spanning 55 genera and 21 families (with one species unassigned to a family). Using a two-category system, 58 of these were rated as causing ‘severe’ clinical disease and 154 as ‘nonsevere’ following systematic literature review (Fig 2, see also S1 Table, S2 Table). Two virus species could not be assigned a disease severity rating and were excluded from all analyses (Hepatitis delta virus, which is reliant on Hepatitis B virus coinfection; and Primate T-lymphotropic virus 3, which may be associated with chronic disease like other T-lymphotropic viruses, but has not been known in humans long enough for cohort observations). Disease severity differed between viral taxonomic families (Fisher’s exact, 1000 simulations, p < 0.001), with Arenaviridae, Filoviridae and Hantaviridae having the highest fractions of severe-rated virus species (Fig 2). Fatalities were reported in healthy adults for 64 viruses and in vulnerable individuals only for an additional 26 viruses, whilst 8 viruses rated ‘nonsevere’ had severe strains, 6 of which belonged to the family Picornaviridae.
Classification Tree Risk Factor Analysis
To find predictive risk factors for virulence, we firstly divided the 212 virus species into a training set (n = 181) and test set (n = 31) based on taxonomy and severity in order to minimise potential biases from trait imbalances. Using the training set, we then constructed a single classification tree that aimed to optimally classify viruses in virulence based on their ecological traits. The final pruned classification tree included variables relating to transmissibility, tissue tropism and taxonomy (Fig 2). Severe disease was predicted by the model for four generalised groups: i) viruses with a neural or systemic primary tropism with limited human-to-human transmissibility (excluding orthomyxoviruses, phenuiviruses and reoviruses); ii) viruses known to have a renal tropism (primary or otherwise); iii) hantaviruses; and iv) retroviruses with sustained human-to-human transmissibility.
Random Forest Risk Factor Analysis
Although the illustrated classification tree identified several risk factors, this represents one of many possible trees, as tree structure is dependent on the exact sampling partition between training and test data. We therefore constructed a random forest model containing 5000 individual trees, each built using a bootstrapped sample of the training data and a randomly restricted subset of predictors.
Aggregated over these bootstrapped trees, the most informative predictor variables for classifying virulence were taxonomic family and primary tissue tropism (Fig 4). However, transmission route, human-to-human transmissibility level, and having a known neural or renal tropism were also relatively informative, broadly mirroring the risk factors observed in the single tree. Host range predictors were generally uninformative.
To quantify the effects of the most informative risk factors, partial dependences were extracted from the random forest, describing the marginal predicted probabilities of severe virulence associated with each virus trait (Fig 5, S3 Table). Averaging across other predictors, viruses having tissue tropisms within neural, renal or systemic across multiple organ systems presented the highest risk of severe virulence, whilst respiratory and gastrointestinal tropisms presented the lowest risk. An increased probability of severe virulence was also observed for viruses transmitted by direct contact or respiratory routes, and those with known but limited human-to-human transmissibility.
Model Performance in Predicting Viral Virulence
Although the single classification tree model predicted the training set well, it did not appear generalisable to novel data within the test set. The single tree correctly predicted virulence ratings from literature-based criteria for 24 of 31 viruses in the test set giving a resulting accuracy of 77.4% (95% confidence interval [CI]: 58.9% - 90.4%), no evident improvement on the null model assigning all viruses as nonsevere (null accuracy = 74.2%). The random forest gave better predictive accuracy, correctly predicting virulence ratings for 28 of 31 test set viruses (accuracy: 90.3%, 95% CI: 74.3% - 98.0%), significantly greater than the null accuracy (exact binomial one-tailed test, p = 0.025). The random forest also achieved superior performance when considering sensitivity, specificity, True Skill Statistic, and the negative predictive value as a performance measure prioritising correct classification of ‘severe’-rated viruses (Table 1). The random forest also outperformed the classification tree in AUROC, area under the receiver operating characteristic curve (Table 1, Fig 3).
All misclassifications from the random forest occurred within the genus Flavivirus (S2 Table). Within the test set, there were two flaviruses rated as severe from literature protocols that were predicted to be nonsevere (Rio Bravo virus, Yellow fever virus), and one nonsevere flavivirus predicted to be severe (Usutu virus).
The observed predictor importances and risk factor directions were robust to constructing random forest models for subsets of viruses, removing those with low-certainty data or data from serological evidence only (S1 Fig, S2 Fig), and similar performance diagnostics were obtained (S5 Table). Redefining our virulence measure to integrate information on known fatalities and differences with subspecies or strains in an ordinal ranking system (S5 Table) did not improve predictive performance (S6 Table). Using alternative virulence measurements, the most informative variables and virus traits predicting severity showed good agreement with that of the main analysis (S3 Fig, S4 Fig) though when definitions of ‘severe’ virulence were widened, hepatic tropism became an informative predictor towards disease severity.
Discussion
We present the first comparative analysis of virulence across all known human RNA virus species to our knowledge. We find that disease severity is non-randomly distributed across virus families and that beyond taxonomy, severe disease is predicted by risk factors of tissue tropism, and to a lesser extent, transmission route and level of human-to-human transmissibility. In both the classification tree and random forest, viruses were more likely to be predicted to cause severe disease if they caused systemic infections, had neural or renal tropism, transmitted via direct contact or respiratory routes, or had limited capability to transmit between humans (0 < R0 ≤ 1). These risk factors were robust to alternative modelling methods, alternative definitions of virulence, and exclusions of poor quality data.
Ecology and Evolution of Risk Factor Traits
Primary tissue tropism was the most informative non-taxonomic risk factor (Fig 4) and the first split criteria in the classification tree (Fig 2), with specific neural tropism and generalised systemic tropism predicting severe disease (Fig 5). Few evolutionary studies have directly predicted how tissue tropism should influence virulence. The identified risk factor tropisms could be explainable as a simple function of pathology occurring in multiple or sensitive tissues respectively, increasing intensity of clinical disease. However, it has been suggested that an excessive, non-adapted virulence may result if infections occur within non-target tissues that do not contribute to transmission [28]. Furthermore, the evolutionary determinants of tissue tropism themselves are not well understood [29]. Tissue tropism should be a key consideration for future comparative and evolutionary modelling efforts.
We also found viruses primarily transmitted by direct contact and respiratory routes to have a higher predicted probability of severe virulence than viruses transmitted by more indirect faecal-oral or vector-borne routes. Contrastingly, Ewald [18] reported a positive association between virulence and vector-borne transmission in comparative analyses pooling several microparasite types, including a limited range of viruses, and suggested virulence has fewer costs to viral evolutionary fitness if vector transmission can occur independent of host health and mobility. The opposite association we observe may imply that even if transmission occurs via an indirect route such as through an arthropod vector, virulence could bring ultimate fitness costs due to host mortality before encountering a vector, fomite, etc..
The relationship between virulence and transmissibility appears more complex. Firstly, the random forest model suggested a lower risk of severe virulence for viruses with sustained human-to-human transmissibility (level 4) (Fig 5). This would lend support towards hypothesised virulence-transmissibility trade-offs [20–22] and suggests that the adaptation necessary to develop efficient human-to-human transmissibility could result in attenuation of virulence in RNA viruses. Sustained transmissibility appeared to positively predict severe disease for a specific subset of four viruses in the single classification tree (Fig 2), all retroviruses causing chronic syndromes (HIV 1 and 2, Primate T-lymphotropic virus 1 and 2), which are likely subject to different evolutionary dynamics – if disease occurs after the infectious period, virulence brings fewer costs to pathogens from host mortality, essentially ‘decoupling’ from transmission [24]. We note only three non-chronic level 4 viruses rated severe: Severe acute respiratory syndrome-related coronavirus, Yellow fever virus, and Zaire ebolavirus.
Secondly, cross-species infections incapable of onward transmission (sometimes termed ‘dead-end’ infections) have been predicted to result in higher virulence as without any evolutionary selection, viral phenotypes within that host will be non-adapted, i.e. a ‘coincidental’ by-product [24,25]. However, we did not observe viruses incapable of human-to-human transmissibility to be more virulent, the highest risk instead being observed for viruses with self-limited transmissibility. This may suggest that if virulence is entirely unselected in dead-end infections, ultimate levels of virulence could also feasibly turn out to be ‘coincidentally’ low.
Taxonomic family being a highly informative predictor in the random forest implies that there is a broad phylogenetic signal to virulence, but it is also highly likely that the explanatory power represents a proxy for many other phylogenetically-conserved viral traits that are challenging to implement in comparative analyses of this scale, such as variation at the proteomic, transcriptomic or genomic level; or further data beyond simple categorisations, e.g. specific arthropod vector species. Untangling these sources of variation from different scales of traits will be a critical next step in predictive modelling of viral virulence.
Analytical Limitations
We acknowledge several limitations to the quality of our data, as with any broad comparative analysis. Risk factor data was problematic or missing for certain viruses, e.g. natural transmission route for viruses only known to infect humans by accidental occupational exposure, and tissue tropism for viruses only known from serological evidence. However, the consistency of findings between alternative, stricter definitions of virulence and data subsets removing viruses with suspected data quality issues suggests scarcity of data does not bias our analyses.
Virulence also exhibits substantial variation at the sub-species level, i.e. between strains or variants. For example, severity of Lassa virus disease superficially varies with infection route and geography, though this appears to be driven by variation between genotypes [30]. Confirmatory analyses at a finer resolution would validate our identified risk factors, e.g. phylogenetic trait models of individual genera or species. Furthermore, clinical symptoms are also subject to traits of the host individual, e.g., immunocompetence, age, microbiome [31,32]. Our risk factor analysis brings a novel, top-down perspective on virulence at the broadest level, though caution must be exerted in extrapolating the risk factors we find to dynamics of specific infections.
Implications for Public Health
The value of predictive modelling as an inexpensive and rapid tool for risk assessments during early emergence is increasingly recognised [16]. Instances where machine learning model predictions do not match outcomes could indicate likely candidates for outcome class changes, e.g. future reservoir hosts for zoonotic disease [33]. Severe virulence was predicted for one virus rated ‘nonsevere’ from literature protocols, Usutu virus, potentially suggesting the capability for more severe disease to be recognised in future.
However, our models have restricted function in predicting the virulence of a newly identified virus. Although taxonomy is easily accessible and applicable to give simple virulence estimates, the most informative non-taxonomic predictor, tissue tropism, is not likely to be known with confidence before clinical observations of virulence. One way to address this paucity of data lies in the potential predictability of tissue tropism from cell receptors, and more challengingly, cell receptors from viral sequence data [34], an increasingly accessible information source during early emergence following advances in genomic sequencing methods [35]. However, the exact links between tissue tropism, cell receptors, and sequences are currently a critical knowledge gap, but a potentially powerful focus for future predictive efforts. A further key area will be the possibility to directly infer virulence itself from other aspects of sequence data, e.g. genome composition biases, which have recently demonstrated the potential to predict reservoir host taxa and arthropod vectors via machine learning [36].
More widely, our analysis brings a novel focus that complements comparative models predicting other aspects of the emergence process, such as zoonotic transmission [8,9,27,33], propagation within humans [10,11] or geographic hotspots [37,38]. After continued calls for model-informed strategy, predictive studies are now beginning to shape surveillance and prevention with respect to emerging zoonoses [16,39], with virulence being been suggested as a factor to direct viral surveillance [40], albeit in non-human hosts. The virulence risk factors we identify suggest that broadly targeting direct contact or respiratory transmission interfaces within ecological systems and/or tailoring detection assays towards certain virus families (e.g. Hantaviridae) or tissues (e.g. neural tissue) could contribute to a viable strategy to detect future virulent zoonoses.
Conclusion
This work adds to the comparative and predictive modelling efforts surrounding emerging infectious diseases. Here, we contribute a novel focus in ecological predictors of virulence of human RNA viruses, which can be combined in holistic frameworks with other models such as those predicting emergence dynamics. As a predictive model, the featured random forest offers valuable inference into the evolutionary determinants of virulence in newly emerging infections. We propose that future predictive studies and preparedness initiatives with respect to emerging diseases should carefully consider potential for human virulence.
Materials and Methods
Data Collection
For each of the 214 recognised human-infective RNA virus species following standardised data compilation efforts and critical assessment protocols [5], data on virulence and potential risk factors were collected via a systematic search and review of clinical and epidemiological literature. The following were consulted in turn: clinical virology textbooks [41–43]; references from the dataset described by [5]; literature searches using Google Scholar (search terms: 1) [virus name] AND human, 2) [virus name] AND human AND case, 3) [virus name] AND human AND [fatal* OR death], 4) [virus name] AND human AND [tropi* or isolat*]. Searches 3 and 4 were carried out only when fatality or tropism data respectively were not already found from previous sources. Data collection and virus name search terms included the full species name, any synonyms or subspecies (excluding vaccine strains) and the standard virus abbreviation as given by ICTV Online Virus Taxonomy [44].
Although many possible measurements of virulence have been proposed [45,46], even simple metrics like case fatality ratio (CFR) have not been calculated for the majority of human RNA virus species. Therefore, virulence was rated using a simple two-category measure of severity of typical disease in humans. We rated viruses as ‘severe’ if they firstly had ≥5% CFR where data was available (159/214 viruses including those with zero CFR), otherwise, we rated viruses as ‘severe’ if they had frequent reports of hospitalisation, were associated with significant morbidity from certain conditions (haemorrhagic fever, seizures/coma, cirrhosis, AIDS, hantavirus pulmonary syndrome, HTLV-associated myelopathy) or were explicitly described as “severe” or “causing severe disease” (S1 Table, S2 Table). We rated viruses as ‘nonsevere’ if none of these conditions were met. Note that this led to ‘nonsevere’ ratings for some viruses with clinically severe, but rare syndromes, e.g. Dengue virus can cause haemorrhagic dengue fever, though this is much rarer than typical acute dengue fever [41,42]. To address this, data were also collected on whether the virus has caused fatalities in vulnerable individuals (defined as age 16 and below or 60 and above, immunosuppressed, having co-morbidities, or otherwise cited as being ‘at-risk’ by sources for specific viruses) and in healthy adults, and whether any ‘nonsevere’ virus has atypically severe strains (for example, most infections with viruses within the species Human enterovirus C cause mild disease; however, poliovirus, which causes severe paralytic disease, is also classified under this species). These were examined both individually and within a composite six-rank system (S5 Table).
Data were compiled for four main risk factors: transmission route(s) and tissue tropisms, sourced from literature search exercises as described; and extent of human-to-human transmissibility and host range, sourced directly from [5]. Although evolutionary theories also predict virulence to vary with other traits, e.g. environmental survivability [47], paucity of data or nestedness within taxonomic family prevented their inclusion in our analysis. Transmission route was defined as the primary route the virus is transmitted by, classified as either vector-borne (excluding mechanical transmission), direct contact, faecal-oral or respiratory transmission. Tissue tropism was specified the primary organ system the virus typically infects or targets, classified as either neural, gastrointestinal, hepatic, respiratory, circulatory, vascular, or ‘systemic’ (primary tropism within multiple organ systems). We accepted isolation of the virus, viral proteins or genetic material, or diagnostic symptoms of the virus (such as characteristic histological damage) as evidence of infection within an organ system but did not accept generalised symptoms such as inflammation. However, many human viruses were isolated from blood with no further evidence of any specific tissue tropisms (n = 69). Therefore, we also included an additional ‘viraemia’ category in this variable to indicate only blood presence was known. Binary variables were also constructed denoting whether viruses were ever known to utilise a) more than one transmission route/tissue tropism, and b) each individual transmission route and tropism, including additional categories that were never among the primary routes/tropisms (food-borne and vertical transmission; renal, cardiac, joint, reproductive, sensory, skin, muscular and endocrine tropism).
Human-to-human transmissibility was specified using infectivity/transmissibility levels, based on previous conceptual models and a systematic compilation and review of evidence [4,5,12]. Level 2 denotes a virus capable of infecting humans but not transmitting between humans (R0 = 0), level 3 denotes a virus with limited human-to-human transmissibility (0 < R0 ≤ 1); and level 4 denotes a virus with sustained human-to-human transmissibility (R0 ≥ 1). Host range was specified as either ‘narrow’ (infection known only within humans or humans plus non-human primates) or ‘broad’ (infection known in mammals or animals beyond primates) [5]. Binary variables were also sourced as to whether infection was known within a) humans only, b) non-human primates, c) other mammals and d) birds. All virulence and risk factor data pertained to natural or unintentional artificially-acquired human infection only and data from intentional human infection, animal infection, and in vitro infection were not considered. Viral taxonomy was included in analyses by specifying both genome type and taxonomic family as predictors. All virulence and risk factor data are available via Figshare [48].
Machine Learning Risk Factor Analysis
Firstly, the 212 retained virus species were split into a training set for model fitting and test set for model evaluation at an approximate 75:25 ratio using stratified random sampling based on taxonomic family and virulence rating. Fisher’s exact tests confirmed equal representation of families (p = 0.991) and virulence ratings (p > 0.999) between training and test data. Comparative risk factor analyses were firstly carried out by constructing a classification tree using the R package ‘rpart’ v4.1-11 [49]. Classification trees are a simple form of machine learning models that aim to optimally classify data points into their correct category of outcome variable based on a structure of binary predictor splits. Tree-based methods are well-suited for comparative analyses where confounding often results from taxonomic signal or suites of otherwise co-occurring traits as their high structure can intuitively fit complex non-linear interactions and local effects.
A tree model was fitted to the training set to predict virulence ratings by ‘recursive partitioning’, the repeated splitting of the dataset using every possible binary permutation of each predictor, and retaining the split that minimises the Gini impurity [50], defined as for outcome variable x with n possible ratings and p(xi) denoting proportion of data with rating i, which is equal to zero for perfectly separated data. To prevent overfitting, the tree was pruned back to the optimal branching size, taken as most common consensus size over 1000 repeats of 10-fold cross-validation. To validate the predictive power of the classification tree, predictions of virulence rating were generated when applied to the test set. Tree accuracy was then calculated comparing the proportion of correct predictions compared to literature-assigned ratings (assuming these to be 100% accurate as the ‘gold standard’ or ‘ground truth’). As virulence ratings were imbalanced (i.e. only a minority of viruses cause severe disease, so correct nonsevere classifications are likely to be achieved by chance), accuracy was directly compared to the null model, i.e. a model with no predictors that predicted ‘nonsevere’ for all viruses. Additional diagnostics of interest (sensitivity, specificity, negative predictive value, and True Skill Statistic [60]) were also obtained.
Although classification trees have the advantage of presenting an interpretable schematic of risk factor effects and directions, individual tree structures may be sensitive to particular data points and have no intuitive measures of uncertainty. Therefore, we constructed a random forest, an ensemble collection of a large number of bootstrapped classification trees [51]. Having many predictor variables compared to the relatively limited and fixed number of human-infective RNA virus species, random forests handle such ‘large p, small n’ data architecture much more easily than traditional regression frameworks [52]. Missing data in all predictors was imputed using the R package ‘missForest’ v1.4 [53]. Then, using the R package ‘randomForest’ v4.6-12 [53], a random forest was created containing 5000 individual trees, each built upon a bootstrapped sample of the training data and restricted to test a randomly selected subset of predictors (k = 5) at each split during construction and convergence confirmed by inspection. Predictive power of the random forest model was evaluated using the test set as for the classification tree and receiver operating characteristic curves were visualised and area under curves calculated to directly compare the two machine learning methodologies.
Due to their high structuring, random forest models cannot give a simple parametric predictor effect size and direction (e.g., an odds ratio). Instead, potential virulence risk factors were evaluated using two metrics: variable importance and partial dependence. Variable importance is calculated as the mean decrease in Gini impurity following tree splits on the predictor and can be considered as how informative the risk factor was towards correctly predicting virulence. Partial dependence is calculated as the mean relative change in log-odds of predicting severe virulence, which were converted to predicted probabilities of severity associated with each risk factor. Partial dependences describe marginal effects averaging across any influence of other predictors and as such, a single estimate may not reflect any complex risk factor interactions. Therefore, to test hypotheses regarding virulence risk factors, we present both random forest partial dependences and the less robust but more accessible single classification tree for its ease of interpretation in risk factor structure, and directly compare the statistical validity of both methods by plotting receiver operating characteristic curves. All modelling was carried out in R v 3.4.3 [54], with a supporting R script available via Figshare [48].
Acknowledgements
We thank Jarrod Hadfield, Samantha Lycett, and Daniel Streicker for helpful discussion, and Alex Bhattacharya, Christopher McCaffery, David McCulloch, Conor O’Halloran, Claire Taylor and Feifei Zhang for assistance in data collection.