Abstract
Individuals differ widely in their contribution to the spread of infection within and across populations. Three key epidemiological host traits affect infectious disease spread: susceptibility (propensity to acquire infection), infectivity (propensity to transmit infection to others) and recoverability (propensity to recover quickly). Interventions aiming to reduce disease spread may target improvement in any one of these traits, but the necessary statistical methods for obtaining risk estimates are lacking. In this paper we introduce a novel software tool called SIRE (standing for “Susceptibility, Infectivity and Recoverability Estimation”), which allows simultaneous estimation of the genetic effect of a single nucleotide polymorphism (SNP), as well as non-genetic influences on these three unobservable host traits. SIRE implements a flexible Bayesian algorithm which accommodates a wide range of disease surveillance data comprising any combination of recorded individual infection and/or recovery times, or disease status measurements. Different genetic and non-genetic regulations and data scenarios (representing realistic recording schemes) were simulated to validate SIRE and to assess their impact on the precision, accuracy and bias of parameter estimates. This analysis revealed that with few exceptions, SIRE provides unbiased, accurate parameter estimates associated with all three host traits. For most scenarios, SNP effects associated with recoverability can be estimated with highest precision, followed by susceptibility. For infectivity, many epidemics with few individuals give substantially more statistical power to identify SNP effects than the reverse. Importantly, precise estimates of SNP and other effects could be obtained even in the case of incomplete, censored and relatively infrequent measurements of individuals’ infection or survival status, albeit requiring more individuals to yield equivalent precision. SIRE represents a new tool for analysing a wide range of experimental and field disease data with the aim of discovering and validating SNPs and other factors controlling infectious disease transmission.
1 Introduction
In the era of rapid expansion in the human population resulting in increasing demands on food security, effective solutions that reduce the spread of infectious diseases not only in humans, but also in plants and livestock, are urgently needed. Failure of stringent biosecurity measures [2, 3] and emergence of anti-microbial resistance [4, 5] and escape mutants to viral vaccines [6, 7] demonstrate that infectious diseases cannot be combatted by conventional biosecurity and pharmaceutical interventions alone.
The advent of genome wide high density single-nucleotide polymorphism (SNP) chip panels has already led to a remarkable range of discoveries regarding the genetic regulation and biology of diseases and translation towards innovative therapeutics [8]. In agriculture, SNP chip panels have revolutionized breeding practices by facilitating genomic selection [9, 10]. In the infectious disease context genomic selection may effectively prevent or reduce disease spread by providing a means to identify and select against individuals with high genetic risk of becoming infected or transmitting infections purely based on their genetic make-up, without the need of exposing them to infectious pathogens [11]. However, to date the full host genetic basis underlying infectious disease transmission is still poorly understood.
Epidemiological models are widely used to identify risk factors for disease spread in populations. Indeed, modelling disease transmission in genetically heterogeneous populations is well established (see e.g.[12, 13]). Particularly relevant are so-called compartmental models in which individuals are classified as, for example, susceptible to infection, infected and infectious, or recovered (or alternatively dead). Transitions between these states are determined by three key individual traits: susceptibility, the relative risk of an uninfected individual to become infected when exposed to a typical infectious individual or infectious material excreted from such an individual, infectivity, the propensity of an individual, once infected, to transmit infection to a typical (average) susceptible individual, and recoverability, the propensity of an individual, once infected, to recover or die) [14, 15]. As demonstrated by numerous simulation studies, host genetic variation in any one of these traits, if correctly identified, could be exploited to reduce infectious disease spread within and across populations [15–18]. However, despite their strong epidemiological importance, the genetic regulation and co-regulation of these three host traits is largely unexplored. Whereas a plethora of studies have identified substantial heritable variation and SNPs associated with host susceptibility [18], remarkably little is known about the genetic regulation of host recoverability and infectivity, despite emerging evidence that genetic variation in these traits exists [19, 20]. In particular, it is currently not known to what extent infectivity is genetically controlled, despite compelling evidence that super-spreaders, defined as a small proportion of individuals responsible for a disproportionally large number of transmissions, are a common phenomenon in epidemics [21–23]. This shortcoming is largely because appropriate statistical methods for estimating genetic and also non-genetic (treatment) effects for all three key epidemiological traits controlling disease transmission from infectious disease data are currently lacking.
In many conventional genome-wide association studies (GWAS) [24], target traits for genetic improvement are measured directly, so establishing genetic associations is relatively straightforward. In the epidemiological setting, however, the susceptibility, infectivity and recoverability of individuals are not measured directly. Rather their effects are manifested in the infection and recovery times of individuals in the epidemic (or epidemics) as a whole. Furthermore, most conventional GWAS assume that an individual’s infection status is controlled by its own genetic susceptibility and environmental effects. From an epidemiological viewpoint however, an individual’s disease phenotype (e.g. infected or not) may not only depend on its own susceptibility and recoverability genes, but also on the infectiousness of other individuals in the same contact group, i.e. their infectivity and recoverability genes [25]. This complex interdependence between underlying and observable traits poses challenges for existing methods.
The motivation behind this paper is to introduce new statistical and computational methods that utilise information derived from observation of epidemics and trait interdependence to estimate, for the first time, genetic and other systematic effects for all three underlying epidemiological host traits. This requires combining statistical, epidemiological and genetic modelling principles. Analysis of incomplete epidemic data to draw inferences on epidemiological parameters is well established [26, 27]. However, analysing such data to draw joint inferences on both the disease epidemiology and host genetic variation has proven challenging [25, 28]. Recent studies have expanded conventional quantitative genetics threshold models to enable joint genetic evaluation of cattle susceptibility to, and recoverability from, mastitis [29, 30], which led to identification of novel SNPs and candidate genes associated with these traits [19]. However, because infectivity acts on group members rather than the focal individual itself, applying these technique to estimate genetic effects for infectivity is problematic.
Alternative approaches have focused on disentangling susceptibility from infectivity effects. For example, Anacleto et al. [31] developed a Bayesian inference approach to produce genetic risk estimates for host susceptibility and infectivity from epidemic time to infection data, assuming that susceptibility and infectivity are under polygenic control (i.e. they are determined by a large number of genes, each with small effect). This approach, however, does not incorporate genetic variation in recoverability, and does not estimate SNP effects. An alternative approach, based on the assumption that susceptibility and infectivity are controlled by two single bi-allelic genetic loci [32, 33], used a generalized linear model (GLM) to estimate relative allelic effects on host susceptibility and infectivity. Whilst an important contribution, this approach focused on the disease status of individuals at the end of each epidemic (i.e. discarding potentially useful information from the infection and recovery times themselves). It also failed to incorporate variation in recoverability, and relied on a number of simplifying assumptions which were found to produce biased estimates under certain circumstances. A variant of this approach [34], which adopted a GLM to analyse time-series data on individual disease status, illustrated the benefits of longitudinal records of individuals’ infection status for improving prediction accuracies of SNP effects, although it still relied on a number of simplifications that may compromise prediction accuracies and lead to unwanted bias. A further shortcoming of previous approaches [32–34] is that they ignore potential pleiotropic effects, i.e. SNPs affecting more than one epidemic trait. This seems unrealistic, since, for example, SNPs that control within host pathogen replication may also lower the risk that infection can establish, i.e. reduce susceptibility, and simultaneously reduce pathogen shedding and hence infectivity, and speed up recovery.
In this study we present a novel software tool called SIRE (standing for “susceptibility, infectivity and recoverability estimation”) that implements a Bayesian inference approach to simultaneously estimate the effects of a single SNP (importantly capturing any pleiotropy), together with that of other fixed effects (such as e.g. sex, breed or vaccination status) on host susceptibility, infectivity and recoverability from temporal epidemic data. This approach can be applied to a wide range of epidemic data, collected at the level of individuals, and accounts for different types of uncertainty in a statistically consistent way (e.g. censoring of data or imperfect diagnostic tests), and permits the incorporation of prior knowledge. We validate SIRE for a variety of simulated epidemic scenarios, comprising not only the ideal case in which infection and recovery / death times of each individual are known exactly, but also under more realistic scenarios in which epidemics are only partially observed.
2 Materials and methods
2.1 Data structure and the underlying genetic-epidemiological model
SIRE applies to individual-level disease data originating from one or more contact groups in which infectious disease is transmitted from infectious to susceptible individuals through contact. This data can come from well controlled disease transmission experiments or from much less well controlled field data (which may be less complete, but readily available in larger quantity).
In the context of disease transmission experiments in plants or livestock, epidemics are initiated by means of artificially infecting a proportion of “seeder” individuals which go on to transmit their infection to susceptible individuals sharing the same contact group. In field data contact groups may consist of animal herds, or any group of individuals sharing the same environment such as a pasture, pen, cage or pond, and infection is assumed to invade the group by some external, usually unknown, means (e.g. by the unintentional spread of infected material, or the introduction of an infected individual from elsewhere). For simplicity it is assumed that throughout the observation period groups are closed, i.e. no births, migrations, or transmission of disease between groups. This assumption generally holds for experimental studies and also for most common field situations, where a movement ban is imposed after disease notification [35].
The dynamic spread of disease within a contact group is modelled using a so-called SIR model [36]. Individuals are classified as being either susceptible to infection (S), infected and infectious (I), or recovered/removed/dead (R). Under the simple SIR model for homogeneous populations, the time-dependent force of infection for a susceptible individual j (i.e. the probability per unit time of becoming infected) is given by λj(t) = βI(t), which is the product of an average transmission rate β and I(t), the number of infected individuals at time t. To incorporate individual-based variation in host susceptibility and infectivity, this simple expression for λj(t) is replaced by an individual force of infection (see [31] for a formal derivation)
Here gj characterises the fractional deviation in individual j’s susceptibility as compared to that of the population as a whole (e.g. gj=0.1 corresponds to individual j being ≃10% more susceptible than the population average), fi characterises the corresponding quantity for individual i’s infectivity, and the sum in Eq.(1) goes over all individuals infected at time t sharing the same contact group z as individual j (note, this sum varies as a function of t as individuals become infected and recover). The term Gz in Eq.(1) accounts for the fractional deviation in disease transmission for group z. This incorporates group-specific factors that influence the overall speed of an epidemic in one contact group relative to another (e.g. animals kept in different management conditions, environmental differences, or variation in pathogen strains with differing virulence). Whilst variation in Gz may be small for a well-controlled challenge experiment, this may not be the case in real field data. Gz is assumed to be a random effect with standard deviation σG. The exponential dependencies in Eq.(1) ensure that λj is strictly positive and allow for the possibility that some groups or individuals are much more/less susceptible/infectious than others, i.e. it can accommodate potential super-spreaders.
Whilst in Eq. (1) infection is modelled as a Poisson process with individual infection rates λj [18, 20], the recovery process is modelled by assuming that the time taken for individual m to recover after being infected is drawn from a gamma distribution with an individual-based mean wm and shape parameter k (which for simplicity is assumed to be the same across individuals). This mean recovery time is expressed as where γ represents an average recovery rate across the population and rm describes the fractional deviation from this for individual m. This approach is taken to allow the recovery probability distribution to adopt a more biologically realistic profile compared with the exponential distribution often assumed (see electronic supplementary material Appendix A for further details).
Following standard quantitative genetics theory [37], the individual-based deviations in susceptibility g, infectivity f and recoverability r (which are vectors with elements relating to each individual) are decomposed into the following contributions
SNP effects
The model assumes that a specific locus defined by a SNP (potentially) plays an important contribution to the trait values (note, repeated analysis can be performed on different SNPs of interest). Assuming a diploid genomic architecture with biallelic SNP implies three SNP genotypes: AA, AB and BB. The SNP contribution to the traits for individual j depends on j’s genotype in the following way:
The parameters ag, af and ar capture the relative differences in trait values between AA and BB individuals, and are subsequently referred to as the “SNP effects” for susceptibility, infectivity and recoverability, respectively (e.g. if ag is positive, individuals with an AA genotype will, on average, be more susceptible to disease than those with a BB genotype). The scaled dominance factors Δg, Δf and Δr characterise the trait deviations between the heterozygote AB individuals and the homozygote mean (a value of 1 corresponds to complete dominance of the A allele over the B allele and −1 when the reverse is true, whereas absence of dominance is represented by a value of 0) [38].
Fixed effects
The design matrix X and fixed effect vectors bg, bf and br in Eq.(3) allow for other known sources of variation to be accounted for (e.g. breed, sex or vaccination status). Following convention, an additional fixed effect is added to account for trait mean, which is explicitly chosen to ensure the population averages of g, f and r are zero (remembering that the average effects are already captured by the parameters β and γ).
Residual contributions
Here ε=(εg, εf, εr) accounts for all other contributions to the traits (i.e. coming from genetic effects not captured by the SNP under consideration, as well as any non-genetic environmental variation). We assume that for each individual the three trait residuals are drawn from a single multivariate normal distribution with zero mean and 3×3 covariance matrix Σ.
Including these correlations is important because it allows for the possibility that, for example, more susceptible individuals may also, on average, be more infectious and recover at a slower rate (on top of any correlations which may also arise from the SNP and fixed effects). Note that in this study, which focuses on the estimation of SNP effects, there is no explicit distinction between random genetic and environmental effects, although the model could be extended to incorporate estimation of these polygenic effects. It is thus assumed that individuals are randomly distributed across the groups with respect to the genetic effects on the epidemiological traits not captured by the SNP. Also note that Eq.(3) does not contain random group effects for the individual epidemiological traits. This is because the group effect has already been incorporated in the expression of the individual force of infection in Eq.(1). In other words, it is assumed that the group environment is the dominant mechanism affecting the speed at which infection spreads within a group rather than group specific factors affecting individuals’ susceptibility, infectivity or recoverability.
2.2 Bayesian inference
Based on the description above, the model contains the following set of parameters: θ=(β, γ, k, ag, af, ar, Δg, Δf, Δr, bg, bf, br, εg, εf, εr,, Σ, G, σG). We denote the complete set of infection and recovery event times for all individuals as ξ over the observed duration of the epidemics [39]. Typically ξ is not precisely known, and so we consider the general case in which ξ represents a set of latent model variables. The nature of the actual observed data y will be problem dependant. For example, in some instances recovery or removal (e.g. due to death) times will be precisely known but infection times completely unknown. In other instances infection and recovery times will both be unknown, but results from disease diagnostic tests provide information regarding disease status at particular points in time. The framework presented in this paper is flexible to these various possibilities.
Application of Bayes’ theorem implies that the posterior probability distribution for model parameters and latent variables is given by where individual components are defined as follows:
Observation model π(y|ξ) – the probability of the data given a set of event times ξ. The expression for the observation model depends on the nature of the data being observed. In many contexts this simply takes the values one or zero depending on whether ξ is consistent with y or not. For example a perfect disease diagnostic test showing that an individual is infected would be only consistent with ξ containing an infection event on that individual prior to the time of the test and a recovery event after the time of the test. Similarly, if data y indicates that an individual becomes infected at a particular point in time, this is only consistent provided ξ also contains this infection event. When imperfect disease diagnostic test results are available the observation model includes the sensitivity and specificity of the test to account for this uncertainty in the data. In summary, the observation model depends on the data collection process and constrains the possible event sequences ξ, and this, in turn, informs the model parameters θ.
Latent process likelihood L(ξ|θ) – the probability of ξ being sampled from the model given parameters θ. This can be derived from the genetic-epidemiological model described in the previous section [26, 27] (see Appendix B for details), and is given by
The functional dependence of L(ξ|θ) on the parameters θ is expressed in terms of the force of infections λj in Eq.(1) and mean recovery times wm in Eq.(2), which themselves depend in g, f and r in Eq.(3). The product z goes over all contact groups and within each contact group: j goes over individuals that become infected excluding those which initiate epidemics [40], m goes over individuals that become infected including those which initiate epidemics and e goes over both infection and recovery events (with corresponding event times te). Here the notation j∈z indicates that j goes over all those individuals j in contact group z, and e∈Ez indicates that e goes over all events Ez. The force of infection λj is calculated immediately prior to individual j becoming infected. The gamma distributed probability density function FΓ for recovery events gives the probability an individual is infected for duration δtm given a mean duration wm and shape parameter k. The time dependent total rate of infection events Λz in contact group z immediately prior to event time te is given by where the sum in s goes over all susceptible individuals in group z at that time.
An important point to mention is that Eq.(6) is calculated on an unbounded time line. In situations in which data is censored, the observation model restricts events that occur within the observed time window, but other events can exist outside of this observed region [41].
Priorπ(θ) – the state of knowledge prior to data y being considered. To account for the prior assumption that residuals ε in Eq.(3) are multivariate normally distributed and that the vector of group effects G in Eq.(1) are random effects, π(θ) can be decomposed into where θ−ε,G includes all parameters with the exception of ε and G and
Here j goes over each individual and εj =(εg,j, εf,j, εr,j)T is a three dimensional vector giving the residual contributions to the susceptibility, infectivity and recoverability of j. Σ is a 3×3 covariance matrix (which describes not only the overall magnitude of the residual contributions, but also any potential correlations between traits). Finally, the product z in Eq.(9) goes over all contact groups and Gz represents the group-based fractional deviation in transmission rate, which is assumed to be independent between groups and normally distributed with standard deviation σG.
The default prior for θ−ε,G (which can be modified if necessary) is largely uninformative but does place upper and lower bounds on many of the key parameters to stop them straying into biologically unrealistic values (details are given in Appendix C).
Samples for θ and ξ from the posterior are generated by means of an adaptive Markov Chain Monte Carlo (MCMC) schemes which implements optimised random walk Metropolis-Hastings updates for most parameters and posterior-based proposals [1] to aid fast mixing of the residual parameters (details are given in Appendix D).
2.3 SIRE
SIRE is a desktop application that implements the Bayesian algorithm outlined above. It is freely available to download from the supplementary material or at www.mkodb.roslin.ed.ac.uk/EAT/SIRE.html (with versions for Windows, Linux and Mac). An easy to use point and click interface allows for data tables to be imported in a variety of formats and graphical outputs are dynamically displayed as inference is performed. The core of SIRE utilises efficient C++ code and allows for running MCMC chains on multiple CPU cores.
SIRE takes as input any combination of information about infection times, recovery times, disease status measurements, disease diagnostic test results, genotypes of SNPs or any other fixed effects (see screenshot in Fig 1a), details of which individuals belong to which contact groups and any prior specifications (Fig 1b). The output from SIRE consists of posterior trace plots for model parameters θ, distributions (Fig 1c), visualisation of infection and recovery times ξ, dynamic population estimates and summary statistics (means and 95% credible intervals) as well as MCMC diagnostic statistics (Fig 1d). Posterior distribution graphs can be exported from SIRE and also files containing posterior samples of θ and ξ for further analysis using other tools. The user guide for SIRE is available in the electronic supplementary material and on the website.
2.4 Data scenarios
SIRE is flexible to many possible inputs. Reflecting real-world datasets this paper considers five potential data scenarios (DS):
DS1: Infection and recovery times for all individuals exactly known
This represents the best case scenario for inferring parameter values. For example, appearance of symptoms or visual or behavioural signs may indicate the onset of infection, and recovery/removal times are given by the time of death.
DS2: Only recovery times known
Often “recovery” in compartmental SIR models represents the death and removal of individuals. Consequently DS2 is pertinent to cases in which the only measurable quantity is the time at which individuals die. For example, disease challenge experiments in aquaculture routinely record time of death rather than infection times, which are usually difficult to measure [42].
DS3: Only infection times known
Whilst less common than DS2, in some instances data provides information regarding when individuals become infected but not when they recover. For example in human epidemics, patients may go to the doctor when they become ill, but no records will be kept on when they recover.
DS4: Disease status periodically checked
DS4 represents the most common scenario for monitoring infectious disease spread in livestock or plant populations, where each individual is periodically checked to establish its disease status. Under DS4 the point at which epidemics start is usually unknown, as well as the infection and recovery times of individuals themselves. However the diagnostic test results place constraints on these quantities. For example, if an individual is found to be uninfected at one sampling time and infected at the next sampling time this means that infection must have occurred at some point in the intervening period (note here we assume perfect diagnostic tests but SIRE also allows for imperfect diagnostic test results to be used, provided the sensitivity and specificity of the tests are known).
DS5: Time censored data
This data scenario relates to situations in which epidemics are not observed over their entire time period. For example a disease transmission experiment being carried out may be terminated early, due to cost or other factors (e.g. animal welfare), even though epidemics have not completely died out.
3 Assessment of performance and data requirements
In this section we apply SIRE to simulated datasets in order to 1) test the extent to which the inferred posterior parameter distributions agree with their true values, and 2) investigate how the precision, accuracy and bias of inferred model parameters depends on the type of data available.
Initially the focus of results will be on DS1 (which although rarely applies in practice, still provides useful insights for software validation and application) and later in section 3.5 consideration is given to DS2-5.
3.1 Illustrative example simulation and inference
We first demonstrate the performance of SIRE assuming complete information of individuals’ infection and recovery times, for a representative but complex set of parameters with regards to the genetic and non-genetic regulation of the three epidemiological host traits. Subsequently we investigate how these results change under different parameter and data scenarios.
Simulations
Individuals were randomly assigned into Ngroup different contact groups, with each group containing Gsize individuals. The SNP under investigation was assumed to be in Hardy-Weinberg equilibrium [38] with an A allele frequency of p=0.3. For the effect sizes we used the values ag=0.4, af=0.3, ar=−0.4, representing a relatively large pleiotropic effect (which confers higher susceptibility for AA compared to BB individuals, as well as slightly higher infectivity and reduced recoverability). The choice of Δg=0.4, Δf=0.1, Δr=−0.3 for the scaled dominance factors represents partial, but not strong, dominance of either the A or B allele. For simplicity we included only a single fixed effect, e.g. sex, of arbitrary moderate size bg0=0.2, bf0=0.3, br0=−0.2 with individuals in the population randomly selected to be male or female. The residual variances were chosen to be Σgg=Σff=Σrr=1, corresponding to a large variation in traits between individuals (perhaps larger than is biologically realistic, but here we want to demonstrate that inference of the SNP effects is still possible despite significant variation in trait values arising from other sources). In line with the direction of the SNP effects, the covariances were chosen to be Σgf=0.3, Σgr=−0.4 and Σfr=−0.2, representing a potential scenario in which individuals that are more susceptible are also more infectious and recover at a slower rate and vice-versa). To accommodate variation in epidemic speed across groups, we set the standard deviation in the group effects to σG=0.5. Finally, the average transmission rate was chosen to be β=0.3/Gsize (selected because it led to a substantial fraction of individuals becoming infected and including Gsize such that the basic reproductive ratio R0 remained independent of group size, i.e. frequency dependent transmission) and an average recovery rate γ=0.1 with shape parameter k=5 (corresponding to the infection duration being relatively highly peaked around a mean of 10 time units).
Simulated epidemic data was generated by means of a Doob-Gillespie algorithm [43] modified to account for non-Markovian recovery times (details of this procedure are given in Appendix F). A typical output for one simulated epidemic in a single contact group Ngroup=1 with Gsize=50 individuals is shown in Fig 2. Whilst the simulation itself is generated on an individual basis, this graph summarises dynamic variation in the susceptible, infectious and recovered populations, categorised by SNP genotype. It reveals classic epidemic SIR model behaviour: a single infected individual passes its infection on to others, triggering a rapidly spreading infection process throughout the population until the epidemic eventually dies out as a result of the susceptible population becoming largely exhausted and the remaining infected population recovering. Note that in closed groups not all susceptible individuals become infected. In this particular case some AB and BB individuals remain uninfected at the end of the epidemic. The absence of AA individuals partly stems from natural stochasticity in the system, but also partly from the fact that ag=0.4 is positive, i.e. AA individuals are more susceptible to disease and so on average less likely to remain uninfected. Consequently we can link the genetic composition in the final state of the epidemic to the expected value for ag (which, based on this particular dataset, is more likely positive than negative). Over and above information from the final state, however, there is much to be gained from also accounting for the infection and recovery event times themselves. The Bayesian approach adopted in this paper utilises all this information to extract the best available parameter estimates.
The information content from a single epidemic is generally insufficient to estimate the large number of parameters in the model. Therefore we next simulated a more realistic dataset (using the same parameter set as above) made up of 1000 individuals split into Ngroup=20 contact groups, each containing Gsize=50 individuals. The infection and recovery event times from this simulation were then used as input data into SIRE (scenarios in which infection and recovery times are not known precisely are discussed later in section 3.5).
Parameter estimates
Fig 3 shows the inferred posterior probability distributions for all parameters in θ corresponding to the simulated multi-group scenario described above. The actual parameter values used to generate the data (see vertical black dashed lines in Fig 3) consistently lie within regions of high posterior probability. The standard deviations (SDs) in these distributions characterise the precision with which parameters can be estimated:
Population average parameters (Figs 3a-c) – The recovery rate γ has the greatest precision (smallest relative SD), followed by the transmission rate β. Whilst the distribution for the shape parameter k is wide, it is clearly able to discount the possibility of an exponential recovery duration (i.e. k=1), which has a very low posterior probability, over a more peaked distribution (i.e. k>1).
SNP effects (Figs 3d-f) – The estimated recovery SNP effect ar is highly peaked around its true value of −0.4 (Fig 3f). Importantly this distribution has an extremely low posterior probability at ar=0. Indeed, since ar=0 does not lie within the 95% credible interval it can be concluded, to a high degree of certainty, that the SNP is associated with recoverability. The same is true for the susceptibility SNP effect ag in Fig 3d, albeit with a wider posterior probability distribution. This difference is for two reasons: firstly the recovery process involves only ar, whereas the infection process involves both ag and af (leading to potential confounding between these parameters) and secondly the recovery processes is gamma distributed which has a smaller standard deviation than the more dispersed Poisson process governing infection. The infectivity SNP effect af in Fig 3e exhibits a much wider probability distribution than the other two SNP effects. The fact that zero does lie within the 95% posterior credible interval (which goes from −0.35 to 2.1) means that no certain association with infectivity can be made in this particular example. Figs 3d-f illustrates a general principle that was common in the vast majority of subsequent simulation scenarios: SNP effects associated with recoverability are most precisely estimated, followed by susceptibility, and finally infectivity [44].
Scaled dominance factor (Figs 3g–i) – Compared to the SNP effects themselves, precision of the scaled dominance parameters is relatively poor, and actually reduces as the size of the SNP effects goes down (results not shown), which makes sense in the limit of zero SNP effect size, because here no information about dominance is available. Estimating them accurately, therefore, either requires very large SNP effects or substantially more data.
Fixed effects (Figs 3j–l) – Since SNP effects are also a type of fixed effect, the same comments as above also apply for other fixed effects.
Residual covariance matrix and random group effect (Figs 3m–s) – Interestingly, it was possible to obtain relatively good estimates for elements in the residual covariance matrix. Again, the familiar pattern is observed whereby quantities related to recoverability are more precisely estimated than those related to susceptibility, with infectivity the least precise. Finally, the variance of the group effect could be estimated with similar precision as that for susceptibility (Figs 3s & m).
3.2 Dependence on parameter values
The previous section showed an illustrative example for a particular parameter set. Here we assess what happens when different parameters in the model are altered. This was achieved by means of taking the following “base” set of parameters and then changing each parameter separately (fixing all others) [45]. Fig 4 shows scatter plots (each referring to a different selected parameter) of the posterior means (crosses) with corresponding 95% credible intervals inferred from a single simulated dataset using the true selected parameter value on the x-axis. Plots in which most crosses lie near to the diagonal line imply that inference is able to accurately capture the true parameter values. Table 1 shows the corresponding prediction accuracy, measured as the correlation between the inferred and true parameter values. Except for Δf for which prediction accuracy was only 34%, prediction accuracies for all other parameters ranged from 69-99%. In line with the discussion above, parameters associated with recoverability have generally higher predication accuracies than those associated with susceptibility, which are again higher than those for infectivity.
Bias indicates systematic differences between the true parameter values and those inferred from the data. Bias was measured by fitting regression lines through the posterior means in Fig 4 (as a function of the true parameter value). The corresponding y-intercept and slope values are shown in Table 1, where a zero y-intercept and a slope of one indicate absence of bias. Whilst the majority of observed y-intercepts tended to be very small, the slope for some of the parameters is markedly less than one (most notably for Δf). The reason for this is as follows. When Bayesian analysis reveals insufficient information regarding a parameter, its distribution follows that of the prior (which are uniform for all the parameters in this particular study, as described in Appendix C). This behaviour happens irrespective of the parameter’s true value, leading to a plot in Fig 4 that would be entirely flat (i.e. a slope of zero). Therefore, the slopes of less than one in Fig 4 simply reflect a lack of data, which is essentially another manifestation of a lack of parameter precision. Consequently, bias reduces as the amount of data increases (provided the model being fitted is the correct one).
From the point of view of this paper, the probability distributions which are of greatest interest are the SNP effects. Noting the sizes of the error bars across Figs 4d-f demonstrate that the precisions of the parameter estimates are largely independent of the values of the parameters themselves, a result which can be supported analytically [46]. This implies that the precision of SNP effects calculated using the base set of parameters in Eq.(10) is expected to be generally applicable to any other parameter set [47] (e.g. the average SDs in Table 1 for the base parameter set are very similar to the SDs shown in Fig 3).
Consequently, the remainder of this paper focuses on investigating how SNP effect estimates are affected by contact group structure and the nature of the measured data using this base set of parameters. We focus first on outlining the behaviour with respect to key design features, e.g. group size, number of individuals per group and allele frequency, and then go on to consider how observations of the system influence what can be learned.
3.3 Dependence on the number and size of contact groups
The crosses in Fig 5 shows how SDs in the SNP effects change as a function of the number of individuals Gsize within each contact group (here Ngroup=10 contact groups are assumed). The SD in ag reduces as the number of individuals in each contact group Gsize increases (Fig 5a). Importantly this relationship scales as a line of slope −½ (note the log scales on this plot), corresponding to precision increasing by a factor of two as the number of individuals is increases by a factor of four (in line with what would be expected from central limit theorem). Fig 5a provides insights into how many individuals would need to be observed in order to be able to make an association with a susceptibility SNP effect of a given size. For example, in order to detect an association with a susceptibility SNP of effect size ag = 0.4, Gsize=20 individuals per contact group, and so Gsize×Ngroup=200 individuals in total would be needed to assure that the 95% credibility interval does not contain zero (assuming approximate normality for the posterior distribution), as illustrated by that black dashed line in Fig 5a. Fig 5c shows the same scaling relationship for identifying recoverability SNP effects, but this time only Gsize×Ngroup=100 individuals are needed to make associations for recovery SNP effects (reflecting the fact that ar can be inferred more precisely, as mentioned previously). A very different state of affairs, however, is observed in Fig 5b. Here we see that not only is the infectivity SNP effect af poorly estimated, but also its precision does not markedly improve even when the number of individuals in each contact group Gsize is substantially increased.
Instead of varying Gsize and fixing the number of contact groups Ngroup, we now fix Gsize=10 and vary Ngroup. Results for this are shown in Fig 6 (represented by the crosses). This reveals a similar behaviour as seen before for the SD in ag and ar, but crucially we find the SD in the infectivity SNP effect af now also scales with the familiar line of slope −½. The reason for this behaviour lies in the fact that infectivity is an indirect genetic effect, i.e. an individual’s infectivity SNP affects the disease phenotype of group members rather than its own disease phenotype [48–50]. More intuitively, this can be explained as follows. Susceptibility and recoverability SNPs of an individual directly affect its own measured disease phenotype (the former affecting its infection time and the latter affecting its recovery time). Therefore the information on which these two quantities can be inferred is expected to scale with the total number of individuals. On the other hand, as an individual’s infectivity SNP acts on all susceptible individuals sharing the same contact group, it affects the epidemic dynamics as a whole. In fact much of the information regarding infectivity comes from the overall speed of epidemics. For example, if those contact groups containing individuals with more A alleles consistently experience epidemics which are faster than those with fewer A alleles, this provides evidence that the A allele confers greater infectivity than the B allele (the situation is further complicated by the fact that differences in susceptibility can also cause this behaviour, however the algorithm can independently estimate ag, so removing this potential confounding). Because information about the infectivity SNP effect comes from epidemic-wide behaviour, it is expected to scale linearly with the number of contact groups Ngroup (Fig 6b), but not with the number of individuals per contact group Gsize (Fig 5b).
Finally, we investigate the case in which we fix the total number of individuals to Gsize×Ngroup=1000 whilst simultaneously varying Gsize and Ngroup, as shown in Fig 7 (see crosses). In Fig 7a we find very little variation in the precision of ag. Interestingly, the results in Fig 7b clearly demonstrate that larger numbers of contact groups containing fewer individuals help to reduce the SD in the infectivity SNP effect af. In the case of Gsize=2 the posterior SDs in ag and af are actually the same due to the symmetry of this particular setup (i.e. each group consists of exactly one infected and one susceptible individual). Lastly, Fig 7c shows that the SD in ar is largely independent of Gsize. This is because recovery is solely an individual-based process, and so happens independently of others sharing the same contact group (although in cases in which R0 is small, differences may result from variation in the fraction of individuals which actually become infected).
3.4 Dependence on allele frequency
So far we have assumed a fixed A allele frequency p=0.3 in the population. Fig 8 demonstrates what happens when this is no longer the case by varying p, which in turn changes the Hardy-Weinberg equilibrium frequencies for the three genotypes. We find that the curves are symmetric around a minimum of p=0.5 and remain remarkably flat over a large region. They only increase substantially when the minor allele frequency drops below around 10%. This result shows that the statistical power to establish SNP effects dramatically reduces when they are rare, which is consistent with observations from conventional GWAS analyses [51].
3.5 Different data scenarios
This section shows results from the various data scenarios introduce in section 2.4, in which the infection and recovery times of all individuals are not known precisely:
DS2: Only recovery times known
Since ag and af relate to the infection process, naïvely it might be expected that because infection times are unknown then nothing can be inferred about these SNP effects. This section, however, clearly demonstrates this not to be the case. The reason lies in the fact that whilst infection times are latent variables, the distribution from which they are sampled is informed by the available recovery data through the likelihood in Eq.(6).
The square symbols in Fig 5a denote the posterior SDs in the susceptibility SNP effect ag under DS2. Compared to the best case scenario DS1, the SD in ag increases as a result of having to infer probable infection times for individuals (as opposed to knowing them exactly). The number of individuals per group needed to identify an association for a susceptibility SNP effect of ag =0.4 is now Gsize=80 (see dashed purple line in Fig 5a), as opposed to Gsize=20 in the case of DS1. Consequently to achieve an equivalent precision for ag under DS2 requires around 4 times as many individuals. In the case of the infectivity SNP effect af, this factor becomes approximately 4.2 (see Fig 6b, assuming a large number of contact groups), and for the recoverability it is 1.9 (see Fig 5c). These factors were found to be remarkably consistent across a broad range of group numbers and sizes (results not shown).
Estimates of prediction accuracies and bias for the case of DS2 were obtained as described in section 3.2, and results are presented in Appendix H. Compared to DS1 (Fig 4 and Table 1), The prediction accuracies tend to be slightly lower (but still above 0.5 in the majority of cases and above 0.9 for some parameters) and the bias slightly higher, reflecting the reduction in data. However, similar patterns with regards to which parameters are associated with lower prediction accuracy and bias emerge as was seen for DS1 (Fig 4).
In summary our analysis of DS2 clearly demonstrates that even when infection times are unknown, accurate inference regarding all SNP effects can be made, given sufficient data.
DS3: Only infection times known
The triangles in Figs 5, 6 and 7 show results under DS3 for different group sizes and group compositions. Here the SDs in the SNP effects for susceptibility ag and infectivity af are found to be almost the same as for DS1 (because uncertainty in recovery times only has a very weak impact on uncertainty in the infection process). However the SD for the recovery SNP effect ar is much larger, meaning that little can be inferred regarding SNP-based differences in recoverability. This is because under DS3 the only indirect information regarding recovery times comes from the very early stages of each epidemic (e.g. we know that the first infected individual cannot recover before the second individual becomes infected). This explains why SDs for recovery SNP effects decrease at a rate of −½ (on the log-scale) as the number of contact groups Ngroup increases (i.e. the triangles in Fig 6c scale with the black line) but not when the number of individuals per contact group Gsize is changed (see Fig 5c).
DS4: Disease status periodically checked
Fig 9 shows results under DS4 assuming a time interval between checks of Δt. When Δt=0 (as shown on the left of this figure) the DS4 results are the same as in DS1 (because here infection and recovery times are effectively exactly known). On the other hand as checking becomes less and less frequent, the SDs in the SNP effects rise. A surprising feature is that this reduction in statistical power is perhaps less than might be expected. The vertical lines in Fig 9 represent two key timescales: 〈tI〉 is the average infection time as measured from the beginning of the epidemic and 〈tR〉 is the average recovery time (these quantities are found by averaging over a large number of simulated replicates). We see that statistical power only marginally reduces even when disease diagnostic checking is performed on a similar timescale as the epidemics as a whole. This means that results assuming DS1 which are either analytical (explored in a follow up paper [46]) or numerical, as looked at in sections 3.1–3.4, remain relevant in realistic data scenarios.
The limit on the right hand side of Fig 9 shows the situation in which there is no information regarding infection and recovery times (i.e. only the initial and final states of the epidemic are observed). Unfortunately it was found to be difficult to probe this regime using SIRE due to mixing problems arising in the MCMC algorithm [52] (principally because the number of possible parameter sets and event sequences consistent with a given final outcome is vast).
The results here emphasise the fact that even relatively infrequent disease status checks provide useful data from which accurate inferences regarding SNP effects can be drawn.
DS5: Time censored data
In Fig 10a it is assumed the infection and recovery times are exactly known but only up to some final time tend (subsequent to which no further data is available). We find that very little information is lost when restricting tend to around the average recovery time 〈tR〉. This is largely because most individuals recover before 〈tR〉 as a consequence of a small number of individuals having very low recoverability (which itself arises because of the large residual variance Σrr=1 assumed here). Given that 〈tR〉 is usually substantially less than the total epidemic time, from a practical point of view terminating disease transmission experiments prior to the end of the epidemic when no new infections occur, (and perhaps performing further replicates) may be beneficial. However, the effectiveness of this approach would depend on a large assumed variation in recoverability in the population, which a priori may be unknown.
Fig 10b shows the opposite scenario, in which contact groups are observed from an initial starting time tstart after the start of the epidemic up until its termination. This scenario may apply to field outbreaks, where sampling occurs only after notification of the outbreak. Here again we see a reduction in statistical power with increasing tstart, but this reduction is not substantial until around the average infection time. This result is surprising, but it turns out that whilst none of the events before tstart are actually measured (which may include a large proportion of the total number of infection events), the disease status of all the individuals at tstart can be accurately inferred (because the final state is known and all the subsequent events from tstart are also known, the state at tstart is exactly specified) and this encapsulates almost the same amount of information as when the event times are precisely known.
General data scenario
It should be noted that the data scenarios DS1-5 considered are not comprehensive. Any combination of infection time, recovery time, disease status data and diagnostic test results can be used as inputs into SIRE. Furthermore SIRE accounts for additional uncertainties in cases in which data is missing on some individuals and where diagnostic tests are imperfect.
4 Discussion
The availability of dense genome-wide SNP panels has revolutionized human medicine and has paved the way for genetic disease control in agriculture. With declining genotyping costs, discovery of new disease susceptibility loci has increased exponentially over recent years, and evidence for their effective utilization in personalized medicine and livestock and plant breeding programmes continues to emerge [53–56]. However, there is increasing awareness amongst researchers and policy makers that disease susceptibility is not the only host genetic trait controlling disease incidence and prevalence in populations, and in particular that host genetic infectivity and recoverability may also constitute important improvement targets for reducing disease spread [19–22, 57, 58]. Yet, genetic loci associated with host recoverability reported in the literature are sparse, and to the best of our knowledge no infectivity SNP has yet been identified. This, perhaps, is unsurprising given that phenotypic measurements of recoverability and infectivity, such as individuals’ recovery or pathogen shedding rates are rarely available in practice and statistical inference methods to accurately infer these from available epidemic data are still in their infancy. In line with the lack of suitable statistical methods, little is known about what type and number of measurements are needed to produce unbiased and precise estimates of SNP effects for these ‘new’ epidemiological host trait phenotypes.
In this paper we developed a Bayesian methodology to allow simultaneous estimation of SNP effects for host susceptibility, recoverability and infectivity from temporal epidemic data. This methodology was validated with data from simulated epidemics, which were also used to assess how different parameter values and data scenarios representing different recording schemes in field or experimental studies may affect the estimates of SNP effects and other parameters influencing transmission dynamics. The sophisticated Bayesian algorithm outlined in this paper has been implemented into a user-friendly software tool called SIRE, which allows computationally efficient analyses to be performed by anyone with relevant epidemiological data (as shown in Appendix G, outputs typically take a few minutes of CPU time per 1000 individuals).
Our results indicate that it is possible to obtain simultaneous unbiased estimates of SNP effects for all three epidemiological host traits, in addition to that of other fixed or random effects influencing disease transmission, from temporal epidemic data. Across simulated data scenarios we found that recoverability SNP effects are generally (with few exceptions) easiest to identify, followed by susceptibility and then infectivity SNP effects. In the latter case a large number of contact groups with few individuals provide much more information than the reverse. Simulations of different data scenarios representing optimal (perfect and complete data) and practically feasible recording schemes produced the following relevant insights: firstly, even when only recovery (or death) times of individuals are known inference of SNP effects is still possible, albeit requiring around four times as many individuals to gain equivalent precision as for perfect data. Secondly, only knowing infection times marginally reduces statistical power to detect SNP effects for susceptibility and infectivity, but recovery SNP effects become difficult to detect. Thirdly, when data consists of periodic measurements of individuals’ disease status it was found that even relatively infrequent measurements (e.g. on a similar timescale as the entire epidemic) yields SNP effects with high precision, given sufficient data. Lastly, precise estimates of SNP effects could still be obtained with censored epidemic data.
For model validation, we chose a complex inter-dependence structure for the model parameters by assuming that the SNP under consideration is associated with all three epidemiological host traits (i.e. pleiotropy), but with different allele substitution effects and different modes of dominance. Furthermore, we assumed that the traits are also influenced by other fixed effects, have large residual variance (introducing much noise into the system) and are correlated, and that environmental group effects influence the within-group transmission dynamics. This choice represents an extremely challenging system in which to estimate SNP effects and in practice most real world examples are likely to be considerably less challenging as simpler structures and reduced variation/better control of variation will improve the quality of the parameter estimates.
The results from different data scenarios indicate a log-log scaling relationship with slope −½ between the precision (as measured by the SD in the posterior) of SNP effect estimates, and group size or number of groups (this relationship in analytically confirmed in a follow up paper [46]). For the majority of the simulations presented here, a moderate total population size of 1000 or less individuals was assumed. The corresponding posterior standard deviations for estimated SNP effects were generally above 0.01, and in the case of infectivity effects, more often above 0.1. This would suggest that for datasets comprising of 1000 individuals or less, SIRE is only able to detect SNPs of large effects on the epidemiological host traits, but identification of SNPs of small to moderate effects on this trait requires significantly more data, in particular for infectivity.
We chose a dataset comprising of 1000 individuals partly because of computational efficiency but also because generating datasets of this size seems feasible for transmission experiments in plants and most domestic livestock species, in particular aquaculture species [20, 59, 60]. However, many existing field data, in particular in dairy cattle populations with routine genotyping and frequent recordings of disease phenotypes e.g. for mastitis, bovine Tuberculosis, and other infectious diseases [61–63] already exceed this number by several orders of magnitude. As genotyping costs continue to fall and automated recording systems are applied at rapidly increasing frequency in agriculture [64, 65], the possibility of identifying SNPs with small to moderate effects on the epidemiological host traits, and their mode of dominance, which was poorly estimated for the given sample size, would appear to be well within reach in the near future.
It is widely recognised that disease traits are for the most part polygenic, i.e. regulated by many genes each with small effect, and hence that SNPs with large effect on disease phenotypes are the exception rather than the norm [10, 62]. This is partly due to the fact that observed disease phenotypes, such as individuals’ binary infection status or infection time are the result of many interacting biological processes, each controlled by a different set of genes or genetic pathways and characteristics of the wider population. Hence the impact of an individual gene on the disease phenotype is diluted. In contrast, the relative impact of a particular gene on traits that are more closely related to specific biological processes, such as e.g. pathogen entry, replication or shedding affecting susceptibility, recoverability or infectivity, respectively, may be higher [66]. Therefore, it is not unreasonable to assume that SNPs with moderate to large effects on these epidemiological traits, and in particular on host infectivity, may indeed exist. Evolutionary theory suggests that alleles that confer low susceptibility to infection or fast recoverability from infection are subject to strong directional selection when individuals are commonly exposed to infection [67]. Hence, such beneficial alleles tend to become fixed within only a few generations, and consequently, SNPs with large effects on disease susceptibility or recoverability would be expected to occur primarily only in populations that have not experienced strong selection pressure for these traits. This is exemplified in the case of Infectious Pancreatic Necrosis (IPN) in farmed Atlantic salmon that have only undergone a few generations of selection, where a single SNP explains most of the variation in mortality of fish exposed to the IPN virus [60, 68]. In contrast, selection pressure on infectivity is expected to be relatively low, since an individual’s infectivity genes affect the disease phenotype of group members rather than its own disease phenotype [34, 48, 69]. Therefore, infectivity SNPs with large effect may indeed exist, and may now be identifiable with the methods presented here.
The approach developed in this study and integrated into SIRE complement and succeed previous studies that aimed to develop statistical methods for estimating genetic effects for the different host epidemiological traits [25, 30–32, 34]. The key novelty of our approach lies in its ability to estimate genetic and non-genetic effects associated with all three epidemiological host traits from a range of temporal epidemic data, even when that data is incomplete.
Applications
Many disease challenge experiments and field studies have identified SNPs with moderate to large effects on measurable disease resistance phenotypes [54, 55, 70]. However, the role of these SNPs on transmission dynamics is often poorly understood. For example, it is generally not known whether individuals that carry the beneficial allele for e.g. surviving infectious challenge are less likely to become infected (i.e. less susceptible), or more prone to surviving infection (e.g. due to better recoverability), and also less prone to transmitting infection, once infected (i.e. less infective). From an epidemiological perspective, SNPs with favourable pleiotropic effects on all three host epidemiological traits are highly desirable for preventing or mitigating disease spread [71]. In contrast alleles associated with better survival in existing GWAS would only bring the expected epidemiological benefits if they don’t simultaneously confer greater infectivity. In other words, knowing the SNP effects for all three underlying epidemiological host traits is desirable for effective employment of genetic disease control. Based on the results in this paper, SIRE can readily be applied to disentangle such SNP effects using data from transmission experiments or field studies. Furthermore, although this paper focused on estimating SNP effects, SIRE could also immediately be applied to estimating breed, age, sex, treatment or vaccination effects, or any other factor that may affect disease spread, even if genetic information is absent.
Limitations of the current approach and future work
One of the potential practical limitations for accurately estimating infectivity SNP effects is that they require a large number of epidemic groups. Previous work has shown that experimental designs can have a significant impact on the precision and accuracy with which model parameters can be estimated (as demonstrated to some extent in this paper and also investigated for indirect genetic effects in numerous other studies [49]). In particular, theoretical studies indicate that significant improvement in estimates of infectivity effects can be achieved by appropriately grouping genetically related individuals [32, 34]. Whilst this paper focused entirely on a fixed A allele frequency p across groups, a follow up paper [46] will show that appropriate variation in genotypes within and across contact groups can lead to substantial improvements in the precision of the infectivity SNP effect af, without the need for large numbers of epidemic contact groups (interestingly, the susceptibility and recoverability SNP effects cannot be substantially improved in this way).
A tool such as SIRE that can accurately estimate the effects of single SNPs on hitherto inaccessible epidemiological traits presents an important first step towards creating a statistically consistent scheme for performing GWAS on epidemiological traits using potentially incomplete data. GWAS, however, typically contains additional features beyond the scope of the simple single SNP analysis presented here. In particular, the current software focuses on one SNP at a time for estimating genetic effects for susceptibility, infectivity and recovery, but ignores the contributions of other genes on these traits. In the current model design these are incorporated into the residual effects. This simplifying assumption may have little impact for appropriately designed transmission experiments, but may lead to biased estimates of SNP effects if genetically similar individuals are not randomly distributed across groups. Theory also suggests that the required sample size for GWAS increases with the number of loci affecting the trait under consideration [72]. Hence, further model development is required for enabling GWAS for the three underlying epidemiological host traits. Previous work in our group developed a Bayesian algorithm for estimating polygenic effects for host susceptibility and infectivity from incomplete epidemic data [31]. Combining both approaches may prove a useful way forward to allow estimation of genetic effects under all realistic genetic architectures and population structures.
In summary, this paper introduces, for the first time, software that can estimate genetic and non-genetic effects for susceptibility, infectivity and recoverability simultaneously. This user-friendly tool can be applied to a range of experimental and field data and will help move genetic disease control significantly forward, beyond the focus on genetic improvement of resistance alone.
Funding
This research was funded by the Strategic Research programme of the Scottish Government’s Rural and Environment Science and Analytical Services Division (RESAS). ADW’s contribution was funded by the BBSRC Institute Strategic Programme Grants (BB/J004235/1 (ISP1), BBS/E/D/20002172 (ISP2.1) & BBS/E/D/30002275 (IPS3.1)).
APPENDIX A: Recovery dynamics
Fig A shows examples of gamma distributions for three different values of shape parameter k=1, 2 and 10. A common approach when modelling epidemics is to assume k=1, corresponding to the Markovian assumption of constant recovery probability. From a biological point of view, however, this is rather unrealistic. Typically after infection, the host’s immune system takes some time to respond, e.g. to generate the appropriate antibodies to fight off the infection. Thus, for many real diseases k may be large.
Consequently, incorporation of a gamma distributed infection duration (which is characterised by two parameters, a mean and a shape parameter, instead of just one for the exponential distribution) allows the model to more realistically capture true disease dynamics.
APPENDIX B: Derivation of the likelihood
The likelihood in Eq.(6) represents the probability the model in section 2.1 generates a certain set of events ξ, assuming a given set of parameters θ. This can be calculated by multiplying the probabilities for each of the individual sampling steps used in the simulation procedure, as described in Appendix F.
During initialisation of a given contact group, the infection duration for the initially infected individual j is sampled with probability
We now consider each event e in turn. In step 1 of Appendix F, the inter-event time Δt is sampled from the exponential distribution (where Λe(te) is the total infection rate Eq.(7) evaluated immediately prior to te, the time of event e). Two possibilities exist for event e:
It is a recovery event. Considering the algorithm in Appendix F, step 2(a) would have been branched to, and this only happens if the sampled value for Δt is greater than the observed inter-event time te−te−1. The probability of this is calculated from the following integral:
It is an infection event. Again, considering the algorithm in Appendix F, step 2(b) would have been branched to, and this happens with probability (which comes from the exponential distribution above). The individual j which becomes infected is selected with probability λj/Λe. Finally, the infection duration of that individual δtj is sampled from a gamma distribution with probability density function FΓ(δtj | wj, k). Combining these three contributions gives an overall probability:
Multiplying the results from Eqs.(B1), (B2) and (B3) for all the infection and recovery events leads to the likelihood for a single contact group of: where j goes over individuals that become infected (excluding those which initiate epidemics), m also goes over individuals that become infected but including those which initiate epidemics and e goes over both infection and recovery events (with corresponding event times te).
Since contact groups are assumed to be independent, the likelihood for multiple contact groups is simply the product of each separate one, as shown in Eq.(6).
APPENDIX C: Prior definitions for model parameters
The default prior in SIRE (which can be altered), which is used for all the results in this paper, is largely uninformative but does place upper and lower bounds on many of the key parameters to stop them straying into biologically unrealistic regimes. Bounding parameters in this way was found to be especially important when considering relatively uninformative data scenarios when unbounded flat priors could lead to improper posterior probability distributions. Specifically, a uniform prior between −2.3 and 2.3 was chosen for ag. This corresponds to assuming that it is biologically unrealistic for a single SNP to change the susceptibility of individuals by more than a factor of 100 between AA and BB individuals1. An identical uniform prior was also placed on af, ar and on each of the fixed effects in bg, bf and br. Similarly, a uniform prior between −3.45 and 3.45 was placed on each of the residuals εg, εf and εr, This larger range reflects a potential factor of 1000 variation across individuals (chosen to be larger as residual contributions account for all other SNPs as well as non-genetic factors, as opposed to just the effect of the single SNP under analysis).
The scaled dominance factors Δg, Δf, Δr, were chosen to have uniform priors between 1 and −1, i.e. going from complete dominance of A to complete dominance of B 2. The prior for the shape parameter k was chosen to be uniform between 1 and 10, where 1 represents a Poisson random process and 10 represents a situation in which recovery times of individuals are highly concentrated around their mean.
Because parameters β and γ depend on the timescale over which the epidemic is measured3, which is partly pathogen specific, placing informative priors on β and γ by default is not appropriate (although it can be done). Instead a uniform prior between 0 and 20 was placed on the equivalent basic reproductive ratio R0 4, which is a dimensionless quantity.
Appendix D: MCMC procedure
Markov chain Monte Carlo (MCMC) produces a list of (correlated) parameter θq and event time ξq samples drawn from the posterior probability distribution in Eq.(5), where q indexes sample number.
Initialisation – MCMC is initially started with some set of parameters θ1 and event times ξ1 consistent with the data y. In the case of parameters, θ1 is simply sampled from the prior distribution π(θ). For events, initialisation depends on the available data: DS1) events are defined by the data, DS2) infection times are exponential sampled backwards in time from the first observed recovery time5, DS3) recovery times are exponentially sampled forwards in time from the final observed infection time6, DS4) infection and recovery times are sampled uniformly in the time intervals identified between successive diagnostic tests, and DS5) events within the observation period are defined by the data, infection events prior to this time period are exponentially sampled backward in time and similarly recoveries events after this time period are exponentially sampled forward in time.
Iteration - MCMC operates by proposing changes to the model parameters θ and event times ξ and accepting or rejecting these changes in accordance with a Metropolis-Hastings probability. In this way the underlying dynamics are able to explore all potential possibilities consistent with the observations.
A single MCMC “update” consists of making the following sets of proposals:
MCMC UPDATE
Parameters – Each individual parameter in θ=(β, γ, k, ag, af, ar, Δg, Δf, Δr, bg, bf, br, εg, εf, εr,, Σ, G, σG, denoted by θj, is considered in turn. A proposed value is drawn from a normal distribution centred on the parameter’s current chain value with all other parameters in θp remaining the same as in θq (if this produces an inconsistent value, e.g. β becomes negative, the proposal is immediately rejected). The proposal is accepted with Metropolis-Hastings probability If accepted, we set θq+1= θp else θq+1= θq with ξq+1= ξq.
Tuning Jj in Eq.(C1) is important. If it is too large, very few proposals will be accepted and if too small, mixing will be slow. Motivated by adaptive MCMC [73, 74], a robust heuristic method for optimising Jj within the burn-in period is as follows. Initially, Jj is set to a small quantity. Each time a proposed change on parameter j is accepted Jj is updated according to and when rejected
These numerical factors are chosen for two reasons: Firstly, the updates in Eqs.(C3) and (C4) balance each other out when acceptance occurs around 33% of the time, leading to a steady state solution for Jj. Secondly, they are chosen to be sufficiently close to 1 to prevent large fluctuations in Jj, but sufficiently far to allow the steady state solution to be found within the burn-in period.
In addition to the single parameter proposals outlined above, joint proposals are also made on residuals along with their corresponding covariance matrix, as described in Appendix E. The reason these joint updates are necessary is that Σ and ε are often highly correlated within the model (as determined by Eq.(9)). In many cases the data provides little information regarding Σ itself, so these correlations can lead to extremely slow mixing.
Events – In situations in which ξ is not precisely known (i.e. data scenarios other than DS1), the unknown latent event times must be stochastically changed in accordance with the posterior distribution. Each individual j is considered in turn. In the case of proposing changes to the infection time of j, the following normal distribution is sampled from: with all other event times in ξp=(tI,tR) remaining the same as in ξq. If this proposed infection time exceeds the recovery time the proposal is immediately rejected (infections cannot occur after recoveries). The proposal in Eq.(C5) is accepted with probability
If accepted, we set ξq+1= ξp else ξq+1= ξq with θq+1= θq.
A similar proposal to Eq.(C5) is used to change recovery times. The jumping parameters JI,j and JR,j are again tuned to give an average acceptance probability of around 33% (using the same method as in Eqs.(C3) and (C4) above). Furthermore, there are data scenarios in which it is not possible to tell whether an individual has become infected or not. In these cases it is necessary to include additional proposals which insert infection/recovery event pairs and, conversely, other proposals which remove them.
Each of the proposals above have been optimised to calculate only those parts of the likelihood that change, e.g. if the infection time of individual j is altered then only the change in likelihood in the period between then initial and proposed times within j’s contact group needs to be calculated. Furthermore, when making some sets of proposals computational speed is substantially increased by pre-calculating terms in the likelihood which remain unchanged.
Appendix E: Posterior-based proposals for residuals
This appendix describes special joint proposals in the residuals (Σ,ε) which are used to aid MCMC mixing. Specifically three type of proposal are considered: 1) those which randomly make changes to Σgg, Σgf, or Σgr and stochastically alter εg (with everything else kept fixed), 2) those which randomly make changes to Σff, Σgf, or Σrr and stochastically alter εf (with everything else kept fixed), and 3) those which randomly makes changes to Σrr, Σgr, or Σfr and stochastically alter εr (with everything else kept fixed). Here we describe just one of these possibilities, but others can be found by suitably permuting indices.
So-called “posterior-based proposals” (PBPs) [1] consist of three steps:
Step 1 – A proposed is made to one of the parameters in the model. In this example with the value of tuned to give an average acceptance probability of around 33% (using the same method as in Eqs.(C3) and (C4) above).
Step 2 – Each element εg,j in the εg is considered in turn. The values of means μq,j and μp,j and standard deviations σq,j and σp,j are calculated such that the normal distributions they characterise approximate, to some level of accuracy, the true posterior distributions for εg,j in the initial and proposed states (details on how this is done are discussed below). If σp,j>σq,j then for each individual j the proposed residuals are sampled using where κ is a tuneable constant set to 0.03, and when σp≤σq
The somewhat complicated expressions in Eqs.(C8) and (C9) are specifically designed such that if the posterior is truly represented by the normal approximations (as characterised by μq,j, μp,j, σq,j and σp,j), the proposal is accepted with probability 1.
Step 3 – The proposed combination is accepted or rejected with Metropolis-Hastings probability where j goes over all individuals and is the Gaussian probability density function.
Optimisation
For efficient implementation of this approach it is necessary to generate μq,j, μp,j, σq,j and σp,j in a suitable manner. One approximation is to set them to values implied by prior in Eq.(9): where M is the inverse of the covariance matrix Σ−1. This approximation leads to so-called “model-based proposals” (MBP). A more accurate approximation (because it takes into account the observed events, which themselves are informed by the data) is given by and this gives a first order “posterior-based proposal” (PBP). Although Eq.(C13) is computationally slower to calculate than Eq.(C12) (because it contains gradients in the log-likelihood), it can lead to larger jumps in parameter space compared to a corresponding MBP resulting in improved mixing. Empirically it was found that applying PBPs to parameters relating to susceptibility and recoverability and MBPs to parameters relating to infectivity was the fastest approach to take.
Appendix F: Simulation
The Doob-Gillespie algorithm [43] provides a means of taking into account inherent stochasticity in Markovian compartmental models (i.e. models for which the transition rates depend solely on the current state of the system). The model used in this paper combines Markovian infection transitions with more realistic non-Markovian recovery dynamics. Below we describe how these recovery events are incorporated into the standard Doob-Gillespie framework.
The purpose of this procedure is to build up a time ordered sequence of infection and recovery event times indexed by event number e. The following notation is used: te is the event time, xe is the event type (infection in. or recovery re.), je is the affected individual, and and are the infection and recovery times for individual j, respectively.
Initialization: Each epidemic is assumed to be started by one (or potentially more) initially infected individual j at some initial time point tinit. The infection duration δtj for this individual is drawn from a gamma distribution parameterised in terms of a mean and shape parameter: (note, the dependency of wj on θ is given through Eqs. (2) and (3)). This allows us to set and . Individual j is then placed onto a list 𝓡, which represents all currently infected individuals. We set event index to e=1.
Step 1: Calculate the time to the next infection event. This is done by first evaluating the total transition rate that any individual becomes infected where the sum s goes over all currently susceptible individuals and the force of infection λs (which gives the probability per unit time of s becoming infected) is given by Eq.(1). In accordance with a Poisson process, the time to the next infection event is generated by drawing a sample from the exponential distribution Λe−ΛΔt. In practice, this is achieved by selecting an inter-event time using where u is a (uniform) randomly generated number between 0 and 1. The new event time is then defined by
Step 2: Choosing the event type. For the SIR model two possibilities exist:
If tnew is greater than the smallest recovery time of all the individuals in 𝓡, which we label jmin, then we remove jmin from 𝓡 and set
Otherwise, we set and select the actual individual that becomes infected with probability
The infection duration for je is sampled using Eq.(D1), and the infection and recovery times are set to
Individual je is then placed onto the list 𝓡.
Step 3: Increment e and jump to step 1 if there are any remaining infected individuals.
End: Insert recovery times for any remaining individuals j in 𝓡 incrementing e after each addition.
Appendix G: Computational speed estimate for SIRE
Fig G shows the CPU time SIRE takes to estimate the SNP effects as a function of the total number of individuals (this is based on a single 2GHz core). We find that the SNP effect associated with infectivity takes the longest to accurately estimate. The approximate linear scaling (represented by the solid black lines) means that SIRE is expected to take around one minute per 1000 individuals to generate 100 representative samples from the posterior under DS1 and around 10 minutes for DS2.
Appendix H: Parameter prediction accuracy under DS2
Fig H shows results when repeating the analysis is section 3.2 of the paper, but this time taking the scenario in which infection times are unknown (i.e. DS2). The corresponding regression analysis on these points is presented in Table H. The average SDs in Table H are smaller than those in Table 1, reflecting the reduction in statistical power under this data scenario. A factor of two difference in SD corresponds to four times as many individuals required for equivalent accuracy.
Footnotes
↵† Deceased
↵1 This factor comes from the exponential dependency in Eq.(1) coupled with the result e2×2.3 =100.
↵2 If over dominance is considered a possibility, this prior distribution would be extended.
↵3 E.g. if measurements are made in hours then β and γ would be very different to if they were made in days.
↵4 Defined by R0=β(〈N〉−1)/γ, where 〈N〉 is the average population size of contact groups as taken from the data, this represents the number of infections one typically infectious individual generates on average over the course of its infectious period in an otherwise uninfected population.
↵5 The standard deviation in recovery times is used to set the decay rate.
↵6 The standard deviation in infection times is used to set the decay rate.