Estimating individuals’ genetic and non-genetic effects underlying infectious disease transmission from temporal epidemic data

Christopher M. Pooley; Glenn Marion; Stephen C. Bishop; Richard I. Bailey; Andrea B. Doeschl-Wilson

doi:10.1101/618363

Abstract

Individuals differ widely in their contribution to the spread of infection within and across populations. Three key epidemiological host traits affect infectious disease spread: susceptibility (propensity to acquire infection), infectivity (propensity to transmit infection to others) and recoverability (propensity to recover quickly). Interventions aiming to reduce disease spread may target improvement in any one of these traits, but the necessary statistical methods for obtaining risk estimates are lacking. In this paper we introduce a novel software tool called SIRE (standing for “Susceptibility, Infectivity and Recoverability Estimation”), which allows simultaneous estimation of the genetic effect of a single nucleotide polymorphism (SNP), as well as non-genetic influences on these three unobservable host traits. SIRE implements a flexible Bayesian algorithm which accommodates a wide range of disease surveillance data comprising any combination of recorded individual infection and/or recovery times, or disease status measurements. Different genetic and non-genetic regulations and data scenarios (representing realistic recording schemes) were simulated to validate SIRE and to assess their impact on the precision, accuracy and bias of parameter estimates. This analysis revealed that with few exceptions, SIRE provides unbiased, accurate parameter estimates associated with all three host traits. For most scenarios, SNP effects associated with recoverability can be estimated with highest precision, followed by susceptibility. For infectivity, many epidemics with few individuals give substantially more statistical power to identify SNP effects than the reverse. Importantly, precise estimates of SNP and other effects could be obtained even in the case of incomplete, censored and relatively infrequent measurements of individuals’ infection or survival status, albeit requiring more individuals to yield equivalent precision. SIRE represents a new tool for analysing a wide range of experimental and field disease data with the aim of discovering and validating SNPs and other factors controlling infectious disease transmission.

1 Introduction

In the era of rapid expansion in the human population resulting in increasing demands on food security, effective solutions that reduce the spread of infectious diseases not only in humans, but also in plants and livestock, are urgently needed. Failure of stringent biosecurity measures [2, 3] and emergence of anti-microbial resistance [4, 5] and escape mutants to viral vaccines [6, 7] demonstrate that infectious diseases cannot be combatted by conventional biosecurity and pharmaceutical interventions alone.

The advent of genome wide high density single-nucleotide polymorphism (SNP) chip panels has already led to a remarkable range of discoveries regarding the genetic regulation and biology of diseases and translation towards innovative therapeutics [8]. In agriculture, SNP chip panels have revolutionized breeding practices by facilitating genomic selection [9, 10]. In the infectious disease context genomic selection may effectively prevent or reduce disease spread by providing a means to identify and select against individuals with high genetic risk of becoming infected or transmitting infections purely based on their genetic make-up, without the need of exposing them to infectious pathogens [11]. However, to date the full host genetic basis underlying infectious disease transmission is still poorly understood.

Epidemiological models are widely used to identify risk factors for disease spread in populations. Indeed, modelling disease transmission in genetically heterogeneous populations is well established (see e.g.[12, 13]). Particularly relevant are so-called compartmental models in which individuals are classified as, for example, susceptible to infection, infected and infectious, or recovered (or alternatively dead). Transitions between these states are determined by three key individual traits: susceptibility, the relative risk of an uninfected individual to become infected when exposed to a typical infectious individual or infectious material excreted from such an individual, infectivity, the propensity of an individual, once infected, to transmit infection to a typical (average) susceptible individual, and recoverability, the propensity of an individual, once infected, to recover or die) [14, 15]. As demonstrated by numerous simulation studies, host genetic variation in any one of these traits, if correctly identified, could be exploited to reduce infectious disease spread within and across populations [15–18]. However, despite their strong epidemiological importance, the genetic regulation and co-regulation of these three host traits is largely unexplored. Whereas a plethora of studies have identified substantial heritable variation and SNPs associated with host susceptibility [18], remarkably little is known about the genetic regulation of host recoverability and infectivity, despite emerging evidence that genetic variation in these traits exists [19, 20]. In particular, it is currently not known to what extent infectivity is genetically controlled, despite compelling evidence that super-spreaders, defined as a small proportion of individuals responsible for a disproportionally large number of transmissions, are a common phenomenon in epidemics [21–23]. This shortcoming is largely because appropriate statistical methods for estimating genetic and also non-genetic (treatment) effects for all three key epidemiological traits controlling disease transmission from infectious disease data are currently lacking.

In many conventional genome-wide association studies (GWAS) [24], target traits for genetic improvement are measured directly, so establishing genetic associations is relatively straightforward. In the epidemiological setting, however, the susceptibility, infectivity and recoverability of individuals are not measured directly. Rather their effects are manifested in the infection and recovery times of individuals in the epidemic (or epidemics) as a whole. Furthermore, most conventional GWAS assume that an individual’s infection status is controlled by its own genetic susceptibility and environmental effects. From an epidemiological viewpoint however, an individual’s disease phenotype (e.g. infected or not) may not only depend on its own susceptibility and recoverability genes, but also on the infectiousness of other individuals in the same contact group, i.e. their infectivity and recoverability genes [25]. This complex interdependence between underlying and observable traits poses challenges for existing methods.

The motivation behind this paper is to introduce new statistical and computational methods that utilise information derived from observation of epidemics and trait interdependence to estimate, for the first time, genetic and other systematic effects for all three underlying epidemiological host traits. This requires combining statistical, epidemiological and genetic modelling principles. Analysis of incomplete epidemic data to draw inferences on epidemiological parameters is well established [26, 27]. However, analysing such data to draw joint inferences on both the disease epidemiology and host genetic variation has proven challenging [25, 28]. Recent studies have expanded conventional quantitative genetics threshold models to enable joint genetic evaluation of cattle susceptibility to, and recoverability from, mastitis [29, 30], which led to identification of novel SNPs and candidate genes associated with these traits [19]. However, because infectivity acts on group members rather than the focal individual itself, applying these technique to estimate genetic effects for infectivity is problematic.

Alternative approaches have focused on disentangling susceptibility from infectivity effects. For example, Anacleto et al. [31] developed a Bayesian inference approach to produce genetic risk estimates for host susceptibility and infectivity from epidemic time to infection data, assuming that susceptibility and infectivity are under polygenic control (i.e. they are determined by a large number of genes, each with small effect). This approach, however, does not incorporate genetic variation in recoverability, and does not estimate SNP effects. An alternative approach, based on the assumption that susceptibility and infectivity are controlled by two single bi-allelic genetic loci [32, 33], used a generalized linear model (GLM) to estimate relative allelic effects on host susceptibility and infectivity. Whilst an important contribution, this approach focused on the disease status of individuals at the end of each epidemic (i.e. discarding potentially useful information from the infection and recovery times themselves). It also failed to incorporate variation in recoverability, and relied on a number of simplifying assumptions which were found to produce biased estimates under certain circumstances. A variant of this approach [34], which adopted a GLM to analyse time-series data on individual disease status, illustrated the benefits of longitudinal records of individuals’ infection status for improving prediction accuracies of SNP effects, although it still relied on a number of simplifications that may compromise prediction accuracies and lead to unwanted bias. A further shortcoming of previous approaches [32–34] is that they ignore potential pleiotropic effects, i.e. SNPs affecting more than one epidemic trait. This seems unrealistic, since, for example, SNPs that control within host pathogen replication may also lower the risk that infection can establish, i.e. reduce susceptibility, and simultaneously reduce pathogen shedding and hence infectivity, and speed up recovery.

In this study we present a novel software tool called SIRE (standing for “susceptibility, infectivity and recoverability estimation”) that implements a Bayesian inference approach to simultaneously estimate the effects of a single SNP (importantly capturing any pleiotropy), together with that of other fixed effects (such as e.g. sex, breed or vaccination status) on host susceptibility, infectivity and recoverability from temporal epidemic data. This approach can be applied to a wide range of epidemic data, collected at the level of individuals, and accounts for different types of uncertainty in a statistically consistent way (e.g. censoring of data or imperfect diagnostic tests), and permits the incorporation of prior knowledge. We validate SIRE for a variety of simulated epidemic scenarios, comprising not only the ideal case in which infection and recovery / death times of each individual are known exactly, but also under more realistic scenarios in which epidemics are only partially observed.

2 Materials and methods

2.1 Data structure and the underlying genetic-epidemiological model

SIRE applies to individual-level disease data originating from one or more contact groups in which infectious disease is transmitted from infectious to susceptible individuals through contact. This data can come from well controlled disease transmission experiments or from much less well controlled field data (which may be less complete, but readily available in larger quantity).

In the context of disease transmission experiments in plants or livestock, epidemics are initiated by means of artificially infecting a proportion of “seeder” individuals which go on to transmit their infection to susceptible individuals sharing the same contact group. In field data contact groups may consist of animal herds, or any group of individuals sharing the same environment such as a pasture, pen, cage or pond, and infection is assumed to invade the group by some external, usually unknown, means (e.g. by the unintentional spread of infected material, or the introduction of an infected individual from elsewhere). For simplicity it is assumed that throughout the observation period groups are closed, i.e. no births, migrations, or transmission of disease between groups. This assumption generally holds for experimental studies and also for most common field situations, where a movement ban is imposed after disease notification [35].

The dynamic spread of disease within a contact group is modelled using a so-called SIR model [36]. Individuals are classified as being either susceptible to infection (S), infected and infectious (I), or recovered/removed/dead (R). Under the simple SIR model for homogeneous populations, the time-dependent force of infection for a susceptible individual j (i.e. the probability per unit time of becoming infected) is given by λ_j(t) = βI(t), which is the product of an average transmission rate β and I(t), the number of infected individuals at time t. To incorporate individual-based variation in host susceptibility and infectivity, this simple expression for λ_j(t) is replaced by an individual force of infection (see [31] for a formal derivation)

Here g_j characterises the fractional deviation in individual j’s susceptibility as compared to that of the population as a whole (e.g. g_j=0.1 corresponds to individual j being ≃10% more susceptible than the population average), f_i characterises the corresponding quantity for individual i’s infectivity, and the sum in Eq.(1) goes over all individuals infected at time t sharing the same contact group z as individual j (note, this sum varies as a function of t as individuals become infected and recover). The term G_z in Eq.(1) accounts for the fractional deviation in disease transmission for group z. This incorporates group-specific factors that influence the overall speed of an epidemic in one contact group relative to another (e.g. animals kept in different management conditions, environmental differences, or variation in pathogen strains with differing virulence). Whilst variation in G_z may be small for a well-controlled challenge experiment, this may not be the case in real field data. G_z is assumed to be a random effect with standard deviation σ_G. The exponential dependencies in Eq.(1) ensure that λ_j is strictly positive and allow for the possibility that some groups or individuals are much more/less susceptible/infectious than others, i.e. it can accommodate potential super-spreaders.

Whilst in Eq. (1) infection is modelled as a Poisson process with individual infection rates λ_j [18, 20], the recovery process is modelled by assuming that the time taken for individual m to recover after being infected is drawn from a gamma distribution with an individual-based mean w_m and shape parameter k (which for simplicity is assumed to be the same across individuals). This mean recovery time is expressed as where γ represents an average recovery rate across the population and r_m describes the fractional deviation from this for individual m. This approach is taken to allow the recovery probability distribution to adopt a more biologically realistic profile compared with the exponential distribution often assumed (see electronic supplementary material Appendix A for further details).

Following standard quantitative genetics theory [37], the individual-based deviations in susceptibility g, infectivity f and recoverability r (which are vectors with elements relating to each individual) are decomposed into the following contributions

SNP effects

The model assumes that a specific locus defined by a SNP (potentially) plays an important contribution to the trait values (note, repeated analysis can be performed on different SNPs of interest). Assuming a diploid genomic architecture with biallelic SNP implies three SNP genotypes: AA, AB and BB. The SNP contribution to the traits for individual j depends on j’s genotype in the following way:

The parameters a_g, a_f and a_r capture the relative differences in trait values between AA and BB individuals, and are subsequently referred to as the “SNP effects” for susceptibility, infectivity and recoverability, respectively (e.g. if a_g is positive, individuals with an AA genotype will, on average, be more susceptible to disease than those with a BB genotype). The scaled dominance factors Δ_g, Δ_f and Δ_r characterise the trait deviations between the heterozygote AB individuals and the homozygote mean (a value of 1 corresponds to complete dominance of the A allele over the B allele and −1 when the reverse is true, whereas absence of dominance is represented by a value of 0) [38].

Fixed effects

The design matrix X and fixed effect vectors b_g, b_f and b_r in Eq.(3) allow for other known sources of variation to be accounted for (e.g. breed, sex or vaccination status). Following convention, an additional fixed effect is added to account for trait mean, which is explicitly chosen to ensure the population averages of g, f and r are zero (remembering that the average effects are already captured by the parameters β and γ).

Residual contributions

Here ε=(ε_g, ε_f, ε_r) accounts for all other contributions to the traits (i.e. coming from genetic effects not captured by the SNP under consideration, as well as any non-genetic environmental variation). We assume that for each individual the three trait residuals are drawn from a single multivariate normal distribution with zero mean and 3×3 covariance matrix Σ.

Including these correlations is important because it allows for the possibility that, for example, more susceptible individuals may also, on average, be more infectious and recover at a slower rate (on top of any correlations which may also arise from the SNP and fixed effects). Note that in this study, which focuses on the estimation of SNP effects, there is no explicit distinction between random genetic and environmental effects, although the model could be extended to incorporate estimation of these polygenic effects. It is thus assumed that individuals are randomly distributed across the groups with respect to the genetic effects on the epidemiological traits not captured by the SNP. Also note that Eq.(3) does not contain random group effects for the individual epidemiological traits. This is because the group effect has already been incorporated in the expression of the individual force of infection in Eq.(1). In other words, it is assumed that the group environment is the dominant mechanism affecting the speed at which infection spreads within a group rather than group specific factors affecting individuals’ susceptibility, infectivity or recoverability.

2.2 Bayesian inference

Based on the description above, the model contains the following set of parameters: θ=(β, γ, k, a_g, a_f, a_r, Δ_g, Δ_f, Δ_r, b_g, b_f, b_r, ε_g, ε_f, ε_r,, Σ, G, σ_G). We denote the complete set of infection and recovery event times for all individuals as ξ over the observed duration of the epidemics [39]. Typically ξ is not precisely known, and so we consider the general case in which ξ represents a set of latent model variables. The nature of the actual observed data y will be problem dependant. For example, in some instances recovery or removal (e.g. due to death) times will be precisely known but infection times completely unknown. In other instances infection and recovery times will both be unknown, but results from disease diagnostic tests provide information regarding disease status at particular points in time. The framework presented in this paper is flexible to these various possibilities.

Application of Bayes’ theorem implies that the posterior probability distribution for model parameters and latent variables is given by where individual components are defined as follows:

Observation model π(y|ξ) – the probability of the data given a set of event times ξ. The expression for the observation model depends on the nature of the data being observed. In many contexts this simply takes the values one or zero depending on whether ξ is consistent with y or not. For example a perfect disease diagnostic test showing that an individual is infected would be only consistent with ξ containing an infection event on that individual prior to the time of the test and a recovery event after the time of the test. Similarly, if data y indicates that an individual becomes infected at a particular point in time, this is only consistent provided ξ also contains this infection event. When imperfect disease diagnostic test results are available the observation model includes the sensitivity and specificity of the test to account for this uncertainty in the data. In summary, the observation model depends on the data collection process and constrains the possible event sequences ξ, and this, in turn, informs the model parameters θ.
Latent process likelihood L(ξ|θ) – the probability of ξ being sampled from the model given parameters θ. This can be derived from the genetic-epidemiological model described in the previous section [26, 27] (see Appendix B for details), and is given by
The functional dependence of L(ξ|θ) on the parameters θ is expressed in terms of the force of infections λ_j in Eq.(1) and mean recovery times w_m in Eq.(2), which themselves depend in g, f and r in Eq.(3). The product z goes over all contact groups and within each contact group: j goes over individuals that become infected excluding those which initiate epidemics [40], m goes over individuals that become infected including those which initiate epidemics and e goes over both infection and recovery events (with corresponding event times t_e). Here the notation j∈z indicates that j goes over all those individuals j in contact group z, and e∈E_z indicates that e goes over all events E_z. The force of infection λ_j is calculated immediately prior to individual j becoming infected. The gamma distributed probability density function F_Γ for recovery events gives the probability an individual is infected for duration δt_m given a mean duration w_m and shape parameter k. The time dependent total rate of infection events Λ_z in contact group z immediately prior to event time t_e is given by where the sum in s goes over all susceptible individuals in group z at that time.
An important point to mention is that Eq.(6) is calculated on an unbounded time line. In situations in which data is censored, the observation model restricts events that occur within the observed time window, but other events can exist outside of this observed region [41].
Priorπ(θ) – the state of knowledge prior to data y being considered. To account for the prior assumption that residuals ε in Eq.(3) are multivariate normally distributed and that the vector of group effects G in Eq.(1) are random effects, π(θ) can be decomposed into where θ_−ε,G includes all parameters with the exception of ε and G and

Here j goes over each individual and ε_j =(ε_g,j, ε_f,j, ε_r,j)^T is a three dimensional vector giving the residual contributions to the susceptibility, infectivity and recoverability of j. Σ is a 3×3 covariance matrix (which describes not only the overall magnitude of the residual contributions, but also any potential correlations between traits). Finally, the product z in Eq.(9) goes over all contact groups and G_z represents the group-based fractional deviation in transmission rate, which is assumed to be independent between groups and normally distributed with standard deviation σ_G.

The default prior for θ_−ε,G (which can be modified if necessary) is largely uninformative but does place upper and lower bounds on many of the key parameters to stop them straying into biologically unrealistic values (details are given in Appendix C).

Samples for θ and ξ from the posterior are generated by means of an adaptive Markov Chain Monte Carlo (MCMC) schemes which implements optimised random walk Metropolis-Hastings updates for most parameters and posterior-based proposals [1] to aid fast mixing of the residual parameters (details are given in Appendix D).

2.3 SIRE

SIRE is a desktop application that implements the Bayesian algorithm outlined above. It is freely available to download from the supplementary material or at www.mkodb.roslin.ed.ac.uk/EAT/SIRE.html (with versions for Windows, Linux and Mac). An easy to use point and click interface allows for data tables to be imported in a variety of formats and graphical outputs are dynamically displayed as inference is performed. The core of SIRE utilises efficient C++ code and allows for running MCMC chains on multiple CPU cores.

SIRE takes as input any combination of information about infection times, recovery times, disease status measurements, disease diagnostic test results, genotypes of SNPs or any other fixed effects (see screenshot in Fig 1a), details of which individuals belong to which contact groups and any prior specifications (Fig 1b). The output from SIRE consists of posterior trace plots for model parameters θ, distributions (Fig 1c), visualisation of infection and recovery times ξ, dynamic population estimates and summary statistics (means and 95% credible intervals) as well as MCMC diagnostic statistics (Fig 1d). Posterior distribution graphs can be exported from SIRE and also files containing posterior samples of θ and ξ for further analysis using other tools. The user guide for SIRE is available in the electronic supplementary material and on the website.

Fig 1. SIRE software.

Illustrative screenshots of the software package: (a) Different data sources can be imported by loading user defined data tables (text or cvs files), (b) prior specification can be made on parameters, (c) posterior distributions can be visualised as inference in being performed, and (d) summary statistics and MCMC diagnostics.

2.4 Data scenarios

SIRE is flexible to many possible inputs. Reflecting real-world datasets this paper considers five potential data scenarios (DS):

DS1: Infection and recovery times for all individuals exactly known

This represents the best case scenario for inferring parameter values. For example, appearance of symptoms or visual or behavioural signs may indicate the onset of infection, and recovery/removal times are given by the time of death.

DS2: Only recovery times known

Often “recovery” in compartmental SIR models represents the death and removal of individuals. Consequently DS2 is pertinent to cases in which the only measurable quantity is the time at which individuals die. For example, disease challenge experiments in aquaculture routinely record time of death rather than infection times, which are usually difficult to measure [42].

DS3: Only infection times known

Whilst less common than DS2, in some instances data provides information regarding when individuals become infected but not when they recover. For example in human epidemics, patients may go to the doctor when they become ill, but no records will be kept on when they recover.

DS4: Disease status periodically checked

DS4 represents the most common scenario for monitoring infectious disease spread in livestock or plant populations, where each individual is periodically checked to establish its disease status. Under DS4 the point at which epidemics start is usually unknown, as well as the infection and recovery times of individuals themselves. However the diagnostic test results place constraints on these quantities. For example, if an individual is found to be uninfected at one sampling time and infected at the next sampling time this means that infection must have occurred at some point in the intervening period (note here we assume perfect diagnostic tests but SIRE also allows for imperfect diagnostic test results to be used, provided the sensitivity and specificity of the tests are known).

DS5: Time censored data

This data scenario relates to situations in which epidemics are not observed over their entire time period. For example a disease transmission experiment being carried out may be terminated early, due to cost or other factors (e.g. animal welfare), even though epidemics have not completely died out.

3 Assessment of performance and data requirements

In this section we apply SIRE to simulated datasets in order to 1) test the extent to which the inferred posterior parameter distributions agree with their true values, and 2) investigate how the precision, accuracy and bias of inferred model parameters depends on the type of data available.

Initially the focus of results will be on DS1 (which although rarely applies in practice, still provides useful insights for software validation and application) and later in section 3.5 consideration is given to DS2-5.

3.1 Illustrative example simulation and inference

We first demonstrate the performance of SIRE assuming complete information of individuals’ infection and recovery times, for a representative but complex set of parameters with regards to the genetic and non-genetic regulation of the three epidemiological host traits. Subsequently we investigate how these results change under different parameter and data scenarios.

Simulations

Individuals were randomly assigned into N_group different contact groups, with each group containing G_size individuals. The SNP under investigation was assumed to be in Hardy-Weinberg equilibrium [38] with an A allele frequency of p=0.3. For the effect sizes we used the values a_g=0.4, a_f=0.3, a_r=−0.4, representing a relatively large pleiotropic effect (which confers higher susceptibility for AA compared to BB individuals, as well as slightly higher infectivity and reduced recoverability). The choice of Δ_g=0.4, Δ_f=0.1, Δ_r=−0.3 for the scaled dominance factors represents partial, but not strong, dominance of either the A or B allele. For simplicity we included only a single fixed effect, e.g. sex, of arbitrary moderate size b_g0=0.2, b_f0=0.3, b_r0=−0.2 with individuals in the population randomly selected to be male or female. The residual variances were chosen to be Σ_gg=Σ_ff=Σ_rr=1, corresponding to a large variation in traits between individuals (perhaps larger than is biologically realistic, but here we want to demonstrate that inference of the SNP effects is still possible despite significant variation in trait values arising from other sources). In line with the direction of the SNP effects, the covariances were chosen to be Σ_gf=0.3, Σ_gr=−0.4 and Σ_fr=−0.2, representing a potential scenario in which individuals that are more susceptible are also more infectious and recover at a slower rate and vice-versa). To accommodate variation in epidemic speed across groups, we set the standard deviation in the group effects to σ_G=0.5. Finally, the average transmission rate was chosen to be β=0.3/G_size (selected because it led to a substantial fraction of individuals becoming infected and including G_size such that the basic reproductive ratio R₀ remained independent of group size, i.e. frequency dependent transmission) and an average recovery rate γ=0.1 with shape parameter k=5 (corresponding to the infection duration being relatively highly peaked around a mean of 10 time units).

Simulated epidemic data was generated by means of a Doob-Gillespie algorithm [43] modified to account for non-Markovian recovery times (details of this procedure are given in Appendix F). A typical output for one simulated epidemic in a single contact group N_group=1 with G_size=50 individuals is shown in Fig 2. Whilst the simulation itself is generated on an individual basis, this graph summarises dynamic variation in the susceptible, infectious and recovered populations, categorised by SNP genotype. It reveals classic epidemic SIR model behaviour: a single infected individual passes its infection on to others, triggering a rapidly spreading infection process throughout the population until the epidemic eventually dies out as a result of the susceptible population becoming largely exhausted and the remaining infected population recovering. Note that in closed groups not all susceptible individuals become infected. In this particular case some AB and BB individuals remain uninfected at the end of the epidemic. The absence of AA individuals partly stems from natural stochasticity in the system, but also partly from the fact that a_g=0.4 is positive, i.e. AA individuals are more susceptible to disease and so on average less likely to remain uninfected. Consequently we can link the genetic composition in the final state of the epidemic to the expected value for a_g (which, based on this particular dataset, is more likely positive than negative). Over and above information from the final state, however, there is much to be gained from also accounting for the infection and recovery event times themselves. The Bayesian approach adopted in this paper utilises all this information to extract the best available parameter estimates.

Fig 2. Simulated epidemic profiles.

This graph shows epidemic profiles for the three SNP genotypes (i.e. AA, AB or BB), where S_g, I_g, R_g indicate the number of susceptible, infected and recovered individuals of genotype g, respectively. This example is simulated using a single contact group containing G_size=50 individuals, of which one is initially infected. The model parameters θ are: β=0.006, γ=0.1, k=5, a_g=0.4, a_f=0.3, a_r=−0.4, Δ_g=0.4, Δ_f=0.1, Δ_r=−0.3, b_g0=0.2, b_f0=0.3, b_r0=−0.2, Σ_gg=1, Σ_gf=0.3, Σ_gr=−0.4, Σ_ff=1, Σ_fr=−0.2, Σ_rr=1, σ_G=0.5 and the A allele has frequency p=0.3. Note, the step jumps in curves result from discrete disease status transitions in individuals.

The information content from a single epidemic is generally insufficient to estimate the large number of parameters in the model. Therefore we next simulated a more realistic dataset (using the same parameter set as above) made up of 1000 individuals split into N_group=20 contact groups, each containing G_size=50 individuals. The infection and recovery event times from this simulation were then used as input data into SIRE (scenarios in which infection and recovery times are not known precisely are discussed later in section 3.5).

Parameter estimates

Fig 3 shows the inferred posterior probability distributions for all parameters in θ corresponding to the simulated multi-group scenario described above. The actual parameter values used to generate the data (see vertical black dashed lines in Fig 3) consistently lie within regions of high posterior probability. The standard deviations (SDs) in these distributions characterise the precision with which parameters can be estimated:

Population average parameters (Figs 3a-c) – The recovery rate γ has the greatest precision (smallest relative SD), followed by the transmission rate β. Whilst the distribution for the shape parameter k is wide, it is clearly able to discount the possibility of an exponential recovery duration (i.e. k=1), which has a very low posterior probability, over a more peaked distribution (i.e. k>1).
SNP effects (Figs 3d-f) – The estimated recovery SNP effect a_r is highly peaked around its true value of −0.4 (Fig 3f). Importantly this distribution has an extremely low posterior probability at a_r=0. Indeed, since a_r=0 does not lie within the 95% credible interval it can be concluded, to a high degree of certainty, that the SNP is associated with recoverability. The same is true for the susceptibility SNP effect a_g in Fig 3d, albeit with a wider posterior probability distribution. This difference is for two reasons: firstly the recovery process involves only a_r, whereas the infection process involves both a_g and a_f (leading to potential confounding between these parameters) and secondly the recovery processes is gamma distributed which has a smaller standard deviation than the more dispersed Poisson process governing infection. The infectivity SNP effect a_f in Fig 3e exhibits a much wider probability distribution than the other two SNP effects. The fact that zero does lie within the 95% posterior credible interval (which goes from −0.35 to 2.1) means that no certain association with infectivity can be made in this particular example. Figs 3d-f illustrates a general principle that was common in the vast majority of subsequent simulation scenarios: SNP effects associated with recoverability are most precisely estimated, followed by susceptibility, and finally infectivity [44].
Scaled dominance factor (Figs 3g–i) – Compared to the SNP effects themselves, precision of the scaled dominance parameters is relatively poor, and actually reduces as the size of the SNP effects goes down (results not shown), which makes sense in the limit of zero SNP effect size, because here no information about dominance is available. Estimating them accurately, therefore, either requires very large SNP effects or substantially more data.
Fixed effects (Figs 3j–l) – Since SNP effects are also a type of fixed effect, the same comments as above also apply for other fixed effects.
Residual covariance matrix and random group effect (Figs 3m–s) – Interestingly, it was possible to obtain relatively good estimates for elements in the residual covariance matrix. Again, the familiar pattern is observed whereby quantities related to recoverability are more precisely estimated than those related to susceptibility, with infectivity the least precise. Finally, the variance of the group effect could be estimated with similar precision as that for susceptibility (Figs 3s & m).

Fig 3. Parameter posterior distributions.

Probability distributions for model parameters inferred from a simulated dataset which consisted of exact infection and recovery times (DS1) for N_group=20 contact groups each containing G_size=50 individuals. The parameter values in Fig 1 were used for the simulation (denoted by the vertical black dashed lines). The standard deviations (SD) give a measure of precision.

3.2 Dependence on parameter values

The previous section showed an illustrative example for a particular parameter set. Here we assess what happens when different parameters in the model are altered. This was achieved by means of taking the following “base” set of parameters and then changing each parameter separately (fixing all others) [45]. Fig 4 shows scatter plots (each referring to a different selected parameter) of the posterior means (crosses) with corresponding 95% credible intervals inferred from a single simulated dataset using the true selected parameter value on the x-axis. Plots in which most crosses lie near to the diagonal line imply that inference is able to accurately capture the true parameter values. Table 1 shows the corresponding prediction accuracy, measured as the correlation between the inferred and true parameter values. Except for Δ_f for which prediction accuracy was only 34%, prediction accuracies for all other parameters ranged from 69-99%. In line with the discussion above, parameters associated with recoverability have generally higher predication accuracies than those associated with susceptibility, which are again higher than those for infectivity.

View this table:

Table 1. Prediction accuracy, bias and precision for the parameter estimates.

Other columns relate to the sub-plots in Fig 4 (see Fig 4 caption for information about the underlying data). Prediction accuracy is defined as the correlation between the inferred and true parameter values. The y-intercept and slope were obtained by fitting regression lines through the data points in Fig 4 (a y-intercept of zero and slope of one indicates no bias). Av. SD gives the average posterior standard deviation across all data points as an indicator for precision of parameter estimates. Subscripts g, f and r refer to susceptibility, infectivity and recovery, respectively.

Fig 4. Prediction accuracy and bias.

The inferred posterior distributions for parameters compared to their true value. Simulated data was generated using the base parameter set in Eq.(10) except for a single parameter which was singled out in each of the sub-plots above*. Each cross corresponds to the inferred posterior mean (with error bars indicating 95% credible intervals) of the selected parameter (whose true value is on the x-axis) when SIRE is applied to a single simulated dataset consisting of infection and recovery times (DS1) from N_group=20 contact groups each containing G_size=50 individuals. A description of the model parameters, together with calculated prediction accuracies (correlation between true and inferred value), and bias (represented by intercept and slope of regression lines fitted to the data points), and average standard deviations are given in Table 1. (*Additionally for (g) a_g=0.4, (h) a_f=0.4 and (i) a_r=0.4, such that dominance has an effect).

Bias indicates systematic differences between the true parameter values and those inferred from the data. Bias was measured by fitting regression lines through the posterior means in Fig 4 (as a function of the true parameter value). The corresponding y-intercept and slope values are shown in Table 1, where a zero y-intercept and a slope of one indicate absence of bias. Whilst the majority of observed y-intercepts tended to be very small, the slope for some of the parameters is markedly less than one (most notably for Δ_f). The reason for this is as follows. When Bayesian analysis reveals insufficient information regarding a parameter, its distribution follows that of the prior (which are uniform for all the parameters in this particular study, as described in Appendix C). This behaviour happens irrespective of the parameter’s true value, leading to a plot in Fig 4 that would be entirely flat (i.e. a slope of zero). Therefore, the slopes of less than one in Fig 4 simply reflect a lack of data, which is essentially another manifestation of a lack of parameter precision. Consequently, bias reduces as the amount of data increases (provided the model being fitted is the correct one).

From the point of view of this paper, the probability distributions which are of greatest interest are the SNP effects. Noting the sizes of the error bars across Figs 4d-f demonstrate that the precisions of the parameter estimates are largely independent of the values of the parameters themselves, a result which can be supported analytically [46]. This implies that the precision of SNP effects calculated using the base set of parameters in Eq.(10) is expected to be generally applicable to any other parameter set [47] (e.g. the average SDs in Table 1 for the base parameter set are very similar to the SDs shown in Fig 3).

Consequently, the remainder of this paper focuses on investigating how SNP effect estimates are affected by contact group structure and the nature of the measured data using this base set of parameters. We focus first on outlining the behaviour with respect to key design features, e.g. group size, number of individuals per group and allele frequency, and then go on to consider how observations of the system influence what can be learned.

3.3 Dependence on the number and size of contact groups

The crosses in Fig 5 shows how SDs in the SNP effects change as a function of the number of individuals G_size within each contact group (here N_group=10 contact groups are assumed). The SD in a_g reduces as the number of individuals in each contact group G_size increases (Fig 5a). Importantly this relationship scales as a line of slope −½ (note the log scales on this plot), corresponding to precision increasing by a factor of two as the number of individuals is increases by a factor of four (in line with what would be expected from central limit theorem). Fig 5a provides insights into how many individuals would need to be observed in order to be able to make an association with a susceptibility SNP effect of a given size. For example, in order to detect an association with a susceptibility SNP of effect size a_g = 0.4, G_size=20 individuals per contact group, and so G_size×N_group=200 individuals in total would be needed to assure that the 95% credibility interval does not contain zero (assuming approximate normality for the posterior distribution), as illustrated by that black dashed line in Fig 5a. Fig 5c shows the same scaling relationship for identifying recoverability SNP effects, but this time only G_size×N_group=100 individuals are needed to make associations for recovery SNP effects (reflecting the fact that a_r can be inferred more precisely, as mentioned previously). A very different state of affairs, however, is observed in Fig 5b. Here we see that not only is the infectivity SNP effect a_f poorly estimated, but also its precision does not markedly improve even when the number of individuals in each contact group G_size is substantially increased.

Fig 5. Variation in precision of the SNP effect estimates with group size G_size.

Posterior standard deviations (SDs) in SNP effects for (a) susceptibility a_g, (b) infectivity a_f and (c) recoverability a_r from simulated data with N_group=10 contact groups each containing G_size individuals (which is varied). Different symbols represent different data scenarios: DS1) Both the infection and recovery times for individuals are known, DS2) only recovery times are known, and DS3) only infection times are known. Each symbol represents the average posterior SD over 50 simulated data replicates with the error bar denoting 95% of the stochastic variation about this value, i.e. 95% of posterior SDs lie within the interval (note, they do not represent posterior credible intervals, as in Fig 4). The black line indicates a slope of −½ and the dashed black and purple dash lines indicate the sample size required for identifying a SNP with effect size 0.4 for the trait under consideration (see main text for further explanation). Parameter values are given in Eq.(10).

Instead of varying G_size and fixing the number of contact groups N_group, we now fix G_size=10 and vary N_group. Results for this are shown in Fig 6 (represented by the crosses). This reveals a similar behaviour as seen before for the SD in a_g and a_r, but crucially we find the SD in the infectivity SNP effect a_f now also scales with the familiar line of slope −½. The reason for this behaviour lies in the fact that infectivity is an indirect genetic effect, i.e. an individual’s infectivity SNP affects the disease phenotype of group members rather than its own disease phenotype [48–50]. More intuitively, this can be explained as follows. Susceptibility and recoverability SNPs of an individual directly affect its own measured disease phenotype (the former affecting its infection time and the latter affecting its recovery time). Therefore the information on which these two quantities can be inferred is expected to scale with the total number of individuals. On the other hand, as an individual’s infectivity SNP acts on all susceptible individuals sharing the same contact group, it affects the epidemic dynamics as a whole. In fact much of the information regarding infectivity comes from the overall speed of epidemics. For example, if those contact groups containing individuals with more A alleles consistently experience epidemics which are faster than those with fewer A alleles, this provides evidence that the A allele confers greater infectivity than the B allele (the situation is further complicated by the fact that differences in susceptibility can also cause this behaviour, however the algorithm can independently estimate a_g, so removing this potential confounding). Because information about the infectivity SNP effect comes from epidemic-wide behaviour, it is expected to scale linearly with the number of contact groups N_group (Fig 6b), but not with the number of individuals per contact group G_size (Fig 5b).

Fig 6. Variation in precision of the SNP effect estimates with number of groups N_group.

Posterior standard deviations (SDs) in SNP effects for (a) susceptibility a_g, (b) infectivity a_f and (c) recoverability a_r from simulated data with N_group contact groups (which is varied) each containing G_size=10 individuals. Different symbols represent different data scenarios: DS1) Both the infection and recovery times for individuals are known, DS2) only recovery times are known, and DS3) only infection times are known. Each symbol represents the average posterior SD over 50 simulated data replicates with the error bar denoting 95% of the stochastic variation about this value. The black line indicates a slope of −½. Parameter values are given in Eq.(10).

Finally, we investigate the case in which we fix the total number of individuals to G_size×N_group=1000 whilst simultaneously varying G_size and N_group, as shown in Fig 7 (see crosses). In Fig 7a we find very little variation in the precision of a_g. Interestingly, the results in Fig 7b clearly demonstrate that larger numbers of contact groups containing fewer individuals help to reduce the SD in the infectivity SNP effect a_f. In the case of G_size=2 the posterior SDs in a_g and a_f are actually the same due to the symmetry of this particular setup (i.e. each group consists of exactly one infected and one susceptible individual). Lastly, Fig 7c shows that the SD in a_r is largely independent of G_size. This is because recovery is solely an individual-based process, and so happens independently of others sharing the same contact group (although in cases in which R₀ is small, differences may result from variation in the fraction of individuals which actually become infected).

Fig 7. Variation in precision of the SNP effect estimates with partitioning into groups.

Posterior standard deviations (SDs) in SNP effects for (a) susceptibility a_g, (b) infectivity a_f and (c) recoverability a_r from simulated data with N_group contact groups each containing G_size individuals, both of which are varied such that the total population N_group×G_size is fixed to 1000. Different symbols represent different data scenarios: DS1) Both the infection and recovery times for individuals are known, DS2) only recovery times are known, and DS3) only infection times are known. Each symbol represents the average posterior SD over 50 simulated data replicates with the error bar denoting 95% of the stochastic variation about this value. Parameter values are given in Eq.(10).

3.4 Dependence on allele frequency

So far we have assumed a fixed A allele frequency p=0.3 in the population. Fig 8 demonstrates what happens when this is no longer the case by varying p, which in turn changes the Hardy-Weinberg equilibrium frequencies for the three genotypes. We find that the curves are symmetric around a minimum of p=0.5 and remain remarkably flat over a large region. They only increase substantially when the minor allele frequency drops below around 10%. This result shows that the statistical power to establish SNP effects dramatically reduces when they are rare, which is consistent with observations from conventional GWAS analyses [51].

Fig 8. Variation in precision of the SNP effect estimates with allele frequency p.

Posterior standard deviations (SDs) in SNP effects for (a) susceptibility a_g, (b) infectivity a_f and (c) recoverability a_r from simulated data with N_group=20 contact groups each containing G_size=50 individuals. Different symbols represent different data scenarios: DS1) Both the infection and recovery times for individuals are known, DS2) only recovery times are known, and DS3) only infection times are known. Each symbol represents the average posterior SD over 50 simulated data replicates with the error bar denoting 95% of the stochastic variation about this value. Parameters used are given in Eq.(10).

3.5 Different data scenarios

This section shows results from the various data scenarios introduce in section 2.4, in which the infection and recovery times of all individuals are not known precisely: