A sub-exponential branching process to study early epidemic dynamics with application to Ebola

Exponential growth is a mathematically convenient model for the early stages of an outbreak of an infectious disease. However, for many pathogens (such as Ebola virus) the initial rate of transmission may be sub-exponential, even before transmission is affected by depletion of susceptible individuals. We present a stochastic multi-scale model capable of representing sub-exponential transmission: an in-homogeneous branching process extending the generalised growth model. To validate the model, we fit it to data from the Ebola epidemic in West Africa (2014–2016). We demonstrate how a branching process can be fit to both time series of confirmed cases and chains of infection derived from contact tracing. Our estimates of the parameters suggest transmission of Ebola virus was sub-exponential during this epidemic. Both the time series data and the chains of infections lead to consistent parameter estimates. Differences in the data sets meant consistent estimates were not a foregone conclusion. Finally, we use a simulation study to investigate the properties of our methodology. In particular, we examine the extent to which the estimates obtained from time series data and those obtained from chains of infection data agree. Our method, based on a simple branching process, is well suited to real-time analysis of data collected during contact tracing. Identifying the characteristic early growth dynamics (exponential or sub-exponential), including an estimate of uncertainty, during the first phase of an epidemic should prove a useful tool for preliminary outbreak investigations. Author Summary Epidemic forecasts have the potential to support public health decision making in outbreak scenarios for diseases such as Ebola and influenza. In particular, reliable predictions of future incidence data may guide surveillance and intervention responses. Existing methods for producing forecasts, based upon mechanistic transmission models, often make an implicit assumption that growth is exponential, at least while susceptible depletion remains negligible. However, empirical studies suggest that many infectious disease outbreaks display sub-exponential growth early in the epidemic. Here we introduce a mechanistic model of early epidemic growth that allows for sub-exponential growth in incidence. We demonstrate how the model can be applied to the types of data that are typically available in (near) real-time, including time series data on incidence as well as individual-level case series and chains of transmission data. We apply our methods to publically available data from the 2014–2016 West Africa Ebola epidemic and demonstrate that early epidemic growth was sub-exponential. We also investigate the statistical properties of our model through a simulation re-estimation study to identify it performance characteristics and avenues for further methodological research.


40
Physical systems can rarely support exponential growth for extended periods; during 41 an epidemic, depletion of susceptible individuals leads to reduced transmission and, if 42 intervention measures have not already done so, cause incidence to decline. Despite re-43 cent work showing the initial transmission of many diseases is sub-exponential, it is still 44 common to see epidemics represented by models in which transmission grows exponen- 45 tially Viboud et al. (2016). This is concerning because exponential growth is extremely 46 sensitive to its growth rate parameter, which can inflate the variance of forecasts. During an outbreak of a novel pathogen, uncertainty in the growth rate is almost guaranteed. 48 Furthermore, the likely impact of an intervention, such as social-distancing or deploy-49 ment of a vaccine, is likely to be highly sensitive to the estimate for the growth rate 50 parameter. 51 The quantitative models used in epidemiology vary, from simple phenomenological 52 models Lega and Brown (2016); Nouvellet et al. (2018) to complex agent-based simula-53 tions Ajelli et al. (2016);Merler et al. (2015). 54 Typically, the simpler phenomenological models -while able to produce exponen-55 tial or sub-exponential growth -lack the mechanistic underpinning to answer relevant 56 question (e.g., what will be the effect of vaccinating 20% of the population?) and so 57 have arguably limited application in outbreak investigations. At the other end of the 58 complexity spectrum, agent-based models, with their high biological fidelity, allow for 59 conceptually simple explorations of the impact of interventions. However, there is a 60 cost: their complexity makes them difficult to reason about mathematically. They are 61 also computationally intensive, making statistical analysis and so assessment of the early 62 growth characteristics and potential impact of interventions, challenging.
Here, in the context of early outbreak investigation, we demonstrate how an in-64 homogeneous branching process formulation can overcome some of the challenges described above: the mismatch between exponential growth of transmission and obser-66 vations, and the difficulty of finding a model with a mechanistic basis which is still 67 mathematically tractable. A temporal in-homogeneity in the branching process ensures 68 the generation sizes grow algebraically (in expectation), instead of the typical geomet-69 ric/exponential growth. Branching processes can be viewed as either a tree, where it 70 describes who-infected-whom, or as a time series to describe the total number of cases 71 through time. As such, they are a good example of a multi-scale model. Unlike many 72 of the complex mechanistic models, the simplicity of the branching process means it is 73 possible to reason about them quantitatively and work with them computationally. 74 We explore the use and properties of this model from three perspectives. First, we  (2015). Third, to investigate the extent to which one might expect the previous 86 result (i.e., obtaining similar estimates from each data type) to generalise, we performed 87 a simulation study. The goal of this simulation study was not to investigate the utility 88 of each data type for estimating the parameters per se, but to ask whether or not both 89 data types, when derived from the same epidemic, produce concordant estimates.  91 We derive the branching process in terms of a generic cumulative incidence function, 92 i.e., a function describing the total number of cases that have occurred by a given time. 93 We then consider the special case of a cumulative incidence function previously used to  and Z g the total number of infectious individuals in that generation, i.e., the sum of the 100 X i g−1 . We derive an in-homogeneous branching process where the expected generation 101 sizes are f g = EZ g .

102
Usually, the expected number of infectious individuals in a branching process grows if EX = µ then EZ g = µ g . The branching process derived below has expected generation 105 sizes (i.e., the EZ g ) which can follow any given monotonically increasing function. The 106 notation used in this construction is summarised in Table 1.

107
Let C(t) be the expected cumulative incidence by time t, i.e., the number of infections 108 we would expect to occur by time t. Evaluated at multiples of the serial interval, C yields 109 the generation sizes, f g for g = 1, 2, . . .
where ∆ g is the time of the gth generation. The first value of this sequence is f 0 = Z 0 , 111 the number of infectious individuals in the first generation. Then, assuming the X i g are 112 independent with mean µ g = f g+1 /f g we observe The solution to this recurrence is So EZ g = f g from the definition of µ g .

115
In summary, by fixing the expected value of the offspring distribution (in terms of the 116 generation) we obtain a branching process which, on average, has an expected cumulative 117 incidence C. This construction enables us to capture the behaviour of a phenomenologi-118 cal model which is known to fit observations better than exponential/geometric growth, 119 while maintaining a mechanistic foundation because it explicitly represents the individ-120 uals in the population. The construction above assumes a cumulative incidence function, C. We use the gener- is constant, for p = 1/2 (when m = 2) the incidence grows linearly (since the incidence 132 is the derivative of the cumulative incidence by definition), p = 2/3 provides quadratic 133 growth and with p = 1 we recover exponential growth in incidence.
so as k → ∞ we recover the Poisson distribution. Since the mean value is determined process against extinction in the likelihood during fitting to mitigate this bias. In short be viewed as a sequence of generation sizes, Z 0:g . We refer to this representation of the 162 process as the population view. As we will see, the ability to represent a process as both 163 a tree and a time series is very useful when making use of multiple data types.

164
First we consider the likelihood function for the time series data, conditioned against 165 extinction over the observed generations. We extend the notation introduced in sec-166 tion 2.1 to specify the (geographic) location (denoted by j): X i g,j denotes the number of 167 infections caused by the ith member of the gth generation in location j, and Z g,j denotes 168 the number of cases in generation g in location j. For ease of notation, we will often 169 drop the indices where they are clear by context.

170
By definition X g has a negative binomial distribution with mean µ g and shape pa- (2) Since Z g |Z g−1 is the sum of Z g−1 independent X g−1 , the MGF is given by the product and hence Z g |Z g−1 is also negative binomial with mean µ g−1 Z g−1 and shape parame-174 ter kZ g−1 . Since the generation sizes form a Markov chain and each location is assumed 175 to have an independent epidemic, the likelihood of all of the time series data is where N j denotes the number of generations that were observed in location j.

177
In order to condition the process against extinction during the observed generations 178 we use the probability that the process is extinct by the time of the last observation 179 at that location. To compute this probability, we consider the probability-generating 180 function (PGF) of the generation sizes, G Zg (t). Since we are working with the PGF 181 (rather than the MGF) the sum of Z g−1 independent X g−1 leads to the composition Iterating this g times, and noting that G Z0 (t) = t Z0 leads to where G Xg (t) = (k/(k + µ(1 − t))) k . The composition produces a complicated ex-184 pression, but for moderate g this is not an issue computationally. The probability of 185 extinction is the zero-th order coefficient of the PGF, hence the probability of extinction 186 by generation g is G Zg (0).
Putting the previous results together, we obtain the conditional likelihood for the 188 observed times series: For secondary infections data, the probability of extinction conditional upon partial We conclude this section with a few remarks on computation. When computing 196 with the expressions above for the likelihoods, the log-likelihood is used to avoid under-197 flow/overflow issues. Moreover, the use of a probabilistic programming language (such 198 as Stan as used here) will handle this expression and its gradient in a numerically stable 199 way. Therefore in practice, beyond specifying the process as a graphical model, the only 200 requirement is to implement the computation of the extinction probability.  The WHO data also includes approximate locations for each case. Using this infor-214 mation, we extracted another time series specific to Conakry, the capital city of Guinea.  In the case of Conakry, it is important to note that there are important differences 222 between the data sets. The time series is specific to confirmed cases while the tree 223 contains both confirmed and probable cases. And the number of cases in the time series 224 is far greater than the number in the infection tree.
2.6. Inference method for the time series model 226 The confirmed cases of EVD in the three West African countries were modelled as time 227 series of generation sizes using the population-level formulation of the branching process.

228
We considered a hierarchical model in which the model parameters for each country are . 256 We used the population view of the branching process to model the time series of con-257 in the tree) and X i g is their number of secondary infections (the out-degree of the node).

262
The prior distribution used is shown in Table 3. Fitting the model to each data set 263 allows us to investigate whether these views of the same epidemic are consistent. to determine whether the computation had converged or whether a numerical issue had been encountered (in which case the simulation and optimisation were repeated).  tively. Figure 3b shows the posterior mass for p has accumulated around 0.5 for all three 302 countries; in the model, this corresponds to approximately linear growth in the incidence.

303
Another way to view this would be that the cumulative incidence had quadratic growth.

304
Recall from Equation (1)  data suggests slower, but more rapidly accelerating growth than the secondary infections 319 data. We consider potential causes for these differences in the Discussion. mates. There is a clear bias and increased variability in the estimates derived from the 338 secondary infections data for both the growth rate and the dispersion parameter. As  with any Bayesian analysis, it is important to understand the impact of the prior distri-340 bution; in the absence of any data, the MAP would be the mode of the prior distribution.

368
In the case of this Ebola epidemic, the time series data was available long before the 369 infection tree. However, obtaining comprehensive time series of disease is challenging, 370 and it is interesting to know that there are alternative data sets which may be useful and 371 already part of the data collected during intervention measures such as contact tracing.

372
Moreover, our observations do not guarantee that we can rely on the agreement between 373 the inference methods in general which is, in part, why we also carried out the simulation 374 study. should not be ignored, however, given there will also be a level of uncertainty on these 384 estimates, they will still give broadly consistent characterisations of the epidemic. to the data collection process that precludes this being reversed. In fact, it seems plausi-431 ble that in active surveillance programs and with increased use of sequencing, secondary 432 infections data may become available before time series, and so the method we present