Estimating disease spread using structured coalescent and birth-death models: A quantitative comparison

Understanding how disease transmission occurs between subpopulations is critically important for guiding disease control efforts irrespective of whether the subpopulations represent geographically separated people, age or risk groups. The structured coalescent (SC) and the multitype birth-death (MBD) model can both be used to infer migration rates between subpopulations from phylogenies reconstructed from pathogen genetic sequences. However, the two classes of phylodynamic methods rely on different assumptions. Here, we report on a simulation study which compares inferences made using these models for a variety of migration rates in both endemic diseases and epidemic outbreaks. For the epidemic outbreak, we found that the MBD recovers the true migration rates better than the SC regardless of migration rate. We hypothesize that the inaccurate SC estimates stem from the its assumption of a constant population size. For the endemic scenario, our analysis shows that both models obtain a similar coverage of the migration rates, while the SC provides slightly narrower posterior intervals. Irrespective of the scenario, both models estimate the root location with similar coverage. Our study provides concrete modelling advice for infectious disease analysts. For endemic disease either model can be used, while for epidemic outbreaks the MBD should be the model of choice. Additionally, our study reveals the need to develop the SC further such that varying population sizes can easily be taken into account. Author summary Controlling an infectious disease requires us to quantify and understand how it spreads through pools of susceptible individuals, defined by their belonging to different geographical regions, age or risk groups. Rates of pathogen movement between these pools can be inferred from pathogen phylogenies which are themselves reconstructed from pathogen genetic sequences collected from infected individuals. Two popular foundations for such models are the multitype birth-death model and the structured coalescent. Although these models fulfill the same purpose, they differ in their assumptions and can, hence, produce contrasting results. To assess the appropriateness of the models in different situations, we performed a simulation study. We find that, for endemic diseases, both models are able to estimate the migration parameters reliably. For epidemic outbreaks, however, the multitype birth-death model obtains better estimates of the migration rates. We hypothesize that the structured coalescent’s inaccurate estimates for the epidemic scenario arise because it assumes a constant number of infected individuals through time.

Recent years have seen a rise in both the use of phylodynamic models and their 10 complexity. Early models focused on inferring important population dynamic 11 parameters, e.g. the reproductive number, from pathogen genetic data for a panmictic 12 population [1]. However, a population is often structured into subpopulations, e.g. an 13 infected individual is more likely to transmit to other individuals that are in the same 14 region, risk group or share another attribute [2]. More recent models [3][4][5][6][7] can estimate 15 the population dynamics within subpopulations and the migration rates between them. 16 For concreteness, we will regard subpopulations as consisting of infected individuals in 17 distinct geographical regions in the remainder of the paper. 18 In a Bayesian phylodynamic analysis of an infectious disease dataset, the input data 19 is an alignment of pathogen sequences collected from infected individuals. To 20 reconstruct the phylogenetic tree together with the DNA substitution parameters, and 21 the past population dynamics, a variety of methods are available [8][9][10]. An important 22 component of such an analysis is the tree prior. When formulated in terms of a 23 generative model, this allows researchers to characterise the posterior distribution of the 24 population dynamic parameters and, for structured populations, the migration rates 25 between the subpopulations. 26 Two popular models for the tree prior in structured populations are the multitype 27 birth-death (MBD) [11][12][13] and the structured coalescent (SC) [14][15][16]. Both describe a 28 stochastic process, that generates a distribution of phylogenetic trees for a given 29 parameter set. Nevertheless, the MBD and the SC differ substantially for a number of 30 reasons. Firstly, the MBD explicitly models how the sampled sequences were collected 31 from the total population. On the one hand, this allows the MBD to use additional 32 signal stemming from the sampling times [17]. On the other hand, its results can be 33 biased, if the sampling process is unknown and wrongly specified in the analysis. The 34 SC generally conditions on the number and timing of the samples, even though 35 exceptions exist [17][18][19]. Hence, commonly used implementations (such as the SC in 36 BEAST 2) cannot make use of this information. Secondly, the two models treat 37 population size changes differently. For the original SC, the population size has to stay 38 constant in each region. If additional predictor data is available, the population 39 dynamic parameters are allowed to vary through time in a piecewise constant 40 fashion [20]. In contrast, the MBD allows the population size to change stochastically remains approximately at the same baseline level. But during an epidemic outbreak, the 44 population grows exponentially. At the beginning of an outbreak, stochastic variation 45 dominates the dynamics [21]. 46 Given the models' different strengths and weaknesses it is not necessarily 47 straightforward to identify which one should be used in a given application. For 48 unstructured populations, coalescent and birth-death models have been compared in 49 simulation studies [17,22]. However, an analysis of the potential biases in the more 50 complex case of structured populations is lacking. In this paper, we investigate the 51 performance of the SC, as implemented in MultiTypeTree [3] (in particular assuming 52 constant population sizes and conditioning on sampling), and the MBD, as implemented 53 in BDMM [5]

59
To compare the structured coalescent (SC) and the multitype birth-death (MBD) 60 models, we analysed their performance on simulated data mimicking endemic and 61 epidemic, infectious diseases. In the endemic scenario, we simulated phylogenetic trees 62 using the SC with a constant population size (Fig 1, left panel). For the epidemic 63 outbreak, we used the MBD with exponential growth (Fig 1, right  Then, we applied the SC and the MBD to infer the posterior distribution over the 68 migration rates. We evaluate their performance using three metrics. Firstly, the 69 coverage assesses how often the true migration rate is contained within the 95% highest 70 posterior density (HPD) interval of the posterior. As a method can recover a parameter 71 arbitrarily well by having wider HPD intervals, we also compare the relative HPD width. 72 Additionally, we determine the relative root mean square error (RMSE) of each method. 73 Here, a relative metric is attained by dividing the original metric by the true migration 74 rate. For every migration case, each method infers 100 posterior distributions (one for 75 every simulated tree). Therefore, we report the mean metrics and their standard 76 deviation per parameter case and model.   are collected, respectively.

106
Overall, the MBD performs better than the SC over all scenarios and settings. As an 107 October 30, 2020 4/21  Further, the RMSE of m rb decreases from ≈ 76 to 10 but it increases for m br from 35 to 114 ≈ 130. Hence, a high sampling proportion does not change the results qualitatively.

115
In summary, the MBD performs better for epidemic diseases or generally, when the 116 population size increases exponentially. Migration rate estimates under the SC show low 117 coverage and high RMSE. We hypothesise that these inaccurate estimates are due to 118 the SC's assumption of a constant population size. In the following section, we will 119 investigate how exactly the SC fits the data when population structure and size changes 120 are present.

122
The SC and exponential growth 123 To understand the source of the SC's inaccurate estimates, we visually inspected the 124 coloured trees it infers (example in S6 Fig, panel A). For fast migration, we observed 125 that migration mostly flows one-way, e.g. only from red to blue in the exemplary tree. 126 This is corroborated by the inferred migration rates (S6 Fig, panel B), which are 127 negatively correlated (τ = −0.7, p-value< 0.001). Further, the root location is tied to 128 the higher migration rate, e.g. right of the black line in S6 Fig, panel B, the higher 129 migration rate m rb always coincides with the root location r. That means the SC infers 130 a migration process along the tree that starts in either of both locations and then only 131 allows migration into the other location. This strongly deviates from the true migration 132 process (S6 Fig, panel A). Further, only the coalescence rate from the root location is 133 constrained, e.g. λ r , right of the vertical line in panel D while the HPD interval of the 134 other can span between [1 − 0.003]. 135 We do not find those patterns in the MBD estimates (S6 Fig, panel C, E). Hence, we 136 can exclude that signal for the above mentioned correlations was present in the 137 simulated data. For slow migration, the anticorrelation within and between migration 138 and coalescent rates becomes less pronounced in the SC analysis (see S8 Fig). Thus, 139 given stronger signal for population structure the SC can better estimate the migration 140 rates in the presence of an exponentially growing population (see also increased 141 coverage in Fig 2).

142
In summary, given weak signal for population structure (fast migration), the SC 143 infers a specific migration process along the tree that deviates strongly from the truth. 144 If the population is more structured, this pattern becomes less pronounced.

145
Root location inference 146 The root state provides information on the potential source of an epidemic. Here, we 147 compare how well the models recover the root location based on trees from the endemic 148 and epidemic scenario. We calculate the posterior probability for the root node being in 149 location r and b. Note that P(root in r) = 1 − P(root in b). We assign the root a 150 location if the posterior probability for either location is > 50% or > 90%. For the trees 151 with a marked root location we report how often the true root location was assigned in 152

165
There are several reasons why the models recover the root state better from the 166 epidemic trees. A major influence are the sampling times. In the epidemic scenario, we 167 sample continuously through time resulting in samples close to the root (compare Fig 1). 168 In contrast, the endemic trees only have samples close to the present. Hence, while for 169 the epidemic tree a simple parsimony assignment of the root state will often be correct, 170 for the endemic trees the lineage locations have to be correctly backtraced throughout 171 the tree. Additionally, the endemic trees are on average longer than the epidemic trees, 172 which, all other things being equal, is bound to increase the uncertainty in lineage 173 location.

174
Overall, both models recover the root state similarly well. This finding is interesting, 175 given that the SC estimates inaccurate migration rates from epidemic trees. It appears 176 that the inference of the root location is more robust against model misspecifications. Inference of the root location. We show how the SC and the MBD recover the root location from endemic and epidemic trees. We vary the required level of certainty in the root assignment between > 50% and > 90%. The colouring displays how many trees meet the requirement. The recovery is displayed for every migration case. The red dashed lines indicates the recovery for optimal (1.0) and random recovery (0.5).

178
We compared how well the structured coalescent (SC) and the multitype birth-death 179 (MBD) models infer migration rates and the root location from phylogenetic trees. We 180 tested their performance for different parameter regimes, inluding endemic versus 181 epidemic disease, varying migration rates and sampling intensities.

182
For endemic diseases (constant population size), both methods recover the true 183 migration rates similarly well. While the MBD shows a higher coverage for medium 184 migration, the SC consistently obtains narrower posterior intervals of the migration rate. 185 Our results are consistent with previous analyses of coalescent and birth-death models 186 for unstructured populations by Boskova et al. [22]. There, coalescent models obtained 187 narrower posterior intervals than the birth death models when estimating the growth 188 rate from coalescent trees. As Boskova et al., we hypothesize that the birth death  For epidemic outbreaks (exponentially growing populations), the MBD infers 193 migration rates with higher coverage and lower RMSE than the SC across parameter 194 regimes. The traditional, constant-rate SC obtains these inaccurate estimates, because 195 it cannot account for exponential growth. We show, that it infers a particular kind of migration process that deviates from the truth when signal for population structure is 197 weak. Our findings motivate further research to combine the exponential growth 198 coalescent with population structure.

199
For both the endemic and epidemic scenario, we found that it matters for migration 200 rate inference whether the SC prior is placed on the forward or backward migration rate. 201 Hence, practitioners should consider whether their prior knowledge is best expressed in 202 terms of the emigration (forward migration) or the immigration (backward migration) 203 rate. We implemented an option to place the prior on the forward migration rate in SC 204 analyses in the MTT package.

205
Both models infer the root location with similar coverage throughout parameter 206 regimes. This is interesting because the SC estimates inaccurate migration rates in the 207 epidemic scenario. Therefore, our simulations reveal that inferring the root state is more 208 robust compared to the migration rates. 209 We show here that phylodynamic models can reliably extract information about 210 population structure from phylogenetic trees. Such phylodynamic analyses complement 211 confirmed case data analyses, which by themselves reveal little information about the 212 underlying population structure. We envision that a future framework allowing for a 213 simultaneous, coherent analysis of both data types will provide rapid and reliable 214 information on epidemic spread in structured populations which is extremely valuable 215 information for authorities to design public health interventions.

217
Simulations 218 We simulated phylogenetic trees, that capture important characteristics of an endemic 219 disease and an epidemic outbreak. An outbreak is characterised by an exponential 220 increase in the number of cases of a disease. When case counts are still low, they vary in 221 a stochastic way. The outbreak can be adequately modelled by the multitype  The SC with two compartments i and j is parameterised by the pairwise coalescence 262 rates λ i , λ j and the backward migration rates q ij , q ji . The SC parameters quantify 263 processes that occur backward in time, i.e. from the tips of the tree toward the root.

264
The coalescence rate is the rate at which a pair of lineages coalesces as we move from 265 the tips of the tree toward the root. The migration rate q ij is the rate at which an 266 individual from location i moves to location j backward in time. We start the 267 simulation at the leaves of the tree in the present and merge the lineages together as we 268 go into the past. For each tree, we draw 50 leaf times for each location uniformly at 269 random between (0, 10). For the MBD we defined the migration rates as fast, medium 270 and slow with respect to the birth rate. For the SC we use [24] to determine the 271 coalescence rate which corresponds to the birth rate and the final population size 272 employed under the MBD model, which we used in the epidemic simulations: The factor 1 2 arises because we assume that the final population size N f consists of 274 individuals from either location in equal shares. Hence, we set λ i = λ j = 2 75 . Then, we 275 October 30, 2020 9/21 convert the forward migration rates from the epidemic simulation into the backward 276 migration rates [25]: Hence, we set q ij to {1, We fit the models with packages implemented in the BEAST 2 [26] Bayesian inference 281 framework. We use the BDMM [5] package for the MBD and the MultiTypeTree [3] 282 package for the SC inference. In the MBD analysis, we infer the posterior for three 283 parameters for each location i: the birth rate β i , the sampling proportion ψ i and the 284 forward migration rate m ij . We fix the death rates in both locations δ r = δ b = 1 to 285 avoid unidentifiability constraints.

286
In the SC inference, we estimate the posterior distribution of two parameters for 287 each location i: the coalescence rate λ i and the backward migration rate q ij . To 288 compare the inferred migration rates between both models, we convert the SC's 289 backward migration rates q ji to forward migration rates m ij by rearranging Eq. 3.

290
In both sets of analyses, we fix the tree to the truth, because we are only interested 291 in the differences arising from our tree prior choice (MBD or SC). The prior 292 distributions on all parameters can be found in S1 Table and S2 Table. All MCMC 293 analyses were run for 10 8 steps and 10% of burn-in was discarded. Additionally, we 294 verified that the effective sample size was higher than 200 for all parameters, to ensure 295 the runs were converged.

Summary statistics 297
To compare the posterior distributions on m ij inferred by the SC and the MBD, we 298 chose the following three metrics: coverage, relative 95% highest posterior density 299 (HPD) width and relative root mean square error (RMSE). The coverage quantifies, how 300 often the true parameter is contained in the credible interval of the posterior. Here, we 301 use the 95% HPD, which is the narrowest region of the posterior, that holds 95% of the 302 probability mass. The second metric, HPD width, is the width of that interval and 303 measures the precision of the estimate. Lastly, we compute the RMSE between the 304 inferred median and the true value, which quantifies how accurate the estimate is. Both 305 the HPD width and the RMSE are divided by the true migration rate and hence we 306 obtain their relative versions. Therefore we can compare them across the different 307 migration cases.

308
For each model and parameter case, we have 100 independent MCMC chains (one 309 for each tree). For each parameter of interest we compute the posterior median and 310 boundaries of the posterior 95% HPD interval from a MCMC chain (for a given tree).

311
Given this summary data, for each of the previously mentioned metrics we compute the 312 mean and standard deviation by bootstrapping over 1000 replicates using the R Boot 313 package [27]. We use bootstrapping because the distribution of the posterior medians is 314 not necessarily normal.

316
For each tree, we compute the posterior probability of the root being in location r and b. 317 Note that P(root in r) = 1 − P(root in b). We consider two methods for assigning the 318 root state. First, if the posterior probability for root state r is above 50%, we assign the 319 October 30, 2020 10/21 root the state r, and otherwise b. Second, only if the posterior probability for a root 320 state is > 90%, we assign that particular root state; otherwise we assume that the root 321 state cannot be determined. For every scenario and parameter case we compute the 322 recovery, defined as the number of times the root was assigned to the correct location. 323 We use bootstrapping to determine the mean and standard deviation or the recovery as 324 before.   Table. October 30, 2020 20/21