Improved multi-type birth-death phylodynamic inference in BEAST 2

The multi-type birth-death model with sampling is a phylodynamic model which enables quantification of past population dynamics in structured populations, based on phylogenetic trees. The BEAST 2 package bdmm implements an algorithm for numerically computing the probability density of a phylogenetic tree given the population dynamic parameters under this model. In the initial release of bdmm, analyses were limited to trees consisting of up to approximately 250 genetic samples for numerical reasons. We implemented important algorithmic changes to bdmm which dramatically increase the number of genetic samples that can be analyzed, and improve the numerical robustness and efficiency of the calculations. Being able to use bigger datasets leads to improved precision of parameter estimates. Furthermore, we report on several model extensions to bdmm, inspired by properties common to empirical datasets. We apply this improved algorithm to two partly overlapping datasets of Influenza A virus HA sequences sampled around the world, one with 500 samples, the other with only 175, for comparison. We report and compare the global migration patterns and seasonal dynamics inferred from each dataset. Availability The latest release with our updates, bdmm 0.3.5, is freely available as an open access package of BEAST 2. The source code can be accessed at https://github.com/denisekuehnert/bdmm.

j ∈ {1 . . . d}, with birth rate λ ij,k , migrates to type j with rate m ij,k (with m ii,k = 0), dies with 81 rate µ i,k , and is sampled with rate ψ i,k . At time t k , each individual of type i is sampled with 82 probability ρ i,k . Upon sampling (either with rate ψ i,k or probability ρ i,k ), the individual is 83 removed from the infectious pool with probability r i,k . We summarize all birth-rates λ ij,k in λ, 84 migration rates m ij,k in m, death rates µ i,k in µ, sampling rates ψ i,k in ψ, sampling probabilities 85 ρ i,k in ρ and removal probabilities r i,k in r, i, j ∈ {1, . . . , d}, k ∈ {1, . . . , n}. The model 86 described in Kühnert et al. [2016] is a special case of the above, assuming that migration rates are 87 constant through time (i.e. do not depend on k), removal probabilities are constant through time 88 and across types (i.e. do not depend on k and i), and that ρ i,k = 0 for k < n and i ∈ {1 . . . d}.

89
This process gives rise to complete trees on sampled and non-sampled individuals with types 90 being assigned to all branches at all times (Figure 1, left). Following each branching event, one 91 offspring is assigned to be the "left" offspring, and one the "right" offspring, each assignment has 92 probability 1 2 . In the figure, we draw the branch with assignment "left" on the left and the branch 93 with assignment "right" on the right. Such trees are called oriented trees, and considering On the other hand, if we discard these annotations but keep the types of the sampled 100 individuals, we call the resulting object a sample-typed (or tip-typed) tree (Figure 1, right).

101
Below, we state the probability density of the sampled tree (i.e. the sample-typed or branch-typed 102 tree) given the multi-type birth-death parameters λ, m, µ, ψ, ρ, r, T . This probability density is 103 obtained by integrating probability densities g from the leaf nodes (or "tips"), backwards along 104 all edges (or "branches"), to the origin of the tree. Our notation here is based on previous work

105
[ Kühnert et al., 2016, Stadler et al., 2013, and the probabilities p i,k (t) and g e i,k (t) relate to E and  Every branching event in the sampled tree gives rise to a node with degree 3 (i.e. 3 branches are 108 attached). Every sampling event gives rise to a node of degree 2 (called "sampled ancestor") or 1 109 (called "tip", as defined above). A sampling event at time t = t k , k ∈ {1, . . . , n}, is referred to as 110 Complete tree Branch-typed tree Sample-typed tree T 0 time t 1 Figure 1: Complete tree (left) and sampled trees (middle and right) obtained from a multi-type birth-death process with two types. The orange and blue dots on the trees represent sampled individuals and are coloured according to the type these individuals belong to. A ρ-sampling event happens at time t 1 . The grey squares represent degree-2 nodes added to branches crossing this event. ρ-sampling also happens at present (time T ). As seen in the complete tree, the first three individuals who were sampled were not removed from the population upon sampling, while the four individuals giving rise to the later samples were removed upon sampling.
a ρ-sampled node. All other nodes corresponding to samples are referred to as ψ-sampled nodes.

111
Further, degree-2 nodes are put at time t k on all lineages crossing time t k , k = 1, . . . , n − 1 as 112 shown at time t 1 in Figure 1. In a branch-typed tree, a node of degree 2 also occurs on a lineage at

115
We highlight that in bdmm, we assume that the most recent sampling event happens at time T .

116
This is equivalent to assuming that the sampling effort was terminated directly after the last 117 sample was collected, and overcomes the necessity for users to specify the time between the last 118 sample and the termination of the sampling effort at time T .

119
The derivation of the probability density of a sampled tree under the extended multi-type 120 birth-death model is developped in Supplementary Information (SI) (section S1).
Number of samples Number of samples the migration rates inferred from the Southern to Tropical locations arises between the two analyzed in [Kühnert et al., 2016]. Overall, we observe, as could be expected, that analysing a dataset with more samples gives a more long-term picture of the global transmission patterns and 204 reduces the general uncertainty on parameter estimates.

205
With the addition of so-called ρ-sampling events in the past, intense sampling efforts limited to 206 short periods of time (leading to many samples being taken nearly simultaneously) can be easily

226
In summary, we expect the new release of bdmm to become a standard tool for phylodynamic 227 analysis of sequencing data and phylogenetic trees from structured populations.