Scalable gradients enable Hamiltonian Monte Carlo sampling for phylodynamic inference under episodic birth-death-sampling models

Birth-death models play a key role in phylodynamic analysis for their interpretation in terms of key epidemiological parameters. In particular, models with piecewise-constant rates varying at different epochs in time, to which we refer as episodic birth-death-sampling (EBDS) models, are valuable for their reflection of changing transmission dynamics over time. A challenge, however, that persists with current time-varying model inference procedures is their lack of computational efficiency. This limitation hinders the full utilization of these models in large-scale phylodynamic analyses, especially when dealing with high-dimensional parameter vectors that exhibit strong correlations. We present here a linear-time algorithm to compute the gradient of the birth-death model sampling density with respect to all time-varying parameters, and we implement this algorithm within a gradient-based Hamiltonian Monte Carlo (HMC) sampler to alleviate the computational burden of conducting inference under a wide variety of structures of, as well as priors for, EBDS processes. We assess this approach using three different real world data examples, including the HIV epidemic in Odesa, Ukraine, seasonal influenza A/H3N2 virus dynamics in New York state, America, and Ebola outbreak in West Africa. HMC sampling exhibits a substantial efficiency boost, delivering a 10- to 200-fold increase in minimum effective sample size per unit-time, in comparison to a Metropolis-Hastings-based approach. Additionally, we show the robustness of our implementation in both allowing for flexible prior choices and in modeling the transmission dynamics of various pathogens by accurately capturing the changing trend of viral effective reproductive number.


S2.3 Implementation Algorithm:
We implement a recursive algorithm to compute the necessary gradient of the log-likelihood within our rate parameter space.Intermediate quantities are stored in between epochs to alleviate computational burden.Detailed algorithm is shown below based on the equations listed in 2.5 and previous sections in the supplement.

S3.1 HIV dynamics in Odesa, Ukraine
We refer to the prior settings on the compound parameters from previous work (Vasylyeva et al. 2020), and try to roughly match their priors by adopting the following prior distributions on each of the rate parameters.Note that the sampling proportion was fixed to 0 before the first sampling date in their study, so we also set the sampling rate to 0 for the last two epochs for consistency.

S3.3 Ebola epidemic in West Africa
We assume a constant death rate, µ for this data set, and we employ an empirical Bayes

S5 Computational complexity of the nodewise likelihood
The computational complexity of evaluating node-based representations of the likelihood is much less explicit.First, we need to write out an equivalent expression for the likelihood of Equation 1 node-wise.It will be helpful to distinguish different types of samples.In particular, let us denote serially-sampled tips ūψ with a particular serially-sampled tip being ūψi .With a slight abuse of notation, let us denote intensively-sampled tips ūρ , with ūρi denoting the vector of intensively-sampled tips at the ith intensive-sampling event.Then we can write The complexity here is not immediately apparent for a number of reasons.For one, the complexity appears to depend on the relative proportion of samples of different types, which affects the number of values of p k (t) and q k (t) which must be computed.Importantly, the complexity of computing those p k (t) and q k (t) is not immediately apparent either, and that these costs are somewhat hard to disentangle, as p k (t i ) builds recursively on p k−1 (t i ) and q k (t) depends on p k (t).

S5.1 Node lookups
Regardless of such ambiguities, all nodes in the tree require an interval lookup.For births, the lookup is required to find the correct λ k term to use.For samples, the lookup is either to find the appropriate sampling rate, for serial samples, or to determine to which intensive-sampling event a sample belongs, for intensive samples.The time requirement here depends on the algorithm, for a binary search it is O(log(K)), making the total lookup cost O(N log(K)).
S5.2 How many computations of q k (t) are required?
In the worst, but most common, case, there are no intensive-sampling events and q k (t) must be computed for the times of all samples, all births, and all epoch times (note that even when ρ i is 0, there is a term L(t i ) log(q i−1 (t i )) which must be computed in the final summation).
In the best case, all samples are at intensive-sampling events, and q k (t) only needs to be computed for the times of all births and all epoch times.These are both O(N + K), though there is a factor of two's worth of variation in front of the N depending on which side of this spectrum a tree falls in.Calling the cost of computing q k (t) Q, this makes the contribution to the complexity here O(Q(N + K)).

S5.3 How many computations of p k (t) are required?
The likelihood contains a number of explicit computations of p k (t) in the terms pertaining to (both serially-and intensively-)sampled tips.When all samples are serial samples, there are O(N ) direct computations of p k (t), while when all samples are intensive samples, there are O(K).Taking the cost of computing p k (t) to be P , the addition to the cost here is between O(P N ) and O(P K).
S5.4 What is the cost of computing p k (t) and q k (t)?
We have thus far shown that the cost of computing the nodewise likelihood appears to be between O(N log(K) + Q(N + K) + P N ) and O(N log(K) + Q(N + K) + P K).But this is not particularly revealing without considering P and Q.
While q k (t) depends on p l:l<k (t) through A and B, once A k and B k have been computed, let us assume (as we did when evaluating the cost of the interval-wise likelihood) that the cost of q k (t) is O(1).In other words, let us assume that O(Q(N +K)) = O(P (N +K)).This makes the implied cost of the nodewise likelihood between O(N log(K) + P (N + K) + P N ) and O(N log(K) + P (N + K) + P K), which both simplify to O(N log(K) + P (N + K)).
Naïvely, we might choose to compute p k (t) recursively every time we need it, which is O(K 2 ).
In this case, the implied cost of the nodewise likelihood is O(N log(K) + N K + K 2 )).

S5.5 Precomputing A and B
One can instead choose to pre-compute A k , B k , as once these are computed the cost to compute p k (t) and q k (t) becomes O(1).Working backwards from the present allows recomputation to be avoided.As we did when we approximated the cost of the interval-wise likelihood, we will take the cost of the update (computing be O(1).Thus, the cost of the precomputation is O(K).This puts the implied cost of computing the nodewise likelihood between O(N log(K) + N + K).

S5.6 Counting lineages at epoch times
Regardless of whether the model includes intensive-sampling (that is, regardless of whether ρ = 0), one must compute L(t i ) for all epoch times.This can be solved essentially the same way as the subintervals are obtained, at a cost of O(N + N log(N )).Alternately, it can be obtained by counting the number of births and sampled tips older (or younger) than each epoch time, at a cost of O(KN ).This makes the lower end of the computational cost once again a range, from In practice, the constants in front of all the sorting and node-lookup terms appear to be so small as to be unnoticeable in real-world computation.We demonstrate this in our timing experiments in the next section.Thus, for all practical purposes, the likelihood appears to be O(N + K) regardless of representation, as long as one avoids recursive computation of p k (t).

S6 Timing Experiments
With the reformulation of the likelihood and derivation of the analytical gradients, our method notably gains in speed, as we highlight in this section.For a comprehensive assessment, we compare our approach with four other specialized packages for EBDS model To assess the scalability of the aforementioned methods in terms of likelihood/gradient calculation, we simulated a set of trees under the EBDS model with increasing number of tips.To investigate the scalability of different methods wrt the number of sequences, we fix the number of epochs to 5 for both likelihood and gradient calculation.
Regarding scalability with respect to the number of epochs, we adjust the model by progressively increasing the number of epochs.To keep other variables constant, we maintain the tree topology and set the number of tips at 12 (in scenarios where K >> N , this allows us to negate the effect of N in O(N +K)) for likelihood computation.For gradient calculations, we set the number of tips to 8198 (to minimize the impact of For methods that employ just-in-time (JIT) compilation, including BEAST, BEAST2 and VBSKY, we run a short MCMC chain or variational inference algorithm to compute likelihood or gradient across 100,000 iterations and take the average run time.In our analysis, we observe that for likelihood computations, the implementations in BEAST, BEAST2, and RevBayes offer similar speed performance when adjusting both the number of sequences and epochs.In contrast, the TreePar package consistently lags, being several hundred times slower than its counterparts across all tested scenarios.It is also the sole implementation that exhibits a quadratic scaling with the number of epochs.The algorithms of BEAST, BEAST2, and RevBayes seem to demonstrate approximately linear scaling relative to both tree size and model epochs.It's worth noting that RevBayes delivers the quickest calculation speed, which might be attributed to the inherent speed advantages of precompiled codes, particularly for quick likelihood calculations in our context.Result for TreePar with epochs exceeding than 512 is not not included as TreePar fail to process such large models.In terms of gradient calculations, our analytical gradients deployed within BEAST is remarkably faster than VBSKY approach using automatic differentiation.The gradient computation scales approximately linearly with the number of sequences for both BEAST and VBSKY.However, wrt the number of epochs, the scaling remains linear for BEAST but seems quadratic for VBSKY.We further confirm that the runtime slowness exhibited in VBSKY is not due to memory issues or JIT compilation difficulty.Therefore, our analysis demonstrates that analytically calculating the gradients of the EBDS likelihood is critical for improving the running time of gradient based methods.
= −0.77,SD = 1.17)Log-scale sampling rate at present t or Normal (Mean = 12.5, SD = 15.0)Age of phylogeny α Fixed to 2.0 Exponent of the MRF ϕ Gamma (Shape = 1.0,Scale = 1.0)Transformed global scale of the MRF ν k Fixed to 1.0 Local scale of MRF approach proposed byMagee et al. (2020)  to set the prior on the first log-birth-rate and logsampling-rate in our Bayesian bridge MRF models.The prior for the constant death rate is obtained from an estimation of the plausible duration of infectious period with 95% confidence intervals covering 8 to 40 days(Velásquez et al. 2015).The detailed prior distributions can be found in the table below: = 1.27,SD = 0.58) Log-scale sampling rate at present t or Normal (Mean = 1.89,SD = 15.0)Age of phylogeny α Fixed to 0.25 Exponent of the MRF ϕ Gamma (Shape = 1.0,Scale = 1.0)Transformed global scale of the MRF ν k Exponentially tilted stable distributions Local scale of Bayesian bridge MRF ξ Fixed to 2.0 Slab width of Bayesian bridge MRF

Figure S1 :
Figure S1: HIV virus: Median (solid line) and 95% credible intervals indicated by the shaded areas of the (a) birth rate, (b) death rate, and (c) sampling rate estimates through time.

Figure S2 :Figure S3 :
Figure S2: Influenza virus: Median (solid line) and 95% credible intervals indicated by the shaded areas of the (a) birth rate, (b) death rate, and (c) sampling rate estimates through time.
inference concerning likelihood calculations.These include the BDSKY (Stadler et al. 2013) package within BEAST2 (Bouckaert et al. 2019), TreePar (Stadler et al. 2013) package in R (R Core Team 2021) and RevBayes (Höhna et al. 2016).Furthermore, we present a benchmark comparing the gradient calculation efficiency of automatic differentiation implemented in VBSKY (Ki & Terhorst 2022)package using JAX library (Bradbury et al. 2018) isolated from the variational inference procedure against our algorithm based analytical gradients implemented in BEAST.

Figure S4 :
Figure S4: Speed of implementations for the likelihood calculations of increasing number of sequences (left plot) or number of epochs (right plot) for EBDS model.Note the time and number of sequences/epochs are laid out according to a logarithmic scale with base 2.

Figure S5 :
Figure S5: Speed of implementations for and gradient calculations of increasing number of sequences (left plot) or number of epochs (right plot) for EBDS model.

Table S2 :
Prior specifications for the EBDS model in Influenza virus analysis

Table S3 :
Prior specifications for the EBDS model in Ebola virus analysis S4 Inferred trajectories for birth/death/sampling rates