No-U-Turn sampling for phylogenetic trees

The inference of phylogenetic trees from sequence data has become a staple in evolutionary research. Bayesian inference of such trees is predominantly based on the Metropolis-Hastings algorithm. For high dimensional and correlated data this algorithm is known to be inefficient. There are gradient based algorithms to speed up such inference. Building on recent research which uses gradient based approaches for the inference of phylogenetic trees in a Bayesian framework, I present an algorithm which is capable of performing No-U-Turn sampling for phylogenetic trees. As an extension to Hamiltonian Monte Carlo methods, No-U-Turn sampling comes with the same benefits, such as proposing distant new states with a high acceptance probability, but eliminates the need to manually tune hyper parameters. Evaluated on real data sets, the new sampler shows that it converges faster to the target distribution. The results also indicate that a higher number of topologies are traversed during sampling by the new algorithm in comparison to traditional Markov Chain Monte Carlo approaches. This new algorithm leads to a more efficient exploration of the posterior distribution of phylogenetic tree topologies. Author summary Phylogentic trees are important for our understanding of evolutionary relationships. But even for only a small number of entities the number of possible trees is immense. In order to efficiently search through this large space and analyze the distributions of these trees, different algorithmic solutions have been proposed. A phylogenetic tree is not only defined by the way it groups the entities but also by the length of the branches. This nature of the phylogentic trees complicates the exploration of these distributions. Building on research and algorithmic ideas for non tree like data and recent developments on the mathematics of trees, I propose a new method which is able to traverse the space of possible trees more rapidly. By using a search strategy which is guided by the data and the underlying evolutionary model, the algorithm is able to better sample from the desired distribution. I am able to show that the new algorithm can indeed analyze the correct distribution but can do so with a higher efficiency than other algorithms.

The inference of phylogenetic trees has long been a research focus for different scholars.
successful application of a Hamiltonian Monte Carlo algorithm depends on a careful 22 tuning of these two hyperparameters. To alleviate this problem, the No-U-Turn Sampler 23 (NUTS) can be used. [13] This algorithm determines the optimal step-size during the 24 burn-in phase of the experiment and adaptively sets the step-size maximizing the 25 distance between two successive proposals. 26 Transferring these methods to phylogenetic research is not straightforward. The 27 reason lies in the nature of the trees. A phylogentic tree is defined by a discrete part, the 28 topology, and a continous part, the branch lengths. Previous research has shown that 29 the space of trees can be formulated as a continuous space. [14] This formulation of the 30 tree space enabled the development of a sampler based on Hamiltonian dynamics. [15] Ji 31 et. al. [16] also propose gradient based methods for statistical phylogenetics. However, 32 the focus of their work is the inference of branch specific evolutionary rates. Thus their 33 project is not so much concerned with the problem of th two-part nature of the trees. 34 Building on the idea of the No-U-Turn sampler and the Hamiltonian style sampler 35 for trees, I propose a No-U-Turn sampler for phylogenetic trees. The algorithm 36 functions in a similar fashion to the classical No-U-Turn sampler and maintains the two 37 important aspects, the tuning of the step-size and the adaptive setting of the number of 38 steps. The important aspect is to identify the suitable distance measure. The 39 performance of the algorithm is tested on two different data sets from the area of Hamiltonian dynamics provide a mathematical framework to model the dynamics of a 47 physical system over time using the systems energy. 48 In two dimensions, we can visualize the dynamics as that of a frictionless 49 puck that slides over a surface of varying height. The state of the system 50 consists of the position of the puck, given by a 2D vector q, and the 51 momentum of the puck (its mass times its velocity) given by a 2D vector p. 52 The potential energy, U (q), of the puck is proportional to the height of the 53 March 2, 2021 2/15 surface at its current position, and its kinetic energy, K(p), is equal to 54 |p| 2 /(2m), where m is the mass of the puck. [12, p. 2] 55 The momentum of the puck will help it to slide along a rising slope as long as the 56 kinetic energy is above zero. At the point where the kinetic energy becomes zero it will 57 naturally slide down again. In a non physical setting the position of the puck is equated 58 to the variable of interest and the potential energy to the minus of the log probability 59 density of the given variable. Only the momentum variable will be set artificially. [12] 60 By using the Hamiltonian equations, for d dimensions the development of p and the 61 parameters θ over time t can be determined: Where H is the Hamiltonian function taking p and θ as arguments. The function 63 H(θ, p) can be written as In one update step the leapfrog approximation is done for L steps. This describes the 78 Hamiltonian Monte Carlo (HMC) algorithm. The discretization procedure necessarily 79 introduces a slight error into the computation. The error depends on ε and L and can 80 theoretically grow out of bounds depending on L. [17] In real world applications 81 however, this is usually not an issue. [13] So, the leapfrog integrator fulfills all properties 82 which are necessary for valid updates on the parameters (see Neal (2011) [12] for formal 83 details). Algorithm 1 shows the general procedure of the Hamiltonian Monte Carlo 84 method including the leapfrog approximation.

85
What remains for the user to do is to specify L and ε. In practice, setting these two 86 parameters can be tedious. As Hoffman & Gelman (2014) [13] point out a too small ε 87 Algorithm 1 HMC Pseudocode The pseudo code for HMC following [13]. With L being the logarithm of the joint density of the variable θ.
will waste precious computation time while a too large ε will produce inaccurate 88 samples. Similarly, if L is too small the HMC mirrors random walk behavior and if L is 89 too large the HMC will start to loop back and retrace its steps. In some severe cases, if 90 L is too large, some of the formal properties making the leapfrog method a valid 91 sampler may even break down. [12]. Thus fine-tuning L and ε is very important.

97
NUTS uses a stochastic optimization technique to optimize ε during the 98 burn-in. [13,18] The starting value for ε is set by either doubling or halving ε until the 99 probability of accepting an HMC update with L = 1 crosses 0.5. 1 Usually the seed value 100 for ε is 1. By simulating an acceptance probability of the proposed HMC moves during 101 burn-in ε is altered using the dual averaging method [19] such that the overall 102 acceptance probability approximates a given value.

103
In contrast to the original HMC sampling scheme where the number of leapfrog 104 steps L is a fixed parameter, the number of steps in NUTS can change in each 105 generation. Importantly, L is not set randomly but rather such that it maximizes the 106 distance between successive proposals while also avoiding a loop back. Furthermore, 107 varying the number of steps accounts for the shape of the probability density surface in 108 the current region. The algorithm proceeds by building a set of candidate states 109 following the slice sampling approach [20], i.e. only if a proposed state is within the slice 110 it is added to the set. This set building procedure continues until one of two stopping will be called θ + and θ − will be the result of a backwards time simulation. Starting 121 from the initial state, the time direction is chosen randomly. For the first iteration, the 122 Hamiltonian dynamics are simulated for one step. For the next iteration, the dynamics 123 are simulated for two steps and the next for four steps. So every new iteration doubles 124 the number of leapfrog steps taken. This procedure implicitly builds a balanced binary 125 tree with the leaves corresponding to a pair of position-momentum vectors p and q. [13] 126 More formally, let C be the set of candidate states and let B ⊇ C be the set of states a 127 leapfrog integrator visits during an iteration. Hoffmann & Gelman [13] show that as 128 long as the initial state of the current generation is an element of C and for all other 129 elements of C it is the case that they are in the current slice and the stopping criteria 130 are not met, sampling a random element from C is a valid update. Building up C in this 131 way is only valid if the method doing so preserves volume. 2 Neal (2011) [12] show that 132 the leapfrog integrator does preserve volume, so it can be used to propose new states for 133 C. While the algorithm for building candidate sets and sampling from it fulfills all 134 necessary conditions for being a valid sampler and also proposes distant states from the 135 starting state, Hofmann & Gelman [13] modify their initial algorithm slightly to 136 improve its memory efficiency and also let it do larger jumps on average. The  The set of all binary phylogenetic trees with N leaves will be called T N . An element 160 of the set T N is a pair (τ, q), where τ defines the topology and each q e ∈ q is the length 161 of a particular edge. Let Γ be the set of all τ . The size of q equals the number of edges 162 of τ . Pendant edges are those incident to leaves. Any other edge is called internal.

163
Consider a function f (τ ) → q which maps the length of each edge of τ to a specific gives rise to an intuitive neighboring relation between trees embedded in this space.

173
Imagine reducing the length of an internal edge of a particular tree to zero. In this case, 174 the two nodes connected by this edge, would collapse into one vertex with degree four. 175 By expanding the resulting degree four vertex in an edge and two vertices with degree 176 three, a new tree topology can be created. This operation is also known as a nearest 177 neighbor interchange (NNI) operation (see Fig. 1). [15,24] By using this relation, ...
The rightmost tree can be generated from the leftmost by a NNI move. The tree in the middle displays the situation where the length of the branch t 1 was reduced to 0 from which the new tree can be generated.
Importantly, Billera, Holmes & Vogtmann [14] show that this space of trees has 184 certain properties such that a geodesic between two trees exists. Thus, the distance 185 between two trees in this space is properly defined. Amenta et. al. [25] show how the where P x,y (t) is the probability of transitioning from state y to state x in time t and 194 E (τ, q) is the set of all edges of the tree, such that (u, v) ∈ E (τ, q), where u is the Since the space within one orthant is continuous, the traditional HMC procedure can 204 be applied. It only becomes problematic when the boundary of an orthant is reached 205 which means that one particular element of the branch length vector becomes zero or 206 even negative. Through the way the space of trees is defined, it is possible to interpret 207 the scenario when an element of q becomes zero. In the case of a rooted binary tree 208 there are three topologies, two generated via an NNI operation and the original, which 209 share this particular facet. The sampler will pick one of these three orthants at random 210 and will continue the simulation in the newly selected orthant. The momentum 211 corresponding to that particular attribute will then be negated. The Hamiltonian 212 remains unaffected by using the described technique. [15] This defines the "leap-prog" 213 integrator, the probabilistic counterpart to the standard leapfrog integrator. If in a 214 current update step multiple elements of q hit zero, the random selection process will be 215 done for each of this elements. Thus, it is possible to visit multiple tree topologies in 216 just one leap-prog step.

217
As an extension to this simple algorithm Dinh et. al. [15] note that via the 218 discontinuity of the space of trees, the potential energy function on τ and q which is the 219 likelihood function, is not differentiable on the whole space and may thus lead to a loss 220 of accuracy which can not be neglected. To circumvent this problem the authors 221 propose to use the following smooth approximation for (τ, q) [15] 222 This smoothing function is applied to each element of q and ensures that the gradient 223 vanishes when the element approaches 0, thus the derivative to be continuous across Hamiltonian dynamics defined above uphold the reversibility, volume preservation and 230 k-accessibility properties of traditional Hamilitonian dynamics. [15] No-U-Turn Sampling in the Tree Space

232
There are two aspects of NUTS which need to be adapted to define the pyhlogenetic 233 No-U-Turn sampler (p-NUTS). The first part is to provide a HMC sampler in the space 234 of trees and the second is to define an appropriate stopping rule for the NUTS tree 235 building procedure.

236
Replacing the leapfrog integrator in the No-U-Turn sampler with the leap-prog 237 integrator described by Dinh et. al [15] is straightforward. Since the leap-prog 238 integrator preserves volume and is reversible as the leapfrog integrator, the replacement 239 does not impact the target distribution. [15] Thus, using the leap-prog integrator   The second part of the No-U-Turn sampler is the automated tuning of the step-size. 272 Hofmann & Gelman [13] propose to adapt the step-size with respect to a target 273 acceptance probability. For the p-NUTS sampler there is a second target quantity which 274 should influence the optimal step-size. This second quantity is the number of NNI by Hofmann & Gelman [13]. Then H t = δ − H N U T S with δ as the target average 282 acceptance probability is a statistic describing the behavior of the MCMC which can be 283 used as a condition for the dual averaging scheme. In order to incorporate the number 284 of NNI moves, let η be the target number of NNI moves and H P N be the estimate of 285 the average number of NNI moves during the last iteration of the tree building process. 286 Let H p be the statistics describing the behavior of the MCMC regarding the number of 287 NNI moves.
This statistic is nonincreasing in the step-size, which is a necessary condition, and can 289 thus be used to describe the behaviour of the MCMC as well. In order to combine H t

301
The first step in assessing the correctness of the new sampler is to identify if it samples 302 from the correct distribution, the second step will be to assess the speed of convergence 303 for the new p-NUTS algorithm. Assuming that MrBayes indeed samples from the 304 correct distribution, MrBayes was run for 10 6 generations. Since the aim is not to infer 305 a perfect phylogenetic model for the three different data sets, but to test how well 306 p-NUTS can sample tree structures, the only parameters inferred are the trees and the 307 equilibrium frequency of the presence/absence of the binary coding character.
308 Table 2 provides a summary of the results of the different analyses. 6 Overall the  Fig. 2).

313
The next step in evaluating p-NUTS is assessing its speed of convergence. In order 314 to do so, I will focus only on the tree length. Figure 3 shows the development of the 315 5 Care has to be taken in setting η. If the value is too large the step-size parameter might get too large and the algorithm becomes unstable (see Neal 2011 [12]). 6 It has to be kept in mind that p-NUTS has a lower bound on the branch lengths through the usage of the smoothing parameter δ. This may lead to slightly different estimates in the exact branch lengths of the tree.   shrink factor against the number of iterations in the MCMC runs. As can be seen from 316 the plot, the shrink factor approaches 1 for each of the p-NUTS runs rapidly. This is 317 somewhat expected as already the original NUTS sampler is able to produce nearly 318 independent samples in every generation. This behavior is mirrored by p-NUTS. This 319 result is in line with the observations made for the probabilistic path sampler. [15] 320 Another statistics supports the effectiveness of the p-NUTS sampler. The numbers 321 in table 3 list the average number of NNI moves per generation for the p-NUTS sampler. 322 This plot shows the development of the shrink factor against the percentage of sampled iterations for each of the data sets. [31,32] Although these are not necessarily different topologies, the p-NUTS sampler still  MrBayes. In addition to calculating the likelihood function, p-NUTS requires several evaluations of the gradient, which adds further computational burden. Implemented in 337 the Julia programming language [33], the current version of p-NUTS can calculate 338 50000 generations in one and a half hours for the Dravidian data set, which amounts to 339 roughly 555 generations per minute. 7 In addition to the costly p-NUTS sampler for the 340 phylogenetic tree, the equilibrium frequencies are sampled using a slice sampler. The The algorithm presented in this paper can be used to infer a posterior distribution of 346 phylogoentic trees based on a gradient based sampling scheme. Ji et. al. [16] also 347 propose gradient based methods for statistical phylogenetics. 9 However, the focus in 348 their paper is the inference of branch specific evolutionary rates. Although based on a 349 similar algorithmic idea, the focus of the two projects is clearly different. In addition to 350 their results, Ji et. al. [16] show that the gradient calculation can be done in linear time 351 of the number of sampled sequences. Experiments were conducted with both an The paper proposes an algorithm which is capable of performing No-U-Turn sampling 367 for phylogenetic trees. Despite the complicated structure of the space of trees, Billera,368 Holmes & Vogtmann [14] were able to define the space of trees in such a way that 369 Hamiltonian Monte Carlo sampling is possible. The leap-prog approach for sampling 370 phylogenetic trees, brought forward by Dinh et. al. [15], lends itself naturally to build 371 an algorithm for phylogenetic No-U-Turn sampling. The results show that indeed better 372 mixing and faster convergence is achieved by using p-NUTS to sample phylogenetic 373 trees. 374 7 Times are measured on an AMD Ryzen 9 3900X CPU with 12 cores and 2.9 GHz. Although the particular version of p-NUTS used only 5 cores maximally. 8 On a side note, MrBayes is implemented in the C programming language. To compare p-NUTS and MrBayes on equal footing, both approaches would need to be written in the same programming language. This however, would likely not change the quality of the runtime comparison, because of the increased complexity of p-NUTS. See Hoffman & Gelman (2014) [13] for a short discussion about the run time of Hamiltonian Monte Carlo methods. 9 See also Kenney & Hu [34] for derivations of the gradient of the likelihood function.