Markov Katana: a Novel Method for Bayesian Resampling of Parameter Space Applied to Phylogenetic Trees

Phylogenetic inference requires a means to search phylogenetic tree space. This is usually achieved using progressive algorithms that propose and test small alterations in the current tree topology and branch lengths. Current programs search tree topology space using branch-swapping algorithms, but proposals do not discriminate well between swaps likely to succeed or fail. When applied to datasets with many taxa, the huge number of possible topologies slows these programs dramatically. To overcome this, we developed a statistical approach for proposal generation in Bayesian analysis, and evaluated its applicability for the problem of searching phylogenetic tree space. The general idea of the approach, which we call ‘Markov katana’, is to make proposals based on a heuristic algorithm using bootstrapped subsets of the data. Such proposals induce an unintended sampling distribution that must be determined and removed to generate posterior estimates, but the cost of this extra step can in principle be small compared to the added value of more efficient parameter exploration in Markov chain Monte Carlo analyses. Our prototype application uses the simple neighbor-joining distance heuristic on data subsets to propose new reasonably likely phylogenetic trees (including topologies and branch lengths). The evolutionary model used to generate distances in our prototype was far simpler than the more complex model used to evaluate the likelihood of phylogenies based on the full dataset. This prototype implementation indicates that the Markov katana approach could be easily incorporated into existing phylogenetic search programs and may prove a useful alternative in conjunction with existing methods. The general features of this statistical approach may also prove useful in disciplines other than phylogenetics. We demonstrate that this method can be used to efficiently estimate a Bayesian posterior.

make proposals based on a heuristic algorithm using bootstrapped subsets of the data. Such 23 proposals induce an unintended sampling distribution that must be determined and removed to 24 generate posterior estimates, but the cost of this extra step can in principle be small compared to 25 the added value of more efficient parameter exploration in Markov chain Monte Carlo analyses. 26 Our prototype application uses the simple neighbor-joining distance heuristic on data subsets to 27 propose new reasonably likely phylogenetic trees (including topologies and branch lengths). The 28 evolutionary model used to generate distances in our prototype was far simpler than the more 29 complex model used to evaluate the likelihood of phylogenies based on the full dataset. This 30 prototype implementation indicates that the Markov katana approach could be easily 31 incorporated into existing phylogenetic search programs and may prove a useful alternative in 32 conjunction with existing methods. The general features of this statistical approach may also 33 prove useful in disciplines other than phylogenetics. We demonstrate that this method can be 34 used to efficiently estimate a Bayesian posterior. 2009). Despite its importance, phylogenetic inference is difficult partly because searching tree 42 space is an NP-hard problem (Bodlaender 1992;Brocchieri 2001). Distance-based methods such 43 as neighbor-joining (NJ; (Saitou 1987)) are fast and often provide good approximate results but 44 are considered less reliable than the computationally expensive (Hershkovitz 1998;Takahashi 45 2000; Whelan 2001) likelihood-based methods (maximum likelihood, ML, and Bayesian or 46 posterior probability, PP). While distance methods generate a single tree using heuristic 47 approaches, likelihood methods must search tree space, generally by running an optimization 48 scheme or Markov chain Monte Carlo (MCMC). Tree space is often searched using various 49 forms of branch swapping (Felsenstein 1981;Huelsenbeck 1997Huelsenbeck , 2001Sullivan 2005;50 Anisimova 2006). A cautious approach to interpreting results from traditional branch-swapping 51 algorithms is warranted, particularly for trees with sequences from many taxa (Mossel 2005). 52 The principle confounding effect in phylogenetic inference is that multiple substitutions 53 may occur at the same site. Distance-based methods are inferior to likelihood-based methods in 54 accurately inferring multiple substitutions (Felsenstein 1984;Huelsenbeck 1996;Xia 2006). 55 Distance-based methods are also far more strongly biased by long-branch attraction and cannot 56 fully incorporate the advantages of site-specific models of evolution (Huelsenbeck 1995(Huelsenbeck , 199757 Pollock 1998). Another major class of phylogenetic analysis, based on the principle of maximum 58 parsimony, will not be considered here because parsimony methods are far slower than distance 59 methods, and they do not accurately model evolutionary processes despite having the same 60 biases and inaccuracies as distance methods. The computational limitations of likelihood-based 61 methods become far more severe with large amounts of sequence data from highly diverse sets 62 of organisms (Pollock 2000;Sanderson 2003; A. J. de Koning 2010). For example there are 63 2.75*10 76 possible topologies relating 50 taxa (Felsenstein 2004), making exhaustive approaches 64 impossible. Branch-and-bound searches can reduce the tree space to be examined for smaller 65 trees but are insufficient for large datasets because the number of tree topologies is still too large 66 (Hendy 1982). Thus, heuristic searches must be used for large trees, evaluating trees that are 67 proximal to reasonably likely trees that have already been found. These searches are currently 68 often performed using branch-swapping algorithms such as nearest-neighbor interchange (NNI), 69 subtree pruning and regrafting (SPR) and tree bisection and reconnection (TBR) (e.g. Ronquist 70 2003; Salemi 2009). The number of NNI, SPR, and TBR neighbors of any topology increase 71 respectively as linear, quadratic, and cubic functions of the number of taxa, and the trees 72 proposed are not necessarily of similar likelihood to the known tree. Therefore, many highly 73 improbable trees are evaluated in branch-swapping algorithms, and the correct solution is not 74 guaranteed due to the presence of local optima in tree space (Mossel 2005). Branch length 75 optimization (or posterior equilibration) must also be performed after branch swapping and is an 76 additional source of computational cost. 77 Several heuristic approaches have been developed to release tree searches from local 78 optima. Ratchet methods employ multiple initial trees perturbed by bootstrap resampling to 79 ensure a less-overlapping tree space in subsequent optimizations using branch swapping (Nixon 80 1999;Vos 2003). The partial stepwise addition (PSA) approach enables escape from local 81 optima by removing some taxa during the topology search (Whelan 2007). Simulated annealing (SA; Kirkpatrick 1983) and Metropolis-coupled Markov chain Monte Carlo (MCMCMC; Geyer 83 1991) manipulate a likely range of proposed tree acceptances in a single heuristic search or in 84 multiple interacting chains, respectively. Genetic algorithms (GAs) simulate the population 85 dynamics of tree topologies using likelihood as a fitness parameter (Matsuda 1995). These 86 methods outperform simple heuristic searches in at least some contexts. All approaches listed 87 above employ branch swapping to explore tree space and therefore suffer from inefficiency due 88 to the decoupling of topology proposals from the likelihoods of the topologies. 89 Here we consider whether the beneficial features of Bayesian analyses under relatively 90 complex models can be profitably combined with the speed of distance methods based on 91 relatively simple models. The key to our approach is that rather than using branch swapping to 92 explore phylogenetic tree space, distance-based trees predicted from partially sampled sequences

104
The 495 amino acid COI 1249-taxon mitochondrial gene alignment from Goldstein et al 105 was used (Goldstein 2015). 10 taxa were arbitrarily selected from the alignment to use for testing 106 and are shown in Table 1. (1) 118 where and are respectively the number of possible rooted and unrooted topologies with x 119 taxa, N is the total number of taxa being evaluated, and s is the smaller number of taxa that are 120 segregated on one side or the other of branch (Pickett 2005). 121

123
In initial runs, the NJ algorithm often generated unrealistically short branches, so to 124 counteract this we lengthened the shortest branches by adding a random number from 0 to 2 125 substitutions (a branch length increase of 0 to 2/495). This limited the effect of these implausibly 126 short branches in the proposal mechanism. Short branches were still possible, but extremely 127 short branches were not as likely to be proposed. 128

130
A bootstrap sampling procedure (Felsenstein 1985;Zharkikh 1995) was employed to 131 sample sites in the alignment that were then used to calculate distance matrices. Although 132 complete and partial bootstrapping has been used extensively in phylogenetic studies to evaluate 133 branch support and tree confidence (Efron 1996;Alfaro 2003), we used it solely to generate a 134 broad distribution of reasonably likely trees based on the NJ heuristic. Note that while partial 135 sampling is more common when employing the related jackknife approach, bootstrapping 136 approaches such as that employed here sample with replacement, rather than without 137 replacement as in the jackknife. Depending on the number of sites sampled (the sample size), 138 trees produced from partial sequence samples can be quite different from the ML tree of the 139 entire alignment and considerably less likely (Castoe 2009). Evaluating the posterior distribution 140 with an importance sampling approach using these trees is not feasible because the extreme 141 variation in likelihoods among trees means that a few trees would dominate the weighted importance sampling average (Kuhner 1995). Instead, it is necessary to use a progressive 143 Markov chain approach to evaluate the posterior, such as the Metropolis-Hastings algorithm 144 (Hastings 1970), in which the proposed sample depends on the current sample (Fig. 1) (3) 170 We note that this procedure is identical to obtaining a NJ partial bootstrap, but by running 171 the Markov chain with a given jump size we can obtain the connectedness among topologies, 172 providing a natural topological distance measure. 173 We then recognize that 174 , (4) 175 where , , and , are the likelihood, the genealogy sampling bias induced by 176 NJ proposals with sampling fraction f, and the prior, respectively. Here we assume a flat prior 177 across all tree topologies. The next step is to divide the uncorrected topology posterior by the 178 sampling distribution induced by the proposals to obtain 179 .
(5) 180 We normalized the corrected posteriors by dividing by the sum of all corrected posteriors over 181 all topologies sampled. 182 It may sometimes be useful and possibly more accurate to calculate branch (a.k.a. a 183 can be calculated directly (see Methods). 196

Implementation of the Markov katana
We began by analyzing a 10-taxon Cytochrome C Oxidase subunit 1 (COI) amino acid 198 alignment (495 residues) that was chosen so that there would be a moderate level of topological 199 uncertainty in the posterior (Fig. 2). Preliminary evaluations indicated that NJ trees on 200 bootstrapped data have a distribution of topologies that are relatively similar among distance 201 types (Sup. Fig. S1). Although there is considerable noise to the estimates for very small 202 frequencies, and there is a slight shift towards higher frequencies with the Markov katana 203 difference NJ, overall the two measures have a nearly linear relationship. This gave us 204 confidence that NJ trees based on differences rather than corrected distances might be 205 sufficiently accurate for our purposes, so to keep the NJ calculations as simple and fast as 206 possible for initial testing, distances were generated using the simple difference matrix. The 207 likelihood of the proposed tree topology and branch lengths were then evaluated using the 208 mtMam substitution rate model (Yang 1998) on the entire sequences. Continuing to keep things 209 simple for initial testing, we used a flat prior, although we imagine that most future 210 implementations will want to incorporate other priors here, such as the commonly used 211 exponential priors on branch lengths (Yang 2005). 212 To understand the differences in topology sampling bias estimates obtained using 213 different sample sizes, f, Markov katana was run with sample fractions ranging from 100% (495 214 sites) down to 20% (99 sites). The topology sampling biases for smaller f become somewhat 215 more even, with the least frequent topologies about 10x more frequent for f = 20% than for 216 f=100% (Fig. 3). At the same time, the number of topologies with sampling probabilities greater 217 than 10 -6 increased from sample sizes (Fig. 4). The 232 that the answer would have been similar without the correction. It is probably best to use the 249 correction anyway, because in more complicated situations it may make more of a difference, 250 and it is not too much trouble to obtain and is correct. 251 The corrected posterior estimates appear to be most noisy when the sampling distribution 252 estimate is small and therefore poorly estimated. This is not entirely surprising given that the 253 sampling distribution is in the denominator. Because the sampling distribution calculations are 254 computationally inexpensive (they do not require a likelihood calculation), it is possible to obtain 255  Acceptance probabilities varied widely depending on jump size. In general, 5-10 sites 292 appears to be a minimum, and 50 sites is probably a maximum. With smaller sample sizes (e.g., 293 80% shown here), the jump size is a larger proportion of the sample and reduces the acceptance 294 probability more rapidly. Jump sizes bigger than 50 have somewhat greater probability of 295 making large jumps in topology space (Fig. 6), at the cost of reduced probabilities of jumps to 296  Table 2). The size of the 303 node represents the relative 304 posterior of the topology, and 305 the edges of the graph indicate 306 NNI distances of one or two 307 between the tree topologies. 308 The tree topology space of this test data was clearly divided into two clusters of trees shown by 309 the intragroup connections and the few intergroup connections. Given the connectivity of the 310 network, other tree topology sampling procedures may have difficulty jumping between groups. 311

312
We have demonstrated here that the Markov katana bootstrapping approach to 313 phylogenetic tree searching can be a highly effective means for finding Bayesian posterior 314 topologies and branches. It is able to take advantage of the speed of approximate distance-based 315 methods to propose new trees, but retains the reliability of Bayesian methods. Many previous 316 phylogenetic tree-search methods use the provided sequences for only the likelihood 317 calculations, but Markov katana introduces a new way to explore tree space informed by the 318 sequences. Including the sequence data in the tree search improves the fraction of high likelihood 319 trees proposed and allows efficient jump proposals between even distant topologies. 320 For the 10-taxon dataset, the NJ algorithm is extremely fast, and the overall speed of the 321 MK computation was limited by the likelihood calculations. As the number of taxa grows 322 beyond ~200, the NJ algorithm slows dramatically and dominates computation times (data not 323 shown). This could be alleviated using fast heuristic NJ algorithms or external programs such as 324 RapidNJ that are optimized for large alignments (Simonsen 2008). Our current implementation 325 calculates the distance contribution of each site only once and so is not hindered by the 326 complexity of the distance measure. We did not see a great difference in the proposal bias for the 327 two distance measures we compared, but further exploration of the performance of alternative 328 distances may in some cases be warranted. 329 We used PAML for the likelihood calculations, but any program that computes 330 likelihoods could potentially be used. The simplicity and adjustability of the approach means that 331 it could be easily incorporated into existing sequence analysis packages (e.g., MrBayes, PAUP*, 332     Table 2. Posterior probability for topologies with substantial representation in the 536 uncorrected posterior for the 10-taxon dataset. The topologies are labeled for reference in Figure  537