A linear-time algorithm to sample the dual-birth model

The ability to sample models of tree evolution is essential in the analysis and interpretation of phylogenetic trees. The dual-birth model is an extension of the traditional birth-only model and allows for sampling trees of varying degrees of balance. However, for a tree with n leaves, the tree sampling algorithm proposed in the original paper is 𝒪(n log n). I propose an algorithm to sample trees under the dual-birth model in 𝒪(n), and I provide a fast C++ implementation of the proposed algorithm.


I. INTRODUCTION
The dual-birth model is a birth-only model in which each branch is a Poisson process with one of two states: active, in which splits occur with rate λ b , and inactive, in which splits occur with rate λ a . It is typically parameterized by λ = λ a + λ b and r = λ a /λ b [1]. When a branch splits, one child branch is active and the other is inactive.
Simulating trees under the dual-birth model can provide null distributions describing neutral evolutionary process, which can be rejected when inferring trees from the data. Further, Maximum Likelihood (ML) tree inference techniques tend to infer significantly overbalanced trees when true trees are extremely unbalanced [1], and the dual-birth model allows for researchers to simulate trees of varying balance by adjusting its r parameter, making the dual-birth model useful for simulation experiments that study ML tree inference techniques. The algorithm originally proposed is O(n log n), where n is the number of leaves of the tree, but by exploiting statistical properties of Poisson processes and the exponential distribution, the sampling problem can be solved in O(n).

A. Linear-time algorithm description
Let exp (λ) denote an exponential random variable with rate λ. Let bern(p) denote an exponential random variable with probability p. Let x denote the size of set x. The algorithm starts with two sets of leaves, active (with just the root) and inactive (initially empty). The algorithm iterates while the termination condition has not yet been met. The time until the next split is sampled from exp (λ b active + λ a inactive ). The next splitting leaf is chosen by first determining its state by sampling bern (λ b active /(λ b active + λ a inactive )) (success = active, failure = inactive) and then uniformly choosing a random leaf from the corresponding set. The chosen leaf's time is set to the sampled split time, the leaf is removed from its set, two children are created, and one child is placed in active and the other in inactive. When the termination condition has been met, the times of all remaining leaves in active and inactive are set to the end time.

B. Algorithm correctness
The algorithm begins with a single active leaf at time 0. Because each branch is a Poisson process, the time until the branch splits is an exponential random variable with the branch's rate. Therefore, at a given time t with n a active leaves and n i inactive leaves, the time until the next splitting event is the minimum of n a exponential random variables, each with rate λ b , and n i exponential random variables, each with rate λ a .
The minimum of k exponential random variables X 1 , . . . , X k with rates λ 1 , . . . , λ k is an exponential random variable with rate k i=1 λ i . Therefore, the time until the next splitting event is an exponential random variable with rate n a λ b + n i λ a .
Given k exponential random variables X 1 , . . . , X k with rates λ 1 , . . . , λ k , the probability that X i is the minimum of all k random variables is λ i / k i=1 λ i . Because only one terminal branch can be the next to split, the events that pendant branch j (for all j) is the next to split are mutually exclusive. Therefore, the probability that the next branch to split is an active branch is p = n a λ b /(n a λ b +n i λ a ). Once the state of the next branch to split is determined (by sampling a Bernoulli random variable with probability p), because all leaves in that state have the same rate (and are thus equally likely to be the next to split), the next splitting leaf can be chosen uniformly from the set of leaves in the chosen state.

III. RESULTS
My implementation of the proposed algorithm closely follows the theoretical expectations of the dual-birth model (Fig. 1) and is orders of magnitude faster than the original implementation of the initially-proposed algorithm (Fig. 2).

IV. DISCUSSION
The proposed algorithm scales linearly with the number of leaves, and its C++ implementation is fast and does not have any non-STL dependencies, making it easy to build on all environments. My hope is that the fast implementation I provide can be wrapped into existing tools for the convenience of users as well as developers.

AVAILABILITY
My C++ implementation of the proposed algorithm can be found in the following GitHub repository: https://github.com/niemasd/Dual-Birth-Simulator ACKNOWLEDGMENT I thank Siavash Mirarab for his guidance in the original dual-birth model project.