PT - JOURNAL ARTICLE
AU - Trefzer, Axel
AU - Stamatakis, Alexandros
TI - Compressing Streams of Phylogenetic Trees
AID - 10.1101/440644
DP - 2018 Jan 01
TA - bioRxiv
PG - 440644
4099 - http://biorxiv.org/content/early/2018/10/15/440644.short
4100 - http://biorxiv.org/content/early/2018/10/15/440644.full
AB - Bayesian Markov-Chain Monte Carlo (MCMC) methods for phylogenetic tree inference, that is, inference of the evolutionary history of distinct species using their molecular sequence data, typically generate large sets of phylogenetic trees. The trees generated by the MCMC procedure are samples of the posterior probability distribution that MCMC methods approximate. Thus, they generate a stream of correlated binary trees that need to be stored. Here, we adapt state-of-the art algorithms for binary tree compression to phylogenetic tree data streams and extend them to also store the required meta-data. On a phylogenetic tree stream containing 1, 000 trees with 500 leaves including branch length values, we achieve a compression rate of 5.4 compared to the uncompressed tree files and of 1.8 compared to bzip2-compressed tree files. For compressing the same trees, but without branch length values, our compression method is approximately an order of magnitude better than bzip2. A prototype implementation is available at https://github.com/axeltref/tree-compression.git.