Robust Design for Coalescent Model Inference

Kris V Parag; Oliver G Pybus

doi:10.1101/317438

Abstract

The coalescent process models how unobserved changes in the size of a population influence the genealogical patterns of sequences sampled from that population. The estimation of these hidden population size changes from reconstructed sequence phylogenies, is an important problem in many biological fields. Often, population size is described by a piecewise-constant function, with each piece serving as a parameter to be estimated. Estimate quality depends on both the statistical inference method used, and on the experimental protocol, which controls variables such as the sampling or parametrisation, employed. While there is a burgeoning literature focussed on inference method development, there is surprisingly little work on experimental design. Moreover, these works are largely simulation based, and therefore cannot provide provable or general designs. As a result, many existing protocols are heuristic or method specific. We examine three key design problems: temporal sampling for the skyline demographic coalescent model; spatial sampling for the structured coalescent and time discretisation for sequentially Markovian coalescent models. In all cases we find that (i) working in the logarithm of the parameters to be inferred (e.g. population size), and (ii) distributing informative (e.g. coalescent) events uniformly among these log-parameters, is provably and uniquely robust. ‘Robust’ means that both the total and maximum uncertainty on our estimates are minimised and independent of their unknown true values. These results provide the first rigorous support for some known heuristics in the literature. Given its persistence among models, this two-point design may be a fundamental coalescent paradigm.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.