The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference

Syst Biol. 2009 Feb;58(1):130-45. doi: 10.1093/sysbio/syp017. Epub 2009 May 22.

Abstract

Although an increasing number of phylogenetic data sets are incomplete, the effect of ambiguous data on phylogenetic accuracy is not well understood. We use 4-taxon simulations to study the effects of ambiguous data (i.e., missing characters or gaps) in maximum likelihood (ML) and Bayesian frameworks. By introducing ambiguous data in a way that removes confounding factors, we provide the first clear understanding of 1 mechanism by which ambiguous data can mislead phylogenetic analyses. We find that in both ML and Bayesian frameworks, among-site rate variation can interact with ambiguous data to produce misleading estimates of topology and branch lengths. Furthermore, within a Bayesian framework, priors on branch lengths and rate heterogeneity parameters can exacerbate the effects of ambiguous data, resulting in strongly misleading bipartition posterior probabilities. The magnitude and direction of the ambiguous data bias are a function of the number and taxonomic distribution of ambiguous characters, the strength of topological support, and whether or not the model is correctly specified. The results of this study have major implications for all analyses that rely on accurate estimates of topology or branch lengths, including divergence time estimation, ancestral state reconstruction, tree-dependent comparative methods, rate variation analysis, phylogenetic hypothesis testing, and phylogeographic analysis.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Animals
  • Bayes Theorem
  • Biological Evolution
  • Likelihood Functions
  • Models, Genetic*
  • Phylogeny*
  • Urodela / classification*
  • Urodela / genetics