Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites

J Theor Biol. 2015 Jun 7:374:35-47. doi: 10.1016/j.jtbi.2015.03.006. Epub 2015 Mar 17.

Abstract

The inference of the evolutionary history of a collection of organisms is a problem of fundamental importance in evolutionary biology. The abundance of DNA sequence data arising from genome sequencing projects has led to significant challenges in the inference of these phylogenetic relationships. Among these challenges is the inference of the evolutionary history of a collection of species based on sequence information from several distinct genes sampled throughout the genome. It is widely accepted that each individual gene has its own phylogeny, which may not agree with the species tree. Many possible causes of this gene tree incongruence are known. The best studied is the incomplete lineage sorting, which is commonly modeled by the coalescent process. Numerous methods based on the coalescent process have been proposed for the estimation of the phylogenetic species tree given DNA sequence data. However, use of these methods assumes that the phylogenetic species tree can be identified from DNA sequence data at the leaves of the tree, although this has not been formally established. We prove that the unrooted topology of the n-leaf phylogenetic species tree is generically identifiable given observed data at the leaves of the tree that are assumed to have arisen from the coalescent process under a time-reversible substitution process with the possibility of site-specific rate variation modeled by the discrete gamma distribution and a proportion of invariable sites.

Keywords: Algebraic statistics; Identifiability; Phylogenetics.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • DNA / analysis
  • Evolution, Molecular
  • Genetic Speciation*
  • Markov Chains
  • Models, Biological*
  • Mutation
  • Phylogeny*
  • Probability
  • Sequence Analysis, DNA
  • Time Factors

Substances

  • DNA