RT Journal Article SR Electronic T1 Avoiding ascertainment bias in the maximum likelihood inference of phylogenies based on truncated data JF bioRxiv FD Cold Spring Harbor Laboratory SP 186478 DO 10.1101/186478 A1 Asif Tamuri A1 Nick Goldman YR 2017 UL http://biorxiv.org/content/early/2017/09/09/186478.abstract AB Some phylogenetic datasets omit data matrix positions at which all taxa share the same state. For sequence data this may be because of a focus on single nucleotide polymorphisms (SNPs) or the use of a technique such as restriction site-associated DNA sequencing (RADseq) that concentrates attention onto regions of differences. With morphological data, it is common to omit states that show no variation across the data studied. It is already known that failing to correct for the ascertainment bias of omitting constant positions can lead to overestimates of evolutionary divergence, as the lack of constant sites is explained as high divergence rather than as a deliberate data selection technique. Previous approaches to using corrections to the likelihood function in order to avoid ascertainment bias have either required knowledge of the omitted positions, or have modified the likelihood function to reflect the omitted data. In this paper we indicate that the technique used to date for this latter approach is a conditional maximum likelihood (CML) method. An alternative approach — unconditional maximum likelihood (UML) — is also possible. We investigate the performance of CML and UML and find them to have almost identical performance in the phylogenetic SNP dataset context. We also make some observations about the nucleotide frequencies observed in SNP datasets, indicating that these can differ systematically from the overall equilibrium base frequencies of the substitution process. This suggests that model parameters representing base frequencies should be estimated by maximum likelihood, and not by empirical (counting) methods.