Are profile mixture models over-parameterized?

Hector Baños; Edward Susko; Andrew J. Roger

doi:10.1101/2022.02.18.481053

Abstract

Site heterogeneity of the amino acid substitution process accounts for the biochemical constraints on the range of admissible amino acids at specific sites. Phylogenetic models of protein sequence evolution that do not account for site heterogeneity are more prone to long-branch attraction artifacts.

Profile mixture models are used to model site heterogeneity. Even though model, tree, and mixing parameters are statistically consistent, the performance of these models with short alignments is unclear. Here we explore the behavior of tree topology estimates and marginal cumulative distributions with short simulated alignments. We find that over-parameterization is not a problem for complex profile mixture models and that simple models behave poorly. Misspecification of the frequency distributions does not cause a problem if the estimated cumulative distribution function adequately approximates the true one. Also, we find that misspecification of the exchangeabilities can severely affect parameter estimation and that an increase in likelihood does not necessarily reflect better tree estimation. Although the inclusion of more taxa often helps, it can hurt estimation if the exchangeabilities are badly misspecified.

Finally, we explore the effects of including an ‘F-class’ with the overall amino acid frequencies of the dataset as an additional class in the profile mixture model. Surprisingly, the F-class does not seem to help parameter estimation significantly, and it can decrease the probability of correct tree estimation, depending on the scenario, despite the fact that it tends to improve likelihood scores. We also investigate this with several empirical data sets.