TY - JOUR T1 - Overtraining often results in topologically incorrect species trees with maximum likelihood methods JF - bioRxiv DO - 10.1101/140780 SP - 140780 AU - Damien M. de Vienne AU - Fran Supek AU - Toni Gabaldon Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/05/22/140780.abstract N2 - Background Overtraining occurs when an optimization process is applied for too many steps, leading to a model describing noise in addition to the signal present in the data. This effect may affect typical approaches for species tree reconstruction that use maximum likelihood optimization procedures on a small sample of concatenated genes. In this context, overtraining may result in trees better describing the specific evolutionary history of the sampled genes rather than the sought evolutionary relationships among the species.Results Using a cross-validation-like approach on real and simulated datasets we showed that overtraining occurs in a significant fraction of cases, leading to species trees that are more distant from a gold-standard reference tree than a previously considered (and rejected) solution in the optimization process. However, we show that the shape of the likelihood curve is informative of the optimal stopping point. As expected, overtraining is aggravated in smaller gene samples and in datasets with increased levels of topological variation among gene trees, but occurs also in controlled, simulated scenarios where a common underlying topology is enforced.Conclusions Overtraining is frequent in species tree reconstruction and leads to a final tree that is worse in describing the evolutionary relationships of the species under study than an earlier (and rejected) solution encountered during the likelihood optimization process. This result should help develop specific methods for species tree reconstruction in the future, and may improve our understanding of the complexity of tree likelihood landscapes. ER -