PT - JOURNAL ARTICLE AU - Dominic A. Evangelista AU - Sabrina Simon AU - Megan M. Wilson AU - Akito Y. Kawahara AU - Manpreet K. Kohli AU - Jessica L. Ware AU - Benjamin Wipfler AU - Olivier Béthoux AU - Philippe Grandcolas AU - Frédéric Legendre TI - Phylogenetic Synecdoche Demonstrates Optimality of Subsampling and Improves Recovery of the Blaberoidea Phylogeny AID - 10.1101/601237 DP - 2019 Jan 01 TA - bioRxiv PG - 601237 4099 - http://biorxiv.org/content/early/2019/04/27/601237.short 4100 - http://biorxiv.org/content/early/2019/04/27/601237.full AB - Phylogenomics seeks to use next-generation data to robustly infer an organism’s evolutionary history. Yet, the practical caveats of phylogenomics motivates investigation of improved efficiency, particularly when quality of phylogenies are questionable. To achieve improvements, one goal is to maintain or enhance the quality of phylogenetic inference while severely reducing dataset size. We approach this goal by designing an optimized subsample of data with an experimental design whose results are determined on the basis of phylogenetic synecdoche − a comparison of phylogenies inferred from a subsample to phylogenies inferred from the entire dataset. We examine locus mutation rate, saturation, evolutionary divergence, rate heterogeneity, selection, and a priori information content as traits that may determine optimality. Our controlled experimental design is based on 265 loci for 102 blaberoidean cockroaches and 22 outgroup species. High phylogenetic utility is demonstrated by loci with high mutation rate, low saturation, low sequence distance, low rate heterogeneity, and low selection. We found that some phylogenetic information content estimators may not be meaningful for assessing information content a priori. We use these findings to design concatenated datasets with an optimized subsample of 100 loci. The tree inferred from the optimized subsample alignment was largely identical to that inferred from all 265 loci but with less evidence of long branch attraction and improved statistical support. In sum, optimized subsampling can improve tree quality while reducing data collection costs and yielding 4-6x improvements to computation time in tree inference and bootstrapping.