TY - JOUR T1 - Accuracy through Subsampling of Protein EvolutioN: Analyzing and reconstructing protein divergence using an ensemble approach JF - bioRxiv DO - 10.1101/170787 SP - 170787 AU - Roman Sloutsky AU - Kristen M. Naegle Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/08/01/170787.abstract N2 - Mapping the history of gene duplications which gave rise to a protein family encoded in a genome (a set of paralogs) can be critical to understanding how those proteins function in their host cells today. However, since each member of a family is recapitulated in the genomes of related species (a set of orthologs), selection of sequences to be included in the history reconstruction is non-trivial. Reconstruction is extremely sensitive to the choice of sequences, which is deeply problematic given no mechanism exists for assessing the accuracy of individual reconstructions. Here, we capitalize on the variability of phylogenetic tree reconstruction to selected input sequences, by subsampling from the available ortholog sequences of a protein family to create an ensemble of trees, which explores the space of plausible tree topologies. We hypothesize that the most consistent topological features across an ensemble are more likely to be true and propose a tree reconstruction algorithm (ASPEN) based on this hypothesis. We simulate 600 protein families over known phylogenies, with varying branch lengths, and compare the accuracy of ASPEN reconstructions to those of traditional phylogeny inference methods. We find that ASPEN trees are more accurate than trees reconstructed traditionally. Additionally, we develop an observable metric calculated form subsampling, reconstruction Precision, for assessing the likely accuracy of a traditional, single-alignment all-sequence reconstruction of the divergence history for a set of paralogs. Together these findings suggest that an ensemble of imperfect reconstructions can provide more accurate insight than any individual reconstruction. ER -