Abstract
Multiple sequence alignments (MSAs) are widely used to infer evolutionary relationships, enabling inferences of structure, function, and phylogeny. Standard practice is to construct one MSA by some preferred method and use it in further analysis; however, undetected MSA bias can be problematic. I describe Muscle5, a novel algorithm which constructs an ensemble of high-accuracy MSAs with diverse biases by perturbing a hidden Markov model and permuting its guide tree. Confidence in an inference is assessed as the fraction of the ensemble which supports it. Applied to phylogenetic tree estimation, I show that ensembles can confidently resolve topologies with low bootstrap according to standard methods, and conversely that some topologies with high bootstraps are incorrect. Applied to the phylogeny of RNA viruses, ensemble analysis shows that recently adopted taxonomic phyla are probably polyphyletic. Ensemble analysis can improve confidence assessment in any inference from an MSA.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
The original submission was too technical in the main article, had poor narrative flow, and did not make a clear enough case for novelty and advance over the state of the art. In the revision, I try to write for a typical biologist who uses alignment and tree software but is not an algorithm specialist. Technical details are delegated to Methods and Supplementary Material as far as possible. I provide a summary of necessary technical terms in Table 1 to aid the reader. The ensemble bootstrap is de-emphasised, and more results are presented to demonstrate the advantages of ensemble analysis over bootstrapping.