PT - JOURNAL ARTICLE AU - Alexander G Lucaci AU - Jordan D Zehr AU - Sergei L. Kosakovsky Pond TI - Evolutionary shortcuts via multi-nucleotide substitutions and their impact on natural selection analyses AID - 10.1101/2022.12.02.518889 DP - 2022 Jan 01 TA - bioRxiv PG - 2022.12.02.518889 4099 - http://biorxiv.org/content/early/2022/12/03/2022.12.02.518889.short 4100 - http://biorxiv.org/content/early/2022/12/03/2022.12.02.518889.full AB - Inference and interpretation of evolutionary processes - in particular of the types and targets of natural selection affecting coding sequences, are critically influenced by the assumptions built into statistical models for such analyses. If certain aspects of the substitution process (even when they are not of direct interest) are presumed absent or are modeled with too crude of a simplification, estimates of key model parameters can become biased - often systematically, and lead to poor statistical performance. Here, we performed a detailed characterization of how modeling instantaneous multi-nucleotide (or multi-hit, MH) substitutions impacts dN/dS based inference of episodic diversifying selection at the level of the entire alignment. The inclusion of MH reduces the rate (1.37-fold or 26.8%) at which positive selection is called based on the analysis of N = 9,861 empirical data-sets, while offering significantly better statistical fit to sequence data in 8.37% of cases. Through additional simulation studies, we show that this reduction is not simply due to loss of power because of additional model complexity. After a detailed examination of 21 benchmark alignments and a new high-resolution analysis showing which parts of the alignment provide support for positive selection, we reveal that MH substitutions occurring along shorter branches in the tree are largely responsible for discrepant results in selection detection. Our results add to the growing body of literature which examines decades-old modeling assumptions and finds them to be problematic for biological data analysis. Because multi-nucleotide substitutions have a significant impact on natural selection detection even at the level of an entire gene, we recommend that routine selection analysis of this type consider their inclusion. To facilitate this procedure, we developed a simple model testing selection detection framework able to screen an alignment for positive selection with two biologically important confounding processes: synonymous rate variation, and multi-nucleotide instantaneous substitutions.Competing Interest StatementThe authors have declared no competing interest.