Abstract
Traditional statistical methods using maximum likelihood and Bayesian inference can detect positive selection from an interspecific phylogeny and a codon sequence alignment based on model assumptions, but they are prone to false positives due to alignment errors and can lack power. These problems are particularly pronounced when faced with high levels of indels and divergence. Leveraging the feature-detection capabilities of convolutional neural networks (CNNs), we achieve higher accuracy in detecting selection across a specific range of phylogenetic scenarios and evolutionary modes. This advantage is particularly evident with noisy data prone to misalignments. Our method shows some ability to account for these errors, where most statistical frameworks fail to do so in a tractable manner. We explore generalisability and identify future avenues to achieve broader utility. Once trained, our CNN model is faster at test time, making it a scalable alternative to traditional statistical methods for large-scale, multi-gene analyses. In addition to binary classification (inference of the presence or absence of positive selection during the evolution of the sequences), we use saliency maps to understand what the model learns and observe how this could be leveraged for sitewise inference of positive selection.
Competing Interest Statement
The authors have declared no competing interest.