Abstract
Selective pressure acts on the codon use, optimizing multiple, overlapping signals that are only partially understood. To study codon selection, we trained artificial intelligence (AI) models to predict the codons of genes given their amino acid sequence in four model organisms, the eukaryotes Saccharomyces cerevisiae and Schizosaccharomyces pombe and the bacteria Escherichia coli and Bacillus subtilis. We evaluated our models on large sets of proteins of varying lengths and expression levels. Our models significantly outperformed the naïve frequency-based approach, demonstrating that there are dependencies between codons within a gene sequence that can be learned to better predict evolutionary-selected codon usage. The prediction accuracy advantage of our models is greater for highly expressed genes than those expressed at low levels and it is greater in bacteria than eukaryotes, suggesting that there is a monotonic relationship between selective pressure for complex codon patterns and effective population size. In addition, in S. cerevisiae and bacteria, our models were more accurate for longer than shorter proteins, indicating that patterns related to co-translational protein folding were likely learned by the AI system. Gene functionality and conservation were also important determinants that affect the performance of our models. Finally, we showed that using information encoded in homologous proteins has only a minor effect on the accuracy of our predictions, perhaps due to complex codon-usage codes in genes undergoing rapid evolution. In summary, our study employing contemporary AI methods offers a new perspective on codon usage patterns and a novel tool to optimize codon usage in endogenous and heterologous proteins.
Competing Interest Statement
The authors have declared no competing interest.