Abstract
Selective pressure acts on the codon use, optimizing multiple, overlapping signals that are only partially understood. To study codon selection, we trained artificial intelligence (AI) models to predict the codons of genes given their amino acid sequence in four model organisms, the eukaryotes Saccharomyces cerevisiae and Schizosaccharomyces pombe and the bacteria Escherichia coli and Bacillus subtilis. We evaluated our models on large sets of proteins of varying lengths and expression levels. Our models significantly outperformed the naïve frequency-based approach, demonstrating that there are dependencies between codons within a gene sequence that can be learned to better predict evolutionary-selected codon usage. The prediction accuracy advantage of our models is greater for highly expressed genes than those expressed at low levels and it is greater in bacteria than eukaryotes, suggesting that there is a monotonic relationship between selective pressure for complex codon patterns and effective population size. In addition, in S. cerevisiae and bacteria, our models were more accurate for longer than shorter proteins, indicating that patterns related to co-translational protein folding were likely learned by the AI system. Gene functionality and conservation were also important determinants that affect the performance of our models. Finally, we showed that using information encoded in homologous proteins has only a minor effect on the accuracy of our predictions, perhaps due to complex codon-usage codes in genes undergoing rapid evolution. In summary, our study employing contemporary AI methods offers a new perspective on codon usage patterns and a novel tool to optimize codon usage in endogenous and heterologous proteins.
Introduction
Although in all known organisms, there are 61 codons that each encode one of the 20 amino acids, 18 amino acids are encoded by multiple (two to six) codons. Thus, there are many codon sequences that encode the same amino acid sequence. The selection of codons used impacts protein production, indirectly, by influencing the availability of tRNA and free ribosomes in the cell [1–3], and directly by influencing mRNA structure and stability [3, 4], transcription [5], splicing [6], and translation kinetics [7, 8]. This translation kinetics, in turn, influence protein co-translational folding and regulation [8, 9].
Certain codons are preferred over others, a phenomenon known as codon bias [1, 9], and this bias differs not only across species [8] but also within species in a manner that depends on expression level and protein length [4, 9, 10]. Codon usage even differs along individual genes. For example, there are different usage patterns at the beginning of coding regions versus protein domain boundaries [11, 12]. Use of rare codons can slow translation [9, 13] and enable the generation of functionally or structurally stable proteins [3]. Because codon usage influences efficiency and accuracy of protein synthesis, it is recognized as a code within the genetic code that is subject to evolutionary selection [3, 10].
Computationally predicting the codon encodings of proteins in different organisms holds practical value, which extends beyond the curiosity of emulating a fundamental evolutionary process. Codon usage can have a significant effect on protein levels in different organisms [14–17]. Production of heterologous (non-host) proteins for use in protein science and biotechnology [18], in bacterial cell factories [19], as vaccines [20], or for agricultural purposes [21] requires optimization of coding sequence. The disparities in codon biases between the original and host organisms necessitate adjusting the codons to the biases of the new host. Dozens of measures aim to model codon usage patterns (reviewed in [22]). These measures, however, are usually limited to capturing local statistics of codon distribution.
To rigorously approach this prediction challenge, one should establish distinct training and test sets [23]. The straight-forward naïve Bayes predictor offers a meaningful baseline by predicting the most frequently used codon for each amino acid (as observed in the training set). It is the most accurate among predictors that overlook codon interactions [24]. However, there may be patterns of interactions among codons, possibly even those distant in primary sequence, and learning these may yield better predictions [3, 25]. To learn codon patterns with the limited data available per organism, sophisticated tools that learn data distributions are needed [26].
Knowing the codons used in an orthologous protein in another organism may aid in prediction. For example, orthologous proteins may have a codon at a particular position, that differs due to usage bias, but that functions to induce translation pausing that is crucial for proper protein folding. If codons with unique functions can be identified and mimicked, the prediction may be more accurate [7, 19]. In support of the utility of this information, it was shown that position-dependent clusters of optimal and nonoptimal codons are conserved among orthologous proteins [27, 28], and that codon usage relates to protein structure with nonoptimal codons aligning domain boundaries [7–9, 29]. Indeed, these observations underlie the design of codon harmonization tools that predict a codon sequence for optimal incorporation of a given amino acid sequence in a non-native organism. Codon harmonization tools are not, however, designed for orthologous proteins but rather for those with the same amino acid sequence [30, 31]. In other words, they are not intended to predict the evolutionarily selected codons in one organism based on the codons of a different, albeit orthologous, protein. In practical terms, this means that codon harmonization tools mimic readily described properties like frequency rank [30, 31].
Deep networks, and transformers in particular [26], have emerged as the tool of choice to learn complex distributions like those characterizing codon usage. In prior research, deep networks have been used to address challenges in this domain: Most closely related is the work by Yang et al. who used deep networks (BiLSTM) to predict codon sequences of highly expressed proteins, but their predictions failed to improve upon the frequency-based baseline [23]. Other studies suggest however that there is a signal within codon sequences that deep networks can harvest: Tunney et al. used a feed-forward network to successfully predict the ribosome density from the sequence neighborhood [32], and several studies have demonstrated the utility of deep networks (RNN, T5, and BiLSTM) to optimize the gene expression levels of two to four proteins [33–35], and predict other aspects of gene expression [36–39].
Here, we take a data-driven approach to investigate evolutionary selected codon sequences in four organisms: the eukaryotes S. cerevisiae and S. pombe, and the bacteria E. coli and B. subtilis. We explored two scenarios: prediction of the codon sequence from amino acid sequences, and prediction of the codon sequence in one organism given the codon sequence of an orthologous protein in another organism. We used mBART, a transformer-based encoder-decoder architecture that extends BART, developed for a single natural language [40, 41]. mBART learns a shared model for multiple languages (e.g., English and French) allowing it both generate text in these languages and convert (i.e., translate) text from one language to another. In our setting, the analog of a language is an organism, and our mBART-trained models can both generate codon sequences for multiple organisms and mimic the codons of an orthologous protein. Our best models outperformed the frequency-based baseline, suggesting that there are patterns of codon interactions between residues (including non-neighboring residues) that can be learned. By studying improvement over this baseline as a function of expression level and protein length in different organisms, we gained insights into evolutionary selection processes. Furthermore, in S. cerevisiae and E. coli, we compared the accuracy gain of our model to the frequency-based baseline for functional sets of proteins grouped by Gene Ontology (GO) annotations and found that accuracy gain was higher than average for some molecular functions and biological processes. The novel artificial intelligence (AI) tool introduced here, with publicly available code and an easy-to-use web interface (https://www.aa2codons.info/), will enable future investigations related to evolutionary selection of codon sequences.
Results
(1) Training of mBART models to predict codons from amino acid sequence using masking and mimicking
We consider two tasks: masking, which is prediction of codons from the amino acid sequence, and mimicking, which is prediction of codons based on codons of an ortholog protein in another organism. The rationale for the mimicking mode is that the rate of translation elongation depends on the codon used [42], and the nonuniform rate may be important for co-translational protein folding [7, 43, 44]. Thus, as codons of orthologous proteins in two organisms may encode similar elongation rates, the codon sequence of the orthologous protein may be useful in prediction of codon usage for a protein of interest.
We trained several mBART models [40, 41] to support masking and mimicking tasks in four well studied model organisms: S. cerevisiae, S. pombe., E. coli, and B. subtilis. Following standard machine learning practices, we divided the protein data from these organisms into three distinct sets: ∼70% in the training set, ∼10% in the validation set, and ∼20% in the test set. All three sets included proteins with a wide range of expression levels. At the amino acid level, none of the proteins in the test set were closely related to those in the training set (based on amino-acid sequence clustering using CD-HIT [45] with a threshold of 0.7). The test set included 1240 S. cerevisiae, 1024 S. pombe, 812 E. coli, and 855 B. subtilis proteins (of which all but 496, 463, 271, and 247, respectively, have measured expression levels). Thus, evaluation of the models was conducted under stringent conditions, both in that the test set included a significant number of proteins with a wide range of expression levels and lengths and that the training set was not similar at the amino acid level to proteins in the test set (see Methods for more detail).
Figure 1 illustrates the training procedure and the input format of the mBART models. In masking mode, the input data is (only) the amino acid sequence of the target protein; in mimicking mode, the input data is the amino acid sequence and the codons of an ortholog protein (Figure 1A). The input format with two concatenated sequences supports both tasks and includes tokens indicating the organisms of the proteins (Figure 1B). We trained multiple models, each with a specific window size (Figure 1C). Pre-processing and post-processing steps are illustrated in Figure 1D. The performance of the trained models was evaluated based on accuracy of prediction of the codons of all proteins in the test set. We observed that the frequency-based model trained on highly expressed proteins was more accurate when predicting the codons of highly expressed proteins (Figure 2); therefore, we added a 6-class classification token of the expression level of the protein within its organism (omitted from the illustration in Figure 1A for brevity). In masking mode, the input is two copies of the same amino acid sequence and the organism token. In mimicking mode, the input is a gap-infused alignment of two orthologous proteins, where the first sequence is the codons of the source protein and its organism, and the second sequence is the amino acids of the target protein and its organism. Some of the codons in the target sequences may be passed as input for context (i.e., they are not predicted). For example, during pre-training, only 30% of the positions are masked by amino acids and predicted by the model.
(2) Masking-mode mBART predictions have better accuracy than the frequency-based baseline suggesting that long-range codon interaction patterns can be learned
Figure 2 shows the accuracies of codon predictions for the test-set proteins in masking mode for different models, as a function of expression level in the four organisms. The mBART models were pre-trained on the masking and mimicking tasks and then fine-tuned (FT) on the masking task. The models varied by window size, with windows of 10, 30, 50, and 75 codons evaluated. The accuracies of the baseline frequency-based models, which are based on the most frequently used codon for each amino acid in each organism for all training-set proteins (cyan) and for the 10% most highly expressed training-set proteins (magenta) were determined for comparison. The mBART models are more accurate than the frequency-based models, suggesting that are patterns can be learned from the long-range relationships among codons.
We sort the proteins by expression level and show their ranked position along the x-axis; Figure S1 shows the same data with expression values along the x-axis. In both cases, the data is smoothed with a Gaussian kernel (50 proteins window), and the horizontal solid lines indicate the average accuracy for the proteins with no measured expression. Figures S2, S3 show the accuracy differences between our models and the frequency-based baselines. Table 1 lists the p-values in a statistical test comparing the accuracies of pairs of models. The p-value is for the test decision for the null hypothesis that the data in the differences (e.g., accuracy of prediction of the mBART model minus the accuracy of the frequency-based model) comes from a normal distribution with zero mean and unknown variance, using the paired-sample t-test. All p-values for comparisons with the baseline models are (far) smaller than 0.05 (Table 1), thus rejecting the null hypothesis, and suggesting the accuracies of the mBART predictions are indeed better than that of the frequency-based prediction. Figure 3 compares the perplexity of these models on the test-set proteins, as a function of the expression level rank of the proteins (with a Gaussian kernel smoothing of 50 proteins). Perplexity is a commonly-used measure in AI to assess model predictions and is the computed exponentiated average of the cross-entropy loss, implying that better models have lower perplexity.
Figures 2, S1-S3 and Table 1 show that the mBART models predict more accurately than the frequency-based models, demonstrating there are patterns that can be learned from the long-range relationships among codons. The perplexity measures in Figure 3 mirror the accuracies in Figure 2, in that where the mBART models are accurate they also have low perplexity. Both for the frequency-based models and the mBART models it is easier to predict the codons of more highly expressed proteins. In S. cerevisiae, the accuracy advantage of mBART is the smallest and is in the highly expressed proteins. In S. pombe, the best mBART models are more accurate than the frequency-based models, and for the bacteria, the mBART models offer the most dramatic improvement in prediction accuracy with respect to the frequency-based models, across all levels of protein expression.
We evaluated the accuracies of the mBART and the frequency-based models as a function of the protein length ranking (Figures 4, S4, and S5). There is not a consistent relationship between the accuracy of the best model and protein length. In S. cerevisiae, E. coli, and B. subtilis, longer proteins are more accurately predicted; in S. pombe this is not the case. The accuracy advantage of the mBART models is more pronounced in the shorter proteins of the eukaryotic organisms and the longer proteins of the bacterial organisms (Figure S5).
We also compared accuracies directly, without conditioning on another parameter, as we did for expression level and length, to evaluate which window size performs best. The difference between models with different window sizes is marginal (Figure S6). Analysis of average accuracies of the top-k most accurately predicted proteins, with k ranging from 1 to the size of the test set revealed that the longer the window size, the greater the potential of learning long-range relationships among codons (Figure S7). However, if the long-range signal was insignificant using increasingly longer window sizes has no benefit and may lead to decreased accuracy, as there are fewer long windows in the data set to train on. Thus, while the mBART model with the 75-codon window has a marginal advantage from the cumulative perspective, overall the best model is that with the 30-codon window size (Figure S7).
To determine whether similarities between the training and test sets artificially boost prediction accuracies, we identified similarities at the amino-acid level that remained after we used CD-HIT clustering for the training set/test set split. To do this, we BLAST aligned (with an E-value threshold of 10-2) each test protein to the training set. Then, we calculate its average percent identity to the training set by identifying for each residue the aligned segment with the highest percent identity and averaging these values over the protein residues. This analysis showed that 31% of the test set proteins have no sequence identity to proteins in the training set (28.1%, 28.1%, 30.3%, and 37.2% for S. cerevisiae, S. pombe, E. coli, and B. subtilis respectively; Figure S8). The accuracies of the predictions of the mBART and frequency-based models as a function of the average identity were then calculated (Figure S9). The accuracy gap between the mBART and frequency-based models was similar regardless of whether or not the test set protein had close neighbors in the training set, suggesting that the performance of the mBART models is not artificially boosted by these similarities.
(2) Mimicking-mode mBART predictions are only marginally better than masking-mode predictions
To identify orthologous segments that can be used in mimicking-mode predictions, we used BLAST (with an E-value cutoff of 10-2). Figure S10 shows histograms quantifying the similarities among the orthologous segments in the test-set using percent identity and log10(E-values). The average percent identities among the orthologous segments in the test set are 32%, 33%, 30%, and 29% for S. cerevisiae, S. pombe, E. coli, and B. subtilis, respectively. During training, the data is of aligned proteins that are both from the training set, and in testing, orthologous proteins that were both in the test set were evaluated.
We evaluated different mBART models in masking-mode inference and mimicking-mode inference, on the test set of orthologous segments. First, we consider the same models (FT on the masking-task with windows of 30 and 50 residues) and the same masking-mode inference as described above, only on this different test set. We further fine-tuned these two models on the masking and mimicking task, and evaluated these refined models in masking-mode inference and in mimicking-mode inference. Finally, we also calculated the frequency-based baseline (similarly to the what is described above, only on this test set), and a naïve frequency-mimicking model where we mimic the codon with the same frequency rank as in the orthologous protein (see Methods for details). Figures 5 shows the accuracies of the predictions as a function of the sequence identity to the orthologue; Figures S11 shows these accuracies with respect to the frequency-based baseline, and Figure S12 shows the same data, sorted by accuracy along the x-axis.
The mBART model’s mimicking-mode predictions have accuracies that are very similar to the masking-mode predictions. The mimicking-mode resulted in a modest accuracy boost in the two eukaryotic species and for orthologs that have segments with high amino-acid sequence identity. Prediction accuracy may be higher due to the codons used in a particular segment are easier to predict, namely the most frequently used ones. In the masking-mode frequency-based predictions, ortholog codons are not used, and hence percent identity is only meaningful in terms of characterizing the protein as a conserved one. In the eukaryotic species, the accuracies of both the naïve frequency-based mimicking and the mBART models became higher than the frequency-based baseline as the percent identity of the orthologs increased. This implies that there is a signal in the codons of the orthologs, especially when the percent identity is high, that improves prediction accuracy. However, the masking-based predictions have almost identical accuracies, so this extra information does not further improve the mBART predictors. It should be noted that even when there is a signal, the information from the orthologs could be detrimental. For example, although the naïve frequency-based mimicking improved due to this signal, it was still less accurate than the frequency-based masking model. Our mimicking-based predictions have similar, and in some cases, improved accuracy compared to the naïve frequency-based mimicking model.
We calculated p-values to test the null hypothesis that differences of the accuracies of two models result from a normal distribution with mean equal to zero and unknown variance using the paired-sample t-test. The p-values when comparing the accuracies of the mBART model in mimicking mode and the frequency-based mimicking model for all four organisms are less than 10-5 (Table S1), so the null hypothesis should be rejected. In contrast, the p-values for comparison of the mBART model in masking mode versus in mimicking mode are not lower than 0.05 in all cases (Table S1), indicating that there is not a significant difference between these models.
(3) mBART outperforms the frequency-based baseline more significantly for certain types of proteins
To determine whether prediction accuracies depend on molecular functions or type of biological process, we use the GO-XL slim classifications of S. cerevisiae and E. coli proteins and considered cases for which there were at least 10 proteins with that annotation in the test sets of these organisms. For each annotation, placed along the x-axis, and the data is shown as box plots and scatter data. We used the Mann-Whitney rank sum test to determine if the accuracy difference values for the proteins with a specific GO annotation are likely from the same population of values as the rest of the organism’s test-set proteins (Table S2).
For most GO-annotation categories, the accuracy boost of the mBART model is not different from the background of all test-set proteins in either organism, but there are some that stand out (Figures 6 and S13-S16). In S. cerevisiae, the mBART model outshined the frequency-based baseline for (1) proteins with the molecular function ‘structural constituent of ribosome’, (2) proteins with nucleic acid binding function ‘rRNA binding’, ‘RNA binding’, and ‘DNA binding’, and (3) proteins with a molecular function that falls under the broad category of catalytic activity and specifically ‘transferase activity’, ‘nucleotidyltransferase activity’, ‘nuclease activity’, and ‘peptidase activity’. The E. coli molecular function categories in which the mBART model considerably outperformed the frequency-based baseline are (1) ‘structural molecular activity’, (2) ‘RNA binding’, and (3) both the broader ‘catalytic activity’ and more specifically ‘catalytic activity acting on DNA’. Considering biological processes, the mBART model was significantly more accurate than the frequency-based baseline for S. cerevisiae proteins involved in (1) ‘ribosomal large subunit biogenesis’, (2) the broad category of translation and specifically ‘cytoplasmic translation’ and ‘regulation of translation’, (3) the broad category of nucleic acid metabolic process and specifically ‘DNA recombination’ and ‘rRNA processing’, and (4) ‘transposition’. For E. coli, mBART was more significantly more accurate for proteins involved (1) ‘ribosome biogenesis’, a parent category of ‘ribosomal large subunit biogenesis’ identified in S. cerevisiae, (2) ‘cytoplasmic translation’, a process also identified in S. cerevisiae, (3) and ‘protein containing complex assembly’.
Discussion
We have employed contemporary tools from AI to predict codons in genes from the eukaryotes S. cerevisiae and S. pombe and the bacteria E. coli and B. subtilis given the protein amino acid sequences. Our approach is data-driven: mBART models were trained on data from each organism and are evaluated on a diverse test set. Because the difficulty of codon prediction varies, evaluation on a large and diverse test set rather than on only a few proteins as has been done previously [23] results in a more comprehensive test of predictive power and allows us to study the accuracy of the models as a function of protein expression, length, conservation, and functionality. The lower bound for the accuracy of the model is the frequency-based model, and we expect the effective upper to be significantly lower than 100%, because not all positions are under a strong evolutionary pressure [44]. Given this, the improvement reported here relative to the frequency-based model is very significant. Our mBART models are publicly available as source code and trained models and through a web-based user-friendly interface (https://www.aa2codons.info/).
Learning the statistical patterns of codon usage in an organism is challenging because the amount of available data is limited by the number of proteins in the organism. Previous studies highlighted associations between the frequencies of neighboring codons and attributed these to their effect on ribosomal pausing, frameshifting, and other gene expression steps [25, 46, 47]. Here, we used mBART, a transformer-based architecture [26], to learn correlations in codon use in residues that are 30 to 75 codons apart to improve predictions. Codon usage patterns inferred by our models are therefore related not only to translation but also to gene expression steps including transcription, splicing, methylation, RNA processing, mRNA stability, and genomic stability [6]. That the model with a 30-codon window was a better performer than models with longer window sizes may be due to statistical aspects such as the amount of data given the number of parameters in the model. However, this window size may be optimal because it is closest to the length of the ribosomal exit tunnel [48].
That our trained models can better learn the codon patterns of a subset of proteins, suggests that their codon usage is more constrained, and this may be due to the encodings of these proteins being under more pronounced evolutionary selection. Here, we study how the accuracy and perplexity of our trained models vary as a function of the expression-levels, lengths of the proteins, conservation, and GO annotations. It was suggested that selection for codon usage increases with the expression levels of a gene [22, 49]. One reason for this is that we expect a silent mutation to have a greater effect on organism fitness in highly expressed genes [22]. Indeed, the codons of high-expression proteins in all four organisms, are more accurately predicted both by the naïve frequency-based model and our models, and with lower perplexity. This suggests that high-expression proteins are not only under increased evolutionary selection for simple codon patterns but also for complex-long range codon usage patterns. We also observed that in most organisms (excluding S. pombe) the gap between the performances of our models and the frequency-based approach is greater for long proteins than for the short ones. It is possible that complex signals related to co-translational folding regulation, needed for the tighter control of in multi-domain proteins [50, 51], are “encoded” by codon identities at distant positions and thus better detected by our models. It is also possible that our models detect gene expression codes beyond those important for co-translational folding that are encoded in longer genes; these could be codes related to binding sites of transcription factors or RNA binding proteins [6].
Our models performed better for genes with conserved orthologs than for those without. The latter are either new or have undergone very rapid evolution. This suggests that evolution first shapes the amino acid sequence and later the complex codon usage patterns. It may also suggest that older genes tend to include more complex gene expression codes than newer ones, possibly as a result of selection for tighter regulation, and it is these complex codes that are detected and exploited by our models.
The gap between the performances of our models and the frequency-based baseline varied among different GO functional groups. This result may be related to factors mentioned above and that seem to introduce complex codes into the coding sequence: expression levels, gene age and conservation, and gene length. For example, genes related to the translation process (e.g., those encoding ribosomal proteins), that were better predicted by our models, are known to be highly expressed and old. DNA and RNA binding proteins are also known to include ancient domains, and this may explain the better performance of our models in predicting their codons [52]. Our GO-based analysis suggests that our models can be useful for studying and annotating novel genes given that predictive performance is associated with functionality.
Finally, our models excel over the frequency-based approach more significantly in bacteria than in the yeast species evaluated. This may be due to the larger effective population size in bacteria [53]. A larger population should induce stronger selection pressure on codon usage, which in turn will result in complex-long-range signals that can be captured by our models but not by the frequency-based model. Also, the mean number of ribosomes per mRNA in bacteria is greater than in eukaryotes making traffic jams more common, and possibly triggering stronger selection for complex codon usage signals [54–57]. Moreover, horizontal gene transfer occurs in bacteria [58], which may accelerate the evolutionary rate.
We trained our mBART models to mimic, i.e., to predict the codons of a protein given the codon encodings of an ortholog. An accurate mimicking tool is useful for predicting codons that are optimal in a non-native host. To design a mimicking tool for protein codons, the first step is to characterize patterns of codon usage in orthologous proteins. Then, the learned insights (or AI-models) can be used, given the codon encoding of a protein in a source organism, to predict a codon encoding for a similar protein in a target organism. The final, sometimes overlooked step, is to evaluate the predicted codon encodings in the target organism.
Previous studies used “hand-crafted” measures to characterize patterns of codon usage in orthologous proteins: Pechman and Frydman devised a translational efficiency scale and applied it to yeast species to show evolutionary conservation of codon optimality in eukaryotes [27]. Jacobs and Shakhnovich measured local rare-codon enrichment and studied its conservation across multiple-sequence alignments [28]. Chaney et al. used the MinMax measure to identify clusters of rare codons and showed that they are conserved among homologous proteins across eukaryotic and bacteria species [59]. These measures are only a few of the many available codon bias indices [22]. In contrast to these hand-crafted features, our mBART models learn patterns from the data, both indirectly by training a single model on the masking task for multiple organisms and directly by training on the mimicking task from alignment data. As in all contemporary AI-based models, the patterns learned help improve accuracy during inference, even though we do not have an explicit description of what the models learned.
Evaluating mimicking tools is challenging because there is a discrepancy between their desired use and the data on which we can evaluate them. Their use is prediction of the optimal codons for expression of a protein in a non-native host. Evaluation can be carried out on evolutionary data: the codon encodings in two organisms of merely similar, namely non-identical, proteins. This meaningful distinction has two important consequences. First, because evolutionary data does not include pairs of identical protein sequences in two organisms as in our desired use, current tools, like CHARMING [31] and CodonWizard [30], do not evaluate their measures, or even predict codons, based on the codon encodings of a similar, yet non-identical, protein ortholog. Second, it is unclear what is a correct threshold for approximating this desired use, in terms of the amino-acid sequence identity of the ortholog. We observed that mimicking predictions are not consistently better than the masking predictions. This may be because we evaluated our mimicking models on orthologs that are too remote. Alternatively, it may be that it is not easier to learn to mimic than to predict the codon encoding from the amino acid sequence. Nonetheless, accuracies of our mimicking-mode predictions improved as the amino acid sequence identity increased, approximating the desired use more closely. Thus, we believe that our mimicking-based predictions will be useful for optimizing expression of proteins in a non-native host, and we provide the code and a web-interface for this task.
In summary, we used AI to study codon usage bias, a challenging topic in evolution. Our approach can be used to design the coding sequence of heterologous genes and also to better understand the evolution of codons and the encoding of expression in transcripts. We believe that that future research on codon usage will involve extensive application of AI-based approaches, like the one presented here.
Methods
Dataset
The NCBI coding sequences of four organisms, S. cerevisiae, S. pombe, E. coli, and B. subtilis, were divided to three disjoint sets, for training, validation, and testing. We first clustered their translated amino acid sequences with CD-HIT (flags -d 10000 -c 0.7 -bak 1). Clusters were not split between two sets (e.g., training and test). Within each of the three sets (i.e., training, validation, test), we used BLAST (with E-value threshold 0.01) to identify similar sequences. Finally, we associated with each protein, if available, measured expression levels [60], and its category in the 6-class classification: [0-25%, 25-50%, 50-75%, 75-90%, 90-100%, no measured expression].
Models
We trained and share mBART models with fixed window sizes (10, 30, 50, and 75) using the Huggingface infrastructure [61]. mBart has an Encoder-Decoder architecture that receives two inputs: one for the encoder and one for the decoder; the decoder is auto regressive. Our models have six attention layers and eight attention heads for both encoder and decoder. The auto-regressive nature of the decoder means that it predicts probabilities for the next codon, given the encoder inputs and all the previous predictions. To reduce training time, we used teacher-forcing, where the decoder predicts probabilities for the next codon given the input to the encoder and the correct previous predictions (as opposed to its own predictions which would be given in non-teacher-forcing mode). Critically, during inference, the codon sequence is not given, and a fixed-length window of prediction with the same size of the input is generated auto-regressively choosing the highest probability codon restricted to the amino-acid sequence we want to generate.
Training
Our training protocol had three steps: pretraining, masking-mode fine-tuning, and mimicking-mode fine-tuning. Each training step had a batch size of 32, 15000 warmup steps, 1e-4 learning rate, 15% label smoothing, and early stopping on the cross-entropy loss of the validation set. In pretraining, we randomly chose 30% of the input codons and masked each with a token of its encoded amino acid. The training set windows were randomly chosen (both the sequence and the position within the sequence) on-the-fly during the batch creation. To avoid noisy comparisons between validation steps that are used for early stopping, we pre-calculated and fixed the validation set windows. In mimicking-mode training, we only masked the codons of the aligned target and noted that these may differ from the amino acids of the orthologous sequence.
Inference
Testing during inference was on coding sequences of a wide range of lengths, yet our mBART models train (only) on fixed-sized windows. Naively applying the models trained on fixed-sized windows to longer sequences led to poor performance. Thus, we used our models only to infer predictions in windows of the size on which they were trained and derived a prediction for the full-length sequence from these partial predictions. Given a sequence and a model trained on a fixed window size, we predicted for all (sliding) windows of that size in the full sequence (using the ‘generate’ function). Then, for each position in the full-length protein, we averaged the logits predicted in that position in the sliding windows in which it appears and normalized these averages with log softmax. The final prediction was the highest probability codon among those that encoded the amino acid in that position.
Frequency-rank mimicking
For each position in the target sequence for which we wanted to predict the codon, we had a codon in the aligned orthologous segment. If the amino acid in the target sequence was aligned to a gap, then we predicted the most frequently used codon for that amino acid. If it was aligned to an identical amino acid, then we mimicked the rank frequency of the codon in the aligned position. For example, if the codon was the most frequently used codon in the other organism, we predicted the most frequently used codon in the target organism. Because it is the same amino acid, the number of possible codons is the same, and rank mimicking is straight-forward. Similarly, if the aligned amino acid in the other organism has the same number of codons encoding it as the target amino acid, rank mimicking is straight-forward. If the aligned amino acid has a different number of codons, we selected the codon with the highest-rank that was lower than or equal to that of the frequency in the mimicked codon.
Evaluation metrics
The prediction accuracy was calculated as the fraction of amino acids along the protein sequence that were correctly predicted based on the true, evolutionary selected, codon sequence of each protein. A perfect prediction has an accuracy of 1, and a prediction that failed in all positions has an accuracy of 0. We also measured the perplexity of a model given the true, evolutionary selected, codon sequence: Perplexity was computed by averaging the negative log(probabilities) assigned by the predictor to the true codon, and then exponentiating this average. In the case of the frequency-based baseline, the probability we assign each codon is its likelihood, and this is equal to exponentiating the average negative log likelihood.
GO slim annotations for E. coli and S. cerevisiae
GO slim annotations were downloaded from Quick GO (https://www.ebi.ac.uk/QuickGO/) on November 2023. GO terms were merged according to the GO slim data (all annotation related to one GO slim were merged). We used GO slim generic for E. coli and GO slim yeast for S. cerevisiae. Gene names were converted to unique names using UniPort ID mapping (https://www.uniprot.org/id-mapping). We removed from the analysis GO slim terms that contained less than 10 genes.
Data availability
Our models and datasets are available through the huggingface hub: models: https://huggingface.co/siditom, datasets: https://huggingface.co/datasets/siditom/SCPECBS3, tokenizer: https://huggingface.co/siditom/tokenizer-codon_optimization-refined_expr. A Google colab script for model inference (along with documentation) is also available: https://colab.research.google.com/drive/1ocbWMWcTHgGSQvgPtRuvY2CBxIVSHUxX?usp=sharing.
Supplementary Figures
Acknowledgements
We thank Prof. Rita Osadchy of University of Haifa and Michael Peeri, Lorna Bakhit, and Prof. Nir Ben-Tal of Tel-Aviv University for insightful discussions. SBE and TT were supported by the Safra Center for Bioinformatics at Tel-Aviv University, and TS and RK were supported by the Data Science Research Center at the University of Haifa.