Conserved codon adaptation in highly expressed genes is associated with higher regularity in mRNA secondary structures

The redundancy of the genetic code allows for a regulatory layer to optimize protein synthesis by modulating translation and degradation of mRNAs. Patterns in synonymous codon usage in highly expressed genes have been studied in many species, but scarcely in conjunction with mRNA secondary structure. Here, we analyzed over 2,000 expression profiles covering a range of strains, treatments, and developmental stages of five model species (Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus). By comparative analyses of genes constitutively expressed at high and low levels, we revealed a conserved shift in codon usage and predicted mRNA secondary structures. Highly abundant transcripts and proteins, as well as high protein per transcript ratios, were consistently associated with less variable and shorter stretches of weak mRNA secondary structures (loops). Genome-wide recoding showed that codons with the highest relative increase in highly expressed genes, often C-ending and not necessarily the most frequent, enhanced formation of uniform loop sizes. Our results point at a general selective force contributing to the optimal expression of abundant proteins as less variable secondary structures promote regular ribosome trafficking with less detrimental collisions, thereby leading to an increase in mRNA stability and a higher translation efficiency.


INTRODUCTION
Diverging frequencies of synonymous codons can be found in all domains of life and vary with the subsets of genes or gene regions compared. Ever since codon bias in highly expressed genes was shown to correlate with tRNA abundance profiles (optimal codons) (1-6), the general concept that codon use influences translational efficiency was adopted. More recent studies reveal that codon usage offers organisms an additional regulatory layer to govern a multitude of processes ranging from transcription, mRNA stability, translation initiation and elongation, to protein folding (7,8).
An increasingly recognized mechanism to optimize the biogenesis of proteins is the optimization of translation speed along mRNA transcripts by adjusting codon use to subcellular tRNA concentrations (7,8). Slowly translated codons at the beginning of mRNAs have been suggested to reduce ribosome traffic jams at later stages during elongation of abundant proteins (9,10). Clustering of identical synonymous codons -irrespective their match with optimal codons -may enhance translation speed by recycling of nearby recharged tRNA molecules (11). Other forces shaping the codon landscape include the use of non-optimal codons to reduce the elongation rate to facilitate co-translational protein folding (12)and translocation of membrane proteins (13,14). Alternatively, the kinetics of codon-anticodon pairing can also be tuned by modulating tRNA concentrations. Oscillating tRNA concentrations with anti-codons that enhance or repress translation speed have been implicated in tweaking up-and downregulation of gene expression during cell proliferation and differentiation.
An additional regulatory layer that controls the rate of translation is the secondary structure of mRNAs. Strong structures (low free energy) are demonstrated to slow down ribosomes and decrease the speed of translation (15)(16)(17)(18)(19)(20)(21). Generally, strong structures at the 5' end of mRNAs are considered as disadvantageous for translation initiation (20,22). Genome-wide analyses demonstrate wellconserved patterns across prokaryotes and eukaryotes towards reduced stability of mRNA structures at the initiation site, especially for GC-rich genes (23). However, weak folding of 5' mRNA structures in Escherichia coli and Saccharomyces cerevisiae could not be correlated with expression levels of individual genes, and have been suggested to enhance the translation efficiency in a global manner (17). Another systematic trend in mRNA structures is described for the region directly downstream of the start codon in which strong mRNA structure is seen in S. cerevisiae, probably as a mechanism to slow down ribosome speed at the beginning and to avoid ribosome jamming during elongation (10,24).
In addition, it has been suggested that strong mRNA structures facilitate translational pauses to accommodate co-translational folding of compact proteins (25). Recently, it has been shown that certain mRNA regions appear to be subjected to conserved selective forces (26). Studying more than 500 organisms revealed in most phyla a strong folded region downstream of the start codon and weak secondary structures at the start and the end of coding regions (26).
Codon optimality can also have a profound effect on mRNA stability (8,27,28). Replacing optimal codons in budding yeast by less favorable codons significantly reduces mRNA stability, while reverse substitutions lead to a decreased decay (27). This positive correlation between codon optimality and mRNA stability is well-conserved and has been observed in E. coli (29), Xenopus (30), zebrafish (30), as well as in mammalian cell lines (31). In Saccharomyces cerevisiae (32) the DEAD-box protein Dhh1p is required for the coupling between codon content and mRNA decay (32). Dhh1p is thought to sense Page 4 of 23 the speed of ribosomes during elongation as it binds preferentially to transcripts enriched with slowly translated codons, thereby activating mRNA degradation. Together these findings indicate that codon usage may optimize gene expression via two distinct mechanisms that are linked via the dynamics of ribosome trafficking along transcripts: the kinetics of codon-anticodon pairing and mRNA turnover. The potential impact of ribosome speed on gene expression has been supported by analyzing codon choice in weak and strong mRNA secondary structures (33). In E. coli and S. cerevisiae optimal codons are preferentially used in regions with strong secondary structures, and codons matching with less abundant tRNAs are located in lowly structured regions (33). These opposing effects are considered as a mechanism to smoothen overall translation rates to optimize gene expression by reducing detrimental ribosome collisions (33).
Here we searched for patterns in codon usage and predicted mRNA secondary structures in constitutively highly expressed genes. Bias in codon usage has frequently been reported for highly expressed genes (7,22,34,35), but scarcely in conjunction with structural features in mRNA folding.
In addition, many studies (2,3,7,(36)(37)(38) focused on preferential codon usage in highly expressed genes and the match with cognate tRNA species (optimal codons). Here, we studied codon adaptation by analyzing relative changes in codon frequencies in highly expressed genes when compared with lowly expressed genes. This approach enables a comparative analysis of codon adaptation across species despite differences in optimal codons. By genome-wide comparisons of highly and lowly expressed genes, we found remarkably conserved patterns over five model species: Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus. In most species the relative changes in codon usage in highly expressed genes was associated with a reduction of mean and maximum loop sizes. This shift in mRNA secondary structures was found for highly abundant transcripts and proteins, as well as for high protein per transcript ratios. By recoding the genomes of the five model species, we show that codons with the highest relative increase in highly expressed genes reduces the variance and length of the loop sizes. Except for mouse, the overrepresentation of C-ending and avoidance of A-and T-ending codons in highly expressed genes was a commonly observed pattern across all species. Our findings suggest that the conserved bias in codon usage and associated smoothening of mRNA secondary structures reflect a general selection mechanism to enhance the expression level of abundant proteins, most likely via reducing detrimental ribosome collisions and promoting regular translation rates along transcripts.
Page 5 of 23

Data analysis
Data analysis was done in "R" (version 3.4.2, x64) with custom written scripts (39). Data organization was mainly done using the tidyverse packages (40), and most plots were generated using ggplot2 (41 For analysis of the transcript data, the log2 intensities were transformed to a z-score by applying the formula = , − where the average (µ) log2 intensity over all spots of species j (j = 1, 2, …, 5) was subtracted from the log2 intensity (y) of spot i (varied per species) of species j and divided by the standard deviation (σ) over all the spots of species j. This assured that all data was comparable in range across species.

Proteome collection and transformation
Protein abundances for each of the five species were obtained from the protein abundance database PaxDb (version 4) (42,43). Abundances with a value of 0 were excluded from the set. As for the gene expression data, the protein abundance data was rank normalized for analysis of highest versus lowest protein abundances. Within each species the abundance was ranked as a percentage (0% the lowest and 100% the highest abundance).For analysis of the protein data, log2 transformed abundances were transformed to a z-score, as for the transcript abundance data.

Protein per transcript ratio calculations
Page 6 of 23 We collected both transcript abundance data and protein abundance data to calculate a protein per transcript ratio (PTR). This was calculated using the z-score data = where PTR is the protein per transcript ratio of gene i, for which the log2 intensity transcript abundance (y) was divided by the log2 protein abundance (z). The PTR could only be calculated for genes where both transcript and protein abundance data was available.

Coding sequences and mRNA structure prediction
The coding sequences (CDS) of all genes of five species were downloaded from sequence or genome at 20 o C, using the parameters of Andronescu, 2007 (49).

mRNA sequence and structure statistics
Several statistics were taken from the mRNA sequence: gene length, absolute codon usage, absolute nucleotide usage (absolute number of A, T, C, and G) and the relative amount of each nucleotide per transcript, and nucleotide per codon position (e.g. the number of A at position 1), again absolute and relative.
The relative synonymous codon frequency (RSCF) for each codon was calculated across all selected transcripts using where Ncodon is the number of occurrences of a particular codon from the set of synonymous codons and Nsynonymous represents the number of occurrences of the synonymous codons, and i stands for a transcript of the n selected transcripts that contain the particular amino acid the synonymous codons encode. Thus, the values for RSCF vary between 0 and 1.
From the predicted mRNA structures ten parameters were taken: free energy (kcal/mol), For some of the factors a correction was applied for gene-length, these were: energy, number of bound nucleotides, number of unbound nucleotides, and number of transitions from stem to loop.

Relative synonymous codon frequencies linked to mRNA abundance, protein abundance, and protein per transcript
Changes in frequencies for each codon were calculated using where the relative change in codon frequency (ΔRSCF) was calculated for the average codon frequency in the 5% highest expressed genes (RSCFhigh) versus the 5% lowest expressed genes (RSCFlow). This resulted in a measure for changes in codon usage linked to expression levels. This method was applied for transcript abundances (relative synonymous codon use transcript abundance; ΔRSCF-TA), protein abundances (relative synonymous codon use protein abundance; ΔRSCF-PA), and protein per transcript ratios (relative synonymous codon use protein per transcript ratio; ΔRSCF-PTR). Comparison of the different ΔRSCF values within species was done using Pearson correlation as to account for ΔRSCF amplitude differences. To compare the direction of ΔRSCF associations between species, Spearman correlations were used. Spearman was used as the amplitudes of the ΔRSCF between species differed strongly.

mRNA structures linked to mRNA abundance, protein abundance, and protein per transcript ratio
To evaluate the predicted mRNA structures in a similar way as shifts in codon frequencies, the relationship between mRNA structure and expression was calculated using where the structural bias (ΔS) was calculated for the average structure value in the 5% highest expressed genes (xhigh) versus 5% lowest expressed genes (xlow). This method was applied for transcript abundances, protein abundances, and protein per transcript ratio. Comparison of the different structural biases within species was done using Pearson correlation as to account for ΔRSCF amplitude differences.

Permutation analyses of codon biases and mRNA structures
To assign probabilities to the observed trends, a permutation approach was used. Within each species the synonymous codons were randomly distributed over the amino acids. This resulted in an mRNA pool that encoded the same proteins and contained the same overall codon use. This procedure was conducted 100 times for each species, after which the permutated mRNA pool was folded and scored according to the procedures described above.

Page 8 of 23
The resulting distributions were used to determine whether the trends in codon bias linked to TA, PA, and PTR were likely to be random or the result of a non-random process. We ranked the relative synonymous codon use values from each permutation, the 2 nd highest and 2 nd lowest were taken as the 95% confidence interval. Based on these values the significances of the codon biases were assessed. Like the trends in codon bias, these distributions were also used to determine if mRNA structure properties linked to TA, PA, and PTR were likely to be random or the result of a non-random process.

Structure evaluation through mRNA sequence recoding based on preferred codons
To evaluate the impact of changes in codon usage on mRNA structure, the entire coding sequence library of each species was recoded by replacing each codon with the synonymous codon that showed the highest relative change (ΔRSCF) upon comparison of highly and lowly expressed genes. This recoding process was conducted for codons that were selected by analysing transcript abundances, protein abundances as wells protein per transcript ratios. A comparison between the native and recoded mRNAs was made per transcript by where the structural change (ΔS) was calculated for the average structure value in recoded genes (xrecoded) versus native encoded genes (xnative). The ΔRSCF-identified codons used for recoding that showed the highest relative changes can be found in Supplementary Table S5. It is noted that these codons are not necessarily the most frequent synonymous codons. To test if recoded mRNA structures differed from the native encoded mRNA structures, a paired t-test was conducted on the untransformed structure features. In total, 11 structure features over five species with three different recodings were tested: codons that were selected by analysing ΔRSCF-TA, ΔRSCF-PA, and ΔRSCF-PTR data. A Bonferroni correction was applied to correct for multiple testing.

Conserved adaptation in codon usage in highly abundant transcripts
To determine whether global patterns in codon change can be identified in abundant mRNAs, we analysed for each of five model species at least 250 transcript profiles (Supplementary Table S1). After rank-normalization, the expression levels across developmental stages, tissues, strains and treatments were averaged to determine the relative ranking of the genes within each species. Next, the highest and lowest 5% expressed transcripts were selected and the relative synonymous codon frequency was determined per transcript ( Figure 1A) and used to calculate to average relative codon frequency (RSCF) within the two sets of transcripts. To evaluate the relative change in codon usage between the two classes of transcript abundances (TA), the log2 ratio of the two RSCF values was taken as a measure

Transcript and protein abundances, as well as protein per transcript ratios reveal similar patterns in codon adaptation across species
To evaluate the codon usage in relation to protein levels, we retrieved data of the five species from the protein abundance database PaxDb (42,43). The same analysis as for the transcript abundance was  Table S3).
In conclusion, despite the fact that largely non-overlapping gene pools were sampled ( Figure 2C  To determine the influence of the conserved codon adaptation on secondary mRNA structures, the genomes of the five model species were recoded by selecting for each amino acid a codon that increased most in frequency at high extremes as identified by the TA, PA, or PTR analyses. It should be noted that these selected codons were not necessarily the most frequently used codons in highly expressed genes. In several cases, codons with the highest frequency in highly expressed genes did not show the highest relative increase. Depending on the species and the type of analysis, in three to seventeen cases the most frequent codon was not the same as the codon with highest value for ΔRSCF (Supplementary Figure S7).
For most amino acids the selected codons largely overlapped between the three types of analyses, not only within species but also between species (Supplementary Table S4). After recoding, the secondary structures were predicted, and the structural characteristics of each recoded mRNA was compared with those of its native structure. This comparison revealed that codons showing the highest relative increase at high transcript abundances (TA) caused a significant bias in secondary structures by increasing the physical stability of the secondary mRNA structure, the number of stem to loop transitions, and the number of bound nucleotides (Bonferroni adjusted paired t-test, p < 0.05; Figure 4).  Table S5).
In conclusion, recoding the genes of the five model species demonstrated that codons with the largest relative frequency changes at high TA, PA and PTR levels enhanced the formation of regular secondary structures, mainly through reducing the size and variation of the loops.

Interdependence between mRNA structural features
To obtain insight in the interdependence between the mRNA structural features analysed in this study, The most likely explanation is that the shift towards C-ending codons and avoidance of A and T-ending codons enhances the propensity to form stable secondary structures, which reduces loop size but apparently does not necessarily result in an increase in stem size.
Codon bias across multiple organisms has been well documented in literature. Already in the 1980s studies reported on tRNA abundances correlating with codon usage (5,50). Codon adaptation has been linked to both transcript and protein abundance (8,22). Genome-wide codon bias is driven by weak selective forces, mutation and genetic drift (6). While selection has been demonstrated to play a dominant role of in non-mammalian species (36,51), in mammalian species mutational biases are more prominent (52,53) and the contribution of selection is more disputed (6,22). This correlates well with our findings where the conserved bias in codon use and mRNA secondary structures in M. musculus is less pronounced than in the other four model species. The approach used in this work identifies codon adaptation by calculating the ratio of the relative coding frequencies (RSCF) in highly and lowly expressed genes. Our computational method minimizes the contribution of mutation bias (54) and focuses on the effect of selection, and is conceptually similar to that presented in (54), although it does not distinguish between A/G-ending or T/C-ending codons. Other methods to identify codon-biases exist, including codon adaptation index (CAI) (37), translation adaptation index (tAI) (55), and codon deviation coefficient (CDC) (56). All methods have their interpretations and biases, which are still topic of debate (54). A key advantage of the current approach is that it simply compares the relative enrichment of the use of a specific synonymous codon in a group of genes defined by a measurement of abundance. The pronounced outcome of our analyses may be attributed to evolutionary forces that Page 14 of 23 act both on highly and lowly expressed genes. While traditionally selective forces have been proposed to be the most dominant at high expression levels (6), recent work has suggested that codon usage in lowly expressed genes is subjected to selection and reduces the stability of the mRNA secondary structures (57).
Codon optimality, and hence translation rate, is thought to promote mRNA stability via regulating the rate of elongation (8,58). The positive relationship between translation rate and mRNA stability has been reported for a diverse range of organisms ranging from E. coli (29), Drosophila (59), zebrafish (30), Xenopus (30), to humans (60,61). In S. cerevisiae the DEAD-box protein Dhh1p has been shown to monitor ribosome densities and to target inefficiently translated mRNAs for degradation (32). In view of these observations, we assume that the observed bias in mRNA secondary structures contributes to the stability of highly abundant transcripts by reducing variation in ribosome speed along transcripts.
The translation rate in weakly structured regions (loops) is relatively high and may lead to congestion of ribosomes at the start of highly structured regions, thereby leading to collisions and enhanced mRNA decay. An alternative explanation, not mutually exclusive, is that larger loop sizes with inherent higher ribosome speed have larger stretches of 'naked' mRNA and are less protected by ribosomes that shield mRNAs from degradation by endonucleases (62). Since codon usage and mRNA secondary structures are intertwined, smoothening of the elongation rate is expected to be a delicate interplay between optimizing codon composition and mRNA folding. The existence of such an interplay has been reported for E. coli and S. cerevisiae where a reduction of fluctuating translation rates is achieved by an increase in optimal codons in highly structured regions (stems), and a preferential use of non-optimal codons in weakly structured regions (loops) (33).
The notion that the avoidance of local fluctuations in ribosome densities enhances mRNA stability may explain that both high transcript abundances and high protein per transcript ratios are associated with a similar bias in codon usage and inherent mRNA secondary structures. Local fluctuations in translation rates leading to ribosome collisions and stalling may trigger no-go decay of the transcript and degradation of the partly synthesised protein (63)(64)(65)(66)(67)(68). Modelling studies (69) emphasize that varying rates of translation may be unproductive and indicate that evenly spaced distances between ribosomes are more beneficial for optimal elongation rates. Thus, the bias in mRNA secondary structures associated with a high protein per transcript ratio indicates that smoothening of secondary structures may enhance translation efficiencies. It is stressed that the amount of protein per transcript seems a fair proxy for translation efficiency, as genome-wide protein turnover plays a rather small role in the explanation of protein abundance variation (70,71). Remarkably, while smoothening of the secondary structures is positively linked with transcript levels (TA) as well as protein per transcript ratios (PTR), the underlying gene pools hardly overlap. These data imply that our findings are independent of the function and sequence structure of the underlying genes indicating that two different pathways converge by selecting codons and inherent mRNA structures that promote regular translation rates along transcripts. It is noted that protein abundances (PA) are the output of transcript levels (TA) and the number of polypeptides synthesized per mRNA (PTR) and, consequently, reveal a similar bias at high extremes as the TA and PTR analyses, which is also reflected in the large overlap of the underlying genes at high expression levels.
In sum, we suggest that the high correlations between all three types of analyses (TA, PA and PTR) can be explained by selection towards mRNA structures that facilitate a uniform ribosome speed during elongation. However, the strength and relevance of the underlying mechanism remains elusive. A multitude of selective forces (7, 8, 20-22, 72, 73)  Irrespective the underlying mechanisms, our results provide an additional guideline in codon optimization strategies for the design of synthetic genes. Although many successes have been reported, results are unpredictable and optimizing codon use remains a challenge (74). To increase the production of proteins in heterologous expression systems, several methods have been developed that optimize expression by adapting the CAI's index to the expression host, together with adjusting sequence features that often include GC content, avoidance of Shine-Dalgarno-like sequences, transcriptional terminator sites and RNase recognition sites (74). Our findings can be used to extend existing algorithms to design synthetic genes leading to higher expression levels by selecting codons that lead to more uniform mRNA secondary structures.
Altogether, our analysis across four kingdoms of life reveals a broad selection pressure towards more uniform mRNA structures. Genome-wide recoding of the mRNAs with codons that are increase most in highly expressed genes results in a shift towards smaller and less variable mRNA loop sizes.
Our data indicate that selection on regular mRNA secondary structures may contribute to shaping the codon landscape and plays next to codon-anticodon adaptation and various other mechanisms (7,8,72,73), a significant role in enhancing the synthesis of highly abundant proteins. A plausible mechanism to explain our observations is that more regular mRNA structures promote gene expression by reducing detrimental fluctuations in ribosome trafficking that may attenuate both mRNA stability and translation efficiency.

AVAILABILITY
Scripts and the underlying transcriptome and structural database assembled to derive the figures and tables presented in the paper are accessible via a git repository: https://git.wur.nl/published_papers/sterken_codon_2020.

Page 16 of 23
Accession numbers of the transcriptome studies used can be found in Supplementary Table S1.

SUPPLEMENTARY DATA
Supplementary Data are available along with the publication.

ACKNOWLEDGEMENT
We want to thank Aska Goverse and Geert Smant for their comments on early versions of this manuscript.