The Use of GC-, Codon-, and Amino Acid-frequencies to Understand the Evolutionary Forces at a Genomic Scale

It is well known that the GC content varies enormously between organisms; this is believed to be caused by a combination of mutational preferences and selective pressure. Within coding regions, the variation of GC is more substantial in position three and smaller in position one and two. Less well known is that this variation also has an enormous impact on the frequency of amino acids as their codons vary in GC content. For instance, the fraction of alanines in different proteomes varies from 1.1% to 16.5%. In general, the frequency of different amino acids correlates strongly with the number of codons, the GC content of these codons and the genomic GC contents. However, there are clear and systematic deviations from the expected frequencies. Some amino acids are more frequent than expected by chance, while others are less frequent. A plausible model to explain this is that there exist two different selective forces acting on the genes; First, there exists a force acting to maintain the overall GC level and secondly there exists a selective force acting on the amino acid level. Here, we use the divergence in amino acid frequency from what is expected by the GC content to analyze the selective pressure acting on codon frequencies in the three kingdoms of life. We find four major selective forces; First, the frequency of serine is lower than expected in all genomes, but most in prokaryotes. Secondly, there exist a selective pressure acting to balance positively and negatively charged amino acids, which results in a reduction of arginine and negatively charged amino acids. This results in a reduction of arginine and all the negatively charged amino acids. Thirdly, the frequency of the hydrophobic residues encoded by a T in the second codon position does not change with GC. Their frequency is lower in eukaryotes than in prokaryotes. Finally, some amino acids with unique properties, such as proline glycine and proline, are limited in their frequency variation.


Introduction
The GC-frequency varies significantly between different and within genomes, both in coding and non-coding regions [1]. The reason behind this is not entirely understood. However, it is likely due to a combination of a balance between mutational preferences, selective pressure and evolutionary history [1]. In general mutational preferences decrease GC levels in most organisms [2,3], while GC-biased gene conversion (gBGC) can contribute to higher GC levels [4]. The environment of the organism can also influence the GC level as GC levels are higher in thermophiles [5]. Further, there is a phylogenetic signal so that closely related organisms mostly have similar GC-levels [6]. Finally, differences in DNA polymerase subunit III might correlate with differences in GC [7].
Here, we are not trying to answer the long-disputed origin of the difference in GC content. Instead, we assume that there is some mechanism driving the GC content of a particular organism towards an optimal level. Thereafter, we ask how does this affect the proteomes by examining the frequency of amino acids, nucleotides 1/ 30   1st base   2nd base  3rd base  T  C  A  G   T   TTT  TCT  TAT  TGT  T  Therefore, in general, amino acids encoded by more codons are more frequent, and amino acids encoded by GC-rich codons are more frequent in GC-rich genomes [8][9][10] leading to massive variation in the frequency of amino acids in different organisms [11,12]. For instance, the positively charged amino acids Arg and Lys vary between 2% and 10% in frequency. Arg is more frequent in GC-rich organism, and Lys is more frequent in GC-poor organisms. The codon table, see Figure 1, is surprisingly well conserved since early life. The same 61 codons encode the twenty amino acids in most organisms. However, some variations exist. For instance, in eukaryotes, one of the stop codons can encode selenium methionine [13], and other variations exist among Mycoplasma, Spiroplasma, Ureaplasma and Mesoplasma [14]. The redundancy in the codon table means that for many amino acids, the third position does not change the amino acid. Therefore, the overall GC content can change by using different nucleotides in the third position without affecting the proteome. Further, the codons have evolved in such a way that the general properties of the amino acids are determined mainly by the codon in position two [15].
In addition to codon frequency and GC level, there exist many factors that contribute to the frequency of amino acids [16][17][18]. The cost of amino acid synthesis might affect their frequency [19,20], and some amino acids, such as serine, can be toxic at high levels [21]. In addition to purifying effects to reduce the frequency of one amino acid, an organism might require a minimum frequency of amino acids with specific properties, while other amino acids, such as alanine, might be allowed to vary more freely [22].
Below, we analyze the frequency of amino acids, codons and nucleotides in different genomes. We show that the codon frequency does not fully explain amino acid frequencies, i.e. other factors also affect the amino acid frequencies. Some amino acids, such as serine, are consistently less frequent than expected, while others, such 2/30 as glutamate, are more frequent. Further, some amino acids, such as proline, are less dependent on GC than expected, indicating that there are limits to how much they can vary. The picture that emerges is that there on a genomic perspective there exists two selective forces, one that adjusts the GC content to a certain level and one that given a certain GC level adjusts amino acids frequencies. By detailed analysis, we can obtain an understanding of the forces acting on the amino acids.

Datasets
The dataset used in this study originates from the complete bacterial, archaeal and eukaryotic proteomes in UniProt [23] as of December 2017. All genomes from Mycoplasma, Spiroplasma, Ureaplasma, and Mesoplasma were ignored as they have another codon usage -which influences the expected amino acid frequencies. The final dataset contains 36,098,162 protein sequences from 8,546 genomes, divided into 7,197 bacterial, 351 archaeal, and 998 eukaryotic species. For each genome, the GC content of the genome and the length was obtained from NCBI. Further, the DNA and amino acid sequences of each gene were downloaded. The processed datasets, as well as all scripts, are available from this repository [24].

Statistics
For each protein, we calculated amino acid-, GC-, codon-and nucleotide-frequencies. Average, maximum and minimum frequencies for each genome in the dataset are presented in Table S1.
ANOVA type 2 F-tests [25] were used to identify the contribution differences between the kingdoms, compensating for differences in GC content, see Table ??. Using each codon/amino acid/nucleotide as the dependent variable and the GC content is used as the independent variable, the difference between kingdoms was tested. Here, it should be mentioned that even tiny differences are statistically significant, as the dataset is large. Further, differences between eukaryotes and bacteria dominate the ANOVA test as these are the most prominent groups.

Expected frequencies
It is necessary to define the expected frequency of amino acid (AA i ) to identify any selective pressure. Therefore, we define models to estimate the expected amino acid frequencies (AA i X ) assuming different scenarios. The simplest model, the codon model, assumes that the frequency of an amino acid is solely determined by the number of codons encoding that amino acid: where Codons i is the number of codons for amino acid i and 61 is the number of codons excluding stop codons.
Alternatively, the amino acid frequencies may be dependent on GC (i.e. there exist some other mechanism that determines the GC content of a genome) leading to the expected amino acid frequency AA i GC at a certain GC level to be: 3/30 where Codons i represents the codons for amino acid i and x the three nucleotides in that codon, GC is the fraction GC in the genome and δ(N (x) ∈ (A, T )) is a delta function that is one if the nucleotide N (x) is A or T, and zero if not. Further, AA ST OP GC (GC) is the expected frequency of stop-codons given GC as defined here: where the stopcodons sums over the three stop codons. However, as we show below there are other parameters that also affect the amino acid frequencies. To take several scenarios into account, we use the following formulae to estimate the amino acid frequency (AA i ) for the amino acid i at a given GC level: Here, AA i GC (GC) is the expected frequency of amino acid i at the GC as defined in equation 2. AA i GC (50%) is the expected frequency at GC=50%, and W i , and K i are two parameters that are optimized for each amino acid. The reason to use this function, and not simply AA i = w i * GC + k i is to have a consistent definition of the parameters W i , and K i . In particular, the parameter K i is useful to estimate over-, and under-representation of an amino acid.
Using equation 4, we can model different scenarios. If W i = 0 and K i = 0, then equation 4 describes the expected frequency from the number of codons as in equation 1 (the codon model). If W i = 1 and K i = 0 the equation describes the expected amino acid frequency at a certain GC level as in equation 2. If W i = 0 while K i is optimized, this describes the average amino acid frequency in all genomes and then K i represents a shift from the expected frequency. Finally, we can optimize both W i and K i and obtain the amino acid levels using two parameters (the twopar model). Here, to be more realistic, we limit W i to be between 0 and 1. Also here K i represents the shift from the expected frequency.
To compare the different models to estimate the amino acids, we use the Pearson correlation coefficient [26] and the average error between the estimated and observed frequencies of all twenty amino acids, see Figure S1.

Linear regressions
To estimate the GC frequency from amino acid frequency, we used sklearn [27]. Given the amino acid frequency of one or more amino acids in a protein or a proteome, the model was trained to predict the GC level of the coding region of a proteome. In addition we trained the same model to predict the GC level from a single protein. Here 25,000 randomly selected proteins were used.

Results and Discussion
The GC frequency can vary tremendously between organisms, see Figure 2. In our set of proteomes, the beta proteobacteria Candidatus Zinderia insecticola has the lowest GC content with 13.5% and Geodermatophilus nigrescens has the highest (75.9%), see Table S1. The mechanism causing these differences is not entirely known, but factors such as mutation rate, crossover rate, thermodynamical stability and phylogenetic memory contribute [4]. Anyhow, in this study, we will not focus on the GC difference. Instead, we will analyze how the difference in GC levels affect the proteomes and use divergence from expected frequencies to analyze the selective pressures at the proteome level.

GC distributions
First, some notes about the overall GC content. Both prokaryotic kingdoms have a bimodal GC distribution with one peak around 40% and the second at 70% [28], see Figure 2a. In contrast, Eukaryotes have a single, Figure 2. Distribution of GC contents in the three different kingdoms. In (a) the GC content in the whole genome is plotted against the GC content in the coding regions. In (b-d). the GC content in the three codon positions are plotted against the GC in the coding regions.
less wide, peak of GC content. Therefore, the standard deviation in the prokaryotes is larger (12% vs 8%), while the average GC levels are similar (48%-51% for the coding regions), see Table S1.

GC coding vs non-coding
For both prokaryotes, the genomic GC level (from NCBI) and the GC level of the coding regions (from Uniprot) are almost identical and perfectly correlated (CC=0.998), while for Eukaryotes the levels differ slightly but are still strongly correlated (CC=0.89), see Table S1. Eukaryotes have a higher GC content in coding regions (49% vs 44%), see Figure 2a. The GC level in the coding regions is similar to the average level observed in prokaryotes. Therefore, we believe that comparing features with the GC content of the coding region is more appropriate. Further is also simplifies the analysis of codon and nucleotide frequencies.  In the codon table, seven (Phe, Leu, Val, Pro, Thr, Ala, and Gly) out of the twenty amino acids are determined by position one and two, see Figure 1. Further, the two first bases and a combination of TC or AG in position three determines eight other amino acids (Tyr, His, Gln, Asn, Lys, Asp, Glu, and Cys). Two amino acids (Met and Trp) have only one codon, and Ile uses the three ATX codons not encoding Met. The remaining two amino acids, serine and arginine, are encoded by two groups of codons with different nucleotides in position one and two. Finally, there are three stop codons that all have a T in its first position (T1).
Given the position in the codon table for amino acids with similar properties, it is clear that in particular, 6/30 position two determines the properties of the amino acid [15]. For instance, all codons with T2 encode hydrophobic amino acids, while both negatively charged amino acids have A2.

GC in different positions.
The GC content differs between the three codon positions, see Figure 2 and Table S1. In all positions, the GC content is strongly correlated with the overall GC content (Cc > 0.93). The average GC content is lower in position two than in the other two positions. Further, in position one and two, the variation in GC content is much more restricted than in position three. The highest GC level in position three is 97% and the lowest 3%, compared to 18% and 58% in position two. The difference between the positions means that the GC variation in position three is significantly higher than in the other parts of the genome. A model to explain the variation of GC in the three positions can be formulated as follows: In an organism, there exists a selective pressure to have a certain optimal GC content (in the coding regions). For some organisms, this optimal level is very high or very low, i.e. extreme. However, the selective pressure acting on amino acids frequencies makes it impossible to have extreme GC levels in position one and two. Therefore, to obtain extreme overall GC levels in these organisms, it is necessary to over-compensate in position three. In theory, if the GC content is limited to 50% in position one and two but varying in position three, this allows the genomic GC to vary between 33 and 67%. However, the GC content in one and two also varies, and amino acid frequencies also change; therefore, the overall GC content can vary between 13% and 76%.
Although the average GC content is similar in all three positions, it is clear that the nucleotide frequencies are not, see Figure 3, 4 and Table S1. The differences are largest for position one and two. In position one, G1 and A1 are more frequent than the other two nucleotides, while in position two A2 and T2 are most frequent. For a more detailed understanding of these differences, we will analyze the frequencies of each nucleotide in each position, starting with position one.

Position 1
In position one, it can be seen that G1 is most frequent, and T1 is least frequent (average frequency is 16.7%). However, it should be remembered that all three stop codons have a T1, so the expected T1 frequency is not 25% but only 20.3%. In addition, serine, which has four out of six codons with T1, is one of the most underrepresented amino acids, as we have described before [12]. G1 encodes for VADEG, these amino acids are all over-represented compared to random, see Figure 5.

Position 2
Position two is the most conserved position when it comes to GC content. It is also clear that A2 and T2 are more frequent than G2 and C2, on average about 30% vs 20%, see Table S1 and Figure 3. Further, T2 is almost independent of the GC content in all genomes, but consistently lower in eukaryotes than prokaryotes, see Figure 4. The constant level of T2 guarantees a stable amount of the non-polar, and β-sheet forming amino acids (FLIVM). G2 is rare and have a limited range. G2 encodes several amino acids with unique properties, such as glycine (the smallest amino acid) and cysteine (that can form disulphide bonds), but also arginine, tryptophan and one of the stop codons. The rareness can be contributed to the 40% (6 out of 15) of the non-stop G2 codons that encode arginine, and the frequency of arginine is underrepresented, see Figure 5. Finally, A2, that encodes primarily charged and polar amino acids (YHQNEDK), and C2, that encodes APST are allowed to vary more freely than the other two nucleotides in position 2.

Position 3
In general, it is believed that position three in a codon is not under selective pressure as it only rarely affects the amino acid, Figure 1. However, if no selective pressure acted on position three random drift would make all nucleotides equally frequent in that position, and clearly, they are not, see Figure 4. In contrast, the nucleotides in position three vary much more than in the other positions. The frequency of most nucleotides varies between 1% and 60%, supporting the idea that the genomic GC preference governs nucleotide frequencies. C3 is most frequent in position three but least frequent in the other two positions, see Figure 3.

Amino acid frequency vs GC
To be able to identify the selective pressures acting on amino acid frequencies, it is necessary to estimate the expected amino acid frequencies without any selective pressure. Therefore, it is necessary to have a model to describe the expected amino acid frequency for a genome. Following the speculations above we do assume that: There exist some evolutionary process that strives the GC content of a genome to be adapted, but that is independent of the selective pressure acting at the amino acid level. It is then possible to model the expected amino acid frequencies assuming that protein-coding regions would be random in the absence of any selective pressure at the amino acid level. A simple explanation of the variation of amino acid frequency would be that it is just decided by the number of codons coding for an amino acid as in equation 1. Nevertheless, a better agreement is observed when taking the GC into account and calculate the expected amino acid frequencies, given the GC of the genome, as in equation 2. Below, we use this equation to estimate the expected frequencies of the amino acids. Figure 5 shows the amino acid frequencies of each amino acid against the GC content of the coding regions

Frequency of low GC amino acids depends strongly on GC
The frequency of all the amino acids with less than one-third of GC in their codons, i.e. Ile, Lys, Asn, Phe and Tyr, show a strong correlation with GC, see the top row in Figure 5. The frequency of these amino acids vary from 1-2% at high GC up to 19% at low GC and the correlation with GC is 0.83 to 0.93, see Table S1. The lowest correlations against GC are for Tyr and Phe, which have a flatter distribution than expected from the GC frequency alone.

The frequencies of amino acids low GC dependency are independent of GC
Next, there are 11 amino acids with a GC content in their codons between one-and two-thirds. None of these shows a strong dependency of GC, but the correlations with GC are rather high for Valine (CC=0.72) and Trp (CC=0.74). More notably, some of these amino acids are more frequent than expected from the codons and some less.

Frequency of all high GC codons strongly depends on GC
Finally, the amino acids with more than two-third of GC in their codons are also strongly dependent on the GC content (CC> 0.85). Shifts can be seen as Arg is less frequent than expected. The frequencies of Gly and Pro also appears to be limited to be within a specific range.

Systematic shifts
From the studies above, it is clear that there exist systematic divergences of amino acid frequencies for some amino acids. In general, the divergences are (a combination) of two types, shifts and decreased GC dependency.
A shift refers to that the amino acid frequency is consistently over-or under-represented (as for serine), while the decreased GC dependency refers to a decreased dependency of GC, i.e. a flatter distribution (as for proline), see Figure 5. It is tempting to speculate that a shift would indicate that there exist a selective pressure for that amino acid to be more or less frequent, while a decreased GC dependency indicates that there exists a selective pressure to keep that amino acid at a constant level.
To identify systematic shifts, we have used equation 4 (with different limitations to the parameters). Here, K describes a shift up or down from what is expected by random and W describes the strength of the dependency with GC (one is perfectly correlated, and zero indicates no dependency). The parameter W is, therefore, only relevant for the amino acids with GC-rich or GC-poor codons, see Figure S3. Figure 6 shows that the shifts (K) are consistent independent of what model is used. Arginine, serine, cysteine and proline are under-represented while glutamate, aspartate, lysine and alanine are over-represented, see Figure 6. These shifts are also clearly observable in Figure 5. The variation between the kingdoms is small, but the shifts are in general smaller in Eukaryotes. The average error for the GC model, equation 2, is 1.4% in Eukaryotes vs 1.9% in Bacteria and 2.1% in Archaea.

The intricate balance of charged residues.
The positively charged amino acids Lys and Arg are like Siamese twins, one has GC-rich codons, and one GC-poor, both are positively charged, and they can often (but not always) perform similar roles in a protein.
One notable difference is that six codons encode arginine compared with two for lysine, i.e. arginine should be three times as frequent at 50% GC. However, arginine is consistently less frequent than expected from GC while lysine is more frequent, compensating for the difference in codons, see Figure 5 and 6. The total number of Arg+Lys is rather constant but decreases slightly with GC, see Figure 7.
The negative amino acids (Asp and Glu) are, in contrast, not very GC dependent and are constant in GC, see Figure 5. Notably, as a group, the negatively charged amino acids are much more frequent compared to what is expected by random, see Figure7 and 6. The shifts are therefore most likely a consequence of that there are eight codons for positively charged amino acids compared to only four for the negatively charged amino acids and that the overall charge of the proteome is close to neutral independent on GC content, see Figure 7.

Limited frequency ranges.
In addition to amino acids consistently over-or under-represented, there exist amino acids that are limited in their variation. In Figure 5 and S3, it can be seen that five amino acids are less dependent on GC than expected. Isoleucine, tyrosine and phenylalanine are all less frequent than expected at low GC and more frequent at high GC. Similarly, Pro and Gly are both more frequent than expected at low GC and less frequent at high GC. Given the unique properties of Tyr/Phe (aromatic) and Gly/Pro (secondary structure breakers), it is not surprising that there exist boundaries to their frequency variations.

Differences between kingdoms
Although most amino acids and codon frequencies are similar in the three different kingdoms, there exist some differences to be noted. We have earlier reported that serine and proline are more frequent in eukaryotes, and that isoleucine is less frequent [12]. Here, we confirm that these differences are among the most significant differences between the kingdoms using an ANOVA test, see Table 1. However, other differences can also be detected.
In Table 1, it can be seen that two features dominate the difference between the kingdoms, increased serine frequency in eukaryotes and decreased T2 frequency in eukaryotes. As mentioned above, T2 codes for the hydrophobic amino acids phenylalanine, leucine, isoleucine, methionine and valine.
If we ignore differences in codons, following next in importance is the increase in eukaryotic proline frequency and decrease in isoleucine frequency [12]. These are then followed a decrease in glycine and an increase in cysteine in the eukaryotes. Finally, G1 is less frequent in eukaryotes than in prokaryotes. G1 encodes valine, alanine, aspartate, glutamate and glycine, and all these are slightly less frequent in eukaryotes than in the prokaryotes.
All the codons that are highest ranked in the ANOVA test are coding for one of the amino acids discussed above. It is interesting to note that CCA codon explains the most of the proline increase.

Archaea
For many features, such as glutamate and aspartate frequencies, it can be seen that the archaea kingdom is divided into two groups. Brief analysis indicates that this roughly correlates with the phylum Euryarchaeota and other archaea. Euroarchaeota have more proteins (2170 vs 1620), higher GC (50% vs 45%) and more Asp (6.3% vs 4.9%) and Glu (8.2% vs 7.2%) but less Lys (7.3% vs 5.8%). Although interesting, a detailed analysis of these differences is beyond the goals of this study.

Predicting GC from amino acid frequencies
Is it possible to predict the GC frequency from amino acid frequencies? We show that even the frequency of one amino acids, such as asparagine or alanine, in the proteome, can predict the GC level with an error of less than 5% and a correlation coefficient of 0.95, see Figure 8. If the frequency of all twenty amino acids is included, the error drops below 2%, and the correlation coefficient is 0.99.  Even the frequency of amino acids for a single protein is informative of the GC level of the entire proteome. The sequence of a single protein can predict the GC level with an average error of 5% and a correlation coefficient above 0.90. This can, for instance, be used to detect laterally transferred genes directly from amino acid sequences if the genomic sequence was not available.

Conclusions
Here, we study the relationship between GC content of organisms and frequencies in their coding regions. We highlight that amino acid frequencies differ significantly in high and low-GC genomes and that their frequencies are primarily dependent on the number of codons and the GC content of their codons. But there are also significant differences.
To explain this, we propose that there exist an (unknown) mechanism acting to maintain the GC level in an organism. This can be seen by the fact that the third position varies much more than the others and by the differences in nucleotide frequencies in the different codon positions. Next, we propose that there is also a selective pressure changing amino acid frequencies from what is expected by chance. This mechanism decreases the frequencies of arginine and serine in all organisms, while lysine, aspartate and glutamate are more frequent than expected by chance. Further, this mechanism limits the influence of GC on the frequency of tyrosine, phenylalanine, glycine, proline and isoleucine.
We also note that the selective pressure acts to; (i) Keep a balance of negatively and positively charges amino acids in all genomes (except some Euroarchaeota). This is maintained by an intriguing by the underrepresentation of arginine and overrepresentation of negatively charged amino acids. (ii) Maintaining the hydrophobic residues at a constant level by keeping a constant fraction of Thymine in the second codon position.
Finally, we also show that two most significant factors differ between eukaryotes and prokaryotes are: (a) Eukaryotes have more serine residues and (b) less of codons with a T in position two (T2), which results in fewer hydrophobic residues (FLIVM).  Figure S1. Correlation between expected and observed amino acid frequency for all amino acids over all genomes. Figure S2. In position three perfect correlation, i.e. GC determines everything 18/30