1 Abstract
It is well known that the GC content varies enormously between organisms; this is believed to be caused by a combination of mutational preferences and selective pressure. Within coding regions, the variation of GC is more substantial in position three and smaller in position one and two. Less well known is that this variation also has an enormous impact on the frequency of amino acids as their codons vary in GC content. For instance, the fraction of alanines in different proteomes varies from 1.1% to 16.5%. In general, the frequency of different amino acids correlates strongly with the number of codons, the GC content of these codons and the genomic GC contents. However, there are clear and systematic deviations from the expected frequencies. Some amino acids are more frequent than expected by chance, while others are less frequent. A plausible model to explain this is that there exist two different selective forces acting on the genes; First, there exists a force acting to maintain the overall GC level and secondly there exists a selective force acting on the amino acid level. Here, we use the divergence in amino acid frequency from what is expected by the GC content to analyze the selective pressure acting on codon frequencies in the three kingdoms of life. We find four major selective forces; First, the frequency of serine is lower than expected in all genomes, but most in prokaryotes. Secondly, there exist a selective pressure acting to balance positively and negatively charged amino acids, which results in a reduction of arginine and negatively charged amino acids. This results in a reduction of arginine and all the negatively charged amino acids. Thirdly, the frequency of the hydrophobic residues encoded by a T in the second codon position does not change with GC. Their frequency is lower in eukaryotes than in prokaryotes. Finally, some amino acids with unique properties, such as proline glycine and proline, are limited in their frequency variation.