Codon usage is a stochastic process across genetic codes of the kingdoms of life

DNA encodes protein primary structure using 64 different codons to specify 20 different amino acids and a stop signal. To uncover rules of codon use, ranked codon frequencies have previously been analyzed in terms of empirical or statistical relations for a small number of genomes. These descriptions fail on most genomes reported in the Codon Usage Tabulated from GenBank (CUTG) database. Here we model codon usage as a random variable. This stochastic model provides accurate, one-parameter characterizations of 2210 nuclear and mitochondrial genomes represented with > 104 codons/genome in CUTG. We show that ranked codon frequencies are well characterized by a truncated normal (Gaussian) distribution. Most genomes use codons in a nearuniform manner. Lopsided usages are also widely distributed across genomes but less frequent. Our model provides a universal framework for investigating determinants of codon use.

The 'language' by which genomes describe proteins has received theoretical interest ever since the genetic code was discovered. In particular, a degeneracy of vocabulary is intriguing: the 64 different codons of the DNA genetic code outnumber the different amino acids and stop signal that they encode by a factor ≈ 3. How do biological organisms deal with or exploit such degeneracy? Does biased use of synonymous codons encode information beyond amino acid sequence (2)(3)(4)(5)(6)? Information flowing from genome to encoded proteins can be monitored at its source by counting how often certain codons are used by a genome. Normalizing codon counts into frequencies and ordering frequencies into a descending series generate frequencyrank plots that are comparable across genomes in spite of differences in genome size, genetic code, or varying bias in the use of synonymous codons. Such plots provide a global perspective for investigative and theoretical work regarding the organization of the coding DNA and the machineries of translation.
Toward a mathematical interpretation of frequency-rank plots, various empirical formal descriptions have been proposed: a power of rank (Zipf's law) (7,8), an exponential of rank (9)(10)(11), or a combination of exponential and linear relations (1). Also, various statistical relations have been formulated: the first statistical model of codon use was made by Borodovsky et al (12,13) and, more recently, by Naumis et al (14,15). Most of these descriptions do not capture the tail of codon/rank plots that have an inflection, a feature observed in many genomes. Models based on additive or multiplicative contributions to codon use have been devised to better describe the tail of such frequency-rank plots (1,14,15).
In this paper we present a systematic study of codon use based on the Codon Usage Tabulated from GenBank (CUTG) database (16). Codon frequency is interpreted as a random variable, and the ranks of codon frequencies are interpreted in terms of cumulative probability. Cumulative probability is described by a truncated normal distribution, which is used in our analysis as an empirical description rather than as the consequence of a specific stochastic model. We analyze codon use in 2210 genomes throughout the CUTG database. Differences in codon use among diverse genomes are captured in a single adjustable parameter. We establish a near-continuous variety of codon use that gravitates strongly toward uniform use of codons in all kingdoms of life.

Methods
CUTG database entries comprising at least 10 4 codons are analyzed. Observed relative codon frequencies y of an organism are ordered so that rank r = 1 is given to the largest frequency: y(r), 1 ≤ r ≤ 64. The discrete rank-frequency series r(y) is interpreted in terms of the continuous rank-frequency function where Φ is the normal distribution in the standardized variable t = (y − µ)/σ, that is, the cumulative probability of events in the range −∞ through t. That distribution is naturally truncated here to the range of possible codon frequencies, 0 ≤ y ≤ 1, or t min = −µ/σ through t max = (1 − µ)/σ. We normalize the truncated distribution, complement, and map linearly to ranks to construct the continuous rank function R(y), which we will superimpose to the discrete rank series of observed codon frequencies, r(y). We rotate the conventional frequency/rank plot (e.g., Fig 1B, C) counterclockwise by 90 degrees, so that frequency becomes the independent variable (abscissa) and rank the dependent variable (ordinate, proportional to the cumulative probability of codon frequency).
Since the observed frequencies y i are normalized ( i y i = 1), the average codon frequency is 1/N codons = 1/64. A truncated distribution to be superimposed to these observed frequencies thus must have a mean of 1/64. In the case of the truncated normal distribution where φ(t) = dΦ(t)/dt. Note that location µ and scale σ also enter the relation through the definition of t.
Since frequency normalization establishes a relation between the normal-distribution parameters µ and σ, these cannot be chosen independently in fitting R(y) (eqn. 1) to the observed ranks r(y i ). Only one of these parameters is free. We vary σ to minimize i (r(y i ) − R(y i )) 2 while computing the value of µ to be adjoined to a choice of σ from eqn. 2.

Results
In interpreting ranked frequencies of codon occurrence one must keep in mind that ranking abstracts the distribution of frequencies from the identities of the codons. Codons occupying particular frequency ranks in one genome typically occupy different ranks in other genomes.
Consider the human nuclear and mitochondrial genomes (Fig. 1A). Both sets of codon frequencies are ranked here in the order of decaying occurrence in the nuclear genome. The mitochondrial genome evidently does not follow the nuclear ranking. The mitochondrial codons need to be arranged in a much different order to form a decaying sequence. If the codon frequencies of these genomes are ranked, the same rank generally corresponds to different codons of the genetic code.
When the nuclear and mitochondrial codon frequencies of Fig. 1A are individually ranked in decaying order they reveal two different patterns of codon use (symbols in Fig 1B, C). The nuclear human codons are used with quite uniform frequencies, as evident at middle-ranks, and fewer are used substantially more often or rarely at low and high ranks, respectively (Fig. 1B).
The 19 lowest-ranked codons receive about one half of the total usage. In the mitochondrial genome, the 12 lowest-ranked codons receive about one half of the total usage ( Fig. 1C).
Mitochondrial codon usage is less balanced among codons than nuclear codon usage.
The red lines in Fig. 1B, C represent frequency-rank relations computed from the model that we present in this paper. Codon frequency is interpreted here as a random variable (rather than a probability), and the ranks associated with these frequencies as (unnormalized) estimates of the cumulative probabilities of the frequencies. We fit that cumulative distribution by a truncated Gaussian distribution (eqn. 1). The model reproduces the two different forms of human codon use. These are determined by a single free parameter, the location µ of the Gaussian distribution. The other parameter of that distribution, the scale σ, is fixed by the parameter µ and the requirement that the predicted frequency distribution be normalized (eqn. 2).
). The solid black lines are derived from the observed nuclear ( Fig. 2A, The range of the truncated distribution for the nuclear codon frequencies comprises the position µ of the normal distribution, whereas the range of the mitochondrial frequency distribution is restricted to the right-tail of the normal distribution. We apply the truncated normal distribution model to several genomes that are common subjects of study (Fig. 3, symbols). To assess variations of codon usage among and within different kingdoms of life we summarize the values of the parameter µ found for nuclear genomes by database division (Fig. 4E).
A joint cumulative histogram of µ is shown in Fig. 4D.

Discussion and Perspectives
We show here that the ranked frequencies of codon occurrence of a large number of genomes are well characterized by truncated Gaussian distributions with a single adjustable parameter.
With regard to the accuracy of description and number of required parameters, we improve substantially on several previously proposed approaches ( Fig. S1 and S2). shows an unstructured relationship as if the Naumis et al model fits the data in an ad-hoc manner ( Fig. S3A and S3B). Our model gives a comparable fit with a single free parameter (Fig. S2).       Fig. 2.