Entropy and codon bias in HIV-1
===============================

* Aakash Pandey

## Abstract

HIV is rapidly evolving virus with a high mutation rate. For heterologous gene expression system, the codon bias has to be optimized according to the host for efficient expression. Although DNA viruses show a correlation on codon bias with their hosts, HIV genes show low correlation for both nucleotide composition and codon usage bias with its human host which limits the efficient expression of HIV genes. Despite this variation, HIV is efficient at infecting hosts and multiplying in large number. In this study, I have performed information theoretic analysis of nine genes of HIV-1 based on codon statistics of the whole HIV genome, individual genes and codon usage of human genes. Despite being poorly adapted to the codon usage bias of human hosts, I have found that the Shannon entropies of nine genes based on overall codon statistics of HIV-1 genome are very similar to the entropies calculated from codon usage of human genes probably suggesting co-evolution of HIV-1 along with human genes. Similarly, for the HIV-1 whole genome sequence analyzed, the codon statistics of the third reading frame has the highest bias representing minimum entropy and hence maximum information.

Keywords
*   HIV-1 Genome
*   Shannon Entropy
*   Codon Usage Bias
*   Codon Adaptation Index
*   Expected Codon Adaptation Index

## Introduction

Every organism has its own pattern of codon usage. All the synonymous codons for a particular amino acid are not used equally. Some synonymous codons are highly expressed, whereas the uses of others are limited. The use depends on the species [1] [2]. The difference in codon usage has also been observed among genes of the same organism [3]. Codon bias has been linked to specific tRNA levels that are mainly determined by the number of tRNA genes that code for a particular tRNA [4]. The choice of codon affects the expression level of genes. This can be seen in the expression pattern of transgenes. Gustafsson *et. al.* showed that use of particular codons can increase the expression of the transgene by over 1,000 fold [5]. In bacteria, the gene expressivity correlates with codon usage [6]. Although bacteriophages have been shown to have codons that are preferred by their hosts [7], however, the codon usage pattern of RNA virus seems to differ with its host [8]. Despite this variation, HIV virus can effectively multiply in human T cells. Although codon usage of early genes (*tat, rev, nef*) shows higher correlations with human codon usage [9], however, late genes show little correlation. It raises a question of how such variation in codon usage still allows for efficient viral gene expression. *van Weringh et. al.* showed that there is a difference in the tRNA pool of HIV-1 infected and uninfected cells. Although they speculated that HIV-1 modulates the tRNA pool of the host making it suitable for its efficient genome translation, however, the extent to which such modulation helps in efficient translation is still unknown.

After Shannon published his groundbreaking paper “A Mathematical theory of Communication” [10], there have been several attempts in using information theory in the context of living systems. Shannon used the term *information* differently than classical information theorists have used. Here, I have tried to use the information as mentioned by Shannon to see whether information theoretic analysis leads to some novel insights into the problem. According to Shannon, for a possible set of events with probability distribution given by {*p*1, *p*2, *p*3, …, *p**n*} the entropy or uncertainty is given by,

![Formula][1]</img> 
This is in fact the observed entropy of a sequence with the given probability distribution. H is the maximum when all *pi* are equally likely. In this condition the information content is zero. The amount of information or ‘negentropy’ in a sequence can then be given as, ![Formula][2]</img>  where, *Hobs* is the entropy obtained from given probability distribution [11]. DNA comprises 4 nucleotides A, G, C, T whose distribution pattern varies among different species. Gatlin deduced information content based on this distribution pattern [12] and using transition probability values obtained from the neighbor data [13]. Lately, the information theoretic value of a given DNA sequence was obtained using the Shannon formula as double sum [14],

![Formula][3]</img> 
Here, *naa* is the number of distinct amino acids, *nsyncod(i)* is the number of synonymous codons (or micro states) for amino acid *i* (or macro state) whose value range from 1 to 6, and *P*(*i,j*) is the probability of synonymous codon *j* for amino acid *i*. Also, we know that information is not absolute. It depends on the environment. This means that the same sequence of DNA may represent different amounts of information depending on what environment it is in or on the machinery that interprets the sequence. We exploit this to calculate Shannon entropy for nine genes of HIV-1 based on codon distribution of the viral genome, individual genes and that of its host – human codon usage frequency. Information is calculated based on the codon distribution for three possible reading frames. To the best of my knowledge, I believe that such study has not been carried out yet. Viruses show overlapping genes and are speculated to be present to increase the density of genetic information [15]. These genes are read by ribosomal frameshifting [16]. For those nine genes, I have also calculated the intrinsic entropy of the sequence which can be defined as the entropy based on its own codon usage (i.e. codon usage within the same gene) to compare with other entropy values.

Heterologous expression systems, such as viruses, use host translational machinery for their replication. They are under evolutionary pressure to adapt to the host tRNA pool. To estimate a degree of evolutionary adaptiveness of host and viral codon usage, Codon Usage Index (CAI) can be used [17]. But, for sequences with a high biased nucleotide composition, interpretation of CAI can be tricky [18]. So, to know whether the value of CAI is statistically significant and has arisen from codon preferences or is merely artifacts of nucleotide composition bias, expected CAI (eCAI) can be a threshold value for comparison [19].

## Methodology

The DNA sequences were obtained from NCBI database in FASTA format. For each sequence, codon statistics was obtained by entering the sequences on online Sequence Manipulation Suite [20] and using the standard genetic code as the parameter. Number and fractions of each possible codons were noted. First nucleotide was deleted to shift the reading frame by +1 to include other possible codon patterns and again the number and frequency were noted. The process was repeated for +2 reading frame. Now, as any of the reading frames can contain the gene of interest, all three reading frame statistics were used to calculate the Shannon entropy separately. The assumption made in the calculation is that reading the message occurs in a linear fashion without slippage of the reading frame (RF). The fraction was normalized for a particular amino acid, but not with the total number of codons. Thus,

![Formula][4]</img> 
First, a row matrix was constructed with fractions of synonymous codons used.

![Formula][5]</img> 
The fractions in the matrix were treated as microstate to calculate Shannon entropy and thus another matrix was constructed consisting of Shannon entropy of each fraction distribution.

![Formula][6]</img> 
The total Shannon entropy of the sequence is then calculated as:

![Formula][7]</img> 
Here *Ni* is the total number of a particular codon present in the gene of interest. Such calculation was performed for all three RFs.

The correlation coefficient for each gene’s codon statistics was calculated with human codon usage statistics. Correlation coefficients for two genes *vpr* and *vpu* were calculated again removing the codon data for which no amino acid is present in that gene. Then, Shannon entropy was calculated for all nine genes using the human codon usage statistics. Intrinsic entropy, which is the entropy based on own codon statistics of each gene was also calculated. Again, the assumption is that there is no slippage of reading frame during translation of the message. Thus, codon statistics for single reading frame starting with start codon was used to calculate intrinsic entropy. Similarly, average entropy was calculated by averaging the fractions of synonymous codons for all three RFs. For the calculation of percentage overall GC content and position specific GC content of codons of nine genes and CAI values, [http://genomes.urv.cat/CAIcal/](http://genomes.urv.cat/CAIcal/) [21] online site was used again using the standard genetic code as the parameter. For calculation of the expected codon adaptation index was performed in E-CAI server ([http://genomes.urv.es/CAIcal/](http://genomes.urv.es/CAIcal/)) using Markov chain and standard genetic code as the parameters. Human codon usage statistics was obtained from the online site ([http://genomes.urv.cat/CAIcal/CU\_human_nature.html](http://genomes.urv.cat/CAIcal/CU_human_nature.html)). Computations were performed in R.

## Result and Discussion

We see that there is a high correlation between codon present in HIV-1 early genes (*tat, rev, nef*) and human codon usage (fig 1), but correlation is low for other genes: *env, gag, pol, vpr, vif, vpr* (table 1); *vpu* and *vif* genes have the lowest correlation with human codon usage. The degree of correlation also differs for 3 RFs codon statistics and nine genes: high correlation probably suggesting the location of the particular gene at that RF.

![Fig 1:](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/08/27/052274/F1.medium.gif)

[Fig 1:](http://biorxiv.org/content/early/2016/08/27/052274/F1)

Fig 1: 
Scatter plot between codons fractions of *tat, rev* and *nef* genes of HIV-1 and human codon usage fraction with the correlation coefficient of 0.72, 0.62 and 0.73 respectively. These three genes have the higher correlation with human codon usage fraction and are also the early genes.

View this table:
[Table 1.](http://biorxiv.org/content/early/2016/08/27/052274/T1)

Table 1. Correlation coefficients calculated among human codon usage, codon usage of HIV-1 genes and codon statistics for all three reading frames of HIV-1 genome

Entropic calculation shows a general trend for the sequences analyzed. First, +2 frameshifted reading frame shows lower entropy as compared to two other reading frames. This marked distinction of the third reading frame among three possible reading frames of the sequence analyzed is surprising. As it has the lowest entropy among three reading frames, the sequence with the codon usage pattern of third RF represents the highest information (in Shannon’s sense). This probably suggests that there is a genome-wide conservation of codon usage for that reading frame, but the reason is unclear. *env, rev*, *pol and vpu* genes have the highest correlation with the third reading frame as compared to other two reading frames. Similarly, *gag* and *vpr* also have a high correlation coefficient. If we use codon statistics of third RF to calculate the Shannon entropy, we get the minimum entropy and hence maximum information. But, then again we run into a problem as this third RFs shows the lowest correlation coefficient with human codon usage pattern. So there has to be a balance between these contrasts: maximizing information or maximizing correlation. Take *gag* for example, it shows high correlation with RF0 and RF2 both of which have lower correlation with human codon usage. This means that the expression of *gag* is affected by this choice of codons. In fact, the ratio of native and optimized codons determine the HIV-1 *gag* expression [22]. This also supports the speculation that codon bias leads to sub-optimal expression in infected cells. There is, in fact, good evidence that HIV-1 gene expression is not the maximum but, is fine-tuned to allow regulation of diverse processes [23]. More evidence of sub-optimal expression is shown by the fact that when codon optimized genes that are better adapted to the host tRNA pool were introduced, it led to higher expression [24][25][26]. From the entropic calculation based on human codon usage, we can see that *vpu* and *vif* have the lowest entropy, but as they have a low correlation with human codon usage, their expression is limited. Codon optimization of these genes results in the increase of expression level [27]. However, high correlation does not imply that the gene is in that reading frame. It is possible that such bias may or may not affect biological function, but it is likely that such distinction of lower entropy has some evolutionary importance.

Intrinsic entropy differs greatly with other entropic values as it shows lowest values. Such low intrinsic entropies may have significance for free-living organisms as lower entropies suggest higher bias. But, for heterologous expression systems such as HIV-1, entropy H probably represents the best entropic values for the genes analyzed as host (human) gene usage codon statistics was used for the calculation. Average entropy, which is closer to entropy H, rather than intrinsic entropy gives a better representation for entropic value and hence for the amount of information a gene contains inside a human host. Although there is great variation in the synonymous codon usage statistics between HIV-1 genes and human genes, the entropic values for the HIV-1 genes based on the overall code distribution of the HIV genome shows almost similar values as compared to the calculation based on human codon usage statistics (Table 2). Even for *a vpu* gene, which has a very low correlation coefficient (0.26), the entropic values based on overall codon statistics of HIV-1 genome and human codon usage statistics show similarity: 36.17 and 38.69 bits respectively. Even if we remove the data for which there is no single codon for certain amino acids in that gene, the correlation coefficient is still low. In *vpu* gene, codons for Cysteine, Threonine and Phenylalanine are absent. If we remove that data, we get a correlation coefficient of 0.41, which is still low. However, this removal does not affect the entropic calculation. Similarly, for *vpr* gene, codons for Cysteine are absent. Removing that data new correlation coefficient obtained is 0.55 and average entropic values and entropy H are close: 42.77 and 43.00 bits respectively. HIV is a highly variable virus which undergoes rapid mutation. Although HIV cannot match its codon bias with that of the host, but it can have a stable codon usage pattern. To maintain the overall codon statistics, it has to maintain the nucleotide composition, which is the determinant of codon bias [28]. It has been shown that, despite its high mutation rate, the biased nucleotide composition of HIV is constant over time [29]. If the genes have same codon biases as that of the host then it might lead to their highest expression. But this is not desired as it would not allow for efficient tuning of its complex processes. If codon bias is completely different from that of host, then it might result in very low expression putting its ability to survive in the host into question. So, HIV has to find a solution which results in sub-optimal expression of genes. So from the calculation of table 2 (Entropy H and Average Entropy), it seems that HIV has found a solution in which its codon bias is different from that of host to allow sub-optimal expression, but at the same time represent the same level of information as can be obtained from the codon bias of its host. This might suggest that, despite having a different nucleotide composition with human, HIV-1 viruses have co-evolved with human genes to represent the same level of information.

View this table:
[Table 2:](http://biorxiv.org/content/early/2016/08/27/052274/T2)

Table 2: Entropies of HIV-1 genes based on various codon distributions

Besides having similar entropy, we can note that all the genes have high CAI values (Table 3) although with a varying degree of GC content. All the CAI values are greater than 0.6 with *vpu* having the lowest value of 0.62 whereas *tat* has the highest value of 0.77. CAI well above 0.5 is usually considered to be showing a good level of adaptation towards the host, however, care should be given while interpreting these values as they may not reveal the level of adaptation just by themselves. Such values may be due to the bias in nucleotide composition. So, to know whether these values actually represent the adaptation we need to set a threshold set by the bias in nucleotide composition so we may say that CAI does represent the level of adaptation. For that expected CAI (eCAI) is calculated and compared [19]. From these comparisons, none of the genes seems to be well adapted to the human codons usage pattern. We can note that the GC content of all the genes is below average. Also, GC content of second and third nucleotides of codons shows the greatest variability. Thus, we can conclude that nucleotides of these positions are the determinant of codon bias in HIV genes rather than the selection pressure for codons.

View this table:
[Table 3:](http://biorxiv.org/content/early/2016/08/27/052274/T3)

Table 3: Codon adaptation index (CAI) and GC content of HIV-1 genes.

## Conclusion

Despite many studies, HIV viral genome still possesses several mysteries. HIV is evolving along with its human host. However, it is not clear why its nucleotide composition and synonymous codon usage bias differ greatly from its host. From the comparison of CAI with eCAI, we can conclude that HIV genes are poorly adapted to the tRNA pools of humans. So it can be inferred that selection pressure on HIV to adapt to tRNA pools is minimal as compared to the rapid mutation it has on its genome [8]. Because of this, HIV evolves as a separate entity, although there is selection pressure on different levels. It is not clear whether nucleotide composition bias can give rise to the asymmetry in the observed information content along three possible reading frames. However, despite having large differences in nucleotide composition and synonymous codon usage bias, HIV genes are seem to have evolved to represent the same level of information as obtained by the codon bias of human genes. How HIV is able to attain such uniformity, despite differing from its host, is yet another mystery this study has surfaced. Further work is needed, which can bring together the differences in one place to give a clear picture of the evolution of HIV viral genome.

## Conflict of interest

The authors declare no conflict of interest.

## Footnotes

*   aakash.biophys{at}gmail.com

*   Received May 9, 2016.
*   Accepted August 27, 2016.


*   © 2016, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## References

1.  1.Grantham R, Gautier C, Gouy M, Mercier R, Pavé A. Codon catalog usage and the genome hypothesis. Nucl Acids Res. 1980;8(1): 197–197.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/8.1.197&link_type=DOI) 

2.  2.Grantham R, Gautier C, Gouy M, Jacobzone M, Mercier R. Codon catalog usage is a genome strategy modulated for gene expressivity. Nucl Acids Res. 1981;9(1): 213–213.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/9.1.213&link_type=DOI) 

3.  3.Sharp P, Cowe E, Higgins D, Shields D, Wolfe K, Wright F. Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens ; a review of the considerable within-species diversity. Nucl Acids Res. 1988;16(17): 8207–8211.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/16.17.8207&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=3138659&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1988Q116800001&link_type=ISI) 

4.  4.Hershberg RPetrov D. Selection on Codon Bias. Annu Rev Genet. 2008;42(1): 287–299.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1146/annurev.genet.42.110807.091442&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=18983258&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000261767000014&link_type=ISI) 

5.  5.Gustafsson C, Govindarajan S, Minshull J. Codon bias and heterologous protein expression. Trends in Biotechnology. 2004;22(7): 346–353.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.tibtech.2004.04.006&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=15245907&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000222712000008&link_type=ISI) 

6.  6.Gouy M Gautier C. Codon usage in bacteria: correlation with gene expressivity. Nucl Acids Res. 1982;10(22): 7055–7074.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/10.22.7055&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=6760125&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1982PT32800001&link_type=ISI) 

7.  7.Lucks J, Nelson D, Kudla G, Plotkin J. Genome Landscapes and Bacteriophage Codon Usage. PLoS Computational Biology. 2008;4(2):e1000001.
    
    
8.  8.Jenkins G, Holmes E. The extent of codon usage bias in human RNA viruses and its evolutionary origin. Virus Research. 2003;92(1): 1–7.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/S0168-1702(02)00309-X&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=12606071&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000181372400001&link_type=ISI) 

9.  9.van Weringh A, Ragonnet-Cronin M, Pranckeviciene E, Pavon-Eternod M, Kleiman L, Xia X. HIV-1 Modulates the tRNA Pool to Improve Translation Efficiency. Molecular Biology and Evolution. 2011;28(6): 1827–1834.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/molbev/msr005&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=21216840&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000290814600007&link_type=ISI) 

10. 10.Shannon, C.. A Mathematical Theory of Communication. Bell Systems Technical Journal, 1948;27: 279–423, 623–656.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1002/j.1538-7305.1948.tb00917.x&link_type=DOI) 

11. 11.Brillouin L. Science and information theory. New York: Academic Press; 1962.
    
    
12. 12. L.L. Gatlin. The information content of DNA. Journal of Theoretical Biology, 1966;10,281–300.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/0022-5193(66)90127-5&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=5964394&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A19667546200008&link_type=ISI) 

13. 13. J. Josse,  A.D Kaiser,  A. Kornberg. Enzymatic synthesis of deoxyribonucleic acid: VIII. frequencies of nearest neighbor base sequences in deoxyribonucleic acid. Journal of Biological Chemistry, 1961;236, 864–875.
    
    [FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6MzoiUERGIjtzOjExOiJqb3VybmFsQ29kZSI7czozOiJqYmMiO3M6NToicmVzaWQiO3M6OToiMjM2LzMvODY0IjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTYvMDgvMjcvMDUyMjc0LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

14. 14.Zeeberg B. Shannon Information Theoretic Computation of Synonymous Codon Usage Biases in Coding Regions of Human and Mouse Genomes. Genome Research. 2002;12(6): 944–955.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NjoiZ2Vub21lIjtzOjU6InJlc2lkIjtzOjg6IjEyLzYvOTQ0IjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTYvMDgvMjcvMDUyMjc0LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

15. 15.Lamb RHorvath C. Diversity of coding strategies in influenza viruses. Trends in Genetics. 1991;7(8): 261–266.
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=1771674&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1991FY49700008&link_type=ISI) 

16. 16.Wilson W, Braddock M, Adams S, Rathjen P, Kingsman S, Kingsman A. HIV expression strategies: Ribosomal frameshifting is directed by a short sequence in both mammalian and yeast systems. Cell. 1988;55(6): 1159–1169.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/0092-8674(88)90260-7&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=3060262&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1988R562400024&link_type=ISI) 

17. 17.Sharp P, Li W. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucl Acids Res. 1987;15(3): 1281–1295.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/15.3.1281&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=3547335&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1987F976100030&link_type=ISI) 

18. 18.Grocock R Sharp P. Synonymous codon usage in Pseudomonas aeruginosa PA01. Gene. 2002;289(1–2):131–139.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/S0378-1119(02)00503-6&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=12036591&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000177867000015&link_type=ISI) 

19. 19.Puigbò P, Bravo I, Garcia-Vallvé S. E-CAI: a novel server to estimate an expected value of Codon Adaptation Index (eCAI). BMC Bioinformatics. 2008;9(1): 65.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1186/1471-2105-9-65&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=18230160&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 

20. 20.Codon Usage [Internet]. Bioinformatics.org. 2016 [cited 4 April 2016]. Available from: [http://www.bioinformatics.org/sms2/codon\_usage.html](http://www.bioinformatics.org/sms2/codon_usage.html)
    
    
21. 21.Puigbò P, Bravo I, Garcia-Vallve S. CAIcal: A combined set of tools to assess codon usage adaptation. Biology Direct. 2008;3(1): 38.
    
    
22. 22.Kofman A, Graf M, Bojak A, Deml L, Bieler K, et al. HIV-1 gag expression is quantitatively dependent on the ratio of native and optimized codons. Tsitologiia 2003;45: 86–93.
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=12683241&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 

23. 23.Marzio G, Vink M, Verhoef K, de Ronde A, Berkhout B. Efficient Human Immunodeficiency Virus Replication Requires a Fine-Tuned Level of Transcription. Journal of Virology. 2002;76(6): 3084–3088.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoianZpIjtzOjU6InJlc2lkIjtzOjk6Ijc2LzYvMzA4NCI7czo0OiJhdG9tIjtzOjM3OiIvYmlvcnhpdi9lYXJseS8yMDE2LzA4LzI3LzA1MjI3NC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

24. 24.Haas J, Park E, Seed B. Codon usage limitation in the expression of HIV-1 envelope glycoprotein. Current Biology. 1996;6(3): 315–324.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/S0960-9822(02)00482-7&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=8805248&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1996UC44000026&link_type=ISI) 

25. 25.Anson Dunning K. Codon-Optimized Reading Frames Facilitate High-Level Expression of the HIV-1 Minor Proteins. Molecular Biotechnology. 2005;31(1): 085–088.
    
    
26. 26.Ngumbela K, Ryan K, Sivamurthy R, Brockman M, Gandhi R, Bhardwaj N et al. Quantitative Effect of Suboptimal Codon Usage on Translational Efficiency of mRNA Encoding HIV-1 gag in Intact T Cells. PLoS ONE. 2008;3(6):e2356.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0002356&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=18523584&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 

27. 27.Nguyen K, Llano M, Akari H, Miyagi E, Poeschla E, Strebel K et al. Codon optimization of the HIV-1 vpu and vif genes stabilizes their mRNA and allows for highly efficient Rev-independent expression. Virology. 2004;319(2): 163–175.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.virol.2003.11.021&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=15015498&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000220014400001&link_type=ISI) 

28. 28.Bronson E Anderson J. Nucleotide composition as a driving force in the evolution of retroviruses. J Mol Evol. 1994;38(5): 506–532.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1007/BF00178851&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=8028030&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom) 

29. 29.van der Kuyl A Berkhout B. The biased nucleotide composition of the HIV genome: a constant factor in a highly variable virus. Retrovirology. 2012;9(1): 92.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1186/1742-4690-9-92&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=23131071&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F08%2F27%2F052274.atom)

 [1]: /embed/graphic-1.gif
 [2]: /embed/graphic-2.gif
 [3]: /embed/graphic-3.gif
 [4]: /embed/graphic-4.gif
 [5]: /embed/graphic-5.gif
 [6]: /embed/graphic-6.gif
 [7]: /embed/graphic-7.gif