The amino acid composition of a protein influences its expression

The quantity of each protein in a cell only is only partially correlated with its gene transcription rate. Independent influences on protein synthesis levels include mRNA sequence motifs, amino acyl-tRNA synthesis levels, elongation factor action, and protein susceptibility to degradation. Here we report two novel forms of interaction between the amino acid composition of a protein and its expression level. In animals, the differing origins of amino acids define a nutritional classification system and indicate their potential for scarcity – essential amino acids (EAA) are solely obtained from dietary supply, non-essential amino acids (NEAA) from biosynthetic supply, and conditionally essential amino acids (CEAA) from both. Accessing public proteomic datasets, we demonstrate that CEAA sequence composition is inversely correlated with expression – a rule of supply that is further magnified by rapid cellular proliferation. Similarly, proteins with the most extreme compositions of EAA are reduced in abundance. Homeostatic responses to malnutrition may result from the reductions in expression of extreme composition proteins participating in biological systems such as taste and food-seeking behaviour, oxidative phosphorylation, and chemokine function. The rule can also influence general human phenotypes and disease susceptibility: stature proteins are enriched in CEAAs, and a curated dataset of over 700 cancer proteins is significantly under-represented in EAAs. A second rule, whereby individual amino acids influence protein expression is also described. This rule is shared across all kingdoms of life and rooted in the immutable structural and encoding parameters of each amino acid. Species-specific environmental survival pathways are shown to be enriched in proteins with amino acid compositions favouring higher expression according to this rule. These two rules of protein expression regulation promise new insights into systems biology, evolutionary studies, experimental research design, and public health intervention.


Introduction
The regulated transcription of mRNA from DNA, and subsequent translation into effector proteins, underlies all of life's dynamic processes. However, a typical gene's levels of mRNA and protein only show a correlation of 0.6 [1][2][3], indicating the presence of DNA-independent regulatory influences on translation. Those influences are complex and incompletely understood [4][5][6] but include mRNA sequence motifs, compatibility between mRNA codon choice and corresponding tRNA-amino acid availability [7,8] and the complex regulation of translation initiation, elongation and termination.
The established mTORC1 signalling pathway elicits molecular and cellular changes in response to nutritional state via the monitoring of certain amino acid concentrations [9].
However, the direct impact of global amino acid scarcity on protein translation is underexplored, despite supply characteristics defining an important amino acid classification system in animals [10]. That classification comprises essential amino acids (EAA) required from diet, non-essential amino acids (NEAA) obtained through biosynthesis, and an ill-defined intermediate class, conditionally essential amino acids (CEAA), requiring supplementation from diet during development and periods of stress or illness [11,12]. Over 500 million years ago the new animal kingdom was, in part, distinguished by a coordinated inactivation of biosynthetic pathways for the EAA class [13][14][15]. The resulting switch from an autotrophic to auxotrophic lifeway obliged the direct or secondary sourcing of EAAs from a diet of prototrophic plants, or heterotrophic prey that fed on those plants. The opportunity for increased biological complexity offered by the energetic efficiency of a higher trophic level has been of demonstrable advantage to animals but it created vulnerability to situational deficits in dietary supply of EAA and possibly CEAA. Such deficits would likely act by limiting tRNAamino acid synthesis and availability, slowing the protein translation rate, and decreasing protein expression -with inevitable phenotypic consequences.
Here we present evidence obtained from quantitative proteomics datasets that a protein's amino acid composition correlates with its expression in two ways. Firstly, we show that the proportions of the three nutritional classes of amino acids in an animal protein exert an influence on expression reflecting extrinsic supply and intrinsic amino acid biosynthesis constraints, respectively. A second form of intrinsic amino acid composition effect on expression is also described that is shared across all kingdoms of life and derived from the universal structural and encoding parameters of amino acids. We propose that evolution has harnessed both the extrinsic and intrinsic effects to select protein compositions that confer advantageous expression responses during environmental stress. These two rules offer intriguing new insights into the environmental influences on protein evolution and expression regulation.

Extrinsic effects of amino acid classes on protein expression
We first examined how the amino acid nutritional class composition of every human protein influences its expression. Mass spectrometry-derived expression levels of 9,399 liver proteins were accessed from the public proteomics repository, PaxDB [16] (Methods) . Fig 1a shows proteins ranked from low-to-high frequency of each of the three nutritional classes plotted against a conservative moving median protein expression level. A greater compositional frequency of CEAA (fCEAA) was generally associated with a modest decrease in protein expression, whereas greater EAA (fEAA) was associated with increased expression.
However, at the extremes of composition, the outcome was more complex with high fEAA and fCEAA both repressing expression, and low values releasing constraints on expression. At both extremes, we interpret the apparent fNEAA influence on expression to be merely the 5 passive consequences of active fCEAA or fEAA constraint effects. These findings are striking in two regards: they are naïve of mRNA expression information, and data are derived from human donors without known dietary amino acid deficiency.
An alternative approach to data visualisation, focusing on relative amino acid composition, was applied to data from eight human organs and a lung alveolar basal epithelial adenocarcinoma cell line, A549. The accessed organ expression datasets (PaxDB) had already been normalised and integrated from several analyses. Ranked protein expression levels were plotted against smoothed relative EAA:CEAA:NEAA proportions for each protein (Methods , Fig 1b-g, and Supplementary Fig 1). The right side of every component image is largely conserved in appearance: high-level protein expression requires amino acid composition to be near to the population means (fEAA, 0.41; fCEAA, 0.28; fNEAA, 0.31). By contrast, the left side of each image was observed to fall into one of three distinct profiles. In the first, Fig.1b, Supplementary Fig 1 liver, heart, and male and female gonads show a marked increase in proteins with high CEAA proportion at lower expression levels (green arrow). This suggests the existence of a biosynthetic shortfall of these amino acids that results in reduced expression of such proteins. In   [18], #id: 878737823) -with cells described as actively proliferating -a substantial increase in CEAA and EAA proportions (green and red arrows) was observed at lower expression levels indicating that proteins with greater proportions of those two amino acid classes were inefficiently translated. A suspected interaction between proliferation rate, nutritional class and expression was therefore examined in PaxDB expression data from 26 cell lines. Wide provenance, intrinsic cell line expression differences, and uncertain culture conditions at the time of protein isolation required an objective means to stratify cell line data by proliferation rate. A normalised protein expression level was derived for established proliferation marker Ki67 (MKI67) within each cell line (this value correlated well with expression of cell division proteins such as CKS2, KIF23, POLA1, CDC45 and SPC24; data not shown). Cell lines were stratified into 3 groups by this proliferation rate proxy, as well as into 5 fEAA or fCEAA classes. When total protein expression levels within each of the resulting fifteen subdivisions were summed and plotted (Fig 1h and 1i), we saw evidence for expression influenced by two forms of amino acid supply kinetics. Firstly, a proportionately negative influence on expression exists across the full range of fCEAA, which is further intensified (green arrow) by the amino acid demands of rapid proliferation (Fig 1h). Secondly, a largely proliferation-independent, positive effect of increasing fEAA on protein expression levels was observed which switches to negative only for those proteins with the very highest fEAA (perhaps determined by the limits of EAA availability in media and its cellular uptake) (Fig 1i). Human-specific inhibitory effects were seen for isoleucine (I) and proline (P), and methionine The pan-species nature of these amino acid effects was effectively demonstrated by using the E. coli MLR model from Table 1 and Supplementary File 1 to predict trends in global human protein expression (Fig. 2). A moving average expression describing three orders of magnitude was observed when the liver MLR model value for each protein was plotted against that protein's true liver expression (Fig 2a). The bacterial MLR model applied to human liver expression generated was still able to describe a moving average expression spanning two orders of magnitude (Fig 2b). To explain these universal effects of amino acids on protein expression we considered three fundamental properties as candidate influences. The first property was the number of synonymous codons assigned to each amino acid, in what we hypothesised was a proxy for the effect of codon choice on translation. Secondly, three related models of amino acid biosynthetic cost were applied to determine if metabolic economisation has directed protein evolution towards 'thrifty' amino acid composition and protein expression levels. In the Akashi [20] and Wagner models [21], the cost of synthesis for each amino acid is measured as high-energy phosphate bond equivalents and, in the Zhang model [22], this is further refined by including amino acid degradation constants. The third assessed property was the composite 'Dufton score' [23], assigned to amino acids based on their spatial volume and chemical complexity -encapsulating biosynthetic, structural, and functional parameters. Dufton score (p=5.2 x 10 -13 ), but no significant influence of energetic cost. Combined in a single model, these two simple and immutable amino acid properties were sufficient to generate a moving average describing almost three orders of magnitude of liver expression (Fig 2d).

Protective biological systems at extremes of amino acid composition
A prediction from these findings would be that animal proteins with the most extreme EAA or CEAA amino acid compositions would be the first to experience translational inefficiency in a state of amino acid deficiency. We theorised that such proteins might have retained counterintuitively extreme compositions to sense and respond to environmental adversity.
When the fEAA and fCEAA composition of 20,397 human proteins was visualised (Fig 3a), outlier proteins were found to intersect with our understanding of the biology and pathology of animal survival during malnourishment. The fCEAA outlier group primarily consisted of proteins with roles in the formation of connective tissue, skin, hair, and their maturation enzymes: collagens, elastin, keratinassociated proteins, late cornified envelope proteins, small cysteine, glycine and proline repeat-containing proteins, fibrillins, fibulins, lysyl oxidase, and latent-transforming growth factor beta-binding proteins (Fig 3b and 3c) [24,25]. Gastrointestinal excretion and the production of skin and hair are the three principal routes of irretrievable amino acid loss from the body [26], so the relative paucity of EAAs in these proteins may be a resource conservation strategy. Furthermore, their extreme CEAA composition may act as a translational regulator for life processes that can be temporarily sacrificed or permanently downscaled to conserve energy and amino acid reserves. Anorexia nervosa can be accompanied by thinning hair, brittle nails, and deterioration in skin health [27] and, separately, periods of bodily stress or illness are frequently recorded as discontinuous nail growth in the form of Beau's lines or 'pitting'. Other proteins with high fCEAA that might be susceptible to the effects of dietary deficiency include all 14 metallothioneins involved in heavy metal-binding and oxidative stress responses, several members of the serine-/arginine-rich splicing factor family, oxytocin (a hormone involved in all aspects of reproduction from sexual arousal, to uterine contraction in labour, mother-offspring bonding, and milk production) and multiple components of the Notch cell fate and differentiation signalling pathway (including DLL3, NOTCH4, and JAG2).
Extreme fEAA was observed in both leptin (LEP) and the melanocortin receptor proteins (MC1R-MC5R) (Fig 3d) -components of an established hypothalamic signalling pathway that responds to increased levels of adiposity by promoting satiety. This pathway is also linked to onset of puberty and stature [28]. GRPR (gastrin-releasing peptide receptor) also possesses high fEAA and similar appetite-suppressing role. We speculate that their conventional signalling pathways are augmented by 'hard-wired' protein synthesis inhibition during EAA deprivation -expression reduction of any of these components would act to increase appetite drive, potentially leading to recuperative ingestion of amino acids. Most members of the large olfactory receptor subfamily [29] exhibit high EAA density. Similarly, bitter taste receptors of the TAS2R subfamily have an extremely high EAA composition (Fig.3e) -with member TAS2R20 ranked 9 th in the entire proteome. This contrasts with the proteome-average EAA composition of the umami and sweet taste receptors of the TAS1R subfamily. The TAS2R family fulfilled an important survival function in human prehistory by allowing detection and rejection of potentially toxic substances in foraged food. We hypothesise that the proteins responsible for these two sensory modalities have evolved fragility of expression during dietary EAA deficiency. The resulting reduction in bitter taste and smell acuity may lower food discrimination or aversion, offering access to a greater range of foodstuffs potentially containing EAAs (histidine, tryptophan and valine salts are all bitter-tasting [30]). There is tentative evidence that prolonged anorexia nervosa blunts taste sensitivity [31]. Three other protein families have significant representation at the extremes of fEAA, including a group of fatty acid metabolism proteins (fig 3d), 14 protein components of the mitochondrial electron transport chain complexes (fig 3d) -and all 26 protein members of the CC-/CX-chemokine and chemokine receptor families that control chemotaxis and other immune cell functions (Fig   3e). Proteins at the four MLR score distribution extremes were analysed via The Gene Ontology Resource [32] using Panther [33] to identify significant enrichment for specific biological processes (gene ontologies: GO). Next, MLR-scores and MLR+ scores were collated from the full set of proteins associated with identified gene ontology terms and statistically compared to the whole protein population using a two-tailed z-test. For E. coli, we observed significant enrichment within the less negative MLR-, more positive MLR+ protein sector for 251 proteins designated under the 'response to abiotic stimulus' GO term (MLR-p=2.6 x 10 -5 , MLR+ p=1.1 x 10 -6 ). The two least negative MLR-proteins in the entire E. coli proteome fall within this environment-detection category: acid shock protein (asr) and cold shock-like protein (cspC). Also significant were 116 proteins under the 'translation' GO term (MLR-p=1.1 x 10 -subject to bias from the presence of multiple paralogs. Reanalysis of the paralog-rich translation term, collapsing multiple paralogs to a single averaged archetype, still yielded significance (MLR-p=0.002, MLR+ p=5.5 x 10 -8 ). Translation-related GO terms also presented statistically significant MLR score biases in S. cerevisiae, A. thaliana (root), and H. sapiens (liver). Likewise, environmental detection-related GO terms response to high light intensity, water deprivation, and cold acclimation (A. thaliana); detoxification, and response to stress (H. sapiens), all showed significant MLR score biases (Supplementary File 2).

Amino acid composition and disease
In humans, gene-environment (GxE) interactions modifying disease risk and phenotypic expressivity may be encountered by proteins with extreme fEAA/fCEAA composition.
Malnourishment in early life is currently experienced by 1 in 5 of the world's population, affecting stature, intellectual ability, future fertility, and risk of chronic illnesses -with the WHO reporting 145 million children with stunted height in 2020 [34,35]. In the first approach to examine potential EAA/CEAA deficiency influences on disease risk, the online DisGeNET catalogue of genes associated with 8,383 diseases [36] was queried to identify extreme fEAA/fCEAA composition proteins which also had robust aetiopathological roles supported by at least 10 distinct disease annotations (Supplementary File 1). Proteins linked to cancers (e.g., the tumour suppressor, CDKN2A), CNS disorders (e.g., the myelin constituent, PMP22), and developmental disorders (e.g., skeleton and tooth development protein, SLC10A7) were represented at fCEAA and fEAA extremes. For fCEAA, there were many connective tissue disorders (due to the collagen protein family), as well as several proteins linked to miscarriage (COL5A1, IGFBP6, LGALS3); for fEAA, proteins were linked to immunological disorders, primarily due to the chemokine family and their receptors.
In a second, multigenic approach, five conditions (cancer, male infertility, female infertility, tooth abnormality, obesity) and one phenotype (stature/height) were chosen as established indicators of malnourishment or, in the case of cancer, selected because of a pathology defined by aberrant proliferation. Risk proteins for each disorder were compiled from the literature or public databases and two-tailed Z-tests performed to determine if risk protein lists showed average fCEAA or fEAA values significantly deviating from the entire proteome (Supplementary File 1). Significant findings were observed for cancer and stature.

Discussion
We have established both extrinsic and intrinsic rules by which a protein's amino acid composition can influence its expression. The profile of extrinsic CEAA inhibitory effects on expression during baseline and proliferative cellular conditions offers the first rigorous molecular definition of this historically ignored nutritional class. One consequence of our findings is that the proliferative state of laboratory cell lines (largely unreported in publication methods) may be a confounding factor for experimental replication of functional or expression studies -for half of all proteins. Expression of proliferation biomarker MKI67 may be a useful benchmark for such studies. By contrast, only modest consequences were observed for EAAenriched proteins. Determining the true scale of extrinsic EAA influences on protein expression in vitro and in vivo will require experimentation with amino acid-deficient culture media/feeds.
The remarkable second finding that individual amino acids affect protein expression in a largely conserved manner across species appears to be a consequence of the intrinsic amino acid properties of size, structural complexity, and codon allocation. It is presumed that these intrinsic effects act at the ribosome during translation. We observed that proteins participating in translation and in species-specific environmental stress responses were significantly underrepresented in amino acids with negative influence on expression and, conversely, overrepresented in those with positive influence. We suggest this selective drive has ensured that survival-enhancing proteins can be rapidly, robustly, and highly expressed, even in challenging cellular and environmental conditions. Human and animal proteins at the extremes of EAA/CEAA composition may also have evolved as an advantageous strategy to survive extrinsic nutritional scarcity. As well as the described effects on hair/nail/skin production and food-seeking behaviours, the modulation of collagen protein expression may be a key CEAAs such as cysteine, or negative MLR coefficient amino acids such as serine, isoleucine, and tryptophan are now also worthy of investigation. By contrast, restricting amino acids in diet is an emerging concept in cancer treatment, capitalising on the specific demands of tumour cells. Our earlier findings on proliferation demands suggested that this 'hallmark' [42] would manifest as reduced fCEAA in cancer risk proteins. In fact, cancer proteins exhibited an extraordinary under-representation of EAA, suggesting that restricted essential amino acid supply to the tumour microenvironment may be a major determinant of protein expression, genotype-phenotype correlation, and clonal selection in cancer [43]. In tumours, expression of high fEAA proteins involved in mitochondrial oxidative respiration and chemokine function may thus be compromised. This would be consistent with the Warburg effect [42] which describes the metabolic shift within tumours from oxidative respiration to glycolysis, and it may also contribute to the extensive chemokine/receptor-mediated interactions between tumour cells, stromal cells and macrophages [44]. Multiple amino acids have been trialled in restriction studies [45], mostly on the basis of gross abundance, so the detailed findings reported here may guide future dietary protocols in cancer treatment.

Data import and basic amino acid frequency analysis
From Uniprot.org, one representative human protein sequence per gene (totalling 20,397) was downloaded from the Reference Human proteome (ID: UP000005640) in FASTA format.
Microsoft Excel text analysis formulas were applied to calculate the total amino residues, the frequency of each individual amino acid, and the relative proportions of the EAA/CEAA/NEAA nutritional classes present within each protein (Supplementary File 1). For example, a total of 34 EAA amino acids within a protein of 299 residues generates a fEAA of 0.11. Frequencies of amino acids or amino acid classes were used to remove the confounder of protein length differences. Similar processes were carried out for the Escherichia coli (UP000635675), Saccharomyces cerevisiae (UP000002311), and Arabidopsis thaliana (UP000006548) proteomes.

Protein expression correlation with amino acid classification
Protein expression data were imported as simple .txt files into Excel from publicly available datasets in PaxDB. Species-specific protein identifiers in the expression data were converted into universal UniProt or UniPARC identifiers using the VLOOKUP command accessing imported conversion tables, allowing correlation with the amino acid/amino acid class frequencies of each protein.

Moving median/average expression analysis
Nutritional amino acid class frequency was ranked and plotted against the moving median liver protein expression level (Fig.1a, periodicity of 469 = 5% of total). Multiple linear regression (see below) model score for each protein was plotted against liver expression value (log scale) and an Excel moving average trendline applied (periodicity of 100 proteins) ( Fig.   2a/b/d).

Tissue and cell line plots of changing EAA/CEAA/NEAA proportions across expression levels
For a tissue or cell line, both the numerical expression levels and the fEAA, fCEAA, and fNEAA for each protein were separately converted into ranks. Cumulative average proportions for each amino acid class were calculated from lowest-to-highest expressed proteins and, in parallel, highest-to-lowest expressed proteins -the average of the pair of values was calculated for each individual protein and plotted ( Fig.1b-g, Supplementary Figure 1,   Supplementary File 1). This method produced smoothed plots of trends in relative amino acid class representation as a function of ranked protein expression level.

Multiple Linear Regression and models
Multiple linear regression (MLR) in the Excel 'Data Analysis' ToolPak add-in was used to identify intrinsic parameters or individual amino acids with significant correlation to protein expression level, and their respective coefficients (Supplementary File 1). Statistically significant (p<0.05) MLR coefficients were normalised across organs and species in Table 1.
MLR findings allowed the construction of models which generated relative numerical expression predictions for each protein based on coefficients and amino acid frequencies.

Statistical tests of gene ontology and disease candidate lists
Two-tailed z-tests were applied to groupings derived from MLR+ (combined coefficients increasing expression) and MLR-(combined coefficients decreasing expression) data, or fEAA/fCEAA/fNEAA data.