Abstract
As gene sequences change through evolution, and as the abundances of different proteins change through development, the distinct elemental composition of the proteins at different times can be represented as an overall chemical reaction. Compositional and thermodynamic analysis of these reactions leads to novel insight on biochemical changes and enables predictions of intensive variables including redox potential. The stoichiometric hydration state refers to the number of H2O in theoretical reactions to form the proteins from a set of thermodynamic components. By analyzing published phylostratigraphy and transcriptomic and proteomic datasets, I found that of proteins decreases on evolutionary timescales (from single-celled organisms to metazoans) and on developmental timescales in Bacillus subtilis biofilms. Moreover, values of computed for a developmental proteome of fruit flies are aligned with organismal water content from larva to adult stages. I present a thermodyamic model for the equilibrium chemical activity of target proteins in a genomic background. Conditions that maximize the activity of the target proteins are found by optimizing the values of water activity and oxygen fugacity, which are then combined to calculate effective values of redox potential (Eh). The effective Eh values during evolution range between values reported for mitochondria, the cytosol, and extracellular compartments. These results suggest a central role for water, and water activity, in the biochemistry of evolution and development.
1. Introduction
Evolutionary developmental biology (evo-devo) addresses the evolution of developmental mechanisms across organisms. Even in organisms with similar genetic makeup, differences in gene regulation lead to phenotypic variation and as a result affect the evolvability of particular biological lineages [1]. A well known example from evo-devo regards the spatial and temporal expression patterns of homeobox (Hox) genes, which regulate the morphogenesis of animals [2]. In these systems, there is a need to better understand the expression dynamics of large numbers of genes and proteins. This question has motivated the production of genome-wide transcriptomic and proteomic datasets in model organisms including animals and bacteria [3,4]. However, integration of these datasets with primary biochemical features has lagged. Water content is one of the most basic biochemical characteristics that changes through development [5,6]. In this study, I apply a compositional analysis to proteomic data to gain information about the stoichiometry of water in biochemical reactions at the proteome level, which can be related to measured water content. Subsequently, a thermodynamic analysis reveals the influence of water activity on the relative stabilities of proteins and leads to a new model for biochemical redox potentials over evolutionary and developmental time scales.
Water activity can be defined as the water vapor pressure in a system divided by the vapor pressure of pure water at the same temperature (e.g. [7–9]). In origin-of-life research, geological environments with low water activity have been proposed to reduce or overcome the energetic barriers to polymerization of biomolecules in an aqueous environment [10]. This concept has received renewed attention recently for serpentinizing systems, which could not only provide redox disequilibria to drive the abiotic synthesis of various organic molecules, but also provide pore spaces with reduced water activity [11]. Among extant organisms, some species of fungi and bacteria are capable of growth at environmental water activities (aw) of 0.65 or lower [12]; a lowering of extracellular aw is also associated with decreased cytoplasmic water content and activity [13,14]. The control of water activity is important for food microbiology [8], and the characterization of water activity in extraterrestrial settings helps to identify promising targets for exploration in astrobiology [7,15].
Examples from other areas of biology highlight the relevance of dynamic water content in biological systems. There is a progressive loss of water during development from prenatal to adult forms in mammals [6,16,17]. Conversely, relatively high water content has been recognized for over a century as a biochemical characteristic of cancer tissue [18–22], and some authors have highlighted parallel trends of higher water content in both cancer and embryonic tissue [23–25]. Models are available to predict water activity from the water content of foods [8], but there is a pressing need to relate observations of water content in normal and pathological states to water activity ([26]; see also [24, p. 189]), which is a better thermodynamic measure of the potential for hydration in biochemical reactions.
Although biomolecular conformations – i.e. the structures resulting largely from non-covalent interactions – have for some time been known to be affected by reduced water activity associated with the macromolecular crowding that is a characteristic of cellular interiors [9,27,28], the possibility that water activity might also be a controlling factor in covalent biochemical reactions has been largely overlooked. This assumption explicitly enters many biochemical and geobiochemical models that set in thermodynamic calculations [29–31].
In contrast, many petrological studies use thermodynamic models that link water activity with observable mineral assemblages in rocks. The consideration of both water activity and oxygen fugacity (a thermodynamic measure of oxidation potential) is essential for understanding melting and magmatic processes [32] as well as lower-temperature metasomatic processes including serpentinization [33]. Notably, there is a straightforward conversion between oxygen fugacity and redox potential expressed in the Eh scale. For instance, partial pressures of O2 between 10−40 and 10−83.1 correspond to Eh values between approximately −70 and −410 mV at pH 7, 25 °C, and [34, p. 176].
A basic assumption for geochemistry is that knowledge of the initial and final states of a system is sufficient to describe a process that can be represented by thermodynamic models [35, p. 50]. Similarly, some models for the energetics of biomass synthesis in early Earth systems consider overall anabolic reactions from inorganic precursors without reference to the actual mechanisms that may be involved [36]. It follows that from a geochemical perspective, changes in protein identity and abundance – whether through evolution, ontogeny, or in cell culture experiments – would be best represented as an overall chemical reaction. This type of approach could be described as “geobiochemistry” to distinguish it from the textbook version of biochemistry. The traditional biochemical approach conceptualizes proteins as “molecular machines” that catalyze and control metabolic reactions [30,37], but does not step back to take a broader view of chemical changes in the proteome itself.
By applying a compositional analysis to proteomic data, it can be shown that water is consumed as a reactant in the differential expression of proteins of many cancer types compared to normal tissues [26]. This observation suggests that increasing water activity may be a driver for the observed biochemical changes at the proteome level, but a thermodynamic model should be developed to quantify this prediction.
In this study, I further develop the biological motivation and theoretical foundation for applying thermodynamics to patterns of protein occurrence and abundance. This is done through compositional and thermodynamic analysis of proteins associated with phylostrata, which represent the origin of genes at particular times in evolution. I also explore patterns of protein expression in developmental model systems, including biofilms and fruit flies. The common pattern that emerges is water loss from proteins over various timescales. The results are consistent with the hypothesis that physicochemical conditions exert a major influence on proteome dynamics during evolution and development.
2. Results
2.1. Compositional analysis of proteins in evolution
Several studies have associated patterns of gene expression in cancer with phylogenetically earlier genes [38,39]. The phylostratigraphic analysis used in these studies assigns ages of genes based on the latest common ancestor whose descendants have all the computationally detected homologs of that gene. To analyze the evolutionary trends of oxidation and hydration state of proteins, I used 16 phylostrata (PS) for human protein-coding genes given by Trigos et al. [38]. The mean lengths of proteins coded by genes in each phylostratum are plotted in Fig. 1A. There is an initial rise in protein length, leading up to Eukaryota, which is consistent with the previously reported greater median protein length in eukaryotes than prokaryotes [40]. The large decrease of protein length in later phylostrata could in part be an artifact of BLAST-based homology searches [41]. Other studies have detected generally shorter sequences for younger genes; whether this reflects a significant source of phylostratigraphic bias should be kept in mind [42].
The abundances of elements in the primary sequence of proteins (C, H, N, O, S) can be represented as 5-dimensional compositional vector. To visualize particular projections of this compositional space, it is helpful to use compositional metrics with meaningful chemical definitions. The metrics used here are carbon oxidation state (ZC) and stoichiometric hydration state . The carbon oxidation state represents the average charge on all carbon atoms in the molecule, given nominal charges of the other atoms (H+1, N−3, O−2, S−2). Assuming that the heteroatoms (N, O, S) are bonded only to H or C, and not to each other, the carbon oxidation state of amino acids and proteins can be computed directly from the elemental abundances (note that this assumption precludes consideration of disulfide bonds and some types of post-translational modifications) [43,44]. In contrast, the stoichiometric hydration state is the coefficient on H2O in mass-balanced reactions representing the theoretical formation of the protein from a set of thermodynamic components (also termed basis species). So that and ZC can be viewed as independent variables, it is important to choose basis species where their covariation is reduced. Accordingly, the basis species glutamine, glutamic acid, cysteine, H2O, and O2 (denoted “QEC”) were chosen for this analysis [26,44].
Fig. 1A reveals distinct evolutionary patterns of oxidation state and hydration state of proteins. ZC forms a strikingly smooth hump between PS 1 and 11 then increases rapidly to the maximum at PS 14, followed by a smaller decline to PS 16, which corresponds to Homo sapiens. In contrast, shows an overall decrease through time, although there are notable positive jumps between PS 3 and 4 and PS 7 and 8.
I also considered proteins grouped into eight gene ages (GA) reported by Liebeskind et al. [45] based on consensus tables for different age-estimation algorithms. Compared to the Trigos phylostrata, the Liebeskind gene ages have three steps between cellular organisms and Eukaryota, providing a greater resolution in earlier evolution, and stop at Mammalia, which corresponds to Trigos PS 10. Keeping in mind the different resolutions and scales of the Trigos phylostrata and Liebeskind gene ages, the two datasets show similar maxima for ZC and protein length near Eumetazoa (or Opisthokonta, which is not one of the Trigos phylostrata), and an overall decrease of during evolution (Fig. 1B).
2.2. Theoretical prediction of relative stabilities of proteins
Up to now, we have seen the chemical composition of proteins represented in terms of selected compositional metrics: oxidation and hydration state. How can these metrics be related to environmental conditions: oxidation and hydration potential? Here I describe a thermodynamic method for predicting the environmental oxidation and hydration potentials that stabilize particular proteins compared to others. Stability in this context refers not to protein conformation (i.e. 3-dimensional structure determined by non-covalent interactions), but to the energetics of the overall formation of proteins from the basis species. The theoretical formation reactions considered here represent the distinct elements and covalent bonds (i.e. specific amino acid residues) in the primary sequences of different proteins.
To perform this calculation, I used the chemical affinity (A), which is the opposite of the non-standard Gibbs energy change of the reaction [46, p. 143] and can be computed from (e.g. [47]) where K is the equilibrium constant and Q is the activity product for the formation reaction for a particular protein. The right-hand side is multiplied by the natural logarithm of 10 (≈2.303) so that all other logarithmic values are common logarithms. Because it includes the chemical activities of all the species in the reaction, Q incorporates the sensitivity to oxidation and hydration potential, which are represented by oxygen fugacity and water activity . On the other hand, K is a function of the standard Gibbs energy of the reaction and therefore of temperature and pressure.
An example of a balanced formation reaction is shown below for chicken egg-white lysozyme (UniProt: LYSC_CHICK).
The chemical formula of the whole protein (C613H959N193O185S10) is divided by the length (129) to give the per-residue formula that is a product of the reaction. The only other species in the reaction are the basis species chosen to project the elemental composition into chemical space: glutamine (C5H10N2O3), glutamic acid (C5H9NO4), cysteine (C3H7NO2S), H2O, and O2. The standard Gibbs energy (ΔGf°) of the protein calculated using amino acid group additivity is also divided by the protein length to give the per-residue ΔGf°, which is combined with the standard Gibbs energies of the other species in the reaction to calculate the standard Gibbs energy of reaction (ΔGr°), and from that, logK. By using the subcrt() function in the CHNOSZ package [48], the logK for Reaction 2 is computed to be −39.84. The methods for amino acid group additivity for proteins are described in [49]. It would be possible to include pH effects in this model by considering ionization of protein sidechain and terminal groups [49,50], but in order to focus on the contributions of hydration and oxidation potential, the present calculations are only concerned with proteins treated as neutral species.
Calculation of the chemical affinities requires values for the activities of all species in the reaction. and are used here as exploratory variables, so they were assigned a range of values at equal intervals in order to construct a 2-dimensional grid that is used for plotting the relative stabilities of the proteins. The chemical activities of the other basis species were assigned using mean concentrations of amino acids in human plasma [51]; expressed as logarithms of concentrations in mol/l, these are −3.2 for glutamine, −4.5 for glutamic acid, and −3.6 for cysteine (−3.6) [52]. The initial (non-equilibrium) activity of the per-residue formula for each protein was set to unity; the equilibrium activities were calculated as described next.
The relative stabilities of proteins (all represented by per-residue formulas) were quantified using the Boltzmann distribution written as [50] where a is activity, A is affinity, and i designates a single protein in a system of any number of proteins. For convenience in the calculation, the total activity in the result (∑ ai) is fixed at unity; this in combination with the assumption of unit activity coefficients means that the ai for each protein represents its fractional degree of formation in equilibrium with all other proteins. The predominant protein is the one with the highest predicted activity, but all the proteins in the equilibrium model have finite values of activity. The next step is to find the conditions that maximize the activity of particular target proteins in a model system.
2.3. Maximizing stabilities of target proteins on a genomic background
In order to thermodynamically characterize the changes in protein composition between phylostrata, model proteins for each phylostratum (referred to here as PS model proteins) were generated by computing the mean amino acid composition of all proteins in each phylostratum. Equilibrium calculations for the 16 PS model proteins for the Trigos phylostrata are displayed on a diagram (Fig. 2A). Predominance fields for only few proteins are visible on the diagram; the other model proteins with lower activity are “hiding” under the plane of the diagram. As indicated both for the predominant proteins (by the colored fields in the diagram) and for all 16 PS model proteins (by the points), the activities maximize at extreme values of and ; those for the predominant proteins actually maximize at infinite values of and/or . Therefore, it is not possible to use this model system to find particular values of water activity and oxygen fugacity that characterize each phylostratum.
What happens if we add to the system many different human proteins in addition to the 16 PS model proteins? It is likely that some of the human proteins will be more stable than the PS model proteins, so that none of the latter will predominate. The large number of human protein sequences (e.g. 16,974 in the Trigos phylostrata dataset that can be mapped to the UniProt database) prevents running the complete calculation on a laptop computer with 16 GB of RAM, so random samples of human proteins were used here. The calculated predominance diagram for a system of 16 PS model proteins together with 200 randomly sampled human proteins reveals that a relatively small number of human proteins predominate in equilibrium. The 16 PS model proteins are among the less stable proteins, and their activities are maximized at particular values of and (Fig. 2B).
2.4. Maximum activity analysis applied to phylostrata
The method described above was used to predict optimal values of water activity and oxygen fugacity to maximize the activity of the PS model proteins (the target proteins) against the background of thousands of human proteins. The values of and that maximize the activity of 16 PS model proteins are shown in Fig. 3A and B. In a single calculation, a random sample of 2000 human proteins was used. The sampling and equilibrium calculations were performed 100 times; the means of all runs are represented by the red lines in Fig. 3A and B. The mean values of are also plotted in Fig. 3C, with droplines at Cellular organisms, Eukaryota, Eumetazoa, and Mammalia to indicate coinciding gene ages in the Liebeskind dataset.
It is not surprising to find that the trends in hydration and oxidation potentials depicted in Fig. 3A and B are similar to those in the corresponding compositional metrics, and ZC, in Fig. 1. For instance, the earliest phylostratum (cellular organisms) has both the highest and , and PS 1–10 hover around moderate values of these quantities. According to the thermodynamic model, this corresponds to about , indicated by the horizontal dashed line in Fig. 3C. After PS 10 (Mammalia), there is a decrease in both and . However, although PS 16 (Homo sapiens) has the lowest of any phylostratum (Fig. 1A), the thermodynamic model predicts a small rise in for PS 16 compared to PS 15. Looking at the oxidation trends, PS 11 (Theria) has both the lowest ZC and lowest .
It should be kept in mind that oxygen fugacity is a thermodynamic quantity that implies nothing about the actual mechanism of oxidation or reduction [35]. For instance, in geological systems where there is no free O2, the processes that actually provide oxygen come from other reactions in the environment [55]. A practical use of oxygen fugacity is as a thermodynamic parameter that can be used to calculate other parameters that are easier to measure [56, p. 245]. Likewise, the theoretical values of and shown here simply represent thermodynamic measures of hydration and oxidation potential that maximize the activities of the target proteins. The practical value of these parameters is demonstrated below by combining them to calculate Eh, which is a more common scale of redox potential in biochemistry.
2.5. Effective redox potential and implications for early cellular evolution
Effective values of redox potential (Eh) can be obtained by considering the half-cell reaction for H2O:
At equilibrium, K = Q, where K and Q are the equilibrium constant and activity product. For Reaction 4, this gives where pe = − log ae− and pH = − log aH+. Values of Eh can then be calculated using where R, T, and F are the gas constant, temperature, and Faraday constant.
Fig. 3D shows theoretical values of Eh for the PS model proteins calculated using Eqs. (5–6) with pH = 7. The effective Eh is a composite variable; it is elevated by either increasing or decreasing . As with , Eh exhibits a broad hump between PS 1 and 11, but the whole profile is tilted up, reflecting the overall evolutionary decrease of . Several measurements for selected redox pairs in cells and plasma are shown for comparison [53,54]. The PS 1–11 hump begins and ends close to the cytosolic Eh of the NADH/NAD+ redox pair and maximizes near the Eh for cytosolic GSH/GSSG measured in erythrocytes. Between PS 12 and 15 there is a rapid rise toward Eh values characteristic of GSH/GSSG in plasma, followed by a return in PS 16 to the redox potential for cytosolic GSH/GSSG.
The first four Liebeskind consensus gene ages [45] are cellular organisms (GA 1), Euk_Archaea (GA 2; the common ancestor of Eukaryota and Archaea), Euk+Bac (GA 3; genes present only in Eukaryotes and Bacteria, representing horizontal transfer to Eukaryotes), and Eukaryota (GA 4), and therefore provide greater resolution over these stages of evolution than the Trigos phylostrata. Figs. 3E and F show the results for the maximum activity analysis applied to the Liebeskind gene ages. There is an increase of both (Fig. 1B) and between GA 1 and 2, and these values are higher than those for all subsequent gene ages. The latter hover near and have effective Eh values that range between cytosolic NADH/NAD+ and GSH/GSSG (Fig. 3F).
The Euk_Archaea group is of particular interest because it represents the common ancestor of Eukaryota and Archaea. Although proteins coded by Euk_Archaea genes (GA 2) are slightly more oxidized than those in GA 1 (Fig. 1B), their higher – and optimal – means that they are stabilized by lower redox potentials, close to −300 mV. This redox potential approaches that of the NADH/NAD+ system in mitochondria [54]. The low redox potential suggested by these results could reflect a reductive overall cellular physiology, typical of archaeal cells, that operated before the endosymbiotic transfer of mitochondria [57]. Subsequently, subcellular conditions outside of the mitochondria could become more oxidizing; the release from a reductive chemistry might explain the rise in effective Eh at GA 3 and later.
2.6. Compositional and thermodynamic analysis of biofilm development
Temporal patterns of gene and protein expression during development have been documented for a growing number of organisms. A recent study reports data for Bacillus subtilis biofilms, which have been compared to developing embryos [4]. Those authors combined gene and protein abundances with phylostrata assignments to compute a transcriptome age index (TAI) and proteome age index (PAI) for timepoints in the biofilm development. Here I just used the reported abundances to calculate the weighted mean amino acid composition for proteins at each developmental stage, which are used as the model proteins for the compositional and thermodynamic analysis described below.
Futo et al. [4] described three periods of biofilm growth: early (6H–1D), mid (3D–7D), and late (1M–2M). Timepoints of 2D and 14D are regarded as transitional stages between these periods, and are marked by vertical lines in Fig. 4. In the early period of development, there is a steep decline in the mean protein length. Note that this is computed simply by combining the lengths of canonical protein sequences from the UniProt database with normalized gene or protein expression values reported by Futo et al. [4]; no phylostrata values are used in this or any of the following calculations. The late stage of biofilm development, where only transcriptomic data are available, shows another steep decline in mean length of the corresponding proteins (Fig. 4A).
I used the model proteins (i.e. mean amino acid compositions) at each timepoint to compute values of ZC and . The proteome-based model proteins exhibit decreasing in the early to mid developmental period (6H to 7D) (Fig. 4B). Except for an initial rise starting from the liquid culture (LC), the time course for the transcriptome-based model proteins also exhibits decreasing . In the late developmental period, the transcriptome-based model proteins show an even greater decrease in ; unfortunately, proteomic data for this period are not available for comparison. Through early and mid development there is relatively little variation in ZC of the model proteins, but the transcriptome-based model proteins become much more oxidized in the late period (Fig. 4C).
The trends of are closely reflected in values of computed from the thermodynamic analysis. The values of are positive at all but the last timepoint (Fig. 4D). This implies that the biofilm requires an elevated internal hydration potential for early growth, but the internal conditions of the biofilm eventually reach a near-equilibrium state with the aqueous medium, where the activity of H2O is very close to unity.
Optimal values of were combined with those of (Fig. 4E) to compute effective values of Eh using Eqs. (5)–(6). The effective values of Eh rise through development, particularly in the latter stages (Fig. 4E), which could be an indication of the attainment of more oxidizing conditions within the biofilm. The hypothesis that aging biofilms progressively become more oxidized could be tested with oxygen microelectrode and/or redox potential measurements. These types of measurements have been reported for some species such as Geobacter sulfurreducens [58], but I was unable to find any in the literature for aging B. subtilis biofilms.
2.7. Organismal water content and stoichiometric hydration state of proteins in development of fruit flies
The fruit fly, Drosophila melanogaster, is a widely used invertebrate model organism in genetics and developmental biology. A recent study provides proteomic data for developmental stages of D. melanogaster, including embryogenesis, larva, pupa, and adults [3]. The changes of water content and other biochemical constituents in the development of D. melanogaster from larval to adult stages, when grown on chemically defined axenic medium, were reported in [5]. As larvae progress through different instars (i.e. a few days post-hatching), the water content shows some variation around 80%. The water content then decreases sharply to 66% in the prepupal stage [5]. After this, the data of [5] show a rise of water content to greater than 70% in adults (Fig. 5A).
The stoichiometric hydration state of proteins computed for the fly developmental proteome is plotted in Fig. 5B. The is nearly constant during embryogenesis and three instars of larva (L1, L2, L3). There is a sudden drop to much lower in stage L3c, which is described as “L3 crawling larva” [3]. The pupa collected on different days (P1 to P5) exhibit a somewhat higher , but still lower than the embryos. The rises in young adults and then again in old adults, which have values close to those of the embryos and early larva.
The strong decrease of proteomic in the crawling larva (L3c) is aligned with the decrease of water content in the prepupal stage, which is when the larva leaves the medium [5]. Likewise, the rise of water content in the adult fly to higher values (Fig. 5A), but less than that of the larva, is likely reflected in the trend of for young adults (Fig. 5B). No distinction was made between young and old adults in Ref. [5], so it is not possible to compare the continued rise of with their water content data. Overall, the computed from the proteome of developing D. melanogaster appears to be tightly coupled to organismal water content.
The somewhat higher of proteomes of adult males than those of females can be compared with water contents of 7 to 42 day old D. melanogaster, expressed as a percentage of total mass, calculated using data from Figure 4 of [59]. The percent water content for females ranges from 64.1 to 66.2, and that for males from 66.3 to 68.4. Similarly, in humans, adult males have a higher percent water content then females [17]. These observations suggest that the higher of proteomes of adult male flies in Fig. 5B reflects actual physiological differences between the sexes.
3. Discussion
The main findings of this study can be grouped into two themes: compositional and thermodynamic analysis. The results show that proteomes may be biochemically connected to the environment through multiple thermodynamic components (O2 and H2O) that vary over a range of timescales.
3.1. Compositional analysis
The compositional analysis uncovers decreasing stoichiometric hydration state of proteins in evolution, as represented by phylostratigraphic ages, and in development of Bacillus subtilis biofilms. The pattern for the developmental proteome of Drosophila melanogaster is not a uniform decrease, but suggests a more cyclical nature. The large decrease in at the crawling larva stage (L3c) is aligned with the measured prepupal decrease in organismal water content [5]. This strong association between measured water content and values computed from the proteome supports the biological relevance of the compositional analysis performed in this study.
By considering proteins as chemical species, this study emphasizes the importance of water loss as a major feature of both evolutionary and developmental processes. A recent analysis of proteins coded by environmental metagenomes points to decreased for particle-associated compared to free-living fractions from marine and freshwater samples [44]. Additionally, in cell culture, lower is characteristic of the proteomes of cells grown as 3D spheroids or aggregates, compared to 2D monolayers [26]. Taken together, these findings support the hypothesis that water loss, as measured by the stoichiometric hydration state of proteomes, is a feature of cell-cell interactions, and possibly more rigid cellular or multicellular structures.
3.2. Thermodynamic analysis
The main theoretical advance in this study follows from the observation that both O2 and H2O are involved in the water half-cell reaction. This is the rationale for using not only oxygen fugacity but also water activity as thermodynamic variables in the derivation of an effective redox potential expressed as Eh.
The analysis described here is related to thermodynamic models in geochemistry that involve “perfectly mobile components”, which are represented by chemical potentials, instead of components defined by bulk composition [33,60]. By treating the chemical potentials of O2 and H2O as exploratory variables, optimal values can be found to maximize the predicted chemical activity of the target model proteins in a genomic background. These potentials are then combined with the law of mass action for water half-cell reaction to calculate an effective redox potential (Eh). A notable finding is that the model protein for the putative common ancestor of eukaryotes and Archaea is characterized by effective Eh values that approach those of the NADH/NAD+ redox couple in present-day mitochondria. With further developments, this method could lead to a new way of looking at the evolutionary trends of protein composition that can uncover clues about subcellular chemical conditions in the past.
A possible concern about the thermodynamic analysis is the large range of water activity values in the model. The theoretical values of reach much lower values than saturated salt solutions (e.g. saturation of NaCl corresponds to 0.755 water activity [12]). At the other extreme, the theoretical values can be greater than 1, which represents an unphysical condition since pure water has unit activity. It may be possible to obtain in molecular dynamics simulations of mixture of H2O and organic media due to oversaturation of H2O and cluster formation in a nonpolar solvent, but such results were regarded as anomalous by other authors [61].
One finding that may serve as a “reality check” is that the systems analyzed here tend toward with time (see Fig. 3E and Fig. 4D). This tendency suggests a buffering effect, whereby after a relatively long time (in either evolution or development) proteomes naturally adjust toward equilibrium in a largely aqueous system.
4. Materials and Methods
All figures were created in R [62] using the contributed packages canprot version 1.1.0 [26] (available on the Comprehensive R Archive Network (CRAN)) and CHNOSZ [48] (development version > 1.4.0, which is available at https://r-forge.r-project.org/projects/chnosz/). Specifically, functions in the canprot package were used for calculating compositional metrics from amino acid compositions of proteins, and CHNOSZ was used for the thermodynamic calculations. The code to make the figures for this paper is available in the “evdevH2O.R” file and “evdevH2O” vignette in the JMDplots package version 1.2.5 [63].
Phylostrata were obtained from the supporting information of Trigos et al. [38] and the “main_HUMAN.csv” file of Liebeskind et al. [45,64]. Liebeskind et al. did not use phylostrata numbers, so gene ages 1 (oldest) to 8 (most recent) were assigned here (see Fig. 1B) corresponding to the names in the “modeAge” column of the source file. The Ensembl gene identifiers in the Trigos dataset were converted to UniProt accession numbers using the UniProt mapping tool [65]; in the case of duplicate UniProt accession numbers, the first matching phylostratum was used. The files with phylostrata assignments and UniProt IDs are available in the canprot package.
Transcriptomic and proteomic data for growing B. subtilis biofilms were taken from Supplementary file S10 of [4], specifically the tables named “Input values for calculating TAI” and “Input values for calculating PAI”. The values were used without modification. Data for the Drosophila developmental proteome were extracted from Supplemental Table S1 of [3]. The values in the columns for imputed log2 LFQ intensity were exponentiated, then mean values were computed for each time point (4 replicates). For both the B. subtilis and Drosophila datasets, protein IDs were mapped using the UniProt mapping tool [65]. The canonical protein sequences were downloaded from UniProt, and the read.fasta() function in the CHNOSZ package was used to generate the amino acid compositions of the proteins. These were combined with the transcriptomic or proteomic abundances to compute weighted means for amino acid composition at each time point. The JMDplots package has the computed mean amino acid compositions, which are used as input for making the figures in this paper. The intermediate files (transcriptomic or proteomic abundances with UniProt mappings, and amino acid compositions of proteins) and R scripts to generate the mean amino acid compositions are available separately (https://github.com/jedick/devodata).
Funding
This research received no external funding.
Conflicts of Interest
The author declares no conflict of interest.