Distinct trends in chemical composition of proteins from metagenomes in redox and salinity gradients

Thermodynamic influences on the chemical compositions of proteins have remained enigmatic despite much work that demonstrates the impact of environmental conditions on amino acid frequencies. Here, we show that the dehydrating effect of salinity is apparent in protein sequences inferred from metagenomes and metatranscriptomes. The stoichiometric hydration state (nH2O), derived from the number of water molecules in theoretical reactions to form proteins from a particular set of basis species (glutamine, glutamic acid, cysteine, O2, H2O), decreases along salinity gradients including the Baltic Sea, Amazon River and ocean plume, and other samples from freshwater and marine environments. Analysis of other metagenomic datasets shows that differences in carbon oxidation state, rather than stoichiometric hydration state, are a stronger indicator of redox gradients than salinity gradients. These compositional metrics can help to identify thermodynamic effects in the distribution of proteins along chemical gradients at scales from geologic systems to cells.


Introduction
How microbial communities adapt to environmental gradients is a major challenge at the intersection of geochemistry, microbiology, and biochemistry. Patterns of amino acid usage in proteins are important indicators of microbial adaptation, and amino acid composition at the genome level is well known to 1 depend on growth temperature 1 . Furthermore, measures of evolutionary distance and community composition based on protein sequences predicted from metagenomic sequencing are strongly associated with environmental temperature and pH 2 . It is widely held that the effect of amino acid substitutions on the structural stability of proteins is a major factor affecting amino acid usage in thermophiles 1,3 . Similarly, a large body of work has demonstrated amino acid signatures associated with proteins from halophilic organisms [4][5][6][7] . The most common interpretation of these trends is that particular amino acid substitutions are selected through evolution to increase the stability and solubility of the folded conformation and enhance other structural properties such as flexibility 5 .
A complementary approach to interpreting patterns of amino acid composition is based on the energetics of amino acid synthesis. Energetic costs in terms of ATP requirements have been used to model protein expression levels in bacterial and yeast cells 8,9 . Although ATP demands depend on environmental conditions 8 , a current limitation of ATP-based models is that they are derived for specific biosynthetic pathways, such as whether yeast are grown in respiratory or fermentative (i.e. aerobic or anaerobic) conditions 9 . A different class of models, based on thermodynamic analysis of the overall Gibbs energy of reactions to synthesize metabolites from inorganic precursors, quantifies the energetics of the reactions in terms of temperature, pressure, and chemical activities of all the species in the reactions, including those that define pH and oxidation-reduction potential 10 . Notably, the overall energetics for amino acid synthesis become more favorable, but to a different extent for each amino acid, between cold, oxidizing seawater and hot, reducing hydrothermal solution 11 . A recent systems biology study demonstrates tradeoffs between Gibbs energy of alternative pathways for amino acid synthesis and cofactor use efficiency (which affects ATP costs) in E. coli and suggests that pathway thermodynamics play a role in thermophilic adaptation 12 . Nevertheless, energetic models have not made much headway in relating metagenomic and geochemical data. This may be because few studies have asked whether specific changes in the chemical composition of biomolecules reflect specific environmental conditions.
To address this gap, here we claim that compositional analysis of proteins provides evidence for distinct adaptations to two types of environmental conditions: redox and salinity gradients. Because redox reactions are inherent in many aspects of metabolism, while hydration and dehydration reactions are 2/31 essential for the synthesis of biomacromolecules 13 , our approach is shaped by the assumption that O 2 and H 2 O are two primary components that link environmental conditions to the energetics of biomolecular synthesis. Thermodynamic considerations predict that redox gradients supply a driving force for changes in oxidation state of biomolecules (similar reasoning applies to oxygen content of proteins 14 ), while salinity gradients, through the dehydrating potential associated with osmotic effects, exert a force that selectively alters the hydration state of biomolecules.
To test these predictions, we used two compositional metrics, the carbon oxidation state (Z C ) and stoichiometric hydration state (n H 2 O ). Z C is computed from the chemical formulas of organic molecules, and takes values between the extremes of -4 (CH 4 ) and +4 (CO 2 ), although the range for particular classes of biomolecules is much smaller 15 . n H 2 O is derived from the number of water molecules in theoretical formation reactions of proteins from basis species 16,17 . Through analysis of representative metagenomic and metatranscriptomic datasets, we show that Z C and n H 2 O are most closely aligned with environmental redox and salinity gradients, respectively. These findings apply to freshwater and marine environments, but metagenomic and protein expression trends for hypersaline environments and halophiles deviate from the thermodynamic predictions, most likely due to optimizations of hydrophobicity and isoelectric point or specialized physiological adaptations.
In our previous study 18 , we compared one broad class of geochemical conditions (redox gradients) with one compositional metric for proteins (carbon oxidation state). Here, we expand the geobiochemical framework to two dimensions by considering another set of environments (salinity gradients) and another compositional metric (stoichiometric hydration state). A long-term research goal is to extend this framework to as many dimensions as there are thermodynamic components plus temperature and pressure. Further background on the concepts used in this paper, and their limitations, is provided below.

Conceptual background
Intracellular conditions are maintained at physiological levels, so shouldn't the comparisons be made with intracellular or local measurements of redox potential and salinity? Available data show that physicochemical conditions in cells are not constant, but may be maintained in a narrower range relative to the environment. The most well studied example is for pH. In a compilation of data for different 3/31 organisms, Figure 1 of ref. 19 shows that cytoplasmic pH varies from ca. 4.5 to 9.5 in external pH from 1 to 11.
Cell membranes are permeable to uncharged species like hydrogen 19 , supporting the argument that oxidation-reduction conditions in the cytoplasm are affected by the external environment 20 . Oxygen diffuses rapidly through lipid membranes, depending on their composition and structure, and rates of diffusion increase with temperature 21 . Cell membranes are also permeable to water 22 . For E. coli, which grows most rapidly at about 0.3 OsM (osmolarity), increasing the extracellular osmotic strength from 0.1 to 1.0 OsM (approximately the osmotic concentration of seawater; BioNumbers 23 BNID 100802) reduces the amount of free cytoplasmic water by more than half 22 . Halophiles, which thrive at even higher salinities, accumulate inorganic salts or organic solutes to maintain osmotic balance with the environment 6, 24 . This does not mean that intracellular hydration potential is constant. Intracellular water activity estimated from freezing point and cell composition data closely follows that of the growth medium, but is often offset to lower values 25 , perhaps due to macromolecular crowding effects 24 .
This brief review shows that the physicochemical conditions in cell interiors, at least under experimental conditions, are neither constant nor equal to the environment, but are sensitive to the environment.
Ideally, we would like to compare the compositions of biomolecules to conditions actually measured inside cells or in the immediate surroundings of cells, but these measurements are generally not available for microbial communities in their natural environments, so we make comparisons with large-scale geochemical gradients, except for different layers of the Guerrero Negro microbial mat, where metagenomic and chemical data are available on the scale of millimeters.
The number of water molecules in formation reactions from basis species is not related to the water molecules released by the condensation of amino acids in protein synthesis. What does this number mean? In the thermodynamics literature, a "formation reaction" represents the composition of a chemical species, either in terms of elements 26 , or in terms of other species 27 . When these other species are restricted in number to the minimum needed to represent the composition of all possible species in the system, they constitute a set of "basis species", which can be thought of as the building blocks of the system, similar to the concept of thermodynamic components 28 . Therefore, a formation reaction from 4/31 basis species is a mass-balanced, but non-unique, stoichiometric representation of the chemical composition of the protein. This type of reaction in general does not correspond to any biosynthetic mechanism, so to avoid confusion, we refer to these formation reactions as "theoretical formation reactions"; the number of water molecules in the theoretical formation reactions is the "stoichiometric hydration state".
Proteins are synthesized by polymerization of L-alpha-amino acids, which is a condensation (dehydration) reaction. They are not synthesized directly from CO 2 , NH 3 29 , where the energy cost of proteins per amino acid in cancer cells was evaluated by averaging the contributions for amino acids making up the protein sequences. That paper refers to other papers that quantify the energetic cost for proteins using amino acid contributions in bacteria (E. coli) and yeast (S. cerevisiae) 8,9 . In those papers, the energetic costs of proteins were computed per amino acid; likewise, in this paper, Z C and n H 2 O are defined as per-carbon and per-residue values, respectively, enabling comparisons between proteins of different length. What are the evolutionary driving forces that may affect the chemical composition? Is it higher protein stability? Thermodynamic models define the "cost" of a protein as a function of not only amino acid

5/31
composition but also environmental conditions. Conceptually, this follows from Le Chatelier's principle, in that increasing the chemical activity of a reactant (on the left-hand side of a reaction) drives the reaction toward the products, or in more general terms, that the overall Gibbs energy of a reaction depends on the activities of species in the reaction 10,30 . Consider two proteins with different amino acid compositions, and therefore also different chemical compositions and theoretical formation reactions.
The formation of the protein with more water as a reactant is theoretically favored by increasing the water activity, whereas the formation of the protein with more oxygen as a reactant is favored by increasing the oxygen activity. The water and oxygen activity are thermodynamic measures of hydration and oxidation potential and can be converted to other scales. Note that the number of O 2 in theoretical formation reactions is closely related to the compositional metric used here, average oxidation state of carbon (see Methods).
This reasoning provides the theoretical justification for using chemical composition as an indicator for molecular adaptation to specific environmental conditions, but does not replace interpretations based on structural considerations. Halophilic organisms exhibit well-documented patterns of amino acid usage, including lower hydrophobicity and higher abundance of acidic residues, that impart greater stability, solubility, and flexibility of proteins 5 . These adaptations are reflected in lower values of the grand average of hydropathicity (GRAVY) 5,7 and/or isoelectric point of proteins (pI) 6 . In the Results, we compare the compositional analysis with GRAVY and pI and describe their different advantages.
It is well known that amino acid composition is affected by temperature and pH. In general, it is not controlled by a single environmental parameter. How can we tell if other environmental parameters affect oxidation and hydration state? The redox gradients in hydrothermal systems are also temperature gradients (e.g. the mixing of seawater and hydrothermal fluid), and we have not attempted to disentangle the effects of temperature and redox conditions. However, our previous analysis of other redox gradients, including stratified hypersaline lakes, indicates that carbon oxidation state of biomolecules can vary even in systems where temperature changes are much smaller 18 . We claim that changes in oxidation state can be associated with one thermodynamic component of the system, and our goal in the present study is to explore the differences between this and one other component (represented by hydration state). Future work should also account for the effects of pH and temperature, which is possible using thermodynamic models for proteins 31 .

Is there an evolutionary or biosynthetic reason for choosing the QEC basis species?
The basis species used in this study for deriving the stoichiometric hydration state of proteins are glutamine, glutamic acid, cysteine, O 2 , and H 2 O (denoted QEC). The primary reason for choosing these basis species is to reduce the covariation between the metrics for oxidation and hydration state; that covariation is a mathematical consequence of projecting the elemental formulas of proteins into a particular compositional space, and may not reflect meaningful differences of chemical composition. There is nothing implied by the choice of basis species about evolutionary or biosynthetic mechanisms, and any set of basis species is thermodynamically valid, as long as they are the minimum number needed to represent the chemical composition of all the species in the system 28 . However, it is most convenient to select basis species that correspond to the controlling variables of the system. The QEC basis species has a biological rationale since glutamine and glutamic acid are often identified as highly abundant metabolites, and have been characterized as "nodal point" metabolites 32 . Other considerations are described in the first part of the Results.

Choice of basis species
In this study we are concerned with the chemical formulas of primary amino acid sequences, not structural H 2 O molecules or the "hydration shell" of proteins. We aim to find a projection of the There are no thermodynamic restrictions on the choice of basis species, but a biologically meaningful set is likely to comprise metabolites that have high network connectivity, i.e. are involved in reactions with many other metabolites. Reactions involving glutamine and glutamic acid, or its ionized form, glutamate, are major steps of nitrogen metabolism 33,34 . Either methionine or cysteine would provide the required sulfur for the system, but cysteine is relevant as a constituent of the glutathione molecule, which has important roles in cellular redox chemistry 32 Table 1.

Comparison of redox and salinity gradients
To search for the hypothesized dehydration signal in metagenomic data, we began with redox gradients as a negative control. Submarine hydrothermal vents are zones of complex interactions between reduced endmember fluids and relatively oxidized seawater 36,37 . Terrestrial hydrothermal systems, such as the hot springs in Yellowstone National Park, USA, provide a source of reduced fluids that are oxidized by degassing and mixing with air and surface groundwater as well as biological activity including sulfide 8/31 oxidation 38 . Redox gradients can also develop at shorter length scales. The surface of the Guerrero Negro microbial mat (Baja California Sur, Mexico) is exposed to~1 m deep hypersaline water that has ca. 200 mM dissolved oxygen, but in the mat, oxygen rises during the daytime and is depleted within a few millimeters, giving way to anoxic, then sulfidic conditions 39 .
Based on metagenomic data for these redox gradients 4, 40, 42-44 , Figure 1 shows that the carbon oxidation state of proteins increases dramatically in the outflow channel of Bison Pool (Fig. 1A) and between fluids from diffuse hydrothermal vents and relatively oxidizing seawater (Fig. 1B). It is noteworthy that intact polar lipids extracted from the microbial communities of Bison Pool and other alkaline hot springs also exhibit downstream increases in carbon oxidation state 41 . The Z C of proteins increases more subtly toward the surface in the upper few mm of the Guerrero Negro microbial mat; it also increases at greater depths, perhaps due to heterotrophic degradation and/or horizontal gene transfer 18 (Fig. 1C). Furthermore, an evolutionary trajectory associated with the occurrence of different homologs of nitrogenase (Nif) in anaerobic and aerobic organisms is characterized by increasing Z C of the proteomes of these organisms 20 (Fig. 1D).
With the exception of Guerrero Negro, these datasets exhibit larger changes in carbon oxidation state than stoichiometric hydration state ( Fig. 2A). This is an expected outcome, as the redox gradients considered here do not have large changes in salinity. Concentrations of Cl -, a conservative ion, increase by less than 10% (6.1 to 6.6 mM) in the outflow of Bison Pool due to evaporation 40  Turning our attention to a salinity gradient, the Baltic Sea exhibits a freshwater to marine transition over 1800 km, but dissolved oxygen at the surface is at or near saturation with air 45 , so this transect does not represent a redox gradient. For protein sequences derived from metagenomes in the 0.1-0.8 mm size fraction, we find large changes in stoichiometric hydration state along the Baltic Sea transect, but relatively small differences in the carbon oxidation state (Fig. 2B). This effect holds for samples from both the surface and chlorophyll a maximum (9-30 m deep).

Multifactorial hydration effects
Metagenomic and metatranscriptomic data for different particle sizes are available for the Baltic Sea.
The 0.1-0.8 mm and 0.8-3.0 mm size fractions represent free living bacteria, while the 3.0-200 mm fraction contains particle-associated bacteria with average larger genome sizes and greater inferred metabolic and regulatory capacity 45 . Fig. 3 shows that proteins derived from metagenomes of larger particles have lower n H 2 O than those from the smallest size fraction. A plausible physical explanation is that the interiors of larger particles are sequestered to some extent from the surrounding aqueous environment, but evolutionary trends between unicellular and multicellular organisms could be another factor (see Discussion).
The Guerrero Negro microbial mat offers another opportunity to compare exposed and interior environments. Unlike Z C , which has a V-shaped pattern (Fig. 1C), n H 2 O decreases throughout the mat, but the changes are most pronounced in the upper few millimeters ( Fig. 2A). We speculate that this

Compositional trends in river and hypersaline environments
The Amazon river and ocean plume provide another example of a freshwater to marine transition, with salinities that range from below the scale of practical salinity units (PSU) in the river to 23-36 PSU in the plume 48,49 . We used published metagenomic and metatranscriptomic data for filtered samples classified as free-living (0.2 to 2.0 mm) and particle-associated (2.0 to 156 mm) 48,49 . River samples form a tight cluster on a plot of stoichiometric hydration state against carbon oxidation state of proteins; plume samples in the free-living size fraction are dispersed to lower Z C whereas the particle-associated fraction shows very low values of n H 2 O (Fig. 4A). Between the river and plume metatranscriptomes, n H 2 O decreases noticeably but there is little overall difference in carbon oxidation state (Fig. 4B). The metatranscriptomes of particle-associated samples also have generally lower n H 2 O than the free-living samples. higher Z C (Fig. 4C). To interpret these results, we considered other factors that are known to influence the amino acid compositions of proteins in halophiles.

Comparisons with hydropathy and isoelectric point
"Salt-in" halophilic organisms have proteins with relatively low isoelectric point that remain soluble in high salt concentration 52 . Notably, proteins with a lower pI also tend to have relatively high Z C due to higher abundances of aspartic acid and glutamic acid, which are relatively oxidized (see refs. 11 and 55 and Fig. 6 in Methods). Consequently, the lower pI characteristic of "salt-in" organisms is also associated with an increase of carbon oxidation state (Fig. 4C). Because of the dominance of the pI shift, the increase of Z C in this case is not an indicator of an environmental redox gradient.
Hydrophobic amino acids have high values on the hydropathicity scale (GRAVY) as well as relatively low values of Z C 55 . Therefore, values of GRAVY and Z C for proteins are negatively correlated, but there is very little correlation between GRAVY and n H 2 O (Supplementary Figure S1). Marine metagenomes have a lower GRAVY than most of the freshwater samples, and hypersaline metagenomes are shifted to both lower GRAVY and pI (Fig. 4F). However, there are irregular trends in the Amazon River data.
Compared to the river, the plume metagenomes exhibit lower GRAVY and either higher or lower pI (Fig.   4D). Similarly, other authors have reported that although lower pI is a signature of many hypersaline environments, it does not distinguish marine from low-salinity environments 56 . On the other hand, the plume metatranscriptomes do show decreased pI but no discernible trend in GRAVY (Fig. 4E).
Considering all the variables and datasets shown in Fig. 4, only n H 2 O exhibits a consistent trendtoward lower values -in marine compared to freshwater samples. However, the decrease of stoichiometric hydration state between freshwater and marine environments does not continue into hypersaline environments.

Compositional analysis of differentially expressed proteins
To probe into the physiological basis for the patterns found in metagenomes and metatranscriptomes, we considered the changes in transcripts and protein expression levels when cells are exposed to hyperosmotic conditions in the laboratory.  differentially expressed proteins reported in 13 studies on non-halophilic bacterial and eukaryotic cells were summarized in a previous publication 17 . We augmented that compilation with data from an additional seven studies on bacteria and archaea [57][58][59][60][61][62][63] , including four halophiles ( Table 2). The reported significantly differentially expressed genes or proteins were mapped to UniProt accession numbers 66 ; proteins with unavailable UniProt IDs, those with ambiguous expression (appearing in both the downand up-regulated groups), and duplicates were removed.
The combined data are plotted in Fig. 5A Table 2). The dashed outline indicates the 50% credible region for highest probability density for proteomics data for Bacteria and Archaea under hyperosmotic conditions (code adapted from the HPDregionplot function in R package emdbook 64 with kernel density estimates that were computed with function kde2d in R package MASS 65 ). (B) Mean differences of GRAVY and pI for the same datasets.
the hyperosmotic stress experiments, including microbial proteomes and transcriptomes and eukaryotic proteomes. Notably, the datasets cluster near a difference in n H 2 O of ca. -0.02, which is similar to the differences between freshwater and marine metagenomes and metatranscriptomes described above. An opposite pattern emerges for increasing salinity from hypoosmotic conditions to optimal growth salinity for some halophilic organisms, indicated by large squares in Fig. 5 closer to the general pattern for hyperosmotic stress experiments in non-halophilic organisms.
The mean value of GRAVY increases for differentially expressed proteins in most hyperosmotic stress experiments (Fig. 5B). This trend is opposite that found for halophilic adaptation 5 and the metagenomic comparisons described above. We conclude that n H 2 O is a more generally applicable metric, since it records decreasing hydration state along salinity gradients in the Baltic Sea and Amazon River, between other freshwater and marine metagenomes, and in protein expression under hyperosmotic stress.

Discussion
Based on mass-action effects in thermodynamics, we predicted that carbon oxidation state of proteins (Z C ) should increase toward more oxidizing conditions and that stoichiometric hydration state (n H 2 O ) should decrease toward higher salinity. We found that proteins inferred from metagenomes in redox gradients associated with submarine hydrothermal systems and a Yellowstone hot spring exhibit large changes of Z C , whereas regional salinity gradients, including the Baltic Sea freshwater-marine transect hyperosmotic conditions. We speculate that the net uptake or release of water from proteomes in response to changing growth conditions could affect cell metabolism, perhaps contributing to the effects of metabolically produced water on microbial growth at low water activities 67 . Similarly, cancer tissues are characterized by up-regulation of unicellular genes 69 as well as proteins with higher n H 2 O compared to normal tissue 17 . More work is needed to confirm whether hydration state decreases through the evolution of multicellularity, as suggested by these preliminary observations.
Another important evolutionary transition is the emergence of heterotrophic metabolism, which is considered to be a later innovation than autotrophic core metabolism 13,33 (Fig. 2A). If decreasing stoichiometric hydration state is a common theme across evolutionary transitions, then the relatively high n H 2 O in the proteomes of organisms carrying the ancestral nitrogenase Nif-D 20 ( Fig. 2A) is not unexpected. The findings of this study underscore an opportunity for integrating hydration state into evolutionary models that already consider changes in oxidation state or oxygen content of proteins 14,20 .

Prediction of protein sequences
Protein sequences were predicted from metagenomic reads using a previously described workflow 18 .
Briefly, reads were trimmed, filtered, and dereplicated using scripts adapted from the MG-RAST pipeline 70 . For metatranscriptomic datasets, ribosomal RNA sequences were removed using SortMeRNA 71 . Protein-coding sequences were identified using FragGeneScan 72 , and the amino acid sequences of the predicted proteins were used in further calculations. For large datasets, only a portion of the available reads was processed (at least 500,000 reads; see Supplementary Table S1). This reduces the computational requirements without noticeably affecting the calculated average compositions 18 .

18/31
Average oxidation state and stoichiometric hydration state   (Fig. 6A), but a relatively weak correlation between Z C and n O 2 (Fig. 6B). The QEC basis provides a much stronger association between Z C and n O 2 (which can both be interpreted as metrics for oxidation state) and also greatly reduces the coupling between Z C and n H 2 O (Fig. 6C-D). However, there is a small remaining negative correlation for amino acids (Fig. 6D), which is also visible in whole-proteome data for humans and E. coli (Fig. 6E-F). We derived values of n H 2 O by taking the

20/31
residuals of the linear model for amino acids (Fig. 6D), then subtracting a constant so that the mean for all human proteins equals zero. This derivation, which we refer to as "rQEC", gives the residual-corrected stoichiometric hydration state for each amino acid, which is plotted in Fig. 6G and listed in Table 1. Even with the residual correction for amino acids, there remain slightly positive and negative correlations for human and E. coli proteins (Fig. 6H-I). As noted above, the mean n H 2 O for human proteins was defined to be zero; the mean for proteins in E. coli is somewhat greater, at 0.014.

Compositional metrics for proteins and metagenomes
The stoichiometric hydration state of proteins was calculated by multiplying the values from Table 1 by the frequencies of the amino acids and dividing the sum by the number of amino acids. The average oxidation state of carbon was also calculated from the amino acid values (see Table 1  Means and standard deviations of Z C and n H 2 O were calculated for 100 random subsamples of protein sequences from each metagenomic or metatranscriptomic dataset. The numbers of sequences included in the subsamples were chosen to give a total length closest to 50,000 amino acids on average.

Amino acid composition of proteomes of Nif-bearing organisms
Amino acid compositions of all proteins for each bacterial, archaeal, and viral taxon in the NCBI Reference Sequence (RefSeq) database 74  less than those identified in ref. 20. Note that values of Z C calculated here (Fig. 1D) are lower than those shown in Fig. 5 of ref. 20. This difference is associated with the weighting by carbon number (described above), which was not done in ref. 20.

GRAVY and pI
The grand average of hydropathicity (GRAVY) was calculated using published hydropathy values for amino acids 75 . The isoelectric point was calculated using published pK values for terminal groups 76  pI were calculated for subsampled sequences as described above.

Data Availability
All metagenomic and metatranscriptomic data analyzed here were obtained from public databases using the accession numbers listed in Supplementary Table S1  code used to make all the figures in this paper; see the "gradH2O" vignette in the package.