Local and global chemical shaping of bacterial communities by redox potential

Thermodynamics predicts a positive correlation between environmental redox potential and oxidation state of molecules; if found for microbial communities it would imply a new kind of deterministic eco-evolutionary process. This study examines evidence for local- and global-scale correlations between oxidation-reduction potential (ORP or Eh) in environmental samples and carbon oxidation state (ZC) of estimated bacterial and archaeal community proteomes. Seventy-nine public datasets for seven environment types (river & seawater, lake & pond, alkaline spring, hot spring, groundwater, sediment, and soil) were analyzed. Taxonomic abundances inferred from high-throughput 16S rRNA gene sequences were combined with NCBI Reference Sequence (RefSeq) proteomes to estimate the amino acid compositions and chemical formulas (CcHhNnOoSs) of community proteomes, which yield ZC. Alkaline hot springs have the lowest ZC for both bacterial and archaeal domains of any environment. Positive global correlations between redox potential and ZC are found for bacterial communities in lake & pond, groundwater, and soil environments, but not archaeal communities, suggesting a broad ecological signal of chemical shaping in Bacteria.

and has important roles for assessing bioremediation Zecchin et al., 2021) and soil disinfestation (Poret-Peterson et al., 2020) strategies. Although measurements in natural systems generally reflect "mixed potentials" that have no clear thermodynamic interpretation (Bricker, 1982;Tan et al., 2017), the appearance of Eh in numerous microbial ecology studies calls for a better theoretical development of the ecological relevance of this parameter. The primary advantage of Eh for a global-scale assessment of redox effects in microbial communities is that it can be measured over extreme ranges of redox conditions. In contrast, oxygen and hydrogen reach unmeasurable concentrations in anoxic and oxic environments, respectively, which makes either one of them alone less useful as a universal scale of redox conditions.
Redox potential is often one of a number of physicochemical variables that is associated with differences in microbial community composition. Although multivariate techniques can be used to establish correlations between microbial community structure and physicochemical variables, these methods do not have a foundation in chemical concepts, and therefore remain correlative in nature (Fig. 1). Perhaps a hybrid approach that combines elements of biology and chemistry and is informed by the tradition of thermodynamic modeling in geochemistry could shed light on the ecoevolutionary implications of redox potential.
We propose the term "chemical shaping" for the hypothesis that the elemental composition of microbial communities is linked to environmental redox conditions. Testing this hypothesis requires elemental compositions, which are available for selected microbial species grown in the laboratory (see Popovic et al., 2021), but not for microbial communities in natural environments. To overcome this limitation, it can be recognized that gene sequences carry chemical information in the elemental compositions of the proteins they code for (Dick and Tan, 2021). When microbial abundances inferred from 16S rRNA gene sequences are coupled with reference proteomes of organisms with sequenced genomes, this permits an estimate of the chemical formula of the community proteome, represented as C c H h N n O o S s . A basic chemical metric, oxidation state of carbon (Z C ), can be computed for proteins as it can for any organic molecule; for the primary sequence of proteins the equation is (Dick et al., 2020) Because Z C is a linear combination of elemental ratios, it is a dimensionless quantity, but it can be conceived as a molar electron / carbon ratio, or e -/ C.
The thermodynamic notion of relative stability allows predicting the direction of irreversible reactions, making it an extremely valuable tool in geochemistry. Completely analogous to how the presence of oxygen increases the relative stability of oxidized minerals compared to more reduced ones, the chemical shaping hypothesis predicts that the redox Data sources and Eh corrected to pH 7 In general, datasets were considered for inclusion in the analysis if demultiplexed 16S rRNA sequence data could be found on the Sequence Read Archive (SRA) (Leinonen et al., 2011) or MG-RAST (Meyer et al., 2008) and data were available for at least 5 biological samples with an associated Eh range of at least 100 mV before correction to pH 7. With some exceptions, studies were included where Eh values for individual samples, rather than for means of sample groups, were available. Brief descriptions, BioProject or MG-RAST accession numbers, and references for the datasets are listed in Table 1.
Values of redox potential, assumed to be stated with respect to the standard hydrogen electrode (SHE), were taken from the original publications as detailed in Table S1. Where needed, values were extracted from plots using the program g3data (Frantz and Novak, 2013). Values of temperature and pH were also compiled if available; otherwise their values were assumed to be 25°C and pH 7 for the conversion to Eh7. Occasionally, redox potential values were obtained from different publications than those that describe the sequencing data; see Table 1 for references and Table S1 for additional details. Textual descriptions of sample type (e.g. water or sediment), location (e.g. geographic location or depth in water or sediment columns), sampling time (e.g. date or season), or treatment group for soil and sediment mesocosms were also recorded as applicable to each study.
Because of the dependence of electrical potential on pH -known as the Nernst slope or Nernst factor -reported Eh values were corrected to values at pH 7 with the equation (Husson et al., 2016) where Eh7 denotes the corrected value and dEh/dpH is the theoretical slope, which is equal to -59.16 mV/pH unit at 25°C . Values of the Nernst slope at other temperatures were calculated from -2.303RT/F, where R, T, and F are the gas constant, temperature in Kelvin, and Faraday constant.
Sequence data processing and chemical metrics In most cases, one sequencing run per sample was used. The biological material was DNA except for the Great Salt Lake sediment study (Boyd et al., 2017), which used RNA reverse transcribed to cDNA. For Bison Pool Hot Spring in Yellowstone National Park (Swingley et al., 2012), 16S rRNA sequence data were obtained from the authors of that study. The dataset of Power et al. (2018) includes 16S rRNA sequence data for 925 water samples of hot springs in New Zealand; for this study we used data for randomly selected 42 acidic samples (pH < 3) and 39 circumneutral to alkaline Table 1 Datasets analyzed in this study. The name of each dataset is followed by the BioProject or MG-RAST accession number(s) and one or two literature references; in most cases the second reference is for redox potential data. "NA" indicates data that were not obtained from SRA or MG-RAST (see text for details).   Dykes et al., 2021 andLeinonen et al., 2011) samples (pH > 6). A multi-year dataset with 256 samples is available for sediments in the Bay of Biscay (Lanzén et al., 2021); for this study, we used data for 55 samples collected in 2017.

River & Seawater
Sequence data were processed with a pipeline consisting of removal of adapter sequences, merging of forward and reverse reads where applicable, quality and length filtering, singleton removal, subsampling to 10000 sequences for each sample, reference-based chimera detection, and taxonomic classification. Length filtering was specified as a minimum read length of 200 and maximum length of 600 for 454 pyrosequencing datasets (--fastq minlen and --fastq maxlen options of vsearch), or as a truncation length for Illumina datasets (--fastq trunclen option), which was manually selected for each dataset as approximately the highest round-number value that could be used while retaining the majority of reads. Specific settings used for each dataset are given in Table S2. After length, quality, and singleton filtering, runs were subsampled at a depth of 10000 sequences so that the chimera detection and taxonomic classification steps could be performed in a reasonable amount of time. The pipeline depends on the following software and databases: fastq-dump from the NCBI SRA Toolkit with the --clip option to remove adapter sequences (has effect only for 454 datasets), vsearch (Rognes et al., 2016) for merging, singleton and chimera detection, seqtk (Li, 2020) for extracting non-singletons, SILVA SSURef NR99 database version 138.1 (Quast et al., 2012) for chimera detection, and RDP Classifier version 2.13 (Cole et al., 2014) for taxonomic classification with the default confidence threshold of 80%. After classification using the -h option of the RDP Classifier to produce counts of reads assigned to each taxon, the results for all samples in a dataset were merged using the merge-count command of the RDP Classifier.
The -h option of the RDP Classifier produces a hierarchical file where sequences are assigned at multiple taxonomic levels. For subsequent processing, assignments at the root or domain level or to Chloroplast or Eukaryota were omitted, and the lowest assigned taxonomic level (from genus to phylum) for each remaining sequence was used for mapping.
Taxonomic classifications in the RDP output were mapped to the NCBI taxonomy represented in RefSeq by automatic name matching or by custom mappings for particular taxa; see Dick and Tan (2021)  To calculate Z C values, taxonomic abundances in bacterial and archaeal domains were considered separately. For sequencing runs for libraries generated using domain-specific bacterial or archaeal primers, custom R functions were used to retain only the bacterial or archaeal taxonomic assignments, respectively, in the output of the RDP Classifier.
The use of domain-specific primers is indicated by the "Domain" column in the metadata files generated here. If studies used universal primers (often 515F/806R), R functions were used to extract the RDP assignments within each domain.
Samples with less than 50 sequences with lowest-level taxonomic assignments from genus-to phylum-level for either Bacteria or Archaea were discarded from further analysis for that domain. According to these criteria, all datasets but one  have bacterial sequences, and many datasets have both bacterial and archaeal sequences (see Table   S1). Within each domain, the abundances of all mappable lowest-level taxonomic assignments were multiplied by Z C of the reference proteomes, summed, and divided by total abundance to estimate Z C of the bacterial or archaeal community.

Computer code for data analysis and visualization
The code and output files for the steps described above are available in the JMDplots package on GitHub (https:// github.com/jedick/JMDplots). The sequence processing pipeline is implemented in the script chem16S/pipeline.R.
Metadata including accession numbers, sample ID and description, Eh, T and/or pH are saved in .csv files for each study in the orp16S/metadata directory. The .tab files produced by the RDP merge-count command are saved in the orp16S/RDP directory. The amino acid compositions of reference proteomes, their taxonomic names and ranks, and scripts used to generate these files from the the RefSeq database are in the refseq directory (see Dick and Tan, 2021 for details). The getmdat() function retrieves metadata for a dataset using a bibliographic key and implements the Eh7 correction, the getRDP() function implements the selection of assignment counts at lowest taxonomic levels and includes a "lineage" argument to select e.g. only Bacteria or Archaea, the getmap() function maps RDP taxonomic names to NCBI names, and the getmetrics() function implements the calculation of Z C for the estimated community The function orp16S S1() in the JMDplots package is used to make the plots for each dataset in Figure S1 and outputs .csv files for sample data and linear regression coefficients that correspond to Datasets S1 and S2. These datasets are used by other functions in the package to make each figure (e.g. orp16S4() for Fig. 4), and the vignette orp16S.Rmd runs each function to reproduce the figures in this paper. Studies are identified in the package, supplementary material, and plot titles of Fig. 4 by bibliographic reference keys, which are defined in the BibTeX file in Dataset S3.

Winogradsky columns
To establish the feasibility of these methods, we analyzed data for Winogradsky columns, which are well known as teaching and research tools to illuminate basic concepts of microbiology by monitoring the biological and chemical changes in enriched sediments and overlying water in a controlled laboratory setting (Rundell et al., 2014). Winogradsky columns are also candidates for "model systems" for microbial communities that may facilitate transferability of results between studies (Widder et al., 2016). As far as we are aware, there are no published studies that provide 16S rRNA data and ORP and pH measurements along depth profiles in the same Winogradsky columns, so we have used data from different sources with the aim of uncovering general features. Therefore, the following analysis is based on the direction of change of redox potential and estimated community Z C with depth but not their specific values, which likely differ depending on source material and laboratory parameters. values corrected to pH 7, denoted as Eh7, are shifted to lower values than the reported redox potential, but still show a decreasing trend with depth. Mature Winogradsky columns prepared using sediment from non-acidic ponds also exhibit decreases of redox potential with depth in the sediment (Pibernat and Abellà, 1993;Pagaling et al., 2014), and more reducing conditions in sediment compared to overlying water (see Figure 5 of Widder et al., 2016 and Figure S1A of Pagaling et al., 2017). It follows that a decrease of redox potential with depth is a general feature of mature Winogradsky columns.
According to the chemical shaping hypothesis, the elemental composition of microbial communities is predicted to become more reduced with depth in mature Winogradsky columns. Data supporting this prediction are shown in the columns. It can also be noted by observing the relative lengths of the whiskers in Fig. 2B that the sediment-water interface (SWI) is the layer with the greatest variability of Z C . This is concordant with the relatively low similarity of communities in replicate samples for the SWI as revealed by UNIFRAC distances (Rundell et al., 2014) and further emphasizes that Z C is a chemical representation of the taxonomic composition of communities.
The decrease of estimated carbon oxidation state with depth in Winogradsky columns supports the thermodynamic prediction that the communities are chemically shaped by redox potential. The rest of this study is devoted to analyzing communities in a wide variety of environments in order to assess the broader applicability and limitations of this prediction.

Analysis of local-scale data for different environment types
We searched the literature for studies that report both 16S rRNA and redox potential (ORP or Eh) data. Several types of laboratory-based studies were excluded from this search, so that most of the included studies are field-based, but mesocosms and laboratory experiments on soils and some sediments were included.
The locations of study sites are shown in Fig. 3. For each study, a plot of Eh7 and bacterial Z C was made; if archaeal sequences were available, a second plot for archaeal Z C was made. Figure S1 has scatterplots with linear regressions for all datasets, and selected datasets are shown in Fig. 4 (Luan et al., 2020), hot springs in the Southern Tibetan Plateau , water samples from the Three Gorges Reservoir (Gao et al., 2021), and wells in the Mahomet Aquifer, Illinois, USA (Yamamoto et al., 2019). Studies used field-collected samples except for those indicated by square symbols, which are for laboratory experiments or mesocosms. The map is drawn with the Winkel Tripel projection and was created using the R package oce and its included coastlineWorld dataset, together with shapefiles for the North American Great Lakes obtained from USGS (2010). Geographic coordinates of sampling sites were taken from publications or from the BioSample metadata on NCBI; see the source code of the orp16S3() function in the JMDplots package for details.
absolute value of slope greater than or less than 0.01 V -1 , respectively. In the following paragraphs, datasets are referred to as having a positive (slope > 0.01 V -1 ) or negative (slope < -0.01 V -1 ) correlation, or no correlation (|slope| < 0.01 V -1 ). The linear regressions do not capture complex patterns in the data, but the sign of the slope enables straightforward testing of the prediction that the correlation between Eh7 and Z C is positive.
The following paragraphs describe the results in more detail for bacterial communities. The plots in Figure S1 reveal that many archaeal communities show different trends, but for conciseness, the data for Archaea will only be described in the section on global-scale analysis.

River & Seawater and Lake & Pond
In multiple datasets for water bodies, spatial variation in redox potential is associated with corresponding changes in Z C of the bacterial communities. Several stratified water columns, including the PHOX2 station in the western Black Sea (Sollai et al., 2019), the Sansha Yongle Blue Hole in the South China Sea (He et al., 2020), and Ursu Lake in Romania (Baricz et al., 2021) were previously found to exhibit decreasing Z C of estimated community proteomes with depth (Dick and Tan, 2021). In the present study, we find that Z C for bacterial communities exhibits a positive correlation with Eh7 calculated from redox potential measurements taken from van Vliet et al. (2021)  . The negative correlation at Peña de Hierro speculatively could be associated with the unusual geochemistry of this lake compared to others in the IPB, particularly microoxic rather than anoxic conditions at depth (Grettenberger et al., 2020). The other datasets that were analyzed but not mentioned here, including several for time series during cyanobacterial bloom or ice formation, yield linear regressions with an absolute value of slope that is less than 0.01 V -1 .

Alkaline Spring and Groundwater
The Alkaline Spring datasets represent water samples from serpentinizing systems associated with weathering of ultramafic peridotites. Positive correlations between Eh7 and Z C are exhibited by samples from the Voltri Massif, Italy (Kamran et al., 2021), which include two spring sites with different ranges of redox potential and a more oxidizing creek sample, samples from wells with different borehole depths in the Coast Range Ophiolite Microbial Observatory (CROMO), California, USA (Sabuda et al., 2020) and the Samail Ophiolite, Oman (Rempfert et al., 2017) (Fig. 4), and depth-resolved water samples from the Samail Ophiolite collected using a packer system (Nothaft et al., 2021). In contrast, there is relatively little variation of bacterial community Z C over an Eh7 range of ca.  -Rangel et al., 2020). In the latter dataset, the canal has the lowest range of both measured redox potential and estimated bacterial Z C , which extends to lower values than the other datasets for groundwater and other environment types, except those for alkaline hot springs (Fig. 5A).
In contrast, there is a negative correlation between bacterial Z C and Eh7 for groundwater and surface water samples from the Golmud area in the Qaidam Basin, China (Guo et al., 2019), in the vicinity of a gold mine in Phetchabun and Pichit provinces in Thailand (Sonthiphand et al., 2021a) (Fig. 4), among shallow and deep groundwater samples in Lower Chao Praya Basin (Sonthiphand et al., 2021b), and in the Po Plain, Northern Italy (Zecchin et al., 2021). Notably, arsenic was identified as a major driver of community structure in several of these studies (Sonthiphand et al., 2021a(Sonthiphand et al., , 2019Zecchin et al., 2021). A novel hypothesis to emerge from the present analysis is that arsenic contamination may interfere with the redox-based chemical shaping of the bacterial communities.

Sediment and Soil
The 15 field-based studies of sediments are approximately evenly divided between those with positive and negative correlations. The datasets with positive correlations include some for marine (Jorgensen et al., 2012;Gallardo et al., 2016), coastal/estuarine (Lanzén et al., 2021) (Fig. 4), freshwater (Han et al., 2019;Cai et al., 2019;Zhang et al., 2021b), and hypersaline (Boyd et al., 2017) sediments. Several of these reflect depth profiles in sediment cores (Jorgensen et al., 2012;Gallardo et al., 2016;Zhou et al., 2017), whereas others include both sediment and water samples (Han et al., 2019), and one study investigated depth profiles in both the water column and sediment of the Jinchuan River in China .
The datasets with negative correlations represent marine (Sánchez-Soto Jiménez et al., 2018;Molina et al., 2021;Wu et al., 2021), coastal (Zhou et al., 2017), freshwater (Zárate et al., 2020;von Hoyningen-Huene et al., 2019), and hypersaline sediments (Vavourakis et al., 2019). The dataset with the most negative fitted slope is that for Daya Bay, China , where samples were collected from the surface, middle, and bottom of sediment cores. In addition, multiple lake and ocean datasets are characterized by higher Eh7 for overlying water than sediment (Han et al., 2019;von Hoyningen-Huene et al., 2019;Vavourakis et al., 2019;Molina et al., 2021); most of these exhibit negative correlations with Z C , except for Honghu Lake (Han et al., 2019), which shows a relatively high Z C for water. A systematic difference in redox potential between water and sediment is not evident for river datasets Zárate et al., 2020).
Two datasets from laboratory experiments with sediments show positive correlations between Eh7 and Z C : sediments and water from the Baltic Sea Swedish Coast subjected to incubation under oxic or anoxic conditions (Broman et al., 2017) and depth profiles in estuarine sediment mesocosms with different macroinvertebrate ecosystem engineers (Wyness et al., 2021).
Most of the soil datasets considered here (11 of 13) were obtained from laboratory or mesocosm studies. The absolute values of the Z C -Eh7 slope for two of these are less than 0.01 V -1 (Zhu et al., 2018;Xu et al., 2020), and the remainder show positive correlations between Eh7 and Z C . The experimental designs include biochar addition in soil remediation experiments (Song et al., 2017), acetate addition to urban wetland soil (Brigham et al., 2018), water management and cyclic drainage and imbibition (Pronk et al., 2020;Chuang et al., 2020), anaerobic or reductive soil disinfestation (Poret-Peterson et al., 2020;Li et al., 2021), inoculation with soil of an artificially petroleum hydrocarbon-contaminated sediment with different terminal electron acceptors , sampling of bulk soil and mature, elongation, and tip zones of rhizosphere of rice plants (Dai et al., 2020), and amendment of paddy soils with rice husk, charred husk, or calcium silicate (Dykes et al., 2021).
There are two field-based soil datasets. East Asia Paddy Soil sampled along a regional transect (Luan et al., 2020) has a slightly negative fitted slope between Eh7 and Z C . A large positive correlation is evident for samples of Cd- , which is 1000 times the slope against Eh7 (mV). The lower-left legends show the Pearson correlation coefficient (r) with P-value indicated by stars (*P < 10 -2 , **P < 10 -5 , ***P < 10 -20 ). The text is grayed out for environments with fewer than 5 datasets. (D) Z C -Eh7 plots and linear regressions for Bacteria and Archaea for all environment types except hot springs.

Global-scale analysis
Linear regressions between Eh7 and bacterial Z C for all datasets are compiled in Fig. 5A in separate panels for (1) River & Seawater, Lake & Pond, Alkaline Spring, Hot Spring and (2) Groundwater, Sediment, Soil. The slopes are positive for most datasets in the first group and also for most groundwater and soil datasets. This can be seen more easily in Fig. 5B, where the values of slope are plotted against the logarithm of number of samples in each dataset. The plot for the second group can tentatively be interpreted to show that larger datasets tend to be characterized by more positive slopes.
In Fig. 5C, data from all datasets in each environment type are combined and used to assess global correlations between Eh7 and Z C . For Bacteria, the global analysis for each environment type generates a positive slope, but only the Lake & Pond, Groundwater, and Soil groups are deemed significant according to an arbitrary P-value cutoff of < 0.01. In contrast to Bacteria, Archaea show a negative or no correlation for different environment types at a global scale.
Because of the lower number of datasets with archaeal sequences for each environment type, they may not represent true global patterns, and environment types with less than five available datasets are grayed out in Fig. 5C.
Combining data for all environment types except hot springs yields significant positive and negative correlations between Eh7 and Z C for Bacteria and Archaea, respectively (Fig. 5D). However, the data for Archaea are dominated by a few environment types (Lake & Pond, Sediment, and Soil); as will be seen below, adding data for hot springs increases the slope of the global linear fit for both Bacteria and Archaea.
Redox potential and other environmental factors in hot spring communities Hot spring and river samples in the Copahue Volcano-Río Agrio area in Argentina (Lopez Bedogni et al., 2020;Massello et al., 2020) yield a small negative slope of the linear regression between Eh7 and bacterial Z C (Fig. 6A). The other hot spring datasets have some of the highest slopes of linear fits between Eh7 and Z C for any environment, but they are also characterized by a wide range of fitted Z C between datasets. The environmental factors that contribute to this variation are explored below.
A dataset for water samples in the Dallol geothermal area, Kenya (Belilla et al., 2019), has the highest range of fitted Z C . This area is described as a polyextreme environment that is characterized by hypersaline conditions in addition to high heat and low pH. It is likely that high Z C arises not because of higher redox potential, but because of the greater number of acidic residues in proteomes of halophilic Archaea that represent adaptations to stabilize the three-dimensional structure of proteins (Dick et al., 2020). After the dataset for the Dallol geothermal area, those for other acidic springs and sediment samples have somewhat lower ranges of fitted Z C .
The datasets with the lowest ranges of fitted Z C are those for water samples from neutral or alkaline hot springs in Southwestern Yunnan (Guo et al., 2021), the Eastern Tibetan Plateau (Guo et al., 2020) (see also Fig. 4), the circumneutralto-alkaline subset of New Zealand hot springs (Power et al., 2018), and Bison Pool in Yellowstone National Park, USA (Swingley et al., 2012). The biofilm sampling for DNA analysis of Bison Pool took place in 2005 (Swingley et al., 2012) while redox potential measurements were made in summer of 2009 (Table S1 and Dick and Shock, 2011); Eh measurements are not available for the highest-temperature site of Bison Pool, so a value measured in the neighboring and chemically similar (Boyer et al., 2020) Mound Spring was used. Notably, the range of Z C for communities at Bison Pool estimated from 16S rRNA gene sequences in this study (ca. -0.22 to -0.16; Figure S1) largely overlaps with the range inferred from shotgun metagenomic sequencing (ca. -0.21 to -0.155; Shock, 2011 andDick et al., 2019), lending support to the accuracy of using estimated community proteomes from 16S rRNA sequencing. Archaeal abundances in most hot spring datasets yield estimated Z C that show a small positive association with Eh7, except for the highly negative slope for circumneutral to alkaline New Zealand hot springs (Fig. 6A). When taxonomic abundances of both domains are combined, an overall positive association between Eh7 and Z C remains. This is especially the case for the Dallol water and Japan sediment datasets, where the combined abundances show a higher positive slope than either Archaea or Bacteria alone. This is possible evidence that the relative abundances of the domains themselves, and not only of taxa within the bacterial domain, are chemically shaped by redox conditions in these environments. Additional caution is needed in interpreting the results for the Dallol area because of the likelihood that bacterial sequences in the most acidic samples are contaminants and that archaeal sequences are carried on wind-blown dust (Belilla et al., 2019).
Finally, adding the hot spring data to the global analysis for all environments increases the fitted slope for both Bacteria and Archaea relative to the global correlations without hot springs (compare Figs. 5D and 6B). For both domains, alkaline hot springs have the lowest range of Z C compared to other environments. Although non-hydrothermal alkaline springs hosted in ophiolites are also known for very reducing conditions, their range of Z C is higher than that for alkaline hot springs (Fig. 5A). It therefore appears that the combination of high temperature, high pH, and reducing conditions in alkaline hot springs may be a crucial combination that selects for organisms with more reduced proteins than any other environment.

Discussion
Although no taxonomic data were explicitly shown, all of the results described above represent a combination of taxonomic abundances with reference proteomes to estimate a chemical formula for the community proteome of each sample.
Because Z C at the community level is determined by protein-coding genomic sequences and relative abundances of different organisms, its predicted correlation with redox potential, while being based on thermodynamic concepts, also has theoretical implications for evolution and ecology. More broadly, Z C is a quantitative chemical representation of microbial communities that allows predictions about eco-evolutionary differences that are not possible with conventional methods of diversity analysis and sequence comparison.
Our initial results for Winogradsky columns (Fig. 2) demonstrated that Z C in microcosm communities decreases, as predicted, as redox potential decreases with depth. Analysis of multiple field-based lake and seawater datasets showed that lower redox potential with depth in stratified water bodies is almost always associated with decreasing Z C of the estimated bacterial community proteomes. This mirrors the earlier findings of Dick and Tan (2021) for oxygen gradients in stratified systems. We then performed global comparisons for different environment types considered separately and together ( Fig. 5C-D). The positive correlations between Eh7 and Z C for Bacteria in various environment types, especially lakes and ponds, groundwater, and soils, and in the pan-environmental comparison provide ample support for the chemical shaping hypothesis.
In contrast to Bacteria, Archaea do not show a strong signal of chemical shaping, and even display a negative correlation for some environment types (Fig. 5C). However, greater uncertainty of archaeal abundances due to primer biases and the occurrence in some environments of some phylogenetic groups, such as Woesearchaeota and Pacearchaeota in the DPANN superphylum (Ortiz-Alvarez and Casamayor, 2016), which are identified by the RDP Classifier but are not currently available in RefSeq, mean that the current findings for Archaea remain provisional. Importantly, hot springs exhibit a wider range of Z C than other environment types, and when they are included in the global analysis, the fitted slopes for both Bacteria and Archaea increase, and the correlation for Archaea becomes marginally positive (Fig. 6D).
Therefore, compared to other environments, microbial communities in alkaline hot springs may represent the greatest evolutionary and ecological divergence toward more reduced proteins under the influence of reducing conditions.
Multiple physicochemical factors, and not only redox potential, are drivers of microbial community composition in many environments. For instance, salinity gradients have a major influence on microbial community structure in sediments of the Great Salt Lake (GSL), USA (Boyd et al., 2017). The analysis in this study does not control for these other factors. Nevertheless, a positive correlation between Eh7 and Z C for both Archaea and Bacteria in the GSL sediment dataset ( Figure S1) speaks to the potential for detecting a signal of redox potential from among other environmental drivers. The positive slopes of linear regressions for many datasets from aquatic and soil environments (Fig. 5B) are consistent with the prediction of local chemical shaping by redox potential, which may be superimposed on the effects of other environmental factors. This is particularly evident for different hot spring datasets, which are offset from one another due to differences in salinity, pH, and sample type, but with one exception exhibit positive correlations between Eh7 and bacterial Z C at the local scale (Fig. 6).
A chemical representation of microbial communities naturally suggests a strategy to model deterministic outcomes that can be predicted by thermodynamics. This type of assessment might complement approaches that depend on variation partitioning or null models to discern stochastic and deterministic community assembly processes (Stegen et al., 2015) but do not explicitly account for chemical variability. Although the influence of ecological determinism is thought to scale with body size, so that microorganisms display relatively low degrees of determinism in community assembly (Farjalla et al., 2012;Luan et al., 2020), the results of this study suggest that numerous bacterial communities are chemically shaped to some extent through a previously unrecognized nexus between environment and proteome composition.

Funding
No specific funding was received for this work.

Conflicts of interest
The authors declare no conflict of interest.

Data and code availability
All data analyzed in this study are from public sources; see Table 1 for accession numbers for sequence data and Table   S1 for sources of redox potential values. All data and code generated in this study are in the JMDplots R package that is available on GitHub (https://github.com/jedick/JMDplots); see Section Computer code for data analysis and visualization for details.

Supplementary material
Table S1 Dataset summaries: Bibliographic study key, name, number of samples with bacterial and archaeal sequences, sample type (soil, sediment, water, or biological), ranges of T, pH, Eh, and Eh7, sign of Z C -Eh7 correlation for Bacteria, source of Eh data, notes. Figure S1 Z C -Eh7 scatterplots and linear regressions for all datasets.
Dataset S1 Data and metadata for each sample: Study key, environment type, lineage (Bacteria or Archaea), sample name, SRA or MG-RAST accession number, grouping description, sample group, values of Eh, Eh7, and Z C .
Dataset S2 Coefficients of linear regressions and Eh7 ranges for each dataset and domain. Figure S1 and Datasets S1 and S2 were generated using the orp16S S1() function in the JMDplots package.
Dataset S3 BibTeX bib file for data sources.