Chemical Links Between Redox Conditions and Estimated Community Proteomes from 16S rRNA and Reference Protein Sequences

Environmental influences on community structure are often assessed through multivariate analyses in order to relate microbial abundances to separately measured physicochemical variables. However, genes and proteins are themselves chemical entities; in combination with genome databases, differences in microbial abundances directly encode for chemical variability. We predicted that the carbon oxidation state of estimated community proteomes, obtained by combining taxonomic abundances from published 16S rRNA gene sequencing datasets with reference microbial proteomes from the NCBI Reference Sequence (RefSeq) database, would reflect environmental oxidation-reduction conditions. Analysis of multiple datasets confirms the geobiochemical predictions for environmental redox gradients in hydrothermal systems, stratified lakes and marine environments, and shale gas wells. The geobiochemical signal is largest for the steep redox gradients associated with hydrothermal systems and between injected water and produced fluids from shale gas wells, demonstrating that microbial community composition can be a chemical proxy for environmental redox gradients. Although estimates of oxidation state from 16S amplicon and metagenomic sequences are correlated, the 16S-based estimates show stronger associations with redox gradients in some environments.


Introduction
The environmental impacts on community structure are a major area of investigation within microbial ecology. The elucidation of environmental drivers of community structure depends on two types of data: environmental DNA sequencing (e.g., 16S amplicon or shotgun metagenomes) and physicochemical measurements. Many studies have used multivariate techniques, particularly ordination, to relate microbial Jeffrey M. Dick jeff@chnosz.net Jingqiang Tan tanjingqiang@csu.edu.cn can start with the taxonomic composition of microbial communities, use reference databases to estimate the elemental formulas of community members, and compare Z C to measured oxygen concentration or other indicators of environmental redox conditions. Two specific predictions that can be made with this "chemical shaping" hypothesis are that where an environment is more reducing, the community proteomes would have a lower Z C , and where an environment is more saline, which is a condition associated with lower water activity, the community proteomes would have a lower stoichiometric hydration state (n H 2 O ). This choice of variables reflects the recognition of oxidationreduction and hydration-dehydration reactions as major types of biochemical transformations in metabolism and, by extension, at the ecosystem scale. These chemical metrics are normalized by number of carbon atoms (Z C ) or by protein length (n H 2 O ), so they can be compared across samples and even across vastly different environments.
We previously analyzed Z C and n H 2 O from shotgun metagenomic data for redox gradients and salinity gradients [19,20]. There are many more publicly available 16S rRNA datasets, which is the motivation for the first part of this study to develop methods for combining taxonomic profiles from 16S rRNA gene sequences with reference proteomes in order to estimate community proteomes. We then make a series of comparisons of oxygen concentration and carbon oxidation state in stratified water bodies, identify large differences in Z C between relatively oxidized and reduced conditions within hydrothermal systems and shale gas wells, and finally compare the 16S-based estimates with Z C of protein sequences inferred from metagenomes.

Methods
Published datasets from studies of hydrothermal systems, stratified water bodies, and microbial mats were selected to represent both local (within datasets) and global (across datasets) redox and salinity gradients (Table 1). Subsequent to the selection of datasets based on environmental gradients, additional datasets were selected for comparison of 16S rRNA and shotgun metagenome data.
Protein sequences from the NCBI Reference Sequence (RefSeq) database were processed to obtain reference proteomes, which are proteomes for individual taxonomic groups in the RefSeq database that are based on automatic translation of genomic sequences. Then, we generated estimated community proteomes, which represent the combination of reference proteomes with 16S-based microbial abundance profiles to estimate the amino acid compositions of proteomes for entire communities. All data processing and visualization was done with R [67] on a laptop with 16 GB of RAM. The methods are described in detail below and summarized in Fig. S1.

RefSeq Proteomes
Reference protein sequences were obtained from the RefSeq database release 206 (2021-05-21) [62] for the 49448 bacterial, archaeal and viral taxa identified by a unique taxonomic ID (taxid). Our estimates of the chemical metrics of microbial communities only use reference proteomes for bacterial and archaeal taxa, but viral taxa were also included in the reference compilation with the rationale that visualizing the trends of chemical metrics in a broader biological context helps to establish the usefulness of a chemical outlook (see Fig. 1A).
For each taxid with a species-level name in the NCBI taxonomy and with at least 500 RefSeq sequences (for Bacteria and Archaea only), the amino acid compositions of all available RefSeq protein sequences were summed and divided by the number of proteins to generate the mean amino acid composition for this species (AA species ). For each genus in the NCBI taxonomy, the mean amino acid compositions (AA species ) of every species in that genus were added and divided by the numbers of species to generate the mean amino acid composition of the genus (AA genus ). Similarly, the mean amino acid compositions (AA species ) of every species in each family were combined to generate the mean amino acid composition of that family (AA family ), and analogous calculations were done for all orders, classes, and phyla, and superkingdoms (domains). The resulting mean amino acid compositions, referred to here as reference proteomes, were generated for 4737 genera, 757 families, 299 orders, 138 classes, 68 phyla, and 3 superkingdoms (Bacteria, Archaea, and Viruses) in the NCBI taxonomy.
The rationale for assembling chemical metrics at different taxonomic levels is that some, and in some cases, most, sequences cannot be classified by the RDP Classifier at the genus level (see Table S1 for percentage classification to genus level for each dataset). In order to get more chemical information out of the available classifications, we used the lowest-level classifications for each sequence for mapping to the NCBI taxonomy. The chemical metrics of reference proteomes at each successive level from family to phylum are correlated with genus-level metrics to a decreasing extent (Fig. S2), but metrics at the level of superkingdom (domain) are essentially uncorrelated with genus-level metrics, so no domain-level classifications were used to estimate community proteomes. That using lowestlevel instead of genus-only classifications has a slightly beneficial effect on estimates of carbon oxidation state for communities is shown below through comparisons with metagenomic data. a. 16S rRNA gene sequences were obtained from GenBank using the read.GenBank() function from R package ape [63]. For other datasets, sequences were downloaded from the NCBI Sequence Read Archive or MG-RAST with the accession numbers listed in the table. b. Bacterial domain-specific 16S primers were used in these studies; other studies used universal primers for archaeal and bacterial sequences. c. To minimize the effects of seasonality and depth, we analyzed data for samples taken in the summer from depths of not more than 20 m. d. The analysis includes samples from both surface (0-10 cm) and subsurface (15-30 cm) layers. Normal (low salinity) soil samples are from outside the Qarhan Salt Lake (Water Park, Tianjin, China). e. PCR and molecular cloning methods (not using an amplicon strategy) are described in [55] 16S rRNA Data Sources and Processing 16S rRNA amplicon sequences from 22 studies were downloaded from GenBank, the NCBI Sequence Read Archive (SRA) [45], or MG-RAST [54] (see Table 1). VSEARCH version 2.15.0 [70] was used to merge Illumina pairedend reads with the -fastq mergepairs command with default parameters; for some datasets with low-quality reverse reads, only the forward reads were used as in previous studies [76]. For Illumina datasets, quality and length filtering was done with the options -fastq maxee rate 0.005 (i.e., maximum one expected error for every 200 bases) and -fastq trunclen with a specific length value for each dataset. For 454 datasets, the --clip option for the fastq dump command in the SRA Toolkit was used to remove adapter sequences, and quality and length filtering was done with -fastq truncqual 15 -fastq minlen 200 -fastq maxlen 600. After filtering, the remaining sequences for all samples in each dataset were pooled and singletons (sequences that appear exactly once, but not including subsequences of other sequences) were removed. Then, runs were subsampled to a depth of 10000 sequences; samples with fewer than 10000 sequences were not affected. The subsampling reduces the processing time and retains enough sequences for classifying the major taxonomic groups in the communities. Reference-based chimera detection was performed using the VSEARCH command -uchime ref with the SILVA SSURef NR99 database version 138.1 [66]; sequences identified as chimeras or borderline chimeras   Table S5 were removed (note that the SILVA database was not used for taxonomic classification). The remaining sequences were processed with the Ribosomal Database Project (RDP) [12] Classifier version 2.13 [80] with the default training set (RDP 16S rRNA training set no. 18 07/2020). The sequence classifications for all samples in each dataset were merged using the RDP Classifier command merge-count. Sequence processing statistics and additional details are given in Table S1.

Taxonomy Mapping
To estimate the amino acid compositions of communities from 16S rRNA sequencing data, RDP classifications at only the domain level were discarded because they provide very little chemical information (see Fig. S2). Sequences assigned to RDP domain Eukaryota, class-and familylevel name Chloroplast or genus-level names Chlorophyta and Bacillariophyta were also discarded. All remaining classifications were retained for attempted mapping to the NCBI taxonomy. In general, mapping from the RDP Classifier to the NCBI taxonomy was performed by text matching of both the taxonomic rank and name. Although there is some discordance between the RDP and NCBI taxonomies, and we can not control for the flux of taxonomic names through time, we used a current snapshot of the databases with a limited number of manual name mappings informed by the literature to provide a level of consistency that is adequate for assessing the effects of large-scale changes among taxonomic groups on chemical metrics of estimated community proteomes in redox gradients. In principle, the SILVA taxonomy, which is based on GTDB [64], should have a closer alignment than RDP to the NCBI taxonomy. Nevertheless, there are comparable numbers of genus names in RDP (536) and SILVA (320) that are not represented by RefSeq sequences, although they may be present in the NCBI taxonomy ( Fig. S4; complete list in Table S10), and there is at least one phylum name in SILVA (Aquificota) that is not present in the NCBI taxonomy, even as a synonym for the traditional name Aquificae (that is, https://www.ncbi.nlm.nih.gov/taxonomy/?term=Aquificota displays "No items found"; accessed on 2022-01-17). Therefore, choosing SILVA instead of RDP at this time would still require some amount of manual name mapping, and would not alleviate the more important issue of the limited coverage of organisms by RefSeq.
Taxa that are at least one percent or more abundant in some of the datasets analyzed here and whose names in RDP are not present in the NCBI taxonomy were identified for manual mapping. The RDP phylum Cyanobacteria/Chloroplast, class Planctomycetacia, and genus Escherichia/Shigella were mapped to the NCBI phylum Cyanobacteria, class Planctomycetia, and genus Escherichia, respectively. The RDP order Rhizobiales was mapped to the NCBI order Hyphomicrobiales [35]. The RDP taxon Spartobacteria genera incertae sedis, which is relatively abundant in the Baltic Sea [34], was mapped to NCBI class Spartobacteria. The RDP taxon Subdivi-sion3 genera incertae sedis, which was identified here in some stream datasets [59] was mapped to NCBI family Verrucomicrobia subdivision 3. Among Acidobacteria, the RDP genus-level classifications Gp1 and Gp6 were mapped to NCBI genera Acidobacterium and Luteitalea, respectively, which are members of Acidobacteria subdivisions 1 and 6 [24]. The RDP genus-level cyanobacterial groups GpI, GpIIa, and GpVI were mapped to the NCBI genera Nostoc, Synechococcus, and Pseudanabaena, and the RDP taxon Family II was mapped to the family Synechococcaceae. While the other mappings involve synonyms or representative organisms, the cyanobacterial mappings are based on names used in Bergey's Manual [10], which have a more uncertain correspondence to the NCBI taxonomy. The RDP order Clostridiales and family Ruminococcaceae, which are listed in LPSN [65] as heterotypic synonyms (https://lpsn.dsmz.de, accessed on 2022-01-09), were mapped to the respective correct names, Eubacteriales and Oscillospiraceae, in the NCBI taxonomy. Finally, two taxa with ranks given as classes in RDP (Cyanobacteria and Actinobacteria) were mapped to the respective phyla in NCBI. The percentages of manually mapped classifications in each dataset are listed in Table S4.
Any RDP classifications whose rank and name at any taxonomic level up to phylum could not be mapped to the NCBI taxonomy were removed from the subsequent calculations. Note that the mappings include RDP classifications that are given as "unclassified" at the genus level but have a classification at some higher level up to phylum. Across all datasets, a median of 95.7% of RDP classifications at any level from genus to phylum were mapped to the NCBI taxonomy (Table S1). The taxa mainly responsible for the relatively low mapping rate (Table S2) for one dataset for a horizontally fractured shale gas well in the Denver-Julesburg Basin (genus Cavicella; 9% abundant) and Bison Pool hot spring in Yellowstone National Park (candidate phylum Diapherotrites; 17% abundant) are present in the NCBI taxonomy but have no protein sequences in RefSeq release 206.

Analysis of Elemental Composition
Carbon oxidation state (Z C ) of an organic molecule denotes the average charge on carbon atoms given nominal charges of all other atoms. Also known by other names such as nominal oxidation state of carbon (NOSC), Z C is an established metric for describing chemical transformations of natural organic matter [2,40,78], and has also been applied to analysis of metagenomic sequences [27,43]. The equation for Z C of proteins only requires elemental abundances [16,20].
Stoichiometric hydration state (n H 2 O ) was computed from the number of water molecules in theoretical reactions to form proteins from a set of thermodynamic components (referred to here as basis species). Other than the requirement that they are the minimum number of species that can be combined to represent all possible elemental compositions of the primary sequence of proteins, there is no thermodynamic constraint on the choice of basis species. A particular set of basis species -glutamine, glutamic acid, cysteine, O 2 , and H 2 O (abbreviated QEC) -was chosen to maximize the covariation between Z C and n O 2 (which are both measures of oxidation state) and to reduce that between Z C and n H 2 O ; see [20] for theoretical considerations and [17] for applications of this metric to proteomic data.
Values of Z C and n H 2 O were computed for each taxonomic group in RefSeq whose amino acid compositions were obtained as described above, using the ZCAA() and H2OAA() functions in the canprot package [17]. Note that the calculation of Z C from amino acid composition requires weighting by number of carbon atoms in the amino acids as described previously [20]. To generate mean amino acid compositions of estimated community proteomes (AA community ), counts of taxonomic classifications from the RDP Classifier were multiplied by the mean amino acid compositions of the reference proteomes for the lowestlevel mapped taxa (AA genus to AA phylum ); these were summed and divided by the number of classifications to yield AA community . This was then used to compute Z C and n H 2 O .

Metagenomic Sequence Processing
We used our previous compilation [19] of mean amino acid compositions of protein sequences inferred from shotgun metagenomes for Bison Pool hot spring (Yellowstone National Park, USA), Eastern Tropical North Pacific (ETNP) oxygen minimum zone, and Guerrero Negro mat (Baja California Sur, Mexico) and metatranscriptomes for ETNP and Mono Lake (California, USA). Additional metagenomic datasets for the Marcellus Shale, Manus Basin vents, Black Sea, Human Microbiome Project, soils, and mammalian guts (see Table S3) were analyzed in this study using methods described in detail in [19]. Briefly, this consists of trimming, filtering, and dereplicating reads followed by identification of protein-coding sequences with FragGeneScan [69]. The mean amino acid composition of the inferred protein sequences for each sample was used for calculation of Z C and n H 2 O .

Chemical Differences among Taxonomic Groups
Chemical metrics of reference proteomes for phyla with the greatest number of representative lower-level taxa in the RefSeq database are plotted in Fig. 1A. The first panel includes viruses and archaeal and bacterial phyla; although viruses were not used in the estimation of community metrics, their lower stoichiometric hydration state compared to many groups of cellular organisms is an ancillary result of general interest. The next panel (Fig. 1B) excludes viruses to consider cellular organisms. The proteomes of classes within Bacteroidetes have the lowest overall n H 2 O , and those of Crenarchaeota, Fusobacteria, Actinobacteria, and Euryarchaeota, except for the class Halobacteria, have relatively high n H 2 O . In terms of Z C , Actinobacteria, Planctomycetes, and Halobacteria within the Euryarchaeota are the groups with the most oxidized proteomes, while Crenarchaeota, Thermotogae, Fusobacteria, and Tenericutes have the most reduced proteomes. In the third panel (Fig. 1C), n H 2 O is the primary variable that distinguishes the classes of Proteobacteria, with most orders of Alphaproteobacteria and Gammaproteobacteria at relatively high and low n H 2 O , respectively, although Alphaproteobacteria shows very wide variation. The proteobacterial class with the most reduced proteins is Epsilonproteobacteria, members of which are often identified in hydrothermal vent communities [8,60].
This chemical representation of reference proteomes generates many hypotheses about the effects of oxidationreduction and hydration-dehydration reactions on biochemical evolution. For instance, the chemically reducing conditions of sediments and hydrothermal fluids [47,49] may have shaped the evolutionary processes that resulted in the low Z C of the reference proteomes for Methanococci, Archaeoglobi, and Thermococci, which are the most reduced classes in the Euryarchaeota and among the most reduced of all microbial classes shown in Fig. 1. In an even broader evolutionary context, the relatively low hydration state of viral proteomes may not be surprising given the absence of a cytoplasmic compartment in viruses; in particular, the water content per gram of dry matter of geometrical viruses (those without a viral envelope) is considerably lower than that of E. coli cells [51]. The low n H 2 O of Bacteroidetes, which are relatively abundant in suspended particles in seawater [25], aligns with our previous finding of lower n H 2 O of proteins coded by metagenomes of particle-associated compared to free-living communities [20]. Halobacteria (i.e., haloarchaea) is exceptional for having much higher Z C and lower n H 2 O than other euryarchaeotal classes. The lower water activity associated with hypersaline environments may be one reason for this group to have evolved proteomes with lower n H 2 O . However, the high Z C of Halobacteria is probably not an adaptation to more oxidizing conditions; indeed, the solubility of O 2 decreases at higher salinity [71]. Instead, the proteomes of many halophiles have greater numbers of acidic residues that stabilize the three-dimensional structure of proteins in high-salt conditions; because of the oxygen atoms contained in carboxylic acid groups, this adaptation also results in higher average oxidation state of carbon of the proteins [20].
Although the chemical metrics of taxa within a given phylum generally overlap with those of other phyla at both the class and genus level (Figs. 1 and S3), when considering all prokaryotic taxa in RefSeq, there is still a correlation between genus-level metrics and those of each successive taxonomic rank up to phylum (Fig. S2). This is why we used the lowest-level classification for each 16S amplicon sequence to estimate AA community .

Estimated Community Proteomes Encode Redox and Salinity Gradients
As visualized in Fig. 2, Z C of estimated community proteomes is locally lower in the euxinic zone of the Black Sea and anoxic zone of Lake Fryxell, lower in the hotter water samples for the Manus Basin (which have greater input of reduced hydrothermal fluids), and lower just below the surface of the hypersaline Guerrero Negro microbial mat, which experiences a sharp oxygen gradient during the day [31]. The Lake Fryxell dataset is also for mat samples, but the environmental gradient here is between oxygenated shallow water and anoxic deeper water [38]. These and other examples described below provide evidence that carbon oxidation state is aligned with environmental redox conditions.
Yellowstone hot springs and Manus Basin submarine vents are hydrothermal systems that emit highly reduced fluids. The communities in these hydrothermal systems are distributed toward lower Z C than those in many lake and seawater environments (Fig. 2 center), thereby revealing a geobiochemical signature of oxidation-reduction conditions on a global scale.
Estimated proteomes of archaeal communities in alkaline hot springs in Yellowstone National Park have higher n H 2 O than bacterial communities. This can be attributed to the relatively high n H 2 O of reference proteomes (Fig. 1B) [20]. Descriptions of redox conditions are taken from the literature and are distinct from the operational definitions of oxic and anoxic samples used in Fig. 4. The practical salinity values reported for the Baltic Sea (supplementary information of [34]) have no units. References for sequencing data are listed in Table 1, and plotting data for this figure are in Table S6 include Crenarchaeota and Euryarchaeota [7]. In contrast, fauna samples in the Manus Basin dataset have lower n H 2 O than the majority of fluid and rock samples, but in this case no archaeal sequences were reported, so differences in bacterial abundances are the reason for the observed chemical differences. We previously documented decreasing n H 2 O at higher salinities by analyzing shotgun metagenomic sequences for various marine and freshwater environments [20], and this trend is recapitulated in 16S-based estimates of community proteomes for the Baltic Sea (Fig. 2). In contrast, communities in Tibetan Plateau lakes [85] exhibit strongly increasing Z C in hypersaline conditions, which is again consistent with our previous findings for other hypersaline systems based on metagenomic data [20]. Compared to low-salinity normal soil samples from Tianjin, China, saline soils from a playa (Qarhan Salt Lake [82]) are distinguished by lower n H 2 O and, for some samples, higher Z C , showing that salinity gradients can be mapped to two dimensions of chemical variation.

Coupling Between Oxygen Concentration and Community Oxidation State in Water Columns
Many water bodies around the world develop vertical redox gradients as a result of microbial respiration of organic matter derived from the photic zone that leads to oxygen depletion with depth. Besides the Black Sea, we analyzed data for permanently stratified lakes in Switzerland [52], the Sansha Yongle Blue Hole in the South China Sea [33], Ursu Lake in Central Romania [4], and the ETNP oxygen minimum zone [28]. At each location, the profile of Z C of the estimated community proteomes shows an overall decreasing trend with depth (Fig. 3). Spearman correlation coefficients between Z C and measured O 2 concentrations are positive for all sites except outside the Blue Hole (the only site analyzed here where the deep water column is oxic) and the particle-associated communities at ETNP. The relatively low correlation for the Black Sea can be attributed to the fluctuating Z C near the top of the profile in contrast to rapidly dropping O 2 concentrations from 50 to 85 m; below this depth Z C decreases sharply, but O 2 is already below detection. This might suggest downward transport of some cells from the near-surface layers. For ETNP, the Z C decreases strongly with depth in the freeliving communities (0.2-1.6 μm size fraction), but to a lesser extent in particle-associated communities (1.6-30 μm size fraction). This might reflect environmental microniches and cell-cell interactions that to some extent reduce the sensitivity of particle-associated communities to external redox conditions.
There is a moderate rise of Z C in the deepest euxinic zone of the Black Sea. At a much smaller scale in a microbial mat instead of a water column, the low-to high-sulfide layers of the Guerrero Negro mat (below 3 mm) [31] have higher Z C than the minimum at 3 mm (Fig. 2). Therefore, despite the absence of oxygen, Z C appears to rise locally in some sulfide-rich environments.
A study of Ursu Lake in Central Romania provides temporal data from summer 2015 to spring 2016 [4]. At all times, both O 2 and Z C show an overall decline with depth. The profile of Z C exhibits a broad maximum at around 2 m depth in November that becomes narrower and deeper through the winter and spring. The development of the Z C maximum precedes that of an O 2 maximum in February, and the two profiles exhibit a remarkable meter-scale correspondence in April.

Application to Datasets for Shale Gas Extraction
To improve hydrocarbon recovery from unconventional reservoirs (shale gas), hydraulic fracturing fluid is injected into shale formations to create extensive horizontal fracture networks. The injected fluid mixes and reacts with natural formation waters and the fractured rock surfaces. Water that returns to the surface is initially referred to as flowback and later as produced water during the hydrocarbon production stage of the well [39]. Changes in the chemistry and biology of streams that are in proximity to or may be affected by shale gas wells raise important environmental issues. The chemistry of affected surface streams might reflect the possible input of methane and/or chemical additives in fracking fluid from nearby wells, although the overall strength of these associations has been debated [59].
Chemical oxidants are commonly added to injected fracturing fluid, but even without such additives, injected fluids generally consist of large amounts of water from surface sources, which makes them highly oxidized compared to the reducing conditions in organic-rich shale. Other authors have noted that little quantitative information is available about the changes in redox conditions of flowback and produced fluid over time [46,57], but available evidence suggests major temporal changes in redox conditions of these fluids. In one case, oxygen was stated to be rapidly depleted in flowback and produced waters in the Duvernay Formation in Alberta, Canada [84]. In a study of the Marcellus Shale in Pennsylvania, USA, the abundances of S-bearing organic compounds determined from FT-ICR-MS measurements exhibited a decrease in carbon oxidation state compared to injected fluids [50]. Furthermore, oxidationreduction potential (ORP) measured using a multiprobe is lower (more reducing) in groundwater samples with higher concentrations of CH 4 [44]. Therefore, we predicted that estimated community proteomes in produced water would be shifted toward a more reduced state compared to source water. A related prediction is that putative hydrocarbon input to surface streams has a similar reducing effect, but the effect may be smaller because of the greater extent of dilution.
Analysis of 16S rRNA gene sequence data for water from streams in Northwestern Pennsylvania [76] shows that both Z C and n H 2 O are lower in streams potentially impacted by Marcellus Shale activity compared to unaffected streams (Fig. 4A). Smaller differences, but in the same direction, are found for stream sediments in the same study and for stream water in Pennsylvania State Forests (Fig. 4B). Much larger shifts, toward lower Z C and higher n H 2 O , occur in produced waters compared to source waters or injected fracturing fluids. This trend is evident for not only the Marcellus Shale but also the Denver-Julesburg Basin in Colorado, USA and the Duvernay Formation ( Fig. 4C and D). The magnitudes of the Z C differences between source or injected water and produced fluids are among the largest for all datasets considered in this study (Fig. 4E).
The highly saline produced waters from many shales converge toward a common profile dominated by the halophilic and anaerobic Halanaerobium [57], but in the Denver-Julesburg Basin Thermoanaerobacter, which has similar metabolic capabilities, is present instead [36]. The very low oxidation states of the reference proteomes for these abundant groups (Z C values of −0.199 and −0.227, respectively), which are both members of Clostridia, accounts for much of the decrease in Z C in produced fluids. Notably, hypersaline conditions in non-reducing lakes and soils are

Fig. 3
Lower carbon oxidation state is tied to oxygen depletion in water columns. Z C values were calculated for estimated community proteomes using published microbial 16S rRNA gene sequences (see Table 1). Oxygen concentrations were taken from the same publications, except for locations inside and outside the Blue Hole [83]. All sites except for ETNP and station C4 outside the Blue Hole are permanently stratified. One set of metagenomic data for the Black Sea [77] is shown for comparison. No Z C value is shown for 1 m depth in Ursu Lake in April 2016 because fewer than 200 sequences remained for this sample after all sequence processing steps. Spearman correlation coefficients (ρ) are shown in the legends. Plotting data for this figure are in Table S7 [76] and Pennsylvania State Forests (PASF; water samples in spring and fall) [59]. (C) Injected fluids and produced water from hydraulically fractured wells in the Marcellus Shale [11]. (D) Mean values for samples of produced water compared to injected fluids or source water for hydraulically fractured wells in the Marcellus Shale (multiple wells) [11], Denver-Julesburg Basin (one horizontally fractured well) [36], and Duvernay Formation (well 2) [84]. Numbers of samples used for calculating group means are shown next to points in (B) and (D). Note that only injected fluids and selected produced water samples (starting with day 49 [11], day 22 [36], or day 18 [84]) are plotted in (C) and (D); data for earlier flowback samples are not shown. Plotting data for panels (A)-(D) are in Table S8. (E) Mean difference of carbon oxidation state of estimated community proteomes between relatively oxidized and reduced sample groups (Z C ) for selected datasets. Mean differences are computed as mean of Z C values for reduced samples minus mean of Z C values for oxidized samples. Criteria for assigning samples to oxidized and reduced groups are shown together with numbers of samples in each group (in parentheses). "Oxic" and "anoxic" are operational definitions that refer to samples with O 2 concentrations > 0.5 or < 0.5, respectively, in the units shown in Fig. 3, and mg L −1 for Bison Pool [74], Lake Fryxell [38], and Mono Lake [23]. Abbreviations: SW -source water; IF -injected fluids; PW -produced water. MSA+, MSA-, highest, and lowest are terms used by other authors to describe the possible disturbance of streams by shale gas extraction [59,76]. Datasets are ordered by decreasing magnitude of Z C , and those for shale gas wells and hydrothermal systems are highlighted in red characterized by relatively high Z C (this study and [20]), so the finding of an opposite trend for shale gas wells despite the high salinity of produced fluids strengthens the interpretation that reducing conditions are a primary driver of community structure in produced water.

Discussion
We have found that that the largest differences of carbon oxidation state of estimated community proteomes occurs along redox gradients associated with hydrothermal environments and injected and produced fluids of shale gas wells, and that smaller differences are apparent for oxygen gradients in stratified water bodies (Fig. 4E). Although an overall depthwise decrease of protein Z C is apparent for both 16S rRNA and metagenomic data from the Black Sea (Fig. 3), the same trend is not exhibited by the carbon oxidation state (NOSC) of dissolved organic matter [73]. Therefore, compared to that of dissolved organic matter, the carbon oxidation state of proteins in microbial communities may be a stronger indicator of redox conditions in the water column.
Our results imply that redox conditions predictably impact the carbon oxidation state of microbial communities regardless of their specific taxonomic composition. Different reducing environments are known to be inhabited by distinct taxonomic groups, including but not limited to Epsilonproteobacteria and methanogenic Archaea in some submarine hydrothermal ecosystems [47,53], Aquificae in alkaline hot springs [74], and Clostridia in produced waters of shale gas wells [36]. The reference proteomes of these groups have relatively low Z C (Fig. 1 and Fig. S3). Analogous to the functional redundancy of some microbial communities with different taxonomic composition in the same type of environment [48], these patterns suggest widespread chemical convergence. Because of the thermodynamic basis for these predictions, we can not immediately say whether the trends are consequences of particular processes such as niche differentiation or natural selection. Conceptually, there is no absolute distinction between the timescales of ecology and evolution but instead an overlap of timescales [30], and functional structures may emerge from both bottom-up and top-down mechanisms [5]. In this context, thermodynamics only predicts that an association between oxidation state of environments and proteomes has the potential to emerge. The actual mechanistic linkage can occur through ecology (organisms with reduced proteomes tend to dominate the communities in reducing environments) or evolution (organisms that inhabit reducing environments tend to evolve more reduced proteomes).
There are various limitations to the methods used here arising from the use of 16S rRNA sequences, a taxonomybased mapping, and reference sequences. These are briefly discussed in turn, and it is argued that these limitations do not affect the conclusions of this study. Comparisons with metagenomic data provide additional support for our findings and yield clues about processes that may differentiate taxonomy-based and metagenomic estimates of carbon oxidation state.

Limitations
Primer bias is an inherent problem with the analysis of 16S rRNA sequences. Because different primers for the 16S rRNA gene are used by different laboratories (Table 1), different taxonomic groups are selectively amplified during PCR, and it is not possible in principle to compare the results across studies. This issue may affect the comparison of chemical metrics made in Fig. 2 (center), but the trends in this plot are qualitatively consistent with those previously found for metagenomes [19,20]. Moreover, the trends in each vertical profile of Z C (Fig. 3) and the graded response to size of redox gradients (Fig. 4E) are based on comparisons between samples within individual datasets and are therefore unlikely to be artifacts of primer bias.
A taxonomy-based mapping to reference proteomes has several disadvantages: text mapping of names requires manual adjustment and is somewhat subjective, unclassified sequences cannot be used to obtain chemical information, and a decision has to be made about the taxonomic levels that should be included. Phylogenetic inference using hidden-state prediction is available in PICRUSt [42] for predicting functional traits of communities from 16S rRNA data and a reference database of traits (gene functions). Sequence-based inference (as in PICRUSt2; [21]) instead of taxonomic inference would provide even more accurate estimates of community properties. These tools have been designed for functional microbial ecology and cannot be used without substantial modification for the chemical metrics that are the focus of this study. By making firstorder estimates of community proteomes from taxonomic abundances, and more importantly, showing that chemical metrics are relevant variables for understanding community composition, we have prepared the ground for future work that can refine these estimates by using phylogenetic inference, which would yield a more rigorous prediction of amino acid compositions of communities from 16S rRNA sequences.
The manual mapping of taxonomic names used in this study is tedious and must be adjusted for future updates to taxonomies. The combination of text matching and manual mapping currently captures > 95 % of classifications made by the RDP Classifier for most datasets; the main exceptions include soils and some hydrothermal systems and shale gas studies (see last column of Table S1). Even with the lower proportion of mapped sequences in hydrothermal and shale gas datasets, their Z C ranges are among the largest (Fig. 4E), so resolving the unmapped classifications would have relatively little impact on the estimated Z C differences between oxidized and reduced samples within datasets. Furthermore, as shown in Fig. S2, the regression lines between chemical metrics for genus-and higherlevel taxa become flatter at higher taxonomic levels, which means that the use of supra-genus-level classifications, where necessary, to estimate community proteomes tends to decrease the apparent differences of chemical metrics between samples. It follows that the associations described here between Z C and redox conditions are unlikely to be artifacts from sequences that have no close relatives in the RDP 16S rRNA database and are classified at high taxonomic levels.
The reference proteomes were constructed from the average amino acid compositions of proteins for species in the RefSeq database. This does not account for actual protein abundances in cells, which is the primary reason why we have referred to this as an estimation, and not a prediction, of community proteomes. The only direct validation of the estimated chemical metrics would be by comparison with metaproteomes, which currently have limited availability for the wide range of environments considered in this study. Therefore, below we make comparisons with chemical metrics inferred from metagenomic data, which are the gold standard for assessing the functional potential of communities, but which also must be regarded as another type of estimate of the actual community proteome. Figure 5A compares 16S-based estimates of Z C with those from metagenomically inferred protein sequences for several shotgun metagenomic and metatranscriptomic datasets previously analyzed by us [19]. Samples with high concentrations of O 2 have 16S-based estimated community proteomes that are themselves relatively oxidized. There is not a one-to-one correspondence with metagenomic data, and the sub-vertical trends for Mono Lake and ETNP indicate larger differences between 16S-based estimates than between metagenomic values for each dataset. The largest within-dataset Z C difference occurs between the source pool and distal samples of the outflow channel in Bison Pool hot spring (Fig. 5A); a somewhat smaller range of Z C is found for the injected and day 82 produced fluids from the Marcellus Shale (Fig. 5B). Interestingly, 16S-based and metagenomic estimates reveal coordinated temporal fluctuations of Z C except for the last two time points. The results for Manus Basin (Fig. 5C) show that Z C estimated from 16S rRNA sequences is lower in higher-temperature samples, which have less dissolved O 2 . Although the metagenomic data concur that the highesttemperature sample has the most reduced inferred proteins, the Z C range between samples is compressed compared to 16S-based estimates. We therefore conclude that 16S-based estimates yield a stronger signal of redox gradients than do metagenomes.

Comparison of Z c Estimated from 16S rRNA and Metagenomic Data
Additional comparisons are made in Fig. 5D for data from the Human Microbiome Project (HMP) and in Fig. 5E for soils and mammalian guts. These datasets were chosen because they were used by the authors of the PICRUSt [42] and Tax4Fun [1] papers. The HMP data exhibit an overall strong correlation between 16S-based and metagenomic estimates of Z C (regression lines and R 2 values are shown in Fig. S5). There is considerable variation between 16S-based and metagenomic estimates of Z C within each of the soil and mammalian gut datasets, but the amplicon and shotgun data both show that soil communities have more oxidized proteins than gut communities.
Because the soil samples are characterized by differences in pH and salinity [26] rather than a redox gradient, this dataset represents a negative control for our study. Despite considerable taxonomic variability ( [26]; their Fig. S1), there is relatively little variation of Z C within this dataset (our Fig. 5E), which is the expected result if there is no large redox gradient between the samples. Moreover, changing the taxonomic mapping to use only genus-level classifications (with a consequent reduction in number of used classifications per sample) slightly weakens the overall correlations between metagenomic and 16S-based estimates of Z C (Fig. S5). This supports the current method of selecting the lowest-level taxonomic classification for each sequence for mapping to the NCBI taxonomy.
The increasing trend of protein Z C inferred from shotgun metagenomes for the produced fluids between days 82 and 328 in the Marcellus Shale without a concomitant change in the 16S-based estimates (Fig. 5B) evokes a process that affects shotgun DNA sequences without necessarily altering taxonomic compositions. Building on our previous explanation for the unexpected rise of protein Z C inferred from shotgun metagenomes despite more reducing conditions at depth in some sediments and stratified water bodies [19], we describe several independent pieces of evidence that fit together to yield a provisional explanation for this pattern. (1) Abundant extracellular DNA (exDNA) is present in near-surface sediments and soils, and to a lower extent in water samples [3], and the presence of exDNA can affect microbiome studies based on amplicon sequences [3,9]. Extracellular DNA, if present, can also affect the results of shotgun metagenomic sequencing [61]; (2) heterotrophic bacteria, if present and active, can degrade exDNA, as shown for sediments [81]; (3) heterotrophic bacteria in a ferruginous (reduced but non-sulfidic) lake sediment selectively degrade AT-rich exDNA, such that residual exDNA has higher GC content [79]. Such an outcome is broadly consistent with earlier work showing faster degradation of adenosine monophosphate than cytidine monophosphate by bacteria in marine sediments [15]; and (4) GC-rich DNA sequences code for proteins with relatively high oxidation state [19]. In settings where exDNA is abundant and has been affected by heterotrophic degradation, the putative preferential preservation of GC-enriched exDNA and its incorporation into metagenomic samples would result in higher Z C of the coded proteins.
Other authors have inferred high levels of viral predation and viral-mediated cell lysis from the presence of the CRISPR-Cas system in assembled microbial genomes for late produced fluids from the Marcellus Shale [13] and showed that viral lysis releases cellular material from a Halanaerobium species grown in culture [14]. Halanaerobium is the predominant taxon in later produced fluids from the Marcellus Shale and is associated with heterotrophic and fermentative activity [11,13,14]. Fusibacter is another genus that has been identified in incubations of produced fluids from Marcellus Shale wells [56], and bacteria in its family (Fusibacteraceae) can uptake DNA components that might be liberated by other organisms with nuclease activity [81]. Taken together, these observations suggest a plausible scenario for the release of DNA from microbial cells and its   Table S1 for details). Only samples with at least 50000 metagenomically inferred protein sequences are shown; additional analyses were done to ensure that the anomalous Z C values are not due to insufficient numbers of processed reads (see Table S3). (E) Soils and mammalian guts (including humans and other species). Sources of metagenomic data are listed in Table S3, and plotting data for this figure are in Table  S9. GI-gastrointestinal; UG-urogenital; ND-not determined heterotrophic degradation prior to metagenomic sampling, particularly at the late timepoints. For the Marcellus Shale, we find a very close correlation between inferred protein Z C and GC content of shotgun metagenome reads; in contrast, estimated proteome Z C and GC content of 16S amplicon reads show a weaker but still positive correlation (Fig. S6A). Therefore, although selective degradation of AT-rich exDNA would likely affect all types of DNA, including the 16S rRNA gene, the differential Z C -GC correlations imply a greater impact by putative degradation processes on metagenomically inferred Z C values than on 16S-based estimates of community Z C . The Z C of the reference proteome of the genus Halanaerobium (-0.199) corresponds with the 16S-based estimates of Z C at days 82 and 328 (Fig. 5B). The range of Z C for the reference proteomes of the nine Halanaerobium species in the Ref-Seq database is (−0.214, −0.191), so it is unlikely that phylogenetic inference, which still depends on a reference database for amino acid compositions of sequenced organisms, would greatly alter the estimated community Z C . The considerable difference between the taxonomically estimated and metagenomically inferred protein Z C at day 328 is therefore a robust result; the hypothesized selective degradation of AT-rich exDNA is a provisional explanation for this result.
The selective degradation hypothesis cannot consistently explain the trends in the HMP dataset. This is because there is relatively little correlation between inferred protein Z C and GC content of shotgun metagenome reads (Fig. S6B), suggesting that even if preferential preservation of GCrich exDNA did occur, it would not explain the trends in Z C . Instead, the anomalous results for the urogenital tract and nasal cavity, which exhibit very high metagenomically inferred Z C (Fig. 5D) despite relatively low GC content, could perhaps be related to the low abundance of microbial DNA in human microbiome samples, which together with privacy issues necessitates the computational identification of human DNA sequences [75] and their exclusion from the public datasets that were analyzed here. We also emphasize that the discrepancies between Z C from metagenomic and 16S-based estimates for mammalian guts and soils (Fig. 5E) could be due to limitations of the taxonomy-based approach used here and are not necessarily an indication of a contribution by extracellular DNA.

Z c as a Geobiochemical Indicator of Redox Gradients
The inner plot of Fig. 2 makes comparisons among datasets showing that hydrothermal systems have lower Z C and hypersaline systems have higher Z C . Similar trends were first identified through analysis of metagenomic data in our previous studies [19,20]. Relatively low Z C of protein sequences inferred from metagenomes was also noted in a separate study of the Old City hydrothermal vent field [43]. Although the trends are not new, the ability to replicate them through analysis of 16S rRNA data is a new result of this study. Moreover, we find that the response of Z C is graded; that is, the differences are larger for larger redox gradients (Fig. 4E).
Our findings for shale gas systems suggest that in addition to previously recognized factors such as salinity, the steep redox gradient between the surface (oxidizing) and subsurface (reducing) is a primary factor that shapes the structure of microbial communities. Other authors have commented on the dearth of redox and oxygen measurements in samples collected from black shale well sites [57], and a lack of such measurements is also apparent in the USGS National Produced Waters Geochemical Database [6]. Of the 114943 records in the database, there are only 66 with measurements of dissolved oxygen (O 2 ) or oxidation-reduction potential (ORP), and these are only for conventional or geothermal wells. The multivariate analyses performed in previous studies [11,36,84] identified variables including dissolved inorganic and organic carbon, salinity, pH, and temperature as major environmental drivers of the differences in microbial community composition between injected and produced fluids. Because oxygen or redox measurements of produced waters from unconventional wells were not available, those analyses could not have been used to identify an association between microbial communities and oxidation-reduction state of the fluids. However, our chemical analysis of estimated community proteomes reveals that community structure and redox state should be significantly related.
An important theme of this study is that redox gradients are a nearly ubiquitous feature of microbial habitats.
In the context of human hosts, the exposure of skin to oxygen in the atmosphere, in contrast to internal environments of the gastrointestinal and urogenital tracts, leads to the prediction that skin would have bacterial communities with more oxidized proteins. Our analysis of selected 16S rRNA amplicon datasets supports this prediction, but more of the available datasets from Human Microbiome Project and other studies should be analyzed to quantify these trends with greater confidence. In an environmental context, hydrothermal systems and shale gas wells have relatively large redox gradients owing to geological sources of reductants (i.e., hydrothermal fluids and organic-rich shales). Although stratified water bodies have vertical oxygen gradients driven by microbial respiration, there is no additional geological source of reducing species at depth. It is evident in Fig. 4E that estimated community proteomes for hydrothermal systems and shale gas wells have very large Z C differences, while stratified water bodies have smaller Z C differences. Thus, the magnitude of Z C differences reflects the size of environmental redox gradients. These results therefore provide the foundation for a conceptual bridge between microbial community composition and redox conditions in a wide range of environment types.

Data Availability Statement
All data analyzed in this study were obtained from public databases with accession numbers listed in Table 1 and Table S3. The JMDplots package version 1.2.11 deposited at Zenodo [18] has the scripts used for processing RefSeq proteins (refseq) and 16S rRNA gene sequences (chem16S; all directory names are under inst/extdata in the package) and processed data files including amino acid compositions of taxa compiled from the RefSeq database (refseq/protein refseq.csv.xz), chemical metrics calculated for taxonomic groups (chem16S/taxon metrics.csv), RDP Classifier results (geo16S/RDP), and sample IDs and metadata (geo16S/metadata). The geo16S.Rmd vignette in the package runs the code to make each of the figures and Tables S5-S10.