Two dimensions of chemical variation of the human microbiome across body sites and in COVID-19 patients

A better understanding of dysbiosis is a major goal of human microbiome studies, but more knowledge about chemical effects on microbial communities is needed. Oxidation-reduction and hydration-dehydration reactions are chemical processes that are important for physiological functions and, it is hypothesized here, may also inﬂuence the elemental composition of microbial proteins. Chemical metrics of biomolecules relevant to these processes are carbon oxidation state ( Z C ) and stoichiometric hydration state ( n H 2 O ). I calculated these metrics for protein sequences derived from microbial genomes (multiplied by 16S rRNA-based taxonomic abundances to obtain community reference proteomes), shotgun metagenomes, and metaproteomes. Metaproteomes of gut communities are reduced (i.e., have lower Z C ) compared to oral communities. In contrast, community reference proteomes have lower n H 2 O in gut compared to nasal, skin, and oral communities, and metagenomes for gut and oral communities exhibit the same trend. The chemical differences for metaproteomes may be explained by physiological adjustment of protein expression levels to anaerobic, reducing conditions in the gut, whereas metagenomes and reference proteomes may reﬂect evolutionary adaptation to dehydrating conditions brought on by intestinal absorption of water. Community reference


Introduction
The microbiota that inhabit human body sites are major contributors to health and disease [1]. The gut has the largest population of bacterial cells, which aid digestion by providing metabolic functions that are absent from the human genome [2]; moreover, a healthy gut microbiome benefits the host by producing anti-inflammatory molecules and has a barrier function against pathogen infection [3,4]. The taxonomic composition of the healthy microbiome varies across body sites [5] and is disrupted in some diseases, a condition known as dysbiosis. Coronavirus disease 2019 (COVID-19) is primarily a respiratory disease, but the causative agent (SARS-CoV-2) also infects the gastrointestinal tract [6], and population-based studies have revealed dysbiosis of the gut microbiota in COVID-19 patients [7]. Dysbiosis associated with  has been proposed to impair oxygen sensing and contribute to hypoxemia (i.e., blood oxygen concentrations that are lower than the normal physiological range) [8]. However, establishing causal linkages between dysbiosis and disease is challenging [1], and new insight is needed into the interactions between microbial communities and host-associated environments.
One promising but underexplored approach is based on patterns of elemental abundance. For instance, environmental oxygen levels are potential drivers of elemental variation of protein sequences [9,10]. This is relevant from a clinical standpoint because oxygen concentration differs among anatomical sites in normal conditions and is altered in several diseases [11,12]. The skin and nares are sites with more aerobic niches compared to interior sites such as the oral cavity and gut [10], and O 2 concentrations in physiologically normal tissues are lower than those in standard culture conditions [13]. Geochemical variables in the external environment influence the evolution and assembly of free-living communities, but how analogous variables in the internal environment of the body may affect host-associated communities has received far less attention [14]. Therefore, several remarks on the conceptual background for a chemical analysis of multi-omics data are timely.
First, the eco-evolutionary context of different meta-omics techniques is crucial.
Communities assemble and mature at timescales of human lives, but microbial lineages have co-evolved with the ancestors of modern humans for millions of years [15,16]. It follows that the chemical differences of proteins encoded by genomes and metagenomes provide insight into adaptation on evolutionary timescales, while metaproteomes reflect changes that occur over a range of timescales. That is, protein expression levels depend on host and microbial physiology, but the possible protein sequences are still constrained by evolution.
Second, the structure of ecosystems can be represented in terms of not only taxonomic but also chemical attributes [17]. Many previous studies have performed taxonomic and functional characterization of metagenomic DNA sequences but have left out a chemical analysis. Compared to DNA, proteins have a richer repertoire of chemical variability (20 amino acids compared to 4 nucleotides); in addition, because of their abundance in cells, proteins rather than DNA make a larger contribution to cellular energy demand due to environmental variation. Therefore, all multi-omics data in this study are presented in terms of chemical metrics of proteins.
Third is the hypothesis that protein sequences have evolved toward incomplete metastable equilibrium in an environmental context during microbe-host coevolution over millions of years. I refer to this as the chemical adaptation hypothesis. This hypothesis acknowledges two key features of biomolecular systems: 1) metastability: in most environments (ultramafic-hosted submarine hot springs are one exception [18]), proteins are unstable with respect to amino acids and small inorganic molecules, so the comparisons here are limited to metastable protein molecules compared with each other, and 2) incomplete equilibrium: the postulated thermodynamic effects are just one of many factors that structure microbial ecosystems; other factors include enzyme function, kinetics, metabolic networks, and community assembly. Accordingly, the chemical adaptation hypothesis taken on its own only states how environmental gradients, in the absence of other effects, may explain chemical differences among protein sequences.
Fourth and finally, the choice of chemical variables to analyze should be guided by biological considerations. Redox state, which includes not only oxygen tension but other measures of redox disequilibria, is an important variable for host and microbial physiology; as emphasized by Pfister et al., "the redox environment underpins all microbial metabolisms and should be a key component of research across systems" [14]. A previous analysis of the human microbiome was made in terms of O content and C:O ratios of proteins [10]. However, the oxidation state of carbon in organic molecules is affected not only by C-O but also by C-H and C-S bonds [19], and an informative metric based on this reasoning is the average oxidation state of carbon (Z C ) [20]. Next, I identified hydration state as another important variable. Water is a substrate for numerous metabolic reactions (more than even oxygen) [21], tissue water content in mammals decreases during development from embryo to adult [22,23] and differs between tumors and normal tissue [24], eukaryotic cells gain or lose water during transitions to cellular proliferation and dormancy [25,26], intestinal absorption of water from chyme is a major function of the mammalian gut [27], and some microbes can tolerate extremes of hydrostatic pressure and osmotic strength, both of which affect biomolecular hydration state monitored by biophysical techniques [28]. In contrast to detecting physical interactions of water molecules, stoichiometric hydration state (n H 2 O ) is a chemical metric that is calculated by balancing chemical reactions using the elemental composition of proteins (see below and [29]).
Associations between proteomic n H 2 O and the environment or physiology are evident in various systems, including 1000 km-scale salinity gradients, cancer compared to normal tissue, and growth stages of Bacillus subtilis biofilms [29][30][31].
Research in geochemical biology investigates the chemical and thermodynamic links between geochemical environments and genome evolution [32]. As an extension of geochemical biology, this study asks whether chemical conditions in human body sites and disease may influence the eco-evolutionary and physiological processes of microbial communities. Thermodynamics predicts that reducing conditions favor the evolution of reduced molecules and that dehydrating conditions favor the evolution of dehydrated molecules. The first aim of this study is to test these predictions by comparing chemical metrics for proteins calculated from multi-omics data with the chemical conditions of body sites. These outcomes of these tests establish the relevance of the chemical adaptation hypothesis to the human microbiome. This justifies the second aim of the study, which is to calculate chemical metrics for proteins in COVID-19 patients in order to infer changes in the internal chemical environment of the body.

Two chemical metrics
Average oxidation state of carbon (Z C ) in organic molecules in general can be obtained by counting electrons in Lewis structures [33]. For molecules in which the heteroatoms O and S are only bonded to C or H and not to one another (this includes the standard amino acids and primary sequences of proteins but not some post-translational modifications such as disulfide bonds, none of which are considered here), Z C can be calculated from the elemental formula [20]. Counting water molecules, instead of electrons, is needed to calculate hydration state, so its formulation requires writing chemical reactions. Stoichiometric hydration state (n H 2 O ) is the number of water molecules in a theoretical reaction to form a protein from thermodynamic components, normalized by the number of amino acid residues. This definition leverages the notion of thermodynamic components in order to project a vector of elemental composition (for the five elements C, H, N, O, and S) into a vector of chemical composition with the same number of components. The specific components -also known as basis species -used in this study are glutamine, glutamic acid, cysteine, H 2 O, and O 2 , which have been referred to as the QEC basis species [29]. The inclusion of amino acids rather than simple inorganic species permits a better separation of trends of oxidation and hydration state among proteins in representative genomes [31]; these particular amino acids were chosen because they are relatively abundant and highly connected metabolites and yield a relatively low covariation between n H 2 O and Z C [29].
I calculated chemical metrics of reference proteomes for genera in the "List of Prokaryotes according to their Aerotolerant or Obligate Anaerobic Metabolism" taken from Table S1 of [34]. There is significantly higher Z C for aerotolerant organisms compared to obligate anaerobes, but no significant difference for n H 2 O (Fig.   1A). This shows that evolutionary adaptation to oxygen is reflected more strongly in changes of oxidation state than hydration state of proteins.
Differences between metaproteomes and community reference    Table S1). Delta values denote mean differences of Z C or n H 2 O between anaerobic and aerotolerant genera. (B) Comparison of chemical metrics for metaproteomes and community reference proteomes. Theoretical 1:1 correspondence is indicated by dashed gray lines, and the number of samples in each dataset is listed in the legend.
a phylogenetically consistent taxonomy [35]. In this study, the GTDB was used to retrain the RDP Classifier and to generate reference proteomes for taxa (see Methods). Taxonomic classifications from high-throughput 16S rRNA data were then combined with the reference proteomes for all identified taxa in order to obtain the amino acid compositions of community reference proteomes.
I analyzed several published datasets that provide both 16S rRNA gene sequences and metaproteomes for the same samples. The higher Z C for metaproteomes than for community reference proteomes ( Various studies − see Table 1 Community reference proteomes (controls in COVID-19 studies) 22 Zuo'20 Proteins from metagenomes (controls and COVID-19 patients)   (Table 1). (D) Proteins predicted from metagenomes for controls and COVID-19 patients: nasal [37], oral [38], and gut [39]. (E) Metaproteomes for controls and patients in non-COVID-19 studies: ulcerative colitis [40] and dietary resistant starch [41] for gut microbiomes and oral cancer [42] and lung cancer [43] for oral microbiomes. Each point represents a sample, except for (C), where each point represents the mean of samples in a single dataset. For visual comparison, the dashed triangle representing the convex hull around the points in (A) is replicated in (C-E). p-values on the horizontal and vertical axes were calculated for Z C and n H 2 O , respectively. p-values in (B) were calculated using the Wilcoxon paired test for samples from all body sites; p-values in other panels were calculated using the Wilcoxon unpaired test comparing data for gut and oral or oropharyngeal samples only. large spread in n H 2 O of metaproteomes that is not captured by community reference proteomes.

Chemical variation of the human microbiome across body sites
The 16S rRNA dataset of Boix-Amorós et al. [36] includes samples from four body sites (nasal, skin, oral, and fecal communities); furthermore, community DNA from each body site was sequenced before and after treatment with viral inactivation by ethanol, formaldehyde, heat, psoralen, or trizol. Considering reference proteomes for untreated communities, oral and nasal sites exhibit the lowest and highest ranges of Z C compared to other body sites, and the skin and gut have intermediate Z C whereas the gut has lower n H 2 O than other body sites ( Fig. 2A). Ethanol treatment is associated with a shift to lower n H 2 O compared to untreated samples for communities from most body sites except the mouth (Fig. 2B). A milder signal of dehydration in community reference proteomes is apparent for heat and trizol treatment, but not for formaldehyde or psoralen.
Next, I asked whether these trends are recapitulated in independent 16S rRNA datasets from other studies as well as metagenomes and and metaproteomes. I calculated chemical metrics for community reference proteomes derived from 16S rRNA data for control subjects in COVID-19 studies (sources of data are listed Table 1).
The few datasets for gut communities that fall outside of the triangular area that circumscribes values for the Boix-Amorós et al. dataset exhibit relatively low n H 2 O (Fig.   2C). Furthermore, proteins coded by metagenomes for various body sites of controls and COVID-19 patients largely overlap the range of chemical metrics for community reference proteomes; the main exception is again lower n H 2 O of gut communities ( Fig. 2D). Finally, selected metaproteomes for gut communities have significantly lower Z C than those from oral communities ( Fig. 2E; none of these studies were for COVID-19). The distinct ranges n H 2 O of gut metaproteomes from two studies [40,41] may be due to technical differences, so the apparently significant difference from oral metaproteomes is excluded from further discussion.
Chemical variation of the microbiome in COVID-19 patients I analyzed publicly available 16S rRNA gene sequences for oral or oropharyngeal, nasopharyngeal, and gut (i.e., fecal) samples in COVID-19 studies (Table 1). For the naso-and oropharyngeal/oral studies, various datasets exhibit either positive or negative mean differences of chemical metrics of community reference proteomes between COVID-19 patients and controls, with no clear trend in the data (Fig. 3A).
Although the majority of datasets (6 of 9) for oropharyngeal communities show lower mean Z C in COVID-19 patients, the difference is not significant for all datasets combined. In contrast, a large majority of datasets (10 of 13) for gut communities exhibit a lower mean Z C in COVID-19 patients and the difference is significant. It is also notable that the most extreme points correspond to large negative differences; four gut datasets have DZ C < −0.003 but no gut dataset has DZ C > 0.003.
Metaproteomic data for gut communities in controls and COVID-19 patients were reported by He et al. [71]. Carbon oxidation state for gut metaproteomes is significantly lower in COVID-19 patients than controls when considering only bacterial sequences, and marginally lower with consideration of all mapped UniProt sequences (this includes bacterial and human sequences) (Fig. 3B).

Discussion
Metagenomes and community reference proteomes (which are derived from genomes) only indicate the sequences of possibly expressed proteins, but metaproteomes identify some of the proteins that are actually present in a sample. The analysis of metaproteomes is not necessarily the best choice for all purposes in biology, because each -omics technique provides a different representation of biological processes. Specifically, the chemical metrics calculated from (meta)genomically coded proteins represent chemical differences that occur due to evolution and ecology, while metaproteomes include the effects of both eco-evolutionary and physiological processes.
On the whole, community reference proteomes and proteins predicted from metagenomes show similar ranges of Z C and n H 2 O (see Fig. 2D and [74]), but metaproteomes are relatively oxidized (Fig. 1C-E and Fig. 2E). Two factors may explain this result. First, at least in Escherichia coli, cytoplasmic proteins are more abundant than membrane proteins [75]. Second, some metaproteomic workflows are biased against extraction and detection of membrane proteins [76]. Because membrane proteins tend to be enriched in hydrophobic amino acid residues, they have  lower Z C than cytoplasmic proteins [20], and a low proportion of membrane proteins in cells or technical biases against their detection would both contribute to higher metaproteomic Z C compared to community reference proteomes and metagenomes.

Body sites
The main findings of this study are summarized in Fig. 4. In the first chemical dimension, the finding that community reference proteomes for skin and nasal communities are more oxidized than those for oral and gut communities (see also Fig. 2A) is congruent with previous analyses of metagenomic data from the Human Microbiome Project (HMP) by this laboratory [74] and another group [10]. This finding supports the notion that oxygenated habitats on the skin and in the airway are environmental filters for the evolution and growth of bacteria with genomes that code for relatively oxidized proteins. Community reference proteomes for oral communities are more reduced than those for the gut, but metaproteomes show the opposite trend ( Fig. 2A and E). Although oxygen is depleted locally in some oral habitats, the lumen of the large intestine has the lowest oxygen concentrations within the body [13,14]. It therefore appears that low oxygen concentrations in the gut may have a larger chemical effect on protein expression (detected by metaproteomes) than on protein evolution (represented by community reference proteomes and metagenomes).
In the second chemical dimension, community reference proteomes are more dehydrated for gut communities compared to other body sites and are more dehydrated for samples treated with ethanol compared to untreated samples. The intestine has a physiological function of water absorption [27], and the dehydrating effect of ethanol is described in protocols for preparing tissue sections for histological analysis [77]. In line with metagenomic trends in salinity gradients [29], these results supply additional evidence for chemical adaptation of proteins at evolutionary timescales to dehydrating forces in microbial habitats.

COVID-19
It was previously hypothesized that dysbiosis in inflammatory bowel disease (IBD) may be driven by higher oxygen levels in the gut [12]. Likewise, gut communities in HIV patients exhibit higher abundances of aerotolerant organisms [78], and dysbiosis in several pathological conditions (but not COVID-19) has been linked with greater availability of electron acceptors [14]. In contrast to the oxidative trend that has been described for dysbiosis in IBD and HIV, the analysis of multi-omics datasets for COVID-19 in this study indicates a shift toward more reduced proteins. According to the chemical adaptation hypothesis, environmental selection for genomes with more reduced proteins requires relatively reducing, rather than oxidizing conditions. Chemical analysis of multi-omics datasets for IBD and HIV is beyond the scope of this study, but should be performed to determine whether trends of protein Z C for dysbiotic communities in those conditions differ from COVID-19.
These findings have important implications for understanding how gut communities respond to pathological oxygen concentrations. The intestine maintains a steep radial gradient of O 2 between the vascularized mucosa and anaerobic lumen [79]. If the radial O 2 gradient in the intestine is sensitive to blood oxygen levels, then the occurrence of hypoxemia in some COVID-19 patients may help explain the identification of relatively reduced proteins in multi-omics datasets for gut communities.
However, electron acceptors other than oxygen should also be examined for a more complete understanding of the effects of the redox environment on gut dysbiosis [14].
In the second chemical dimension, proteins coded by MAGs exhibit lower n H 2 O for gut communities in COVID-19 patients. This provides evidence for chemical adaptation through eco-evolutionary processes to dehydrating conditions, but it is too early to tell if chemical variation of the gut microbiome can be mechanistically connected to the pathophysiology of COVID-19 that may include whole-body dehydration [80,81]. The absence of this trend from community reference proteomes suggests possible contributions from horizontal gene transfer or species-or strainlevel variability that cannot be detected by taxonomic classification of 16S rRNA sequences.

Limitations
Micro-scale gradients of oxygen concentration may be key determinants of microbiome composition [14], but the macro-scale datasets analyzed in this study do not permit resolving these features. Furthermore, the analysis in this study only compares controls and COVID-19 patients without classification according to disease severity (although some source datasets include metadata for disease severity), and no longitudinal data were analyzed here.

This analysis inherits other limitations intrinsic to the source studies, including
all types of bias that affect amplicon-based, metagenomic, and metaproteomic studies [76,82]; no corrections for bias were applied. Furthermore, cross-comparisons between different data types, such as metaproteomes and metagenomes, compounds the bias intrinsic to each experimental technique.
This study does not provide clinical guidance. Other research is needed to reveal how dietary or other interventions may alter the body's internal chemical environment and interact with microbiota (e.g. [83]).

16S rRNA data processing
Publicly available 16S rRNA gene sequencing datasets were downloaded from the NCBI Sequence Read Archive (SRA). FASTQ sequence files for paired-end sequences were merged using the "fastq_mergepairs" command of VSEARCH version 2.15.0 [84]. Quality filtering was done with maximum expected error rate of 0.005 and a sequence truncation length specific for each dataset (see Table S1 for details and sequence processing statistics). Singletons were removed and remaining sequences were subsampled at a depth of 10000. Reference-based chimera detection with the VSEARCH command "uchime_ref" was performed using the SILVA SSURef NR99 database version 138.1 [85].
To generate reference proteomes of taxa, species representatives for bacterial and archaeal genomes with at least 500 proteins were taken from the GTDB (release 207). Taxonomic classification was performed using the RDP Classifier version 2.13 [86] retrained with the GTDB database (bac120_ssu_reps and ar53_ssu_reps in GTDB release R07-RS207). The lowest-level taxonomic classification (from genus to phylum) for each sequence was used; the numbers of taxonomic classifications were multiplied by the amino acid compositions of reference proteomes for taxa and summed to obtain the amino acid composition of the community reference proteome.

Metaproteomic data processing
For datasets with an available study-specific protein sequence database, amino acid  Table S2 of [92].
Community reference proteomes were based on 16S rRNA data from publications listed above with the exception of [93] for Manus Basin Active Chimneys.

Metaproteomes for body sites and COVID-19 gut samples
For the saliva microbiome from [42], Majority Protein IDs and LFQ intensity for saliva cells were taken from  [95], and partial protein sequences were predicted using FragGeneScan version 1.18 [96]. The amino acid compositions of all protein sequences in each run were summed and used to calculate chemical metrics. Dataset accession numbers and processing statistics are listed in Table S2.

Metagenome-assembled genomes for COVID-19 patients and controls
To analyze the MAGs generated by Ke et al. [72], which are based on metagenomic data originally reported in [39] and [73]  For each MAG, protein sequences were predicted using Prodigal [97], and the amino acid compositions of all predicted proteins were summed and used to calculate chemical metrics.
Chemical metrics Z C was calculated from where n Gln , n Glu , n Cys , n H 2 O , and n O 2 are stoichiometric coefficients of a theoretical reaction to form a protein with formula C c H h N n O o S s , and the stoichiometric matrix represents the number of elements in each of the basis species (glutamine, glutamic acid, cysteine, water, and oxygen). Amino acid compositions of community reference proteomes, metagenome-coded proteins, or metaproteomes were combined with precomputed values of Z C and n H 2 O for amino acid residues (see [29]) in order to calculate the respective chemical metrics for proteins.

Statistics
Wilcoxon tests were used to calculate p-values. Tests were made for unpaired observations except for untreated and viral-inactivated samples from the same subject and body site (Fig. 2B) and means for COVID-19 and control samples from the same study (Fig. 4A), for which paired Wilcoxon tests were used. p < 0.05 was considered significant. Table S1. 16S rRNA processing statistics. Table S2. Metagenome processing statistics.

Ethics approval and consent to participate
Not applicable.