Soil, ocean, hot spring, and host-associated environments reveal unique selection pressures on genomic features of bacteria in microbial communities

Free-living bacteria in nutrient limited environments often exhibit small genomes which curb the cost of reproduction – a phenomenon known as genomic streamlining. Streamlining has been associated with a suite of traits such as reduced GC content, fewer 16S rRNA copies, and a lower abundance of regulatory genes, such as sigma (σ) factors. Here, we analyzed these traits from 116 publicly available metagenomes derived from marine, soil, host associated, and thermophilic communities. In marine and thermophilic communities, genome size and GC content declined in parallel, but GC content was higher in thermophilic communities. In soils, the relationship between genome size and GC content was negative, suggesting a different selection pressure on genome size and GC content in soil bacteria. The abundance of σ-factors varied with average genome size, ecosystem type, and the specific functions regulated by the sigma factor. In marine environments, housekeeping and heat-shock σ-factor genes (rpoD and rpoH respectively) increased as genome size declined, and σ-factor responsible for flagella biosynthesis (fliA) decreased, suggesting a trade-off between nutrient conservation and chemotaxis. In soils, a high abundance of fliA and the stress response σ-factor gene (rpoS) was associated with smaller average genome size and often located in harsh and/or carbon-limited environments such as deserts or agricultural fields – suggesting an increased capacity for stress response and mobility in nutrient-poor soils. This work showcases how ecosystem-specific environmental constraints force trade-offs which are then embedded in the genomic features of bacteria in microbial communities, specifically genome size, GC content, and regulatory genes, and further highlights the importance of considering these features in microbial community analysis.

as represented by the recently described Candidatus Udaeobacter copiosus, a ubiquitous taxon 61 in soils, with a genome size of 2.81 Mbp [25]. 62 Temperature can also influence genome size due to increased fitness of small cells at high 63 temperatures [14]. Accordingly, small cells and smaller genomes are typically associated with 64 higher optimal growth temperatures. This relationship is most pronounced in thermophilic 65 communities [26], but has also been demonstrated in marine systems [27][28][29] and more recently 66 in soils [30]. These patterns between genome size, GC content, and number of 16S rRNA gene 67 copies as a result of temperature-induced genome reduction often resemble patterns in 68 streamlined genomes [14]. 69 approach is to examine genome size on a community level in situ. While metagenome-assembled 84 genomes can provide astounding insights into evolutionary processes [38], assembly can be 85 difficult and computationally expensive, making this approach difficult to apply across a large 86 number of communities. Fortunately, genomic traits like GC content, and average genome size, 87 often can be easily estimated, and do not require an extensive knowledge of the taxa within the 88 community. Here we present a comparison of genomic traits from 118 metagenomes from soil, 89 marine, host-associated, and thermophilic systems. We hypothesize that the average genome size 90 in soil microbial communities will be larger than in marine, host-associated, or thermophilic 91 communities, consistent with findings from isolates [14]. We also predict that average genome 92 size and GC content will be positively correlated in free-living soil, marine, and thermophilic 93 communities, consistent with streamlining. Finally, we predict that while both free-living and 94 host-associated communities with small average genome sizes will demonstrate a low GC 95 number of individuals in a population, which is then divided by the total number of read base-124 pairs to provide an estimate of the average genome size in a metagenome. 125 From IMG/M, we accessed the size of the metagenomic sample (bp), GC-%, total 126 number of 16S rRNA gene copies, and the total number of σ factors identified by the KEGG 127 Orthology database (Table 1 [ To ensure that any observed trends were not heavily influenced by the abundance of 135 Multiple regression was used to determine the relationship between genome size and 158 genomic characteristics -specifically, GC content, 16S rRNA gene relative abundance, the 159 relative abundance of σ-factor genes, and the relative abundance of specific σ-factor variants. 160 Models were constructed with the command lm or lmer from the R (v3.6.1 [66]) package lme4 161 [67]. For each response variable, we constructed multiple models considering all parameters and 162 interactions. Final models were selected using Akaike information criterion (AIC) values. The 163 addition of a new parameter resulting in a reduction of the AIC value by at least 4 indicated a 164 significantly better fit with increased model complexity. 165 To assess the abundance of σ-factor genes between different ecosystems, we used both 166 the multi-response permutation procedure (MRPP) as well as the permutational multivariate 167 analysis of variance (PERMANOVA). The MRPP was conducted using all samples while 168 PERMANOVA was conducted using 11 randomly selected genomes from each ecosystem to ensure balanced design. Both analyses were conducted using Bray-Curtis dissimilarity matrices 170 constructed from the relative abundance of each σ-factor. To visualize differences in the 171 distribution of different types σ-factors between ecosystems we used nonmetric 172 multidimensional scaling (NMDS) on Bray-Curtis distances. MRPP, PERMANOVA and NMDS 173 were done using the vegan package [68] in R (v3.6.1). 174

175
Isolates 176 To compare relationships between genomic characteristics of a microbial community 177 with characteristics of isolates, we accessed over 6,000 isolates of bacteria, archaea, and fungi 178 from the IMG/M system in June of 2020. Isolates were selected if they were (1) publicly 179 available; (2) previously published; (3) sequenced by JGI. Metadata was used to group samples 180 into one of three ecosystem types: soil, marine, thermophilic, or host-associated. To avoid 181 potential bias introduced by large studies selecting for specific taxa, we randomly selected no 182 more than 20 isolates from a single study. Relationships between genomic characteristics were 183 analyzed using multiple regression analyses as described above for the analysis of community-184 level traits. ANOVA was used to assess differences in the distribution of genomic characteristics 185 between isolates and metagenomic averages.

RESULTS 187
Average Genome Size and GC Content 189 Average genome size was significantly different between ecosystems (ANOVA; F 4,111 = 190 135.9, p < 0.01). Specifically, average genome size was higher in soils compared to marine, host-191 associated, or thermophilic communities (Fig. 1a, Tukey's HSD p < 0.01). GC content (%) 192 varied between each ecosystem (ANOVA; F 4,111 = 140.3, p < 0.01), and was highest in soil, 193 followed by thermophilic, host-associated, and then marine communities (Fig. 1b). Overall, GC 194 content and average genome size were positively correlated; however, this trend varied between 195 ecosystems (Fig. 1c). A comparison of multiple models, using AIC values as selection criteria, 196 indicated that GC content was best predicted by average genome size, ecosystem, and their 197 interaction (F 9,106 = 136.1, p < 0.01, Supplemental Table 2). Specifically, GC content was 198 positively correlated with average genome size in marine and thermophilic communities, 199 negatively correlated in soil communities, and not significantly related in host-associated 200 communities (Fig. 1c). The relationship between average genome size and GC content was offset 201 between marine and thermophilic communities: thermophilic communities had a higher GC 202 content than marine communities with the same average genome size (Fig. 1c). The relationship 203 between GC content and average genome size was strongly driven by the abundance of archaea 204 in the mixed thermophilic samples (Supplemental Fig. 2). In soils, average genome size and GC 205 content were significantly different between ecosystem types (ex. Deserts, grasslands, forests; 206 ANOVA: Mbp -F 7,38 = 24.35, p < 0.01; GC-% -F 7,38 = 4.986, p < 0.01; Fig. 2). 207 The average genome size and GC content of the metagenomes fell within the range of 208 isolates from each ecosystem (Supplemental Fig. 3). However, the mean genome size and GC content derived from metagenomes varied from isolates in both soil and thermophilic 210 environments (ANOVA; p < 0.05), but not in marine environments. 211

16S rRNA gene copies and Sigma factors 213
Host-associated communities had the highest number of 16S rRNA gene copies per 214 genome, followed by soils and then thermophilic and marine communities (Supplemental Fig. 4). 215 A comparison of AIC values indicated that ecosystem type alone was the best predictor of 16S 216 rRNA gene copies per genome (Supplemental Fig. 4, Supplemental Table 2). 217 The relative abundance of σ-factors genes per metagenome changed with estimates of 218 average genome size, however this relationship varied significantly between ecosystems ( Fig. 4a; 219 Supplemental Table 2). Average genome size was significantly correlated with the relative 220 abundance of σ-factors in thermophilic environments (R 2 = 0.49), but not in soil, marine, or host 221 associated environments (R 2 < 0.2; Fig. 4a). The distribution of σ-factor types within a 222 metagenome varied more between ecosystems than within (Fig 3; Fig. 4b; MMRP, A = 0.34, p < 223 0.01), and ecosystems differed significantly (Fig 3; Fig. 4b; PERMANOVA, R 2 = 0.50, p < 224 0.01). 225 The relationship between average genome size and the relative abundance of individual 226 σ-factors was dependent on both ecosystem type and the type of σ-factor (Fig. 4c, Supplemental 227 Table 3). In host-associated communities, the relative abundance of only one σ-factor, sigH, was 228 significantly (P = 0.018) negative correlated with average genome size. Abundance of all other 229 sigma factors were unchanged with genome size in host-associated communities (Supplemental 230   Table 3). In soil communities the relative abundance of rpoH per metagenome significantly 231 increased (P < 0.01) with larger average genome size, while the relative abundance per metagenome of rpoS, sigH, sigB, and fliA decreased (P < 0.01). In marine communities, we 233 found that the relative abundance of fliA, rpoE, and sigH significantly increased (P < 0.01) with 234 genome size, and the abundance of rpoH, and rpoD significantly decreased (P < 0.01). Due to 235 the small samples size of thermophilic communities, we did not include the relationships 236 between σ-factors and average genome size for thermophilic environments; however, correlation 237 coefficients and statistics for all linear regressions between average genome size and σ-factor 238 abundance for each ecosystem can be found in Supplemental Table 3 Our hypotheses were that: 1) average genome size in soil microbial communities would 245 be higher than in marine, thermophilic, and host-associated microbial communities; 2) GC 246 content would be positively correlated with the average genome size of free-living (not host-247 associated) microbial communities; and 3) host-associated microbial communities exhibiting 248 smaller average genome sizes, although AT-rich, will lack other traits associated with 249 streamlining, such as a reduction in regulatory σ-factors. The first and third hypotheses were 250 generally supported, however the second hypothesis was rejected since trends between genomic 251 traits varied between environments. 252 As expected, microbial communities in marine, host, and thermophilic environments had 253 a smaller average genome size than those in soil, consistent with previous findings from studies 254 using bacterial isolates and single-amplified genomes [8, 11, 69]. Since smaller genomes tend to 255 have lower GC content [70], we expected to find a positive correlation between GC content and 256 average genome size for each ecosystem. Surprisingly we only found this relationship in marine 257 and thermophilic communities. In thermophilic communities, this relationship appeared 258 confounded with the presence of archaea, thus making it impossible to distinguish between 259 archaeal abundance or streamlining as a driver for smaller genome size in these extreme 260 environments. It is worth noting that the relationship between genome size and GC content in 261 thermophilic communities was offset higher from marine systems, even for bacterial dominated 262 thermophilic communities. This offset is likely the result of a requirement for thermal stability in 263 hot environments which is provided by the GC triple-hydrogen bonds versus the AT double-264 bond [71, 72].
Both GC content and average genome size in host-associated communities were low, a 266 common feature of symbiotic bacteria [32]. Although host-associated bacteria in small 267 populations often have AT-rich genomes [9], the relationship between GC content and average 268 genome size was not significant for host-associated communities. Reduced genetic flow in these 269 communities could mean that changes in nucleotide frequency and genome size develop 270 independently in populations. Therefore, these trends might exist within, but not between, 271 communities. In other words, host-associated environments might produce small AT-rich 272 genomes, but these two traits do not covary between communities. 273 Soil communities exhibited a negative relationship between average genome size and GC 274 content. This does not necessarily exclude streamlining as a driver of genome size in soils but 275 suggests other drivers of genome size and GC content. For instance, fungal reads may reduce the 276 overall GC content of a metagenome while raising estimates of average genome size. We found 277 that fungal isolates generally have a lower GC content than bacteria with a similarly sized 278 genome (Supplemental Fig. 6). Although we attempted to avoid the influence of fungal genomes 279 by limiting our dataset to metagenomes which were dominated by bacteria, it is possible that 280 even a low abundance of large fungal genomes affected our estimates. 281 Another explanation is that soil microbial communities in nutrient limited environments 282 select for smaller genomes with a higher GC content, which may be advantageous when carbon 283 is limited [73]. A GC basepair has carbon to nitrogen ratio of 9:8 while an AT basepair has a 284 ratio of 10:7. A reduction in GC content therefore decreases the amount nitrogen required for 285 DNA synthesis, which has been suggested as an explanation of the low GC content that is 286 commonly exhibited in marine systems, where nitrogen is limiting [74]. Similarly, small 287 genomes in soil, where C is generally limiting [75,76], might preferentially select for GC rich DNA. In this dataset, communities from deserts, agricultural fields, and grasslands had a smaller 289 average genome size and higher GC content (Fig. 2). These environments tend to have lower soil 290 and microbial carbon to nitrogen ratios than forests [77]. Similarly, bacterial communities in 291 forests tended to have larger average genome sizes and lower GC content. Other environmental 292 factors fall along this gradient which might also influence GC content, such as temperature and 293 moisture, which has been shown to influence nucleotide composition in terrestrial plants [78]. 294 Our work shows a potential connection between nucleotide frequency and ecosystem properties, 295 emphasizing the need to develop a more complete understanding of genomic features across soil 296 microbial communities. 297 We did not find that the relative abundance of σ-factors was associated with average 298 genome size. However, we did observe that marine communities maintained a lower abundance 299 of σ-factor gene copies in comparison to other ecosystems, even when average genome size was 300 comparable. One explanation is that the reduction of σ-factor gene copies is particularly effective 301 in reducing reproductive costs in marine systems. Marine systems are considered to be nutrient 302 poor relative to soils and a general reduction in the proportion of σ-factors in bacterial genomes 303 may function as an adaptation to nutrient constraints. We also found many trends between 304 average genome size and the abundance of specific σ-factor genes in marine communities. In 305 marine metagenomes, the relative abundance per genome of rpoD and rpoH, which encode for 306 σ D and σ H respectively, was negatively correlated with average genome size. These trends are 307 perhaps caused by the abundance of the streamlined SAR11 clade, which only contain σ D and σ H 308 [24]. Conversely, the abundance of the gene fliA, which encodes for the σ 28 and regulates flagella 309 biosynthesis [79], increased with average genome size. This relationship tracks a copiotroph-streamlined, individuals while increased nutrient availability selects for larger individuals 312 capable of chemotaxis [17,80]. 313 In soils, the relative abundance of many σ-factors were negatively correlated with 314 estimates of average genome size. Most notably, we observed a decrease in the relative 315 abundance of rpoS (σ S ) but no significant change in the abundance of rpoD (σ D ) with increasing 316 average genome size. The balance between rpoS and rpoD may be a trade-off between stress 317 tolerance and growth [81,82]. A higher ratio of rpoS to rpoD has been shown to increase the 318 cell's capacity to cope with stress but limit its ability to grow on a variety of carbon sources [82-319 84]. We see this reflected in the environments from which the metagenomes were samples, with 320 microbial communities from high stress environments, such as deserts, having a higher 321 abundance or rpoS compared to lower-stress carbon-rich environments, such as forests 322 (Supplemental Fig. 7). 323 Surprisingly, we found a high abundance of fliA gene copies in soil communities with 324 smaller genomes, several of which were sourced from desert environments. Motility may be 325 more valuable in nutrient limited environments, whereas in environments with high nutrient 326 inputs, nutritional competency may be more paramount. However, these results contrast with the 327 commonly held notion that chemotaxis is most prevalent in mesic soils. One explanation is that 328 motility may be especially important when water availability is ephemeral. A greater number of 329 regulatory mechanisms would therefore be advantageous as it would allow for a rapid response 330 to periodic pulses of moisture. Another possibility is that bacteria utilize biofilms surrounding 331 fungal hyphae, or "fungal highways" [85], which could explain the persistence of flagellated