The relationship between land cover and microbial community composition in European lakes

Microbes such as bacteria, archaea, and protists are essential for element cycling and ecosystem functioning, but many questions central to the understanding of the role of microbes in ecology are still open. Here, we analyze the relationship between lake microbiomes and the land cover surrounding the lakes. By applying machine learning methods, we quantify the covariance between land cover categories and the microbial community composition recorded in the largest amplicon sequencing dataset of European lakes available to date. We identify microbial bioindicators for these land cover categories. Combining land cover and physico-chemical bioindicators identified from the same amplicon sequencing dataset, we develop two novel similarity metrics that facilitate insights into the ecology of the lake microbiome. We show that the bioindicator network, i.e., the graph linking OTUs indicative of the same environmental parameters, corresponds to microbial co-occurrence patterns. Taken together, we demonstrate the strength of machine learning approaches to identify correlations between microbial diversity and environmental factors, potentially opening new approaches to integrate environmental molecular diversity into monitoring and water quality assessments.

as apparent proxy measurement for the ecological variables they are indicative for (and therefore of use for 54 biomonitoring schemes), but are not necessarily in a biologically relevant relationship with it (Landres et al., versity. To this end, we extracted the relative area covered by different land cover categories in circular areas 143 around the sampling sites of the European lake dataset from both the CORINE land cover (CLC) dataset 144 as well as the OpenStreetMap (OSM) project (see Methods). These two datasets differ in the way they were 145 generated and their categorization of land cover. While the former is derived from satellite data, the latter 146 is annotated in a community-driven manner, based on landscape features observed "on the ground". To 147 distinguish between effects present at shorter or longer geographic ranges, we extracted and analyzed areas 148 surrounding the sampling points within radii ranging from 1 km to 10 km, in steps of 1 km, as well as 25 149 km and 50 km. We then assessed the degree to which the relative sub-areas of the land cover categories con-150 tained in the extracted areas can be used to predict a set of biodiversity metrics calculated for the microbial 151 communities of the sampled lakes using Random Forest models. 152 To calculate biodiversity metrics, the OTU table was rarefied using the rrarefy function from the R   For the bioindicator network, an edge was created between all pairs of bioindicator OTUs that are indica-198 tive of at least one common environmental parameter. Null-hypothesis networks were created the same way 199 based on resampled indicator lists; for these, each lake parameter is assigned the same number of randomly 200 selected OTUs as pseudo-indicators in such a way that the distribution of cardinalities is the same as that of 201 the real OTUs. Node properties (degree, closeness centrality, eigenvector centrality, page rank, and authority 202 score) were calculated using the igraph package in R 4.0.3 (v1.2.6, ref. Csardi and Nepusz, 2006   Our results show that there is, at best, a marginal relationship between land cover and microbial bio-234 diversity, as no combination of radius, biodiversity metric, and dataset results in an R 2 > 0.2 (see figure   235 1A, supplementary figure 1 and 2). For all radii studied here, the lake microbiome's Renyi entropy is most 236 predictable from land cover, followed by species richness. Furthermore, we found no significant difference 237 between R 2 values obtained for the same biodiversity metric at different radii (see figure 1B), indicating that 238 the results presented here are most likely due to statistical artifacts rather than processes that shape micro-239 bial biodiversity based on surrounding land cover. In general, the land cover data collected from the OSM 240 dataset is less predictive for microbial biodiversity than the CLC datasets (see figures 1A and B). Therefore, 241 we focus our further analysis on the CLC dataset. Taken together, these results suggest that if land cover 242 has a structuring effect on lake microbiomes, this relationship is not reflected in alpha diversity metrics.  i.e., the relative area of the land use category in question. On level 1 of CLC category hierarchy, we observe 250 covariation of R 2 > 0.05 for "artificial surfaces (1)" and "agricultural areas (2)" at very low radii as well 251 as increasing covariation for "forest and semi-natural areas (3)" with increasing radii (see figure 2A). In 252  Table 1: Numbers of microbial indicators for land cover categories (with respective radius, left) and physicochemical parameters (right) and the respective R 2 resulting from the covariation framework (results for physico-chemical parameters taken from Sperlea et al. (2021) The covariation observed between the lake microbiome and land cover categories at level 2 of the CLC 255 hierarchy paints a more nuanced picture (see figure 2C). For example, we observe increasing covariation with 256 the lake microbiomes at increasing radii for the land cover categories "arable land (21)" and "scrub and/or 257 herbaceous vegetation associations (32)". In contrast, for "urban fabric (11)" and "inland waters (51)", we 258 observe a peak in covariation at radii of 7 km and 6 km, respectively, with lower R 2 for the other radii.  To identify general spatial trends, we separately calculated the mean covariation of all land cover categories 271 for each radius and land cover hierarchy level. For all but one radius-hierarchy level combination, the average R 2 value is below 0.15 (see supplementary figure 3) and throughout all combinations, the relative standard 273 deviation is close to or higher than 100%. This shows that there are neither general spatial trends nor 274 a generally higher covariation at higher levels of the CLC hierarchy. Instead, to observe specific, radius-275 dependent effects of land cover on the lake microbiome, it is important to differentiate between land cover 276 categories.  The node properties of the bioindicator network are significant and comparable to co-occurrence networks. A. Distribution of multitask bioindicators across numbers of indicated land cover categories. Inset: Distribution of multitask bioindicators for ten or more parameters. B. Correlation between the cardinality of each bioindicator OTU and the node properties of the respective node in the bioindicator network. Black squares: Results for the bioindicator network. Grey box plots: Results for resampled nullhypothesis networks (for details, see methods). For all metrics, the results for the bioindicator network are significantly different from those of the null-hypothesis networks (one-sample t-test, P< 2.2e−16). C.
Comparison of the bioindicator network and established methods for the network inference from OTU tables. Networks are compared by the coefficient of determination, R 2 , between a property of the nodes in the bioindicator network and the network created using a network inference method. As some of the methods create fully connected networks by default, edges with weights smaller than a cut-off were removed; this cut-off ranges between the respective minimum and maximum edge weight in 100 equidistant steps.
bioindicator network and the respective nodes' degree, closeness centrality, eigenvector centrality, page rank, for details). We find that the nodes in the bioindicator network have statistically significant properties (see 313 figure 3B), which suggests a biological relevance of the bioindicator network's structure.

314
Furthermore, we compared the bioindicator network to networks inferred from the original OTU table.

315
More specifically, we asked whether the node properties generated using network inference methods correlate 316 with the node properties of the bioindicator network. We chose this approach to comparing the two network 317 structures as it can capture relative differences between node properties and might thus be robust with regard 318 to global effects of a network method, e.g., a consistently lower degree. Our results show that applying a 319 high cut-off to co-occurrence and checkerboard score similarity matrices results in networks similar to the 320 bioindicator network. In contrast, neither correlation-based nor compositionality-aware methods do so (see 321 figure 3C).

322
The second data structure presents the similarity of pairs of environmental parameters in terms of the  for European lake microbiomes published to date, but still contains ∼1000 times more OTUs than samples. ). The independence of features is, however, necessary for many statistical approaches.

352
In this paper, we study the relationship between lake microbiomes and the land cover surrounding the respectively) indicate that changes in these categories of land cover are not reflected in the microbiome.

369
On a more speculative level, these results could indicate that there are more ecological processes linking 370 the lake microbiome to areas from the former group of land cover categories than to areas from the latter.

371
More notable, the covariation between the microbiome and land cover categories are, generally, lower than pollution generated through land use as, e.g., tire wear on roads and urban areas, but we do not find evidence 376 for this in our analyses. In addition, areas that are too small to appear in the CLC dataset but that have a 377 strong impact on lake ecology, such as a small group of trees at the shore of a lake, are not represented in 378 our analysis.

379
Aggregation of similar parameters is a necessary step for the analysis of ecological processes but comes with