Iterative Subtractive Binning of Freshwater Chronoseries Metagenomes Identifies of over Four Hundred Novel Species and their Ecologic Preferences

Recent advances in sequencing technology and accompanying bioinformatic pipelines have allowed unprecedented access to the genomes of yet-uncultivated microorganisms from a wide array of natural and engineered environments. However, the catalogue of available genomes from uncultivated freshwater microbial populations remains limited, and most genome recovery attempts in freshwater ecosystems have only targeted few specific taxa. Here, we present a novel genome recovery pipeline, which incorporates iterative subtractive binning and apply it to a time series of metagenomic datasets from seven connected locations along the Chattahoochee River (Southeastern USA). Our set of Metagenome-Assembled Genomes (MAGs) represents over four hundred genomospecies yet to be named, which substantially increase the number of high-quality MAGs from freshwater lakes and represent about half of the total microbial community sampled. We propose names for two novel species that were represented by high-quality MAGs: “Candidatus Elulimicrobium humile” (“Ca. Elulimicrobiota” in the “Patescibacteria” group) and “Candidatus Aquidulcis frankliniae” (“Chloroflexi”). To evaluate the prevalence of these species in the chronoseries, we introduce novel approaches to estimate relative abundance and a habitat-preference score that control for uneven quality of the genomes and sample representation. Using these metrics, we demonstrate a high degree of habitat-specialization and endemicity for most genomospecies observed in the Chattahoochee lacustrine ecosystem, as well as wider species ecological ranges associated with smaller genomes and higher coding densities, indicating an overall advantage of smaller, more compact genomes for cosmopolitan distributions.


Introduction 4
Freshwater environments represent a major microbial habitat on Earth, hosting 5 an estimated 1.3×10 26 prokaryotic cells worldwide [1,2]. The level of diversity in 6 microbial freshwater communities is orders of magnitude lower than that of other 7 major environments such as soil and seawater [3], making them a tractable but 8 globally important model for studying microbial community ecology. However, the 9 lack of comprehensive sets of reference genomes and low cultivation rates belong to yet-uncultured phyla, with an additional two thirds belonging to 1 3 uncultured genera, families, or classes [4]. In fact, only a tenth of freshwater 1 4 microbial cells belong to cultivated species or genera, the smallest cultivated 1 5 fraction among all major environments on Earth (i.e., environments with over 10 25 1 6 (Ganges River) and Greece (Kalamas River) [14,15]. The fraction of the 1 communities captured by these MAGs or other reference genomes is typically 2 moderate to low due to the high diversity of freshwater communities as well as 3 the limitations of the underlying binning methods, which are not optimized for 4 chronoseries datasets from natural habitats but rather for single or small sets of 5 samples from the exact same microbial community. Temporal and spatial series 6 from freshwater ecosystems are even sparser; yet, such data could provide a 7 more complete picture of seasonal and biogeographic patterns of the 8 corresponding microbial communities that are important for human activities. 9 We introduce here a pipeline for the recovery of MAGs from sets of 1 0 metagenomes through iterative subtractive binning and apply it to a 1 1 metagenomic chronoseries from freshwater lakes and estuaries along the 1 2 Chattahoochee River (Southeast USA). The abundance distribution of these 1 3 population genomes in the meta-community was studied using two 1 4 methodological innovations: an estimation of relative abundance controlling for 1 5 completeness and micro-diversity issues in the genomes, and an ecologic 1 6 preference score controlling for uneven sample representation. The collection of 1 7

1
Additional information on software versions and parameters used is available in 2 Table 1, and additional details are provided in the Text S1. and processed typically within 1-4 h, and no more than a day post collection.

1
Water was sequentially filtered with a peristaltic pump through 2.5 µm and 1.6 1 2 µm porosity glass microfiber filters (Whatman), to capture large particles and Sterivex filters (Millipore). Thus, all sequenced metagenomes represent the 1.6-1 5 bacterial cells. Those viral metagenomes were included in the binning process, 1 but not in subsequent analyses. All sequenced metagenomic datasets were subjected to quality control and those 5 not passing minimum requirements were re-sequenced. Sequencing reads were 6 trimmed and clipped using SolexaQA++ [18] and Scythe. Abundance-weighted 7 average coverage of the datasets was estimated using Nonpareil [19]. A 8 minimum dataset size of 1Gbp after trimming and 50% coverage were required 9 for all samples in this study (Table S1). An initial binning methodology was implemented using metadata-dependent 1 3 grouping of samples to recover high-quality metagenome-assembled genomes 1 4 (MAGs; Fig. 1, top row). Specifically, we grouped and co-assembled all cell-1 5 metagenomic samples from Lake Lanier (34 samples, 120 Gbp in total). The co-1 6 Next, we implemented a strategy to recover MAGs using the complete 1 collection of samples (Fig. 1) MAGs with estimated genome quality above 50 were considered of high quality 8 (see below genome quality definition), and the first resulting set was labeled 9 WB4. The resulting set of high-quality MAGs (LLD + WB4) was used as 1 0 reference database to map reads from all samples (Bowtie), and unmapped 1 1 reads (SAMtools [29]) were used as input for Mash/MCL clustering, iterating the 1 2 process described above to produce sets WB5-WBB (Fig. 1). The number of 1 3 iterations was determined by saturation of phylogenetic breadth and fraction of 1 4 reads mapping (Fig. 2). Finally, two corrections were implemented targeting 1 5 groups that typically generate quality underestimations. First, a correction for 1 6

1
The quality and taxonomic classification of MAGs were evaluated using MiGA.

2
Briefly, a composite index of genome quality was used, defined as 3 "Completeness -5×Contamination", where both completeness and contamination 4 were determined by the presence and copy number of genes typically found in 5 genomes of Archaea and Bacteria in single copy [21,28]. Taxonomy was 6 determined by MiGA with the NCBI Genome Database, Prokaryotic section 7 (henceforth NCBI_Prok; MiGA Online; Jan-2019) [28]. MiGA also performs a de-8 replication of the collection by generating groups of genomes with ANI ≥ 95% 9 using ogs.mcl.rb [21,24]. These clusters, analogous to bacterial or archaeal Finally, most of the analyses described above required a reliable 1 estimation of the relative abundance of genomospecies in each dataset. technical artifacts like non-overlapping assemblies. We applied a novel approach 7 to estimate MAG abundance in metagenomes that sidesteps these limitations.

8
Two key corrections include (1) truncation of sequencing depth before averaging 9 to exclude highly conserved regions (overestimating depth), regions with gene-1 0 content micro-diversity (underestimating depth), and contamination (both); and 1 1 (2) normalization of sequencing depth by genome equivalents in the 1 2 metagenome, allowing relative abundance estimates. Note that this approach 1 3 aims to estimate the relative abundance of the species in the community 1 4 (i.e., number of cells per total cells), not the more common metric of relative The authors declare no competing interests.  .   . . . P  a  r  k  s  D  H  ,  C  h  u  v  o  c  h  i  n  a  M  ,  W  a  i  t  e  D  W  ,  R  i  n  k  e  C  ,  S  k  a  r  s  h  e  w  s  k  i  A  ,  C  h  a  u  m  e  i  l  P  -A  ,  e  t  a  l  .  1   A  s  t  a  n  d  a  r  d  i  z  e  d  b  a  c  t  e  r  i  a  l  t  a  x  o  n  o  m  y  b  a  s  e  d  o  n  g  e  n  o  m  e  p  h  y  l  o  g  e  n  y  s  u  b  s  t  a  n  t  i  a  l  l  y Figure 1: Diagram of the iterative subtractive binning methodology applied in this 3 study. Input data (bold) and processes are depicted as light grey boxes, data flow 4 as arrows, and output sets of MAGs as dark grey boxes. The initial non-iterative 5 binning of Lake Lanier metagenomes corresponds to the set LLD, and the 8 6 iterations including all datasets correspond to the sets WB4-WBB. After the 7 iterative approach, two targeted corrections were applied corresponding to WBC 8 (Archaea) and the empty set WBD (CPR). QC stands for Quality Control, and HQ 9 stands for High Quality. Text S3: Additional methodology, results, and protologues for novel lineages 1 described here: "Ca. Elulimicrobium humile" gen. nov. sp. nov. and "Ca.