Abstract
Microbes are main drivers of biogeochemical cycles in oceans and lakes, yet surprisingly few bacterioplankton genomes have been sequenced, partly due to difficulties in cultivating them. Here we used automatic binning to reconstruct a large number of bacterioplankton genomes from a metagenomic time-series from the Baltic Sea. The genomes represent novel species within freshwater and marine clades, including clades not previously genome-sequenced. Their seasonal dynamics followed phylogenetic patterns, but with fine-grained lineage specific adaptations. Signs of streamlining were evident in most genomes, and estimated genome sizes correlated with abundance variation across filter size fractions. Comparing the genomes with globally distributed aquatic metagenomes suggested the existence of a global brackish metacommunity whose populations diverged from freshwater and marine relatives >100,000 years ago, hence long before the Baltic Sea was formed (8000 years). This markedly contrasts to most Baltic Sea multicellular organisms that are locally adapted populations of fresh- or marine counterparts.
Microorganisms in aquatic environments play a crucial role in determining global fluxes of energy and turnover of elements essential to life. To understand these processes through comprehensive analyses of microbial ecology, evolution and metabolism, sequenced reference genomes of representative native prokaryotes are crucial. If these are obtained from isolates, the encoded information can be complemented by phenotypic assays and ecophysiological response experiments to provide insights into the factors that regulate the activity of these populations in particular biogeochemical processes. However, obtaining and characterising new pure cultures is invariably a slow process, even with recent advances in high-throughput dilution culturing approaches 1. Most notoriously, the highly-abundant slow-growing oligotrophic lineages typical of pelagic environments 2,3 remain severely underrepresented in current culture collections 4.
Metagenomics offers a potential shortcut to much of the information obtained from pure culture genome sequencing 5,6. The last decade’s revolution in DNA sequencing throughput and cost has provided researchers with the unprecedented possibility of obtaining sequences corresponding to thousands of genomes at a time without the need for isolation, cultivation or enrichment. However, despite vast amounts of sequence data allowing inferences on global distribution of phylogenetic lineages and metabolic potentials 5–8, the issue of structuring the data into genomes has remained. This is critical because, while individual genes or genome fragments provide useful information on the metabolic potential of a community, in practice most biochemical transformations take place inside a cell, involving sets of genes structured in controlled pathways. Gaining insight into these pathways requires understanding which genes coexist inside individual microbes. Furthermore, reconstructed genomes from naturally abundant microbes serve as references that allow high-quality annotations to be made in environmental sequencing efforts where otherwise a majority of sequences would remain unclassified 7,8. Single-cell sequencing has emerged as a very powerful approach to obtain coherent data from individual lineages 3,9,10. This approach allows researchers to select certain targets of interest, based on e.g. cell characteristics or genetic markers, to address specific research questions 9,11,12. However, single-cell sequencing requires a highly specialized laboratory facility, and single-amplified genomes (SAGs) typically have low genome coverage, due to the small amount of DNA in each cell and associated whole-genome amplification biases13.
While metagenomics can offer, in principle, unlimited amounts of starting material and little amplification bias, it has until recently been impossible to automatically reconstruct full genomes from the mass of genome fragments (contigs or scaffolds) generated from complex natural communities. Approaches based on sequence composition, e.g. tetranucleotide frequencies, have been successfully used to reconstruct near-complete genomes from metagenomic contigs without the use of reference genomes, but can generally only discriminate down to the genus-level 14,15. More recently, coverage variation across multiple samples has been used, allowing binning down to species and sometimes strain level 16–19. At the same time as genomes are reconstructed, the abundance distribution of these genomes across the samples is obtained, allowing ecological inferences. The CONCOCT (Clustering of contigs based on coverage and composition) program does automatic binning using a combination of these two data sources and was shown to give high accuracy and recall on both model and real human gut microbial communities 20, but has not yet been applied to aquatic communities.
The Baltic Sea is, in many aspects, one of the most thoroughly studied aquatic ecosystems 21. It presents unique opportunities for obtaining novel understanding of how environmental forcing determines ecosystem structure and function, due to its strong gradients in salinity (North-Southwest), redox (across depths) and organic and nutrient loading (from coasts to center), as well as pronounced seasonal changes in growth conditions. 16S rRNA gene-based studies have revealed prominent shifts in the microbial community composition along these dimensions 22–24. The community composition of surface waters changes gradually along the 2000 km salinity gradient, from mainly freshwater lineages in the low salinity North to mostly marine lineages in the higher salinity South-West, and a mixture in the mesohaline central Baltic Sea 22. The phylogenetic resolution of 16S amplicons, however, does not permit determining whether prokaryotic lineages are locally adapted freshwater and marine populations or represent distinct brackish strains. A recent Baltic Sea metagenomic study showed how a shift in genetic functional potential along the salinity gradient paralleled this phylogenetic shift in bacterial community composition 8. However, since genes were not binned into genomes, different sets of distinguishing gene functions could not be assigned to the genomic context of specific taxa. Reference genomes would therefore be invaluable for a richer exploration of available and future omics data.
Here, we used metagenome time-series data from a sampling station in the central Baltic Sea to generate metagenome-assembled genomes (MAGs) corresponding to several of the most abundant, and mostly uncultured, lineages in this environment. We use these data to compare functional potentials between phylogenetic lineages and relate functionality with seasonal succession. By comparing the MAGs with metagenome data from globally distributed sites, we propose that these are specialised brackish populations that evolved long before the formation of the Baltic Sea and whose closest relatives are today found in other brackish environments across the globe.
Results and Discussion
Metagenome-assembled genomes
We conducted shotgun metagenomic sequencing on 37 surface water samples collected from March to December in 2012 at the Linnaeus Microbial Observatory, 10 km east of Öland in the central Baltic Sea. On average, 14.5 million read-pairs were assembled from each sample, yielding a total of 1,443,953,143 bp across 4,094,883 contigs. In order to bin contigs into genomes, the CONCOCT software 20 was run on each assembled sample separately, using information on the contigs’ coverages across all samples (Supplementary Fig. 1). Single-copy genes (SCGs) were used to assess completeness and purity of the bins. We approved bins having at least 30 of 36 SCGs present (Supplementary Dataset 1), of which not more than two in multiple copies. This resulted in the identification of 83 genomic bins, hereafter referred to as metagenome assembled genomes (MAGs). The completeness of these MAGs was further validated by assessing the presence and uniqueness of a set of phylum- and class-specific SCGs (Supplementary Dataset 1). Based on these, we estimate the MAGs to be on average 82.7% complete with 1.1% of bases misassembled or wrongly binned. Some MAGs were estimated to be 100% complete. In comparison, recent single-amplified genome studies of free-living aquatic bacteria have obtained average completeness of 55-68% 3,10. Importantly, the number of MAGs reconstructed from each sample correlates with the number of reads generated from it and there is no sign of saturation in this trend (Supplementary Fig. 2), meaning that more genomes can easily be reconstructed by deeper sequencing of the same samples. Every sample with over 20 million reads passing quality control yielded at least 3 approved bins. Further, while only highly complete genomes were selected for this study, other research questions might be adequately addressed with partial genomes, many more of which were generated.
In the original CONCOCT study 20, binning was done on a coassembly of all samples. Here we employed an alternative strategy where binning was run on each sample separately. This way, community complexity was minimized and binning accuracy increased. Since this strategy may reconstruct the same genome multiple times over the time-series, the 83 complete MAGs were further clustered based on sequence identity. Thirty distinct clusters (BACL [BAltic CLuster] 1-30) with >99% intra-cluster sequence identity were formed (<70% between-cluster identity; 95% sequence identity is a stringent cut-off for bacterial species definition 25), that included between one and 14 MAGs each (Supplementary Fig. 3; Table 1; Supplementary Dataset 1). Having several MAGs in the same cluster increases the reliability of the analyses performed, especially in the case of results based on the absence of a sequence, such as missing genes.
Overview of clusters, sorted by taxonomy.
The genome clusters generated represent environmentally abundant strains, together corresponding to on average 13% of the shotgun reads in each sample (range: 4 - 23%). This shows that the CONCOCT approach successfully reconstructs novel genomes of environmentally relevant bacteria.
Phylogeny and functional potential of MAGs
The reconstructed genomes belong to Actinobacteria, Bacteroidetes, Cyanobacteria, Verrucomicrobia, Alpha-, Beta and Gammaproteobacteria and Thaumarchaeota (Table 1, Fig. 1; Supplementary Fig. 4, Supplementary Dataset 2). Phylogenetic reconstruction using concatenated core proteins placed all MAGs consistently within clusters, lending further support to the binning (Supplementary Fig. 4). Based on average nucleotide identity, only BACL8 was estimated to have >70% DNA identity with its nearest neighbour in the phylogenetic tree. In this and many other cases, the closest relative was not an isolate, but a SAG, reflecting these methods’ ability to recover genomes from abundant, but potentially uncultivable, species.
Phylogenetic tree of reconstructed genomes. Phyla and proteobacterial classes for which MAGs were generated are highlighted with coloured branches: Thaumarchaeota (dark blue), Cyanobacteria (blue-green), Actinobacteria (lime green), Alphaproteobacteria (yellow), Bacteroidetes (orange), Verrucomicrobia (purple), Gammaproteobacteria (red) and Betaproteobacteria (dark green).
The broad phylogenetic representation allowed us to compare functional potential between taxonomic groups in this ecosystem. Non-metric multidimensional scaling based on abundance of functional gene categories grouped the MAG clusters according to their phylogeny (Fig. 2; Supplementary Fig. 5; Supplementary Dataset 3), which was confirmed by ANOSIM analysis (Supplementary Table 1). Specifically, Alphaproteobacterial clusters encoded a significantly higher proportion of genes in the “Amino acid transport and metabolism” COG category compared to all other clusters (Welch’s t-test p<0.001). Actinobacteria were significantly enriched in genes in the “Carbohydrate transport and metabolism” COG category (p=0.04). Enzymes involved in carboxylate degradation were significantly more abundant in Gammaproteobacteria compared to all other clusters (p=0.019). Carboxylate degradation enzymes were also abundant in Alphaproteobacteria and Bacteroidetes, but significantly lower in proportion among the Actinobacteria (p<0.01). Bacteroidetes and Verrucomicrobia had the largest number of of glycoside hydrolase genes, including xylanases, endochitinases and glycogen phosphorylases (Supplementary Fig. 6), and thus appear well suited for degradation of polysaccharides such as cellulose, chitin and glycogen, in line with previous findings 9,26,27.
Ordination of MAG clusters based on COG abundances. Non-metric multidimensional scaling was applied to a pairwise distance matrix of the genomes and the first two dimensions are shown. MAG clusters are displayed with abbreviated lineage names and cluster numbers in parentheses and further colored according to Phyla/Class.
Transporter proteins mediate many of the interactions between a cell and its surroundings, thus providing insights into an organism’s niche. A detailed analysis of transporter genes in the 30 MAG clusters (Supplementary Fig. 7; Supplementary Dataset 4) revealed important general patterns, such as a high diversity of genes for amino acid uptake in Actinobacteria and Alphaproteobacteria, a lack of genes for carboxylic acid uptake and a multitude of genes for polyamine uptake in Actinobacteria, and a high diversity of ABC-type sugar transport genes in Actinobacteria and Alphaproteobacteria. The Gammaproteobacteria, Bacteroidetes and Verrucomicrobia encoded a large number of TonB-dependent transporter genes, likely involved in carbohydrate, vitamin and iron chelator uptake 28. Phosphate uptake systems, such as the high affinity PstS transporter, were highly abundant in the Betaproteobacterial BACL14, while Thaumarchaeon BACL13 had the highest proportion of phosphonate transporter genes, followed by the Cyanobacterium BACL30 and acIV genome clusters. The BACL30 and the Nitrosopumilus BACL13 were sparse in uptake systems for organic molecules in general, consistent with these organisms’ photoautotrophic and chemoautotrophic lifestyles, respectively. In addition to the amoA ammonia monooxygenase gene, BACL13 encoded urease genes, indicating the capability to utilise urea for nitrification, as previously observed in Arctic Nitrosopumilus 29. In line with the OM43 clade comprising simple obligate methylotrophs with extremely small genomes 30, the Betaproteobacterial OM43 cluster (BACL14), which encoded a methanol dehydrogenase gene and genes for formaldehyde assimilation, was also sparse in uptake systems.
All clusters belonging to the Bacteroidetes and Gammaproteobacteria lineages contained the Na+-transporting NADH:ubiquinone oxidoreductase (NQR) enzyme. This enzyme is involved in the oxidative respiration pathway in some bacteria and is similar to the typical H+-transporting ndh NADH dehydrogenase 31. However, the NQR enzyme exports sodium from the cell and thereby creates a gradient of Na+ ions, in contrast to the proton gradient generated by the ndh enzyme. The use of the NQR enzyme has been shown to be correlated with salinity (increasing Na+ concentrations) in bacterial communities 8. Accordingly, NQR-containing MAG clusters were generally the ones with closest relatives in the marine environment (e.g. Bacteroidetes, see section on biogeography below), while genome clusters more closely related to freshwater bacteria (e.g. Actinobacteria) contained the H+-transporting enzyme. An exception to this were the SAR11 MAGs, which harbored the H+-transporting enzyme despite having predominantly marine relatives. The genomes containing NQR enzymes in our dataset also contained a significantly higher proportion of Na+ symporters and antiporters (for e.g. dicarboxylates, disaccharides and amino-acids), as well as TonB-dependent transporters, compared to the other genomes (Welch’s t-test p<0.001). In contrast, ATP-driven ABC-transporters were significantly less abundant in these clusters (p<0.001), strongly indicating that these bacteria have reduced their energy requirement by making use of the sodium motive force generated by the NQR enzyme to drive transport processes, a strategy that has been suggested previously 31. TonB-dependent transporters require energy derived from charge separation across cellular membranes, generally in the form of a proton gradient 32. The significant enrichment of TonB transporters in NQR-containing genomes suggests that these proteins may also utilize the sodium motive force.
Novelly sequenced lineages
The MAG approach has previously proven useful for closing gaps in the tree of life by the reconstruction of genomes from uncultivated species (e.g. 33,34). Here we report the first draft genomes for the oligotrophic marine Gammaproteobacteria OM182, and for the typically freshwater Verrucomicrobia subdivision LD19 and Actinobacteria clade acIV. Annotations for these genomes are found in Supplementary Dataset 3 and Supplementary Dataset 4.
OM182 is a globally abundant Gammaproteobacteria which has been grown in enrichment culture, but never sequenced. BACL3 includes a 16S gene 99% identical to that of the OM182 isolate HTCC2188 35. This MAG cluster shares common features with other Gammaproteobacteria, such as a variety of glycoside hydrolases and carboxylate degradation enzymes. It also encodes the cysA sulfate transporter and a complete set of genes for assimilatory sulfate reduction to sulfide and for production of cysteine from sulfide and serine via cysK and cysE. Genes for sulfite production from both thiosulfate (via glpE) and taurine (via tauD) are also encoded in the genome, and this is the only MAG cluster to encode the full set of genes for intracellular sulfur oxidation (dsrCEFH). BACL3 thus appears remarkably well-suited for metabolising different inorganic and organic sulphur sources, the latter potentially originating from phytoplankton blooms 36, even more so than previously sequenced isolates of oligotrophic marine Gammaproteobacteria 37.
Two verrucomicrobial genome MAG clusters were reconstructed. BACL9 MAGs include 16S rRNA genes 99% identical to that of the globally distributed freshwater clade LD19 38, a subdivision within the Verrucomicrobia still lacking cultured or sequenced representatives. Previous 16S-based analyses placed LD19 as a sister group to a subdivision with acidophilic methanotrophs 39. Accordingly, BACL9 is placed as a sister clade to the acidophilic methanotroph Methylacidiphilum infernorum 40 in the genome tree, but does not encode methane monooxygenase genes and thus lacks the capacity for methane oxidation seen in M. infernorum. Interestingly, BACL9 contains a set of genes that together allow for production of 2,3-butanediol from pyruvate (via acetolactate and acetoin). Butanediol plays a role in regulating intracellular pH during fermentative anaerobic growth and biofilm formation 41. This is also the only MAG with the genetic capacity to synthesize hopanoid lipids, which have been implicated in enhanced pH tolerance in bacteria by stabilizing cellular membranes 42. This indicates adaptation to withstanding lowered intracellular pH such as that induced by fermentative growth under anaerobic conditions. Such conditions occur in biofilms 41, and it remains to be shown whether these planktonic bacteria can form biofilms to grow attached to particles in the water column.
BACL 6, 17, 19 and 27 all belong to actinobacterial clade acIV, of the order Acidimicrobiales. Most isolates of the order Acidimicrobiales are acidophilic, and no genomes have been reported for acIV, despite its numeric importance in lake water systems 43. Compared to the other typically freshwater clades acI (BACL 2, 4, 15) and Luna (BACL 25, 28), that belong to the order Actinomycetales, acIV MAG clusters have larger genome sizes and contain a significantly lower proportion of genes in the Carbohydrate transport and metabolism COG category (p<0.01), particularly ABC-type sugar transporters (Supplementary Fig. 7). AcIV and acI are also impoverished for phosphotransferase (PTS) genes and amino acid transporters, compared to Luna MAGs. In contrast, acIV MAG clusters contain a significantly higher proportion of genes in the Lipid transport and metabolism COG category (p=0.02), and a significantly higher total proportion of enzymes involved in fatty-acid oxidation (p<0.001), indicating that these Actinobacteria may use lipids as carbon source.
The only cyanobacterial genome assembled was BACL30. While it is placed in the phylogenetic tree as a distant neighbour to Cyanobium gracile, its 16S rRNA gene is only 97% identical with it, the same identity as with Synechococcus and Prochlorococcus. This genome contains genes for the pigments phycocyanin (PC) and phycoerythrin (PE) and harbors the Type IIB pigment gene organization recently identified as being dominant in Baltic Sea picocyanobacteria 44. The PC genes cpcBA and the intergenic spacer are 100% identical to sequences in the Type IIB pigment clade. Phylogenies of PC and PE subunits as well as 6 ribosomal proteins consistently placed this cyanobacterial MAG within the Type IIB pigment clades and within the clade of picocyanobacteria whose members are abundant in the Baltic Sea, but for which a reference genome has been unavailable (Supplementary Fig. 8). BACL30 contains the high affinity pstS phosphate transporter, but lacks the phoU regulatory gene as well as an alkaline phosphatase. In this respect the genome is similar to the coastal strain Synechococcus CC9311 45, likely reflecting adaptation to higher phosphorous loads compared to the open oceans.
Genome streamlining and inferred cell sizes
Oligotrophic bacterioplankton are characterised by streamlined genomes, i.e. small genomes with high coding densities and low numbers of paralogs 46. For the few cultured oligotrophs, such as Prochlorococcus 47 and SAR11 48, this coincides with small cell sizes. The small cells render high surface-to-volume ratios, beneficial for organisms that compete for very-low-concentration nutrients 49. SAG sequencing has shown that genomic streamlining is a widely distributed feature among abundant bacterioplankton 3, contrasting to most cultured marine bacteria. Lauro et al identified genome features for predicting whether an organism or community is oligotrophic or copiotrophic 2. Ordination using some of these features (coding density, GC-content and proportion of five COG categories; Swan, 2013) separated our MAG clusters from marine isolate genomes (Fig. 3a). The exceptions were isolates of picocyanobacteria, SAR11 and OM43, that overlapped with our MAG clusters, and the SAR92, OM182 and Opitutaceae MAG clusters that overlapped with the isolates. Hence, most of the MAGs displayed pronounced signs of streamlining. These features, with the exception of GC-content, were found to be highly correlated with genome size (Supplementary Fig. 9), and genome size alone gave equally strong separation (Fig. 3b-c).
Genome properties and filter size fraction distributions of MAGs. (a) Principal Components Analysis (PCA) on our 30 MAG clusters and 135 genomes from marine isolates 4 based on log-transformed percentages of non-coding DNA, GC-content, COG category Transcription [K], Signal transduction [T], Defense mechanism [V], Secondary metabolites biosynthesis [Q] and Lipid transport and metabolism [I]. Only isolates belonging to the phyla and proteobacterial classes as represented by MAGs were included. (b) Genome size vs. percentage of non-coding DNA plotted for the same set of genomes, (c) with a zoom-in on the smaller and denser genomes. (d-e) Number of sequence reads matching to our reconstructed genomes from different filter fractions from Dupont et al 8. (f) The ratio of matches between the 0.8 - 3.0 and the 0.1 - 0.8 um fraction were plotted against genome size, both in log scale.
Interestingly, several of the Bacteroidetes MAG clusters appear to be streamlined, despite Bacteroidetes being generally described as copiotrophic 46. One of them (BACL11), which represents a novel branch in the Cryomorphaceae (Fig. 1), has a particularly small genome (1.19 Mbp [range 1.16 - 1.21] MAG size, at 75% estimated completeness) with only 4% non-coding DNA. It encodes a smaller number of transporters than the other Bacteroidetes MAG clusters and only one type of glycoside hydrolase. It also has a comparatively low GC-content (33%). However, the Polaribacter MAG cluster (BACL22), which has the largest genome and lowest gene density of the Bacteroidetes genome MAG clusters, has equally low GC-content (32%), as previously observed in planktonic and algae-attached Polaribacter isolates 50. Since, in general, GC-content correlates very weakly with either genome size or gene density (Supplementary Fig. 9), this may not be an optimal marker for genome streamlining. Supporting the impression that MAGs represent small and streamlined genomes, with little metabolic flexibility, most MAG clusters (25 of 30) encode bacteriorhodopsins (PF01036, Supplementary Dataset 3), which allows them to adopt a photoheterotrophic lifestyle when their required substrates for chemoheterotrophy are not available.
By mapping shotgun reads from different filter fractions (0.1 - 0.8, 0.8 - 3.0 and >3.0 μm) from a previous spatial metagenomic survey of the Baltic Sea 8, we could investigate how MAG cluster cells were distributed across size fractions. Comparing counts of mapped reads between the 0.8 - 3.0 and 0.1 - 0.8 fractions showed that Bacteroidetes tended to be captured on the 0.8 μm filter to a higher extent than Actinobacteria (Fig. 3d). This bias could be driven by Bacteroidetes being, to a higher extent, attached to organic matter particles or phytoplankton. However, comparing the >3.0 μm with the 0.8 - 3.0 μm fraction showed a clear bias only for one of the Bacteroidetes clusters (BACL12; Fig. 3e). This cluster has the largest genomes (2.5 and 2.8 Mbp) of the reconstructed Bacteroidetes and is the only representative of the Sphingobacteriales (Fig. 1). Sphingobacteria have previously been suggested to bind to algal surfaces with the assistance of glycosyltransferase genes 51. We did not find significantly more glycosyltransferases in BACL12 than in the other Bacteroidetes. Rather, it encodes a greater number of genes containing carbohydrate-binding module (CBM) domains than the other clusters ( = 12 in BACL12 vs. 1.3 in the other Bacteroidetes and 1.4 in all clusters), which may facilitate adhesion to particles or phytoplankton 52.
Since only one Bacteroidetes MAG cluster was biased toward the >3 μm filter, attachment to organic matter doesn’t seem to be the main reason behind the difference in filter capture between Bacteroidetes and Actinobacteria, unless the particles are mainly in the 0.8 - 3.0 um size range. Another possibility is that the bias reflects cell size distributions; each population has a specific size distribution that will influence what proportion of cells will pass through the membrane. Interestingly, the (0.8 - 3.0 um)/(0.1 - 0.8 um) read count ratio is correlated to size of the MAGs (Spearman rho = 0.76; p = 10−5; Fig. 3f), indicating a positive correlation between cell size and genome size.
The reason for the streamlining of genomes in oligotrophs is not known 46. Lowered energetic costs for replication is one possibility. Despite the energetic requirements for DNA replication being low (<2% of the total energy budget 53), the extremely large effective population sizes of oligotrophic pelagic bacteria could explain selection for this trait 46. Another possibility is spatial constraints. In Pelagibacter the genome occupies 30% of the cell volume 48, so that cell size minimisation may be constrained by the genome size. A strong correlation between cell- and genome size for oligotrophic microbes would favor such an explanation. Further analyses with more reconstructed genomes and higher resolution of filter sizes could shed more light on the mechanisms behind genome streamlining.
Seasonal dynamics
Pronounced seasonal changes in environmental conditions with phytoplankton spring blooms are characteristic of temperate coastal waters. As is typical for the central Baltic Sea, in 2012 an early spring bloom of diatoms was followed by a dinoflagellate bloom, causing inorganic nitrogen to decrease rapidly; later in summer, diazotrophic filamentous cyanobacteria bloomed (Fig. 4 and Supplementary Fig. 10). The only reconstructed picocyanobacteria genome (BACL30) peaked in early summer, between the spring and summer blooms of the larger phytoplankton. A similar pattern was previously observed for an operational taxonomic unit identical to the 16S rRNA gene of the reconstructed genome 23.
Seasonal dynamics of MAG clusters and phytoplankton. The heatmap plot shows the relative abundance of each MAG cluster in the time-series, based on calculated coverage from read-mappings. In addition, the relative abundance of phytoplankton groups, assessed by microscopy, is shown for the same samples. MAG clusters and eukaryotic groups were hierarchically clustered using Spearman correlations as shown in the dendrogram on the left. Coloured rows at the top indicate month and season of each sample.
The seasonal dynamics of heterotrophic MAGs were highly influenced by the phytoplankton blooms, with different populations co-varying with different phytoplankton (Fig. 4). Phylum-level patterns were present, with a Bacteroidetes-dominated community in spring and early summer (7/9 Bacteroidetes genome MAG clusters), coinciding with the spring phytoplankton blooms, and Actinobacteria being more predominant in the second half of the year (8/10 Actinobacterial genome MAG clusters). This is largely in agreement with what is known for Bacteroidetes: being better adapted to feeding on complex carbohydrates abundant for the duration of phytoplankton blooms 27. This was also reflected in the functional annotations, where Bacteroidetes MAGs contained several enzymes for degradation of polysaccharides and were enriched for certain aminopeptidases. For Actinobacteria, no such general correlation pattern to phytoplankton has been shown, but there are indications of association with and active uptake of photosynthates from cyanobacterial blooms 54,55. Actinobacteria MAGs, which were enriched in genes for the uptake and metabolism of monosaccharides such as galactose and xylose, became abundant as levels of dissolved organic carbon increased in the water (Supplementary Fig. 10).
Besides these phylum- and order-level trends, temporal patterns were also observed at finer phylogenetic scales. The peaks of Luna clades coincide with phytoplankton blooms, while acI and acIV are more abundant in autumn, after the blooms. AcI also appears to have a more stable abundance profile than acIV, in agreement with Newton et al 43. As previously reported for acI SAGs 10, cyanophycinase was found in two of the three acI MAG clusters, potentially allowing degradation of the storage compound cyanophycin synthesized by cyanobacteria. The two acI MAG clusters (BACL2 and 4) encoding cyanophycinase became abundant in late July, as filamentous cyanobacteria, which typically produce cyanophycin, started to peak in abundance (Fig. 4). In contrast, all acIV and Luna MAGs lacked this gene.
Furthermore, contrasting dynamics between members of the same clade, as exemplified by one acIV population blooming in spring, highlight that, despite the general similarities in their functional repertoire, lineage specific adaptations allow different microniches to be occupied by different strains (Fig. 2, Fig. 4). As an example, the spring blooming acIV BACL6 contained several genes for nucleotide degradation that were missing in the summer blooming acIV MAG clusters, such as adenine phosphoribosyltransferase, thymidine phosphorylase and pyrimidine utilization protein B. In addition, BACL6 contained genes sulP and phnA for uptake of sulfate and uptake and utilization of alkylphosphonate, respectively. These genes were also found in the spring blooming BACL25 (Luna clade), but were notably absent from the summer blooming acI, acIV and Luna MAG clusters. The capacity to utilize nucleotides and phosphonates as major carbon and phosphorous sources thus probably set BACL6 and 25 apart from other closely related lineages.
The two SAR11 MAG clusters also showed contrasting seasonal patterns, with BACL20 being abundant in spring and peaking in early summer, while BACL5 appeared later and showed a stable profile from July onwards. Functional analysis showed that, of these two populations, BACL5 contained several genes related to phosphate acquisition and storage that were missing from BACL20. These included the high-affinity pstS transporter, polyphosphate kinase and exopolyphosphatase, as well as the phosphate starvation-inducible gene phoH. BACL5 therefore appears better adapted to the low concentrations of phosphate found in mid-to late summer (Supplementary Fig. 10). In addition, proteorhodopsin was found in BACL5, but not in BACL20. However, since the latter consists of only one MAG, this gene may have been missed due to incomplete genome assembly.
Biogeography of the brackish microbiome
To assess how abundant the MAGs presented here are in other marine and freshwater environments around the globe, fragment recruitment was performed from a collection of samples comprising a wide range of salinity levels. At intermediate levels of sequence identity, different phylogenetic lineages recruit preferentially fresh or marine water fragments. Most markedly, SAR11 displays a clear marine profile, while acI and acIV actinobacteria have a distinct freshwater signature (Fig. 5a, Supplementary Fig. 11a, Supplementary Fig. 12). In addition, MAGs belonging to Bacteroidetes and Gammaproteobacteria show signs of a marine rather than a freshwater signature that fits with the presence of the Na+-transporting NADH dehydrogenase in these lineages. However, at a high identity level (99%) only reads from brackish environments are recruited, including estuaries in North America, to the exclusion of fresh and marine waters much closer geographically to the Baltic Sea (Fig. 5b, Supplementary Fig. 11b, Supplementary Fig. 12). Indeed, it is remarkable that BACL8 is placed phylogenetically as a single clade together with a SAG sampled in the brackish Chesapeake Bay (Supplementary Fig. 4). Despite being separated by thousands of kilometers of salt water, these cells share >92% similarity over Blast+ high-scoring pairs. Overall, our analysis showed that the reconstructed genomes recruited sequences primarily from brackish estuary environments at various levels of sequence identity (Fig. 6).
Biogeographical abundance profiles of MAGs. Heatmap plots showing the abundance of recruited reads from various samples and sample groups to each of the 30 MAG clusters at the (a) 85% and (b) 99% identity cutoff levels. Shown values represent number of recruited reads/kb of genome per 10,000 queried reads. For clarity, several sample groups have been collapsed with recruitment values averaged over samples in the group. Sample groups are indicated by the lower color strip above the plot and samples are ordered by salinity (shown in the upper color strip). See Supplementary Fig. 11 for full visualizations of samples. ‘BalticAsm’ represents a metagenomic co-assembly of all the samples in the time-series.
Recruitment profiles at different nucleotide identity. Fragment recruitment values were calculated at various percent identity cutoffs for different sample groups. For each sample in each group, the average recruitment over all 30 MAG clusters was calculated and boxes show the variation of these averages from all samples in each group. The number of samples used in each group are indicated in the legend.
The Baltic Sea is a young system, formed by the opening of the Danish straits to the North Sea in a long process between 13,000 and 8,000 years ago. The initially high salinity has slowly decreased due to the influx of freshwater from the surrounding area and the narrow connection to the open ocean, forming a stable brackish system around 4,000 years ago. Even considering fast rates of evolution for bacteria, the high degree of separation observed at the whole genome level between the Baltic metagenome and global fresh and marine metagenomes cannot be explained by isolation in the Baltic alone. Based on the rates of evolution presented by 56, it would take over 100 thousand years for free-living bacteria to accumulate 1% genome divergence. These specialists must therefore have evolved before current stable bodies of brackish water, such as the Baltic, the Black Sea and the Caspian, were formed in the end of the last glacial period. Intriguingly, brackish-typical green sulfur bacteria have been observed in sediment layers of 217,000 years in the now highly saline mediterranean 57, suggesting that brackish populations might migrate between these transient environments as salinity shifts. This is in agreement with the well documented separation between freshwater and marine species, which indicates that salinity level is a main barrier isolating populations (reviewed in 58). Strains previously adapted to brackish environments and transported through winds, currents or migratory animals can thus proliferate and saturate all available niches before fresh and marine strains can effectively adapt to the new environment.
The prokaryotic populations of the Baltic Sea thus appear to have adapted to its intermediate salinity levels via a different mode than its multicellular species, which are only recently adapted to brackish environments from the surrounding fresh and marine waters 59,60. Indeed, while there is low multicellular species-richness and intra-species diversity in the Baltic 61, suggestive of a recent evolutionary bottleneck, no such observation has been made for bacteria in the region 8,22.
Conclusion
Here we have presented 83 genomes, corresponding to 30 clusters at >99% nucleotide identity, reconstructed from metagenomic shotgun sequencing assemblies using an unsupervised binning approach. Many of these belong to lineages with no previous reference genome, including lineages known from 16S-amplicon studies to be highly abundant. We show that the seasonal dynamics of these bacterioplankton follow phylogenetic divisions, but with fine-grained lineage specific adaptations. We confirm previous observations on the prevalence of genome streamlining in pelagic bacteria. Finally, we propose that brackish environments exert such strong selection for tolerance to intermediate salinity that lineages adapted to it flourish throughout the globe with little influence of surrounding aquatic communities. The new genomes are now available to the wider research community to explore further questions in microbial ecology and biogeography, solidly placing the automated reconstruction of genomes from metagenomes as an invaluable tool in ocean science.
Methods
Sample collection, library preparation and sequencing
Water samples were collected on 37 occasions between March and December of 2012, at 2 m depth, at the Linnaeus Microbial Observatory (N 56° 55.851, E 17° 03.640),10 km off the coast of Öland (Sweden), using a Ruttner sampler. All samples are referred to in the text and figures by their sampling date, in the format yymmdd. Samples were filtered successively at 3.0 μm and 0.22 μm. The 0.22 μm fraction was used for DNA extraction. The procedures for DNA extraction, phytoplankton counts and chlorophyll a and nutrient concentration measurement are described in 24.
2-10 ng of DNA from each sample were prepared with the Rubicon ThruPlex kit (Rubicon Genomics, Ann Harbour, Michigan, USA) according to the instructions of the manufacturer. Cleaning steps were performed with MyOne™carboxylic acid-coated superparamagnetic beads (Invitrogen, Carlsbad, CA, USA). Finished libraries were sequenced in SciLifeLab/NGI (Solna, Sweden) on a HiSeq 2500 (Illumina Inc, San Diego, CA, USA). On average, 31.9 million pair-ended reads of 2x100 bp were generated.
Sequence data quality filtering and assembly
Reads were quality trimmed using sickle (https://github.com/najoshi/sickle) to eliminate stretches where average quality scores fall below 30. Cutadapt 62 was used to eliminate adapter sequences from short fragments detected by FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). Finally, FastUniq 63 was used to eliminate reads which were, on both forward and reverse strands, identical prefixes of longer reads (on average, 49% of the reads from each sample). Each sample was then assembled separately, using Ray 64 with kmer length 21, 31, 41, 51, 61, 71 and 81. Contigs from each of these assemblies were cut up to 1000 bp in sliding windows every 100 bp using Metassemble (https://github.com/inodb/metassemble) and reassembled using 454 Life Science’s software Newbler (Roche, Basel, Switzerland).
Binning of sequencing data and construction of MAGs
The quality-filtered reads of each sample were mapped against the contigs of all other samples using Bowtie2 65, Samtools 66, Picard (http://broadinstitute.github.io/picard) and BEDTools 67. Contigs from each sample were then binned based on their tetranucleotide composition and covariation across all samples using Concoct 20 and accepting contigs over 1000, 3000 or 5000 bp in length. Prodigal 68 was used to predict proteins on contigs for each bin, and these were compared to the COG database with RPS-BLAST. The resulting hits were compared to a small set of 36 single-copy genes (SCG) used by Concoct, only considering a protein hit if it covered more than half of the reference length. Bins were considered good if they presented at least 30 of the 36 SCG, no more than two of which in multiple copies. Another set of phylum-specific SCG was used to evaluate each selected bin more carefully. Both the general prokaryotic SCG and phylum-specific SCG were selected such that they were present in at least 97% of sequenced representatives within that taxon and had an average gene count of less than 1.03. For the phylum-specific SCG, Proteobacteria was divided down to class level for increased sensitivity.
For each sample, only one CONCOCT run was chosen for downstream analysis. For most samples, the 1000 bp cutoff provided the maximum number of good bins, but samples 120705, 120828 and 121004 had best results with 3000 bp. This resulted in 83 good bins in total. As the same genome could have been independently found in more than one sample, MUMmer 69 was used to compare all good bins against each other. The distance between two bins was set as one minus average nucleotide identity, given a minimum of 50% bin coverage of the larger bin in each pair. This procedure yielded 30 clearly distinct clusters (BACL), independently of the clustering method used (average-, full- or single-linkage).
Abundance estimation and comparison of MAGs
The relative abundance of each MAG was estimated using the fraction of reads in each sample mapping to the respective MAG. Normalized on the size of that bin, this yielded the measure fraction of reads per nucleotide in bin. This measure was chosen since it is comparable across samples with varying sequencing output and different bin sizes.
Using the CONCOCT input table, multiplying the average coverage per nucleotide with the length of the contig in question and summing over all contigs within a bin and within a sample gave the number of reads per bin within a sample. The fraction of reads in each sample mapping to each bin was then calculated by dividing this value with the total number of reads from each sample, after having removed duplicated reads.
Functional analysis
Contigs in each genome cluster were annotated using PROKKA (v. 1.7, 70), modified so that partial genes covering edges of contigs were included, to suit metagenomic datasets, and extended with additional annotations so that Pfam (http://pfam.xfam.org/), TIGRFAMs (http://www.jcvi.org/cgi-bin/tigrfams/index.cgi), COG (http://www.ncbi.nlm.nih.gov/COG/) and Enzyme Commission (http://enzyme.expasy.org/) numbers were given for all sequences where applicable. The extended annotation was performed using homology search with RPS-Blast. Metabolic pathways were predicted in MAGs using MinPath (v. 1.2, 71) with the Metacyc database (v. 18.1, 72) as a reference. Non-metric multidimensional scaling (NMDS) analysis was applied to the genome clusters based on their annotations as well as on a subset of transporter genes and predicted metabolic pathways. Abundances of functional features were explored, and statistical analyses of functional differences between groups of MAGs performed using STAMP (v. 2.0.9, 73) with multiple test correction using the Benjamini-Hochberg FDR method.
Taxonomic and phylogenetic annotation
Initial taxonomy assignment for each MAG was done with Phylosift 74. To improve the resolution of annotations, classification of 16S rRNA genes was also used. Complete or partial 16S genes were identified on contigs using WebMGA 75. Further, since rRNA is difficult both to assemble and to bin, a complementary approach was used where partial 16S rRNA genes were assembled for each MAG using reads classified as SSU rRNA by SortMeRNA 76, but whose paired-end read was assembled in another contig belonging to the same MAG. The identified and reconstructed 16S fragments were classified with stand-alone SINA 1.2.13 77 and by Blasting against the data by Newton et al 43.
Using the information provided by Phylosift and 16S analysis, relevant isolate genomes and single-amplified genomes (SAG) were selected. These were combined with all complete prokaryotic genomes in NCBI. Prodigal was used for protein prediction in each genome. These proteomes, together with the proteomes of our MAGs, were used for phylogenetic tree reconstruction using Phylophlan 78. Phylophlan’s reference database was not used as we noticed that, in instances where genomes that were already present in the reference were processed by us and added, they tended to branch closer to the MAG than otherwise, thus indicating a role of protein prediction method in the phylogeny. The tree visualizations displayed here were generated with iTOL v2 79. For the sake of clarity, not all species included in the tree are maintained in the overview or clade-specific insets. Since the distance between MAGs and their nearest neighbours in NCBI were as a rule too large for ANI calculation, we adopted Genome BLAST Distance for this comparison 80.
Genome streamlining analysis
The dataset of marine microbial isolates from 4 was downloaded from CAMERA (http://camera.calit2.net/). These were functionally annotated in the same way as the MAGs. For streamlining analysis, the GC content, genome length, and average fraction of non-coding nucleotides were calculated. To avoid bias of shorter contigs, the average fraction of non-coding nucleotides was only based on sequences longer than 5000 nucleotides. For clarity, only genomes belonging to the same phyla as our reconstructed MAGs were included in the analysis. For quantifying how MAG cluster cells were distributed across filter size fractions in 8, 10,000 random reads were sampled from each size fraction from 21 samples and aligned to the MAGs by BLAST, using 95% identity and alignment length of 100 bp as cutoff.
Fragment recruitment
Fragment recruitment 7 was used to estimate the presence of the reconstructed MAGs in various locations around the globe. We selected a total of 86 metagenomic samples obtained from a wide range of salinity levels and geographic locations (Table 2). Missing salinity value for Delaware Bay (GS011) was set to 15 PSU after consulting the Delaware Bay Operational Forecast System (DBOFS) (http://tidesandcurrents.noaa.gov/ofs/dbofs/dbofs.html).
Metagenomic projects used as queries for biogeographic fragment recruitments.
All samples were sub-sampled to 10,000 sequences, each 350 bp in length, and all reads were queried against a database of the reconstructed genome bins using Blast+ (v. 2.2.30). Non-coding intergenic sequences were excluded by using only the nucleotide sequences of predicted open reading frames. Only samples comprising the 0.1-0.8 μm filter fraction were used and only hits with e-value < 1e-5 and alignment length > 200 bp were considered. For visualizations, the number of hits for MAGs in each sample was normalized against the total size (in bp) of the MAG. These normalized counts were then averaged over the MAGs of each MAG cluster.
Acknowledgements
We thank Anders Månsson and Kristofer Bergström for their knowledgeable and persistent sampling effort, and Sabina Arnautovic and Emmelie Nilsson for their careful processing of samples. This work was funded by the EC BONUS project BLUEPRINT, partially funded by FORMAS, by the Swedish Research Council VR (grant 2011-5689) through a grant to A.F.A, as well as by Formas project ECOCHANGE (Strategic Grant for Marine Research) through a grant to C.L. and J.P. All computations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) through the Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) and the PDC Center for High Performance Computing at KTH. Sequencing was conducted at the Swedish National Genomics Infrastructure (NGI) at SciLifeLab in Stockholm.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.
- 18.
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.
- 82.
- 83.
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.
- 89.
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵