Abstract
Molecular and genomic approaches that target mixed microbial communities (e.g., metagenomics or metatranscriptomics) provide insight into the ecological roles, evolutionary histories, and physiological capabilities of the microorganisms and the processes in the environment. Computational tools that harness large-scale sequence surveys have become a valuable resource for characterizing the genetic make-up of the bacterial and archaeal component of the marine microbiome. Yet, fewer studies have focused on the unicellular eukaryotic fraction of the community. Here, we developed the EukHeist automated computational pipeline, to retrieve eukaryotic and prokaryotic metagenome assembled genomes (MAGs). We applied EukHeist to the eukaryote-dominated large-size fraction data (0.8-2000μm) from the Tara Oceans survey to recover both eukaryotic and prokaryotic MAGs, which we refer to as TOPAZ (Tara Oceans Particle-Associated MAGs). The TOPAZ MAGs consisted of more than 900 eukaryotic MAGs representing environmentally-relevant microbial and multicellular eukaryotes in addition to over 4,000 bacterial and archaeal MAGs. The bacterial and archaeal TOPAZ MAGs retrieved with EukHeist complement previous efforts by expanding the existing phylogenetic diversity through the increase in coverage of many likely particle- and host-associated taxa. We also demonstrate how the novel eukaryotic genomic content recovered from this study might be used to infer functional traits, such as trophic mode. By coupling MAGs and metatranscriptomic data, we explored ecologically-significant protistan groups, such as the Stramenopiles. A global survey of both eukaryotic and prokaryotic MAGs enabled the identification of ecological cohorts, driven by specific environmental factors, and putative host-microbe associations. Accessible and scalable computational tools, such as EukHeist, are likely to accelerate the identification of meaningful genetic signatures from large datasets, ultimately expanding the eukaryotic tree of life.
Introduction
Unicellular microbial eukaryotes, or protists, play a critical part in all ecosystems found on the planet. In addition to their vast morphological and taxonomic diversity, protists exhibit a range of functional roles and trophic strategies (Caron et al., 2011). Protists are centrally important to global biogeo-chemical cycles, mediating the pathways for the synthesis and processing of carbon and nutrients in the environment (Mitra et al., 2014; Caron et al., 2017; Strom, 2008). Despite their importance across ecosystems and in the global carbon cycle, research on microbial eukaryotes typically lags behind that of bacteria and archaea (Caron and Countway, 2009; Keeling and Campo, 2017). Consequently, fundamental questions surrounding microbial eukaryotic ecological function remain unresolved. Novel approaches that enable genome retrieval from meta’omic data provide a means of bridging that knowledge gap.
Assembled genetic fragments (derived from metagenomic reads) can be grouped together based on their abundances, co-occurrences, and tetranucleotide frequency to reconstruct likely genomic collections, often called bins (Alneberg et al., 2014; Wu et al., 2014; Kang et al., 2019; Graham et al., 2017). These bins can then be refined through a series of steps to ultimately represent metagenome assembled genomes or MAGs (Parks et al., 2017; Delmont et al., 2018; Tully et al., 2018; Almeida et al., 2019). Binning metagenomic data into MAGs has revolutionized how researchers ask questions about microbial communities and has enabled the identification of novel bacterial and archaeal taxa and functional traits (Rinke et al., 2019; Tully, 2019), but the recovery of eukaryotic MAGs is less well established. The reason for this being arguably twofold: (1) eukaryotic genomic complexity (Zhang et al., 2011) complicates both metagenome assembly and MAG retrieval; and (2) there is a bias in currently available metagenomic computational tools towards the study of bacterial and archaeal members of the community. Much can be learned about the diversity and role of eukaryotes in our environment from eukaryotic MAG retrieval (Olm et al., 2019).
Here we developed and applied EukHeist, a scalable and reproducible pipeline to facilitate the reconstruction, taxonomic assignment, and annotation of prokaryotic and eukaryotic metagenome assembled genomes (MAGs) from mixed community metagenomes. The EukHeist pipeline was applied to a metagenomic dataset from the Tara Oceans expedition protist-size fractions samples (Carradec et al., 2018), which encompasses more than 20Tb of raw sequence data. From these large-size fraction metagenomic samples, we recovered over 4,000 prokaryotic MAGs and 900 eukaryotic MAGs.
Results and Discussion
We developed the EukHeist metagenomic pipeline to automate the recovery and classification of eukaryotic and prokaryotic MAGs from large-scale environmental metagenomic datasets. EukHeist was applied to the metagenomic data from the large-size fraction metagenomic samples (0.8-2000μm) from Tara Oceans (Carradec et al., 2018), which is dominated by eukaryotic organisms. We generated 94 co-assembled metagenomes based on the ocean region, size fraction, and depth of the samples (Figure S1), which totaled 180 Gbp in length (Supplementary Table 1). A total of 988 eukaryotic MAGs and 4,022 prokaryotic MAGs were recovered; considering our efforts to target the larger size fractions, these MAGs have been made available under the name Tara Oceans Particle Associated MAGs, or TOPAZ (Supplementary Tables 2 and 3). The TOPAZ MAGs expand the current repertoire of publicly available eukaryotic genomic references for the marine environment and shed light on the biogeographical and functional potential of these eukaryotic-enriched marine communities.
Eukaryotic genome recovery from metagenomes covers major eukaryotic supergroups
The EukHeist classification pipeline identified 988 putative eukaryotic MAGs following the refinement of recovered metagenomic bins based on length (> 2.5 Mbp) and proportion of base pairs predicted to be eukaryotic in origin by EukRep (West et al., 2018) (Figure S4). Protein coding regions in the eukaryotic MAGs were predicted using the EukMetaSanity pipeline (Neely et al., 2021), and the likely taxonomic assignment of each bin was made with MMSeqs (Steinegger and Söding, 2018) and EUKulele (Krinos et al., 2021) (Supplementary Table 2). Of the 988 eukaryotic MAGs recovered, 713 MAGs were estimated to be more than 10% complete based on the presence of core eukaryotic BUSCO orthologs (Simao et al., 2015). For the purposes of our subsequent analyses, we only consider the highly complete eukaryotic TOPAZ MAGs, or those that were greater than 30% complete based on BUSCO ortholog presence (n=485) (Figure 1).
Eukaryotic genomes are known to be both larger and have higher proportions of non-coding DNA than bacterial genomes (Zhang et al., 2011). On average across sequenced eukaryotic genomes, 33.1% of genomic content codes for genes (2.6% - 59.8% for the 1st and 3rd quartiles) (Hou and Lin, 2009), while bacterial genomes have a higher proportion of coding regions (86.9%; 83.9% - 89.3%) (Hou and Lin, 2009). The high-completion TOPAZ eukaryotic MAGs have an average of 73.7% ± 14.3% gene coding regions (Figure S9). This trend of a higher proportion of coding regions was consistent across eukaryotic groups, where Haptophyta and Ochrophyta TOPAZ MAGs had an average coding region of 80.3 ± 4.9% and 78.1 ± 6.3%, respectively. Genomes from cultured Haptophyta (Emiliania huxleyi CCMP1516 with 31 Mb or 21.9% (Read et al., 2013)), and Ochrophyta (Phaeodactylum tricornutum with 15.4 Mb or 57.3% (Bowler et al., 2008)) had significantly lower proportions of protein coding regions within their genomes compared to TOPAZ MAGs. The lowest percentages of gene coding were within Metazoan and Fungal TOPAZ MAGs, with 52.6 ± 9.8% and 58.8 ± 6.7%, respectively. As a point of comparison, the human genome is estimated to have ≈ 34 Mb or ≈ 1.2% of the genome coding for proteins (Consortium, 2004). Globally, the higher gene coding percentages for the recovered eukaryotic TOPAZ MAGs likely reflect biases caused by the use of tetranucleotide frequencies in the initial binning (Kang et al., 2019) as well as challenges inherent in the assembly of non-coding and repeat-rich regions of eukaryotic genomes.
Phylogenetic placement of TOPAZ MAGs aligned with estimated taxonomy based on protein-consensus annotation (Figure 1). The recovered MAGs spanned 8 major eukaryotic supergroups: Archaeplastida (Chlorophyta), Opisthokonta (Metazoa, Choanoflagellata, and Fungi), Amoebozoa, Apusozoa, Haptista (Haptophyta), Cryptista (Cryptophyta), and the SAR supergroup (Stramenopiles, Alveolata, and Rhizaria) (Burki et al., 2020). Eukaryotic MAGs were retrieved from all ocean regions surveyed, with the largest number of high-completion TOPAZ MAGs recovered from the South Pacific Ocean Region (SPO) (n=143) and the fewest recovered from the Southern Ocean (SO) (n=11) and Red Sea (RS) (n=12) (Figure S8). These regional trends in MAG recovery and taxonomy aligned with the overall sequencing depth at each of these locations (Supplementary Table 1), with fewer, less diverse MAGs recovered from the SO and RS (Figures S5 and S8).
The largest number of MAGs was recovered from the smallest size fraction (0.8 − 5μm) (n=311) (Figures 1 and S5), and yielded the highest taxonomic diversity, including MAGs from all the major supergroups listed above (Figure S5). Chlorophyta (n=133), Ochrophyta (n=57), and taxa placed within the SAR group (Stramenopiles, Alveolata, and Rhizaria) (n=56) made up the the largest pro-portion of small size fraction MAGs. Chlorophyta MAGs were smaller and had fewer predicted proteins relative to other eukaryotic MAGs, despite demonstrating comparable completeness metrics; the average Chlorophyta MAG size was 13.9 Mbp with 7525 predicted proteins (Figure S9). By contrast, Cryptophyta and Haptophyta had the largest average MAG size with 50.8 Mbp and 44.4 Mbp with an average of 23500 and 24400 predicted proteins, respectively (Figure S9). Fewer eukaryotic MAGs were recovered from the other size fractions 5 − 20μm (n=20), 20 − 180μm (n=87), and 180 − 2000μm (n=39) (Figure S5), instead these larger size fractions recovered a higher total number of metazoan MAGs. Metazoan MAGs had the lowest average completeness (50 ± 13%) (Figures S7 and S9); where the average size of recovered metazoan MAGs was 43.2 Mb (6.5-177Mbp), encompassing an average of 14600 proteins (Figure S9). 76 of the 123 metazoan MAGs likely belong to the Hexanauplia (Copepoda) class; copepod genomes have been estimated to be up to 2.5 Gb with high variation (10-fold difference) across sequenced members (Jørgensen et al., 2019).
MAGs were also retrieved from all discrete sampling depths: surface, SRF (n=315), deep chlorophyll max, DCM (n=133), mesopelagic, MES (n=13), as well as samples with no discrete depth, MIX (n=21) and the filtered seawater controls, FSW (n=3). Notably, the FSW included 1 Chlorophyta MAG (TOPAZ_IOF1_E003) that was estimated to be 100% complete with no contamination (Supplementary Table 2).
The composition of TOPAZ MAGs from basin-scale mesopelagic co-assemblies recovered a higher percentage of fungi relative to other depths. This is similar to other mesopelagic and bathypelagic molecular surveys, where the biomass of fungi is thought to outweigh other eukaryotes (Morales et al., 2019; Pernice et al., 2015; Edgcomb et al., 2010). Further, fungal MAGs had the highest overall average completeness (87 ± 15%) (Figures S7 and S9). A total of 16 highly complete fungal MAGs were also recovered, of those, 11 originated from the MES (Figures 1 and S8). Putative fungal TOPAZ MAGs were recovered from the phyla Ascomycota (n=10) and Basidiomycota (n=1) and ranged in size from 12.5-47.8 Mb (Figure S9), which are within range of known average genome sizes for these groups, 36.9 and 46.5 Mb, respectively (Mohanta and Bae, 2015).
The metagenome read recruitment to these TOPAZ MAGs paralleled MAG recovery, where metazoan MAGs dominated the larger size fractions (20 − 180μm and 180 − 2000μm) across both the surface and DCM for all stations, and Chlorophyta MAGs were dominant across most of the small size fraction stations (0.8 − 5μm) (Figure S11). A notable exception are the stations from the Southern Ocean, where Haptophyta and Ochrophyta were most abundant in all size fractions. Compared to the samples from the photic zone, SRF and DCM, the average recruitment of reads from the MES was far lower (24500 ± 34450 average CPM in the MES compared to 131000 ± 104000 and 136000 ± 85000 for the SRF and DCM, respectively (Figure S11). This suggests that the mesopelagic have high variability across communities (Pernice et al., 2015) and that we did not fully capture the eukaryotic MAGs that adequately describe all surveyed communities. Alternatively, this might suggest that the communities sampled were dominated by prokaryotic biomass (Pernice et al., 2014).
Trophic mode can be predicted from MAG gene content
Eukaryotic microbes can exhibit a diversity of functional traits and trophic strategies in the marine environment (Worden et al., 2015; Caron et al., 2011), including phototrophy, heterotrophy, and mixotrophy. Phototrophic protists are responsible for a significant fraction of the organic carbon synthesis via primary production; these phototrophs dominate the microbial biomass and diversity in the sunlit layer of the oceans (Worden et al., 2015; de Vargas et al., 2015). Phagotrophic protists (heterotrophs), which ingest bacteria, archaea, and smaller eukaryotes, and parasitic protists are known to account for a large percentage of mortality in food webs (Sherr and Sherr, 2002; Caron et al., 2011; Worden et al., 2015). Protists are also capable of mixed nutrition (mixotrophy), where a single-cell exhibits a combination of phototrophy and heterotrophy (Stoecker et al., 2017). Typically, the identification of trophic mode has relied upon direct observations of isolates within an lab setting, with more recent efforts including transcriptional profiling as a means of assessing trophic strategy (Keeling et al., 2014; Liu et al., 2016). Scaling up these culture-based observations to environmentally-relevant settings (Alexander et al., 2015; Hu et al., 2018; Gong et al., 2016) has been an important advance in the field for exploring complex communities without cultivation. An outcome from these studies has been the realization that trophic strategies are not governed by single genes (Labarre et al., 2020); in reality, trophic strategy will be shaped by an organisms’ physiological potential and environmental setting. Therefore larger genomic and transcriptomic efforts to predict or characterize presumed trophic strategies among mixed microbial communities will greatly contribute to our understanding of the role that microorganisms play in global biogeochemical cycles, by enabling the observation of functional traits and strategies in situ.
Large scale meta’omic results, such the TOPAZ MAGs recovered here, can be leveraged alongside presently available reference data to enable the prediction of biological traits (such as trophic mode) without a priori information. Machine learning (ML) applications can be implemented to access the potential of these large datasets. ML approaches have been recently shown to be capable of accurate functional prediction and cell type annotation using genetic input, in particular for cancer cell prediction (Shipp et al., 2002; Bashiri et al., 2017; Tabl et al., 2019), and functional gene and phenotype prediction in plants (Mahood et al., 2020). Recently, these approaches have been applied to culture and environmental transcriptomic data to predict trophic mode using currently available trophy annotations (Lambert et al., 2021; Burns et al., 2018; Jimenez et al., 2021). Here, we apply an independent machine learning model to the eukaryotic TOPAZ MAGs to predict each organisms’ capacity for various metabolisms.
Using a reference set built from protistan transcriptomic data, we predicted the trophic mode of the highly complete TOPAZ MAGs using machine learning and direct estimation via presence of important KEGG pathways (eq. (8)). As the gradient of trophic mode among protists is not strictly categorical, we calculated a Heterotrophy Index (H-index) that places the TOPAZ MAGs on a scale of highly phototrophic (negative values) to highly heterotrophic (positive values) (Figures 1 and 2). Thus, for all sufficiently complete (≥ 30%) TOPAZ MAGs we have predicted both a gross trophic category (heterotrophic (n = 227), mixotrophic (n = 0), or phototrophic (n = 258) as well as the quantitative extent of heterotrophy (H-index, eq. (8)). Broadly, the trophic predictions aligned well with the putative taxonomy of each MAG (Figures 1 and 2). For example, TOPAZ MAGs that had taxonomic annotation of well known heterotrophic lineages (Metazoa, Fungi), were predicted as heterotrophs based on our model. Further, our data-driven trophic mode predictions correlate well with an independent model designed to identify the presence of photosynthetic machinery and capacity for phagotrophy (Burns et al., 2018) (Figures S26 and S27).
Despite evidence that many lineages recovered include known mixotrophs, no TOPAZ MAGs were identified as mixotrophic using this approach. However, the utility of the H-index enables us to still consider mixotrophic-capable MAGs. We explore the likely reasons for this more deeply in the Section 1.3, but one potential explanation is that MAG recovery targets the genomic content of a eukaryotic lineage and the evolutionary history of phototrophy and heterotrophy is complicated and varies with respect to species (Flynn et al., 2019). Therefore, the genetic composition of MAGs may reflect encoded metabolisms that are not necessarily exhibited in situ. Additionally, mixotrophy is not a singular trait, but rather a spectrum of metabolic abilities that are largely driven by the microorganisms’ nutritional needs and surrounding environment. Continued culturing combined with large-scale ’omic efforts will continue to improve such ML models focused on complex traits and ultimately our ability to predict trophic mode. We suggest that the integration of metagenomic and metatranscriptomic datasets might better reflect the active strategies being used.
Ecological niches of heterotrophic and phototrophic Stramenopiles using meta-transcriptomic evidence
A total of 24 high-completion TOPAZ eukaryotic MAGs were subset as a case study to explore the utility of pairing metatranscriptome data with MAG results to characterize the ecological context of less resolved protistan lineages. Selected MAGs included 11 taxonomically-assigned as Dictyochophyceae (silicoflagellates), which were putatively classified as phototrophs with our trophic model (Figure 2), and 13 MAGs within a phylogenetically-related stramenopile clade and classified as heterotrophs (Figures 1 and 3 a).
All 24 MAGs had a cosmopolitan distribution primarily originating from surface samples from the smallest size fraction, but some individual MAGs had relatively higher CPM abundances suggesting environmental selection based on oceanographic region (Figure 3a, Figure S18, Figure S17). The biogeography of the Dictyochophyceae aligned with existing literature, where Dictyochophyceae are globally distributed and typically found in the euphotic layer of the world ocean (Vaulot et al., 2008; Obiol et al., 2020; Massana, 2011). Global sampling efforts have also recovered genetic signatures of Dictyochophyceae as a prominent, but not abundant, member of the Stramenopile group that does not demonstrate remarkable seasonality (Giner et al., 2019; Obiol et al., 2020). While grazing on bacteria and picocyanobacteria has been observed within mixotrophic Dictyochophyceae, previous work to quantify grazing rates were unsuccessful to due to the low cell abundance (Unrein et al., 2014). Therefore, metagenomic and metatranscriptomic datasets, such as the recovered TOPAZ MAGs, are well suited to illuminate the biogeography and functional potential of the Dictyochophyceae. Dictyochophyceae and heterotrophic Stramenopiles have previously been reported in analyses of the Tara Oceans data (Carradec et al., 2018; Pierella Karlusich et al., 2020; Vorobev et al.; Sieracki et al., 2019).
This analysis enabled both targeted and untargeted approaches to investigate the physiology of SAR TOPAZ MAGs. First, we assessed ortholog co-occurrence across the MAGs to identify genes that were present only among the phototrophy-predicted Dictyochophyceae MAGs (Phototrophy Exclusive; Figure 3b). Genes deemed Phototrophy Exclusive (Figure 3b) were related to chlorophyll biosynthesis (Por; protochlorophyllide reductase), enzymes integral to the pentose phosphate pathways, and acyltransferases and carboxylases, which are involved in de novo fatty acid biosynthesis functions (e.g., ACACA; acetyl-CoA carboxylase biotin, ACSS; acetyl-coA synthetase). Inversely, genes detected only within putative stramenopile MAGs classified as heterotrophs (i.e., Heterotrophy Exclusive; Figure 3 b) included enzymes integral for the breakdown of large sugar molecules such as glycosaminoglycans (e.g., IDUA; L-iduronidase, UDP-glucose:O-linked fucose beta-1,3-glucosyltransferase, NAGLU; alpha N acetylglucosaminidase, IDS; iduronate-2sulfatase, and GALC; galactosylceramidases). Enzymes associated with glycosaminoglycan metabolism may be associated with cell adhesion or the intracellular re-processing of glycosaminoglycan; the latter of which may be a genetic attribute for more heterotrophic lifestyles. By isolating the presence and absence of specific genes across MAGs with varied predicted trophic modes, we can identify sets of genes that may be indicative of a species’ ecological role.
Ordination results based on the expression of transcripts common across all 24 MAGs (Figure 3 c; based on ‘Shared in all MAGs’ in Figure 3b) clustered by H-index and TOPAZ MAG identity (Figure S19). While there was some overlap among MAGs predicted to be heterotrophic versus phototrophic, trends dictating the PCA results appeared to be driven primarily by the trophic mode of individual MAGs, rather than region sampled (Figure 3 c, Figure S19). Further, ordination results resembled previously observed trends from transcript-based efforts to separate phototrophic, mixotrophic, and heterotrophic protistan species from cultivation (Koid et al., 2014; Beisser et al., 2017) and the environment (Hu et al., 2018). Results from the mapped metatranscriptome reads revealed populations to exhibit more heterotrophic or phototrophic traits depending on the environment (Figure S21, Figure S22). For instance, among the Dictyochophyceae-predicted MAGs variable phototrophic versus heterotrophic relative abundances may reflect a mixotrophic-capable population responding to the environment (Figure S21). The 13 MAGs classified as heterotrophic (ML model in this study) were phylogenetically similar to other Stramenopiles (Figure 3 b & Figure 1) and H-indices reported for each MAG aligned with what was seen in the metatranscriptome signal, where MAGs with the highest heterotrophy scores (e.g., TOPAZ_SAS1_E035 and TOPAZ_NAX_E011; Figure S19) had higher CPM associated with heterotrophic traits in all samples (Figure S22). While the identity of the Dictyochophyceae MAGs was further supported by phylogenetic similarity with other Dictyochophyceae MAGs and single-cell genome-informed MAGs (Figure S20), the taxonomic identity of the presumed heterotrophic Stramenopile TOPAZ MAGs was less resolved. While we cannot confidently annotate beyond the taxonomic classification of ’SAR’, these MAGs were distinct from those in culture and likely include a mixotrophic-capable group of protists distinct from the MA-rine STramenopiles (MAST; Figure S20). These findings demonstrate the value of large untargeted genetic approaches to gain insight into the in situ metabolisms of less explored branches of the eukaryotic tree of life. Paired metagenomic and metatranscriptome results, alongside the environmental context provided by a large-scale global sampling effort, and the predicted nutritional strategies we can gain a more comprehensive understanding of protists in our oceans.
TOPAZ prokaryotic MAGs distinct from previous marine MAG recovery efforts
The vast majority of the retrieved prokaryotic MAGs belonged to Bacteria. High-quality non-redundant TOPAZ (HQ-NR-TOPAZ) MAGs were comprised of 711 bacterial and 5 archaeal MAGs belonging to 30 different phyla (Figure 4 and Supplementary Table 4); an additional 15 phyla were recovered in the medium quality (MQ) MAGs. Of the 716 HQ-NR-TOPAZ MAGs, 507 were unique based on a 99% ANI comparison threshold with MAGs generated from previous binning efforts from Tara Oceans metagenomic data, including Delmont et al. (2018) (TARA), Tully et al. (2018) (TOBG), and Parks et al. (2017) (UBA) (Figure 4). The phylogenetic diversity captured by the TOPAZ MAGs was quantified by a comparison to a "neutral" reference set of genomes; these neutral references approximate the state of marine microbial genomes, dominated by isolate genomes, previous to the incorporation of the Tara Oceans-derived MAGs (Table 1). Relative to the neutral genomic references, the entire TOPAZ NR (includes both HQ and MQ) set represented a 42.8% phylogenetic gain (as measured by additional branch length contributed by a set of data) and 59.9% phylogenetic diversity (as measured by the total branch length spanned by a set of taxa), as compared to efforts focused solely on the smaller size fractions such as TARA and UBA, which had a smaller degree of gain (31.0% and 25.8%, respectively) and diversity (44.4% and 40.5%, respectively) (Table 1). An inclusive tree containing the neutral reference and all Tara Oceans MAGs (TOBG + UBA + TARA + TOPAZ), the TOPAZ NR MAGs represented 14.4% of the phylogenetic gain and 44.7% phylogenetic diversity, suggesting that the TOPAZ MAGs offer the largest increase in phylogenetic novelty when compared to MAGs reconstructed from the metagenomes of the smaller size fractions (< 5.00μm). The TOPAZ MAGs primarily originated from the larger Tara Ocean size fraction samples, and thus include a higher proportion of more complex host- and particle-associated bacterial communities. The novelty of the HQ- and MQ-NR-TOPAZ MAGs here, suggests that these particle-associated MAGs are overlooked and current genome databases are largely skewed towards free-living bacteria.
To confirm the hypothesis that the prokaryotic TOPAZ MAGs included particle-associated members, we examined the genomic features of several selected groups that were well-recovered here and in single-cell amplified genomic datasets (i.e., GORG) (Pachiadaki et al., 2019). To avoid potential biases related to completeness and contamination of the genomes, only the HQ-NR MAGs were compared to the GORG SAGs, and analyses were limited to groups with sufficient representation within both datasets (Bacteriodota, Cyanobacteria, and Proteobacteria). For these well represented groups, the average GC% and estimated genome size of the TOPAZ MAGs were significantly higher than the ones typically reported in free-living marine bacteria (Dufresne et al., 2005; Swan et al., 2013; Luo et al., 2015) and those observed within the GORG dataset (Pachiadaki et al., 2019). TOPAZ MAGs were found to encode more tRNAs on average per genome than GORG (39.5 vs 30). Additionally, Carbohydrate-Active Enzymes (CAZy) and peptidases were enriched within the TOPAZ MAGs relative to GORG (Figure S29). Larger genomes have been considered diagnostic for a copiotrophic lifestyle in bacteria (Okie et al., 2020), since the more extended and flexible gene repertoire can facilitate substrate catabolism in organic rich niches such as particles. Genomes of copiotrophs are also commonly found to have higher copy numbers of genes associated with replication and protein biosynthesis such as tRNAs and rRNAs (Rocha, 2004) which facilitate higher growth rates. In contrast, the streamlined genomes of SAR11 and other groups that have free-living oligotrophic lifestyles require fewer resources to maintain and replicate their genomes and have higher carbon-use efficiency (Giovannoni et al., 2014). Similarly, G and C have higher energy cost of production and more limited intracellular availability compared to A and T (Moore et al., 2013; Luo et al., 2015). The genomic trends observed support our findings that TOPAZ MAGs represent both particle associated and free-living microbes, and are relatively enriched for copiotrophic microbes.
Environmental factors structure TOPAZ MAG co-occurrence
The co-retrieval of eukaryotic and prokaryotic MAGs from across the global ocean allows the unique opportunity to assess the biogeographical and ecological associations and potential co-occurrence of these organisms while also being able to infer likely function. To identify communities of associated organisms that co-occur across the surface ocean metagenomes, we performed a correlation clustering based on the abundances of the eukaryotic TOPAZ MAGS and the HQ-NR-TOPAZ MAGs (Figure 5 a). We employed a modularity optimization algorithm to the correlation analysis Blondel et al. (2008) to identify distinct communities of co-occurring organisms. This approach identified seven distinct communities, which included 379 of the non-redundant TOPAZ MAGs (Figure 5 b). The communities were variably connected to each other, as defined by Equation (11), with high connectedness among Communities 1, 2, and 3, and patchy connectedness among Communities 4-7 (Figure 5 b). Community 6 (the smallest of the communities, which consisted of a single Bacillariophyta TOPAZ MAG and five distinct Synechococcus TOPAZ MAGs) had the highest inter-connectedness (0.733 connectedness), suggesting that members of this community co-occurred across samples with high fidelity. Moreover, this community was largely distinct from other communities and only shared significant connections to Community 7. The other six communities showed lesser degrees of inter-connectedness (range: 0.181 − 0.550; mean: 0.365 ± 0.204), suggesting that they co-occur less consistently across samples.
The seven communities that we identified based on metagenomic abundance correlations also significantly correlated with environmental factors, which consequently define the environmental niches where the communities were most abundant (Figure 5 c, Supplementary Table 13). Temperature was a primary factor defining the community correlations, significantly correlating with five of the seven communities. Community 4 correlated with colder temperatures and Communities, 1, 3, 5, and 6 correlated with warmer temperatures (Figure 5 c). In Community 4, we found significant positive correlations with chlorophyll and net primary productivity (Chla: ρ = 0.401, p = 2.34e − 35, NPP: ρ = 0.166, p = 1.28e− 3), while we found negative correlations with “residence time” (ρ = −0.347, p = 1.98e − 14), indicating a likely occurrence in newly formed eddies. Thus, Community 4 was largely found within colder, productive regions, and had enhanced metagenomic abundance in the Southern Ocean and the North Atlantic (Figure S32). Community 4 was comprised of MAGs from Chlorophyta, Cryptophyta, Haptophyta, and Ochrophyta, the major groups containing primarily phototrophic eukaryotic microbes. 18 prokaryotic MAGs were also contained in this community, including both photosynthetic (Synechococcales) and non-photosynthetic lineages (e.g. Myxococcota and Planctomycetota). All told, this guild of MAGs comprises likely photosynthesizers often found in cold, but not necessarily nutrient-rich, environments.
The four communities (1, 3, 5, and 6) that were positively correlated with temperature were distinguished by different correlations with nutrients and physical features. Communities 1 and 3 tended to be found in longer-lived eddies (according to the calculation by d’Ovidio et al. (2010) as reported in the Tara Oceans metadata Tara Oceans Consortium and Tara Oceans Expedition (2016) as “residence time”) (Figure 5 c; Community 1: ρ = 0.274, p = 1.74e− 8, Community 3: ρ = 0.216, p = 7.15e− 5). However, these two communities differed both in their association with nutrients and their taxonomic compositions. Community 1 was dominated by Metazoa and bacteria and correlated with oligotrophic conditions (nitrate and nitrite: ρ = −0.234, p = 1.21e− 10, phosphate: ρ = −0.269, p = 1.76e− 14, silica: ρ = −0.190, p = 1.02e − 6), and was most abundant in the larger size fraction samples (20-2000 μm) (Figure S32). By contrast, Community 3 was largely comprised of phototrophic chlorophytes and bacteria and was more likely to be found in high-nutrient environments (nitrate and nitrite: ρ = 0.230, p = 3.04e− 10, phosphate: ρ = −0.266, p = 3.82e− 14, silica: ρ = 0.182, p = 4.39e− 6), and was most abundant in smaller size fraction samples (0.8-20 μm), particularly around the tropics (Figure S32). The other two warm-associated communities were comprised by SAR and bacteria (Community 5), and Ochrophyta and SAR (Community 6) (Figure 5 b). These communities were negatively correlated with nutrient concentrations (Community 5: nitrate and nitrite: ρ = −0.140, p = 3.17e − 3, phosphate: ρ = −0.147, p = 1.28e − 3, silica not significant; Community 6: nitrate and nitrite: ρ = −0.355, p = 1.68e− 26, phosphate: ρ = −0.418, p = 6.01e− 38, silica: ρ = −0.187, p = 1.92e− 6;), suggesting communities that thrive in oligotrophic regions.
While many of the communities recovered appeared to be driven largely by environmental forces, the genomic signatures of Community 1 members suggest that this community was comprised by MAGs from hosts and likely microbiome-associated microbes. Community 1 was comprised primarily of Metazoa, specifically Hexanauplia, and bacterial MAGs (Figure 5 b). Many of the bacterial MAGs in Community 1 had genes that suggest adaptations to microaerophic niches such as those which might be experienced when living in close host association (e.g. high affinity oxygen cytochromes, and reductases) (Figure S33). The bacterial MAGs in Community 1 could be broadly broken into two apparent functional types: those with larger genomes typical of copiotrophic bacteria and those with small genomes indicative potentially of reductive evolution. The first group was comprised of MAGs from family Saprospiraceae in phylum Bacteriodota (n=2, 3.0 Mbp average genome size), the family UBA2386 in phylum Plactomycetota, which lacks cultured representatives (n=2, 3.3 Mbp), the order Opitutales in phylum Verrucomicrobiota (n=2, 3.4Mbp), and the family Vibrionaceae (n=2, 4.5 Mbp) and order Pseudomonadales (n=2; 3.2 Mbp), both in phylum Gammaproteobacteria (Figure S33). In addition to their relatively large size, the Saprospiraceae, Plactomycetota and Vibrionaceae MAGs were found to encode for genes involved in the hydrolysis and utilization of various complex carbon sources including chitin and other carbohydrates (Figure S33), such as those that might be shed or excreted by zooplankton such as copepods (Corte et al., 2017). By contrast, the second group of bacterial MAGs within Community 1 with smaller genomes, included MAGs from the Proteobacteria order Rickettsiales (n=3, 0.6-1.2 Mbp), Gammaproteobacteria family Francisellaceae (n=1, 1.2 Mbp), and the Bacteriodota family Amoebophilaceae (n=1; 0.8Mbp) Figure S33). The smaller genome sizes exhibited by these groups may be indicative of a genome streamlining which occurred with reductive evolution due to obligate or facultative symbiosis (Giovannoni et al., 2014). Rickettsiales, Francisellaceae, and Amoebophilaceae all contain well-described obligate intracellular symbionts (Santos- Garcia et al., 2014; Darby et al., 2007; Li et al., 2021) and zoonotic pathogens (Celli and Zahrt, 2013; Darby et al., 2007).
Conclusion
Sequence datasets are revolutionizing how we form new hypotheses and explore environments on the planet. Here, we demonstrated a critical advance in the recovery of MAGs from environmentally-relevant eukaryotic organisms with EukHeist. The retrieval and study of MAGs to study the role of microorganisms in environmentally significant biogeochemical cycling is promising; however the current lack of eukaryotic reference genomes and transcriptomes complicates our ability to interpret the eukaryotic component of the microbial community. We recovered 988 total eukaryotic MAGs, 485 of which were deemed highly complete. Our findings demonstrate that specific branches of the eukaryotic tree were more likely to be resolved at the MAG-level due to their smaller genome size, distribution in the water column, and biological complexity. A substantial portion of the recovered eukaryotic MAGs were distinct from existing sequenced representatives, demonstrating that these large-scale surveys are a critical step towards characterizing less-resolved branches of the eukaryotic tree of life.
The continuing expansion of global-scale meta’omic surveys, such as BioGeoTraces Biller et al. (2018) and Bio-GO-SHIP (Ustick et al., 2021), highlights the importance of developing scalable and automated methods to enable more complete analysis of these data. Metagenomic pipelines that specifically integrate steps for handling eukaryotic biology, such as the EukHeist pipeline, are vital as eukaryotes are important members of microbial communities, ranging from the ocean, to soil (Bailly et al., 2007) and human- (Lukeš et al., 2015) and animal-associated (Campo et al., 2019) environments. The application of eukaryotic-sensitive methods such as EukHeist to other systems stands to greatly increase our understanding of the diversity and function of the "eukaryome".
Materials and Methods
Data acquisition
The metagenomic and metatranscriptomic data corresponding to the size fractions dominated by eukaryotic organisms ranging from microbial eukaryotes and zooplankton (0.8 − 2000μm) as originally published by Carradec et al. (2018) were retrieved from European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) under the accession numbers PRJEB4352 (large size fraction metagenomic data) and PRJEB6603 (large size fraction metatranscriptomic data) on November 20, 2018. Only samples with paired end reads (forward and reverse) were used in the subsequent analyses (Supplementary Table 1). After an initial sample-to-sample comparison with sour-mash (sourmash compare -k 31 -scaled 10000) (Brown and Irber, 2016) (Figure S3), it was determined that samples largely clustered by depth and size fraction. Samples were grouped for co-assembly by size fraction (0.8 − 5μm, 5 − 20μm, 20 − 180μm, and 180 − 2000μm) as per Carradec et al. (2018), depth or sample type (surface (SRF), deep chlorophyll maximum (DCM), mesopelagic (MES), mixed surface sample (MIX), and filtered seawater (FSW)), and geographic location (Supplementary Table 1). In cases where a sample did not fall directly within one of the size classes, it was assigned to an existing size class based on the upper μm limit of the sample. This grouping resulted in the combination of 824 cleaned, paired FASTQ files samples into 94 distinct co-assembly groups, which were used downstream for co-assembly (Supplementary Table 1).
EukHeist pipeline for metagenome assembly and binning
The metagenomic analysis, assembly, binning, and all associated quality control steps were carried out with a bioinformatic pipeline, EukHeist, that enables user-guided analysis of stand-alone metagenomic or paired metagenomic and metatranscriptomic sequence data. EukHeist is a streamlined and scalable pipeline currently based on the Snakemake workflow engine (Koster and Rahmann, 2012) that is configured to facilitate deployment on local HPC systems. Figure S2 outlines the structure and outputs of the existing EukHeist pipeline. EukHeist is designed to retrieve and identify both eukaryotic and prokaryotic MAGs from large, metagenomic and metatranscriptomic datasets (Figure S2). EukHeist takes input of sequence meta-data, user-specified assembly pairings (co-assembly groups), and raw sequence files, and returns MAGs that are characterized as either likely eukaryotic or prokaryotic.
Here, all raw sequences accessed from the EMBL-EBI were quality assessed with FastQC and MultiQC (Andrews, 2010). Sequences were trimmed using Trimmomatic (v. 0.36; parameters: ILLUMINACLIP: 2:30:7, LEADING:2, TRAILING:2, SLIDINGWINDOW:4:2, MINLEN:50) (Bolger et al., 2014). Passing mate paired reads were maintained for assembly and downstream analyses. Quality trimmed reads co-assembled based on assembly groups (Supplementary Table 1) with MEGAHIT (v1.1.3, parameters: k= 29,39,59,79,99,119) (Li et al., 2015). Basic statistics were assessed for all assemblies with Quast (v. 5.0.2) (Gurevich et al., 2013) (Supplementary Table 1). Cleaned reads from assembly-group-associated metagenomic and metatranscriptomic samples were mapped back against the assemblies with bwa mem (v.0.7.17) (Li and Durbin, 2010). The bwa-derived abundances were summarized with MetaBat2 (v. 2.12.1) script jgi_summarize_bam_contig_depths (with default parameters). The output contig abundance tables were used along with tetranucleotide frequencies to associate contigs into putative genomic bins using MetaBat2 (v. 2.12.1) (Kang et al., 2019). The Snakemake profile used to conduct this analysis is available at https://www.github.com/alexanderlabwhoi/ tara-euk-metag. A generalized version of the Snakemake pipeline (called EukHeist) that might be readily applied to other datasets is available at https://www.github.com/alexanderlabwhoi/ EukHeist. MAGs here are subsequently named and referred to as Tara Oceans Particle Associated MAGs (TOPAZ) and are individually named based on their assembly group (Supplementary Tables 2 and 3).
Identification of putative Eukaryotic MAGs
The binning process described above recovered a total of 16,385 putative bins. These bins were screened to identify high completion eukaryotic and prokaryotic bins. All bins were first screened for length, assuming that eukaryotic bins would likely be greater than 2.5Mbp in size (modeled off of the size of the smallest known eukaryotic genome, ∼ 2.3Mbp Microsporidian Encephalitozoon intestinalis (Corradi et al., 2010)). Bins larger than 2.5Mbp were screened for relative eukaryotic content using EukRep (West et al., 2018), a k-mer based strategy that estimates the likely domain-origin of metagenomic contigs. EukRep was used to classify the relative proportion of eukaryotic and prokaryotic content in each bin in a contig-by-contig manner. This approach identified 907 candidate eukaryotic bins that were greater than 2.5Mb in length and estimated to have more than 90% eukaryotic content by length. Protein coding domains were predicted in all 907 putative eukaryotic bins using EukMetaSanity (Neely et al., 2021).
Protein prediction in Eukaryotic MAGs with EukMetaSanity
Taxonomy
The MMseqs2 v12.113e3 (Steinegger and Söding, 2017, 2018; Mirdita et al., 2019) taxonomy module (parameters: -s 7 –min-seq-id 0.40 -c 0.3 –cov-mode 0) was used to provide a first-pass taxonomic assignment of the input MAG for use in a downstream element of Euk-MetaSanity pipeline that requires an input NCBI taxon id or a taxonomic level (i.e. Order, Family, etc.). We created a custom database comprising both OrthoDB (Kriventseva et al., 2018) and MMETSP (Keeling et al., 2014) protein databases (OrthoDB-MMETSP) that integrates NCBI taxon ids. MMseqs2 was used to query each MAG against the OrthoDB-MMETSP database to identify a first-pass taxonomic assignment. The lowest common ancestor of top scoring hits was identified to provide taxonomic assignment to each candidate eukaryotic bin. The taxonomyreport module generates a taxon tree that includes the percent of MMseqs mappings that correspond to each taxonomic level. A taxonomic identifier and scientific name are selected to the strain level or when total mapping exceeds 8%, whichever comes first. The assigned NCBI taxon id is retained for downstream analyses.
Repeats identification
RepeatModeler (Flynn et al., 2020a; Smit and Hubley, 2008-2015) was used to provide ab initio prediction of transposable elements, including short and long interspersed nuclear repeats, as well as other DNA transposons, small RNA, and satellite repeats. RepeatMasker (Smit et al., 2013-2015) was then used to hard-mask these identified regions, as well as any Family-level (as identified above) repeats from the DFam 3.2 database (Flynn et al., 2020b). RepeatMasker commands ProcessRepeats (parameter: -nolow) and rmOutToGff3 (parameter: -nolow) were used to output masked sequences (excluding low-complexity repeat DNA from the mask) as FASTA and gene-finding format (GFF3) files, respectively.
Ab initio prediction
GeneMark (Lomsadze et al., 2005) was used to generate ab initio gene predictions with the repeat-masked eukaryotic candidate bin sequences output from the prior step. The Gen-eMark subprogram ProtHint attempts to use Order-level proteins from OrthoDB-MMETSP database to generate intron splice-site predictions for ab initio modeling using GeneMark EP (Bruna et al., 2020). If ProtHint fails to generate predictions, then GeneMark will default to ES mode. Due to the fragmented nature of metagenomic assemblies, the prediction parameter stringency was drastically reduced relative to what is recommended for draft genome projects (parameters: –min_contig 500 –min_contig_in_predict 500 –min_gene_in_predict 100). These parameters can be easily modified within the EukMetaSanity config file. GeneMark outputs predictions of protein coding sequences (CDS) and exon/intron structure as GFF3 files.
Integrating protein evidence
MetaEuk (Levy Karin et al., 2020) was used to directly map the repeat-masked eukaryotic candidate bins sequences against proteins from the MMETSP (Keeling et al., 2014; Johnson et al., 2018) and eukaryotes included in the OrthoDB v10 dataset (Kriventseva et al., 2018), hereafter referred to as the OrthoDB-MMETSP database. MetaEuk easy-predict (parameters: –min-length 30 –metaeuk-eval 0.0001 -s 7 –cov-mode 0 -c 0.3 -e 100 –max-overlap 0) used Order-level proteins to identify putative CDS and exon/intron structure. MetaEuk encodes this output as headers in FASTA sequences that are then parsed into GFF3 files.
Merging final results
GFF3 output from the previous ab initio and MetaEuk protein evidence steps were input into Gffread (Pertea and Pertea, 2020) (parameters: -G –merge) to localize predictions from both lines of evidence into a single GFF3 output file. Each locus was then merged together using a Python (Foundation) script and the BioPython API (Cock et al., 2009) within EukMetaSanity. The set of ab initio generated exons in each locus is used as a prediction of the underlying exon/intron structure of the gene locus to which it is assigned. If there are any protein-evidence-generated exons present at the same locus, and if the total numbers of exons predicted by each line of evidence have ≥ 70% agreement, ab initio generated exons lacking a corresponding protein-evidence-generated exon are removed (the first and last exon(s) of a locus are not removed). Conversely, any protein-evidence-generated exon present that lacks a corresponding ab initio generated exon is added to the predicted exon/intron structure. The final gene structure for each locus is then processed into GFF3 and FASTA format.
Functional and taxonomic annotation of eukaryotic MAGs
Predicted proteins from EukMetaSanity were annotated for function against protein families in Pfam with PfamScan (Finn et al., 2014) and KEGG using kofamscan (Kanehisa, 2019; Aramaki et al., 2019) (Supplementary Tables 7 and 8). The relative completeness and contamination of each putative Eukaryotic MAG was assessed based on protein content using BUSCO v 4.0.5 against the eukaryota_odb10 gene set using default parameters (Simao et al., 2015) and EukCC v 0.2 using the EukCC database (created 22 October 2019 (Saary et al., 2020)). Annotation and completeness assessment were carried out using a EukHeist-Annotate (https://www.github.com/halexand/ EukHeist-annotate). EukCC (Saary et al., 2020) was also used to calculate MAG completeness and contamination. The average completeness across groups increased in all cases with EukCC except for metazoans, which on average had a lower estimated completeness (Figure S10).
The taxonomic affiliation of the high- and low-completion bins was estimated using MMSeqs taxonomy through EukMetaSanity and EUKulele (Krinos et al., 2021), an annotation tool that takes a protein-consensus approach, leveraging a Last Common Ancestor (LCA) estimation of protein taxonomy, as well as MMSeqs2 taxonomy module (Steinegger and Söding, 2017, 2018; Mirdita et al., 2019). Taxonomic level estimation in EUKulele was assessed based on e-value derived best-hits, where percent id was used as a means of assessing taxonomic level, with the following cutoffs: species, >95%; genus, 95-80%; family, 80-65%; order, 65-50%; class, 50-30% modeled off of Carradec et al. (2018). All MAGs were searched against the MarMetZoan combining the MarRef, MMETSP, and metazoan orthoDB databases (Johnson et al., 2018; Keeling et al., 2014; Kriventseva et al., 2018; Klemetsen et al., 2017). This database is available for download through EUKulele.
Phylogeny of eukaryotic MAGs
A total of 49 BUSCO proteins were found to be present across 80% or more of the highly complete eukaryotic TOPAZ MAGs and were selected for the construction of the tree. Amino acid sequences from all genomes and transcriptomes of interest were collected and aligned individually using mafft (v7.471) (parameters: –thread -8 –auto) (Katoh and Standley, 2013). Individual protein alignments were trimmed to remove sections of the alignment that were poorly aligned with trimAl (v1.4.rev15) (parameters: -automated1) (Capella-Gutierrez et al., 2009a). Protein sequences were then concatenated and trimmed again with trimAl (parameters: -automated1). A final tree was then constructed using RAxML (v 8.2.12; parameters: raxmlHPC-PTHREADS-SSE3 -T 16 -f a -m PROTGAMMAJTT -N 100 -p 42 -x 42) (Stamatakis, 2014). The amino acid alignment and construction was controlled with a Snakemake workflow: https://github.com/halexand/BUSCO-MAG-Phylogeny/. Trees were visualized and finalized with iTOL (Letunic and Bork, 2016).
Prokaryotic MAG assessment and analysis
The 15,478 bins that were not identified as putative eukaryotic bins based on length and EukRep metrics were screened to identify quality prokaryotic bins. The quality and phylogenetic-association of these bins was assessed with a modified version of MAGpy (Stewart et al., 2019), which was altered to include taxonomic annotation with GTDB-TK v.0.3.2 (Chaumeil et al., 2019). Bins were assessed based on single copy ortholog content with CheckM v (Parks et al., 2015) to identify 2 different bin quality sets: 1) high-quality (HQ) prokaryotic bins (>90% completeness, <5% contamination), and 2) medium-quality (MQ) prokaryotic bins (90-75% completeness, <10% contamination). A total of 4022 prokaryotic MAGs met the above criteria. A final set of 2,407 non-redundant (NR) HQ-MQ MAGs were identified using dRep v2.6.2 (Olm et al., 2017), which performs pairwise genome comparisons in two steps. First, a rapid primary algorithm, Mash v1.1.1 (Ondov et al., 2016) is applied. Genomes with Mash values equivalent to 90% Average Nucleotide Identity (ANI) or higher were then compared with MUMmer v3.23 (Marçais et al., 2018). Genomes with ANI ≥ 99% were considered to belong to the same cluster. The best representative MAGs were selected based on the dRep default scoring equation (Olm et al., 2017). Out of the final set of 2,407 NR MAGs, 716 were HQ. The same pipeline was used to determine the HQ and MQ NR MAGs reconstructed from the Tara Oceans metagenomes in previous studies (Tully et al., 2018; Parks et al., 2017; Delmont et al., 2018).
Phylogeny of bacterial non-redundant high-quality MAGs
Only 5 out of the 716 HQ NR MAGs were found to belong to Archaea, thus only bacterial MAGs were used for the construction of the phylogenetic tree with GToTree v.1.4.10 (Lee, 2019) and the gene set (HMM file) for Bacteria (74 targets). GToTree pipeline uses Prodigal v2.6.3 (Hyatt et al., 2010) to retrieve the coding sequences in the genomes, and HMMER3 v3.2.1 (Eddy, 2011) to identify the target genes based on the provided HMM file. MUSCLE v3.8 (Edgar, 2004) was then used for the gene alignments, and Trimal v1.4 (Capella-Gutierrez et al., 2009b) for trimming. The concatenated aligned is used for the tree constructions using FastTree v2.1 (Price et al., 2010). Three genomes were excluded from the analysis due to having too few of the target genes. The tree was visualized using the Interactive Tree of Life (iToL) (Letunic and Bork, 2016).
Prokaryote MAG phylogeny comparison
A set of 8,644 microbial genomes were collected from the MarDB database (Klemetsen et al., 2017)(ac- cessed 31 May 2018) encompassing the publicly available marine microbial genomes. Genomes were assessed using CheckM v1.1.1 (Parks et al., 2015)(parameters: lineage_wf) and genomes estimated to be <70% complete or >10% contamination were discarded. The remaining genomes (n = 5,878) were assessed using CompareM v0.0.23 (parameters: aai_wf; https://github.com/dparks1134/CompareM) and near identical genomes were identified using a cutoff of ≥ 95% average amino acid identity (AAI) with ≥ 85% orthologous fraction (determined as one standard deviation from the average orthologous fraction for genomes with 97 − 100% AAI). Based on CheckM quality, the genome with the highest completion and/or lowest contamination were retained. From the remaining genomes (n = 3,843), all MAGs derived from the Tara Oceans dataset, specifically from Tully et al. Tully et al. (2018) and Parks et al. Parks et al. (2017), were removed. The remaining genomes (n = 2,275) would be used to form the base of a phylogenetic tree representing the available genome diversity prior to the release of previous Tara Oceans related MAG datasets Tully et al. (2018); Parks et al. (2017); Delmont et al. (2018), termed the “neutral“ component of subsequent phylogenetic trees.
For the comparisons, phylogenetic trees were constructed using GToTree v1.4.7 (Lee, 2019) (default parameters; 25 Bacteria_and_Archaea markers). Any genome added to a tree that did not meet the default 50% marker presence requirement was excluded from that tree. Five iterations of phylogenetic trees were constructed using the neutral genomes paired with each Tara Oceans MAG dataset, the high-quality TOPAZ prokaryote MAGs, and the medium-quality TOPAZ prokaryote MAGs, individually, and two larger trees were constructed containing all neutral genomes and Tara Oceans MAGs, with additions of either high- or medium-quality TOPAZ MAGs. Phylogenetic trees were assessed using genometreetk (parameter: pd; https://github.com/dparks1134/GenomeTreeTk) to determine the phylogenetic diversity (i.e., the total branch length traversed by a set of leaves) and phylogenetic gain (i.e., the additional branch length added by a set of leaves) (Parks et al., 2017) for each set of MAGs compared against the neutral genomes and for the TOPAZ prokaryote MAGs compared against the neutral genomes and the other Tara Oceans MAGs.
MAG abundance profiling
Raw reads from all metagenomic and metatranscriptomic samples were mapped against the eukaryotic and prokaryotic TOPAZ MAGs to estimate relative abundances with CoverM (v. 0.5.0; parameters: -min-read-percent-identity 0.95 -min-read-aligned-percent 0.75 -min-covered-fraction 0 -contig-end-exclusion 75 -trim-min 0.05 -trim-max 0.95 -proper-pairs-only; https://github.com/wwood/CoverM). The total number of reads mapped to each MAG was then used to calculate Reads Per Kilobase Million (RPKM), where for some MAG, i : RPKMi = Xi/liN109, with X = total number of reads recruiting to a MAG, l = length of MAG in Kb, N = total number of trimmed reads mapping to a sample in millions. We also calculated counts per million (CPM), a normalization of the RPKM to the sum of all RPKMs in a sample. CPM, a modification of transcripts per million (TPM) was first proposed by Wagner et al. (2012) as an alternative to RPKM that reduces statistical bias. The metric has since been applied to metagenomics data, sometimes called GPM (genes per million) (Gradoville et al., 2017).
Nutritional modelling
To predict the trophic mode of the high quality TOPAZ eukaryotic MAGs (n=485), a Random Forest model (Breiman, 2001) was constructed and calibrated using the ranger (Wright and Ziegler, 2017) and tuneRanger packages in R (Probst et al., 2018), respectively. The model was trained using KEGG Orthology (KO) annotations (Kanehisa, 2019) from a manually-curated reference trophic mode transcriptomic dataset consisting of the MMETSP (Keeling et al., 2014) and EukProt (Richter et al., 2020) (Supplementary Table 5). 644 of the transcriptomes in this reference dataset came from the MMETSP (Keeling et al., 2014), after 22 transcriptomes were removed due to low coverage of KEGG and Pfam annotations (Finn et al., 2014). The remaining 266 came from the EukProt database (Richter et al., 2020), after 162 were removed due having fewer than 500 present KOs. Nutritional strategy (phototrophy, heterotrophy, or mixotrophy) was assessed for each reference transcriptome individually based on the literature, 25% of the combined reference transcriptomes were excluded from model training as testing data.
A subset of KEGG Orthologs (KOs) that were predictive for trophic mode classification was determined computationally with the vita variable selection package in R (Janitza et al., 2016) (Supplementary Table 6), which was tested and justified by Degenhardt et al. Degenhardt et al. (2017). This process was carried out by the algorithm without regard to the predicted function of the KOs, but we found that many of these KOs were implicated in carbohydrate and energy metabolism, with preference for those KOs that differ strongly between heterotrophs and phototrophs (particularly for energy metabolism; Figure S25). The model was built using the selected KOs (n = 1787 of a total 21585 KOs) with the 75% of the combined database assigned as training data.
Additionally, we developed a secondary metric for assessing the extent of heterotrophy of a transcriptome or MAG. As opposed to the trinary classification scheme of the Random Forest model, this approach quantifies the extent that the MAG aligns with heterotrophic, phototrophic, or mixotrophic references by assigning a composite score. We calculated the likelihood of vita selected KOs used in the Random Forest model above to be present within heterotrophic, phototrophic, or mixotrophic reference transcriptomes. Three scores (h, p, m), one corresponding to each trophic mode, were hence calculated for each vita-selected KO (k) (n = 1787) (Supplementary Table 6). In Equation (1), K is the number of references the KO was present in for each trophic mode category, while n is the total number of references available for each trophic mode category.
If a given KO occurred in fewer than 50% of the reference transcriptomes for a trophic mode, it was considered not to be characteristic of that trophic mode and as such the score, which we represent as the variable a, the ratio of the present KOs to the total for the subset of transcriptomes annotated some trophic mode (Equation (11)), was transformed (−(0.5 −a), if a < 0.5), to reflect the absence without valuing absence over presence. In the test transcriptome dataset, the ratio-transformed scores were negated when a given KO was absent from the transcriptome. For instance, if a KO was absent from 90% of reference transcriptomes assigned to heterotrophy (a = 0.1), and absent in the MAG or transcriptome being evaluated, it would receive a score of hk = −1 ∗ (−(0.5 − 0.1)) = 0.4 (Equation (1)) for that KO. This reflects that the absence of the KO in the evaluated MAG or transcriptome aligned well with the high probability that the KO was absent among the reference transcriptomes.
The scores for all KOs selected by vita were then used to scale the presence/absence patterns observed across transcriptomes and MAGs. Thus, for each transcriptome or MAG a single score was calculated for each trophic mode heterotrophy (H), phototrophy (P), and mixotrophy (M) for all KOs present within the transcriptome or MAG (K):
These calculated values can then be aggregated to a composite heterotrophy score (Hind ) (Supplementary Table 9). The score was computed as follows:
Ecological analysis of SAR and Dictyochophyceae MAGs
24 highly-complete TOPAZ eukaryotic MAGs were subset to characterize the biogeography and putative trophic modes of Dictyochophyceae and closely-related stramenopiles. MAGs included 11 Dictyochophyceae MAGs and 13 MAGs belonging to a phylogenetically similar branch (Eukaryotic SAR; derived from the MMSeqs and EUKulele assignment). Metagenomic CPM abundance of MAGs was used to compare the biogeography and distribution of Dictyochophyceae and closely-related stramenopile MAGs. MAGs were further classified based on trophic prediction and heterotrophy score.
To investigate physiological potential, quality trimmed metatranscriptome reads were mapped using Salmon Patro et al. (2017) to the 24 MAGs. Comparison of of the putative metabolic capabilities of the predicted phototrophic versus heterotrophic MAGs was conducted in R3.6.1, where MAGs were clustered using the average distance between MAGs based on the presence and absence of known orthologs (KO annotations (Kanehisa, 2019)). Principle coordinate analysis of center log-ratio transformed TPM abundances of mapped reads from the smallest size fraction (0.8 − 5.00μm) of all surface samples revealed the degree of overlap between Dictyochophyceae and closely-related Stramenopile MAGs. Specific genes shared among all MAGs, shared across predicted trophic modes, and those vita selected KOs used in the trophic model (described above) were further targeted. All analyses described above are available at https://alexanderlabwhoi.github.io/ 2021-TOPAZ-MAG-Figures/.
Network Analysis
To identify co-occurring MAGs across the stations surveyed by Tara Oceans, the CPM abundance of each highly-complete eukaryotic MAG (> 30% BUSCO completeness) and each non-redundant, highly complete bacterial MAG was assessed at each station at all available depths and size fractions as described above. CPM was used because of the power of this metric for comparing samples directly: the sum of all CPM values per sample will be the same, as sequencing depth is accounted for after gene length. This makes it easier to compare the abundances of MAGs originally recovered from different sites (Gradoville et al., 2017). A Spearman correlation matrix was generated to identify monotonic relationships between MAGs. Correlations were filtered based first on p-value, using the Šidák correction (Šidák, 1967), a slightly less stringent metric than the Bonferroni correction. The Šidák correlation adjusts for multiple comparisons and is given by p < 1 − (1 − α)1/n, where n is the total number of comparisons, and α is the significance value, in this case 0.05. We considered only those correlations within the 90th percentile of CPM correlations, thus correlations with absolute value less than 0.504 were removed from the analysis. Subsequently, we further filtered interactions to those with coefficient of correlation > 0.70 for the construction of the network diagram. Because it was expected for several of the eukaryotic MAGs to be closely related (based on ANI), the relationships in the network were further filtered to exclude interactions between MAGs of exceedingly high similarity (having both 99% ANI similarity and > 0.70 coefficient of correlation in the network analysis) (Supplementary Table 12). ANI-based group members tended to have identical taxonomic classifications: only 2 of 94 clusters had different classifications at the order level per EUKulele (Figure S30).
We generated a network from this reduced set of labeled interactions (cut off at > 0.70 coefficient of correlation, focusing on interactions between eukaryotes and prokaryotes or eukaryotes and eukaryotes, and using ANI-based clusters instead of MAG names when applicable) using igraph (Csardi and Nepusz, 2006; Team) (Supplementary Table 11). Communities of highly associated MAGs were identified using a modularity optimization algorithm introduced in Blondel et al. (2008) and implemented in igraph (Csardi and Nepusz, 2006).
We assessed the connectedness within and between communities by calculating a connectedness metric as follows. For the connectedness within a community (one community to itself), we identified the number of “dense” connections by counting up the total number of links found between community members, regardless of how many times the particular MAG had been connected to its own community, and divided that number by the total possible “dense”, meaning the number of connections which would exist if all community members were connected to all other community members. Between different communities, we defined connectedness by qualifying that a “connection” is made the first time each MAG from a given community is linked to another community, and calculated this quantity by dividing the number of realized links between community members by the maximum total size of the two involved communities (Figure 5 b; Equation (9) - Equation (11)).
We calculated Spearman correlation coefficients for the relationship between the abundance of communities between stations and several environmental parameters of interest from the Tara Oceans metadata (Pesant et al., 2015; Tara Oceans Consortium and Tara Oceans Expedition, 2016) (Figure 5). We considered the measured physical and chemical parameters, the modeled mesoscale physical oceanographic parameters, and averaged remote sensing products (Tara Oceans Consortium and Tara Oceans Expedition, 2016; d’Ovidio et al., 2010; Pesant et al., 2015). We adjusted the p-value of these comparisons using a Bonferroni adjustment within the statistics package in R (Team).
Data and code availability
The eukaryotic and prokaryotic TOPAZ MAGs and Supplementary Tables 1-13 are available through the Open Science Framework (OSF) at https://osf.io/gm564/ with the DOI: 10.17605/OSF.IO/GM564. EukHeist, which was used to recover the reported TOPAZ MAGs can be found at https://github. com/AlexanderLabWHOI/EukHeist and EukMetaSanity which was used for protein prediction in eukaryotic MAGs can be found at https://github.com/cjneely10/EukMetaSanity. Code used to generate the figures in this paper can be found at https://github.com/AlexanderLabWHOI/ 2021-TOPAZ-MAG-Figures. An interactive visualizer for the TOPAZ eukaryotic MAGs is available at https://share.streamlit.io/cjneely10/tara-analysis/main/TARAVisualize/main.py with source code at https://github.com/cjneely10/TARA-Analysis.
Author contributions statement
HA and SKH conceived of and designed the study. HA carried out the assembly and binning. HA, AIK, SKH, MP, TR, and CJN, and BJT analyzed the data. HA and SKH wrote the manuscript with input from all authors. All authors edited and commented on the manuscript.
Ethics Declaration
The authors declare no conflicts of interest.
Acknowledgements
This research would not have been possible without the community driven efforts to provide open and freely available data by the Tara Oceans Consortium. This research was supported by a National Science Foundation grant (NSF-OCE-1948025) to HA and a WHOI Independent Research and Development award to HA. SKH was supported through a Postdoctoral Fellowship (OCE-0939564) provided through the NSF Center for Dark Energy Biosphere Investigations and through an NSF grant (OCE-1947776). AIK was supported by the Computational Science Graduate Fellowship (DOE; DE-SC0020347). BJT was supported by the Center for Dark Energy Biosphere Investigations (C-DEBI) through NSF-OCE-0939654.