Abstract
Marine planktonic eukaryotes play a critical role in global biogeochemical cycles and climate. However, their poor representation in culture collections limits our understanding of the evolutionary history and genomic underpinnings of planktonic ecosystems. Here, we used 280 billion Tara Oceans metagenomic reads from polar, temperate, and tropical sunlit oceans to reconstruct and manually curate more than 700 abundant and widespread eukaryotic environmental genomes ranging from 10 Mbp to 1.3 Gbp. This genomic resource covers a wide range of poorly characterized eukaryotic lineages that complement long-standing contributions from culture collections while better representing plankton in the upper layer of the oceans. We performed the first comprehensive genome-wide functional classification of abundant unicellular eukaryotic plankton, revealing four major groups connecting distantly related lineages. Neither trophic modes of plankton nor its vertical evolutionary history could explain the functional repertoire convergence of major eukaryotic lineages that coexisted within oceanic currents for millions of years.
Cover Navigating on the map of plankton genomics with Tara Oceans and anvi’o: a comprehensive genome-resolved metagenomic survey dedicated to eukaryotic plankton.
Introduction
Plankton in the sunlit ocean contributes about half of Earth’s primary productivity, impacting global biogeochemical cycles and food webs1,2. Plankton biomass appears to be dominated by unicellular eukaryotes and small animals3–6 including a phenomenal evolutionary and morphological biodiversity5,7–9. The composition of planktonic communities is highly dynamical and shaped by biotic and abiotic variables, some of which are changing abnormally fast in the Anthropocene10–12. Our understanding of marine eukaryotes has progressed substantially in recent years with the transcriptomic (e.g.,13,14) and genomic (e.g.,15–17) analyses of organisms isolated in culture, and the emergence of efficient culture-independent surveys (e.g.,18,19). However, most eukaryotic lineages’ genomic content remains uncharacterized20,21, limiting our understanding of their evolution, functioning, ecological interactions, and resilience to ongoing environmental changes.
Over the last decade, the Tara Oceans program has generated a homogeneous resource of marine plankton metagenomes and metatranscriptomes from the sunlit zone of all major oceans and two seas22. Critically, most of the sequenced plankton size fractions correspond to eukaryotic organismal sizes, providing a prime dataset to survey genomic traits and expression patterns from this domain of life. More than 100 million eukaryotic gene clusters have been characterized by the metatranscriptomes, half of which have no similarity to known proteins5. Most of them could not be linked to a genomic context23, limiting their usefulness to gene-centric insights. The eukaryotic metagenomic dataset (the equivalent of ∼10,000 human genomes) on the other hand has been partially used for plankton biogeographies24,25, but remains unexploited for the characterization of genes and genomes due to a lack of robust methodologies to make sense of its diversity.
Genome-resolved metagenomics26 has been extensively applied to the smallest Tara Oceans plankton size fractions, unveiling the ecology and evolution of thousands of viral, bacterial, and archaeal populations abundant in the sunlit ocean27–32. This approach may thus be appropriate also to characterize the genomes of the most abundant eukaryotic plankton. However, very few eukaryotic genomes have been resolved from metagenomes thus far27,33–36, in part due to their complexity (e.g., high density of repeats37) and extended size38 that might have convinced many of the unfeasibility of such a methodology. With the notable exception of some photosynthetic eukaryotes27,33,36, metagenomics is lagging far behind cultivation for eukaryote genomics, contrasting with the two other domains of life. Here we fill this critical gap using hundreds of billions of metagenomic reads generated from the eukaryotic plankton size fractions of Tara Oceans and demonstrate that genome-resolved metagenomics is well suited for marine eukaryotic genomes of substantial complexity and length exceeding the emblematic gigabase. We used this new genomic resource to place major eukaryotic planktonic lineages in the tree of life and explore their evolutionary history based on both phylogenetic signals from conserved gene markers and present-day genomic functional landscape.
Results and discussion
A new resource of environmental genomes for eukaryotic plankton from the sunlit ocean
We performed the first comprehensive genome-resolved metagenomic survey of microbial eukaryotes from polar, temperate, and tropical sunlit oceans using 798 metagenomes (265 of which were released through the present study) derived from the Tara Oceans expeditions. They correspond to the surface and deep chlorophyll maximum layer of 143 stations from the Pacific, Atlantic, Indian, Arctic, and Southern Oceans, as well as the Mediterranean and Red Seas, encompassing eight eukaryote-enriched plankton size fractions ranging from 0.8 µm to 2 mm (Figure 1, Table S1). We used the 280 billion reads as inputs for 11 metagenomic co-assemblies (6-38 billion reads per co-assembly) using geographically bounded samples (Figure 1, Table S2), as previously done for the Tara Oceans 0.2–3 μm size fraction enriched in bacterial cells27. In addition, we used 158 eukaryotic single cells sorted by flow cytometry from seven Tara Oceans stations (Table S2) as input to perform complementary genomic assemblies (see Methods).
The map displays Tara Oceans stations used to perform genome-resolved metagenomics, summarizes the number of metagenomes, contigs longer than 2,500 nucleotides, and eukaryotic MAGs characterized from each co-assembly and outlines the stations used for single-cell genomics. ARC: Arctic Ocean; MED: Mediterranean Sea; RED: Red Sea, ION: Indian Ocean North; IOS: Indian Ocean South; SOC: Southern Ocean; AON: Atlantic Ocean North; AOS: Atlantic Ocean South; PON: Pacific Ocean North; PSE: Pacific South East; PSW: Pacific South West. The bottom panel summarizes mapping results from the SMAGs across 939 metagenomes organized into four size fractions. The mapping projection of complete SMAGs is described in the Methods and Supplemental Material.
We thus created a culture-independent, non-redundant (average nucleotide identity <98%) genomic database for eukaryotic plankton in the sunlit ocean consisting of 683 metagenome-assembled genomes (MAGs) and 30 single-cell genomes (SAGs), all containing more than 10 million nucleotides (Table S3). These 713 MAGs and SAGs (hereafter dubbed SMAGs) were manually characterized and curated using a holistic framework within anvi’o39 that relied heavily on differential coverage across metagenomes (see Methods and Supplemental Material). Nearly half the MAGs did not have vertical coverage >10x in any of the metagenomes, emphasizing the relevance of co-assemblies to gain sufficient coverage for relatively large eukaryotic genomes. Moreover, one-third of the SAGs remained undetected by Tara Oceans’ metagenomic reads, emphasizing cell sorting’s power to target less abundant lineages. Absent from the SMAGs are DNA molecules physically associated with the focal eukaryotic populations, but that did not correlate with their nuclear genomes across metagenomes. They include chloroplasts, mitochondria, and viruses generally present in multi-copy. Finally, some highly conserved multi-copy genes such as the 18S rRNA gene were also missing due to technical issues associated with assembly and binning, following the fate of 16S rRNA genes in bacterial MAGs27.
This new genomic database for eukaryotic plankton has a total size of 25.2 Gbp and contains 10,207,450 genes according to a workflow combining metatranscriptomics, ab-initio, and protein-similarity approaches (see Methods). Tara Oceans SMAGs are, on average, ∼40% complete (redundancy of 0.5%) and 35.4 Mbp long (up to 1.32 Gbp for the first Giga-scale eukaryotic MAG), with a GC-content ranging from 18.7% to 72.4% (Table S3). They are affiliated to Alveolata (n=44), Amoebozoa (n=4), Archaeplastida (n=64), Cryptista (n=31), Haptista (n=92), Opisthokonta (n=299), Rhizaria (n=2), and Stramenopiles (n=174). Only three closely related MAGs could not be affiliated to any known eukaryotic supergroup (see the phylogenetic section below). Among the 713 SMAGs, 271 contained multiple genes corresponding to chlorophyll a-b binding proteins and were considered phytoplankton (Table S3). Genome-wide comparisons with 484 reference transcriptomes from isolates of marine eukaryotes (the METdb database40 which improved data from MMETSP13 and added new transcriptomes from Tara Oceans, see Table S3) linked only 24 of the SMAGs (∼3%) to a eukaryotic population already in culture (average nucleotide identity >98%). These include well-known Archaeplastida populations within the genera Micromonas, Bathycoccus, Ostreococcus, Pycnococcus, Chloropicon and Prasinoderma and a few taxa amongst Stramenopiles (e.g., the diatom Minutocellus polymorphus) and Haptista (e.g., Phaeocystis cordata). Thus, we found metagenomics, single-cell genomics, and culture highly complementary with very few overlaps for marine eukaryotic plankton’s genomic characterization.
The SMAGs recruited 39.1 billion reads with >90% identity (average identity of 97.4%) from 939 metagenomes, representing 11.8% of the Tara Oceans metagenomic dataset dedicated to unicellular and multicellular organisms ranging from 0.2 µm to 2 mm (Table S4). In contrast, METdb with a total size of ∼23 Gbp recruited less than 7 billion reads (average identity of 97%), indicating that the collection of Tara Oceans SMAGs reported herein better represents the diversity of open ocean eukaryotes as compared to genomic data from decades of culture efforts worldwide. The majority of Tara Oceans metagenomic reads were still not recruited, which could be explained by eukaryotic genomes that our methods failed to reconstruct, the occurrence of abundant bacterial, archaeal, and viral populations in the large size fractions we considered (e.g., Trichodesmium41), and the incompleteness of the SMAGs. Indeed, with the assumption of correct completion estimates, complete SMAGs would have recruited ∼26% of all metagenomic reads, including >50% of reads for the 20-180 µm size fraction alone due in part to an important contribution of hundreds of large copepod MAGs abundant within this cellular range (see Figure 1 and Table S4).
Expanding the genomic representation of the eukaryotic tree of life
We then determined the phylogenetic distribution of the new ocean SMAGs in the tree of eukaryotic life. METdb was chosen as a taxonomically curated reference transcriptomic database from culture collections, and the two largest subunits of the three DNA-dependent RNA polymerases (six multi-kilobase genes found in all modern eukaryotes and hence already present in the Last Eukaryotic Common Ancestor) were used as evolutionary marker genes given their relevance to our understanding of eukaryogenesis42. Protein sequences for these genes were manually extracted and curated for the SMAGs (n=2,150) and METdb (n=2,032) (see Methods and Supplemental Material). BLAST results provided a novelty score for each of them (see Methods and Table S3), expanding our analysis scope to eukaryotic genomes stored in NCBI as of August 2020. Our final phylogenetic analysis included 416 reference transcriptomes and 576 environmental SMAGs that contained at least one of the six genes (Figure 2). The concatenated DNA-dependent RNA polymerase protein sequences effectively reconstructed a coherent tree of eukaryotic life, comparable to previous large-scale phylogenetic analyses based on other gene markers43, and to a complementary BUSCO-centric phylogenomic analysis using protein sequences corresponding to hundreds of smaller gene markers (Figure S1). As a noticeable difference, the Haptista were most closely related to Archaeplastida, while Cryptista was most closely related to the TSAR supergroup (Telonemia not represented here, Stramenopiles, Alveolata and Rhizaria), albeit with weaker supports. This view of the eukaryotic tree of life using a previously underexploited universal marker is by no means conclusive by itself but contributes to ongoing efforts to understand deep evolutionary relationships amongst eukaryotes while providing an effective framework to assess the phylogenetic positions of a large number of the Tara Oceans SMAGs.
The maximum-likelihood phylogenetic tree of the concatenated two largest subunits from the three DNA-dependent RNA polymerases (six genes in total) included Tara Oceans SMAGs and METdb transcriptomes and was generated using a total of 7,243 sites in the alignment and LG+F+R10 model; Opisthokonta was used as the outgroup. Supports for selected clades are displayed. Phylogenetic supports were considered high (aLRT>=80 and UFBoot>=95), medium (aLRT>=80 or UFBoot>=95) or low (aLRT<80 and UFBoot<95) (see Methods). The tree was decorated with additional layers using the anvi’o interface. The novelty score layer (see Methods) was set with a minimum of 30 (i.e., 70% similarity) and a maximum of 60 (i.e., 40% similarity). Branches and names in red correspond to main lineages lacking representatives in METdb.
Amongst small planktonic animals, the Tara Oceans SMAGs recovered one lineage of Chordata related to the Oikopleuridae family, and Crustacea including a wide range of copepods (Figure 2, Table S3). Copepods dominate large size fractions of plankton8 and represent some of the most abundant animals on the planet44,45. They actively feed on unicellular plankton and are a significant food source for larger animals such as fish, thus representing a key trophic link within the global carbon cycle46. For now, less than ten copepod genomes have been characterized by isolates47,48. The additional 8.4 Gbp of genomic material unveiled herein is split into 217 MAGs, and themselves organized into two main phylogenetic clusters that we dubbed marine Hexanauplia clades A and B. The two clades were equally abundant and detected in all oceanic regions. Copepod MAGs typically had broad geographic distributions, being detected on average in 25% of the globally distributed Tara Oceans stations. In comparison, Opisthokonta MAGs affiliated to Chordata and Choanoflagellatea (Acanthoecida) were, on average detected in less than 10% of sampling sites.
Generally occurring in smaller size fractions, SMAGs corresponding to unicellular eukaryotes considerably expanded our genomic knowledge of known genera within Alveolata, Archaeplastida, Haptista and Stramenopiles (Table S3). Just within the diatoms for instance (Stramenopiles), MAGs were reconstructed for Fragilariopsis (n=5), Pseudo-nitzschia (n=7), Chaetoceros (n=11), Thalassiosira (n=5) and seven other genera, all of which are known to contribute significantly to photosynthesis in the sunlit ocean49. Beyond this genomic expansion of known planktonic genera, the SMAGs covered various lineages lacking representatives in METdb. These included (1) sister clades to the Cryptophyta division (putative Katablepharidophyta division50 according to their relatively abundant 18S amplicons in small size fractions of Tara Oceans8), to the class Chrysophyceae, and the genera Phaeocystis and Pycnococcus, (2) basal lineages of Oomycota within Stramenopiles and Myzozoa within Alveolata, (3) multiple branches within the MAST-4 lineage, (4) and a small cluster possibly at the root of Rhizaria we dubbed “putative new group” (Figure 2). The BUSCO-centric phylogenomic analysis placed it at the root of Haptista (Figure S1), supporting its high novelty while stressing the difficulty placing it accurately in the eukaryotic tree of life. The novelty score of individual DNA-dependent RNA polymerase genes was supportive of the topology of the tree. Significantly, the sister clade to the Cryptophyta division, diverse MAST-4 lineage and putative new group all displayed a deep branching distance from cultures and a high novelty score.
The most conspicuous lineage lacking any SMAGs was the Dinoflagellata, a prominent and extremely diverse phylum in small and large eukaryotic size fractions of Tara Oceans8. These organisms harbor very large and complex genomes51 that likely require much deeper sequencing efforts to be recovered by genome-resolved metagenomics.
A complex interplay between the evolution and functioning of marine eukaryotes
SMAGs provided a broad genomic assessment of the eukaryotic tree of life within the sunlit ocean by covering a wide range of marine plankton eukaryotes distantly related to cultures but abundant in the open ocean. Thus, the resource provided an opportunity to explore the interplay between the phylogenetic signal and functional repertoire of eukaryotic plankton with genomics. With EggNOG52–54, we identified orthologous groups corresponding to known (n=15,870) and unknown functions (n=12,567, orthologous groups with no assigned function at http://eggnog5.embl.de/) for 4.7 million genes (nearly 50% of the genes, see Methods) and used their genomic distributions to classify the SMAGs based on their functional profiles (Table S5). Our hierarchical clustering analysis using Euclidean distance and Ward linkage (an approach to organize genomes based on pangenomic traits55) first split the SMAGs into small animals (Chordata, Crustacea, copepods) and putative unicellular eukaryotes (Figure 3). Fine-grained functional clusters exhibited a highly coherent taxonomy within the unicellular eukaryotes. For instance, SMAGs affiliated to the coccolithophore Emiliana and the sister clade to Phaeocystis formed distinct clusters. The sister clade to Cryptophyta was also confined to a single cluster that could be explained partly by a considerable radiation of genes related to dioxygenase activity (up to 644 genes). Most strikingly, the Archaeplastida SMAGs not only clustered with respect to their genus-level taxonomy, but the organization of these clusters was highly coherent with their evolutionary relationships (see Figure 2), confirming not only the novelty of the sister clade to Pycnococcus, but also the sensitivity of our framework to draw the functional landscape of unicellular marine eukaryotes.
The figure displays a hierarchical clustering (Euclidean distance with Ward’s linkage) of 681 SMAGs based on the occurrence of ∼28,000 functions identified with EggNOG52–54, rooted with small animals (Chordata, Crustacea and copepods) and decorated with layers of information using the anvi’o interactive interface. Layers include the occurrence in log 10 of 100 functions with lowest p-value when performing Welch’s ANOVA between the functional groups A, B, C and D (see nodes in the tree). Removed from the analysis were Ciliophora MAGs (gene calling is problematic for this lineage), two less complete MAGs affiliated to Opisthokonta, and functions occurring more than 500 times in the gigabase-scale MAG and linked to retrotransposons connecting otherwise unrelated SMAGs.
Four major functional groups of unicellular eukaryotes emerged from the hierarchical clustering (Figure 3). Importantly, the taxonomic coherence observed in fine-grained clusters vanished when moving towards the root of these functional groups. Group A was an exception since it only covered the Haptista (including the highly cosmopolitan sister clade to Phaeocystis). Group B, on the other hand, encompassed a highly diverse and polyphyletic group of distantly related heterotrophic (e.g., MAST-4 and MALV) and mixotrophic (e.g., Myzozoa and Cryptophyta) lineages of various genomic size, suggesting that broad genomic functional trends may not only be explained by the trophic mode of plankton. Group C was mostly photosynthetic and covered the diatoms (Stramenopiles of various genomic size) and Archaeplastida (small genomes) as sister clusters. This finding likely reflects that diatoms are the only group with an obligatory photoautotrophic lifestyle within the Stramenopiles, like the Archaeplastida. Finally, Group D encompassed three distantly related lineages of heterotrophs (those systematically lacked gene markers for photosynthesis) exhibiting rather large genomes: Oomycota, Acanthoecida choanoflagellates, and the Cryptophyta’s sister clade. Those four functional groups have similar amounts of detected functions and contained both cosmopolite and rarely detected SMAGs across the Tara Oceans stations. While attempts to classify marine eukaryotes based on genomic functional traits have been made in the past (e.g., using a few SAGs56), our resource therefore provided a broad enough spectrum of genomic material for a first genome-wide functional classification of abundant lineages of unicellular eukaryotic plankton in the upper layer of the ocean.
A total of 2,588 known and 680 unknown functions covering 1.94 million genes (∼40% of the annotated genes) were significantly differentially occurring between the four functional groups (Welch’s ANOVA tests, p-value <1.e-05, Table S5). We displayed the occurrence of the 100 functions with lowest p-values in the hierarchical clustering presented in Figure 3 to illustrate and help convey the strong signal between groups. However, more than 3,000 functions contributed to the basic partitioning of SMAGs. They cover all high-level functional categories identified in the 4.7 million genes with similar proportions (Figure S2), indicating that a wide range of functions related to information storage and processing, cellular processes and signaling, and metabolism contribute to the partitioning of the groups. As a notable difference, functions related to transcription (−50%) and RNA processing and modification (−47%) were less represented, while those related to carbohydrate transport and metabolism were enriched (+43%) in the differentially occurring functions. Interestingly, we noticed within Group C a scarcity of various functions otherwise occurring in high abundance among unicellular eukaryotes. These included functions related to ion channels (e.g., extracellular ligand-gated ion channel activity, intracellular chloride channel activity, magnesium ion transmembrane transporter activity, calcium ion transmembrane transport, calcium sodium antiporter activity) that may be linked to flagellar motility and the response to external stimuli57, reflecting the lifestyle of true autotrophs. Group D, on the other hand, had significant enrichment of various functions associated with carbohydrate transport and metabolism (e.g., alpha and beta-galactosidase activities, glycosyl hydrolase families, glycogen debranching enzyme, alpha-L-fucosidase), denoting a distinct carbon acquisition strategy. Overall, the properties of thousands of differentially occurring functions suggest that eukaryotic plankton’s complex functional diversity is vastly intertwined within the tree of life, as inferred from phylogenies. This reflects the complex nature of the genomic structure and phenotypic evolution of organisms, which rarely fit their evolutionary relationships.
To this point, our analysis focused on the 4.4 million genes that were functionally annotated to EggNOG, which discarded more than half of the genes we identified in the SMAGs. Our current lack of understanding of many eukaryotic functional genes even within the scope of model organisms58 can explain the limits of reference-based approaches to study the gene content of eukaryotic plankton. Thus, to gain further insights and overcome these limitations, we partitioned and categorized the eukaryotic gene content with AGNOSTOS59. AGNOSTOS grouped 5.4 million genes in 424,837 groups of genes sharing remote homologies, adding 2.3 million genes left uncharacterized by the EggNOG annotation. AGNOSTOS applies a strict set of parameters for the grouping of genes discarding 575,053 genes by its quality controls and 4,264,489 genes in singletons. The integration of the EggNOG annotations into AGNOSTOS resulted in a combined dataset of 25,703 EggNOG orthologous groups (singletons and gene clusters) and 271,464 AGNOSTOS groups of genes, encompassing 6.4 million genes, 45% more genes that the original dataset (see Methods). The genome-wide functional classification of SMAGs based on this extended set of genes supported most trends previously observed with EggNOG annotation alone (Figure S3; Table S6), straightening our observations. But most interestingly, classification based solely on 23,674 newly identified groups of genes of unknown function (Table S7, a total of 1.3 million genes discarded by EggNOG) were also supportive of the overall trends, including notable links between diatoms and green algae and between sister clade to Cryptophyta and Acanthoecida (Figure S4). Thus, we identified a functional repertoire convergence of distantly related eukaryotic plankton lineages in both the known and unknown coding sequence space, the latter representing a substantial amount of biologically relevant gene diversity.
Niche and biogeography of individual eukaryotic populations
Besides insights into organismal evolution and genomic functions, the SMAGs provided an opportunity to evaluate the present and future geographical distribution of eukaryotic planktonic populations (close to species-level resolution) using the genome-wide metagenomic read recruitments. Here, we determined the niche characteristics (e.g., temperature range) of 374 SMAGs (∼50% of the resource) detected in at least five stations (Table S8) and used climate models to project world map distributions (https://gigaplankton.shinyapps.io/TOENDB/) based on climatologies for the periods of 2006-2015 and 2090-209925 (see Methods and Supplemental Material).
Each of these SMAGs was estimated to occur in a surface averaging 42 and 39 million km2 for the first and second period, respectively, corresponding to ∼12% of the surface of the ocean. Our data suggest that most eukaryotic populations in the database will remain widespread for decades to come. However, many changes in biogeography are projected to occur. For instance, the most widespread population in the first period (a MAST-4 MAG) would still be ranked first at the end of the century but with a surface area increasing from 37% to 46% (Figure 4), a gain of 28 million km2 corresponding to the surface of North America. Its expansion from the tropics towards more temperate oceanic regions regardless of longitude is mostly explained by temperature and reflects the expansion of tropical niches due to global warming, echoing recent predictions made with amplicon surveys and imaging data60. As an extreme case, the SMAG benefiting the most between the two periods (a copepod) could experience a gain of 55 million km2 (Figure 4), more than the surface of Asia and Europe combined. On the other hand, the SMAG losing most ground (also a copepod) could undergo a decrease of 47 million km2. Projected changes in these two examples correlated with various variables (including a notable contribution of silicate), an important reminder that temperature alone cannot explain plankton’s biogeography in the ocean. Our integration of genomics, metagenomics, and climate models provided the resolution needed to project individual eukaryotic population niche trajectories in the sunlit ocean.
The probability of presence ranges from 0 (purple) to 1 (red), with green corresponding to a probability of 0.5. The bottom row displays first-rank region-dependent environmental parameters driving the projected shifts of distribution (in regions where |ΔP|>0.1). Noticeably, projected decreases of silicate in equatorial regions drive 34% of the expansion of TARA_PSW_MAG_00299 while driving 34% of the reduction of TARA_PSE_93_MAG_00246, possibly reflecting different life strategies of these copepods (e.g., grazing). In contrast, the expansion of TARA_IOS_50_MAG_00098 is mostly driven by temperature (74%).
Conclusion
Following methodological advances for viral, bacterial and archaeal lineages, we are experiencing a shift from cultivation to metagenomics for the genomic characterization of marine eukaryotes en masse. Our culture-independent and manually curated genomic characterization of unicellular eukaryotic populations and small animals abundant in the sunlit ocean covered a wide range of poorly characterized lineages and provided the first gigabase-scale metagenome-assembled genome, a landmark for both genome-resolved metagenomics and plankton genomics. These lineages cover multiple trophic levels (e.g., copepods and their prey, mixotrophs, autotrophs, and parasites) and appear to be abundant and widespread in the sunlit ocean. In summary, most eukaryotic genomes we characterized with different degrees of completion are not only different from past genomic surveys of isolated marine organisms but also better represent eukaryotic plankton in the open photic ocean. As a result, our survey represents an innovative step towards using genomics to explore in concert the ecological and evolutionary underpinnings of environmentally relevant eukaryotic organisms, using metagenomics to fill critical gaps in our remarkable culture porfolio22.
Phylogenetic gene markers such as the DNA-dependent RNA polymerases (the basis of our phylogenetic analysis) provide a critical understanding of the origin of eukaryotic lineages and allowed us to place most environmental genomes in a comprehensible evolutionary framework. However, this framework is based on sequence variations within core genes that in theory are inherited from the last eukaryotic common ancestor representing the vertical evolution of eukaryotes, disconnected from the structure of genomes. As such, it does not recapitulate the functional evolutionary journey of plankton, as demonstrated in our genome-wide functional classification of unicellular eukaryotes in both the known and unknown coding sequence space. The dichotomy between phylogeny and function was already well described with morphological and other phenotypic traits and could be explained in part by secondary endosymbiosis events that have spread plastids and genes for their photosynthetic capabilities across the eukaryotic tree of life61–64. Here we moved beyond morphological inferences and disentangled the phylogeny of gene markers and broad genomic functional repertoire of a comprehensive collection of marine eukaryotic lineages. We identified four major genomic functional groups of unicellular eukaryotes made of distantly related lineages. The Stramenopiles proved particularly effective in terms of genomic functional diversification, possibly explaining part of their remarkable success in this biome8,65.
The topology of phylogenetic trees compared to the functional clustering of a wide range of eukaryotic lineages has revealed contrasting evolutionary journeys for widely scrutinized gene markers of evolution and less studied genomic functions of plankton. The apparent functional convergence of distantly related lineages that coexisted in the same biome for millions of years could not be explained by neither a vertical evolutionary history of unicellular eukaryotes nor their trophic modes (phytoplankton versus heterotrophs), shedding new lights into the complex functional dynamics of plankton over evolutionary time scales. Convergent evolution is a well-known phenomenon of independent origin of biological traits such as molecules and behaviors66,67 that has been observed in the morphology of microbial eukaryotes68 and is often driven by common selective pressures within similar environmental conditions. However, an independent origin of similar functional profiles is not the only possible explanation for organisms sharing the same habitat. Indeed, one could wonder if lateral gene transfers between eukaryotes69,70 have played a central role in these processes, as previously observed between eukaryotic plant pathogens71 or grasses72. As a case in point, secondary endosymbiosis events are known to have resulted in massive gene transfers between endosymbionts and their hosts in the oceans61,62. In particular, these events involved transfers of genes from green algae to diatoms73, two lineages clustering together in our genomic functional classification of eukaryotic plankton. However, lineages sharing the same secondary endosymbiotic history did not always fall in the same functional group. This was the case for diatoms, Haptista and Cryptista that have different functional trends yet originate from a common ancestor that likely acquired its plastid from red and green algae61,62,74. Surveying phylogenetic trends for functions derived from the ∼10 million genes identified here will likely contribute to new insights regarding the extent of lateral gene transfers between eukaryotes75,76, the independent emergence of functional traits (convergent evolution), as well as functional losses between lineages77, that altogether might have driven the functional convergences of distantly related eukaryotic lineages abundant in the sunlit ocean.
Regardless of the mechanisms involved, the functional repertoire convergences we observed likely highlight primary organismal functioning, which have fundamental impacts on plankton ecology, and their functions within marine ecosystems and biogeochemical cycles. Thus, the apparent dichotomy between phylogenies (a vertical evolutionary framework) and genome-wide functional repertoires (genome structure evolution) depicted here should be viewed as a fundamental attribute of marine unicellular eukaryotes that we suggest warrants a new rationale for studying the structure and state of plankton, a rationale also based on present-day genomic functions rather than phylogenetic and morphological surveys alone.
STAR Methods
Tara Oceans metagenomes
We analyzed a total of 943 Tara Oceans metagenomes available at the EBI under project PRJEB402 (https://www.ebi.ac.uk/ena/browser/view/PRJEB402). 265 of these metagenomes have been released through this study. Table S1 reports accession numbers and additional information (including the number of reads and environmental metadata) for each metagenome.
Genome-resolved metagenomics
We organized the 798 metagenomes corresponding to size fractions ranging from 0.8 µm to 2 mm into 11 ‘metagenomic sets’ based upon their geographic coordinates. We used those 0.28 trillion reads as inputs for 11 metagenomic co-assemblies using MEGAHIT78 v1.1.1, and simplified the scaffold header names in the resulting assembly outputs using anvi’o39 v.6.1 (available from http://merenlab.org/software/anvio). Co-assemblies yielded 78 million scaffolds longer than 1,000 nucleotides for a total volume of 150.7 Gbp. We performed a combination of automatic and manual binning on each co-assembly output, focusing only on the 11.9 million scaffolds longer than 2,500 nucleotides, which resulted in 837 manually curated eukaryotic metagenome-assembled genomes (MAGs) longer than 10 million nucleotides. Briefly, (1) anvi’o profiled the scaffolds using Prodigal79 v2.6.3 with default parameters to identify an initial set of genes, and HMMER80 v3.1b2 to detect genes matching to 83 single-copy core gene markers from BUSCO81 (benchmarking is described in a dedicated blog post82), (2) we used a customized database including both NCBI’s NT database and METdb to infer the taxonomy of genes with a Last Common Ancestor strategy5 (results were imported as described in http://merenlab.org/2016/06/18/importing-taxonomy), (3) we mapped short reads from the metagenomic set to the scaffolds using BWA v0.7.1583 (minimum identity of 95%) and stored the recruited reads as BAM files using samtools84, (4) anvi’o profiled each BAM file to estimate the coverage and detection statistics of each scaffold, and combined mapping profiles into a merged profile database for each metagenomic set. We then clustered scaffolds with the automatic binning algorithm CONCOCT85 by constraining the number of clusters per metagenomic set to a number ranging from 50 to 400 depending on the set. Each CONCOCT clusters (n=2,550, ∼12 million scaffolds) was manually binned using the anvi’o interactive interface. The interface considers the sequence composition, differential coverage, GC-content, and taxonomic signal of each scaffold. Finally, we individually refined each eukaryotic MAG >10 Mbp as outlined in Delmont and Eren86, and renamed scaffolds they contained according to their MAG ID. Table S2 reports the genomic features (including completion and redundancy values) of the eukaryotic MAGs. The supplemental material provides more information regarding this workflow and describes examples for CONCOCT clusters’ binning and curation.
A first Giga scale eukaryotic MAG
We performed targeted genome-resolved metagenomics to confirm the biological relevance and improve statistics of the single MAG longer than 1 Gbp with an additional co-assembly (five Southern Ocean metagenomes for which this MAG had vertical coverage >1x) and by considering contigs longer than 1,000 nucleotides, leading to a gain of 181,8 million nucleotides. To our knowledge, we describe here the first successful characterization of a Gigabase-scale MAG (1.32 Gbp with 419,520 scaffolds), which we could identify using two distinct metagenomic co-assemblies.
MAGs from the 0.2–3 μm size fraction
We incorporated into our database 20 eukaryotic MAGs longer than 10 million nucleotides previously characterized from the 0.2–3 μm size fraction27, providing a set of MAGs corresponding to eukaryotic cells ranging from 0.2 µm (picoeukaryotes) to 2 mm (small animals).
Single-cell genomics
We used 158 eukaryotic single cells sorted by flow cytometry from seven Tara Oceans stations as input to perform genomic assemblies (up to 18 cells with identical 18S rRNA genes per assembly to optimize completion statistics, see Supplementary Table 2), providing 34 single-cell genomes (SAGs) longer than 10 million nucleotides. Cell sorting, DNA amplification, sequencing and assembly were performed as described elsewhere19. In addition, manual curation was performed using sequence composition and differential coverage across 100 metagenomes in which the SAGs were most detected, following the methodology described in the genome-resolved metagenomics section. For SAGs with no detection in Tara Oceans metagenomes, only sequence composition and taxonomical signal could be used, limiting this curation effort’s scope. Notably, manual curation of SAGs using the genome-resolved metagenomic workflow turned out to be highly valuable, leading to the removal of more than one hundred thousand scaffolds for a total volume of 193.1 million nucleotides. This metagenomic-guided decontamination effort contributes to previous efforts characterizing eukaryotic SAGs from the same cell sorting material19,56,87–89 and provides new marine eukaryotic guidelines SAGs. The supplemental material provides more information regarding this workflow and describes an example for the curation of SAGs using metagenomics.
Characterization of a non-redundant database of SMAGs
We determined the average nucleotide identity (ANI) of each pair of SMAGs using the dnadiff tool from the MUMmer package90 v.4.0b2. SMAGs were considered redundant when their ANI was >98% (minimum alignment of >25% of the smaller SMAG in each comparison). We then selected the longest SMAG to represent a group of redundant SMAGs. This analysis provided a non-redundant genomic database of 713 SMAGs.
Taxonomical inference of SMAGs
We manually determined the taxonomy of SMAGs using a combination of approaches: (1) taxonomical signal from the initial gene calling (Prodigal), (2) phylogenetic approaches using the RNA polymerase and METdb, (3) ANI within the SMAGs and between SMAGs and METdb, (4) local blasts using BUSCO gene markers, (5) and lastly the functional clustering of SMAGs to gain knowledge into very few SMAGs lacking gene markers and ANI signal.
Protein coding genes
Protein coding genes for the SMAGs were characterized using three complementary approaches: protein alignments using reference databases, metatranscriptomic mapping from Tara Oceans and ab-initio gene predictions. While the overall framework was highly similar for MAGs and SAGs, the methodology slightly differed to take the best advantage of those two databases when they were processed (see the two following sections).
Protein-coding genes for the MAGs. Protein alignments
Since the alignment of a large protein database on all the MAG assemblies is time greedy, we first detected the potential proteins of Uniref90 + METdb that could be aligned to the assembly by using MetaEuk91 with default parameters. This subset of proteins was aligned using BLAT with default parameters, which localized each protein on the MAG assembly. The exon/intron structure was refined using genewise92 with default parameters to detect splice sites accurately. Each MAG’s GeneWise alignments were converted into a standard GFF file and given as input to gmove. Metatranscriptomic mapping from Tara Oceans: A total of 905 individual Tara Oceans metatranscriptomic assemblies (mostly from large planktonic size fractions) were aligned on each MAG assembly using Minimap293 (version 2.15-r905) with the “-ax splice” flag. BAM files were filtered as follows: low complexity alignments were removed and only alignments covering at least 80% of a given metatranscriptomic contig with at least 95% of identity were retained. The BAM files were converted into a standard GFF file and given as input to gmove. Ab-initio gene predictions: A first gene prediction for each MAG was performed using gmove and the GFF file generated from metatranscriptomic alignments. From these preliminary gene models, 300 gene models with a start and a stop codon were randomly selected and used to train AUGUSTUS94 (version 3.3.3). A second time, AUGUSTUS was launched on each MAG assembly using the dedicated calibration file, and output files were converted into standard GFF files and given as input to gmove. Each individual line of evidence was used as input for gmove (http://www.genoscope.cns.fr/externe/gmove/) with default parameters to generate the final protein-coding genes annotations.
Protein coding genes for the SAGs. Protein alignments
The Uniref90 + METdb database of proteins was aligned using BLAT95 with default parameters, which localized protein on each SAG assembly. The exon/intron structure was refined using GeneWise92 and default parameters to detect splice sites accurately. The GeneWise alignments of each SAG were converted into a standard GFF file and given as input to gmove. Metatranscriptomic mapping from Tara Oceans: The 905 Tara Oceans metatranscriptomic individual fastq files were filtered with kfir (http://www.genoscope.cns.fr/kfir) using a k-mer approach to select only reads that shared 25-mer with the input SAG assembly. This subset of reads was aligned on the corresponding SAG assembly using STAR96 (version 2.5.2.b) with default parameters. BAM files were filtered as follows: low complexity alignments were removed and only alignments covering at least 80% of the metatranscriptomic reads with at least 90% of identity were retained. Candidate introns and exons were extracted from the BAM files and given as input to gmorse97. Ab-initio gene predictions: Ab-initio models were predicted using SNAP98 (v2013-02-16) trained on complete protein matches and gmorse models, and output files were converted into standard GFF files and given as input to gmove. Each line of evidence was used as input for gmove (http://www.genoscope.cns.fr/externe/gmove/) with default parameters to generate the final protein-coding genes annotations.
BUSCO completion scores for protein-coding genes in SMAGs
BUSCO81 v.3.0.4 with the set of eukaryotic single-copy core gene markers (n=255). Completion and redundancy (number of duplicated gene markers) of SMAGs were computed from this analysis.
Biogeography of SMAGs
We performed a final mapping of all metagenomes to calculate the mean coverage and detection of the SMAGs (Table S4). Briefly, we used BWA v0.7.15 (minimum identity of 90%) and a FASTA file containing the 713 non-redundant SMAGs to recruit short reads from all 943 metagenomes. We considered SMAGs were detected in a given filter when >25% of their length was covered by reads to minimize non-specific read recruitments27. The number of recruited reads below this cut-off was set to 0 before determining vertical coverage and percent of recruited reads. Regarding the projection of mapped reads, if SMAGs were to be complete, we used BUSCO completion scores to project the number of mapped reads. Note that we preserved the actual number of mapped reads for the SMAGs with completion <10% to avoid substantial errors to be made in the projections.
Identifying the environmental niche of SMAGs
Seven physicochemical parameters were used to define environmental niches: sea surface temperature (SST), salinity (Sal), dissolved silica (Si), nitrate (NO3), phosphate (PO4), iron (Fe), and a seasonality index of nitrate (SI NO3). Except for Fe and SI NO3, these parameters were extracted from the gridded World Ocean Atlas 2013 (WOA13)99. Climatological Fe fields were provided by the biogeochemical model PISCES-v2100. The seasonality index of nitrate was defined as the range of nitrate concentration in one grid cell divided by the maximum range encountered in WOA13 at the Tara sampling stations. All parameters were co-located with the corresponding stations and extracted at the month corresponding to the Tara sampling. To compensate for missing physicochemical samples in the Tara in situ data set, climatological data (WOA) were favored. More details are available in the supplemental material.
Cosmopolitan score
Using metagenomes from the Station subset 1 (n=757), SMAGs were assigned a “cosmopolitan score” based on their detection across 119 stations (see the supplemental material for more details).
A database of manually curated DNA-dependent RNA polymerase genes
A eukaryotic dataset101 was used to build HMM profiles for the two largest subunits of the DNA-dependent RNA polymerase (RNAP-a and RNAP-b). These two HMM profiles were incorporated within the anvi’o framework to identify RNAP-a and RNAP-b genes (Prodigal79 annotation) in the SMAGs and METdb transcriptomes. Alignments, phylogenetic trees and blast results were used to organize and manually curate those genes. Finally, we removed sequences shorter than 200 amino-acids, providing a final collection of DNA-dependent RNA polymerase genes for the SMAGs (n=2,150) and METdb (n=2,032) with no duplicates (see the supplemental material for more details).
Novelty score for the DNA-dependent RNA polymerase genes
We compared both the RNA-Pol A and RNA-Pol B peptides sequences identified in SMAGs and MetDB to the nr database (retrieved on October 25, 2019) using blastp, as implemented in blast+102 v.2.10.0 (e-value of 1e-10). We kept the best hit and considered it as the closest sequence present in the public database. For each SMAG, we computed the average percent identity across RNA polymerase genes (up to six genes) and defined the novelty score by subtracting this number from 100. For example, with an average percent identity is 64%, the novelty score would be 36%.
Phylogenetic analyses of SMAGs
The protein sequences included for the phylogenetic analyses (either the DNA-dependent RNA polymerase genes we recovered manually or the BUSCO set of 255 eukaryotic single-copy core gene markers we recovered automatically from the ∼10 million protein coding genes) were aligned with MAFFT103 v.764 and the FFT-NS-i algorithm with default parameters. Sites with more than 50% of gaps were trimmed using Goalign v0.3.0-alpha5 (http://www.github.com/evolbioinfo/goalign). The phylogenetic trees were reconstructed with IQ-TREE104 v1.6.12, and the model of evolution was estimated with the ModelFinder105 Plus option: for the concatenated tree, the LG+F+R10 model was selected. Supports were computed from 1,000 replicates for the Shimodaira-Hasegawa (SH)-like approximation likelihood ratio (aLRT)106 and ultrafast bootstrap approximation (UFBoot)107. As per IQ-TREE manual, we deemed the supports good when SH-aLRT >= 80% and UFBoot >= 95%. Anvi’o v.6.1 was used to visualize and root the phylogenetic trees.
EggNOG functional inference of SMAGs
We performed the functional annotation of protein-coding genes using the EggNog-mapper53,54 v2.0.0 and the EggNog5 database52. We used Diamond108 v0.9.25 to align proteins to the database. We refined the functional annotations by selecting the orthologous group within the lowest taxonomic level predicted by EggNog-mapper.
Eukaryotic SMAGs integration in the AGNOSTOS-DB
We used the AGNOSTOS workflow to integrate the protein coding genes predicted from the SMAG into a variant of the AGNOSTOS-DB that contains 1,829 metagenomes from the marine and human microbiomes, 28,941 archaeal and bacterial genomes from the Genome Taxonomy Database (GTDB) and 3,243 nucleocytoplasmic large DNA viruses (NCLDV) metagenome assembled genomes (MAGs)59.
AGNOSTOS functional aggregation inference
AGNOSTOS partitioned protein coding genes from the SMAGs in groups connected by remote homologies, and categorized those groups as members of the known or unknown coding sequence space based on the workflow described in Vanni et al. 202059. To combine the results from AGNOSTOS and the EggNOG classification we identified those groups of genes in the known space that contain genes annotated with an EggNOG and we inferred a consensus annotation using a quorum majority voting approach. AGNOSTOS produces groups of genes with low functional entropy in terms of EggNOG annotations as shown in Vanni et al. 202059 allowing us to combine both sources of information. We merged the groups of genes that shared the same consensus EggNOG annotations and we integrated them with the rest of AGNOSTOS groups of genes, mostly representing the unknown coding sequence space. Finally, we excluded groups of genes occurring in less than 2% of the SMAGs.
Differential occurrence of functions
We performed a Welch’s ANOVA test followed by a Games-Howell test for significant ANOVA comparisons to identify EggNog functions occurring differentially between functional groups of SMAGs. All statistics were generated in R 3.5.3.
Functional clustering of SMAGs
We used anvi’o to cluster SMAGs as a function of their functional profile (Euclidean distance with ward’s linkage), and the anvi’o interactive interface to visualize the hierarchical clustering in the context of complementary information.
Data availability
All data our study generated are publicly available at http://www.genoscope.cns.fr/tara/. The link provides access to the 11 raw metagenomic co-assemblies, the FASTA files for 713 SMAGs, the ∼10 million protein-coding sequences (nucleotides, amino acids and gff format), and the curated DNA-dependent RNA polymerase genes (SMAGs and METdb transcriptomes). This link also provides access to the supplemental figures and the supplemental material.
Contributions
Damien D. Hinsinger, Morgan Gaia, Eric Pelletier, Patrick Wincker, Olivier Jaillon and Tom O. Delmont conducted the study. Tom O. Delmont and Morgan Gaia characterized the SMAGs and RNA polymerase genes, respectively. Damien D. Hinsinger (analysis of the ∼10 million genes), Morgan Gaia (phylogenies), Paul Fremont (climate models and world map projections), Eric Pelletier (METdb database, mapping results) and Tom O. Delmont performed the primary analysis of the data. Artem Kourlaiev, Leo d’Agata, Quentin Clayssen and Jean-Marc Aury assembled and annotated the single cell genomes and helped processing metagenomic assemblies. Emilie Villar, Marc Wessner, Benjamin Noel, Corinne Da Silva, Damien D. Hinsinger, Olivier Jaillon and Jean-Marc Aury identified the eukaryotic genes in the MAG assemblies. Antonio Fernandez Guerra and Chiara Vanni characterized the repertoire of functions in the unknown coding sequence space. Tom O. Delmont wrote the manuscript, with critical inputs from the authors.
Supplemental material
---Available at http://www.genoscope.cns.fr/tara/ --
Supplemental figures
The maximum-likelihood phylogenomic tree of the BUSCO gene markers (255 genes) included Tara Oceans MAGs and METdb transcriptomes (minimum of 25% of completion) and was generated using a total of 19,785 sites in the alignment and LG+F+R10 model; Opisthokonta was used as the outgroup. The tree was decorated with additional layers using the anvi’o interface. Branches and names in red correspond to lineages lacking representatives in METdb.
Relative proportion of known COG categories in annotated functions versus those that were significantly differentially occurring between the four functional groups.
The figure displays a hierarchical clustering (Euclidean distance with Ward’s linkage) of 681 SMAGs based on the occurrence of ∼39,705 groups of genes (total of 5,178,829 genes) identified by combining EggNOG52–54 with Agnostos109, rooted with MAGs dominated by small animals (Chordata, Crustacea and copepods) and decorated with layers of information using the anvi’o interactive interface. Removed from the analysis were Ciliophora MAGs (gene calling is problematic for this lineage), and functions occurring more than 1,000 times in the gigabase-scale MAG and linked to retrotransposons connecting otherwise unrelated SMAGs, or occurring in less than 2% of the SMAGs.
The figure displays a hierarchical clustering (Euclidean distance with Ward’s linkage) of 681 SMAGs based on the occurrence of ∼28,000 gene clusters of unknown function (total of 1.3 million genes) identified by solely with Agnostos109 (environmental unknowns plus genomic unknowns), rooted with MAGs dominated by small animals (Chordata, Crustacea and copepods) and decorated with layers of information using the anvi’o interactive interface. Removed from the analysis were Ciliophora MAGs (gene calling is problematic for this lineage), and functions occurring more than 1,000 times in the gigabase-scale MAG and linked to retrotransposons connecting otherwise unrelated SMAGs, or occurring in less than 2% of the SMAGs.
Supplemental tables
---Available at http://www.genoscope.cns.fr/tara/ ---
Table S1. Summary of 939 Tara Oceans metagenomes that include their station ID, size fraction and number of quality filtered reads.
Table S2. Summary of the genome-resolved metagenomics and single cell genomics outcomes. The table includes statistics for the metagenomic co-assemblies, redundant MAGs and SAGs, and targeted efforts regarding the one giga scale MAG.
Table S3. Statistics of non-redundant SMAGs and METdb transcriptomes. The table includes genomic statistics and taxonomical inferences, the occurrence of RNA polymerase genes, and distribution patterns across stations and size fractions.
Table S4. Mapping result for the non-redundant SMAGs and METdb transcriptomes.
Table S5. Functional profiling of the non-redundant SMAGs based on EggNOG.
Table S6. Functional profiling of the non-redundant SMAGs based on EggNOG and Agnostos.
Table S7. Functional profiling of the non-redundant SMAGs solely based on Agnostos genomic and environmental unknowns not covered by EggNOG.
Table S8. Niche partitioning and world map projection statistics for 374 SMAGs
Tara Oceans Coordinators
Shinichi Sunagawa1, Silvia G. Acinas2, Peer Bork3,4,5, Eric Karsenti6,7,11, Chris Bowler6,7, Christian Sardet7,9, Lars Stemmann7,9, Colomban de Vargas7,19, Patrick Wincker7,18, Magali Lescot7,26, Marcel Babin7,20, Gabriel Gorsky7,9, Nigel Grimsley7,24,25, Lionel Guidi7,9, Pascal Hingamp7,26, Olivier Jaillon7,18, Stefanie Kandels3,7, Daniele Iudicone10, Hiroyuki Ogata12, Stéphane Pesant13,14, Matthew B. Sullivan15,16,17, Fabrice Not19, Lee Karp-Boss21, Emmanuel Boss21, Guy Cochrane22, Michael Follows23, Nicole Poulton27, Jeroen Raes28,29,30, Mike Sieracki27 and Sabrina Speich31,32.
1 Department of Biology, institute of Microbiology and swiss institute of Bioinformatics, etH Zürich, Zürich, switzerland.
2 Department of Marine Biology and Oceanography, institute of Marine sciences–CsiC, Barcelona, spain.
3 Structural and Computational Biology, european Molecular Biology Laboratory, Heidelberg, Germany.
4 Max Delbrück Center for Molecular Medicine, Berlin, Germany.
5 Department of Bioinformatics, Biocenter, university of würzburg, würzburg, Germany.
6 Institut de Biologie de l’ENS, Département de Biologie, École Normale supérieure, CNRS, INSERM, Université PSL, Paris, France.
7 Research Federation for the study of Global Ocean systems ecology and evolution, Fr2022/tara GOsee, Paris, France.
8 Université de Nantes, CNRS, uMr6004, Ls2N, Nantes, France.
9 Sorbonne université, CNRS, Laboratoire d’Océanographie de Villefranche, villefranche- sur- Mer, France.
10 Stazione Zoologica anton Dohrn, Naples, Italy.
11 Directors’ research, European Molecular Biology Laboratory, Heidelberg, Germany.
12 institute for Chemical research, Kyoto university, Kyoto, Japan.
13 PaNGaea, university of Bremen, Bremen, Germany.
14 MaruM, Center for Marine environmental sciences, university of Bremen, Bremen, Germany.
15 Department of Microbiology, the Ohio state university, Columbus, OH, USA.
16 Department of Civil, environmental and Geodetic engineering, the Ohio state university, Columbus, OH, USA.
17 Center for RNA Biology, the Ohio state university, Columbus, OH, USA.
18 Génomique Métabolique, Genoscope, institut de Biologie Francois Jacob, Commissariat à l’Énergie atomique, CNrs, université evry, université Paris- saclay, evry, France.
19 Sorbonne université and CNRS, UMR 7144 (AD2M), ECOMAP, station Biologique de Roscoff, Roscoff, France.
20 Département de Biologie, Québec Océan and Takuvik Joint International Laboratory (UMI 3376), Université Laval (Canada)–CNRS (France), Université Laval, Quebec, QC, Canada.
21 School of Marine Sciences, University of Maine, Orono, ME, USA. 22European Molecular Biology Laboratory, European Bioinformatics Institute, Welcome Trust Genome Campus, Hinxton, Cambridge, UK.
23 Department of Earth, Atmospheric, and Planetary Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA.
24 CNRS UMR 7232, Biologie Intégrative des Organismes Marins, Banyuls- sur- Mer, France.
25 Sorbonne Universités Paris 06, OOB UPMC, Banyuls- sur- Mer, France.
26 Aix Marseille Universit/e, Université de Toulon, CNRS, IRD, MIO UM 110, Marseille, France.
27 Bigelow Laboratory for Ocean Sciences, East Boothbay, ME, USA.
28 Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium.
29 Center for the Biology of Disease, VIB KU Leuven, Leuven, Belgium.
30 Department of Applied Biological Sciences, Vrije Universiteit Brussel, Brussels, Belgium.
31 Department of Geosciences, Laboratoire de Météorologie Dynamique, École Normale Supérieure, Paris, France.
32 Ocean Physics Laboratory, University of Western Brittany, Brest, France.
Acknowledgments
Our survey was made possible by two scientific endeavors: the sampling and sequencing efforts by the Tara Oceans Project, and the bioinformatics and visualization capabilities afforded by anvi’o. We are indebted to all who contributed to these efforts, as well as other open-source bioinformatics tools for their commitment to transparency and openness. Tara Oceans (which includes the Tara Oceans and Tara Oceans Polar Circle expeditions) would not exist without the leadership of the Tara Ocean Foundation and the continuous support of 23 institutes (https://oceans.taraexpeditions.org/). We thank the commitment of the following people and sponsors who made this singular expedition possible: CNRS (in particular Groupement de Recherche GDR3280 and the Research Federation for the Study of Global Ocean Systems Ecology and Evolution FR2022/Tara GOSEE), the European Molecular Biology Laboratory (EMBL), Genoscope/CEA, the French Ministry of Research and the French Governement ‘Investissement d’Avenir’ programs Oceanomics (ANR-11-BTBR-0008), FRANCE GENOMIQUE (ANR-10-INBS-09), ATIGE Genopole postdoctoral fellowship, HYDROGEN/ANR-14-CE23-0001, MEMO LIFE (ANR-10-LABX-54), PSL Research University (ANR-11-IDEX-0001-02) and EMBRC-France (ANR-10-INBS-02), Fund for Scientific Research—Flanders, VIB, Stazione Zoologica Anton Dohrn, UNIMIB, ANR (projects ALGALVIRUS ANR-17-CE02-0012, PHYTBACK/ANR-2010-1709-01, POSEIDON/ANR-09-BLAN-0348, PROMETHEUS/ANR-09-PCS-GENM-217, TARA-GIRUS/ANR-09-PCS-GENM-218), EU FP7 (MicroB3/No. 287589, IHMS/HEALTH-F4-2010-261376), Genopole, CEA DRF Impulsion program. The authors also thank agnès b. and E. Bourgois, the Prince Albert II de Monaco Foundation, the Veolia Foundation, the EDF Foundation EDF Diversiterre, Region Bretagne, Lorient Agglomeration, Worldcourier, Illumina, the EDF Foundation EDF Diversiterre, for support and commitment. The global sampling effort was made possible by countless scientists and crew who performed sampling aboard the Tara from 2009 to 2013. The authors are also grateful to the countries that graciously granted sampling permission. Part of the computations were performed using the platine, titane and curie HPC machine provided through GENCI grants (t2011076389, t2012076389, t2013036389, t2014036389, t2015036389 and t2016036389). We also thank Noan Le Bescot (TernogDesign) for artwork on Figures.
This article is contribution number XX of Tara Oceans.
Footnotes
The functional repertoire classification of unicellular eukaryotes is now performed in both the known and unknown coding space. This additional analysis. This allowed us to identify a functional repertoire convergence of distantly related eukaryotic plankton lineages in both the known and unknown coding sequence space, the latter representing a substantial amount of biologically relevant gene diversity.
References
- 1.↵
- 2.↵
- 3.↵
- 4.
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.
- 29.
- 30.
- 31.
- 32.↵
- 33.↵
- 34.
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵