Functional repertoire convergence of distantly related eukaryotic plankton lineages revealed by genome-resolved metagenomics

Marine planktonic eukaryotes play a critical role in global biogeochemical cycles and climate. However, their poor representation in culture collections limits our understanding of the evolutionary history and genomic underpinnings of planktonic ecosystems. Here, we used 280 billion metagenomic reads from 143 Tara Oceans stations to reconstruct and manually curate more than 700 abundant and widespread eukaryotic metagenome-assembled genomes ranging from 10 Mbp to up to 1.3 Gbp. The resulting non-redundant genomic resource of 25 billion nucleotides that describe 10 million genes covers a wide range of poorly characterized unicellular and multicellular eukaryotic lineages that complement the long-standing contributions of culture efforts to survey the tree of marine life while better representing plankton from the open ocean. Phylogeny of the DNA-dependent RNA polymerase placed this genomic resource in a comprehensive evolutionary framework that provided insights into the relationships of eukaryotic supergroups. From there, classification of unicellular eukaryotic plankton based on functions encoded in their genes revealed four major groups connecting distantly related lineages such as the diatoms and green algae. There has been a recurrent problem in understanding the interplay between eukaryotes’ vertical evolution and their phenotype. By disentangling phylogenetic signals from functional trends with genomics, we found that neither the classical trophic mode of plankton nor its vertical evolutionary history could fully explain the genomic functional landscape of marine eukaryotes that coexisted for millions of years. Cover Navigating on the map of plankton genomics with Tara Oceans and anvi’o: a comprehensive genome-resolved metagenomic survey dedicated to eukaryotic plankton.


Introduction
Plankton in the sunlit ocean contributes about half of Earth's primary productivity, impacting global biogeochemical cycles and food webs 1,2 . Plankton biomass appears to be dominated by unicellular eukaryotes and small animals 3-6 including a phenomenal evolutionary and morphological biodiversity 5,[7][8][9] . The composition of planktonic communities is highly dynamical and shaped by biotic and abiotic variables, some of which are changing abnormally fast in the Anthropocene [10][11][12] . Our understanding of marine eukaryotes has progressed substantially in recent years with the transcriptomic (e.g., 13,14 ) and genomic (e.g., [15][16][17] analyses of organisms isolated in culture, and the emergence of efficient culture-independent surveys (e.g., 18,19 ). However, most eukaryotic lineages' genomic content remains uncharacterized 20,21 , limiting our understanding of their evolution, functioning, ecological interactions, and resilience to ongoing environmental changes.
Over the last decade, the Tara Oceans program has generated a homogeneous resource of marine plankton metagenomes and metatranscriptomes from the sunlit zone of all major oceans and two seas 22 . Critically, most of the sequenced plankton size fractions correspond to eukaryotic organismal sizes, providing a prime dataset to survey genomic traits and expression patterns from this domain of life. More than 100 million eukaryotic gene clusters have been characterized by the metatranscriptomes, half of which have no similarity to known proteins 5 . Most of them could not be linked to a genomic context 23 , limiting their usefulness to genecentric insights. The eukaryotic metagenomic dataset (the equivalent of ~10,000 human genomes) on the other hand has been partially used for plankton biogeographies 24 , but remains unexploited for the characterization of genes and genomes due to a lack of robust methodologies to make sense of its diversity.
Genome-resolved metagenomics 25 has been extensively applied to the smallest Tara Oceans plankton size fractions, unveiling the ecology and evolution of thousands of viral, bacterial, and archaeal populations abundant in the sunlit ocean [26][27][28][29][30][31] . This approach may thus be appropriate also to characterize the genomes of the most abundant eukaryotic plankton. However, very few eukaryotic genomes have been resolved from metagenomes thus far 26,[32][33][34][35] , in part due to their complexity (e.g., high density of repeats 36 ) and extended size 37 that might have convinced many of the unfeasibility of such a methodology. With the notable exception of some photosynthetic eukaryotes 26,32,35 , metagenomics is lagging far behind cultivation for eukaryote genomics, contrasting with the two other domains of life. Here we fill this critical gap using hundreds of billions of metagenomic reads generated from the eukaryotic plankton size fractions of Tara Oceans and demonstrate that genomeresolved metagenomics is well suited for marine eukaryotic genomes of substantial complexity and length exceeding the emblematic gigabase. We used this new genomic resource to place major eukaryotic planktonic lineages in the tree of life and explore their evolutionary history based on both phylogenetic signals from conserved gene markers and present-day genomic functional landscape.

Results and discussion A new resource of environmental genomes for eukaryotic plankton from the sunlit ocean
We performed the first comprehensive genome-resolved metagenomic survey of microbial eukaryotes from polar, temperate, and tropical sunlit oceans using 798 metagenomes (265 of which were released through the present study) derived from the Tara Oceans expeditions. They correspond to the surface and deep chlorophyll maximum layer of 143 stations from the Pacific, Atlantic, Indian, Arctic, and Southern Oceans, as well as the Mediterranean and Red Seas, encompassing eight eukaryote-enriched plankton size fractions ranging from 0.8 µm to 2 mm ( Figure 1, Table S1). We used the 280 billion reads as inputs for 11 metagenomic coassemblies (6-38 billion reads per co-assembly) using geographically bounded samples ( Figure 1, Table S2), as previously done for the Tara Oceans 0.2-3 μm size fraction enriched in bacterial cells 38 . In addition, we used 158 eukaryotic single cells sorted by flow cytometry from seven Tara Oceans stations (Table S2) as input to perform complementary genomic assemblies (see Methods).
We thus created a culture-independent, non-redundant (average nucleotide identity <98%) genomic database for eukaryotic plankton in the sunlit ocean consisting of 683 metagenome-assembled genomes (MAGs) and 30 single-cell genomes (SAGs), all containing more than 10 million nucleotides (Table S3). These 713 MAGs and SAGs (hereafter dubbed SMAGs) were manually characterized and curated using a holistic framework within anvi'o 39 (see Methods and Supplemental Material). Nearly half the MAGs did not have vertical coverage >10x in any of the metagenomes, emphasizing the relevance of co-assemblies to gain sufficient coverage for large eukaryotic genomes. Moreover, one-third of the SAGs remained undetected by Tara Oceans' metagenomic reads, emphasizing cell sorting's power to target less abundant lineages. Absent from the SMAGs are DNA molecules physically associated with the focal eukaryotic populations, but that did not correlate with their nuclear genomes across metagenomes. They include chloroplasts, mitochondria, and viruses generally present in multi-copy. Finally, some highly conserved multi-copy genes such as the 18S rRNA were also missing due to technical issues associated with assembly and binning, following the fate of 16S rRNA genes absent in bacterial MAGs 38 .
This new genomic database for eukaryotic plankton has a total size of 25.2 Gbp and contains 10,207,450 genes according to a workflow combining metatranscriptomics, ab-initio, and protein-similarity approaches (see Methods). Tara Oceans SMAGs are, on average, ~40% complete (redundancy of 0.5%) and 35.4 Mbp long (up to 1.32 Gbp for the first Giga-scale eukaryotic MAG), with a GC-content ranging from 18.7% to 72.4% (Table S3). They are affiliated to Alveolata (n=44), Amoebozoa (n=4), Archaeplastida (n=64), Cryptista (n=31), Haptista (n=92), Opisthokonta (n=299), Rhizaria (n=2), and Stramenopiles (n=174). Only three closely related MAGs could not be affiliated to any known eukaryotic supergroup (see the phylogenetic section below). Among the 713 SMAGs, 271 contained multiple genes corresponding to chlorophyll a-b binding proteins and were considered phytoplankton (Table S3). Genome-wide comparisons with 484 reference transcriptomes from isolates of marine eukaryotes (the METdb database 40 which improved data from MMETSP 13 and added new transcriptomes from Tara Oceans, see Table S3) linked only 24 of the SMAGs (~3%) to a eukaryotic population already in culture (average nucleotide identity >98%). These include well-known Archaeplastida populations within the genera Micromonas, Bathycoccus, Ostreococcus, Pycnococcus, Chloropicon and Prasinoderma and a few taxa amongst Stramenopiles (e.g., the diatom Minutocellus polymorphus) and Haptista (e.g., Phaeocystis cordata). Thus, we found metagenomics, single-cell genomics, and culture highly complementary with very few overlaps for marine eukaryotic plankton's genomic characterization. The SMAGs recruited 39.1 billion reads with >90% identity (average identity of 97.4%) from 939 metagenomes, representing 11.8% of the Tara Oceans metagenomic dataset dedicated to unicellular and multicellular organisms ranging from 0.2 µm to 2 mm (Table S4). In contrast, METdb with a total size of ~23 Gbp recruited less than 7 billion reads (average identity of 97%), indicating that the collection of Tara Oceans SMAGs reported herein better represents the diversity of open ocean eukaryotes as compared to genomic data from decades of culture efforts worldwide. The majority of Tara Oceans metagenomic reads were still not recruited, which could be explained by eukaryotic genomes that our methods failed to reconstruct, the occurrence of abundant bacterial, archaeal, and viral populations in the large size fractions we considered (e.g., Trichodesmium 41 ), and the incompleteness of the SMAGs. Indeed, with the assumption of correct completion estimates, complete SMAGs would have recruited ~26% of all metagenomic reads, including >50% of reads for the 20-180 µm size fraction alone due in part to an important contribution of hundreds of large copepod MAGs abundant within this cellular range (see Figure 1 and Table S4).

Expanding the genomic representation of the eukaryotic tree of life
We then determined the phylogenetic distribution of the new ocean SMAGs in the tree of eukaryotic life. METdb was chosen as a taxonomically curated reference transcriptomic database from culture collections, and the two largest subunits of the three DNA-dependent RNA polymerases (six multi-kilobase genes found in all modern eukaryotes and hence already present in the Last Eukaryotic Common Ancestor) were used as evolutionary marker genes given their relevance to our understanding of eukaryogenesis 42 . Protein sequences for these genes were manually extracted and curated for the SMAGs (n=2,150) and METdb (n=2,032) (see Methods and Supplemental Material). BLAST results provided a novelty score for each of them (see Methods and Table S3), expanding our analysis scope to eukaryotic genomes stored in NCBI as of August 2020. Our final phylogenetic analysis included 416 reference transcriptomes and 576 environmental SMAGs that contained at least one of the six genes ( Figure 2). The concatenated DNA-dependent RNA polymerase protein sequences effectively reconstructed a coherent tree of eukaryotic life, comparable to previous large-scale phylogenetic analyses based on other gene markers 43 , and to a complementary BUSCO-centric phylogenomic analysis using protein sequences corresponding to hundreds of smaller gene markers ( Figure S1). As a noticeable difference, the Haptista were most closely related to Archaeplastida, while Cryptista was most closely related to the TSAR supergroup (Telonemia not represented here, Stramenopiles, Alveolata and Rhizaria), albeit with weaker supports. This view of the eukaryotic tree of life using a previously underexploited universal marker is by no means conclusive by itself but contributes to ongoing efforts to understand deep evolutionary relationships amongst eukaryotes while providing an effective framework to assess the phylogenetic positions of a large number of the Tara Oceans SMAGs. The maximum-likelihood phylogenetic tree of the concatenated two largest subunits from the three DNA-dependent RNA polymerases (six genes in total) included Tara Oceans SMAGs and METdb transcriptomes and was generated using a total of 7,243 sites in the alignment and LG+F+R10 model; Opisthokonta was used as the outgroup. Supports for selected clades are displayed. Phylogenetic supports were considered high (aLRT>=80 and UFBoot>=95), medium (aLRT>=80 or UFBoot>=95) or low (aLRT<80 and UFBoot<95) (see Methods). The tree was decorated with additional layers using the anvi'o interface. The novelty score layer (see Methods) was set with a minimum of 30 (i.e., 70% similarity) and a maximum of 60 (i.e., 40% similarity). Branches and names in red correspond to main lineages lacking representatives in METdb.
Amongst small planktonic animals, the Tara Oceans SMAGs recovered one lineage of Chordata related to the Oikopleuridae family, and Crustacea including a wide range of copepods ( Figure 2, Table S3). Copepods dominate large size fractions of plankton 8 and represent some of the most abundant animals on the planet 44,45 . They actively feed on unicellular plankton and are a significant food source for larger animals such as fish, thus representing a key trophic link within the global carbon cycle 46 . For now, less than ten copepod genomes have been characterized by isolates 47,48 . The additional 8.4 Gbp of genomic material unveiled herein is split into 217 MAGs, and themselves organized into two main phylogenetic clusters that we dubbed marine Hexanauplia clades A and B. The two clades were equally abundant and detected in all oceanic regions. Copepod MAGs typically had broad geographic distributions, being detected on average in 25% of the globally distributed Tara Oceans stations. In comparison, Opisthokonta MAGs affiliated to Chordata and Choanoflagellatea (Acanthoecida) were, on average detected in less than 10% of sampling sites.
Generally occurring in smaller size fractions, SMAGs corresponding to unicellular eukaryotes considerably expanded our genomic knowledge of known genera within Alveolata, Archaeplastida, Haptista and Stramenopiles (Table S3). Just within the diatoms for instance (Stramenopiles), MAGs were reconstructed for Fragilariopsis (n=5), Pseudo-nitzschia (n=7), Chaetoceros (n=11), Thalassiosira (n=5) and seven other genera, all of which are known to contribute significantly to photosynthesis in the sunlit ocean 49 . Beyond this genomic expansion of known planktonic genera, the SMAGs covered various lineages lacking representatives in METdb. These included (1) sister clades to the Cryptophyta division (putative Katablepharidophyta division 50 according to their relatively abundant 18S amplicons in small size fractions of Tara Oceans 8 ), to the class Chrysophyceae, and the genera Phaeocystis and Pycnococcus, (2) basal lineages of Oomycota within Stramenopiles and Myzozoa within Alveolata, (3) multiple branches within the MAST-4 lineage, (4) and a small cluster possibly at the root of Rhizaria we dubbed "putative new group" (Figure 2). The BUSCO-centric phylogenomic analysis placed it at the root of Haptista ( Figure  S1), supporting its high novelty while stressing the difficulty placing it accurately in the eukaryotic tree of life. The novelty score of individual DNA-dependent RNA polymerase genes was supportive of the topology of the tree. Significantly, the sister clade to the Cryptophyta division, diverse MAST-4 lineage and putative new group all displayed a deep branching distance from cultures and a high novelty score.
The most conspicuous lineage lacking any SMAGs was the Dinoflagellata, a prominent and extremely diverse phylum in small and large eukaryotic size fractions of Tara Oceans 8 . These organisms harbor very large and complex genomes 51 that likely require much deeper sequencing efforts to be recovered by genome-resolved metagenomics.

A complex interplay between the evolution and functioning of marine eukaryotes
SMAGs provided a broad genomic assessment of the eukaryotic tree of life within the sunlit ocean by covering a wide range of marine plankton eukaryotes distantly related to cultures but abundant in the open ocean. Thus, the resource provided an opportunity to explore the interplay between the phylogenic signal and functional repertoire of eukaryotic plankton with genomics. We identified orthologous groups corresponding to known (n=15,870) and unknown functions (n=12,567, orthologous groups with no assigned function at http://eggnog5.embl.de/) for 4.7 million genes (nearly 50% of the genes, see Methods) and used their genomic distributions to classify the SMAGs based on their functional profiles (Table S5). Our hierarchical clustering analysis using Euclidean distance and Ward linkage (an approach to organize genomes based on pangenomic traits 52 ) first split the SMAGs into small animals (Chordata, Crustacea, copepods) and putative unicellular eukaryotes ( Figure 3). Fine-grained functional clusters exhibited a highly coherent taxonomy within the unicellular eukaryotes. For instance, SMAGs affiliated to the coccolithophore Emiliana and the sister clade to Phaeocystis formed distinct clusters. The sister clade to Cryptophyta was also confined to a single cluster that could be explained partly by a considerable radiation of genes related to dioxygenase activity (up to 644 genes). Most strikingly, the Archaeplastida SMAGs not only clustered with respect to their genus-level taxonomy, but the organization of these clusters was highly coherent with their evolutionary relationships (see Figure 2), confirming not only the novelty of the sister clade to Pycnococcus, but also the sensitivity of our framework to draw the functional landscape of unicellular marine eukaryotes.
Four major functional groups of unicellular eukaryotes emerged from the hierarchical clustering ( Figure 3). Importantly, the taxonomic coherence observed in fine-grained clusters vanished when moving towards the root of these functional groups. Group A was an exception since it only covered the Haptista (including the highly cosmopolitan sister clade to Phaeocystis). Group B, on the other hand, encompassed a highly diverse and polyphyletic group of distantly related heterotrophic (e.g., MAST-4 and MALV) and mixotrophic (e.g., Myzozoa and Cryptophyta) lineages of various genomic size, suggesting that broad genomic functional trends may not only be explained by the trophic mode of plankton. Group C was mostly photosynthetic and covered the diatoms (Stramenopiles of various genomic size) and Archaeplastida (small genomes) as sister clusters. This finding likely reflects that diatoms are the only group with an obligatory photoautotrophic lifestyle within the Stramenopiles, like the Archaeplastida. Finally, Group D encompassed three distantly related lineages of heterotrophs (those systematically lacked gene markers for photosynthesis) exhibiting rather large genomes: Oomycota, Acanthoecida choanoflagellates, and the Cryptophyta's sister clade. Those four functional groups have similar amounts of detected functions and contained both cosmopolite and rarely detected SMAGs across the Tara Oceans stations. While attempts to classify marine eukaryotes based on genomic functional traits have been made in the past (e.g., using a few SAGs 53 ), our resource therefore provided a broad enough spectrum of genomic material for a first genome-wide functional classification of abundant lineages of unicellular eukaryotic plankton in the upper layer of the ocean. Removed from the analysis were Ciliophora MAGs (a more advanced gene calling workflow dedicated to this lineage is underway -those MAGs will then be incorporated into the analysis), two less complete MAGs affiliated to Opisthokonta, and functions occurring more than 500 times in the gigabase-scale MAG and linked to retrotransposons connecting otherwise unrelated SMAGs.
A total of 2,588 known and 680 unknown functions covering 1.94 million genes (~40% of the annotated genes) were significantly differentially occurring between the four functional groups (Welch's ANOVA tests, p-value <1.e -05 , Table S5). We displayed the occurrence of the 100 functions with lowest p-values in the hierarchical clustering presented in Figure 3 to illustrate and help convey the strong signal between groups. However, more than 3,000 functions contributed to the basic partitioning of SMAGs. They cover all high-level functional categories identified in the 4.7 million genes with similar proportions ( Figure S2), indicating that a wide range of functions related to information storage and processing, cellular processes and signaling, and metabolism contribute to the partitioning of the groups. As a notable difference, functions related to transcription (-50%) and RNA processing and modification (-47%) were less represented, while those related to carbohydrate transport and metabolism were enriched (+43%) in the differentially occurring functions. Interestingly, we noticed within Group C a scarcity of various functions otherwise occurring in high abundance among unicellular eukaryotes. These included functions related to ion channels (e.g., extracellular ligand-gated ion channel activity, intracellular chloride channel activity, magnesium ion transmembrane transporter activity, calcium ion transmembrane transport, calcium sodium antiporter activity) that may be linked to flagellar motility and the response to external stimuli 57 , reflecting the lifestyle of true autotrophs. Group D, on the other hand, had significant enrichment of various functions associated with carbohydrate transport and metabolism (e.g., alpha and beta-galactosidase activities, glycosyl hydrolase families, glycogen debranching enzyme, alpha-L-fucosidase), denoting a distinct carbon acquisition strategy. Overall, the properties of thousands of differentially occurring functions suggest that eukaryotic plankton's complex functional diversity is vastly intertwined within the tree of life, as inferred from phylogenies. This reflects the complex nature of the genomic structure and phenotypic evolution of organisms, which rarely fit their evolutionary relationships.

Niche and biogeography of individual eukaryotic populations
Besides insights into organismal evolution and genomic functions, the SMAGs provided an opportunity to evaluate the present and future geographical distribution of eukaryotic planktonic populations (close to species-level resolution) using the genome-wide metagenomic read recruitments. Here, we determined the niche characteristics (e.g., temperature range) of 374 SMAGs (~50% of the resource) detected in at least five stations (Table S6)  Each of these SMAGs was estimated to occur in a surface averaging 42 and 39 million km 2 for the first and second periods, respectively, corresponding to ~12% of the surface of the ocean. Our data suggest that most eukaryotic populations in the database will remain widespread for decades to come. However, many changes in biogeography are projected to occur. For instance, the most widespread population in the first period (a MAST-4 MAG) would still be ranked first at the end of the century but with a surface area increasing from 37% to 46% (Figure 4), a gain of 28 million km 2 corresponding to the surface of North America. Its expansion from the tropics towards more temperate oceanic regions regardless of longitude is mostly explained by temperature and reflects the expansion of tropical niches due to global warming, echoing recent predictions made with amplicon surveys and imaging data 59 . As an extreme case, the SMAG benefiting the most between the two periods (a copepod) could experience a gain of 55 million km 2 (Figure 4), more than the surface of Asia and Europe combined. On the other hand, the SMAG losing most ground (also a copepod) could undergo a decrease of 47 million km 2 . Projected changes in these two examples correlated with various variables (including a notable contribution of silicate), an important reminder that temperature alone cannot explain plankton's biogeography in the ocean. Our integration of genomics, metagenomics, and climate models provided the resolution needed to project individual eukaryotic population niche trajectories in the sunlit ocean. Noticeably, projected decreases of silicate in equatorial regions drive 34% of the expansion of TARA_PSW_MAG_00299 while driving 34% of the reduction of TARA_PSE_93_MAG_00246, possibly reflecting different life strategies of these copepods (e.g., grazing). In contrast, the expansion of TARA_IOS_50_MAG_00098 is mostly driven by temperature (74%).

Conclusion
Following the path of viruses, bacterial and archaeal populations, we are experiencing a shift from cultivation to metagenomics for the genomic characterization of marine eukaryotes en masse. Our culture-independent and manually curated genomic characterization of unicellular eukaryotic populations and small animals abundant in the sunlit ocean covered a wide range of poorly characterized lineages and provided the first gigabase-scale metagenomeassembled genome, a landmark for both genome-resolved metagenomics and plankton genomics. These lineages cover multiple trophic levels (e.g., copepods and their prey, mixotrophs, autotrophs, and parasites) and better represent eukaryotic plankton in the open photic ocean compared to genomic surveys from decades of culturing efforts. In summary, most eukaryotic genomes we characterized with different degrees of completion are not only different from past genomic surveys of isolated plankton but also appear to be abundant and widespread in the sunlit ocean. As a result, our survey represents an innovative step towards using genomics to explore in concert the ecological and evolutionary underpinnings of environmentally relevant eukaryotic organisms, using metagenomics to fill critical gaps in our remarkable culture porfolio 22 .
Phylogenetic gene markers such as the DNA-dependent RNA polymerases (the basis of our phylogenetic analysis) provide a critical understanding of the origin of eukaryotic lineages and allowed us to place most environmental genomes in a comprehensible evolutionary framework. However, this framework is based on sequence variations within core genes inherited from the last eukaryotic common ancestor representing the vertical evolution of eukaryotes, disconnected from the structure of genomes. As such, it does not recapitulate the functional evolutionary journey of plankton, as demonstrated in our genome-wide functional classification of unicellular eukaryotes. The dichotomy between phylogeny and function was already well described with morphological and other phenotypic traits. This could be explained in part by secondary endosymbiosis events that have spread plastids and genes for their photosynthetic capabilities across the eukaryotic tree of life [60][61][62][63] .
Here we moved beyond morphologies and disentangled the phylogeny of gene markers and broad genomic functions of a comprehensive collection of marine eukaryotic lineages. We identified four major genomic functional groups of unicellular eukaryotes made of distantly related lineages. The Stramenopiles proved particularly effective in terms of genomic functional diversification, possibly explaining part of their remarkable success in this biome 8,64 .
The topology of the phylogenetic tree compared to the functional clustering of a wide range of eukaryotic lineages has revealed contrasting evolutionary journeys for widely scrutinized gene markers of evolution and less studied genomic functions of plankton. The apparent functional convergence of distantly related lineages that coexisted in the same biome for millions of years could not be explained by neither the vertical evolutionary history of unicellular eukaryotes nor their classical trophic mode (phytoplankton versus heterotrophs), shedding new lights into the complex functional dynamics of plankton over evolutionary time scales. Convergent evolution is a well-known phenomenon of independent origin of biological traits such as molecules and behaviors 65,66 that has been observed in the morphology of microbial eukaryotes 67 and is often driven by common selective pressures within similar environmental conditions. However, an independent origin of similar functional profiles is not the only possible explanation for organisms sharing the same habitat. Indeed, one could wonder if lateral gene transfers between eukaryotes 68,69 have played a central role in these processes, as previously observed between eukaryotic plant pathogens 70 or grasses 71 . As a case in point, secondary endosymbiosis events are known to have resulted in massive gene transfers between endosymbionts and their hosts in the oceans 60,61 . In particular, these events involved transfers of genes from green algae to diatoms 72 , two lineages clustering together in our genomic functional classification of eukaryotic plankton. However, lineages sharing the same secondary endosymbiotic history did not always fall in the same functional group. This was the case for diatoms, Haptista and Cryptista that have different functional trends yet originate from a common ancestor that likely acquired its plastid from red and green algae 60,61,73 . Surveying phylogenetic trends for functions derived from the ~10 million genes identified here will likely contribute to new insights regarding the extent of lateral gene transfers between eukaryotes 74,75 , the independent emergence of functional traits (convergent evolution), as well as functional losses between lineages 76 , that altogether might have driven the functional convergences of distantly related eukaryotic lineages abundant in the sunlit ocean.
Regardless of the mechanisms involved, the functional repertoire convergences we observed likely highlight primary organismal functioning, which have fundamental impacts on plankton ecology, and their functions within marine ecosystems and biogeochemical cycles. Thus, the genome-scale dichotomy between phylogenies (a vertical evolutionary framework) and functional repertoires (genome structure evolution) depicted here should be viewed as a fundamental attribute of marine unicellular eukaryotes that we suggest warrants a new rationale for studying the state of plankton, a rationale also based on present-day genomic functions rather than phylogenetic surveys alone.

Methods
Tara Oceans metagenomes. We analyzed a total of 943 Tara Oceans metagenomes available at the EBI under project PRJEB402 (https://www.ebi.ac.uk/ena/browser/view/PRJEB402). 265 of these metagenomes have been released through this study. Table S1 reports accession numbers and additional information (including the number of reads and environmental metadata) for each metagenome.
Genome-resolved metagenomics. We organized the 798 metagenomes corresponding to size fractions ranging from 0.8 µm to 2 mm into 11 'metagenomic sets' based upon their geographic coordinates. We used those 0.28 trillion reads as inputs for 11 metagenomic co-assemblies using MEGAHIT 77 v1.1.1, and simplified the scaffold header names in the resulting assembly outputs using anvi'o 39 v.6.1 (available from http://merenlab.org/software/anvio). Co-assemblies yielded 78 million scaffolds longer than 1,000 nucleotides for a total volume of 150.7 Gbp. We performed a combination of automatic and manual binning on each co-assembly output, focusing only on the 11.9 million scaffolds longer than 2,500 nucleotides, which resulted in 837 manually curated eukaryotic metagenome-assembled genomes (MAGs) longer than 10 million nucleotides. Briefly, (1) anvi'o profiled the scaffolds using Prodigal 78 v2.6.3 with default parameters to identify an initial set of genes, and HMMER 79 v3.1b2 to detect genes matching to 83 single-copy core gene markers from BUSCO 80 (benchmarking is described in a dedicated blog post 81 ), (2) we used a customized database including both NCBI's NT database and METdb to infer the taxonomy of genes with a Last Common Ancestor strategy 5 (results were imported as described in http://merenlab.org/2016/06/18/importing-taxonomy), (3) we mapped short reads from the metagenomic set to the scaffolds using BWA v0.7.15 82 (minimum identity of 95%) and stored the recruited reads as BAM files using samtools 83 , (4) anvi'o profiled each BAM file to estimate the coverage and detection statistics of each scaffold, and combined mapping profiles into a merged profile database for each metagenomic set. We then clustered scaffolds with the automatic binning algorithm CONCOCT 84 by constraining the number of clusters per metagenomic set to a number ranging from 50 to 400 depending on the set. Each CONCOCT clusters (n=2,550, ~12 million scaffolds) was manually binned using the anvi'o interactive interface. The interface considers the sequence composition, differential coverage, GC-content, and taxonomic signal of each scaffold. Finally, we individually refined each eukaryotic MAG >10 Mbp as outlined in Delmont and Eren 85 , and renamed scaffolds they contained according to their MAG ID. Table S2 reports the genomic features (including completion and redundancy values) of the eukaryotic MAGs. The supplemental material provides more information regarding this workflow and describes examples for CONCOCT clusters' binning and curation.
A first Giga scale eukaryotic MAG. We performed targeted genome-resolved metagenomics to confirm the biological relevance and improve statistics of the single MAG longer than 1 Gbp with an additional co-assembly (five Southern Ocean metagenomes for which this MAG had vertical coverage >1x) and by considering contigs longer than 1,000 nucleotides, leading to a gain of 181,8 million nucleotides. To our knowledge, we describe here the first successful characterization of a Gigabase-scale MAG (1.32 Gbp with 419,520 scaffolds), which we could identify using two distinct metagenomic co-assemblies.
MAGs from the 0.2-3 μm size fraction. We incorporated into our database 20 eukaryotic MAGs longer than 10 million nucleotides previously characterized from the 0.2-3 μm size fraction 38 , providing a set of MAGs corresponding to eukaryotic cells ranging from 0.2 µm (picoeukaryotes) to 2 mm (small animals).

Single-cell genomics:
We used 158 eukaryotic single cells sorted by flow cytometry from seven Tara Oceans stations as input to perform genomic assemblies (up to 18 cells with identical 18S rRNA genes per assembly to optimize completion statistics, see Supplementary Table 2), providing 34 single-cell genomes (SAGs) longer than 10 million nucleotides. Cell sorting, DNA amplification, sequencing and assembly were performed as described elsewhere 19 . In addition, manual curation was performed using sequence composition and differential coverage across 100 metagenomes in which the SAGs were most detected, following the methodology described in the genome-resolved metagenomics section. For SAGs with no detection in Tara Oceans metagenomes, only sequence composition and taxonomical signal could be used, limiting this curation effort's scope. Notably, manual curation of SAGs using the genome-resolved metagenomic workflow turned out to be highly valuable, leading to the removal of more than one hundred thousand scaffolds for a total volume of 193.1 million nucleotides. This metagenomic-guided decontamination effort contributes to previous efforts characterizing eukaryotic SAGs from the same cell sorting material 19,53,[86][87][88] and provides new marine eukaryotic guidelines SAGs. The supplemental material provides more information regarding this workflow and describes an example for the curation of SAGs using metagenomics.
Characterization of a non-redundant database of SMAGs. We determined the average nucleotide identity (ANI) of each pair of SMAGs using the dnadiff tool from the MUMmer package 89 v.4.0b2. SMAGs were considered redundant when their ANI was >98% (minimum alignment of >25% of the smaller SMAG in each comparison). We then selected the longest SMAG to represent a group of redundant SMAGs. This analysis provided a non-redundant genomic database of 713 SMAGs.
Taxonomical inference of SMAGs. We manually determined the taxonomy of SMAGs using a combination of approaches: (1) taxonomical signal from the initial gene calling (Prodigal), (2) phylogenetic approaches using the RNA polymerase and METdb, (3) ANI within the SMAGs and between SMAGs and METdb, (4) local blasts using BUSCO gene markers, (5) and lastly the functional clustering of SMAGs to gain knowledge into very few SMAGs lacking gene markers and ANI signal.
Protein coding genes. Protein coding genes for the SMAGs were characterized using three complementary approaches: protein alignments using reference databases, metatranscriptomic mapping from Tara Oceans and ab-initio gene predictions. While the overall framework was highly similar for MAGs and SAGs, the methodology slightly differed to take the best advantage of those two databases when they were processed (see the two following sections).

Protein-coding genes for the MAGs. Protein alignments:
Since the alignment of a large protein database on all the MAG assemblies is time greedy, we first detected the potential proteins of Uniref90 + METdb that could be aligned to the assembly by using MetaEuk with default parameters. This subset of proteins was aligned using BLAT with default parameters, which localized each protein on the MAG assembly. The exon/intron structure was refined using genewise 90 with default parameters to detect splice sites accurately. Each MAG's GeneWise alignments were converted into a standard GFF file and given as input to gmove. Metatranscriptomic mapping from Tara Oceans: A total of 905 individual Tara Oceans metatranscriptomic assemblies (mostly from large planktonic size fractions) were aligned on each MAG assembly using Minimap2 91 (version 2.15-r905) with the "-ax splice" flag. BAM files were filtered as follows: low complexity alignments were removed and only alignments covering at least 80% of a given metatranscriptomic contig with at least 95% of identity were retained. The BAM files were converted into a standard GFF file and given as input to gmove. Ab-initio gene predictions: A first gene prediction for each MAG was performed using gmove and the GFF file generated from metatranscriptomic alignments. From these preliminary gene models, 300 gene models with a start and a stop codon were randomly selected and used to train AUGUSTUS 92 (version 3.3.3). A second time, AUGUSTUS was launched on each MAG assembly using the dedicated calibration file, and output files were converted into standard GFF files and given as input to gmove. Each individual line of evidence was used as input for gmove (http://www.genoscope.cns.fr/externe/gmove/) with default parameters to generate the final protein-coding genes annotations.
Protein coding genes for the SAGs. Protein alignments: The Uniref90 + METdb database of proteins was aligned using BLAT 93 with default parameters, which localized protein on each SAG assembly. The exon/intron structure was refined using GeneWise 90 and default parameters to detect splice sites accurately. The GeneWise alignments of each SAG were converted into a standard GFF file and given as input to gmove. Metatranscriptomic mapping from Tara Oceans: The 905 Tara Oceans metatranscriptomic individual fastq files were filtered with kfir (http://www.genoscope.cns.fr/kfir) using a k-mer approach to select only reads that shared 25-mer with the input SAG assembly. This subset of reads was aligned on the corresponding SAG assembly using STAR 94 (version 2.5.2.b) with default parameters. BAM files were filtered as follows: low complexity alignments were removed and only alignments covering at least 80% of the metatranscriptomic reads with at least 90% of identity were retained. Candidate introns and exons were extracted from the BAM files and given as input to gmorse 95 . Ab-initio gene predictions: Ab-initio models were predicted using SNAP 96 (v2013-02-16) trained on complete protein matches and gmorse models, and output files were converted into standard GFF files and given as input to gmove. Each line of evidence was used as input for gmove (http://www.genoscope.cns.fr/externe/gmove/) with default parameters to generate the final protein-coding genes annotations.
BUSCO completion scores for protein-coding genes in SMAGs. BUSCO 80 v.3.0.4 with the set of eukaryotic single-copy core gene markers (n=255). Completion and redundancy (number of duplicated gene markers) of SMAGs were computed from this analysis.

Biogeography of SMAGs.
We performed a final mapping of all metagenomes to calculate the mean coverage and detection of the SMAGs (Table S4). Briefly, we used BWA v0.7.15 (minimum identity of 90%) and a FASTA file containing the 713 nonredundant SMAGs to recruit short reads from all 943 metagenomes. We considered SMAGs were detected in a given filter when >25% of their length was covered by reads to minimize non-specific read recruitments 26 . The number of recruited reads below this cut-off was set to 0 before determining vertical coverage and percent of recruited reads. Regarding the projection of mapped reads, if SMAGs were to be complete, we used BUSCO completion scores to project the number of mapped reads. Note that we preserved the actual number of mapped reads for the SMAGs with completion <10% to avoid substantial errors to be made in the projections.
Identifying the environmental niche of SMAGs. Seven physicochemical parameters were used to define environmental niches: sea surface temperature (SST), salinity (Sal), dissolved silica (Si), nitrate (NO3), phosphate (PO4), iron (Fe), and a seasonality index of nitrate (SI NO3). Except for Fe and SI NO3, these parameters were extracted from the gridded World Ocean Atlas 2013 (WOA13) 97 . Climatological Fe fields were provided by the biogeochemical model PISCES-v2 98 . The seasonality index of nitrate was defined as the range of nitrate concentration in one grid cell divided by the maximum range encountered in WOA13 at the Tara sampling stations. All parameters were co-located with the corresponding stations and extracted at the month corresponding to the Tara sampling. To compensate for missing physicochemical samples in the Tara in situ data set, climatological data (WOA) were favored. More details are available in the supplemental material.
Cosmopolitan score. Using metagenomes from the Station subset 1 (n=757), SMAGs were assigned a "cosmopolitan score" based on their detection across 119 stations (see the supplemental material for more details).
A database of manually curated DNA-dependent RNA polymerase genes. A eukaryotic dataset 99 was used to build HMM profiles for the two largest subunits of the DNA-dependent RNA polymerase (RNAP-a and RNAP-b). These two HMM profiles were incorporated within the anvi'o framework to identify RNAP-a and RNAP-b genes (Prodigal 78 annotation) in the SMAGs and METdb transcriptomes. Alignments, phylogenetic trees and blast results were used to organize and manually curate those genes. Finally, we removed sequences shorter than 200 amino-acids, providing a final collection of DNA-dependent RNA polymerase genes for the SMAGs (n=2,150) and METdb (n=2,032) with no duplicates (see the supplemental material for more details).
Novelty score for the DNA-dependent RNA polymerase genes. We compared both the RNA-Pol A and RNA-Pol B peptides sequences identified in SMAGs and MetDB to the nr database (retrieved on October 25, 2019) using blastp, as implemented in blast+ 100 v.2.10.0 (e-value of 1e -10 ). We kept the best hit and considered it as the closest sequence present in the public database. For each SMAG, we computed the average percent identity across RNA polymerase genes (up to six genes) and defined the novelty score by subtracting this number from 100. For example, with an average percent identity is 64%, the novelty score would be 36%.
Phylogenetic analyses of SMAGs. The protein sequences included for the phylogenetic analyses (either the DNA-dependent RNA polymerase genes we recovered manually or the BUSCO set of 255 eukaryotic single-copy core gene markers we recovered automatically from the ~10 million protein coding genes) were aligned with MAFFT 101 v.764 and the FFT-NS-i algorithm with default parameters. Sites with more than 50% of gaps were trimmed using Goalign v0.3.0-alpha5 (http://www.github.com/evolbioinfo/goalign). The phylogenetic trees were reconstructed with IQ-TREE 102 v1.6.12, and the model of evolution was estimated with the ModelFinder 103 Plus option: for the concatenated tree, the LG+F+R10 model was selected. Supports were computed from 1,000 replicates for the Shimodaira-Hasegawa (SH)-like approximation likelihood ratio (aLRT) 104 and ultrafast bootstrap approximation (UFBoot) 105 . As per IQ-TREE manual, we deemed the supports good when SH-aLRT >= 80% and UFBoot >= 95%. Anvi'o v.6.1 was used to visualize and root the phylogenetic trees.
Functional inference of SMAGs. We performed the functional annotation of protein-coding genes using the EggNog-mapper 55,56 v2.0.0 and the EggNog5 database 54 . We used Diamond 106 v0.9.25 to align proteins to the database. We refined the functional annotations by selecting the orthologous group within the lowest taxonomic level predicted by EggNog-mapper.
Differential occurrence of functions. We performed a Welch's ANOVA test followed by a Games-Howell test for significant ANOVA comparisons to identify functions occurring differentially between functional groups of SMAGs. All statistics were generated in R 3.5.3.
Functional clustering of SMAGs. We used anvi'o to cluster SMAGs as a function of their functional profile (Euclidean distance with ward's linkage), and the anvi'o interactive interface to visualize the hierarchical clustering in the context of complementary information.
Data availability. All data our study generated are publicly available at http://www.genoscope.cns.fr/tara/. The link provides access to the 11 raw metagenomic co-assemblies, the FASTA files for 713 SMAGs, the ~10 million protein-coding sequences (nucleotides, amino acids and gff format), and the curated DNA-dependent RNA polymerase genes (SMAGs and METdb transcriptomes). This link also provides access to the supplemental figures and the supplemental material.

Acknowledgments
Our survey was made possible by two scientific endeavors: the sampling and sequencing efforts by the Tara Oceans Project, and the bioinformatics and visualization capabilities afforded by anvi'o. We are indebted to all who contributed to these efforts, as well as other open-source bioinformatics tools for their commitment to transparency and openness. Tara Oceans (which includes the Tara Oceans and Tara Oceans Polar Circle expeditions) would not exist without the leadership of the Tara Ocean Foundation and the continuous support of 23 institutes (https://oceans.taraexpeditions.org/). We thank the commitment of the following people and sponsors who made this singular expedition possible: CNRS The authors also thank agnès b. and E. Bourgois, the Prince Albert II de Monaco Foundation, the Veolia Foundation, the EDF Foundation EDF Diversiterre, Region Bretagne, Lorient Agglomeration, Worldcourier, Illumina, the EDF Foundation EDF Diversiterre, for support and commitment. The global sampling effort was made possible by countless scientists and crew who performed sampling aboard the Tara from 2009 to 2013. The authors are also grateful to the countries that graciously granted sampling permission. Part of the computations were performed using the platine, titane and curie HPC machine provided through GENCI grants (t2011076389, t2012076389, t2013036389, t2014036389, t2015036389 and t2016036389). We also thank Noan Le Bescot (TernogDesign) for artwork on Figures. This article is contribution number XX of Tara Oceans.  Figure S1: Phylogenomic analysis of the protein sequences of 255 BUSCO genes markers from eukaryotic plankton. The maximum-likelihood phylogenomic tree of the BUSCO gene markers (255 genes) included Tara Oceans MAGs and METdb transcriptomes and was generated using a total of 19,785 sites in the alignment and LG+F+R10 model; Opisthokonta was used as the outgroup. The tree was decorated with additional layers using the anvi'o interface. Branches and names in red correspond to lineages lacking representatives in METdb. Please note that this analysis is preliminary, and that more advanced phylogenomic analyses using the same markers are underway.  Table S1. Summary of 939 Tara Oceans metagenomes that include their station ID, size fraction and number of quality filtered reads. Table S2. Summary of the genome-resolved metagenomics and single cell genomics outcomes. The table includes statistics for the metagenomic co-assemblies, redundant MAGs and SAGs, and targeted efforts regarding the one giga scale MAG. Table S3. Statistics of non-redundant SMAGs and METdb transcriptomes. The table includes genomic statistics and taxonomical inferences, the occurrence of RNA polymerase genes, and distribution patterns across stations and size fractions.