Unlocking the Genomic Taxonomy of the Prochlorococcus Collective

Prochlorococcus is the most abundant photosynthetic prokaryote on our planet. The extensive ecological literature on the Prochlorococcus collective (PC) is based on the assumption that it comprises one single genus comprising the species Prochlorococcus marinus, containing itself a collective of ecotypes. Ecologists adopt the distributed genome hypothesis of an open pan-genome to explain the observed genomic diversity and evolution patterns of the ecotypes within PC. Novel genomic data for the PC prompted us to revisit this group, applying the current methods used in genomic taxonomy. As a result, we were able to distinguish the five genera: Prochlorococcus, Eurycolium, Prolificoccus, Thaumococcus, and Riococcus. The novel genera have distinct genomic and ecological attributes.


Introduction
Prochlorococcus cells are the smallest cells that emit red chlorophyll fluorescence, which is how they were discovered approximately 30 years ago [1,2]. It is the smallest and most abundant photosynthetic microbe on Earth. The oxytrophic free-living marine Prochlorococcus marinus contain divinyl chlorophyll a (chl a2) and both monovinyl and divinyl chlorophyll b (chl b2) as their major photosynthetic pigments rather than chlorophyll a and phycobiliproteins, which are typical of Cyanobacteria, allowing them to adapt to different light intensities [1,[3][4][5][6], particularly to capture the blue wavelengths prevailing in oligotrophic waters [7]. Prochlorococcus strains have been cultured and are available in culture collections, but the limited number of cultures does not reflect the true phenotypic and genomic diversity found in the ocean [8]. Prochlorococcus is a smaller, differently pigmented variant of marine Synechococcus [2]. The pioneering work on the eco-genomics of Prochlorococcus was performed by Chisholm's group at MIT [2,[9][10][11][12][13][14][15].
The taxonomy of the Prochlorococcus collective (PC) has been neglected in the last three decades. The designation PC was first coined by Kashtan et al. [11]. This collective comprises the genera Prochlorococcus, Prolificoccus, Eurycolium, and Thaumococcus [16][17][18], a group of picocyanobacteria that radiated from Synechococcus millions of years ago. We preferer to use the neutral term "collective" here instead of "community," "complex," or any other term.
Most literature on PC is based on the assumption that it comprises a single genus, in which the species Prochlorococcus marinus is within a "collective" of ecotypes. The main reason for this concept is that 16S rRNA gene sequence similarities within this group are > 97%. To describe the observed diversity in the PC, ecologists adopt the distributed genome hypothesis of an open bacterial pan-genome (supragenome) [12,14]. In this hypothesis, the pan-genome is the full complement of genes that each member of the population contributes to and draws genes from. No single isolate contains the full complement of genes, which contributes to high diversity and provides selective advantages [19]. Described in its extreme form, the pan-genome of the bacterial domain is of infinite size, and the bacteria as a whole present an open pan-genome (supragenome), in which the constant "rain" of genetic material in genomes from a cloud of frequently transferred genes enhances the chance of survival of a Diogo Tschoeke and Vinicius W. Salazar contributed equally to this work.
Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00248-020-01526-5) contains supplementary material, which is available to authorized users. species [20]. The PC case illustrates the continuously growing gap between prokaryotic taxonomy as it is practiced today and in-depth eco-genomic studies. Genomic clusters are referred to as ecotypes, ecospecies, sub-ecotypes, clades, populations, subpopulations, OTUs, or CTUs. This informal nomenclature finds its way into databases, adding further nomenclatural chaos to "terrae incognitae" [21,22] comprising these picocyanobacteria.
In their theory of self-amplification and self-organization of the biosphere, Braakman et al. [3] hypothesize that Prochlorococcus clades living in surficial oceanic waters adapted to high light and low nutrient levels have evolved from more ancient Prochlorococcus subpopulations (~800 million years ago) inhabiting deeper oceanic areas, with lower light and higher nutrient concentrations. They refer to these evolutionary paths as "niche constructing adaptive radiations." Sun and Blanchard [23] hypothesize three phases in the evolutionary history of the PC: genome reduction, genome diversification, and ecological niche specialization, all due to different evolutionary forces. The PC has adapted to live under different light intensities, seawater temperatures, and nutrient levels, and the abundance of the different ecotypes varies markedly along gradients of these parameters [24][25][26][27][28][29][30]. However, only high light (HL)-adapted and low light (LL)-adapted ecotypes were initially distinguished [31]. At present, twelve phylogenetic ecotypes are delineated: six HL-adapted ecotypes and six LL-adapted ecotypes [32]. These ecotypes are based on the phylogeny of the 16S/23S rRNA internal transcribed spacer region, differences in behavior towards light, whole-genome analysis, and physiological properties, combined with their environmental distribution [28,30,33]. Not only light adaptation but also temperature adaptation have been recognized in the two dominant nearsurface HL ecotypes, the HLI ecotype Eurycolium eMED4 and the HLII ecotype Eurycolium eMIT9312, which both exhibit growth maxima at approximately 25°C, but Eurycolium eMED4 appears to be adapted to thrive at much lower temperatures (approx. 14°C), whereas Eurycolium eMIT9312 is able to grow at 30°C [34]. These ecotypes appear to have nearly identical genomes [9], in which the major differences are restricted to five genomic islands, but show marked phenotypic differences.
The relative abundance of cells belonging to the different ecotypes has been estimated in several oceanic regions, revealing patterns that agree for the most part with the HL/LL phenotypes. HL-adapted cells dominate the surface mixed layer, and LL-adapted cells most often dominate deeper waters [32,[35][36][37][38]. By combining the physiological features of isolates and ecotype abundance studies in the ocean, it was shown that temperature, in addition to light, is an important determinant of the abundance of these ecotypes in oceans [39,40]. There is also tremendous variability within ecotypes, particularly with regard to nutrient acquisition [41], susceptibility to predation or phage infection [42], and modes of interaction with other members of the community [43], including interactions with Alteromonas for protection against ROS [44,45]. Prochlorococcus can use different organic compounds containing key elements to survive in oligotrophic oceans, and can also take up glucose and use it as a source of carbon and energy [46].
Kashtan et al. [11] studied the genomic diversity of single Prochlorococcus cells in samples collected in the Bermuda-Atlantic Time-Series Study (BATS) site in summer, spring, and winter, and hundreds of new Prochlorococcus "subpopulations" with distinct genomic backbones were found. These subpopulations are estimated to have diverged millions of years ago, suggesting ancient, stable niche partitioning followed by selection (not by genetic drift). The authors suggested that the global PC may be conceived as an assortment of thousands of species. They argued that such a large set of coexisting species with distinct genomic backbones is a characteristic feature of free-living prokaryotes with very large population sizes living in highly mixed habitats. In another study by these authors focused on the Pacific Ocean, they concluded that the Atlantic and Pacific Oceans are occupied by different PC populations that experienced separate evolutionary paths for a few million years [13].
From 1970 to the present, the need to classify, name, and identify prokaryotes has been based on a pragmatic polyphasic consensus approach [47]. This non-theory-based approach aims to integrate phenotypic, genotypic, and phylogenetic and ecological characteristics to establish stable and informative classification systems. The central unit is the prokaryotic species. In this procedure, the definition of prokaryotic species is based mainly on DNA-DNA hybridization (DDH) and 16S rRNA sequence similarity: a species is defined as a monophyletic group of strains sharing high phenotypic and genomic similarity (> 70% DDH, > 98.3% 16S rRNA sequence similarity) [48][49][50][51][52].
Prokaryote taxonomy is becoming more genome based, considering the genomes themselves as the ultimate end products of microbial evolution [50,53,54]. Such an approach offers the opportunity to obtain a more natural taxonomic system and allows the definition of consistent and stable monophyletic species and genera with maximal utility for its users [47,48,. Cohan has extensively developed the ecotype or ecospecies theory for prokaryotes, in which "ecotypes are the most newly divergent populations that are distinct ecologically from one another" [76,77]. In this theory, the splitting of one ecotype into two is the fundamental diversity-creating process of speciation in bacteria. In his stable ecotype model, ecotypes (ecospecies) are long lived, and different ecotypes coexist indefinitely. They are cohesive by virtue of periodic selection events that purge the ecotype of sequence diversity. In this case, the ecotype has all the characteristics of a species [77,78]. Ecotypes have been discerned among the PC.
As new habitats and niches are found for Prochlorococcus, it is important to establish a solid taxonomic system to contribute to the interpretation of the ecology of the PC. Novel PC genomic data allowed us to address the hypothesis that novel genera and species are present within the PC. In this review, we revisit the current taxonomy of the PC and propose the creation of a new genus to encompass the clearly recognized clusters. The novel genera have distinct genomic and ecological attributes.

A Proposal of a New Taxonogenomic Framework for the PC
The availability of Prochlorococcus genomes labelled Prochlorococcus in public databases prompted us to test the hypothesis that the PC comprises a greater number of stable genera and species than have been previously recognized (Table 1 and S1). We first collected all the publicly available genomes designated as Prochlorococcus, and applied the theory-free genomic methodologies and principles discussed previously. The Prochlorococcus genomes came from samples distributed around the world, especially in the Pacific and Atlantic Oceans (Table S1). First, the GC% content of the genomes was determined. A phylogenomic framework was established using genomic signatures, i.e., average amino acid identity (AAI), multilocus sequence analysis (MLSA), and core genomebased tree. We then classified the genomes into different genera and species using relevant AAI and phenotypic information obtained directly from the genomes.

Genome Quality Assessment
A total of 232 genomes and the corresponding metadata were selected from NCBI GenBank [98] because they fit our quality control criteria. Contigs or scaffold files were input into Prodigal software v2.6.3 [55] for gene and protein prediction. To check the quality of the genomes, completeness, contamination, and GC%, the genomes were submitted to CheckM v1.0.11 [56] using default parameters for this estimation. CheckM uses a broader set of marker genes specific to the position of a genome within a reference genome tree. The completeness is measured using a set of marker genes that are expected to be present. This expectation can be defined along an evolutionary gradient, ranging from highly conserved genes to species-specific genes. The genomes used are of high quality and have ≥ 85% of completeness, ≥ 1 Mb of size, and ≤ 5% of contamination (Table S1).

Average Amino Acid Identity Analyses
The average amino acid identity was calculated as described previously for all 232 genomes [57]. All PC genomes were used to calculate the amino acid average identity using the CompareM tool (https://github.com/ dparks1134/CompareM). The result was used to plot a dendrogram. The CompareM output was converted to a dissimilarity square matrix. The spatial.distance. squareform function from scipy package v1.1.0 was used in order to obtain a condensed list from the lower triangular elements of the matrix. A complete linkage clustering was performed using cluster.hierarchy.linkage function from scipy with the output of squareform and cityblock metric distance. The figure that represents the linkage matrix was plotted with cluster.hierarchy. dendrogram function from scipy. The intraspecies limit is assumed as ≥ 95%, whereas genera delimitation is assumed as ≥ 70%. Genera were delineated only for those lineages exhibiting at least 3 representative genome sequences.

Phylogenomic Reconstruction
The MLSA approach was based on the concatenated sequences of four housekeeping genes (gyrB, pyrH, recA, rpoB) with intraspecies relatedness ranging from > 95 to 100 % MLSA [16]. This was achieved by using tBLASTn [58]. A nucleotide alignment was then made for each of the four genes using MAFFT [59], and these alignments concatenated using MAFFT with default parameters and a custom Perl script. The tree was built from the concatenated alignment using FastTree 2 [60] with default parameters and rooted on S. elongatus PCC 6301 using ETE 3 [61]. To build the core genome tree, we used the GToTree package [62] with default parameters. The 232 selected genomes were searched against a Hidden Markov Model of 250 cyanobacterial marker genes using HMMER3 [63]. A concatenated protein alignment from the 250 genes was constructed using Muscle [64] and subsequently trimmed using TrimAl [65]. The alignment was then used to construct a tree using FastTree 2 [60] with default parameters and the pairwise distance matrix using MEGA 6.0 [66]. The tree was rooted on S. elongatus PCC 6301 using ETE 3 [61]. Part of the processing was done with GNU parallel [67].

In Silico Phenotype
The phenotype was predicted utilizing the sequences of the nitrogen-related (narB, nirA, ntcA, glnB, amt, cynS, ureA, glnA, glsF, gdhA), phosphorous-related (phoB, ptrA, phoR), arsenate-related (arsC), iron (piuC, som), a n d p h o t o s y n t h e t i c p i g m e n tr e l a t e d ( p s a A B , p c b A B C D E F G H , C P 4 7 , and k a t G ) genes of Prochlorococcus obtained from UniProt and GenBank as queries for a local blastp [68] search against all PC ORF-predicted genomes. The results were parsed with Python scripts, and the sequences corresponding to the blastp hits were extracted from the respective genomes. The extracted sequences were used to perform a blastp search against the nr database to confirm the functional annotation. Manual verification of these results was performed. The confirmed results were used to build a matrix of presence/absence for the genes related to the indicated proteins in each PC genome.

GC Content Reveals Great Genomic Diversity in the PC
One of the most striking findings of our study came from the analysis of GC content, which ranged from 30 to 50.7% ( Fig. 1 and Table 1). The distribution of the GC values of all the genomes comprised five groups: (i) the majority (75% of the genomes) exhibited 30 to 33% GC, comprising the genus Eurycolium; (ii) a second group (12% of the genomes) exhibited 34 to 35% GC, comprising the genera Prolificoccus; (iii) a third group Prochlorococcus (4.8% of the genomes) presented 36-37% GC; (iv) a fourth group Riococcus (1.4% of the genomes) exhibited 37 to 38% GC; and (v) the fifth group (6.7% of the genomes) exhibited 50 to 51% GC, which was characteristic of the genera Thaumococcus. The average genome sizes of the first four GC groups were similar, whereas the genome sizes of the fifth GC group were almost twofold larger. This GC analysis provides a clear indication of the taxonomic width of the PC. Mutational bias is dominated by GC > AT transitions even in bacteria with a high % GC content [69,70]. The species and genus delimitations were based on clusters with three or more genomes that shared ≥ 95% or ≥ 70% amino acid average identity (AAI), respectively. Here, we revisit the current taxonomy of the PC and propose the creation of a new genus Riococcus and 128 novel species (N = 137 species) ( Table 1 and S1). We confirmed the previous proposal of Walter et al. [18], and we now expand the classification of PC (Fig. 2). Eurycolium (N = 112) and Prolificoccus (N = 16) were the genera with the greatest numbers of species followed by Prochlorococcus (N = 4), Riococcus (N = 3), and Thaumococcus (N = 2) (Fig. 2, Table 1 and S1). The AAI among PC members varied considerably, between 57 and 99.9%. The closest genus to Prochlorococcus marinus CCMP1375 T was Riococcus MIT9211 T having 69.4% AAI. The percentages of AAI variability between genera and species were 57 to 69.9% and 70 to 94%, respectively. The percentages of AAI variability within genera and species were > 70 to 94% and 95 to 99.9%, respectively.

Phylogenomic Framework for the PC
To establish the phylogenetic structure of the five PC genera, phylogenetic analysis was performed using a four-locus-based tree (MLSA) and 250 core protein sequences. Bootstrap analysis indicated that most of the branches were highly significant. These analyses have a much higher resolution than the 16S rRNA phylogeny. The PC harbors very similar 16S rRNA gene sequences with similarity ranging from 97 to 99% [2]. The MLSA and the core protein tree showed the relationship of the five genera (Fig. 3). The MLSA and core protein trees were congruent. The MLSA and the core protein sequence identities among PC members varied considerably, from 49 to 100% and from 56 to 100%, respectively. The type species of the genus Prochlorococcus, P. marinus CCMP1375 T , showed identity maxima of 80% and 70% MLSA and core protein sequence similarity, respectively, with all other genera. The variability between species was < 95, while that within species ranged from 96 to 100% similarity. The phylogenetic analysis based on MLSA and core protein sequences (Fig. 3) demonstrated congruence between the genetic and ecological features of the five novel genera. For instance, Eurycolium comprises high-light (HLI/II/VI) ecotypes with 30-33% GC contents, whereas the genera Prolificoccus and Prochlorococcus comprise low-light (LLI and LLII/III, respectively) ecotypes with 34-35% and 36-37% GC contents, respectively. Riococcus and Thaumococcus consist of low-light ecotypes (LLII/III and LLIV, respectively) with 37 to 38% and 50 to 50.7% GC contents, respectively.

In Silico Phenotype Prediction
Useful phenotypic features that may be used for the identification and classification of the PC were identified, including nitrogen, phosphorous, arsenate, and iron utilization ( Table 2). The genus Prochlorococcus does not present some nitrogenrelated genes (narB, nirA, cynS, ureA, and gdhA) and phosphorous-related genes (phoB, ptrA, and phoR) but some genomes do present other nitrogen-related genes (ntcA, glnB, amt, glnA, glsF), an arsenate-related gene (arsC), and ironrelated genes (piuC and som). Prochlorococcus is differentiated from its closest genus Riococcus by the presence of the arsC gene. The genes related to photosynthetic pigments synthesis (psaAB, pcbABCDEFGH, and CP47) were conserved among the genera (Table S2). All genera present the gene Pro1404 that confers the capability for glucose uptake in PC [46].
Overall, we demonstrated that the methods used (AAI, core protein sequences, and MLSA) correlated and provided significant taxonomic resolution for differentiation of genera and species in PC (Figs. 2 and 3). The proposed genera are in complete agreement with previous literature and support previous theories on the evolution of Prochlorococcus [11,13,14,31,33,36,71]. The PC genomic signatures were more  [103][104][105]. Taxon nomenclature within this group has long been a topic of discussion, but there is currently no consensus [106][107][108][109]. As a result, more than 50 genera of Cyanobacteria have been described since 2000, and many of them remain unrecognized in the List of Prokaryotic Names with Standing in Nomenclature (LPSN, http://www.bacterio.net) [110] and in databases (e.g., NCBI) or have been the target of high criticism [111,112]. Cyanobacterial taxonomy is based on morphologic traits and may not reflect the results of phylogenetic analyses [72, [113][114][115][116][117][118]. The analysis of 16S rRNA gene sequences is useful for charting and characterizing microbial communities [119] but lacks sensitivity to evolutionary changes that occur in association with ecological dynamics, in which microbial diversity is organized by physicochemical parameters [120,121]. A recent study proposed that there should be 170 genera of Cyanobacteria based only on 16S rRNA sequences [119].

Farrant et al. delineated 121 ecologically significant taxonomic units (ESTUs) of Prochlorococcus and 15 ESTUs
Synechococcus in the global ocean using single-copy petB sequences and environmental cues [122]. Although it was assumed that all the disclosed Prochlorococcus belonged to a single genus, it was not clear how many phylogenetic groups Fig. 2 Hierarchical clustering based on average amino acid identity (AAI) analysis of the 232 PC genomes. The numbers on the branches correspond to the percentage divergence among genomes and the red dots highlight it. The intraspecies limit is assumed as ≥ 95%, whereas genera delimitation is assumed as ≥ 70%. The colors (green, red, blue, purple, and yellow) represent the five genera Unlocking the Genomic Taxonomy of the PC Despite the astonishing advances in understanding the ecology of the PC, its taxonomic structure has remained puzzling until recently. The Prochlorococcus collective is thought to present a high degree of panmixis due to horizontal gene transfer and is composed of one genus and one species comprising two subspecies, P. marinus subsp. marinus and P. marinus subsp. pastoris [33,[123][124][125][126]. We remark that at least 12  The current (type) genus Prochlorococcus contains two species: the type species Prochlorococcus marinus and its type strain SS120 T (= CCMP 1375 T ) and P. ceticus with its type strain MIT9211 T . The genus Eurycolium seems to form a large diverse eco-genomic group related to high temperature and oligotrophic environments, whereas the genera Prochlorococcus and Prolificoccus seem to be abundant at low temperature; Thaumococcus appears to thrive at low temperature and in copiotrophic environments [18].
On the basis of the eco-genomic and in silico phenotype evidence gathered amassed here, we propose five new genera within the PC. The taxonomic structure of the PC now comprises the following type species and type strains or type sequences: (1) the genus Prochlorococcus (Pro'chlo ro coc"cus. Gr. pref. pro, before (primitive); Gr. adj chloros, green; M.L. masc n. coccus, berry. Primitive green sphere) (the type genus). The type species Prochlorococcus marinus (ma'ri"nus. marinus L. adj., marine) with its type strain CCMP1375 T (=SS120 T ). The genome of this strain presents a GC content of 36.4%; (2) the genus Prolificoccus (Pro.li.fi.co.ccus. L. prolificus, productive, abundant, numerous; Prolificoccus, referring to an abundant coccus). The type species Prolificoccus proteus (L. n. proteus, ofprotos, "first," is an early sea god, the old man of the sea) with its type strain NATL2A T . The genome of this strain presents a GC content of 35.1%; (3) the genus Eurycolium (Eur.y.co.lium. Gr. adj. eury, wide, broad; L. cole, inhabit; Gr. ium, quality or relationship, Eurycolium, referring to the spread inhabiting trait in marine habitats). The types species Eurycolium pastoris (pa.sto'ris. Latinized form of Pasteur) with its type strain CCMP1986 T (=MED4 T ). The genome of this strain presents a GC content of 30.8%; (4) the genus Thaumococcus (Thau.mo.co.ccus. Referring to Thaumas, Greek god of the wonders of the sea; N.L. masc. n. coccus (from Gr. Masc. n. kokkos, grain, seed, kernel); N.L. masc. n. Thaumococcus, referring to a coccus living in the sea). The type species Thaumococcus swingsii (L. gen. masc. n. swingsii, of Swings, in honor of the Belgian microbiologist Jean Swings) with its type strain MIT 9313 T . The genome of this strain presents a GC content of 50.7%; and (5) the genus Riococcus (Rio.coc'cus. N.L. masc. n. Rio, of Rio de Janeiro, a city in Brazil; N.L. masc. n. kokkos, a grain; N.L. masc. n. Riococcus, referring the city of Rio de Janeiro). The type species Riococcus ceticus (L. n. ceticus, of Cetus, denotes a large fish) with its type strain MIT9211 T . The genome of this +, presence; -, absent; v, variable. Nitrogen-related genes: narB, nirA, ntcA, glnB, amt, cynS, ureA, glnA, glsF, gdhA. Phosphorous-related genes: phoB, ptrA, phoR. Arsenate-related gene: arsC. Iron-related genes: piuC, som a Data extracted from Berube et al. [14,71] strain presents a GC content of 38%. We suggest that a PC species may be defined as a group of strains that share > 95% DNA identity in core protein sequences, and > 95% AAI. Strains of the same species will form monophyletic groups on the basis of core protein sequences.
Our study further refines previous taxonomic proposals for the Prochlorococcus collective [18]. A recent novel bacterial taxonomy has also been proposed on the basis of phylogenomics [54,127]. This study proposes the split of Prochlorococcus into 5 distinct genera. GTDB-tk [128] is a quick and easy way to assess how our proposed taxonomy compares with what has been previously published. When we compare our proposal with the GTDB proposal we have the following correspondence: Prochlorococcus = Prochlorococcus, Eurycolium = Prochlorococcus_A, Prolificoccus = Prochlorococcus_B, Thaumococcus = Prochloroccus_C, and Riococcus = Prochlorococcus_D, suggesting an excellent agreement between these different studies. These studies provide insights into the evolutionary relationship among the different genera and species, as well in silico phenotypes of PC.

Conclusions
The PC represents an excellent model for further theory development in ecology, evolution, phylogeny, and taxonomy. The PC refers to a group of picocyanobacteria that radiated from Synechococcus during millions of years of evolution in the oceans. The taxonogenomic analysis of 232 genomes revealed at least 5 new genera in addition to the original genus, Prochlorococcus [129]. As we delineated genera only for those groups with at least three available genome sequences, we anticipate that new genera will be included in our new proposal when more representative genomes are added to groups with only 1 or 2 genomes. On average, 20% of the genes of PC genomes have unknown functions and await further characterization. The pan-genome of the PC is estimated to include over 80,000 genes [14,32].
The PC has evolved in multiple ecologic niches in the global ocean and participates in relevant biogeochemical cycles, such as the recently postulated parasitic arsenic cycle between Eurycolium and Pelagibacter ubique [130,131]. We demonstrated here that the current PC is a diverse group of genera that are clearly distinguishable by eco-genomics. Ecological studies on the PC may need to consider the diversity of genera and species when novel ecologic and evolutionary hypotheses are tested.