Estimating the maximal growth rates of eukaryotic microbes from cultures and metagenomes via codon usage patterns

Microbial eukaryotes are ubiquitous in the environment and play important roles in key ecosystem processes, including accounting for a significant portion of global primary production. Yet, our tools for assessing the functional capabilities of eukaryotic microbes in the environment are quite limited because many microbes have yet to be grown in culture. Maximum growth rate is a fundamental parameter of microbial lifestyle that reveals important information about an organism’s functional role in a community. We developed and validated a genomic estimator of maximum growth rate for eukaryotic microbes, enabling the assessment of growth potential for both cultivated and yet-to-be-cultivated organisms. We produced a database of over 700 growth predictions from genomes, transcriptomes, and metagenome-assembled genomes, and found that closely related and/or functionally similar organisms tended to have similar maximal growth rates. By comparing the maximal growth rates of existing culture collections with environmentally-derived genomes we found that, unlike for prokaryotes, culture collections of microbial eukaryotes are only minimally biased in terms of growth potential. We then extended our tool to make community-wide estimates of growth potential from over 500 marine metagenomes, mapping growth potential across the global oceans. We found that prokaryotic and eukaryotic communities have highly correlated growth potentials near the ocean surface, but that this relationship disappears deeper in the water column. This suggests that fast growing eukaryotes and prokaryotes thrive under similar conditions at the ocean surface, but that there is a decoupling of these communities as resources become scarce deeper in the water column.

Among marine primary producers alone, protists account for a third of total biomass [5].
Marine systems account for about half of all global primary production [6], so that the abundance of protists in these systems suggests an important overall role for protists in regulating global carbon cycles [7], among other biogeochemical cycles.And yet, our tools for studying the ecology and evolution of eukaryotic microbes are still quite limited, at least in comparison to their prokaryotic neighbors [8].
Several recent developments have greatly advanced our ability to survey the ecology of microbial eukaryotes directly from the environment using metagenomics.Large-scale efforts to augment the sizes of our existing genomic and transcriptomic databases, specifically the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP; [9]), have expanded our ability to use database-dependent approaches for metagenomic analysis for both taxonomic and functional classification (e.g., [10][11][12][13]).
At the same time, novel approaches for binning and validation have been applied by multiple groups to reconstruct high-quality metagenome-assembled genomes (MAGs) from environmental datasets [14][15][16].
With these new environmentally-derived genomes come new challenges -specifically that of inferring features of an organism's physiology and ecology from its genome sequence, a persistent challenge in metagenomics [17][18][19].One trait of particular interest is the maximal growth rate of an organism, a fundamental parameter of microbial lifestyle that can tell us a great deal about an organism's ecology [20][21][22].Among microbial eukaryotes, minimal doubling times range over two orders of magnitude, from hours (e.g., [23,24]) to days (e.g., [25]), and potentially even weeks (e.g., [26,27]).
Temperature sets a well-studied upper-bound on the maximal growth rates of microbial eukaryotes (see work on the Eppley Curve, e.g., [28][29][30][31][32][33]), but there is a great deal of variation among species below this threshold.For prokaryotes, genomic signals of translation optimization can be leveraged in order to predict the maximal growth rates of an organism [20,22,34].Here, we show that the same signals, specifically the codon usage bias of highly expressed genes, can be used to estimate the growth potential of eukaryotic microbes directly from their genome sequences.We compiled a database of 178 species of eukaryotic microbes with recorded growth rates in culture and either publicly available genomes or transcriptomes.We then used this database to build a genomic predictor of growth potential for eukaryotic microbes.We applied this tool to a set of 465 MAGs and 517 metagenomes to derive ecological insights about the variation of eukaryotic growth potential across organisms and environments.

Predicting maximal growth rates of eukaryotic microbes
We compiled an initial dataset of maximal growth rates and optimal growth temperatures recorded in culture for 178 species with either genomic or transcriptomic information publicly available (S1 Fig, S1 Table).A sizeable portion of this dataset corresponded to marine eukaryotic microbes, with 101 entries corresponding to organisms in the MMETSP, though eukaryotic microbes from other environments were also represented, including important human pathogens (e.g., Giardia intestinalis, Entamoeba histolytica, Leshmania spp., etc.).In general, eukaryotic microbes with genomes in GenBank, which tended to be human associated, had faster maximal growth October 15, 2021 2/17 rates than the marine eukaryotic microbes found in MMETSP (S1 Fig) .This pattern is similar to that found in prokaryotes, where human-associated bacteria and archaea typically had much faster growth rates than those found in marine systems [20].
One of the most reliable signals of optimization for rapid growth in prokaryotic genomes is high codon usage bias (CUB) in highly expressed genes [22].The degeneracy of the genetic code means that multiple codons may code for the same amino acid, but not all organisms use alternative codons at equal frequencies.In fact, many organisms, both prokaryotic and eukaryotic, are biased in their usage of alternative codons.The codon usage patterns of genes are thought to be optimized to the relative abundance of tRNAs within the cell, and this optimization is particularly apparent among highly-expressed genes in fast-growing species [35][36][37][38][39][40][41][42].The basic intuition here is that CUB is a result of optimization of genes for rapid translation, which in turn facilitates rapid growth.We wanted to see whether such patterns could be leveraged to predict the growth rates of eukaryotic microbes [43].For each genome or transcriptome, we calculated the CUB of a set of highly expressed genes relative to the expected CUB calculated from all other coding sequence in that genome or transcriptome (details of these calculations can be found in Materials and Methods; [20,39,44]).Because ribosomal proteins are expected to have high expression across species and many physiological conditions [20,22], we take these as our set of highly expressed genes for all analyses (see Methods and S2 Table ).We found a significant negative relationship between CUB of highly expressed genes calculated in this way and the minimum doubling time of an organism (Pearson's correlation with log-transformed doubling times, ρ = −0.400,p = 3.14 × 10 −8 ).Thus we confirmed that high CUB is a signal of growth rate optimization among microbial eukaryotes.
We then built a linear model relating CUB of highly expressed genes and optimal growth temperature to the doubling times of eukaryotic microbes (Fig 1a).We found that such a model could explain about a third of variation in doubling time among organisms (linear regression, r 2 = 0.328), and that both CUB (p = 1.16 × 10 −7 ) and optimal growth temperature (p = 1.89 × 10 −10 ) were significant predictors in the model.Interestingly, similar to what we have previously reported for prokaryotes [20], we found that for eukaroytic microbes the relationship between CUB and doubling time saturated at a threshold doubling time, after which CUB no longer changed with increasing doubling time (Fig 1b).In our earlier work on prokaryotes, we took the presence of this threshold as evidence that for slow-growing organisms who experienced little selection for optimized codon usage, drift would overwhelm the evolutionary process [20].Thus prokaryotes could be divided into two distinct evolutionary regimes related to their growth strategies.It appears that a similar dynamic may be at work among eukaryotes, although the relatively small number of sequenced eukaryotic microbes with minimal doubling times greater than 40 hours makes this hard to assess.If such a threshold does exist for eukaryotes, it is at a much higher doubling time than the one seen in prokaryotes (40 hours versus 5 hours respectively; [20]), likely due to constraints on eukaryotic growth related to cell size and complexity.In any case, similar to the threshold effect seen for prokaryotes, we note that above this threshold, predicted maximum growth rates are likely to be overestimated.Thus, our model can only reliably predict minimum doubling times up to 40 hours, after which we can only infer that a microbe grows "very slowly".
Finally, we asked if our eukaryotic model of growth improved predictions for eukaryotic organisms relative to the predictions made by previous tools developed for prokaryotes.Consider that tools able to predict growth rate from CUB already exist, though they have been trained exclusively on prokaryotic organisms [20,22].We applied these prediction tools to our eukaryotic dataset and found that they systematically overestimated the growth rates of eukaryotic organisms, often by more than an order of October 15, 2021 3/17 magnitude, leading to quite poor performance on eukaryotes (Fig 1c-d).Thus our eukaryote-specific model is an important development, as prokaryote-specific models cannot accurately predict eukaryotic growth rates.We have incorporated this eukaryote-specific model into the open-source and user-friendly growth prediction R package gRodon, which we previously developed to predict prokaryotic growth rates (https://github.com/jlw-ecoevo/gRodon;[20]).

Environmentally derived genomes reveal biases in culture collections and ecological patterns
We obtained a large set of 1669 eukaryotic MAGs assembled and binned from the Tara Oceans metagenomes by two separate groups [15,16].Of these, we were able to predict the growth rates of 465 MAGs in which we found at least 10 ribosomal proteins (see Materials and Methods for details).These MAGs were uniformly slow-growing, with an average minimum doubling time of approximately one day, and none with a minimum doubling time less than 10 hours long (Fig 2a).These MAGs provide a baseline expectation of the maximal growth rates of eukaryotic microbes in marine environments, and while the reconstruction of MAGs from the environment is not a wholly unbiased process, we expect these MAGs to be more representative of the distribution of organisms living in the environment than what we find within our culture collections [20].In fact, we found that MAGs were estimated to have only slightly longer doubling times than cultured organisms in MMETSP (27.5 vs 24.5 hours respectively; t-test, p = 9.25 × 10 −4 ; Fig 2a).The differences between these two datasets were most apparent when looking at the tails of the distributions of growth rates, where the MMETSP had a long tail of fast-growing organisms that was absent among the MAGs (Fig 2a,b).Altogether the data suggest that our culture databases of eukaryotes do a relatively good job at capturing an accurate distribution of growth rates among organisms, though they are slightly enriched for fast growing organisms that are rare in the environment.This result is in stark contrast to the pattern seen among marine prokaryotic organisms where culture collections were shown to be systematically biased towards fast-growers [20].
Within the set of MAGs several patterns were apparent.First, while organisms classified as phototrophic and heterotrophic had largely overlapping growth rate distributions (Fig 2c), heterotrophs tended to grow faster than phototrophs (t-test, p = 1.58 × 10 −3 ; trophic classification on the basis of the presence of metabolic pathways in a MAG, taken from Alexander et al. [15]).This reflects previous findings that at higher temperatures heterotrophic marine eukaryotic microbes had faster growth rates than phototrophic ones, though phototrophs outgrew heterotrophs at lower temperatures because their growth rates decreased less dramatically with decreases in temperature [33].
Just as growth rates varied among functional groups, they also systematically varied among taxonomic groups (Fig 2d).Overall, marine fungi had the fastest average estimated growth rates.MAGs belonging to the Chlorophyta also seemed to be relatively fast growing, with a somewhat narrow range of growth rates clustered around a doubling time of about a day.By contrast Dinoflagellata, Haptophyta, and to some degree Ochrophyta all had a considerable number of very slow growing representatives (minimal doubling time > 40 hours), though these groups had very broad distributions of growth rates and included many faster growing members as well.The diversity of growth rates in these groups is perhaps not surprising, as the cell sizes of diatom and dinoflagellate species vary over two orders of magnitude, indicating a wide diversity of morphologies and environmental niches [45,46].Overall, the distribution of maximal growth rates varied across taxonomic groups, likely a product of both specialization for October 15, 2021 4/17 different niches and historical contingency.
In accordance with the variation of growth potential across taxonomic groups, we found that closely related organisms had more similar maximal growth rates than distantly related organisms (S2 Fig) .That is, the absolute difference in doubling times between two organisms increased as a function of the patristic distance (distance from tip-to-tip on a phylogeny) between the two organisms, though this relationship had little explanatory value (linear regression, p = 2 × 10 −16 , β = 0.26, r 2 = 0.01).In any case, below a threshold distance of 0.1 substitutions per site maximum growth rates could be extrapolated between relatives with an average absolute error under six hours (S2 Fig) , which may be taken as an acceptable level of error for a general guess at overall lifestyle.This means that for 18S rRNA amplicon sequence variants (ASVs) where a genus-level relative has a genome or transcriptome available, rough estimates of growth potential may be inferred, similar to prokaryotes [20].
Predicting the growth potential of prokaryotic and eukaryotic communities from metagenomes It is often difficult to reconstruct high-quality MAGs for many organisms, both prokaryotic and eukaryotic, from the environment.Even when we cannot easily obtain MAGs representative of the entire microbial community in a particular environment, it is possible to apply CUB-based predictors to a metagenome to estimate the average growth potential of a community [22].The prokaryotic growth predictor previously implemented in the gRodon package allowed the user to predict the median community-wide maximal growth rate of the prokaryotic community [20].Our eukaryotic model can be similarly applied to calculate the median maximal growth rate of the eukaryotic community represented in a metagenomic sample.To demonstrate this application, we acquired assemblies of 610 globally-distributed marine metagenomic samples from the BioGEOTRACES dataset [47].This dataset is particularly useful for our purposes because samples were not size-fractionated, allowing both prokaryotic and eukaryotic communities to be assessed simultaneously.We sorted these metagenomes into prokaryotic and eukaryotic contigs and applied prokaryotic gRodon (using "metagenome" mode) to the prokaryotic sequences and eukaryotic gRodon (using "eukaryote" mode) to the eukaryotic sequences.Because our eukaryotic model only uses one measure of codon usage bias applied on a gene-by-gene basis, it is similar to "metagenome" mode from prokaryotic gRodon (v1.0.0) as well as growthpred, and can be applied as-is directly to mixed-species metagenomic data [20,22].See Materials and Methods for details of this analysis.
Overall, we were able to predict the average community-wide maximal growth rates of the prokaryotic and eukaryotic communities in 517 samples with at least 10 ribosomal proteins each that could be classified as eukaryotic or prokaryotic (Fig 3 It is perhaps not surprising that the growth potentials of eukaryotic and prokaryotic communities would be correlated, since conditions favorable to more copiotrophic lifestyles (e.g., high nutrients) should be similar across both prokaryotes and eukaryotes.The observed decoupling of the growth potential of eukaryotic and prokaryotic communities with depth is consistent with a model of high productivity at the surface linked through particle sinking to productivity at deeper depths.We found that eukaryotic growth potential at depth (> 100 meters) was correlated with eukaryotic, but not prokaryotic, growth potential at the surface (< 100 meters), suggesting that eukaryotic productivity at the surface is a primary driver of community composition at deeper depths (S6 Fig), likely in part due to the relationship between cell size and sinking rate [48,49].The decoupling of eukaryotic and prokaryotic growth potential with depth is also reflected in the increasing heterotrophy of the eukaryotic community as depth increases.Leveraging the MAGs discussed in the last section which had been previously mapped to a globally distributed set of marine metagenomes [15,50], we found that MAGs that were predicted to be phototrophic dropped off in abundance relative to heterotrophic MAGs after 100 meters (S7 Fig).

Conclusions
We developed and validated a new tool to estimate the growth potential of eukaryotic microbes directly from genomic and transcriptomic sequences.Using this tool, we were able to predict the maximal growth rates of a large set of uncultured marine organisms directly from reconstructed MAGs.We found distinct patterns in growth potential across functional and taxonomic groups and assessed existing culture collections for functional bias.We then applied our tool to a large set of marine metagenomes to predict the community-wide growth potential of eukaryotes along large ocean transects.We found a clear positive relationship between eukaryotic and prokaryotic growth potential at the ocean surface, suggesting that fast growing organisms from multiple domains of life thrive under similar conditions, and the same for slow growing organisms.With an increasing number of environmental metagenomes published each year, for many environments it will now be possible to build high-resolution maps of microbial growth potential across domains, yielding insights into the drivers of microbial community structure and function.
Our tool demonstrates the clear utility of genomic and metagenomic trait estimators for eukaryotic microbes.Yet, when working with eukaryotic microbes there are relatively few bioinformatic resources both in terms of methods and databases.Moving forward, as the complexity and subtlety of our bioinformatic tool-set increases, eukaryotic microbes represent a new frontier for methods development and ecological investigations with molecular data (e.g., [11,12,[14][15][16]).

Materials and Methods
The code to generate all figure and analysis in the paper can be found at https://github.com/jlw-ecoevo/eeggo.The new gRodon v2 R package with the eukaryotic growth rate model implemented can be found at https://github.com/jlw-ecoevo/gRodon2.All visualizations were made using the ggplot2 [51] and ggpubr [52]  Training Data From a list of protist species with transcriptomes in MMETSP and/or genomes in GenBank, we searched for recorded growth rates in culture, alongside the temperatures at which these rates were recorded, from the scientific literature (S1 Fig, S1 Table).If multiple rates were reported in culture, we always took the fastest rate we were able to find.We converted between doubling time and specific growth rate using the equation: growth rate .When growth rate was reported in terms of divisions per day we instead used the conversion equation: doubling time = 1 growth rate .A considerable number of growth rates recorded for marine organisms came from previous database compilation efforts by others [33,53].
All MMETSP transcriptome assemblies with annotated coding sequence were obtained from https://zenodo.org/record/3247846[54].Species that had a eukaryotic prey species listed under experimental conditions were excluded from further analysis to reduce the possibility of contamination.For each species in MMETSP, we selected the single largest transcriptome assembly.We then removed any potential cross-contamination between species by identifying possible contaminants using CroCo v1.1 [55] with a threshold of 97% identity and otherwise default parameters (--suspect-id 97), removing any transcripts listed as "suspect".We additionally classified transcripts using kraken2 v2.1.1 [13,56] with the 'nt' database and default parameters, and removed any transcripts classified as viruses or prokaryotes in order to remove any potential contaminants.
For assemblies from GenBank with growth rates listed in our dataset we first ran EukMetaSanity v1.0.0 [11], which incorporates repeat prediction [57], reference protein selection [12,58], and ab-initio gene predictions to determine putative gene loci.The output GeneMark-EP/ES predictions were used for analysis [59,60].We then classified coding sequences using kraken2 v2.1.1 [13,56] with the 'nt' database and default parameters, and removed any transcripts classified as viruses or prokaryotes in order to remove any potential contaminants.

Fitting the Model
For each transcriptome in our dataset we used the annotations provided [54] (generated using dammit [61]) to locate coding sequence corresponding to ribosomal proteins.For each genome in our dataset we searched among translated coding sequences for ribosomal proteins using blastp v2.10.1 [62] against a custom blast database of ribosomal proteins of eukaryotic microbes drawn from the Ribosomal Protein Gene Database (all genes coding for ribosomal proteins available from Dictyostellium discoideum, Giardia lamblia, Phaeodactylum tricornutum, Plasmodium falciparum, Thalassiosira pseudonana, and Toxoplasma gondii ; S2 Table; [63]).In all downstream analyses we omitted any genomes or transcriptomes with fewer than 10 ribosomal proteins detected [20,22].
For each coding sequence corresponding to a ribosomal protein in each genome or transcriptome we calculated the MILC statistic of codon usage bias [39] using the coRdon R package [44], the same as done for prokaryotic gRodon [20].This statistic is both GC-content and length corrected and should be insensitive to both factors.For these calculations the expected codon usage was taken as the genome-wide average (across all coding sequences in a genome or transcriptome; [20]).As recommended in the coRdon documentation, in order to get a reliable estimate of codon bias we removed all genes with fewer than 80 codons.We then calculated the median bias across all genes coding for ribosomal proteins for each genome or transcriptome.
We then fit a linear model (lm() function from the base R stats package [64]) to Box-Cox transformed doubling times (with the optimal λ chosen using the boxcox() function from the MASS package [65]) using (1) optimal growth temperature, and ( October 15, 2021 7/17 the median codon usage bias of genes coding for ribosomal proteins (see above) as predictors.We then implemented this model into the existing gRodon package for prokaryotic growth rate prediction, expanding the package's predictive range to eukaryotic organisms (using the new mode=''eukaryotes'' setting; https://github.com/jlw-ecoevo/gRodon2).

Estimating Growth Rates from MAGs
We obtained a set of 1669 eukaryotic MAGs assembled from the Tara Oceans metagenomic surveys by two groups [15,16].Previously, EukMetaSanity had been run on these MAGs to call genes and find coding sequences [11].We used these annotations, and (similar to above) searched for ribosomal proteins using blastp v2.10.1 [62] against a custom blast database of ribosomal proteins of eukaryotic microbes drawn from the Ribosomal Protein Gene Database [63].We ran EUKulele v1.0.6 to classify these MAGs taxonomically and omitted any organisms identified as Metazoa from downstream analyses [10].Division-level classifications were taken as the division assigned as most likely by eukulele.After removing any MAGs with less than 10 ribosomal proteins detected or that were classified as Metazoa, we were left with a total of 465 eukaryotic MAGs.
To infer the optimal growth temperatures of each MAG we used distributional data across the Tara Oceans metagenomes.For MAGs from Delmont et al. [16], optimal temperatures had already been predicted by the authors as part of a machine-learning pipeline implemented to discover each MAGs niche.For the Alexander et al. [15] MAGs we took a simpler approach.For each MAG we took the top 1% of samples in terms of MAG relative abundance and calculated the mean temperature recorded for those samples (S8 Figure ).For closely related MAGs found in both Delmont et al. [16] and Alexander et al. [15], we found that the two methods for estimating optimal growth temperature agreed well (S9 Figure ).
Ribosomal proteins were identified from prokka annotations.We then predicted the average community-wide maximal growth rate of the prokaryotic community using gRodon v2.0.0 in metagenome mode using the temperature metadata provided with the samples [20].
In order to call genes from eukaryotic contigs we ran MetaEuk v4.a0f584d (using setting easy-predict; [12]).Similar to our procedure with individual genomes, we searched for ribosomal proteins among translated proteins output from MetaEuk using blastp v2.10.1 [62] against a custom blast database of ribosomal proteins of eukaryotic microbes drawn from the Ribosomal Protein Gene Database [63].We then predicted the October 15, 2021 8/17 average community-wide maximal growth rate of the eukaryotic community using gRodon v2.0.0 in eukaryote mode using the temperature metadata provided with the samples.
We visualized our phylogeny using R package ggtree [73] and calculated patristic distance using R package ape [74].
; S3 Fig).The correlation between the growth potentials of prokaryotic and eukaryotic communities at the ocean surface was striking (Pearson correlation of samples from < 100 meters, ρ = 0.566, p = 1.08 × 10 −27 ; Fig 3a), though this relationship disappeared among deeper samples (Pearson correlation of samples from > 100 meters, ρ = −0.101,p = 0.146; Fig 3b).A linear model confirmed a significant interaction between depth and the relationship between eukaryotic and prokaryotic growth rates (linear regression of prokaryotic growth rates, β eukaryotes = 0.0980, p eukaryotes = 3.05 × 10 −7 , β depth = −0.0106,p depth = 5.72 × 10 −9 , β eukaryotes:depth = 1.43 × 10 −4 , p eukaryotes:depth = 1.74 × 10 −5 ).Notably, this was not simply an effect of temperature in the model, as the CUB of eukaryotic and prokaryotic communities co-varied across samples (S4 Fig).Additionally, these patterns cannot be attributed to differences in coverage.While doubling time did decrease with the relative abundance of eukaryotic October 15, 2021 5/17 contigs in a sample, as would be expected, samples with a lower proportion of eukaryotes were not particularly skewed in their estimated growth rates (S5 Fig).

Fig 1 .Fig 2 .Fig 3 .
Fig 1. Codon usage bias (CUB) and temperature predict the maximum growth rates of microbial eukaryotes.(a) Predictions from a linear model of minimum doubling time with CUB and temperature as predictors on our training set generally reflect empirically observed doubling times (r 2 = 0.328).(b) The relationship between CUB and minimum doubling time is roughly linear and negative until approximately 40 hours, after which the relationship levels off (ρ = −0.400,p = 3.14 × 10 −8 ).(c) Predictions of the maximum growth rates of microbial eukaryotes on the basis of CUB and temperature using models trained on prokaryotes are systematically biased towards faster growth predictions and (d) perform much worse than a model trained directly on eukaryotes in terms of mean squared error (MSE).Dashed vertical lines denotes 40 hours and dashed diagonal line denotes where x = y.
R packages.