Macroevolutionary constraints on global microbial diversity

Biologists have long sought to quantify the number of species on Earth. Often missing from these efforts is the contribution of microorganisms. Despite recent large-scale sampling efforts, estimates of global microbial diversity span many orders of magnitude. To reconcile this uncertainty, it is important to consider how speciation and extinction over the last four billion years constrain inventories of biodiversity. We parameterized macroevolutionary and mass-extinction event models to determine how diversification limits present-day microbial diversity. We find that while 106-107 taxa is most probable, much larger values (≥1012) are feasible. Allowing for mass extinction events does not greatly alter these conclusions. Along with empirical predictions, our models provide support for a massive global-scale microbiome while shedding light on the upper limits of life on Earth.

while 10 6 -10 7 taxa is most probable, much larger values (³10 12 ) are feasible. Allowing for mass 22 extinction events does not greatly alter these conclusions. Along with empirical predictions, our models provide support for a massive global-scale microbiome while shedding light on the upper 24 limits of life on Earth.

INTRODUCTION 30
Global biodiversity is impossible to completely census, given the large number of individuals on the planet across a diverse range of habitats (Whitman et al. 1998). Various 32 approaches have been used instead to approximate biodiversity, including diversity estimation based on partial censuses (Costello et al. 2012;Louca et al. 2019), ratios of taxonomic groupings 34 (Mora et al. 2011), and many other macroecological and biogeographical methods (May 1988).
The total number of species on the planet, when focusing on multicellular life, has been estimated 36 to be ~10 6 -10 8 species (Costello et al. 2012;Mora et al. 2011;Thompson et al. 2017). Despite their ubiquity, microorganisms have historically been overlooked in studies that attempt to 38 estimate global biodiversity (Larsen et al. 2017). This oversight was largely due to technological reasons, as there were no comprehensive methods to systematically describe microbial diversity 40 (Woese 1987).
With the advent of high-throughput DNA amplicon sequencing, large-scale quantification 42 of microbial diversity became possible, yet the various approaches used to predict the total number of microbial species have generated highly divergent estimates (Larsen et al. 2017;Locey & 44 Lennon 2016;Louca et al. 2019). Bacteria and archaea can be clustered into operational taxonomic units (OTUs) according to their similarity in the 16S rRNA gene, with the similarity cutoff often 46 being set to 97% (Stackebrandt & Goebel 1994). While this clustering approach is often considered a conservative measure of bacterial and archaeal species (Eren et al. 2015;Poretsky et al. 2014), 48 the 16S rRNA has served as a powerful tool, allowing microbiologists to survey the diversity of bacteria and archaea in a range of ecosystems across the planet (Thompson et al. 2017). Using the 50 massive amount of data collected, several studies have attempted to quantify global bacterial and archaeal species diversity (S). One approach using collector's curves estimated that microbes may only add 10 6 species to the global estimates of plant and animal diversity (Louca et al. 2019), which means they would not fundamentally change current inventories, which are ~10 7 at most 54 (Mora et al. 2011). Another approach using the average number of unique bacterial species per host species estimated S to be approximately 10 9 (Larsen et al. 2017;Wiens 2021). Last, a 56 combination of scaling laws and biodiversity theory predicts that there are 10 12 or more microbial taxa on Earth (Lennon & Locey 2020;Locey & Lennon 2016), including potentially 10 9 in 58 activated sludge systems alone (Wu et al. 2019). This ongoing debate is unresolved, as these predictions and estimates are difficult to directly test, but it may be possible to deem some values 60 of present-day diversity as more likely than others.
These estimations and predictions should be considered in the context of constraints on 62 present microbial diversity. The abundance of microorganisms at a global scale (N) has approached a steady state of 10 29 −10 30 individuals (Kallmeyer et al. 2012;Whitman et al. 1998), and given 64 that global taxon richness S cannot exceed the number of total individuals N, !"#$#%& ≤ 10 30 is a hard upper constraint on microbial richness. However, there may also be a soft upper constraint of 66 10 22 -10 23 due to neutral drift, assuming (1) a constant 10 30 bacterial and archaeal individuals and (2) a neutral drift rate of 4 -5×10 -9 substitutions per site per generation (Louca et al. 2019). A hard 68 lower constraint of !"#$#%& ≥10 6 is also in place due to the number of reported 97% 16S rRNA OTUs (Schloss et al. 2016). Therefore, it is reasonable to surmise that the present number of 70 bacterial and archaeal taxa !"#$#%& is between 10 6 and 10 23 .
Within this range, diversity is further constrained by macroevolutionary processes 72 occurring over geological time scales. Speciation and extinction rates of lineages, the difference of which is the net diversification rate, should directly influence total present microbial diversity diversity. The simplest diversification models are birth-death processes, which assume constant 76 and universal speciation and extinction rates (Raup 1985), but more complicated models should address realistic variation in these rates, such as clade-specific diversification rates (Moran et al. 78 1995;Scholl & Wiens 2016). In macro-organisms, well-documented mass extinction events are another such way the assumption of constant diversification is not upheld (Raup & Sepkoski 1982;80 Rohde & Muller 2005). These include the "Big Five" mass extinction events that eliminated 50-90% of marine invertebrate genera (Raup & Sepkoski 1982), as well as the Great Oxidation Event 82 (GOE) (Gumsley et al. 2017), which likely caused the mass extinction of many lineages (Hodgskiss et al. 2019). Each of these mass extinction events may have reduced microbial 84 diversity, thus constraining contemporary microbial richness, as a large portion of bacterial diversity is likely host-associated, (Hernández-Hernández et al. 2021;Thompson et al. 2017;Xie 86 et al. 2005). The same factors causing the mass extinction of macro-organisms may also have elevated free-living microbial extinction (Newby et al. 2021), though the diversity of host-88 associated taxa should have been most greatly reduced. Models accounting for these phenomena may minimize uncertainty about the number of modern-day microbial taxa. 90 To understand how macroevolutionary rates constrain present-day species diversity, unbiased estimates of speciation and extinction are necessary. Any existing estimates of microbial 92 diversification are derived from phylogenetic data (Louca et al. 2018;Scholl & Wiens 2016). Due to the nearly non-existent microbial fossil record, these phylogenies are constructed solely from 94 molecular data, which may lead to incorrect rate estimation when diversification rates vary among lineages (Rabosky 2010;Stadler 2009). These phylogenies can also be generated by highly 96 dissimilar birth-death processes that have divergent speciation and extinction dynamics (Louca & Pennell 2020). Such methods also require providing estimates for total microbial richness and the number of unsampled taxa to calculate diversification rates (Louca et al. 2018), which would run counter to the aim of using diversification rates to constrain present-day microbial richness. 100 Therefore, diversification rate estimates that do not estimate unsampled taxa and that are not derived from molecular data alone are necessary to understand macroevolutionary constraints on 102 species richness.
In this study, we seek to understand how speciation and extinction rates put additional 104 constraints on present-day microbial diversity. To do so, we estimated speciation rates without phylogenetic inference to avoid the biases discussed above. With a simple model of diversification, 106 we show the probability of various levels of present-day diversity. We then modify this model to account for mass extinction events to explore their potential effects on global diversity. 108

Rate estimation
To consider bacterial and archaeal species, we must first consider our species definition. 112 We phylogenetically defined a species as a cluster of strains with 97% 16S rRNA sequence similarity. While the 16S rRNA has limitations differentiating between certain species (Poretsky 114 et al. 2014), its broad conservation across bacteria and archaea and relatively slow rate of evolution make it a convenient biomarker for considering global bacterial and archaeal diversity (Woese 116 1987). From this perspective, speciation occurs as the accumulation of 16S rRNA substitutions causes a focal sequence and an ancestral sequence to diverge by at least 3%. Thus, speciation rate 118 can be calculated using 16S rRNA nucleotide substitution rates (K16S) as follows: 120 λ = '() +,-./0 * 2 !"# 3% * '() +,-./0 (Eq.1).

122
In Eq. 1, the numerator represents the total number of substitutions a 16S sequence undergoes over a million years, and the denominator represents the total number substitutions necessary for a 3% 124 divergence in sequence, which is a speciation event by our OTU species definition.
To calculate λ values, we used a range of K16S values (0.025-0.091% divergence/nt/My) 126 based on the divergence of endosymbiotic bacteria in preserved and dated insects (Kuo & Ochman 2009). These K16S values were calculated by calibrating the bacterial phylogenies with the age of 128 their insect hosts, which possess a tractable fossil record (Moran et al. 1993). In this way, the ages of internal nodes of the bacterial phylogeny were mapped to corresponding ages in the insect 130 phylogeny. These ages and the divergence between two bacterial 16S rRNA sequences were then used to directly calculate K16S. Using these K16S values (Kuo & Ochman 2009), we calculated 132 speciation rates of 0.0083-0.030 My -1 . However, because substitution rates of endosymbiont bacteria are potentially twice that of their free-living relatives (Moran et al. 1995), we also 134 considered speciation rates 50% smaller than the minimum endosymbiont-based speciation rate, producing a final range of 0.004-0.03 My -1 . As an analogous technique cannot be used to estimate 136 extinction rates (µ), we used values of relative extinction rates e, the ratio of extinction to speciation (µ/λ), between 0 and 1 to account for various extinction scenarios. 138

Expectations of birth-death process 140
The process of lineage diversification is often modeled as a stochastic birth-death process (Magallón & Sanderson 2001;Nee et al. 1994;Raup 1985), where speciation and extinction events 142 are analogous to births and deaths of individuals, respectively. In diversification scenarios where present-day S ≥ 10 12 taxa, the simulation of a stochastic birth-death process that stores times of birth and death events becomes computationally intractable. Due to this limitation, we first analyzed the expectations of birth-death processes E [St] with constant speciation and extinction 146 rates, which can be simply described by exponential growth when assuming the initial number of species to be 1: 148 To compare the amount of diversity across various levels of λ and e, we manipulated Eq. 2 into 152 the following:

156
We plotted several contours of e(λ) with various levels of E [St] with t = 4000 My, a reasonable estimate of the time passed since the last universal common ancestor (LUCA) (Weiss et al. 2018). 158 To calculate the probability of various ranges of present-day diversity, we calculated the area between contours via integration and normalized by the total area of feasible parameter space (10 6 160 ≤ E[St] < 10 23 ).

Mass extinction events
To understand the potential influence of mass extinction and its ability to constrain present-164 day microbial diversity, we considered the following mass extinction events: the Great Oxidation Cretaceous (K-T, 66 Mya) (Gumsley et al. 2017;Raup & Sepkoski 1982). Our model of mass 168 extinction uses the expression for the expectations of a birth-death process (Eq. 2) and adds additional terms accounting for mass extinction events. We consider two new parameters 170 influencing extinction beyond constant extinction rate µ: the intensity of mass extinction (p) and the proportion of taxa potentially affected by mass extinction (q). For each mass extinction event, 172 there is a single reduction in the total number of species according to the magnitudes of p and q.
To obtain the number of species at time t, we multiply the birth-death expectations (Eq. 2) by the 174 proportion of taxa surviving each mass extinction event (1 -pq) for as many mass extinction events occurring by time t, where t is some nonnegative integer: 176 Let M be the set of the six timesteps where mass extinction occurred, and MH be the set of 180 timesteps with host-associated mass extinction. Mass extinction intensity p(i) is equal to some value p during a mass extinction event and 0 for all other timesteps: 182 if ∈ 0 otherwise (Eq. 5). 184 We consider situations where p is 0.0, 0.5, or 0.9 to model situations without mass extinction, with 186 moderate mass extinction, and with intense mass extinction, respectively. Likewise, the proportion of taxa vulnerable to mass extinction q(i) at timestep i is set to some value q during a host-188 associated mass extinction event and 1 for all other timesteps: 190 ( ) = 6 if i ∈ B 1 otherwise (Eq. 6).

192
Therefore, the proportion of species removed due to mass extinction is pq during host-associated mass extinction, p during non-host-associated mass extinction, and 0 for all other timesteps. When (Eq. 8).

204
To obtain informed estimates of q, we assumed that host-associated bacteria and archaea would be more likely to go extinct from macro-organismal mass extinction than would free-living 206 microbes. In place of the unmeasurable ancient proportion of host-associated species, we calculated the present-day proportion of host-associated microbial species, as well as obligately 208 host-associated and preferentially host-associated proportions, using species and site data from the Earth Microbiome Project (EMP) (Thompson et al. 2017). We defined obligately host-associated 210 taxa as OTUs only sampled from hosts and preferentially host-associated taxa as those found in hosts for over 50% of their total occurrences.

Expectations of birth-death process 214
To evaluate how diversification parameters constrain present-day microbial species richness, we expressed relative extinction rate (e) as a function of speciation rate (λ) at various 216 contours of [ & ] (Eq. 3) and with t = 4000 My (present-day) within the bounds of the speciation rate range described above (Fig. 1). Expected diversity increases as extinction decreases and 218 speciation rise, though the relationship between e and λ is non-linear (Eq. 5). Approximately 50% of the combinations of and lead to infeasibly low (<10 6 ) or high (>10 23 ) diversity (Table S1). 220 Certain high levels of diversity require to be sufficiently low and λ to be sufficiently large (Fig.   1A, B). For instance, 10 12 species are only possible for e < 0.78 and λ > 0.007 sp./My. However, 222 there are much less strict limitations on e or λ for reaching 10 6 species. In terms of amount of feasible parameter space, lower diversity outcomes are somewhat more likely than high diversity 224 outcomes (Fig. 1C). For instance, the probability of 10 6 -10 7 species is ~9.0%, while the probability of 10 12 -10 13 species is ~6.2%. A reason for the decrease in probability for higher levels of diversity 226 is that it is impossible to reach them at relatively low values of λ (Fig. 1B). However, once λ reaches ~0.013 sp./My, any further increase in λ does not alter the relative probabilities of each 228 outcome (Fig. 1B). However, each of these ranges of diversity is well within these constraints and far from extreme outcomes, and no outcome is far more probable than the others. With this simple 230 analysis, we show that vast diversity is indeed possible within the specified speciation constraints.

Mass extinction events
Our simulations show that the intensity of mass extinction events determines their effect 234 on present microbial biodiversity (Fig. 2). Let us first consider scenarios where mass extinction affects all microbes equally (q = 1) and strongly (p = 0.9; Fig. 2A, B). Compared to scenarios 236 without mass extinction, much higher speciation rates and lower extinction rates are required to reach equivalent levels of species diversity (Fig. 1A, 2B). For example, net diversification 238 parameters resulting in 10 12 species when p = 0 lead to ~10 6 species when p = 0.9 ( Fig. 2A).
Additionally, the proportion of total feasible parameter space decreases from 50.6% in the birth-240 death expectation model to 36.5% in the mass extinction model (Table S1). The proportion of parameter space leading to >10 6 species increases to 50.2% from 26.7%, as well (Table S1). 242 However, the relative probabilities of each diversity outcome within the feasible parameter space are primarily unchanged (Fig. S1). Setting p = 0.9 is comparable to the degree of extinction in 244 macro-organisms during the Permian-Triassic, the most severe mass extinction event (Sepkoski 1990). If these events lead to even a 50% diversity reduction, present-day diversity still is 246 decreased, though the effect is diminished compared to the 90% scenario. However, it is clear that severe mass extinction can greatly change the outcomes for individual parameter combinations. 248 To obtain an informed estimate of the proportion of taxa vulnerable to mass extinction (q), we used species and site data from the Earth Microbiome Project (EMP) to calculate the present-250 day proportion of host-associated 16S OTUs. We found that ~90% of EMP bacterial OTUs were free-living to some degree, meaning that only ~10% of EMP OTUs were obligately host-associated 252 (Table 2). However, about half of all OTUs were host-associated to some degree. Therefore, if mass extinction events of macro-organisms only resulted in the extinction of host-associated 254 organisms, only between 10% and 50% of microbial taxa would be vulnerable to mass extinction, depending on if all host-associated species, only obligately host-associated species, or some 256 mixture is considered.
When we consider scenarios when only a fraction of microbial species is affected by mass 258 extinction, the model begins to converge to the expectations of the birth-death processes (Fig. 2C,   D). With q = 0.1, which corresponds to only 10% of microbial lineages being vulnerable to mass 260 extinction, mass extinction has a much more muted effect. The same net diversification parameters described above resulting in 10 12 species when p = 0 now leads to >10 10 species at p = 0.9. The 262 feasible parameter space also converges to that of the birth-death expectations as well (Table S1, Fig. 1A, 2D). We also modeled scenarios where only certain groupings of lineages were vulnerable 264 to mass extinction and identified scenarios where this allows for present-day diversity on the order of the birth-death expectations (Fig. S2), illustrating another scenario where adding biological 266 detail can reduce the effect of mass extinction. Therefore, while extreme mass extinction scenarios may constrain microbial diversity, more conservative scenarios suggest that mass extinction may 268 have only moderately decreased present-day diversity.

DISCUSSION
In this study, we modeled how macroevolutionary rates influence present-day microbial 272 species diversity in an attempt to further constrain the estimates and predictions from previous studies ranging from 10 6 to 10 12 species (Larsen et al. 2017;Locey & Lennon 2016;Louca et al. 274 2019;Wiens 2021). The values suggested in these studies all can be generated from feasible combinations of macroevolutionary rates (Fig. 1,4). In fact, our results introduce the possibility 276 that bacterial and archaeal diversity may outstrip the largest predictions (Lennon & Locey 2020;Locey & Lennon 2016). Given the diversification parameters and the model we used, 10 6 species 278 is more likely than 10 9 , both of which are more likely than 10 12 (Fig. 2,3). However, while our study finds that it is most likely that total present-day diversity is not orders of magnitude larger than current inventories, it does not deem any previously made prediction or estimate vanishingly unlikely or even improbable (Larsen et al. 2017;Lennon & Locey 2020;Locey & Lennon 2016;282 Louca et al. 2019;Wiens 2021).
The simple approach we use in this study is not without its caveats and assumptions. This 284 study only uses substitution rate data from obligately host associated taxa, which may not be representative of the overall rate of molecular evolution of free-living microbes (Espejo & Plaza 286 2018;Moran et al. 1995). While a more representative sample of substitution rates from free-living lineages or taxa with multiple 16S rRNA copies may improve upon this study, such data are scarce. 288 Thus, our analysis provides a basis for feasible levels of microbial diversity with backing from the fossil record and including speciation values that account for the differences between free-living 290 and obligate endosymbiont substitution rates (Moran et al. 1995). Additionally, we did not attempt to directly estimate extinction rates, as microbial extinction cannot reasonably be estimated apart 292 from phylogenetic approaches relying on a priori assumptions of microbial richness (Louca et al. 2018), which given the objectives of our study would introduce circular reasoning. If an unbiased 294 method for estimating relative extinction was found, then it could be used to further constrain the diversification parameter space. Our analysis also assumes no biogeographical or niche association 296 with diversification (Li & Wiens 2022). It is quite likely that clades associated with certain environments have higher diversification rates than those of other environments (Rabosky 2020). 298 While clade-specific diversification rates were modeled here, a more thorough modeling process including diversification dynamics of specific bacterial lineages may provide more insight into 300 global diversification.
Our simulations of mass extinction events showed that while severe mass extinction can 302 constrain present-day diversity, there are many scenarios that result in little change compared to our model without mass extinction. This convergence to birth-death expectations occurs as the 304 proportion of lineages affected by mass extinction decreases. In fact, the true proportion of bacteria affected by host mass extinction may have been smaller than the proportion of obligately host-306 associated taxa depending on the host range of the microbial lineages. For instance, if one microbial taxon is present in several host taxa, extinction is unlikely if only one host taxon becomes 308 extinct. However, the publicly available 16S rRNA databases do not typically contain information regarding whether OTUs were found in a narrow or broad range of host taxa, only the general 310 source of each sample. It is also possible for plant and animal mass extinction to affect more than just host-associated microbes if higher-order effects of extinction of macro-organisms had 312 downstream effects on free-living microbes, thus increasing the possible percentage of microbial taxa vulnerable to mass extinction. However, explicitly modeling such effects here is unnecessary, 314 as the outcomes with high q will simply converge to our first mass extinction scenario with q = 1.0. 316 Our mass extinction model contains other assumptions and caveats as well. To simplify the model, we implemented mass extinction as a one-time reduction in diversity per event. These 318 events might be more realistically modeled as occurring over the span of several million years. We implemented each of the "Big Five" mass extinction events as equal in extinction magnitude, but 320 some of these events had larger effects on host diversity than others (Raup & Sepkoski 1982), which likely would have scaled onto microbial extinction. Additionally, there may have been other 322 microbe-specific mass extinction events besides the GOE that could have had a profound impact on diversity. Our models also do not take into consideration increases in diversification via 324 adaptive radiation following mass extinction (Stroud & Losos 2016). Despite these caveats, our models provide a foundation for how losing large proportions of diversity several times may have 326 altered present-day diversity by examining extreme scenarios.
Our study finds vast diversity beyond 10 12 species is indeed possible and only marginally 328 less likely than lower levels of diversity. While this analysis suggests the globe is most likely to contain fewer than 10 8 microbial species, our approach cannot make a precise prediction on 330 microbial diversity, nor can it rule out the predictions and estimates made by previous studies (Larsen et al. 2017;Lennon & Locey 2020;Locey & Lennon 2016;Louca et al. 2019;Wiens 332 2021). The simple models described here use speciation rates calculated from endosymbiotic bacterial 16S rRNA substitution rates, which do not have the inherent bias of requiring estimates 334 of unsampled taxa. These models provide a new angle with which to address the question of global microbial diversity. New approaches will be necessary to confront the lack of consensus in the 336 field as we seek to reconcile the estimations and results put forth, such as methods going beyond 16S rRNA-based species definitions and embracing the ecological and functional differences 338 among micro-organisms (Arevalo et al. 2019). Such approaches may reveal levels of diversity greater than any suggested by 16S rRNA approaches. 340 342 TABLES  450   452  Table 1. Defining key model parameters used in birth-death process and mass extinction models.

Variable/Parameter Description
Global species diversity (St) The total number of 97% 16S rRNA bacterial and archaeal OTUs present on the planet at time t.
Speciation rate (λ) The number of species an extant species generates per million years (My -1 ).
Extinction rate (µ) Number of species extinctions per extant species per million years (My -1 ).
Relative extinction rate (e) Ratio of extinction rate to speciation rate (µ / ).

Mass extinction intensity (p)
The proportion of vulnerable species removed at a certain timestep from a mass extinction event.
Vulnerable proportion of taxa (q) The proportion of species that are vulnerable to mass extinction.

Mass extinction events (M)
The set of timesteps where mass extinction occurs.
Host-associated mass extinction events (MH) The set of timesteps where mass extinction of hosts occurs.

3) with values of E[S4000]
between 10 6 and 10 23 species. (B) The probability of diversity outcome bins across speciation rate. 478 The probability is calculated as the proportion of values leading to a diversity outcome (e.g., 10 6 -10 9 species) at a given value of . (C) The overall probability of diversity outcomes spanning one 480 order of magnitude. This probability was calculated by creating contours of e(λ) (Eq. 3) with E[S4000] set from 10 6 to 10 23 and calculating the area between each contour and normalizing by the 482 total area of the feasible parameter space. speciation (0.004 ≤ λ ≤ 0.03) and relative extinction rates (0 ≤ ≤ 1) with mass extinction intensity 494 p = 0.9. Each labeled region (same coloration scheme as Fig. 1A) is separated by contours of ( )