Empirical abundance distributions are more uneven than expected given their statistical baseline

Exploring and accounting for the emergent properties of ecosystems as complex systems is a promising horizon in the search for general processes to explain common ecological patterns. For example, the ubiquitous hollow-curve form of the species abundance distribution is frequently assumed to reflect ecological processes structuring communities, but can also emerge as a statistical phenomenon from the mathematical definition of an abundance distribution. Although the hollow curve may be a statistical artefact, ecological processes may induce subtle deviations between empirical species abundance distributions and their statistically most probable forms. These deviations may reflect biological processes operating on top of mathematical constraints and provide new avenues for advancing ecological theory. Examining ∼22,000 communities, we found that empirical SADs are highly uneven and dominated by rare species compared to their statistical baselines. Efforts to detect deviations may be less informative in small communities – those with few species or individuals – because these communities have poorly-resolved statistical baselines. The uneven nature of many empirical SADs demonstrates a path forward for leveraging complexity to understand ecological processes governing the distribution of abundance, while the issues posed by small communities illustrate the limitations of using this approach to study ecological patterns in small samples.

Introduction feasible set. The feasible set therefore allows us to define statistical baselines for assessing deviations 122 between observed SADs and what is likely to occur due to mathematical constraints (Locey and White 123 2013). 124 The feasible set can also be used to explore how the characteristics of the statistical baseline, and the 125 presence and nature of any deviations that occur, vary over ranges of values for S and N. Although most 126 feasible sets are dominated by the hollow-curve shape, variation in S, N, and the ratio of N to S modulate 127 the detailed attributes of the SADs in a feasible set (Locey and White 2013). For example, if the ratio of N 128 to S is close to 1, all possible SADs are mathematically constrained to be fairly even (Locey and White 129 2013). Although an SAD that is very even would be highly unusual in most cases, it would be expected in 130 this situation. The feasible set therefore allows us to appropriately calibrate our expectations for what 131 types of observations would be surprising for an SAD given the specific constraints imposed by its S and N. 132 Additionally, accounting for variation in the specificity, or vagueness, of the expectations derived from the 133 statistical baseline may be critically important for disentangling the aspects of the SAD that can be 134 attributed to statistical constraints from those that result from other processes. If the vast majority of 135 mathematically possible SADs are similar in shape -generating a very specific, narrowly defined statistical 136 baseline -then even small deviations between an observed SAD and this baseline can signal the 137 operation of ecological processes. However, if many different shapes occur with more even frequency in 138 the feasible set, the statistical baseline is less specific and less well defined, and our sensitivity for 139 distinguishing biological signal from statistical constraints is greatly reduced. This is more likely to occur 140 when the size of the community, in terms of S and N, is small, because in such cases the feasible set may 141 be too small for a particular shape to emerge as the most common shape. These statistical baselines with 142 broad distributions may therefore impede our ability to assess whether observed deviations are 143 ecologically generated or expected to emerge randomly (Jaynes 1957). This general concern has been 144 acknowledged in efforts to compare ecological observations to statistical baselines (Harte 2011, White et al. 2012, Locey and White 2013) but there has not yet been a quantification of these effects for the SAD 146 or an identification of the range of community sizes most strongly affected. Because ecologists study the 147 SAD for communities varying in size from the very small -S and N < 5 -to the enormous -S and N >> 148 1000 -identifying the community sizes for which we can and cannot confidently detect deviations from 149 the statistical baseline is necessary to appropriately contextualize our interpretations. 150 Here we use the feasible set to define statistical baselines for empirical SADs for 22,000 communities of 151 birds, mammals, trees, and miscellaneous other taxa. We then compare observed SADs to their 152 corresponding statistical baselines and evaluate 1) if the shapes of observed SADs consistently deviate 153 from their statistical baseline, 2) how the characteristics and specificity of the statistical baseline vary 154 over ranges of S and N, and 3) whether this variation appears to be associated with variation in our 155 capacity to detect deviations between observations and the corresponding baselines. 156

Methods 157
Data and code for all of our analyses can be accessed at www.github.com/diazrenata/scadsanalysis. 158

Datasets 159
We used a compilation of community abundance data for trees, birds, mammals, and miscellaneous expectation of the SAD is computationally intractable for very large communities, we filtered our datasets 167 to remove 4 communities that had more than 40714 individuals, which was the largest community we have fewer than 10 species. Rather than analyze all these small communities, we randomly selected 170 10,000 small communities to include in the analysis. We also included all FIA communities with more than 171 10 species, which added 10,355 FIA communities to the analysis and resulted in a total of 20,355 FIA 172 communities. Finally, for sites that had repeated sampling over time, we followed White et al. (2012) and 173 Baldridge (2016) and analyzed only a single, randomly selected, year of data, because samples taken from 174 a single community at different time points are likely to covary. It should be noted that our analyses 175 include data from the Mammal Community Database and Miscellaneous Abundance Database that were 176 collected over longer timescales and cannot be disaggregated into finer units of time. Our final dataset 177 consisted of ~22,000 communities with S and N ranging from 2 to 250 and 4 to 40714, respectively 178 ( Figure 1). Details and code for the filtering process can be found in Appendix S1 in Supporting 179

Information. 180
Accounting for empirical sampling error 181 Because it is logistically impossible to exhaustively census all individuals present in most empirical 182 systems, SADs derived from field sampling will inevitably be subject to some degree of sampling error 183 (Bonar et al. 2011). Therefore, in addition to analyzing the raw SADs in our database, we employed two 184 resampling schemes to test if, and how, different forms of observation error affect our results. 185 First, we explored the possibility that empirical sampling systematically undercounts the true number of 186 rare species in a community (Preston 1948;Gotelli and Colwell 2011). Rare species are more likely to 187 escape detection during sampling, leading to an underestimate of both the total species richness of a 188 community and the proportion of species in the rare tail of the SAD (Preston 1948). We used a procedure 189 based on species richness estimators to adjust for this possibility (see also Ulrich et al. 2010 for the use of 190 richness estimators to distinguish between completely and incompletely censused communities). We took the mean of the two results. This yields a generous estimate of the true number of species in the 195 system. If this estimate exceeded the observed species richness, we added the missing species each with 196 abundance 1. These adjusted SADs allowed us to explore the consequences of undersampling rare 197 species while making the smallest possible changes to S and N. 198 Second, we tested the sensitivity of our results to sampling variability across all species in the SAD -not 199 just rare species -using subsampling. For each observed community, we constructed subsamples by 200 randomly drawing 60% of the observed number of individuals from the total pool of individuals in the 201 community, without regard to species and without replacement. The precise proportion of individuals 202 drawn in each subsample should not dramatically affect the qualitative outcome. We selected 60% so as 203 to introduce appreciable room for sampling error between the raw and subsampled SADs, but to produce 204 subsampled SADs with N (and presumably S) in a comparable size range to the raw ones. Extremely small 205 subsamples (e.g. 10%) could introduce complications related to small N and S that could obscure the 206 effects of sampling error, while very large subsamples (e.g. 90%) could recapture the raw distributions 207 too closely to be informative. We generated 10 resampled communities for each observed community. 208 We ran our computational pipeline using all raw SADs and all SADs adjusted for undersampling of rare 209 species. Because the subsampling approach increased computational effort approximately tenfold, we 210 analyzed all subsampled communities for the Mammal Community, Miscellaneous Abundance, and 211 if they differ in the number of species that have a particular abundance (Locey and White, 2013). 219 Operationally, this means that for S = 3 and N = 9, the SADs (1, 3, 5) and (2, 2, 5) count as distinct 220 partitions, but (1, 3, 5) and (3, 1, 5) do not, because they each contain one species with an abundance 1, 221 3, and 5, respectively, and differ only in the order of the numbers. In the absence of justification for 222 additional assumptions regarding the distinguishability of species and/or individuals, we adopted this 223 simple set of assumptions that has previously been shown to generate realistic statistical baselines (Locey 224 and White 2013). partitions must correspond to a partition of the feasible set with S = 2 and N = 6 (but with a species of 249 abundance equal to 1 removed), and (b) 1 of the 4 partitions must correspond to a partition of the 250 feasible set with S = 3 and N = 4 (but with 1 individual removed from each species). Thus, we can 251 determine the probability that a partition drawn at random from the feasible set for S = 3 and N = 4 is in 252 case (a) -probability ¾ -or case (b) -probability ¼. To generate a partition in case (a), we sample a 253 partition for S = 2 and N = 6 and then add a species with abundance equal to 1; for case (b), we sample a 254 partition for S = 3 and N = 4 and then add 1 individual to each species. In this way, we use the recurrence 255 relation to transform the problem of sampling from a large feasible set into the problem of sampling from 256 a smaller, different feasible set. This procedure continues until a partition is uniquely determined, after 257 which some back-transformation yields a unique partition for the feasible set of interest. A detailed 258 description of the algorithm we use, based on a slightly different recurrence relation, is available in 259 Appendix S2 and is implemented in the R package feasiblesads available on GitHub at 260 www.github.com/diazrenata/feasiblesads. 261 For every community in our database, we drew 4000 samples from the feasible set to characterize the 262 distribution of statistically probable shapes for the SAD. We filtered the 4000 samples to unique 263 elements. For small values of S and N, it can be impossible or highly improbable for the 4000 samples 264 from the feasible set to all be unique, but for large communities, all 4000 are usually unique. We refer to 265 this as the sampled feasible set. 266

Comparing observed SADs to their statistical baselines 267
We compared SADs to their statistical baselines using several metrics, including a general measure of 268 dissimilarity, as well as skewness, Simpson's evenness, Shannon's index, and the proportion of rare 269 species (species with abundance = 1). These metrics represent just a few of the vast array of possible 270 summary metrics to describe the shape of the SAD, each of which emphasize different aspects of the 271 distribution. In this first effort to compare empirical distributions to a statistical baseline, we selected a 272 suite of complementary metrics and explored whether our overall results were consistent between 273 metrics. By calculating these metrics for each the community's sampled feasible set (see Generating the 274 statistical baseline, above), we generated a portfolio of measures describing the shapes expected from 275 randomly sampled SADs. 276 First, as a general characterization of whether observed SADs have rare or common shapes relative to 277 their feasible sets, we computed a dissimilarity score comparing SADs to the central tendencies of their 278 feasible sets (following Locey and White, 2013). We defined the degree of dissimilarity between two SADs 279 with the same S and N as the proportion of individuals allocated to species with different abundances 280 between the two SADs, calculated as: 281 where n1i is the abundance at rank i for one SAD and n2i is the abundance at rank i for the other SAD. This 283 value ranges from 0 to 1, with 1 being high dissimilarity. To find the central tendency of a given sampled 284 feasible set, we identified the sampled SAD with the lowest mean dissimilarity compared to the rest of 285 the SADs in the feasible set. We calculated the dissimilarity between every sample drawn from the 286 feasible set and a random set of 500 other samples, using a subset of samples for comparisons because it 287 is computationally impractical to make all pairwise comparisons between large numbers of samples. To 288 assess whether an observed SAD was highly dissimilar to its central tendency, we calculated the degree of 289 dissimilarity between the central tendency of the corresponding feasible set and all other samples from 290 that feasible set, and between the central tendency and the observed SAD. Although the dissimilarity 291 score is scaled from 0 to 1, the distributions of dissimilarity scores for samples from the feasible set can 292 vary over broad ranges in S and N. We therefore used the percentile rank of the observed dissimilarity 293 scores, relative to the distribution of dissimilarity scores from the corresponding sampled feasible sets, to 294 quantify how likely or unlikely observed dissimilarity scores are across the range of S and N in our 295 datasets. For a single community, an observed percentile score of 95 indicates that there is a 5% chance 296 of drawing a value greater than the observed value from the distribution of values from the sampled 297 feasible set. Aggregating across communities, if observed SADs reflect random draws from their feasible 298 sets, their percentile rank values should be uniformly distributed from 0 to 100. However, if observed 299 SADs are consistently more dissimilar to their feasible sets that expected at random, the percentile values 300 will be disproportionately concentrated at high values. We used a one-tailed 95 confidence interval and 301 tested whether the percentile values for the dissimilarity scores of observed SADs fell above 95 more 302 than 5% of the time. We note that it is impossible for an observation fall above the 95 th percentile if there 303 are fewer than 20 values in the sampled distribution. We therefore excluded from this analysis 304 communities with fewer than 20 unique SADs in their feasible sets, yielding a total of 22,490 305 communities. Finally, note that, if the observed dissimilarity scores for individual communities are not systematically higher than the distributions of dissimilarity scores from the corresponding feasible sets, 307 increasing the number of communities in the analysis will not increase the frequency of extreme 308 percentile scores. 309 While the degree of dissimilarity between SADs and the central tendency of the feasible set provides an 310 overall sense of deviations among possible SADs, it does not describe how observed SADs may differ from 311 their feasible set. We therefore used a set of more targeted, ecologically interpretable metrics to explore 312 how observed SADs compare to their feasible sets in their shape and proportion of rare species. We 313 examined three metrics for the shape of the SAD -skewness, Simpson's evenness (1-D), and Shannon's 314 index. Skewness measures the asymmetry of a distribution around its mean. The Simpson and Shannon 315 indices are commonly used metrics for assessing how equitably abundance is distributed across species 316 (Maurer and McGill 2011). We also calculated the proportion of rare species (species with abundance = 1) 317 in each SAD, because the proportion of rare species in a community is comparable across different 318 community sizes and is of special interest to ecologists. 319 As with the degree of dissimilarity score, to assess whether the shape of an observed SAD was statistically 320 unlikely, we used percentile ranks to compare the observed values of the summary metrics to the 321 distributions of values for those metrics obtained from each community's sampled feasible set. The actual 322 ranges and values of summary metrics vary widely over the ranges of S and N in our data and thus cannot 323 directly compared, but percentile ranks are comparable across different community sizes and allow 324 assessment across our entire dataset. We used two-tailed 95% intervals to test whether observed 325 communities' percentile values for each metric were disproportionately concentrated below 2.5 or above 326 97.5. In all cases, in testing for unusually high percentile scores, we defined the percentile score as the 327 proportion of values in the sampled distribution strictly less than the observed value, while in testing for 328 low values, we defined it as the proportion of sampled values less than or equal to the observed value. distribution. Because it is impossible for an observed percentile score to be above or below the 97.5 th or 331 2.5 th percentile if there are fewer than 40 values in the sample distribution, we excluded from these 332 analyses communities with fewer than 40 SADs in their feasible sets. Finally, note that skewness, as 333 implemented in the R package "e1071" (Meyer et al. 2019), always evaluates to 0 for distributions with 334 only two species, and we therefore excluded those cases from analyses of skewness. Our final analysis 335 included 21,395 communities for skewness and 21,403 communities for all other shape metrics. 336 The narrowness of the expectation 337 We also used the distributions of dissimilarity scores and shape metrics to quantify the relative specificity 338 of the statistical baseline, in order to assess when there could be challenges in determining whether 339 observed communities differ from their statistical baselines. For an overall sense of how tightly elements 340 of the feasible set were clustered around its central tendency, we calculated the mean dissimilarity score 341 between all samples from a feasible set and the central tendency of that feasible set. For the shape 342 metrics, we calculated a breadth index defined as the ratio of the range of values encompassed within a 343 two-sided 95% density interval relative to the full range of values in the distribution (Figure 2). This 344 breadth index for the statistical baseline ranges from 0 (a very narrow distribution and well-resolved 345 baseline) to 1 (a very broad distribution), and is comparable across feasible sets for varying combinations 346 of S and N. These approaches correspond qualitatively to more computationally-intensive approaches to 347 measuring the self-similarity of the elements of feasible sets (see Appendix S3). We explored how the 348 narrowness of the statistical baseline varies with the size of the feasible set and the ratio of N to S. 349

Results 350
Comparing observed SADs to their statistical baselines 351 For four of the five datasets we analyzed -BBS, Gentry, Mammal Communities, and Misc. Abund -352 observed SADs are more dissimilar to their statistical baselines than would be expected by chance ( Figure than are 95% of samples from the corresponding feasible sets (Table 1). If observed SADs reflected 355 random draws from the feasible set, we would expect only 5% to be that dissimilar. These highly unlikely 356 SADs have dissimilarity scores from 1.5 to 9.7 times greater than the mean dissimilarity between the 357 central tendency and samples from the feasible set, an absolute increase ranging from .04 to .6 on a scale 358 from 0-1 ( Figure S4). These datasets also contain highly unlikely observed SADs in terms of their shape 359 metrics. At random, roughly 2.5% of observed percentile scores for these metrics should be very high 360 (>97.5) or very low (<2.5). Compared to their feasible sets, these four datasets contain a disproportionate 361 number of communities with very low values for Simpson's evenness and Shannon diversity, and very 362 high skewness, relative to their feasible sets (Table 1)

. The Mammal Community and Miscellaneous 363
Abundance databases also have high proportions of rare species, but this tendency is weaker for BBS and 364 nonexistent for Gentry -in fact, the Gentry dataset has a high representation of sites with low 365 proportions of rare species (20% of sites; Table S5). The Gentry dataset also has a disproportionate 366 number of communities with the opposite tendencies to the other datasets for the other shape metrics-367 i.e., an overrepresentation of communities with high Simpson's evenness and Shannon diversity, and low 368

skewness. 369
In contrast to the other datasets, percentile scores for sites from the FIA dataset are more uniformly 370 distributed, and the proportions of extreme values are closer to what would be expected by chance 371 (Figure 3, Table 1). Only 7% of FIA communities are highly dissimilar to their feasible sets (compared to a 372 random expectation of 5%). Among the shape metrics, only 2.7% (compared to 2.5% at random) of sites 373 have high values for skewness, 1.3% have high proportions of rare species, 5.7% have low Simpson's 374 evenness, and 5.4% have low Shannon diversity. 375 The narrowness of the expectation the feasible set. Overall, as the size of the feasible set increases, the SADs in a feasible set become more 378 narrowly clustered around the central tendency of that feasible set, and the sampled distributions for 379 shape metrics generally become less variable (Figure 4). In small communities, the breadth indices are In most cases, and most pronouncedly for the Breeding Bird Survey, Mammal Community, and 416 Miscellaneous Abundance databases, our results suggest that the prevailing processes cause abundance 417 distributions to be highly uneven, rather than those that produce more even abundances across species. 418 For these communities, observed SADs tended to be unusually skewed and uneven, and to have a high 419 proportion of rare species, compared to their feasible sets. Accounting for undersampling of rare species 420 strengthened these effects, while subsampling weakened them. Perhaps unsurprisingly, the effect of 421 these two resampling approaches was especially noticeable for the proportion of rare species; enriching 422 the SAD directly adds rare species, while subsampling is likely to drop rare species even if it otherwise recaptures the general shape of a distribution. The long tail of rare species in the SAD has been a 424 consistent focus in SAD research, and our results highlight that the rare tails of observed SADs are 425 extraordinary, even among the hollow-curve shapes that dominate the feasible set. Ecological processes 426 may lengthen the rare tail and decrease the evenness of the SAD, for example by promoting the 427 persistence of rare species at very low abundances (Yenni et al. 2012). Or, they could drive abundant 428 species to have larger populations than would be statistically expected, without also driving other species 429 entirely to extinction (Chesson 2000). 430 While the Gentry database also exhibits deviations tending towards high unevenness, an even greater 431 proportion of its communities are more even, and have a lower proportion of rare species, than would be 432 SADs for these communities also have high proportions of rare species, taking the statistical baseline into 441 account would suggest that the extraordinary thing about these SADs is that they do not have even more 442 rare species. Simultaneously, there may be biological reasons why the species-rich but relatively low-443 abundance tropical tree communities of the Gentry database differ from those in other datasets. The 444 same mechanisms that promote high diversity may manifest in high evenness, and/or ecological features 445 particular to these forests may produce unusual shapes for the SAD. Because no communities from our 446 other datasets are comparable in S and N, we cannot disentangle these statistical and biological explanations. This is an excellent opportunity to develop additional theoretical and empirical approaches 448 to predict and explain variation in the deviations between SADs and their feasible sets, in particular for 449

species-rich communities across ecosystems. 450
Unlike the other four datasets, communities in the FIA dataset showed weak or no evidence of deviations 451 from their feasible sets. We entertained two general classes of explanation for why the FIA dataset differs 452 from the others in our analysis: first, that biological attributes of the FIA communities cause the SADs for 453 these communities to differ from the others in our database, and second, that statistical phenomena 454 related to S and N may modulate the capacity to detect deviations for these communities. To distinguish 455 between possible biological drivers causing FIA to differ from the other datasets, and factors intrinsic to S 456 and N, we compared a subset of ~300 FIA communities to communities from other datasets with directly 457 matching S and N. We did not find differences in the distribution of percentile scores for any metrics 458 between communities from FIA and communities from other datasets, confirmed via Kolmogorov-459 Smirnov tests (Appendix A9). Although 300 communities constitute a small sample relative to the 20,355 460 FIA communities we analyzed, these results point to statistical phenomena, and not biological attributes 461 unique to FIA, as the likely explanation for the differences. To meaningfully draw inferences using deviations in these small communities, we will need more 493 sensitive metrics than those used here, and/or theories that generate more specific predictions for the individuals is important -might reduce the representation of long-tailed, highly uneven SADs within the 507 feasible set, and make the rare tail observed for real SADs appear more unlikely than it does here. Under 508 our assumptions, the SADs (1,2,3,4) and (1, 1, 1, 7) each count as only one unique SAD. Taking species 509 order into account would mean that (1,2,3,4) would count as 24 (4!) unique SADs, because there are 4! 510 ways to assign the abundances to each species. However, an SAD containing species with equal 511 abundances, such as (1, 1, 1, 7), would only count as 4 unique SADs. For SADs, equal abundances are 512 likely most prevalent among rare species. If this is true, then this set of assumptions would generate 513 feasible sets where rare-tailed SADs are relatively scarce, making observed SADs with rare tails seem even 514 more extraordinary. Additional formulations for the statistical baseline exist, including those that 515 approximate exponential, Poisson, or log-series distributions in the limit (Harte et al. 2008, Favretti 2018). 516 Investigating and comparing the results that emerge from different baselines will be an important next 517 step towards reinvigorating the use of the SAD as a diagnostic tool. entropy and the feasible set are promising horizons for macroecology, the small size of some ecological 522 communities may present difficulties that are rare in the domains for which these tools were originally 523 developed (Jaynes 1957, Haegeman andLoreau 2008). When the observed numbers of species and 524 individuals are too small to generate highly resolved statistical baselines, these approaches will be less 525 informative than we might hope -as appears to be the case for the smallest communities in our analysis. 526 In larger communities, where mathematical constraints have more resolved effects on the form of the 527 SAD, our results show that these constraints alone do not fully account for the extremely uneven SADs we 528 often observe in nature -leaving an important role for ecological processes. This ability to detect and 529 diagnose the specific ways in which empirical SADs deviate from randomness can generate new avenues 530 for understanding how and when biological drivers affect the SAD. There are, of course, still many 531 elements to be improved in our ability to distinguish biological signal from randomness, including 532 assessing alternative statistical baselines and calibrating our expected power to detect deviations, 533 especially for small communities. Indeed, more sensitive metrics could also enable identification of 534 processes that operate through time. Continuing to explore and account for the interplay between 535 statistical constraint and biological process constitutes a promising and profound new approach to our 536 understanding of this familiar, yet surprisingly mysterious, ecological pattern.   state space due to differences in their sampling intensity, design, and underlying biology 662 (e.g. productivity, regional richness, taxonomic group, or other factors that influence the capacity of a 663 community to support large numbers of species and individuals). In particular, note that the FIA dataset 664 comprises very small communities, and communities from the Gentry dataset are extreme in both their 665 high species richness and low average abundance. 666   Gentry dataset also contains communities with small feasible sets, these communities also have a very 701 low ratio of N to S, meaning their entire feasible sets may be constrained to be more self-similar than 702 small feasible sets in general (see Dissimilarity to central te∂åndency). There is, however, substantial 703 additional variation in the dissimilarity and breadth indices not accounted for by the size of the feasible 704 set or the ratio of N to S. 705