Phylogenetic community structure metrics and null models: a review with new methods and software

Competitive exclusion and habitat filtering are believed to have an important influence on the assembly of ecological communities, but ecologists and evolutionary biologists have not reached a consensus on how to quantify patterns that would reveal the action of these processes. No fewer than 22 phylogenetic community structure metrics and nine null models can be combined, providing 198 approaches to test for such patterns. Choosing statistically appropriate approaches is currently a daunting task. First, given random community assembly, we assessed similarities among metrics and among null models in their behavior across communities varying in species richness. Second, we developed spatially explicit, individual-based simulations where communities were assembled either at random, by competitive exclusion or by habitat filtering. Third, we quantified the performance (type I and II error rates) of all 198 approaches against each of the three assembly processes. Many metrics and null models are functionally equivalent, more than halving the number of unique approaches. Moreover, an even smaller subset of metric and null model combinations is suitable for testing community assembly patterns. Metrics like mean pairwise phylogenetic distance and phylogenetic diversity were better able to detect simulated community assembly patterns than metrics like phylogenetic abundance evenness. A null model that simulates regional dispersal pressure on the community of interest outperformed all others. We introduce a flexible new R package, metricTester, to facilitate robust analyses of method performance. The package is programmed in parallel to readily accommodate integration of new row-wise matrix calculations (metrics) and matrix-wise randomizations (null models) to generate expectations and quantify error rates of proposed methods.

Introduction 1 "…we may be studying an attribute about which we cannot be sure what measurements 2 can actually represent it or even whether a hypothesized attribute actually exists" (Houle 3 et al. 2011). 4 5 The idea that competition among species increases with relatedness goes back at least to 6 Darwin (1859), who noted that more closely related species tend to be more ecologically 7 similar and should therefore compete more intensely (reviewed in Cavender- Bares et al. 8 2009). Referred to as the competition-relatedness hypothesis (Cahill et al. 2008), 9 competitive exclusion is predicted to result in the co-occurrence of less closely related 10 species than would be expected if communities were assembled entirely via stochastic 11 processes (phylogenetic overdispersion; Elton 1946, Webb et al. 2002 and Levine 2010), such as speciation and dispersal. In contrast to competitive exclusion, 13 habitat filtering is the process whereby only those species possessing similar traits are 14 able to survive and reproduce within a given abiotic environment (Harper 1977, Keddy 15 1992. Thus, to the extent that such traits are evolutionarily conserved, habitat filtering 16 results in local assemblages of species more closely related than expected by chance 17 (phylogenetic clustering; Webb 2000, Cavender-Bares et al. 2009). Habitat filtering 18 operates largely independently of individual interactions, whereas competitive exclusion 19 occurs via either direct or indirect agonistic interactions among individuals of different 20 species. Until the 1990s, few methods existed to test for patterns of relatedness within 21 communities, and those available took a taxonomic rather than a phylogenetic approach 22 (Elton 1946, Vane-Wright et al. 1991. 23 Miller, Farine & Trisos Phylogenetic community structure methods 4 Over the past 25 years a large number of metrics have been developed to quantify 1 phylogenetic patterns in community structure, by which one might infer the action of 2 community assembly processes. However, misconceptions about the relationships of 3 these metrics to each other and to species richness (reviewed in Box 1) have reduced their 4 impact on our understanding of community assembly. The link between our theories of 5 community assembly and our ostensible measures of it are tenuous, and the measures 6 themselves are not well understood (Houle et al. 2011). Furthermore, while the metrics 7 introduced by Webb (Webb 2000, Webb et al. 2002 have been most influential in 8 community ecology, many other metrics have also received widespread use, and yet their 9 mathematical properties and performance across different community assembly processes 10 has not been comprehensively assessed. Recent reviews (Kraft et al. 2007, Kembel 2009, 11 Vamosi et al. 2009, Vellend et al. 2011, Pearse et al. 2014) have addressed metric 12 performance, but have evaluated only partially overlapping sets of metrics, often using 13 different methods and classification schemes. Consequently, results cannot easily be 14 compared among studies, making the selection of appropriate metrics for empirical 15 research difficult. 16 Statistically evaluating the significance of an observed phylogenetic community 17 structure metric requires a null expectation. Thus, since their introduction, phylogenetic 18 community metrics have been linked to null models (Webb 2000), when in fact, they are 19 independent concepts. This conceptual link has led to the creation of redundant metrics 20 and frequent and continuing confusion in the literature (Box 1). We suggest that 21 practitioners should consider phylogenetic community structure methods as a set of 22 possible metrics (e.g., row-wise calculations) and a set of possible null models (e.g., 23 Miller, Farine & Trisos Phylogenetic community structure methods 5 repeated matrix-wise randomizations), any of which can be combined to create a unique 1 metric + null model approach. Thus, the metric value for a particular community and 2 phylogeny is fixed, but the significance of that metric varies according to which null 3 model is used (Connor and Simberloff 1979, Diamond and Gilpin 1982, Gotelli 2000. A 4 good null model randomizes those structures in the observed data (e.g., individual co-5 occurrence patterns) relevant to the null hypothesis, and maintains structures in the 6 dataset unrelated to the null hypothesis (e.g., species' abundance distributions) (Gotelli 7 and Graves 1996). In practice, null model performance, specifically type I (false positive) 8 and II (false negative) error rates, and redundancy among null models is rarely tested (but 9 see e.g., Gotelli 2000, Kembel 2009). 10 Here, we compare the performance of 22 phylogenetic community structure metrics 11 (Table 1) and 9 null models (Table 2). We develop spatially explicit, individual-based 12 simulations of community assembly due to habitat filtering, competitive exclusion or the 13 random placement of individuals, and then compare the ability (type I and II error rates) 14 of each metric + null model combination to identify the correct assembly process. We 15 document cases of equivalency among metrics and null models. We also assess the 16 response of both metrics and null models to variation in species richness. We conclude by 17 discussing the implications of our findings for future tests of community assembly 18 processes. 19 20

Methods 21
We adopt the following terminology. The community is the spatial extent (i.e. study 22 area) of interest. The quadrat is the sampling unit. For instance, 20, 1-ha forest plots in 23 Miller, Farine & Trisos Phylogenetic community structure methods 6 the Ecuadorian Amazon would be considered 20 quadrats of the rainforest community. 1 We refer to the quadrat (row) by species (column) data matrix as the community data 2 matrix (CDM). 3 4 Null model background 5 We tested the performance of nine null models (Table 2). While distinctions are often 6 drawn between models that randomize phylogenetic tip labels and those that randomize 7 the CDM (e.g., Hardy 2008), this distinction is false; all tip-shuffling null models can be 8 performed by matrix shuffling (Table 2, Appendix S1). 9 Perhaps the simplest of the null models we tested is the richness model, which 10 shuffles species occurrences or abundances randomly within quadrats (rows), thereby 11 maintaining species richness (row totals) and, for abundance data, total abundance and 12 the rank-abundance curve of each quadrat. 13 In contrast, the frequency null model shuffles species occurrences within species 14 (columns) in the CDM, which maintains the occurrence frequency or total abundance of 15 each species (column totals), but not quadrat species richness. Instead, per-quadrat 16 randomized species richness values are distributed around the mean per-quadrat species 17 richness in the observed CDM. Species-poor quadrats will tend to be compared with 18 quadrats of higher species richness, which does not incorporate the large variance 19 expected of randomized metric scores derived from repeated small draws from the 20 species pool (Efron 1979), and thus this model may exhibit elevated error rates. We refer 21 to this null model as the "frequency by quadrat", to distinguish it from another model 22 described below. 23 Miller, Farine & Trisos Phylogenetic community structure methods 7 The independent swap null model was developed to reduce error rates by maintaining 1 both species occurrence frequencies and quadrat species richness (row and column totals) 2 (Gotelli 2000, Gotelli andEntsminger 2001). The trial swap (Miklós and Podani 2004) 3 was subsequently introduced as a more efficient approach to maintain the same structures 4 in the null model. We used 10 5 swaps for these algorithms (Fayle and Manica 2010). In 5 addition, Miller et al. (2013, Appendix 3 of that paper) developed the "frequency by 6 richness" null model which, like the frequency null, shuffles occurrences within species 7 but then concatenates the randomized quadrats by their species richness values, thereby 8 maintaining species occurrence frequencies and quadrat species richness. 9 Prior to the development of abundance-weighted metrics, few null models 10 intentionally maintained features of abundance distributions. For example, a species 11 might occur infrequently but in large numbers. Hardy (2008) introduced the 2x and 3x 12 null models to maintain both species richness and occurrence frequency, as well as either 13 the species or quadrat-level structure of abundance data. The 2x maintains the total 14 abundance and rank-abundance curve of each quadrat, but neither species' abundances 15 nor the set of species-specific abundance distributions. In contrast, the 3x maintains 16 species' abundances and the set of species-specific abundance distributions, but not the 17 abundance distributions of each quadrat. No null model that we know of maintains 18 species richness, species occurrence frequency, species-specific and quadrat-specific 19 abundance distributions (it is likely not possible via matrix shuffling). 20 We developed (Appendix S1) and tested a model that approximates this behavior, 21 which we call the regional null. The regional null simulates dispersal of individuals into 22 the local community from the regional pool, where local dynamics have no influence on 23 range of expected metric values, we explored how null model expectations changed with 1 increasing numbers of randomizations of the CDM (Appendix S1). We did this by 2 plotting the expected CI across the corresponding species richness while increasing the 3 randomization of a given, initial CDM and phylogeny. 4 5 Individual-based spatial simulations of community assembly to assess the performance of 6 metric + null combinations 7 The first two sets of analyses illustrated the general behavior of each metric and null 8 model. In this third analysis, we assessed the ability of each metric + null model 9 combination to detect a given assembly process. Because of the large number of steps in 10 this analysis, we include a schematic to aid the following explanation (Appendix S3). 11 Total computing time required to run these tests (>7 years) precluded systematic 12 examination of sensitivity to simulation parameters, but results were very similar across 13 preliminary exploration of parameter space (Appendix S4). 14 To generate test cases against which to assess each metric + null approach, we 15 simulated three types of spatially explicit communities, intended to model random 16 assembly and the extremes of habitat filtering and competitive exclusion. Each spatial 17 simulation produced a 316 x 316 m (10 ha) community, and 1,009 such communities of 18 each type were generated. We began by generating a phylogeny of 100 species using a 19 pure-birth model (birth = 0.1) and log-normal rank abundance curve, and randomly 20 assigned species abundances from this distribution. We expanded assigned abundances to 21 create a vector of individuals with species identities. In the random assembly spatial 22 simulation, these individuals were then randomly placed within the community. 23

11
In habitat filtering simulations, we independently evolved two traits according to a 1 Brownian motion evolutionary process (σ 2 = 0.1). These traits are meant to mimic two 2 independently evolving environmental preferences, e.g., soil moisture and pH. In our 3 case, we treated these as spatial preferences (i.e. x and y-axis preferences), and scaled the 4 simulated traits to match community bounds. We further smoothed species' spatial 5 preferences, which initially approximated a normal distribution, to a uniform distribution, 6 such that species' preferences were evenly distributed but phylogenetically conserved 7 across the arena. We then placed individuals near their spatial preference, with a 8 controllable degree of variation (exact parameters in Appendix S4). This simulation has 9 the effect of placing related individuals near each other in space. 10 In competitive exclusion simulations, we first placed individuals using the random 11 assembly process. Following this, each generation, we calculated the mean relatedness of 12 every individual in the community to all individuals within 20 m, which we term the 13 "interaction distance". We then identified the 20% of individuals with the highest mean 14 relatedness. For each of these individuals, we identified the individual within their 15 interaction distance to which they were most closely related, and then randomly selected 16 one of the two individuals to be removed from the community. At the end of each 17 generation, the same number of individuals as was removed was drawn from the original 18 vector of individuals, and situated randomly in the community. This was repeated for 60 19 generations for each competitive exclusion simulation. Preliminary analyses indicated 20 that results were similar across different interaction distances and percentages of 21 individuals considered (Appendix S4). All spatial simulations employed 200-400 22 individuals/ha, which is somewhat less than stem-density in Australian tropical rain 23 metrics focus on the relationship between "evolutionary distinctiveness and abundance" 1 (Cadotte et al. 2010); Group 3 metrics focus on patterns of phylogenetic relatedness 2 among nearest relatives; and Group 4 metrics quantify the total relatedness in an 3 assemblage, and are most closely correlated with species richness.  (Table 1) across 6 variation in species richness. Panels are color-coded from blue (good) to red (poor) according to the mean 7 of type I and II errors across all simulated assembly processes. (B) Dendrogram of intercorrelations among 8 the phylogenetic community structure metrics (and species richness itself). Closely correlated metrics are 9 annotated along branches. Group 1 metrics focus on "mean relatedness"; Group 2 metrics on the 10 relationship between "evolutionary distinctiveness and abundance"; Group 3 on "nearest-relative" 11 measures of community relatedness; and Group 4 on "total community diversity" and are particularly 12 closely correlated with species richness. The CIs from the richness, 1s, independent swap, trial swap, frequency by richness, 2 and regional null models exhibited confidence funnels , with 3 more variance observed in smaller (less species rich) samples of the regional species pool 4 ( Fig. 2; Fig S1.7). In contrast, the CI of the frequency by quadrat null model did not 5 account for the anticipated increased variance in null model expectations at low species 6 richness, and the value beyond which an observed metric needed to deviate to be 7 considered significant was approximately the same for all quadrats, irrespective of 8 underlying species richness of the quadrat (Fig. 2). 9 10 Figure 2. Confidence intervals (95%) for the richness, both forms of the frequency, 2x and 3x null models 11 (Table 2) across variation in species richness. Expectations shown here are the result of 10 5 randomizations.

12
Because the 2x and 3x nulls follow identical distributions ( Fig. S1.5), only a single layer is included in this 13 figure. The arrow indicates a region of particular concern for type I error when using the frequency by 14 quadrat null. Other null model behavior (including the independent swap, trial swap, and regional models) 15 is summarized in Appendix S1. reasonably well in habitat filtering simulations when used with some metrics (PD, MPD, 1 MNTD, Fig. 3), but poorly in competitive exclusion simulations with most metrics (Fig.  2   4). Finally, the richness, 1s and regional nulls performed well with most metrics in both 3 the habitat filtering and competitive exclusion simulations, but the richness and 1s 4 exhibited high type I error rates (Fig. 5). 5 Focusing on the metrics, PD, PD c , MNTD and AW MNTD had the greatest power to 6 detect habitat filtering, though Group 1 metrics also performed well (Fig. 3). PD and PD c 7 were also relatively powerful at detecting the signature of competitive exclusion (Fig. 4), 8 though here they were outperformed by Group 1 metrics. Group 3 metrics exhibited 9 relatively less power to detect phylogenetic overdispersion, particularly with some null 10 models (3x, independent swap). If we take overall metric performance as the mean of the 11 type I error rates across all null models for the random simulations, and the type II error 12 rates across all null models for the habitat filtering and competitive exclusion 13 simulations, then Group 1 metrics performed best overall, followed closely by PD and 14 PD c , and then by Group 3 metrics (Fig. 5). Some metrics (E AED , PAE, IAC, H AED ) 15 exhibited type I error rates similar to those of the more successful metrics (i.e. 10-11%), 16 but also failed more often than they succeeded to detect simulated community assembly 17 processes. 18

Miller, Farine & Trisos
Phylogenetic community structure methods 19 1 Figure 3. Performance of metric + null model approaches at detecting phylogenetic clustering given habitat 2 filtering, arranged in order from best-performing to worst, with the best approaches in the bottom left 3 corner. Blue bars summarize the proportion of the total 1,009 simulations where the mean of the 4 standardized effect sizes was significantly less than zero (one-way Wilcoxon signed-rank test). Gray bars 5 summarize the proportion where the mean did not differ from zero (type II errors). Equivalent metrics (e.g., 6 PSC, MNTD) performed identically and are combined.  The unification of phylogenetic community structure methods with age-old questions of 10 community assembly has revolutionized the fields of ecology and evolution. Since 11 Webb's seminal papers (Webb 2000, Webb et al. 2002, there has been an explosion of 12 interest in these matters, including a wide variety of "improvements" upon existing 13  (Fig. 1B). Our objective was to assess a wide 2 range of available methods in order to identify those with demonstrable utility, and to 3 identify those that measure unique aspects of phylogenetic community structure. 4 Which metrics are best? The results of our study suggest that the answer depends in 5 part on which community assembly processes are of interest, and which null models are 6 used. However, some clear and general answers did emerge. Across most null models and 7 all community assembly simulations, PD (Faith 1992) consistently performed well (Fig. 8 5), showing low type I error rates and more power than most other metrics; it was 9 particularly good at detecting the effects of habitat filtering (Fig. 3). Group 1 ("mean 10 relatedness") metrics ( Fig. 1) also performed well, particularly at detecting effects of 11 competitive exclusion (Fig. 4). Like Kembel (2009), andunlike Kraft et al. (2007), we 12 found that Group 3 ("nearest-relative") metrics were not as powerful as Group 1 metrics 13 at detecting competitive exclusion, though we did not directly probe changes in 14 community size as did Kraft et al. Instead, we found that Group 3 metrics slightly 15 outperformed Group 1 metrics at detecting habitat filtering. 16 We expected that because non-abundance-weighted metrics can be strongly 17 influenced by the presence or absence of a single individual, such metrics would more 18 frequently exhibit type I errors ). However, abundance-weighted forms 19 of both Group 1 metrics and MNTD showed slightly higher type I error rates than non-20 abundance-weighted forms. This may be because abundance-weighted metrics appear to 21 require more randomizations before expectations stabilize (results not shown). However, 22 increased randomizations (from 10 3 to 10 4 ) of CDMs did not alter our main conclusions 23

Miller, Farine & Trisos
Phylogenetic community structure methods 23 ( Fig. S4.1). We encourage additional exploration of the circumstances under which 1 abundance-weighted versions of these metrics yield type I errors, and emphasize that 2 these differences in error rates among the Group 1 metrics were small. 3 Some of the metrics introduced by Cadotte et al. (2010) showed poor performance, 4 particularly PAE and H AED . The metric E ED , which out-performed other Group 3 metrics, 5 was a notable exception, as was PD c (though see Box 1). As suggested (Cadotte et al. 6 2010), these metrics do indeed measure unique aspects of phylogenetic community 7 structure (Fig. 1B). Some of these aspects, however, do not seem to be related to 8 traditionally recognized community assembly processes. What these metrics (PAE, H AED ) 9 quantify may yet prove useful in certain contexts (Houle et al. 2011), but they showed 10 poor performance with the simulations in this study. When used with the regional null, 11 IAC showed strong power to detect non-random patterns, but this did not extend to other 12 null models. H ED was closely correlated with PD (r = 0.94), but it did not perform as well 13 as it. We recommend use of either PD or Group 1 metrics. 14 Which null models are best? Again, our results suggest that the answer depends in 15 part on the choice of metric and the community assembly process of interest. In general, 16 we recommend against use of a frequency by quadrat null. The CI for this null model 17 account for neither the increased variance in expectations at smaller samples of the 18 regional species pool , nor the correlation of many metrics 19 with species richness (Fig. 1). Under certain parameters (e.g., low observed quadrat 20 species richness as compared with that of randomized quadrats), this is expected to result 21 in high rates of type I errors, particularly for metrics that are correlated with species 22 richness (Fig. 1A), and we suggest this null should be used with prudence. 23

24
The 2x and 3x null models showed mixed performance. While they exhibited fairly 1 low type I error rates (Hardy 2008), they also exhibited limited power to detect expected 2 phylogenetic community structure. When these nulls are concatenated by richness, they 3 exhibit elevated type II error rates (Appendix S6). We suspect that extreme constraints 4 imposed on matrix randomizations by these nulls results in biased exploration of 5 reasonable phylogenetic space (Appendix S6). Regardless of the reason for this lack of 6 power, the instability across species richness shown by the CI for the 2x and 3x null 7 models ( Fig. 2) means that the expectations for a given metric can change dramatically 8 based on whether N or N+1 species are present in an observed community. Nevertheless, 9 these null models are intended to be concatenated by quadrat, and when used in this 10 manner, they performed better than all but the regional null. 11 The regional null (Appendix S1) was designed to simulate dispersal, proportional to 12 species abundance in a regional pool, into a local community (study area) of interest, 13 such that deviations from these dispersal pressures (e.g., the product of environmental 14 filters) can be readily detected, and local community dynamics (e.g., competition) do not 15 obfuscate expectations. For instance, given strong competitive exclusion, local 16 communities may show widespread phylogenetic overdispersion, where certain species 17 are generally excluded. When these observed occurrence frequencies are taken as 18 regional occurrence frequencies and randomized accordingly (as in the independent 19 swap), it becomes difficult to detect phylogenetic overdispersion, since the randomized 20 CDMs will tend to contain distantly related species, and confidence intervals are 21 accordingly shifted up from those expected given a model like the richness null ( Fig.  22 S1.6). The regional null avoids this issue by using expectations from a larger, fixed pool 23

Miller, Farine & Trisos
Phylogenetic community structure methods 25 as the standard against which to compare observations from the study area. However, it is 1 difficult to quantify dispersal pressure on a community of interest, and this model may 2 not be practical for many researchers. Future studies should investigate what information 3 might be used to construct these expectations (e.g., range sizes), and whether this null can 4 be of widespread utility (Lessard et al. 2012). 5 Null model choice cannot be driven entirely by statistical properties. There may be 6 sound biological reasons for why a given null should be employed, even if its statistical 7 performance is not on par with others (Gotelli and Graves 1996). However, such 8 reasoning should not come at the expense of common sense. For instance, if the quadrats 9 from a CDM are not thought to be representative of the study area (e.g., biased sampling 10 across study areas), then a null model like the independent swap that maintains these 11 observed occurrence frequencies will only confuse interpretation of results. In short, we 12 recommend use of a model that randomizes data structures relevant to the hypothesis, 13 while maintaining structures unrelated to that hypothesis, and while being cognizant of 14 null model performance (Figs. 3-5) and behavior (Fig. 2). The behavior of any metric + 15 null approach with any CDM can be elucidated with use of the expectations function 16 from metricTester. More recently, efficient algorithms for directly calculating the 17 richness-standardized forms of MPD and PD (i.e. SES after randomization with richness 18 null) have been developed that do not require lengthy randomizations (Tsirogiannis and  19 Sandel 2015), and there is room to extend such an approach to additional metrics and 20 nulls. 21 What combined approach do we suggest? The richness null with PD or Group 1 22 metrics may offer the simplest results to interpret by making the clearest assumptions 23 (any species can occur anywhere); more constrained null models raise questions of 1 sampling artifacts and the efficiency of swap algorithms. We emphasize that little should 2 be made of the deviation of any single CDM beyond expectations; the high type I error 3 rates of most approaches casts doubt on the interpretation of single community tests. 4 However, if a metric that is uncorrelated with species richness is used (e.g., PSV), then 5 quadrats from that CDM can be arranged along an environmental gradient to test 6 hypotheses (Graham et al. 2009. Here, the slope is of interest, rather 7 than the significance of any individual community (e.g., quadrats are increasingly 8 phylogenetically clustered along a gradient of decreasing precipitation). Hypothesis 9 testing in this manner minimizes the necessity of a null model, and raw metric values, 10 which often have intrinsic meaning, can then be used instead of SES. For instance, the 11 MPD of a community, given a time-calibrated phylogeny, is equal to the mean 12 evolutionary time separating co-occurring taxa. Other metrics like PD are correlated with 13 species richness, and should be used with a null model (or otherwise standardized, e.g., 14 Nipperess and Matsen 2013) if the focus is on phylogenetic community structure (as 15 opposed to e.g., PD itself). Researchers need to consider what they are measuring with 16 their metric(s) of choice, whether they need to standardize those metrics, and why or why 17 not they might procure significant results. 18 By making the assumption that the traits responsible for community assembly covary 19 with phylogeny, this study maintains the sometimes questionable dogma that habitat 20 filtering leads to phylogenetic clustering, and that competitive exclusion leads to 21 phylogenetic overdispersion (Webb et al. 2002, Mayfield andLevine 2010). If trait data 22 are available, we encourage researchers who use these methods to fit explicit models of 23 evolution to traits pertinent to the assembly processes in question (Butler and King 2004), 1 and to also investigate patterns of community structure in functional traits. In this study 2 we did not test approaches that account for variation among quadrats in species co-3 occurrence probabilities (e.g., Cavender- Bares et al. 2004;Hardy & Senterre 2007), but 4 metricTester could be adapted to investigate these metrics. There is also an expansive 5 assortment of existing (and yet to be created), hypothetically useful null models whose 6 behavior and performance remains to be tested (e.g., Ulrich and Gotelli 2010). 7 Ultimately, advanced approaches (Ives and Helmus 2011) may prove more powerful and 8 gain wider use than current phylogenetic community structure metrics, but the existing 9 arsenal remains well suited to addressing a wide variety of questions. 10 11 Acknowledgements 12 We thank Vincenzo Ellis, Matt Pennell, Amy Zanne and the Ricklefs lab for input and 13 feedback, the Harmon lab for help with coding, and the Oslo Bioportal, the University of 14 Missouri Lewis Cluster, and the Domino Data Lab for providing computing resources. 15 We thank Alex Kharbush for providing logistical and technical advice with Amazon Web 16 Services. 17 18

Data accessibility 19
metricTester is available from GitHub (https://github.com/eliotmiller/metricTester), and 20 can be directly installed into an active R session using the devtools package. It requires 21 the package ecoPDcorr, which can also be directly installed with devtools 22 (https://github.com/eliotmiller/ecoPDcorr).  Table 1. The 22 phylogenetic community structure metrics reviewed in this paper. We 1 paraphrase (or sometimes directly quote) the original description of the metric. While 2 some metrics we discuss are in fact equivalent, these original descriptions often 3 emphasized their uniqueness. IAC is a node-based metric. We multiplied it by -1 such 4 that decreases in its value corresponded with increased clustering. 5

Metric
Abbreviation  , Appendix S2 of this paper) * Denotes three metrics not directly assessed here due to equivalency with other metrics 1 (see Appendix S2), leaving 19 focal metrics in this paper.  (2008) Regional Samples individuals at random from regional pool of individuals. Described in detail in Appendix S1 of this paper.
X † Strictly maintains total abundance, approximately maintains rankabundance curve Approx. Approx.
Appendix S1 * Because columns are moved as a unit, each randomized CDM contains the same set of species-specific abundance distributions as 1 the original CDM, though these abundance distributions are disassociated from their original species (i.e. the set of columns is the 2 same, but each column is now associated with a different species).
3 † The randomized matrices do not always contain quadrats with species richness values the same as those of the original CDM, but by 4 concatenating randomization results by the species richness of the randomized quadrats, observed quadrats are compared to random 5 quadrats of the same species richness. ‡ Intended for use with presence/absence data, thus the fact that the picante (and metricTester) implementations also maintain column 1 sums (and not just the sum of non-zero elements), and therefore also maintain species-specific abundance distributions is an 2 unintentional consequence of the way these null models are coded.
Box 1: Abbreviated history of phylogenetic community structure metrics. 1 Faith (1992) introduced PD, a metric that quantifies the unique evolutionary history 2 represented by co-occurring taxa. It was intended (and is often used) as a conservation 3 tool. While PD built upon previous work by Vane-Wright et al. (1991) and others, it was 4 the first to explicitly incorporate phylogeny. Since PD is the sum of all branch lengths 5 connecting the species in a community (Table 1), the assumption that it increases with 6 additional species, and is therefore correlated with species richness, was implicit (exact 7 solution provided by Nipperess & Matsen 2013). 8 Subsequently, Clarke and Warwick introduced metrics (Δ, Δ+, Δ*) focused on the 9 average branch length among a group of taxa or individuals, again linking their 10 methodology to conservation decisions (Warwick and Clarke 1995, Clarke and 11 Warwick 1998, 1999. Their pioneering papers explored some statistical properties of the 12 metrics, including the fact that mean expected Δ+ is not correlated with species richness, 13 but the width of its confidence intervals decreases with increasing species richness 14 (creating a "confidence funnel"). Yet, the conservation-specific scope of their papers 15 limited their impact on community ecology. 16 Webb (2000) introduced two new metrics--MPD and MNTD--and the standardized 17 forms of these, NRI (net relatedness index) and NTI (nearest taxon index). Initially, MPD 18 was slightly different than Clarke and Warwick's metrics, only incorporating nodal 19 distances, but by Webb et al. (2002) the definition had expanded to incorporate branch 20 length, and was therefore equivalent to Δ+ (Appendix S2). Yet, by linking community 21 assembly processes with these phylogenetic patterns, it was MPD and MNTD that 22 revolutionized the field of community ecology. Moreover, despite the equivalency of 23

Miller, Farine & Trisos
Phylogenetic community structure methods 42 MPD and Δ+, Webb stated that both MPD and MNTD are correlated with species 1 richness when only MNTD is (Fig. 1A), and devised standardization procedures to 2 "correct" for this. This misperception occasionally persists to the present (e.g., to be similar but superior to NRI and NTI--PSV and PSC. The noted advantage to these is 10 the lack of need for a reference species pool, and therefore the ability of these metrics to 11 transcend the particulars of the phylogeny and community data matrix at hand, and allow 12 raw metric values to be directly compared. However, these should therefore have been 13 Phylogenetic community structure methods 43 metrics ranked communities differently than each other and than metrics like PSV and 1 MNTD, but offered no discussion of the metrics' statistical properties, nor has any 2 subsequent paper. The metrics are available in ecoPD (http://r-forge.r-3 project.org/projects/ecopd/). 4 We discuss six additional metrics in this paper: QE (Rao 1982), SimpsonsPhy (Hardy 5 and Senterre 2007), abundance-weighted (AW) MNTD, and three variants of AW MPD 6 (Table 1, Appendix S2). Both complete AW MPD and AW MNTD were introduced in 7 Phylocom (Webb et al. 2008) and picante without accompanying publication, and their 8 statistical properties and relationship to other metrics remains essentially unknown. 9 Interspecific AW MPD was introduced in , and intraspecific AW 10 MPD is "first" described in the current paper (Appendix 2), though as we subsequently 11 discovered, it is equivalent to Δ . Similarly, after exploring 12 the behavior of QE and SimspsonsPhy and finding them equivalent, we realized this was 13 already known (Hardy andSenterre 2007, Allen et al. 2009).
Appendix S1. Null models: behavior across variation in species richness, documenting equivalency, and regional null model description.

Behavior of existing null models across species richness
As described in the main text, we were interested in quantifying the behavior of the null models (Table 2) across varying species richness. Basic principles of bootstrapping (Efron 1979) suggest that there should be more variance when small subsamples of a larger pool are taken. If two random taxa are drawn from a phylogeny, they could be close sister species, or they could span the root. The calculated phylogenetic community structure metrics from these two extremes would vary greatly.
Alternatively, if all the species from a phylogeny are present in a community, we know what the calculated metric will be (+/-some slight variation for abundance-weighted metrics). This should lead to a confidence funnel  The richness null (=SIM3, Gotelli (2000)) we tested swaps abundances within quadrats. In other words, given a quadrat by species community data matrix (CDM), this null shuffles the contents of each row (a quadrat). Accordingly, each species is sampled with equal frequency in the randomized matrices (i.e. it does not maintain species' observed occurrence frequencies). We would expect that for metrics like mean pairwise phylogenetic distance (MPD) that are uncorrelated with species richness (Fig. 1), the mean expected value would not change with species richness. Simulations show this is the case (Fig. 2). This is a useful null to use as a benchmark against which to understand other more constrained nulls. A slight variation on this, the 1s null model (Hardy 2008), converges on the same expectations as the richness null ( Fig. S1.1). The 2s null (Hardy 2008) is Hardy's implementation of the richness null, and we examined it here simply to confirm that different R packages do indeed give similar solutions (Fig. S1.1).
The frequency null we tested (=SIM2, Gotelli (2000)) swaps abundances within species. We refer to this as the frequency by quadrat null. Given a quadrat by species CDM, this null shuffles the contents of each column (a species). This means that individual species are not sampled with equal frequency. Importantly, it also means that the randomized quadrats tend to contain the mean number of species as were observed in the input CDM. For example, given a CDM with four quadrats, one of species richness 2, two of species richness 5, and one of species richness 8, randomized quadrats will tend to contain 5 species. Based on the principles of bootstrapping mentioned above, it should be clear why this would be problematic; the larger expected variance at low species richness will not be incorporated in the null model, and high type I error rates are expected (black arrow in Fig. 1 points to the region of concern). To account for this,  developed a method where the per quadrat raw metric values and associated species richness from a frequency null were retained. These values were concatenated by species richness, and observed values were compared to those expected at their corresponding species richness. We refer to this as the frequency by richness null.
Like the frequency by richness null, the derivation of a CDM where the randomized quadrats contain the same number of species as the input CDM, and individual species occur with the same frequency as the input CDM are the goals of the independent swap (Gotelli and Entsminger 2001) and trial swap null models (Miklós and Podani 2004). The trial swap null model has been considered a more efficient implementation of the independent swap (Miklós and Podani 2004). In our simulations this was not the case (although we are not entirely sure whether convergence on stable confidence intervals is necessarily equivalent to the "equidistribution" of randomized matrices discussed by those authors). With increasing randomizations of a given CDM, the independent swap, trial swap and frequency by richness nulls all show increasingly stable expectations, but the trial swap seems to stabilize at a slower rate (Fig. S1.2).
Regardless of the reason for this result, all three nulls converge on the same solution (Figs. S1.3 and S1.4).
The 2x and 3x nulls (Hardy 2008) were developed to maintain not only aspects of species richness and occurrence frequency, but also either the quadrat-specific rank abundance curve or the species-specific abundance distribution, respectively. While these are aspects of a dataset that a researcher most certainly might wish to maintain, in practice, the extreme constraints imposed on the matrix randomizations seems to result in inefficient exploration of phylogenetic space (Appendix S6). Both nulls also gave identical solutions when concatenated by richness (Fig. S1.5). We were unable to determine why these nulls behaved as they did, but the fact that their expectations wobble across species richness seems to be an undesirable property. Biologically, it is hard to construct a reason why one should expect dramatically different phylogenetic community structures with the presence or absence of a single species. Indeed, when using a nonabundance-weighted metric like MPD, a null model that maintains species richness and occurrence frequency such as the independent swap and a null model that maintains both of these aspects of the data and aspects of abundance distributions like the 2x and 3x null models should converge on similar expectations. Using MPD, quadrat-specific and species-specific abundance distributions should not influence the expectations from two quadrats of similar species richness (e.g., there is no reason to expect the randomized values from two observed quadrats, each with five species, to converge on different expectations). Instead, quadrats of the same species richness converged on different expectations, suggesting poor exploration of possible phylogenetic community space (Appendix S6). Concatenating expectations from quadrats of the same species richness resulted in odd-shaped randomized distributions and a total loss of power to detect simulated community assembly processes (Appendix S6). That said, when concatenated by quadrat, as they were intended to be, these null models did perform better than most we tested (Fig. 5).
What determines how expectations for the independent swap (or frequency by richness or trial swap) vary from those given the richness null? It may not be intuitive to all readers that species within a phylogeny vary in their mean phylogenetic distance to other species in the phylogeny. In an ultrametric tree, all species are equidistant from the root. How can one differ from another in its mean relatedness to other species? Consider the case of a single species that is sister to the rest of the phylogeny. This species is separated by larger average evolutionary distances than are the other species. The relationship between species' occurrence frequencies and their mean relatedness determines how the expectations for the independent swap shift from those of the richness null.
To illustrate this point, we generated a CDM as described in the main text. For every species in the CDM, we next calculated both its mean relatedness to the rest of the species and its occurrence frequency in the CDM. In the first simulation ( Fig. S1.6, "sim1"), we then replaced species identities in the CDM such that species that were more closely related to the rest became the most frequently occurring species in the CDM. In other words, the most closely related species in the phylogeny also became the most common in the new CDM. We performed the opposite procedure in the second simulation ("sim2"). When distant relatives are also the least frequently observed species, the expectations are shifted downwards from those given a richness null. When distant relatives are the most frequently observed species, the expectations are shifted upwards ( Fig. S1.6). Moreover, mean expected MPD, which is uncorrelated with species richness, begins to show some correlation with species richness when using a null model like this. This is because the probability of including rare species in the randomized matrices increases with larger samples. Thus, the expected MPD is positively correlated with species richness in the first simulation, and negatively correlated in the second.

Development of the regional null model.
No null model of community assembly that we know of maintains species richness, species occurrence frequency, and species abundance. The null models that come closest to achieving these objectives are the 2x and 3x nulls of (Hardy 2008). We developed a regional null model aimed at achieving these goals. We did this in particular because our competitive exclusion simulations led us to recognize the importance of local interactions on species occurrence frequencies (Appendix S3), though the importance of considering the regional pool has also been recognized widely in the literature (Ricklefs 1987, Lessard et al. 2011, 2012. Specifically, our competitive exclusion simulations produce a local effect where some species that are abundant in the regional species pool become locally less so (Fig. S4.6). Such species are closely related to species that are more abundant in the local community (i.e. the simulation arena and resulting CDM).
When these local occurrence frequencies are used to inform a null model such as the independent swap, short phylogenetic distances (like those between sister species) tend not to occur in the randomized matrices, which results in the expected phylogenetic community structure being shifted upwards from that given a null that maintains only species richness (Fig. S1.7). Accordingly, it becomes difficult to detect phylogenetic overdispersion.
In empirical situations, researchers are likely interested in testing for the effects of community assembly processes in a focused area (e.g., a forest plot, a grid cell on a map, a soil sample, etc.). The thought, likely, is that the focal area was historically or is currently subject to community assembly processes (e.g., competitive exclusion) that operate on a smaller scale than regional dispersal dynamics. The regional null is intended to simulate these regional dispersal probabilities into the focal area. The regional null model largely accomplishes the objectives of maintaining species richness, occurrence frequency, and abundance distributions, and was associated with lower error rates than the other null models (Fig. 5). It requires, however, that a regional abundance vector (in the form of "sp1, sp1, sp1, sp2, sp2, …") be provided. Developing a vector like this is easy in our simulations, but may be more difficult in empirical situations. If a dataset consisted of evenly sampled sites, so as not to introduce biases in species occurrence frequencies, and the assumption was made that species abundances reflected their dispersal probability, then a vector of all individuals across the entire dataset could be used (use the function abundanceVector in metricTester to do so). Most real-world situations are more complicated than this, and the practicality of the regional null remains to be demonstrated.
The regional null takes as input a regional abundance vector and, for each quadrat in the randomized CDM, it then samples with equal probability from this vector the same number of individuals as were in the given quadrat in the observed CDM. The metric of interest is calculated on the quadrats from this randomized CDM, and these values are retained, along with the associated species richness from each quadrat. This process is repeated many times. The randomized values are then concatenated by their associated species richness. Thus, species richness is strictly maintained, as observed quadrats are only ever compared with randomized sites of corresponding species richness.
Species occurrence frequencies are also approximately maintained with the regional null. For instance, after 1,000 randomized CDMs were generated with the regional null, we calculated the mean occurrence frequency across all randomized CDMs for each of the 50 species in CDM. These values were closely correlated with the observed occurrence frequencies for the same 50 species (r 2 = 0.83, p < 0.001, Fig. S1.8).
The abundance at which a species occurs in any given quadrat is also approximately maintained with the regional null. For instance, within a given quadrat from these same randomizations, a randomly selected species was mostly found as a single individual, occasionally as two individuals, and very infrequently at higher abundances (Fig. S1.9A). This is similar to the abundance distribution of the same species in the original CDM ( Fig. S1.9B).      Species richness Mean pairwise phylogenetic distance Null Richness Independent swap Regional in the random quadrats, used to generate the CDM. Thus, the expectations given an independent swap null, which accounts for occurrence frequency, are shifted notably upwards from those given a richness null. Moreover, some species are lost from the arena entirely, and the mean expectations for the richness null are therefore also shifted slightly up from those given the regional null.

Observed species occurrence frequency
Randomized occurrence frequency Figure S1.8. Mean occurrence frequency of 50 species, after 1000 randomizations with the regional null model, as compared with their initial occurrence frequency. Species tended to occur with a frequency proportional to their occurrence frequency in the observed matrix (r 2 = 0.83, p < 0.001). Appendix S2. Three forms of abundance-weighted MPD, and equivalency of some forms to Clarke and Warwick's metrics.

Three forms of abundance-weighted MPD
Abundance-weighted mean pairwise phylogenetic distance (MPD) and mean nearest taxon distance (MNTD) were introduced in Phylocom ( Webb et al. 2008) without accompanying publications. These methods have entered into common usage in the literature, but they have not been discussed at any length. A variation on abundanceweighted MPD was recently introduced that only accounts for interspecific phylogenetic distances . This is different than the implementation in Phylocom and picante (Kembel 2009).
There are at least three different possible forms of abundance-weighted MPD ( Fig. S2.1). Consider a local assemblage of three species drawn from a regional species pool. Qualitatively, species A, B, and C are clustered in the phylogeny. But, how should the abundances of these three species affect the metric? In the simple case of an assemblage of two individuals of species A, and one each of species B and C, all of the potential interactions among individuals can be visualized schematically ( Fig. S2.1).
If we include only interactions among heterospecific individuals to derive a matrix of abundance weights for the MPD calculation ( Fig. S2.1, "interspecific"), we obtain the MPD among heterospecific individuals within the community. This is the same as the MPD among species, weighted by the number of individuals of each interacting species. It is also the same as Δ* of  (see below). The resulting MPD calculated with this metric is slightly less than the unweighted version. This slight decrease is due to down-weighting in the calculation of the contribution of the phylogenetic distance between individuals of the rarer species, B and C, compared to that of unweighted MPD (Fig. S2.1).
The interspecific metric will be useful when it is the phylogenetic distances among individuals of different species that are of interest. For example, when testing for habitat filtering or interspecific competition, given an increase in the number of individuals of species A, a researcher might prefer not to have the metric show a dramatic increase in the degree of clustering (as happens with alternative versions of the metric, see below and Fig. S2.2d). This is because it is the phylogenetic distances among individuals of different species that are hypothesized to be clustered and/or overdispersed. As another example, a researcher studying phylogenetic niche conservatism might be interested in how phylogenetic community structure changes along an environmental gradient. Given abundance data, he or she could study these changes along the gradient, down-weighting the importance of rarely recorded species (e.g., vagrants) and up-weighting the importance of abundant species.
Alternatively, one might wish to account for both inter-and intraspecific interactions to obtain the mean pairwise phylogenetic distance between any two individuals within the community (Fig. S2.1, "intraspecific"). Here, the two intraspecific interactions for species A, which correspond to phylogenetic distances of zero, are given weight when calculating MPD, considerably decreasing the resulting metric from the unweighted version. This intraspecific abundance-weighted MPD is equal to Δ of Clarke & Warwick (1998) (see below). It will likely be preferred when examining patterns in community phylogenetic structure predicted to arise from processes generating negative density-dependence mediated by phylogenetic relatedness. For example, in the case of pathogen-mediated species co-occurrence, the inclusion of both intra-and interspecific phylogenetic distances is important as both con-and heterospecific individuals represent potential hosts, and the expectation may be not only of even spacing among species, but even abundance distributions of individuals among species.
Lastly, abundance-weighted MPD, as currently implemented in Phylocom and picante, is calculated by accounting for all possible interactions, including those of an individual with itself ( Fig. S2.1, "complete") (Webb et al. 2008, Kembel et al. 2010. The biological interpretation of this metric seems more complicated than those of the interspecific or intraspecific methods. The complete method might be likened, biologically, to including an individual's impact both on others and on itself; for example, an individual's use of environmental resources reducing availability for all individuals, including itself. The diagonal element in the abundance weight matrix of the complete method is equal to n 2 , where n is the number of individuals of a species, while that in the intraspecific method is n 2 -n. Thus, MPD values calculated with either the intraspecific or complete versions will converge rapidly as n increases (Fig. S2.3). Only at low total local assemblage abundance is the difference in MPD values between these metrics notable. Nevertheless, it seems that intraspecific MPD is a more accurate implementation of abundance-weighted MPD as defined by Webb et al. (2008) to be the average phylogenetic distance between any two individuals drawn from a sample.
Each of these methods corresponds to a different biological interpretation, and they performed similarly overall (Fig. 3-5). A few points should still be understood about the intraspecific and complete methods. Both intraspecific and complete abundance-weighted MPD will correlate with assemblage species richness, since at lower richness, proportionally more intraspecific phylogenetic distances (i.e. distances of zero) are included in the mean (Fig. 1). Also, assemblages of uniform species abundances will have different MPD scores depending on whether they are abundance-weighted or not ( Fig. S2.2). Finally, abundance-weighted MPD will always be less than the unweighted form (except in the unique case where all species in the assemblage are represented by a single individual, Fig. S2.2).
It is instructive to consider how these three different MPD metrics change as species abundances vary. If all species' abundances are increased, keeping relative abundances the same, the resulting metric is unchanged for the interspecific and complete methods, but decreases for the intraspecific method (it converges on the complete method with increasing total assemblage abundance, Fig. S2.

3). If individuals of both species A
and C are increased in tandem towards infinity, holding B constant, then the interspecific method converges on the phylogenetic distance between species A and C (4 in this example), while the latter two methods converge on the mean of the phylogenetic distance between species A and C and their intraspecific phylogenetic distance (2 in this example; the mean of 4 and zero). Similarly, with the interspecific method, adding individuals of species A only to the assemblage will increase the contribution of the phylogenetic distances between species A and other species, while with either of the other two methods, it will increase the contribution of both interspecific distances involving species A, and distances within species A (Fig. S2.2).

Some forms of MPD are equivalent to Clarke and Warwick's earlier metrics
While writing this manuscript, we became aware of three additional phylogenetic community structure metrics that were not incorporated in the main simulations . This oversight was due in large part to the fact that these metrics have been more frequently used by conservation biologists than by community ecologists (Box 1). As we show here, they are equivalent to other metrics that we did assess, and consequently are expected to perform equivalently. Specifically, non-abundanceweighted MPD is equal to Δ+, interspecific MPD is equal to Δ*, and intraspecific MPD is equal to Δ (Fig. S2.4).
R code to demonstrate the equivalency of the metrics is provided below.
metricTester can be installed directly from GitHub using the devtools package (username = "eliotmiller"; note that the dependency ecoPDcorr must also be installed using the same username).

9b. Use of CIs discussed in Appendix S6
During the competitive exclusion simulations, some species that were initially common in the community became less so with each generation (Fig. S4.6). These species were those with many close relatives in the phylogeny. A null model like the independent swap that incorporates species occurrence frequencies derives these from occurrence frequencies in the observed community data matrix (CDM). After the competitive exclusion simulations, therefore, longer than average branch lengths end up being frequently sampled in the randomized CDMs. Accordingly, the expected phylogenetic community structure is shifted upwards from that given a richness null, and it becomes difficult to detect phylogenetic overdispersion ( Fig. S1.7). This occurs despite the fact that, throughout the competitive exclusion simulations, removed individuals are settled from the initial regional abundance pool. Our development of the regional null model (Appendix S1) was motivated in large part by this complication.
R code for example communities simulated with metricTester (and the specific parameters we used in the main results) is given below: Simulate a phylogeny of 100 species with geiger: tree <-sim.bdtree(b=0.1, d=0, stop="taxa", n=100) Generate an object of class "simulations.input":    all individuals within the interaction distance is derived. The mean genetic neighborhood is then defined as the community mean of these individual means. Across a wide range of percent killed parameters, the general pattern of increasing phylogenetic overdispersion is evident. Based on these preliminary results, it appears that removing ("killing") a small percentage (e.g., 10%) of individuals each generation would ultimately generate a similar pattern to removing a large percentage (e.g., 30%). Figure S4.5. On the left, an example of a 300 x 300 m random assembly community, created using similar parameters to those in the study. Here, a random individual was selected near the center of the community (marked with a white asterisk). Individuals were then color-coded as a function of their relatedness to the focal individual, where bright red indicates a member of the same species. The size of individual dots was scaled according to their mean relatedness to all other species in the phylogeny, such that large dots indicate a member of a species with many close relatives. In this random community, bright red dots occasionally occur close together, and on average the plot is "redder" than the right panel. Also, the dots in the plot appear to be more uniform in size.
On the right, an example of a 300 x 300 m competitive exclusion community. The community from the left panel was used as a starting point. An individual of the same focal species as that panel was selected near the center of the community (marked with a white asterisk). Size and color scaling are the same as in the left panel. In this community, bright red dots appear regularly spaced, and on average the plot is "darker" than the left panel. Also, the dots in the plot appear to be more heterogeneous in size. Figure S4.6. Changes in the rank abundance curve after 25 generations of the competitive exclusion assembly simulations. The initial rank abundance curve is shown in black. Increasing the interaction distance results in increasingly large deviations from the initial rank abundance curve. Some species (e.g., columns 8 and 9) change abundance dramatically during these competition simulations. Four separate simulations with the same initial community and phylogeny are shown here.  Appendix S6. Expanded methods and results for approaches that concatenate randomized values by richness and approaches that assess significance on a per-quadrat basis.
As in the main text, we here define a community data matrix (CDM) as a quadrat (rows)-by-species (columns) matrix, where cells are filled according to the abundance of a given species in a given quadrat. Traditionally, with null models used in empirical studies of phylogenetic community structure, the CDM is randomized according to some algorithm (e.g., the independent swap), and the metric in question is recalculated rowwise after each randomization. These randomizations are performed some large number of times, the quadrat-specific expectations are then compared to observed values, used to derive standardized effect sizes (SES), and significance of the observed matrix-wide deviation of SES from expectations is assessed with a test such as a Wilcoxon signedrank test.
A slight deviation from this approach is to assess significance on a per-quadrat basis ). Here, a given quadrat is considered significantly overdispersed or clustered if it deviates above or below, respectively, the 95% confidence intervals (CI) for that quadrat based on the randomized values from the null model. An additional deviation from the traditional approach is to retain randomized values, along with the species richness (the row-wise sum of non-zero elements) of the corresponding quadrat from the randomized matrix, and concatenate and summarize randomized metric values by these species richness values.
Thus, null models such as the richness, independent swap and trial swap (Table 2, main text) that maintain row-wise sums of non-zero elements are expected to perform similarly whether results are concatenated by richness or quadrat. Indeed, since a given species richness may occur more than once in an observed CDM (e.g., two sampled quadrats from a given community might both contain 20 species), running a null model in this manner can effectively increase the number of randomizations against which observed values are compared. In other words, the expectations for a given observed quadrat of 20 species should be no different than those from a different quadrat of 20 species. However, other null models like the frequency null, which do not maintain rowwise sums of non-zero elements, are expected to behave differently when concatenated by richness or quadrat. This was the impetus for the development of the frequency by richness null , which we showed here to be equivalent to the independent and trial swap null models.
During our simulations, we performed all analyses according both to a traditional SES, community-wide framework, and to a per-quadrat significance framework. We also performed both of these types of analyses after concatenating randomized metric values both by quadrat and by species richness. Thus, in addition to the overall results presented in the main text (Fig. 5), there were three other ways to consider the overall results: (1) a SES framework where results were concatenated by species richness, (2) a per-quadrat framework where results were concatenated by quadrat, and (3) a per-quadrat framework where results were concatenated by species richness.
For the latter two approaches, we summarized type I errors for a given metric + null approach as the sum of all clustered or overdispersed quadrats from the random simulation, overdispersed quadrats from the filtering simulation, and clustered quadrats from the competitive exclusion simulation divided by the total number of quadrats from each of these three simulations. We summarized type II errors for a given metric + null approach as the sum of all not clustered quadrats from the filtering simulation and all not overdispersed quadrats from the competitive exclusion simulation divided by the total number of quadrats from both of these simulations.
While the traditional approach for calculating the significance of observed phylogenetic community structure is sufficient, there are at least three reasons researchers might choose to take one of the other approaches. First, a researcher might wish to use unstandardized metrics (e.g., MPD instead of NRI). Second, a researcher might wish to assess significance for a given quadrat instead of across an entire matrix. Of course, using a SES approach, a single quadrat with an SES of > |1.96| can be considered significant at an alpha of 0.05. However, if the researcher wishes to use unstandardized metrics, assessing significance as per-quadrat deviations beyond CI is a more appropriate approach. Third, we believe it is not well appreciated that null model expectations vary in metric-specific manners across species richness, and significance testing in this manner may provide insight into how such underlying expectations vary in the empirical dataset in question.
When we took an SES framework where results were concatenated by species richness, as expected, results were similar to those when concatenated by quadrat (Fig. 5, main text) for the trial swap, independent swap, richness and 1s null models (Fig. S6.1).
They were identical for the regional and frequency by richness null models since these were also concatenated by richness in the main text, and they were identical for the frequency by quadrat null model since we only concatenated those randomized values by quadrat ( Fig. S6.1). Unexpectedly, however, since both null models maintain species richness (i.e. row-wise sums of non-zero elements), results differed dramatically for both the 2x and 3x when randomizations were concatenated by richness.
We investigated this surprising result by creating a CDM of 12 quadrats and 20 species total, where quadrats either contained 5, 7, or 10 species each (4 quadrats of each richness). When we randomized the matrix according to the 2x and 3x null models and plotted per-quadrat expectations as a function of the species richness of that quadrat, we discovered that quadrats of similar species richness did not appear to converge on similar expectations. Thus, when randomizations from a given species richness were concatenated, instead of being (at least somewhat) normally distributed, the randomized values often exhibited notably multi-modal distributions (e.g., Fig. S6.2). Moreover, even randomized values from a given quadrat were sometimes multi-modal (e.g., Fig. S6.3).
This explains the significant drop in power of the 2x and 3x null models when concatenated by richness-expectations at a given species richness tend to be more platykurtic than when concatenated by quadrat.
The drop in power of the 2x and 3x null models continued when the significance of observed metrics was assessed on a per-quadrat basis. With this manner of significance testing, the type II error rates of both null models was greater than 99.9% when randomizations were concatenated either by quadrat (Fig. S6.4) or by richness ( Fig.   S6.5).
Both the 2x and 3x null models performed well when used in a traditional manner. However, when results were either concatenated by richness and/or significance was assessed on a per-quadrat basis, these models essentially lacked any power to detect simulated community assembly processes. More study of the behavior of these models is warranted. Qualitatively, the issue appears to be biased exploration of plausible phylogenetic community structure space, possibly due to highly constrained randomization algorithms that maintain numerous aspects of the initial CDM. rank test). Gray bars summarize the mean type II error rates from the habitat filtering and competitive exclusion simulations. Blue bars provide an indication of the success of each approach, and are defined as one minus the mean type I and II error rates. Metrics and nulls are arranged in order from overall best-performing to worst, with the best approaches in the bottom left corner.

2x null model
Randomized MPD values