Neither global nor consistent: a technical comment on the tree diversity-productivity analysis of Liang et al. (2016)

The publication of Liang et al. (2016, Science) seems to demonstrate very clearly that increasing tree species richness substantially increases forest productivity. To combine data from very different ecoregions, the authors constructed a relative measure of tree species richness. This relative richness however confounds plot-level tree species richness and the polar-tropical gradient of tree species richness. We re-analysed their orginal data, computing a regional measure of tree species richness and addressing several other issues in their analysis. We find that there is virtually no effect of relative tree species richness on productivity when computing species richness at the local scale. Also, different ecoregions have very different relationships between tree species richness and productivity. Thus, neither the “global” consistency nor the actual effect can be confirmed.

Observational studies of the species-richness-ecosystem functioning relationship are correlational and not causal. Observing a consistent correlative e ect of tree species richness on forest productivity (TSP-P) across a wide range of biomes, however, would make this relationship a reliable principle on which to build recommendations for forestry and forest conservation. The publication of Liang et al. (2016) report on such a consistent, global TSP-P-relationship, based on forest inventory data from over 600,000 plots in 12 di erent ecoregions.
There are several logical and technical aws in their analysis; correcting them also removes the reported global relationship and thus its role as principle for forest management. We identi ed the following problems with the original analysis: (1) The authors computed "relative tree species richness" as proportion of the maximum value. Thereby it represents a gradient from boreal to tropical plots, rather than in local species richness. When instead computing species richness relative to the maximum value in the region the e ect of species richness on productivity is dramatically reduced. (2) Plots are overwhelmingly from temperate forest; indeed only some 2500 plots are from the tropics (equivalent to 0.4%), despite these forests representing around 30% of the world's forest. Stratifying the plots accordingly weakens the TSR-P-relationship. (3) In the spatial regression model, distances between plots were computed without taking the spherical nature of earth into account. This had little e ect on the slope estimate of the TSR-P-relationship. (4) The computational burden of the spatial model required subsampling the data to 500 data points. The authors did not correctly compute con dence intervals for this approach, wrongly interpreting subsampling as bootstrapping and additionally incorrectly computing bootstrap standard errors. A correct subsampling-based estimation led to approximate trippling of the reported con dence interval. (5) As noted earlier (Schulze et al., 2018), some 4% of the plots had productivity values (far) beyond what is biologically plausible (Stape et al., 2010). The likely reason is that small plots with large inventory errors in the productivity may lead to erratically high values. Not taking this into account in the analysis, e.g. by down-weighting plots with productivities above 30 m 2 ha −1 y −1 at least indicates an unre ected use of data. It also leads to an overestimation of the absolute productivity. Whether that is the main reason for productivity increases dramatically higher than any reported from experimental setups (Zhang et al., 2012a;Vilà et al., 2013;Huang et al., 2018;Jactel et al., 2018; Ammer, 2019) is unclear.  Figure 2: Biome-speci c observed tree-species richness-productivity relationships (and 95% con dence interval). (Left) Species richness expressed as percentage of regional maximum. (Right) Productivity as changing with absolute species richness. Note that the shape of the curves can change when re-scaling species richness, as the values enter the analysis log-transformed.
Combining corrections for these points led to a new global analysis, with a negligible e ect of tree species richness on a site's volumetric productivity (Fig. 1 left; see supplementary material for the full analysis including R-code). Furthermore, tting the model to the 11 ecoregions with sucient data separately shows that the global model is not consistent and provides a poor description of the ecoregions' speci c TSR-P-relationships ( Fig. 1 right). Also, there is no obvious bene t of scaling species richness, as the curves are more di cult to interpret when displayed as percentage of some maximal value (Fig. 2). Indeed, using a relative scale makes the species richness-e ect (the θ-estimate) harder to compare across biomes and insigni cant in several cases.
In summary, the analysis presented by Liang et al. (2016) is awed in several respects, leading to a spuriously strong e ect of tree species richness on forest productivity at the global level. While a re-analysis correlatively con rms a positive TSR-P-relationship for most ecoregion, these e ects vary substantially in their strength and should be examined for each system separately. Overall, the original publication has thus added very little to our understanding, as both positive and biomespeci c e ects of species richness were already known before their publication (see, e.g., Whittaker & Heegaard, 2003;Vilà et al., 2007;Paquette & Messier, 2010;Whittaker, 2010;Zhang et al., 2012b), and the global consistency could not be upheld.
Appendix: Reproducible analysis with comments and conclusions for each step

Introduction
The positive e ect of species richness on various measure of ecosystem functioning is generally undisputed (Cardinale et al., 2012), even though this e ect levels o fairly rapidly as the number of species beyond "few" (Cardinale et al., 2006;Isbell et al., 2011). Such fundamental understanding does not necessarily translate into applicable knowledge, and hence the publication of Liang et al. (2016) is a landmark step in converting academic biodiversity-ecosystem functioning into on-theground forest management. They claim to describe a globally applicable relationship between tree species richness and forest productivity, which, if correct, could guide forest management around the world without speci c understanding of local or regional speci cs.
On re-analysing the excellent data base compiled by Liang et al. (2016), we were originally curious to investigate an unanswered question, namely whether the global tree-richness-productivityrelationship is actually anywhere near as good as one built, on the same data, for a speci c biome. Furthermore, we were also concerned with describing a partial dependence plot for the e ect of tree species richness, rather than the conditional plot presented by Liang et al. (2016); the di erence being that in their plot, the environmental predictors used in the model (such as basal area, temperature, precipitation and elevation) must be known, while in a partial dependency plot the marginal e ect of tree species richness, independent of the actual local setting, would be depicted.
During the analysis, we realised that these initial questions were of secondary importance, as we discovered conceptual and statistical issues that cast substantial doubt on the results presented by Liang et al. (2016). Since the rst author initially did not provide his code, we had to rely on the method description and guessing to reproduce the analysis. 1 From the reconstruction we modi ed the original model to correct for some mistakes, as detailed below, to derive our new version of the global relationship. In the next step, we predicted from the global model to the average conditions of each biome, and compared this prediction with those of a biome-speci c model from the same data base.
In the following, we present rst the original analysis, then the modi cations we introduce. We close with a new global and a biome-speci c analysis.

B
Reproducing the original tree richness-productivity relationship

B.1 Data preparation
The data are available under https://figshare.com/articles/GFB1_data_figshare_xlsx/ 4286552 for all countries except New Zealand, whose data are available from https://datastore. landcareresearch.co.nz/dataset/new-zealand-forest-plot-data-in-global-forestbiodiversity-dataset, both as Excel les. Upon downloading and merging the les, here is a summary of its content (in the following, code used for checking is outcommented): l i b r a r y ( r e a d x l ) b p r _ d a t a _ g l o b a l ← r e a d _ e x c e l ( " . . / L i a n g 2 0 1 6 S c i e n c e _ G F B 1 _ d a t a _ f i g s h a r e . x l s x " ) # dim(bpr_data_global) # check dimensions: 773100 x 21 # summary(bpr_data_global) b p r _ d a t a _ n z ← r e a d _ e x c e l ( " . . / g f b 1 d a t a n z . x l s x " , s h e e t = 2 ) The meaning of the di erent column titles is provided in the meta-data to the gshare download. Most importantly, the response, productivity [m 3 ha −1 y −1 ] is given in column P, and the number of tree species observed in a plot is given in column S. As we can see from the summary, productivity ranges from −360 to 1090 m 3 ha −1 y −1 , with most data between 1 and 9 m 3 ha −1 y −1 . Species numbers range from 0 to 405 (we cannot provide a density here, as plot sizes are not provided and range from < 100 to > 1100 m 2 , only the 95% sampling interval was provided in the paper), with most plots between 2 and 7 species. Liang et al. (2016) excluded plots with harvest volume > 50% of stocking volume (page 5), but do not provide the IDs of these plots. We thus excluded all plots with negative productivity and also those with productivity of 0 (as our model including them has a very di erent AIC to those reported by the authors).
Additionally, we remove plots with a species richness of 0, basal area of 0, those with a T3-value of 0 (temperature seasonality as standard deviation times 1000, making a value of 0 indicating no data, rather than no seasonality), IAA (indexed annual aridity times 10 −4 )-values of 0, and manually assigned four plots with an ecoregion value > 13 based on their coordinates in googleEarth. (Although there are 14 ecoregions in WWF's de nition, the 14th in this data set was wrongly coded in the data and in fact plots were in rainforests.)  Some 1139 plots were recorded as ecoregion 0, which, when plotted, were distributed along the coasts across the world, i.e. belonging to very di erent ecoregions. We have no idea why these are classi ed as "0", but excluded them rather than manually and labouriously assigning them to appropriate ecoregions. Some plots were recorded repeatedly in the database (e.g. plot 0 occurs 370 times), but we kept them all in the data, as FIDs were unique, and so are their geographical coordinates. At the end we are left with a data set of 636616 rows, heavily relying on the forest inventory data of temperate forests (ecoregions 4 and 5 comprise 615576 plots, i.e. 97% of all data points). Since for tundra, ecoregion 11, only 9 data points are available, we exclude it from the biome-speci c analyses lateron.
Before analysing these data, let's have a look at them, separately for each ecoregion: It is visible, at rst glance, that these data are suspicious. The highest volume productivity ever measured in situ (rather than estimated from inventories), in a fertilised Eucalypt plantation in Brazil, was around 30 m 3 ha −1 y −1 (Stape et al., 2010). In this data set, the highest value is over 1000 m 3 ha −1 y −1 , and there are 23561 points with values larger than 30 m 3 ha −1 y −1 (over 15000 of those are from the US, but Germany, France and Japan also contribute thousands of data points, respectively): sum ( b p r _ d a t a $P > 3 0 ) [1] 23561 t a b l e ( b p r _ d a t a $ Country [ which ( b p r _ d a t a $P > 3 0 ) ] ) Moreover, as the last gure shows, some unlikely ecosystems contributed to high productivity sites (even after excluding obviously wrong data points): tundra, Mediterranean forest and deserts contributed dozens to thousands of plots with productivity exceeding 10 m 3 ha −1 y −1 . It is possible that the authors did remove some of these plots before the analysis without mentioning it explicitly in the methods (p. 5: "Intensively managed forests with harvests exceeding 50 percent of the stocking volume were excluded from this study. "; there is no indication in the data, which plots are concerned). 2 Nevertheless, as we are able to reproduce the analysis to a large extent (see below), we assume that many of these data points were still in the data set analysed.
B.2 Spatial model Liang et al. (2016) employ a Generalised Least Squares (GLS) approach to accommodate spatial autocorrelation. Since the GLS is rather time-consuming, and it cannot handle data sets beyond a few thousand data points (depending on computer memory), they took random subsamples of 500 plots from the above data and ran the GLS on that subsample. 3 For each such subsample, a new variable Š was computed as Š i = Si max(S subsample ) . The actual model is at the log-log-scale (i.e. both P and Š were log-transformed). Š is interpreted as "relative species richness" and is claimed to "facilitate inter-biome comparison" (p. 7). 4 So the actual steps of the analysis are: 1. Draw a random subsample of 500 data points (without replacement).
4. Run a GLS with log(P) as response and log(Š), G (basal area), T3 (standard deviation of temperature), C1 (annual precipitation), C3 (precipitation of warmest quarter), PET (potential evapotranspiration), IAA (indexed annual aridity) and E (elevation) as predictors. 5 The spatial distances are implicitly computed from geographic coordinates using the corSpher correlation structure. 6 Since coordinates of the released data were rounded, we jitter them slightly to avoid distances of zero.

Store model estimates and repeat.
While Liang et al. (2016) run 10000 bootstraps, we only do 50, as the results do not change substantially anymore after that. (Note that our code is not made for speed but for intellegibility.) Also, as the authors kept the nugget and range xed in their nal run, we use these values here (0.8 and 50, respectively). As illustration, here is a single model run: R-code of (the main part of) their analysis avaiable under https://github.com/jingjingliang2018/GFB1/blob/ master/R_script_Github10312008.R. In that code it is clear that plots with productivity larger than 533 m 3 ha −1 yr −1 and with species richness larger than 270 species were removed. The reasoning behind these values is not communicated in that document. Going through their code, we are pleased to notice that we have correctly interpreted their methods section and that our points of content are not mere misunderstandings. 3 Liang et al. (2016) invent the name "spatial random forest" for this procedure, which is rather misleading. First, the randomForest algorithm (Breiman, 2001) uses classi cation and regression trees as building blocks, hence the name "forest" as a large group of "trees". Second, in randomForest the actual model is altered in each tree, by randomly selecting the variables trialed at each node. Third, while randomForest samples data points, it does so with replacement, i.e. as a bootstrap. As a result, the number of data points are the same as the full data set in each tree. What the subsampling approach of Liang et al. (2016) and the randomForest have in common is the use of aggregating many models. Since these are not actual bootstraps (because Liang et al. subsample), this is not proper "bagging" either. Thus, the term "spatial random forest" is misleading and suggests a sampling theory which actually does not apply to this approach. Accordingly, we shun this term, and also will not use the term "bootstrapping" unless sampling is carried out with replacement.
The part after setting the seed will be repeated below, and from the summary we only keep the coe cient estimates.  Table 2, with the exception of the standard errors for the model coe cients, which Liang computed as incorrectly as standard deviation divided by the square root of the number of replicate models: correctly computed, the standard deviation of a bootstrap is the standard error of the statistic (here the goodness of t measures and the coe cients). Hence, the row "SE" in the authors' table 2 must be multiplied by 100 to get to the correct values. 7 Finally, we plot this relationship with the "correct" error bars. As parameter estimates are correlated, we cannot simply use the above standard errors to compute an error interval around the line. Instead, we predict with each of the 50 parameter sets the curve and compute 95% quantiles from them as con dence intervals. 8 b e t a ← c o e f M a t [ , 5 : 1 3 ] # set all values but log(Sbreve) to their mean: X ← a s . m a t r i x ( d a t a . f r a m e ( I n t e r c e p t = 1 , S b r e v e = l o g ( s e q ( 0 . 0 1 , 1 , by = 0 . 0 1 ) ) , t ( c o l M e a n s ( b p r _ d a t a [ , c ( "G" , " T3 " , " C1 " , " C3 " , " PET " , " IAA " , " E " ) ] ) ) ) ) p r e d s ← X % * % t ( b e t a ) p a r ( mar = c ( 5 , 4 , 1 , 1 ) ) p l o t ( 1 : 1 0 0 , exp ( rowMeans ( p r e d s ) ) , t y p e = " l " , l a s = 1 , lwd = 3 , y l i m = c ( 0 , 1 0 ) , x l a b = " r e l a t i v e s p e c i e s r i c h n e s s [ % ] " , y l a b = " p r o d u c t i v i t y " ) b e t a L i a n g ← c ( 3 . 8 1 6 , 0 . 2 6 2 5 , 0 . 0 1 4 6 , −0.00011 , 0 . 0 0 1 6 , 0 . 0 0 1 7 4 , −0 . 0 0 25 6 6 , −0 . 0 0 01 3 4 , − 0 . 0 0 0 8 0 9 ) X l i a n g ← X X l i a n g [ , 2 ] ← l o g ( s e q ( 0 . 0 1 , 1 , by = 0 . 0 1 ) * 1 0 0 ) # uses %, not fraction l i a n g L i n e ← X l i a n g % * % b e t a L i a n g l i n e s ( 1 : 1 0 0 , exp ( l i a n g L i n e ) , c o l = " r e d " , lwd = 2 ) l e g e n d ( " t o p l e f t " , l e g e n d = c ( " L i a n g " , " r e − a n a l y s i s " ) , c o l = c ( " r e d " , " b l a c k " ) , lwd = 2 , l t y = 1 , b t y = " n " , c e x = 1 . 5 ) We were thus able to reproduce the analysis of Liang et al. relatively faithfully, at least with respect to the expectation (we shall return to the error envelop in due course). 7 We shall return to the computation of error bars in a few sections. For now, please note that the classical boostrap's idea is to use the entire data set when bootstrapping; only then will the bootstrapped standard errors be correct. Using the bootstrap on subsamples, in the way done by Liang et al. (2016), does not constitute a valid statistical approach for computing standard errors for the full data set, neither in their way, nor in the way presented here (with the multiplication by 100). What this subsampling-bootstrapped standard errors deliver is an estimate of where the line would be with 500 data points, not for the full data set. There are subsampling procedures (see Politis et al., 1999), but they require additional corrections on top of the subsampling itself. 8 We do not know how Liang computed these, as he did not reply to our emails on this topic.
Since their log-likelihood is 50 units higher, we suspect that in their analysis further plots were excluded.

Corrections and improvements
In summary, we have been able to reproduce the analysis of Liang et al. (2016).
During this rst step, we may already have noticed some issues with the analysis. Here is our list of things we improve on in the next sections, starting with the most important ones rst: 1. Evaluate the e ect of unreasonably large values of productivity on the result.
2. Aggregating analyses of subsets, which comprise only a permille of the full data set, may introduce a bias relative to the full dataset. The mechanism behind this bias is the same that we criticised as rst point, i.e. the computation of Š on the basis of highest tree species richness in the subset, thereby ranging Š di erently for each subset.
3. Minor point: The GLS can easily accommodate more than 500 points, so we can run subsamples of 1000, 2000 and 5000 points. This will slightly increase the chance of plots being relatively close to each other, and hence allows for a better estimation of the short-range spatial autocorrelation, possibly improving the estimates for nugget and range of the variogram.
4. Strati ed draws of plots in proportion to the area this ecoregion's forests cover on the world. Currently only some 2500 points (out of 636000) are representing the tropical forests (labelled 1 and 2 below), despite their proportion of the world's forests being over 30%.
6. Calculate relative species richness Š relative to what is the maximal species richness in that region. Currently, the de nition of relative species richness is not intuitive. Since only the tropics have plots with dozens to hundreds of species, Š is e ectively representing a gradient from cold to tropical plots. Thus, the so-called species richness gradient is in fact a latitudinal gradient. This is, in our opinion, a serious problem and leads to fundamental misinterpretation of the results. We propose to compute Š di erently, namely relative to what is possible for a given plot. We could choose the maximal value for an ecoregion to scale tree species richness of each plot, or even more locally, the maximal species richness observed in, say, 100 km radius around a plot. This would lead to plots from any ecoregion being able to occupy the upper end of the relative species richness gradient, if they are species rich relative to their wider neighbourhood. 10 In the following, we demonstrate the e ect of each of these improvements separately, always comparing it to our analysis of the data presented above. Arguably, as the data set changed, we may have to re-estimate the coe cients of the spatial autocorrelation, which we did not.

C.1 Do implausible productivity values distort the analysis?
As shown above, some 23,000 data points, around 4% of the entire data set, have productivity values above the largest measured in the eld. The probably reason is that forest inventories measure with a certain error, and if the error is large relative to the increase in biomass, it is easy to get unrealistically high productivity estimates, particularly in small plots with harvesting.
There are, essentially, two ways to accommodate values that are clearly implausible: (i) omit them, or (ii) give the lower weight in the model. We only investigate the e ect of omitting all values where P>30 as the strongest, not necessarily most appropriate, way. ) X l i a n g ← X X l i a n g [ , 2 ] ← l o g ( s e q ( 0 . 0 1 , 1 , by = 0 . 0 1 ) * 1 0 0 ) # uses %, not fraction l i a n g L i n e ← X l i a n g % * % b e t a L i a n g l i n e s ( 1 : 1 0 0 , exp ( l i a n g L i n e ) , c o l = " r e d " , lwd = 2 ) l e g e n d ( " t o p l e f t " , l e g e n d = c ( " L i a n g " , " P > 30 o m i t t e d " ) , c o l = c ( " r e d " , " b l a c k " ) , lwd = 2 , l t y = 1 , b t y = " n " , c e x = 1 . 5 ) C.2 Aggregated subsets vs using the entire data set: do parameter estimates match?
As sample size of a regression problem increases, standard errors of model parameters decrease, typically as a function of √ n. Thus, the 500-data point subset is likely to have much wider error bars than an analysis of the full 636616 data points. How much wider is somewhat di cult to estimate, as data are spatially autocorrelated, making the √ n an optimistic estimate. Following the logic of data cloning (Lele et al., 2010), who show that repeating their data set 100 times yields √ 100 = 10 times to narrow standard errors, one can argue that the standard error computed from the subset-models will be √ 636616/ √ 500 = 34.6 times too wide. We cannot test this idea on the GLS, as we have no computer at our disposal that can invert a 600000 × 600000 matrix as required by the GLS. Thus, we cannot t a GLS on the full data set (otherwise Liang et al. would have done that, too). Instead, we use the non-spatial linear model, which is biased due to spatial autocorrelation, as we shall see. Still, if our 35-fold correction worked for the non-spatial regression (the ordinary least square: OLS), we would be happy to use it for the GLS.  Resampling OLS and full data set analysis yield similar parameter estimates (apart from the intercept). As result, the estimated e ect of increasing "relative species richness" is substantially lower in the subsampling approach than in the full model. We are con dent that this e ect will be similarly present in the GLS analysis and return to this issue in the section on increasing sample size of the subsampling.

C.2.1 Scaling standard errors from subsamples to full data
To take the subsampling correction a step further, we now run subsamples of di erent size and try to identify a relationship between their (subsampling-derived) standard error forθ and size of the subsample, B. As it happens, this relationship is best presented as a straight line on a log-log-plot. To apply this to the GLS, we have to make the assumption that the same proportionality holds for the GLS as for the OLS. If so, we can compute the standard error of the GLS, based on the subsampling of size B = 500, as follows (as a simple rule of three): This is (obviously) dramatically better than the "bootstrap" estimate presented above, but still thrice the 0.0009 of Liang et al. (2016).

C.2.2 Increasing the size of the subsamples for the GLS
As another step, we can brie y check whether increasing the size of the subsample from 500 to 5000 for the GLS makes an appreciable di erence. The estimate of θ and most other model parameters is similar for sample size 500 and 5000, while the intercept and the model's R 2 goes down. What does that mean?
1. Using small subsamples of a large data set leads to dramatic increases in parameter uncertainty. Thus, the (corrected) "bootstrap" approach of Liang et al. (2016) is likely to be very pessimistic. Since they did an error in the computation of their standard errors, the uncertainty limits provided in the original paper cannot be trusted.
2. Using the full data set led to very di erent estimates for the model (comparing OLS for the resampled and the full data set, i.e. the grey and the blue curves above). By analogy, also the resampling-based estimates of the GLS are likely to be substantially biased. 12 3. Correction for spatial autocorrelation leads to lower estimates for productivity, roughly 80% of the expected value of P along the tree species richness gradient (orange line).
Although we have introduced a way to estimate the uncertainty around the predicted relationship from the data by approximation in the previous section, we have not used it until the very end.

C.3 Strati ed sampling of subsets
Each ecoregion has a certain share of the terrestrial surface, but within each ecoregion, not all area is forested. Crowther et al. (2015) provide estimates of the total land area as well as the number of trees for each ecoregion. We use the latter as means of strati cation, i.e. draw plots proportional to the number of trees in that ecoregion.

Liang stratified
The e ect is moderate, with slightly lower values than the original non-strati ed approach. This result suggests that also with non-strati ed sampling always some tropical plots with high species richness are drawn, making the original Š robust to unrepresentative sampling.

C.4 Compute distances between plots as Great Circle distances, rather than Euclidean
(thereby accommodating the spherical nature of earth) In the spatial model, the GLS, the distances between plots are computed from x-y-coordinates as Euclidean distance. 13 On earth, the distance between two points cannot be computed as Euclidean distance, and instead requires the computation of the so-called Great Circle distance. 14 We employ the "halversine" correlation structure provided on stackexchange to implement a correct distance measure within GLS. 15 We de ne the model and run it repeatedly. In this case, the correct representation of spatial distances increases overall productivity estimates a bit, but leaves the TSP-P-relationship unchanged.

C.5 De ne species richness relative to what is regionally possible
The de nition of Š in the subsets is relative to the largest value in the subset. When analysing the full data set, by the same logic, all tree species richness values are a fraction of the most diverse plot with 405 species. Thus, the reported relationship must be read as: "If we increase whatever number of species we currently have in a plot to 400, then we are moving on this curve. " Ecologically, this makes no sense. Whatever the reason why there are only a handful species in a plot of 1/10th of a hectare in temperate and boreal forests, it also prevents a "tropic richness" in these sites. Thus, one cannot meaningfully increase species richness outside the tropics to a tropical level. As a consequence, the depicted gure lacks ecological interpretability. What the reader, and we suspect also some of the authors, probably interpret into the x-axis is "species richness relative to what would be possible at this site". There are di erent ways to de ne "what is possible at this site": (a) relative to what the maximum recorded value for this biome is, (b) relative to the richness of plots in the vicinity, (c) relative to the number of trees in the local/regional species pool, and possibly others. In the following, we use approach (b), relative to other plots in the region. The biome is, in our opinion, too large, while the third approach would require an analysis far beyond what current data allow us to do.
We de ne the "region" around a plot as being within a raster cell of 100 × 100 km 2 .

Counts
The local relative species richness (Š local ) shows substantially more variability for low-richness plots. Now we can repeat the above analysis with a di erent relative tree species richness as response: Representing species richness relative to what is possible in that region substantially reduces the e ect of species richness on productivity, as well as the error margin of this relationship.
To repeat the message of this last plot: When moving from a plot that only comprises, say, 10% of the potential local tree species to one that contains all 100%, productivity increases from around 2.2 to 4.5 m 3 ha −1 y −1 (much less than the increase from 2.5 to 8 m 3 ha −1 y −1 reported by Liang et al. 2016).
C.6 Use relative, rather than absolute, productivity It appears a bit odd to scale species richness relativ to what is locally possible, but not also productivity. Tropical forests can be expected to be far more productive than boreal ones, but the e ect of relative species richness on relative productivity may be similar.
To explore this idea, we follow the same approach as with local relative species richness and compute local relative productivity, P local . This we then analyse with the local relative species richness (Š local ).
Again we de ne the "region" around a plot as being within a raster cell of 100 × 100 km 2 . This does not look like an improvement, as the model t is poorer (R 2 of 0.135 compared to 0.34). Interestingly, however, (a) the e ect of species richness is stronger (larger value of θ); and (b) even at maximal species richness only 10% of maximal productivity is reached. Note that we used the local relative species richness ((S) local ); the model with the original Š is similarly bad, and with a lower value for θ (around 0.213). While this may actually be interesting in itself, we do not pursue this line of thought further here.

A new analysis
In summary, our explorations above have shown substantial e ects of (in that order) 1. computing Š relative to maximal local species richness; 2. stratifying the data proportional to an ecoregion's share of global forest cover; 3. computing spatial distances on a sphere; 4. strati cation of samples by ecoregion forest cover; 5. correcting for spatial autocorrelation; 6. correcting standard errors for subsampling.
The relative scaling of productivity did not improve the global model and is not further considered here.
It is quite possible that these alterations of the original model interact, i.e. that the substantial bias we have seen introduced by strati cation may be caused by the way Š is computed, and would be removed with a locally computed Š local . Exploring all these option combinations is beyond the scope of this re-analysis.
So, in a nutshell, the new global model uses a locally-scaled species richness as predictor, computes spatial distances on a sphere when correcting for spatial autocorrelation in the data and strati es sampling by each ecoregion's forest cover.    Here we add the correction for subsampling, speci cally computing SDs of the predictions from the subsamples and correcting them to the full data. We rst do that for the ts provided by the re-analysis in the original form. As we can clearly see, the pattern observed by Liang et al. (2016) does not hold up to scrutiny when correcting the mistakes of their analysis. The overall e ect of increasing species richness to the maximum possible at that site is negligible and levels o at less than 10%. 16 The absolute levels of productivity presented in the above gure are conditional on the values choosen for environmental covariates and are thus not interpretable "globally". To do so would require a marginal, rather than a conditional approach. As already the conditional e ect at the mean value of the covariates is practically absent, so would probably be the marginal. 17 Although that was one of our initial questions, it is not worth pursuing under these conditions. p r e d s R e g S ← XReg % * % t ( b e t a R e g ) predsRegMeansS ← rowMeans ( p r e d s R e g S , n a . r m =T ) C I s d R e g S ← a p p l y ( p r e d s R e g S , 1 , sd , n a . r m =T ) C I s d c o r r R e g S ← exp ( l o g ( 5 0 0 ) / l o g ( s a m p l e S i z e [ names ( s a m p l e S i z e ) == i ] ) * l o g ( C I s d R e g S ) ) } CIsNewRegS ← r b i n d ( predsRegMeansS + 2 * C I s d c o r r R e g S , predsRegMeansS − 2 * C I s d c o r r R e g S ) y l i m U p p e r ← c e i l i n g ( exp ( max ( CIsNewRegS ) ) ) p l o t ( 1 : Smax , exp ( predsRegMeansS ) , t y p e = " n " , l a s = 1 , y l i m = c ( 0 , y l i m U p p e r ) , main= " " , x l a b = " " , y l a b = " " ) p o l y g o n ( c ( 1 : Smax , Smax : 1 ) , c ( exp ( CIsNewRegS [ 1 , ]  As a summary, we can plot the estimates and their 95%-con dence interval for each ecoregion. We see that in almost all ecoregions higher tree species richness is correlated with higher productivity. We also see the common pattern of a quick decrease of a distinguishable diversity e ect beyond a few species (in tropical moist forests, "few" in this case means a few dozen, while in all other systems "few" means less than 10). 18

F Conclusion
The "global" relationship between "relative tree species richness" and productivity reported by Liang et al. (2016) hinges on an analysis with several logical and technical aws. Removing them yields essentially a at line and no such global relationship. It is entirely possible that there are mistakes in the above analysis, and that is why we provide this document. 19 This relationship is not globally consistent; rather, ecosystems vary substantially, but most show a positive tree-species richness-productivity relationship. Analysing the data with a relative species richness axis provides no bene t or insight when compared to absolute species richnes analyses.