Generating and testing hypotheses about the fossil record of insect herbivory with a theoretical ecospace

A typical fossil flora examined for insect herbivory contains a few hundred leaves and a dozen or two insect damage types. Paleontologists employ a wide variety of metrics to assess differences in herbivory among assemblages: damage type diversity, intensity (the proportion of leaves, or of leaf surface area, with insect damage), the evenness of diversity, and comparisons of the evenness and diversity of the flora to the evenness and diversity of damage types. Although the number of metrics calculated is quite large, given the amount of data that is usually available, the study of insect herbivory in the fossil record still lacks a quantitative framework that can be used to distinguish among different causes of increased insect herbivory and to generate null hypotheses of the magnitude of changes in insect herbivory over time. Moreover, estimates of damage type diversity, the most common metric, are generated with inconsistent sampling standardization routines. Here we demonstrate that coverage-based rarefaction yields valid, reliable estimates of damage type diversity that are robust to differences among floral assemblages in the number of leaves examined, average leaf surface area, and the inclusion of plant organs other than leaves such as seeds and axes. We outline the potential of a theoretical ecospace that combines various metrics to distinguish between potential causes of increased herbivory. We close with a discussion of the most appropriate uses of a theoretical ecospace for insect herbivory, with the overlapping damage type diversities of Paleozoic gymnosperms and Cenozoic angiosperms as a brief case study.


28
In recent years, the number of fossil plant assemblages examined for insect herbivory has increased markedly. 29 The wealth of available data has already been used to inform a variety of biotic and abiotic phenomena equals 1 (Figure 2). Thus, with coverage-based rarefaction, damage type datasets are rarefied not to a 176 particular number of leaves but to a particular slope of the rarefaction curve. If a high proportion of damage 177 types are observed on only one specimen, this indicates that coverage is relatively low. In contrast, if a low 178 proportion of damage types are observed on only one specimen, coverage is relatively high. Coverage-based 179 rarefaction is performed by removing the rarest damage types from the dataset until coverage is reduced to 180 a predetermined threshold such as 0.8 (Figure 3).  Labandeira et al. (2018). For each assemblage, the sample coverage for the complete dataset is listed in the title. The lightest gray boxes denote the damage type diversity at each assemblage when rarefied to a sample coverage of 0.8, and the darker gray boxes denote damage type diversity rarefied to a sample coverage of 0.7. The assemblages are organized from the lowest to highest levels of sample coverage. Each column of symbols represents a damage type, and the number of times each symbol is illustrated represents the number of specimens on which the damage type was observed. (A) has relatively low sample coverage because the majority of damage types are observed on only one specimen. In contrast, (E) has relatively high sample coverage because so few damage types are observed on only one specimen. same research group, with some workers using a wider definition of foliage that includes needles, liverworts, 210 phyllids, photosynthetic wings of seeds, and even flattened horsetail axes (Prevec et al., 2009;Labandeira 211 et al., 2018), and others using a narrower definition restricted to multi-veined broad leaves or leaves with a 212 defined midvein (Schachat et al., 2014(Schachat et al., , 2015(Schachat et al., , 2018(Schachat et al., , 2020. (For additional information, see "criteria for 213 inclusion of leaves and damage types" in the supplemental information.)

214
Because non-broadleaf specimens typically contain little or no evidence of insect herbivory, they contribute 215 little or nothing to metrics of damage type sampling completeness and will therefore have minimal or no 216 impact on estimates of damage type diversity at an assemblage. And, because some herbivory datasets code 217 plants as morphotypes only, with no information on whether the morphotype represents a dicot or a conifer 218 (Maccracken, 2020), these datasets cannot be analyzed with traditional size-based rarefaction. However, 219 these datasets can indeed be analyzed with coverage-based rarefaction. Plant organs that contain little or no 220 insect damage, such as axes and pine needles, contribute little or nothing to estimates of sample coverage, 221 and thus their inclusion in coverage-based damage diversity estimates does not bias the estimates downward.

222
A remaining question is the confidence level that should be used in confidence intervals for rarefied 223 damage type diversity. Schachat et al. (2018) used 84% confidence intervals, rather than 95%, because the 224 comparison of two curves with 84% confidence intervals yields a Type I error rate below 5%. However, the 225 confidence intervals generated with the iNEXT package rely in part on extrapolated diversity, which is shown 226 above to be biased by sample size in the case of damage type diversity datasets. As a result, the confidence 227 intervals generated by iNEXT are narrower than those generated with the iterative procedure of Schachat is no predetermined number of leaves or amount of surface area at which an assemblage is assured to have 237 sufficient coverage.

238
The impossibility of assessing all assemblages that have been sampled to a certain size, such as 400 leaves 239 or 1000 cm 2 of surface area, is perhaps the only disadvantage of using coverage-based rarefaction instead of 240 size-based rarefaction. Five of the assemblages in Tables 1 and 2  Boesmanshkoek 112 assemblage from the same Triassic study contains only two damage types, but these 248 occur frequently enough that the assemblage has sufficient coverage. Thirty-nine of the 44 assemblages 249 with fewer than 1,000 leaves listed in Table 3 have sufficient coverage as well. Of the ten assemblages with

253
The best way to handle assemblages with insufficient coverage depends on the number of leaves examined 254 and the temporal scope of the study. In the case of assemblages with only a few hundred leaves, estimates 255 of damage type diversity comparable to those generated with coverage-based simply cannot be calculated.

256
For studies such as that of Labandeira et al. (2002b) with a limited temporal scope and a large quantity of 257 assemblages, many of which contain small numbers of leaves, the exclusion of assemblages that do not reach 258 sufficient sample coverage can decrease the power of the study to detect changes in herbivory through time 259 by excluding assemblages whose low sample coverage may be due to insufficient sampling or a true scarcity 260 of damage types. In cases such as this, size-based rarefaction curves scaled by the amount of surface area 261 sampled rather than the number of leaves sampled (Schachat et al., 2018) are the most appropriate choice.

262
These size-based estimates of damage type diversity cannot be incorporated into meta-analyses that use 263 coverage-based rarefaction. We conducted simulations to evaluate extrapolated estimates of damage type 264 diversity in cases when the raw dataset does not reach a sample coverage of 0.8, and found that extrapolated 265 estimates are not reliable and are often invalid ( Figure S4). specimen; these data are rarely collected in studies of fossil herbivory (Robledo et al., 2018;Ma et al., 2020).

276
The first additional dimension is the prevalence of external foliage feeding, a functional feeding group 277 containing generalist modes of herbivory in which an insect typically chews on a leaf. The proportion of 278 feeding occurrences belonging to external foliage feeding-as opposed to piercing and sucking or specialized 279 herbivory such as mining and galling-is perhaps the metric with the greatest potential to distinguish between 280 insect evolution and nutrient dilution, the two most common explanations of increased herbivory invoked by 281 paleontologists.

282
When nutrient dilution occurs, the prevalence of piercing-and-sucking feeding damage-whether measured 283 as number of damage types, percentage of leaf area damaged, or number of feeding occurrences-will likely 284 remain constant. This is because piercing-and-sucking insects often feed by puncturing individual phloem 285 cells (Will et al., 2013). Whereas the number of phloem cells may increase in response to nutrient dilution, 286 the content of each cell and the pressure within each cell will most likely remain the same. Therefore, 287 piercing-and-sucking insects that feed on phloem will not need to feed more in order to meet their nutritional 288 requirements. Similarly, when nutrient dilution occurs, the number of specialized mining and galling damage 289 types is unlikely to increase, as is the number of occurrences of these damage types. within the confidence interval for rarefied floral diversity, dividing the former by the latter, and iterating this 328 process 1,000 times. The number of specimens required to rarefy floral diversity to sample coverage of 0.8 329 may differ from the number of specimens required to rarefy damage type diversity to sample coverage of 0.8.

330
Lastly, the prevalences of each functional feeding group within an assemblage can comprise yet more 331 13 dimensions of the ecospace. Ideally, the prevalence of each functional feeding group would be quantified by 332 the absolute and relative amounts of herbivorized surface area that it represents. However, as noted above, 333 surface area measurements are unavailable for nearly all herbivory datasets. Thus, the proportion of damage 334 type occurrences attributable to each functional feeding group can instead be used as a proxy. Confidence 335 intervals for the proportions of either herbivorized area or damage type occurrences can be calculated by 336 iteratively resampling with replacement as discussed above.

337
Additional measures, such as bipartite network metrics (sensu Blüthgen et al., 2008) and host specificity, 338 require so much data that it may not be possible to precisely and accurately estimate them with a typical 339 fossil herbivory dataset. This is further discussed in the supplemental material ( Figure S6; "evaluation of 340 other metrics"). The ability to determine a priori how much sampling is needed for a given assemblage-or whether an but is much more difficult to estimate prior to data collection.

360
One advantage of coverage-based rarefaction is that it can be integrated with a "stopping rule". In the case 361 14 of the sample coverage threshold of 0.8 used here, sample coverage can be re-calculated whenever new data 362 are added to a dataset and sampling can be considered sufficient when the threshold of 0.8 has been reached.

363
Although stoppage of data collection at this threshold would allow a maximum number of assemblages to be 364 examined and compared, it is worth noting that the confidence intervals surrounding damage type diversity 365 and the herbivory index will narrow if sampling continues beyond a sample coverage of 0.8.

366
As discussed in a recent contribution, estimates of damage type diversity require more sampling than The study of insect herbivory in the fossil record has become vastly more popular over the past few decades.

385
Early studies that examined patterns of insect herbivory caused by a major event, such as a mass extinction, 386 had only one available option: to quantify the magnitude of the differences in insect herbivory before and 387 after the event in question. A question that could not be addressed at the time those early studies were 388 conducted, but can be addressed now, is whether the amount of change in insect herbivory after a major 389 event exceeds the amount of change that one would expect if the event had not happened. 2020) nor is its damage type diversity unusually high ( Figure 5).

480
Rising pCO   temporal distribution of studied assemblages adds far more variability to reconstructed long-term trends 513 than one would expect if assemblages were chosen for study in an unbiased manner, and has the potential 514 to obscure these trends. As the proliferation of free software packages and online databases continues, it becomes easier to run analyses 517 and generate graphs with insect herbivory data-regardless of whether the methods are appropriate for the 518 data and regardless of whether the datasets are sufficiently complete to meet the assumptions of the methods.

519
As shown here, the Chao1 estimator, a method that is particularly well-suited to handle the sparseness of 520 insect herbivory datasets, rarely provides estimates of damage type diversity that are both precise enough to 521 minimize the frequency of false negative results and accurate enough to contain the true, asymptotic value.

522
Moreover, asymptotic damage type diversity calculated through a combination of rarefaction and the 523 Chao1 estimator is formulated as an unbiased estimator but is severely biased by sample size, presumably 524 due to the exceptional sparsity of herbivory datasets. This finding highlights the need to regress estimators 525 of insect herbivory against the number of leaves and damage types in raw datasets to verify that they are 526 not biased by sampling effort.

527
Whereas size-based rarefaction and the Chao1 estimator present insurmountable limitations for insect 528 herbivory data, coverage-based rarefaction holds the most promise for generating estimates of damage type 529 diversity that are unbiased, and robust to leaf size and sample size. Combined with the herbivory index, 530 which measures the intensity of insect herbivory, estimates of damage type diversity can be leveraged as the 531 foundation of a theoretical ecospace. This ecospace holds the potential to address two overarching questions 532 that remained unresolved for two decades. First is how to distinguish the minor variation in insect herbivory 533 that inevitably occurs among assemblages from slightly different times or places (null hypothesized amount of 534 variation) from the changes in insect herbivory that accompany major abiotic events or innovations in insect 535 evolution (alternative hypothesis). Second is how to distinguish among abiotic causes (e.g., mass extinctions) 536 21 and biotic causes (insect and plant evolution) of major changes in the amount of insect herbivory.

537
There are two ways to build upon the many fossil herbivory datasets amassed over the past few 538 decades: collecting more data, and interrogating existing data to develop and refine analytical frameworks.

539
By advancing the latter, a theoretical ecospace for insect herbivory will hopefully underscore the 540 importance the former.   Anderson and Anderson (1983, 1985, 1989, 2003, 2008, 2017, that were analyzed in this study. a These assemblages do not have sufficient sample coverage to be rarefied to a coverage of 0.8.   et al. (2018) that were analyzed in this study. a These assemblages are not included in the sensitivity analysis presented in Figure S2 because damage type data are not available for each individual specimen. b These assemblages do not have sufficient sample coverage to be rarefied to a coverage of 0.8.   Mitchell Creek Flats at the amount of leaf area that has been examined for this assemblage. Comparisons 996 of damage type diversity, therefore, require a method such as rarefaction to control for differences in 997 sampling completeness, and require fairly complete sampling of all localities-which can be difficult to 998 achieve due to both the availability of fossil material and the investigator effort required.

999
The difficulty of estimating damage type diversity in light of sampling incompleteness is further  are attributable to the level of sample coverage to which the assemblages were rarefied.

1131
For each assemblage we calculated the difference in damage type diversity when rarefied to a sampling 1132 coverage of 0.8 of 0.9 ( Figure S7). We found that the differences are smallest for the Molteno assemblages 1133 listed in Table 1 routine are performed, but stabilizes around 5,000 iterations ( Figure S8). Therefore, we recommend that 1157 future studies use 5,000 resampling iterations to generate confidence intervals. A shows the rarefaction curves for all data from both assemblages. In B through D, the rarefaction curves for Colwell Creek Pond are calculated from a randomly subsampled set of leaves with nearly equal surface area to that measured at Mitchell Creek Flats.
T r öl la tu n g a -G a u ts h a m a r W il le r s h a u s e n B r ján s lae k u r -S e lj aB iĺ in a -D S H B rě št a n y -L C H R o tt M K -3 T e le m a c h u s S p r u it 1 1 1 K a n n a s k o p 1 1 1 C y p h e r g a t 1 1 1 A G r e e n v a le 1 2 1 B ir d s R iv e r 1 1 1 K o m m a n d a n ts k o p 1 1 1 W a ld e c k 1 1 1 L u th e r s k o p 3 1 1 K r a a i R iv e r 1 1 1 K r a a i R iv e r 3 1 1 E la n d s p r u it 1 1 1 V in e y a r d 1 1 1 E la n d s p r u it 1 1 2 A K a p p o k r a a l 1 1 1 K le in  Tables 1 and 2. The black diamond denotes the raw damage type diversity observed at each assemblage and the lines denote the mean 95% confidence interval of the Chao1 estimator when used on a randomly sampled subset of leaves from the assemblage, as follows: magenta, 100 leaves; yellow, 500 leaves; green, 1,000 leaves; blue, 5,000 leaves. Not all assemblages contain sufficient material to subsample to 5,000 leaves. At low levels of sampling, the Chao1 estimator clearly underestimates damage type diversity. At higher levels of sampling, the Chao1 estimator becomes more accurate but far less precise. The black lines represent the interpolated rarefaction curve for the raw dataset, the blue lines represent the raw dataset subsampled down to 2,000 leaves and then extrapolated to the number of leaves present in the raw dataset, and the red lines represent the raw dataset subsampled down to 1,000 leaves and then extrapolated to the number of leaves present in the raw dataset. 49 Extrapolated diversity Figure S4: The variability of estimates of damage type diversity at a sample coverage of 0.8 when extrapolated from 1,000 subsampled leaves that yield a sample coverage below 0.8. 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 Figure S8: Changes in the width of the 84% confidence interval as the resampling procedure is further iterated.