Automated assessment reveals extinction risk of reptiles is widely underestimated across space and phylogeny

The Red List of Threatened Species, published by the International Union for Conservation of Nature (IUCN), is a crucial tool for conservation decision making. However, despite substantial effort, numerous species remain unassessed, or have insufficient data available to be assigned a Red List threat category. Moreover, the Red Listing process is subject to various sources of uncertainty and bias. The development of robust automated assessment methods could serve as an efficient and highly useful tool to accelerate the assessment process and offer provisional assessments. Here we aimed to: 1) present a machine learning based automated threat assessment method that can be used on less known species; 2) offer provisional assessments for all reptiles - the only major tetrapod group without a comprehensive Red List assessment; and 3) evaluate potential effects of human decision biases on the outcome of assessments. We use the method presented here to assess 4,369 reptile species that are currently unassessed or classified as Data Deficient by the IUCN. Our models range in accuracy from 88% to 93% for classifying species as threatened/non-threatened, and from 82% to 87% for predicting specific threat categories. Unassessed and Data Deficient reptiles were more likely to be threatened than assessed species, adding to mounting evidence that they should be considered threatened by default. The overall proportion of threatened species greatly increased when we included our provisional assessments. Assessor identities strongly affected prediction outcomes, suggesting that assessor effects need to be carefully considered in extinction risk assessments. Regions and taxa we identified as likely to be more threatened should be given increased attention in new assessments and conservation planning. Lastly, the method we present here can be easily implemented to help bridge the assessment gap on other less known taxa.


Introduction
and 2). A model excluding assessor/reviewer effects, but retaining spatial and phylogenetic 119 autocorrelation, and one excluding spatial and phylogenetic autocorrelation but retaining 120 assessor/reviewer effects, achieved similar results ( Table 1). The model excluding both 121 autocorrelations and assessor/reviewer effects, and the models including either spatial or 122 phylogenetic autocorrelation, were less accurate (Table 1). However, the model obtained the 123 highest accuracies when excluding threatened species classified under criteria other than B from 124 the training data set (Table 1, details below). We predicted threat categories for DD and NE 125 species using the model that excluded assessor/reviewer effects but retained spatial and 126 phylogenetic data, since we cannot know the identity of assessors who will evaluate currently 127 unassessed species. For analyses regarding potential assessor/reviewer effects, we used the 128 complete model. Detailed accuracy metrics are presented in Table 2. The lowest accuracy across 129 models was in separating NT from LC categories (Table 2). respectively. More detailed metrics are presented in Table 2. across models see S2 Table). However, it should be noted that machine learning methods such as 155 XGBoost, are geared primarily towards prediction, not inference [28]. Therefore, we assessed the 156 contribution of features by comparing the predictions of different model configurations (Table 1) 157 and report the feature importance above for their methodological utility. Future studies should 158 explore the underlying mechanisms of reptile extinction risk using more appropriate methods, 159 and clarify which proportion of variation in extinction risk can be attributed to unknown sources 160 of threat correlated in space and phylogeny.

161
The hyperparameter configuration for the model chosen for predictions is summarized in 162 S3 assigning them to lower threat categories than observed (S4 Table). This is probably because 175 species are only classified under non-B criteria if such criteria assign them to a similar, or higher, 12 threat category. Thus, we proceeded with models trained on all species for the remaining 177 analysis. Our model correctly classified 93.8% of previously assessed species (6,112 of 6,520 178 species). The 6.2% misclassified species (408 of 6,520 species) were nearly twice as likely to be 179 assigned to non-threatened categories than to shift in the opposite direction, and generally to shift  Comparison with previous methods 184 We compared our method to similar past endeavors. Our method obtained higher 185 accuracy (90%) than methods based on Random Forest (85%) and Neural Networks (79%) (S5 186   Table). The extreme class imbalance in the dataset greatly hindered both methods, especially 187 Neural Networks (S5 Table), despite the use of supersampling to account for uneven class 188 distributions. In fact, Neural Networks are known to be sensitive to such imbalances [29] autocorrelation, none does so in combination, as our method does (S6 Table). Our method is also 193 the first to account for assessor bias (Table 3).

195
DD and NE species were significantly more likely to be assigned threatened categories  Table). 197 13 DD species were more likely than assessed species to be predicted as VU, EN or CR, and less 198 likely to be predicted as NT or LC. NE species were more likely than assessed species to be VU, 199 and EN, and less likely to be predicted as NT or LC (Fig 1b, S7 Table and S8 Table).  Table).

208
Phylogenetic and spatial patterns 209 The proportion of threatened species increased overall for Squamata and Crocodylia but  Table). Effect of assessor/reviewer identities on predictions 235 We permuted the identity of assessors and reviewers until we identified the group of  Table) and 240 across most biogeographical realms. In the Nearctic and Madagascar, the observed and 241 pessimistic scenarios were similar, and in Oceania no differences were detected (Fig 4b, S12 242 Table). Species that changed category between the observed assessments and the optimistic  Palearctic. Significant differences in a Pearson's Χ 2 test are indicated by asterisks, colored 254 according to which proportions are being compared (S11 Table).

257
Our model successfully assigned IUCN categories, and threat statuses, to the 40% of the 258 world's reptiles that currently lack published assessments or are classified as Data Deficient. Our 259 novel modeling approach enabled classifying specific threat categories with high accuracy using 260 only readily available data (ranges and body sizes). Our methods also gained better accuracy 261 16 then previously explored methods (S5 Table). We predicted that the prevalence of threatened 262 reptile species is significantly higher than currently depicted by IUCN assessments. This pattern 263 is widespread across space and phylogeny. Our results show that, while high prediction accuracy 264 can be achieved without accounting for assessor/reviewer identities, the identity of 265 assessor/reviewers greatly affects predictions.

267
General model results

268
The classification accuracy of more extreme categories (CR, EN, LC) was higher than 269 categories straddling the threatened/non-threatened threshold (VU and NT, S1 Phylogenetic and spatial patterns 369 Our results revealed an overall decrease in the proportion of threatened turtle species 370 after the addition of our predictions for DD and NE species (Fig 2). This could be due to the 371 21 more complete assessment of turtles than of squamates. Data on population sizes and trends are 372 much more readily available for testudines in than for squamates [38]. Only 19% of squamates 373 were classified as threatened based (at least in part) on criteria other than Bcompared to 83% 374 of turtles. The proportion of threatened species tended to increase in squamate groups, especially 375 in small, fossorial, rare and endemic groups (Fig 2, S9 Table), which is consistent with 376 previously reported patterns of data deficiency [8]. Our method is thus better suited for data poor 377 clades than for extremely data-rich ones. The latter have already been assessed or are easy to 378 assess, but the former comprise most of global biodiversity. Thus, our method could be 379 especially useful for other data poor and underassessed groups, such as most invertebrate clades.

380
Our results suggest that the world's unknown and rich biodiversity is at even greater risk 381 than previously perceived. This finding adds to accumulating evidence that geographical and  Our models achieved high levels of accuracy even without accounting for 393 assessor/reviewer effects (Table 1). Nonetheless, the composition of assessors may greatly 394 influence predictions across all categories (Fig 4a, S3 Fig, S8). A possible explanation for this 395 pattern is that such effects could be implicitly accounted for in spatial and phylogenetic 396 autocorrelation since assessors usually assess only particular taxa and locations ( handling uncertainty and assessor's attitudes to risk [11,24]. We further recommend that the 409 IUCN, and local or regional agencies wishing to assess extinction risk of species or populations, assessments. 416 We also recommend, as further research avenues, the development of 1) analytical 417 methods to identify which assessment criteria and subcriteria are more subject to ambiguities, 418 and how can they be refined; 2) applications for quick automated assessments using methods 419 such as the one proposed here; 3) automated assessment methods specifically geared towards 420 modeling population sizes and trends (e.g., based on spatial distribution of threats such as land continue to be the gold-standard for categorizing species threat, we recommend caution is 453 necessary and that assessor/reviewer effects should be considered when using them. Altogether, 454 our models predict that the state of reptile conservation is far worse than currently estimated, and 455 that immediate action is necessary to avoid the disappearance of reptile biodiversity. Incorporating spatial and phylogenetic autocorrelation 480 We used Moran's Eigenvector Maps and Phylogenetic Eigenvector Maps to represent 481 spatial and phylogenetic structure in our models [45,46] this, we used an autocorrelative approach similar to our spatial autocorrelation 506 detection/correction method, to incorporate potential assessor/reviewer effects in our models. We 507 considered assessors/reviewers that worked together on a species assessment to be neighbors in 508 the neighborhood matrix, with the number of species each pair assessed together as the weight of 509 each pair's association. Therefore, frequently associated assessors had more similar scores than 510 those that associated occasionally. Assessors/reviewer scores were averaged for each eigenvector 511 on each species. Therefore, species that were evaluated by a similar set of assessors/reviewers 512 had more similar scores than species evaluated by more distinct sets of assessors/reviewers. We 513 performed a priori selection based on eigenvalues, as described above, using the same 514 thresholds, which resulted in 216 eigenvectors being retained for assessors and 39 for reviewers. The range size of a species (as measured by extent of occurrence) can be used as an  Since supervised machine learning methods, such as XGBoost, are primarily predictive, 548 rather than mechanistic, features contributing to better predictions are not necessarily useful for 549 making causal inferences [28]. Thus, we evaluated the contribution of phylogenetic eigenvectors,

550
Moran's eigenvectors, and assessor/reviewer effects, by comparing models without these factors  species for each ecoregion, before and after the addition of predictions for DD and NE species. 587 We also compared the percentage of threatened species before and after the inclusion of 588 31 predictions for the eight terrestrial biogeographical realms: Afrotropics, Australasia, Indomalaya,

589
Madagascar, Nearctic, Neotropics, Oceania, and Palearctic. Each species was assigned to all 590 realms intersecting its range. The difference between proportions of threatened species in each 591 biogeographical realm, before and after the inclusion of predictions, was tested using a Chi-592 square test, with p-values corrected for multiple comparisons, using False Discovery Rate [60].

593
Effect of assessor/reviewer identities on predictions 594 We evaluated the effect of assessor/reviewer identities on predictions for each threat 595 category. We sequentially permuted the assessor/reviewer eigenvector scores of each species to 596 all other species, ran the modeling procedure described above, and retained the scores that 597 resulted in least threatened (optimistic), and most threatened (pessimistic) categorizations. This 598 procedure represents the potential results that would be obtained if the most "optimistic" and the 599 most "pessimistic" group of assessors/reviewers assessed every species. This was done using the 600 complete model using spatial and species-level predictors, spatial and phylogenetic 601 autocorrelations, and assessor/reviewer effects, to minimize the effect of spatial and phylogenetic 602 structure in assessor/reviewer assignments. We then tested if the resulting "optimistic" and 603 "pessimistic" predictions were significantly different from the observed categories, and from 604 each other, using Chi-square tests, with p-values corrected for multiple comparisons, using False 605 Discovery Rate. We performed a similar analysis to explore differences in assessor effects with 606 in each biogeographical realm for the binary classification task (of threatened/non-threatened 607 categories).