Imprecisely georeferenced specimen data provide unique information on species’ distributions and environmental tolerances: Don’t let the perfect be the enemy of the good

16 Aim: Conservation assessments frequently use occurrence records to estimate species’ 17 geographic distributions and environmental tolerances. Typically, records with imprecise 18 geolocality information are discarded before analysis because they cannot be matched 19 confidently to environmental conditions. However, removing records can artificially truncate 20 species’ environmental and geographic distributions. Here we evaluate the trade-offs between 21 using versus discarding imprecise records when estimating species’ ranges and climatic 22


Introduction
"precise" and "imprecise", based on the confidence with which they could be matched to the 138 resolution of our climate data. Precise records were those with an associated coordinate 139 uncertainty of ≤5000 m, in accordance with guidelines used for calculating climate change 140 vulnerability (Young et al. 2016; see also Graham et al. 2008). Imprecise records comprised 141 records that had a coordinate uncertainty >5000 m, or that could only be located to a geopolitical 142 unit (county/parish or state/province). Records that could not be assigned to either one of these 143 categories were removed. We also removed records with coordinates that had an area of 144 uncertainty larger than San Bernardino County (the largest "county"-level geopolitical unit in the 145 coterminous US at 52,104.5 km 2 ) to exclude records which we considered to be too imprecise to 146 convey salient information about the species' relationship to the environment. Maps of each 147 species' records were then visually inspected, and geographic outliers were scrutinized and 148 removed if appropriate (e.g., one georeferenced record was from a bridal bouquet, and several 149 others were purchased or cultivated). We defined duplicate records in different ways depending 150 on the type of records then kept just one of each set of duplicates (Appendix S2). Duplicate 151 records of each species were discarded. Finally, we removed species for which we had fewer 152 than 5 precise records, resulting in 44 total species for analysis. 153 Extent of occurrence 154 The geographic extent of occurrence (EOO) of a species is defined as the area of the minimum 155 convex polygon circumscribing all occurrences. EOO is used as an index of the degree to which 156 risks from threats are spread among populations (IUCN 2019). For each species, we compared the point on the border of the polygon that represented its likely collection locality that was 161 closest to the geographic centroid of the precise records (Fig. 2a). This yielded the smallest 162 possible EOO for precise plus imprecise records. We removed areas covered by major water 163 bodies (oceans and the Great Lakes) before calculating areas of EOOs. Differences in EOO 164 calculated without or with imprecise records were evaluated with a paired Wilcoxon signed-rank 165 test. 166 Environmental data 167 We use interpolated climate coverages from WORLDCLIM Ver.  Univariate and multivariate niche breadth 173 We compared climatic niche breadth in mean annual temperature (MAT) and mean annual 174 precipitation (MAP) using just precise records versus using precise plus imprecise records. Niche variation. We compared differences in univariate and multivariate niche breadth between precise 181 and precise plus imprecise records using paired Wilcoxon signed-rank tests.
Assigning climate data to imprecise records requires making a decision about how to "locate" a 183 record in climate space from the climates in the area from which the record was collected. For 184 the analyses presented in the main text, we chose climate values that were closest to the mean 185 value estimated across all precise records. This method provides the most conservative estimate 186 of the true niche breadth, since the true value will either be the same as or more extreme than the 187 estimated value. However, to compare with other methods, we repeated calculations of niche 188 breadth with precise plus imprecise records by matching imprecise records to: 1) the value at the

196
Climate change exposure 197 We used ecological niche models to estimate exposure to anticipated climate change. Full  variables. Background data were obtained from either an area delineated by a 300-km buffer 202 around the minimum convex polygon of the occurrences used in each model, or from an area 203 delineated using 300-km buffers around each individual occurrence. The former method assumes 204 that a species can disperse to all intermediate locations between known occurrences, but potentially overestimates the total area to which a species could disperse (e.g., when the range is 206 disjunct). The latter method obviates issues with disjunct distributions, but by excluding 207 environments that occur in gaps within a species' range, assumes that the species cannot occupy 208 these environments or disperse to them (Barve et al. 2011). Both buffers were calculated 209 separately for models using only precise records and for all records combined. 210 We modeled species' niches using Maxent ver. 3.3.3k (Phillips et al. 2006;Phillips & Dudík 211 2008) using precise and precise plus imprecise records. Models were then projected to present-212 day and future climate scenarios. For each species, we constructed 10 models ( Fig. S2.1). For 213 models based only on precise records, we calibrated one for each of the two definitions of 214 background area. For models based on precise plus imprecise records, we calibrated eight 215 models (2 background definitions × 4 methods for assigning climate variables to imprecise 216 records). Each model was then projected to RCPs 4.5 and 8.5. We thresholded present-day and 217 future predictions output using the value that excluded 10% of the occurrences from the set 218 considered "present" (i.e., a 90% sensitivity rate). We then compared differences in area of 219 several metrics indicative of area of habitat (Brooks et al. 2014) and exposure to climate change 220 (Young et al. 2016) between model with and without imprecise records: 1) present-day area with 221 suitable climate; 2) future suitable area; 3) stable area that is suitable present and in the future; 222 and 4) loss and 5) gain in suitable climate area. Each of these metrics was calculated within a) 223 the minimum convex polygon drawn around the precise records and b) within the minimum 224 convex polygon drawn around all records, which was delineated as described above for 225 calculation of EOO. Comparisons between models with and without imprecise records were 226 conducted within each combination of background definition, emissions scenario, and method 227 for assigning climate values to imprecise records. We compared differences in area of present-day suitable climate, stable suitable climate, and loss and gain in climatically suitable area with 229 paired Wilcoxon signed-rank tests.

230
Collection date of precise and imprecise records 231 Older records may be more imprecise so may tend to be discarded in favor of newer records with 232 more precise locality information. To assess if imprecise records were generally older than 233 precise records, we tested for differences between the median collection year of precise and 234 imprecise records for each species using a Mann-Whitney U test.

242
Specimen data 243 The data downloaded from GBIF comprised 112,730 records (Table 1). Approximately one third 244 of these records were missing coordinates entirely, and approximately another third were missing 245 a value for coordinate uncertainty. Following application of our data cleaning procedures, 246 removal of duplicate records, and elimination of species with fewer than 5 geographically unique 247 precise records, we were left with 7.5% of the initial number of records, representing 32% of the species (44 of 137) that occurred in the original data (Table 1). Records that could only be 249 assigned to a geopolitical unit were the most abundant ( Fig. 3a; median across retained species: 250 75 records, range: 6 to 519), followed by precise records (median: 25, range: 5 to 239), then 251 records possessing coordinates with coordinate uncertainty >5000 m (median: 10, range: 0 to 252 101; Fig. 3 and Table S2.1).

259
Univariate and multivariate niche breadth 260 Including imprecise records increased estimated niche breadth for nearly all species, even when 261 we conservatively used the climate value most similar to the mean across the precise records 262 (Fig. 4b). Including imprecise records increased univariate niche breadth in MAT by a median 263 value of 25% (range across species: 0 to 353%; P<10 -6 , Wilcoxon V~0) and in MAP by 28% 264 (range: 0 to 292%; P<10 -6 ; Wilcoxon V~0). Using other methods for assigning climate values to 265 imprecise records increased niche breadths even more (Appendix S4). Species with the fewest 266 precise records tended to have the greatest increase in univariate niche breadth when imprecise 267 records were included (Fig. 4b).

272
Species with the fewest records tended to have the greatest increase in niche volume and area 273 when imprecise records were included (Fig. 4c).

274
Climate change exposure 275 We evaluated the effect of using different backgrounds for calibrating the niche models (buffered  Including imprecise records increased the area predicted to remain climatically suitable (Fig. 5a) 291 by 14% (median; range: -33 to 326%; P<10 -5 , Wilcoxon V=120). The median percentage of 292 current suitable area that was expected to be lost was similar (4% when using precise records 293 versus 2% when using precise plus imprecise records; P=0.32, Wilcoxon V=531). However, 294 models using only precise records predicted a greater range of loss than models using both kinds 295 of records (Fig. 5b). In contrast, the area that is currently unsuitable but predicted to become 296 suitable was larger when using just precise records ( Fig. 5c; P=0.0129; Wilcoxon V=623).

297
Collection date of precise and imprecise records 298 We found no significant difference in the median age of the precise and imprecise occurrence 299 records for 36 of the 44 species. For the other 8 species, the median age of the imprecise 300 occurrences was more recent than the age of the precise occurrences (Mann-Whitney U <0.05; 301 Appendix S8 Table S8.1).

303
We found that including geospatially-imprecise increased estimates of EOO, climatic tolerances, 304 and the area of suitable habitat in the present and the future (Figs. 4 and 5). Since we do not 305 know the true locations of imprecise records, assigning a location or value of a climate variable 306 will in most cases lead to an inaccurate match. Hence, our estimates of EOO, niche breadth, and 307 exposure to climate change are admittedly inaccurate. Indeed, aversion to inaccuracy is the 308 primary reason why imprecise records are commonly discarded before conducting conservation  However, our findings are based on using a conservative method for assigning locations and 311 values of climate variables to imprecise records. Thus, it is reasonable to assume that our estimates are still an underrepresentation of the true species geographic and environmental 313 distributions. As a result, we contend that our results are a more accurate indicator of true EOO, 314 climatic tolerances, and exposure to climate change because they are less of an underestimation 315 of the true values of these metrics than metrics calculated using only precise records.

316
Implications for conservation 317 We found that even when using a conservative method for assigning imprecise records to spatial 318 locations, EOO calculated using precise plus imprecise records was substantially larger (median 319 increase across species: 86%) than when using precise records alone (Fig. 4a). If threats to 320 species tend to act in a spatially autocorrelated manner (Fisher 2011), increasing the spatial 321 spread of populations will reduce the probability that threats affect a large proportion of a imprecise records (Fig. 4a), it is reasonable to assume that excluding imprecise records could 346 increase the chance that a species would have an EOO smaller than this threshold, thus making it 347 appear to meet this criterion.

348
Our intent here is to provoke a reconsideration of the benefits and costs of discarding spatially 349 imprecise records when assessing species' conservation status ( Table 2). The trade-offs between 350 these choices are based on the philosophical approach under which any particular conservation 351 assessment is conducted. First, when decisions must be made that affect whether or not a species 352 is designated as vulnerable, many assessments adopt a precautionary strategy that errs on the side  2016a) vulnerability. Since discarding spatially imprecise records reduces apparent niche breadth 356 and extent of distribution (Fig. 4), ignoring imprecise records inherently aligns with a precautionary approach to assessment. The advantage of a precautionary approach is that it 358 ensures few truly vulnerable species are mistakenly classified as "not vulnerable" because 359 species tend to be assigned a vulnerability status that is at least as severe as their true status. The 360 opposite of a precautionary approach is an evidentiary approach, which aims to classify species 361 as vulnerable only if there is strong evidence to support such a designation (IUCN 2019). The 362 evidentiary approach would thus seem to align with inclusion of imprecise records because their 363 use broadens species' apparent geographic and environmental distributions. As a result, ignoring 364 imprecise records would generally seem advisable. 365 We contend that spatially imprecise records-despite their limitations-can represent valid 366 information on species' distributions and environmental tolerances. As such, unless there is 367 sound reason for doing so, ignoring removing imprecise records seems arbitrary and capricious.  (Table 2).

398
Methodological considerations 399 The manner in which imprecise records can be accommodated depends on the approach used to  2020). This was indeed true in our case. For example, we found that including imprecise records 421 increased EOO by more than five-fold for some species with <80 records (Fig. 4a). The degree and about accounting for uncertainty, so decisions should be made in that regard. In our analysis, 442 we did not find that imprecise specimens were older than precise specimens (Appendix S8), but 443 this may not always be the case. Remarkably, only about half of the papers that we reviewed described their data cleaning 449 process. This is troubling because it means either that data were used without proper screening Introducing a practical protocol to prepare species occurrence records for spatial analysis.     used specimen records and ecological niche modeling (full methods described in Appendix S1).

717
Of 285 relevant articles, only 52% described any methods for cleaning data. Across all 718 publications, 45% addressed issues related to coordinate uncertainty by removing records before 719 the analysis. Bases for removing records included being "unnatural" (e.g., cultivated or 720 purchased), having imprecise or unknown coordinate uncertainty, lying outside the species' 721 presumed range, or falling outside the environmental tolerances of the species. No studies we 722 reviewed used modeling methods that could account for uncertainty in spatial location or 723 explicitly indicated that coarser resolution environmental data was used to accommodate 724 spatially imprecise records.  746 Imprecise records are defined as records with a coordinate uncertainty >5000 m, or records that 747 can only be located with confidence to a geopolitical unit.  predicted to be lost due to climate change. c) Area predicted to be gained.