Reducing error and increasing reliability of wildlife counts from citizen science surveys: counting Weddell Seals in the Ross Sea from satellite images

Citizen science programs can be effective at collecting information at large temporal and spatial scales. However, sampling bias is a concern in citizen science datasets and can lead to unreliable estimates. We address this issue with a novel approach in a first-of-its-kind citizen science survey of Weddell seals for the entire coast of Antarctica. Our citizen scientists inspected very high-resolution satellite images to tag any presumptive seals hauled out on the fast ice during the pupping period. To assess and reduce the error in counts in term of bias and other factors, we ranked surveyors on how well they agreed with each other in tagging a particular feature (presumptive seal), and then ranked these features based on the ranking of surveyors placing tags on them. We assumed that features with higher rankings, as determined by “the crowd wisdom,” were likely to be seals. By comparing citizen science feature ranks with an expert’s determination, we found that non-seal features were often highly ranked. Conversely, some seals were ranked low or not tagged at all. Ranking surveyors relative to their peers was not a reliable means to filter out erroneous or missed tags; therefore, we developed an effective correction factor for both sources of error by comparing surveyors’ tags to those by the expert. Furthermore, counts may underestimate true abundance due to seals not being present on the ice when the image was taken. Based on available on-the-ground haul-out location counts in Erebus Bay, the Ross Sea, we were able to correct for the proportion of seals not detected through satellite images after accounting for year, time-of-day, location (islet vs. mainland locations), and satellite sensor effects. We show that a prospective model performed well at estimating seal abundances at appropriate spatial scales, providing a suitable methodology for continent-wide Weddell Seal population estimates.


INTRODUCTION
with the covariates hypothesized to explain them, we may draw misleading conclusions regarding 80 covariate effects, or patterns of change in space and time [6,12,14,16].

81
For citizen science data to be reliable, especially for wildlife monitoring programs, we must provide 82 assurances that the patterns elicited from these data do not reflect, or are not altered by, unidentified 83 sampling bias. Several methodologies have been proposed to reduce potential bias in the data. Isaac et 84 al.
[12] classified these into two general approaches: one is to filter out outliers and highly imprecise 85 records, while the other is to use the data or additional information on how data were collected to 86 control for potential sources of bias. For example, Kelling et al. [14] show that the "semi-structured 87 surveys" of eBird provide information on how the data were collected, as reported by the citizen . We used an online platform to enable volunteers to inspect high-resolution 118 images of fast ice [37,38] and to "tag" (i.e., mark in the images) all presumed seals. Because the seals 119 suckle their young while hauled out on the fast ice, they appear as "gray commas" on a white icy 120 backdrop; thus, they are distinct enough to be detected and counted, especially as they are large 121 enough to occupy several pixels in high-resolution images [8,33,36,37]. Moreover, while they do 122 aggregate at predictable locations to raise their pups (henceforth "haul-out" aggregations), unlike most 123 pinnipeds they space themselves rather than clump closely. Thus, they lend themselves to be counted 124 using very high resolution satellite imagery. Establishing the precision and accuracy of the citizen 125 scientist counts would allow us to determine if our approach is useful for monitoring Weddell seal 126 populations, and at what spatial scale is it possible to then make statistical inferences about the impact, 127 if any, of the fishery and other biophysical processes.

128
The work presented here has two objectives: estimating and correcting the bias and increasing precision 129 of citizen scientists' counts, and then calibrating the estimated counts to actual numbers on the ground.

130
Regarding surveyor bias, our primary concern is the error associated with the possible non-random 131 distribution of surveyor skills and effort across space, and consequent bias in estimating seal numbers.

132
In our study of Weddell seals, the bias arises from the differential skill of taggers in correctly identifying The citizen scientist platform 200 We used the Tomnod crowd-sourcing platform (now called "GeoHive", DigitalGlobe, Inc.), which allows Through peer-to-peer ranking, the probability of a surveyor finding a "feature" (a presumptive seal) in a 215 map is determined through comparison with data from his/her peers (an approach similar to that of 216 [43]). We call this probability the "surveyor CrowdRank." Formally, the surveyor CrowdRank is the 217 probability that the surveyor will identify a feature on a map as being a seal that at least one other 218 surveyor has also identified as a seal (i.e., the probability of tagging a feature that others have also 219 tagged). A surveyor's CrowdRank increases the more features s/he tags that others have also tagged.

220
Conversely, a surveyor's CrowdRank is reduced if s/he identified features that others did not. (There is 221 no penalty for missing features.) Notably, the surveyor's CrowdRank is not the probability of the 222 surveyor identifying a seal. Since we asked that the surveyors find and tag seals, the features are 223 presumed to be seals. We assumed that surveyors can err both in the detection of "features" (as 224 defined above), and in the correct identification of these features as seals.

225
Similarly, once at least two surveyors have placed a tag on a feature, the probability of the feature j 226 being detected (the "feature CrowdRank") is calculated as from all surveyors (i) that tagged it, under the 227 assumption that the surveyors made their determination independent of each other, as: 229 In equation 1, fCR and sCR are the feature and surveyor CrowdRanks, respectively. (A surveyor is never 230 exposed to other surveyors' tags previously placed on a map.) Therefore, only features with tags from at 231 least two different surveyors are considered in the counts -that is, a feature only tagged once is not 232 included. The estimation of the feature CrowdRank does not account for "negative" votes; i.e., surveyor 233 CrowdRanks of those who inspected the map and did not tag the feature. Since the surveyors are 234 looking for seals, the feature CrowdRank is really the probability that the crowd of surveyors will agree 235 on a feature being a seal. It is intuitive to see that if two surveyors have high CrowdRank scores and both 236 place tags on the same feature, there is a high likelihood that the feature will be deemed to be a seal by is the probability that a feature will be deemed a seal by the crowd, we expected that both the accuracy 259 and precision of the slope of the regression increase as the threshold used for features increases. That 260 is, the slope of the regression approaches 1 and its standard error drops as the threshold value 261 increases. An alternative outcome is that increasing the threshold filter would exclude an increasing 262 number of possible features, shrinking the crowd count toward only the features with the highest 263 probabilities of being identified as seals by the crowd. A too high threshold incurs the risk of failure to count seals that are present, which would represent a decrease in accuracy (coefficient less than 1). Our 265 goal was to find a compromise between the accuracy in the number of seals detected by the crowd and 266 the precision of the estimate through the selection of an adequate threshold for the features' 267 CrowdRank.

268
A novel approach to correct bias in citizen science counts 269 As an alternative to just filtering out features, we explored estimating the number of seals from the calculated the probabilities is provided in the Supporting Information S1 document. 289 We can then define the ratio of probabilities in equation 2 as the "shrinkage factor" for surveyor i.

290
Unlike the previous "wisdom of the crowd" approach, with this method we obtain a shrinkage factor for 291 each surveyor. This permits the calculation of a mean or median shrinkage factor from the sample of 292 surveyors who inspected maps in common with the expert, and a measure of its precision too (e.g., the 293 standard deviation or inner 95th quantile of the empirical distribution of individual shrinkage factors).

294
To do this, we assume that the set of surveyors for which we can calculate a shrinkage factor is a 295 representative sample of all surveyors involved in the campaign. This is justified since we found no 296 evidence of any bias in the selection of surveyors. Effect of covariates on accuracy of counts 313 The two methods ("wisdom of the crowd" and "expert correction") were used to estimate the number 314 of seals in the maps, and the resulting estimates were then compared to the counts by our expert for 315 the same maps. The expert correction method resulted in the best approach (as demonstrated below), 316 and is the only one we used in the second part of this study. In order to estimate the actual number of 317 seals at haul-out locations, once we found a suitable surveyor CrowdRank threshold to be used to shrink 318 the counts, we obtained the necessary shrinkage factor by taking the median of the sample. We used 319 the median because of possible outlier effects that could skew the mean value due to our low sample 320 size.

321
We used the shrinkage factor to correct the sum of counts from the set of maps that encompass every 322 known haul-out location in the study area to obtain estimates for each date surveyed. Different image 323 dates-times were used to sum the estimates from all maps for all locations and date-times, and these 324 sums were considered replicates of the estimated count of seals for a location if they were taken within 325 3 days of the ground count (see Results for details). We then ran the regression of our estimates versus 326 the respective ground counts. The regression used the ratio of the crowd-based shrunken estimate to 327 the actual ground count (the "estimated detection rate") as the response variable.

328
We include the effects of sensor type (images come from three satellite sensors: QuickBird-2, while all others were associated with peripheral islets. We distinguish these two models hereafter as the 366 "haul-out location" and "islet/mainland" models.

367
The models predict the ratio of seals counted by the crowd to seals counted on the ground -i.e., the 368 detection rate, after accounting for all relevant effects, using the expert correction method. Thus, to 369 obtain the predicted number of seals, we divide the crowd count by the predicted detection rate: The use of our expert correction approach to shrink counts resulted in increased precision and accuracy   Table 3 also   479 shows that, contrasted with Big Razorback (the reference site in the model, an "islet" location), the 480 "mainland" sites, Hutton Cliffs and Turks Head-Tryggve, showed very strong negative effects on 481 detection rate. That is, detection rate estimates from these "mainland" sites (i.e., near Ross Island in 482 these two cases) are much lower than those of other locations (Table 3), supporting the dichotomy 483 explored in the islet/mainland model. 484 Table 4 shows the results of the final islet/mainland model. Under this model, there is no longer an 485 effect of number of maps with tags, or of its interaction with number of tags placed. This indicates a 486 confounding effect of number of maps and the islet/mainland binary. Indeed, the mainland locations 487 had larger numbers of seals. The model with islet/mainland effects had an adjusted R-squared value of 488 0.31, which was less than half that of the haulout-location model (Table 3) (see also goodness-of-fit 489 plots shown in the Supporting Information S1 document).

512
The solid gray line is the fitted linear model of estimated abundance vs ground counts. properly correct the error introduced by the citizen scientists.

539
We show that the calibration against an expert to develop a correction factor (involving a "shrinkage" 540 factor) is able to account for the imperfect detection and probability of false positive assignments 541 among surveyors (see the Supporting Information S1 document for details). This approach provides 542 reasonably accurate and unbiased estimates of seals in maps. However, the number of seals detected in 543 maps, even by the expert, is only about 2/3 of all seals actually on the ground (Fig 4). This difference is 544 not necessarily due to taggers (or the expert) failing to detect seals in an image, but more likely due to 545 the fact that some fraction of seals will be in the water, and not hauled out on the ice   ., Fig 7). There are many more years of imagery available, thus employing the crowd to 627 assess a longer time series appears feasible. Such an approach would be needed for example to 628 adequately monitor seal populations within the context of conservation and impact assessment of long-term drivers, such as a fishery or climate change. CCAMLR is mandated to monitor "related or 630 dependent species" relative to any extraction activities, thus, to minimize or avoid adverse impacts. Correcting citizen science-based wildlife counts 642 Our "shrinkage factor" calculation can be applied to a broad range of wildlife count data. Any citizen 643 science count dataset is likely to be subject to variance in surveyor skill and the potentially biased 644 distribution of these skills in place and time. As long as an expert assessment is available and can be 645 contrasted to a representative subset of the surveyors to determine the accuracy of their counts, our 646 method can be used to remove the bias and correct the counts. We show that this approach produces 647 estimates that can be sufficiently accurate at the appropriate spatial scales. We also show that it is a 648 much more reliable method than those that rely on how surveyors rank each other ("crow wisdom").

649
There is nothing intrinsic to the data that guarantees it is free of bias in surveyor skill. Careful 650 examination and estimation of the bias to correct it must be a fundamental component of citizen 651 science-based wildlife counts. The method presented here is a step in that direction.