Data fusion for abundance estimation: community science augments systematically collected removal-in-time distance sampling data

Ecologists use a variety of systematically and opportunistically sampled count data to estimate bird abundance, and integrating or fusing different datasets has emerged as a critical challenge in recent years. While previous work provides data integration methodology for occupancy (presence/absence) estimation, methods for abundance estimation that account for imperfect detection and disparate survey protocols remains an active area of research. Here we show how to integrate systematically collected removal-in-time distance sampling data from the Integrated Monitoring in Bird Conservation Regions (IMBCR) program with North American Breeding Bird Survey (BBS) point counts and eBird community science observations. Using the Grasshopper Sparrow (Ammodramus savannarum) in the Great Plains of the United States as a focal species, we demonstrate that BBS and eBird data improve predictive performance for IMBCR count data, providing more spatially refined and precise estimates of abundance at regional scales. Data fusion increased predictive performance even despite relatively weak spatial correlations among data sets. The methodology developed here provides a principled way to fuse data when estimating abundance with distance sampling, that accounts for imperfect detection and variable effort.

with 20 km spatial jitter to retain spatial anonymity, BBS routes, eBird checklists, and the hexagonal spatial grid used to model spatial autocorrelation.
The secondary sampling units are routes selected along roads within the primary blocks. Historically, blocks 82 were selected using a random sampling design stratified by state, but now uses state by BCR stratification 83 (John R. Sauer, Fallon, and Johnson 2003). Stops within routes are tertiary sampling units. A route consists 84 of 50 roadside stops 800 m apart along a transect (a connected stretch of road) surveyed sequentially by 85 an observer. Typically observers survey the same routes from year to year, and counts tend to be low for 86 observers' first year surveying a route (Kendall, Peterjohn, and Sauer 1996). While specific stop locations 87 are not known precisely, route paths are available for 164 routes in the focal region. Thus we sum counts 88 across stops within routes. The BBS uses experienced volunteers, and the resulting data are expected to be 89 moderate or intermediate quality, suitable for regional conservation planning (Rosenberg et al. (2017); John 90 R. Sauer et al. (2017), but see Thogmartin et al. (2006) and Sólymos et al. (2020)). 91 eBird data are community science observations generated by many different observers using various protocols.

92
For consistency with IMBCR an BBS data we focus on stationary protocols only, where observers count 93 all individuals encountered at a spatial location over a known time interval. These data contain a wide 94 range of survey durations, and variable observer skill. Further, eBird data exhibit spatial bias in availability 95 (Callaghan et al. 2019). We expect these data to be of mixed quality, less reliable than IMBCR or BBS data. 96 To explain local variation in density across the region, we extracted land cover data in a 125 m buffer from 97 each IMBCR point from the National Land Cover Database. We summarized this data by computing the 98 fraction of area within 125 m radius comprised of shrubland, grassland, and cultivated crop, generating one 99 fractional summary per cover type for each IMBCR point. These three land cover summaries were used to 100 explain local spatial variation in bird density for the IMBCR data. 101 Regional variation in density was modeled on a hexagonal grid over the spatial domain of the observed 102 data, buffered by 50 km in every direction. The resulting grid has 1111 cells that are 779 km 2 in area, 103 spaced approximately 30 km apart. Spatial discretization is useful for constructing spatial random effects, 104 because sparsity in the neighborhood structure permits computationally efficient representations of spatial 105 autocorrelation (Besag and Kooperberg 1995). The size of the cells represents a tradeoff between computational 106 efficiency and spatial grain. We selected cell sizes that led to reasonable computation times (< 1 day), and 107 spatial scales similar to the largest sampling units (BBS routes, with all but one BBS route being completely 108 contained in one cell and one BBS route spanning two cells). (2014), and we integrate auxiliary data using latent multivariate spatial fields, the "correlation approach" 113 described in Pacifici et al. (2017). Models differ in terms of which auxiliary data augment IMBCR data. 114 2.3.1 IMBCR model 115 We index secondary sampling points by k, so that we have point-transect data from points k = 1, ..., K. 116 When point k is surveyed, the radial distance and time-to-detection is recorded for each detected bird. The 117 data consist of the total number of individuals detected at each point y k , along with the time period t i and 118 distance d i for every detected individual i = 1, ..., n, where n = k y k . The original description of this model 119 in Amundson, Royle, and Handel (2014) used an N-mixture model for abundance, and we use an equivalent 120 integrated formulation (details in appendix): where λ k is the expected population size in the sampled area around point k, and p k is the probability that an 122 individual is available and detected, given that it is present (Amundson, Royle, and Handel 2014). Expected 123 population size is modeled as a function of known covariates and unknown spatial adjustments: where X k is a row vector of known covariates (a 1 to represent an intercept, and land cover fractions around 125 point k), β is a vector of coefficients, and θ (1) c[k] is a spatial adjustment corresponding to the hexagonal cell 126 containing point k, denoted c [k]. Expected density can be computed by dividing by sampled area, e.g., 127 expected number of individuals per square km at point k is would be λ k πr 2 when using data in a radius of r 128 km.

129
To be detected, individuals must be available (e.g., calling or visible), and perceived (e.g., heard or seen) by

133
Individuals are only counted once during a survey, and the probability of availability at point k is the sum of 134 the probabilities of availability in each of T time intervals: tk is the probability of availability in the t th time interval (e.g., the t th minute). These interval-wise 136 probabilities are computed as: where α k is the probability that an individual is available (e.g., calling and/or visible) during one time interval 138 at point k. Heterogeneity in availability that can arise from survey conditions (e.g., temperature) can be 139 incorporated in the model via a k (Amundson, Royle, and Handel 2014). Here, we assume that α is constant, 140 and remove the k subscript from subsequent notation.

143
The probability of perception is modeled as a function of distance, discretised into 10 equidistant bins 144 b = 1, ..., B from 0 m to 125 m. The distance bin d i of the i th detected individual is modeled as: bk is the probability of being detected in distance bin b given availability, and p bk . We 146 use a half-normal detection function to obtain the cell probabilities (Amundson, Royle, and Handel 2014): where r b is the radial distance from the spatial coordinates of point k to the midpoint of distance class b, 148 σ obs is an observer-specific spatial decay parameter that represents how quickly the detection probabilities 149 decrease as a function of distance (with σ obs[k] denoting the particular value for the observer who surveyed 150 point k), δ b is the the width of distance bin b, and d max is the maximum truncation distance for detections 151 (Buckland et al. 2001).

152
By modeling the distance and time interval data separately we are assuming that they are (conditionally) 153 independent. We assume no interaction between distance and time, such as would result if birds tended to 154 move toward or away from observers once a survey begins. Following Amundson, Royle, and Handel (2014), 155 we used ANOVA to test the assumption that distance and time of detection are independent. We failed to 156 reject the assumption of independence (p = 0.28). 158 We use a simple count model for BBS data, aggregating counts among all 50 stops within each route. The

159
BBS data consist of counts z r at routes r = 1, ..., R. These counts are modeled with a negative binomial 160 distribution: is the expected count at BBS route r and φ (z) is an overdispersion parameter. The expected 162 counts are a function of an overall mean and a spatial adjustment: is a row vector of known covariates (a 1 to represent an intercept, and a binary indicator for 164 whether it was an observer's first time on a route (Kendall, Peterjohn, and Sauer 1996)), β (z) is a parameter 165 vector, and θ (2) c[r] is a spatial adjustment.
is the expected value, and φ (q) is an overdispersion parameter. The expected value depends on 171 covariates and varies spatially: is a row vector of covariates (a 1 to represent an intercept, and a measure of "effort" or person-173 hours, which we compute as the product of duration and number of observers), β (q) is a parameter vector, 174 and θ is a spatial adjustment.
where 0 is a vector of zeros of length 3C, Σ is a 3 × 3 covariance matrix for IMBCR, BBS, and eBird data,  BBS data provided additional spatial information, increasing the posterior precision of the spatial random 205 effects relative to the model trained only with IMBCR data (Figure 2). Spatial adjustments associated with 206 eBird and BBS were weakly positively correlated with the spatial adjustments for the IMBCR submodel.

207
The correlation between IMBCR and BBS adjustments was 0.19 (posterior median), with a 95% credible   Grasshopper Sparrows were more abundant at points with a high fraction of grasslands, with a posterior 216 median effect of 1.79 (95% CI: 0.94, 2.64). We did not observe as strong relationships between abundance 217 and cultivated crop cover (posterior median: 0, 95% CI: (-0.8, 0.83)) or shrub cover (posterior median: 1, 218 95% CI: (-0.03, 1.8)) ( Figure 3). As a consequence, the predicted densities and occupancy probabilities across 219 the study region reflect both the large-scale spatial effects and fine-scale grassland cover effects (Figure 4). 220 Figure 4: Spatial predictions of posterior median expected density across the study region.
Predictive performance of the final model on the test set was satisfactory. Empirical 95% credible interval 221 coverage for counts in the test set was 99.7%. The fraction of zero counts in the test set was 0.60, and the 222 95% CI from the full model for the fraction of zeros in the test set was (0.48, 0.84) with a median value of 223 0.69. Similarly, the maximum count in the test set was 5, and the 95% CI for the maximum was (4, 80), with 224 a median of 11.

225
Detection probabilities in the IMBCR point transect data varied by observers, such that the best observers 226 were able to still detect calls at 125 m, and the worst observers had very low detection beyond 80 m ( Figure 5).

227
While it is not possible to separate abundance and detection effects in the BBS and eBird data, the covariates 228 that we included did explain variation in expected counts. Specifically, we found a weak negative effect of a 229 surveyor's first year on a route in the BBS data. The posterior median for the first year effect on BBS counts 230 was -0.35, and the 95% CI was (-1.1, 0.48). The posterior probability that this effect was negative was ≈ 231 0.82. We also found weak evidence for eBird counts being positively related to effort (posterior probability 232 that the effect was positive: 0.8). The posterior median for the eBird effort effect was 0.1, and the 95% CI  gains for observed counts when augmenting systematic survey data with eBird using random forest count 245 models. A key difference between the model described in this paper and the approach of Robinson et al.

246
(2020), is here we estimate density using a submodel that capitalizes on a removal-in-time distance sampling 247 protocol that accounts for imperfect detection. However, machine learning and structured observation models setting, all of these data would be unknown such that neither the BBS or eBird data could be treated as 282 observed covariates.

283
While the current approach borrows spatial information from BBS and eBird to inform the IMBCR submodel, 284 it does so using a single correlation parameter that is applied across the entire region. This might not 285 be ideal, given that the BBS is a roadside survey, and the eBird checklists tend to concentrate in areas 286 of high population density and/or around major interstate roads (Figure 1). This correspondence might 287 underlie the relatively strong positive correlations between BBS and eBird reported here. As a consequence, 288 it may be that these auxiliary data are most informative for IMBCR data from similar locations (e.g., point 289 transects near roads). Future work might evaluate whether there are advantages to allowing for more flexible 290 spatial correlation structures, that might share more or less information among datasets depending on spatial 291 similarity. A "covariate" or conditional data integration approach with spatially varying coefficients might be 292 particularly amenable to this (Hanks, Hooten, and Baker 2011;Pacifici et al. 2017), to allow the relationship 293 between auxiliary and focal data to depend on land cover and road density (Finley 2011).

294
A variety of data are available to understand the distribution and abundance of birds, but integrating 295 different datasets remains a key area for development (La Sorte et al. 2018). Here, we demonstrate that the 296 combination of IMBCR, BBS, and eBird data provides better predictive performance than using IMBCR 297 data alone. Having outlined some future directions for multi-species and multi-temporal settings, we hope 298 that future work can integrate additional data to improve estimates of bird abundance, understand ecological 299 processes, and improve conservation outcomes.

445
We can collapse these three stages into one. First, recognize that this is a hierarchical Poisson-binomial-446 binomial compound distribution The last two binomial components can be collapsed into one using the 447 binomial-binomial hierarchy, to obtain a marginal distribution for y k : This factorizes: Plugging in the binomial probability mass functions: