A general deep learning model for bird detection in high resolution airborne imagery

Advances in artificial intelligence for computer vision hold great promise for increasing the scales at which ecological systems can be studied. The distribution and behavior of individuals is central to ecology, and computer vision using deep neural networks can learn to detect individual objects in imagery. However, developing supervised models for ecological monitoring is challenging because it needs large amounts of human-labeled training data, requires advanced technical expertise and computational infrastructure, and is prone to overfitting. This limits application across space and time. One solution is developing generalized models that can be applied across species and ecosystems. Using over 250,000 annotations from 13 projects from around the world, we develop a general bird detection model that achieves over 65% recall and 50% precision on novel aerial data without any local training despite differences in species, habitat, and imaging methodology. Fine-tuning this model with only 1000 local annotations increase these values to an average of 84% recall and 69% precision by building on the general features learned from other data sources. Retraining from the general model improves local predictions even when moderately large annotation sets are available and makes model training faster and more stable. Our results demonstrate that general models for detecting broad classes of organisms using airborne imagery are achievable. These models can reduce the effort, expertise, and computational resources necessary for automating the detection of individual organisms across large scales, helping to transform the scale of data collection in ecology and the questions that can be addressed.


Introduction 46
Airborne image capture is revolutionizing data collection in ecology by providing information on 47 from training images. For example, if there were multiple flights in an area, each flight would 137 occur only in train or test sets. Or if multiple islands were surveyed, each island would occur 138 only in train or test sets. See Appendix S1 for information on each dataset, including camera 139 specifications, species lists and flight information. Code, processed images and annotations are 140 made available on Zenodo (Weinstein 2021). The final model 141 (https://github.com/weecology/BirdDetector) and evaluation procedures are made available 142 through the DeepForest python package allowing users to extend, train and evaluate models with 143 minimal difficulty (Weinstein et al. 2020). 144

Models and Analysis 145
A general model for bird detection should predict birds in novel environments, allow 146 customization to new datasets using local annotations, and perform better than solely local data. 147 To test these characteristics, we trained a suite of models for analysis (Table 2). All models 148 shared the same architecture and general workflow but differed in input data. The architecture 149 was initially developed for identifying trees in airborne imagery by Weinstein et al., (2020b, 150 2019). The model was a one-shot object retinanet detector with a convolutional neural network 151 backbone (Lin et al. 2017) implemented in the 'DeepForest' Python package (Weinstein et al. 152 2020a). The retinanet detector uses focal loss to increase the weight of difficult to predict 153 images, reducing the overfitting to easy-to-predict samples. The retinanet backbone was a resnet-154 50 network pretrained on the ImageNet classification benchmark. 155 One of the major challenges with building generalized models for airborne bird detection 156 is that airborne data from different sources varies in the height of image capture and resolution of 157 the camera. This leads to differences in size, contrast and detail of birds among datasets. To 158 synthesize bird detections at different resolutions we performed on-the-fly-data augmentation during training to change the sizes of individual annotations to represent variation in the height 160 and resolution of image capture (Zoph et al. 2019). During each batch, a random annotation was 161 selected, and then a randomly sized box was placed with this focal annotation at its center. To 162 avoid over zooming on the annotation and filling the entire image, we set a minimum size of 163 0.15 times the original image size. This technique reduces overfitting but cannot fully remedy 164 differences in image resolution since it is possible to downscale images, but difficult to 165 realistically upscale images. In addition to random zooming, each image was randomly flipped 166 over the x or y axis with a probability of 0.5. 167 To evaluate the ability of the general model to predict birds in novel locations, we 168 performed a leave-one-out cross-validation analysis. For each large dataset, we trained a 'cross-169 validation' model using all other datasets and then predicted the test images of the withheld 170 dataset (Table 2). While the final 'general' model available to users is trained with every dataset, 171 this leave-one-out strategy is a conservative proxy for future use because it represents how well a 172 general model works when not trained on data for a new monitoring effort. Each of the cross-173 validation models were trained with a batch size of 32 for 12 epochs. Training was performed 174 using stochastic gradient descent (SGD) with a momentum of 0.9 and an initial learning rate of 175 0.001. Learning rates were reduced by 50% when validation loss had not decreased by more than 176 0.001 over a period of 10 epochs with a minimum learning rate of 0.00001. 177 To determine whether starting from the general model improved performance for new 178 datasets, we fine-tuned each 'cross-validation model' using local data. For example, to test the 179 ability to customize to the Atlantic Seaduck dataset (ID = 8), we started from the 'cross-180 validation' model trained on all other datasets and added in Atlantic Seaduck annotations. We 181 trained multiple versions of a fine-tuned model, each with a subset of local datasets with 1000, 5000, 10000 and 20000 annotations. We repeated the sub-sampling for the fine-tuned model 3 183 times to evaluate the effect of image sampling. Finally, to determine whether the fine-tuned 184 model benefited from the pretraining on all other datasets, we trained a 'local-only' model that 185 used the same annotations as the 'fine-tuned' model, but starting from standard ImageNet 186 weights. The same test dataset was used for the fine-tuned and local-only models. The zoom data 187 augmentation strategy was not used in these models since flight height and object size are largely 188 conserved within each dataset. The local-only models were more sensitive to initial conditions, 189 and given each dataset, we attempted to find the optimal number of epochs that led to model 190 convergence and highest performance. 191 For all analysis, we used precision and recall on held-out images for model evaluation. 192 The most common evaluation metric in object detection is intersection-over-union, defined as the 193 area of intersection between the true and predicted bounding box, divided by the area of union 194 between true and predicted bounding box. Using this metric, we assessed model recall, defined 195 as the proportion of ground truth boxes correctly overlapping with predicted boxes with an 196 intersection-over-union of greater than 0.2, and model precision, defined as the proportion of 197 predicted boxes which overlap with a ground truth box with an intersection-over-union of 0.2. 198 We selected this threshold because the vast majority of annotations were automatically created 199 from original points placed on individual birds. The exact outline of individuals is therefore 200 approximate and secondary to the goal of detection and enumeration. To rank models, we also 201 calculate the F1-score for each model, which is a combined score of precision and recall and is 202 calculated as F1 = 2 * (precision * recall) / (precision + recall).

Results and Discussion 204
General models for ecological object detection will be most useful if they can detect individuals 205 in novel environments, allow customization to new datasets using local annotations, and produce 206 better detections than models developed with limited local annotations alone. To test for these 207 characteristics, we trained a suite of local and general models for analysis (Table 2) To evaluate the ability of a general model to predict birds in novel locations, we 214 performed a leave-one-out cross-validation analysis. For each large dataset, we trained a 'cross-215 validation' model using all other datasets and then predicted the test images of the withheld 216 dataset (Table 2). Each model was judged to have converged by visually assessing the validation 217 loss during training ( Figure S3). The mean recall of the held-out dataset was 67.9% (range = 218 29.8, 95.4%), the mean precision was 52.9% (range = 18.8, 79.7%; Figure 2), and the mean F1 219 score was 54.5% (range = 30.4%, 81.5%; Table 3). In general, performance was better for 220 datasets with high resolution imagery, such as the West African Terns (ID = 4; recall = 87.7%; 221 resolution = <1cm), whereas lower resolution datasets like the Antarctic Chinstrap Penguins (ID 222 = 5; resolution = > 2cm)) had lower values (recall = 29.8%). Datasets with forested backgrounds 223 similar to the everglades dataset (ID = 1), which forms the backbone of the training annotations, 224 had higher precision, such as the South Pacific Seabirds (ID = 2, precision = 74.0%) whereas 225 datasets with complex aquatic backgrounds had lower precision (e.g Canadian Marshbirds, ID=7, precision = 18.9%). These results suggest that there is the potential for a generalized model to 227 make accurate predictions for completely novel species and environments, but that its 228 performance will depend on having sufficiently diverse data to obtain highly accurate predictions 229 across all novel environments. 230 General models can be refined to local conditions to improve performance by fine-tuning 231 the model using small amounts of local human-labeled data. Using ~1000 annotated birds from 232 the local site improved the mean recall, precision, and F1-score to 84.3%, 66.0%, and 74.5% dataset, see Appendix S1.
In addition to making effective predictions with little or no local training data, building 249 on general models may result in more straightforward model development and better predictions 250 for ecological studies even in cases with moderate amounts of training data (Figure 4). This is 251 due to their ability to learn robust general features, thus avoiding overfitting and producing more 252 accurate predictions on images that deviate from the training set, which is a common occurrence 253 when scaling up monitoring efforts. We evaluated the performance of 'local-only' models versus 254 the cross-validation models that had the same structure but were initially trained with the training 255 data from all other datasets. In 11 of the 13 test datasets, starting from global model weights 256 improved overall performance (F1-score), often by large margins (Table 3). While this difference 257 was largest when using small amounts of local data, it persisted for some datasets even when 258 using >10,000 local annotations, and the fine-tuned general model always performed at least as 259 well as the local-only model (Figure 4). Local-only models were also highly variable with large 260 changes in performance among runs, sensitive to learning rate and training hyperparameters, and 261 required more computationally intensive training. Fine-tuned models required only 20 epochs of 262 training, whereas local-only models needed to be trained for at least 70 epochs to produce 263 reasonable results (Table 2). Even among local-only models there was large variation in the 264 amount of training needed. For example, the West African Terns dataset (ID = 4) had 0% recall 265 and 0% precision after 70 epochs even when using 20,000 local annotations. Extending to 110 266 epochs resulted in a rapid increase to 84% recall and 87% precision, but the potential for good 267 predictions for datasets like this would often be missed given the consistently poor performance 268 at shorter training times. This idiosyncratic behavior was difficult to anticipate since the tern 269 dataset is similar to other datasets in terms of the density of birds, image resolution, and 270 background complexity. 271 Compared to local-only models, the fine-tuned models were more uniform, exhibiting 272 significantly less between-run variance ( Figure   While the datasets used in this study differed in capture altitudes, angle, and sensor 298 specifications, they were still broadly similar in using RGB data with resolutions <3 cm. 299 Generalizing to spatial resolutions >3 cm and using non-RGB remote sensing (e.g., hyperspectral 300 imagery) requires further study across sensors and data acquisition strategies. For example, 301 fixed-wing aircraft surveys covering hundreds of miles are unlikely to capture images at ultra-302 high resolution due to storage and processing limitations. It is unknown how well the features 303 learned in 2 cm imagery will transfer to 10 cm airborne imagery or high-resolution satellite 304 imagery (~30cm). One approach to this type of generalization is to reduce the higher resolution 305 data and train a series of models to bridge the features learned from high resolution to low 306 resolution data. This is known as 'curriculum learning' ((Graves et al. 2017) and can be useful in 307 transferring information among spatial resolutions.

Conclusion 309
Aerial imagery is a powerful tool for studying species and ecosystems at temporal frequencies 310 and spatial extents that are difficult using traditional methods, but it comes with computational 311 and analysis challenges that have limited its widespread application. General computer vision 312 models provide a solution for simplifying the processing of aerial imagery to allow researchers to 313 more easily, efficiently, and accurately extract ecological data from large amounts of imagery. 314 We showed that general models can provide accurate predictions in novel ecosystems and with 315 novel species, with either no local training data or by retraining with very small numbers of 316 annotations. Even when large amounts of local-data are available, starting with general models 317 produces more stable results, with less computational expense, and often performs better than 318 local-only models because of the general features these models learn from other ecosystems and 319 taxa. The ability of general computer vision models to make accurate predictions in novel 320 circumstances will make them an essential tool for monitoring dynamic ecosystems, where 321 species and habitats may change over time or space. Because the need for local hand-annotations 322 is limited, general models can potentially be rapidly deployed in new environments and support 323 aerial monitoring of rare species which can be difficult to study and have limited annotations 324 available for local model development. By reducing the effort, expertise, and computational 325 resources necessary to develop computer vision models for image processing, general models 326 have the potential for revolutionizing the types of data ecology can collect.    Each dataset was run up to its available number of annotations. 497