Taxonomic Classification of Ants (Formicidae) from Images using Deep Learning

The well-documented, species-rich, and diverse group of ants (Formicidae) are important ecological bioindicators for species richness, ecosystem health, and biodiversity, but ant species identification is complex and requires specific knowledge. In the past few years, insect identification from images has seen increasing interest and success, with processing speed improving and costs lowering. Here we propose deep learning (in the form of a convolutional neural network (CNN)) to classify ants at species level using AntWeb images. We used an Inception-ResNet-V2-based CNN to classify ant images, and three shot types with 10,204 images for 97 species, in addition to a multi-view approach, for training and testing the CNN while also testing a worker-only set and an AntWeb protocol-deviant test set. Top 1 accuracy reached 62% - 81%, top 3 accuracy 80% - 92%, and genus accuracy 79% - 95% on species classification for different shot type approaches. The head shot type outperformed other shot type approaches. Genus accuracy was broadly similar to top 3 accuracy. Removing reproductives from the test data improved accuracy only slightly. Accuracy on AntWeb protocol-deviant data was very low. In addition, we make recommendations for future work concerning image threshold, distribution, and quality, multi-view approaches, metadata, and on protocols; potentially leading to higher accuracy with less computational effort.

top97species Qmed def. The distribution of images per species for the dorsal shot type  Table 1 on page 34. We partitioned the images randomly in non-overlapping sets: 144 approximately 70%, 20%, and 10% for training, validation, and testing, respectively (see Table 1 on page 34). The 70%-20%-10% was used in every consecutive dataset involving 146 training. We downloaded images in medium quality, accountable for 233 pixels in width and 147 ranging from 59 pixels to 428 pixels in height (for sample images see Figure 2 on page 28).

148
Cleaning the data This initial data set still contained specimens that miss a gaster 149 and/or head or are close ups of body parts (e.g. thorax, gaster, or mandibles). A small 150 group of other specimens showed damage by fungi or were affected by glue, dirt or other 151 substances. These images were removed from the dataset, as these images are not  Multi-view data set In order to create a multi-view dataset we only included 159 specimens in top97species Qmed def clean with all three shot types. A total of 95 160 specimens (151 images) had two or fewer shot types and, thus could not be used. This list 161 was combined with the bad specimen list for a total of 115 specimens (as there was some 162 overlap with the one/two shot specimens and bad specimens). We removed these 115 163 specimens from the initial dataset so 3,322 specimens remained, all with three images per 164 specimen per shot type, in a dataset named top97species Qmed def clean multi (see Table   165 1 on page 34). The most imaged Camponotus maculatus (Fabricius, 1782) had 223 166 three-shot specimens and the least imaged species Camponotus christi (Forel, 1886) only 167 18. Before stitching, we scaled all images to the same width, using the width of the widest 168 image. If after scaling an image had fewer pixels in height than the largest image, black 169 pixels were added to the bottom of this image to complement the height of the largest 170 image (example in Figure 3 on page 29). We did not consider the black pixels as a problem will therefore learn that these black pixels are not representing discriminating features 173 between species. Now, the images were combined in a horizontally stacked 174 dorsal-head-profile image, followed by normalizing pixel values to [−1, 1]  test data set of seven species with 28 images per shot type (see Table 1 on page 34) is used to assess whether the model can be applied to 198 indicating if an application will be of practical use to natural history museums and 199 collections with existing image banks.

236
We initialized Nadam with standard Keras settings (e.g. decay = 0.004), except one: the 237 learning rate was set to 0.001 and allowed to change if model improvement stagnated.

238
Preprocessing 239 Before training, we normalized pixel values to [−1, 1] to meet the requirements of 240 Inception-ResNet-V2 with a TensorFlow backend. Furthermore, we resized images to 241 299 × 299 pixels in width and height with the "nearest" interpolation method from the 242 python Pillow library. We kept the images in RGB as for some specimens color could be 243 important, giving them 3 pixels in depth. In the end, input was formed as 244 n × 299 × 299 × 3 with n as batch number.

246
We configured the model to train for a maximum of 200 epochs if not stopped early.     with test data including a worker-only and an AntWeb protocol-deviant test set.

324
Consistently throughout our experiments, shot type accuracies were found to rank from alphabetically sorted on genus, false predictions near the yellow diagonal line are most of the time found within the correct genus for these three big genera. Therefore we speculate 348 that inter-genus features are better distinguished than intra-genus features.

349
Because the majority of specimens are workers, there is most probably a bias in 350 learning the workers from a species. We therefore speculate that the model has acquired an 351 improved understanding and representation of workers. However, accuracy for workers did 352 increase only slightly, when reproductives were removed from the test set. We see a slight 353 increase in dorsal and profile worker accuracy over reproductives accuracy, but the increase 354 is small. The only noticeable and interesting increase is for the head shot type, where 355 workers were classified 15.60% more accurate (Table 3 on   The image number threshold for the species in this data set was 68 images, which is 361 approximately 23 images per shot type. That accounts for 16 images in the training set, 362 which nonetheless achieved good accuracy. This means that the threshold could potentially 363 be lower, and thus more species (with fewer than 68 images) could be incorporated.

364
However, more species (classes) will also complicate training and test accuracy. to include these underrepresented specimens in automatic ant identification.

376
Results are not shown, but species in a species complex (i.e. species with 377 subspecies) did not complicate training and did not cause accuracy problems. This was 378 measured using the F 1 -score, calculated as the harmonic mean of precision and recall.

379
With an increasing number of species in a complex, the F 1 -score did not increase or 380 decrease significantly; variation in data could not be explained by the linear relation.

381
Of interesting note is the labeling of this data set, as this was not managed by the 382 author. Identifications and labels were directly taken from AntWeb, assuming that they 383 were correct. However, there is always a chance that identifications are less accurate and 384 certain as expected (e.g. Boer (2016)), despite being a by-expert-labeled data set. Reality 385 is that ant identification is more complex work than labeling a cat and dog dataset for To the best of our knowledge, this is the first time ants were classified to species 392 level using computer vision, which also means that there is a lot to improve. In this section 393 we will discuss some possible improvements for future research in the form of 394 recommendations.

395
To start, focus should lie on creating benchmark data set that is easy to enlarge 396 and improve. To do that, first it is important to find the image threshold for the model to CNNs (e.g. first subfamily, then genus, then species), but also in three parallel CNNs, 410 learning simultaneously. However, for this we first need to work on a (phylogenetic) tree 411 and molecular data, which is a different study itself. Moreover, there is also the option to 412 classify on caste, before classifying species, using a caste-trained CNN, and then make use 413 of specialized workers, males and queen trained CNNs.

414
An other option is to incorporate metadata; e.g. biogeographic region, caste, 415 country, collection locality coordinates, or even body size (using the included scale bar on 416 images). Metadata could be very important, especially for species that are endemic to a 417 specific region. Metadata could provide underlying knowledge of the characteristics. Most 418 of this information is already present on AntWeb and ready for use.

419
In order to improve the multi-view approach, multiple solutions have been tried 420 (Zhao et al. 2017). A first option is to try is using just one CNN with all images as input 421 and with the addition of catalog number as a label will. The next option could be to train 422 three shot type CNNs parallel and combine the output. The output can be processed as 423 the average of three shot type predictions, or by using the highest prediction. It is also            0  0  Head  689  689  -0  0  Profile  692  679  -0  0  Stitched  --675  --Test  Dorsal  333  327  -264  28  Head  332  331  -279  28  Profile  337  336  -278  28  Stitched  --325 -a 2,843 specimens were marked as valid worker specimens and, therefore, were possible specimens for the worker only test set.