Accurate detection and tracking of ants in indoor and outdoor environments

Monitoring social insects’ activity is critical for biologists researching their group mechanism. Manually labelling individual insects in a video is labour-intensive. Automated tracking social insects is particularly challenging: (1) individuals are small and similar in appearance; (2) frequent interactions with each other cause severe and long-term occlusion. We propose a detection and tracking framework for ants by: (1) adopting a two-stage object detection framework using ResNet-50 as backbone and coding the position of regions of interest to locate ants accurately; (2) using the ResNet model to develop the appearance descriptors of ants; (3) constructing long-term appearance sequences and combining them with motion information to achieve online tracking. To validate our method, we build a video database of ant colony captured in both indoor and outdoor scenes. We achieve a state-of-the-art performance of 95.7% mMOTA and 81.1% mMOTP in indoor videos, 81.8% mMOTA and 81.9% mMOTP in outdoor videos. Our method runs 6-10 times faster than existing methods for insect tracking. The datasets and code are made publicly available, we aim to contribute to an automated tracking tool for biologists in relevant domains. Author summary The research on the group behavior of social insects is in great favor with biologists. But before analysis, each insect needs to be tracked separately in a video. Obviously, that is a time-consuming and labor-intensive work. In this manuscript, we introduce a detection and tracking framework that can automatically track the movement of ants in a video scene. The software first uses a residual network to detect the positions of ants, then learns the appearance descriptor of each ant as appearance information via another residual network. Furthermore, we obtain motion information of each ant by using the Kalman filter. Combining with appearance and motion information, we can accurately track every ant in the ant colony. We validate the performance of our framework using 4 indoor and 5 outdoor videos, including multiple ants. We invite interested readers to apply these methods using our freely available software.

maintain correct tracking during a long period of occlusion. Once trajectory drift 29 occurs, the accumulated errors will result in tracking failure [4][5][6]. 30 In recent years, with the popularity of computer vision, many advanced object 31 detection and tracking methods have emerged. 32 Object detection 33 Existing methods in object detection are categorized as one-stage or two-stage, 34 according to whether there is a separate stage of region proposal. One-stage frameworks 35 (e.g., YOLO [7]) are fast, but their accuracy is typically slightly inferior compared with 36 that of two-stage detection. The popularity of two-stage detection frameworks is 37 enhanced by R-CNN [8], which proposes candidate regions via a selective search (SS) 38 algorithm [9], thereby the detector focuses on these RoIs. However, using the SS 39 algorithm [9] to generate region proposals is the main reason causing slow inference. 40 Fast R-CNN [10] reduces the computational complexity of region proposals by 41 downsampling the original image, while Faster R-CNN [11] proposes an RPN, which 42 further improves the speed of training and inference. 43 Given the success of deep learning in general tasks of object detection, researchers 44 also applied to detect specific groups of animals, such as a single mouse [12], fruit 45 flies [13]. These methods are either limited to track a single object, or a fixed number of 46 objects. General tools [14,15] also offer the functionality to detect and track unmarked 47 animals in the image. However, most of existing methods focus on the condition of ideal 48 lab set-up and none of existing works reported the detection of ants in outdoor 49 environments which contain diverse backgrounds and arbitrary terrains. 50 Our goal is to develop a framework for robust ant colony detection and tracking. 51 Our work focuses on accurate detection and tracking ants in both indoor and outdoors 52 scenes, and thus follows a two-stage detection framework as RPN-FCN [16]. Based on 53 ResNet-50, we use position-sensitive score maps to encode the position information of 54 the candidate bounding box proposed by RPN and then perform classification and 55 regression, respectively. 56 Multi-object tracking (MOT) 57 In the last two decades, vision-based detection and tracking models have been widely 58 used to study social insects [17,18]. Appearance (particularly color) and motion 59 information are the main metrics used in this category of method. Due to high 60 similarity of ants' appearance, researchers either use the technique of pigmenting to 61 create more distinct appearance features [19], or limit the observation to a laboratory 62 setup [20,21]. State-of-the-art methods, such as Ctrax [20] and idTracker [21], for insect 63 tracking are tested in a laboratory setup and use background subtraction for foreground 64 information is integrated as a metric.

73
The DAT method is a mainstream method for ant colony tracking [4]. It allows a 74 combination of multiple metrics, and uses Hungarian algorithm [23] to assign detections 75 for trajectories. The PF method is suitable for solving nonlinear problems [2], but the 76 growth in the number of particles leads to an exponential increase in the computational 77 cost, preventing the effective multi-object tracking. Using Markov Chain Monte Carlo 78 sampling can reduce computational complexity [24]. A GPU-accelerated 79 semi-supervised framework can further improve tracking accuracy and performance [3]. 80 When applying the methods above for tracking ant colonies, they are greatly 81 disturbed by background noise and difficult to overcome the serious occlusion problem 82 in dense scenes. Long short-term memory [25] and spatial-temporal attention 83 mechanisms [26] have been developed to tackle the problem of long-term occlusion. A 84 bilinear Long short-term memory structure that couples a linear predictor with input 85 detection features, thereby modeling long-term appearance features [25]. The 86 spatial-temporal attention mechanism is also suitable for the MOT task. introduced by a cyclic structure classifier [27]. 93 We propose a complete detection and tracking framework based on the TBD 94 paradigm. We construct a gallery for each trajectory to store the sequence of historical 95 appearance descriptors, which is used to online association metric. This strategy 96 significantly mitigates the effects of long-term occlusion.

97
In this paper, we use a deep learning method to build a detection and tracking • We adopt a two-stage object detection framework, using ResNet-50 as the 105 backbone and position sensitive score maps to encode regions of interest (RoIs).

106
During the tracking stage, we use a ResNet network to obtain the appearance 107 descriptors of ants and then combine them with motion information to achieve 108 online association.

109
• Our method proves to be robust in both indoor and outdoor scenes. Furthermore, 110 only a small amount of training data are required to achieve the goal in our

119
Ant colony database 120 We establish an video database of ant colony, which contains a total of 10 videos. Five 121 videos are from an existing published work [28] and captured in the indoor (laboratory) 122 environment. The videos in our database have a total of 4983 frames. There are 10 ants per frame 128 in the indoor videos. The number of ants in each frame is 18-53 in the outdoor videos. 129 The number of objects in this scenario is significant, considering the fact that the Some video characteristics present challenges for detection and tracking algorithms, for 132 example over-exposure for indoor videos and diverse background for outdoor ones.

133
There are caves or rugged terrains in outdoor scenes, and ants may enter or leave the 134 scene. Different from multi-human tracking, ants are visually similar and this causes 135 significant challenges for tracking. We manually mark the video frame by frame. To 136 facilitate training and reduce labeling cost, the aspect ratio of each bounding box is 1:1. 137 Considering the posture and scale of ants, we set the size of the bounding box to 96*96 138 for indoor videos and 64*64 for outdoor videos. The database and code will be made 139 publicly available. In our ant database, we set up five groups of training sets (  outdoor training set, which is insufficient to cover the wide range of diversity in terms of 172 environmental backgrounds and ant appearances.

173
In the subsequent experiments, we integrate the images of all outdoor scenes into the 174 outdoor training set and dramatically improve the accuracy of outdoor testing. Fig 2   175 clearly shows the effects of using different training sets. By further increasing in the 176 number of images in outdoor videos, the detection accuracy of outdoor scenes improves 177 slightly. For indoor environments, the detection accuracy is impervious to different 178 training sets. Moreover, reducing the number of images to 50 (I 5 has a total of 351 179 frames) does not reduce the detection accuracy. This shows that we need only a small 180 number of training samples to achieve satisfactory results when the training and testing 181 scenarios are the same.

182
The frame rate is around 12 FR for indoor videos and 16 FR for outdoor ones. The 183 factor of different image resolution should be accountable for this performance gap. In 184 practical applications, if accuracy is guaranteed, we tend to use smaller training sets to 185 reduce labeling costs. Therefore, we use the model trained in "I 5 (50)+O 1−5 (50)" for     Table 3. Tracking performance evaluation. The last two rows indicate that we use the ground truth of detection for tracking, which leads to a boost in tracking performance.
We add a set of comparative experiments in the last two rows of Table 3. We There are two widely used insect tracking software: idTracker [21] and Ctrax [22]. 210 idTracker needs to specify the number of objects before tracking, to create a reference 211 image set for each object. Meanwhile, Ctrax assumes that objects will rarely enter and 212 leave the arena. Thus, they are both not capable of tracking in outdoor scenes because 213 of the variable number of ants. Therefore, we compare these two methods only in videos 214 depicting indoor scenes. idTracker needs to specify the number of objects before 215 tracking, in order to create a reference image set for each object. To compare them with 216 our method, we convert their representations into square boxes as our ground truth.
217 Table 4 shows the tracking results. In addition to a significant improvement of tracking 218 accuracy, our method is 6 and 10 times faster than idTracker and Ctrax (see the column 219 of FR idTracker uses the intensity and contrast of the foreground segmented area to extract 221 appearance features and construct a reference image set for each individual. However, it 222 can not track motionless individuals . Fig 4(a) shows that only a minority group of ants 223 are successfully tracked over the period of video. Further, there are some trajectory

232
Our method classifies and regresses twice to locate ants accurately. During the 233 tracking stage, we use the historical appearance sequence as a reference and update it 234 frame by frame. Compared with idTracker, our method effectively solves the long-term 235 and short-term dependence of motion states, thereby reducing FM. Despite that we also 236 assume the linear distribution of motion states, they are used only to filter impossible 237 associations, and have nothing to do with association cost. We take the appearance 238 distance between trajectories and detection boxes as association cost, thus the model is 239 robust even when the ant movement is complicated. We take the appearance measure  The number of ants in outdoor scenes is on average 33 per frame. It is also typical for 249 ants to involve close body contact with each other for the purpose of information 250 sharing. Naturally, their extremely-close interactions are highly likely to cause 251 mis-detection (Fig 6(a)). Additionally, entrances and exits of ants in outdoor scenes are 252 more prone to mis-detection (Fig 6(b)). Moreover, the dramatically non-rigid 253 deformation of ants is also a factor causing the detection failure (Fig 6(c)). These three 254 scenarios are all challenging cases that deserve our future efforts.

265
Overview 266 Following the TBD paradigm, we propose a uniform framework for detection and 267 tracking to efficiently and accurately track the ant colony in both indoor and outdoor 268 scenes (Fig 8). In the detection phase, we adopt a two-stage object detection framework, 269 using ResNet-50 as the backbone, and encoding RoIs proposed by RPN via position-sensitive score maps. RPN is proposed in Faster R-CNN [11] to generate RoIs. Compared to SS [9], RPN is 279 based on the CNN network structure and can connect the backbone with shared weight, 280 significantly improving detection speed. We use ResNet-50 as the backbone and replace 281 the fully connected layer with a 1*1 convolution to reduce the dimensions of feature branch uses softmax to determine whether there is an object in anchor so that this 288 branch has 2*k outputs. The regression branch will perform a regression on the 4D 289 position parameters of anchors (i.e., center coordinates, width and height) so that there 290 are 4*k outputs. RPN will propose k*w*h anchors with a w*h feature map, called RoIs. 291 We use the Non-maximum suppression algorithm [29] to filter duplicate anchors and set 292 the IOU threshold to 0.7.

293
PSRoI-based detection 294 On the basis of RoIs, the two-stage detection framework classifies and fine-tunes the 295 location of bounding boxes. In Faster R-CNN, RoIs are scaled to the last feature maps 296 and focusing on these areas through ROIPooling. Next, each RoI is classified and 297 regressed through two fully connected layers, causing high computational complexity.

298
In order to reduce the number of parameters, we use RPN-FCN [16] to generate 299 position-sensitive score maps via a convolutional layer, which is connected to the 300 backbone. Both classification and regression tasks have independent position-sensitive 301 score maps, forming three parallel branches with RPN.

302
For the classification task, since we only need to classify ants and background, we 303 use k*k*2 convolution kernels to generate score maps. k*k indicates that each RoI is 304 divided into k*k regions to encode position information. Each region is encoded by a 305 specific feature map with two dimensions. Similarly, we use k*k*4 convolution kernels 306 for fine-tuning the position of RoIs in the regression task.

307
To focus on RoIs, we perform average pooling on each region to get feature maps, 308 called position sensitive region of interest (PSRoI) pooling, as the following formula 309 shows: r c (i, j|Θ) is the result of downsampling in (i, j) th for c th category, and z i,j,c is one score 311 map in the k*k*2 position-sensitive score maps. (x0, y0) represents the left-top corner of 312 RoI. Θ is the set of parameters of the network, and n is the number of pixels in the 313 region.

314
For the feature maps, we vote on k*k regions, getting the overall score of RoI on the 315 classification or regression task, as the following formula shows: In the formula, r c (Θ) represents the overall scores of all regions.

317
Next, we use softmax to implement binary classification, as the following formula Here, s c (Θ) is the probability of c th category. Finally, we use the Non-maximum 320 suppression algorithm to filter the bounding box.

321
Since object detection includes classification and regression, we require a multitask 322 loss function. In this paper, we weight the loss functions of the two tasks. Because 323 softmax is used for the binary classification task, it is natural to adopt cross-entropy 324 loss for the classification task. For the regression task, we calculate the matching degree 325 between the four position parameters and ground truth: where c * is the ground truth category label of RoI, and c * = 1 represents ants. L cls (s c * ) 327 represents cross-entropy loss: L reg (t, t * ) represents the loss of the regression task, including 4 dimensions: November 22, 2020 12/17 In the formula, t * is the predicted position, and t is ground truth after translation and 330 scaling.

331
MOT framework 332 Offline ResNet network architecture 333 We adopt a 15-layer ResNet network architecture to extract the appearance descriptors 334 of objects, as Fig 8 shows. After downsampling eight times, the network will eventually 335 obtain a 128-dimensional feature vector through a fully connected layer. The specific 336 parameters are consistent with [28].

337
Cosine similarity metric classifier 338 We modify the parameters of softmax to get a cosine similarity measurement classifier, 339 which can measure the similarity of the same category or different categories. First, the 340 output of a fully connected layer is normalized by batch normalization, ensuring that it 341 is expressed as a unit length f Θ (x) 2 = 1, ∀x ∈ R D . Second, we normalize the weights, 342 that is, k = ω/ ω k 2 , ∀k = 1, . . . C. Cosine similarity metric classifier is 343 constructed as follows: Here, κ is the free scaling parameter.

345
Because the cosine similarity classifier follows the structure of softmax, we use the 346 cross-entropy loss for training: Here, L(D) represents the sum of the cross-entropy loss of N images, p (y i = k|r i ) is the 348 prediction result of i th image in k th label, and gt yi−k is ground truth.

349
Motion matching 350 We use the KF model to predict the position of trajectories in the current frame. Then, 351 we calculate the square of the Mahalanobis distance between the predicted position and 352 the detected bounding box position by measuring the degree of motion matching [30] as 353 follows: Here, d j is the position of the j th detection box, y i is the position of the i th trajectory 355 predicted by the KF, and S i is the covariance matrix between the i th trajectory and the 356 detected bounding box. 357 We use a 0-1 variable to indicate whether trajectory and detection meet the 358 association conditions. If the Mahalanobis distance meets t (1) , (i, j) will be added to 359 the association set. The formula can be expressed as: Here, b ij is the motion association signal. 361 We use the appearance descriptors to measure the appearance similarity between ants. 363 Furthermore, we create a gallery for each trajectory, and each gallery stores the latest 364 100 appearance descriptors. Then, we calculate the cosine distance of appearance 365 descriptors between gallery and candidate bounding boxes. The smallest distance is 366 used as an appearance matching degree as follows: where r j is the appearance descriptor of the j th detection box, r (i) k is the k th 368 appearance descriptor of the i th trajectory, d (2) (i, j) represents the appearance 369 matching degree between the i th trajectory and the j th bounding box.

370
Similarly, we introduce a 0-1 variable as an association signal. If the appearance 371 matching degree from a pair of trajectory and detection boxes meets the threshold, we 372 add it to the association set: where b (2) ij represents the appearance association signal. In this paper, t (2) is set to 0.2. 374 Comprehensive matching

375
To combine motion and appearance information, we set a comprehensive association 376 signal b ij . Only when both motion and appearance matching degree meet the threshold, 377 the (i, j) pair will be considered for matching. The formula expression is denoted as 378 follows: However, the KF is scarcely possible to track accurately for long periods, because of 380 the motion of ants is complicated. Therefore, we use the appearance matching degree 381 (Section Appearance matching) as the association cost.

382
Track update 383 First, we use matching cascade to match in priority for the most recently associated 384 trajectories, avoiding the trajectory drift caused by long-term occlusion [30]. During the 385 matching, we use the Hungarian algorithm to find the minimum cost matches in the 386 association cost matrix. For unmatched trajectories and detection boxes, we calculate 387 the IOU. If they meet the threshold, they are associated.

388
After that, trajectories need to be updated. They have three states: unconfirmed, 389 confirmed, and deleted. We assign a new trajectory for each unmatched detection box. 390 Furthermore, if the duration of trajectory is less than three, it will be set to an 391 unconfirmed state. The unconfirmed trajectories need to be successfully associated for 392 three consecutive frames before being converted into confirmed state; otherwise, they 393 will be deleted.

394
For the unmatched confirmed trajectories, if they are successfully matched in the 395 previous frame, we will use the KF to estimate and update their motion state in the 396 current frame; otherwise, we will suspend tracking. Moreover, if the number of 397 consecutively lost frames of confirmed trajectories exceeds the threshold (Amax=30), 398 they will be deleted.