Vision-based monitoring and measurement of bottlenose dolphins’ daily habitat use and kinematics

This research presents a framework to enable computer-automated observation and monitoring of bottlenose dolphins (Tursiops truncatus) in a professionally managed environment. Results from this work provide insight into the dolphins’ movement patterns, kinematic diversity, and how changes in the environment affect their dynamics. Fixed overhead cameras were used to collect ∼100 hours of observations, recorded over multiple days including time both during and outside of formal training sessions. Animal locations were estimated using convolutional neural network (CNN) object detectors and Kalman filter post-processing. The resulting animal tracks were used to quantify habitat use and animal dynamics. Additionally, Kolmogorov-Smirnov analyses of the swimming kinematics were used for high-level behavioral mode classification. The detectors achieved a minimum Average Precision of 0.76. Performing detections and post-processing yielded 1.24×107 estimated dolphin locations. Animal kinematic diversity was found to be lowest in the morning and peaked immediately before noon. Regions of the habitat displaying the highest activity levels correlated to locations associated with animal care specialists, conspecifics, or enrichment. The work presented here demonstrates that CNN object detection is not only viable for large-scale marine mammal tracking, it also enables automated analyses of dynamics that provide new insight into animal movement and behavior.

The ability to quantify animal motion and location, both in the environment and with 12 respect to other animals, is therefore critical in understanding their behavior. Here we 13 present an automated computer vision framework inspired by methods found in the 14 field of robotics for persistently and robustly tracking animal position and kinematics. 15 Biomechanics and behavioral studies depend on animal-based measurements that are 16 considered reliable and repeatable for the species of interest [2,[6][7][8], but direct 17 measurements of animals in the marine environment can be challenging. As a result, 18 researchers tend to use direct observation and expert knowledge to classify and 19 parameterize animal behavior in both wild and managed settings. In the wild, 20 measurements of animal motion are often made using animal-borne tracking systems. 21 The sensors used to collect data from animals tend to be packaged together into 22 minimally-invasive (removable) tagging systems [9]. These tags can be used to directly 23 measure parameters such as animal speed, acceleration, position at the surface or 24 orientation in their environment without introducing significant modifications to the 25 animals' swimming dynamics [10]. When combined with direct observations of behavior, 26 tag data can be used to quantify the animals' behaviors during a period of interest, such 27 as foraging [11]. Sensor data and behavioral observations have also been used to train 28 algorithms to automatically detect behavioral states [12,13]. These trained algorithms 29 can then be used to detect and parameterize behavioral states from large amounts of 30 sensor data that lack direct observations of animal behavior. 31 In contrast, tag-based measurements of marine mammals in managed settings are 32 less common, and location measurements in indoor habitats are not possible with GPS. 33 Instead of tags, animals in these environments tend to be monitored using external 34 sensors, such as cameras and hydrophones, placed in the environment [14,15]. These 35 sensor networks can be used to observe a majority of the animals' environment with a 36 relatively small number of sensors. While it is possible to continuously record the 37 animals' environmental use and social interactions, these videos must be heavily 38 processed to convert them into useful information. This processing is often performed 39 by a trained expert, who watches and scores behavioral or tracking information from 40 the data [2,[16][17][18]. This hand-tracking is time consuming and can be inefficient when 41 hundreds of hours of data have been collected from multiple sensors. Recent efforts have 42 been made to automate this process for cameras, primarily through heuristically-crafted 43 computer-vision techniques [19,20]. However, these techniques were either limited in To address these gaps, this work investigates day-scale swimming kinematics using a 48 neural network based computer-automated framework to quantify the positional states 49 of multiple animals simultaneously in a managed environment. Neural networks have 50 demonstrated flexibility and robustness in extracting information on biological systems 51 from image and video data [21][22][23], and were chosen for use in this research for these 52 properties. In this study, video recordings of the animals from a two-camera system 53 were analyzed using convolutional neural network (CNN) object-detection techniques 54 and were post-processed via Kalman filtering to extract animal kinematics. The 55 resulting kinematic states were used to quantify bottlenose dolphin habitat usage, 56 kinematic diversity, and movement profiles during daily life. The framework and results 57 presented here demonstrate the capabilities of robotics/computer vision-inspired 58 techniques in extracting dynamic information from biological systems that can be used 59 to gain new insights into behaviors and biomechanics.

61
In this work, camera data were used to monitor the behavior of a group of marine 62 mammals both qualitatively and quantitatively in a managed setting. Camera-based 63 animal position data were used to quantify habitat usage, as well as where and how the 64 group of animals moved throughout the day. The position data were decomposed into 65 kinematics metrics, which were used to discriminate between two general movement 66 states -static and dynamic -using the velocities of the tracked animals. A general 67 ethogram of the animals' behaviors monitored in this research is presented in Table 1.

68
The kinematics metrics were further used to refine our understanding of the behavioral 69 states the animals experienced both in and out of training sessions through a 70 combination of Kolmogorov-Smirnov statistical analyses and joint differential entropy 71 computations. The study protocol was approved by the University of Michigan BOTTOM-ZOOM (RIGHT): Vector illustrations of the two example tracks. Example notation for tracklet j (red): position ( p (j,t ) ), velocity ( v (j,t ) ), yaw (θ (j,t ) ), and yaw rate (θ (j,t ) ). BOTTOM-ZOOM (LEFT) Illustration of tracklet generation, with detections (stars) and tracklet proximity regions (dashed). Example notation for tracklet j (red): position ( p (j,t) ), velocity ( v (j,t) ), Kalman-predicted future position (ˆ p (j,t+1) ), true future position ( p (j,t+1) ), and future animal detection ( u (j,t+1,i ) ).  approximately 2m apart. The frame was mounted to a support beam directly above the 112 main habitat, with the cameras angled to give full coverage of the area when combined. 113 Figure 1, top, illustrates the habitat, camera placement, and field of view coverage. For 114 data collection, the cameras were connected through the Gigabit Ethernet protocol to a 115 central computer with an Intel i7-7700K CPU. Recordings were executed using the 116 MATLAB Image Acquisition Toolbox, in the RGB24 color format at a frame rate of 117   Neural network methods

131
The first step in the analysis process was dolphin detection from the captured video 132 frames using Faster R-CNN, a machine-learning object detection method [24]. The tightly enclosing an object's location within an image. For a more complete explanation 139 of the method please refer to [24].

181
This yielded multiple sets of conflicting detection bounding boxes spanning the two 182 fields of view, which necessitated associating the most likely left/right box pairs. Before 183 conflict identification was performed, the detection boxes were first transformed into a 184 common plane of reference termed the world frame. Using known world point 185 coordinates, homographies from each camera to the world frame were generated using 186 the normalized Direct Linear Transform method [27]. These homographies were used to 187 convert the vertices of the bounding boxes to the world frame using a perspective camera as w f = 1 − w n . This ensured that if detection u was on l s , then w n = w f = 0.5, 202 and as u moved closer to b n , we would have w n → 0 and w f → 1.

203
In specific circumstances, the shapes of the drains at the bottom of the habitat were 204 warped by the light passing through rough surface water, and resulted in false dolphin 205 detections. Separate (smaller) image classifiers for each camera were trained to identify 206 these false positive drain detections, and were run on any detections that occurred in 207 the regions of the video frames containing the drains. These detectors were strictly 208 CNN image classifiers and were each trained on over 350 images and tested on over 150 209 images. For the drain detector, the input layer size had the format of (l d , l d , 3), where l d 210 is the mean side length of the detection bounding boxes being passed through the 211 secondary classifiers. The feature detection layers had the same general structure as the 212 Faster R-CNN classifier network, except in this case the convolution layers had, in order: 213 32, 48, 64, and 64 filters each. In the classification layers, the first fully connected layer 214 had a length of 256.  (1) where June 10, 2021 7/19 Using the predicted position, the k-th tracklet checked whether there existed a 236 closest detection in the next frame that was within the proximity region of the predicted 237 position. If true, that detection, denoted as u (k,t+1,i) for the i-th detection in frame 238 t + 1 associated with the k-th tracklet, was used as the reference signal of the Kalman 239 filter to update the state (position and speed) of tracklet T (k) . If false, the unassociated 240 tracklet continued propagating forward, assuming the animal maintained a constant 241 velocity. If a tracklet continued to be unassociated for 5 consecutive frames (empirically 242 determined), it was considered inactive and was truncated at the last confirmed 243 association. All information related to the k-th tracklet was saved after its deactivation: 244 As illustrated in Fig. 1 were errors in the world-frame x-y location estimates (caused by camera perspective and 254 light refraction effects) that could not be corrected. In this work, the detection 255 uncertainty was represented as a 2D probability density function (PDF), whose size and 256 shape depended on the location of the detection with respect to the cameras (Fig. 2,  Lastly, the direction of motion of the animals throughout the monitored region was 302 described using a quiver plot representation. To formulate the quiver plot, two separate 303 heatmaps were generated, Q x and Q y , one each for the x and y components of the 304 animals' velocities. Q x was created using a similar method to the speed heatmap, but in 305 this case F was scaled by the x-component of the animal's velocity (sum F · v cos(θ) 306 into Q x centered at [x u , y u ]), where θ was the heading of the animal corresponding to 307 detection u. Similarly for Q y , F was scaled by the y-component of the animal's velocity 308 (sum F · v sin(θ) into Q y centered at [x u , y u ]). The vector components Q x and Q y were computed for each of the following metrics: speed (m s −1 ), yaw (rad), yaw rate 328 (rad s −1 ), and the standard deviations of each [29]. These were done by comparing 329 randomly-sampled subsets of each time block, with each subset consisting of 10 4 data 330 samples per metric. Only time blocks of similar type were compared (i.e. no ITS blocks 331 were compared to OTS blocks, and vice-versa). The computations were performed using 332 the MATLAB statistics toolbox function kstest2.

334
Detector and filter performance   Figure 6, top, displays the overlaid PDFs of the speed and yaw metrics during OTS, and 393 Figure 6, middle, displays the PDFs during ITS. The K-S distances for all six metrics 394 were reported in Table 3, with all values rounded to 3 digits of precision. For OTS, we 395 saw from the K-S results that Blocks 1 and 2 varied the most with respect to the others 396 in terms of speed, which was observed in Figure 6, top, while the yaw values were not 397 generally significantly different, again observed in Fig. 6   For ITS, we note that the significant differences in metrics generally followed the 404 structure type of each ITS block: comparisons between Blocks 1 vs. 3, and 2 vs. 4, were 405 found to be significantly different the least often. This was to be expected, given Blocks 406 1 and 3 were animal care sessions, and 2 and 4 were presentations. Of particular note 407 are the yaw std. dev. and yaw rate std. dev. metrics, with entire order of magnitude 408 differences in K-S distances when comparing similar vs. different types of ITS blocks.  This research presents a framework that enables the persistent monitoring of managed 421 dolphins through external sensing, performed on a scale that would otherwise require a 422 prohibitively high amount of human effort. Both the Faster R-CNN dolphin detection 423 and CNN drain detection methods displayed reliable performance in testing, and 424 enabled large-scale data processing at rates not achievable by humans. Given that the 425 total duration of video processed was ∼ 199 hours (2 cameras × 99.5 hours each), an 426 inference time of ∼ 202 hours (1.013×) represents at minimum an order-of-magnitude 427 increase in processing speed when compared to human data annotation. This estimate 428 was obtained from the authors' prior experience in manual animal tracking, which could 429 take over 10 hours of human effort per hour of video (frame rate of 10 Hz) annotated for 430 a single animal. As such, the performance of this detection framework presents new 431 opportunities in long-term animal monitoring, and enables the automated processing of 432 longer duration and more frequent recording sessions. In this research, use of the 433 monitoring framework enabled the large-scale animal position and kinematic state data 434 necessary to yield insights into animal behavior and spatial use within their 435 environment.

436
OTS time blocks (Fig. 4, right). When combined with the results from the position 487 distributions (Fig. 4, left), this implies that these dolphins not only focused their 488 attention on these attractors, their presence correlated to higher activity levels in the 489 dolphins when swimming in their vicinity.

490
Behavior classification from dynamics metrics 491 During ITS blocks, ACSs asked for specific behaviors from the dolphins and these 492 behaviors were often repeated. Elements of public educational presentations (ITS 2/4) 493 were varied to include a mixture of both high and low energy segments, and this blend 494 resulted in similar dynamic patterns for the public sessions. In contrast, the non-public 495 animal husbandry and training sessions (ITS 1/3) were less dynamic overall, and yielded 496 similar dynamic patterns for these sessions. Qualitative similarities in the pairs of 497 animal training sessions were observable in both the position and speed/quiver plots in 498 Fig. 5, and the probability density functions presented in Fig. 6.

499
The K-S statistics were used to quantify the similarities and differences between 500 time blocks within both OTS and ITS. As the ACSs requested similar behaviors during 501 ITS blocks of the same type, we expected similarities in the dynamics metrics for Blocks 502 1 vs. 3 and Blocks 2 vs. 4, and differences between the metrics for blocks of different 503 types. The pattern displayed by the K-S statistics in Table 3 (particularly in the 504 std. devs.) showed by far the most significant differences between time blocks of 505 different types, and the fewest for blocks of the same type. Without prior knowledge of 506 the block types, it would be possible to use this pattern to identify that Blocks 1 and 3 507 were likely the same type, as were 2 and 4. This demonstrated that the presented 508 method of obtaining and analyzing dolphins' dynamics metrics was sufficient to 509 differentiate between general behavior types.

510
This was useful for analyzing the OTS results, as the position and speed/quiver plots 511 in Fig. 4 only showed patterns in the animals' location preferences within their habitat. 512 In contrast, the K-S statistics gave a clearer view of the differences between OTS time 513 blocks. Block 2 separated itself significantly from all other time blocks in nearly every 514 metric, while Block 1 was in a similar position (though not as pronounced). Blocks 3-5 515 showed few significant differences for metrics comparisons between each other. This 516 indicated that the dolphins had more distinct dynamics for Blocks 1 and 2, and 517 maintained similar dynamics patterns throughout Blocks 3-5. When combined with the 518 joint differential entropy values, these results indicated there may be three general OTS 519 behavior types for the dolphins in this dataset (in terms of kinematic diversity [KD]): 520 "Low KD" at the beginning of the day (Block 1), "High KD" immediately after the first 521 training session (Block 2), and "Medium KD" for the remainder of the day (Blocks 3-5). 522 A fine-scale temporal analysis of animal kinematic diversity should reveal whether these 523 behavior transitions are dependent on the ACSs or other factors.

524
Limitations and future work 525 Using a limited number of cameras meant full stereo coverage of the habitat was not 526 possible, preventing a direct estimate of animal depth. Additionally, camera placements 527 resulted in region-specific glare on the surface of the water that impeded the Faster 528 R-CNN detector. To address these problems, cameras could be added in locations that 529 allow for fully overlapping coverage, at angles that avoid glare in the same regions. Further, installing cameras capable of low-light recording could enable night monitoring 531 sessions. An inherent problem with camera-based tracking is the fact that similarities 532 between dolphin profiles make it challenging to identify individuals. This problem has 533 been addressed in [28], where kinematic data from dolphin-mounted biologging tags 534 were used to filter camera-based animal location data. This filtering process made it 535 more feasible to identify which location data points corresponded to specific tagged 536 individuals, coupling the kinematic and location data streams for these animals. Fusing 537 the coupled tag and camera data through methods similar to [28] or [31] would then 538 provide high-accuracy localization information to contextualize the detailed kinematics 539 data produced by the tags.

541
Through this research we have demonstrated a monitoring framework that offers new 542 options for long-term managed dolphin observation, while significantly enhancing the 543 efficiency of both data collection and analysis. This work demonstrated the feasibility of 544 a camera-based computer-automated marine animal tracking system, and explored its 545 capabilities by analyzing the behavior and habitat use of a group of managed dolphins 546 over a large time scale. From the results, we were able to quantify day-scale temporal 547 trends in the dolphins' spatial distributions, dynamics patterns, and kinematic diversity 548 modes. These in turn revealed that habitat features associated with particular 549 attractors served as focal points for this group of dolphins: these features were 550 correlated with higher animal physical proximity, kinematic diversity (specifically ACS 551 presence), and activity levels.