The PAIR-R24M Dataset for Multi-animal 3D Pose Estimation

Understanding the biological basis of social and collective behaviors in animals is a key goal of the life sciences, and may yield important insights for engineering intelligent multi-agent systems. A critical step in interrogating the mechanisms underlying social behaviors is a precise readout of the 3D pose of interacting animals. While approaches for multi-animal pose estimation are beginning to emerge, they remain challenging to compare due to the lack of standardized training and benchmark datasets. Here we introduce the PAIR-R24M (Paired Acquisition of Interacting oRganisms - Rat) dataset for multi-animal 3D pose estimation, which contains 24.3 million frames of RGB video and 3D ground-truth motion capture of dyadic interactions in laboratory rats. PAIR-R24M contains data from 18 distinct pairs of rats and 24 different viewpoints. We annotated the data with 11 behavioral labels and 3 interaction categories to facilitate benchmarking in rare but challenging behaviors. To establish a baseline for markerless multi-animal 3D pose estimation, we developed a multi-animal extension of DANNCE, a recently published network for 3D pose estimation in freely behaving laboratory animals. As the first large multi-animal 3D pose estimation dataset, PAIR-R24M will help advance 3D animal tracking approaches and aid in elucidating the neural basis of social behaviors.


Introduction
Social behaviors are core components of an animal's behavioral repertoire. Understanding their neural, biological, and evolutionary basis has long been a focus of the life sciences [1,2] and may inform treatments for psychiatric diseases, such as autism spectrum disorder and schizophrenia, where social interactions are impaired [3,4].
Precisely phenotyping social behaviors and identifying their neural basis requires reliable and quantitative measures of social behavior in animal models [5]. Currently, studies largely rely on scoring performance in highly structured assays, for instance the tube test, 3 chamber test, or resident-intruder test [6]. While these provide interpretable readouts, they are ethologically limited and compress complex behavioral processes into scalar variables of questionable biological significance [7]. In contrast, assays in unrestrained animals that use computer vision and behavioral classification offer the ability to profile a richer range of social behaviors between animals, but are more challenging to quantify and interpret [8][9][10][11][12].
To improve behavioral quantification, convolutional neural networks for automated detection of an animal's 2D pose [13][14][15], and more recently 3D pose [16][17][18], have been developed. However, in comparison to single animal tracking, methods for multi-animal postural tracking, especially in 3D, are only beginning to emerge. Existing 2D pose recognition techniques employ a mixture of 'top-down' multi-animal tracking, in which pose is reconstructed within identified bounding boxes of multiple animals (e.g [8,14]) and 'bottom up' architectures that first detect all body landmarks and then assign them to animals [19][20][21]. Both top-down and bottom-up multi-animal tracking approaches are promising, but need substantial amounts of training data to accurately track animal pose in the face of challenging occlusions generated by socially interacting animals.
Development of new data-efficient and occlusion-robust multi-animal tracking approaches requires standardized pose estimation datasets and benchmarks, which do not exist in 3D. To address this, we introduce PAIR-R24M, a novel dataset relating multi-view color video and ground-truth 3D kinematics in behaving rats. We collected over 24 million frames of 30 Hz color video across 24 camera views in 18 different pairs of rats interacting in a behavioral arena. In each frame, a motion capture system provides the 3D positions of 12 body landmarks on each individually identified animal, describing the movement of its head, trunk, shoulders, and hips. Each frame is associated with a behavioral label, denoting which of 11 behavioral categories and 3 inter-animal interaction categories it matches best. These labels can be used to balance datasets during training, rigorously assess pose estimation performance over a wide variety of poses, provide labels for action recognition approaches, and perform detailed analyses of behavioral patterns. There exists a small collection of publicly available 3D animal pose benchmark datasets. The Acino dataset contains 7,588 frames of hand-labeled 3D poses (20 keypoints) from cheetahs, capturing mostly running behaviors [22]. The Open Monkey Studio dataset contains 195,228 hand-labeled frames (13 keypoints) of macaques in a large, enriched enclosure across 62 camera views [16]. Two other approaches use motion capture systems to provide expanded 3D ground-truth datasets. RGBD-Dog includes 3D keypoint data (63-82 keypoints from motion capture) and depth maps along with 8-10 RGB video views in canines, although is limited to 5 behaviors [23]. Rat 7M contains nearly 7 million frames and 3D keypoints across a wide range of rat poses, providing a powerful substrate for training and testing algorithms in rodents, the most common model organisms in biomedicine [18]. While valuable, each of the datasets is limited to individual animals.
Thus far, multi-animal datasets exist only for 2D. Graving et al. released videos and 2D annotations for large groups of locusts and zebras filmed from a single top-down view [14], providing valuable datasets for benchmarking 2D collective behavior tracking algorithms. Pereira et al. published a set of labeled fruit fly courtship data [20]. Lauer et al. released annotated multi-animal datasets from mice, mouse pups, marmoset, and zebrafish [21]. By far the most extensive multi-animal 2D dataset is CalMS21, which was released as part of the Multi-Agent Behavior Challenge 2021 and consists of 6 million frames of unlabeled and over 1 million frames of tracked poses and behavioral annotations of pairs of interacting mice [24].
In the more mature field of 3D human pose estimation, many multi-human 3D datasets are available, which vary broadly in the number of behaviors tracked, number of cameras used, means of marker tracking, and environmental context. The CMU Panoptic dataset provides 480 camera views during a wide range of social behaviors in a laboratory environment, with 3D poses obtained via pose estimation [25]. The Campus, Shelf (manually annotated) and MuPoTS-3D (derived from pose estimation) datasets offer 3D poses and multi-view video in real-world scenes [26,27], while 3DPW offers monocular footage with 3D pose labels derived from inertial measurement units [28]. The MuCo-3DHP dataset [27] is a large, multi-human 3D dataset generated by splicing together individual subjects, and their ground-truth markerless annotations, from the expansive MPI-INF-3DHP dataset [29]. Other benchmark datasets exist in specific domains, such as stores [30] and operating rooms [31]. Others use synthetically rendered humans [32][33][34][35][36] or body surfaces [37]. Together these datasets have fueled a productive era of 3D pose tracking, but their domain is drastically different from laboratory animals. Developing the type of 3D animal tracking algorithms required to accelerate progress in neuroscience, biomedicine, and ecology will require in-domain datasets that permit relevant training and benchmarking over a diversity of body plans and behaviors.

Algorithms and benchmarks for animal 3D Pose Estimation
To our knowledge there is only one example of multi-animal 3D pose estimation in the literature [16], likely due to the lack of large training and benchmark datasets in this domain. There are several existing algorithms for 3D pose in individual animals. DANNCE [18] and Freipose [17] use volumetric representations of multi-view inputs to combine image features across cameras and enable 3D supervision, similar to the current state-of-the-art for multi-view human pose [38]. 3D DeepLabCut [22,39] uses triangulation of 2D detections across multiple views, which GIMBAL [40] and Anipose [41] further refine using spatiotemporal constraints. Open Monkey Studio uses a triangulation-based method but with a larger set of cameras, and in addition to using spatiotemporal constraints, makes use of reprojections into unlabeled views to increased their labeled training pool [16]. DeepFly3D uses triangulation, bundle adjustment, and pictorial structures to provide robust 3D pose estimation in tethered flies [42]. For monocular 3D pose estimation, "lifting" approaches using a fully connected network to infer 3D pose from 2D estimates [43,44] have been extended from work in humans [45] to tethered flies and lab mammals. In addition to lifting, Bolaños et al. [43] use synthetic data to improve 3D pose detection in restrained mice. As of yet, none of these methods have been extended to multi-animal 3D pose estimation. In this study we extend the DANNCE volumetric approach because it has demonstrated superior performance on rodents compared to multi-view triangulation, and also because multi-view triangulation would be further complicated by errors in multi-animal identity tracking.

Multi-animal action recognition
We follow the lead of human 3D pose datasets and group our data into standardized behavioral categories to aid the training and benchmarking of pose-estimation and action-recognition algorithms. However, unlike traditional 3D pose datasets acquired using human actors given explicit instructions, here we needed to infer behavioral categories from movement by extending 3D action recognition methods to the multi-animal setting. Multi-animal action recognition has remained challenging due to a lack of ground-truth and, relatedly, a lack of intuition about the definitions and structure of animal behavior, especially in social contexts. Existing methods for multi-animal action recognition employ supervised learning using human-labeled behavior categories such as mounting or attacking, classified using a variety of features describing behavior: pixels [46], the set of 2D body landmarks visible from a single top-down views [8,24], hand-designed features of 2D body landmarks [47] (sometimes supplemented with depth imaging information [48]), shapes fit to 2D or 3D body contours [9,49], or quantities derived from movement trajectories, such as velocity and heading direction [50]. Other methods use unsupervised learning techniques, again on a range of behavioral features: pixels [51], 2D pose features [12] or both [30]. While no approach has performed unsupervised analysis of multiple animals using 3D pose, Marshall et al. [52] designed an approach for identifying behaviors in single animals based on 3D pose features. Here we extend this approach to multiple animals, and create inter-individual features to define new interaction behavioral categories. This straightforward, yet effective, unsupervised action recognition approach allows us to segment and balance the PAIR-R24M dataset and introduce a foundational algorithm applicable to new multi-subject 3D pose data across species.

The PAIR-R24M dataset
To collect the PAIR-R24M dataset we used CAPTURE, a technique that uses body piercing to chronically attach retro-reflective markers to small animals, allowing their pose to be reconstructed using motion capture [52]. We attached 12 markers to the dorsal surface of each animal at identical locations to label their head, trunk, hips, and shoulders. We additionally added 1-2 markers to the head and trunk of animals to differentiate individuals. If interacting animals bore identical marker sets, we masked a marker on the head using whiteout to disambiguate them.
We used a 12 camera motion capture array to record the position of the markers at 300 Hz with sub-mm precision (Fig. 1A). We used commercial Cortex (Motion Analysis) software, which utilizes pairwise distances between markers and a parametric body model, to assign marker identities to each animal. We concurrently recorded animals at 30 Hz using 6 RGB video cameras. We calibrated the video cameras into the same world coordinate system as the motion capture array to automatically label video frames by projecting the 3D marker positions.
We then performed simultaneous CAPTURE and video recordings for 18 pairs of animals (n=7 subjects bearing markers, n=2 markerless subjects), for 1 hour each (108,000 timepoints). To increase viewpoint diversity, we moved each of the video cameras to 4 different locations across recordings (Fig. 1B). On a subset of camera views and frames in which animals were rapidly moving, we noted discrepancies between motion capture and video due to slight errors in synchronization and calibration (Appendix 3). While these errors could in the long term pose limits in the precision of the dataset as a benchmark, they occur on a limited subset of frames, and similar discrepancies exist in commonly used human 3D pose datasets [38].
We recorded from a subset of animal pairs in each recording condition, yielding a total of 26 hours of data of paired animals bearing markers. We also recorded 14 hours of data from animals bearing markers when paired with animals not bearing markers, to add additional markerless video frames to the dataset. These single-markerset recordings also allowed us to assess the fidelity of animal identity assignment in the dataset. Head segment lengths, which were constant within subjects but differed slightly between subjects due to small changes in head marker placement during headcap construction, were stable across individual animals when compared over single-and double-markerset paired recordings (Appendix 4). Additionally, we recorded from each subject alone for 30 minutes to facilitate the construction of single animal tracking models, and recorded from individual and paired animals not bearing markers. Single animal and paired markerless video recordings are not included in the present dataset but may be added at a later data to facilitate transfer and benchmarking of semi-supervised tracking approaches.
Occasionally, self-, animal-animal, or environmental occlusions prevented 3D marker tracking by the motion capture system. As most of these periods were temporally succinct, we imputed missing data using linear interpolation within an egocentrically aligned reference frame anchored on the animal's center of mass and rotated to place the front of the animal's spine along the y-axis. The center of mass and orientation of the animals were estimated from the remaining markers if spine markers were absent. We also sometimes observed other errors where the motion capture system incorrectly assigned marker position. We addressed incorrect assignment by flagging potential errors using a 4σ threshold on z-scores of inter-marker distance, although we note these frames still appeared to possess accurate behavioral categorization. Our official 24M dataset size is calculated after excluding any frame with at least one flagged marker. In the released dataset, we provide all recorded frames, together with z-scores for each marker, permitting researchers to use partially tracked frames if desired.

Action recognition
The performance of human and animal pose tracking algorithms can vary widely depending on the behaviors animals perform -for instance highly-occlusive rodent grooming behaviors are often challenging to reconstruct -making it important to assess the performance of tracking algorithms in an action-specific manner. There remains no standard taxonomy of rodent behaviors [53], and there is often disagreement among human observers about what defines a behavior and when they begin and end (e.g [24,54]). We therefore used an unsupervised approach to identify behaviors by  first clustering pose dynamics in a reduced dimensional behavioral feature space, and then manually inspecting samples from each cluster to assign cluster names post hoc, following previously published approaches [52,55]. To cluster the animals' behavior, we first performed principal component analysis on the all-to-all marker distances across all frames. We applied a Morlet wavelet transform to the top 10 principal components at 25 dyadically spaced frequencies from 0.5-20 Hz. These features, along with the z-heights and local smoothed velocities of each marker, composed a feature vector. To balance the clustering, we applied tSNE separately to each recording and sampled 1,000 frames distributed evenly across the behavioral embedding of each reduced dataset [12]. We then concatenated the sampled frames from each dataset and embedded them with tSNE, resulting in a comprehensive, balanced embedding space of all animal behavior in the dataset. We then re-embedded wavelet values from each movie using convex optimization, as described in [55], transformed the map into a density distribution after smoothing it with a Gaussian kernel, and applied a watershed transform to divide the data into discrete clusters.
The number of behavioral clusters identified in the embedding space can be varied by changing the density kernel used to create the space. We provide two resolutions of behavioral labeling in the dataset. First, a set of 11 coarse behavioral categories that can be used to balance the dataset and benchmark algorithms across different behaviors. Second, a set of 84 fine behavioral categories that can be used for a more detailed analysis of the animal's behavior.
The coarse behavioral categories reflected common classes of rodent behavior, including rearing, locomotion, and investigation (Fig. 2), each of which results from a manual clustering of fine-grained clusters. Within these fine behavioral categories across the full dataset, behaviors varied in frequency by several orders of magnitude, from 6,000 to 6,000,000 time-points (Fine Behavior 62 -a sideto-side head sweep vs. Fine Behavior 35 -a high-frequency sniff). This class imbalance highlights the importance of obtaining large datasets to train and benchmark behavioral tracking algorithms, especially if algorithm performance on rare behaviors is desired.
To further isolate different classes of inter-animal interactions, we further divided periods in which animals' centroids were within one body length of one another (200 mm) into three different interaction behavioral categories: synchronized locomotion ("Chase"), stationary exploration ("Explore"; when both animals were in any coarse behavioral category among HeadTilt, Groom, Sniff, Investigate, Rears, and CrouchExplore), or other times when animals were adjacent ("Close"). Because inter-animal interactions contain numerous occlusions, they represent a challenging use case for multi-animal tracking algorithms. The over 5.3 million frames of animal interactions we provide here provide an ample diversity of frames to train and benchmark new pose tracking algorithms in social settings. Figure 2: Example reprojections of ground-truth motion capture onto single camera views, shown for specific behavioral categories (pink and white labels corresponding to each rat skeleton) and interaction behavior categories (IB). Trailing points illustrate past 1-second trajectories for each marker. Example Movies.
The video frames, 3D pose estimates, and behavioral annotations are continuous in time, with only moderate interruptions in pose estimates due to flagged tracking errors. The median length of continuously tracked snippets is 138 frames ( 4.5 s), with a long tail such that 5% of all continuous snippets are greater than 86 s in length (Table 1). This will be useful both for benchmarking videobased pose tracking algorithms that use local temporal information [56], as well as building statistical models of single and multi-animal behavior [57,58]. As an example of their use for analyzing the mathematical structure of behavior, we can visualize the ethograms of each animal's behavior, which show that animals transition over many individual and interacting behaviors during a recording session (Fig. 1D).

DANNCE benchmark
To establish baseline benchmarks for pose estimation to which future algorithms should be compared, we used a multi-animal extension of DANNCE [18], the current state-of-the-art for rat 3D pose estimation. Because DANNCE's standard mechanism is to encapsulate a subject in a 3D volumetric bounding box via geometric sampling of multi-view image content, multi-animal inference was performed simply by running each animal's 3D volume through the network independently (see Appendix 5 for details). When animals are separated in space, such that their 3D volumes are non-overlapping, this approach trivially reduces to the single animal case. When animals are nearby and overlapping, however, DANNCE must overcome significant animal-animal occlusion and infer correct landmark-subject associations. Our dataset provides a large library of interacting behavior examples that DANNCE, and other approaches, can use to learn social-specific poses and complex, multi-animal image features.  Table 2: DANNCE 3D multi-animal pose estimation benchmarks. In DANNCE.X, X indicates the type of loss function used for training. (*) was trained from a random initialization of weights, and the others from a network pre-trained on Rat 7M [18]. (PJPE 50 ) 50th percentile of the per joint prediction error (in mm), i.e. the Euclidean distance between predicted and ground-truth markers. (MPJPE) mean PJPE, also broken down by head (MPJPE H ) and trunk (MPJPE T ). (PCK@0.5) percent correct keypoints using a distance threshold of 50% of the distance between two head markers. (PCK@0.75) PCK using a threshold of 75% of the distance. (mPCK) mean PCK over 11 equally spaced thresholds.
We trained DANNCE for 30 epochs, using 420k images (70k poses) per epoch, and varied the pretraining conditions and type of loss function to measure the influence of these parameters on performance. Our results on withheld validation subject 5 are presented in Table 2. When using DANNCE's previously published L2 loss function, DANNCE performance improved with pretraining on Rat 7M. However, training with an L1 loss, with or without pretraining, ultimately minimized the mean per joint prediction error (MPJPE) across all markers (additionally broken down by head, MPJPE H , and trunk, MPJPE T ) and maximized percent correct keypoints (PCK) at all distance thresholds (@ fractions of the distance between two head markers -19.4 mm). Across behaviors, DANNCE tracked Investigate with the smallest and CrouchExplore with the the largest error, respectively, although error was within 10% across most behavioral categories (Appendix Table 3). DANNCE performed similarly well on all close social interaction behaviors (Appendix Table 4). Qualitatively, DANNCE generally made remarkably consistent landmark predictions even in periods of spatial overlap between animals, but it did sometimes briefly assign head landmarks to the wrong animal during specific close interaction poses (Fig. 3).

Limitations
Our dataset will already be a valuable resource for social behavioral tracking, but there are several present limitations that could be addressed in future work. First, due to frequent occlusions in the multi-animal settings, there are periods without accurate landmark tracking that we dropped from the dataset. Future datasets could incorporate a larger number of cameras to reduce the number of missing data periods. Second, the ground-truth motion capture data comes from a reduced 12-marker set that does not capture points on the distal limbs, and this could contribute to a loss of precision in behavioral identification. One potential solution for limb tracking is to train using a combination of the 20-marker Rat7M, which includes multiple limb markers, and PAIR-R24M datasets. Limb keypoints could also be added to the dataset using a combination of manual labeling, e.g. through crowdsourced annotation, and inference, similar to datasets like CMU Panoptic [25]. However, annotating keypoints in animals is generally more challenging for non-primate species, where identification of body parts requires more domain knowledge, making the use of crowd-sourced annotation platforms challenging.

Discussion
The PAIR-R24M dataset is the largest and most diverse benchmark dataset for the rapidly growing field of multi-animal behavioral measurement and analysis. We make the dataset available for researchers interested in training new multi-animal tracking and action recognition algorithms, and for researchers interested in mining the data for new quantitative insights on the nature of social behavior. Specifically, we expect that this dataset will help to address the problems of multi-animal 3D pose estimation and instance segmentation.
In our dataset we solve instance segmentation by identifying individuals using known differences in their respective marker sets. These ground-truth animal identities will assist in the development and evaluation of deep learning algorithms that identify individuals through either top-down inference, such as convolutional networks for identity detection or center-of-mass tracking (e.g. [20,59,60]), or bottom-up inference such as 3D extensions of part affinity fields [61].
The PAIR-R24M dataset should also help develop new approaches for multi-animal 3D pose estimation. Here, we performed pose estimation using a state-of-the-art volumetric animal pose tracking approach. While our approach was generally effective, it made mistakes on some types of close interaction, a relevant concern considering that most interesting social behaviors are characterized by profound animal-animal overlap and contorted poses. Our results may be improved by newer architectures that employ semi-supervised learning or temporal convolutions [56] in addition to previously discussed bottom-up methods. Additionally, while highly performant, the use of volumetric convolution is computationally expensive, limiting inference speeds. PAIR-R24M will aid the development and evaluation of new, fast and performant multi-view 3D pose estimation algorithms.
While the PAIR-R24M dataset is an important step in the collection and dissemination of benchmarks for animal pose estimation, it can be extended in many ways. While we used motion capture as a high-throughput means of collecting training data, labels for animal hands, feet, and other appendages will be necessary for training algorithms that predict more complete descriptions of animal movement. These labels could come from human annotators [18], and crowd-sourcing efforts have begun to assemble such detailed annotations for animals in 2D (e.g. [62]; although see Section 4). Datasets extending beyond keypoints to capture an animal's full 3D body surface, as is now possible in human subjects, will also be valuable. While 3D scans have been used to assemble parametric body models of animals in specific poses [63], the databases that are available are still small compared to those available in humans [36,64] and do not contain data from freely moving subjects. While cross-domain adaptation approaches [43,62,65] may facilitate some progress in 3D surface estimation, ground-truth databases are needed to appropriately benchmark and train these techniques. Finally, future datasets from other species, environments, and social contexts will help to build algorithms that are flexible across a rich array of tasks and contexts, with the ultimate goal of enabling methodologies for full reconstruction of animal kinematics in complex, occlusive environments, with as few as one camera. (D) An example of synchronization error (left) and calibration error (right) with arrows pointing to a head marker and a body marker for comparison. In the frame with synchronization error, the head markers show larger error than the body markers, likely due to the animal moving its head quickly and the RGB video lagging behind. In the frame with calibration error, the head and body marker errors are more uniform, making an issue with calibration parameters more likely.

Appendix 3 | Discrepancies in Motion Capture and Video Tracking
In a subset of video frames and camera views, we observed a discrepancy between the marker positions, as tracked using motion capture, and the apparent marker positions in the video frames. Such a discrepancy could be caused by either noise camera calibration, or temporally localized variability in RGB video camera synchronization with motion capture. To quantify the magnitude and extent of these discrepancies, we hand labeled the position of the markers on the head in 2078 video frames and compared them with the projections of points tracked using motion capture. Differences varied across cameras and positions of the animal in the arena (Fig. 5A). On average, differences (7 px mean, 5 px median) were well below both the marker size (9-14 px) and measured precision of hand-labelers (12 px [52]; Fig. 5B-C). Nevertheless, on 10% of frames these differences were greater than the marker diameter, although they rarely exceeded two marker diameters (∼ 1% of frames). Motion capture discrepancies appeared notably smaller for markers on the body, which are less sensitive to slight variability in synchronization (Fig. 5D). Discrepancies are nearly unavoidable in large datasets [38], and can in principle add robustness to 3D markerless pose detection models [18]. Nevertheless, these deviations may present a noise ceiling for 3D pose tracking, and could be removed, if desired, when running benchmarks [38]. Figure 6: Normalized histograms of head segment lengths for Subjects 1-5, measured from all recorded motion capture data and broken down by recording type: paired recordings in which only one subject had markers (red lines) and paired recordings in which both subjects had markers (blue lines). For each subject, histograms for each of the three head segments are plotted together on one graph.

Appendix 4 | Constant Head Segment Lengths Suggest Accurate Animal Identity Tracking
Motion capture measurements are so precise that they enable fingerprinting of each subject via quantification of small subject-specific differences in head segment lengths; these differences arise from variability in marker placement during headcap construction. We established reference head segment lengths for each subject by examining their distributions in marker + markerless recordings, where identity swapping is impossible. In marker + marker recordings, swaps in animal identity should manifest as frames with head segment lengths deviating from each animal's reference. We see little support for such swaps in the data.   Multi-animal DANNCE (https://github.com/spoonsso/dannce/) training and evaluation was performed in Python 3.7 using tensorflow (for the network) and pytorch (for parallel 3D volume generation). For efficiency, we trained multi-animal DANNCE using 4 NVIDIA V100 16 GB GPUs on the Harvard Odyssey compute cluster. We used training frames and ground-truth poses from 4 unique animal pairs, distributed over 7 1-hour recordings at 30 Hz. To form the training set, 10,000 time points (60,000 frames) were sampled randomly without replacement from the time points in each recording having a complete motion capture marker set without imputation, resulting in 70,000 training samples total. We chose at the outset to train each DANNCE network for 30 epochs using a batch size of 4, and at the end of training we evaluated the performance of each network on the full validation dataset just once (results in Table 2, 3, 4). For the benchmarks presented here, we used all samples from a 1-hour recording of subject 3 and 5, evaluated over withheld validation subject 5 only, that had a complete motion capture marker set without imputation (43,285 samples; 259,710 frames). For each animal, we anchored its image volume to the 3D position of its "SpineM" marker in each frame.

Appendix 5 | DANNCE Training and Evaluation
For the benchmarks, we varied the loss function used for DANNCE training, using either mean squared error (L2) or mean absolute error (L1). We also tested training DANNCE from a random weight initialization, or from previously published weights found by training over images of single animals behaving in the Rat 7M dataset (https://github.com/spoonsso/dannce/) [18]. In all cases, we used DANNCE in the "AVG" architecture configuration (a 3D U-Net with a soft-argmax output layer) and trained using the Adam optimizer with lr = 0.001 and default parameters. We list the full set of DANNCE training parameters used in Table 5. Full architecture details and parameter definitions can be found on the dannce github.
To quantify DANNCE performance, we calculated standard 3D pose estimation error metrics, using a Procruste's alignment to ground-truth before calculations (translation and rotation only; no scaling). MPJPE was calculated as the mean Euclidean error across all markers after alignment. PJPE 50 is the median error across all markers. PCK metrics reflect accuracy over all markers after binarizing all predictions using the indicated threshold distances, expressed as fractions of the distance between two Head markers (19.4 mm). For the mPCK metric, we calculated PCK for each threshold in [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1] and took the mean.