ABSTRACT
Social interactions powerfully impact both the brain and the body, but high-resolution descriptions of these important physical interactions are lacking. Currently, most studies of social behavior rely on labor-intensive methods such as manual annotation of individual video frames. These methods are susceptible to experimenter bias and have limited throughput. To understand the neural circuits underlying social behavior, scalable and objective tracking methods are needed. We present a hardware/software system that combines 3D videography, deep learning, physical modeling and GPU-accelerated robust optimization. Our system is capable of fully automatic multi-animal tracking during naturalistic social interactions and allows for simultaneous electro-physiological recordings. We capture the posture dynamics of multiple unmarked mice with high spatial (∼2 mm) and temporal precision (60 frames/s). This method is based on inexpensive consumer cameras and is implemented in python, making our method cheap and straightforward to adopt and customize for studies of neurobiology and animal behavior.
INTRODUCTION
Objective quantification of natural social interactions is difficult. The majority of our knowledge about animal social behavior comes from hand-annotation of videos, yielding ethograms of discrete social behaviors such as ‘social following’, ‘mounting’, or ‘anogenital sniffing’1. It is widely appreciated that these methods are susceptible to experimenter bias and have limited throughput. There is an additional problem with these approaches, in that manual annotation of video frames yields no detailed information about movement kinematics and physical body postures. This shortcoming is especially critical for studies relating neural activity patterns or other physiological signals to social behavior. For example, neural activity in many areas of the cerebral cortex are strongly modulated by movement and posture2,3, and activity profiles in somatosensory regions can be difficult to analyze without understanding the physics and high-resolution dynamics of touch. Important aspects of social behavior, from gestures to light touch and momentary glances can be transient and challenging to observe in most settings, but critical to capturing the details and changes to social relationships and networks4,5. Together the potential for false positives and false negatives can be high, and to date these issues have thwarted our understanding of the neural basis of somatic physiology and social behavior.
The use of deep convolutional networks to recognize objects in images has revolutionized computer vision, and consequently, also led to major advances in behavioral analysis. Drawing upon these methodological advances, several recent publications have developed algorithms for tracking, such as ‘DeepLabCut’6, ‘(S)LEAP’7 and ‘DeepPoseKit’8. These methods function by detection of key-points in 2D videos, and estimation of 3D postures is not straightforward in interacting animals9. Spatiotemporal regularization is needed to ensure that tracking is stable and error-free, even when multiple animals are closely interacting. During mounting or allo-grooming, for example, interacting animals block each other from the camera view and tracking algorithms can fail. Having a large number of cameras film the animals from all sides can solve these problems9,10, but this has required extensive financial resources for equipment, laboratory space and processing power, which renders widespread use infeasible.
In parallel, other studies have used depth-cameras for animal tracking, fitting a physical body-model of the animal to 3D data11,12. These methods are powerful because they explicitly model the 3D movement and poses of multiple animals. However, due to technical limitations of depth imaging hardware (frame rate, resolution, motion blur), it is to date only possible to extract partial posture information about small and fast-moving animals, such as lab mice. Consequently, when applied to mice, these methods are prone to tracking mistakes when interacting animals get close to each other and the tracking algorithms require continuous manual supervision to detect and correct errors. This severely restricts throughput, making tracking across long time scales infeasible.
Here we describe a novel system for multi-animal tracking that combines ideal features from both approaches. Our method fuses physical modeling of depth data and deep learning-based analysis of synchronized color video to estimate the body postures, enabling us to reliably track multiple mice during naturalistic social interactions. Our method is fully automatic (i.e., quantitative, scalable, and free of experimenter bias), is based on inexpensive consumer cameras, and is implemented in Python, a simple and widely used computing language. Together, this makes our method inexpensive to adopt and easy to use and customize, paving a way for more widespread study of naturalistic social behavior in neuroscience and experimental biomedicine.
RESULTS
Raw data acquisition
We established an experimental setup that allowed us to capture synchronized color images and depth images from multiple angles, while simultaneously recording synchronized neural data (Fig. 1a). We used inexpensive, state-of-the-art ‘depth cameras’ for computer vision and robotics. These cameras contain several imaging modules: one color sensor, two infrared sensors and an infrared laser projector (Fig. 1b). Imaging data pipelines, as well as intrinsic and extrinsic sensor calibration parameters can be accessed over USB through a C/C++ SDK with Python bindings. We placed four depth cameras, as well as four synchronization LEDs around a transparent acrylic cylinder which served as our behavioral arena (Fig. 1c).
Each depth camera projects a static dot pattern across the imaged scene, adding texture in the infrared spectrum to reflective surfaces (Fig. 1d). By imaging this highly-textured surface simultaneously with two infrared sensors per depth camera, it is possible to estimate the distance of each pixel in the infrared image to the depth camera by stereopsis (by locally estimating the binocular disparity between the textured images). Since the dot pattern is static and only serves to add texture, multiple cameras do not interfere with each other and it is possible to image the same scene from multiple angles. This is one key aspect of our method, not possible with depth imaging systems that rely on actively modulated light (such as the Microsoft Kinect system and earlier versions of the Intel Realsense cameras).
Since mouse movement is fast13, it is vital to minimize motion blur in the infrared images and thus the final 3D data (‘point-cloud’). To this end, our method relies on two key features. First, we use depth cameras where the infrared sensors have a global shutter (e.g., Intel D435) rather than a rolling shutter (e.g., Intel D415). Using a global shutter reduces motion blur in individual image frames, but also enables synchronized image capture across cameras. Without synchronization between cameras, depth images are taken at different times, which adds blur to the composite point-cloud. We set custom firmware configurations in our recording program, such that all infrared sensors on all four cameras are hardware-synchronized to each other by TTL-pulses via custom-built, buffered synchronization cables (Fig. 1b).
We wrote a custom multithreaded Python program with online compression, that allowed us to capture the following types of raw data from all four cameras simultaneously: 8-bit RGB images (320 x 210 pixels, 60 frames/s), 16-bit depth images (320 x 240 pixels, 60 frames/s) and the 8-bit intensity trace of a blinking LED (60 samples/s, automatically extracted in real-time from the infrared images). Our program also captures camera meta-data, such as hardware time-stamps and frame numbers of each image, which allows us to identify and correct for possible dropped frames. On a standard desktop PC, the recording system had very few dropped frames and the video recording frame rate and the imaging and USB image transfer pipeline was stable (Fig. 1e,f).
Temporal stability and temporal alignment
In order to relate tracked behavioral data to neural recordings, we need precise temporal synchronization. Digital hardware clocks are generally stable but their internal speed can vary, introducing drift between clocks. Thus, even though all depth cameras provide hardware timestamps for each acquired image, for long-term recordings, across behavioral time scales (hours to days), a secondary synchronization method is required.
For synchronization to neural data, our recording program uses a USB-controlled Arduino microprocessor to output a train of randomly-spaced voltage pulses during recording. These voltage pulses serve as TTL triggers for our neural acquisition system (sampled at 30 kHz) and drive LEDs, which are filmed by the depth cameras (Fig. 1a). The cameras sample an automatically detected ROI to sample the LED state at 60 frames/s, integrating across a full infrared frame exposure (Fig. 1g). We use a combination of cross-correlation and robust regression to automatically estimate and correct for shift and drift between the depth camera hardware clocks and the neural data. Since we use random pulse trains for synchronization, alignment is unambiguous and we can achieve super-frame-rate-precision. In a typical experiment, we estimated that the depth camera time stamps drifted with ∼49 µs/min. We corrected for this drift to yield stable residuals between TTL flips and depth frame exposures (Fig. 1h). Note that the neural acquisition system is not required for synchronization and for a purely behavioral study, we can run the same LED-based protocol to correct for potential shift and drift between cameras by choosing one camera as a reference.
Detection of body key-points by deep learning
We preprocessed the raw image data to extract two types of information for the tracking algorithm: the location in 3D in space of body key-points and the 3D point-cloud corresponding to the body surface of the animals. We used a deep convolutional neural network to detect key-points in the RGB images, and extracted the 3D point-cloud from the depth images (Fig. 2a). For key-point detection (nose, ears, base of tail, and neural implant for implanted animals), we used a ‘stacked hourglass network’14. This type of network architecture combines residuals across successive upsampling and downsampling steps to generate its output, and has been successfully applied to human pose estimation14 and limb tracking in immobilized flies15 (Fig. 2b, Supplementary Fig. 1).
We used back-propagation to train the network to output four ‘target maps’, each indicating the pseudo-posterior probability of each type of key-point, given the input image. The target maps were generated by manually labeling the key-points in training frames, followed by down-sampling and convolution with Gaussian kernels (Fig. 2c, ‘target maps’). We selected the training frames using image clustering to avoid redundant training on very similar frames8. The manual key-point labeling can be done with any labeling software. We customized a version of the lightweight, open source labeling GUI from the ‘DeepPoseKit’ package8 for the four types of key-points, which we provide as supplementary software (Supplementary Fig. 2).
In order to improve key-point detection, we used two additional strategies. First, we also trained the network to predict ‘affinity fields’16. We used ‘1D’ affinity fields8, generated by convolving the path between labeled body key-points that are anatomically connected in the animal. With our four key-points, we added seven affinity fields (‘nose-to-ears’, ‘nose-to-tail’, etc.), that together form a skeletal representation of each body (Fig. 2c, ‘affinity fields’). Thus, from three input channels (RGB pixels), the network predicts eleven output channels (Fig. 2d). As the stacked hourglass architecture involves intermediate prediction, which feeds back into subsequent hourglass blocks (repeated encoding and decoding, Fig 2b), prediction of affinity fields feeds into downstream predictions of body key-points. This leads to improvement of downstream key-point predictions, because the affinity fields give the network access to holistic information about the body. The intuitive probabilistic interpretation is that instead of simply asking questions about the keypoints (e.g., ‘do these pixels look like an ear?’), we can increase predictive accuracy by considering the body context (e.g., ‘these pixels sort of look like an ear, and those pixels sort of look like a nose – but does this path between the pixels also look like the path from an ear to a nose?’).
The second optimization approach was image data augmentation during training17. Instead of only training the network on manually-labeled images, we also trained the network on morphed and distorted versions of the labeled images (Supplementary Fig. 3). Training the network on morphed images (e.g., rotated or enlarged), gives a similar effect to training on a much larger dataset of labeled images, because the network then learns to predict many artificially generated, slightly different views of the animals. Training the network on distorted images is thought to reduce overfitting on single pixels and reduce the effect of motion blur17.
Using a training set of 526 images, and by automatically adjusting learning rate during training, the network was well-trained (plateaued) within one hour of training on a standard desktop computer (Fig. 2e), yielding good predictions of both body key-points and affinity fields (Fig. 2f).
Pre-processing of 3D video
By aligning the color images to the depth images, and aligning the depth images in 3D space, we could assign three dimensional coordinates to the detected key-points. We pre-processed the depth data to accomplish two goals. First, we wanted to align the cameras to each other in space, so we could fuse their individual depth images to one single 3D point-cloud. Second, we wanted to extract only points corresponding to the animals’ body surfaces from this composite point-cloud.
To align the cameras in space, we filmed the trajectory of a sphere that we moved around the behavioral arena. We then used a combination of motion filtering, color filtering, smoothing and thresholding to detect the location of the sphere in the color frame, extracted the partial 3D surface from the aligned depth data, and used a robust regression method to estimate the center coordinate (Fig. 3a). This procedure yielded a 3D trajectory in the reference frame of each camera (Fig. 3b) that we could use to robustly estimate the transformation matrices needed to bring all trajectories into the same frame of reference (Fig. 3c). This robust alignment is a key aspect of our method, as errors can easily be introduced by moving the sphere too close to a depth camera or out of the field of view during recording (Fig. 3b,c, arrow). After alignment, the median camera-to-camera difference in the estimate of the center coordinate of the 40-mm-diameter sphere was only 2.6 mm across the entire behavioral arena (Fig. 3d,e).
We used a similar robust regression method to automatically detect the base of the behavioral arena. We detected planes in composite point-cloud (Fig. 3f) and used the location and normal vector, estimated across 60 random frames (Fig. 3g), to transform the point-cloud such that the base of the behavioral arena laid in the xy-plane (Fig. 3h). To remove imaging artifacts stemming from light reflection and refraction due to the curved acrylic walls, we automatically detected the location and radius of the acrylic cylinder (Fig. 3i). With the location of both the arena base and the acrylic walls, we used simple logic filtering to remove all points associated with the base and walls, leaving only points inside the behavioral arena (Fig. 3j). Note that if there is no constraint on laboratory space, an elevated platform can be used as a behavioral arena, eliminating imaging artifacts associated with the acrylic cylinder.
Loss function design
The pre-processing pipeline described above takes color and depth images as inputs, and outputs two types of data: a point-cloud, corresponding to the surface of the two animals, and the 3D coordinates of detected body key-points (Fig. 4a, Supplementary Video 1). To track the body postures of interacting animals across space and time, we developed an algorithm that incorporates information from both data types. The basic idea of the tracking algorithm is that for every frame, we fit the mouse bodies by minimizing a loss function of both the point-cloud and key-points, subject to a set of spatiotemporal regularizations.
For the loss function, we made a simple parametric model of the skeleton and body surface of a mouse. The body model consists of two prolate spheroids (the ‘hip ellipsoid’ and ‘head ellipsoid’), with dimensions based on an average adult mouse (Fig. 4b). The head ellipsoid is rigid, but the hip ellipsoid has a free parameter (s) modifying the major and minor axes to allow the hip ellipsoids to be longer and narrower (e.g., during stretching, running, or rearing) or shorter and wider (e.g., when still or self-grooming). The two ellipsoids are connected by a joint that allows the head ellipsoid to turn left/right and up/down within a cone corresponding to the physical movement limits of the neck.
Keeping the number of degrees of freedom low is vital to make loss function minimization computationally feasible18. Due to the rotational symmetry of the ellipsoids, we could choose a parametrization with 8 degrees of freedom per mouse body: the central coordinate of the hip ellipsoid (x, y, z), the rotation of the major axis of the hip ellipsoid around the y- and z-axis (β, γ), the left/right and up/down rotation of the head ellipsoid (θ, φ), and the stretch of the hip ellipsoids (s). For the implanted animal, we added an additional sphere to the body model, approximating the surface of the head-mounted neural implant (Fig. 4b). The sphere is rigidly attached to the head ellipsoid and has one degree of freedom; a rotational angle (ψ) that allows the sphere to rotate around the head ellipsoid, capturing head tilt of the implanted animal. Thus, in total, the joint pose (the body poses of both mice) was parametrized by only 17 variables.
To fit the body model, we adjusted these parameters to minimize a weighted sum of two loss terms: (i) The shortest distance from every point in the point-cloud to body model surface. (ii) The distance from detected key-points to their corresponding location on the body model surface (e.g., nose key-points near the tip of one of the head ellipsoids, tail key-points near the posterior end of a hip ellipsoid).
We then used several different approaches for optimizing the tracking. First, for each of the thousands of point in the point-cloud, we needed to calculate the shortest distance to the body model ellipsoids. Calculating these distances exactly is not computationally feasible, as this requires solving a six-degree polynomial for every point19. As an approximation, we instead used the shortest distance to the surface, along a path that passes through the centroid (Supplementary Fig. 4a,b). Calculating this distance could be implemented as pure tensor algebra20, which could be executed efficiently on a GPU in parallel for all points simultaneously. Second, to reduce the effect of imaging artifacts in the color and depth imaging (which can affect both the pointcloud or the 3D coordinates of the key-points), we clipped distance losses at 3 cm, such that distant ‘outliers’ do contribute and not skew the fit (Supplementary Fig. 4c). Third, because pixel density in the depth images depends on the distance from the depth camera, we weighed the contribution of each point in the point-cloud by the squared distance to the depth camera (Supplementary Fig. 4d). Fourth, to ensure that the minimization does not converge to unphysical joint postures (e.g., where the mouse bodies are overlapping), we added a penalty term to the loss function if the body models overlap. Calculating overlap between two ellipsoids is computationally expensive21, so we computed overlaps between implant sphere and spheres centered on the body ellipsoids with a radius equal to the minor axis (Supplementary Fig. 4f). Fifth, to ensure spatiotemporal continuity of body model estimates, we also added a penalty term to the loss function, penalizing overlap between the mouse body in the current frame, and other mouse bodies in the previous frame. This ensures that the bodies do not switch place, something that could otherwise happen if the mice are in joint poses with certain mirror symmetries (Supplementary Fig. 4g,h).
GPU-accelerated robust optimization
Minimizing the loss function requires solving three major challenges. The first challenge is computational speed. The number of key-points and body parts is relatively low (∼tens), but the number of points in the point-cloud is large (∼thousands), which makes the loss function computationally expensive. For minimization, we need to evaluate the loss function multiple times per frame (at 60 frames/s). If loss function evaluation is not fast, tracking becomes unusably slow. The second challenge is that the minimizer has to properly explore the loss landscape within each frame and avoid local minima. In early stages of developing this algorithm, we were only tracking interacting mice with no head implant (Supplementary Video 2). In that case, for the small frame-to-frame changes in body posture, the loss function landscape was nonlinear, but approximately convex, so we could use a fast, derivative-based minimizer to track changes in body posture (geodesic Levenberg-Marquardt steps18). For use in neuroscience experiments, however, one or more mice might carry a neural implant for recording or stimulation. The implant is generally at a right angle and offset from the ‘hinge’ between the two hip and head ellipsoids, which makes the loss function highly non-convex22. The final challenge is robustness against local minima in state space. Even though a body posture minimizes the loss in a single frame, it might not be an optimal fit, given the context of other frames (e.g., spatiotemporal continuity, no unphysical movement of the bodies).
To solve these three challenges – speed, state space exploration, and spatiotemporal robustness – we designed a custom GPU-accelerated minimization algorithm, which incorporates ideas from annealed particle filters23 and online Bayesian filtering24. To maximize computational speed, the algorithm was implemented as pure tensor algebra in Pytorch, a high-performance GPU computing library25. Annealed particle filters are suited to explore highly non-convex loss surfaces23, which allowed us to avoid local minima within each frame. Between frames, we used online Bayesian filtering, to avoid being trapped in low-probability solutions given the preceding tracking. For every frame, we first proposed the state of the 17-parameters using kernel-recursive least-squares tracking (‘KRLS-T’24) from a Bayesian filter bank based on preceding frames. After particle filter-based loss function minimization within a single frame, we updated the Bayesian filter bank, and proposed a particle filter starting point for the next frame. This strategy has three major advantages. First, by proposing a solution, taking into account previous variables and their covariances, we often already started loss function minimization close to the new minimum. Second, if the Bayesian filter deems that the fit for a single frame is unlikely, based on the preceding frames, this fit will only weakly update the Bayesian filter bank, and thus only weakly perturb the upcoming tracking. This gave us a convenient way to balance the information provided by the fit of a single frame, and the ‘context’ provided by previous frames. Third, the Bayesian filter-based approach is only dependent on previously tracked frames, not future frames. This is in contrast to other approaches to incorporating context that rely on versions of backwards belief propagation5,15,26. Since our algorithm only uses past data, it is in principle possible to optimize our algorithm for realtime use in closed-loop experiments.
For each frame, we explored the loss surface with 200 particles (Fig. 4b,c). We generated the particles by perturbing the proposed minimum, based on the previous frames, by quasi-random, low-discrepancy sampling27 (Supplementary Fig. 5). We exploited the fact that the loss function structure allowed us to execute several key steps in parallel, across multiple independent dimensions, and implemented these calculations as vectorizes tensor operations. This allowed us to leverage the power of CUDA kernels for fast tensor algebra on the GPU25. Specifically, to efficiently calculate the point-cloud loss (shortest distance from a point in the point-cloud to the surface of a body model), we calculated the distance to all five body model spheroids for all points in the point-cloud and for all 200 particles, in parallel (Fig. 4c). We then applied fast minimization kernels across the two body models, to generate a smallest distance to either mouse, for all points in the pointcloud. Because the mouse body models are independent, we only had to apply a minimization kernel to calculate the smallest distance, for every point, to 40,000 (200 x 200) joint poses if the two mice. These parallel computation steps are a key aspect of our method, which allows our tracking algorithm to avoid the ‘curse of dimensionality’, by not exploring a 17-dimensional space, but rather explore the intersection of two independent 8-dim and 9-dim subspaces in parallel.
Tracking algorithm performance
To ensure that the tracking algorithm did not get stuck in suboptimal solutions, we forced the particle filter to explore a large search space within every frame (Fig. 5a-c). In successive iterations, we gradually made perturbations to the particles smaller and smaller by annealing the filter23), to approach the minimum. At the end of each iteration, we ‘resampled’ the particles by picking the 200 joint poses with the lowest losses in the 200-by-200 matrix of losses. This resampling strategy has two advantages. First, it can be done without fully sorting the matrix28, the most computationally expensive step in resampling29. Second, it provides a kind of quasi–’importance sampling’. During resampling, some poses in the next iteration might be duplicates (picked from the same row or column in the 200-by-200 loss matrix.), allowing particles in each subspace to collapse at different rates (if the particle filter is very certain about one body pose, but not the other, for example).
By investigating the performance of the particle filter across iterations, we found that the filter generally converged within five iterations (Fig. 5d), providing good tracking across frames (Fig. 5e). In every frame, the particle filter fit yields a noisy estimate of the 3D location of the mouse bodies. The transformation from the joint pose parameters (e.g., rotation angles, spine scaling) to 3D space is highly nonlinear, so simple smoothing of the trajectory in pose parameter space would distort the trajectory in real space. Thus, we filtered the tracked trajectories by a combination of Kalman-filtering and maximum likelihood-based smoothing30,31 and 3D rotation smoothing in quaternion space32 (Supplementary Fig. 6c-e, Supplementary Video 3).
Representing the joint postures of the two animals with this parametrization was highly data efficient, reducing the memory footprint from ∼3.7 GB/min for raw color/depth image data, to ∼0.11 GB/min for pre-processed point-cloud/key-point data to ∼1 MB/min for tracked body model parameters. On a regular desktop computer with a single GPU, we could do key-point detection in color image data from all four cameras in ∼2x real time (i.e. it took 30 mins to process a 1 hr experimental session). Depth data processing (point-cloud merging and key-point deprojection) ran at ∼0.7x real time, and the tracking algorithm ran at ∼0.2x real time (if the filter uses 200 particles and 5 filter iterations per frame). Thus, for a typical experimental session (∼ hours), we would run the tracking algorithm overnight, which is possible because the code is fully unsupervised.
Note that this version of the algorithm is written for active development, not pure speed. For example, a large part of the processing time is spent reading/writing data to disk, and – while convenient for modifying and experimenting with the code – it is not necessary to first process color, then depth and then run a tracking algorithm step, for example. In its present form, the code is fast enough to be useful, but not optimized to the theoretical maximum speed.
Error detection
Error detection and correction is a critical component of behavioral tracking. Even if error rates are nominally low, errors are non-random, and errors often happen exactly during the behaviors in which we are most interested: interactions. In multi-animal tracking, two types of tracking error are particularly fatal as they compound over time: identity errors and body orientation errors (Supplementary Fig. 7a). In conventional tracking approaches using only 2D videos, it is often difficult to correctly track identities when interacting mice are closely interacting, allo-grooming, or passing over and under each other. Although swapped identities can be corrected later once the mice are well-separated again, this still leaves individual behavior during the actual social interaction unresolved5,26. We found that our tracking algorithm was robust against both identity swaps (Supplementary Fig. 7b-e) and body direction swaps (Supplementary Fig. 8). This observation agrees with the fact that tracking in three-dimensional space (subject to our implemented spatiotemporal regularizations) a priori ought to allow better identity tracking; In full 3D space it is easier to determine who is rearing over whom during an interaction, for example.
To test our algorithm for more subtle errors, we manually inspected 500 frames, randomly selected across an example 21 minute recording session. In these 500 frames, we detected one tracking mistake, corresponding to 99.8% correct tracking (Supplementary Fig. 9a). The identified tracking mistake was visible as a large, transient increase in the point-cloud loss function (Supplementary Fig. 9b). After the tracking mistake, the robust particle filter quickly recovered to correct tracking again (Supplementary Fig. 9c). By detecting such loss function anomalies, or by detecting ‘unphysical’ postures or movements in the body models, potential tracking mistakes can be automatically ‘flagged’ for inspection (Supplementary Fig. 9c,d). After inspection, errors can be manually corrected or automatically corrected in many cases, for example by tracking the particle filter backwards in time after it has recovered. As the algorithm recovers after a tracking mistake, it is generally unnecessary to actively supervise the algorithm during tracking, and manual inspection for potential errors can be performed after running the algorithm overnight.
Automated analysis of movement kinematics and social behavior
Despite the high level of data compression (from raw images to pre-processed data to only 17 dimensions), a human observer can clearly distinguish social events in the tracked data (Fig. 5f). The major motivation behind developing our method, however, was to eschew manual labeling especially for large-scale datasets on the order of days to months of video tracking of the same animals. As a validation of our tracking method, we demonstrate that out methods can automatically extract both movement kinematics and behavioral states (movement patterns, social events) during spontaneous social interactions. Moreover, data generated by our tracking method are compatible with two types of analyses: (i) Modern data-mining methods for unsupervised discovery of behavioral states (specifically, state space modeling) and (ii) Template-based analysis, detecting behaviors of interest based on prior knowledge. Template-based methods are better suited than unsupervised methods for detecting certain types of behaviors (see Discussion), so it is a major advantage that our data are amenable to both types of analysis. Both types of analysis are quantitative and fully automatic, solving two major issues with manual labeling (subjective experimenter bias and limited throughput).
To demonstrate template-based analysis, we defined social behaviors of interest as templates and matched these templates to tracked data. We know that anogenital sniffing33 and nose-to-nose touch34 are prominent events in rodent social behavior, so we designed a template to detect these events. In this template, we exploited the fact that we could easily calculate both body postures and movement kinematics, in the reference frame of each animal’s own body. For every frame, we first extracted the 3D coordinates of the body model skeleton (Supplementary Fig. 5). From these skeleton coordinates, we calculated the position (Fig. 6a) and a three-dimensional speed vector for each mouse (‘forward speed’, along the hip ellipsoid, ‘left speed’ perpendicular the body axis and ‘up speed’ along the z-axis; Fig. 6b, Supplementary Fig. 8). We also calculated three instantaneous ‘social distances’, defined as the 3D distance between the tip of each animal’s noses (‘nose-to-nose’; Fig. 6b), and from the tip of each animal’s nose to the posterior end of the conspecific’s hip ellipsoid (‘nose-to-tail’; Fig. 6b). From these social distances, we could automatically detect when the mouse bodies were in a nose-to-nose or a nose-to-tail configuration, and in a single 20 min experimental session, we observed multiple nose-to-nose and nose-to-tail events (Fig. 6c). It is straightforward to further subdivide these social events by body postures and kinematics, to, e.g., separate stationary nose-to-tail configurations (anogenital sniffing/grooming) and nose-to-tail configurations during locomotion (social following).
To demonstrate unsupervised behavioral state discovery, we used GPU-accelerated probabilistic programming35 and state space modeling to automatically detect and label movement states. To discover types locomotor behavior, we fitted a ‘sticky’ multivariate hidden Markov model36 to the two components of the speed vector that lie in the xy-plane (Supplementary Fig. 9a-h). With five hidden states, this model yielded interpretable movement patterns that correspond to known mouse locomotor ‘syllables’: resting (no movement), turning left and right, and moving forward at slow and fast speeds (Fig. 6d). Fitting a similar model with three hidden states to the z-component of the speed vector (Supplementary Fig. 9i-n) yielded interpretable and known ‘rearing syllables’: rest, rearing up and ducking down (Fig. 6e). Using the maximum a posterior probability from these fitted models, we could automatically generate locomotor ethograms and rearing ethograms for the two mice (Fig. 6b).
In line with previous observations, we found that movement bouts were short (medians, rest/left/right/fwd/ffwd: 0.83/0.50/0.52/0.45/0.68 s, a ‘sub-second’ timescale13). In the locomotion ethograms, bouts of rest were longer than bouts of movement (all p < 0.05, Mann-Whitney U-test; Fig. 6f) and bouts of fast forward locomotion was longer than other types of locomotion (all p < 0.001, Mann-Whitney U-test; Fig. 6f). In the rearing ethograms, the distribution of rests was very wide, consisting of both long (∼seconds) and very short (∼tenths of a second) periods of rest (Fig. 6g). As expected, by plotting the rearing height against the duration of rearing syllables, we found that short rests in rearing were associated with standing high on the hind legs (the mouse rears up, waits for a brief moment before ducking back down), while longer rests happened when the mouse was on the ground (ation of rearing syllabpearman rank; Fig. 6h). Like the movement types and durations, the transition probabilities from the fitted hidden Markov models were also in agreement with known behavioral patterns. In the locomotion model, for example, the most likely transition from “rest” was to “slow forward”. From “slow forward”, the mouse was likely to transition to “turning left”, “fast forward” or “turning right”, it was unlikely to transition directly from “fast forward” to “rest” or from “turning left” to “turning right, and so on (Supplementary Fig. 9o,p).
Finally, our method recovered the 3D head direction of both animals. The head direction of the implanted animal was given by the skeleton of the body model (the implant is fixed to the head). As mentioned above, we exploited the rotational symmetry of the body model of the conspecific to decrease the dimensionality of the search space during tracking (Fig. 4c). However, from the 3D coordinates of the detected key-points, we could still infer the 3D head direction (Supplementary Fig. 10) and it matched known mouse behavior (Supplementary Fig. 11). This feature is of particular interest to social neuroscience, since – while rodents clearly respond to the behavior of conspecifics – we are still only beginning to discover how the rodent brain encodes the gaze direction and body postures of others37.
DISCUSSION
We combined 3D videography, deep learning and GPU-accelerated robust optimization to estimate the posture dynamics of multiple freely-moving mice, engaging in naturalistic social interactions. Our method is cost-effective (requiring only inexpensive consumer depth cameras and a GPU), has high spatiotemporal precision, is compatible with neural implants for continuous electrophysiological recordings, and tracks unmarked animals of the same coat color (e.g., enabling behavioral studies in transgenic mice). Our method is fully unsupervised, which makes the method scalable across multiple PCs or GPUs. Unsupervised tracking allows us to investigate social behavior across long behavioral time scales – beyond what is feasible with manual annotation – to elucidate developmental trajectories, dynamics of social learning, or individual differences among animals38,39, among other types of questions. Finally, our method uses no message-passing from future frames, but only relies on past data, which makes the method a promising starting point for real-time tracking.
Reasons to study naturalistic social interactions in 3D
Social dysfunctions can be devastating symptoms in a multitude of mental conditions, including autism spectrum disorders, social anxiety, depression, and personality disorders40. Social interactions also powerfully impact somatic physiology, and social interactions are emerging as a promising protective and therapeutic element in somatic conditions, such as inflammation41 and chronic pain42. These disorders have high incidence but generally lack effective treatment options, largely because even the neurobiological basis of ‘healthy’ social behavior is poorly understood.
In neuroscience and experimental biomedicine, there has been major technical progress in recording techniques for freely moving animals, with high-density electrodes43,44, and head-mounted multi-photon microscopes45,46. Moreover, newer methods are being developed for tracking complex patterns of animal behavior47,48. Here we provide a new method to complement these approaches for feasible, quantitative, and automated behavioral analysis. A major next step for future work is to apply such algorithms to animal behavior in different conditions. For example, the algorithm can easily be adapted to track other animal body shapes such as juvenile mice or other species, or movable, deformable objects that might be important for foraging or other behaviors in complex environments.
What is the advantage of a body model?
In automated analysis of behavioral states, there are three main approaches: nonlinear clustering7,49–55, probabilistic state space modeling13,56–59 and template matching5,11,26. In nonlinear clustering, tracked body coordinates (and derived quantities, such as time derivatives or spectral components) are segmented into discrete behaviors by density-based clustering, typically after nonlinear projection down to a low-dimensional 2D space7,49–53,55 or 3D space60. Density-based clusters are manually inspected and curated, such that clusters judged as similar are merged and clusters are assigned names (e.g., ‘locomotion’, ‘grooming’, etc.). This approach is simple and robust, but still flexible enough to discover behavioral changes due to interactions with conspecifics52,53. A limitation of this approach, however, is that nonlinear clustering directly on the tracked kinematic features does not allow explicit modeling of history dependence or hierarchical structure.
In principle, state space models are highly expressive, allowing for complex nested structures of hidden states, observational models, covariance structures and temporal dependencies, e.g., autoregressive terms13, linear dynamics61, and hierarchies39. In practice, however, state space models are not easily fit to data. Complex models quickly become prohibitively computationally expensive and approximate fitting strategies, such as variational inference, show poor convergence for complex models62. Finally, and most importantly, model comparison is still an unsolved problem and methods guaranteed to discover a ‘true’ latent structure from data (e.g., number of states or transition graph structure) do not exist36,63.
Neither state space modeling nor nonlinear clustering requires an explicit body model of the animal. In principle, any tracked points on the body can be used, as long as their statistics differ enough between distinct behaviors that they still form significantly different clusters. What is the advantage of an actual body model? First, even though in principle, any tracked points on an animal can be used for unsupervised behavioral analysis – body posture features based on the actual body geometry are often both more powerful and more interpretable in practice64. For example, a body model lets us specify appropriate generative models for state space modeling. Specifying appropriate probability distributions over variables with unknown densities and covariances is not straightforward, but specifying priors over the parameters of an understandable body model with known physical constraints is an advantage for this type of analysis.
Second, unsupervised methods are not well-suited for the discovery of behaviors that happen rarely or are kinematically similar to other behaviors. In order for rare or kindred behaviors to form identifiable clusters, it may be necessary to collect extremely large datasets, beyond what is realistic to collect in typical neuroscience experiments, and beyond what it is computationally feasible to analyze. These complications get even worse when considering the joint statistics of multiple animals. A physical model of the bodies allows us to overcome these problems by simple template matching and we can easily specify templates that flag social poses (nose-to-nose and nose-to-tail touch, in our example). Moreover, of particular interest to neural recordings, a body model lets us regress out proprioceptive and movement-related signals, known to align to an egocentric, body-centered reference frame3,65–67.
The tracking algorithm is easy to update
Our data acquisition pipeline and tracking algorithm is open source and implemented in Python, a widely used programming language in machine learning. This makes it easy to update the algorithm with methodological advances in the field. Deep learning in 3D is still thought to be in its infancy68, but along with technical developments in depth imaging hardware (for video games, self-driving cars and other industrial applications), there are exciting developments in analysis, including deep leaning methods for detection of deformable objects from image69–71 and point-cloud72,73 data, geometric and graph-based tricks for GPU-accelerated analysis of 3D data74–76, and methods for physical modeling of deformable bodies77–79.
Our pre-processing pipeline generates a highly compressed data representation, consisting of a 3D point-cloud and key-points, sampled at 60 fps. Storing and sharing the raw depth and color video frames from multiple cameras, across long behavioral time scales (hours to days) is not feasible, but storing and sharing the compressed, pre-processed data format is possible. This is a major advantage for two reasons. First, as deep learning methods for 3D data improve, old datasets can be re-analyzed and mined for new insights. Second, abundantly available 3D datasets of animal behavior let machine learning researchers test and benchmark new methods on experimental data. This enables a development cycle for improving behavioral analysis, by which adoption of 3D-videography-based behavioral tracking in biology can contribute positively to future methodological developments in the machine learning community.
Moving towards real-time tracking
Our algorithm is unsupervised, does not use any message-passing from future frames, and robustly recovers from tracking mistakes. Since the algorithm relies purely on past data, it is in principle possible to run the algorithm in real-time. Currently, the processing time per frame is higher than the camera frame rate (60 frames/s). However, our algorithm is not fully optimized and there are multiple speed improvements, which are straightforward to implement. For example, in the current version of the algorithm, we first record the images to disk, and then read and pre-process the images later. This is convenient for algorithm development and exploration, but writing and reading the images to disk, and moving them onto and off a GPU are time-intensive steps. During pre-processing, it is possible to increase the speed and precision of key-point detection by implementing peak detection as a convolutional layer8 and it may be possible to perform key-point detection directly on the 1-channel infrared images instead of the 3-channel color images. Grayscale infrared images are faster to process and we would be able to perform experiments in visual darkness, which is less stressful and more appropriate for nocturnal mice. As shown in Figure 1d, the infrared images are ‘contaminated’ with a dotted grid of points from the infrared laser projector, but – as evidenced by the usefulness of pixel dropout in image data augmentation17 (Supplementary Fig. 3) – it should be possible to train a network to ‘disregard’ the dotted grid during key-point detection. We may also be able to reduce the number of required particle filtering steps between frames. For example, we could force a network to learn to draw particle samples more intelligently, based on learned covariance patterns in natural mouse movement. Additionally, we could try to combine our particle filter with gradient-based optimization methods (start with a particle filter step to search for loss function basins and then use fast, parallel Levenberg-Marquardt steps18 to quickly move particles to the bottom of the basins, for example).
Beyond these optimizations, tracking at a lower frame rate would allow more data processing time per frame. Our robust, particle-filter-based tracker with online forecasting is an ideal candidate for this task. Going forward, we will investigate the performance of the algorithm at lower frame rates and explore ways to increase tracking robustness further, by implementing other recently described tracking algorithm tricks, such as using the optical flow between video frames to link key-points together in multi-animal tracking (‘SLEAP’7,80), real-time painting-in of depth artifacts81 and even better online trajectory forecasting, for example using deep Gaussian processes82 or a neural network trained to propose trajectories based on mouse behavior. Experimentation and optimization is clearly needed, but our fully unsupervised algorithm – requiring data transfer from only a few cameras, with deep convolutional networks, physical modeling and particle filter tracking implemented as tensor algebra on the same GPU – is a promising starting point for the development of realtime, multi-animal 3D tracking.
AUTHOR CONTRIBUTIONS
C.L.E. designed and implemented the system, performed experiments, and analyzed the data. R.C.F. supervised the study. C.L.E and R.C.F. wrote the manuscript.
COMPETING INTERESTS
The authors declare no competing interests.
DATA AND CODE AVAILABILITY
All code and an example dataset were submitted with this manuscript. All code and an example dataset will be available on Github before or upon publication.
METHODS
Hardware
Necessary hardware:
General lab electronics (tape, wire, soldering equipment, etc.) and:
Software
Our system uses the following software: Linux (tested on Ubuntu 16.04 LTE, but should work on others, https://ubuntu.com/), Intel Realsense SDK (https://github.com/IntelRealSense/librealsense), Python (tested on Python 3.6, we recommend Anaconda, https://www.anaconda.com/distribution/). Required Python packages will be installed with PIP or conda (script in supplementary software). All required software is free and open source.
Animal welfare
All experimental procedures were performed according to animal welfare laws under the supervision of local ethics committees. Animals were kept on a 12hr/12hr light cycle with ad libitum access to food and water. Mice presented as partner animals were housed socially in same-sex cages, and post-surgery implanted animals were housed in single animal cages. Neural recordings electrodes were implanted on the dorsal skull under isoflurane anesthesia, with a 3D-printed electrode drive and a hand-built mesh housing. All procedures were approved under NYU School of Medicine IACUC protocols.
Recording data structure
The Python program is set to pull raw images at 640 x 480 (color) and 640 x 480 (depth), but only saves 320 x 210 (color) and 320 x 240 (depth). We do this to reduce noise (multi-pixel averaging), save disk space and reduce processing time. Our software also works for saving images up to 848 x 480 (color) and 848 x 480 (depth) at 60 frames/s, in case the system is to be used for a bigger arena, or to detect smaller body parts (eyes, paws, etc). Images were transferred from the cameras with the python bindings for the Intel Realsense SDK (https://github.com/IntelRealSense/librealsense), and saved as 8-bit, 3-channel PNG files with opencv (for color images) or as 16-bit binary files (for depth images).
3D data structure
For efficient access and storage of the large datasets, we save all pre-processed data to hdf5 files. Because the number of data points (point-cloud and key-points) per frame varies, we save every frame as a jagged array. To this end, we pack all pre-processed data to a single array. If we detect N points in the point-cloud and M key-points in the color images, we save a stack of the 3D coordinates of the points in the point-cloud (Nx3, raveled to 3N), the weights (N), the 3D coordinates of the key-points (Mx3, raveled to 3M), their pseudo-posterior (M), an index indicating key-point type (M), and the number of key-points (1). Functions to pack and unpack the pre-processed data from a single line (‘pack_to_jagged’ and ‘unpack_from_jagged’) are provided.
Temporal synchronization
LED blinks were generated with voltage pulses from an Arduino (on digital pin 12), controlled over USB with a python interface for the Firmata protocol (https://github.com/tino/pyFirmata). To receive the Firmata messages, the Arduino was flashed with the ‘StandardFirmata’ example, that comes with the standard Arduino IDE. TTL pulses were 150 ms long and spaced by ∼U(150,350) ms.. We recorded the emitted voltage pulses in both the infrared images (used to calculate the depth image) and on a TTL input on an Open Ephys Acquisition System (https://open-ephys.org/). We detected LED blinks and TTL flips by threshold crossing and roughly aligned the two signals by the first detected blink/flip. We first refined the alignment by cross correlation in 10 ms steps, and then identified pairs of blinks/flips by detecting the closest blink, subject to a cutoff (zscore < 2, compared to all blink-to-flip time differences) to remove blinks missed by the camera (because an experimenter moved an arm in front of a camera to place a mouse in the arena, for example). The final shift and drift was estimated by a robust regression (Theil-Sen estimator) on the pairs of blinks/links.
Deep neural network
We used a stacked hourglass network14 implemented in Pytorch25 (https://github.com/pytorch/pytorch). The network architecture code is from the implementation in ‘PyTorch-Pose’ (https://github.com/bearpaw/pytorch-pose). The full network architecture is shown in Supplementary Fig 1. The Image augmentation during training was done with the ‘imgaug’ library (https://github.com/aleju/imgaug). Our augmentation pipeline in shown in Supplementary Fig. 3. The network was trained by RMSProp (α = 0.99, ε = 10−8) with an initial learning rate of 0.00025. During training, the learning rate was automatically reduced by a factor of 10 if the training loss decreased by less than 0.1% for five successive steps (using the built-in learning rate scheduler in Pytorch). After training, we used the final output map of the network for key-point detection, and used a maximum filter to detect key-point locations as local maxima in network output images with a posterior pseudo-probability of at least 0.5.
Image labeling and target maps
For training the network to recognize body parts, we need to generate labeled frames by manual annotation. For each frame, 1-5 body parts are labeled on the implanted animal and 1-4 body parts on the partner animal. This can be done with any annotation software; we used a modified version of the free ‘DeepPoseKit-Annotator’8 (https://github.com/jgraving/DeepPoseKit-Annotator/) included in the supplementary code. This software allows easy labeling of the necessary points, and pre-packages training data for use in our training pipeline. Body parts are indexed by i/p for implanted/partner animal (‘nose_p’ is the nose of the partner animal, for example). Target maps were generated by adding a Gaussian function (σ = 3 px for implant, σ = 1 px for other body parts, scaled to peak value = 1) to an array of zeros (at 1/4th the resolution of the input color image) at the location of every labeled body key-point. 1D part affinity maps were created by connecting labeled keypoints in an array of zeros with a 1 px wide line (clipped to max value = 1), and blurring the resulting image with a Gaussian filter (σ = 3 px).
Aligning depth and color data
The camera intrinsics (focal lengths, f, optical centers, p, depth scale, dscale) and extrinsics (rotation matrices, R, translation vectors, ) for both the color and depth sensors can be accessed over the SDK. Depth and color images were aligned to each other using a pinhole camera model. For example, the z coordinate of a single depth pixel with indices (ic, i d) and 16-bit depth value, dij, is given by:
And the x and y coordinates are given by:
Using the extrinsics between the depth and color sensors, we can move the coordinate to the reference frame of the color sensor:
Using the focal length and optical center, we can project the pixel onto the color image:
For assigning color pixel values to depth pixels, we simply rounded the color pixel indices (ic, id) to the nearest integer and cloned the value. More computationally intensive methods based on ray-tracking exist (‘rs2_project_color_pixel_to_depth_pixel’ in the Librealsense SDK, for example), but the simple pinhole camera approximation yielded good results (small jitter average out across multiple key-points) which allowed us to skip the substantial computational overhead of ray tracing for our data pre-processing.
3D calibration and alignment
To align the cameras in space, we first mounted a blue ping-pong ball on a stick and moved it around the behavioral arena while recording both color and depth video. For each camera, we used a combination of motion filtering, color filtering, smoothing and thresholding to detect the location of the ping-pong ball in the color frame (details in code). We then aligned the color frames to depth frames and extracted the corresponding depth pixels, yielding a partial 3D surface of the ping-pong ball. By fitting a sphere to this partial surface, we could estimate the 3D coordinate of the center of the ping-pong ball (Fig. 3a). This procedure yielded a 3D trajectory of the ping-pong ball in the reference frame of each camera (Fig. 3b). We used a robust regression method (RANSAC routines to fit a sphere with a fixed radius of 40 mm, modified from routines in https://github.com/daavoo/pyntcloud), insensitive to errors in the calibration ball trajectory to estimate the transformation matrices needed to bring all trajectories into the same frame of reference (Fig. 3c).
Body model
We model each mouse at two prolate ellipsoids. The model is specified by the 3D coordinate of the center of the hip ellipsoid, , and the major and minor axis of the ellipsoids are scaled by a coordinate, s ∈ [0,1] that can morph the ellipsoid from long and narrow to short and fat:
The ‘neck’ (the joint of rotation between the hip and nose ellipsoid) is sitting a distance,dhip = 0.75 ahip, along the central axis of the hip ellipsoid. In the frame of reference of the mouse body (taking as the origin, with the major axis of the hip ellipsoid along the x-axis), a unit vector pointing to of the nose psoid, from the ‘neck’ to the center of the nose ellipsoid along it’s major axis is:
In the frame of reference of the laboratory (‘world coordinates’), we allow the hip ellipsoid to rotate around the z-axis (‘left’/’right’) and around the y-axis (‘up’/’down’, in the frame of reference of the mouse). We define R(αx, αy, αz) as a 3D rotation matrix specifying the rotation by an angle α around the three axes, and as a 3D rotation matrix that rotates the vector onto . The we can define: , where ēx is a unit vector alonxg-axthise. In the frame of reference of the mouse body, the center of the nose ellipsoid is:
So, in world coordinates, the center is:
The center of the neural implant if offset from the center of the nose ellipsoid by a distance ximpl along the major axis of the nose ellipsoid, and a distance zimpl orthogonal to the major axis. We allow the implant to rotate around the nose ellipsoid by an angle, ψ. Thus, in the frame of reference of the mouse body, the center of the ellipsoid is:
And in world coordinates, same as the center of the nose ellipsoid:
We calculated other skeleton points (tip of the nose ellipsoid, etc.) in a similar method. We can use the rotation matrices for the hip and the nose ellipsoids to calculate the characteristic ellipsoid matrices:
Calculating the shortest distance from a point to the surface of an 3D ellipsoid in 3 dimensions requires solving a computationally-expensive polynomial19. Doing this for each of the thousands of points in the point-cloud, multiplied by four body ellipsoids, multiplied by 200 particles pr. fitting step is not computationally tractable. Instead, we use the shortest distance to the surface, , along a path that passes through the centroid (Supplementary Fig. 4a-b). This is a good approximation to d (especially when averaged over many points), and the calculation of can be implemented as pure vectorized linear algebra, which can be calculated very efficiently on GPU20. Specifically, to calculate the distance from any point in the point-cloud, we just center the points on the center of an ellipsoid, and – for example – calculate:
In fitting the model, we used the following constants.
Loss function evaluation and tracking
Joint position of the two mice is represented as a particle in 17-dimensional space. For each data frame, we start with a proposal particle (leftmost green block, based on previous frames), from which we generate 200 particles by pseudo-random perturbation within a search space (next green block). For each proposal particle, we calculate three types of weighted loss contributions: loss associated with the distance from the point-cloud to the surface of the mouse body models (top path, green color), loss associated with body key-points (middle path, key-point colors as in and loss associated with overlap of the two mouse body models (bottom path, purple color). We broadcast the results in a way, which allows us to consider all 200×200 = 40.000 possible joint postures of the two mice. After calculation, we pick the top 200 joint postures with the lowest overall loss, and anneal the search space, or – if converged – continue to the next frame. When we continue to a new frame, we add the fitted frame to a KRLS-T filter bank (online adaptive filter for prediction), which proposes the next position of the particle for the next frame, based on previous frame. All loss function calculations, and KRLS-T predictions as pure tensor algebra, that can be fully vectorized and executed on a GPU.
State space filtering of raw tracking data
After tracking, the coordinates of the skeleton points (chip, cnose, etc.) were smoothed with a 3D kinematic Kalman filter tracking both the 3D position (p), velocity (v) and (constant) acceleration (a). For example, for the center of the hip coordinate: where Q’ is the Q matrix for a discrete constant white noise model and σmeasurement = 0.015 m, σprocess = 0.01 m, . The σ’s were the same for all points, except the slightly more noisy estimate of the center of the implant, where we used. σmeasurement = 0.02 m, σprocess = 0.01 m, From the frame rate (60 fps), . The maximum-likelihood trajectory was estimated with the Rauch-Tung-Striebel method30 with a fixed lag of 16 frames. The filter and smoother was implemented using the ‘filterpy’ package (https://github.com/rlabbe/filterpy). The spine scaling, s, was smoothed with a similar filter in 1D, except that we did not model acceleration, only s and a (constant) s ‘velocity’, with σmeasurement = 0.3, σprocess = 0.05 m, .
After filtering the trajectories of the skeleton points, we recalculated the 3D rotation matrices of the hip and head ellipsoid by the vectors pointing from chip to cmid (from the middle of the hip ellipsoid to the neck joint), and from chip to cnose (from the neck joint to the middle of the nose ellipsoid). We then converted the 3D rotation matrixes to unit quaternions, and smoothed the 3D rotations by smoothing the quaternions with an 10-frame boxcar filter, essentially averaging the quaternions by finding the largest eigenvalue of a matrix composed of the quaternions within the boxcar 32. After smoothing the ellipsoid rotations, we re-calculated the coordinates of the tip of the nose ellipsoid (cmii) and the posterior end of the hip ellipsoid (cmsis) from the smoothed central coordinates, rotations, and – for cmsis – the smoothed spine scaling. A walkthrough of the state space filtering pipeline is shown in Supplementary Fig. 6.
Template matching
To detect social events, we calculated three social distances, from three instantaneous ‘social distances’, defined as the 3D distance between the tip of each animal’s noses (‘nose-to-nose’), and from the tip of each animal’s nose to the posterior end of the conspecific’s hip ellipsoid (‘nose-to-tail’; Fig. 6c). From these social distances, we could automatically detect when the mouse bodies were in a nose-to-nose (if the nose-to-nose distance was < 2 cm and the nose-to-tail distance was > 6 cm) and in a nose-to-tail configuration (if the nose-to-nose distance was > 6 cm and the nose-to-tail distance was > 2 cm). The events were detected by the logic conditions, and then single threshold crossings due to noise were removed by binary opening with a 3-frame kernel, followed by binary closing with a 30-frame kernel.
State space modeling of mouse behavior
State space modeling of the locomotion behavior was performed in Pyro35 a GPU-accelerated probabilistic programming language built on top of Pytorch25. We modeled the (centered and whitened) locomotion behavior as a hidden Markov model with discrete latent states, z, and and associated transition matrix, T.
To make the model ‘sticky’ (discourage fast switching between latent states) we draw the transition probabilities, pij from a Dirichlet prior with a high mass near the ‘edges’ and initialize Tinit = (1 − η)I + η/nstates where η = 0.05.
Each state emits a forward speed and a left speed, drawn from a two-dimensional gaussian distribution with a full covariance matrix.
We draw the mean of the states from a normal distribution and use a LKJ Cholesky prior for the covariance:
The up speed was modeled in a similar way, except that the latent states were just a one-dimensional normal distribution. The means and variances for the latent states was initialized by kmeans clustering of the locomotion speeds. The model was fit in parallel to 600-frame snippets of a subset of the data by stochastic variational inference62. We used an automatic delta guide function (‘AutoDelta’) and an evidence lower bound (ELBO) loss function. The model was fitted by stochastic gradient descent with a learning rate of 0.0005. After model fitting, we generated the ethograms by assigning latent states by maximum a posteriori probability with a Viterbi algorithm.
3D head direction estimation
We use the 3D position of the ear key-points to determine the 3d head direction of the partner animal. We assign the ear key-points to a mouse body model by calculating the distance from each key-point to the center of the nose ellipsoid of both animals (cutoff: closest to one mouse and < 3cm from the center of the head ellipsoid, Supplementary Fig 10a). To estimate the 3D head direction, we calculate the unit rejection (vrej) between a unit vector along the nose ellipsoid (vnose) and a unit vector from the neck joint (cmid) to the average 3D position of the ear key-points that are associated with that mouse (v_ear_direction, Supplementary Fig. 10b). If no ear key-points were detected in a frame, we linearly interpolate the average 3D position. To average out jitter, the estimates of the average ear coordinates and the center of the nose coordinate were smoothed with a Gaussian (σ = 3 frames). The final head direction vector was also smoothed with a Gaussian (σ = 10 frames).
Supplementary material
SUPPLEMENTARY FIGURES
ACKNOWLEDGEMENTS
This work was supported by The Novo Nordisk Foundation (C.L.E.), the BRAIN Initiative (NS107616 to R.C.F.) and a Howard Hughes Medical Institute Faculty Scholarship (R.C.F.).
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.
- 51.
- 52.↵
- 53.↵
- 54.
- 55.↵
- 56.↵
- 57.
- 58.
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.
- 67.↵
- 68.↵
- 69.↵
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.
- 76.↵
- 77.↵
- 78.
- 79.↵
- 80.↵
- 81.↵
- 82.↵