## Abstract

Noninvasive behavioral tracking of animals is crucial for many scientific investigations. Recent transfer learning approaches for behavioral tracking have considerably advanced the state of the art. Typically these methods treat each video frame and each object to be tracked independently. In this work, we improve on these methods (particularly in the regime of few training labels) by leveraging the rich spatiotemporal structures pervasive in behavioral video — specifically, the spatial statistics imposed by physical constraints (e.g., paw to elbow distance), and the temporal statistics imposed by smoothness from frame to frame. We propose a probabilistic graphical model built on top of deep neural networks, Deep Graph Pose (DGP), to leverage these useful spatial and temporal constraints, and develop an efficient structured variational approach to perform inference in this model. The resulting semi-supervised model exploits both labeled and unlabeled frames to achieve significantly more accurate and robust tracking while requiring users to label fewer training frames. In turn, these tracking improvements enhance performance on downstream applications, including robust unsupervised segmentation of behavioral “syllables,” and estimation of interpretable “disentangled” low-dimensional representations of the full behavioral video.

## 1 Introduction

Animal pose estimation (APE) is a critical scientific task, with applications in ethology, psychology, neuroscience, and other fields. Recent work in neuroscience, for example, has emphasized the degree to which neural activity throughout the brain is correlated with movement [1, 2, 3]; i.e., to understand the brains of behaving animals we need to extract as much information as possible from behavioral video recordings. State of the art APE methods, such as Deep Lab Cut (DLC) [4], DeepPoseKit (DPK) [5], and LEAP [6], have transferred tools from the human pose estimation (HPE) deep learning literature to the APE setting [7, 8], opening up an exciting array of new applications and new scientific questions to be addressed.

However, even with these advances in place, hundreds of labels may still be needed to achieve tracking at the desired level of precision and reliability. Providing these labels requires significant user effort, particularly in the common case that users want to track multiple objects per frame (e.g., all the fingers on a hand or paw). Unlike HPE algorithms [9], APE algorithms are applied to a wide variety of different body structures (e.g., fish, flies, mice, or cheetahs) [10], compounding the effort required for collecting labeled datasets. Moreover, even with hundreds of labels, users still often see occasional “glitches” in the output (i.e., frames where tracking is briefly lost), which typically interfere with downstream analyses of the extracted behavior.

To reduce the prevalence of these “glitches,” several state of the art human pose estimation algorithms have employed graphical models as regularizers to enforce spatial or temporal constraints [11, 12, 13, 14, 15]. Most of these methods use neural network regularizers [16, 17, 18], exploiting the abundance of large labeled training datasets to prevent overfitting - something not feasible in the APE setting, where training data is much more scarce. Works such as [13, 14] instead combine undirected graphical models with deep neural networks, assign tracked locations discrete values, and use discrete message passing algorithms during the inference step. More recently, [19] employs a convolutional neural network (CNN) with a fully connected conditional random field (CRF) model to learn spatial relationships from the data and avoid specifying certain tree structured models for the CRF, but this method does not exploit temporal information. By incorporating temporal information, the model would become a CRF with latent variables, which requires a completely different inference method. Meanwhile, among state of the art APE algorithms, DLC, DPK, and LEAP do not incorporate any prior temporal structural knowledge into their model, and therefore treat each video frame as independent samples during training. DPK incorporates spatial “part affinity fields” [20] to predict the targets as well as the edges among those targets, but estimating these terms again requires large labeled training sets.

To improve APE performance in the sparse-labeled-data regime, we propose a probabilistic graphical model built on top of deep neural networks, Deep Graph Pose (DGP), to leverage both spatial and temporal constraints, and develop an efficient structured variational approach to perform inference in this model. DGP models the targets as continuous random variables (unlike [13, 14]), resulting in a semi-supervised model that takes advantage of both labeled and unlabeled frames to achieve significantly more accurate and robust tracking, using fewer labels. Finally, we demonstrate that these tracking improvements enhance performance in downstream applications, including robust unsupervised segmentation of behavioral “syllables,” and estimation of interpretable low-dimensional representations of the full behavioral video.

## 2 Model

The graphical model of DGP is summarized in Figure 1. We observe frames *x*_{t} indexed by *t*, along with a small subset of labeled markers *y*_{t,j} (where *j* indexes the different targets we would like to track). The target locations *y*_{t,j} on most frames are unlabeled, but we have several sources of information to constrain these latent variables: temporal smoothness constraints between the targets *y*_{t,j} and *y*_{t+1,j}, which we capture with potentials *ϕ*_{t}; spatial constraints between the targets *y*_{t,i} and *y*_{t,j}, which we model with spatial potentials *ϕ*_{s}; and information from the image *x*_{t}, modeled by *ϕ*_{n}.

We use the notation *ϕ*_{n} to indicate that this potential is parameterized by a neural network. A number of architectures could potentially be employed for *ϕ*_{n} [6, 5]; we chose to adapt the architecture used in DLC [4] here. We use a simple quadratic potential *ϕ*_{t} to impose temporal smoothness:
which penalizes the distance between targets in consecutive frames; the weights in general may depend on the target index *j*, but are constant in time.

The spatial potential *ϕ*_{s} is more dataset-dependent and can be chosen depending on the constraints that the markers should satisfy. Typical examples include a soft constraint that the paw marker should never be greater than some distance from the elbow marker, or the nose might always be within some radius of a waterspout with a fixed position. Again, we use a simple quadratic potential to encode these soft constraints:
which penalizes the distance between “connected” targets *y*_{t,i} and *y*_{t,j} (where the user can pre-specify pairs of connected targets that should have neighboring locations in the frame, e.g. paw and elbow).

We want to “let the data speak” and avoid oversmoothing, so the penalty weights *w*_{s} and *w*_{t} should be small. In practice we found that could be fixed to a small constant value (independent of dataset and target index *j*) with good results; similarly, setting , where *d*_{ij} is a rough estimate of the average distance (in pixels) between targets *i* and *j* and *c >* 0 is a small scalar (again independent of dataset and target indices *i, j*), led to robust results without any need to fit extra parameters.

We summarize the parameter vector as *β* = {*θ, w*_{t}, *w*_{s}}, where *θ* denotes the neural net parameters in *ϕ*_{n}. Given *β*, the joint probability distribution over targets *y* is
where *ε* denotes the edge set of constrained targets (i.e., the pairs of markers *i, j* with a nonzero potential function), *Z*(*x, β*) = ∫ *p*(*y*|*x, β*)*dy* is the normalizing constant marginalizing out *y, T* denotes the total number of frames, and *J* denotes the total number of targets.

## 3 Structured variational inference

Our goal is to estimate *p*(*y*^{h} | *y*^{v}, *x, β*), the posterior over locations of unlabeled targets *y*^{h}, given the frames from the video *x*, the locations of the labeled markers *y*^{v}, and the parameters *β*. (Here *h* denotes hidden, for the unlabeled data, and *v* denotes visible, for the labeled data.) Calculating this posterior distribution exactly is intractable, due to the highly nonlinear potentials *ϕ*_{n}. We chose to use structured variational inference [21, 22] to approximate this posterior. We approximate *p*(*y*^{h}, *y*^{v} | *x, β*) with a Gaussian graphical model (GGM) with the same graphical model as Figure 1, leading to a Gaussian posterior approximation *q*(*y*^{h} | *y*^{v}, *x, β*) for *p*(*y*^{h} | *y*^{v}, *x, β*) in which the inverse covariance (precision) matrix is block tridiagonal, with one block per frame *t*. (Since the potentials *ϕ*_{t} and *ϕ*_{s} are quadratic, yielding Gaussian distributions, the neural-network image potential *ϕ*_{n} is the only term that needs to be replaced with a new quadratic potential to form a Gaussian *q*.)

Updating the parameters of this GGM scales as *O*(*TJ* ^{3}) in the worst case, due to the chain structure of the graphical model (and the corresponding block tridiagonal structure of the precision matrix). If the edge graph *ε* defined by the user-specified spatial potential function set is disconnected, this *J* ^{3} factor can be replaced by *K*^{3}, where *K* is the size of the largest connected component in *ε*.

We use a structured inference network approach [22] to estimate the model and variational parameters. We compute gradients of the evidence lower bound (ELBO) for this model using standard automatic differentiation tools, and perform standard stochastic gradient updates or quasi-Newton methods to estimate the parameters. Full details regarding the ELBO derivation and optimization can be found in the appendix.

### 3.1 Conceptual comparison against fully-supervised approaches

Standard fully-supervised approaches like DeepLabCut [4] learn a neural network (or more precisely, use transfer learning to adjust the parameters of an existing neural network) to essentially perform a classification task: the network is trained to output large values at the known location of the markers (i.e., the “positive” training examples), and small values everywhere else (the “negative” training examples). Given a small number of training examples, these methods are prone to overfitting.

In contrast, the approach we propose here is semi-supervised: it takes advantage of both the labeled and unlabeled frames to learn better model parameters *θ*. On labeled frames, the posterior distribution *p*(*y*^{v} | *y*^{v}, *x, β*) is deterministic, and the objective function reduces to the fully supervised case. On the other hand, on unlabeled frames we have new terms in the objective function (see section S2.2.1 for more details). Clearly, the spatial and temporal potentials *ϕ*_{s} and *ϕ*_{t} encourage the outputs to be temporally smooth and to obey the user-specified spatial constraints (at least on average). But in addition the objective function encourages *ϕ*_{n} to output large values where *p*(*y*^{h} | *y*^{v}, *x, β*) is large, and small values where *p*(*y*^{h} | *y*^{v}, *x, β*) is small. Since we approximate *p*(*y*^{h} | *y*^{v}, *x, β*) as Gaussian, the resulting ELBO encourages *ϕ*_{n} to be (on average) unimodal on unlabeled frames — a constraint that is not enforced in standard approaches. This turns out to be a powerful regularizer and can lead to significant improvements even in cases where the spatial and temporal constraints *ϕ*_{s} and *ϕ*_{t} are weak, as we will see in the next section.

## 4 Results

We applied the DGP method summarized above to a variety of datasets, including behavioral videos from three different species, in a variety of poses and environments (see Table 1 in the appendix for a summary). The new model (DGP) consistently outperformed the baseline (DLC). In each example video analyzed here, DLC outputs occasional “glitch” frames where tracking of at least one target is lost (e.g., around frame index 100 in the lower right panel); these glitches are much less prevalent in the DGP output.

We experimented with running Kalman smoothers and total variation denoisers to post-process the DLC output, but were unable to find any parameter settings that could reliably remove these glitches without oversmoothing the data (results not shown). (The frequency of these “glitches” can be reduced by increasing the training set through labeling more data — but this is precisely the user effort we aim to minimize here.) See the full videos summarizing the performance of the two methods here. An example screenshot for the mouse-wheel dataset [23] is shown in Figure 2. More example screenshots for all other datasets can be found in Figures S2-S5 in the appendix.

We also examined the “confidence maps” generated by visualizing the output of the neural network *ϕ*_{n} as an image; large values of the confidence map indicate the regions where the network “believes” the target is located with high confidence. Comparing the confidence maps output by DLC versus DGP, we see that the latter tend to be more unimodal (see Figure 2, small panels in the middle column). Nonetheless, DGP does occasionally output multi-modal confidence maps (e.g., in frames where the target is occluded), since the ELBO objective function used to train DGP encourages unimodality but does not impose unimodality as a hard constraint.

To develop more quantitative comparisons, we manually labeled 1000 training frames in the mouse-wheel dataset^{2} and randomly assigned 55 labels to the training set. Next we randomly subsampled 10 *−* 90% of this training set and retrained the models to quantify the dependence of the test error on the number of labeled frames. Figure 3 shows the test error averaged over five random subsamples. Focusing on DGP (red) and DLC (blue) bars, we see that DGP outperforms DLC uniformly over the training set fraction (i.e., the number of labeled frames used to train the model). Similar results were obtained using an *E*-insensitive loss that ignored errors below a threshold *E* (on the order of 5-10 pixels here) below which the “true” marker location becomes somewhat subjective.

Next we performed an ablation study, to better understand the source of the performance gains exhibited by DGP. We examined two approaches that can be considered as conceptual waypoints between DLC and DGP. First, we experimented with a model in which the spatial and temporal potentials are turned off (i.e., *w*_{s} = *w*_{t} = 0) during both training and testing. The resulting graphical model is now independent over targets *j* and frames *t*, and variational inference in this model reduces to standard “mean-field” variational inference, in which the variational approximation is a product of Gaussian distributions, one for each *y*_{t,j}. We call the resulting model DLC-semi, since the resulting ELBO objective function combines a usual supervised loss (as in DLC) with an unsupervised term that encourages the output of the image potential *ϕ*_{n} to match its Gaussian approximation for each (*t, j*) pair (i.e., the resulting loss can be considered a semi-supervised DLC hybrid). Surprisingly, although this model does not enforce any spatial constraints or temporal smoothness, the extra regularization from the unsupervised term in the ELBO encourages the model output to be more unimodal, leading to significantly improved predictions compared to DLC, as shown in Figure 3. Indeed, DLC-semi performs almost as well as the full DGP model in some cases, although some significant gaps between DLC-semi versus DGP still remain. Similar results are seen in all of the other datasets examined here; see Figures S6-S10 in the appendix.

As a final comparison we considered training the full DGP model, but then turning off the spatial and temporal potentials (setting *w*_{s} = *w*_{t} = 0) at test time. This means that the model can exploit temporal and spatial regularities in the data to improve its estimate of the neural network parameters *θ* during training, but then at test time the output of the model depends solely on the estimated image potential *ϕ*_{n}. In practice, this results in a speedup at test time, since we simply have to perform one *ϕ*_{n} evaluation per marker *y*_{t,j} (thus this approach has the same computational complexity as DLC), with no final E step required to enforce the spatial and temporal potentials. We call the resulting model DGP-NN. Figure 3 and the appendix Figures S6-S10 illustrate that DGP-NN performs similarly to DGP across datasets and a wide range of training dataset size. Thus, given the speed benefits at test time, DGP-NN seems to be the method of choice in practice.

### 4.1 Downstream analyses

The above results demonstrate that DGP provides improved tracking performance compared to DLC. Next we show that these accuracy improvements can in turn lead to more robust and interpretable results from downstream analyses based on the tracked output.

#### Unsupervised temporal segmentation

We begin with a simple segmentation task: given the estimated trace for the paw, can we use unsupervised methods to determine, e.g., when the paw is moving versus still? Figure 4A shows that the answer is yes if we use the DGP output: a simple two-state auto-regressive hidden Markov model (ARHMM; fit via Gibbs sampling on 1000 frames’ output from either DGP or DLC; [25]) performs well with no further pre- or post-processing. In contrast, the multiple DLC “glitches” visible in Figure 2 contaminate the segmentation based on the DLC traces, resulting in unreliable segmentation. See the video for further details. Similar results were obtained when fitting models with more than two states (data not shown).

#### Conditional convolutional autoencoder (CAE) for more interpretable low-dimensional representation learning

As a second downstream application, we consider unsupervised dimensionality reduction of behavioral videos [3, 1, 26, 27]. This approach, which typically uses linear methods like singular value decomposition (SVD), or nonlinear methods like convolutional autoencoders (CAEs), does not require user effort to label video frames. However, interpreting the latent features of these models can be difficult [28, 29], limiting the scientific insight gained by using these models. A hybrid approach that combines supervised (or semi-supervised) object tracking with unsupervised CAE training has the potential to ameliorate this problem [30, 31, 32, 33] – the tracked targets encode information about the location of specific body parts, while the estimated CAE latent vectors encode the remaining sources of variability in the frames. We refer to this ideal partitioning of variability into more interpretable subspaces as “disentangling.” Below we show that these hybrid models produce features that are more disentangled when trained with the output from DGP compared to DLC.

We fit conditional CAEs that take the markers output by DLC or DGP (hereafter referred to as CAE-DLC and CAE-DGP, respectively) as conditional inputs into both the encoding and decoding networks of the CAE, using the mouse-wheel dataset with 13 randomly chosen labeled frames (see Section S3 for implementation details). For this analysis, to obtain useful information across the full image, we labeled the left paw, right paw, tongue, and nose, rather than the four fingers on the left paw as in the previous section. Incorporating the tracking output from either method decreases the mean square error (MSE) of reconstructed test frames, for a given number of latents (Figure 4B). Furthermore, the networks trained with DGP outputs show improved performance over those trained with DLC outputs. Subsequent analyses are performed on the 2-latent networks, for easier visualization.

To test the degree of disentanglement between the CAE latents and the DGP or DLC output markers, we performed two different manipulations. First, we asked how changing individual markers affects the CAE reconstructions. We manipulate the x/y coordinates of a single marker while holding all other markers and all latents fixed. If the markers are disentangled from the latents we would expect to see the body part corresponding to the chosen marker move around the image, while all other features remain constant. We randomly chose a test frame and simultaneously varied the x/y marker values of the left paw (Figure 5, left). This manipulation results in realistic looking frames with clear paw movements in the CAE-DGP reconstructions, demonstrating that this marker information has been incorporated into the decoder. For the CAE-DLC reconstructions, however, this manipulation does not lead to clear movements of the left paw, indicating that the decoder has not learned to use these markers as effectively (a claim which is also supported by the higher MSE in the CAE-DLC networks, Figure 4B).

Second, we asked how changing the latents (rather than markers) affects the reconstructed frames. In this manipulation we simultaneously change the values of the two latents while holding all markers fixed. If the latents are disentangled from the markers we expect to see the tracked features remain constant while other, untracked features change. For the CAE-DGP network this latent manipulation has very little effect on the tracked body parts, as desired (Figure 5, top center); instead, the manipulation leads to small changes in the configuration of the left paw (rather than its absolute location; Figure 5, top right). On the other hand, for the CAE-DLC network this latent manipulation has a large effect on the left paw location (Figure 5, bottom center), which should instead be encoded by the markers. These results qualitatively demonstrate that the CAE-DGP networks have better learned to disentangle the markers and the latents, a desirable property for more in-depth behavioral analysis. Furthermore, we find through an unbiased, quantitative assessment of disentangling, that using DGP markers in these models leads to higher levels of disentangling between latents and markers than DLC across many different animal poses present in this dataset (see Figure S1).

## 5 Discussion

In this work, we proposed a probabilistic graphical model built on top of deep neural networks, Deep Graph Pose (DGP), which leverages the rich spatial and temporal structures pervasive in behavioral videos. We also developed an efficient structured variational approach to perform inference in this model. The resulting semi-supervised model exploits information from both labeled and unlabeled frames to achieve significantly more accurate and robust tracking, using fewer labels. Our results illustrate how the smooth behavioral trajectories from DGP lead to improved downstream applications, including the discovery of behavioral “syllables,” and interpretable or “disentangled” low-dimensional features from the behavioral videos.

An important direction for future work is to optimize the code to perform online inference for real-time experiments, as in [34]. As emphasized in the context of Figure 3, the version of DGP without the E-step at test time, DGP-NN, has the same computational complexity as DLC or DPK and should therefore be amenable to online tracking, while usually achieving the same accuracy as the full DGP approach. Extending our method to operate in 3D, fusing information from multiple cameras, would be another important direction for future work. Our variational inference approach should be extensible to this case, using similar epipolar constraints as in [35, 36] (using different inference approaches) to perform semi-supervised inference across views.

In addition, [4, 5, 6] all use slightly different architectures and achieve similar accuracies. We plan to perform more experiments with the architectures from [5, 6] in the future. Finally, we would like to incorporate our model into existing toolboxes and GUIs to facilitate user access.

### Code

Open source code is available here.

## 6 Broader impacts

We propose a new method for animal behavioral tracking. As highlighted in the introduction and in [10], recent years have seen a rapid increase in the development of methods for animal pose estimation, which need to operate in a different regime than methods developed for human pose estimation. Our work significantly improves the state of the art for animal pose estimation, and thus advances behavioral analysis for animal research, an essential task for scientific discovery in fields ranging from neuroscience to ecology. Finally, our work represents a compelling fusion of deep learning methods with probabilistic graphical model approaches to statistical inference, and we hope to see more fruitful interactions between these rich topic areas in the future.

## S1 Related Work

### Animal pose estimation

The proposed approach fills a void between state of the art human pose estimation algorithms, which often rely on large quantities of manually labeled samples (see [9] for a recent review), and their counterparts in animal pose estimation [37, 4, 6, 5, 38, 39]. Among these animal pose estimation algorithms, Deep Lab Cut (DLC) [4], Leap Estimates Animal Pose (LEAP) [6], and Deep Pose Kit (DPK) [5], stand out as they can achieve near human-level accuracy using a modest number of labels.

As emphasized above, the major difference between these approaches and the method we propose here is in the treatment of the large number of unlabeled frames that are typically available in behavioral videos. DLC and LEAP do not incorporate any temporal or spatial structural knowledge into their models, while DLC, LEAP, and DPK treat frames independently, only learning from labeled samples. DPK incorporates some skeleton structures in labeled frames to predict the targets as well as the edges among those targets, similar to the part affinity fields described by [40]. However, this skeleton model must again be learned from labeled data, and it is not clear if the skeletal model can be learned with few data points (and therefore may not be useful in the low-data regime we are most interested in).

In our case, we encode spatial structure in a mild spatial potential that does not need to be learned, simply applying a soft constraint on the distance between pairs of targets that are known to be close (e.g., the fingers on a hand). It would be interesting in future work to explore combinations of the simple approach we use here with the more detailed skeletal model employed by [20, 5], since we expect the benefits of these two terms to be complementary, with the former most helpful in the low-training-data regime and the latter potentially more helpful once large training sets are acquired.

### Graphical models

Previous work on human pose estimation has employed graphical models as network regularizers [11, 12, 13, 14]. Among these, [13] and [14], like DGP, build an undirected graphical model (UGM) on top of deep neural networks. However, [13] and [14] assign tracked locations discrete values, which allows for (discrete) message passing algorithms during the inference step. DGP models the targets as continuous random variables, and estimates the unknown targets using variational inference. Another critical difference between the proposed method and similar previous approaches is that the latter focused on labeled frames. Therefore, their UGMs were conditional random fields without additional hidden variables. Again, our proposed model includes a mixture of hidden and visible variables, leading to a semi-supervised learning framework.

### Semi-supervised learning

Semi-supervised learning aims to fully utilize unlabeled or weakly-labeled data to gain additional insights into the structure of the data [41, 42, 43]. Many pose estimation algorithms have adopted such learning schemes to enhance the performance given limited training data [44, 35]. One conceptually similar “weakly-supervised” approach is described in [45], who trained a network to extract flying objects (obeying Newtonian acceleration) simply by constraining the output to resemble a parabola. As discussed above, the ELBO objective function in DGP encourages the output to follow a Gaussian distribution on each frame; this can be seen as a form of weak supervision that leads to improved accuracy even when the temporal and spatial soft constraints are removed (Figure 3).

## S2 Expanded methods

In this section we present our model and inference approach in fuller detail than was possible given space limitations in the main text. (To maintain the logical flow, in some cases we repeat points that were made in the main text methods.)

### S2.1 Deep Graph Pose model

The graphical model of DGP is summarized in Figure 1. We observe frames *x*_{t} indexed by *t*, along with a small subset of labeled markers *y*_{t,j} (where *j* indexes the different targets we would like to track). The target locations *y*_{t,j} on most frames are unlabeled, but we have several sources of information to constrain these latent variables: temporal smoothness constraints between the targets *y*_{t,j} and *y*_{t+1,j}, which we capture with quadratic potentials *ϕ*_{t}; spatial constraints between the targets *y*_{t,i} and *y*_{t,j}, modeled with quadratic potentials *ϕ*_{s}; and information from the image *x*_{t}, modeled by potentials *ϕ*_{n} parametrized by neural networks.

First let’s define the potential function *ϕ*_{n} between the input image *x* and the target’s 2D location *y*. We define *f*_{θ}(·) as a stack of a fixed pretrained ResNet-50 network and a trainable ConvNet parametrized by *θ. f*_{θ}(·) takes a frame *x* as the input and outputs a 2D affinity map image, which ideally has a sharp peak at the most likely coordinates of the target. We then denote the sigmoid function as *σ*(·), and refer to *σ*(*f*_{θ}(*x*)) as a “confidence map.”

With the potential *ϕ*_{n}, our target is to match this 2D confidence map to a 2D Gaussian bump centered at *y* by minimizing the sigmoid cross entropy. Now let’s define the Gaussian bump. We construct a bivariate Gaussian function with mean *y* = [*y*_{m}, *y*_{n}] and variance *l*^{2}. The Gaussian function at the *m*th row and the *n*th column is
The variance parameter was set as *l*^{2} = 1 in practice.

The potential function *ϕ*_{n} for the *m*th row and the *n*th column entry in *σ*(*f*_{θ}(*x*)) is defined as
Summing over all entries in the confidence map, we get the neural network potential *ϕ*_{n} as
We will write everything in vector form hereafter. We define **f** as the vectorized *f*_{θ}(*x*), define **h** as the vectorized *f*_{θ}(*x*) + log(1 + exp(*−f*_{θ}(*x*))), and define **G**(*y, l*^{2}) as the vectorized *G*(*y, l*^{2}), which is a function of mean *y* and variance *l*^{2}. Thus, for each target *j* we can rewrite the *j*-th image-based potential as
where *j* is the index for target *j* and *t* is the index for frame *t*.

We use a simple quadratic potential *ϕ*_{t} to impose temporal smoothness:
which penalizes the distance between targets in consecutive frames. The weights in general may depend on the target index *j*, but are constant in time.

The spatial potential *ϕ*_{s} is dataset-dependent and can be selected according to the spatial constraints the markers should satisfy. Typical examples include a soft constraint that the paw should never be greater than some distance from the elbow. Again, we use a simple quadratic potential to encode these soft constraints:
which penalizes the distance between “connected” targets *y*_{t,i} and *y*_{t,j}. The user pre-specifies pairs of connected targets that should remain close across time, e.g. paw and elbow.

We want to “let the data speak” and avoid oversmoothing, so the penalty weights *w*_{s} and *w*_{t} should be small. In practice we found that could be fixed to a small constant value (independent of dataset and target index *j*) with good results; similarly, setting , where *d*_{ij} is a crude estimate of the average distance (in pixels) between targets *i* and *j* and *c >* 0 is a small scalar (again independent of dataset and target indices *i, j*), led to robust results without any need to fit extra parameters.

We summarize the parameter vector as *β* = {*θ, w*_{n}, *w*_{t}, *w*_{s}}, where *θ* denotes the neural net parameters. Given *β* and the full collection of images *x*, the joint probability distribution over targets *y* is
where *ε* denotes the edge set of constrained targets (i.e., the pairs *i, j* with a nonzero potential function), *Z*(*x, β*) = ∫ *p*(*y*|*x, β*)*dy* is the normalizing constant marginalizing out *y, T* denotes the total number of frames, and *J* denotes the total number of targets. The joint distribution can be described as a combination of a neural network component and a probabilistic graphical model over the latent variables (the unobserved targets *y*).

### S2.2 Structured variational inference

Our goal is to estimate *p*(*y*^{h} | *y*^{v}, *x, β*), the posterior over locations of unlabeled targets *y*^{h}, given the frames from the video *x*, the locations of the labeled markers *y*^{v}, and the parameters *β*. (Here *h* denotes hidden, for the unlabeled data, and *v* denotes visible, for the labeled data.) Calculating this posterior distribution exactly is intractable, due to the highly nonlinear potentials *ϕ*_{n}. We chose to use structured variational inference, similar to [22], to approximate this posterior. We approximate *p*(*y*^{h}, *y*^{v} | *x, β*) with a Gaussian graphical model (GGM) with the same graphical model as Figure 1. We denote the approximate posterior as *q*(*y*^{h}, *y*^{v} |*x, β*_{q}) (*β*_{q} encodes variational parameters). To obtain a fully Gaussian variational approximation, we replace the neural network potentials with quadratic terms
Here the precision variables and means are variational parameters that we could optimize over independently. However, we found it more efficient to model the means *μ*_{n} as . Here *f* (·) is an inference neural network with parameters *γ* whose output is a 2D affinity map, similar to *f*_{θ}(). Putting the pieces together, we have the fully Gaussian approximate posterior
where is the normalizing constant (which can be computed explicitly, due to the fully-Gaussian form of *q*), and *β*_{q} = {*γ, w*_{n,q}, *w*_{t}, *w*_{s}}.

Since *q*(*y*^{h}, *y*^{v}|*x, β*_{q}) is a GGM, we can rewrite eq. S10 in the standard Gaussian form
where and are the precision matrices corresponding to the potentials in *q*(*y*^{h}, *y*^{v}|*x, β*_{q}); these have the form
Thus the mean and covariance of the variational distribution *q*(*y*^{h}, *y*^{v} |*x, β*_{q}) are *μ*_{a} and σ_{a}, where *μ*_{a} is a function of *γ, w*_{n,q}, *w*_{t}, and *w*_{s}, and σ_{a} is a function of *w*_{n,q}, *w*_{t}, and *w*_{s}.

Let *P*_{h} and *P*_{v} denote the permutation matrices that map the vector *y* to *y*^{h} and *y*^{v} respectively, i.e.
Due to the Gaussianity of the joint distribution, we can write down the closed-form expression for *q*(*y*^{h} | *y*^{v}, *x, β*_{q}) as
where

#### S2.2.1 Evidence Lower Bound (ELBO)

Given the approximate posterior (eq. S17), and abbreviating *q*(*y*^{h}) = *q*(*y*^{h} |*y*^{v}, *x, β*_{q}), we can now write down the evidence lower bound (ELBO) as
where *𝒱* and *ℋ* denote the sets of visible targets in visible frames and hidden targets in all frames respectively.

#### S2.2.2 Semi-supervised DLC

To understand the various terms in the ELBO above it is helpful to start with a simpler special case. If we turn off the temporal and spatial potentials in eq. S20 (i.e., set *w*_{t} = *w*_{s} = 0) we arrive at the DLC-semi model discussed in the Results section. The corresponding ELBO is
where and *μ*_{h} = *P*_{h}*μ*_{n}. The first term is a conventional DLC-type cross entropy for labeled frames. The second term is a semi-supervised cross entropy for unlabeled frames. Instead of having the true marker locations for unobserved frames, we construct the Gaussian function using the 2D location output from the neural net. The second term encourages the confidence map *f*_{θ} to be unimodal to match the Gaussian approximate posterior. This semi-supervised term leads to better performance of DLC-semi compared to the original fully-supervised DLC (Figure 3).

The log normalization term log *Z*(*x, β*) deserves a bit of additional explanation. In this DLC-semi model the graphical model factorizes over markers *j* and frames *t*, which means that we can calculate log *Z*(*x, β*) directly by summing over pixels for each *t* and *j*; in the full non-factorized DGP model, this term is not directly tractable. However, in practice, dropping the log *Z* term from the optimization in the DLC-semi model did not affect the results significantly. Therefore we also dropped log *Z* from the optimization in the full DGP model.

### S2.3 Implementation details

The bottlenecks of the ELBO computation in the full DGP model are log |σ_{a}|, and diag(σ_{a}), where σ_{a} *∈* ℝ^{TJ×TJ} and is a block tridiagonal matrix. All of these terms can be computed via message passing with *O*(*TJ* ^{3}) time complexity, due to the chain structure of the graphical model (and the corresponding block tridiagonal structure of the precision matrix). We used standard message passing algorithms to handle the required block tridiagonal matrix computations [46, 47, 48].

The unknown parameters in the ELBO are {*θ, γ, w*_{n}, *w*_{n,q}, *w*_{t}, *w*_{s}}. As stated above, in practice, we fixed to a small constant value 0.1, and used , where *d*_{ij} is the average distance (in pixels) between targets *i* and *j* and *c >* 0 is a small scalar (independent of dataset and target indices *i, j*); this led to robust results without any need to fit extra parameters. We also differentiated *w*_{n} to be and for visible and hidden frames. Empirically, and (*T*_{v} is the number of visible frames) led to good results; this upweights the strength of labeled frames relative to unobserved frames. Moreover, *θ* and *γ* govern the neural network potential and the variational approximation, respectively. We found that setting *γ* = *θ* led to good results and reduced the number of parameters. Therefore, we only optimized the ELBO w.r.t. {*θ, w*_{n,q}}, using coordinate ascent. We calculated the gradients for *w*_{n,q} and *θ* using standard automatic differentiation tools, and performed standard stochastic gradient updates or quasi-Newton methods to estimate these parameters.

In the experiments, we compared DLC with our DGP model. We used the public DLC implementation available at this link. We ran DLC with learning rate = 0.002 for 50k iterations using 1 batch stochastic gradient descent (sgd). Notice that DLC only used the small number of training frames. We initialized our DGP model with the DLC network parameters estimated after 5k iterations, and ran the model with all frames using Limited-memory BFGS (LBFGS). The public DLC implementation trains with a 2048-channel ResNet-50 and a one-layer ConvNet. We reduced the number of channels to 200 and fixed ResNet without training the large number of network parameters, and still obtained good performance.

## S3 Conditional convolutional autoencoder

### S3.1 Implementation details

We fit conditional convolutional autoencoders (conditional CAEs) on 192×192 grayscale images from [23]. In addition, we used 4 markers output by DLC/DGP: left paw, right paw, tongue, and nose. To condition the encoder network on these values we turned each marker into a one-hot 2D array and concatenated these with the corresponding frame, so that the input to the encoder was of size (192, 192, 5). To condition the decoder network on these values we first centered the marker values by subtracting their median (computed over the entire dataset) and then concatenated these values to the latents before feeding them into the decoder. See Table S1 for network architecture details. We trained the autoencoders by minimizing the MSE between original and reconstructed frames using the Adam optimizer [49] with a learning rate of 10^{−4}, a batch size of 100, and no regularization. Models were trained for 300 epochs.

### S3.2 Disentangling analysis

The disentangling analyses presented in Figure 5 require fixing some inputs to the network while varying others. Below we describe this manipulation in more detail. We performed these manipulations on 2-latent networks to make visualization in the latent space easier.

#### Manipulating markers

We chose a random test frame and varied the x/y coordinates for a specific marker (left paw). The limits of the x/y values were the 10^{th} (minimum) and 90^{th} (maximum) percentiles of the DGP outputs on the test set for the specified marker. We did not allow different limits for the DLC/DGP networks, in order to make the comparison more direct. After choosing x/y values for the specified marker we converted these into a one-hot 2D array, as with the other (unchanged) markers from the chosen frame. These one-hot 2D arrays were concatenated with the original frame and then fed into the CAE encoder to produce the latents. The latents were then concatenated with the median-subtracted marker values (one of which is being changed, the rest of which stay the same). This vector was then pushed through the decoder network to produce the reconstructions.

Note that in this conditional architecture the latents themselves are marker-dependent, so are not truly held fixed. We also fit conditional CAE architectures where just the decoder was conditioned on the markers, and the encoder only used the frame as input. We found the results from the disentangling analysis to be qualitatively similar, though reconstructions generally looked cleaner with the architectures that incorporated conditioning in both the encoder and decoder networks (data not shown).

#### Manipulating latents

We chose a random test frame and this time varied the latents while keeping the marker values fixed. Similar to above, we used the 10^{th} (minimum) and 90^{th} (maximum) percentiles of the latents on the test set as limits (this time allowing different limits for each DLC/DGP architecture). We then concatenated the new latent values with the marker values from the original frame, and pushed this vector through the decoder network to produce the reconstructions.

#### Quantifying disentanglement

To quantify the disentanglement results from Figure 5 (center, right panels), we chose the left paw as a (tracked) target of interest that should ideally not undergo large changes when manipulating the latents. If disentanglement is high (which we desire), the differences between the generated paw and the original paw should be small. For each image generated from the latent manipulation, we take a small crop around the original location of the left paw and compute the MSE between this generated paw and the original paw.

We repeat this process for an unbiased sample of test frames. To obtain these frames we performed k-means clustering on the unconditional CAE latents. From each of 64 clusters we take the frame that is closest to the cluster centroid (Figure S1A), and perform the process of generating frames by evenly sampling the latent space - 4 grid points along each of 2 dimensions, for a total of 16 generated frames. We compute the MSE in crops around the labeled paw position as described above (Figure S1B, C). We find that on average the CAE-DGP networks have lower MSE, indicating that these disentanglement results generalize to many other paw positions (and therefore marker values) found in this dataset.

## S4 Results on additional datasets

We first show the comparison between DLC and DGP on the mouse-reach, fly-run, twomice-top-down, and fish-swim datasets (Table S2) in Figures S2-S5. In each case, results were similar to those seen in Figure 2.

The remaining figures (S6-S10) show the same datasets, but with additional panels and traces illustrating the output of the intermediate models DLC-semi and DGP-NN, along with a reimplementation of DLC (labeled as DLC-ours) using the same optimizer, 200 ResNet channels, etc., as in our implementation of DGP. (We include these comparisons to rule out minor differences caused by these distinct implementation details.) Results are consistent with Figure 3: DGP-NN and DGP achieve similar performance across all five datasets; each tends to outperform DLC-semi, which in turn outperforms either implementation of DLC. Full videos can also be found here.

## Footnotes

aw3236{at}columbia.edu, ekb2154{at}columbia.edu, mw3323{at}columbia.edu, er2934{at}columbia.edu, cpe2108{at}columbia.edu, aln2128{at}columbia.edu, ess2129{at}columbia.edu, nm2786{at}columbia.edu, cds2005{at}columbia.edu, ab4463{at}columbia.edu, jpc2181{at}columbia.edu, lmp2107{at}columbia.edu, Michael.Schartner{at}unige.ch, guido.meijer{at}research.fchampalimaud.org, jpn5{at}nyu.edu, da93{at}nyu.edu, info{at}internationalbrainlab.org

↵

^{2}This exhaustive labeling was labor-intensive and we have not yet performed the same analysis for the other datasets in Table 1.As is visible in the appendix figures, our qualitative results are similar across all the datasets analyzed here; we plan to perform more exhaustive comparisons on other datasets in the future.