Anatomically-based skeleton kinetics and pose estimation in freely-moving rodents

Forming a complete picture of the relationship between neural activity and body kinetics requires quantification of skeletal joint biomechanics during behavior. However, without detailed knowledge of the underlying skeletal motion, inferring joint kinetics from surface tracking approaches is difficult, especially for animals where the relationship between surface anatomy and skeleton changes during motion. Here we developed a videography-based method enabling detailed three-dimensional kinetic quantification of an anatomically defined skeleton in untethered freely-behaving animals. This skeleton-based model has been constrained by anatomical principles and joint motion limits and provided skeletal pose estimates for a range of rodent sizes, even when limbs were occluded. Model-inferred joint kinetics for both gait and gap-crossing behaviors were verified by direct measurement of limb placement, showing that complex decision-making behaviors can be accurately reconstructed at the level of skeletal kinetics using our anatomically constrained model.


36
The relationship between neural activity patterns and body motion is complex as neuronal activity 37 patterns are dependent on factors such as the intended behavioral outcome 1 , task familiarity 2 , 38 changes in the environment but also exact limb trajectories [3][4][5] and motion kinetics 6 . Much of the 39 motion kinematic data forming our view of the sensorimotor control of movement was collected 40 during short behavioral epochs where the animal was in various forms of restraint [3][4][5][6][7][8]

150
To reconstruct behavioral sequences using the ACM, we first tracked 2D surface marker locations 151 in the recorded movies using DeepLabCut 24 , which is specifically designed for surface landmark 152 detection of laboratory animals. As the ACM contained both joint angle limits and temporal 153 constraints, we evaluated the role of these by reconstructing poses without either the joint angle

264
The robustness and accuracy of limb tracking was even more apparent when analyzing joint 265 velocities ( Fig. 3e-g), joint angles ( Fig. 3h-j), and joint angular velocities ( Fig. 3k-m), as traces 266 generated without the ACM constraints were dominated by noise in individual examples (Fig. 3f,i,l, 267 lower) and the cyclic nature of gait was less prominent when compared to traces obtained from 268 the ACM (Fig. 3f,i,l, top). Consistent with this, ACM averaged traces (Fig. 3g

285
We next used the ACM to analyze motion kinetics and segment a more complex decision-making 286 behavior, the gap crossing task, in which the distances between two separate platforms are 287 changed forcing the animal to re-estimate the distance to jump for each trial (Fig. 4a).

295
As rats jumped stereotypically, we next tested whether the jump-related pattern of movements 296 could be analyzed using the ACM to objectively define decision points in the behavior, such as 297 time of jump, from each individual trial. The changes in joint angles in the spine segments and 298 hind limbs around the time of the jump were highly consistent. Averaging these joint angles to

299
give an averaged joint-angle trace provided a metric with a global minimum (Fig. 4c) during the 300 jump that was independent of whether the animal crossed the gap immediately, paused and 301 waited at the track edge or reached across the gap ( Supplementary Fig. 11). This approach 302 enabled objective identification of jump start-, mid-and end-points, from each individual jump.

303
Traces of joint angles averaged across joints and trials (Fig. 4d) and average ACM poses ( Fig.   304 4e) illustrate the consistency of the pose changes through the jump. We next used this to quantify 305 relationships between joints and changes in joint kinetics throughout a jump sequence. Auto-

306
correlations for spatial and angular limb velocities allowed quantification of the interdependency 307 of joint movements at any point within the jumping behavior, for example at the start-point of a 308 jump (Fig. 4f,g). This displayed relationships like a significant correlation between the spatial 309 velocity of the right elbow and wrist joints (Fig. 4h left, Spearman correlation coefficient: 0.95, two-310 tailed p-value testing non-correlation: 5.40x10 -24 ), as well as joint interactions across the midline,

311
such as a significant correlation between spatial velocity of the right and left knee joints (Fig 4h,   312 right, Spearman correlation coefficient: 0.93, two-tailed p-value testing non-correlation: 6.79x10 -

349
In addition, we generated ground truth data to quantify both the accuracy of the algorithm used to

362
Lastly, the ACM remained accurate over a large range of animal sizes, 72 g -735 g, with the 363 expectation that the ACM approach would also work for smaller rodents, such as mice. Our

507
Besides the world coordinate system each edge also had its own right-handed coordinate system 508 located at the start vertex of the corresponding edge, e.g. the coordinate system of the edge The only exception from this was the upper bound of the left-sided surface marker on the shoulder 593 in z-direction for the two large animals (animals #5 and #6), which was also set to 0 in order to 594 prevent the bone lengths of the collarbones to become zero during learning.

711
Comparison of skeleton parameters with MRI data. To estimate the quality of these skeleton 712 parameters, we aligned learned 3D skeleton models to manually labeled 3D locations of surface 713 markers obtained from an MRI scan for each animal (Fig. 1b, bottom). To determine the 3D               Fig. 7 | Averaged traces from the model not constrained by either temporal or joint limit constraints as in Figure 3d,g,j,m (right) aligned to velocity peaks for the right ankle (right column), left wrist (center column) and right wrist joint (right column), instead of aligning to the velocity peak of the left ankle joint.

Parameterizing rotations
We choose to parameterize rotations with Rodrigues vectors as they are well suited for the description of bone rotations with three rotational degrees of freedom [10]. A Rodrigues vector r is formed by combining the axis of rotation ω ∈ R 3 and the rotation angle θ ∈ R: where ||ω|| = 1. To calculate the associated rotation matrix R from a given Rodrigues vector r we can use the following function: where I ∈ R 3×3 is the identity matrix andω ∈ R 3×3 is given by: 2 Camera calibration

Pinhole camera model
To project an arbitrary three-dimensional joint or surface marker location m 3D ∈ R 3 onto a camera sensor to obtain the corresponding two-dimensional data point m 2D ∈ R 2 , we are using a pinhole camera model [6], which gives the following relationship between the two: wherer ∈ R 3 is the Rodrigues vector andt ∈ R 3 the translation vector of the respective camera, such that the expression f r→R (r) m 3D +t maps m 3D from the world coordinate system into the coordinate system of the camera. Given the camera's distortion vectork ∈ R 2 , the function f distort applies radial distortions according to with y = (y 1 , y 2 , y 3 ) T andc = y 1 The final mapping onto the two-dimensional camera sensor is done using the camera matrixÃ ∈ R 2×3 given bỹ whereÃ 11 andÃ 22 are the focal lengths andÃ 13 andÃ 23 are the x-and y-location of the camera's optical center.

Calibration of multiple cameras
Given a multi-camera setup with several cameras and overlapping fields of view, we need to infer the initially unknown location and camera parameters of every individual camera in the setup as this allows us to predict where a three-dimensional point in space will be visible on each camera sensor. This can be achieved by generating a sequence of images showing an object whose physical structure and dimensions are known to us. Hereby, the images are taken synchronously in all cameras, such that the spatial location and orientation of the shown object is identical for a given set of images at a certain time point. For this purpose checkerboards are suited objects as edges of individual tiles can be detected automatically in recorded image frames and the description of their spatial structure requires only a single parameter, i.e. the length of a quadratic tile. Given a multi-camera setup with n cam cameras and n time time points at which we used each camera to record images, which show a checkerboard that has a total of n edge detectable edges, we can calibrate the setup by minimizing a respective objective function via gradient decent optimization using the Trust Region Reflective algorithm [2]: arg miñ wherer i is the Rodrigues vector,t i is the translation vector,k i is the distortion vector andÃ i is the camera matrix of camera i. The Rodrigues vectorr τ and the translation vectort τ encode the orientation and translation of the checkerboard at time point τ . Since the checkerboard is a planar object each edge j is given by a three-dimensional pointm j = c tile (x j , y j , 0) T with the known length of a single tile c tile and x j ∈ N as well as y j ∈ N. Furthermore, the two-dimensional edge j in camera i at time point τ is denoted as m τ ij ∈ R 2 and the delta function δ τ ij indicates whether this edge is detected successfully, i.e. δ τ ij = 1, or not, i.e. δ τ ij = 0.

Skeleton model 3.1 Modifying the skeleton model to obtain new poses
Given a three-dimensional skeleton model, we need to adjust joint locations by rotating each bone of the model, such that resulting three-dimensional positions of rigidly attached surface markers match the respective two-dimensional locations in our video data. Assuming our skeleton model has a total of n bone bones and n marker surface markers, we want to generate the three-dimensional locations of the joints p ∈ R n bone ×3 and surface markers m ∈ R n marker ×3 , which can be obtained according to Algorithm 1. R j ← f r→R (r i ) T R j Update rotation of bone j 8: for j ∈ {1, ..., n bone } do 9: Apply bone rotation R j to resting poseR j 10: Initialize root joint location p 1 0

11:
for j ∈ {1, ..., n bone } do 12: p j 1 ← p j 0 + R j 13 , R j 23 , R j 33 T l j Calculate end joint location p j 1 of bone j Here, it is assumed that the set {1, ..., n bone } is sorted, such that one iterates through the skeleton graph beginning with the bone whose start joint is the root joint 1 0 and then proceed with the bones further down the skeleton graph. Thus, it is always guaranteed that for j > i, the start joint i 0 of bone i is never a child of the start joint j 0 of bone j. It is also assumed that the bone coordinate systems of the skeleton model are constructed such that their z-directions encode the directions in which the respective bones are pointing. Furthermore, the global translation vector t ∈ R 3 corresponds to the three-dimensional location of the skeleton's root joint, the rows of the tensor r ∈ R n bone ×3 contain Rodrigues vectors encoding the bone rotations, the vector l ∈ R n bone contains the bone lengths and the rows of the tensor v ∈ R n marker ×3 contain the relative maker locations, i.e. the locations of the markers when the position of the attached joints are assumed to be the origin. The resting posē R ∈ R n bone ×3×3 of the animal describes the orientation of the bones when no additional rotations are applied, i.e. r i = (0, 0, 0) T ∀ i ∈ {1, ..., n bone }. Here, the frequent usage of the transpose operation allows to first rotate bones, which are the closest to the leaf joints of the skeleton graph [4]. This has the advantage that we can enforce constraints on bone rotations with reference to a global coordinate system that corresponds to the three main axes of the animal's body. Assume we only model a single front limb where we only have rotations around the shoulder, elbow and wrist, i.e. R shoulder , R elbow and R wrist , and would like to obtain the new orientation R new of the bone whose start joint is identical to the animal's wrist given its resting poseR wrist while iterating through the skeleton graph starting from the root joint, i.e. the shoulder. Then we can obtain R new according to Thus, we can iterate through the skeleton graph from the root to the leaf joints but actually apply the respective bone rotations in the reversed order.

Inferring bone lengths and surface marker positions
Reconstructing poses for n time time points can be archived equivalently to the calibration of a multicamera setup as discussed in Section 2.2, i.e. we need to minimize a respective objective function via gradient decent optimization using the L-BFGS-B algorithm [3]: arg min tτ ,rτ ,l,v ∀τ ∈{1,...,n time } where m τ ij is the two-dimensional location of marker j in camera i at time point τ and δ τ ij indicates whether this marker location was successfully detected, i.e. δ τ ij = 1, or not, i.e. δ τ ij = 0. The corresponding projected two-dimensional marker locationm τ ij can be obtained by propagating the absolute marker positions calculated via Algorithm 1 through the projection function f 3D→2D : where t τ ∈ R 3 and r τ ∈ R n bone ×3 denote the translation vector and the bone rotations at time point τ . Note how there is a set of pose-encoding parameters t τ and r τ for each time point τ whereas the bone lengths l and the relative surface marker positions v, which encode the animal's skeletal structure and configuration, are shared across all time points. Thus, if we provide enough time points where the animal is visible in many different poses, which ideally cover the entire spectrum of the animal's behavioral space, we can not only reconstruct the pose of the animal for the given time points but are also able to learn the structure of the animal's skeleton, by inferring the unknown parameters l and v.

Scaling of input and output variables
In general, we always scale the translation vector t and the bone rotations r as well as the resulting two-dimensional marker locationsm, such that all of them roughly lie within the same range, i.e.
[−1, 1]. Particularly, we define the normalization constants c t = 50 cm and c r = π 2 rad as well as c 1 = 640 px and c 2 = 512 px, which we use to normalize r and t as well asm. The choice for c t was based on the dimensions of the largest arena we used in our experiments, where the maximum distance to an arena's edge from the origin of the world coordinate system, located at the center of the arena, was around 50 cm. The choice for c r was based on the maximum bone rotation of the naively constrained spine and tail joints in our skeleton model, which was equal to π 2 rad. The choice for c 1 and c 2 were based on the sensor sizes of the cameras we used in our experiments, which were all equal to 1280 × 1024 px 2 . Using the normalization constants we obtain the normalized translation vector t * = t ct and the normalized bone rotations r * = cr r as well as the normalized two-dimensional marker locationsm * = m * for a single two dimensional marker locationm ∈ R 2 , such thatm 1 represents its x-andm 2 its ycoordinate. These normalized variables were used instead of their non-normalized counterparts in all depicted optimization and pose reconstruction steps.

Enforcing body symmetry
To improve the inference of bone lengths and surface marker positions we took advantage of the symmetric properties of an animal's body, i.e. for every left-sided limb there exists a corresponding limb on the right side. Furthermore, we also placed the surface markers onto the animal's fur, such that the marker-pattern itself was symmetrical, e.g. for a marker that was placed to a position close to the left hip joint there was a corresponding marker on the right side of the animal. By incorporating this knowledge into Algorithm 1 we reduced the number of free parameters, i.e. we only optimized the reduced bone lengths l * ∈ R n * bone and relative marker positions v * ∈ R n * marker ×3 , where n * bone is the number of asymmetrical bones, i.e. bones along the head, spine and tail, plus the number of limb bones on the animal's left side and, equivalently, n * marker denotes the number of the asymmetrical and left-sided markers. The excluded right-sided limb bones were then enforced to have the same lengths as the corresponding limb bones on the left side. Additionally, we also applied this concept for the relative marker locations by mirroring the x-component of the left-sided markers at the yz-plane to obtain the relative marker locations of the markers on the right side. To implement this we defined Algorithm 2, which maps the reduced bone lengths l * to the original parameter l.

Algorithm 2
1: function f l * →l (l * ) 2: c ← 1 Initialize counter c for right-sided bones 3: for i ∈ {1, ..., n * bone } do c ← c + 1 Increase counter c for right-sided bones 8: return l Equivalently, we also defined the corresponding Algorithm 3, which maps the reduced relative marker positions v * to their original counterpart v.

5:
if j is left-sided marker then Check if marker j is on the left side 6: c ← c + 1 Increase counter c for right-sided markers 8: return v To learn the underlying three-dimensional skeleton model while also enforcing body symmetry, we then redefinedm τ ij from equation 10 as follows: and minimized equation 9 with respect to the parameters l * and v * instead of l and v.

Using a state space model to describe behavioral time series'
To allow for probabilistic pose reconstruction of entire behavioral sequences of length T , which ensures that poses of consecutive time points are similar to each other, we deploy a state space model, given by a transition and an emission equation where at time point t ∈ {1, ..., T } the state variable z t ∈ R nz encodes the position of the animal as well as the bone rotations and the measurement variable x t ∈ R nx represents the two-dimensional surface marker locations in all cameras given by a trained neural network. Thus, the state variable z t contains the global translation vector t as well as the pose-encoding tensor r for time point t and the measurement variable x t is a constant quantity given for all time points t. The function g, given by Algorithm 4, computes the noise-free measurements of the two-dimensional surface marker locations x * t given the state variable z t . At this point the bone lengths l and relative maker locations v are already inferred and therefore given. The same applies to the Rodrigues vectorr i , the translation vectort i , the distortion vectork i and the camera matrixÃ i of camera i, which we obtained from calibrating the multi-camera setup. The normalization constants c t , c r as well as c 1 and c 2 are the same as in Section 3.3. The probabilistic nature of the model is given by incorporating the two normally distributed random variables z ∼ N (0, V z ) and x ∼ N (0, V x ), simulating small pose changes over time and measurement noise, as well as the initial state z 0 ∼ N (µ 0 , V 0 ), which is also assumed to be a normally distributed random variable. Thus, the state space model is entirely described by the model parameters Θ = {µ 0 , V 0 , V z , V x }. This allows for inferring a set of expected state variables z = {z 1 , ..., z T } given our measurements x = {x 1 , ..., x T } in case we have a good estimate for the model parameters Θ. Alternatively, we are also able to calculate a set of model parameters Θ, which, given an estimate for the state variables z, maximizes a lower bound of the model's evidence, i.e. the evidence lower bound (ELBO). The former is equivalent to the expectation step (E-step) of the expectation-maximization (EM) algorithm, which can be performed by applying the unscented Rauch-Tung-Striebel (RTS) smoother, whereas the latter is identical to the algorithm's maximization step (M-step), in which new model parameters are calculated in closed form to maximize the ELBO [8].
KL (q||p) and the ELBO L, starting from equation 19: with L = q (z) ln p(x,z) q(z) dz and KL (q||p) = − q (z) ln p(z|x) q(z) dz. The KL divergence is a distance measure between the probability density functions q and p and as such always larger or equal to zero: If we now acknowledge that we also require the model parameters Θ to compute the above quantities, i.e.
ln p (x|Θ) = L (q, Θ) + KL (q||p) (30) we can start building an understanding for how the EM algorithm works. In the E-step we are holding Θ constant and maximize L (q, Θ) with respect to q, i.e. given a current estimate for the model parameters Θ k we infer the probability density functions of our state variables p (z|x, Θ k ), such that q (z) = p (z|x, Θ k ), making the KL divergence KL (q||p) become zero, i.e. KL (q||p) = KL (p||p) = 0, and the marginal log-likelihood ln p (x|Θ k ) become equal to the ELBO L (q, Θ). Here, setting q (z) = p (z|x, Θ k ) maximizes the ELBO L (q, Θ) due to the equality given by equation 34 and the previously mentioned fact that the marginal log-likelihood ln p (x) is actually independent of the probability density function q (z). Subsequently, in the M-step we are holding q constant and maximize L (q, Θ) with respect to Θ in order to obtain a new set of model parameters Θ k+1 , leading to an increased marginal log-likelihood ln p (x|Θ k+1 ), as the KL divergence becomes greater then zero again, i.e. KL (q||p) ≥ 0 and L (q, Θ k+1 ) ≥ L (q, Θ k ). Thus, the starting point in the M-step is the following: with Q (Θ, Θ k ) = p (z|x, Θ k ) ln p (x, z|Θ) dz. We note that the latter term is independent of Θ and can be omitted since our goal is to optimize the ELBO L (q, Θ) with respect to Θ. Therefore, instead of maximizing the ELBO L (q, Θ) directly, we can just maximize the function Q (Θ, Θ k ). We furthermore notice that Q (Θ, Θ k ) has the form of an expectation value, i.e. we can obtain Q (Θ, Θ k ) by taking the expectation of ln p (x, z|Θ) with respect to z:

The unscented transformation
We are required to approximate expectation values to perform the E-step, i.e. when applying the unscented Kalman filter and the unscented Kalman smoother (Algorithm 7 and 9), as well as the M-step, i.e. when maximizing Q (Θ, Θ k ) (equation 39), as we can not compute them analytically [8].
These expectation values are of the form: where h is an arbitrary function and y ∈ R d an arbitrary normally distributed random variable, i.e. y ∼ N (m, Σ). We can obtain such approximations using the unscented transformation f ut , which was first introduced by Julier et al. [7] and is defined in Algorithm 5. Given the mean m and the covariance Σ, the unscented transformation f ut generates so called sigma points Y ∈ R 2d+1×d , whose locations are systematically spread around the mean m based on the covariance Σ: return Y Here f cholesky (Σ) denotes the Cholesky decomposition of matrix Σ, which computes a lower triangular matrix L such that LL T = Σ, and λ can be calculated as follows: where we set the parameters α = 1 and κ = 0 [8], such that λ = 0. Using the sigma points Y, we can approximate E [h (y)] as follows: with the weights w: which due to our choice of α and κ simplifies to:

Expectation step
In the E-step we need to infer the probability density function of the latent variable z t for all time points t of a behavioral sequence, given the set of all measurements x and the model parameters Θ, noted as p (z t |x, Θ). Since all random variables of the model are assumed to be normally distributed, this property is maintained for the latent variable z t as well. Therefore, z t is drawn form a normal distribution with mean µ t and covariance V t , i.e. z t ∼ N (µ t , V t ). By using all measurements x of the sequence for the inference of p (z t |x, Θ) at time point t, information of the past as well as of the future is processed, which is what the unscented RTS smoother is used for. However, to derive the equations of the smoother we first need to focus on the inference when only information of the past is available, i.e. we want to infer p (z t |x 1 , ..., x t , Θ) where only measurements until time point t are given, which can be achieved by utilizing the unscented Kalman filter. To avoid confusions, we denote mean values and covariance matrices obtained from the unscented Kalman smoother asμ t andṼ t , whereas those calculated via the unscented RTS smoother are denoted asμ t andV t .

The unscented Kalman filter
The unscented Kalman filter is an iterative algorithm, which calculates the filtered values for the meañ µ t and covarianceṼ t at a time point t, based on the filter output for these valuesμ t−1 andṼ t−1 at the previous time point t − 1 as well as the measurement variable x t for time point t. The inference scheme for obtaining p (z t |x 1 , ..., x t−1 , Θ) is given by Algorithm 6 and 7 [11,12]:

The unscented RTS smoother
The unscented RTS smoother is also an iterative algorithm, which calculates the smoothed values for the meanμ t and covarianceV t at a time point t, based on the smoother output for these valuesμ t+1 andV t+1 at the next time point t + 1 as well as the corresponding output from the unscented Kalman filterμ t andṼ t for time point t. The inference scheme for obtaining p (z t |x, Θ) is given by Algorithm 8 and 9 [12]:

Algorithm 8
1: function f uks 0 (μ t ,Ṽ t ,μ t+1 ,V t+1 , V z ) 2: Z ← f ut μ t ,Ṽ t Form sigma points Z (1) G t ← DP −1 Compute smoother gain G t (4.1) 7:μ t ←μ t + G t (μ t+1 −z) Compute smoothed meanμ t (4.2) Compute smoothed covarianceV t (4.3) 9: returnμ t ,V t , G t To obtain values of the smoothed meansμ = {μ 0 , ...,μ T } and covariancesV = {V 0 , ...,V T } for all time points one needs to run the forward filtering path and then iterate backwards through the entire behavioral sequence: Here, b 0ij and b 1ij denote the lower and upper bound corresponding to entry j of the Rodrigues vector r i , which encodes the rotation of bone i, and erf is a sigmoidal function, i.e. the error function given by: for y ∈ R. In order to enforce joint angle limits we just replace the original transmission and emission equation in our state space model given by equations 13 and 14 with: x t = g (f z * →z (z * t )) + x .
In the following we always refer to the state space model given by equations 49 and 50 and therefore also to the redefined state variables z * but we drop * in the notation.

Maximization step
In the M-step we find a new set of model parameters Θ k+1 by maximizing the ELBO L, given the smoothed meansμ and covariancesV as well as the smoother gains G, which we obtained in the E-step using a current estimate of the model parameters Θ k .

Obtaining new model parameters by maximizing the evidence lower bound
We can take advantage of the specific structure of the state space model when maximizing the ELBO L [8]. In the state space model the state variables fulfill the Markov property, i.e. each state variable z t only depends on the previous one z t−1 . Based on this we can compute the model's joint distribution: p (x, z) = p (z 0 ) T t=1 p (z t |z t−1 ) p (x t |z t ) .
When we now take the logarithm of the joint distribution and acknowledge that the model parameters Θ are also required for computing the joint distribution we obtain: ln p (x, z|Θ) = ln p (z 0 |µ 0 , V 0 ) + T t=1 ln p (z t |z t−1 , V z ) + T t=1 ln p (x t |z t , V x ) .
However, to maximize Q (Θ, Θ k ) we actually need to consider the expectation value of ln p (x, z|Θ): with I 0 = E [ln p (z 0 |µ 0 , V 0 )], I z = T t=1 E [ln p (z t |z t−1 , V z )] and I x = T t=1 E [ln p (x t |z t , V x )]. If we now acknowledge that all random variables in our state space model are normally distributed, i.e. z t ∼ N μ t ,V t , it becomes clear that computing Q (Θ, Θ k ) only involves evaluating the expectation values of log-transformed normal distributions (see Appendix A). Consequently, we can obtain simplified terms for the individual components I 0 , I z and I x of Q (Θ, Θ k ) using the smoothed meansμ and covariancesV as well as the smoother gains G. For I 0 we get: for µ 0 , V 0 , V z and V x respectively. The required derivatives have the following form (see Appendix B): iteration k: ∆µ 0i = abs µ 0,k i − µ 0,k−1 i µ 0,k−1 i ∀ i ∈ {1, ..., n z } (82) where abs is a function returning the absolute value of its input argument and µ 0,k , V 0,k , V z,k and V x,k are the model parameters at iteration k whereas µ 0,k−1 , V 0,k−1 , V z,k−1 and V x,k−1 are those at iteration k − 1. We only focus on the diagonal entries of the covariances V 0 and V z since a fraction of their off-diagonal entries is expected to be zero. Using these relative changes we construct a vector ∆v ∈ R 3nz+nx containing all relative changes via concatenation: and assume convergence is reached when the mean ∆v of ∆v falls below a threshold tol : where we set tol = 0.05.

Implementation of the expectation-maximization algorithm
We initialize the mean of the state variables µ 0 by minimizing the objective function given by equation 9 but keep the bone lengths l and the surface marker positions v constant and set n time = 1, i.e. we only include a single time point in the optimization, which is identical to the first time point of a respective behavioral sequence. The covariances V 0 , V x and V z are initialized as matrices whose diagonal elements all equal 0.001 and off-diagonal entries are set to zero. To learn new model parameters µ 0 , V 0 , V x and V z we run the EM algorithm, given by Algorithm 11, with the stated initial values using measurements x obtained from the behavioral sequence. Finally, once the EM algorithm converged, we use the unscented RTS smoother with the resulting learned model parameters to reconstruct poses of the behavioral sequence.