## Abstract

Naturalistic behavior is highly complex and dynamic. Approaches aiming at understanding how neuronal ensembles generate behavior require robust behavioral quantification in order to correlate the neural activity patterns with behavioral motifs. Here, we present Variational Animal Motion Embedding (VAME), a probabilistic machine learning framework for discovery of the latent structure of animal behavior given an input time series obtained from markerless pose estimation tools.

To demonstrate our framework we perform unsupervised behavior phenotyping of APP/PS1 mice, an animal model of Alzheimer disease. Using markerless pose estimates from open-field exploration as input VAME uncovers the distribution of detailed and clearly segmented behavioral motifs. Moreover, we show that the recovered distribution of phenotype-specific motifs can be used to reliably distinguish between APP/PS1 and wildtype mice, while human experts fail to classify the phenotype based on the same video observations. We propose VAME as a versatile and robust tool for unsupervised quantification of behavior across organisms and experimental settings

## 1 Introduction

Behavior is defined as the way in which an animal responses to a particular situation or stimulus, shaped by experience and knowledge (Carew, 2005). As of today, most studies investigating behavioral changes in model organisms rely on ethological classification performed by humans, which are mostly based on standardized protocols (Crawley, 2008). While standardization is important and ensures generalizability, the variability introduced by human annotators remains a potential confounding factor in the interpretation of behavioral phenotyping results from different laboratories (McIlwain, Merriweather, Yuva-Paylor, & Paylor, 2001). Moreover, in most currently used tests the behavioral repertoire of animals is reduced to easily quantifiable behavioral choices by task extensive pre-test training.

Thus, the need for robust unbiased behavioral quantification methods has been widely recognized and innovative approaches in this direction are currently introduced (Gomez-Marin, Paton, Kampff, Costa, & Mainen, 2014; Brown & de Bivort, 2018). Unsupervised behavior quantification may not only provide a more unbiased description of naturalistic behavior, it may also be sensitive enough to detect subtle differences that would potentially remain undectedable or unquantifiable by a human experimenter (Anderson & Perona, 2014; Datta, Anderson, Branson, Perona, & Leifer, 2019). Furthermore, behavioral quantification based on high temporal resolution time series data may permit the computing-intensive analysis of correlations between behavior and neuronal activity. It may thus serve as an important tool that facilitates the discovery of causal relationships between brain activity and behavior (Markowitz et al., 2018; Musall, Kaufman, Juavinett, Gluf, & Churchland, 2019).

Several computational approaches for unsupervised behavior quantification have been introduced (Berman, Choi, Bialek, & Shaevitz, 2014; Wiltschko et al., 2015; Batty et al., 2019). These methods advanced the field of unsupervised behavior quantification and established an increasing awareness of the necessity to improve objectivity. Most approaches operate on a dimensionality-reduced signal extracted directly from the tracking video. The signal is then learned by a machine learning model in the time-domain (Wiltschko et al., 2015; Batty et al., 2019) or frequency-time domain (Berman, 2018) and segmented into discrete blocks containing similar chunks of input data.

Recently, pose estimation tools such as *DeepLabCut* (Mathis et al., 2018) and *LEAP* (T. D. Pereira et al., 2019) enabled efficient tracking of animal body-parts via supervised deep learning. The robustness of deep neural networks allows the application of these tools for pose estimation in many model systems, such as mice, zebrafish and flies, and allows for a high generalization betweeen datasets (Mathis et al., 2018). However, while such tools provide a continuous representation of the animal body motion, the extraction of underlying discrete states as a basis for classification (Tinbergen, 1951) remains a challenge.

Here we introduce Variational Animal Motion Embedding (VAME), a probabilistic machine learning framework for clustering of behavioral signals obtained from pose estimation tools in both the spatial and the temporal domain. We propose that these continuous spatiotemporal signals can be grouped into discrete states via clustering of the latent vector obtained from a recurrent neural network autoencoder. Our approach is inspired by recent advances in the field of temporal action segmentation (Kuehne, Richard, & Gall, 2020), representation learning (Chung et al., 2015; Chen et al., 2016; Higgins et al., 2017; Jiang, Zheng, Tan, Tang, & Zhou, 2017) and unsupervised learning of multivariate time series (J. Pereira & Silveira, 2019; Ma, Zheng, Li, & Cottrell, 2019).

Our machine learning model is built within the framework of variational autoencoders (VAE) (Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014) that has been previously applied to the field of neuroscience (Speiser et al., 2017; Pandarinath et al., 2018). Based on this work, it has been suggested that learning a disentangled representation of the data generative factors can be helpful for a large variety of tasks (Bengio, Courville, & Vincent, 2013). Using the recurrent autoencoder in a variational setting allows a model to learn a complex distribution of the data and to generalize well to previously unseen data. Moreover, VAEs enable the generation of synthetic samples from the learned distribution, that can be used to validate the learning process.

In this manuscript we first demonstrate the model and how it can be applied to behavioral data obtained during open-field exploration. We then validate our model with human annotator generated labels and inspect the variability in clustering depending on the model parameters. Furthermore we evaluate and quantify the added value of adding temporal information in comparison to relying exclusively on spatial representations of the egocentric body pose of mice.

Finally, we demonstrate the sensitivity of our approach in a use case, in which we apply VAME to investigate behavioral differences in a mouse model of beta-amyloidosis (APP/PS1dE9) (Jankowsky et al., 2004). This mouse model and comparable humanized models are commonly used for preclinical studies. Detecting body pose signatures as functional “biomarkers” of underlying pathophysiology would enable early non-invasive detection and potentially facilitate preclinical research on early therapeutic intervention. We show that our method robustly identifies differences in distribution of behavioral motifs in transgenic versus wildtype animals. Furthermore, we used the learned representation to predict the phenotype of inidividual mice based exclusively on their behavioral motif distribution. Lastly we show that this classification performance is better than the classification based by human experts that have been presented the video data.

## 2 Results

### 2.1 VAME: Variational Animal Motion Embedding

There is a broad agreement in recent work on computational behavioral quantification that observable behavior can be encoded in a low-dimensional subspace or manifold (Wiltschko et al., 2015; Brown & de Bivort, 2018; Berman, 2018). Within this latent structure the identification of different behavioral motifs ranging from stereotyped behavior to rare or spontaneous events is a realistic goal.

To investigate behavioral structure we let animals move freely inside an open-field arena (Figure 1 B, top). During the experiment the animal movement was recorded from a camera that was mounted below the arena. Another behavior setup that can be straight-forwardly equipped with tracking cameras is the restrained setup where head-fixed animals behave on a linear treadmill. The quantification of such data using a related quantification method has been discussed in conference proceedings previously (Luxem, Fuhrmann, Remy, & Bauer, 2019).

In order to identify the postural dynamics of the animal from the video recordings we used a markerless pose estimation tool (Mathis et al., 2018). From pose estimation we obtained a time-dependent series of marker positions **X** which captures the movement of relevant body parts. Our goal was to extract useful information from the time series data, that allowed for an effective behavioral quantification given spatial and temporal information of body movement. We aligned the marker positions egocentrically to the mouse body. Then, the aligned data was used as input to our machine learning model (Figure 1 B).

Within the framework of variational autoencoders (VAEs) (Kingma & Welling, 2014) we built a bidirectional recurrent neural network (RNN) encoder which was trained in an unsupervised fashion (Figure 1 A). Gated recurrent units (GRUs) (Cho et al., 2014) were used as the basic building block of the RNNs. The encoder receives a sample **x**_{i} of the time series and learns to embed the relevant information into a lower dimensional representation **z**_{i}. Learning is achieved by passing **z**_{i} to another RNN which decodes the lower dimensional vector into an approximation of the input chunk (Figure 1 B, bottom). Additionally, a second RNN decoder learns to anticipate the structure of the subsequent time series chunk from **z**_{i} (Srivastava, Mansimov, & Salakhudinov, 2015), thereby regularizing **z**_{i} and forcing the encoder to learn a richer representation of the behavior. The prior of the VAE followed the standard normal distribution. However, inspired by Ma and colleagues (Ma et al., 2019), we introduced an additional prior on **z**_{i} with the aim to improve the clusterability of the latent space (see Methods 4.3 for details).

In order to investigate if our model learned a meaningful representation of the input time series, we visualized **z** using Unifold Manifold Approximation (UMAP) (Figure 1 D, E). Compared to the UMAP embedding of the egocentrically aligned spatial time series (Figure 1 C) the visualization of **z** suggested that the information from **x** is mapped into a dense manifold representing the spatiotemporal dynamics of the animal’s behavior. We then assigned points of **z** to clusters while minimizing their within-cluster variance (k-Means algorithm). In this way we grouped input chunks by spatiotemporal similarity, i.e. created behavioral motifs (Figure S.1).

### 2.2 Validation of VAME

In order to validate our machine learning model we created a manually labaled dataset that was annotated by three human experts with training in behavioral neuroscience. The experts annotated a video of a freely moving wildtype animal consisting of 20.000 frames (≈6 minutes length) with 5 coarse behavioral labels (Walk, Pause, Groom, Rear, Exploratory behavior) (Figure 2 A, see Methods 4.5). When quantifying agreement between individual experts, we observed that 71.93% of the video frames were labeled equally by all three experts. The remaining 13.61% of frames were labeled unequally by two experts and 14.47% were labeled unequally by all three experts (Figure 2 C). This implicated that behavior showed a considerably high observer variability and is not trivially assignable to discrete labels (Anderson & Perona, 2014; Datta et al., 2019).

Next, we trained our machine learning model with the egocentrically aligned marker time series and validated how the obtained clusters matched the manual annotation (Figure 2 B, bottom). We found that although most VAME motifs predominantly overlapped with a single manually assigned label, a portion of the VAME motifs overlapped with two or more manually assigned labels. Interestingly, we found more disagreement between human experts for motifs that overlapped with several human assigned labels, indicating uncertainty of the annotators. This suggests that the clustering accuracy of VAME is high, but the achieved score is low as the underlying behavior can not be uniquely identified by experts.

We further used two different metrics for clustering validation to quantify the model accuracy compared to the manual annotation. Purity was used as a measure of the extent to which clusters contain a single manually assigned label (Manning, Raghavan, & Schütze, 2008). Normalized Mutual Information (NMI) was introduced as information-theoretic metric, that scales the amount of mutual information between VAME clusters and the manually assigned labels (see Methods 4.5 for details).

Compared to the clustering of the egocentrically aligned spatial input signal using the k-Means algorithm we found a relative increase of the Purity score for our best model (Spatio-temporal + Prediction) by 7.03%, 8.56%, 8.61% relative to our model for each choice of the numbers of cluster *k* = {15, 30, 45}, respectively. Likewise we found a relative increase of the NMI score by 40.14%, 47.23%, 43.23% for each choice of *k*, obtained by our best model (Spatio-temporal + Prediction). Absolute values for each setting and metric employed in the comparison are found in Table 5.2. Furthermore, we have compared the scores obtained for VAME with scores obtained for clustering of singular values of the spatiotemporal signal (Table 5.2). In comparison, we found that for all settings of *k* the Purity and NMI score is lesser then the scores obtained with our model.

Finally, we visualized the trajectories of three randomly chosen behavioral sequences with a length of 1.5 seconds within a lower-dimensional space that was projected by UMAP based either on the spatial input signal (Figure 2 F) or the spatiotemporal representation learned by our model (Figure 2 G). We observed that the course of trajectories within the embedding of the spatiotemporal representation followed a smooth path through the projected manifold while the course of trajectories through the embedding of the spatial input signal consisted of several scattered jumps through the projected space. This suggest that our machine learning model captures the spatiotemporal dependencies of the input data and thereby unravels the development of observable behavior on a low-dimensional manifold.

### 2.3 Unsupervised detection of behavioral differences between phenotypes

To demonstrate how VAME can be applied for the detection of individual animal specific as well as cohort specific behavioral differences we performed behavioral tracking on a mouse model of beta-amyloidosis harboring human mutations in the APP and presenelin 1 gene (Jankowsky et al., 2004). Mice heterozygous for the transgenic allele were compared to wildtype littermate control. For this mouse line, several age-dependent behavioral differences have been reported (Huang et al., 2016). For example, age/disease related motor and coordination deficits (Onos et al., 2019), changes in anxiety levels (Lalonde, Kim, & Fukuchi, 2004) and spatial reference memory deficits (Janus, Flores, Xu, & Borchelt, 2015) were observed.

We placed N=8 mice into a novel open-field environment with transparent bottom, in which mice were allowed to freely explore the arena for a duration of 25 minutes after an initial habituation period of 10 minutes (Figure 3 A). During the experiment, the nose, tailroot, hind and front paw movement was captured by a camera mounted below the arena (Figure 1 B, top).

The average speed during the trial was 2.29 ± 1.57 cm/s for control animals and 2.49 ± 2.45 cm/s for test animals, while the average distance travelled was 9187.44 ± 1266.4 cm and 9937.07 ± 1367.08 cm, respectively (Figure 3 C). No statistically significant differences were computed between the groups for both measures. Observing the total occupancy for control and test animals, we found that both groups preferred the boundary of the arena over the middle, while the test animals had an additional bias towards the southern border over the northern (Figure 3 B).

We extracted marker coordinates from pose tracking and clustered the multivariate signal into 30 VAME motifs. From the transition probabilities between individual motifs we created a hierarchical representation of mouse behavior (Figure 3 D). We iteratively grouped two motifs that had the largest transition probability between each other as well as the smallest joint probability of occurrence (see Methods 4.4 for details). Doing so, we made sure that the motifs having the largest spatiotemporal similarity have been grouped at the lowest levels of the hierarchy. We then cut the tree-like hierarchical structure on the second and third hierarchical level in order to obtain communities of motifs. Figure 3 D shows the obtained representation for a single wildtype animal. Note that the structure of the obtained representation was similar for all animals.

We post-hoc labeled the communities into coarse labels that have been used for manual annotation of behavior before, with the addition of a *Turn* community that emerged in the hierarchical representation. In both motifs included in the *Turn* community we could detect bending and turning behavior of the animal that occurred in a stationary position or slow movement speed. Moreover, we have identified two communities containing motifs related to exploratory behavior. In both communities, the animal exhibited head movement strongly resembling undirected sniffing. However, in one community, the animal was mostly in an otherwise stationary position while in the other the animal was moving, typically at a slow pace. Note that the *Groom* community also contained motifs exhibiting consummatory behavior, as we have randomly placed three chocolate flakes in the center of the arena prior to the experiment in order to motivate full coverage of the open field.

Inspecting the difference in motif usage between control and test animals, we detected 5 VAME motifs that significantly differed between the groups. Of those, two were post-hoc categorized into the *Groom* community (motifs 4 and 22, *p* = 0.013, *p* = 0.007), two in the *Turn* community (motifs 8 and 21, *p* = 0.047, *p* = 0.0058) and one as walk behavior (motif 28, *p* = 0.0004) (Figure 3 E). Grooming motifs 4 and 22 are characterized by no movement of hind paws, upper body and head movements, and slightly lifted front paws, as well as consumatory behavior (Figure 4, Supplementary Video 1). Sequences within motifs 8 and 21 display slow body transitions with hind paw movement, upper body and head turns, as well as low rears (Figure 4, Supplementary Video 2). Motif 28 shows moderate walking with the snout approaching the floor (Figure 4, Supplementary Video 3).

Furthermore, observing the stationary probability of the motif transitions, we found two more VAME motifs that were changed in the *Moving exploration* community (motif 20, p=0.04), as well as the *Stationary exploration* community (motif 2, p=0.001) (Figure 3 F). Motif 20 shows walking with a slightly bend upper body and active sniff behavior with head movement (Figure 4, Supplementary Video 4), while in motif 2 the mice perform intense sniff behavior while sitting on the hind paws (Figure 4, Supplementary Video 5). Interestingly, the pronounced appearance of exploratory motifs could be related to deficits of spatial orientation that have been previously reported for comparable mouse models (Lalonde et al., 2004; Janus et al., 2015).

Finally, we investigated if the wildtype and APP/PS1 transgenic mice could be distinguished based on the motif usage and stationary distribution obtained from VAME. When measuring the Kullblack-Leibler divergence between the motif usage distributions as well as the stationary distribution per animal we found a block-diagonal structure in the pairwise dissimilarity matrix (Figure 5 A, B). This suggests that two separable clusters exist. Interestingly, when applying k-Means clustering with the cluster size *k* = 2 to the underlying distributions indeed a decision boundary could be found that seperated each animal into a cluster corresponding to the correct genotype. The robustness of the classification was underpinned by Leave-p-out validation. When leaving *p* = 1 datasets out of the clustering procedure the mean accuracy was 0.96 ± 0.06, while it has been found to be 0.91 ± 0.12 and 0.88 ± 0.14 for *p* = 2 and *p* = 3, respectively. Furthermore, this finding could be furthermore confirmed with statistical testing, when KL-distances within a genotype were compared to between-genotype KL-distances (Figure 5 C: Ks-test *p* = 0, 02, t-test wt to tg *p* = 0.52, t-test wt to between group KL-distance *p* = 0, 0002, t-test tg to between group KL-distance *p* = 0.000002, Figure 5 D: Ks-test *p* = 0, 002, t-test wt to tg *p* = 0.49, t-test wt to between group KL-distance *p* = 0, 0004, t-test tg to between group KL-distance *p* = 0.00002).

In order to demonstrate the strength of our approach in separation of behavioral phenotypes into genotypes we asked 11 human experts to classify the behavior based on the video recordings that were used as an input to the machine learning model. We constructed an online questionnaire for blinded classification where each participant was allowed to watch all videos for an unlimited time before making her decision. All experts had previous experience with behavioral video recordings in an open field and/or treadmill setting. In addition, six of the participants also had previous experience doing behavioral experiments with APP/PS1 mice. We found that the later group showed slightly higher classification accuracy than experts inexperienced with the mouse model (50.98% ± 11.04% for experts, 42.5% ± 15.61% for non-experts). However the overall human classification accuracy was at chance level for all participants (46.61% ± 8.41%, Figure 5 E). Together, these result demonstrate the usefulness of VAME for constructing a motif distribution that captured even the subtle differences of behavior specific to the phenotype, that were not detectable by human experts given the same input data.

## 3 Discussion

Animal movements occur in a broad range of spatial and temporal scales. The detection of stereotypical patterns is a classical ethological approach. However, within these patterns subpatterns exist which may continuously change in response to behavioral and environmental challenges and usually remain undetected. As manual scoring can not capture the full complexity of the behavioral dynamics that is required for detecting causal relationship of neural activity and behavior there is a pressing need for unsupervised behavior quantification in neuroscience.

To address this issues, we presented a probabilistic machine learning framework for clustering of spatiotemporal motion dynamics embedded in lower dimensional space (Variational Animal Motion Embedding, VAME). VAME allows for detection of behavioral motifs in time series data obtained with recently established deep-learning based pose estimation tools. Moreover, our approach offers the opportunity for embedding of other complementary modalities acquired at a similar temporal scale, for example the body temperature, blood oxygen levels, or other physiological parameters. Since the framework can be useful for the wider community of behavioral neuroscientists we provide open-source access to all required code and documentation. Moreover, we designed and used a rodent observation setup that is easily transferable to other laboratories. We demonstrated that a single camera observation from below the animal provides sufficient information for behavior quantification.

VAME requires only the setting of a few key parameters which is advantageous compared to previously introduced unsupervised quantification methods based on segmentation via autoregressive Hidden Markov models (Wiltschko et al., 2015; Batty et al., 2019). One important problem of Hidden Markov Models (HMMs) when applied to behavioral data is the short switching times between modules, following an exponential distribution. To circumvent this problem “sticky” autoregressive parameters have been introduced (Fox, Sudderth, Jordan, & Willsky, 2011). This step however requires the introduction of multiple additional potentially confounding parameters that are not trivial to set by users not familiar with advanced machine learning techniques. Moreover, RNNs have more expressive power than HMMs as their activation functions enable to capture non-linearities of the input data.

Our model of observable behavioral dynamics is a deep-learning based model that is trained in a fully unsupervised fashion. With the capability of deep neuronal networks to extract higher order features they can identify complex patterns within and across raw data points, thereby considerably reducing human effort needed to parametrize the model (Salinas, Flunkert, Gasthaus, & Januschowski, 2019). In order to achieve a high performance level they typically require larger amounts of training data in order to learn accurate models, as fewer structural assumptions are made than in parametric models including state space models. However, today the availability of extended data sets is not a major limiting factor as with the advent of markerless pose estimation tools continuous long-term monitoring of behavior is developing into a state-of-the-art approach in behavioral neuroscience.

Moreover we have decided to build our model within the class on Variational Autoencoders (VAEs) (Kingma & Welling, 2014). VAEs are generative models that combine probabilistic modeling and deep learning into one framework that enables the learning of the latent variable distribution underlying the input data. Moreover, we have adapted the classical VAE model with the addition of a prediction decoder, that anticipates the development of the learned time series and thereby regularizes the learning problem (Srivastava et al., 2015). Furthermore, we introduced an additional prior on the latent space in form of a k-means objective (Ma et al., 2019). We found that although the prior did not significantly improve the Purity and NMI scores in reference to the manual labeling, it increased the quality of the obtained clusters that were validated via generated video sequences (Supplementary Videos 1–5).

Our method was inspired by the approach proposed by Berman and colleagues (Berman et al., 2014), that has been previously applied for unsupervised behavior quantification from marker time series (Günel et al., 2019). In the original paper (Berman et al., 2014), the authors applied signal processing techniques to extract relevant features describing animal movement, such as the leg segments of a fruit fly. Comparably, our approach relies on the pre-selection of features that either captures the full range of observed animal behavior or that are of specific interest for a given study, e.g. pupil dynamics or facial muscle movements. The previously published approach (Berman et al., 2014) then transformed the behavioral time series into a spectrogram, embedded it into a two-dimensional space via t-distributed stochastic neighborhood embedding (t-SNE) and obtains discrete modules using watershed transform of the continuous density map. Differently, our approach now uses a variational recurrent autoencoder, which is an alternative technique for non-linear dimensionality reduction. Both approaches aim at finding a lower dimensional embedding of the input data, but optimize a different term in order to achieve this goal. While variational autoencoders learn the parameters of the probability distribution that underlies the input data, t-SNE applies different transformations in different regions of the data in order to find a low-dimensional map that roughly preserves the distances in the high-dimensional space. For t-SNE applications a set of hyperparameters has to be tuned beforehand, otherwise the algorithm likely produces low-dimensional distributions that may misrepresent the global geometry of the input data (Kobak & Berens, 2019). Due to this, t-SNE is usually preferred for visualization purposes while variational autoencoders are preferentially applied for learning deterministic and reversible mapping from data to the embedding space. Moreoever, the quadratic computational complexity of t-SNE (van der Maaten & Hinton, 2008) may be precluding for the creation of joint embeddings from large datasets, while the complexity of recurrent neuronal networks is asymptotically linear in the length of the input (Goodfellow, Bengio, & Courville, 2016).

In our approach we avoided the transformation of the input signal into the time-frequency domain. This allowed us to treat the signal in its raw form instead of finding a convenient balance of time and frequency resolution, that is necessary for an effective Fourier decomposition. However, it is also possible to extend our framework by incorporating, for example, a multimodal time series consisting of signals representing the wavelet power at a given frequency band (Berman et al., 2014) or traces obtained from filtering for other specific features of the input signal.

A relevant parameter in our model is the size of the temporal window that slides over the input data during training and prediction. Shorter settings of the time window lead to learning of finer nuances within the signal, while larger time windows are required to capture long term dependencies. In our analysis, we have used temporal windows of 500 ms as input to our model, a setting which was proposed based on changepoint analysis of mouse behavior in previous work (Wiltschko et al., 2015). Furthermore the prediction decoder predicts the evolution of the next 250 ms of the input signal, serving as a regularization term. Depending on the time scale of movements across model organism this parameter has to be specifically set to capture relevant features. Our approach can be potentially further improved by using the recently proposed dilated RNNs (Chang et al., 2017) as an encoder model which could lead to an improved treatment of the signal on multiple temporal scales. However, as the approach may require precise tuning of hyperparameters, the comparison to VAEs was not a feasible strategy for this study.

Another relevant parameter requiring optimization is the size of the latent vector, which is used to set the amount of information compression between the encoder and decoder networks. A too large size of the latent vector leads to poor performance of the encoder as it can surpass the inference step and leads to storage of the complete input information directly in the latent vector. A too small size of the latent vector on the other hand may not offer sufficient capacity for encoding relevant information, even after extensive training of the model. This parameter was set empirically to **z** ∈ ℝ^{30} while comparing the difference between input and reconstructed signals. For the appropriate setting the high-frequency noise was removed while the reconstructed signal captured the main characteristics of the input signal. Clearly, this setting requires adjustment for specific use cases and needs to be evaluated when the dimensionality of the input signal changes.

Finally, we have validated VAME based on manual labeling and addressed the issue of behavior identifiability, that has been raised previously (Anderson & Perona, 2014; Datta et al., 2019). Here, we labeled behavioral data based on a composition of stereotypical movements, such as walking, pausing and sniffing, that may also appear in combination with each other. However, we found considerable disagreement in labeling between individual human experts, even if the basic set of motifs was grouped into five coarse behavioral classes, for example a motif jointly representing walking and exploratory behavior. While undirected sniffing in a rigid body pose is typically interpreted as exploratory behavior, sniffing during walking may be identified as regular walking behavior. At this point our approach reached a limit. Possibly, this limit may only be overcome by an identification of the neuronal correlates of the true behavioral states (Krakauer, Ghazanfar, Gomez-Marin, MacIver, & Poeppel, 2017).

Lastly, we have demonstrated how our approach can be applied in practice for robust identification of the detection of differences between two mice of different genotypes. Classification of behavioral phenotypes based on defined features has been demonstrated previously, as for example in *Caenorhabditis Elegans* (Baek, Cosman, Feng, Silver, & Schafer, 2002). We would like to point out, however, that our classification is based on unsupervised discovery of behavioral motifs, with the only manual selection of features being the keypoints defined for pose estimation. Our model thus outperformed human clustering performance, demonstrating the practicability of our approach and, in general, machine learning based behavior quantification approaches in neuroscience.

We are thus convinced that our framework will be useful to robustly detect behavior representations across organisms and experimental settings. Moreover, we anticipate that VAME will stimulate the development of further machine learning models that may be benchmarked provided the data and methodology presented in this paper. Finally, with the introduction of VAME we aim at facilitating the initiation of studies investigating causal relationships between naturalistic behavior and neuronal activity.

## 4 Methods

### 4.1 Animals

For all experiments we used 12 month old male transgenic and non-transgenic APPSwe/PS1dE9 (APP/PS1) mice (Jankowsky et al., 2001) on a C57BL/6J background (Jackson Laboratory). Mice were group housed under standard laboratory conditions with a 12-h light-dark cycle with food and water ad libitum. All experimental procedures were performed in accordance with institutional animal welfare guidelines and were approved by the state government of North Rhine-Westphalia, Germany.

### 4.2 Experimental setup, data acquisition and preprocessing

For the open field exploration experiment mice were placed in the center of an circular area (transparent Plexiglas floor with diameter of 50 cm surrounded by a transparent Plexiglas wall with height of 50 cm) and have been left to habituate for a duration of 10 minutes. Afterwards, sessions of 25 minutes were recorded where the mice were left to freely behave in the arena. To encourage a better coverage, three chocolate flakes were placed uniformly distributed in the central part of the arena prior to the experiment.

Mouse behavior was recorded by a CMOS camera (Basler acA2000-165umNIR) equipped with wide angle lens (CVO GM24514MCN, Stemmer Imaging) that was placed centrally 35 cm below the arena. Three infrared light sources (LIU780A, Thorlabs) were placed 70 cm away from the center, providing homogeneous illumination of the recording arena from below. All recordings were performed at dim room light conditions.

For behavioral pose extraction, *m* virtual markers were placed on relevant bodyparts in 650 uniformly picked video frames from 14 videos in the restrained setup and in 500 uniformly picked video frames from 16 videos in the freely behaving setup. A residual neural network (ResNet-50) was trained to assign the virtual markers to every video (Mathis et al., 2018).

To obtain the egocentric time series of (*x, y*) marker coordinates aligned every animal egocentrically. The alignment is done by taking the nose and tail coordinates and cropping the frame to these coordinates. In order to get a tail to nose orientation from left to right we compute a rotation matrix and rotate the the resulting frame around the centre between nose and tail. This results into egocentrically aligned frames and marker coordinates **X** ∈ ℝ^{2m×N}, where *N* represents the sequence length. To fit our machine learning model we subdivide this sequence into smaller subsequences **x**_{i} by applying a sliding window of length *T*. Furthermore, we created that stores the subsequent time points of **x**_{i}.

For low-dimensional visualization of spatial as well as spatiotemporal data we employed Uniform Manifold Approx-imation and Projection for Dimension Reduction (UMAP) from the `umap-learn` Python package. All embeddings were created with the parameters `min_dist` set to 0.2 and `neighbors` set to 20.

### 4.3 Variational Animal Motion Embedding

Given a set of *n* multivariate time series **X** = {**x**_{1}, **x**_{2}, …, **x**_{n}}, where each time series contains 2*m* × *T* ordered real values. The objective of our model is to learn a *d*−dimensional vector **z**_{i} ∈ ℝ^{d} which contains the latent representation of the input sequence **x**_{i}. **z**_{i} is learned via the non-linear mappings *f*_{enc}: **x**_{i} → **z**_{i} and , where *f*_{enc}, *f*_{dec} denotes the encoding and decoding process, respectively and is defined by,

In order to learn the spatiotemporal latent representation our model encoder is parameterized by a two layer bi-directional RNN with parameters *ϕ*. Furthermore, our model uses two decoder, a one-directional RNN with parameters *θ* and a bi-directional RNN with parameter *η*.

As our input data is temporally dependent, RNNs are a natural choice in order to capture temporal dynamics by recursively processing each input and updating their internal state **h**_{t} at each timestep via,
where *f* is a deterministic non-linear transition function, and *θ* is the parameter set of *f*. The transition function *f* is usually modelled as long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) or gated recurrent unit (GRU) (Cho et al., 2014). Here, we use GRUs as transition function in the encoder and decoder.

The joint probability of a time series **x**_{i} is factorized by a RNN as product of conditionals,

In order to learn a joint distribution over all variables, or more precise, the underlying generative process of the data, we apply the framework of variational autoencoders (VAE) introduced by (Kingma & Welling, 2014; Rezende et al., 2014). VAEs have been shown to effectively model complex multivariate distirbutions and can generalize much better to new situations, e.g. spontaneous events, than their counterparts, discriminative models.

#### 4.3.1 Variational Autoencoder

In brief, by introducing a set of latent random variable **Z** the VAE model is able to learn variations in the observed data and can generate **X** through conditioning on **Z**. Hence, the joint probability distribution is defined as,
and parameterized by *θ*.

Obtaining the data distribution *p*(**X**) by marginalization is intractable due to the non-linear mappings between **X** and **Z** and the integration of **Z**. In order to overcome the problem of intractable posteriors the VAE framework introduces an approximation of the posterior *q*_{ϕ}(**Z**|**X**) and optimizes a lower-bound on the marginal likelihood,
where *KL*(*Q* || *P*) denotes the Kullback-Leibler divergence between two probability distributions *Q* and *P*. The prior *p*_{θ}(**Z**) and the approximate posterior *q*_{ϕ}(**Z**|**X**) are typically chosen to be in a simple parametric form, such as a Gaussian distribution with diagonal covariance. The generative model *p*_{θ}(**X**|**Z**) and the inference model *q*_{ϕ}(**Z**|**X**) are trained jointly by optimzing Eq. 5 w.r.t their parameters. Using the *reparameterization trick* (Eq. 6), introduced by (Kingma & Welling, 2014) the whole model can be trained through standard backpropagation techniques for stochastic gradient descent.

#### 4.3.2 Variational lower bound of VAME

In our case, the inference model (or encoder) *q*_{ϕ}(**z**_{i}|**x**_{i}) is parameterized by a RNN. By concatenating the last hidden states **h**_{t} of each layer of the encoder we obtain a global hidden state **h**_{i} which is a fixed-length vector representation of the entire sequence **x**_{i}. To obtain the probabilistic latent representation **z**_{i} we define a prior distribution over the latent variables *p*_{θ}(**z**_{i}) as an isotropic multivariate Normal distribution *𝒩*(**z**_{i}; **0, I**). Its parameter *µ*_{z} and Σ_{z} of the approximate posterior distribution *q*_{ϕ}(**z**_{i}|**x**_{i}) are generated from the final encoder hidden state by using two fully connected layers with a Linear and a SoftPlus activation, respectively, as proposed in (?, ?). The latent representation **z**_{i} is then sampled from the approximate posterior and computed via the reparameterization trick,
where is an auxiliary noise variable and ⊙ denotes the Hadamard product.

The generative model *p*_{θ}(**x**_{i}|**z**_{i}) (or decoder) receives **z**_{i} as input at each timestep *t* and aims to reconstruct **x**_{i}. We use Mean Squared Error (MSE) as reconstruction loss, defined by,

The log-likelihood of **x**_{i} can be expressed as in Eq. 5. Since the KL divergence is non-negative the log-likelihhood can be written as

Here, *ℒ*(*θ, ϕ*; **x**_{i}) is a lower bound on the log-likelihood and therefore called the *evidence lower bound* (ELBO) as formulated by (Kingma & Welling, 2014).

We extend the ELBO in our model by an additional prediction decoder to predict the evolution of **x**_{i}, parameterized by *η*. The motivation for this additional model is based on (Srivastava et al., 2015) where the authors propose a composite RNN model which aims to jointly learn important features for reconstruction and predicting subsequent video frames. Here, serves as a regularization for learning **z**_{i} so that the latent representation not only memorizes an input time series but also estimates its future dynamics. Thus, we extend Eq. 8 by an additonal term and parameter,

In order to improve the performance of the post-hoc clustering we incorporate a k-means objective based on spectral relaxation into the model as proposed by (Ma et al., 2019) to guide the learning of the network. Briefly, given a data matrix **z**_{i} ∈ ℝ^{d×N}, (Zha, He, Ding, Gu, & Simon, 2002) transformed the k-means objective into a trace maximization problem associated with the Gram matrix . Thus, the k-means objective has the form,
where *Tr* denotes the matrix trace. **A** ∈ R^{N ×k} is called the cluster indicator matrix and can be set to an arbitrary orthogonal matrix which further relaxes the minimization in Eq. 10 to the trace maximizaton problem

Eq. 11 has a closed-form solution based on the *Ky Fan* theorem (Fan & Hoffman, 1955) which states that we need to compute the largest *k* eigenvectors of the Gram matrix, which gives a *lower bound* for the minimum of the k-means objective. In practice, we update **A** by computing the sum of the *k*-first singular values of .

We are motivating this trace term as a prior on the latent vector **z**_{i} by assuming a joint probability of the form
where *p*_{θ}(**z**_{i}|*k*) is the trace optimisation prior. Along with this joint probability we can write
so that **x**_{i} and *k* are independent conditioned on **z**_{i}. Using Bayes theorem we find that
where *p*_{θ}(**z**_{i}|**x**_{i}, *k*) is approximated by the encoder *q*_{ϕ}(**z**_{i}|**x**_{i}, *k*).

Using Jensen’s inequality, the log-likelihood of VAME can be written as:

We can now express the lower bound on the log-likelihood with an additional prior on the latent vector in the form of

As stated by (Ma et al., 2019), **z**_{i} is learned by the model and therefore not static. Therefore, Eq. 10 can be regarded as a regularization term for learning **z**_{i}. Note that the balance between Eq. 10 and Eq. 9 forces the encoder to learn a more defined cluster boundary. Finally, the training objective to minimize is
and the overall loss function can be written as
where *ℒ*_{prediction} is the MSE loss of the prediction decoder.

The full model was trained using the Adam optimizer (Kingma & Ba, 2015) with a fixed learning rate of 0.0005 on a single Nvidia 1080ti GPU. All computing was done with PyTorch (Paszke et al., 2017).

### 4.4 Clustering into behavioral motifs

To determine the set of behavioral motifs *B* = {*b*_{1}, …, *b*_{K}} we first obtained the latent vector **z** from a given dataset using the machine learning framework described in Methods 4.3. Given the pose tracking yields *d* time series extracted from a video containing *N* frames and the size of the spatiotemporal time window is *T*, the resulting feature matrix *ℱ* is of dimensionality *d* × (*N* − *T*). We then performed k-means clustering on *ℱ* to identify *K* behavioral motifs. Figure 1 (C, Middle) and Figure S.1 show exemplary state sequences obtained from the clustering.

We can then determine the motif usage as the percentage of video frames that are assigned to the occurrence of a specific motif. Furthermore we may model the transitions between behavioral motifs as a discrete-time Markov chain where the transition probability into a future motif is only dependent on the present motif. This results in a *K* × *K* transition probability matrix *𝒯*, with the elements
being the transition probabilities from one motif *b*_{l} ∈ *B* to another motif *b*_{k} ∈ *B*, that are empirically estimated from clustering of *ℱ*.

Next we can compute the stationary distribution of the Markov chain *𝒯*, which is a probability distribution to which the chain converges when time progresses. By definition, the stationary distribution *π* satisfies

In other words, *π* is invariant by the matrix *𝒯*.

In order to obtain a hierarchical representation of behavioral motifs we can represent the Markov chain (19) as a directed graph 𝔾 consisting of nodes *v*_{1} … *v*_{K} connected by edges with an assigned transition probability *𝒯*_{lk}. Additionally, the size of each node corresponds to the total occurrence of the behavior motif throughout all *N* video frames. We can the transform 𝔾 into a binary tree 𝕋 by iteratively merging two nodes (*v*_{i}, *v*_{j}) until only the root node *v*_{R} is left. To select *i* and *j* in each reduction step, we compute the cost function
where *U*_{i} is the probability of occurrence for the *i*th motif. Note that after each reduction step the matrix *𝒯* is recomputed in order to account for the merging of nodes.

Lastly, we may obtain *communities* of behavioral motifs by cutting *𝒯* at given depth of the tree, analogous to the hierarchical clustering approach used for dendrograms.

### 4.5 Manually assigned labels and scoring

In order to obtain manually assigned labels of behavioral motifs we asked three experts to annotate one recording of freely moving behavior with a duration of 6 minutes. All three experts had a strong experience with in-vivo experiments as well as ethogram-based behavior quantification. The experts could scroll trough the video in slow-motion forward and backward in time and annotated the behavior into several atomic motifs as well as a composition of those. As an example, the experts were allowed to annotate a behavioral sequence as *walk* or *exploration*, but also *walk and exploration*. We then summarized the annotation into atomic motifs into 5 coarse behavioral labels, as shown in Table 1.

The coarse labels were created with respect to the behavior descriptions taken from the Mouse Ethogram database (www.mousebehavior.org), which provides a consensus of several previously published ethograms. The assignment of coarse labels to the Mouse Ethogram database taxonomy is shown in Table 2.

For scoring of human assigned labels to the behavioral to VAME motifs we used the clustering evaluation measure Purity and NMI. Purity is defined as
where *U* it the set of manually assigned labels *u, V* is the set of labels generated by VAME *v* and *N* is the number of frames in the behavioral video. The Normalized Mutual Information score is written as
where MI(*U, V*) is the mutual information between set *U* and *V* defined as
and *H*(*U*) is the entropy of set *U* defined as
where *P*_{i} now denotes the probability for the *i*th entry of *U*.

Note that the Purity score (22) is larger when the set *V* is larger than *U* and the NMI score (23) is generally larger when both sets *U* and *V* are of similar size, i.e. the number of possible labels is roughly the same in the human assigned set as well as the set generated using VAME.

### 4.6 Human phenotype classification task

For the classification of phenotypes using human experts we have created an online form, where experts could watch all behavior videos and make their choice about which phenotype is shown for each video. The average time to complete the questionnaire for N=8 animals was 30 minutes. The participants have not been told how many animals of each group are in the set. For every video, the following five decision could be made: *APP/PS1 (Very sure), APP/PS1 (Likely), Unsure, Wildtype (Likely), Wildtype (Very Sure)*. However, we have counted each of right answers (Very sure and Likely) as a correct classification (1 point), and both wrong answers as well as the choice for the Unsure option as wrong classification (0 points). Note that we did not give a time limit but asked for the time they spend on the task. In average, every participant spent around 30 min to complete the task. We had eleven experts participating in this classification task. All of them had previous experience with behavioral video recordings in an open field and/or treadmill setting. In addition, six of the participants also had previous experience with the APP/PS1 phenotype. In the end of the task we asked how they tried to identify the APP/PS1 phenotype to uncover if there is a certain strategy human experts are sharing.

### 4.7 Code availability

The VAME toolbox is available at https://github.com/LINCellularNeuroscience/VAME.

## Supplemental Materials

### 5.1 Exemplary traces and VAME clusters

Exemplary traces and VAME clusters can be found in Figure S.1. Note that values of the input time series were set to an arbitrary negative value if the used pose estimation tools could not reliably detect the position of the corresponding virtual marker.

### 5.2 Absolute numbers and comparison to SVD

Absolute values for the scoring with manually annotated as shown in Figure 2 are shown in Table 5.2. Furthermore, we have compared the performance of VAME against Singular Value Decomposition (SVD), a linear dimension reduction method that is closely related to Principal Component Analysis (PCA). For this purpose we have obtained the first 12 singular values computed for the identical time window *T*, that was otherwise fed to VAME. The singular values explained more than 95% of the original input data. We have then clustered the singular values for each time window using k-Means and computed the Purity and NMI score, shown in Table 5.2.

### 5.3 Training/test loss-curves

The training error for all loss terms employed in the model is shown in Figure S.2.

## 5 Acknowledgments

We thank J. Macke, E. Restrepo, J. Gall and S. Stober for comments on the manuscript. This work was supported by the European Research Council (CoG;SUBDECODE) and DFG-SFB 1089.