## Abstract

Neuronal manifold learning techniques represent high-dimensional neuronal dynamics in low-dimensional embeddings to reveal the intrinsic structure of neuronal manifolds. Common to these techniques is their goal to learn low-dimensional embeddings that preserve all dynamic information in the high-dimensional neuronal data, i.e., embeddings that allow for reconstructing the original data. We introduce a novel neuronal manifold learning technique, BunDLe-Net, that learns a low-dimensional Markovian embedding of the neuronal dynamics which preserves only those aspects of the neuronal dynamics that are relevant for a given behavioural context. In this way, BunDLe-Net eliminates neuronal dynamics that are irrelevant to decoding behaviour, effectively de-noising the data to reveal better the intricate relationships between neuronal dynamics and behaviour. We demonstrate the quantitative superiority of BunDLe-Net over commonly used and state-of-the-art neuronal manifold learning techniques in terms of dynamic and behavioural information in the learned manifold on calcium imaging data recorded in the nematode *C. elegans*. Qualitatively, we show that BunDLe-Net learns highly consistent manifolds across multiple worms that reveal the neuronal and behavioural motifs that form the building blocks of the neuronal manifold.

## 1 Introduction

Advances in neuronal imaging techniques have increased the number of neurons that can be recorded simultaneously by several orders of magnitude [1, 2]. While these advances greatly expand our abilities to study and understand brain function, the complexities of the resulting high-dimensional data sets pose non-trivial challenges for data analysis and visualisation. Fortunately, individual neurons are embedded in brain networks that collectively organise their high-dimensional neuronal activity patterns into lower-dimensional neuronal manifolds [3, 4].

To understand the collective organisation of individual neurons into brain networks, we require algorithms that learn neuronal manifolds from empirical data.

The goal of neuronal manifold learning is to find low-dimensional representations of data that preserve particular data properties. In neuroscience, a broad range of classical dimensionality reduction techniques is being employed, including but not limited to principal component analysis (PCA), multi-dimensional scaling (MDS), Isomap, locally linear embedding (LLE), Laplacian eigenmaps (LEM), t-SNE, and uniform manifold approximation and projection (UMAP) [5]. More recently, advances in artificial intelligence in general and deep learning methods, in particular, have given rise to a new class of (often non-linear) dimensionality reduction techniques, e.g., based on autoencoder architectures [6, 7, 8] or contrastive learning frameworks [9].

Common to all these techniques is their goal to reduce the data dimensionality while preserving particular properties of or information in the data. For instance, autoencoder-based frame-works typically focus on finding low-dimensional data representations that allow a good (or even perfect) reconstruction of the original, high-dimensional data. In contrast, we argue that reconstruction quality is only one out of several desirable features for neuronal manifold learning. First, and in line with the argument by Krakauer et al. [10] that neuroscience needs behaviour, we argue that a neuronal manifold learning algorithm should not aspire to represent all but only those characteristics of high-dimensional neuronal activity patterns that are relevant in a given behavioural context. For instance, when studying an animal’s ability to navigate a maze using visual cues, neuronal activity patterns that carry auditory or olfactory information are irrelevant in the behavioural context and should be abstracted away to better reveal the intricate relationships between neuronal representations of the visual cues and motor behaviour. Second, we argue that the reconstruction of the dynamics of the neuronal activity patterns should also take into account whether the low-dimensional embedding is causally sufficient in terms of the system’s dynamics. To elaborate on this issue, consider the example of using a dimensionality reduction technique to learn the physical state description of a simple pendulum from a video stream showing the pendulum in action. Ideally, the dimensionality reduction technique should learn to represent the position and momentum of the pendulum for each video frame because these two variables constitute a full description of the system’s physical state. In contrast, a dimensionality reduction technique that learns to represent the positions of the pendulum in the current and the past video frame only (without representing the pendulum’s momentum) would also allow for a good reconstruction of the dynamics of the pendulum. This is the case because the pendulum’s momentum, which is required to predict in which direction it will swing, can be approximately reconstructed from the difference in position across two video frames. However, this representation would not constitute a complete description of the actual physical state of the system. In analogy, a neuronal manifold learning technique should attempt to learn a complete physical state description of the underlying neuronal dynamics. Mathematically, this goal can be formulated as learning neuronal state trajectories that form a Markov chain because, in a Markov chain, the current state of the chain is causally sufficient for predicting the next state (in mathematical terms, the past and future states of the chain are statistically independent given the current state).

Here, we introduce a novel framework for neuronal manifold learning, termed the Behaviour and Dynamics Learning Network (BunDLe-Net). BunDLe-Net learns a low-dimensional Markovian representation of the neuronal dynamics while retaining all information about a given behavioural context. It is based on the architecture shown in Fig. 1, which consists of two branches. In the lower branch, the high-dimensional neuronal trajectories *X*_{t} are first projected via a mapping *τ* to a lower-dimensional, latent trajectory . A first-order transition model *T*_{Y} then predicts the difference between the current and the next state to arrive at an estimate of the latent state at time *t* + 1. This predicted latent state is compared to the true latent state at time *t* + 1 in the upper branch, which is obtained by mapping the observed neuronal state *X*_{t+1} at time *t* + 1 via the same *τ* as in the lower branch to the latent state via the loss function *ℒ*_{Markov}. By jointly learning the mapping *τ* and the first-order transition model *T*_{Y} that minimise the loss function *ℒ*_{Markov} we obtain a latent, low-dimensional time-series *Y*_{t} that is Markovian by construction. This is the case because the transition model *T*_{Y} acts as a bottleneck that constrains the class of functions for *τ* for which the current state of the system is sufficient to predict the next state, in the sense that previous states do not provide any additional information. However, this architecture is not yet sufficient to learn a meaningful latent data representation because a mapping *τ* that projects the neuronal state trajectories to a constant (*Y*_{t} = *c*) would also fulfil the criterion of Markovianity. To obtain a meaningful latent representation, we also require that the behavioural context must be decodable from the latent representation *Y*_{t} by adding the loss function *ℒ*_{Behaviour} that measures the reconstruction error between the true behavioural labels (*B*_{t+1}) and those predicted from the latent representation . By jointly learning that mapping *τ* and the first-order state transition model *T*_{Y} that minimise the two loss functions *ℒ* _{Markov} and *ℒ* _{Behaviour}, the BunDLe-Net architecture learns low-dimensional Markovian representations of those aspects of the high-dimensional neuronal state trajectories that are relevant for a given behavioural context.

We remark that BunDLe-Net is a generic architecture in the sense that each of its modules (the mapping *τ*, the state transition model *T*_{Y}, and the prediction model for the behaviour) can be realised by whatever models, e.g., linear or non-linear mappings which may be realised via (deep) neuronal networks or other modelling techniques, are most suitable for a certain type of neuronal data. The BunDLe-Net architecture is available as a *Python* toolbox at https://github.com/akshey-kumar/BunDLe-Net.

In the following, we compare the BunDLe-Net architecture with other state-of-the-art neuronal manifold learning techniques on calcium imaging data recorded in the nematode *C. elegans* [11] and demonstrate its ability to uncover intricate relationships between neuronal activity patterns and behaviour that are not revealed by competing techniques.

## 2 Results

Here, we demonstrate how BunDLe-Net preserves vital information about behavioural dynamics while simultaneously enabling visually interpretable insights into the data. We start with a quantitative evaluation of BunDLe-Net and compare it with existing state-of-the-art neuronal manifold learning techniques. We then examine the visual interpretability of the embeddings of BunDLe-Net and competing algorithms. To ensure the robustness of our findings, we apply BunDLe-Net to five different worms and analyse the consistency of the embeddings in terms of their topology. The results highlight the generalisation abilities of BunDLe-Net, revealing similar patterns while maintaining individual differences across recordings. Finally, we show that BunDLe-Net is capable of embedding behaviours in distinct motifs based on the neuronal basis of the behaviour and its dynamics.

### 2.1 Description of data

We apply BunDLe-Net to calcium-imaging whole brain data from the nematode *C. elegans* from the work by Kato et al. [11]. This dataset is ideal for demonstrating the capabilities of BunDLe-Net due to its high-dimensional neuronal recordings labelled with motor behaviour^{1}, multiple animal recordings, eight different behavioural states, and multiple repetitions of behavioural states over time. It includes time-series recordings of neuronal activation from five worms with human-annotated behaviours for each time frame. The recordings consist of approximately 2500-3500 time samples, spanning around 18 minutes (sampled at *∼* 2.9 Hz) in which around 100 -200 neurons are recorded. A low-pass filter with a cut-off frequency of 0.07 Hz is applied to mitigate high-frequency noise in the raw neuronal traces. Not all recorded neurons could be identified; hence, only a subset is labelled for each worm, with different yet overlapping subsets identified across worms. The human-annotated behaviours *B* denote the motor state of the worm at a given instant of time and can take on one of eight states: forward, slowing, dorsal turn, ventral turn, no-state, sustained reversal, reversal-1, and reversal-2.

### 2.2 Quantitative evaluation against competing methods

We evaluate a latent space representation based on how well it preserves behavioural and dynamical information. To estimate the *behavioural information* of an embedding, we train a simple feed-forward neural network in a supervised setting^{2} to predict behaviour from the embedding. The decoding accuracy is then used as a metric for the information content about *B* in the embedding, with the decoding accuracy obtained on the raw, high-dimensional neuronal traces serving as the baseline. To evaluate the *dynamical information* in the embedding, we train an ANN autoregressor to predict *Y*_{t+1} from *Y*_{t}. The mean squared error between the predicted and true *Y*_{t+1} is estimated. From this, we compute a predictability metric for the dynamics, defined as 1 *−* MSE_{m}*/*MSE_{io}, where MSE_{m} is the mean squared error of the model, and MSE_{io} is the mean squared error of a trivial autoregressor that copies its input to the output. We trained all evaluation models on a training set of the embedded data and performed the evaluation on a held-out test set to prevent overfitting (for more details, see Model validation in Section 4.3).

With the stage for evaluation set, we compare BunDLe-Net with other algorithms that are commonly used to learn high-level representations in the field of neuroscience such as PCA, t-SNE, autoencoder, an ANN autoregressor with an autoencoder architecture (ArAe)^{3} and CEBRA-hybrid^{4}. A description of these methods can be found in Section 4.4. All embedding spaces were chosen to be 3-dimensional for ease of comparison across algorithms and visualisation purposes.

Figure 2 presents the outcomes of our quantitative comparison, showcasing dynamical and behavioural prediction metrics in the left and right panels, respectively. Each panel depicts the predictability metric on the y-axis and the manifold learning technique on the x-axis, while the violin plots portray the metric’s distribution across all five worm datasets. The substantial variability across these plots underscores the diverse behavioural and dynamical attributes inherent in the dataset of each worm. For the dynamics evaluations, we compare all the models to a baseline model, which simply copies the input *Y*_{t} to the output as the predicted value for *Y*_{t+1}. For the behaviour evaluation, we compare with a chance level behavioural decoding accuracy obtained by randomly shuffling the behavioural labels. We also compare it with the behavioural decoding accuracy from the raw neuronal traces.

Turning to the results, we see that BunDLe-Net outperforms all other methods, including the state-of-the-art CEBRA, by a large margin. In the left panel, unsupervised methods like PCA, t-SNE, and the autoencoder show limited improvement over the baseline in predicting dynamics. Since they try to preserve maximum variance in the data in a low-dimensional space, they neglect to preserve minor details that may be crucial in determining future time dynamics. CEBRA-hybrid, which also takes temporal information into consideration, does not perform better than the baseline model. The autoregressive-autoencoder, which seeks to reconstruct *X*_{t+1} from *X*_{t}, preserves some dynamical information and is seen to outperform PCA, t-SNE, and the autoencoder. Nonetheless, ArAe’s reconstruction of the entire neuronal state at time *t* + 1 can lead to irrelevant details persisting in latent space embedding. In contrast, BunDLe-Net’s design focuses exclusively on retaining information pertinent to the latent space state at time *t* + 1, which results in a markedly superior performance even compared to ArAe.

Shifting our attention to the right panel, all models surpass chance-level behaviour decoding accuracy. Notably, both CEBRA-h and the unsupervised methods (PCA, t-SNE, autoencoder) exhibit roughly the same performance on average. Despite this, their average decoding accuracy remains notably lower than neuronal-level decoding accuracy, indicating an inability to capture behavioural information at the neuronal level completely. Although ArAe worked slightly better at preserving dynamical information, it falls short in preserving behavioural information. This suggests that unsupervised preservation of dynamical attributes alone does not suffice for constructing behaviourally relevant models. In this regard, BunDLe-Net stands out by retaining all behavioural information, as originally intended. On average, it even rivals the decoding performance achieved with raw neuronal data.

Of particular interest is the comparison between CEBRA-h and BunDLe-Net in terms of their respective performances. Despite incorporating behavioural information in addition to dynamics, CEBRA-h demonstrates only marginal improvements over other models. In contrast, BunDLe-Net rises above all other methods, excelling in both behavioural and dynamical metrics. This highlights BunDLe-Net’s proficiency in effectively retaining crucial neuronal-level information relevant to behaviour analysis and modelling. For further evaluation of behavioural and dynamical performance of BunDLe-Net’s embedding, please refer to Appendix A.

### 2.3 Visual interpretability of embeddings

In this section, we analyse the embeddings of BunDLe-Net and other competing neuronal manifold learning techniques. We visualise the embeddings of the Worm-1 in 3D and evaluate them qualitatively based on their structure and interpretability. We generalise the insights to all worms in the next section. Figure 3 shows the embeddings of Worm-1 by a) PCA, b) t-SNE, c) Autoencoder, d) Autoregressor-Autoencoder (ArAe), e) CEBRA-hybrid, and f) BunDLe-Net. In a), b) and c), we observe a noticeable drift in the PCA, t-SNE, and autoencoder embeddings. This drift drags out the dynamics in time, which is undesirable since we are searching for consistent mappings independent of time. The drift is also seen to obscure the recurrent nature of the dynamics to a large extent in b). The source of this drift could be a calcium imaging artefact or some neuronal dynamics irrelevant to our behaviour of interest. Since these models aim to preserve maximum variance for full-state reconstruction, they inadvertently embed the drift.

In contrast, in Figure 3 d), e), f), we see that this drift is largely absent, and the recurrent dynamics are more evident. These models have a common characteristic: they consider dynamics without attempting to reconstruct the entire neuronal state. Among the three methods shown, ArAe is unsupervised, while CEBRA and BunDLe-Net take behaviour into account. In both d) and e), we observe reasonably separated behaviours with minor trajectory overlaps. However, both embeddings demonstrate high variance *within* a trajectory of a given behaviour. In contrast, BunDLe-Net produces compact bundles that are well-separated from one another. The variance is low within each bundle, while a high variance is observed between different bundles. Consequently, BunDLe-Net’s embedding exhibits distinct behavioural trajectories that are well-separated and along which the dynamics recur in an orbit-like fashion.

Additionally, in e), we observe that CEBRA-h tends to embed the neuronal activity on the surface of a sphere, which may be an artefact resulting from the contrastive learning paradigm. As a consequence, trajectories may be forced to intersect at certain points. Such intersection points are generally undesirable because they introduce ambiguity about the future trajectory. Ideally, intersection points should only occur when there is genuinely no information available about the subsequent behavioural trajectory.

In stark contrast, BunDLe-Net’s trajectories demonstrate a markedly different pattern, characterised by high compactness and sparse intersections. Figure 3 f) reveals precisely three intersection points: sustained reversal , ventral turn , and forward . (See supplementary material https://github.com/akshey-kumar/BunDLe-Net/tree/main/figures/rotation_comparable_embeddings for rotating 3-D plots). These intersections and bifurcations could be interpreted as instances where BunDLe-Net encountered a lack of information about future trajectories.

### 2.4 Consistency of neuronal manifolds across worms

Here, we apply BunDLe-Net to all five worms in the dataset to visually compare the embeddings regarding their consistency and/or any differences that arise across worms.

To produce comparable embeddings^{5}, we first trained a model on each worm separately. We then extracted the *T*_{Y} layer and behaviour predictor layer from the model with the least loss (Worm-1, in this case). We then trained fresh models on each worm, with the chosen *T*_{Y} and behaviour predictor layers from Worm-1 frozen in, until the losses converged. Thus the new models would only have to learn the mapping *τ* for each worm while the other layers remained unchanged throughout the learning process. Notably, this approach was feasible despite recording different neurons from each worm. By adopting this strategy, we ensured consistent geometries across the worms, allowing us to effectively compare differences in topology, should they be present.

The embeddings are illustrated in Figure 4. A latent dimension of three was again chosen for ease of visualisation, and can also be justified by a graph-theoretical argument detailed in Section 4.3. Examining Figure 4, we observe a branching structure in the trajectories of all the worms. For now, let us consider Worm-1. The dynamics exhibit bundling of several segments, leading to recurring patterns along these bundles. Within each branch, the dynamics are predominantly deterministic, while probabilistic *decisions* occur only at specific bifurcation points in the trajectories. We disregard bundles consisting of fewer than one or two segments and identify five prominent bundles in Worm-1, which can be described as follows,

(*C*_{1}) : sustained reversal *→* ventral turn

(*C*_{2}) : ventral turn *→* slowing *→* reversal-1 *→* sustained reversal

(*C*_{3}) : ventral turn *→* forward

(*C*_{4}) : sustained reversal *→* dorsal turn *→* forward

(*C*_{5}) : forward *→* slowing *→* reversal-2 *→* sustained reversal

These five motifs define the generic building blocks of the neuronal manifold in the sense that the neuronal trajectories are almost deterministic within each motif, and probabilistic bifurcations occur at the transitions between motifs. As can be readily checked in Figure 4, these building blocks are highly consistent across worms, with similar behavioural motifs emerging across all worms. For example, motif *C*_{2} is consistently present in the embeddings of all worms, forming a loop (*→* *→* *→*). The same holds true for motifs *C*_{1} and *C*_{5}. However, motif *C*_{4} is not present in all worms and is notably absent in Worm-4. Instead, both Worm-4 and Worm-5 exhibit a slightly different motif (sustained reversal *→* dorsal turn *→* slowing). This variation in motifs may be due to the recording times, which may have been too short to capture all possible transitions for a given animal.

It is noteworthy that even though the individual worm recordings do not share an identical subset of neurons, the embeddings share a basic topological structure with only minor variations in transitions and bifurcation points. These results demonstrate consistency in the embeddings across worms while preserving individuality in the behavioural dynamics in each worm and recording session.

### 2.5 Embedding of states in distinct behavioural motifs

Behaviour can be modelled at different levels of granularity. In the present data set, the worms’ behaviour is described in terms of high-level behavioural patterns such as forward and reversal movements. Alternatively, one could analyse the angular positions and velocities of the various segments of the worms’ bodies, resulting in a more fine-grained representation. Both fine-grained and coarse-grained models hold value in specific contexts. However, it is crucial to maintain consistency within a model’s state space to describe the dynamics accurately. If we utilise a model to understand fine-grained elements but only have access to coarse-grained information, the resulting model will be incomplete or inconsistent in the sense that it lacks the essential information required to predict features of the behavioural dynamics at the desired level of granularity. Here, we demonstrate how BunDLe-Net adeptly handles the coarse-graining of data while still preserving the crucial distinctions between states that are instrumental in explaining the overall dynamics.

We present the discovery of two distinct behavioural states with identical labels based on BunDLe-Net’s neuronal embedding concerning the given set of behaviours. Consider branch *C*_{2} (*→**→**→* ) and *C*_{5} (*→**→**→*) of the trajectory in Figure 5. The *slowing* behaviour (in pink) occurs in both these branches, i.e., they are represented distinctly in the latent space and are not fused together even though they have been assigned the same behavioural label. Branch *C*_{2} has a much shorter *slowing* segment than branch *C*_{5}. We name the new behavioural states corresponding to *C*_{2} and *C*_{5} as *slowing 1* and *slowing 2*, respectively. These different types of slowing movements are embedded in distinct behavioural motifs since they differ in their neuronal realisation and their relevance for the model dynamics, i.e., one would predict different future trajectories depending on whether the state is *slowing 1* or *slowing 2*. We note that this is not the case for other behavioural states, e.g., the sustained reversal (in brown) for which all trajectories form one coherent bundle in the embedding. This implies that in the behavioural state of a sustained reversal BunDLe-Net found no information at the neuronal level to predict whether a dorsal or ventral turn is more likely to occur next. In summary, BunDLe-Net can maintain distinct representations or fuse trajectories depending on whether dynamical information about future behaviours is present. Accordingly, if provided with a set of behaviours that are not consistent or complete for the construction of a full dynamical model, BunDLe-Net can discover extra distinctions or *states* that complete this set of behaviours, provided this information is present in the neuronal level.

## 3 Discussion

We have demonstrated the superiority of BunDLe-Net to other neuronal manifold learning techniques on calcium imaging data recorded in *C. elegans*. However, BunDLe-Net can easily be extended to other imaging modalities and model organisms by adapting its learning modules (for the latent embedding function *τ*, the state transition model *T*_{Y}, and the behavioural decoding layer) while maintaining the overall structure shown in Figure 1. As such, BunDLe-Net is not one algorithm but a generic architecture for learning consistent state representations from neuronal data based on simple but vital principles. In the following, we further elaborate on the relevance of these principles for neuronal manifold learning.

On a fundamental level, the concept of a neuronal manifold can be interpreted as a scientific discovery that sheds new light on how large numbers of neurons coordinate their activities to represent information, implement computations, and generate behaviour. In this view, the goal of neuronal manifold learning techniques is to reveal the true, intrinsic structure of the neuronal manifold from empirical data. Alternatively, neuronal manifold learning algorithms can be interpreted as data compression and visualisation techniques. In this view, the particular shape of a neuronal manifold results from a model-based dimensionality-reduction technique that attempts to preserve certain data properties. Notably, these two viewpoints are not mutually exclusive, i.e., the observed shape of the neuronal manifold may be influenced by its intrinsic structure as well as by the particularities of the dimensionality reduction technique.

Indeed, our results in Figure 3 show substantial qualitative differences in the manifolds across various learning techniques, indicating that different model assumptions inherent to the various algorithms influence the shapes of the learned manifolds. On the other hand, the results obtained by BunDLe-Net shown in Figure 4 demonstrate that highly consistent manifolds can be learned across multiple animals, supporting the concept of an intrinsic structure of the neuronal manifold.

Remarkably, BunDLe-Net achieves this consistency despite only 22 out of more than 100 neurons per animal being shared across the five data sets. We attribute this ability to reconstruct consistent manifolds to the time-delayed embedding of the neuronal dynamics for learning the latent dynamics (cf. Section 4.1), which due to Taken’s theorem [12] allows the reconstruction of a Markovian representation of a dynamical system (i.e., the neuronal dynamics on the manifold) regardless of the specific observation function (i.e., the recorded neurons for each worm). We note that the number of time lags that need to be considered in this embedding is determined in BunDLe-Net by minimising the Markovian loss function *ℒ*_{Markov}, i.e., the number of time lags is increased until no further decrease in the loss function is observed.

Together with the constraint that the behavioural information must be preserved, BunDLe-Net’s ability to learn a Markovian latent embedding results in almost deterministic trajectories that only exhibit a high degree of randomness at a discrete number of bifurcation points (cf. Section 2.3 and Figure 4). This distinction in the neuronal dynamics between periods of high certainty with apparently random behaviour at a discrete number of bifurcation points renders the neuronal manifold of *C. elegans* particularly interesting. Specifically, we interpret the almost deterministic trajectory bundles as the basic building blocks of the neuronal manifold that are fused together at the bifurcation points to create the manifold’s intrinsic structure.

The bifurcations act as decision points regarding the worm’s future behaviour. However, it is presently unclear how *C. elegans* makes these decisions. In general, the randomness in the bifurcation points could be due to intrinsic randomness in the neuronal activity or due to latent, unobserved neurons, i.e., observing these neurons could disentangle the bifurcation points and again result in deterministic trajectories. However, BunDLe-Net’s ability to learn Markovian representations would disentangle the bifurcation points if such information were present in the time delay embeddings of the neuronal dynamics. Because this is not the case, our empirical results align with an interpretation in which the randomness in the bifurcation points is intrinsic neuronal noise. However, we remark that such randomness might be overwritten by external stimuli, which were not part of the experimental design.

Regardless of the nature of the noise in the bifurcation points, the learned neuronal manifolds reveal the behavioural flexibility of *C. elegans* in the context of its neuronal dynamics. In particular, they reveal when, i.e., at which points on the neuronal manifold, *C. elegans* makes decisions about its future behaviour. As such, we predict that external perturbations of the neuronal activity, e.g., by optogenetic stimulation, are most effective when applied at times when the neuronal state is in one of the bifurcation points. Conversely, we hypothesise that the neuronal dynamics are more robust against external perturbations if these are applied when the neuronal dynamics follow one of the highly deterministic trajectory bundles. To generalise from this argument, we consider neuronal manifold learning algorithms in general and BunDLe-Net in particular to be of extraordinary utility in neuroscience because these methods allow us to make empirically testable predictions on how large-scale neuronal dynamics are coordinated to generate behavioural flexibility.

To conclude this article, we outline several potential extensions of BunDLe-Net. First, we note that we have only presented the application of BunDLe-Net to discrete behaviours. Extensions to continuous behaviours can be implemented by adapting the behavioural prediction layer or, in a less elegant fashion, by discretising continuous behaviours. Second, it would be interesting to consider the extension of BunDLe-Net to multiple non-mutually exclusive behaviours to study how large-scale neuronal activity coordinates multi-dimensional behaviours. Naturally, this approach could be extended to include stimuli to study how external information is encoded in neuronal manifolds and translated into behaviour. Each of these changes would merely require adapting the behavioural prediction layer. Regarding the learning module for the latent embedding, we note the growing body of literature on the topic of (causal) representation learning. Representation learning addresses the problem of learning high-level (causal) variables from low-level observations [13, 14]; a topic with potentially rich synergies with neuronal manifold learning that are yet to be explored.

## 4 Methods

In this section, we first provide further information on the theoretical principles that motivate BunDLe-Net. Subsequently, we elaborate on the architectural framework that arises from these principles. We then proceed to provide a comprehensive overview of BunDLe-Net’s implementation, encompassing the learning modules and the details of the training process. Finally, we present the competing methods that serve as benchmarks for evaluating the performance of BunDLe-Net.

### 4.1 Theoretical principle

BunDLe-Net employs a fundamental theoretical principle to embed neuronal data with respect to a given set of behaviours. The core idea is to ensure that the resulting embedding *Y* contains all information about the dynamics and behaviour that is present at the neuronal-level *X*. To elucidate this concept, consider the diagram in Figure 6, where *T*_{X} denotes a transition model at the *X* level. For illustrative purposes, we presently assume that the *X* level is Markov, but will later relax this assumption. The embedding *Y* is obtained by applying a function *τ* on the *X* level. Generally, the resulting transition model at the *Y* level may not be Markov, implying that *Y*_{t} might not fully capture the information about *Y*_{t+1} present in the system, either at the *X* level and/or in the past states *Y*_{t−n}, where *n ∈* Z^{+}. Such an embedding would be of limited use since one might need to refer back to the *X* level to answer certain questions about the *Y* level.

To ensure a more comprehensive and self-contained embedding, we aim for *Y* to be Markov and independent of the *X* level. This requires the diagram (Figure 6) to commute, i..e. it should not make a difference if we first time-evolve and then transform with *τ*, or the other way round. Put in terms of conditional independence, our requirement takes the form *Y*_{t+1} *⊥ X*_{t}|*Y*_{t}, meaning that knowledge of *X*_{t} provides no additional information about *Y*_{t+1} beyond what is already known from *Y*_{t}. In this way, the dynamics at the *Y* level are self-contained and *sealed-off* from the details at *X* level. This is what makes our embedding so useful and interpretable: our embedding has all the relevant information from the *X* level, enabling it to be viewed as a distinct and meaningful dynamical process in its own right.

#### Non-Markovian neuronal dynamics

To handle non-Markov neuronal dynamics at *X*, we consider time windows that include the previous *n* time steps, i.e., (*X*_{t}, …, *X*_{t−n}) as input to our model. By choosing a large enough value for *n*, we can ensure that the resulting process becomes Markov [12], allowing us to model it in the same way as described above. Note that while earlier we were mapping a single time slice to a point in latent space, now we are mapping an entire time window of length *n* to a single point in latent space. Such a transformation does not merely coarse-grain over the neuronal or *spatial* level of granularity but also over the *temporal* domain of patterns.

#### Learning meaningful embeddings

While the requirement of a Markov embedding may be very useful in terms of elegance and interpretability, it is not sufficient to ensure meaningful embeddings. For example, consider a transformation *τ* that uniformly maps every neuronal state to a constant. In this scenario, the resultant process would exhibit Markov dynamics as a single-state process. However, such an embedding fails to yield any meaningful insights regarding the underlying dynamics or behaviour. Remarkably, for BunDLe-Net, such a process would yield a perfect *ℒ* _{Markov} loss, irrespective of the input data.

An additional requirement must be imposed to avoid such *trivial* embeddings. We demand that the behaviour *B* can be decoded from the embedding, thereby preventing the transformation from reducing everything to a mere constant. By upholding this crucial condition, we preserve the behavioural intricacies that render the embedding purposeful and informative, aligning with the ideals espoused by Krakauer et al. [10].

### 4.2 BunDLe-Net architecture

Here, we explain how the BunDLe-Net’s architecture in Figure 1 arises from the commutativity diagram of Figure 6. The upper and lower arms in the architecture correspond to the possible paths from *X*_{t} to *Y*_{t+1} in the commutativity diagram. The lower arm in the architecture involves first coarse-graining *X*_{t}, followed by implementing a transition model on the Y-level. In practice, the transition model outputs Δ*Y*_{t} from which *Y*_{t+1} is estimated as *Y*_{t} +Δ*Y*_{t}. Since the transition model *T*_{Y} outputs *Y*_{t+1} with only *Y*_{t} as input, the Y-level is first-order Markov by construction. The upper arm of BunDLe-Net coarse-grains the time-evolved *X*_{t+1}^{6}. Both arms result in estimates of *Y*_{t+1} which we distinguish by upper indices and . We add a mean-squared error term to our loss function *ℒ* _{Markov} that forces and to be equal, thus ensuring that our requirement of commutativity in Figure 6 is satisfied,
The estimated *Y*_{t+1} is then passed through a predictor layer which learns to output the behaviour *B*_{t+1} given *Y*_{t+1}. Correspondingly, we add a term *ℒ* _{Behaviour} to our loss function, which forces the predicted behaviour to match the true behaviour. This ensures that *Y*_{t} contains the same amount of information about *B*_{t} as *X*_{t}. Here, we use the cross-entropy loss where represents the *j*-th component of a one-hot encoded label vector of *B*_{t+1}, and is the softmax output of the predicted .
Both terms are weighted by a hyper-parameter *γ* and the loss function is given as,
All the layers in BunDLe-Net are learned simultaneously, and both loss terms ensure that the learned *τ* and *T*_{Y} preserve information about the behavioural dynamics. An open-source Python implementation of the BunDLe-Net architecture is available at https://github.com/akshey-kumar/BunDLe-Net.

### 4.3 Learning modules

#### Architecture parameters

The *τ* layer (encoder) of our network consists of a series of ReLU layers [15], followed by a normalisation layer. An encoder of identical architecture is used later in the autoregressor-autoencoder (ArAe) model to facilitate comparison across models. For the *predictor* and *T*_{Y} layer, we use a single dense layer each. In the case of our dataset, this sufficed to achieve good performance. For other data sets, more complex layers may be required. For *T*_{Y}, we also add a normalisation layer so that the output remains in the scaling of the latent space learned by *τ* . The details of the individual layers are provided in the Python code in Appendix B.

#### Gaussian noise against overfitting

To safeguard against overfitting of the model, we introduce Gaussian white noise in the latent space by incorporating it in the *τ* layer. Injecting Gaussian white noise is a well-established regularisation technique that makes the model robust to overfitting [16, 17]. Since we are working with relatively limited data in the context of artificial neural networks, guarding against overfitting becomes particularly crucial.

#### Latent space dimensionality

We choose the dimensionality of the *Y* -level to be three. This is because, in 3-D, we can connect any finite number of points without the edges crossing each other. This allows for embeddings of neuronal activity in the form of trajectories with nodes and edges that do not intersect. This might not always be possible in 2-D, where one can have a constellation of data points that cannot be connected without crossings. It is possible however, to embed any arbitrary graph in three dimensions without the edges having to intersect. [18].

Intersection points are undesirable for the embedding of a dynamical process due to the ambiguity they introduce. A meaningful embedding should exhibit smooth trajectories without self-intersections. An intersection point of two trajectories would mean that the past state at time (*t −* 1) contains additional information about the future state (*t* + 1) than the present state at (*t*), thus rendering the dynamics non-Markovian. Avoiding such intersections and non-Markovian dynamics enhances the interpretability of the embedded data and allows an enhanced prediction of future dynamics.

#### Model validation / parameter tuning

To determine the optimal parameters for the model, including the number and types of layers, we use a held-out validation set on Worm-1. The neuronal and behavioural data of Worm-1 is partitioned into seven folds along the time axis, and one fold is randomly selected as the validation set from the time-ordered dataset. The remaining data forms the training set. By choosing an entire fold in the data as a validation set, we ensure that the model performs as well on unseen data. This would not be the case if we created our validation set by *iid* (independent and identically distributed) sampling due to high time correlations in the time series. After selecting the optimal model parameters through validation on Worm-1, we train models with the same parameters on the other worms. Since we only use Worm-1 for parameter tuning, if the model performs well on other worms, we can be confident that its success is not due to overfitting.

#### Training details

Since the neuronal data was found to be non-Markovian^{7}, we use time-windows of length 15 as input to BunDLe-Net. Reducing the window length decreased model performance while increasing it further had no significant effect. Training was performed with the ADAM optimiser [19] with a learning rate of 0.001 and batch size of 100. The *γ* parameter of BunDLe-Net was chosen to be 0.9 to ensure that ℒ_{Markov} and *ℒ* _{Behaviour} are roughly the same order of magnitude during training (see Figure 7). We trained BunDLe-Net until the losses converged.

### 4.4 Description of competing methods

Here, we describe the other commonly-used neuronal manifold learning algorithms used in the comparison. All models are used to project the *C. elegans* data to a three-dimensional space for purposes of fair comparison. A full implementation of the various models, training process, and evaluation procedures can be found at https://github.com/akshey-kumar/comparison-algorithms.

#### PCA

Principal component analysis [20] has been applied to neuronal datasets to enable visualisation and interpretation of the data. It is a linear transformation that aligns the data along the directions of maximum variance. Typically, the first three principal components are chosen and plotted in 3-D space [11]. The resulting trajectories can provide a rough perspective of the neuronal dynamics at a high level. Since this is a commonly-used method to coarse-grain data, we use PCA as our first baseline model.

#### t-SNE

t-distributed stochastic neighbour embedding is a popular tool for visualising high-dimensional data, including neuronal data [21, 22]. It is essentially a non-linear dimensionality reduction method that tries to preserve distances between the data points.

#### Autoencoder

Arguably, autoencoders (or some variant thereof) are currently one the most predominant method for learning low-dimensional representations of data [6, 7]. Typically, an autoencoder learns a representation by attempting to reconstruct the training data using an ANN composed of an encoder and decoder [23]. Here, we consider the deterministic vanilla autoencoder with a deep encoder and decoder. The depth of the layers, number of neurons, and other training-related hyperparameters were tuned to obtain reasonably optimal performance.

#### Autoregressor-autoencoder (ArAe)

An autoregressor is generally used on time-series data to predict the future state based on the past. Here we implemented an autoregressor with an ANN with an autoencoder-like architecture^{8} and refer to it as ArAe. Such architectures have been used before to learn low-dimensional representations of time-series data [6, 24]. We implement our ArAe as an ANN with a deep encoder and decoder that tries to predict *X*_{t+1} given *X*_{t} as input with *Y*_{t} as the latent space as seen in Figure 9.

#### CEBRA

CEBRA [9] is a state-of-the-art neuronal manifold technique. It uses contrastive learning to optimise the encoding of data by maximising the similarity between related samples and minimising the similarity between unrelated samples. The algorithm employs neural network encoders and a similarity measure to optimise the embeddings based on user-defined or time-only labels. In our experiments, we used CEBRA-hybrid, which takes both behaviour and time dynamics into account for the embedding.

## A Further evaluation of BunDLe-Net’s embedding

In the following, we provide further information to build an intuition for the behavioural and dynamic prediction performance of BunDLe-Net. In Figure 8 a), we present the confusion matrix for BunDLe-Net’s behavioural *prediction layer* from the ANN architecture. BunDLe-Net achieves a decoding accuracy of 94.3%, with the few decoding errors dominated by confusion of forward and slowing, two behaviours that are qualitatively similar and only quantitatively differ in the speed of the motion. To evaluate the dynamical performance of the model, we use the *transition model layer T*_{Y} to estimate *Y*_{t+1} from *Y*_{t} and compare it with the true *Y*_{t+1}, obtained as *τ* (*X*_{t+1}). Figure 8 b) shows that the predicted dynamics indeed track the true dynamics rather well. These results indicate that the behaviour predictor and transition model within BunDLe-Net do well to preserve dynamical and behavioural information as intended.

## B BunDLe-Net architecture

### B.1 BunDLe-Net loss function

## C Other architectures of ANN models

## D Learning process

## 5 Acknowledgements

We would like to thank Sebastian Tschiatschek and Simon Rittel for enriching discussions at the Causal Representation Workshop 2021 which was hosted at the Faculty of Computer Science,

University of Vienna. We would also like to thank Manuel Zimmer and his lab, especially Kerem Uzel, for collaborating with us and providing the neuronal calcium imaging data from *C. elegans*. We also thank Verity Cook for discussions about the illustrations and figures.

## Footnotes

This work was supported under the CHIST-ERA grant (CHIST-ERA-19-XAI-002), by the Austrian Science Fund (FWF) (grant reference I 5211-N) and the Engineering and Physical Sciences Research Council United Kingdom (grant reference EP/V055720/1), as part of the Causal Explanations in Reinforcement Learning (CausalXRL) project.

A citation (CEBRA [9]) in introduction was slightly modified. It was listed as an algorithm from contrastive learning framework. Author affilations order was changed Another person was acknowledged for figure discussions.

https://github.com/akshey-kumar/BunDLe-Net/tree/main/figures/rotation_comparable_embeddings

↵

^{1}The motor behavioural labels were inferred from the activity of the neurons AVAR, AVAL, SMDVR, SMDVL, SMDDR, SMDDL, RIBR, RIBL while the worms were immobilised. Hence, we removed these neurons from the dataset to ensure we are not inferring behaviours directly from these neurons.↵

^{2}We use a simple architecture consisting of a single linear layer since it already demonstrated a high decoding accuracy (∼ 0.94) on the raw neuronal traces. Hence, more complex models are not required to evaluate the embeddings.↵

^{3}The ArAe would preserve dynamical information and embed it in a lower dimensional space due to the autoencoder architecture.↵

^{4}Note that CEBRA as an algorithm was designed for continuous-valued behaviours We cast our categorical behaviour (int) into a continuous behaviour (floating-point) and ran CEBRA on it.↵

^{5}We could also simply fit separate models on each worm’s data, as was done for the evaluation in Figure2. Due to differing initialisations of BunDLe-Net this would result in visually different embeddings. These embeddings however, share the same topology independent of the initialisation. For ease of visual comparison between embeddings, we adopt the above procedure to have latent spaces that can be mapped to one another.↵

^{6}Since we have time-series data, we need not learn*T*_{X}of the commutativity diagram, but simply feed*X*_{t+1}directly into the network.↵

^{7}We tested for non-Markovianity using an autoregressor model and found that including multiple time steps from the past boosted the prediction performance of the model.↵

^{8}We use an autoencoder architecture since an autoregressor, in general, need not map the data to a low-dimensional space. Hence, we use an encoder to obtain an embedding that can be compared with the other methods