Summary
When faced with imminent danger, animals must rapidly take defensive actions to reach safety. Mice can react to innately threatening stimuli in less than 250 milliseconds [1] and, in simple environments, use spatial memory to quickly escape to shelter [2,3]. Natural habitats, however, often offer multiple routes to safety which animals must rapidly identify and choose from to maximize the chances of survival [4]. This is challenging because while rodents can learn to navigate complex mazes to obtain rewards [5,6], learning the value of different routes through trial-and-error during escape from threat would likely be deadly. Here we have investigated how mice learn to choose between different escape routes to shelter. By using environments with paths to shelter of varying length and geometry we find that mice prefer options that minimize both path distance and path angle relative to the shelter. This choice strategy is already present during the first threat encounter and after only ~10 minutes of exploration in a novel environment, indicating that route selection does not require experience of escaping. Instead, an innate heuristic is used to assign threat survival value to alternative paths after rapidly learning the spatial environment. This route selection process is flexible and allows quick adaptation to arenas with dynamic geometries. Computational modelling of different classes of reinforcement learning agents shows that the observed behavior can be replicated by model-based agents acting in an environment where the shelter location is rewarding during exploration. These results show that mice combine fast spatial learning with innate heuristics to choose escape routes with the highest survival value. They further suggest that integrating priors acquired through evolution with knowledge learned from experience supports adaptation to changing environments while minimizing the need for trial-and-error when the errors are very costly.
Results
Escape route choice is determined by path distance and angle to shelter
To investigate escape route choice, we placed mice in elevated arenas with a threat and a shelter platform connected by runways of different configurations and lengths. Previous work has shown that in simple arenas mice escape along a direct vector toward a memorized shelter location [2,7,8], and that they can form subgoal memories to avoid obstacles when navigating to shelter [3]. To determine whether mice learn the value of alternative routes to shelter, we first built an arena where the direct path to shelter lead to a dead-end and while two other open paths of equal length were available (Figure 1A). Mice explored the entire arena over a period of ~10 minutes (Figure S1A), after which they were exposed to innately threatening auditory and visual stimuli [8,9] when they were on the threat platform, facing away from the shelter. Mice reliably escaped from threat in these conditions (Figure S1B) and had a preference for the two side paths (Pside path = 0.79, Pdead-end = 0.21, 34 trials from 7 mice; Figure 1A; Video 1). The average time to leave the threat platform to one of the side arms was 2.49±0.81s, and mice accelerated directly toward the left of right paths from the start of the flight (Figure 1B, 1C). This shows that mice quickly learn to overcome the innate preference of escaping along the shelter direction and use knowledge of which paths lead to shelter when committing to escape trajectories.
Next, we tested escape route selection in four different arenas where the length of the left arm was progressively increased while keeping constant the initial angle between each path and the threat platform (arenas 1-4, Figure 1D). In this experiment the relative value of the left hand-side path decreases between arenas 1 and 4 as mice should in principle escape along the shortest path to minimize exposure to danger [10–12]. The differences in geodesic distance (path length) between the threat and shelter platforms translated not only into differences in distance travelled but also into time taken to traverse each path during escape (Figure S1C, D). While path choice was probabilistic, when presented with threat, mice preferentially escaped via the shortest path in each of the three asymmetric arenas (overall Pright path = 0.79, 600 trials from 83 mice; Figure 1D,E; Video 2). Across all arenas, the probability of taking the shortest path was significantly dependent on the geodesic distance ratio between the two paths (arena1 Pright path = 0.486, arena 2 Pright path = 0.731, arena 3 Pright path = 0.807, arena 4 Pright path = 0.829, p = 0.0006 one-way ANOVA). The preference for the shortest path could not be explained by differences in arm familiarity arising from biases during arena exploration, nor by mice simply choosing the arm they entered the threat platform from (Figure S1E,F). This suggests that mice are choosing to escape along the shortest path. In addition, the time to leave the threat platform was independent of path chosen (2.53±1.51 and 2.58±1.37s for left and right arm respectively) and path choice could be predicted from the escape trajectory before mice left the threat platform (Figure S2). This indicates that mice commit to one path from escape onset. These data therefore suggest that mice evaluate either geodesic distance or escape duration to shelter when choosing escape routes. These two quantities are strongly correlated in our experimental setup (Figure S1C) and we cannot disambiguate between these two alternatives. In either case, however, mice quickly learn to select the fastest escape routes to safety.
To assess whether other aspects of the path geometry influenced escape route selection, we next built an arena where the two arms had the same length but the initial angle relative to the shelter was larger for the right arm (Figure 1F). In this configuration, when mice escape along the right arm, they initially must navigate away from the shelter direction, but the path length and escape duration is the same for both arms (Figure S1C). Thus, if mice selected escape paths based on path length or travel duration alone, they should choose each path with equal probability, as in arena 1. Instead, we found that mice had a clear preference for the left arm (Pleft path = 0.69, 81 trials, 23 mice, p=0.02 chi squared test, Figure 1G). This suggests that in addition to geodesic distance, mice also consider shelter direction when choosing escape paths. To further quantify the relative weights of different variables on escape path preference we used a generalized linear model (GLM) to predict escape path choice across trials. The predictors included geodesic distance and shelter angle, as well as exploration time and trajectory before threat. The model fitted to the data could explain more than 90% of the variance in path choice in cross-validated tests and showed approximately equal weighting between geodesic distance and shelter angle, with minimal weight for exploration time and arm of origin (Figure 1H). Together these results show that when faced with two possible paths to shelter, mice can quickly learn distances and angles to shelter and escape along the route that minimizes both.
Route learning does not require escape experience and is flexible
Computing escape route choices requires at least two steps: learning the properties of the available paths to safety and a function to map those properties into their value for escape (e.g.: favoring escape along the shortest path). While the first is likely learned during natural exploration of the environment, the second could in principle be learned through repeated encounters with threat, or it could be innate (i.e.: the animal is born with a value function that links path properties to escape values). To distinguish between these two alternatives, we computed path choice probabilities for the first trial of threat presentation (naïve) and compared them with the probabilities for trials after experience with threats (experienced). We found that the preference for shorter paths and smaller angles to shelter was already present in naïve trials, and that the path choice probabilities were not different between naïve and experience trials (Figure 2A). In addition, choice probabilities did not significantly change over the course of the experimental session and repeated threat presentations (Figure 2B), which is also in agreement with the low GLM weighting for the time in session variable (Figure 1H). This analysis suggests that the strategy for selecting escape routes does not develop through experience of escaping. Instead, the evaluation of path length and angle to shelter represents an innate heuristic to assign escape value to the different route options. In addition, the preference for selecting the shortest path upon the first exposure to threat implies that mice were able to learn the relevant environment properties during natural exploration of the arena. In our experiments, mice spend on average 11.03±3.8 min exploring before threat presentation, during which time they perform only 4.2±0.9 complete trips between the threat platform and the shelter (Figure 2C, D). Mice thus require a very small amount exploration to learn spatial properties relevant for escape and have an innate function to assign escape value to alternative paths based on the learned spatial relationships between the paths and shelter location.
The combination of rapidly learning path properties and having an innate value function allows mice to effectively select escape paths shortly after entering a novel environment. Next, we aimed to establish whether this can also support flexible and adaptive escape route selection in a dynamic environment, where path preference must be rapidly updated to reflect changes in the arena. We built a version of arena 4 where the path lengths could be quickly and automatically flipped between left and right sides (Figure 2E; Video 3). After exploration and 2-3 threat presentation trials, we flipped the arena and let the animals again explore the maze (14.6±6.1 min, median, 6.05±2.77 threat platform to shelter trips). We then presented threats and found that the path preference during escape changed to reflect the new arena geometry – mice now took the left arm with a higher probability (baseline Pright path = 0.641, flipped Pright path = 0.321, p=0.0014 Fisher’s exact test; Figure 2F), while the time out of threat platform and orientation movement profiles were similar between baseline and flipped trials (2.16±1.10 and 1.77±0.93 seconds respectively, p=0.06 t-test). These data suggest that after initially learning the arena geometry and developing an escape route preference, mice remain in a flexible learning state where they can incorporate new information at a rate similar to naïve animals. This ability enables the selection of the fastest escape routes in changing environments.
Model-based reinforcement agents with limited experience choose the shortest escape route
Learning the shortest escape route in our experiments was a fast process, which contrasts with large the amount of training needed for some spatial navigation and decision making tasks [13,14], as well as for training artificial intelligence agents [15,16]. To gain further insight into the type of learning algorithms that mice might be using, we compared the performance of different reinforcement learning (RL) algorithms [17] on a task similar to our experiments. We selected three algorithms representing different classes of RL models: Q-learning (model-free; [17]), DYNA Q (model-based; [17]) and influence zones (IZ; [18]). The latter is a model-based algorithm where several state-action values are updated simultaneously according to a topological mapping between states, and thus particularly appropriate for spatial navigation tasks [18]. These models were trained to navigate a grid-world representation of arena 4, from a starting location (corresponding to the threat platform in our experiments) to a goal where they received a positive reward (corresponding to the shelter location, Figure 3A). As the goal of the agents is to maximize the time discounted cumulative expected reward, this should result in learning a policy that selects the shortest route to the goal, thereby mimicking innate preference of mice for selecting shorter escape paths.
We trained the RL agents under two regimes: in the free-exploration regime, agents were allowed to explore the environment freely under an epsilon-greedy policy for 250 episodes with a maximum of 500 steps each; in the guided-exploration regime, RL agents explored the environment in a single episode and moved through the maze following the exploration trajectory of individual mice recorded during our experiments (see methods). The free-exploration regime is therefore analogous to the standard practice in the RL field [17] where agents are allowed a large number of steps to learn (up to 125,000 in our conditions). The guided-exploration regime poses a more challenging learning problem in principle: real exploration trajectories in our experiments have a mean of 754±267 steps, almost three order of magnitude less (Figure 3B, C). Under free-exploration, all models successfully learned to navigate to the goal location (Figure 3D). The short arm was chosen by 64% of the agents trained with Q-learning that reached the shelter at the end of the test trial, and by more than 95% when trained with DYNAQ or influence zones. (Figure 3E). In contrast, under guided-exploration, the Q-learning algorithm failed to learn how to reach the goal entirely (Figure 3F).
The two model-based agents, however, performed significantly better, with the influence zones algorithm outperforming DYNAQ (Figure 3F). Both learned to navigate to goal for more than half of the training trajectories and chose the shorter arm for >94% of these (Figure 3E,F). These results suggest that rapidly learning to navigate the arena environment with limited exploration requires a learning algorithm that goes beyond naïve model-free rules and incorporates elements such as internal replay or the topology of the environment.
Discussion
This work shows that mice in a novel environment learn to choose shortest escape route to shelter when there is more than one option. This learning process is fast and happens during spontaneous exploration, before mice have experienced any threats. The choice is done by selecting the route with the shortest path length and shortest angle to shelter.
Minimizing the path length during escape agrees with the straight flight trajectories observed in open arenas and appears to serve the purpose of minimizing exposure to danger. While in an open arena the shortest path is the direct one, here mice had to learn the path lengths of the different routes to choose the shortest path. The minimization of shelter angle is in line with the observation that mice keep track of a vector to the shelter in open arenas [2,7]. In the arenas tested in this work, mice seem to keep track of this vector and use it as a variable for choosing escape paths even though there was never a direct route to shelter available. While this decision strategy does not minimize exposure to risk in these arenas, selecting the path with the smallest shelter angle minimizes the Euclidean distance to the shelter. Should a shortcut suddenly appear along the escape path, following a default policy of moving closer to the shelter might provide an advantage.
While it may seem trivial that animals would choose the fastest escape route, this need not be the case. Escape strategies in the animal kingdom are diverse and there is often an advantage to using alternative strategies, such as outpacing the predator while not revealing where shelter is [4,19]. Perhaps surprisingly, we found that escape path choice was probabilistic despite the relative path lengths being fixed for each experimental session. This could reflect imperfect learning of the environment geometry, noise in the sensory and decision-making systems, or the effect of unmeasured variables. Alternatively, it could also provide an advantage by maintaining some amount of exploration while exploiting the fastest known route, or by increasing the unpredictability of the escape trajectory, which has been suggested to be advantageous for many species [4,20–22].
Our results highlight the close interplay between learning over two distinct timescales to generate adaptive behavior: individual experience and evolution. In our experiments mice had to learn the spatial properties of the arena, which they did through their natural drive and behavior. Even though no task structure nor explicit instructions were provided, all animals explored the arenas efficiently, identifying the shelter, available paths, and extracting relevant arena features such as path length and angle to shelter. They then immediately relied on the estimation of path length and direct shelter vector to assume that minimizing these is the best value for escape. Thus, in addition to not having to learn that small, enclosed places offer shelter from threat, mice also did not have to learn the value of the different escape routes through experience of threat, or any form of punishment associated with taking the longer routes (e.g.: being exposed to an unpleasant stimulus for longer). Instead, they have an innate policy that guides escape path selection upon the first encounter with threat, thus removing the need for trial-and-error learning in a scenario in which errors could be fatal. This finding is in agreement with mice not needing to be exposed to threat to learn the direct shelter vector [2] or even a more complex subgoal shelter route [3] and suggests that mapping escape value onto the spatial environment is a priority of naturally behaving mice. A likely explanation is that, as prey species, during free exploration mice give high value to sheltering locations and routes that lead to these safe places, even in the absence of explicit threat. They extract relevant knowledge from the environment when it is safe to do so and identify a set of possible defensive actions. When threat does come and defensive actions need to be selected, an innate heuristic is leveraged to assign value to each alternative with no need for further learning. This combination of acquired and innate knowledge ensures that the animals can discover and select the most adaptive defensive action in the safest way, thus maximizing their chances of survival. From a neurobiology perspective, these findings provide a new paradigm for investigating the mechanisms of value assignment, how the estimation of spatial properties is linked to route decisions and how learned and innate information are integrated to guide decision-making.
A key finding in this study is that route learning was fast and required minimal exploration. This builds on previous work showing fast learning of shelter location [2,8,23] and supports recent findings showing that mice rapidly learn to navigate a maze for reward through natural exploration [11]. A picture that emerges from these studies is that natural exploration of space is the fundamental way that mice learn about the environment and express their behavioral choices, and therefore evolution has ensured that spatial learning is fast and prioritizes supporting survival needs. While it is unclear what learning algorithms mice use in the settings explored here, our reinforcement learning modeling suggests that a simple model free algorithm is not sufficient to generate the observed behavior. Instead, a more sophisticated learning process seems to be required to extract the necessary information from very limited experience. Alternatively, a simple learning algorithm could act on prior knowledge that is useful for solving the problem, such as a model for quickly estimating distances from self-motion. Our results invite future work investigating the biological basis of how innate and acquired knowledge interact to generate behavior, as well as on how abstractions of this interaction can be leveraged for developing efficient learning algorithms for machine learning applications [24].
Experimental Procedures
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Tiago Branco (t.branco{at}ucl.ac.uk).
Materials availability
All arena design files are available upon request.
Data and code availability
All data reported in this paper will be shared by the lead contact upon request.
Analysis code can be found at the following GitHub repository: https://github.com/FedeClaudi/EscapePathSelection.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Experimental model and subject details
Animals
Male adult C57BL/6J Mice (8-12 weeks old) were single housed on a 12h light cycle and tested during the light phase. Mice from the same litter were randomly assigned to different experiments when possible. Most animals (78/119) were only used once, and the remaining were used for experiments on two arenas on different days. Mice that were used twice did not show any difference in escape response or arm choice probability compared to animals that were used only once. All experiments were performed under the UK Animals (Scientific Procedures) act of 1986 (PPL 70/7652 and PFE9BCE39).
Methods details
Behavioral arena
The behavioral arenas consisted of white acrylic platforms elevated 30cm from the floor. Each arena was composed of octagonal platforms (24cm in diameter) and connecting bridges of various lengths (10cm wide). For the experiment shown in Figure 2E, some bridge sections were fitted with a computer-controlled servo motor which rotated a 20cm long bridge section by 90 degrees in the downward direction and created a gap that the mice could not traverse. The servo motors were controlled with custom Arduino code and activated manually.
Auditory and visual stimulation
Mice were presented with auditory stimuli consisting of three frequency modulated sweeps from 17 to 20 kHz of 3 seconds each at a sound pressure of 70-85dB as measured at the arena floor. In some experiments overhead visual stimuli were used. These were projected onto a screen positioned 1.8m above the arena floor and consisted of a dark circle (Weber contrast = −0.98) expanding over a period of 250ms [2]. The visual stimulus was repeated five times in short sequence with an interval of 500ms between repeats. No difference in behavior was observed between auditory and visual stimuli and therefore the data were pooled. Stimuli were triggered manually and controlled with software custom-written in LabVIEW (2015 64-bit, National Instrument). While manual stimulation could be a source of bias, the arena design ensured that mice would be in the same position and similar orientation across trials and experiments. No systematic difference in position or orientation was observed based on the selected escape path. The sound was played from the computer through an amplifier (TOPAZ AM10, Cambridge Audio) and speaker (L60, Pettersson). The audio signal was fed in parallel through a breakout board (BNC-2110, National Instruments) into a multifunction I/O board (PCIe-6353, National Instruments) and sampled at 10 KHz. To synchronize the audio and video, this signal was compared to the 30/40 Hz pulse triggering video frame acquisition, which was also fed as an input to the input/output board and sampled at 10 KHz. The visual stimuli and video were synchronized using a light dependent resistor whose voltage output depends on the amount of light it detects and thus reflected the presence/absence of visual stimuli. The resistor’s output was fed as input to the input-output board and sampled at 10 KHz.
Behavioral assay
Experimental sessions were filmed at 30 or 40fps with an overhead camera (positioned 1.8m above the arena floor). At the start of each experimental session mice were allowed to freely explore the arena for a period of ~10 minutes, during which they spontaneously found the shelter. After the exploration period, threats (either auditory or visual) were presented repeatedly while the animals were on a designated threat platform and facing in the direction opposite from the shelter platform. A stimulus response was considered an escape if the mouse reached the shelter within 10 seconds from stimulus onset. The number of trials in each experimental session varied across mice. Experiments were terminated when mice either remained in the shelter continuously for 30 minutes or failed to escape in response to three consecutive stimuli. Some experiments were performed in total darkness (with auditory stimuli only). No difference in behavior was observed between lights on and lights off experiments and thus the two datasets were pooled.
Animal tracking
The position and orientation of mice in the arena was reconstructed using DeepLabCut [25] to track the position of several body parts (snout, neck, body and tail base) in all recorded videos with a custom trained network. Post processing of tracking data included median filtering (window width: 7 frames), removal of low-confidence tracking (likelihood < 0.995) and affine transformation to a standard template arena to facilitate comparison across experiments [3]. Processed tracking data were stored in a custom DataJoint database [26] which also stored experimental metadata (e.g., mouse number, arena type, stimuli times etc.) and was used for all subsequent analysis.
Analysis code
All analysis was carried out using custom Python code and used several software packages from Python’s scientific software ecosystem: NumPy [27], Matplotlib [28], Scikit [29], Pandas [30], OpenCV [31], and StatsModels [32]. To calculate the animal’s orientation, we computed the vectors between the tail base and body and between the body and snout body part location as reconstructed by DeepLabCut, we then took the average of the two (we found this to be more stable than either vector alone). We set the shelter direction to be at 0 degrees in allocentric coordinates.
Reinforcement learning modelling
Three classes of Reinforcement Learning (RL; [17]) models were trained to navigate a grid world representation of the Arena 4 from the experimental study. All RL simulation, analysis and data visualization work were done in custom Python code. The grid world representation of Arena 4 consisted of a 50×50 array of quadrilateral cells with zeros corresponding to locations on the arena (126 cells in total) and ones to locations inaccessible by the agent. Agents could move in 8 cardinal directions (up, up-right, right, down-right, down, down-left, left, up-left) by one cell at the time and had to learn how to navigate the environment from a starting location to a goal location (corresponding to the threat and shelter locations in Arena 4 correspondingly). All agents were awarded a reward of 1 for reaching a cell < 3 cells distant from the goal location and received a penalty of −0.1 for attempting a move leading towards an inaccessible cell (upon which the agent did not move). For the QTable agent only (described below) a small (1e-8) reward was also delivered for any training step in which the agent moved to a new cell to encourage exploration.
Reinforcement learning models
Three different reinforcement learning model were used. All models shared the same environment (state space), actions space and reward function (with the exception of QTable as noted above). The three models were: QTable (model free RL; [17]), DYNA Q (model based; [17]) and Influence Zones (IZ, model based; [18]). All three models were implemented in custom python code. For all simulations the following parameters were used:
The DYNAQ model includes a planning step in which randomly sampled entries from the agent’s model are used to update the value function. In all simulations the number of samples used for each planning step was set to 20.
The IZ model had additional parameters. These include a TD error threshold for one-step updates of value function (1e-10) and a threshold for n-step updates (0.1) and additional parameters for the Instantaneous Topological Map (ITM; [33]) model used by IZ: ITM learning rate (0.2) and max error (1.2).
Free exploration training
In the free exploration training regime, agents were trained for 250 episodes of maximum 500 training steps each. For each training episode the agent was initialized in a random location on the grid world arena and had to navigate to the goal location. The episode terminated when the agent took 500 steps or if it reached the goal location. During training at each step agents selected the action to perform using an epsilon greedy strategy: a random action is chosen with probability equal to the exploration rate parameter, otherwise the action with the current highest value is selected. At the end of each episode, the exploration rate decayed by a factor set by the parameter exploration rate decay.
To assess the agent’s performance during learning, at the end of each training episode the agent was initialized at the start location and allowed to act greedily (i.e., with no randomly selected actions). If the agent reached the shelter location the simulation was marked successful, otherwise it was labelled as a failure. If the agent attempted an illegal move (i.e., trying to move to an inaccessible cell) the simulation was terminated and considered a failure. The agents were not allowed to use the experience from this evaluation simulation for learning.
Guided exploration training
In the guided exploration training, RL agents followed the exploration trajectory from the experimental animals. Tracking data from each experiment’s exploration phase was registered to the grid world arena through an affine transformation (scaling and shift). Tracking data was represented at a higher spatial resolution that the grid world arena and did not match the grid world arena layout perfectly (due to imperfect registration of the tracking data to the standard template). The first issue was resolved by assigning, for each frame in the tracking data, the grid world arena cell closest to it. Imperfect alignment and tracking errors could not be corrected in some experimental sessions and these were discarded, leaving 42 valid sessions. As mice often remained in the same location for extended periods of time during natural exploration (e.g., in the shelter), these periods were eliminated from the tracking data and only frames in which the mouse moved from one arena cell to another were kept. The tracking data was then used to guide the movement of all agents during the training phase. For a given session’s data, the agent was initialized at the arena location corresponding to the first frame in the tracking data. For each step, the location of the next arena cell was identified and the action leading from the agent’s current cell to the next was identified. The agent then performed the selected action, experienced rewards and learned, as it would have during free exploration. Thus, the main difference between the free and guided exploration paradigms was that in the guided exploration regime agents were not allowed to select which action to perform as this was determined by the tracking data.
Once the agent followed each step from the tracking data the training phase was concluded. The agent was then initialized at the start location and allowed to act greedily following the value function it learned during training, with the goal of reaching the shelter location. If the goal was reached the simulation was classified as successful, otherwise it was classified as a failure. If the agent attempted an illegal move (i.e., trying to move to an inaccessible cell) the simulation was terminated and classified as a failure.
Quantification and statistical analysis
Quantification of escape probability
To calculate the probability of escaping in response to the threat stimulus, the movement trajectories during the first 10s after stimulus onset were analyzed, and the fraction of trajectories terminating on the shelter platform was computed. For comparison, randomly selected time points in which the same animals where on the threat platform but not presented with a threatening stimulus were selected, and the fraction of shelter arrivals in this random sample was estimated too. The number of randomly selected time points matched the number of trials. Fisher’s exact test was used to determine the significance of the difference in number of shelter-arrival trajectories between stimuli and control groups.
Quantification of the probability of escaping along a dead-end
To determine the probability of escape along the dead-end arm in arena 1, all trials from experiments in the arena were pooled. The trajectory in the 10 seconds following threat presentation was analyzed to determine which arm the mouse first moved to after leaving the threat platform, and the probability of escape along each of the three arms (left, right and dead end) was then computed. To distinguish between trials in which the mice escaped vs trials in which the mouse ignored the stimulus but still moved away from the threat platform, we used two criteria. First, since escapes are characterized by a higher running speed than normal locomotion [2], we only included in the analysis trials in which the mouse was moving at a speed higher than 35 cm/s when it left the threat platform. Second, as escapes have fast reaction times and mice leave the threat platform within the first 3 seconds from stimulus, only trials in which the mouse left the threat platform within 4 seconds from stimulus onset were considered escapes.
Quantification of heading direction
To quantify the average heading direction during escape, movement trajectories from each arena were selected and grouped into left or right path escapes. The trajectories where then truncated to the frame in which the mouse left the threat platform and their duration normalized. The average heading direction at regular intervals across all trajectories in the same group was then computed.
Quantification of arm choice probability
To estimate the probability of escape along a given path, all trials from experiments on each arena were pooled and the probability of selecting the right or left arm was computed. In experiments in which the arena design changed during the course of an experiment, trials were pooled across animals and grouped into baseline (before the change) and ‘flipped’ (after the change). Fisher’s exact test was used to determine whether the number of escapes along the right path differed between the baseline and flipped conditions.
In addition to estimating the probability of selecting an arm, the posterior distribution of the probability value was estimated using a Bayesian model. The model had a Beta distribution prior (parameters: a=1, b=1) and a Binomial distribution as likelihood (n = total number of trials, k = total number of right path escapes). The resulting posterior distribution is then a Beta distribution whose parameters are given by:
To compare the probability of selecting the right arm between naïve and experienced trials, for each arena the first escape trial for each mouse was classified as naïve while all other trials were classified as experienced. When a mouse was used for more than one experiment, only the first trial on the first experiment the mouse was used for was considered naïve. The probability of escaping along the right path was computed for the naïve trials. Because the number of experienced trials is larger than the naïve condition, to compare the probability of taking the right path between the two groups, experienced trials were randomly sampled without replacement to match the number of naïve trials in the same arena, and the probability of taking the right path was then computed. This procedure was repeated 10 times to generate a distribution of probability values, and the mean and 95th percentile interval were computed from this distribution.
Arm choice probability for naïve vs experienced trials
To determine whether mice required repeated experience with threat to select the preferred escape path we identified the very first stimulus presentation of each animal. We grouped such “naïve” trials across individuals tested on the same experimental arena and estimated the probability of escape along the right path in this subset of the data. To compare naïve vs experienced (i.e. following the first encounter with threat) trials, we randomly sampled from the experienced trials from each arena matching the number of naïve trials in the same arena and we computed the probability of escape on the right arm. We repeated the sampling procedure 100 times to obtain a distribution of probability values for different random subsets.
Change in path preference with time
To quantify the change in probability of escape along the right path over time, for each arena we pooled all trials that occurred during the first 60 minutes from the start of the experiment. We then binned the trials based on the time since experiment start (interval between bins: 120 seconds, bin width: 300s) and computed the posterior distribution of p(R) as described above.
Quantification of shelter-threat trips during exploration
To quantify the number of trips between the shelter and the threat platform during exploration, the tracking trajectory corresponding to the exploration period (from start of the experiment to one frame before the first stimulus) was analyzed. For each frame, the mouse was assigned to one of four regions of interest (‘shelter’, ‘right arm’, ‘threat’, ‘left arm’) based on the coordinates of the tracking data registered to a standard template image as described above. A trip was ended when the mouse arrived at the shelter (threat) platform and started at the last frame in which the mouse was on the threat (shelter) platform. Incomplete trips (e.g.: the mouse left the shelter platform and returned to it without first reaching the threat platform) were discarded.
Predicting escape arm with GLM
To predict the probability of escaping along the right arm from trial data, a binomial generalized linear model (GLM) with a logit link faction was used (implemented in StatsModels; [32]). All trials from all arenas were pooled and split between train and test sets (stratified k-fold repeated cross-validation; five different splits of the trials were used, and the data were split such that the test set was roughly balanced with respect to the number of trials from the corresponding arena; this procedure was repeated four times with different random splits each time, yielding a total of 20 model fits). The GLM model attempted to estimate the probability of escape along the right path for each trial based on: 1) the geodesic ratio of the trial’s arena, 2) the angles ratio of the trial’s arena, 3) the trial time (in seconds) since the start of the experimental session, 4) the identity of the origin arm. Categorical variables were one-hot encoded, and all variables were normalized to the 0-1 range. The accuracy of the model’s predictions on the test data was estimated with the Pearson’s correlation between the predicted probability of escape on the right arm and the arm chosen by the animal. This accuracy measure was compared to the accuracy of models fitted on randomly shuffled data. The full model described above was then compared to models lacking one or two of the input variables to estimate the effect of each variable. Each model was fitted on k-fold cross validated data (k=5) and the procedure was repeated four times using different random number generator seeds for each repeat. The coefficient weights of each parameter for each fit of the full model were used to estimate the average and standard deviation of the coefficient weights.
Decoding escape path from threat trajectory
To decode the escape arm from the trajectory on the threat platform we used a logistic regression model (implemented in Scikit; [29]). Trajectories from all trials in each experimental arena were pooled and their duration was normalized. To assess how the model performed as mice moved away from the threat location and towards the escape arms, 8-9 time points were selected corresponding to different average positions along the axis between the threat and shelter platform. For each time point, the animal’s average orientation in the five frames after the time point and the trial’s escape arm were used. The data were randomly split between a training and test set (test set 0.33% of the trials) and the training set was used to fit the model to predict the escape arm based on the orientation value. The model’s accuracy score on held out training data was then computed. The procedure was repeated 100 times for each time point with a different random split of the data and the average accuracy computed.
Quantification of RL models performance
To quantify the performance of RL agents trained to navigate the grid world arena under the free exploration regime, we trained 64 repetitions of each model. At the end of each training episode, each repetition was tested on its ability to navigate to the goal location and returned a 1 for successes and 0 otherwise. Thus, a vector of outcomes was constructed based on the value returned by each repetition and the overall score was given by the mean and standard error of the mean (SEM) of the outcomes vector. For visualization, the mean and SEM accuracy at each training episodes were displayed following smoothing with a rolling mean filter with window width of 6 episodes.
To quantify the performance of RL agents trained under the guided exploration regime, we trained 10 repetitions of each RL model on the tracking data from each experimental session. Under this regime, unlike in free exploration, training is fully deterministic because the actions are specified by the tracking data, the only variability emerges from the DYNA-Q model’s probabilistic sampling of its model at each training step. After training, all repetitions of each model were tested on their ability to navigate from the start to the threat location. The model was considered successful if at least 8/10 repetitions successfully reached the goal location.
To produce the state occupancy heatmaps in Figure 3B we recorded all cell visits for one example agent trained under the free exploration regime and one example agent trained under the guided exploration regime. The total number of visits to visits to each cell was then used to produce the heatmap. To visualize the preferred action at each cell (Figure 3E) we trained one example agent for each class of RL algorithm and training regime and displayed the action with highest value for each cell.
Author Contributions
F.C. and T.B. designed the study and experiments and wrote the manuscript. F.C. performed all experiments and analyses.
D.C. contributed to experiment design, data analysis design and interpretation.
Acknowledgments
This work was funded by a Wellcome Senior Research Fellowship (214352/Z/18/Z) and by the Sainsbury Wellcome Centre Core Grant from the Gatsby Charitable Foundation and Wellcome (090843/F/09/Z) (T.B.), and the SWC PhD Programme (F.C.). We thank members of the Branco lab for discussions, and Ruben Vale and Phillip Shamash for comments on the manuscript.
Footnotes
Authors affiliation.