Mice in a labyrinth: Rapid learning, sudden insight, and efficient exploration

Animals learn certain complex tasks remarkably fast, sometimes after a single experience. What behavioral algorithms support this efficiency? Many contemporary studies based on two-alternative-forced-choice (2AFC) tasks observe only slow or incomplete learning. As an alternative, we study the unconstrained behavior of mice in a complex labyrinth and measure the dynamics of learning and the behaviors that enable it. A mouse in the labyrinth makes ~2000 navigation decisions per hour. The animal quickly discovers the location of a reward in the maze and executes correct 10-bit choices after only 10 reward experiences – a learning rate 1000-fold higher than in 2AFC experiments. Many mice improve discontinuously from one minute to the next, suggesting moments of sudden insight about the structure of the labyrinth. The underlying search algorithm does not require a global memory of places visited and is largely explained by purely local turning rules.


19
How can animals or machines acquire the ability for complex behaviors from one or a few  In laboratory studies, one prominent instance of one-shot learning is the Bruce effect 26 (Bruce, 1959). Here the female mouse forms an olfactory memory of her mating partner that 27 allows her to terminate the pregnancy if she encounters another male that threatens infanticide. 28 Another form of rapid learning accessible to laboratory experiments is fear conditioning, where 29 a formerly innocuous stimulus gets associated with a painful experience, leading to subsequent 30 avoidance of the stimulus (Fanselow and Bolles, 1979; Bourtchuladze et al., 1994). These 31 learning systems appear designed for special purposes, they perform very specific associations, 32 and govern binary behavioral decisions. They are likely implemented by specialized brain 33 circuits, and indeed great progress has been made in localizing these operations to the accessory 34 olfactory bulb (Brennan and Keverne, 1997) and the cortical amygdala (LeDoux, 2000). 35 In the attempt to identify more generalizable mechanisms of learning and decision making, 36 one route has been to train laboratory animals on abstract tasks with tightly specified sensory and side (B) views of a home cage, connected via an entry tunnel to an enclosed labyrinth. The animal's actions in the maze are recorded via video from below using infrared illumination. (C) The maze is structured as a binary tree with 63 branch points (in levels numbered 0,...,5) and 64 end nodes. One end node has a water port that dispenses a drop when it gets poked. Blue line in A and C: path from maze entry to water port. (D) A mouse considering the options at the maze's central intersection. Colored keypoints are tracked by DeepLabCut: nose, mid body, tail base, 4 feet.    period of exploratory experiments. Ten of these animals had been mildly water-deprived for 89 24 hours; they received food in the home cage and water only from the port hidden in the maze.

90
The other ten animals were sated and had free access to food and water in the cage. Each 91 animal's behavior in the maze was recorded continuously for 7 h during the first night of its 92 experience with the maze, starting the moment the connection tunnel was opened (watch a 93 sample video here). The investigator played no role during this period, and the animal was free 94 to act as it wished including travel between the cage and the maze. 95 All of the mice except one passed between the cage and the maze readily and frequently 96 (Figure 1-figure supplement 1). The single outlier animal barely entered the maze and never 97 progressed past the first junction; we excluded this mouse from subsequent analysis. On average 98 over the entire period of study the animals spent 46% of the time in the maze (Figure 1-figure   99 supplement 2). This fraction was similar whether or not the animal was motivated by water 100 rewards (47% for rewarded vs 44% for unrewarded animals). Over time the animals appeared The nature of the animal's forays into the maze changed over time. We call each foray 112 from entrance to exit a "bout". After a few hesitant entries into the main corridor, the mouse 113 engaged in one or more long bouts that dove deep into the binary tree to most or all of the 114 leaf nodes (Figure 2A). For a water-deprived animal, this typically led to discovery of the 115 reward port. After~10 bouts, the trajectories became more focused, involving travel to the 116 reward port and some additional exploration (Figure 2B). At a later stage still, the animal 117 often executed perfect exploitation bouts that led straight to the reward port and back with no 118 wrong turns ( Figure 2C). Even at this late stage, however, the animal continued to explore 119 other parts of the maze (Figure 2D). Similarly the unrewarded animals explored the maze 120 throughout the night (Figure 1-figure supplement 2). While the length and structure of the 121 animal's trajectories changed over time, the speed remained remarkably constant after~50 s of 122 adaptation (Figure 2-figure supplement 1). 123 Whereas Figure 2 illustrates the trajectory of a mouse's nose in full spatio-temporal detail, 124 a convenient reduced representation is the "node sequence". This simply marks the events 125 when the animal arrives at one of the 127 nodes of the binary tree that describes the maze (see is to reverse course. We call the transition from one node to the next a "step". The following 129 investigations all apply to an animal's node sequence.

130
Few-shot learning of a reward location 131 We now examine early changes in the animal's behavior that reveal how it rapidly acquires and 132 remembers information needed for navigation. First we focus on navigation to the water port.

133
The ten water-deprived animals had no indication that water would be found in the maze. 134 Yet all 10 discovered the water port in less than 2000 s and requiring fewer than 17 bouts 135 ( Figure 3A). The port dispensed only a drop of water followed by a 90-s timeout before 136 rearming. During the timeout the animals generally left the port location to explore other parts 137 of the maze or return home. For each of the water-deprived animals, the frequency at which it 138 consumed rewards in the maze increased rapidly as it learned how to find the water port, then 139 settled after a few reward experiences ( Figure 3A). 140 How many reward experiences are sufficient to teach the animal reliable navigation to the 141 water port? To establish a learning curve one wants to compare performance on the identical 142 task over successive trials. Recall that this experiment has no imposed trial structure. Yet 143 the animals naturally segmented their behavior through discrete visits to the maze. Thus we 144 focused on all the instances when the animal started at the maze entrance and walked to the 145 water port ( Figure 3B). 146 On the first few occasions these paths to water can involve hundreds of steps between nodes 147 and their length scatters over a wide range. However, after a few rewards, the animals began 148 taking the perfect path without detours (length 6, Figure 3-figure supplement 1), and soon 149 that became the norm. Note the path length plotted here is directly related to the number of 150 "turning errors": Every time the mouse turns away from the shortest path to the water port that 151 adds two steps to the path length (Equation 7). The rate of these errors declined over time, by 152 a factor of after~10 rewards consumed ( Figure 3B). Late in the night~75% of the paths to 153 water were perfect. The animals executed them with increasing speed; eventually these fast 154 "water runs" took as little as 2 s ( Figure 3B). Many of these visits went unrewarded owing to 155 the 90-s timeout period on the water port.

156
In summary, after~10 reward experiences on average the mice learn to navigate efficiently 157 to the water port, which requires making 6 correct decisions, each among 3 options. Note that 158 even at late times, long after they have perfected the "water run", the animals continue to take  water-deprived mice (red dots, every fifth reward has a blue tick mark). (B) The length of runs from the entrance to the water port, measured in steps between nodes, and plotted against the number of rewards experienced. Main panel: All individual runs (cyan dots) and median over 10 mice (blue circles). Exponential fit decays by 1∕ over 10.1 rewards. Right panel: Histogram of the run length, note log axis. Red: perfect runs with the minimum length 6; green: longer runs. Top panel: The fraction of perfect runs (length 6) plotted against the number of rewards experienced, along with the median duration of those perfect runs.   some extremely long paths: a subject for a later section (Figure 6). 160 Discontinuous learning 161 While an average across animals shows evidence of rapid learning (Figure 3) one wonders 162 whether the knowledge is acquired gradually or through moments of "sudden insight". To  Over time, the animals learned the path to water not only from the entrance of the maze but 169 from many locations scattered throughout the maze. The largest distance between the water 170 port and an end node in the opposite half of the maze involves 12 steps through 11 intersections 171 ( Figure 4A). Thus we included as another behavioral variable the occurrence of long direct 172 paths to the water port which reflects how directedly the animals navigate within the maze.
173 Figure 4B shows for one animal the cumulative occurrence of water rewards and that of 174 long direct paths to water. The animal discovers the water port early on at 75 s, but at 1380 175 s the rate of water rewards jumps suddenly by a factor of 5. The long paths to water follow 176 a rather different time line. At first they occur randomly, at the same rate as the paths to the 177 unrewarded control nodes. At 2070 s the long paths suddenly increase in frequency by a factor 178 of 5. Given the sudden change in rates of both kinds of events there is little ambiguity about 179 when the two steps happen and they are well separated in time ( Figure 4B). 180 The animal behaves as though it gains a new insight at the time of the second step that 181 allows it to travel to the water port directly from elsewhere in the maze. Note that the two 182 behavioral variables are independent: The long paths don't change when the reward rate steps 183 up, and the reward rate doesn't change when the rate of long paths steps up. Another animal 184 ( Figure 4C) similarly showed an early step in the reward rate (at 860 s) and a dramatic step in 185 the rate of long paths (at 2580 s). In this case the emergence of long paths coincided with a 186 modest increase (factor of 2) in the reward rate.

187
Similar discontinuities in behavior were seen in at least 5 of the 10 water-deprived animals 188 (Figure 4-figure supplement 1, Figure 4-figure supplement 2), and their timing could be  identified to a precision of~200 s. We varied the criterion of performance by asking for even 190 longer error-free paths, but the results were largely unchanged and no additional discontinuity  One-shot learning of the home path 197 For an animal entering an unfamiliar environment, the most important path to keep in memory 198 may be the escape route. In the present case that is the route to the maze entrance, from which 199 the tunnel leads home to the cage. We expected that the mice would begin by penetrating into 200 the maze gradually and return home repeatedly so as to confirm the escape route. This might 201 help build a memory of the home path gradually level-by-level into the binary tree. Nothing 202 could be further from the truth.

203
At the end of any given bout into the maze, there is a "home run", namely the direct 204 path without reversals that takes the animal to the exit (see Figure 3-figure supplement 1).
205 Figure 5A shows the nodes where each animal started its first home run, following the first 206 penetration into the maze. With few exceptions that first home run began from an end node, 207 as deep into the maze as possible. Recall that this involves making the correct choice at six 208 successive 3-way intersections, an outcome that is unlikely to happen by chance.

209
The above hypothesis regarding gradual practice of home runs would predict that short    Once the animal has learned to perform long uninterrupted paths to the water port, one can 229 categorize its behavior by three states: (1) walking to the water port; (2) walking to the exit; 230 and (3) exploring the maze. Operationally we define exploration as all periods in which the 231 animal is in the maze but not on a direct path to water or to the exit. For the ten sated animals 232 this includes all times in the maze except for the walks to the exit.  (Figure 6-figure supplement 1). The rewarded mice began about half their bouts into 237 the maze with a trip to the water port and the other half by exploring (Figure 6A). After a 238 drink, the animals routinely continued exploring, about 90% of the time.

239
For water-deprived animals the dominance of exploration persisted even at a late stage of 240 the night when they routinely executed perfect exploitation bouts to and from the water port:

241
Over the duration of the night the 'explore' fraction dropped slightly from 0.92 to 0.75, with the 242 balance accrued to the 'drink' and 'leave' modes as the animals executed many direct runs to the 243 water port and back. The unrewarded group of animals also explored the maze throughout the 244 night even though it offered no overt rewards (Figure 6-figure supplement 1). One suspects 245 that the animals derive some intrinsic reward from the act of patrolling the environment itself.   One can presume that a goal of the exploratory mode is to rapidly survey all parts of the 251 environment for the appearance of new resources or threats. We will measure the efficiency of (1) This mouse explores with efficiency E = 32/76 = 0.42. For comparison, Figure 7A plots the 263 performance of the optimal agent (E = 1.0) and that of a random walker that makes random 264 decisions at every 3-way junction (E =0.23). Note the mouse is about half as efficient as the 265 optimal agent, but twice as efficient as a random walker. SF is the probability that it will move forward rather than reversing. Given that it moves forward, SA is the probability that it will take an alternating turn from the preceding one (gray), i.e. left-right or right-left. Bottom: An animal arriving from the bar of the T may either reverse or go straight, or turn into the stem of the T. BF is the probability that it will move forward through the junction rather than reversing. Given that it moves forward, BS is the probability that it turns into the stem. (B) Scatter graph of the biases SF and BF (left) and SA and BS (right). Every dot represents a mouse. Cross: values for an unbiased random walk. (C) Exploration curve of new end nodes discovered vs end nodes visited, displayed as in Figure 7A, including results from a biased random walk with the 4 turning biases derived from the same mouse, as well as a more elaborate Markov-chain model (see Figure 10C). (D) Efficiency of exploration (Equation 1) in 19 mice compared to the efficiency of the corresponding biased random walk. Rules of exploration 272 What allows the mice to search much more efficiently than a random walking agent? We 273 inspected more closely the decisions that the animals make at each 3-way junction. It emerged 274 that these decisions are governed by strong biases (Figure 8) from an unbiased random walk ( Figure 8B, Figure 8-figure supplement 1). 279 First, the animals have a strong preference for proceeding through a junction rather than 280 returning to the preceding node ( SF and BF in Figure 8B). Second there is a bias in favor

287
Qualitatively, one can see that these turning biases will improve the animal's search strategy.

288
The forward biases SF and BF keep the animal from re-entering territory it has covered already.

289
The bias BS favors taking a branch that leads out of the maze. This allows the animal to rapidly 290 cross multiple levels during an outward path and then enter a different territory. By comparison, 291 the unbiased random walk tends to get stuck in the tips of the tree and revisits the same end 292 nodes many times before escaping. To test this intuition we simulated a biased random agent 293 whose turning probabilities at a T-junction followed the same biases as measured from the 294 animal ( Figure 8C). These biased agents did in fact search with much higher efficiency than 295 the unbiased random walk. They did not fully explain the behavior of the mice (Figure 8D), 296 accounting for~87% of the animal's efficiency (compared to 60% for the random walk). A more

299
Clearly some components of efficient search in these mice remain to be understood.

300
Systematic node preferences 301 A surprising aspect of the animals' explorations is that they visit certain end nodes of the 302 binary tree much more frequently than others (Figure 9). This effect is large: more than a 303 factor of 10 difference between the occupancy of the most popular and least popular end nodes 304 ( Figure 9A-B). This was surprising given our efforts to design the maze symmetrically, such 305 that in principle all end nodes should be equivalent. Furthermore the node preferences were 306 very consistent across animals and even across the rewarded and unrewarded groups. Note that 307 the standard error across animals of each node's occupancy is much smaller than the differences 308 between the nodes ( Figure 9B). 309 The nodes on the periphery of the maze are systematically preferred. Comparing the 310 outermost ring of 26 end nodes (excluding the water port and its neighbor) to the innermost 16 311 end nodes, the outer ones are favored by a large factor of 2.2. This may relate to earlier reports 312 of a "centrifugal tendency" among rats patrolling a maze (Uster et al., 1976). 313 Interestingly, the biased random walk using four bias numbers (Figure 8, Figure 10D) 314 replicates a good amount of the pattern of preferences. For unrewarded animals, where the 315 maze symmetry is not disturbed by the water port, the biased random walk predicts 51% of  Ideally, the depth of these action trees would be very large, so as to take as much of the 343 prior history into account as possible. However, one soon runs into a problem of over-fitting:

344
Because each T-junction in the maze has 3 neighboring junctions, the number of possible 345 histories grows as 3 . As increases, this quickly exceeds the length of the measured node 346 sequence, so that every history appears only zero or one times in the data. At this point one "fixed tree" "variable tree" "biased walk" ... ...

Markov chain fixed depth
Markov chain variable depth Biased walk  can no longer estimate any probabilities, and cross-validation on a different segment of data 348 fails catastrophically. In practice we found that this limitation sets in already beyond = 2 349 (Figure 10-figure supplement 1A). To address this issue of data-limitation we developed a

354
With these methods we focused on the portions of trajectory when the mouse was in 'explore' 355 mode, because the segments in 'drink' and 'leave' mode are fully predictable. Furthermore, we 356 evaluated the models only at nodes corresponding to T-junctions, because the decision from an 357 end node is again fully predictable. Figure Figure 10E compares chain models that produced the best fits to the behavior used history strings with an average 364 length of~4. 365 We also evaluated the predictions obtained from the simple biased random walk model 366 ( Figure 10D). Recall that this attempts to capture the history-dependence with just 4 bias 367 parameters ( Figure 8A). As expected this produced considerably higher cross-entropies than 368 the more sophisticated Markov chains (by about 18%, Figure 10E). Finally we used several 369 professional file compression routines to try and compress the mouse's node sequence. In

378
This finally is a quantitative answer to the question, "How well can one predict the animal's 379 behavior?" Whether the remainder represents an irreducible uncertainty -akin to "free will" 380 of the mouse -remains to be seen. Readers are encouraged to improve on this number by 381 applying their own models of behavior to our published data set.

383
Summary of contributions 384 We present a new approach to the study of learning and decision-making in mice. We give 385 the animal access to a complex labyrinth and leave it undisturbed for a night while monitoring 386 its movements. The result is a rich data set that reveals new aspects of learning and the 387 structure of exploratory behavior. With these methods we find that mice learn a complex 388 task that requires 6 correct 3-way decisions after only~10 experiences of success (Figure 2, 389 Figure 3). Along the way the animal gains task knowledge in discontinuous steps that can be 390 localized to within a few minutes of resolution (Figure 4). Underlying the learning process is 391 an exploratory behavior that occupies 90% of the animal's time in the maze, persists long after 392 the task has been mastered, and even in complete absence of an extrinsic reward (Figure 6). 393 The decisions the animal makes at choice points in the labyrinth are constrained in part by 394 the history of its actions (Figure 8, Figure 10), in a way that favors efficient searching of the maze (Figure 7). This microstructure of behavior is surprisingly consistent across mice, with 396 variation in parameters of only a few percent (Figure 8). Our most expressive models to predict 397 the animal's choices still leave a remaining uncertainty of~1.24 bits per decision (Figure 10), 398 a quantitative benchmark by which competing models can be tested. Finally -as discussed where the odor appeared together with sugar (Bitterman et al., 1983). Similarly rodents will 475 start digging for food in a scented bowl after just a few pairings with that odor (Cleland et al., 476 2009). Again, these are 1-bit tasks learned rapidly after one or a few experiences.

477
By comparison the tasks that a mouse performs in the labyrinth are more complex. For 478 example, the path from the maze entrance to the water port involves 6 junctions, each with 3 479 options. At a minimum 6 different contexts must be mapped correctly into one of 3 actions 480 each, which involves 6 ⋅ log 2 3 = 9.5 bits of complexity. The animals begin to execute perfect 481 paths from the entrance to the water port well within the first hour ( Figure 2C, Figure 3B). 482 At a later stage during the night the animal learns to walk direct paths to water from many 483 different locations in the maze (Figure 4); by this time it has consumed 10-20 rewards. In ). Yet the animals take a long time to learn these simple tasks. For example, the mouse 497 with the steering wheel requires about 10,000 experiences before performance saturates. It 498 never gets particularly good, with a typical hit rate only 2/3 of the way from random to perfect.

499
All this training takes 3-6 weeks; in the case of monkeys several months. The rate of learning, 500 measured in task complexity per unit time, is surprisingly low: < 1 bit/month compared to~10 501 bits/h observed in the labyrinth. The difference is a factor of 6,000. Similarly when measured 502 in complexity learned per reward experience: The 2AFC mouse may need 5,000 rewards to 503 learn a contingency table with 1 bit complexity, the mouse in the maze needs~10 rewards to 504 learn 10 bits. Given these differences in learning rate spanning many orders of magnitude, 505 it seems likely that whatever neural process underlies ultra-slow 2AFC learning is different 506 from the implementation of fast learning in the labyrinth and other complex environments.

507
Furthermore, the ultra-slow mode of learning may have little relevance for an animal's natural 508 condition. In the month that the 2AFC mouse requires to finally report the location of a light, 509 its relative in the wild has developed from a baby to having its own babies. Along the way, that 510 wild mouse had to make many decisions, often involving high stakes, without the benefit of 511 10,000 trials of practice. average of many discontinuous curves will certainly look continuous and incremental, but that 522 reassuring shape may miss the essence of the learning process. A recent reanalysis of many 523 Pavlovian conditioning experiments suggested that discontinuous steps in performance are the 524 rule rather than the exception (Gallistel et al., 2004). Here we found that the same applies to 525 navigation in a complex labyrinth. While the average learning curve presents like a continuous 526 function (Figure 3B), the individual records of water rewards show that each animal improves 527 rather quickly but at different times ( Figure 3A). 528 Owing to the unstructured nature of the experiment, the mouse may adopt different policies 529 for getting to the water port. In at least half the animals we observed a discontinuous change 530 in that policy, namely when the animal started using efficient direct paths within the maze 531 (Figure 4, Figure 4-figure supplement 2). This second switch happened considerably after 532 the animal started collecting rewards, and did not greatly affect the reward rate. Furthermore, 533 the animals never reverted to the less efficient policy, just as a child rarely unlearns to balance 534 a bicycle.

535
Presumably this switch in performance reflects some discontinuous change in the animal's By all accounts the animals spent a large fraction of the night exploring the maze (Figure 1-548   figure supplement 2). The water-deprived animals continued their forays into the depths of 549 the maze long after they had found the water port and learned to exploit it regularly. The sated 550 animals experienced no overt reward from the maze, yet they likewise spent nearly half their 551 time in that environment. As has been noted many times, animals -like humans -derive some 552 form of intrinsic reward from exploration (Berlyne, 1960). Some have suggested that there 553 exists a homeostatic drive akin to hunger and thirst that elicits the information-seeking activity, 554 and that the drive is in turn sated by the act of exploration (Hughes, 1997). If this were the 555 case, then the drive to explore should be weakest just after an episode of exploration, much as 556 the drive for food-seeking is weaker after a big meal.

557
Our observations are in conflict with this notion. The animal is most likely to enter the maze 558 within the first minute of its return to the cage (Figure 1-figure supplement 3), a strong trend answer. This would manifest as an unexpectedly high error rate on unambiguous stimuli, 575 sometimes called the "lapse rate" (Carandini and Churchland, 2013). The fact that the lapse 576 rate decreases only gradually over weeks to months of training (Burgess et al., 2017) suggests 577 that it is difficult to crush the animal's drive to explore.

578
The animals in our experiments had never been presented with a maze environment, yet they 579 quickly settled into a steady mode of exploration. Once a mouse progressed beyond the first 580 intersection it typically entered deep into the maze to one or more end nodes (Figure 5). Within 581 50 s of the first entry the animals adopted a steady speed of locomotion that they would retain 582 throughout the night (Figure 2-figure supplement 1). Within 250 s of first contact with the 583 maze the average animal already spent 50% of its time there (Figure 1-figure supplement 2). 584 Contrast this with a recent study of "free exploration" in an exposed arena: Those animals 585 required several hours before they even completed one walk around the perimeter (Fonio et al.,   586  2009). Here the drive to explore is clearly pitted against fear of the open space, which may not 587 be conducive to observing exploration per se. 588 The persistence of exploration throughout the entire duration of the experiment suggests 589 that the animals are continuously surveying the environment, perhaps expecting new features 590 to arise. These surveys are quite efficient: The animals cover all parts of the maze much faster 591 than expected from a random walk (Figure 7). Effectively they avoid re-entering territory they 592 surveyed just recently. It is often assumed that this requires some global memory of places 593 visited in the environment (Nagy et al., 2020; Olton, 1979). Such memory would have to  This question will require future directed experiments, but we can exclude a few candidate 627 explanations based on observations so far. Early workers already concluded that rodents in a 628 maze will use whatever sensory cues and tricks are available to accomplish their tasks (Munn, 629 1950b). Our maze was designed to restrict those options somewhat. The goal of the study was to observe mice as they explored a complex environment for the 675 first time, with little or no human interference and no specific instructions. In preliminary 676 experiments we tested several labyrinth designs and water reward schedules. Eventually we 677 settled on the protocol described here, and tested 20 mice in rapid succession. Each mouse was 678 observed only over a 7-hour period during the first night it encountered the labyrinth. shortening of each branch, ranging from~12 inch to 1.5 inch (Figure 1 and Figure 2). A single 688 end node contained a 1.5 cm circular opening with a water delivery port (described below).

689
The maze included provision for two additional water ports not used in the present report. In

Rates of transition between cage and maze 727
This section relates to Figure 1-figure supplement 3. We entertained the hypothesis that the 728 animals become "thirsty for exploration" as they spend more time in the cage. In that case one 729 would predict that the probability of entering the maze in the next second will increase with 730 time spent in the cage. One can compute this probability from the distribution of residency 731 times in the cage, as follows: 732 Say = 0 when the animal enters the cage. The probability density that the animal will 733 next leave the cage at time is where ( ) is the instantaneous rate for entering the maze. So This relates the cumulative of the instantaneous rate function to the cumulative of the The rate of entering the maze is highest at short times in the cage (Figure 1-figure supple-738   ment 3A). It peaks after~15 s in the cage and then declines gradually by a factor of 4 over 739 the first minute. So the mouse is most likely to enter the maze just after it returns from there.

740
This runs opposite to the expectation from a homeostatic drive for exploration, which should 741 be sated right after the animal returns. We found no evidence for an increase in the rate at 742 late times. These effects were very similar in rewarded and unrewarded groups and in fact the 743 tendency to return early was seen in every animal.

744
By contrast the rate of exiting the maze is almost perfectly constant over time (Figure 1-745   figure supplement 3B). In other words the exit from the maze appears like a constant rate 746 Poisson process. There is a slight elevation of the rate at short times among rewarded animals (Figure 1-figure supplement 3B top). This may come from the occasional brief water runs 748 they perform. Another strange deviation is an unusual number of very short bouts (duration 749 2-12 s) among unrewarded animals (Figure 1-figure supplement 3B bottom). These are brief 750 excursions in which the animal runs to the central junction, turns around, and runs to the exit.

751
Several animals exhibited these, often several bouts in a row, and at all times of the night.

766
In Figure 4 and Figure 5 we count the occurrence of direct paths leading to the water 767 port (a "water run") or to the exit (a "home run"). A direct path is a node sequence without any 768 reversals. Figure 3-figure supplement 1 illustrates some examples.

769
If the animal makes one wrong step from the direct path, that step needs to be backtracked, to collect rewards at a steady rate: this is when the green curve rises up. At a later time the 776 long direct paths to the water port become much more frequent than to the comparable control 777 nodes: this is when the red and blue curves diverge. For almost all animals these two events are 778 well separated in time (Figure 4-figure supplement 1). In many cases the rate of long paths 779 seems to change discontinuously: a sudden change in slope of the curve. ? This was calculated as follows: 810 We restricted the animal's node trajectory ( ) to clips of exploration mode, excluding the 811 direct paths to the water port or the exit. All subsequent steps were applied to these clips, then The parameter is the number of visits required to survey half of the end nodes, whereas 820 reflects a relative acceleration in discovering the last few end nodes. This function was found 821 by trial and error and produces absurdly good fits to the data (Figure 7-figure supplement 1). 822 The values quoted in the text for efficiency of exploration are = 32 ∕ (Equation 1). 823 The value of was generally small (~0.1) with no difference between rewarded and unre-824 warded animals. It declined slightly over the night (Figure 7-figure supplement 1B), along 825 with the decline in (Figure 7C). 826 Biased random walk 827 For the analysis of Figure 8 we considered only the parts of the trajectory during 'exploration' 828 mode. Then we parsed every step between two nodes in terms of the type of action it represents.

829
Note that every link between nodes in the maze is either a 'left branch' or a 'right branch',

Models of decisions during exploration 842
The general approach is to develop a model that assigns probabilities to the animal's next 843 action, namely which node it will move to next, based on its recent history of actions. All the 844 analysis was restricted to the animal's 'exploration' mode and to the 63 nodes in the maze that 845 are T-junctions. During the 'drink' and 'leave' modes the animal's next action is predictable.

846
Similarly when it finds itself at one of the 64 end nodes it only has one action available.
For every mouse trajectory we split the data into 5 segments, trained the model on 80% of 848 the data, and tested it on 20%, averaging the resulting cross-entropy over the 5 possible splits. As one pushes to longer histories, i.e. larger , the analysis quickly becomes data-limited, 870 because the number of possible histories grows exponentially with . Soon one finds that 871 the counts for each history-action combination drop to where one can no longer estimate 872 probabilities correctly. In an attempt to offset this problem we pruned the history tree such that 873 each surviving branch had more than some minimal number of counts in the training data.

874
As expected, this model is less prone to over-fitting and degrades more gently as one 875 extends to longer histories (Figure 10-figure supplement 1A). The lowest cross-entropy was 876 obtained with an average history length of~4.0 but including some paths of up to length 6.

877
Of all the algorithms we tested, this produced the lowest cross-entropies, although the gains 878 relative to the fixed-depth model were modest (Figure 10-figure supplement 1C). 879 Pooling across symmetric nodes in the maze of those junctions, one would be justified in pooling across these nodes, leading to a better 885 estimate of the action probabilities, and perhaps less over-fitting. This particular procedure 886 was unsuccessful, in that it produced higher cross-entropy than without pooling.

887
However, one may want to distinguish two types of junctions within a given level: L-nodes 888 are reached by a left branch from their parent junction one level lower in the tree, R-nodes 889 by a right branch. For example, in Figure 3-figure supplement 1, node 1 is L-type and node 890 2 is R-type. When we pooled histories over all the L-nodes at a given level and separately 891 over all the R-nodes the cross-entropy indeed dropped, by about 5% on average. This pooling 892 greatly reduced the amount of over-fitting (Figure 10-figure supplement 1B)   A numbering scheme for all 127 nodes of the maze. Green: a direct path from the entrance to the water port ("water run") with the node sequence ( ) = (0, 2, 6, 13, 28, 57, 116), involving 6 decisions. Magenta: a direct path from end node 83 to the exit ("home run"). Orange: a path from end node 67 to the exit that includes a reversal. Here the home run starts only from node 8, namely (8, 3, 1, 0). For every bout we compared the start of the node sequence leading into the maze with the final portion leading back out to the exit. The number of nodes of the entry sequence that match the time-reverse of the exit sequence is called the "overlap". This figure histograms the overlap for all bouts of all animals. Note the minimum overlap is 1, because all paths into and out of the maze have to pass through the central junction (node 0 in Figure 3-figure supplement 1). This is also the most frequent overlap. The peak at overlap 6 for rewarded animals results from the frequent direct paths to the water port and back, a sequence of 6 nodes in each direction.  to the data from the mouse's exploration. Animals with best fit (top) and worst fit (bottom). The relative uncertainty in the two fit parameters and was only 0.0038 ± 0.0020 (mean ± SD across animals). (B) The fit parameter for all animals, comparing the first to the second half of the night. (C) The efficiency (Equation 1) predicted from two models of the mouse's trajectory: The 4-bias random walk ( Figure 10D) and the optimal Markov chain ( Figure 10C).  The cross-entropy of the model's prediction is plotted as a function of the average depth of history. In both cases we compare the results obtained on the training data ('train') vs those on separate testing data ('test'). Note that at larger depth the 'test' and 'train' estimates diverge, a sign of over-fitting the limited data available. (B) As in (A) but to combat the data limitation we pooled the counts obtained at all nodes that were equivalent under the symmetry of the maze (see Methods). Note considerably less divergence between 'train' and 'test' results, and a slightly lower cross-entropy during 'test' than in (A). (C) The minimal cross-entropy (circles in (B)) produced by variable vs fixed history models for each of the 19 animals. Note the variable history model always produces a better fit to the behavior.