## Abstract

Problem-solving and reasoning involve mental exploration and navigation in sparse relational spaces. A physical analogue is spatial navigation in structured environments such as a network of burrows. Recent experiments with mice navigating a labyrinth show a sharp discontinuity during learning, corresponding to a distinct moment of ‘sudden insight’ when mice figure out long, direct paths to the goal. This discontinuity is seemingly at odds with reinforcement learning (RL), which involves a gradual build-up of a value signal during learning. Here, we show that biologically-plausible RL rules combined with persistent exploration generically exhibit discontinuous learning. In structured environments, positive feedback from learning generates a traveling ‘reinforcement wave’. The discontinuity occurs when the wave reaches the starting point. Task difficulty and the learning rule alter its profile and speed, which are determined by the nonlinear dynamics between the nose and bulk of the wave. Predictions explain existing data and motivate specific experiments to isolate the phenomenon. Additionally, we characterize the exact learning dynamics of various RL rules for a complex sequential task.

## INTRODUCTION

As we walk the streets of a city, we rapidly figure out paths to new spots after a few times visiting them. For nesting animals, foraging between new locations and their nests in structured environments is an essential aspect of their survival. Rats constantly navigate within a complex underground network of burrows to expand their stores of food [1]. Navigating from point A to point B in a structured space requires different strategies compared to a similar task on a flat, open field. In the latter, navigation often involves geometric calculations of distances and angles based on celestial cues, compasses or landmarks. In a burrow, on the other hand, a rat needs to learn which way to turn at each intersection and benefits from understanding the relationship between places within the network.

The relational structure of mazes offers a well-controlled experimental paradigm to identify biological algorithms for navigating structured environments. Early laboratory experiments on learning algorithms, and animal behavior at large, involved rats navigating a maze [2–7]. Rats rapidly learn to navigate to a rewarding location within the maze, which often develops into a habitual action sequence resistant to subsequent changes such as the addition of a shortcut. These experiments and others led to the hypothesis that learning entailed the fixation of stimulus-response relationships due to a reward [6–9]. A parallel set of experiments showed that the structure of the maze could be learned during exploration without any significant reward, termed as latent learning [10]. Latent learning presumably proceeds through the formation of a ‘cognitive map’, which can be flexibly re-used when the animal needs to generalize to a novel situation [11–13]. This dichotomy between behavioral stereotypy and flexibility is analogous to the modern dichotomy in computational reinforcement learning (RL) between direct and indirect learning, often implemented using model-free and model-based methods respectively [14–16] However, the specific learning algorithms that animals use to navigate and the circumstances under which one system or the other is employed remain unclear.

Recent developments in deep-learning-based behavioral tracking methods [17–19] allow for following mice in labyrinthine mazes for extended periods of time. In an elegant experiment [20], mice were allowed to freely navigate (in the dark) a maze structured as a depth-six binary tree (Figure 1a). Ten of the twenty mice were water-deprived and a water reward was renewed every 90 seconds from a port at one end of the maze. Results recapitulate aforementioned studies: mice exhibit rapid learning and eventually execute a quick action sequence from home to the water port. In addition, mice persistently explore the maze with exploration biases which are remarkably consistent across rewarded and unrewarded animals.

Intriguingly, the probability that mice take a direct path of > 6 correct binary choices towards the water port exhibits a sharp discontinuity, similar to an ‘a-ha’ moment of sudden insight (Figure 1b). This moment can occur well after the animal learns about the reward and persists for the rest of the experiment. Discontinuous learning curves have also been measured in a variety of other experiments [21]. RL algorithms gradually reinforce correct actions through accumulated experience, which suggests that an alternative mechanism may be at play. The availability of the full history of decisions made by mice within the maze presents a unique opportunity to identify the mechanism behind step-like learning curves.

In this work, we show using agent-based simulations that persistent exploration combined with biologically-plausible RL rules reproduce discontinuous learning. We first develop a generally applicable framework based on inverse reinforcement learning to decouple and analyze the influence of reward-based learning on naturalistic exploratory behavior. Next, to identify the mechanistic origin of discontinuous learning, we develop a model for RL-based sequence learning on sparse relational graphs. We show that the structure of navigation on a sparse graph and the local, feedback-based propagation of the reinforcement signal lead to the formation of a traveling wave of reinforcement, which we term a ‘reinforcement wave’. We develop a nonlinear, continuous-time model, which accurately captures the full learning dynamics and identifies the mechanism behind wave propagation in different exploration regimes. This model extends to commonly used model-free and model-based variants of RL, whose dynamics are quantified in detail. Finally, a re-analysis of experimental data lends further support for the theory and motivates specific experiments to isolate the phenomenon.

## RESULTS

### Discontinuous learning in RL simulations

We begin by specifying an RL model closely following the experimental setup of [20] (Figure 1a). The model is defined by the states (*s*), how the state changes when a certain action (*a*) is taken and the expected reward for each state-action pair, *r*(*s, a*). The states determine the information the agent can use to make a decision. Consistent with the history-dependence of exploratory behavior measured in experiments, we assume the agent knows which specific intersection it is currently at and where it is coming from. When the agent arrives at an intersection along a certain corridor, it can choose to go along any of the corridors at that intersection (including back where it came from). A fixed reward (*r*) is delivered in the corridor leading to water in Figure 1a, upon which the agent is reset and the simulation is repeated.

Actions are chosen randomly with probability upto a normalization constant. *π*(*a*| *s*) is a modified version of the commonly used softmax policy [14], which chooses actions with a log-probability proportional to their expected long-term reward or value, *q*(*s, a*). Here, we split *q*(*s, a*) into two terms, *q*_{ε}(*s, a*) and *q*_{r}(*s, a*). *q*_{ε} is the *intrinsic* value (or a ‘curiosity drive’ [22]) the agent receives on taking an action at that state and is kept fixed throughout learning. *q*_{r} is the *extrinsic* value, which is initially set to zero and is modulated by reward-based learning. Before learning, the agent makes stochastic exploratory choices based on *q*_{ε}(*s, a*), which could be set by an innate bias or guided by knowledge external to the present task. As learning progresses, these exploratory choices are influenced by the reward, which biases the agent towards rewarding actions (or avoids costly ones). The randomness of the policy is set by the magnitudes of *q*_{ε} and *q*_{r}, whereas the influence of the reward on exploration is set by their ratio.

This split between intrinsic and extrinsic rewards allows us to examine *in silico* the influence of a learning rule on natural behavior. We first determine *q*_{ε} from the behavior of unrewarded mice in experiments using maximum entropy inverse reinforcement learning (MaxEnt IRL) [23, 24]. MaxEnt IRL finds the maximum entropy policy and the associated reward function that best explain observed behavioral trajectories (see Methods and SI for a brief overview of MaxEnt IRL). Next, we enable learning by specifying a biologically-plausible temporal-differences learning rule [14, 25, 26]. Specifically, *q*_{r} is updated using the learning rule:
*δ* is the reward prediction error and the expectation above is with respect to the policy the agent uses at the next state (*s*′). By comparing the best fit *q* values obtained from MaxEnt IRL for rewarded and unrewarded mice, we estimate the reward as *r* ≈2 (Figure S1). The remaining free parameter, *α*, scales the rate of learning. Similar to the learning curves from experiments shown in Figure 1b we track the cumulative number of rewards acquired by the agent and the cumulative number of direct paths to the goal from distant locations in the maze.

Simulated RL agents exhibit rapid learning similar to those observed in experiments. Importantly, the rate of taking a long direct path deviates discontinuously from the default rate (from pure exploration) at a distinctive moment during learning, reproducing the ‘sudden in-sight’ phenomenon observed in experiments (Figure 1c). This phenomenon is reproduced during re-runs with variability comparable to the variability observed across mice in experiments (Figure S2a). Fitting the rate of direct paths using a logistic function, we find that the transition can be localized to within fewer than three trials in about half of the runs (Figure S2b).

### Goal-oriented navigation on sparse relational graphs

To identify the mechanism that underpins the sharp transition in learning, we develop a general, tractable framework for goal-oriented navigation on sparse relational graphs. In this task, the agent traverses a relational graph (a directed graph whose edge labels specify the action or relationship between two states) from a fixed starting point to a goal where it receives a reward (Figure 2a). We track its progress in finding the direct path (highlighted in Figure 2a) by accumulating experience across multiple episodes. The graph can be very large with possibly many paths to the goal. We consider graphs that capture the core features of a structured environment such as roads on a university campus or abstract knowledge graphs [27]. Specifically, we require: 1) discrete decision points and choices, 2) the graph is sparse, namely, the number of paths of comparable length to the direct path is small (unlike a Manhattan-like grid), and long, branching side paths which lead to dead-ends.

Graphs that have the above three features and suffi-ciently simple to allow for an in-depth quantitative analysis are tree-structured graphs (Figure 2a), which include the maze architecture from the experiments. In such graphs, the side paths necessarily lead to dead-ends. On encountering a dead-end, the agent will turn back and eventually re-encounter the direct path. The agent’s movements in a side path can thus be represented as a single node keeping in mind that if the agent goes in, it will surely return back (Figure 2b). When the agent returns back from the side path, it can either choose to go towards or away from the goal. This choice implies that the agent can make it to the goal via long detours through side paths. As discussed later, the discontinuous learning phenomenon is most salient when detours through side paths are much longer than the direct path (the second of the three core features stated above).

The agent’s exploration biases (specified by *q*_{ε}) play an important role in determining the qualitative character of the learning dynamics. A key parameter is the probability of continuing towards the goal along the direct path whose corresponding *q*_{ε} value we denote *ε* (Figure 2b). *ε* can be interpreted as an analog measure of task difficulty. When *e*^{ε} ≫1 (easy), the agent continues on the direct path for long stretches and rapidly reaches the goal. In the opposite limit, *e*^{−ε} ≫1 (difficult), correct actions along the direct path are rare. To make progress, the agent would have to take constant detours towards the goal through side paths, whose probability is set by the corresponding intrinsic reward *q*_{ε} = *ε*′ (Figure 2b). Clearly, if the probability of going towards the goal both along the direct path and through side paths is small (*e*^{−ε′}, *e*^{−ε} ≫ 1), the agent is very unlikely to make it to the goal. Whether the agent makes any learning progress whatsoever will thus depend on the exploration parameters. We find that the exploration statistics display three sharply delineated regimes depending on the net probability of going towards the goal *vs* back towards the start (SI). If this net probability is negative, the ‘cautious’ agent constantly returns to the starting point and does not learn the task. When the net probability is positive, the ‘adventurous’ agent on average ventures closer to the goal. The marginal case of zero net probability leads to diffusive exploration.

### Reinforcement waves lead to discontinuous learning curves

We now examine the learning dynamics generated by the rule, (1), beginning with RL simulations followed by a theoretical analysis. Since actions that lead the agent away from the goal are never reinforced during learning, only the *q*_{r} values for continuing along the direct path towards the goal (*q*_{n}), and turning towards the goal when exiting the side path at each intersection *n* should be tracked (we use *n* = 0 and *n* = *N* for the goal and start respectively, see Figure 2b). Figure 2c shows *q*_{n} and at three time points (in units of 1*/α* episodes), highlighting the wave-like propagation of *q*_{n} (Movie S1). The learning curves show a sharp discontinuity (Movie S1, Figure 2d), which occurs precisely when the wave reaches the starting point. Learning time is determined by the wave’s speed, which we measure as the number of intersections on the direct path the wave crosses every 1*/α* episodes. Tracking the half-maximum of *q*_{n}, we find that the wave travels at a constant speed, *v* (Movie S1). Simulations across a range of *ε* show the speed saturating at *v* = 1 for *ε* ≳ 1, which decreases to zero with decreasing *ε* (Figure 2e), hinting at distinct regimes. The factors that determine the speed and profile of the wave will be discussed further below.

The origin of discontinuous learning through reinforcement waves can be intuitively understood by examining how learning operates at each intersection. We isolate three factors: 1) the correct action at an intersection is only reinforced if the action at the subsequent intersection is reinforced, implying that the chain of reinforcement has to travel backwards from the goal, 2) when an intersection is sufficiently reinforced, the probability of the correct action at that intersection increases by a large factor as long as the reward is sufficiently large (*e*^{r+ε} ≫ 1). Since the rate of traveling directly from start to goal is the product of the probabilities of taking the correct action at each intersection, this rate will increase rapidly when the wave reaches the start, and 3) if the agent is unlikely to take the correct action at a certain intersection (*q*_{ε} < 0 for that action), reinforcement is applied through a few rare events until the intrinsic bias is overcome, *q*_{r} + *q*_{ε} > 0. Since the probability of taking the correct action in turn increases rapidly with reinforcement, the learning curve for taking the correct action at *each* intersection will appear step-like. These three factors together imply that when the task is non-trivial, the wave of reinforcement marches backward from the goal, reinforcing correct actions one intersection at a time with step-like learning at each intersection, until it reaches the starting point.

This intuitive picture can be made mathematically precise by examining the effects of the learning rule, (1), on *q*_{n} and . We summarize the results here and refer to the SI for full details. When *α* ≫1, we find that their expected change over 1*/α* episodes is given by
where are the average number of times per episode the agent crosses intersection *n* through the direct path or the side path respectively, and *σ*_{n} is the probability of continuing along the direct path at intersection *n*. In general, *µ*_{n} and depend on the transition probabilities and thus the values at every intersection in the graph. The analysis is made tractable by noticing, first, that the ratio is determined by the relative probability of taking the correct action at intersection *n* through the direct path *vs* the side path. Second, no learning occurs outside of the front and bulk of the wave. Lastly, learning at the front of the wave only happens when subsequent intersections are already sufficiently reinforced, which implies that the agent is likely to go directly to the goal immediately after crossing the front. Thus, in each episode, the intersection at the wave’s front is crossed just once on average, . This relation combined with the expression for fix . The ’s obtained from numerical integration of (2) are in excellent agreement with the ones from full-scale RL simulations (Figure 3a). An analysis of (2) reveals two qualitatively distinct regimes of wave propagation with *e*^{ε} ≫ 1 and *e*^{−ε} ≫ 1 as their asymptotic limits. We term these the expanding and marching regimes respectively. Maze architectures that could exhibit these two regimes are illustrated in Figure 3b,c.

In the expanding regime (*e*^{ε} ≫ 1), (2) leads to linear dynamics, which can be solved exactly. We find *q*_{n}(*t*) = *rP* (*n, t*), where *P* (*n, t*) is the regularized lower incomplete gamma function. For large *n*, the half-maximum is at *n*_{1/2} = *t*, which explains the speed *v* = 1 observed in simulations for *ε* ≳1, and the width of the profile expands with time as .

In the marching regime (*e*^{−ε} ≫ 1), a negative *ε* leads to qualitatively different behavior compared to the expanding regime. Any step on the direct path that has previously been reinforced beyond *ε* is more likely to be traversed. When the reinforcement wave reaches a intersection *p* on the direct path that is yet to be reinforced to |*ε*|, the reinforcement of that step occurs through rare events until *q*_{p} ≃ |*ε*|. Meanwhile, the direct path for *n* < *p* is rapidly reinforced. The rare events at *p* combined with rapid reinforcement for *n* < *p* lead to a bottleneck at *p* and a steep wave profile. Once *q*_{p} reaches |*ε* |, it is subsequently reinforced rapidly and *q*_{p+1} in turn begins to be slowly reinforced through rare events. Thus, the wave ‘marches’ forward reinforcing one step at a time. Computing the duration *τ* it takes to march one step will let us estimate the speed of the wave, *v* = *τ* ^{−1}.

The duration *τ* can be calculated by examining the nonlinear dynamics in the front (*n* = *p*) and bulk (*n* < *p*) of the wave (SI). The full dynamics in the bulk plays a role as the reinforcement received at the intersection *n* = *p* depends on the temporal dynamics of *q*_{p−1}, which in turn depends on *q*_{p−2}, and so on. However, it can be shown that the dynamics in the bulk are linear and exhibit self-similarity with period *τ*. Exploiting a conservation equation that results from these properties, we compute the wave speed as
which is in excellent agreement with the speed measured in RL simulations (Figure 3d). The wave profile in the bulk is given by *q*_{n−1}(*t*) = *r* − *β*(*r* − *q*_{n}(*t*)), where *β* = −*τ* ^{−1}*W* (*τe*^{−τ}) and *W* (*x*) is the Lambert *W* function. Most of the learning at a certain intersection occurs in ≲ 1*/α* episodes (Figure 3e). Since the wave speed is less than one in the marching regime, each intersection is almost fully reinforced before the wave marches to the next one.

The results are summarized in Figure 3f, which depicts the expanding and marching regimes in addition to the ‘stalled’ regime corresponding to the exploration parameters where learning is largely absent.

### Other learning rules lead to reinforcement waves with altered speeds and profiles

Common variants of the *SARSA* rule [14] in (1) also lead to reinforcement waves, highlighting the generality of the phenomenon. A detailed analysis of each of these variants is presented in the SI, which we summarize here.

We find Watkins’ *Q*-learning, which uses a slightly modified version of the rule (1), leads to largely similar speeds and profiles. The advantage of *Q*-learning is that the *q*_{r} values can be learned off-policy, i.e., the agent’s behavior is not necessarily derived from the learned *q*_{r} values. To decouple the influence of learning on behavior, we use *Q*-learning together with an explorative agent that disregards the learned *q*_{r} values. We find expanding waves irrespective of the task difficulty, suggesting that expanding waves are the ‘default’ dynamics without feedback in the structured environments considered here. Feedback due to learning leads to traveling waves with steeper profiles as observed in the marching regime. Both *Q*-learning and *SARSA* learn values from local updates, which constrains the wave speed to be at most one.

An alternative class of models build a model of the environment from experience, similar to a cognitive map, and update the values offline by sampling from the model (planning). We consider *Dyna-Q*, which implements a simple version of this general idea. Specifically, *Dyna-Q* first learns a model of future states and rewards for everystate-action pair it encounters during the task. At each step, it samples *n*_{p} state-action-state-reward transitions from the model and updates their corresponding values. We show that *Dyna-Q* applied to our setting leads to the same behavior as (1) with an enhanced learning rate (1 + *n*_{p})*α*. Intuitively, when the agent plans, learning which otherwise occurs only through physical exploration is sped up due to mental exploration. However, since both physical and mental exploration employ the same search process, the result is a simple scaling of the learning rate.

Another common variant with non-local updates is *SARSA* combined with eligibility traces, which are an efficient, biologically-plausible mechanism for enhancing learning speed when rewards are sparse. Instead of updating the value of the current state-action pair, eligibility traces effectively use the current reward prediction error to also update the *k* most recent state-action pairs. The exact learning dynamics can be calculated (SI) and are qualitatively similar to the *SARSA* case. In the expanding regime, eligibility traces scale the wave speed by a factor 1 + *k*. The speed in the marching regime also increases since the front of the wave receives reinforcement from the intersection 1 + *k* steps along the direct path, which has a larger value compared to the subsequent intersection. In the limit *k* → ∞, we show that the speed converges to a maximum *v*_{∞} = *r/*(|*ε*| + *e*^{|ε|} − 1).

The theoretical predictions for the various learning rules are verified in simulations (Figures S3,S4).

## EXPERIMENTAL TESTS

In addition to reproducing the discontinuous learning curves observed in experiments, the theory provides predictions which can be immediately tested by re-analyzing the data from ref. [20]. Specifically, note that the learning curves in Figure 1b correspond to the number of direct paths greater than a certain length, namely, six. If the discontinuity in the learning curves is due to a reinforcement wave, this discontinuity should occur at a later time for direct paths beginning from farther nodes. This prediction should be contrasted with an alternative mechanism where sudden insight corresponds to the singular moment when the mouse has figured out the global structure of the environment and uses this knowledge to find direct paths from distant sections of the maze. The experimental data lends support for the former hypothesis, which show that the discontinuity is delayed for longer direct paths (Figure 4a). The time delay between these discontinuities provides an estimate of the wave speed. The smaller rate of taking direct paths for longer paths observed in Figure 4a can also be explained in our framework. The reward (estimated as *r* ≈ 2 previously) is not sufficiently large to fully overcome the stochastic, exploratory drive of the agent, leading to a significantly smaller probability of taking a longer direct path. This decreasing probability provides an estimate for the range of wave propagation, *N*_{range}. The theory predicts that *N*_{range} and the speed of wave propagation should increase with increasing reward for moderate (*ε* ≃ 0) to difficult (*ε* < 0) tasks, which can be tested in future experiments. An intriguing possibility is to observe the transition in speed from the expanding to marching regimes by manipulating the task difficulty, for example by modifying the inclinations of the T-junctions in a complex maze (as illustrated in Figure 3b,c) or manipulating the number of branches at each intersection.

A potentially important confounding factor for observing a single, distinct discontinuity in the learning curves is when multiple paths of length comparable to the direct path are available. The speed at which the wave propagates along these competing paths depends on a number of factors, including their number, lengths and the exploration statistics within each path. If a competing path is fully reinforced earlier than the direct path, it can interfere with learning the direct path. Multiple paths can explain the variability observed in experimental trajectories. Indeed, the learning curves in Figure 1b,4 effectively average over all direct paths of certain lengths. If paths of similar lengths from distant nodes exhibit discontinuities with only slight delays, the averaged curve will appear smoother than when each path is observed separately. Consistent with this intuition, considering paths from specific locations in the experiment highlights the variability across mice in which of these paths contributes most to the discontinuity (Figure 4b). Additional experiments designed similar to our setting in Figure 2b will provide crucial data to resolve sources of variability.

## DISCUSSION

The discontinuous learning phenomenon observed in complex mazes and other learning tasks clashes with the intuition that RL-based algorithms make learning progress by incrementally reinforcing rewarding actions. Here, we have shown that a standard biologically-plausible RL rule consistently reproduces this phenomenon in simulations designed to reflect maze experiments and more generally during goal-oriented navigation in large, sparse relational graphs. The key result is that the value signal propagates as a steep, traveling reinforcement wave, which sequentially reinforces correct actions along the path towards the goal. ‘Sudden insight’ occurs the moment the wave reinforces all the correct actions along the main path. The reinforcement wave arises due to a combination of the effectively one-dimensional task structure in such structured environments, the local propagation of reinforcement and the positive feedback of reinforcement on behavior. These factors together with the agent’s innate exploration biases determine the character of the wave, including its speed and profile. The exploration biases play an important role as they determine if any learning occurs in the first place (the stalled regime), and if learning does progress, whether the learning dynamics are limited by the learning rule (expanding regime) or due to the low probability of taking the correct action (marching regime). While common model-free and model-based variants of the RL rule may enhance the learning speed and alter the wave’s profile, the qualitative characteristics of wave propagation are preserved.

While existing experimental data is consistent with the propagation of a reinforcement wave, further experiments should reveal and verify the generality of the phenomenon. The sharp transition in learning is most salient in highly complex mazes where the direct path is nontrivial and paths other than the direct path are present but are poor solutions. Competing paths lead to additional complexity, analogous to when a multitude of local minima compete with the global solution in non-convex optimization problems. Easily accessible competing paths which are of comparable length to the direct path may lead to non-trivial exclusion effects, effectively average out the learning curves and amplify variability due to minor differences in exploration biases across animals. In addition, our model predicts that the magnitude of the reward plays an important role in determining the learning range and speed for difficult tasks.

Our analysis also provides a complete characterization of the learning dynamics of various RL rules for a non-trivial sequential decision-making task, which is currently lacking. A key challenge in the theoretical analysis of RL algorithms is the feedback of learning on behavior, which makes the data distribution inherently non-stationary. In our setting, the non-stationarity is reflected by the dynamics of the wave during learning. We have shown that the front of the wave effectively acts as an absorbing boundary, which simplifies the analysis considerably. The learning speed is determined by the number of times the learning rule updates the value at the nose of the wave. Since this number itself depends on the value at the nose, the dynamics are nonlinear. In turn, since the value of the subsequent action depends on the value of the later actions within the bulk, the full interactions between the nose and the bulk of the wave will influence learning speed. We show that the learning speed cannot exceed a certain value due to the locality of the learning rule. Relaxing the locality constraint using eligibility traces enhances the learning speed by widening the value differential between the unreinforced action and the distal action from which it receives reinforcement. A model-based method which uses planning scales up the speed simply by scaling up the number of times it updates each action rather than due to a qualitative change in how reinforcement is propagated. An exciting future direction is to extend our analysis framework to spatial navigation tasks with other graph topologies or when learning of actions is intertwined with the learning of representations.

## METHODS

### Extracting exploration statistics from data and hyperparameters for RL simulations

We use MaxEnt IRL (see SI for a brief introduction) to infer the exploration biases of unrewarded mice. As discussed in the Results section, the state space was chosen as the directed edges of the graph that delineate the maze in experiments, where the root of the tree corresponds to ‘home’. We pooled trajectories from all unrewarded mice, set *γ* = 0.8 and split the trajectories to length *T* = 12 (*T* should be at least the effective horizon ∼ (1 −*γ*)^{−1} = 5 and choosing a large *T* slows inference). The choice of *γ* was motivated by the analysis in [20], which showed that a variable length Markovian model typically chooses ≲ 5 previous states to predict mice behavior. The *q*_{ε} values are obtained from maximum likelihood estimation, specifically, from log *p*_{λ,0}(*s, a*) after optimizing for ** λ** (SI). Note that due to normalization the

*q*

_{ε}values are determined only up to a constant additive term for each state.

To estimate *r*, we apply the above procedure to both unrewarded and rewarded mice. We calculate the difference between rewarded and unrewarded animals in the differences of the correct action’s *q* value and the effective *q* values of the other two incorrect ones (note . A subset of these values are shown in Figure S1, which shows that the correct actions leading to reward have a value differential of Since the values of actions close to the reward after learning saturate at *r*, the value differential is an estimate of the reward, *r* ≈2. To ensure that this estimate is not significantly influenced by the habitual paths that go directly from home to goal, we repeat the above procedure excluding these paths (Figure S1). The estimate decreases slightly to *r* ≈1.5. In the RL simulations of the depth-6 binary tree maze, we use *r* = 2 and *α* = 0.33.

### Setup and notation for the RL framework for navigation on tree-structured graphs

A tree-structured graph can be cast as a linear track, as argued in the Results section and illustrated in Figure 2a,b. The linear track consists of *N* − 1 nodes on the direct path, *n* = 1, 2, …, *N* − 1. The agent starts each episode at node *n* = *N* and the reward is at the goal node *n* = 0. In addition to these nodes, the nodes from *n* = 1 to *N* − 1 each have a side path, which we label as 1_{b}, 2_{b}, …, (*N* − 1)_{b}. The state space of the Markov decision process is the set of *directed edges* that connect the various nodes and the side paths as shown in Figure 2b. In other words, both the agent’s location in the graph and the direction in which it is headed matter. We denote (*n*_{1}, *n*_{2}) as the directed edge from *n*_{1} to *n*_{2}.

The transition dynamics *P* (*s*′|*s, a*) are deterministic (note however that the policy *π*(*a*|*s*) is stochastic). At each directed edge, the agent can choose to go along the directed edges emanating from its current node, except for turning back, for e.g., the transition (*n* + 1, *n*) → (*n, n*+1) is disallowed. This simplifying assumption does not affect the results as the agent can effectively turn back by going into a side path and returning (*n* + 1, *n*) → (*n, n*_{b}) → (*n*_{b}, *n*) → (*n, n* + 1). The episode begins with the agent at the directed edge (*N, N* −1). The directed edge pointing towards the goal node, (1, 0), is an absorbing state, i.e., the agent receives a reward *r* and the episode ends once the agent traverses that edge. We impose reflecting conditions at edges going into the side paths (*n, n*_{b}) and the start node (*N*− 1, *N*).

The agent receives identical intrinsic exploration rewards at every intersection on the direct path. There are three directed edges leading to any node *n*, and we thus consider three cases at each node. These three cases are shown pictorially in Figure 2b. Since the agent can take two actions at each step and the policy only depends on differences of *q* values, we specify the *q* values for only one of the actions. The notation used for the three cases is introduced (see also Figure 2b).

the agent is on the direct path and going towards the goal, (

*n*+ 1,*n*): for the action corresponding to the agent continuing towards the goal (*n*+ 1,*n*) → (*n, n*− 1), we denote*q*_{ε}≡*ε, q*_{r}≡*q*_{n}.the agent is on the side path

*n*_{b}and going towards*n*, (*n*_{b},*n*): for the action corresponding to the agent turning towards the goal (*n*_{b},*n*) → (*n, n*− 1), we denote .the agent is on the direct path and going towards the start, (

*n*−1,*n*): for the action corresponding to the agent continuing towards the start (*n*− 1,*n*) → (*n, n*+ 1), we denote .

The probabilities of taking the action described in each of three cases is denoted , where *σ*(*x*) = 1*/*(1 + *e*^{−x}) is the logistic function.

## ACKNOWLEDGMENTS

G.R thanks Andrew Murray and Venkatesh Murthy for useful comments on the manuscript. G.R was partially supported by the NSF-Simons Center for Mathematical & Statistical Analysis of Biology at Harvard (award number #1764269) and the Harvard Quantitative Biology Initiative.

## APPENDIX

### 1. Maximum entropy inverse reinforcement learning

In this Section, we present a brief introduction to Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL). See refs. [23, 24] and references therein for more details. MaxEnt IRL aims to find a reward function *r*(*s, a*) for each state-action pair (*s, a*) that guides a policy consistent with a set of *L* observed behavioral trajectories 𝒟 = {*d*_{1}, *d*_{2}, …, *d*_{L}}. Each trajectory has *T* + 1 state-action pairs: . The transition matrix *P* (*s*′|*s, a*) is assumed to be known.

While the end goal is to estimate *r*(*s, a*), MaxEnt IRL proceeds by formulating an unsupervised learning task. Specifically, consider a generative model which assigns a probability *p*(*τ*) for trajectory *τ* such that the expected discounted reward, from the model matches the empirical discounted reward (note 0 ≤ *γ* < 1 is the standard RL discount factor). To do this, it is sufficient to ensure that the expected frequencies of all *s, a* pairs appropriately discounted match the empirical ones:
where the sum over *τ* is over all possible trajectories with *T* + 1 state-actions pairs, for a trajectory *d* = (*s*_{0}, *a*_{0}, *s*_{1}, *a*_{1}, …, *s*_{T}, *a*_{T}) and *δ* is the indicator function. Since multiple reward functions satisfy (A1), the inverse RL problem is ill-posed without additional constraints. MaxEnt IRL introduces an additional constraint by choosing the generative model *p*(*τ*) which satisfies (A1) *and* minimizes the relative entropy between *p*(*τ*) and a model *u*(*τ*) whose trajectories are generated by a random policy. In other words, MaxEnt IRL selects the “most random” policy which also satisfies (A1).

*p*(*τ*) is found by minimizing a variational objective with constraints imposed using Lagrange multipliers:
where *u*(*τ*) is defined below, the *λ*(*s, a*)’s are Lagrange multipliers which enforce (A1) and *µ* enforces normalization. Minimizing 𝒞 w.r.t *p*(*τ*) gives
and *p*(*s*_{0}) is the probability of the initial state. *u*(*τ*) is (up to a constant factor) the probability of trajectory *τ* under a policy which picks actions with equal probability. Observe that the discounted sum in the exponent of *p*(*τ*) is precisely the discounted long-term reward along *τ*. The Lagrange multipliers ** λ** are thus interpreted as the rewards,

*r*(

*s, a*) →

*λ*(

*s, a*) for each

*s, a. p*(

*τ*) implicitly places an exponentially larger weight on policies that lead to rewarding trajectories relative to the uniform random policy corresponding to

*u*.

The rewards ** λ** are obtained by maximizing the log-likelihood ℒ of the model over the sample trajectories:
where the subscript

**is introduced to highlight the dependence on**

*λ***. From (A3), we find where**

*λ**p*

_{λ,t}(

*s, a*) is the marginal probability of encountering state-action pair

*s, a*at time

*t*w.r.t

*p*. The Markovianity of the process enables efficient calculation of

*p*

_{λ,t}(

*s, a*) using a forward-backward algorithm (described below) similar to the one used to train Hidden Markov Models (HMM). When ℒ attains its optimum, we observe from (A6) that the constraint (A1) is automatically satisfied.

The forward-backward equations are similar to the HMM case except for an exponential weight:
for *k* > 0 and . We have
In practice, we compute the logarithm of the various quantities and use log to prevent underflow/overflow. The *λ*s are obtained using gradient ascent on ℒ using standard optimization methods.

#### The dynamics of sequence learning on tree-structured relational graphs

In this Section, we expand on the dynamics of learning in the goal-oriented navigation task introduced in the main text. The dynamics are analyzed in full detail for the SARSA learning algorithm used in the main text and other commonly used RL rules. These alternative rules also display wave-like propagation of the value signal, with differences in the speed and front profiles for certain variants. Part of this Section is reproduced in the Methods, and is included here for completeness.

### Learning framework

An RL algorithm is defined by the learning rule and the (possibly non-stationary and stochastic) policy executed during learning. We consider a stochastic policy, where for each state-action pair (*s, a*), we have ln *π*(*a* |*s*) = *q*(*s, a*) − ln *Z*(*s*), where *q*(*s, a*) = *q*_{ε}(*s, a*) + *q*_{r}(*s, a*) and *Z*(*s*) is a normalization constant. *q*_{ε}(*s, a*) represents fixed intrinsic rewards that drive exploration and determine the agent’s exploratory biases. The extrinsic rewards *q*_{r}(*s, a*) (initially set to zero) are modulated by the learning rule. At the beginning of learning, the agent’s behavior depends solely on its exploratory biases. As learning proceeds and the agent acquires rewards, its behavior is modified by the increasing effect of the extrinsic rewards. We consider the learning dynamics for various learning rules discussed further below.

### Setup and notation

The linear track consists of *N* − 1 nodes on the direct path, *n* = 1, 2, …, *N* − 1. The agent starts each episode at node *n* = *N* and the reward is at the goal node *n* = 0. In addition to these nodes, the nodes from *n* = 1 to *N* −1 each have a side path, which we label as 1_{b}, 2_{b}, …, (*N* −1)_{b}. The state space of the Markov decision process is the set of *directed edges* that connect the various nodes and the side paths as shown in Figure 2b. In other words, both the agent’s location in the graph and the direction in which it is headed matter. We denote (*n*_{1}, *n*_{2}) as the directed edge from *n*_{1} to *n*_{2}.

The transition dynamics *P* (*s*′| *s, a*) are deterministic (note however that the policy *π*(*a|s*) is stochastic). At each directed edge, the agent can choose to go along the directed edges emanating from its current node, except for turning back, for e.g., the transition (*n* + 1, *n*) → (*n, n*+1) is disallowed. This simplifying assumption does not affect the results as the agent can effectively turn back by going into a side path and returning (*n* + 1, *n*) → (*n, n*_{b}) → (*n*_{b}, *n*) → (*n, n* + 1). The episode begins with the agent at the directed edge (*N, N* − 1). The directed edge pointing towards the goal node, (1, 0), is an absorbing state, i.e., the agent receives a reward *r* and the episode ends once the agent traverses that edge. We impose reflecting conditions at edges going into the side paths (*n, n*_{b}) and the start node (*N* − 1, *N*).

The agent receives identical intrinsic exploration rewards at every intersection on the direct path. There are three directed edges leading to any node *n*, and we thus consider three cases at each node. These three cases are shown pictorially in Figure 2b. Since the agent can take two actions at each step and the policy only depends on differences of *q* values, we specify the *q* values for only one of the actions. The notation used for the three cases is introduced (see also Figure 2b).

the agent is on the direct path and going towards the goal, (

*n*+ 1,*n*): for the action corresponding to the agent continuing towards the goal (*n*+ 1,*n*) → (*n, n*− 1), we denote*q*_{ε}≡*ε, q*_{r}≡*q*_{n}.the agent is on the side path

*n*_{b}and going towards*n*, (*n*_{b},*n*): for the action corresponding to the agent turning towards the goal (*n*_{b},*n*) → (*n, n*− 1), we denote .the agent is on the direct path and going towards the start, (

*n*−1,*n*): for the action corresponding to the agent continuing towards the start (*n*− 1,*n*) → (*n, n*+ 1), we denote .

The probabilities of taking the action described in each of three cases is denoted , where *σ*(*x*) = 1*/*(1 + *e*^{−x}) is the logistic function.

### Dynamics of exploration

We first consider the dynamics of exploration with no learning. An important byproduct of the calculation is the identification of the values of *ε, ε*′, *ε*′′ for which the agent finds the goal within a reasonable amount of time. These values delineate the parameter regime where learning occurs (the marching and expanding regimes in Figure 3c) *vs* the regime in which no learning occurs as the agent rarely finds the goal in the asymptotic limit *N* ≫1 (the stalled regime in Figure 3c). The upshot is that in these two cases the agent on average either drifts linearly towards the goal or constantly reverts back to the start. The marginal case corresponds to diffusive behavior.

To identify these regimes, we calculate the expected time for the agent, starting at (*N* − 1, *N*), to find the goal state. We additionally calculate the expected number of times the agent visits each state which is used further below for the analysis of learning dynamics. The calculation of both of these quantities is simplified by considering the dynamics of a simpler, equivalent Markov chain. Specifically, we observe that when the agent enters a node *n*, say along (*n* + 1, *n*), regardless of whether it goes into the side path or not, it either continues along its path, (*n, n* −1), or turns back (*n, n* + 1). The probability of the former possibility, *k*_{+}, is the sum of the probability that it moves to (*n, n*− 1) along the direct path, *σ*(*ε*), and the probability that it takes a detour through the side path, (1 − *σ*(*ε*))*σ*(*ε*′). The probability of turning back is 1 − *k*_{+}. Similarly, the probability of continuing towards the start state when the agent is at (*n* − 1, *n*) is *k*_{−} = *σ*(*ε*′′) + (1 − *σ*(*ε*′′))(1 − *σ*(*ε*′)), and the probability of turning back to (*n, n* −1) is 1 − *k*_{−}. In summary, the probabilities that the agent continues towards the goal and towards the start are
respectively. The dynamics of the Markov chain defined by these two parameters are easily computed.

#### (a) Expected time to the goal

Let the expected time to the goal starting from (*n, n* − 1) and (*n* − 1, *n*) be denoted *T*_{+}(*n*) and *T*_{−}(*n*) respectively. We aim to compute *T*_{+}(*N*) with an absorbing boundary at (1, 0), so that *T*_{+}(1) = 0. The reflecting boundary conditions at (*N* − 1, *N*) imply *T*_{−}(*N*) = *T*_{+}(*N*) + 1. We have
for 1 < *n* ≤ *N* and 1 < *n* < *N* respectively. Expressing *T*_{−}(*n*) and *T*_{−}(*n*+1) in terms of *T*_{+}(*n*− 1), *T*_{+}(*n*), *T*_{+}(*n*+ 1) using (A10) and plugging into (A11), we obtain the second-order difference equation
for 1 < *n* < *N*. Defining *U* (*n*) = *T*_{+}(*n* + 1) − *T*_{+}(*n*) gives
Since *T*_{−}(*N*) = *T*_{+}(*N*) + 1, setting *n* = *N* in (A10) gives *U* (*N* − 1) = −1 + 2*/k*_{+}. Multiplying both sides by (*k*_{−}*/k*_{+})^{n−1} and summing from *n* + 1 to *N* − 1, we get
which after simplification leads to
for *k*_{−} *≠ k*_{+}. Summing *U* (*n*) from *n* = 0 to *N* − 1 and using *T*_{+}(0) = 0, we obtain the mean time to find the goal
If *k*_{−} *> k*_{+}, the agent takes time exponential in *N*, which is infeasible for large *N*. For *k*_{−} < *k*_{+}, the agent travels linearly towards the goal with drift . The marginal case of *k*_{−} = *k*_{+} leads to diffusive behavior. In this case, from (A14), we have
Summing from *n* = 0 to *N* − 1 gives .

#### (b) Expected number of visits

We calculate the expected number of times per episode the agent visits each state on the direct path. Denote the expected number of times the agent visits states (*n* + 1, *n*) and (*n, n* + 1) as *M*_{+}(*n*) and *M*_{−}(*n*) respectively. We have
We observe that setting *n* → *n* + 1 in (A19) and adding (A18) gives a conservation equation, *M*_{+}(*n*) + *M*_{−}(*n* + 1) = *M*_{−}(*n*) + *M*_{+}(*n* + 1). At the boundaries we have *M*_{−}(1) = (1 −*k*_{+})*M*_{+}(1) and *M*_{+}(*N* −1) = *M*_{−}(*N* −1)+1. Combining the latter boundary condition and the conservation equation, we see that *M*_{+}(*n*) = *M*_{−}(*n*) + 1 for all *n*. Plugging this into the former boundary condition then leads to *M*_{+}(1) = 1*/k*_{+} and into (A18) leads to
Dividing both sides by (*k*_{−}*/k*_{+})^{n+1}, summing from 1 to *n* and simplifying gives
for *k*_{−}≠ *k*_{+}. When *k*_{−} = *k*_{+}, we have *M*_{+}(*n*) = 1 + *n*(1 − *k*_{+})*/k*_{+}.

### Dynamics of learning

We now analyze the dynamics of reinforcement propagation for five learning rules, beginning with the SARSA rule considered in the main text. Throughout, we consider the slow-learning limit *α* ≪ 1. In this limit, we can compute the learning dynamics averaged over the agent’s behavior in each episode. Simulations show that the analytical results are accurate up to *α* ≲ 0.1, with wave-like propagation of the reinforcement signal observed for even larger *α*. In all cases analyzed below, we set the standard discount factor in RL (*γ*) to unity. This discount factor introduces an effective length scale for the influence of the reward and regularizes the Bellman equation, which are not necessary in our setting due to the regularization and effective horizon provided by the exploration-based stochastic policy. In our analysis below, it is easy to see that the effective horizon is *N*_{range} ≃ (− ln *σ*(*ε* + *r*))^{−1}. For *N* ≪ *N*_{range}, the reinforcement signal propagates as a wave. The wave slows down and the reinforcement signal eventually decays to zero at steady state when *N* ≫*N*_{range}. We consider the 1 ≪ *N* ≪ *N*_{range} limit in our analysis below, which thus requires that *ε* + *r* be appropriately large. In the linear track setting, reward is acquired only when the agent reaches the goal state. This is set by the boundary condition *q*_{0} = *r* (see the setup and notation section above). The initial conditions are . Also note that , i.e., the *q*_{r} value corresponding to going *away* from the goal is never reinforced and stays fixed at . We thus consider the dynamics of *q*_{n} and in what follows.

#### (a) SARSA

Whenever the agent makes a state-action-state transition *s*_{t}, *a*_{t}, *s*_{t+1}, the SARSA(0) (state-action-reward-state-action) learning rule updates the corresponding *q*_{r} values as
Here 0 < *α* < 1 is the learning rate, *r*_{t} is the reward obtained at time *t* and the expectation is over future actions drawn from the policy *π*. As discussed previously, ln *π*(*a* |*s*) = *q*_{ε}(*s, a*) + *q*_{r}(*s, a*) − ln *Z*(*s*). Note that the learning rule described above is strictly speaking the “expected SARSA” rule [14], which is an efficient variant of the SARSA(0) rule. The 0 in the parenthesis corresponds to the eligibility trace parameter *λ* = 0, which converts the local SARSA(0) rule to a non-local rule for *λ >* 0. We discuss eligibility traces further below.

Applying the SARSA rule (A22), the expected change Δ*q*_{n} and *in each episode* is given by
for *n >* 1 and . The *σ*_{n−1}*q*_{n−1} term is the expected value of *q*_{r} at the subsequent state, i.e., once the agent crosses node *n*. Here *µ*_{n} and are the expected number of times per episode the agent crosses node *n* towards the goal along the direct path or by making a detour through the side path at *n*, respectively. The relative probability of these two events determines their ratio, , and the total expected crossings through node *n* towards the goal determines their sum . The expected crossings at each node depends in general on the *q*_{ε} and *q*_{r} values at all states. It is possible to calculate the expected crossings recursively similar to the case of pure exploration above. However, we note that for the nodes where learning occurs (near the front of the wave), we should expect a single crossing, i.e., 1. As argued in the main text, the key idea is that learning at node *n* only occurs when *σ*_{n−1}*q*_{n−1} is non-negligible, i.e., the probability of staying on the direct path is sufficiently reinforced at *n* − 1 and thus also for *n*′*≥ n* − 1. Since these actions are reinforced, the agent is very likely to take the direct path to the goal immediately after crossing *n*. Equivalently, whenever the agent crosses *n* towards the goal, the probability of it cycling back to *n*′ *> n* and making another attempt at crossing *n* is small, leading to . We thus obtain
We can convert the discrete dynamics equations above to a continuous-time equation in the limit *α* ≪ 1, to get
where the dot represents a time derivative and the unit of time is 1*/α*. Integrating the above equations over a unit time interval Δ*t* = 1 corresponds to expected changes in *q*_{n} and over 1*/α* episodes.

As shown in Figure 3a, integrating the equations (A25) provides an accurate approximation of the values obtained through full-scale RL simulations. We expand onthe analysis of (A25) presented in the main text. Below, we consider exploration parameter values, *ε, ε*′, *ε*′′, such that *k*_{−} < *k*_{+}, i.e., in the regime where the agent consistently gets to the goal and learning occurs. We set *ε*′′ = 0 and vary *ε, ε*′. The key parameter which determines the character of the wave is *ε. ε*′ is chosen such that *k*_{−} < *k*_{+}. It is useful to consider the asymptotic limits *e*^{ε} ≫ 1 (expanding regime) and *e*^{−ε} ≫ 1 (marching regime). The motivation for the names will become apparent.

### Expanding regime

The expanding regime has straightforward linear learning dynamics. To see this, note that when *e*^{ε} ≫ 1, we have *σ*_{n} ≃ 1 for all *n*. This implies from (A24) that is negligible and thus remains at 0. From (A25)
with boundary conditions *q*_{0}(*t*) = *r* and initial conditions *q*_{n} (0) = 0 for *n >* 0. Defining , i.e., the Laplace transform of *q*_{n}(*t*), (A26) leads to
Multiplying both sides of the equation by (*s* + 1)^{n−1} and summing the recursion gives
Using and performing the inverse Laplace transform leads to
where *P* (*n, t*) is the regularized lower incomplete gamma function. Setting the boundary conditions, we have *q*_{n}(*t*) = *rP* (*n, t*). For large *n* it is well known that the gamma distribution is approximated by a normal distribution: , where F is the standard nor-mal cdf. The half-maximum of the propagating signal thus travels as *n*_{1/2} = *t*, i.e., with speed *v*_{0} = 1 corresponding to 1 node every 1*/α* episodes. The width of the front expands as .

### Marching regime

Qualitatively different behavior is observed in the opposite asymptotic limit, *e*^{−ε} ≫ 1. In this case, the nonlinear dynamics of (A25) come into play. We show below that the dynamics consist of a self-similar wavefront which marches forward one node at a time, with a fixed time interval, *τ*_{0}, between each node (and thus its speed is ). We first present the qualitative picture and then proceed to calculate *τ*_{0}. It turns out that a complete characterization of the wave dynamics is feasible, which leads to analytical expressions for *τ*_{0}, the shape of the wavefront and the dynamics of reinforcement at each node, *q*_{n}(*t*).

Since *e*^{−ε} ≫ 1, the probability that the agent takes the direct path through node *n, µ*_{n}, is negligible before any reinforcement occurs at node *n*. Any reinforcement that occurs is due to rare events which gradually increase *q*_{n} provided that *q*_{n−1} > |*ε*|. This bottleneck remains until *q*_{n} itself gradually increases to *q*_{n} ≃ |*ε*|. Meanwhile, all the actions on the direct path for *n*′ < *n* are continuously reinforced. This reinforcement is possible as the agent can bypass crossing *n* through the direct path by instead taking a detour through the side path at *n*. Indeed, learning occurs (*k*_{−} < *k*_{+}) in the marching regime only if *ε*′ is sufficiently large to allow for frequent detours through the side path. As a consequence, while *q*_{n} is gradually being reinforced, is sufficiently reinforced so that and *µ*_{n} ≃ *σ*_{n} (from (A24)). The detours through the side paths ensure that *q*_{n−1} is rapidly reinforced and *r* ≫ |*ε*| implies that *σ*_{n−1} displays switch-like behavior once *q*_{n−1} > |*ε* |. Since *q*_{n−1} is increasing, successive rare events through the direct path at node *n* receive increasing amounts of reinforcement on *q*_{n}. Calculating the amount of reinforcement on *q*_{n} thus requires knowing the dynamics of *q*_{n−1}. However, the dynamics of *q*_{n−1} depend on the dynamics of *q*_{n−2}, which in turn depends on *q*_{n−3} and so on, leading to cascading dependencies. The duration *τ*_{0} for *q*_{n} to be reinforced from 0to |*ε* | will thus depend on interactions with the bulk of the wave. We now calculate *τ*_{0} by solving the equation hierarchy. We will see that the solution is feasible due to the linear dynamics in the bulk of the wave (A26), which considerably simplify the analysis.

We begin (*t* = 0) from the moment when the wave has just reached node *n*, i.e., *q*_{n−1}(0) = |*ε* | and *q*_{n}(0) = 0 begins to be reinforced. *τ*_{0} is the time it takes for one cycle to complete, which is when *q*_{n}(*τ*_{0}) = |*ε* | is reached and *q*_{n} begins to influence *q*_{n+1}. From (A25), the (approximate) evolution of *q*_{n} during this interval is given by
where we have used *µ*_{n} ≃ *σ*_{n} and *σ*_{n−1} ≃ 1. Since *r* ≫ |*ε*| implies *q*_{n} *q*_{n−1} in this period, we have also ignored the negative feedback due to *q*_{n}. This argument can be made rigorous by computing *q*_{n} from (A30) (see below), which will lead to an upper bound on *q*_{n}, and showing that this upper bound is ≪ *q*_{n−1}. Integrating (A30) for *t* ≤ *τ*_{0}, we get
where is to be calculated. Plugging in and integrating gives
The solution of the above equation is expressed in terms of the Lambert *W* function [28], *W* (*x*),
Since *q*_{n}(*τ*_{0}) = |*ε*|, *τ*_{0} is obtained from solving
We now calculate *x*_{t}. In the bulk (*m* < *n*), we have
From (A29),

Note that the initial conditions *q*_{m′} (0) here are the unknown values of *q*_{m′} when *q*_{n−1} = *ε* and *q*_{n} = 0, which are to be computed self-consistently under the marching dynamics with time step *τ*_{0}. Using the series expansion of *P* (*m, t*) and *m*′→ *m* −*m*′, we re-write the above equation as
Here increasing values of *m*′ correspond to the nodes reinforced in earlier stages of the learning process. Thus, the terms *r* −*q*_{m−m′} (0) are decreasing in magnitude and the terms of large *m*′ do not matter for *τ*_{0} > 1. We may then take the sum to infinity in the equation above.

To solve for *q*_{m}(0) for *m* < *n*, we notice that the self-similarity of the wavefront with period *τ*_{0} implies *q*_{m}(*τ*_{0}) = *q*_{m−1}(0). Plugging *t* = *τ*_{0} into (A37), using*q*_{m}(*τ*_{0}) = *q*_{m−1}(0), we get
where *g*(*m*) ≡ 1 −*q*_{m}(0)*/r*. This difference equation applies for all *m* < *n*. Since the difference equation has constant coefficients, solutions are of the form *g*(*m*) = *cb*^{m}, where *c* is a constant [29]. We have
Thus *b* satisfies
which leads to
where *W* is the non-trivial branch of the Lambert *W* function (the trivial branch evaluates to 1). Evaluating (A37) for *m* = *n* − 1 then yields
Since *q*_{n−1}(0) = |*ε*|, we get *cb*^{n−1} = 1 − |*ε*|*/r*. The evolution of *q*_{n−1}(*t*) is thus
where , which determines the wave profile at any instant *t* since the periodicity implies *q*_{n−n′} (*t*) = *q*_{n−1}(*t* + (*n*′ − 1)*τ*_{0}).

Note that *W* (*x*) is real-valued for *x* ≥ −1*/e*, which corresponds to *τ*_{0} ≥1. For *τ*_{0} = 1, using *W* (−1*/e*) = 1 we see that *β* = 1 and the power solution *g*(*m*) = *cβ*^{m} cannot be used. It can be checked that *g*(*m*) = *cm* satisfies the recursion (*c* is again a constant). The initial conditions yield , whose dependence on *n* violates the self-similarity of the wavefront. Thus, the dynamics when *τ*_{0} = 1 cannot be described by a self-similar traveling wave. Indeed, as shown in the expanding regime section, when *τ*_{0} = 1 the width of the wavefront expands as .

Using the identity *e*^{W (x)} = *x/W* (*x*), we obtain . Plugging this into (A34), we finally have
An interesting alternative derivation of *τ*_{0} exploits a conservation equation for . Since , integrating both sides from 0 to *τ*_{0} and re-arranging yields
where we have used the periodicity in the bulk, *q*_{n−2}(0) = *q*_{n−1}(*τ*_{0}) in the last step. Repeating this sequence of steps, we are lead to

#### (b) Eligibility traces

In RL, eligibility traces are used to enhance learning speed by efficiently propagating errors backwards in time. Specifically, the method uses the reward prediction error (*δ*_{t} in (A22)) at the current state-action pair to update the *q*_{r} values of recently visited state-action pairs, in addition to the *q*_{r} values of the current state-action pair. The resultant *q*_{r} update of the state-action pair visited *j* steps before the current one is
where 0 ≤*λ* < 1 is the eligibility traces parameter and *γ* is the discount factor (recall that *γ* is set to 1 throughout our analysis). *λ* sets the effective number of previous state-action pairs which are affected by the update at the current state-action pair. We consider a slightly modified version of this learning rule, where *λ* is set to 1 and instead a fixed number *k* of previous state-action pairs are updated according to (A47).

Let us examine the effect of this rule when updating *q*_{n} (the case of follows). If, after crossing node *n* towards the goal, the agent goes into the side path at *n* − 1, then since all the “hidden” state-action pairs inside the side path have *q*_{r} = 0 and the side paths are assumed to be long detours (≫ *k*), the updates within the side path and after exiting the side path do not update *q*_{n}. In this case, *q*_{n} is updated just as in (A25). If the agent, instead of turning into the side path at *n*− 1, continues along the direct path towards node *n*− 2, and if *k* ≥ 1, then *q*_{n} is updated as
where the first term in the parenthesis is the *δ*-error from when the agent crosses node *n* and the second term is from when the agent crosses *n* − 1 along the direct path. If, at this point, the agent turns into the side path at *n* − 2, there are no further updates of *q*_{n} for the same reason stated above. Similarly, if it instead continues along the direct path to *n* −3 and *k* ≥2, then *q*_{n} is again updated using the *δ*-error at that transition. Extending this argument, we see that *q*_{n} (and by the same argument) receives non-local updates as long as the agent continues along the direct path without taking turns into side paths. In the limit of *α* ≪ 1, we take expectations to get
The product, , corresponds to the probability of taking *k* contiguous steps along the direct path. The second equation is obtained by noticing that all the terms except the ones involving *q*_{n} and *q*_{n−k−1} cancel out. We verify the validity of (A49) by comparing to full RL simulations (Figure S3a).

As in the SARSA case, we consider the asymptotic limits *e*^{ε} ≫1 and *e*^{−ε} ≫1. For *e*^{ε}≫ 1, we have *σ*_{n} ≃ 1, *µ*_{n} ≃ 1 for all *n* and the sums in (A49) collapse into the simple set of linear equations
For convenience, we assume *n* = *𝓁*(1 + *k*) where *𝓁* is an integer. Similar results with minor modifications are obtained for the other cases. Taking the Laplace transform, we get
which after following steps similar to the SARSA case leads to
For the initial conditions *q*_{n}(0) = 0 for *n >* 0, we have for large *n*. From here, we obtain the speed, *v*_{k}= 1 + *k*, and the width of the wavefront at time . Thus, eligibility traces enhance the speed of reinforcement propagation by breaking the locality constraint of SARSA.

In the marching regime, *e*^{−ε} ≫1, suppose, as in the SARSA case, that the wave has just reached node *n*, i.e., *q*_{n−1}(0) = |*ε* |, *q*_{n}(0) = 0. We compute the time *τ*_{k} it takes for *q*_{n} to be reinforced to |*ε* |. When *q*_{n− 1} ≥|*ε*|, from (A41), the product of *σ*’s equals 1 and *q*_{n}(*t*) subsequently evolves as
*τ*_{k} can be calculated by integrating the above equation from *t* = 0 to *t* = *τ*, which gives
The integral is determined by the dynamics in the bulk of the wave, which are governed by (A50). The Laplace transform of (A50) leads to a relationship similar to (A37) for *m* = *𝓁*(1 + *k*) < *n*:
Defining *h*(*𝓁*) ≡ 1−*q*_{𝓁(1+k)}(0)*/r* and using the periodicity of the wave w.r.t *τ*_{k}, *q*_{𝓁(1+k)}((1 + *k*)*τ*_{k}) = *q*_{(𝓁−1)(1+k)}(0), we have
It can be checked that with
satisfies the above recursion and the boundary conditions *q*_{n−1}(0) = |*ε*|. Defining *β*_{k} ≡ 1 − 1*/b*_{k}, we obtain
From (A54), we have
which upon rearranging leads to
This implicit equation can be solved numerically for *τ*_{k}. We verify the result by comparing our analytical calculation to RL simulations (Figure S3b,c). Importantly, it can be shown that the second term on the right-hand-side goes to zero as *k*→ ∞. This limit corresponds to the situation in which the node *n* effectively receives reinforcement from future states that have already been fully reinforced to *r* (i.e., . From here, we obtain the upper bound on the speed,

#### (c) Dyna-Q

Dyna-Q (tabular) is a model-based RL algorithm which combines “planning” and *Q*-learning. It is useful to examine the learning dynamics of a model-based learning algorithm when applied to our setting. The Dyna-Q agent learns a model of the environment, *P* (*s*′|, *r s, a*), corresponding to the distribution of subsequent rewards and states for every state-action pair. At every step during navigation, in addition to updating its *q*-values using (A63) for the current state-action pair, the agent also randomly samples *n*_{p} previously visited state-action pairs from memory, draws the subsequent state and reward from its learned model, and updates the corresponding *q*-values.

Dyna-Q does not significantly influence the learning dynamics in our setting due to the deterministic state transitions and rewards. Its effect is to simply scale the learning rate compared to SARSA. This scaling is because the probability, *p*(*s, a*), that a particular state-action pair, (*s, a*), is reinforced during each of the additional *n*_{p} planning steps is proportional to the number of times (*s, a*) is expected to be traversed in an episode, say, *n*(*s, a*). The number of times (*s, a*) is reinforced in a single planning step is then , where *n*_{steps} is the total number of state-action pairs visited in each episode. Since this planning step is applied at each step, the total *additional* number of times, *q*_{r}(*s, a*) is updated per episode is *n*_{p}*n*(*s, a*), in addition to the *n*(*s, a*) regular SARSA updates. This calculation implies that (*s, a*) is updated by a factor *n*_{p} + 1 compared to the plain SARSA case, which can be interpreted as a scaling of the learning rate *α*→ (1 + *n*_{p})*α*. Here, we have assumed that the memory of the Dyna-Q agent is not much larger than *α*^{−1} so that *n*(*s, a*) does not change significantly over a timescale comparable to the memory size. We confirm this prediction in RL simulations of Dyna-Q agents (Figure S4a).

#### (d) Q-learning

Watkins’ *Q*-learning rule is closely related to SARSA with an important difference that enables off-policy learning, i.e., the agent learns the optimal *q* values while executing an arbitrary policy. This rule is defined by
Note the max instead of the expectation in *δ*_{t}. We consider *Q*-learning using the stochastic policy considered for SARSA above. The equivalent of (A25) follows by replacing *σ*_{n−1}*q*_{n−1} in (A25) with max(0, *q*_{n−1}) = *q*_{n−1}since the alternative action of turning into the side pathhas *q*_{r} = 0 and *q*_{n−1} ≥ 0. We thus have
The *σ*_{n−1} prefactor does not play a significant role, and the results obtained for SARSA are directly applicable in both the expanding and marching regimes, which we verify through RL simulations (Figure S4b).

#### (e) Q-learning (pure exploration)

Next, we consider *Q*-learning with a purely explorative policy that is independent of the learned *q*_{r} values. Specifically, the policy is fixed and is given by ln *π*(*a*|*s*) = *q*_{ε}(*s, a*) − ln *Z*(*s*), where *q*_{ε}(*s, a*) are the intrinsic rewards as in the previous cases. This case is useful to understand the effects of the learning rule in isolation, without the effects of the feedback on behavior due to learning coming into play.

The dynamics of *q*_{n} and are given by (A63). However, the factors *µ*_{n} are no longer the ones in (A24). Recall that *µ*_{n} is the expected number of times per episode the agent performs the action of traversing towards the goal at node *n* on the direct path. This number is the expected visits to state (*n* + 1, *n*), *M*_{+}(*n*) (from (A21)), multiplied by the probability, *σ*(*ε*), that the agent takes the action leading to (*n, n* − 1). We thus have *µ*_{n} = *σ*(*ε*)*M*_{+}(*n*) and, similarly, . Since *µ*_{n}’s are constant over time, the differential equations (A63) are linear and can be exactly solved for *q*_{n}(*t*) in the Laplace domain. We obtain
which unfortunately does not yield much insight. Instead, we consider the evolution of *q*_{n}(*t*) for *n* large when *k* −< *k*_{+}. From (A21), we have that is a constant. The equations lead to *q*_{n}(*t*) ≈ *rP* (*n, µt*). This relation is approximate as *µ*_{n} is not constant for small *n*. The reinforcement signal thus propagates with speed
which is verified in simulations (Figure S4c). This analysis further highlights that when learning is decoupled from behavior, similar to the *e*^{ε} ≫ 1 expanding regime, the propagation of the signal is simply constrained by the local learning rule with speed proportional to the number of times the learning rule is applied at each state-action pair.