Abstract
The successor representation (SR) is a candidate principle for generalization in reinforcement learning, computational accounts of memory, and the structure of neural representations in the hippocampus. Given a sequence of states, the SR learns a predictive representation for every given state that encodes how often, on average, each upcoming state is expected to be visited, even if it is multiple steps ahead. A discount or scale parameter determines how many steps into the future SR’s generalizations reach, enabling rapid value computation, subgoal discovery, and flexible decision-making in large trees. However, SR with a single scale could discard information for predicting both the sequential order of and the distance between states, which are common problems in navigation for animals and artificial agents. Here we propose a solution: an ensemble of SRs with multiple scales. We show that the derivative of multi-scale SR can reconstruct both the sequence of expected future states and estimate distance to goal. This derivative can be computed linearly: we show that a multi-scale SR ensemble is the Laplace transform of future states, and the inverse of this Laplace transform is a biologically plausible linear estimation of the derivative. Multi-scale SR and its derivative could lead to a common principle for how the medial temporal lobe supports both map-based and vector-based navigation.
Introduction
The reinforcement learning problem is one faced by biological and computational agents alike: finding a series of actions in an environment to maximize long-run reward (Dayan & Balleine, 2002). Many environments have sparse rewards, making a step by step simulation of entire trajectories to choose optimal reward policy (Figure 1, top left) costly and sometimes intractable. One way to make the solution tractable is to generalize representations of nearby states along long paths, allowing simulations to hop over nearby states rather than traverse them one by one. The successor representation offers such a solution (Dayan, 1993). The key idea is that, given a stream of experience and actions, the SR represents a given state in terms of states that will follow it in the near future. Importantly, the definition of “near” depends on a discount parameter, which imposes a timescale on the generalization and hence predictions over successor states (Figure 1, top right). It has been shown that the successor representation (SR) offers a candidate principle for generalization in reinforcement learning (Dayan, 1993; Momennejad, Russek, et al., 2017; Russek, Momennejad, Botvinick, Gershman, & Daw, 2017) and computational accounts of episodic memory and temporal context (Gershman, Moore, Todd, Norman, & Sederberg, 2012), with implications for neural representations in the medial temporal lobe (Stachenfeld, Botvinick, & Gershman, 2017) and the midbrain dopamine system (Gardner, Schoenbaum, & Gershman, 2018).
Most real-world planning problems require agents to plan over longer timescales. Classic reinforcement learning solutions, the model-based and the model-free and hybrid agents, provide non-satisfactory solutions to the problem. A model-free (MF) agent simply stores long-term value of a given action without storing information about the states or the map of the environment. An MF agent solves the RL problem fast, but is inflexible in the face of changes in either the rewards or the map structure of the environment. A typical model-based (MB) solution to finding action policies that maximize rewards is to learn and use a representation of the environment that stores relationships between states that are one-step away from one another, or a one-step transition map of the environment. When the MB agent is about to make a decision, it retrieves different trajectories to reward then computes and compares their expected values via simulating possible sequences one step at a time. Thus, MB agents compute expected value (long-term cumulative reward) for each possible sequence. This one-step MB solution offers flexibility in the face of sudden changes, but it is computationally expensive and sometimes intractable for environments with large state spaces. In contrast, SR offers a more computationally efficient solution to the RL problem by storing generalizations (temporal abstractions) of multi-step relationships between states (Dayan, 1993; Gershman, 2018).
In order to increase efficient computations, SR relies on the weighted sum over the probability of visiting future states, storing expected future occupancy of each future states (that follow from the current state) at a discount scale. This precomputed information about multi-step predictive relationships is later combined with reward to compute value in a linear operation, enabling rapid and flexible adaptation to changes in reward, as in reward revaluation. SR is as flexible as MB when the rewards change in the environment (reward revaluation) but less flexible when the map of the environment changes (transition revaluation), predicting behaviorally asymmetric flexibility. Recent experiments have compared these models to human behavior and shown that the asymmetry in human behavior is more consistent with the predictions of SR agents that update their models via replay (SR-Dyna) (Momennejad, Russek, et al., 2017). It has also been shown that SR offers a computational account of optimal behavior in a variety of RL problems such as policy revaluation and detour (Russek et al., 2017) and explains how context repetition enhances memory-driven predictions in human behavior (Smith, Hasinski, & Sederberg, 2013). SR has also been proposed as a principle for neural organization for place cells and grid cells in the hippocampus and the entorhinal cortex, playing a crucial role in rodent navigation (Stachenfeld et al., 2017). Stachenfeld and colleagues reviewed the literature for the hippocampal-entorhinal encoding of spatial maps during navigation, and modeled the evidence using the successor representation. They concluded that SR is a candidate organizational principles governing the neural firing of place cells and grid cells for learning spatial maps guiding navigation. Since the eigenvectors for the transition matrix and the SR are the same, they suggested that grid cells in the medial entorhinal cortex may provide an eigendecomposition of the graph of the states.
Furthermore, human neuroimaging implicates SR in neural representations underlying event segmentation in the statistical learning of non-spatial relational structures (Schapiro, Rogers, Cordova, Turk-Browne, & Botvinick, 2013). Computational models show that SR can partition the state space, enabling sub-goal processing in large decision trees (Botvinick & Weinstein, 2014). Finally, a recent human fMRI study showed that SR govern the implicit encoding and later retrieval of non-spatial relational knowledge (Garvert, Dolan, & Behrens, 2017). They presented a series of images in an order that was implicitly determined by discrete and non-spatial graphs, and later probed participants about the relation between the different images. They were able to recover the structure of non-spatial relationships from blood oxygen level-dependent adaption in the hippocampus and entorhinal cortex, and showed that the map that best captured behavioral and medial temporal lobe (MTL) results was the weighted sums of future states, as in SR (Garvert et al., 2017).
Taken together, these computational, behavioral, neuroimaging, and electrophysiological evidence in humans and rodents suggest the successor representation as a candidate principle underlying the organization of hippocampal and entorhinal firing in spatial navigation and non-spatial relational learning.
The purpose of our paper is to clarify limitations of existing SR models and propose a solution. Briefly, the limitations are that estimating sequential order and distance between states from an SR with a single discount is nontrivial. Intuitively, for every row of the n×n SR matrix, information about the sequential order and distance between states is lost. SR generalizes over successor states and this temporal abstraction relies on a weighted sum of future states. The weights of successor states exponentially decay the further they are in the future, depending on a discount parameter γ. The parameter γ determines the horizon over which the SR generalizes states. Low γ values discounts states that are further in the future, leading to a shorter temporal horizon over which states are abstracted, while high γ values lead to less steep discounting, and hence a larger time-scale for the abstraction. This weighted sum also depends on the number of times a future state is expected to be visited, starting from the present state.
This can perhaps best be illustrated in an example. Consider the choice between actions that lead from a starting state (e.g., my work) to two different successor states (e.g., two bistros). Depending on how often one or the other is expected to be visited, it is possible that the successor state that is furthest (my favorite bistro) has a higher SR value (expected frequency of future visits) than the successor state nearby (bistro close to work). Depending on the goals of the organism (e.g., how hungry I am and how much time I have for lunch) the optimal choice of a successor state may require separate information about the distances and the frequency of future visits from the present state (e.g., work) to the optimal successor state (e.g., bistro). Thus, due to temporal abstraction, the weighted sum imposes two limitations. The SR vector starting in a given state, and computed with one discount parameter, does not always tell us which future state is closer to the starting location, and what the sequence of states leading to each is. That is, the distance and sequential steps between a starting state and its successor states cannot be computed with merely the SR vector of size. Instead, this estimation requires nonlinear operations involving the entire n × n SR matrix, involving computation on n vectors, each starting in a different state. Given large SR matrices this operation can be costly and it is not likely to be biologically plausible. Our proposed theoretical framework overcomes this limitation by assuming that the brain stores an ensemble of SRs at multiple discounts, and by estimating the derivative of multi-scale SRs.
The simple intuition behind our proposed solution is as follows. Consider an ensemble of ng matrices of SRs estimated with different γs, all between 0 and 1, where ng is the number of γs and hence SR matrices. Then consider a line going through the jith cell of all SR matrices. This is a vector of expected future visitations from state i to state j across all the SR matrices, with size ng. Each index in this vector corresponds to temporal abstraction at a different γ. The distance between i and state j determines at which index (or γ) of this vector the expected future occupancy will change. For an intermediate distance, this relationship could be near zero for low γs but increase for higher γs. For instance, let us say that they are 20 states apart, then their relationship will only register for SRs where the abstraction horizon reaches 20 states into the future, but not for SRs with smaller horizons.
Mathematically, this means that the derivative of this vector of SRs from i to j at multiple discount rates can identify at which distance horizon the relationship between the two states changes, thus recovering their distance. From any given starting state, if we know at which horizon the relationship to every other state changes, it is possible to recover the sequential order and distance of all successor states (see Figure 1). This is a powerful intuition since absent multiple SRs or the derivative, computing distance and order in the bistro example from a single SR requires computing the one-step transition matrix from a given state to all other states using the entire matrix, which is nontrivial.
In short, the derivative of multiple SR matrices can identify at which scales the relationship between two given states change. If we know at which horizon the relationship between every two states i and j changes, it is possible to identify the distance between two states (see Figure 1). In this paper we show that linear operations on an ensemble of multi-scale SRs can estimate their derivatives. Using the derivative, we can reconstruct the sequence of expected future states following a given starting state, and recover order and distance between states, merely by computing a vector as opposed to entire matrices.
In the remainder of the manuscript we offer a more detailed account of our proposal, and show qualitative fit between model predictions and recent neural findings. We conclude with a discussion of the proposal, its neural plausibility, and its possible implications for a unifying account of map-based and vector-based representations using SR.
An ensemble of successor states is the Laplace transform of the likely future
Where might one look for a gradient of multi-scale SRs, which are equivalent to the Laplace transform? It has been hypothesized that place cells in the hippocampus reflect a predictive map of the environment generated by the SR (Stachenfeld et al., 2017). It is well known that there is a gradient of spatial scales along the long axis of the hippocampus (Jung, Wiener, & McNaughton, 1994; Kjelstrup et al., 2008a; Brunec et al., 2018; Collin, Milivojevic, & Doeller, 2015). For instance, all things equal, CA1 place fields grow in size as the recording location is moved systematically from the dorsal end of the hippocampus to the ventral end (e.g., Kjelstrup et al., 2008a). Furthermore, it has been shown that grid cells are topographically organized in the medial entorhinal cortex (MEC) as well, with the scale of grids increasing from the dorsal border of MEC (Brun et al., 2008). It is possible that this gradient of observed place and gird field sizes corresponds to a gradient of planning horizons over which the SR is computed. Much of the neurophysiology of the MTL hippocampal formation seems oriented to computations along this gradient, suggesting that computation of a derivative is a reasonable computation. For instance, theta oscillations are traveling waves that traverse along the long axis of the hippocampus (Lubenov & Siapas, 2009; Patel, Fujisawa, Berényi, Royer, & Buzsáki, 2012; Zhang & Jacobs, 2015) enabling different spatial scales (and perhaps planning horizons) to be processed sequentially. We discuss the neural plausibility of our proposal in the Discussion section in more detail.
In what follows we show that a neurally plausible linear operation, namely the inverse of the Laplace transform, can be used to compute the derivative of multi-scale SR. We first demonstrate that an ensemble of successor representations with different discount rates is the Laplace transform of a timeline of future states. We will then show that a simple inverse of the Laplace transform identifies the derivative of multiple SRs, indicating where the relationships between states change. We show that this operation recovers the sequential order of states and predicts cells that fire at specific distance to goal states, as shown in recent animal and human literature (Sarel, Finkelstein, Las, & Ulanovsky, 2017a; Qasim et al., 2018). We will then discuss the significance of this model for learning and navigation, evidenced by the qualitative match between the model’s predictions and recent findings across species.
The successor representation, described above, provides an exponentially-weighted estimate of future occupancy in going from one state i to state j. That is, when the agent is in its current location in state i, the representation of its successor states are co-activated. The extent of this co-activation depends on the distance: successor states that are nearby are co-activated to a larger extent than those further away, leading to a exponentially-weighted representation of successor states. However, if the multiplied probability of transitioning from state i to a given successor state j is higher than that of transitioning to an equidistant successor state l, then in spite of similar discounting, the state j will have a higher SR value. Therefore, merely on the basis of a row of SR alone, it will not be possible to tell whether state j or state l is closer or not.
It is worth noting that SR is policy dependent, which means that every SR matrix is computed under a given action policy. If the agent expects to visit a state often simply out of habit, the successor representation for that state will have a high magnitude simply due to this policy choice, even if that state does not contain higher reward. This makes it in principle possible that different policies lead to the SR matrix at different time-scales, since the agent may be caching SRs while taking different policies at different scales. This feature may enable multi-scale SR to dynamically select different policy based on changing temporal horizons. In other words, caching a multi-scale ensemble of SRs enables selecting policies with a flexible temporal horizon, which would be more adaptive to task demands compared to a condition where a single predictive horizon was used regardless of the planning horizon required by the task. We return to this point in the discussion.
The choice of discounting and abstraction is controlled by a parameter γ, with values between 0 and 1. Choosing different values of γ sets the temporal horizon over which predictions can be made. Each successor representation matrix is constructed using a discount rate or scale of generalization, and under a certain policy. Here we propose that the brain constructs multiple SR matrices along the MTL with a broad set of values of γ in parallel.
Here we make the point that the choice to use many values of γ in parallel also enables a very different form of representation that directly estimates a timeline over future events. This is possible because an ensemble of successor representations with a continuous (or near continuous) spectrum of values of γ encodes the Laplace transform of the future. Perhaps an analogy to the better known Fourier transform can provide a good intuition into what a Laplace transform of sequential future trajectories entails. While a Fourier transform decomposes a signal using different sinusoidal functions with different frequencies, the Laplace transform can decompose a signal (e.g., the flow of experience) using exponential curves with different decay rates - note that here we only consider the real part of the Laplace transform.
The Laplace transform has powerful properties, enabling extraction of important information using simple linear operations, such as inversion. The Laplace transform can be inverted with a linear operator that has been extensively studied in computational cognitive neural models of memory (Shankar & Howard, 2012, 2013; Liu, Tiganj, Hasselmo, & Howard, in press). Importantly, inverting the transform is the equivalent of computing a derivative of the relation between two given states i and j across different SR matrices, i.e., across different time-scales. Intuitively, knowing at which scales the relationship between two states change, indicated by the derivative, enables us to estimate their distance from each other. In short, the multi-scale SR ensemble and its derivative are equivalent, respectively, to the Laplace transform of expected sequential future states and its inversion. Thus, we arrive with cached representation that can also recover the order and distance in estimated future trajectories. To see how this intuition is possible in more detail, we start with a formal description of the successor representation.
Let us suppose that the environment consists of a finite number of states 1..n, or S. Depending on the states that an agent visits at every time point, we have a time series of visited states st where each visited state belongs to S. The agent visits these states according to a Markov process specified by a one-step transition matrix, where the jith element is the probability of transitioning from state i to state j within one step under action a
The successor representation (SR) with a given γ value and a given policy, which we denote M (γ), constructs an associative matrix between each pair of states whether they are one step or multiple steps away. The entries in this SR matrix estimate the exponentially-discounted expected future occupancy of any destination state j (any column) when starting at a particular state i, over all future time. That is, the entries of M (γ) estimate:
That is,
In the case where the statistics of the world are governed by a Markov process, the successor representation can be efficiently computed via temporal difference (TD) learning (Dayan, 1993).
Note that Eq. 2 describes a sum over future events. Naturally, the sum contains less information than a complete description of all of the future events. One can readily see this by noting that there are number of possible futures that would give the same value for the sum. For instance, given the same starting state (e.g., work), a visitation to a nearby state that the agent is does not visit often (e.g., nearby bistro) might end up with the same (or even less) weighted SR value as a state with more likely transitions that is many more steps away (e.g., favorite bistro miles away). As such, given that the probabilities of one-step transitions along the trajectory are not all 1 and different policies may be adopted, a higher SR value does not necessarily imply proximity and smaller distance, neither can sequential order be inferred from a comparison of SR values. However, we will show that an ensemble of successor representations with different values of γ can be used to recover the expected future trajectory.
To simplify notation and show equivalence to the Laplace transform, let us define a function that returns the true un-discounted expected trajectory of all future states (whether one or multiple steps away) that are expected to follow a given starting state i, at a lag of time τ as:
For a successor representation matrix with a single discount rate, M (γ), we can now rewrite Eq. 2 using the function defined above as
Let us take the continuum limit and define σ ≡ − log γ. Importantly, σ is the real part of the Laplace domain variable. The variable σ is understandable as a rate; the time constant 1/σ has units of time and is understandable as the temporal horizon over which predictions are evaluated. Now, we can rewrite Eqs. 3 and 4 as the Laplace transforms of the transition matrix and future following state i respectively:
Here we understand the left hand sides of Eq. 6 to refer to many instances of M (σ) with a wide range of (real) values of σ, i.e., multiple SRs with different (real) values of γ. From Eq. 6 it is clear that [M (σ)]i is the Laplace transform of the expected states that will follow i as a function of future time, .
The insight that M (σ) is the real Laplace transform of is very powerful. The Laplace transform is invertible; if we could invert the Laplace transform we could explicitly estimate the expected sequential trajectory of future states (Figure 2). The real part of the Laplace transform, which is given by Eq. 6 is sufficient to uniquely specify functions defined over the interval from 0 to ∞. Recall that this sequence of future states was formalized in the future-trajectory function above (see Eq. 4). This means that inverting the Laplace transform could recover an estimate of the function over future events itself.
Inverting the Laplace transform in a neurally-plausible way
Put another way, Eq. 6 says that the timeline over the sequence of future states is distributed across different values of σ. Our strategy is to invert the transform, recovering this information about the function, and write the answer onto a set of neurons that estimate the future-trajectory function directly. Different neurons in this representation would then estimate the states that are expected to occur at different points in the future. Therefore, if we find a neurally plausible way to invert the Laplace transform, then we will have a powerful tool for recovering a function of future states from a multi-scale but static stack of cached predictive representations.
Fortunately, there has been a great deal of progress on neurally plausible methods to encode and invert the Laplace transform (Shankar & Howard, 2012, 2013). The Post approximation provides a means to estimate the inverse transform using a set of fixed weights that resemble center-surround receptive fields with respect to σ. The comparison to receptive fields offers an intuition about how a coarse estimate of the future-trajectory function is arrived at by averaging over a temporal neighborhood in the vicinity of each moment/state along the trajectory.
We denote the set of weights that invert the transform as , which is defined as where Ck is a normalization constant that depends on k. The kth derivative on the right hand side can be approximated numerically. It can be understood in an analogy to center-surround receptive fields. There is a long-standing tradition of identifying center-surround receptive fields using spatial derivatives (e.g., Marr & Hildreth, 1980). In this view, the so-called “Mexican hat” form for the receptive field suggests computation of the second derivative of Gaussian receptive fields.
Putting this together, we construct an estimate of the future following state i by first probing the ensemble M(σ) with i. This yields a vector valued (over states) estimate for each value of σ. We then operate on this with , which estimates derivatives with respect to σ. The result is a set of units, indexed by . Intuitively, designates the unit of time: the temporal horizon or the horizon of generalization. As we will demonstrate in more detail later, , or the future trajectory, is estimated using many units each with a different value of :
The right-hand of the first line makes reference to the prediction of all future states following from i via a successor representation with different values of σ [M(σ)].i. As previously discussed, this is the Laplace transform of the future expected to follow i under a given policy (Eq. 6). This multi-scale successor representation is operated on by , which approximates the inverse Laplace transform (Eq. 7). The second line simply makes the claim that inverting the transform recovers the original function that was transformed, here . Because provides an estimate to the inverse transform, this equivalence is only approximate.
The Post approximation ensures that in the limit as k → ∞, the approximation of future states becomes perfect (Post, 1930). When k is finite, is a coarse-grained estimate of averaging over a temporal neighborhood in the vicinity of τ. Following (Shankar & Howard, 2012) it can be shown that the width of this coarse-graining is exactly proportional to each units value of , implementing a scale-invariant estimate of the future.
Simulating points along a future trajectory
The form of the curves representing the future can be readily understood analytically starting from the Laplace transform of a delta function. Consider the case where some state perfectly predicts an outcome some time τ0 in the future. This would obtain if a set of states are deterministically experienced in a long sequence. Let us write to describe the function describing the distance to the goal state from the present. In this case, we find the Laplace transform of this function, generated by the successor representation is given by
The left panel of Figure 3 shows this expression for different values of σ (different curves) with the distance to the goal state τo on the x-axis. This describes the activation of these units unfolding as the agent comes closer to the goal state moving left to right.
Now, to compute we must take derivatives of F(σ) with respect to σ. Each derivative of Eq. 10 spits out another factor of τo and we find (including the remainder of the definition in Eq. 7), where we have used the definition of in the second line. The right panel of Figure 3 shows this expression for units with different values of (different curves each corresponding to a different horizon unit) as the goal state is approached. Rather than increasing monotonically as the goal is approached, neurons in peak in their activation when the goal is a characteristic distance in the future.
The form of Equation 12 makes several properties of the representation clear. The expression is the product of two terms, an increasing power law function and a decreasing exponential term . The product is zero when the ratio is zero and also when the ratio is infinity. As k increases each of these terms becomes more steep. An elementary calculation shows that Eq. 12 is maximal at . Considering the behavior of Eq. 12 as one varies τo, the unit coding for steps in the future is maximally activated when , independent of k. Because τo appears only as a ratio with , we can see that the model is scale-invariant. That is, starting for any particular value of τo (and k), we find some pattern of activation as a function of . If we change the value of τo to , we will find the same pattern of activation across units (up to a constant factor) by remapping .
The goal of Figure 4 is to provide mechanistic insight into why has the properties it does. Rather than showing the activation of units unfolding in time Figure 4 shows the activation across all units in the multi-scale SR (x-axis) and at different moments when the goal state is predicted at different distances in the future (different lines). The top left panel shows the multi-scale SR for each possible value of γ. In each case, one obtains an exponential function; when the goal is closer, the curve is shallow (lighter lines); when it is further in the future the curve is more steep (darker lines). The curvature of this functio over values of γ encodes information about when the outcome is expected to occur. In order to decode this information in units of distance to the goal, we need to change variables from the discount factor to units of distance. σ is understandable as a rate; k/σ has the same units as τo. The bottom left shows the mapping between γ and k/σ. The y-axis of this graph is truncated. This is because as γ goes to 1 the “time constant” k/σ grows without bound corresponding to an infinitely broad planning horizon.
The upper right plot in Figure 4 replots the data from the upper left curve, but as a function of k/a rather than γ (the axes are also rescaled). Note that each of these curves start at zero with a very shallow slope. Moving to the right, they shift to a regime with a non-zero slope. Note that the second derivative of each curve would be very small at the left of the plot and very small at the right of the plot. In between, the curve rapidly changes from the shallow slope to the higher slope. The inflection point where this happens depends on how far in the future the goal state is predicted. Because each unit in indexed by computes derivatives with respect to σ in the neighborhood of , these units are maximally activated when the goal state is expected units in the future. With many values of , the pattern of over different values provides a veridical but coarse-grained estimate of when the goal state will be observed in the future subject to the policy that generated the multi-scale SR.
Predictive maps and distance to goal firing patterns
Multi-scale successor representations can be thought of as a set of predictive maps, each with a different scale or predictive horizon. The analytic expression above (Eq. 12) can be readily calculated for simple environments for which the multi-scale SR has a simple closed form solution. However, operating on the multi-scale SR with constructs an estimate of expected future outcomes as a function of distance across expected paths given a particular policy. Figure 6 illustrates the multi-scale SR and the estimate of future timeline for a gridworld environment with a barrier.
To construct the SR, one first starts with the transition matrix T which is non-zero for each transition between adjacent points that is not prevented by a barrier. This can be a policy-independent or policy-dependent transition matrix. As an agent navigates the environment, SRs should be estimated based on the agent’s experience and in a policy-dependent manner, where successor states are dependent on the agent’s action policies. For the present purposes, however, we will use an equation that obtains for a random-walk version of SR. For a random policy, it is possible to estimate SR simply from a one-step transition matrix using the equation below. Computing SRs using the one-step transition matrix is the equivalent of computing SR while the agent navigates the environment under a random policy, if the rewards were equally distributed across the environment. Following previous work (e.g., Dayan, 1993; Momennejad, Russek, et al., 2017; Stachenfeld et al., 2017), here we computed successor representations at multiple scales from the one-step transition probability matrix (T) according to
Figure 5 displays a multi-scale ensemble of SRs with different γs as well as the corresponding derivatives (inverse of transforms for various values of σ) for a gridworld with a π-shaped barrier in the middle, and reward in position (2, 2). The plots on the left show an ensemble of predictive maps, i.e., M(γ), predicting the successors of a state in the upper left of the gridworld with different values of γ. As γ increases, the gradient across states becomes more shallow; at large horizons one can note the π shaped barrier. We applied , computed with a discrete approximation to the derivative operator (Shankar & Howard, 2013), to these predictive maps and obtained the panels on the right of Figure 5. States of the same shade in these panels represent states that are the same expected distance from the goal state subject to the policy used to generate the M(γ). This procedure naturally takes into account barriers—note that one can read off that states close to the goal but on the other side of the barrier are the same distance in terms of expected paths as states in the lower right corner of the environment.
There is some evidence that these computational properties are reflected in the firing patterns of cells in the hippocampal and medial temporal formations in animal models. As shown above, computing the inverse (or the derivative) of the multi-scale SR ensemble resembles firing patterns at specific distances to any given destination state. This predicts sequential activations of different units/cells as a function of distance to the goal (this ‘goal’ could be a goal object, a reward state, a frequently visited location or destination, a specific boundary, a subgoal, a remembered location, etc.). Whereas the successor representation itself gives monotonically decreasing gradients into the future (like border cells), the inverse of a multi-scale SR ensemble predicts cells that are sequentially activated as a function of distance to goal (these can be thought of as distance-to-goal cells).
These theoretical predictions offer a qualitative fit (though we did not compute a quantitative fit) to three recent empirical findings in the bat, rodents, and humans. These studies have discovered cells that display sequential firing as a function of distance to a goal state. First, Sarel and colleagues (Sarel, Finkelstein, Las, & Ulanovsky, 2017b) investigated the representation of spatial goals in the hippocampus of bats (Figure 6, right). They reported that CA1 neurons of flying bats displayed angular tuning to the goal direction, many of which were tuned to the goal even when it was occluded. Importantly, some of these cells encoded distance to goal as well as goal direction. The distance to goal aspect is in line with earlier reports of goal-approach cells in rodents (Eichenbaum, Kuperstein, Fagan, & Nagode, 1987). These firing patterns are consistent with the idea of a vectorial representation of goals in the hippocampus. Our simulation of distance to goal firing based on our theoretical proposal in Figure 6 resembles the distance to goal cells in the bats.
The second piece of evidence consistent with distance-to-goal firing comes from Gauthier and Tank (2018), who report mouse CA1 cells that tune to specific distances from reward locations (which they refer to as reward cells). The study reports that these reward cells were tuned to similar distances to reward states even when the animal was moved to another environment. This is what one would expect from this conception of a future timeline. Because the same outcome is predicted at the same distance in the two environments, this should result in the same cells being activated to code for the same future outcome.
The third piece of evidence comes from human electrophysiology (Qasim et al., 2018). Qasim and colleagues report cells in the human entorhinal cortex (EC) that fire at certain distances as people approach remembered locations during cued object-location recall. In this case, the goal location is not visible in the environment, and is instead an internally generated goal related to the memory retrieval cue.
In short, consistent with our predicted distance-to-goal firing cells, recent empirical evidence suggests cells that fire at specific distances to goal in the hippocampus and entorhinal cortex of bats (Sarel et al., 2017b), rodents (Aronov, Nevers, & Tank, 2017; Gauthier & Tank, 2018), and humans (Qasim et al., 2018). As indicated in Figure 6 the inverse (or derivative with respect to σ) of a multi-scale SR ensemble simulates the reported distance to goal effect. One difference between the bat goal-cell findings and our distance-to-goal units is that, in the bat results, more cells fire in closer proximity to the goal state (see Figure 6, right). Furthermore, in the model the units that fire in closer distance to the goal states show less uncertainty. This could be because our model overestimates certainty about what state the agent is in. This can be addressed by including a certainty term. Taken together, this correspondence suggests that vectorial representations could emerge from static cached maps. (Johnson & Redish, 2007; Pfeiffer & Foster, 2013).
Our multi-scale successor representation framework could potentially offer a unified principle that supports both the map-like representations elicited by place cells, asymmetric firing skewed towards the goal location (Stachenfeld et al., 2017), and the vectorial representation of direction and distance to goal (Kubie & Fenton, 2009; Bush, Barry, Manson, & Burgess, 2015). The predictions of this proposal are in line with the observation of distance to goal cells (Fig. 6).
Discussion
The successor representation (SR) offers a principle for abstract organization in reinforcement learning (Dayan, 1993; Momennejad, Russek, et al., 2017; Russek et al., 2017), computational accounts of episodic memory and temporal context (Gershman et al., 2012), and predictive representations in place cells and grid cells in the hippocampus and medial entorhinal cortex (Stachenfeld et al., 2017). The successor representation (SR) offers a solution to planning at large temporal horizons and optimal sub-goal discovery (Botvinick & Weinstein, 2014). Combined with offline replay, a model known as SR-Dyna is superior to classic model-free and model-based reinforcement learning mechanisms in explaining human behavior (Momennejad, Russek, et al., 2017), outperforming hybrid MF-MB models and varieties of earlier Dyna models across other problems as well (Russek et al., 2017). Neurally, it can also explain asymmetric firing toward the goal state in hippocampal place cells (Stachenfeld et al., 2017). However, on its own, a single row of SR discards fine grained information about order and distance of a starting state to expected future states. This information is important to animals in many real-world decision making problems, and call for an adequate account in any computational proposal.
Here we have shown that a multi-scale ensemble of successor representations can overcome these limitations: the derivative of the SR ensemble can estimate the expected sequence of future states following a starting state, recovering both sequential order and distance between states. Importantly, this derivative marks changes in the relationship between two given states across the timescales of abstraction. How can the brain compute this derivative in a neurally plausible manner? We have shown that a multi-scale SR ensemble is equivalent to the real Laplace transform of a given states timeline of successor states. The inverse of this Laplace transform computes the derivative of the SR ensemble, recovering which future states lie within given temporal horizons of a given state (e.g., the present state, or the goal state). Importantly, the Laplace formulation and its inverse are closely related to an established neurally plausible proposal for a scale-invariant representation of the past in the medial temporal lobe (Howard et al., 2014). In short, we mathematically show the neural plausibility of the idea that the brain may store multi-scale SRs and compute their derivative, leading to both predictive map representations and neural firings akin to vectorial representation.
Notably, the inverse of the Laplace transform (i.e., derivative of the multi-scale SR) predicts sequential firing of “distance-to-goal” neurons when the agent is in certain temporal neighborhoods of (or distance to) the goal state (Figures 3 and 6). We show that this analytic model prediction resembles recent findings in bats, rodents, and humans (Sarel et al., 2017b; Gauthier & Tank, 2018; Qasim et al., 2018) that are consistent with the idea of vectorial navigation. The rodent data calls these cells reward cells, the bat data has been taken these cells as evidence for goal-vector cells, and the human stury refers to them as trace cells. What these results and our proposal here have in common are units/cells that each fire at a specific predictive distance to a goal state, such as reward in the rodent study, resting locations (or destination state) in the bat study, or remembered goal locations in the human study. What our proposal further provides is that, computationally, all these varieties of vectorial representations supervene on underlying predictive maps. Thus, our proposal combines vector-based and map based navigation in a neurally plausible fashion.
Neural plausibility: Evidence from place cells, grid cells, and time cells
We have suggested that the derivative of a multi-scale SR ensemble offers a linear solution for computing the distance and sequential order of a given state’s future trajectory. But is multi-scale SR neurally plausible? The proposal that SR can be stored at multiple scales along the long axis of the hippocampus has been suggested elsewhere as well (Stachenfeld et al., 2017) and is in line with the following observations in the rodent literature. First, it has been shown that hippocampal place cells, i.e., neurons the activity of which corresponds to the animal’s current location, fire over a larger spatial radius in more ventral (anterior in humans) hippocampal neurons compared to more dorsal (posterior in humans) hippocampal neurons. In other words, the size of place fields vary along the long axis of the hippocampus, with topologically more anterior hippocampal regions representing larger place fields (Kjelstrup et al., 2008b). Furthermore, grid cells are topographically organized in the medial temporal lobe (MTL) as well, and the scale of the grids has been shown to increase along the dorsal border of the medial entorhinal cortex (Brun et al., 2008).
It has also been shown that the so called place cells also respond to non-spatial relations, e.g., sonic and social state-spaces (Aronov et al., 2017; Omer, Maimon, Las, & Ulanovsky, 2018). This suggests a potentially non-spatial underlying computational role for place and grid systems that is not restricted to spatial navigation but can support the organization of non-spatial relational knowledge as well (Garvert et al., 2017). Based on these findings it is possible that the brain learns an ensemble of successor representations along the long axis of the hippocampus, each with a different discount parameter corresponding to different scales of abstraction, providing information about progressively larger horizons along the long axis of the hippocampus (see Figure 1). Notably, topologically organized place fields have also been observed in cortical regions such as the retrospenial cortex, but it has been shown that these fields develop slowly over time and are hippocampally dependent: they are attenuated with hippocampal damage (Mao et al., 2018). Similarly, while multi-scale SRs can also be encoded or consolidated in the neocortex, it is likely that the hippocampus plays a crucial role in their formation.
In addition to encoding spatial representations, MTL neurons also carry explicit information about the time at which events were experienced in the past. Neurons referred to as time cells have been observed in many brain regions (Pastalkova, Itskov, Amarasingham, & Buzsaki, 2008; Jin, Fujii, & Graybiel, 2009; MacDonald, Lepage, Eden, & Eichenbaum, 2011; Kraus et al., 2015; Mello, Soares, & Paton, 2015; Akhlaghpour et al., 2016; Tiganj, Kim, Jung, & Howard, 2017; Tiganj, Cromer, Roy, Miller, & Howard, 2018). These neurons appear to show receptive fields in time for past events; as a stimulus recedes into the past different neurons become sequentially activated, presumably as the stimulus sequentially enters their receptive fields. Because each neuron has a temporally circumscribed “time field”, enabling the animal to learn associations between stimuli separated in time via a simple association. Looking across neurons, it is possible to directly read off an estimate of the time at which different events were experienced. Time cells may rely on representations of events that decay at different rates, allowing an estimate of past events that extend backward from the present moment. If one could construct an analogous estimate of future events that extend forward from the present, this would provide us with explicit sequential information about future events. Such an estimate would resemble distance to goal cells, which fire at particular distances to a goal location (Eichenbaum et al., 1987; Sarel, Finkelstein, Las, & Ulanovsky, 2017c; Gauthier & Tank, 2018; Qasim et al., 2018). As shown above, the derivatives of multi-scale successor representations may offer a theoretical account for the emergence of such firing from predictive maps.
While sufficient neurophysiological data is not available to strongly constrain our account, progress in understanding the neurobiology of the Laplace transform for memory of the past may provide some insight. Shankar and Howard (2012, 2013) argued that the brain could construct a memory as a function of how far in the past events were experienced via the Laplace transform. Analogous to the present paper, a set of cells code the Laplace transform of the past. These cells project via to another set of cells that approximate functions of past time.
Neurons coding for the time of past events would behave like so-called “time cells” that had been observed in the medial temporal lobe (Pastalkova et al., 2008; MacDonald et al., 2011). Since the initial reports, time cells have been observed in a wide range of regions, tasks, and species, confirming several qualitative predictions of the theoretical account via the inverse Laplace transform (e.g., Tiganj et al., 2017; Tiganj, Cromer, et al., 2018; Mello et al., 2015; Akhlaghpour et al., 2016). Of course, time cells that align with the properties predicted for the inverse Laplace transform of the past could have been constructed via some other mechanism. However, recent preliminary work shows evidence for neurons that decay exponentially with a spectrum of rates in the entorhinal cortex (Tsao et al., 2017; Meister & Buffalo, 2017); just the form predicted for the Laplace transform of time. Notably, “the Laplace transform of the past” could be understood in the context of reinforcement learning as “a set of eligibility traces with different decay rates.”
Given the neural plausibility of the Laplace transform itself, the only other requirement to neurally construct the inverse is to find a biological basis for the operator . This operator can be understood as a set of receptive fields conveying information from cells encoding the transform (indexed by σ) to cells estimating the inverse (indexed by ). It has long been understood that center-surround receptive fields compute derivatives (e.g., Marr & Hildreth, 1980) much like those required for eq. 7). The entries in can be understood as describing center-surround receptive fields with respect to σ (Liu et al., in press). In the visual system it has been understood that general computational principles could enable learning of center-surround receptive fields from natural statistics (Bell & Sejnowski, 1997; Olshausen & Field, 1996). These findings suggest that the the operator used here is neurally plausible.
The distribution of scales and discretization of time
What have attributed time-scales to the parameter γ. It is not obvious whether the distribution of γ scales is approximately continuous or discrete. The mammalian brain provides both discrete and continuous distributions of scales for related quantities. For instance, the distribution of spatial frequency of grid cells appears to be organized into discrete modules along the entorhinal cortex (Barry, Hayman, Burgess, & Jeffery, 2007; Stachenfeld et al., 2017). These modules consist of many grid cells with the same spatial frequency but different spatial phases. The modules are organized along the dorso-ventral axis. In contrast, the distribution of time cell receptive fields appears to be continuous even within dorsal CA1 (Mau et al., 2018). That is, within the population of time cells in CA1, the set of time cells continuously maps the delay interval (up to at least a few dozen seconds). On another note, a recent study has shown that temporal information is robustly encoded across time scales (seconds to hours) in lateral entorhinal cortex populations (Tsao et al., 2018), while similar information was not found in either CA3-CA1 nor in the medial entorhinal cortex. It is, however, computationally possible that a small number of discrete nodes over future time provides sufficient information to solve computational problems related to time-scales. For instance, a deep Q-learning network performing a video game task was able to learn efficiently with just a handful of logarithmically-sampled points of a temporal history (Spears, Jacques, Howard, & Sederberg, 2017). Future work is required to shed light on the distribution of scales and the discretization of time.
Relationship to prior modeling work
We have proposed that the brain maintains successor representations at a spectrum of scales or γs. Previous modeling works have proposed spectral assumptions about learning rates and eligibility traces. In (Kurth-Nelson & Redish, 2009), the authors used a set of “micro-agents” each with a different learning rate. They showed that the set of micro-agents, with a distribution of learning rates across agents, was able to simulate behavior akin to hyperbolic discounting. Ludvig, Sutton, and Kehoe (2008) propose a model in which the decaying trace of a stimulus is approximated by a series of basis functions with receptive fields spread across trace heights. The time course of the trace results in “microstimuli” that get shorter and wider with time. Future modeling work can better illuminate the relationship between spectral proposals about learning rates, eligibility traces, and the scale parameter in SR.
Other recent models have proposed methods for constructing a compressed estimate of future events by exploiting properties of the Laplace transform. Tiganj, Gershman, Sederberg, and Howard (2018) constructed an estimate of future events as a function of time using a simple associative account in which a compressed estimate of the past—analogous to a set of sequentially-activated hippocampal time cells—was associated to the present. The associative operator learns simple Hebbian associations between the past at each lag and the present. In order to estimate the future a lag in the future, the present stimulus is used to probe the appropriate association. Tiganj, Gershman, et al. (2018) described a number of computational properties of this method, including scale-invariance. Shankar, Singh, and Howard (2016) exploited the properties of the Laplace domain to translate the current estimate of the past to estimate the future. That model made a detailed mapping between function translation and theta phase precession. Although the endpoint of these models—a scale-invariant estimate of future events that makes order and distance explicit—is quite similar to the endpoint of the method developed in this paper, the mechanisms used to generate them are quite different. The multiscale SR described here can use efficient temporal-difference learning algorithms to construct the SRs. In environments where the world is well-approximated by a Markov process, this property can represent a considerable computational advantage over associative models.
Relationship between map-based and vector-based navigation
An interesting implication of our account is that it supports properties of both map-based and vector-based navigation. Map-based navigation relies on computations using an underlying cognitive map like representations of states, which in our proposal take the form of multi-scale predictive representations. Vector-based navigation, on the other hand, enables rapid planning of direct trajectories to goals via estimates of distance and direction to the goal state (Kubie & Fenton, 2009; Bush et al., 2015). Previous work has contrasted map-based navigation with vector-based navigation for solving optimal route to goal in path integration problems (Kubie & Fenton, 2009). It has been proposed that vector-based navigation requires computing a “shortcut matrix” in memory: a set of shortcut vectors computed for pairs of visited locations (Kubie & Fenton, 2009). Kubie and Fenton contrast insect navigation, where heading vectors are computed by path integration, with mammalian navigation where hippocampal place cells are typically considered to govern map-based navigation. However, when considering map-based navigation most studies focus on representations of one-step relationships between visited locations.
Indeed if mammalian path integration was merely computed online using a map of one-step transitions, one would not expect firing distance-to-goal cells, like the goal-vector cells shown in bats (Sarel et al., 2017b), reward cells shown in rodents (Gauthier & Tank, 2018), or or trace cells in humans representing distances to remembered locations (Qasim et al., 2018). However, recapturing the idea of cognitive maps in terms of predictive representations of multi-step relationships can offer a different view on the relationship between map-based and vector-based navigation. The correspondence of our simulated distance-to-goal firing and the finding of vectorial representations in the bat suggests that a linear function over multi-scale predictive maps is - at least partly - consistent with a vector-based representation of goals (Kubie & Fenton, 2009; Bush et al., 2015). Intuitively, this makes sense in light of previous computational proposals for vectorial representations.
It has been proposed that in order to accomplish vector-based navigation the brain computes and stores a “shortcut matrix” using path integration (Kubie & Fenton, 2009), leading to egocentric and immediate firings of direction and distance to goal without further online computations. While a shortcut matrix does not readily arise from traditional one-step cognitive maps since Tolman (Tolman, 1948), the sorts of representations computed for a shortcut matrix are already learned in the multi-step dependencies that are stored in the successor representation (Dayan, 1993; Momennejad, Russek, et al., 2017). First, consistent with the idea of a shortcut matrix, relationships between states that are multiple steps apart are cached, and second, as we show here, predictions consistent with the firing of cells in specific distances to goal states can be derived from the multi-scale SR account. Our proposal could potentially extend the shortcut vector view to multi-scale or hierarchical shortcut vectors, where rows of SRs at different scales correspond to shortcut vectors at hierarchically different levels of abstraction – well-suited for complex navigation. Future work is required to systematically compare the mathematical relationship between the two accounts, as well as how they compare when simulating empirical data.
Our proposal is also consistent with recent work by Banino, Barry and colleagues, who have shown that training a recurrent network to perform path integration leads to the emergence of grid-like representations (Banino et al., 2018). Importantly, deep reinforcement learning agents using such grid-like representation and vectorial navigation outperformed other models (and humans) in goal-directed navigation in unfamiliar environments, and mammal-like discovery of shortcuts. We showed that the distance to goal aspect of vectorial representations appears readily using the derivative of multi-scale SR, through the activation of cells corresponding to different neighboring horizons of the destination state (corresponding to different σ values). The dependence on head direction can perhaps be captured in the policy-dependent aspect of successor representations as well the relationship to the grid system and basis set/eigen-vector accounts (Gustafson & Daw, 2011; Stachenfeld et al., 2017). However, though every SR matrix is computed under a specific policy toward the goal state, here we have assumed SRs with a single policy for simplification. The direction to goal aspect of the vector-based representation is, thus, less readily apparent in the present version of multi-scale SR. Future accounts can expand this model to systematically simulate policy-dependent SRs and their relationship to simulating head-direction cells during navigation.
Future directions
Policy dependence
One of the main features of SR is that it is policy dependent. That is, SR matrices are computed given particular action policies that the agent takes, rather than the alocentric map of the environment. SR is not optimal for facilitating “full model-based,” off-policy, alocentric, action-outcome contingent analyses. Due to policy-dependence, SR over-represents parts of the state space that are often visited under its corresponding policy. Perhaps a rough analogy is that of walking in the snow: paths that are more often taken are carved deeper and more clearly than paths less traveled. This analogy can explain why the direction to often-visited states (or goal states with a high value) should readily arise from the neural representation of multi-scale SRs. That said, in principle it is possible that SRs at different time-scales are formed with different policies as well, corresponding to the statistics of the environment at different time-scales. If true, this could allow flexible policy-choice depending on environmental statistics and the temporal horizon of the problem at hand. That is, policy-dependence could potentially lead to selecting different policy depending on the time-scale or planning horizon.
One possibility is that different policies mark the relevant SR at different scales of abstraction, and at the moment of decision making, arbitration between representations from different policies might be required. In the case of navigation, the policy-dependence of SRs, and potentially different policies used for abstraction at different levels, may yield interesting relationships to head-direction cells and grid cells when choosing policy at different scales or horizons for planning future actions. Future studies can be designed to model and test specific predictions of a multi-scale policy dependent model.
Expanding the SR ensemble via offline replay
To overcome limited predictive and planning horizon, we have proposed that the brain simultaneously caches multiple successor representations with different discount rates (Figure 1). However, attentional and learning resources may be limited to specific horizons during experience and online learning. As one increases the number of scales (γs) at which SR matrices are computed, the resolution for the estimate over future time points increases. This resolution comes at a high cost: the amount of resources committed to constructing the representation ‘online’ goes up proportional to the number of scales. This means that, given a fully online strategy, the system must commit the resources necessary to construct the multi-scale representations at the outset of learning. However, in the absence of an a priori expectation about the meta-parameters of the environment or information about what scale—or scales—will be relevant in a particular learning problem, the cost of learning representations at multiple scales may be too high.
To overcome this further limitation, one possibility is that offline replay enables the brain to learn and cache more successor representations with a larger set of discount rates ‘offline’. Depending on the parameters of the task, this could potentially enable better separation of different categories, better integration and inference (Momennejad, Otto, Daw, & Norman, 2017), or clustering and generalization of items from the same context. We have elsewhere proposed and named the combination of offline replay and the successor representation SR-Dyna (Momennejad, Russek, et al., 2017; Russek et al., 2017). A multi-scale extension of SR-Dyna could use offline replay to cache representations at scales beyond what has been learned during direct experience. This further proposal for a multi-scale SR-Dyna remains to be computationally fleshed out and empirically tested in future studies.
Summary
We propose a multi-scale model of predictive representations that overcomes caveats in existing models incorporating the successor representation. Computing the order and sequence of future states following a starting state is non-trivial and costly: it requires nonlinear operations on entire SR matrices that can be large. Here we show that the derivatives of multiple successor representations (computed with different scales of abstraction) can be used to recover distance and sequential order of successor states, reconstructing entire expected future trajectories. Our proposal offers a mechanism for constructing an estimated timeline of sequential future events using representations abstracted at multiple time-scales. This proposal expands previous work on predictive representations in reinforcement learning (Momennejad, Russek, et al., 2017; Russek et al., 2017; Stachenfeld et al., 2017) and has the same properties as an analogous representation of remembering or reconstructing a timeline for the past (Shankar & Howard, 2013). The model’s prediction of distance-to-goal cells is in line with recent findings in the bat, rodent, and human literature (Sarel et al., 2017b; Gauthier & Tank, 2018; Qasim et al., 2018), and may offer promising insight into a unified model of cognitive map and vector-based representations (Kubie & Fenton, 2009; Banino et al., 2018) used in navigation and planning. This proposal can be applied to planning in the context of spatial navigation as well as non-spatial cognition. We expect the model to inspire future theoretical and empirical studies in reinforcement learning, navigation, and planning.
Acknowledgments
We thank Sam Gershman for influential early discussions and also thoughtful reactions to an earlier draft of the manuscript. We especially thank Kim Stachenfeld for helpful conversations and Per Sederberg, Zoran Tiganj, Salman Qasim, Nathaniel Daw, and Dylan Rich for useful discussions. This work was funded by the John Templeton Foundation (IM),NIBIB R01EB022864, and ONR MURI N00014-16-1-2832 (MWH).
Footnotes
ida.m{at}columbia.edu, marc777{at}bu.edu