## Abstract

In computer science, reinforcement learning is a powerful framework with which artificial agents can learn to maximize their performance for any given Markov decision process (MDP). Advances over the last decade, in combination with deep neural networks, have enjoyed performance advantages over humans in many difficult task settings. However, such frameworks perform far less favorably when evaluated in their ability to generalize or transfer representations across different tasks. Existing algorithms that facilitate transfer typically are limited to cases in which the transition function or the optimal policy is portable to new contexts, but achieving “deep transfer” characteristic of human behavior has been elusive. Such transfer typically requires discovery of abstractions that permit analogical reuse of previously learned representations to superficially distinct tasks. Here, we demonstrate that abstractions that minimize error in predictions of reward outcomes generalize across tasks with different transition and reward functions. Such reward-predictive representations compress the state space of a task into a lower dimensional representation by combining states that are equivalent in terms of both the transition and reward functions. Because only state equivalences are considered, the resulting state representation is not tied to the transition and reward functions themselves and thus generalizes across tasks with different reward and transition functions. These results contrast with those using abstractions that myopically maximize reward in any given MDP and motivate further experiments in humans and animals to investigate if neural and cognitive systems involved in state representation perform abstractions that facilitate such equivalence relations.

**Author summary** Humans are capable of transferring abstract knowledge from one task to another. For example, in a right-hand-drive country, a driver has to use the right arm to operate the shifter. A driver who learned how to drive in a right-hand-drive country can adapt to operating a left-hand-drive car and use the other arm for shifting instead of re-learning how to drive. Despite the fact that both tasks require different coordination of motor skills, both tasks are the same in an abstract sense: In both tasks, a car is operated and there is the same progression from 1st to 2nd gear and so on. We study distinct algorithms by which a reinforcement learning agent can discover state representations that encode knowledge about a particular task, and evaluate how well they can generalize. Through a sequence of simulation results, we show that state abstractions that minimize errors in prediction about future reward outcomes generalize across tasks, even those that superficially differ in both the goals (rewards) and the transitions from one state to the next. This work motivates biological studies to determine if distinct circuits are adapted to maximize reward vs. to discover useful state representations.

## 1 Introduction

A central question in reinforcement learning (RL) [1] is which representations facilitate re-use of knowledge across different tasks. Existing deep reinforcement learning algorithms, such as the DQN algorithm [2], construct latent representations to find a reward-maximizing policy in tasks with complex visual inputs. While these representations may be useful for abstracting across states in the service of optimal performance in a specific task, this article considers representations that facilitate re-use across different tasks. Humans are adept at such flexible transfer. As a lay example, consider shifting gears in a manual transmission car. In a right-hand-drive country, the steering wheel is on the left side of the car and the right arm is used for shifting, whereas the opposite is the case in a left-hand-drive country. A person who has learned in one scenario can quickly generalize to the other, despite the fact that both tasks require different coordination of motor skills. Both tasks are the same in an abstract sense: In each case, there is a progression from 1st to 2nd gear and so on, which should be coordinated with the clutch pedal and steering, and this structure can be generalized from a left-hand-drive car to a right-hand-drive car [3, 4] and a driver does not have to learn how to drive from scratch.

By treating two different sensory inputs or states as equivalent, an agent can generalize what it has learned from one state to another and speed up learning [4, 5]. Such equivalences can be modeled using *state abstractions* [3, 6], which map states to a compressed latent representation. The usual RL framework considers Markov Decision Processes (MDPs) [7] in which an agent’s sole objective is to maximize reward. In contrast, in transfer or lifelong learning, an agent observes a sequence of MDPs and attempts to learn a state abstraction that can be re-used to speed up learning in a previously unseen task. State abstractions can be constructed in different ways, for example by merging states with the same optimal action or Q-values into the same latent or *abstract state*. This article considers two types of state abstractions:

*reward-maximizing state abstractions*, which allow an agent to maximize total reward, and*reward-predictive state abstractions*, which allow an agent to predict future reward sequences.

While many different RL transfer algorithms have been proposed (see [8] for a survey), this article demonstrates that, while reward-maximizing state abstractions are useful for compressing states within a given task, they fail to generalize across tasks that differ in reward and transition functions. In contrast, reward-predictive state abstractions can be leveraged to improve generalization even when both transition and reward functions change across tasks. The presented analysis and simulations motivate the design of new RL algorithms that can discover such state abstractions as well as further experiments to investigate whether neural mechanisms in biological agents facilitate learning of such representations.

Previous work [9] shows that reward-predictive state abstractions can be extracted from the *successor representation* (SR) [10], which predicts the discounted expected frequency of visiting future states given the current state. While re-using a previously learned SR has been shown to speed up learning when reward functions change [11–13], these methods are only suitable if transition functions are shared (e.g., if one is in the same maze but only the location of the goal changes). Further, if the optimal decision-making strategy differs between two tasks, the SR has to be re-learned [14]. In contrast, this article shows that reward-predictive state abstractions afford “zero-shot” transfer across tasks with variations in transition functions, reward functions, and optimal policies and do not have to be adjusted or re-learned for each task. Such “deep transfer” across environments, even in the absence of prior experience with specific transition or reward functions, is predicted by behavioral and neural signatures of human structure learning [4, 15, 16] but not afforded by alternative algorithms that compress the transition function itself directly [17, 18].

To unpack the relative advantages of distinct state abstraction algorithms for generalization, we proceed as follows. In Section 2, we begin with a simple illustration of the state abstraction framework and then present the conceptual utility of reward-predictive state abstractions in Section 3. Next, we examine this advantage quantitatively when a single abstraction is possible for re-use across a range of task settings and assumptions about the number of latent states (Section 3.1). Subsequently, Section 3.2 considers a curriculum learning situation where multiple state abstractions might apply to different MDPs and the agent has to select amongst them when learning a new MDP. In Section 3.3 we show that this advantage is preserved even when the agent has to simultaneously learn the transitions and rewards of the new MDP and perform inference. Finally, in Section 3.4 we show how this advantage can be leveraged in a guitar playing task, whereby an agent can reapply learned structure about the fret-board while learning a musical scale to quickly learn to play other scales that differ in transitions, rewards, and policy.

## 2 Generalization across states

An MDP [1] is a quintuple with a state space , an action space , a transition function , and a reward function . (In this article, we will always refer to the full transition function *p* of an MDP.) The interaction of an agent in a particular task is modeled by a sequence of transitions between different states. Each transition from state *s* to state *s′* is initiated by an action and is rewarded with a scalar number *r*(*s, a, s′*). The probability of reaching state *s′* from state s using action *a* is specified by the transition function *p*(*s, a, s′*). How strongly short-term rewards are favored over long-term rewards is controlled by the discount factor γ ∈ [0,1).

In *model-free learning*, for example Q-learning [19], the optimal decision-making strategy, called a *policy*, is learned through trial and error interactions in an MDP. Throughout these interactions, a policy is incrementally improved. During learning, only the policy *π* and some form of cached values of the policy *π* are stored at any point in time. In other words, the agent only learns and represents the net predicted reward value of an action in a given state, without needing to represent the specific outcomes of each action in terms of the subsequent states that will be encountered. In *model-based learning* [20, 21] an agent attempts to build a model of the task’s transition and reward function and uses this model to predict sequences of future reward outcomes (*r*_{1}, *r*_{2},…) given a start state *s* and a particular sequence of actions (*a*_{1}, *a*_{2},…). While more computationally intensive, this approach is orthogonal to model-free learning, because using this model an RL agent can predict the value of any arbitrary policy, and it can flexibly adjust its policy if the reward changes. In this case, the agent’s “knowledge” is sufficient to generalize across the space of all possible policies [20].

For example, in an Atari game [22], the number of states in an MDP is large and it may be inefficient to learn and memorize an optimal action for every possible image pixel configuration. In this case, state abstractions [6, 23] provide a framework for simplifying the input space into a lower-dimensional latent space. A state abstraction, also called state representation, is a function *ϕ* mapping the state space to some other latent space. Because state representations are many-to-one relations, they can map different states to the same latent state and create a partitioning of the state space. A state partition is a subset of the state space that maps to the same latent or abstract state. An agent using a state abstraction *ϕ* operates on the space of state partitions and generalizes knowledge learned in one state across the entire state partition. For example, in Q-learning, a value update is applied to the entire state partition even if the update is computed only from one specific state transition, resulting in faster learning if the state abstraction is appropriate [4, 5].

While several approaches exist for constructing a useful state abstraction *ϕ* for complex MDPs (e.g., Atari games), this article investigates which state abstractions facilitate re-use across different tasks. Specifically, we consider the question of which algorithm should be used to learn a state abstraction *ϕ* from a hypothesis space of all possible state abstractions to maximize the agent’s ability to reuse knowledge in future tasks. A state abstraction is a function mapping states to a smaller latent abstract state space . The state-abstraction hypothesis space is then
and a representation learning algorithm searches this space to identify a state abstraction *ϕ*. An agent that uses a state abstraction *ϕ* operates directly on the latent space rather than the underlying state space . Depending on how *ϕ* constructs the latent state space , the agent may or may not be able to distinguish between a rewarding and a non-rewarding state.

Fig 1 presents an example of how state representations simplify a task. The column world task (Fig 1A) is a grid-world navigation task where an agent only receives reward by entering the right column. For this task, Fig 1B illustrates a reward-predictive state representation that generalizes across different columns, as indicated by the colouring. In this case, the 3 × 3 grid world is compressed into a smaller 3 × 1 grid world where a reward is given for entering the latent state *ϕ*_{3} (green) and no reward is given for the latent states *ϕ*_{1} (blue) and *ϕ*_{2} (orange). While the compressed version of the grid world does not preserve all information about the task, it still possible to predict future reward outcomes. For example, the path indicated by the black boxes is mapped to a latent state sequence of (*ϕ*_{1}, *ϕ*_{2}, *ϕ*_{2}, *ϕ*_{3}). This latent state sequence could then be mapped to a reward sequence of (0, 0, 0, 1). In this example, every possible path in the original task is mapped to a path in the compressed task that produces the same reward sequence. Hence, the smaller compressed task can be used to predict future reward outcomes of the original task and the state representation is thus reward predictive.

A reward-predictive state abstraction allows an agent to best predict which expected reward sequence *r*_{1},…, *r _{t}* will be observed after executing a decision sequence

*a*

_{1},…,

*a*starting at a specific state

_{t}*s*. If the random variable

*R*describes the reward that is observed after following the action sequence

_{t}*a*

_{1},…,

*a*starting at state

_{t}*s*, then the expected reward sequence is

The expectation in Eq (2) is conditioned on the start state *s* and is computed over all possible trajectories in an MDP that follow the action sequence *a*_{1},…, *a _{t}*. A reward-predictive state abstraction needs to satisfy for any start state

*s*and action sequence

*a*

_{1},…,

*a*that

_{n}The expectation on the right-hand side is conditioned on the latent state *ϕ*(*s*). Consequently the expectation on the right-hand side is computed over all possible trajectories in a latent space, while the expectation on the left-hand side is computed over all possible trajectories in an MDP [9]. While this model is different to learning an explicit approximation of the MDP’s transition and reward function, learning a reward-predictive state abstraction is akin to model-based RL, because both systems are used to predict sequences of future reward outcomes.

Fig 1C presents a reward-maximizing state representation. Because the optimal action is to move right in this specific task, the task can be compressed into a single state in this example. If rewards are only given when an agent enters the right column by moving right, then only the right action is directly rewarded. In this case, the compressed task can still be used to find an optimal policy, because only the move right action is rewarded; the remaining three actions are not rewarded in any case. Nevertheless, the compressed task cannot be used to make accurate predictions of future reward outcomes, because this state representation simplifies the task into only one latent state (the agent does not know which column it is in: it is as if moving right simply produces stochastic rewards). Because this state abstraction allows an agent to recover the optimal policy, this state abstraction is reward maximizing in this example.

In this article, state abstractions are generated in one of two ways:

Enumerate all possible state abstractions using Algorithm U [24]. This method is used in Sec 3.1 and 3.2.

Learning a state-abstraction function from transition data. This method is used in Sec 3.3.

Reward-predictive state abstractions can be learned using Linear Successor Feature Models (LSFMs) [9] as a means to learn reward-predictive state abstractions. Successor Features (SFs) [25] are a generalization of the SR [10], and predict the expected visitation frequencies in some latent feature space:
where the expectation is computed over all infinite length trajectories that start in state *s* with action *a* and then follow the policy *π*. The discount factor *γ* ∈ [0,1) is used such that states in the more distant future are weighted to a lesser degree in the summation in Eq (4). Intuitively, SFs incorporate information about which latent state features are observed along a trajectory, including their relative temporal positions, because for each time step *t* a different weight *γ ^{t}* is associated with the latent state feature vector

*ϕ*(

*s*) (Eq (4)) (allowing the state abstraction to distinguish between a reward sequence “+1, –1” vs. “–1, +1”, for example). LSFMs extract this temporal property from SFs and construct a state abstraction

_{t}*ϕ*that is predictive of the order with which particular latent state features are observed. Critically, the LSFM latent space is constructed so as to most efficiently predict reward sequences without being tied to the specific transitions or rewards, and thus permit a “deeper” form of transfer loosely akin to analogical reasoning.

If such a state abstraction also associates each feature vector *ϕ*(*s _{t}*) with one-step reward outcomes, then this state abstraction is reward-predictive. Mathematically, this intuition can be generalized to predict reward outcomes for any start state

*s*and action sequence

*a*

_{1},…,

*a*and indeed, learning LSFMs is equivalent to learning reward-predictive state abstractions [9]. Please refer to supporting text S.3 for a detailed description of how LSFMs are used.

_{t}Note that this LSFM approach contrasts with the typical application of SFs in which Q-values are expressed as a dot-product between the SF vectors * ψ^{π}*(

*s, a*) and a reward-model vector. While that approach allows an agent to re-use SFs when rewards and the associated reward-model vector change, it does not afford analogical transfer when transitions change. In fact, because SFs depend on the transition function and a particular policy, a transferred SF has to be relearned and adjusted to a specific task. In contrast to SFs, reward-predictive state abstractions are independent of a specific policy and can be used to generalize across all policies that are defined in terms of the latent states. More concretely, a reward-predictive state abstraction can be used to predict the value of any arbitrary abstract policy by first predicting which reward sequence a specific policy generates and then computing the discounted sum over this reward sequence [9, Theorem 4]. Fig 2 presents an intuitive transfer example and plots different SFs for each task. Consequently, an agent would have to adjust a previously learned SF.

An alternative to using LSFMs are Linear Action Models (LAM), which predict the expected next state instead of SFs. LAMs are very closely tied to LSFMs and can also be used to learn reward-predictive state abstractions [9]. Because we found that LSFMs are easier to use than LAMs in practice, this article focuses on LSFMs. Please refer to S2 for a more detailed description of LSFMs and the connection to LAMs.

## 3 Generalization across tasks

To generalize knowledge across different tasks, a compressed state abstraction is needed that preserves key aspects of the tasks even if the details of transitions or rewards change. Consider the transfer example in Fig 2, where an agent is first presented Task A, and then transfers a state abstraction to Task B. The key similarity between the tasks is evident in that they both have columnar structure, but transitions, rewards, and the optimal policy can all differ. In this example, the reward-predictive state abstraction (Fig 2B) can be re-used to plan a different policy in Task B, while the reward-maximizing state abstraction (Fig 2C) cannot be re-used in Task B. Of course, such a benefit is only possible if the two tasks share an abstract relation: This columnar state abstraction would not be useful in subsequent MDPs that arranged in rows. Below, we consider how multiple state abstractions can be learned and where generalization involves an inference process to select which one of them is most applicable [3].

In principle, there always exists one state abstraction that is both reward-maximizing and reward-predictive in a model-based agent: Trivially, if the identity map is used to map nine distinct states into a latent space of nine distinct states, then such a state representation is always reward maximizing and reward predictive. However, such a state representation is not really “abstract” in that it does not inform an agent across which states information can be generalized. But, for the same reason, this representation preserves information that might be needed in other tasks. We will further discuss this trade-off in Section 3.3.

### 3.1 Transfer with single state abstractions

The above example was illustrative for a single MDP designed to show the potential utility of reward-predictive state abstractions. We next systematically assess the generalization potential of reward-maximizing or reward-predictive state abstractions across a range of different tasks. The goal of this experiment is to be algorithm agnostic: Rather than focusing on how a particular algorithm performs at transfer with a single learned state abstraction, we enumerated the entire hypothesis space for all possible partitions of the state space and evaluated them in all transfer tasks. (Out of a set of tasks, one task was randomly chosen for evaluation of a single state abstraction *ϕ*. Subsequently, this state abstraction *ϕ* is evaluated in all other remaining transfer tasks. In all simulations, the evaluation and transfer tasks are distinct.) For each state abstraction in , we computed a compressed abstract MDP [23] for every tested MDP and solved it using value iteration [1, Chapter 4.4] (please also refer to supporting text S.1). A reward-maximizing state abstraction is then identified by testing the computed policy in a single randomly selected task for *N* trials over *T* time steps and computing the total reward
where *r _{n,t}* is the reward incurred in trial

*n*at time step

*t*. A reward-predictive state abstraction is identified by sampling

*N*random state-and-action sequence pairs (

*s*

_{n}, a_{n, 1},…,

*a*) and predicting the reward sequence using the abstract MDP. The reward-sequence prediction error is

_{n,T}We considered three types of tasks: column-worlds (like those in the motivating example), 100 randomly generated MDPs, and grid worlds (Figs 3A, 3B, and 3C). For each transfer experiment, all possible state abstractions are enumerated and the top 5% scoring state abstractions were re-evaluated on the remaining transfer MDPs and the total rewards generated by these state abstractions are plotted as histograms in Fig 3. In all cases, state abstractions with low reward-sequence prediction errors *RS*_{error} generate a higher total reward at transfer than state abstractions that were selected based on their ability to construct a well performing policy (produce a high *R*_{total} score) on the original MDP. Note that restricting the hypothesis space to abstractions that construct three latent states (second row of histograms in Fig 3) does not change the overall result. This result indicates that reward-predictive state abstractions encode information about an MDP that can be generalized across different MDPs that share the same abstract structure. In Sections 3.2 and 3.3, we will present an extension to environments in which multiple structures are possible and have to be inferred.

Figs 3D and 3G present the results for the transfer experiment discussed in Fig 2. Both histograms indicate that state abstractions with low reward-sequence prediction errors outperform on average state representations that only maximize total reward in one of the tasks. Because all three MDPs can be compressed into three latent states, constraining the hypothesis space to only contain state abstractions that create three latent states does not impact the total reward generated at transfer time significantly. In this case, both histograms have equal support.

To further control for a potential dependency between the constructed MDPs and a particular state-representation type, the experiment in Fig 3B randomly generates transition and rewards. This experiment is similar to the previous test case in that all 100 randomly generated MDPs can be compressed with the same state representation. These MDPs are constructed by generating random three-state MDPs and then “inflating” the state space size based on this common but randomly generated state abstraction. Aside from this common “hidden” state representation, these 100 MDPs differ in both transition and reward functions. The histograms confirm the claim that abstractions yielding low reward-sequence prediction errors perform best in generalization across different MDPs (Figs 3E and 3H). In contrast, state representations that result in high total reward in any of the original MDPs generate on average less reward in any of the remaining MDPs. Again, constraining the hypothesis space to only include abstractions that construct three latent states does not change the support of the histogram in Fig 3H, but the shape changes and the median shifts. This shift can be explained by the fact that incorrectly compressing a task and incurring approximation errors can quickly degrade an agent’s ability to perform optimally. If a state abstraction does not maximally compress a task, for example from nine to eight states, then performance may not degrade as quickly.

The above simulation assumed lossless compression (given that the abstraction was selected and then inflated to generate larger state spaces). To test which state abstractions generalize across tasks when no “hidden” state abstraction is embedded in the tasks, we next considered situations in which state spaces could not be compressed without some information loss. Fig 3C presents a transfer experiment where two reward locations and four different wall placements are permuted in a grid world. These changes in reward and wall locations resemble changes in the transition and reward functions. In this experiment, the MDPs cannot be compressed without incurring some loss, because the grid location is important for predicting where the goal locations are and what action is optimal at each location. However, both histograms in Figs 3F and 3I indicate that state abstractions that minimize the reward-sequence prediction error criterion still perform better than those that maximize total reward. By nature, grid worlds have a specific topology of the state space and state representations that cluster only neighbouring states approximately preserve the grid location information and would be expected to perform relatively well across all MDPs. If the hypothesis space is constrained to abstractions that compress nine states into three latent states, then the advantage shrinks. This difference can be explained by the fact that for arbitrary navigation tasks grid worlds should ideally not be compressed (for efficient navigation an agent needs to be aware of its position), and hence neither abstraction yields an optimal policy. However, the histogram in Fig 3F suggests that there exist several state abstractions that compress nine states into six or seven latent states that can still lead to (close to) optimal performance.

Note that the identity map, which does not compress the state space and can always be used to construct an optimal policy, is also included in this histogram and occurs in the bin with highest total reward. The identity map is included exactly once into each histogram that plots the distribution for all abstractions, because this experiment tests each possible state abstraction once. This experiment highlights a trade-off between the ability to obtain an optimal policy in a task and re-use of a particular state abstraction that compresses a task.

### 3.2 Transfer with Multiple State Abstractions

The previous experiment assumes that all tasks share a common “hidden” state abstraction that can be learned and re-used by an agent. In this section, we consider the situation in which different MDPs might correspond to different abstractions. A non-parametric Bayesian model maintains a belief space of possible state abstractions [3, 4], which it can use for inference. Fig 4 illustrates how the curriculum of tasks is randomly generated. This task curriculum is observed in sequence by the non-parametric Bayesian model and the model is signalled when a switch between tasks occurs. Each task can be compressed in one of two different ways (this approach can be expanded to larger numbers without loss of generality; two is used here for clarity of exposition). Critically, this state abstraction is hidden from the learning agent. After observing an MDP sequence *M*_{1},…, *M _{t}*, the agent updates its belief space using a posterior over which state abstraction is most suitable to solve a given task

*M*: where

_{t}*c*(

_{t}*ϕ*) is the count of how often an abstraction

*ϕ*was used in the previous

*t*– 1 tasks. These counts are used to construct a Chinese Restaurant Process (CRP) [26, 27] prior for an intensity

*α*≥ 0:

The posterior is also conditioned on the MDP *M _{t}* through the factor Pr(

*ϕ*|

*M*). Using a loss function

_{t}*l*(we consider both reward-maximizing and reward-predictive losses), each state abstraction

*ϕ*can be scored and for

*β*≥ 0, the probability of this state abstraction being suitable to solve

*M*is the soft-max probability

_{t}To determine which state abstraction should be added into the abstraction belief set , the non-parametric Bayesian agent has access to the best scoring state abstraction *ϕ*_{next-best} not included into . The posterior is computed over the set of state abstractions . (The goal is not to design an algorithm that can solve a sequence of tasks efficiently, but to analyze which state abstractions generalize across different tasks. Thus, for the moment, we assume that the agent has access to an oracle that knows the transition function of each new MDP and can score the loss for each compression. Using an oracle that tabulates all possible state abstractions gives insight into which state abstractions generalize across different tasks, while being algorithm agnostic; below we relax the need for an oracle.) In contrast to the previously presented simulation, this non-parametric Bayesian agent is constrained to only use state abstractions that compress nine-state MDPs to three-state MDPs. Consequently, the model is forced to generalize across different states and cannot default to only using the identity state abstraction, which does not compress an MDP and is both reward-predictive and reward-maximizing. If *α* increases, the resulting prior and posterior assign a higher probability to adding the next-best state abstraction *ϕ*_{next-best} into . In this case, the CRP prior influences the posterior more strongly. If *β* increases, then more emphasis is given on using the loss function *l* to determine which state abstraction should be used from the set and the CRP prior is effectively ignored.

Rather than using the empirical scores *R*_{total} or *RS*_{error}, the agent is allowed to observe a tabulation of all possible transitions and rewards to obtain a ground truth score for each abstraction. In these experiments, reward-maximizing state abstractions are identified by assessing how much using a state abstraction impacts the value of the policy *π _{ϕ}* relative to that of the optimal policy in the abstract MDP:
where

*V*

^{π*}is the optimal discounted value function [1],

*π*

^{*}is the optimal policy, and

*V*is the discounted value function of the policy

^{πϕ}*π*evaluated in the task itself. Reward-predictive state abstractions are scored by the loss function

_{ϕ}*l*

_{predictive}bounding the reward-sequence prediction error where

*C*is a constant that depends on the action-sequence length

_{γ,t}*t*and discount factor

*γ*. The loss function

*l*

_{predictive}is computed using the SF model [9]. supporting text S.2 presents all details on how to compute

*l*

_{predictive}. If any of the two loss functions evaluates to zero for a state abstraction

*ϕ*, then

*ϕ*is either a globally optimal reward-maximizing or reward-predictive state abstraction. For reward-predictive state abstractions, this property holds because each tested task in Fig 1 can be compressed (by construction) and the LSFM Bisimulation Theorem [9, Theorem 2] applies. In this case, if

*l*

_{predictive}= 0, then the state abstraction

*ϕ*can be used to predict reward-sequences accurately. (Alternatively, one could also use

*RS*

_{error}as defined in Eq. (6).)

Fig 5 plots the results from testing the agent with each loss function for various *α* and *β* settings. The agent selects its policy by using the posterior to mix the policies that would be optimal in the respective abstract MDPs (as described in Sec. 3.1, policies are computed using value iteration on the abstract MDP). Setting *β* = ∞ means that the probability Pr(*ϕ*|*M _{t}*) is deterministic: The highest scoring state abstraction is assigned a probability of one and all other state abstractions are assigned a probability of zero. This case is equivalent to only using the loss function to select a state abstraction while ignoring the CRP prior , because the factor Pr(

*ϕ*|

*M*) is either zero or one in Eq (7). For low

_{t}*β*settings, the prior is used to determine which state abstraction is used. If a is high, then up to 20 state abstractions are added into the belief set . Because the prior influences the posterior heavily, the total reward of the resulting agent is comparably low, because the agent is not well informed about which state abstraction should be used on a given task. For

*β*= ∞, the loss function influences the posterior strongly.

The key difference between the two loss functions becomes apparent when analyzing how the agent maintains the belief space . Using the loss function l_{pre}dictiv_{e}, which identifies reward-predictive state abstractions, the agent identifies the correct ground truth state abstractions that were used to generate the task sequences. Fig 5F shows that the agent correctly learns that one state abstraction occurs with a frequency of 75%. Because the agent only maintains two belief abstractions, the agent correctly estimates that the other abstraction occurs with a frequency of 25%.

In contrast, when the loss function *l*_{maximizing} is used, Figs 5A and 5B demonstrate that the agent can only achieve optimal reward by isolating a significantly higher number of state abstractions than the reward-predictive model. At best, using *l*_{maximizing} and a small a value the agent is capable of isolating between four and five state abstractions. For high a settings, the agent effectively memorizes a solution for almost every task by increasing the size of its belief set , because a previously used state abstraction does not generalize to the next task. The model is thus able to achieve optimal reward only if it constructs a new reward-maximizing abstraction for each MDP. Note that this experiment does not account for any cost associated with learning or constructing a state abstraction for each task from scratch. In Section 3.3, this assumption is removed and the presented results illustrate how constructing a reward-maximizing state abstraction results in slower learning. When using the loss function *l*_{predictive}, the agent can correctly identify which state abstraction to use for which MDP and obtain an optimal reward level while only using two different state abstractions (green curves in Figs 5D and 5E). This confirms the claim that reward-predictive state abstractions generalize across different tasks.

### 3.3 Learning to transfer multiple state abstractions

While the previous transfer experiment presents evidence that reward-predictive state abstractions generalize across different tasks, for exposition these previously presented simulations assumed that a full tabulation of all transitions and reward is accessible for the agent to score the loss. In addition, it was possible to configure both reward-predictive and reward-maximizing models such that an optimal reward level is always obtained given the agent can always construct a new reward-maximizing abstraction. This section presents an experiment where an intelligent system has to learn through trial-and-error interactions with a novel sequential decision-making task thereby simultaneously learning the transitions and making inferences about which abstraction is appropriate to reuse, in a Bayesian mixture of experts scheme [28] that is updated after interacting for a certain number of episodes in a particular task. If an intelligent agent is capable of extracting a particular structure from one task to accelerate learning in another task, then this agent will generate more reward in certain tasks than an agent that does not transfer any latent structure.

In the following simulation experiments, an agent is allowed to interact with a task for a certain number of trials, called *episodes*. The interaction data itself is a data set of transition quadruples of the form (*s, a, r, s′*) that describe a transition from state s to state s’ that occurs by selecting action *a* and is rewarded with *a* scalar reward *r*.

The generalized non-parametric Bayesian model maintains a belief space of state abstractions that is updated after interacting for 200 episodes in a task. (We found that 200 episodes allow each tested algorithm to converge to an optimal policy.) Subsequently, the collected data is used to learn a new state abstraction. A reward-predictive state abstraction is obtained using the LSFM discussed in Sec 2. A reward-maximizing state abstraction is obtained by clustering states with approximately equal Q-values into latent states. Please refer to supporting text S.3 for a detailed description of the algorithmic implementation and how hyper-parameters were tuned. After interacting with a task and learning a new state abstraction, the belief set is updated using the posterior probabilities . During learning in the next task, the state abstractions stored in the belief set are used to generalize Q-values across different states during learning in a task. While the agent observes transition data in a task *M _{t}*, a separate Q-learning agent is maintained for each state abstraction and another for the identity state abstraction

*ϕ*

_{identity}:

*s*↦

*s*. (The motivation here is that the agent should consider not only the Q-values of actions that pertain to previously seen abstractions but that it should also have potential to learn Q-values in the full observable state space. We consider biological implications of this assumption in the discussion). Using a state abstraction

*ϕ*, a state is mapped to a latent state

*ϕ*(

*s*) and this latent state

*ϕ*(

*s*) is given as input to the Q-learning algorithm. (If Q-learning would normally observe a transition (

*s, a, r, s′*), the algorithm now observes a transition (

*ϕ*(

*s*),

*a, r, ϕ*(

*s′*)). Because Q-learning caches Q-values for latent states and multiple states map to the same latent state, the agent now generalizes Q-values across multiple states and can thus converge faster.) The Q-learning algorithm thus generalizes Q-values to multiple states that map to the same latent state. As in the Bayesian mixture of experts scheme [3, 28], the agent selects its overall policy by mixing the policies of each Q-learning agent using the posterior probabilities . Specifically, the probability of selecting action

*a*at state

*s*is where

*π*are the action-selection probabilities of the Q-learning algorithm corresponding to the state abstraction

_{ϕ}*ϕ*. For example, if the posterior probabilities place a high weight on a previously learned state abstraction , then the agent will effectively select actions similar to a Q-learning algorithm that is run on the latent state space constructed by the state abstraction

*ϕ*. In this case, an optimal policy should be obtained more quickly in comparison to not using any state abstraction, assuming the state abstraction

*ϕ*is constructed properly for the given task. If the posterior places a high weight on the identity state abstraction

*ϕ*

_{identity}, then the model will effectively select actions similarly to the usual Q-learning algorithm.

While training, the non-parametric Bayesian model also uses all observed transitions (*s, a, r, s′*) to construct a transition and reward table. After 200 training episodes on a particular task, these transition and reward tables are used to construct either a reward-maximizing or reward-predictive state abstraction *ϕ*_{next-best}. Then, the posterior probabilities are computed as described in Eq. 7 and a state abstraction is sampled using this posterior distribution. Depending on the parameter settings for *α* and *β*, the newly learned state abstraction *ϕ*_{next-best} may be added into the belief set for the next task or previously learned state abstraction is re-used and its count *c _{t}*(

*ϕ*) is increased. When training on the first task, the belief set is initialized to the empty set.

This model is tested on the task sequence illustrated in Fig 6. The top row depicts two different maze maps that are used to construct a curriculum of five tasks. Each map is a 10 × 10 grid world where the agent has to navigate from the blue start state to the green goal location. Once the green goal location is entered, the agent receives a reward of +1 and the episode is ended. The transition dynamics are the same in each task with the difference that the agent cannot cross the black barrier. These two mazes are mirror images of another and the optimal action is different at each grid cell. Consequently, transitions, rewards, and the optimal policy of Maze A and Maze B are different at every state and cannot be immediately transferred from one maze to another.

Using these two mazes, a task sequence is constructed by adding a “light/dark” variable or a “red/green/blue” colour variable into the state that is irrelevant for navigation. This task sequence is designed to demonstrate that if an algorithm learns to correctly generalize across different states, then the algorithm can learn to solve the maze navigation task faster than an algorithm that does not generalize correctly.

The schematic in Fig 6 illustrates how the task sequence is constructed and how additional state variables are introduced. In the bottom left of Fig 6, the task “Maze A Light-Dark” is constructed by augmenting each state of the Maze A task with a binary “light/dark” variable. As an agent transitions between different grid locations, this binary variable switches with equal probability. By adding this variable, the state space is doubled to 200 states. Note that the state *s* will be communicated to the agent as an index that ranges from 0 to 199. The agent is not informed about the fact that states are augmented by a binary variable. To determine how this 200 state light-dark maze can be compressed, the agent would have to infer that state 0 and state 100 are equivalent and can be compressed to one latent state, for example.

The task “Maze B Coloured” (second map in bottom row of Fig 6) is constructed by augmenting the right half of the maze with a “red/green/blue” colour variable. In this case, states corresponding to the left half of the maze are not changed, but states that correspond to the right half are augmented with either the colour red, green, or blue. Intuitively, as the agent transitions into the right half of the maze, it will observe a coloured grid cell and colours will randomly switch between either red, green, or blue. Conforming to the light-dark maze construction, the state is presented to the agent as an index ranging from 0 to 199. The agent is not given a state in a factored form, for example a grid position and colour.

The bottom row of Fig 6 depicts the five-task curriculum. In this experiment, an agent can either learn how to maximize reward in each of the 200-state tasks or learn how to compress each task into 100 latent states, generalize information across different tasks, and ultimately learn an optimal policy faster and generate higher total reward.

Fig 7 presents the results of the learning experiment conducted on the maze-task curriculum. The average-per-task episode length of each algorithm is plotted in Fig 7A. Because each task is a navigation problem, a low average episode length indicates that an algorithm reaches the rewarding goal using fewer time steps and can generate on average more reward per time step. For Q-learning, the average episode length per task remains roughly constant (blue curve in Fig 7A), because Q-learning does not transfer information across tasks. In comparison, the reward-predictive non-parametric Bayesian model achieves a significantly lower average episode length on tasks three through four. This behaviour is explained by the posterior plotted in Figs 7C and 7D. On the first two tasks, the reward-predictive model adds two new state abstractions into its belief set (Figs 7C). During training on task 1 and task 2, this model uses the identity state abstraction and does not (and cannot) generalize across two different states. Consequently, there is no difference in performance between the reward-predictive model and Q-learning. Task 1 and task 2 expose the agent for the first time to a light-dark and a coloured maze and after learning in these two tasks the reward-predictive model adds a new state abstraction into its belief set (orange boxes in Fig 7C, left panel). From task 3 onward, the agent detects within the first few episodes which state abstraction to re-use in which task, resulting in faster learning and consequently shorter average episode lengths on these tasks. These results demonstrate that the reward-predictive model is capable of extracting two state abstractions, one for the light-dark scenario and one for the coloured scenario, and re-using these state abstractions.

In contrast, the reward-maximizing model only performs comparably to the reward-predictive model on the third task (orange curve in Fig 7A). The posterior probability plot for this model (Fig 7D) indicates that only on task 3 a previously learned state abstraction is re-used. This re-use occurs because the first and third tasks are identical and the first task’s solution can be repeated on the third task. For all other tasks, the reward-maximizing model introduces a new state abstraction into its belief set. This supports the hypothesis that the reward-maximizing model effectively memorizes a solution for each task and can only repeat previously learned solutions.

Fig 7B compares the average episode length of the reward-predictive model with transferring and adjusting SFs, the system used in prior work [11–14, 17, 18, 29–31]. (Supporting text S3 provides a description and re-production of how re-using previously learned SFs leads to faster convergence.) In the tested grid-world tasks, we found that our SF-learning algorithm implementation in combination with the used initialization heuristics converges faster to an optimal policy than the Q-learning algorithm, resulting in a shorter average episode lengths. The reward-predictive model can be adopted to use the SF-learning algorithm instead of the Q-learning algorithm and this model is presented in Fig 7B. The blue curve in Fig 7B plots the average episode length when the SF-learning algorithm is used to find an optimal policy. In this simulation, the SF-learning algorithm does not transfer a representation and instead resets its weights when switching between tasks. The orange curve plots the average episode length when the SF-learning re-uses previously learned SFs instead of resetting its representation. Fig 7B demonstrates that re-using SFs degrades performance on the maze task sequence while the reward-predictive model outperforms a SF-learning baseline. On tasks 3 and 5, the reward-predictive model outperforms the SF transfer method because the reward-predictive model identifies which state abstraction to use in which task, as previously discussed. Note that when transitioning from task 3 to task 4, the underlying light-dark state abstraction is not changed whereas in all other task changes the underlying state abstraction is changed as well. This result suggests that SFs themselves implicitly incorporate parts of the state abstraction helping the SF-learning algorithm to converge faster in task 4.

### 3.4 Comparison to transferring successor features

Lastly, in this section, we illustrate the differences between re-using reward-predictive state abstractions and re-using successor features themselves [11, 12, 18]. Although reward-predictive state abstractions can be extracted from successor features, the resulting abstraction is a more abstract aspect of an MDP than simply reusing the successor features. Fig 8 presents a guitar-playing example to illustrate this idea. In this example, the task is to play a guitar scale, a sequence of notes such as C-D-E-F-G-A-B. On a guitar, the note “C” can be played by holding down a finger at one out of multiple possible locations on the fret board, as illustrated in Fig 8A. (Even within the same octave, the note “C” can be played in up to five different ways.) A skilled guitarist has internalized a representation that links fret-board positions to the notes they produce. In this example, a reward-predictive state abstraction captures this aspect of mapping all positions on the fret board to a latent state of playing the note “C”.

The guitar-scale task illustrated in Fig 8A is constructed such that the agent always starts at a separate start state. To play a scale correctly, the agent has to select an action sequence that corresponds to playing the note sequence correctly. The state is represented as a bit matrix, where each entry corresponds to a position on the fret board. In the guitar-scale task the agent transitions through a sequence of fret-board locations by playing a sequence of notes. Rewards are only maximized across time if the agent plays the correct scale (Fig 8A, bottom schematic).

For a sequence of two guitar-scale tasks, Fig 8B compares the performance of a reward-predictive model with that of transferring previously learned SFs. Note that these two guitar-scale tasks differ in their transitions, rewards, and optimal policy. While all algorithms perform similarly in learning the first scale (given that they have to learn the abstraction), only the reward-predictive model (green curve) exhibits transfer to the second scale. Fig 8C plots the reward obtain in each episode for both the reward-predictive model and the SF transfer algorithm and illustrates that the reward-predictive model obtains an optimal policy faster on the second task. This performance improvement can be attributed to the fact that the reward-predictive model builds an internal representation that more closely models how to generalize across different fret-board locations, which is invariant to the scale (i.e, the reward sequence is identical if the agent correctly plays the scale in any of the octaves). In contrast, SFs encode the visitation frequencies of future (latent) states under a specific policy, a property that changes between the two tasks. Thus, the performance discrepancy in Fig 8B comes about because SFs and reward-predictive state abstractions model different aspects of an MDP.

## 4 Discussion

In reinforcement learning, the agent’s goal is to find a reward-maximizing policy. But, whereas typical RL applications pertain to a single MDP, in a lifelong learning scenario (such as that confronted by biological agents), the objective is to maximize reward across a variety of environments. For this purpose, it is critical to discover state abstractions that can be efficiently re-used and transferred across a variety of situations. While several approaches exist for discovering useful state abstractions that reduce the complexity of a high dimensional task environment (e.g., using deep neural networks) in an attempt to maximize reward, this article demonstrates that, for longer term benefits, an agent should focus on learning reward-predictive state abstractions. Our findings indicate that such abstractions permit an agent to discover state spaces that can be re-used by way of analogy to previously seen state spaces, without requiring the details of the task (transitions, reward functions, or optimal policy) to be preserved.

Our initial simulations considered situations in which a single abstraction could be transferred to a subsequent MDP. However, in a lifelong learning scenario, one must consider multiple possible abstract structures that may pertain to any novel situation. When a musician picks up a banjo, they may quickly recognize its similarity to other string instruments—even those with alternate tuning—and efficiently learn to play a scale; the same musician may re-use a different structure when attempting to master the accordion. Previous theoretical work relied on non-parametric Bayesian clustering models that assess which of several previously seen structures might apply to a novel situation and be flexibly combined in a compositional fashion [3], a strategy supported by empirical studies in humans [16]. However, such an approach still requires the agent to recognize that the specific transition function and/or the reward function is portable to new situations. Here, we applied a similar non-parametric Bayesian agent to cluster reward-predictive state abstractions, affording “zero-shot” transfer of state representations to novel environments that are only similar by way of analogy to previously seen scenarios. Because the reward-predictive model can identify which state abstractions are embedded in a task and re-use these state abstraction to accelerate learning, the presented results suggest that reward-predictive state representations generalize across tasks.

Biologically, our findings motivate studies to investigate whether brain systems involved in representing state spaces, such as the hippocampus and orbitofrontal cortex [32–34], have learning rules that are guided by minimizing reward-predictive loss, rather than simply minimizing the Bellman error as in classical temporal difference learning rules leveraged by striatal dopaminergic systems [35, 36]. Indeed, dopaminergic learning signals themselves are diverse, not only conveying reward prediction errors used for optimizing actions, but with some signals (perhaps projecting to distinct circuits) appearing to be used to learn about state transitions that permit subsequent transfer [11, 18, 37, 38]. Our simulations motivate more tailored experiments to investigate the potential role of such signals in compressing state representations such that they can be analogically reused.

Existing experiments searching for neural and behavioral correlates of the SR [11, 18] have not varied both rewards and transitions, because (unlike the reward-predictive model), the SR is not robust to these changes across environments. Our work motivates the development of targeted experimental designs that would test if human subjects can reuse a latent structure that is present in a set of tasks despite variations in transitions and rewards. For example, one could design a human subject study similar to [16] where participants solve a sequence of grid-world navigation problems, but augment the design to test if subjects reuse a latent structure present in a set of tasks despite variations in transitions and rewards, similar to the task sequence presented in Fig 5. As illustrated in Fig 6, the specific pattern of generalization across tasks is predicted to vary depending on whether agents use reward-predictive state abstractions or re-use SR abstractions. Thus, our work provides a concrete testable behavioral prediction that would discriminate between our work and existing work.

Offline hippocampal replay has been proposed to reflect sampling from a model to train model-free RL and facilitate planning [39–42]. Our work provides a predicted amendment to this notion: we suggest that replay may be prioritized in such a way that facilitates the construction of reward-predictive state abstractions. In our work on learning (Sec 3.3), while the agent is first interacting in a novel MDP, it retains an identity (i.e., un-compressed) state abstraction. Only after sufficiently learning and interacting in this task, the agent can then construct a new state abstraction that can be used for planning in the future. Indeed, for efficient learning and generalization, retaining the identity map while learning is critical; otherwise the agent is likely to create a sub-optimal abstraction that will not generalize. We suggest that the online use of the identity matrix may depend on the highly pattern-separated and conjunctive representations in the hippocampus, whereas the more abstract representations that facilitate generalization and transfer may be cortical [43]. Moreover, we speculate that one way this abstraction could be learned offline would be if, during replay, hippocampal events could be sequentially sampled from regions of the state space that are most similar in a reward-predictive sense (i.e., those that incur the least reward-predictive loss). In this way, an abstract graph-like structure suitable for future planning could be constructed [44, 45] but further augmented so that it does not depend on temporal adjacency of transitions within the graph itself, but rather in terms of the ability to predict future expected reward sequences – facilitating a deeper form of transfer. This reward-predictive loss function for guiding replay may also shed light on recent studies in rodents demonstrating that replay is biased toward recently received rewards (e.g., food) rather than those that are currently desired (e.g., water) after revaluation, even though behavior is directed toward the desired one [46]. While this pattern is counter-intuitive from the perspective that replay is used for future planning, it accords with that expected from an algorithm that compresses the state space based on reward-predictive representations, where reward is defined by the previously experienced reward function. Consequently, these representations do not generalize to any arbitrary task and are restricted to variations in transitions, rewards, and optimal policy. This restriction of reward-predictive state abstractions can be observed in Fig 1, where a representation learned for the light-dark maze would not be re-used on the coloured maze. Because the presented model demonstrates that generalization across different rewards and transitions is possible, future studies on replay would test subjects for generalization across different tasks instead of only testing for recall of a previously observed task structure.

The Tolman-Eichenbaum machine [47, 48] presents a model for generalization in the hippocampal-entorhinal system [49]. Similar to reward-predictive state abstractions, this model learns a latent representation that is predictive of future outcomes or stimuli but is also tied to a fixed transition function. While this model is not formulated in the usual RL framework, predicting future outcomes or stimuli can also be understood as a form of reward prediction. However, this model is trained directly on entire interaction sequences to predict future outcomes, and the learned representations are thus tied to the transition function. The presented transfer examples and simulations illustrate that reward-predictive state abstractions are not restricted by these limitations and can be directly re-used, assuming certain state equivalences are preserved.

Our approach also stands in contrast to prior attempts to leverage SFs [11–14, 17, 18, 29, 30] in which the SFs themselves are used to initialize learning in novel environments. Such an approach can accelerate learning in some situations, but it can be fragile to changes in the optimal policy [14] and transition function. A similar effect has also been shown for variations of reward-maximizing state abstractions [5], but these abstractions are also adjusted to each task, similar to SFs. While prior work mitigates this re-learning by associating a novel task with one out of multiple previously learned SFs [13, 31], these methods still rely on initializing learning with a previously learned representation to obtain a performance gain over solving a task from scratch. Universal successor feature approximators (USFA) [50] mitigate the dependency of previously learned SF to a single policy by defining SFs as a function * ψ^{π}*(

*s, a*;

**), where the weight vector**

*w***describes a particular MDP. While this approach only requires learning one SF representation function for a family of policies, this model also assumes fixed transition functions. In contrast, reward-predictive representations have the ability to abstract away irrelevant task features and these abstractions can be re-used without re-learning them. While the presented reward-predictive model transfers state abstractions across tasks, this model has to re-learn how individual latent states are associated with one-step rewards or SFs for each task. In fact, the presented abstraction transfer models could be combined with prior work [3, 13, 16, 31] that transfers SFs, latent transition functions, or latent reward functions to integrate the benefits of each transfer system.**

*w*In related work [17], the SR of an MDP was compressed using PCA and the obtained representations were demonstrated to be suitable for transfer and connections to place cells and grid cells in the hippocampus. However, this compressed SR constructs a representation of the transition function itself, and hence transfer is again limited to environments that share the same transition function. In contrast, reward-predictive state abstractions separate the transition dynamics (and the SR) from the compression on the state space itself, and thus generate a latent state representation of a task exploiting analogical task equivalences. Latent state abstractions are not tied to particular transitions [3, 4], and can thus circumvent this dependency without adjusting the transferred representation itself.

While reward-predictive state abstractions do not limit an agent’s ability to obtain an optimal policy for an MDP [9, 51], the solution space of possible reward-predictive state abstractions is far more constrained. Prior deep learning models [52] construct latent state representations as part of a model-free and model-based hybrid model that constructs a latent state representation and extracts the underlying state-transition dynamics. In contrast to their method, reward-predictive state abstractions compress the state space by generalizing across states that generate identical future expected reward sequences. While this article uses an existing SF model [9] to compute *l*_{predictive}, several other methods exist to evaluate reward-predictive state abstractions [53–55].

### 4.1 Limitations and Future Directions

With the exception of Fig 3C, each simulation experiment assumes that a given task has an (unknown) state-abstraction embedding. In this case, there always exists a state abstraction which, if discovered, would allow any learning algorithm to find an optimal policy. A case that has not been studied in this article and is left for future work is the case when a task is over-compressed (i.e., lossy compression). Over-compressing a task induces approximation errors, because the compression removes too much information or detail from the state space such that accurate predictions are no longer possible [5, 9]. If only the latent state is given as state input to an algorithm like Q-learning, the algorithm may not converge and learn an optimal policy because the latent state is only providing partial information about the actual state of the task. One could analyze the problem as a partially observable MDP (POMDP) [56], but algorithms that can solve POMDPs also maintain a belief about which actual state they are in. In this literature, the actual state is assumed to be unknown to the agent. Because this work assumes that the actual state is known to the agent, the benefit of using such an algorithm is not clear in the case where a task is over-compressed by a state abstraction. Under what assumptions algorithms like Q-learning can be combined with state abstractions that over-compress a task is left for future work.

The presented results consider finite MDPs, allowing the algorithm to tabulate a value or latent state for each possible state. Another direction of future work is to extend the presented models and algorithms to larger state spaces, such as images. Such an extension would integrate neural networks or deep learning techniques, and allow the presented models to be applied to more complex tasks, such as computer games [22] or visual transfer tasks that can also be used in a human subject study [4].

## 5 Conclusion

The presented results suggest that reward-predictive state abstractions generalize across tasks with different transition and reward functions, motivating the design of future transfer algorithms. The discussed connections to predictive representations in the brain and generalization in human and animal learning motivate further experiments to investigate if biological systems learn reward-predictive representations.

## Supporting information

### S.1 State Abstractions

For finite state and action MDPs, the transition function can be presented as a set of left-stochastic transition matrices where and a set of expected reward vectors where

A state abstraction is an MDP homomorphism [23] mapping an MDP into an abstract MDP . Using a weighting function *ω*, the abstract MDP “aggregates” the transition and reward function across states that map to the same abstract state . Specifically, the abstract reward function *w* is constructed as

The transition function between latent states *m* is constructed as

The weighting function is assumed to be non-negative and the sum across a state partition evaluates to one: . The simulations presented in Sections 3.1 and 3.2 assume a uniform weighting function which averages across state partitions. Similar to Eq (13) and Eq (14), the abstract transition matrices and abstract reward vectors can be computed and

Note that function *m* defines transition probabilities between latent states in the same way the transition function *p* defines transition probabilities between states [6]. Consequently, the latent model described by the matrices and vectors can be used as a normal MDP, with the only difference that this latent MDP is defined on latent states. A policy that is optimal with respect to this compressed or latent MDP can be computed by performing value iteration [1, Chapter 4.4] on the matrices and vectors . Such a policy is then used as the optimal policy for an agent that generalizes across states according to the state abstraction *ϕ*.

### S.2 Successor Features Identify Reward-Predictive State Abstractions

Linear Successor Feature Models (LSFMs) can be used to identify reward-predictive state abstractions [9]. Critically, LSFMs construct a state abstraction *ϕ* so as to most accurately predict the SFs at every state s and action a by requiring that
holds for a set of real-valued square matrices . Intuitively, Eq. (18) asserts that a state feature vector *ϕ*(*s*) has to be constructed such that only a linear transform can be used to predict the SFs * ψ^{π}*(

*s, a*). By [9, Lemma 1], requiring a state abstraction to predict the SFs

*is equivalent to requiring a state abstraction to predict the expected next state feature vector (the vector that is reached by selecting action*

**ψ**^{π}*a*at state

*s*). In fact, this model, called Linear Action Model (LAM), can also be used to learn reward-predictive state abstraction [9]. While both models can be used to learn reward-predictive state abstractions, LSFMs and LAMs are not equivalents of one another and we have found it easier to use LSFMs in practice. If such a state abstraction also associates each feature vector

*ϕ*(

*s*) with one-step reward outcomes, then this state abstraction is reward predictive. Similar to Eq. (18), this property can be enforced by requiring for some set of real valued vectors that where

_{t}*r*(

*s, a*) is the one-step expected reward that can be obtained by selecting action a at state s. To obtain a state abstraction that satisfies Eqs. (18) and (19), the differences between the left-hand and right-hand sides of Eqs. (18) and (19): where

*c*is a real-valued constant (we denote the Euclidean L2 norm with ╚·║

_{2}). For finite MDPs, the summation in Eq. (20) is computed over all possible states and actions. (Please refer to supporting text S.2 for a detailed description.) Finding a state abstraction

*ϕ*that minimizes the loss objective

*l*

_{LSFM}is equivalent to learning reward-predictive state abstractions [9, Theorem 2 and Lemma 1], because where

*C*is a constant depending on the discount factor

_{γ,t}*γ*and sequence length

*T*. In this article all abstract states

*ϕ*(

*s*) are encoded as one-hot bit vectors, a vector of zeros with only one entry set to one. The product evaluates to a row probability vector specifying with which probability an abstract state is reached starting at the abstract state

*ϕ*(

*s*) and following the action sequence

*a*

_{1},…,

*a*

_{t–1}. By computing the dot-product between this vector and the reward vector

*, the prediction of the expected reward is calculated after following the actions sequence*

**w**_{at}*a*

_{1},…,

*a*starting at state

_{t}*s*with . The loss function /

_{predictive}is then computed with

If *l*_{predictive}(*ϕ*) = 0, the *ϕ* is a reward-predictive state abstraction. A more detailed description of these bounds and LSFMs was previously presented by [9].

#### S.2.1 Transferring SFs Across Tasks

To establish a baseline and investigate to what extend re-using previously learned Q-values or SFs accelerates learning, Fig S4 re-produces the navigation experiment presented in prior work [14] and demonstrates the benefit of transferring SFs between different tasks as in prior work [11, 12, 18, 25]. In this experiment, the Q-learning [19] and SF-learning [12] algorithms are tested on a sequence of four grid-world navigation tasks (Fig S4A). SF-learning is an algorithm that learns a SF vector * ψ^{π}*(

*s, a*) for each state

*s*and action

*a*. Similar to Q-learning, SF-learning iteratively improves its current behaviour policy until an optimal policy is found [12]. In supporting text S3 we present all implementation details on how these algorithms are implemented and evaluated. Each agent was allowed to attempt each navigation task for 200 episodes. After 200 episodes, each agent is signalled that the grid-world map has changed. While interacting with a task, each algorithm updates its internal values and attempts to find an optimal policy that reaches the goal location as fast as possible.

Fig S4B plots the episode length of the SF-learning algorithm as a function of the number of episodes. The gray lines indicate when the interaction in one of the four tasks begins. This plot shows that at the end of learning in each task, a close to optimal policy is found because the agent can repeatedly navigate from start to goal location within about ten time steps. However, at the beginning of learning in each map (episode 0, 200, 400, and 600), the episode length spikes up. The first spike at episode zero occurs because the SF-learning algorithm is initialized and has to learn an optimal policy from scratch. At episodes 200, 400, and 600, the SF-learning algorithm is reset but is allowed to preserve its previously learned SFs and reward weight vector. Because the change in goal location also leads to a change in optimal policy, the episode length spikes up when the goal location is changed, but the SF-learning algorithm can subsequently recover an optimal policy for that new task [13].

Fig S4C plots the average episode length for each task and tested algorithm. In this plot, each point is averaged across the 200 episodes an agent spent in each task. Because the task is to navigate across a grid world, a shorter episode or trial length is indicative of a policy that generates more reward per time step. For both Q-learning and SF-learning the average episode length does not vary significantly across the different tasks. This behaviour is expected, because both algorithms are reset when the grid map is changed and an optimal policy is learned from scratch. Furthermore, these simulations indicate that SF-learning converges to an optimal policy faster than Q-learning in this particular experiment, because the average episode length is significantly lower for SF-learning than for Q-learning. The orange curve in Fig S4C plots the episode length when Q-learning re-uses previously learned Q-values. While re-using previously learned Q-values reduces the average episode length on tasks two and three, re-using Q-values does not result in a significant performance improvement and this algorithm is outperformed by the SF-learning algorithm. The red curve in Fig S4C plots the episode length when the SF-learning algorithm can re-use previously learned SFs and a previously learned reward vector. In this case, the average episode length is significantly lower on tasks two through four when the SF-learning algorithm is allowed to transfer previously learned SFs. While SFs have to be re-learned [12, 13], re-using previously learned SFs leads to faster convergence. This experiment aligns with previous findings demonstrating that re-using SFs can speed up learning [11–14, 17, 18, 29–31].

### S.3 Hyper-Parameter Selection and Implementation of Learning Experiments

In this section we present implementation details necessary to reproduce the learning experiments presented in Figures 6, 7, and S4. Because the Q-learning algorithm and SF-learning algorithm use a learning rate parameter, the resulting episode lengths are sensitive to hyper-parameters and learning rates. To account for this dependency, each algorithm was tested on a range of different hyper-parameter combinations and the hyper-parameter setting for each algorithm that leads to the best performance was used for the presented experiments.

All implementations and the datasets generated by each simulation are publicly available through this link: https://github.com/lucaslehnert/rewardpredictive.

#### 5.3.1 Optimistic initialization and exploration

To ensure that each algorithm learns an optimal policy using only interaction data, each algorithm was initialized optimistically to ensure efficient exploration [57]. In optimistic initialization, all Q-values [1, 19] are initialized to a specified maximum value *Q*_{max}, which upper-bounds the actual Q-values of the task. A Q-learning agent then learns to incrementally decrease its value predictions until prediction errors are low. By combining this initialization scheme with an exploration policy that is greedy with respect to the current Q-value predictions, the Q-learning agent will repeat actions where Q-value predictions overestimate the actual return. All tested tasks are navigation tasks where an agent always receives a reward of one once the goal location is reached and a reward of zero otherwise. Because the reward of one is incurred only once, all Q-values lie in the interval [0, 1] and *Q*_{max} = 1.

Fig S5 demonstrates that optimistic initialization combined with a greedy action selection strategy significantly outperforms initializing Q-values or SFs to zero and using an ε-greedy exploration policy as prior work did [12, 14, 25]. This result confirms that both Q-learning and SF-learning benefit from the optimistic initialization strategy used in the presented simulation experiments.

#### 5.3.2 Q-learning

The Q-learning algorithm [19] was tested for three different learning rate settings: 0.1, 0.5, and 0.9. In all experiments, using a learning rate of 0.9 leads to the fastest learning and overall best performance, even when the algorithm was combined with a state abstraction. Only in Fig S5A, the Q-learning baseline (without no transfer) used a learning rate of 0.7. When Q-learning was used in combination with a state abstraction, each transition (*s, a, r, s′*) was mapped to a transition (*ϕ*(*s*), *a, r′, ϕ*(*s′*)) between latent states.

#### 5.3.3 SF-learning

Algorithm 1 outlines an adaptation of the SF-learning algorithm [25] that incorporates optimistic initialization. Our implementation of SF-learning maintains a matrix ** G** of size

*SA*×

*SA*, where

*S*is the number of states and

*A*the number of actions. Each row of this matrix

**approximates the SF where the expectation in Eq (23) is defined over all infinite length trajectories that start with state s and action**

*G**a*and then follow the policy π. Each vector

*is a one-hot bit vector associated with the state-action pair (*

**e**_{si,ai}*s*). This vector has a dimension of SA and the entry corresponding to the state action pair (

_{i}, a_{i}*s*) is set to one with all other entries being zero. The definition presented in Eq (23) is equivalent to the definition presented by [25], with the distinction that one-hot bit vector features are used. Consider a vector

_{i}, a_{i}**of length**

*v**SA*storing one-step reward predictions in each entry:

Algorithm 1 incrementally learns a matrix ** G** and vector

**such that Eq (23) and Eq (24) are satisfied. The iterates in lines 8 and 9 of Algorithm 1 are obtained by substituting**

*v**=*

**ϕ**_{t}*and into [14, Algorithm 1].*

**e**_{st,at}If the matrix ** G** is constructed such that Eq (23) exactly holds, then [25]

The SF-learning algorithm learns an optimal policy by constructing Q-value prediction using Eq (25). Optimistic initialization can be integrated into SF-learning by initializing the matrix ** G** and vector

**such that**

*v**Q*(

*s, a*) =

*Q*

_{max}at the beginning of training. This can be accomplished by initializing

**=**

*G***and every entry**

*I*

*v*_{0}(

*s, a*) =

*Q*

_{max}(line 2 in Algorithm 1). Hence, we have for the Q-values at initialization where

**1**is a vector with each entry being set to one. By combining this initialization scheme and selecting actions greedily with the predicted Q-values, we found that the SF-learning algorithm converges faster to the Q-learning algorithm. Table 1 lists the tested learning rates and the best performing learning rate setting for each simulation.

#### S.3.4 Learning of Reward-Maximizing and Reward-Predictive State Abstractions

The experiments presented in Section 3.3 learn reward-maximizing state abstractions by clustering Q-values. These Q-values are computed by performing value iteration on the constructed transition and reward tables. By clustering Q-values a state abstraction *ϕ* is obtained with a low *l*_{maximizing}(*ϕ*) value^{1} [58].

A reward-predictive state abstraction is computed by computing a LSFM with low prediction errors. Such an LSFM is obtained by using a rescaled version of the loss objective presented by [9]:
where are the transition and reward tables the non-parametric Bayesian model computes from the observed interaction data. Each row of the matrix **Φ** stores a latent feature vector that is associated with a particular state. The loss objective is optimized using the Adam gradient optimizer [59] with a learning rate of 0.01 and *α _{F}* = 0.01. Otherwise Tensorflow’s [60] default parameters were used and the optimization process was initialized as described by [9]. After 1000 gradient update steps, the row-space of the real-valued matrix

**Φ**was clustered using agglomerative clustering to construct a partitioning of the state space and a state abstraction.

The resulting state representation will then have a low one-step-reward-prediction error *ε _{r}* and SF error

*ε*and a low loss value

_{ψ}*l*

_{predictive}(

*ϕ*) =

*ε*+

_{r}*ε*. During learning in a particular task, these loss values are computed on an incomplete transition table to compute the posterior over state abstraction needed for the mixture policy described in Eq (12) in the main article.

_{ψ}#### S.3.5 Hyper-Parameters Influence Belief Space Size

Fig S6 plots the average episode length and belief space size for different *α* and *β* settings for each model for the learning experiment presented in Section 3.3. Figs S6A and S6C indicate that a decrease in *α* corresponds in most cases to a smaller belief space size. As the parameter *α* is decreased, the CRP prior will emphasize re-use of previously learned state representations, leading to a smaller belief space size after observing all five tasks. Nevertheless, the *β* parameter also influence the belief space size. For high *β* parameters, the scores *l*_{predictive} or *l*_{maximizing} influence the posterior more strongly and if the loss values are low enough another state abstraction is added into the belief set despite the CRP prior. Consequently, a higher *β* parameter corresponds to a larger belief set size as plotted in the right panels of Fig S6. Figs S6B and S6D show that a small belief space (where *β* = 1) results in a high average episode length and thus poor performance on the tested task sequence. If the *α* parameter is increased for *β* = 1, then the average episode length decreases because the model tends to use multiple state abstractions in its belief set instead of enforcing the re-use of the same state abstraction. If the same state abstraction is re-used on all tasks, then the used state abstraction may remove information from the state that is needed to find an optimal policy, resulting in poor performance. For both models, the average episode length is lowest when the learned belief space has a small but not too small size. This suggests that learning and re-using a small set of state abstractions results in a shorter average episode length and better performance. If the belief space is too large, then the average episode length also increases because state abstractions are not re-used to accelerate learning in subsequent tasks and the model does not transfer previously learned structures. Note that for a setting of *β* = 100 only the reward-predictive model constructs a belief set of size two for *α* ≤ 0.01 for all ten repeats. Because the task sequence was constructed by adding irrelevant state variables in one of two ways, this result suggests that the reward-predictive model detects these two generalization patterns.

Table 2 outlines the tested and best mixture model parameter for the simulations presented in Sec 3.3.

### S.4 Hyper-Parameter for Guitar Task Experiment

In the simulation presented in Sec 3.4, each learning algorithm interacted with with each task for 50 episodes (trials). We make the assumption that same note cannot be repeated in the scale. This assumption ensures for simplicity that the fret board can be used as a Markovian state. For each algorithm, learning on the first task was initialized by setting both the ** v** vector and the

**matrix (defined in Eq (23) and (24)) were set to zero. On task 2, values were also reset to zero for the baseline algorithm. The reward-predictive model (green curve in Fig 8) only re-used a previously learned reward-predictive state abstraction and all latent SFs and reward weight vectors were also reset to zero at the beginning of training in task 2. The SF transfer algorithm (orange curve in Fig 8) re-used the previously learned SF matrix**

*G***and the reward weight vector**

*G***was reset to zero, similar to [14].**

*w*The SF transfer algorithm and the reward-predictive model used a learning rate of *α*_{sF} = 0.9 and *α _{r}* = 0.9. The SF-learning baseline algorithm used a learning rate of

*α*

_{sF}= 0.1 and

*α*= 0.9. A total of nine learning rate combinations were tested by setting each learning rate to either 0.1, 0.5, or 0.9.

_{r}**S7 Audio File.** Sound version of the optimal policy in the scale task 1. The sound file is available for download at this link: https://github.com/lucaslehnert/rewardpredictive/blob/master/audiofile/S7.wav.

**S8 Audio File.** Sound version of the optimal policy in the scale task 2. The sound file is available for download at this link: https://github.com/lucaslehnert/rewardpredictive/blob/master/audiofile/S8.wav.

**S9 Audio File.** Sound version of the SF transfer algorithm’s policy after learning for 25 episodes in scale task 2. The sound file is available for download at this link: https://github.com/lucaslehnert/rewardpredictive/blob/master/audiofile/S9.wav.

## Acknowledgments

We would like to thank Rex Liu, Nicholas T. Franklin, and Alana Jaskir for insightful comments on previous drafts of this article. Lucas Lehnert was supported by the NIH T32MH115895 Training program for Interactionist Cognitive Neuroscience (ICoN). Michael J. Frank was supported by grant NIH MH084840.

## Footnotes

↵1 If each state partition has Q-values that differ by at most

*ε*from another, then*l*_{maximizing}(*ϕ*) ≤ 2*ε*/(1 –*γ*)^{2}.