## Abstract

Human curiosity has been interpreted as a drive for exploration and modeled by intrinsically motivated reinforcement learning algorithms. An unresolved challenge in machine learning is that several of these algorithms get distracted by reward-independent stochastic stimuli. Here, we ask whether humans get distracted by the same stimuli as the algorithms. We design an experimental paradigm where human participants search for rewarding states in an environment with a highly ‘stochastic’ but reward-free sub-region. We show that (i) participants get repeatedly and persistently distracted by novelty in the stochastic part of the environment; (ii) optimism about the availability of other rewards increases this distraction; and (iii) the observed distraction pattern is consistent with the predictions of algorithms driven by novelty but not with ‘optimal’ algorithms driven by information-gain. Our results suggest that humans use suboptimal but computationally cheap curiosity-driven policies for exploration in complex environments.

## Introduction

Curiosity drives humans and animals to explore their environments^{1–3} and to search for potentially more valuable sources of reward (e.g., more nutritious foods or better-paid jobs) than those currently available^{4,5}. In computational neuroscience and psychology, intrinsically motivated reinforcement learning (RL) algorithms^{6,7} have been proposed as models of curiosity-driven behavior^{8–11} with novelty, surprise, or information-gain as intrinsic motivation in addition to the extrinsic motivation by nutritious or monetary reward^{11}. These algorithms have been successful not only in explaining aspects of exploration in humans and animals^{12–18} but also in solving complex machine learning tasks with sparse or even no (extrinsic) rewards such as in computer games^{19–21} and high-dimensional control problems^{22–25}. Despite their successes, these algorithms face a serious challenge: Intrinsically motivated agents are prone to distraction by reward-independent stochasticity (the so-called ‘noisy TV’ problem)^{26,27}, i.e., they are attracted to novel, surprising, or just noisy states independently of whether or not these states are rewarding^{28}.

The extent of distraction varies between different algorithms, and designing efficient noise-robust algorithms is an ongoing line of research in machine learning ^{29–32}. In particular, it is well-known that artificial RL agents seeking information-gain eventually lose their interest in stochasticity when exploration yields no further information, whereas RL agents seeking surprise or novelty exhibit a persistent attraction by stochasticity^{26,27,33}. Here, we ask (i) whether humans get distracted in the same situations as intrinsically motivated RL agents and, if so, (ii) whether this distraction vanishes (similar to seeking information-gain) or persists (similar to seeking surprise or novelty) over time.

To answer these questions, we bring ideas from machine learning^{26,27} to behavioral neuroscience and design a novel experimental paradigm with a highly stochastic part. We test the predictions of three different intrinsically motivated algorithms (i.e., driven by novelty, surprise, and information-gain) against the behavior of human participants and show that human behavior is both qualitatively and quantitatively consistent with that of novelty-driven RL agents: Human participants exhibit a persistent distraction by novelty, and the degree of this distraction correlates with their degree of ‘reward optimism’, where reward optimism is defined by the experimental procedure. Our results provide evidence for (i) novelty-driven RL algorithms as models of human curiosity even when novelty-seeking is suboptimal and (ii) the influence of reward optimism on the relative importance of novelty-seeking versus reward-seeking in human decision making.

## Results

### Experimental Paradigm

We first design an experimental paradigm for human participants that allows us dissociate predictions of different intrinsically motivated RL algorithms. We employ a sequential decision-making paradigm^{34–36} for navigation in an environment with 58 states plus three goal states (Fig. 1A-B). Three actions are available in each non-goal state, and agents can move from one state to another by choosing these actions (arrows in Fig. 1A-B). We use the term ‘agents’ to refer to either human participants or agents simulated by RL algorithms. In the human experiments, states are represented by images on a computer screen and actions by three disks below each image (Fig. 1C); for simulated participants, both states and actions are abstract entities (i.e., we consider RL in a tabular setting^{37}). The assignment of images to states and disks to actions is random but fixed throughout the experiment. Agents are informed that there are three different goal states in the environment (*G*^{∗}, *G*_{1}, or *G*_{2} in Fig. 1A) and that their task is to find a goal state 5 times; see Methods for how this information is incorporated in the RL algorithms. Importantly, neither human participants nor RL agents are aware of the total *number* of states or the *structure* of the environment (i.e., how states are connected to each other).

The 58 states of the environment can be classified into three groups: Progressing states (1 to 6 in Fig. 1A), trap states (7 and 8 in Fig. 1A), and stochastic states (S-1 to S-50 in Fig. 1B, shown as a dashed oval in Fig. 1A). In each progressing state, one action (‘progressing’ action) takes agents one step closer to the goals and another action (‘bad’ action) takes them to one of the trap states. The third action in states 1-3 and 5-6 is a ‘self-looping’ action that makes agents stay at the same state. Except for the progressing action in state 6, all these actions are deterministic, meaning that they always lead to the same next state. The progressing action in state 6 is *almost* deterministic: It takes participants to the ‘likely’ goal state *G*^{∗} with a probability of 1 − *ε* and to the ‘unlikely’ goal states *G*_{1} and *G*_{2} with equal probabilities of ≪1. In state 4, instead of a self-looping action, there is a ‘stochastic’ action that takes agents to a randomly chosen (with equal probability) stochastic state (Fig. 1B1). In each stochastic state, one action takes agents back to state 4 and two stochastic actions take them to *another* randomly chosen stochastic state (Fig. 1B2). In each trap state, all three actions are deterministic: Two actions bring agents to either the same or the other trap state and one action to state 1.

The stochastic part of the environment – which is inspired by the machine-learning literature and mimics the main features of a ‘noisy TV’^{28} – is a crucial difference to existing paradigms in the literature of behavioral neuroscience ^{18,38,39}. Without the stochastic part, intrinsic motivation helps agents to avoid the trap states and find the goal^{18}, hence it helps exploration before and does not harm exploitation after finding a goal. By adding the stochastic part, we aim to quantify how much exploitation of the discovered goal is reduced because of the distraction by the stochastic states.

We organize the experiment in 5 episodes: Agents are randomly initialized at state 1 or 2 and are instructed to find a goal 5 times. After finding a goal, agents are randomly re-initialized at state 1 or 2. We choose a small enough *ε* (Fig. 1A) to safely assume that all agents visit only *G*^{∗} while being aware that *G*_{1} and *G*_{2} exist (Methods).

### Simulating intrinsically motivated agents with efficient algorithms

To formulate qualitative predictions for human behavior, we simulate three intrinsically motivated RL algorithms. Intrinsic motivation is described in each algorithm by ‘intrinsic rewards’ that agents give to themselves upon visiting ‘novel’, ‘surprising’, or ‘informative’ states (see Methods for details). Extrinsic rewards, on the other hand, are received only when visiting the three goal states. Agents simulated by each algorithm are able to navigate in an environment with an unknown number of states by seeking a combination of extrinsic and intrinsic rewards (Fig. 1D): At each time *t*, an agent observes state *s*_{t} and evaluates an extrinsic reward value *r*_{ext,t} (which is zero except at the goal states) and an intrinsic reward value *r*_{int,t} (e.g., novelty of state *s*_{t}). Extrinsic and intrinsic reward values are then passed to two parallel blocks of RL, each working with a single reward signal. Independently of each other, the two blocks use efficient model-based planning^{40,41} to propose a policy *π*_{ext,t} that maximizes future extrinsic rewards and *π*_{int,t} that maximizes future intrinsic rewards^{18,26}, respectively. The two policies are combined into a hybrid policy *π*_{t} for taking the next action *a*_{t}, controlled by a set of free parameters that indicate the relative importance of intrinsic over extrinsic rewards (Methods). The degree of exploration is high if *π*_{int,t} dominates *π*_{ext,t} during action-selection.

For the intrinsic reward *r*_{int,t}, we choose one option from each of the three main categories of intrinsic rewards in machine learning^{26,27}: (i) novelty^{18–20} quantifies how infrequent the state *s*_{t} has been until time *t*; (ii) information-gain^{23,25,42,43} quantifies how much the agent updates its belief about the structure of the environment upon observing the transition from the state-action pair (*s*_{t−1}, *a*_{t−1}) to state *s*_{t}; and (iii) surprise^{21,28,44} quantifies how unexpected it is to observe state *s*_{t} after taking action *a*_{t−1} at state *s*_{t−1}.

The three different intrinsic reward signals lead to three efficient intrinsically motivated algorithms and to three groups of simulated efficient agents: those (i) seeking novelty, (ii) seeking informationgain, and (iii) seeking surprise. In the following section, we focus on episodes 2-5 and formulate the *qualitative* predictions of different intrinsically motivated RL algorithms for human behavior by characterizing the behavior of these simulated efficient agents. We note that these predictions are made by using efficient RL algorithms with perfect memory and high computational power. Thus these efficient agents are a starting point and used only (i) to test whether our experimental paradigm dissociates action-choices of different intrinsically motivated RL algorithms and (ii) to gain insights about their principal differences; a more realistic simulation of human behavior is presented in a later section.

### Different intrinsically motivated algorithms exhibit principally different behavioral patterns

To avoid arbitrariness in the choice of parameters, we fine-tune the parameters of each algorithm to have on average the lowest number of actions in episode 1 (to have the most efficient exploration; Methods). As a result, different algorithms achieve a similar performance during episode 1 and find the goal *G*^{∗} almost equally fast (Supplementary Materials). Hence, exploration policies driven by different intrinsic rewards cannot be qualitatively distinguished during episode 1.

Given the same set of parameters, we study how different simulated efficient agents behave in episodes 2-5 (Fig. 2). After finding the goal *G*^{∗} for the 1st time, an agent has two options: (i) return to the discovered goal state *G*^{∗} (exploitation) or (ii) search for the other goal states *G*_{1} and *G*_{2} (exploration). In our simulations, we consider three choices for the trade-off between exploration and exploitation by changing the relative importance of *π*_{int,t} over *π*_{ext,t} (Fig. 1D): pure exploitation (the action policy does not depend on intrinsic rewards, i.e., *π*_{t} = *π*_{ext,t}), pure exploration (the action policy does not depend on extrinsic rewards, i.e., *π*_{t} = *π*_{int,t}), and a mixture of both (different shades of each color in Fig. 2). If the extrinsic reward value assigned to *G*_{1} or *G*_{2} is higher than the one assigned to *G*^{∗}, then the policy *π*_{ext,t} for seeking extrinsic rewards can also contribute to exploration in episodes 2-5 (Methods). In order to characterize qualitative features essential to exploration driven by different *intrinsic* rewards, we assume a symmetry between the three goal states in the simulated efficient agents and assign the same extrinsic reward value to all goals (Methods); we drop this assumption in the next sections and quantify the additional but negligible contribution of *π*_{ext,t} to explaining human exploration.

For all three groups of simulated efficient agents, decreasing the relative importance of intrinsic rewards decreases both the search duration (Fig. 2A1, C1, and E1) and the fraction of time spent in the stochastic part (Fig. 2A2, C2, and E2). This observation implies that intrinsically motivated exploration leads to an attraction to the stochastic part of the environment, effectively keeping the simulated efficient agents away from the goal region beyond state 6 (Fig. 1A). Our results thus confirm earlier findings in machine learning ^{26,28} that intrinsically motivated agents get distracted by noisy reward-independent stimuli.

While all three groups of simulated efficient agents get distracted by the stochastic part, their degree of distraction is different (different colors in Fig. 2A3, C3, and E3). For efficient agents that purely seek information-gain (i.e., pure exploration), the time spent in the stochastic part decreases over episodes (Fig. 2C3), whereas we observe the opposite pattern for efficient agents that purely seek novelty (Fig. 2A3) or surprise (Fig. 2E3). In particular, efficient agents that purely seek surprise get most often (i.e., in *>* 50% of simulations in episode 5) stuck in the stochastic part and do not escape it within 3000 actions (Fig. 2E3). These observations confirm the inefficiency of seeking surprise and the efficiency of seeking information-gain in dealing with noise ^{26}.

In order to further dissociate action-choices of different algorithms, we analyze the action preferences of simulated efficient agents in state 4 during episode 2 (Fig. 2B, D, and F). For all three groups of efficient agents, increasing the relative importance of intrinsic rewards increases their preference for the stochastic action. However, for the highest importance of intrinsic rewards, the probability of choosing the progressing action is substantially lower than the probability of choosing the stochastic action for seeking surprise or information-gain (15% vs. 85%; Fig. 2D and F), whereas this difference is much smaller for seeking novelty (40% vs. 60%; Fig. 2B).

This distinct behavior of novelty-seeking is due to the fact that novelty is defined for states, whereas surprise and information-gain are defined for transitions (i.e., state-action pairs; see Methods): By the end of episode 1, the goal state has been observed only once and remains, during episode 2, relatively novel (and hence attractive for an efficient novelty-seeking agent) compared to most stochastic states, whereas there are many actions between the stochastic states that have rarely or potentially never been chosen and are, thus, attractive for an efficient agent seeking surprise or information-gain.

To summarize, different intrinsically motivated algorithms exhibit principally different behavioral patterns in our experimental paradigm. We consider these behavioral patterns as *qualitative* predictions for human behavior.

### Human participants

To test the predictions of intrinsically motivated algorithms, we first compare the exploratory behavior of human participants with that of simulated efficient agents. For simulated efficient agents, the relative importance of the intrinsic reward for action-selection (Fig. 1D) determines the balance of exploration versus exploitation. A challenge in human experiments is that we do not have explicit control over the variable that controls the relative importance of intrinsic rewards compared to extrinsic rewards. Inspired by earlier studies^{45–47}, we conjecture that human participants who are more optimistic about finding a goal with a high value of reward are more curious to explore the environment than human participants who are less optimistic. In other words, we hypothesize that the relative importance of intrinsic rewards in human participants is positively correlated with their degree of ‘reward optimism’, where we define reward optimism as the expectancy of finding a goal of higher value than those already discovered.

Based on this hypothesis, we include a novel reward manipulation in the instructions given at the beginning of the experiment: We inform human participants that there are three different possible reward states corresponding to values of 2 Swiss Franc (CHF), 3 CHF, and 4 CHF, represented by three different images (Methods). At the beginning of the experiment, we randomly assign the three different reward values to the goal states *G*^{∗}, *G*_{1}, and *G*_{2} in Fig. 1A, separately for each participant (without informing them), and keep the assignment fixed throughout the experiment. After this random assignment, *G*^{∗} has a different value for different participants. Even though all participants receive the same instructions, participants who are randomly assigned to an environment with 4 CHF reward value for *G*^{∗} do not have any monetary incentive to further explore in episodes 2-5 (= a low degree of reward optimism), whereas participants who are assigned to an environment with 2 CHF reward value for *G*^{∗} are likely to keep searching for more valuable goals in episodes 2-5 (= a high degree of reward optimism). Therefore, we have three different groups of participants with three different levels of reward optimism in episodes 2-5; see Methods for how this information is incorporated in the RL algorithms. We note that our definition of reward optimism in the context of our experiment is in line but independent of the notion of general optimism that is quantified for individual participants in psychology^{48}.

Following a power analysis based on the data of simulated efficient agents (Methods), we recruited 63 human participants and collected their action-choices during the 5 episodes of our experiment: 23 participants in an environment with 2 CHF reward value for *G*^{∗} and two times 20 human participants in environments with 3 CHF and 4 CHF reward value for *G*^{∗}, respectively. In the rest of the manuscript, we refer to each group by their reward value of *G*^{∗}, e.g., the 3 CHF group is the group of human participants who were assigned to 3 CHF reward value for *G*^{∗} (as in Fig. 1C). We excluded the data of 6 human participants from further analyses since they either did not finish the experiment or had an abnormal performance (Methods).

### Human participants exhibit a persistent distraction by stochasticity

We perform the same series of analyses on the behavior of human participants as those performed on the behavior of simulated efficient agents (Fig. 3). In episodes 2-5, the search duration of human participants (Fig. 3A1) and the fraction of time they spend in the stochastic part (Fig. 3A2) are both negatively correlated with the goal value of their environment, e.g., the 2 CHF group has a longer search duration and spends more time in the stochastic part than the other two groups. Moreover, increasing the goal value increases the preference of human participants for the progressing action in state 4 during episode 2 (Fig. 3B). These observations support our hypothesis that increasing the degree of reward optimism influences the behavior of human participants in the same way as increasing the relative importance of intrinsic rewards influences the behavior of simulated efficient agents (e.g., compare Fig. 3A1, A2, and B with Fig. 2A1, A2, and B, respectively).

The behavior of the 2 CHF group is particularly interesting since they are the most optimistic group of participants. The 2 CHF group exhibits a constant search duration over episodes 2-5 (zero correlation accepted by Bayesian hypothesis testing^{49}; Fig. 3A3). This implies that they persistently explore the stochastic part. Moreover, during episode 2, the 2 CHF group chooses the progressing and the stochastic actions equally often (no-difference in means accepted by Bayesian hypothesis testing^{49}; Fig. 3B). If we assume that the high degree of reward optimism in the 2 CHF group results in a policy that is driven dominantly by intrinsic rewards (driving exploration) and only marginally by extrinsic rewards, then these observations are more similar to the qualitative predictions of seeking novelty than those of seeking information-gain or surprise (compare Fig. 3B against Fig. 2B, D, and F).

### Novelty-seeking is the most probable model of human exploration

In the previous section, we observed that human participants exhibit patterns of behavior *qualitatively* similar to those of novelty-seeking simulated efficient agents. However, the qualitative predictions in Fig. 2 were made based on the assumptions of (i) using efficient RL algorithms with perfect memory and high computational power, (ii) using parameters that were optimized for the best performance in episode 1, and (iii) assigning the same extrinsic reward value to different goal states. In this section, we use a more realistic model of behavior than that of efficient agents in Fig. 2: In order to model the behavior of human participants, we use a hybrid RL model^{18,36,38,51} combining model-based planning^{41} and model-free habit-formation^{52}, account for imperfect memory and suboptimal choice of parameters, and allow our algorithms to assign different extrinsic reward values to different goal states (Methods). We fit the parameters of our three intrinsically motivated algorithms to the action-choices of each individual participant by maximizing the like-lihood of data given parameters (Methods). Such a flexible modeling approach allows each of the three algorithms to find its closest version to the behavior of human participants, constrained on using one specific intrinsic reward signal (i.e., novelty, surprise, or information-gain).

Given the fitted algorithms, we use Bayesian model-comparison^{53,54} to *quantitatively* test whether human behavior is explained better by seeking novelty than seeking information-gain or surprise (Methods). Our model-comparison results show that seeking novelty is the most probable model for the majority of human participants, followed by seeking information-gain as the 2nd most probable model (Fig. 4A; Protected Exceedance Probability ^{54} = 0.99 and 0.01 for seeking novelty and information-gain, respectively). This result shows that seeking novelty describes the behavior of human participants better than seeking information-gain and surprise, but it does not tell us which aspects of data statistics cannot be explained by algorithms driven by information-gain or surprise. To investigate this question, we use our three intrinsically motivated algorithms with their fitted parameters and simulate new participants, i.e., we perform Posterior Predictive Checks (PPC)^{55,56}. As opposed to the simulations in Fig. 2, we do not freely choose the level of exploration in simulations for PPC. Rather, the level of exploration of each newly simulated participant is completely determined by the previously fitted parameters from one of the 57 human participants; specifically, each simulated participant belongs to one of the three groups of human participants (e.g., the 3 CHF group), and its action-choices are simulated using a set of parameters fitted to the action-choices of one human participant randomly selected from the participants in that group (Methods).

Given the PPC results, we first perform model-recovery^{55} on the data from the simulated participants: Indeed, model recovery confirms that we can infer which algorithm has generated the action-choices of simulated participants (by repeating our model-fitting and -comparison; Fig. 4B). This implies that even the versions of different algorithms that are closest to human data can be dissociated in our experimental paradigm (average Protected Exceedance Probability^{54} *≥* 0.98 for the true model in Fig. 4B). Next, we perform a systematic comparison between the statistics of the action-choices of human participants and those of the simulated participants (the two most discriminating statistics are reported in Fig. 5A-B and a systematic analysis in Supplementary Materials). Our results show that simulated participants using novelty as intrinsic rewards reproduce all data statistics (including the zero correlation observed in Fig. 3A3; see Supplementary Materials), whereas simulated participants using information-gain or surprise fail to do so. The failure of algorithms using information-gain or surprise is most evident regarding the fraction of time spent in the stochastic part during episodes 2-5: 1. We observe that the 2 CHF group of simulated participants who seek information-gain or surprise spends a significantly smaller fraction of their time (less than half) in the stochastic part of the environment than the 2 CHF group of human participants (Fig. 5A). 2. Simulated participants using information-gain or surprise fail to reproduce the observed negative correlation between the goal value and the fraction of time spent in the stochastic part (Fig. 5B). We emphasize that both shortcomings are observed despite the fact that parameters of the algorithms had been previously optimized to explain as best as possible the sequence of action choices across the whole experiment.

The failure of surprise-seeking algorithms to reproduce these statistics is due to the detrimental consequences of seeking surprise in the presence of stochasticity (e.g., as observed for the simulated efficient agents in Episode 5 of Fig. 2E3). Hence, to stop the simulated participants from spending an enormous amount of time during episode 5 in the stochastic part of the environment, fitting surprise-seeking to action-choices of human participants yields a set of parameters that causes action-choices to be dominated by extrinsic reward (relative importance of surprise-seeking about 0.3 for the 2 CHF group; Fig. 5C), which in turn cannot explain the overall high level of exploration observed in the 2 CHF group of human participants (Fig. 5A). Similarly, the relative importance of information-gain is around 0.65 when parameters of a hybrid algorithm driven by information-gain are optimized to fit human behavior. A higher value of relative importance would make, during episode 2, the algorithm too attracted to the stochastic action in state 4 compared to humans (compare Fig. 2D with Fig. 3B). With such reduced importance of information-gain, the hybrid algorithm cannot, however, explain the specific behavioral features in Fig. 5A and B. Therefore, the attraction of human participants to the stochastic part has specific characteristics that are explained by seeking novelty but not by seeking surprise or information-grain.

Taken together our results with simulated participants provide strong quantitative evidence for novelty-seeking as a model of human exploration in our experiment.

### Reward optimism correlates with relative importance of novelty

Using novelty-seeking as the most probable model of human behavior, we can now explicitly test our hypothesis that reward optimism increases the relative importance of intrinsic rewards. By analyzing the parameters of our novelty-seeking algorithm fitted to the behavioral data, we observe, in agreement with our hypothesis, a significant negative correlation between the relative importance of novelty during action-selection (in episodes 2-5) and the goal value participants found in episode 1 (Fig. 6A; parameter-recovery^{55} in Fig. 6C). Moreover, the participants in the 2 CHF group continue with an almost fully exploratory policy in episodes 2-5 indicating that they have only a small bias towards exploiting the small but known reward (Fig. 6A).

Since our simulated participants are informed that there are three different goal states in the environment, the reward-seeking component *π*_{ext,t} of the action-policy can also contribute to exploratory behavior, e.g., through optimistic initialization of *Q*-values^{37} or prior assumptions about the state-transitions (see Methods for a theoretical analysis). To study the extent of this contribution, we focus on episode 1 where this effect is most easily detectable: We observe a dominant influence of novelty-seeking on action-selection (Fig. 6B). This implies that, to explain human behavior, the knowledge of the existence of different goal states must drive exploration through a novelty-seeking policy instead of the optimistic initialization of a reward-seeking policy.

## Discussion

We designed a novel experimental paradigm to study human curiosity-driven exploration in the presence of stochasticity. We made two main observations: (i) Human participants who are optimistic about finding higher rewards than those already discovered are persistently distracted by stochasticity; and (ii) this persistent pattern of distraction is explained better by seeking novelty than seeking information-gain or surprise, even though seeking information-gain is theoretically more robust in dealing with stochasticity.

How humans deal with the exploration-exploitation trade-off has been a long-lasting question in neuroscience and psychology ^{57,58}. Experimental studies have shown that humans use a combination of random and directed exploration^{13,16}, potentially linked to different neural mechanisms^{59–61}. However, there are multiple distinct theoretical models to describe directed exploration^{5,10,26,62–65}, and it has been debated which one is best suited to explain human behavior. In a general setting, human exploration is driven by multiple motivational signals ^{15,65}, but it has been also shown that a particular signal can dominate exploration in specific tasks^{3,17,45,66–68}. In an earlier work^{18}, we have shown that novelty signals dominantly drive human exploration in situations where one needs to search for rewarding states in unknown but deterministic environments. Observations (i) and (ii) above provide further evidence for novelty as the dominant drive of human search strategy even in situations where seeking novelty is not optimal and leads to distraction by reward-independent stochasticity. Further experimental studies are needed to investigate the role novelty in other types of human exploratory behavior.

Observation (ii) is particularly surprising as it has been believed that humans are not prone to the ‘noisy TV’ problem^{4,31,33}. Our results with human participants challenge the idea of defining curiosity as a normative solution to the exploration-exploitation trade-off^{3,6}; hence, algorithmic advances in machine learning do not necessarily help finding better models of human exploration. However, we note that, for computing novelty, an agent only needs to track the state frequencies over time and does not need any knowledge of the environment’s structure (Methods); hence computing novelty is computationally cheaper than computing information-gain. This suggests that a potentially higher level of distraction by noise in humans may be the price of spending less computational power. In other words, novelty-seeking in the presence of stochasticity may not be a globally optimal strategy for exploration but can be an optimal strategy given a set of prior assumptions and computational constraints, i.e., a ‘resource rational’ policy^{69–71}.

The core assumption of using intrinsically motivated algorithms as models of human curiosity is the existence of an intrinsic exploration reward parallel to the extrinsic reward. There are, however, multiple ways to incorporate intrinsic rewards into the RL framework^{26}. The common practice is to use a weighted sum of the intrinsic and the extrinsic reward as a single scalar reward signal driving action-selection^{8,19,28}. An alternative approach is to treat different reward signals in parallel and compute separate action-policies which are combined to drive action-selection later in the processing stream^{18,72,73}. The latter approach provides higher flexibility to arbitrate between different policies based on changes in the relative importance of intrinsic versus extrinsic rewards because it does not need re-planning or re-evaluation of already learned policies. Parallel processing paths are compatible with the rapid change of behavior observed in our human participants after finding a goal at the end of episode 1 (Fig. 3B1-2) and also consistent with experimental evidence for partially separate neural pathways of novelty- and reward-induced behaviors^{18,74–78}.

We found that the relative importance of novelty- and reward-induced behaviors in human participants is correlated with the degree of reward optimism. This is in line with the known influence of environmental variables on an agent’s preference for novelty^{45,46,76}. In particular, theories of ‘motivation crowding effect’^{79} and ‘undermining effect’^{80,81} suggest that the absolute value of extrinsic reward might contribute, in addition to the reward optimism, to the observed negative correlation in Fig. 6A, predicting that even if participants were confident that there is no other goal state in the environment, the 2 CHF group would spend more time in the stochastic part than the 4 CHF group – simply because 2 CHF is not an attractive reward anyway. A potential future direction to investigate the interplay of novelty and reward is to study human behavior in various environments with different reward distributions and different sources of stochasticity.

Optimism in psychology has been defined as a ‘variable that reflects the extent to which people hold generalized favorable expectancies for their future’^{48} and has been linked to several neural and behavioral characteristics^{48,82,83}. While the traditional approach to measure optimism is through self-tests^{84}, more recently statistical inference using RL^{85} and Bayesian^{86,87} models of behavior have been proposed to quantify variables correlated with traditional measurements. While there are multiple traditional ways to incorporate the notion of optimism into the RL framework (Methods), seeking intrinsic rewards has also been interpreted in the machine learning community as an ‘optimistic policy’ for exploration^{62}. Our results show that the preference for an intrinsic reward is indeed correlated with a notion of optimism defined in the context of our experiment as the expectancy of finding a goal of higher value in episodes 2-5 (‘reward optimism’ in Fig. 6A). Moreover, the persistent exploration of the stochastic part of our environment observed in the behavior of human participants (Fig. 3B3) is consistent with the known phenomena of optimism bias^{88} and optimistic belief updating in humans ^{82,89,90}.

Even though notions of ‘novelty’, ‘surprise’, and ‘information-gain’ are frequently used in neuroscience^{18,91,92}, psychology^{93,94}, and machine learning^{26,27,33}, there is no consensus on the precise definitions of these notions as scientific terms^{44,95}. Our results in this paper are based on the specific mathematical formulations that we have chosen (Methods), but we expect our conclusions to be invariant with respect to the precise choice of definitions as long as (i) novelty quantifies infrequency of *states* ^{18}, e.g., defined based on density models in machine learning^{19,20}; (ii) surprise quantifies mismatches between observations and agents’ expectations, where the expectations are made based on the previous *state-action* pair, including all measure of prediction surprise^{44} and typical measures of prediction error in machine learning^{21,28}; and (iii) information-gain quantifies improvements in the agents’ *world-model* and vanishes by accumulation of experience, e.g., including Bayesian^{91} and Postdictive surprise^{92} and measures of disagreement and progress-rate in machine learning^{23–25,29}.

In conclusion, our results show (i) that human decision-making is influenced by an interplay of intrinsic with extrinsic rewards that is controlled by reward optimism and (ii) that novelty-seeking RL algorithms can successfully model this interplay in tasks where humans search for rewarding states.

## Methods

### Ethics Statement

The data for human experiment were collected under CE 164*/*2014, and the protocol was approved by the ‘Commission cantonale d’ethique de la recherche sur l’ ê tre humain’. All participants were informed that they could quit the experiment at any time, and they all signed a written informed consent. All procedures complied with the Declaration of Helsinki (except for pre-registration).

### Experimental procedure for human participants

63 participants joined the experiment. Data of 6 participants were removed (see below) and, thus, data of 57 participants (27 female, mean age 24.1 *±* 4.1 years) were included in the analyses. All participants were naïve to the purpose of the experiment and had normal or corrected-to-normal visual acuity. The experiment was scripted in MATLAB using the Psychophysics Toolbox^{96}.

Before starting the experiment, the participants were informed that they need to find either one of the 3 goal states 5 times. They were shown the 3 goal images and informed that different images had different reward values of 2 CHF, 3 CHF, and 4 CHF. Specifically, they were given the example that ‘if you find the 2 CHF goal twice, 3 CHF goal once, and 4 CHF goal twice, then you will be paid 2 × 2 + 1 × 3 + 2 × 4 = 15 CHF’; see ‘Informing RL agents of different goal states and modeling optimism’ for how simulated efficient agents and simulated participants were given this information. At each trial, participants were presented an image (state) and three grey disks below the image (Fig. 1C). Clicking on a disk (action) led participants to a subsequent image which was chosen based on the underlying graph of the environment in Fig. 1A-B (which was unknown to the participants). Participants clicked through the environment until they found one of the goal states which finished an episode (Fig. 1C).

The assignment of images to states and disks to actions was random but kept fixed throughout the experiment and among participants. Exceptionally, we did not make the assignment for the actions in state 4 before the start of the experiment. Rather, for each participant, we assigned the disk that was chosen in the 1st encounter of state 4 to the stochastic action and the other two disks randomly to the bad and progressing actions, respectively (Fig. 1A). With this assignment, we made sure that all human participants would visit the stochastic part at least once during episode 1. The same protocol was used for simulated efficient agents and simulated participants.

Before the start of the experiment, we randomly assigned the different goal images (corresponding to the three reward values) to different goal states *G*^{∗}, *G*_{1}, and *G*_{2} in Fig. 1A, separately for each participant. The image and hence the reward value were then kept fixed throughout the experiment. In other words, we randomly assigned different participants to different environments with the same structure but different assignments of reward values. We, therefore, ended up with 3 groups of participants: 23 in the 2 CHF group, 20 in the 3 CHF group, and 20 in the 4 CHF group. The probability of encountering a goal state other than *G*^{∗} is controlled by the parameters *ε*. We considered *ε* to be around machine precision 10^{−8}, so we have (1 − *ε*)^{5×63} *≈* 1 − 10^{−5} *≈* 1, meaning that all 63 participants would be taken almost surely to the goal state *G*^{∗} in all 5 episodes. We note, however, that a participant could in principle observe any of the 3 goals if they could choose the progressing action at state 6 sufficiently many times because lim_{t→∞}(1 − *ε*)^{t} = 0.

2 participants (in the 2 CHF group) did not finish the experiment, and 4 participants (1 in the 3 CHF group and 3 in the 4 CHF group) took more than 3 times group-average number of actions in episodes 2-5 to finish the experiment. We considered this as a sign of being non-attentive and removed these 6 participants from further analyses.

The sample size was determined by a power analysis performed on the data of the efficient simulations done for Fig. 2 (see ‘Efficient model-based planning for simulated participants’ for the simulation details). Our goal was to have a statistical power of more than 80% (with a significance level of 0.05) for correlations in panels Fig. 2A, C, and E as well as for the differences for the highest importance of intrinsic rewards in Fig. 2D and F.

The correction for multiple hypotheses testing was done by controlling the False Discovery Rate at 0.05^{50} over all 22 null hypotheses that are tested in Fig. 3, Fig. 5, and Fig. 6 (p-value threshold: 0.034). All Bayes Factors (abbreviated BF in the figures) were evaluated using Schwartz approximation^{49} to avoid any assumptions on the prior distribution. We note that evaluating the Bayes Factors using priors suggested by ref.^{97,98} does not change our conclusions. We also note that using the Spearman correlation instead of the Pearson correlation in Fig. 2A, C, and E, Fig. 3A, and Fig. 6A does not change our conclusions.

### Full hybrid model

We first present the most general case of our algorithm as visualized in Fig. 1D and then explain the special cases used for simulating efficient agents (Fig. 2) and for modeling human behavior (Fig. 4-Fig. 6). We used ideas from non-parametric Bayesian inference ^{99} to design an intrinsically motivated RL algorithm for environments where the total number of states is unknown. We present the final results here and present the derivations and pseudo-code in Supplementary Materials.

We indicate the sequence of actions and states until time *t* by *s*_{1:t} and *a*_{1:t}, respectively, and define the **set of all known states** at time *t* as
where are our three different goal states corresponds to the 2 CHF goal, to the 3 CHF goal, and to the 4 CHF goal. Note that represents the images of the goal states and not their locations *G*^{∗}, *G*_{1}, and *G*_{2} and that the assignment of images to locations is unknown to the model. Hence, since *t* = 0, the simulated efficient agents and the simulated participants are aware of the existence of multiple goal states in the environment. In a more general setting, should be replaced by the set of all states whose images were shown to participants prior to the start of the experiment. After a transition to state *s*_{t+1} = *s*′ resulting from taking action *a*_{t} = *a* at state *s*_{t} = *s*, the reward functions *R*_{ext} and *R*_{int,t} evaluate the reward values *r*_{ext,t+1} nd *r*_{int,t+1}. We define the **extrinsic reward function** *R*_{ext} as
where *δ* is the Kronecker delta function, and we assume (without loss of generality) a subjective extrinsic reward value of 1 for (2 CHF goal) and subjective extrinsic reward values of and for and , respectively. The prior information of human participants about the difference in the monetary reward values of different goal states can be modeled in simulated participants by varying and (see ‘Informing RL agents of different goal states and modeling optimism‘). We discuss choices of *R*_{int,t} in the next section.

As a general choice for the RL algorithm in Fig. 1D, we consider a hybrid of model-based and model-free policy ^{18,36,38,52}. The **model-free (MF) component** uses the sequence of states *s*_{1:t}, actions *a*_{1:t}, extrinsic rewards *r*_{ext,1:t}, and intrinsic rewards *r*_{int,1:t} (in the two parallel branches in Fig. 1D) and estimates the extrinsic and intrinsic *Q*-values and , respectively. Traditionally, MF algorithms do not need the total number of states^{37}, thus the MF component of our algorithm remains similar to that of previous studies ^{18,35}: At the beginning of episode 1, we initialize *Q*-values at and Then, the estimates are updated recursively after each new observation. After the transition (*s*_{t}, *a*_{t}) *→ s*_{t+1}, the agent computes extrinsic and intrinsic reward prediction errors *RPE*_{ext,t+1} and *RPE*_{int,t+1}, respectively:
where *λ*_{ext} and *λ*_{int} *∈* [0, 1) are the discount factors for extrinsic and intrinsic reward seeking, respectively, and and are the extrinsic and intrinsic *V*-values^{37} of the state *s*_{t+1}, respectively. We use two separate eligibility traces^{35,37} for the update of *Q*-values, one for extrinsic reward *e*_{ext,t} and one for intrinsic reward *e*_{int,t}, both initialized at zero at the beginning of each episode. The update rules for the eligibility traces after taking action *a*_{t} at state *s*_{t} is
where *λ*_{ext} and *λ*_{int} are the discount factors defined above, and *µ*_{ext} and *µ*_{int} *∈* [0, 1] are the decay factors of the eligibility traces for the extrinsic and intrinsic rewards, respectively. The update rule is then where *e*_{t+1} is the eligibility trace (i.e., either *e*_{ext,t+1} or *e*_{int,t+1}), *RPE*_{t+1} is the reward prediction error (i.e., either *RPE*_{ext,t+1} or *RPE*_{int,t+1}), and *ρ ∈* [0, 1) is the learning rate.

The **model-based (MB) component** builds a world-model that summarizes the structure of the environment by estimating the probability *p*^{(t)}(*s*′|*s, a*) of the transition (*s, a*) *→ s*′. To do so, an agent counts the transition (*s, a*) *→ s*′ recursively and using a leaky integration^{100,101}:
where *δ* is the Kronecker delta function, and *κ ∈* [0, 1] is the leak parameter and accounts for imperfect memory and model-building in humans. If *κ* = 1, then is the exact count of transition (*s, a*) *→ s*′. These counts are used to estimate the transition probabilities
where is the counts of taking action *a* at state *s, ϵ*_{obs}*∈* ℝ^{+} is a free parameter for the prior probability of transition to a known state (i.e., states in 𝒮^{(t)}), and *ϵ*_{new} *∈* ℝ^{+} is a free parameter for the prior probability of transition to a new state (i.e., states not in 𝒮^{(t)}) – see Supplementary Materials for derivations. Choosing *ϵ*_{new} = 0 is equivalent to assuming there is no unknown state in the environment, for which the estimate in Eq. 6 is reduced to the classic Bayesian estimate of transition probabilities in bounded discrete environments^{18,36}. The transition probabilities are then used in a novel variant of prioritized sweeping^{37,40} adapted to deal with an unknown number of states. The prioritized sweeping algorithm computes a pair of *Q*-values, i.e., for extrinsic and for intrinsic rewards, by solving the corresponding Bellman equations^{37} with *T*_{PS,ext} and *T*_{PS,int} iterations, respectively for See and Supplementary Material for details.

Finally, actions are chosen by **a hybrid softmax policy** ^{37}: The probability of taking action *a* in state *s* at time *t* is
where *β*_{MB,ext} *∈* ℝ^{+}, *β*_{MF,ext} *∈* ℝ^{+}, *β*_{MB,int} *∈* ℝ^{+}, and *β*_{MF,int} *∈* ℝ^{+} are free parameters (i.e., inverse temperatures of the softmax policy ^{37}) expressing the contribution of each *Q*-value to actionselection. For Fig. 1D, we defined
and as a result

In general, the contribution of seeking extrinsic reward and seeking intrinsic reward as well as the MB and MF branches to action-selection depends on different factors, including time passed since the beginning of the experiment^{39,52}, cognitive load^{102}, and whether the location of reward is known^{18}. Here, we make a simplistic assumption that these contributions (expressed as the 4 inverse temperatures) are constant within but potentially different between the two phases of the experiment:

Phase 1: Before finding the goal state in episode 1, we consider and as four independent free parameters chosen independently for each agent.

Phase 2: After finding the goal, i.e., in all episodes after episode 1, we consider , and as another four independent free parameters chosen independently for each agent.

See ‘Relative importance of novelty in action-selection’ for how these inverse temperatures relate to the influence of intrinsic and extrinsic rewards on action-choices (Fig. 5C and Fig. 6).

#### Summary of free parameters

To summarize, the full hybrid algorithm has 22 free parameters:
where and are subjective values of the 3 CHF goal and the 4 CHF goal, respectively (with the 2 CHF goal being the reference goal with a value of 1), and are the initial values for MF *Q*-values, *λ*_{ext} and *λ*_{int} are the discount factors, *µ*_{ext} and *µ*_{int} are the decay rates of the eligibility traces, *ρ* is the MF learning rate, *κ* is the leak parameter for model-building, *ϵ*_{new} and *ϵ*_{obs} are prior parameters for model-building, *T*_{PS,ext} and *T*_{PS,int} are the numbers of iterations for prioritized sweeping, and , and are the inverse temperatures of the softmax policy.

### Different choices of intrinsic reward

The intrinsic reward function *R*_{int,t} maps a transition (*s, a*) *→ s*^{′} to an intrinsic reward value, i.e., *r*_{int,t+1} = *R*_{int,t}(*s*_{t}, *a*_{t} *→ s*_{t+1}). In this section, we present our 3 choices of *R*_{int,t}.

#### Novelty

For an agent seeking novelty (red in Fig. 2, Fig. 4, and Fig. 5), we define the intrinsic reward function as
where is the state frequency with the pseudo-count of encounters of state *s*^{′} up to time *t* (similar to Eq. 5): with . With this definition, that generalizes earlier works^{18} to the case where the number of states is unknown, the least novel states are those that have been encountered most often (i.e., with highest . Moreover, novelty is at its highest value for the unobserved states as we have for any unobserved state *s*^{′} ∉ 𝒮^{(t)}. Similar intrinsic rewards have been used in machine learning^{19,20}.

#### Surprise

For an agent seeking surprise (orange in Fig. 2, Fig. 4, and Fig. 5), we define the intrinsic reward function as the Shannon surprise (a.k.a. surprisal)^{44}
where *p*^{(t)}(*s*^{′}|*s, a*) is defined in Eq. 6. With this definition, the expected (over *s’*^{′}) intrinsic reward of taking action *a* at state *s* is equal to the entropy of the distribution *p*^{(t)}(*s*^{′}|*s, a*)^{103}. If *ϵ*_{new} *< ϵ*_{obs}, then the most surprising transitions are the ones to unobserved states. Similar intrinsic rewards have been used in machine learning^{21,28}.

#### Information-gain

For an agent seeking information-gain (green in Fig. 2, Fig. 4, and Fig. 5), we define the intrinsic reward function as
where D_{KL} is the Kullback-Leibler divergence ^{103}, and *p*^{(t+1)} is the updated world-model upon observing (*s, a*) *→ s*^{′}. The dots in Eq. 12 denote the dummy variable over which we integrate to evaluate the Kullback-Leibler divergence. Note that if *s*^{′} ∉ 𝒮^{(t)}, then there are some technical problems in the naïve computation of D_{KL} – since *p*^{(t)} and *p*^{(t+1)} have different supports. We deal with these problems using a more fundamental definition of D_{KL} using the Radon–Nikodym derivative; see Supplementary Materials for derivations and see ref.^{42} for an alternative heuristic solution. Note that the information-gain in Eq. 12 has been also interpreted as a measure of surprise (called ‘Postdictive surprise’^{92}), but it has a behavior radically different from that of the Shannon surprise introduced above (Eq. 11) – see ref.^{44} for an elaborate treatment of the topic.

Importantly, the expected (over *s*^{′}) information-gain corresponding to a state-action pair (*s, a*) converges to 0 as (see Supplementary Materials for the proof). Similar intrinsic rewards have been used in machine learning^{23,29,33,42}.

### Informing RL agents of different goal states and modeling optimism

Human participants had been informed that there were different goal states in the environment with different monetary reward values. This information was aimed to motivate participants to further explore the environment after they received the first reward at the end of episode one. This information is incorporated into our hybrid algorithms through a few mechanisms, where some include explicit information about the goal states but some others only an implicit notion of optimism.

Our main focus throughout the paper has been on modeling reward optimism by balancing intrinsic rewards against extrinsic rewards (Fig. 2, Fig. 5, and Fig. 6). In particular, assigning different values to *β*_{MB,ext}, *β*_{MF,ext}, *β*_{MB,int}, and *β*_{MF,int} (c.f. Eq. 7) during the two phases of the experiment enables us to implicitly make the relative importance of intrinsic rewards depend on the difference between the reward value of the discovered goal *r*_{G}*∗* and the known reward values and of the other goal states (Eq. 2). Our results for the fitted relative importance of intrinsic reward across different groups of human participants (Fig. 6A) support this very assumption, which implies that the influence of reward optimism on the action-choices is via regulation of the balance between two separate policies, one for seeking intrinsic and one for seeking extrinsic rewards.

However, there are two other alternative mechanisms, purely based on seeking extrinsic rewards, that can contribute to reward-optimism in our hybrid algorithms: The model-based and model-free optimistic initialization. In this section, we discuss these mechanisms and how they balance exploration versus exploitation. We note that our results in Fig. 4 and Fig. 6 (particularly Fig. 6B) show that these two mechanisms alone are not enough and that a novelty-seeking module is necessary to explain the behavior of human participants; otherwise, all three intrinsically motivated algorithms would have the same probability of generating human data – because the purely reward-seeking algorithm with optimistic initialization is a special case of all three intrinsically motivated algorithms that we compared. In other words, if optimistic initialization alone were sufficient to explain human behavior, then all three algorithms would perform equally well in Fig. 4 and the best fit would indicate a relative importance of 0 for novelty in Fig. 6.

### Model-based optimistic initialization

MB optimistic initialization is an explicit approach to model reward-optimism through designing the world-model. The MB branch of the hybrid algorithm finds the extrinsic *Q*-values by solving the Bellman equations
where *p*^{(t)}(*s*^{′}|*s, a*) is estimated transition probability in Eq. 6, and
is the average immediate extrinsic reward expected to be collected by taking action *a* in state *s* (see Eq. 2). Hence, the knowledge of the existence of three different goal states with three different rewards has an explicit influence on the MB branch of our algorithms. For example, because no transitions to any of the goal states have been experienced during episode 1, we have
This equation has two important implications. First, is an increasing function of *ϵ*_{obs}. This implies that the expected reward of a transition during episode 1 increases by increasing the prior probability of transition to states in 𝒮^{t)}. This is a direct consequence of our Bayesian approach to estimating the world-model. Second, is a decreasing function of This implies that the expected reward of a state-action pair decreases by experience. Importantly, converges to 0 as which makes a link between exploration driven by the MB optimistic initialization and exploration driven by information-gain.

During episodes 2-5, the exact theoretical analysis of the MB optimistic initialization is rather complex. However, using a few approximation steps for episode 2, we can find a condition for whether the MB extrinsic *Q*-values show a preference for exploring or leaving the stochastic part (Supplementary Materials). The condition involves a comparison between the discounted reward value of the discovered goal state and an optimistic estimate of a reward-to-be-found . in the stochastic part that depends on , and the average pseudo-count of state-action pairs in the stochastic part (Supplementary Materials). We can show that if then increasing would eventually result in a preference for staying in the stochastic part: If the reward value of a goal state is much greater than the value of the discovered goal state, then the agent prefers to keep exploring the stochastic part. However, for any value of and , increasing would eventually result in a preference for leaving the stochastic part and going towards the already discovered goal: After a sufficiently long and unsuccessful exploration phase, agents will eventually give up exploration. This is another qualitative link between exploration based on the MB optimistic initialization and exploration driven by information-gain. This qualitative link leads to the conclusion that an agent with only the MB optimistic initialization cannot explain human behavior for the same reason that an agent with intrinsic reward based on information gain cannot explain human behavior.

### Model-free optimistic initialization

As opposed to the MB branch of the hybrid algorithm, the MF branch does not have any explicit knowledge about the existence of different goal states and their values. However, the initial value of the MF extrinsic *Q*-values quantifies an expectation of the reward values in the environment prior to any interaction with the environment. During episode 1, no extrinsic reward is received by the agent, hence, for a small enough learning rate *ρ* and an optimistic initialization , the extrinsic reward prediction errors are always negative (Eq. 3). As a result, decreases as an agent keeps taking action *a* in state *s*, which motivates the agent to try new actions. This is a well-known mechanism for directed exploration in the machine learning community^{37}. Similar to the MB optimistic initialization, the effect of the MF optimistic initialization fades out over time – which makes them both similar to exploration driven by information-gain.

During episode 2-5, the exact theoretical analysis of the MF optimistic initialization is complex and dependent on an agent’s exact trajectory (because of the eligibility traces). However, whether the MF extrinsic *Q*-values show a preference for exploring or leaving the stochastic part essentially depends on the reward value of the discovered goal state and the initialization value . For example, if an agent, starting at *s*1, takes the perfect trajectory of *s*1 *→ s*2 *→ s*3 *→ s*4 *→ s*5 *→ s*6 *→ G*^{∗} in episode 1, then, given a unit decay rate of the eligibility traces (i.e., *µ*_{ext} = 1), it is easy to see that, in the 1st visit of state 4 in episode 2, the agent prefers the stochastic/bad action over the progressing action if . This implies that, even though the MF branch is not explicitly aware of different goal states and their reward values, it is still able to model a type of reward optimism through initialization of *Q*-values. Nevertheless, since model fitting reveals an importance factor significantly greater than 0.5 (Fig. 6), the effective reward optimism generated by optimistic initialization is not strong enough to explain human behavior.

### Efficient model-based planning for simulated participants

For simulating efficient agents in Fig. 2, we set *ε* = 0 (see Fig. 1A) and used a pure MB version of our algorithm with 13 parameters:
We considered perfect model-building by assuming *κ* = 1 and almost perfect planning by assuming *T*_{PS,ext} = *T*_{PS,int} = 100. We chose discount factors *λ*_{ext} and *λ*_{int} as well as prior parameters *ϵ*_{new} and *ϵ*_{obs} in the range of fitted parameters reported by Xu et al. (2021)^{18}: *λ*_{ext} = 0.95, *λ*_{int} = 0.70, *ϵ*_{new} = 10^{−5} and *ϵ*_{obs} = 10^{−4}. To relatively separate the effect of optimistic initialization^{37} from seeking intrinsic reward in episode 2-5, we assumed the same value of reward for all goals, i.e., . Finally, we considered to have pure intrinsic reward seeking in episode 1.

After fixing parameter values for 10 out of 13 parameters in Eq. 16, we fine-tuned to minimize the average length of episode 1 (to find the goal as fast as possible; see Supplementary Materials). For episodes 2-5, we first set and to have a non-deterministic policy for purely seeking extrinsic reward after the 1st encounter of the goal (the lightest shade of colors in Fig. 2). Different shades of color in Fig. 2 corresponds to different choices of *ω ∈* [0, 1] for and More precisely, we used *ω* = 0 for the darkest color (pure extrinsic reward seeking), *ω* = 1 for the lightest color (pure intrinsic reward seeking), and *ω* = 0.7 for the one in between. Higher values of *ω* indicates higher relative importance of the intrinsic reward.

### Model-fitting and model-comparison

To compare seeking different intrinsic rewards based on their explanatory power, we considered our full hybrid (with both MF and MB components) algorithm – except that we put *T*_{PS,ext} = *T*_{PS,int} = 100 to decrease number of parameters, based on the results of Xu et al. (2021)^{18} showing the negligible importance of this parameter. As a result, we had 20 free parameters for each of the three intrinsic rewards (i.e., novelty, information-gain, and surprise). For each intrinsic reward *R ∈ {*novelty, inf-gain, surprise*}* and for each participant *n* ∈ {1, …, 57}, we estimated the algorithm’s parameters by maximizing likelihood of data given parameters:
where *D*_{n} is the data of participant *n, R*_{n} is the intrinsic reward assigned to participant *n, P* (*D*_{n}|Φ, *R*_{n} = *R*) is the probability of *D*_{n} being generated by our intrinsically motivated algorithm seeking *R*_{n} = *R* with its parameter equal to Φ (see Eq. 9), and is the set of estimated parameters that maximizes that probability. For optimization, we used Subplex algorithm^{104} as implemented in Julia NLopt package^{105}.

Because all algorithms have the same number of parameters, we considered the maximum loglikelihood as the model log-evidence, i.e., for intrinsic reward *R* and participant *n*, we consider – which is equal to a shifted Schwarz approximation of the model log-evidence (also called BIC)^{49,53}. Fig. 4A1 shows the total log-evidence Σ*n* log *P* (*D*_{n}|*R*_{n} = *R*). With the fixed effects assumption at the level of models (i.e., assuming that *R*_{1} = *R*_{2} = … = *R*_{57} = *R*^{∗}), the total log-evidence is equal to the log posterior probability log *P* (*R*^{∗} = *R*|*D*_{1:57}) of *R* being the intrinsic reward used by all participants (plus a constant). See ref.^{53,55} for tutorials.

We also considered the Bayesian model selection method of ref.^{54} with the random effects assumption, i.e., assuming that participant *n* uses the intrinsic reward *R*_{n} = *R*, which is not necessarily the same as the one used by other participants, with probability *P*_{R}. We performed Markov Chain Monte Carlo sampling (using Metropolis Hasting algorithm^{50} with uniform prior and 40 chains of length 10^{t}000) for inference and estimated the joint posterior distribution
Fig. 4A2 shows the expected posterior probability 𝔼[*P*_{R}|*D*_{1:57}] as well as the protected exceedance probabilities *P*(*P*_{R} > *P*_{R′} for all *R*′ ≠ *R*|*D*_{1:57}) computed by using the participant-wise log-evidences.

The boxplots of the fitted parameters of novelty-seeking are shown in Supplementary Materials. The same set of parameters were used for model-recovery in Fig. 4B, posterior predictive checks in Fig. 5, computing the relative importance of novelty in Fig. 6A-B (see ‘Relative importance of novelty in action-selection‘), and parameter recovery in Fig. 6C.

### Posterior predictive checks, model-recovery, and parameter-recovery

For each intrinsic reward *R ∈ {*novelty, inf-gain, surprise*}* and participant group 𝒢 *∈ {*2 CHF, 3 CHF, 4 CHF*}*, we repeated the following two steps 500 times: 1. We sampled participant *n* from group 𝒢 with probability . We ran a 5-episode simulations in our environment using the intrinsic reward *R* and the parameter i.e., we sampled a trajectory *D* from (with the *G*^{∗} of the environment corresponding to the group 𝒢. As a result, we ended up with 1500 simulated participants (with randomly sampled parameters) for each algorithm.

We considered the simulated participants who took more than 3000 actions in any of the 5 episodes to be similar to the human participants who quit the experiment and excluded them from further analyses: 238 (*∼* 16%) of simulated participants seeking novelty, 166 (*∼* 11%) of those seeking information-gain, and 374 (*∼* 25%) of those seeking surprise. We note that, even with the marginal influence of surprise on action-selection (Fig. 5C), one fourth of participants seeking surprise cannot escape the stochastic part in less than 3000 actions. Moreover, we excluded, separately for each algorithm, the simulated participants who took more than 3 times group-average number of actions in episodes 2-5 to finish the experiment (i.e., the same criterion that we used to detect non-attentive human participants): 45 (*∼* 3%) of simulated participants seeking novelty, 77 (*∼* 5%) of those seeking information-gain, and 27 (*∼* 2%) of those seeking surprise. We then analyzed the remaining participants (1217 simulated participants seeking novelty, 1257 seeking informationgain, and 1099 seeking surprise) as if they were real human participants. Fig. 5 and its supplements in Supplementary Materials show the data statistics of simulated participants in comparison to human participants.

Given the participants simulated by each of the three intrinsically motivated algorithms, we fitted all three algorithms to the action-choices of 150 simulated participants (50 from each participant group, i.e., 2 CHF, 3 CHF, and 4 CHF). Then, we applied the Bayesian model selection method of ref.^{54} to 5 randomly chosen sub-populations of these 150 simulated participants (each with 60 participants, i.e., 20 from each participant group). Fig. 4B shows the results of the modelcomparison averaged over these 5 repetitions. Fig. 6C shows the relative importance of novelty in action-selection (see Eq. 18) for each of the 150 simulated participants estimated using the original parameters (which were used for simulations) and the recovered parameters (which were found by re-fitting the algorithms to the simulated data).

### Relative importance of novelty in action-selection

The relative importance of novelty in action-selection depends not only on the inverse-temperatures *β*_{MB,ext}, *β*_{MF,ext}, *β*_{MB,int}, and *β*_{MF,int} but also on the variability of *Q*-values; for example, if the extrinsic *Q*-values and are the same for all state-action pairs, then, independently of the values of the inverse-temperatures, the action is taken by a pure novelty-seeking policy – because the policy in Eq. 7 can be re-written as Thus, to measure the contribution of different components of action-selection to the final policy, we need to consider the variations in their *Q*-values as well.

In this section, we propose a variable *ω*_{i2e} *∈* [0, 1] for quantifying the relative importance of seeking intrinsic reward in comparison to seeking extrinsic reward. We first define total intrinsic and extrinsic *Q*-values as and respectively. We further define the state-dependent variations in *Q*-values as and as well as their temporal average and , where ⟨·⟩ shows the temporal average. and show the average difference between the most and least preferred action with respect to seeking extrinsic and intrinsic reward, respectively.

Therefore, a feasible way to measure the influence of seeking intrinsic reward on action-selection is to define *ω*_{i2e} as
Fig. 6A shows the value *ω*_{i2e} in episode 2-5 computed for each human participant (dots) and averaged over different groups (bars), and Fig. 6B shows the same for episode 1. Fig. 5C shows the value *ω*_{i2e} in episode 2-5 for the 2 CHF group of simulated participants. See Supplementary Materials for a similar approach for quantifying the relative importance of the MB and MF policies in action-selection.

## Author Contributions

AM, HAX, MHH, and WG developed the study concept and designed the experiment. HAX and WL conducted the experiment and collected the data. AM designed the algorithms, did the formal analyses, and analyzed the data. AM, MHH, and WG wrote the paper.

## Competing Interests statement

The authors declare no competing interests.

## Code and data availability

All code and data needed to reproduce the results reported in this manuscript will be made publicly available after publication acceptance.

## Acknowledgement

AM thanks Vasiliki Liakoni, Johanni Brea, Sophia Becker, Martin Barry, Valentin Schmutz, and Guillaume Bellec for many useful discussions on relevant topics. This research was supported by Swiss National Science Foundation No. CRSII2 147636 (Sinergia, MHH and WG), No. 200020 184615 (WG), and No. 200020 207426 (WG) and by the European Union Horizon 2020 Framework Program under grant agreement No. 785907 (Human Brain Project, SGA2, MHH and WG).

## Footnotes

New statistical analyses and new results were added.

## References

- 1.↵
- 2.
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.↵
- 62.↵
- 63.
- 64.
- 65.↵
- 66.↵
- 67.
- 68.↵
- 69.↵
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.
- 76.↵
- 77.
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵