Abstract
Human curiosity has been interpreted as a drive for exploration and modeled by intrinsically motivated reinforcement learning algorithms. An unresolved challenge in machine learning is that several of these algorithms get distracted by reward-independent stochastic stimuli. Here, we ask whether humans get distracted by the same stimuli as the algorithms. We design an experimental paradigm where human participants search for rewarding states in an environment with a highly ‘stochastic’ but reward-free sub-region. We show that (i) participants get repeatedly and persistently distracted by novelty in the stochastic part of the environment; (ii) optimism about the availability of other rewards increases this distraction; and (iii) the observed distraction pattern is consistent with the predictions of algorithms driven by novelty but not with ‘optimal’ algorithms driven by information-gain. Our results suggest that humans use suboptimal but computationally cheap curiosity-driven policies for exploration in complex environments.
Introduction
Curiosity drives humans and animals to explore their environments1–3 and to search for potentially more valuable sources of reward (e.g., more nutritious foods or better-paid jobs) than those currently available4,5. In computational neuroscience and psychology, intrinsically motivated reinforcement learning (RL) algorithms6,7 have been proposed as models of curiosity-driven behavior8–11 with novelty, surprise, or information-gain as intrinsic motivation in addition to the extrinsic motivation by nutritious or monetary reward11. These algorithms have been successful not only in explaining aspects of exploration in humans and animals12–18 but also in solving complex machine learning tasks with sparse or even no (extrinsic) rewards such as in computer games19–21 and high-dimensional control problems22–25. Despite their successes, these algorithms face a serious challenge: Intrinsically motivated agents are prone to distraction by reward-independent stochasticity (the so-called ‘noisy TV’ problem)26,27, i.e., they are attracted to novel, surprising, or just noisy states independently of whether or not these states are rewarding28.
The extent of distraction varies between different algorithms, and designing efficient noise-robust algorithms is an ongoing line of research in machine learning 29–32. In particular, it is well-known that artificial RL agents seeking information-gain eventually lose their interest in stochasticity when exploration yields no further information, whereas RL agents seeking surprise or novelty exhibit a persistent attraction by stochasticity26,27,33. Here, we ask (i) whether humans get distracted in the same situations as intrinsically motivated RL agents and, if so, (ii) whether this distraction vanishes (similar to seeking information-gain) or persists (similar to seeking surprise or novelty) over time.
To answer these questions, we bring ideas from machine learning26,27 to behavioral neuroscience and design a novel experimental paradigm with a highly stochastic part. We test the predictions of three different intrinsically motivated algorithms (i.e., driven by novelty, surprise, and information-gain) against the behavior of human participants and show that human behavior is both qualitatively and quantitatively consistent with that of novelty-driven RL agents: Human participants exhibit a persistent distraction by novelty, and the degree of this distraction correlates with their degree of ‘reward optimism’, where reward optimism is defined by the experimental procedure. Our results provide evidence for (i) novelty-driven RL algorithms as models of human curiosity even when novelty-seeking is suboptimal and (ii) the influence of reward optimism on the relative importance of novelty-seeking versus reward-seeking in human decision making.
Results
Experimental Paradigm
We first design an experimental paradigm for human participants that allows us dissociate predictions of different intrinsically motivated RL algorithms. We employ a sequential decision-making paradigm34–36 for navigation in an environment with 58 states plus three goal states (Fig. 1A-B). Three actions are available in each non-goal state, and agents can move from one state to another by choosing these actions (arrows in Fig. 1A-B). We use the term ‘agents’ to refer to either human participants or agents simulated by RL algorithms. In the human experiments, states are represented by images on a computer screen and actions by three disks below each image (Fig. 1C); for simulated participants, both states and actions are abstract entities (i.e., we consider RL in a tabular setting37). The assignment of images to states and disks to actions is random but fixed throughout the experiment. Agents are informed that there are three different goal states in the environment (G∗, G1, or G2 in Fig. 1A) and that their task is to find a goal state 5 times; see Methods for how this information is incorporated in the RL algorithms. Importantly, neither human participants nor RL agents are aware of the total number of states or the structure of the environment (i.e., how states are connected to each other).
A. Structure of the environment with the stochastic states merged together (dashed oval; see B). Each circle represents a state and each solid arrow an action. All actions except for the ones to the stochastic part or to the goal states are deterministic. Dashed arrows indicate random transitions; values (e.g., 1 − ε) show probabilities of each transition. We choose ε ≪ 1 (see Methods). B. Structure of the stochastic part of the environment (states S-1 to S-50), i.e., the dashed oval in A. B1. In state 4, one action takes agents randomly (with uniform distribution) to one of the stochastic states. B2. In each stochastic state (e.g., state S-1 in the figure), one action takes agents back to state 4 and two actions to another randomly chosen stochastic state. C. Time-line of one episode in human experiments. The states are represented by images on a computer screen and actions by disks below each image. An episode ends when a goal image (i.e., ‘3 CHF’ image in this example) is found. D. Block diagram of the intrinsically motivated RL algorithm. Given the state st at time t, the intrinsic reward rint,t (i.e., novelty, information-gain, or surprise) and the extrinsic reward rext,t (i.e., the monetary reward value of st) are evaluated by a reward function and passed to two identical (except for the reward signals) parallel RL algorithms. The two algorithms compute two policies, one for seeking intrinsic reward πint,t and one for seeking extrinsic reward πext,t. The two policies are then weighted according to the relative importance of the intrinsic reward and are combined to make a single hybrid policy πt. The next action at is selected by sampling from πt. See Methods for details.
The 58 states of the environment can be classified into three groups: Progressing states (1 to 6 in Fig. 1A), trap states (7 and 8 in Fig. 1A), and stochastic states (S-1 to S-50 in Fig. 1B, shown as a dashed oval in Fig. 1A). In each progressing state, one action (‘progressing’ action) takes agents one step closer to the goals and another action (‘bad’ action) takes them to one of the trap states. The third action in states 1-3 and 5-6 is a ‘self-looping’ action that makes agents stay at the same state. Except for the progressing action in state 6, all these actions are deterministic, meaning that they always lead to the same next state. The progressing action in state 6 is almost deterministic: It takes participants to the ‘likely’ goal state G∗ with a probability of 1 − ε and to the ‘unlikely’ goal states G1 and G2 with equal probabilities of ≪1. In state 4, instead of a self-looping action, there is a ‘stochastic’ action that takes agents to a randomly chosen (with equal probability) stochastic state (Fig. 1B1). In each stochastic state, one action takes agents back to state 4 and two stochastic actions take them to another randomly chosen stochastic state (Fig. 1B2). In each trap state, all three actions are deterministic: Two actions bring agents to either the same or the other trap state and one action to state 1.
The stochastic part of the environment – which is inspired by the machine-learning literature and mimics the main features of a ‘noisy TV’28 – is a crucial difference to existing paradigms in the literature of behavioral neuroscience 18,38,39. Without the stochastic part, intrinsic motivation helps agents to avoid the trap states and find the goal18, hence it helps exploration before and does not harm exploitation after finding a goal. By adding the stochastic part, we aim to quantify how much exploitation of the discovered goal is reduced because of the distraction by the stochastic states.
We organize the experiment in 5 episodes: Agents are randomly initialized at state 1 or 2 and are instructed to find a goal 5 times. After finding a goal, agents are randomly re-initialized at state 1 or 2. We choose a small enough ε (Fig. 1A) to safely assume that all agents visit only G∗ while being aware that G1 and G2 exist (Methods).
Simulating intrinsically motivated agents with efficient algorithms
To formulate qualitative predictions for human behavior, we simulate three intrinsically motivated RL algorithms. Intrinsic motivation is described in each algorithm by ‘intrinsic rewards’ that agents give to themselves upon visiting ‘novel’, ‘surprising’, or ‘informative’ states (see Methods for details). Extrinsic rewards, on the other hand, are received only when visiting the three goal states. Agents simulated by each algorithm are able to navigate in an environment with an unknown number of states by seeking a combination of extrinsic and intrinsic rewards (Fig. 1D): At each time t, an agent observes state st and evaluates an extrinsic reward value rext,t (which is zero except at the goal states) and an intrinsic reward value rint,t (e.g., novelty of state st). Extrinsic and intrinsic reward values are then passed to two parallel blocks of RL, each working with a single reward signal. Independently of each other, the two blocks use efficient model-based planning40,41 to propose a policy πext,t that maximizes future extrinsic rewards and πint,t that maximizes future intrinsic rewards18,26, respectively. The two policies are combined into a hybrid policy πt for taking the next action at, controlled by a set of free parameters that indicate the relative importance of intrinsic over extrinsic rewards (Methods). The degree of exploration is high if πint,t dominates πext,t during action-selection.
For the intrinsic reward rint,t, we choose one option from each of the three main categories of intrinsic rewards in machine learning26,27: (i) novelty18–20 quantifies how infrequent the state st has been until time t; (ii) information-gain23,25,42,43 quantifies how much the agent updates its belief about the structure of the environment upon observing the transition from the state-action pair (st−1, at−1) to state st; and (iii) surprise21,28,44 quantifies how unexpected it is to observe state st after taking action at−1 at state st−1.
The three different intrinsic reward signals lead to three efficient intrinsically motivated algorithms and to three groups of simulated efficient agents: those (i) seeking novelty, (ii) seeking informationgain, and (iii) seeking surprise. In the following section, we focus on episodes 2-5 and formulate the qualitative predictions of different intrinsically motivated RL algorithms for human behavior by characterizing the behavior of these simulated efficient agents. We note that these predictions are made by using efficient RL algorithms with perfect memory and high computational power. Thus these efficient agents are a starting point and used only (i) to test whether our experimental paradigm dissociates action-choices of different intrinsically motivated RL algorithms and (ii) to gain insights about their principal differences; a more realistic simulation of human behavior is presented in a later section.
Different intrinsically motivated algorithms exhibit principally different behavioral patterns
To avoid arbitrariness in the choice of parameters, we fine-tune the parameters of each algorithm to have on average the lowest number of actions in episode 1 (to have the most efficient exploration; Methods). As a result, different algorithms achieve a similar performance during episode 1 and find the goal G∗ almost equally fast (Supplementary Materials). Hence, exploration policies driven by different intrinsic rewards cannot be qualitatively distinguished during episode 1.
Given the same set of parameters, we study how different simulated efficient agents behave in episodes 2-5 (Fig. 2). After finding the goal G∗ for the 1st time, an agent has two options: (i) return to the discovered goal state G∗ (exploitation) or (ii) search for the other goal states G1 and G2 (exploration). In our simulations, we consider three choices for the trade-off between exploration and exploitation by changing the relative importance of πint,t over πext,t (Fig. 1D): pure exploitation (the action policy does not depend on intrinsic rewards, i.e., πt = πext,t), pure exploration (the action policy does not depend on extrinsic rewards, i.e., πt = πint,t), and a mixture of both (different shades of each color in Fig. 2). If the extrinsic reward value assigned to G1 or G2 is higher than the one assigned to G∗, then the policy πext,t for seeking extrinsic rewards can also contribute to exploration in episodes 2-5 (Methods). In order to characterize qualitative features essential to exploration driven by different intrinsic rewards, we assume a symmetry between the three goal states in the simulated efficient agents and assign the same extrinsic reward value to all goals (Methods); we drop this assumption in the next sections and quantify the additional but negligible contribution of πext,t to explaining human exploration.
We consider three levels of importance for intrinsic rewards (Fig. 1D): high (dark colors), medium (shaded colors), and no (light colors). For each level, we run 500 simulations of each algorithm. A1, C1, and E1. Median number of actions over episodes 2-5. Error bars show the standard error of the median (SEMed; evaluated by bootstrapping). Single dots show the data of 20 (randomly chosen out of 500) individual simulations to illustrate variabilities among simulations. Simulations stopped after 3000 actions even if a goal state was not reached. The Pearson correlation between the search duration and the degree of exploitation is negative (red numbers), indicating that search duration decreases if the degree of exploitation increases (Methods). A2, C2, and E2. Average fraction of time spent in the stochastic part of the environment during episodes 2-5. The Pearson correlation between the fraction of time spent in the stochastic part and the degree of exploitation is negative (Methods). Error bars show the standard error of the mean (SEMean) and single dots the data of 20 individual simulations. A3, C3, and E3. Median number of actions in episodes 2-5 for simulated efficient agents purely driven by intrinsic rewards (i.e., pure exploration). The Pearson correlation between the search duration and episode number is positive for seeking novelty or surprise but is negative for seeking information-gain. Error bars show the SEMed and single dots the data of 20 individual simulations. B, D, and F. Fraction of time taking the progressing action (PA) and the stochastic action (SA) when encountering state 4 during episode 2. Purely seeking novelty shows a smaller difference between the preference for SA and PA in state 4 compared to purely seeking information-gain or surprise. Error bars show the SEMean.
For all three groups of simulated efficient agents, decreasing the relative importance of intrinsic rewards decreases both the search duration (Fig. 2A1, C1, and E1) and the fraction of time spent in the stochastic part (Fig. 2A2, C2, and E2). This observation implies that intrinsically motivated exploration leads to an attraction to the stochastic part of the environment, effectively keeping the simulated efficient agents away from the goal region beyond state 6 (Fig. 1A). Our results thus confirm earlier findings in machine learning 26,28 that intrinsically motivated agents get distracted by noisy reward-independent stimuli.
While all three groups of simulated efficient agents get distracted by the stochastic part, their degree of distraction is different (different colors in Fig. 2A3, C3, and E3). For efficient agents that purely seek information-gain (i.e., pure exploration), the time spent in the stochastic part decreases over episodes (Fig. 2C3), whereas we observe the opposite pattern for efficient agents that purely seek novelty (Fig. 2A3) or surprise (Fig. 2E3). In particular, efficient agents that purely seek surprise get most often (i.e., in > 50% of simulations in episode 5) stuck in the stochastic part and do not escape it within 3000 actions (Fig. 2E3). These observations confirm the inefficiency of seeking surprise and the efficiency of seeking information-gain in dealing with noise 26.
In order to further dissociate action-choices of different algorithms, we analyze the action preferences of simulated efficient agents in state 4 during episode 2 (Fig. 2B, D, and F). For all three groups of efficient agents, increasing the relative importance of intrinsic rewards increases their preference for the stochastic action. However, for the highest importance of intrinsic rewards, the probability of choosing the progressing action is substantially lower than the probability of choosing the stochastic action for seeking surprise or information-gain (15% vs. 85%; Fig. 2D and F), whereas this difference is much smaller for seeking novelty (40% vs. 60%; Fig. 2B).
This distinct behavior of novelty-seeking is due to the fact that novelty is defined for states, whereas surprise and information-gain are defined for transitions (i.e., state-action pairs; see Methods): By the end of episode 1, the goal state has been observed only once and remains, during episode 2, relatively novel (and hence attractive for an efficient novelty-seeking agent) compared to most stochastic states, whereas there are many actions between the stochastic states that have rarely or potentially never been chosen and are, thus, attractive for an efficient agent seeking surprise or information-gain.
To summarize, different intrinsically motivated algorithms exhibit principally different behavioral patterns in our experimental paradigm. We consider these behavioral patterns as qualitative predictions for human behavior.
Human participants
To test the predictions of intrinsically motivated algorithms, we first compare the exploratory behavior of human participants with that of simulated efficient agents. For simulated efficient agents, the relative importance of the intrinsic reward for action-selection (Fig. 1D) determines the balance of exploration versus exploitation. A challenge in human experiments is that we do not have explicit control over the variable that controls the relative importance of intrinsic rewards compared to extrinsic rewards. Inspired by earlier studies45–47, we conjecture that human participants who are more optimistic about finding a goal with a high value of reward are more curious to explore the environment than human participants who are less optimistic. In other words, we hypothesize that the relative importance of intrinsic rewards in human participants is positively correlated with their degree of ‘reward optimism’, where we define reward optimism as the expectancy of finding a goal of higher value than those already discovered.
Based on this hypothesis, we include a novel reward manipulation in the instructions given at the beginning of the experiment: We inform human participants that there are three different possible reward states corresponding to values of 2 Swiss Franc (CHF), 3 CHF, and 4 CHF, represented by three different images (Methods). At the beginning of the experiment, we randomly assign the three different reward values to the goal states G∗, G1, and G2 in Fig. 1A, separately for each participant (without informing them), and keep the assignment fixed throughout the experiment. After this random assignment, G∗ has a different value for different participants. Even though all participants receive the same instructions, participants who are randomly assigned to an environment with 4 CHF reward value for G∗ do not have any monetary incentive to further explore in episodes 2-5 (= a low degree of reward optimism), whereas participants who are assigned to an environment with 2 CHF reward value for G∗ are likely to keep searching for more valuable goals in episodes 2-5 (= a high degree of reward optimism). Therefore, we have three different groups of participants with three different levels of reward optimism in episodes 2-5; see Methods for how this information is incorporated in the RL algorithms. We note that our definition of reward optimism in the context of our experiment is in line but independent of the notion of general optimism that is quantified for individual participants in psychology48.
Following a power analysis based on the data of simulated efficient agents (Methods), we recruited 63 human participants and collected their action-choices during the 5 episodes of our experiment: 23 participants in an environment with 2 CHF reward value for G∗ and two times 20 human participants in environments with 3 CHF and 4 CHF reward value for G∗, respectively. In the rest of the manuscript, we refer to each group by their reward value of G∗, e.g., the 3 CHF group is the group of human participants who were assigned to 3 CHF reward value for G∗ (as in Fig. 1C). We excluded the data of 6 human participants from further analyses since they either did not finish the experiment or had an abnormal performance (Methods).
Human participants exhibit a persistent distraction by stochasticity
We perform the same series of analyses on the behavior of human participants as those performed on the behavior of simulated efficient agents (Fig. 3). In episodes 2-5, the search duration of human participants (Fig. 3A1) and the fraction of time they spend in the stochastic part (Fig. 3A2) are both negatively correlated with the goal value of their environment, e.g., the 2 CHF group has a longer search duration and spends more time in the stochastic part than the other two groups. Moreover, increasing the goal value increases the preference of human participants for the progressing action in state 4 during episode 2 (Fig. 3B). These observations support our hypothesis that increasing the degree of reward optimism influences the behavior of human participants in the same way as increasing the relative importance of intrinsic rewards influences the behavior of simulated efficient agents (e.g., compare Fig. 3A1, A2, and B with Fig. 2A1, A2, and B, respectively).
A. Search duration in episodes 2-5. A1. Median number of actions over episodes 2-5 for the three different groups: 2 CHF (dark), 3 CHF (medium), and 4 CHF (light). Error bars show the SEMed (evaluated by bootstrapping) and single dots the data of individual participants. The Pearson correlation between the search duration and the goal value is negative (correlation test; t = −4.2; 95%Confidence Interval (CI) = (−0.67, −0.27); Degree of Freedom (DF) = 55; Methods). A2. Average fraction of time spent in the stochastic part of the environment during episodes 2-5. The Pearson correlation between the fraction of time spent in the stochastic part and the goal value is negative (correlation test; t = −4.7; 95%CI = (−0.70, −0.32); DF = 55; Methods). Error bars show the SEMean and single dots the data of individual participants. A3. Median number of actions in episodes 2-5 for the 2 CHF group. A Bayes Factor (BF) of 1/3.7 in favor of the null hypothesis 49 suggests a zero Pearson correlation between the search duration and the episode number (one-sample t-test on individual correlations; t = 0.63; 95%CI = (−0.20, 0.37); DF = 20). Error bars show the SEMed and single dots the data of individual participants. C. Fraction of time choosing the progressing action (PA) and the stochastic action (SA) when encountering state 4 during episode 2; see Supplementary Materials for other progressing states. Error bars show the SEMean. The difference between PA and SA for the 4 CHF group is significant (one-sample t-test; t = 2.99; 95%CI = (0.14, 0.81); DF = 16). A BF of 1/4.6 in favor of the null hypothesis 49 suggests an equal average between PA and SA for the 2 CHF group (one-sample t-test; t = 0.039; 95%CI = (−0.25, 0.26); DF = 20). The test for 3 CHF group is inconclusive (one-sample t-test; t = 1.17; 95%CI = (−0.13, 0.47); DF = 18). Red p-values: Significant effects with False Discovery Rate controlled at 0.05 50 (see Methods). Red BFs: Significant evidence in favor of the alternative hypothesis (BF≥ 3). Blue BFs: Significant evidence in favor of the null hypothesis (BF≤ 1/3).
The behavior of the 2 CHF group is particularly interesting since they are the most optimistic group of participants. The 2 CHF group exhibits a constant search duration over episodes 2-5 (zero correlation accepted by Bayesian hypothesis testing49; Fig. 3A3). This implies that they persistently explore the stochastic part. Moreover, during episode 2, the 2 CHF group chooses the progressing and the stochastic actions equally often (no-difference in means accepted by Bayesian hypothesis testing49; Fig. 3B). If we assume that the high degree of reward optimism in the 2 CHF group results in a policy that is driven dominantly by intrinsic rewards (driving exploration) and only marginally by extrinsic rewards, then these observations are more similar to the qualitative predictions of seeking novelty than those of seeking information-gain or surprise (compare Fig. 3B against Fig. 2B, D, and F).
Novelty-seeking is the most probable model of human exploration
In the previous section, we observed that human participants exhibit patterns of behavior qualitatively similar to those of novelty-seeking simulated efficient agents. However, the qualitative predictions in Fig. 2 were made based on the assumptions of (i) using efficient RL algorithms with perfect memory and high computational power, (ii) using parameters that were optimized for the best performance in episode 1, and (iii) assigning the same extrinsic reward value to different goal states. In this section, we use a more realistic model of behavior than that of efficient agents in Fig. 2: In order to model the behavior of human participants, we use a hybrid RL model18,36,38,51 combining model-based planning41 and model-free habit-formation52, account for imperfect memory and suboptimal choice of parameters, and allow our algorithms to assign different extrinsic reward values to different goal states (Methods). We fit the parameters of our three intrinsically motivated algorithms to the action-choices of each individual participant by maximizing the like-lihood of data given parameters (Methods). Such a flexible modeling approach allows each of the three algorithms to find its closest version to the behavior of human participants, constrained on using one specific intrinsic reward signal (i.e., novelty, surprise, or information-gain).
Given the fitted algorithms, we use Bayesian model-comparison53,54 to quantitatively test whether human behavior is explained better by seeking novelty than seeking information-gain or surprise (Methods). Our model-comparison results show that seeking novelty is the most probable model for the majority of human participants, followed by seeking information-gain as the 2nd most probable model (Fig. 4A; Protected Exceedance Probability 54 = 0.99 and 0.01 for seeking novelty and information-gain, respectively). This result shows that seeking novelty describes the behavior of human participants better than seeking information-gain and surprise, but it does not tell us which aspects of data statistics cannot be explained by algorithms driven by information-gain or surprise. To investigate this question, we use our three intrinsically motivated algorithms with their fitted parameters and simulate new participants, i.e., we perform Posterior Predictive Checks (PPC)55,56. As opposed to the simulations in Fig. 2, we do not freely choose the level of exploration in simulations for PPC. Rather, the level of exploration of each newly simulated participant is completely determined by the previously fitted parameters from one of the 57 human participants; specifically, each simulated participant belongs to one of the three groups of human participants (e.g., the 3 CHF group), and its action-choices are simulated using a set of parameters fitted to the action-choices of one human participant randomly selected from the participants in that group (Methods).
A. Human participants’ action-choices are best explained by novelty-seeking (see Methods for details). A1. Model log-evidence summed over all participants (i.e., assuming that different participants have the same exploration strategy but can have different parameters 53) is significantly higher for seeking novelty than seeking information-gain or surprise. High values indicate good performance, and differences greater than 10 are traditionally 50 considered as strongly significant. A2. The expected posterior model probability with random effects assumption 54 (i.e., assuming that different participants can have different exploration strategies and different parameters) given the data of all participants. PXP stands for Protected Exceedance Probability 54, i.e., the probability of one model being more probable than the others. Error bars show the standard deviation of the posterior distribution. B. Confusion matrix from the model recovery procedure: Each row shows the results of applying our model-fitting and -comparison procedure (as in A2) to the action-choices of simulated participants by one of the three algorithms (with their parameters fitted to human data; see Methods). Color-code shows the expected posterior probability and numbers in parentheses the PXP (both averaged over 5 sets of 60 simulated participants). We could always recover the model that had generated the data (PXP ≥ 0.98), using almost the same number of simulated participants (60) as human participants (57).
Given the PPC results, we first perform model-recovery55 on the data from the simulated participants: Indeed, model recovery confirms that we can infer which algorithm has generated the action-choices of simulated participants (by repeating our model-fitting and -comparison; Fig. 4B). This implies that even the versions of different algorithms that are closest to human data can be dissociated in our experimental paradigm (average Protected Exceedance Probability54 ≥ 0.98 for the true model in Fig. 4B). Next, we perform a systematic comparison between the statistics of the action-choices of human participants and those of the simulated participants (the two most discriminating statistics are reported in Fig. 5A-B and a systematic analysis in Supplementary Materials). Our results show that simulated participants using novelty as intrinsic rewards reproduce all data statistics (including the zero correlation observed in Fig. 3A3; see Supplementary Materials), whereas simulated participants using information-gain or surprise fail to do so. The failure of algorithms using information-gain or surprise is most evident regarding the fraction of time spent in the stochastic part during episodes 2-5: 1. We observe that the 2 CHF group of simulated participants who seek information-gain or surprise spends a significantly smaller fraction of their time (less than half) in the stochastic part of the environment than the 2 CHF group of human participants (Fig. 5A). 2. Simulated participants using information-gain or surprise fail to reproduce the observed negative correlation between the goal value and the fraction of time spent in the stochastic part (Fig. 5B). We emphasize that both shortcomings are observed despite the fact that parameters of the algorithms had been previously optimized to explain as best as possible the sequence of action choices across the whole experiment.
For each of the three intrinsic rewards, we run 1500 simulations of algorithms with parameters fitted to individual human participants; random seeds are different in each simulation. We divide the simulated participants into three groups (corresponding to the 2 CHF, 3 CHF, and 4 CHF goal values) and use the same criteria as we used for human participants to detect and remove outliers among simulated participants (Methods). A. Average fraction of time during episodes 2-5 spent by the 2 CHF group of human participants (blue circles, same data as in Fig. 3A2) and the simulated participants (bars). Error bars: SEMean. P-value and BF: Comparison between the simulated and human participants (unequal variances t-test). Human participants spend a significantly greater fraction of their time in the stochastic part than simulated participants seeking information-gain (t = 4.4; 95%CI = (0.08, 0.23); DF = 21.2) or surprise (t = 6.3; 95%CI = (0.15, 0.30); DF = 21.6). No significant difference was observed for novelty-seeking (t = 1.0; 95%CI = (−0.04, 0.11); DF = 22.3). B. Pearson correlation between the fraction of time during episodes 2-5 spent in the stochastic part and the goal value. Human participants’ data shows the same correlation value as reported in Fig. 3A2. Error bars: Standard deviation evaluated by bootstrapping. P-values are from permutation tests (1000 sampled permutations; Bayesian testing was not applicable). C. The relative contribution of intrinsic rewards (i.e., dominance of πint,t over πext,t; Eq. 18 in Methods) in episodes 2-5 for the 2 CHF group of simulated participants. P-value and BF: Comparison with 0.5 (one-sample t-test). We observe a dominance of πint,t for seeking novelty (t = 58.2; 95%CI = (0.88, 0.91); DF = 416) and information-gain (t = 12.7; 95%CI = (0.63, 0.67); DF = 379) but a dominance of πext,t for seeking surprise (t = −14.6; 95%CI = (0.27, 0.32); DF = 327). Red p-values: Significant effects with False Discovery Rate controlled at 0.05 50 (see Methods). Red BFs: Significant evidence in favor of the alternative hypothesis (BF≥ 3).
The failure of surprise-seeking algorithms to reproduce these statistics is due to the detrimental consequences of seeking surprise in the presence of stochasticity (e.g., as observed for the simulated efficient agents in Episode 5 of Fig. 2E3). Hence, to stop the simulated participants from spending an enormous amount of time during episode 5 in the stochastic part of the environment, fitting surprise-seeking to action-choices of human participants yields a set of parameters that causes action-choices to be dominated by extrinsic reward (relative importance of surprise-seeking about 0.3 for the 2 CHF group; Fig. 5C), which in turn cannot explain the overall high level of exploration observed in the 2 CHF group of human participants (Fig. 5A). Similarly, the relative importance of information-gain is around 0.65 when parameters of a hybrid algorithm driven by information-gain are optimized to fit human behavior. A higher value of relative importance would make, during episode 2, the algorithm too attracted to the stochastic action in state 4 compared to humans (compare Fig. 2D with Fig. 3B). With such reduced importance of information-gain, the hybrid algorithm cannot, however, explain the specific behavioral features in Fig. 5A and B. Therefore, the attraction of human participants to the stochastic part has specific characteristics that are explained by seeking novelty but not by seeking surprise or information-grain.
Taken together our results with simulated participants provide strong quantitative evidence for novelty-seeking as a model of human exploration in our experiment.
Reward optimism correlates with relative importance of novelty
Using novelty-seeking as the most probable model of human behavior, we can now explicitly test our hypothesis that reward optimism increases the relative importance of intrinsic rewards. By analyzing the parameters of our novelty-seeking algorithm fitted to the behavioral data, we observe, in agreement with our hypothesis, a significant negative correlation between the relative importance of novelty during action-selection (in episodes 2-5) and the goal value participants found in episode 1 (Fig. 6A; parameter-recovery55 in Fig. 6C). Moreover, the participants in the 2 CHF group continue with an almost fully exploratory policy in episodes 2-5 indicating that they have only a small bias towards exploiting the small but known reward (Fig. 6A).
A. The relative importance of novelty-seeking in episodes 2-5 is computed for each participant after fitting the model to data (similar to Fig. 5C but using action-choices of human participants instead of simulated participants; Methods). Error bars show the SEMean and single dots the data of individual participants. We observe a significant negative correlation between the relative importance of novelty and the goal value (correlation test; t = −3.6; 95%CI = (−0.63, −0.20); DF = 55). P-values and BFs on top: Comparison with 0.5 (one-sample t-test). We observe a significant dominance of πint,t for the 2 CHF group (t = 5.9; 95%CI = (0.70, 0.92); DF = 20). A BF of 1/4.0 in favor of the null hypothesis 49 suggests an equal contribution of πext,t and πint,t for the 4 CHF group (t = −0.23; 95%CI = (0.35, 0.62); DF = 16). The test for 3 CHF group is inconclusive (t = 1.8; 95%CI = (0.48, 0.80); DF = 18). B. The relative importance of novelty-seeking in episode 1 implies a significant dominance of novelty-seeking against optimistic initialization for exploration (t = 7.3; 95%CI = (0.68, 0.82); DF = 56). C. Parameter-recovery 55 using the action-choices of 150 (= 50 per group) simulated participants seeking novelty (Methods). The comparison between the true contribution of novelty-seeking to action-selection (computed with the parameters used for simulations) and the recovered contribution (computed with the parameters fitted to the simulated action-choices) shows that the relative importance of novelty-seeking is on average identifiable in our experimental paradigm: Positive correlations both for episode 1 (t = 9.0; 95%CI = (0.48, 0.70); DF = 148) and episodes 2-5 (t = 12; 95%CI = (0.63, 0.79); DF = 148). Red p-values: Significant effects with False Discovery Rate controlled at 0.05 50 (see Methods). Red BFs: Significant evidence in favor of the alternative hypothesis (BF≥ 3). Blue BFs: Significant evidence in favor of the null hypothesis (BF≤ 1/3).
Since our simulated participants are informed that there are three different goal states in the environment, the reward-seeking component πext,t of the action-policy can also contribute to exploratory behavior, e.g., through optimistic initialization of Q-values37 or prior assumptions about the state-transitions (see Methods for a theoretical analysis). To study the extent of this contribution, we focus on episode 1 where this effect is most easily detectable: We observe a dominant influence of novelty-seeking on action-selection (Fig. 6B). This implies that, to explain human behavior, the knowledge of the existence of different goal states must drive exploration through a novelty-seeking policy instead of the optimistic initialization of a reward-seeking policy.
Discussion
We designed a novel experimental paradigm to study human curiosity-driven exploration in the presence of stochasticity. We made two main observations: (i) Human participants who are optimistic about finding higher rewards than those already discovered are persistently distracted by stochasticity; and (ii) this persistent pattern of distraction is explained better by seeking novelty than seeking information-gain or surprise, even though seeking information-gain is theoretically more robust in dealing with stochasticity.
How humans deal with the exploration-exploitation trade-off has been a long-lasting question in neuroscience and psychology 57,58. Experimental studies have shown that humans use a combination of random and directed exploration13,16, potentially linked to different neural mechanisms59–61. However, there are multiple distinct theoretical models to describe directed exploration5,10,26,62–65, and it has been debated which one is best suited to explain human behavior. In a general setting, human exploration is driven by multiple motivational signals 15,65, but it has been also shown that a particular signal can dominate exploration in specific tasks3,17,45,66–68. In an earlier work18, we have shown that novelty signals dominantly drive human exploration in situations where one needs to search for rewarding states in unknown but deterministic environments. Observations (i) and (ii) above provide further evidence for novelty as the dominant drive of human search strategy even in situations where seeking novelty is not optimal and leads to distraction by reward-independent stochasticity. Further experimental studies are needed to investigate the role novelty in other types of human exploratory behavior.
Observation (ii) is particularly surprising as it has been believed that humans are not prone to the ‘noisy TV’ problem4,31,33. Our results with human participants challenge the idea of defining curiosity as a normative solution to the exploration-exploitation trade-off3,6; hence, algorithmic advances in machine learning do not necessarily help finding better models of human exploration. However, we note that, for computing novelty, an agent only needs to track the state frequencies over time and does not need any knowledge of the environment’s structure (Methods); hence computing novelty is computationally cheaper than computing information-gain. This suggests that a potentially higher level of distraction by noise in humans may be the price of spending less computational power. In other words, novelty-seeking in the presence of stochasticity may not be a globally optimal strategy for exploration but can be an optimal strategy given a set of prior assumptions and computational constraints, i.e., a ‘resource rational’ policy69–71.
The core assumption of using intrinsically motivated algorithms as models of human curiosity is the existence of an intrinsic exploration reward parallel to the extrinsic reward. There are, however, multiple ways to incorporate intrinsic rewards into the RL framework26. The common practice is to use a weighted sum of the intrinsic and the extrinsic reward as a single scalar reward signal driving action-selection8,19,28. An alternative approach is to treat different reward signals in parallel and compute separate action-policies which are combined to drive action-selection later in the processing stream18,72,73. The latter approach provides higher flexibility to arbitrate between different policies based on changes in the relative importance of intrinsic versus extrinsic rewards because it does not need re-planning or re-evaluation of already learned policies. Parallel processing paths are compatible with the rapid change of behavior observed in our human participants after finding a goal at the end of episode 1 (Fig. 3B1-2) and also consistent with experimental evidence for partially separate neural pathways of novelty- and reward-induced behaviors18,74–78.
We found that the relative importance of novelty- and reward-induced behaviors in human participants is correlated with the degree of reward optimism. This is in line with the known influence of environmental variables on an agent’s preference for novelty45,46,76. In particular, theories of ‘motivation crowding effect’79 and ‘undermining effect’80,81 suggest that the absolute value of extrinsic reward might contribute, in addition to the reward optimism, to the observed negative correlation in Fig. 6A, predicting that even if participants were confident that there is no other goal state in the environment, the 2 CHF group would spend more time in the stochastic part than the 4 CHF group – simply because 2 CHF is not an attractive reward anyway. A potential future direction to investigate the interplay of novelty and reward is to study human behavior in various environments with different reward distributions and different sources of stochasticity.
Optimism in psychology has been defined as a ‘variable that reflects the extent to which people hold generalized favorable expectancies for their future’48 and has been linked to several neural and behavioral characteristics48,82,83. While the traditional approach to measure optimism is through self-tests84, more recently statistical inference using RL85 and Bayesian86,87 models of behavior have been proposed to quantify variables correlated with traditional measurements. While there are multiple traditional ways to incorporate the notion of optimism into the RL framework (Methods), seeking intrinsic rewards has also been interpreted in the machine learning community as an ‘optimistic policy’ for exploration62. Our results show that the preference for an intrinsic reward is indeed correlated with a notion of optimism defined in the context of our experiment as the expectancy of finding a goal of higher value in episodes 2-5 (‘reward optimism’ in Fig. 6A). Moreover, the persistent exploration of the stochastic part of our environment observed in the behavior of human participants (Fig. 3B3) is consistent with the known phenomena of optimism bias88 and optimistic belief updating in humans 82,89,90.
Even though notions of ‘novelty’, ‘surprise’, and ‘information-gain’ are frequently used in neuroscience18,91,92, psychology93,94, and machine learning26,27,33, there is no consensus on the precise definitions of these notions as scientific terms44,95. Our results in this paper are based on the specific mathematical formulations that we have chosen (Methods), but we expect our conclusions to be invariant with respect to the precise choice of definitions as long as (i) novelty quantifies infrequency of states 18, e.g., defined based on density models in machine learning19,20; (ii) surprise quantifies mismatches between observations and agents’ expectations, where the expectations are made based on the previous state-action pair, including all measure of prediction surprise44 and typical measures of prediction error in machine learning21,28; and (iii) information-gain quantifies improvements in the agents’ world-model and vanishes by accumulation of experience, e.g., including Bayesian91 and Postdictive surprise92 and measures of disagreement and progress-rate in machine learning23–25,29.
In conclusion, our results show (i) that human decision-making is influenced by an interplay of intrinsic with extrinsic rewards that is controlled by reward optimism and (ii) that novelty-seeking RL algorithms can successfully model this interplay in tasks where humans search for rewarding states.
Methods
Ethics Statement
The data for human experiment were collected under CE 164/2014, and the protocol was approved by the ‘Commission cantonale d’ethique de la recherche sur l’ ê tre humain’. All participants were informed that they could quit the experiment at any time, and they all signed a written informed consent. All procedures complied with the Declaration of Helsinki (except for pre-registration).
Experimental procedure for human participants
63 participants joined the experiment. Data of 6 participants were removed (see below) and, thus, data of 57 participants (27 female, mean age 24.1 ± 4.1 years) were included in the analyses. All participants were naïve to the purpose of the experiment and had normal or corrected-to-normal visual acuity. The experiment was scripted in MATLAB using the Psychophysics Toolbox96.
Before starting the experiment, the participants were informed that they need to find either one of the 3 goal states 5 times. They were shown the 3 goal images and informed that different images had different reward values of 2 CHF, 3 CHF, and 4 CHF. Specifically, they were given the example that ‘if you find the 2 CHF goal twice, 3 CHF goal once, and 4 CHF goal twice, then you will be paid 2 × 2 + 1 × 3 + 2 × 4 = 15 CHF’; see ‘Informing RL agents of different goal states and modeling optimism’ for how simulated efficient agents and simulated participants were given this information. At each trial, participants were presented an image (state) and three grey disks below the image (Fig. 1C). Clicking on a disk (action) led participants to a subsequent image which was chosen based on the underlying graph of the environment in Fig. 1A-B (which was unknown to the participants). Participants clicked through the environment until they found one of the goal states which finished an episode (Fig. 1C).
The assignment of images to states and disks to actions was random but kept fixed throughout the experiment and among participants. Exceptionally, we did not make the assignment for the actions in state 4 before the start of the experiment. Rather, for each participant, we assigned the disk that was chosen in the 1st encounter of state 4 to the stochastic action and the other two disks randomly to the bad and progressing actions, respectively (Fig. 1A). With this assignment, we made sure that all human participants would visit the stochastic part at least once during episode 1. The same protocol was used for simulated efficient agents and simulated participants.
Before the start of the experiment, we randomly assigned the different goal images (corresponding to the three reward values) to different goal states G∗, G1, and G2 in Fig. 1A, separately for each participant. The image and hence the reward value were then kept fixed throughout the experiment. In other words, we randomly assigned different participants to different environments with the same structure but different assignments of reward values. We, therefore, ended up with 3 groups of participants: 23 in the 2 CHF group, 20 in the 3 CHF group, and 20 in the 4 CHF group. The probability of encountering a goal state other than G∗ is controlled by the parameters ε. We considered ε to be around machine precision 10−8, so we have (1 − ε)5×63 ≈ 1 − 10−5 ≈ 1, meaning that all 63 participants would be taken almost surely to the goal state G∗ in all 5 episodes. We note, however, that a participant could in principle observe any of the 3 goals if they could choose the progressing action at state 6 sufficiently many times because limt→∞(1 − ε)t = 0.
2 participants (in the 2 CHF group) did not finish the experiment, and 4 participants (1 in the 3 CHF group and 3 in the 4 CHF group) took more than 3 times group-average number of actions in episodes 2-5 to finish the experiment. We considered this as a sign of being non-attentive and removed these 6 participants from further analyses.
The sample size was determined by a power analysis performed on the data of the efficient simulations done for Fig. 2 (see ‘Efficient model-based planning for simulated participants’ for the simulation details). Our goal was to have a statistical power of more than 80% (with a significance level of 0.05) for correlations in panels Fig. 2A, C, and E as well as for the differences for the highest importance of intrinsic rewards in Fig. 2D and F.
The correction for multiple hypotheses testing was done by controlling the False Discovery Rate at 0.0550 over all 22 null hypotheses that are tested in Fig. 3, Fig. 5, and Fig. 6 (p-value threshold: 0.034). All Bayes Factors (abbreviated BF in the figures) were evaluated using Schwartz approximation49 to avoid any assumptions on the prior distribution. We note that evaluating the Bayes Factors using priors suggested by ref.97,98 does not change our conclusions. We also note that using the Spearman correlation instead of the Pearson correlation in Fig. 2A, C, and E, Fig. 3A, and Fig. 6A does not change our conclusions.
Full hybrid model
We first present the most general case of our algorithm as visualized in Fig. 1D and then explain the special cases used for simulating efficient agents (Fig. 2) and for modeling human behavior (Fig. 4-Fig. 6). We used ideas from non-parametric Bayesian inference 99 to design an intrinsically motivated RL algorithm for environments where the total number of states is unknown. We present the final results here and present the derivations and pseudo-code in Supplementary Materials.
We indicate the sequence of actions and states until time t by s1:t and a1:t, respectively, and define the set of all known states at time t as
where
are our three different goal states
corresponds to the 2 CHF goal,
to the 3 CHF goal, and
to the 4 CHF goal. Note that
represents the images of the goal states and not their locations G∗, G1, and G2 and that the assignment of images to locations is unknown to the model. Hence, since t = 0, the simulated efficient agents and the simulated participants are aware of the existence of multiple goal states in the environment. In a more general setting,
should be replaced by the set of all states whose images were shown to participants prior to the start of the experiment. After a transition to state st+1 = s′ resulting from taking action at = a at state st = s, the reward functions Rext and Rint,t evaluate the reward values rext,t+1 nd rint,t+1. We define the extrinsic reward function Rext as
where δ is the Kronecker delta function, and we assume (without loss of generality) a subjective extrinsic reward value of 1 for
(2 CHF goal) and subjective extrinsic reward values of
and
for
and
, respectively. The prior information of human participants about the difference in the monetary reward values of different goal states can be modeled in simulated participants by varying
and
(see ‘Informing RL agents of different goal states and modeling optimism‘). We discuss choices of Rint,t in the next section.
As a general choice for the RL algorithm in Fig. 1D, we consider a hybrid of model-based and model-free policy 18,36,38,52. The model-free (MF) component uses the sequence of states s1:t, actions a1:t, extrinsic rewards rext,1:t, and intrinsic rewards rint,1:t (in the two parallel branches in Fig. 1D) and estimates the extrinsic and intrinsic Q-values and
, respectively. Traditionally, MF algorithms do not need the total number of states37, thus the MF component of our algorithm remains similar to that of previous studies 18,35: At the beginning of episode 1, we initialize Q-values at
and
Then, the estimates are updated recursively after each new observation. After the transition (st, at) → st+1, the agent computes extrinsic and intrinsic reward prediction errors RPEext,t+1 and RPEint,t+1, respectively:
where λext and λint ∈ [0, 1) are the discount factors for extrinsic and intrinsic reward seeking, respectively, and
and
are the extrinsic and intrinsic V-values37 of the state st+1, respectively. We use two separate eligibility traces35,37 for the update of Q-values, one for extrinsic reward eext,t and one for intrinsic reward eint,t, both initialized at zero at the beginning of each episode. The update rules for the eligibility traces after taking action at at state st is
where λext and λint are the discount factors defined above, and µext and µint ∈ [0, 1] are the decay factors of the eligibility traces for the extrinsic and intrinsic rewards, respectively. The update rule is then
where et+1 is the eligibility trace (i.e., either eext,t+1 or eint,t+1), RPEt+1 is the reward prediction error (i.e., either RPEext,t+1 or RPEint,t+1), and ρ ∈ [0, 1) is the learning rate.
The model-based (MB) component builds a world-model that summarizes the structure of the environment by estimating the probability p(t)(s′|s, a) of the transition (s, a) → s′. To do so, an agent counts the transition (s, a) → s′ recursively and using a leaky integration100,101:
where δ is the Kronecker delta function,
and κ ∈ [0, 1] is the leak parameter and accounts for imperfect memory and model-building in humans. If κ = 1, then
is the exact count of transition (s, a) → s′. These counts are used to estimate the transition probabilities
where
is the counts of taking action a at state s, ϵobs∈ ℝ+ is a free parameter for the prior probability of transition to a known state (i.e., states in 𝒮(t)), and ϵnew ∈ ℝ+ is a free parameter for the prior probability of transition to a new state (i.e., states not in 𝒮(t)) – see Supplementary Materials for derivations. Choosing ϵnew = 0 is equivalent to assuming there is no unknown state in the environment, for which the estimate in Eq. 6 is reduced to the classic Bayesian estimate of transition probabilities in bounded discrete environments18,36. The transition probabilities are then used in a novel variant of prioritized sweeping37,40 adapted to deal with an unknown number of states. The prioritized sweeping algorithm computes a pair of Q-values, i.e.,
for extrinsic and
for intrinsic rewards, by solving the corresponding Bellman equations37 with TPS,ext and TPS,int iterations, respectively for
See and
Supplementary Material for details.
Finally, actions are chosen by a hybrid softmax policy 37: The probability of taking action a in state s at time t is
where βMB,ext ∈ ℝ+, βMF,ext ∈ ℝ+, βMB,int ∈ ℝ+, and βMF,int ∈ ℝ+ are free parameters (i.e., inverse temperatures of the softmax policy 37) expressing the contribution of each Q-value to actionselection. For Fig. 1D, we defined
and as a result
In general, the contribution of seeking extrinsic reward and seeking intrinsic reward as well as the MB and MF branches to action-selection depends on different factors, including time passed since the beginning of the experiment39,52, cognitive load102, and whether the location of reward is known18. Here, we make a simplistic assumption that these contributions (expressed as the 4 inverse temperatures) are constant within but potentially different between the two phases of the experiment:
Phase 1: Before finding the goal state in episode 1, we consider
and
as four independent free parameters chosen independently for each agent.
Phase 2: After finding the goal, i.e., in all episodes after episode 1, we consider
, and
as another four independent free parameters chosen independently for each agent.
See ‘Relative importance of novelty in action-selection’ for how these inverse temperatures relate to the influence of intrinsic and extrinsic rewards on action-choices (Fig. 5C and Fig. 6).
Summary of free parameters
To summarize, the full hybrid algorithm has 22 free parameters:
where
and
are subjective values of the 3 CHF goal and the 4 CHF goal, respectively (with the 2 CHF goal being the reference goal with a value of 1),
and
are the initial values for MF Q-values, λext and λint are the discount factors, µext and µint are the decay rates of the eligibility traces, ρ is the MF learning rate, κ is the leak parameter for model-building, ϵnew and ϵobs are prior parameters for model-building, TPS,ext and TPS,int are the numbers of iterations for prioritized sweeping, and
, and
are the inverse temperatures of the softmax policy.
Different choices of intrinsic reward
The intrinsic reward function Rint,t maps a transition (s, a) → s′ to an intrinsic reward value, i.e., rint,t+1 = Rint,t(st, at → st+1). In this section, we present our 3 choices of Rint,t.
Novelty
For an agent seeking novelty (red in Fig. 2, Fig. 4, and Fig. 5), we define the intrinsic reward function as
where
is the state frequency with
the pseudo-count of encounters of state s′ up to time t (similar to Eq. 5):
with
. With this definition, that generalizes earlier works18 to the case where the number of states is unknown, the least novel states are those that have been encountered most often (i.e., with highest
. Moreover, novelty is at its highest value for the unobserved states as we have
for any unobserved state s′ ∉ 𝒮(t). Similar intrinsic rewards have been used in machine learning19,20.
Surprise
For an agent seeking surprise (orange in Fig. 2, Fig. 4, and Fig. 5), we define the intrinsic reward function as the Shannon surprise (a.k.a. surprisal)44
where p(t)(s′|s, a) is defined in Eq. 6. With this definition, the expected (over s’′) intrinsic reward of taking action a at state s is equal to the entropy of the distribution p(t)(s′|s, a)103. If ϵnew < ϵobs, then the most surprising transitions are the ones to unobserved states. Similar intrinsic rewards have been used in machine learning21,28.
Information-gain
For an agent seeking information-gain (green in Fig. 2, Fig. 4, and Fig. 5), we define the intrinsic reward function as
where DKL is the Kullback-Leibler divergence 103, and p(t+1) is the updated world-model upon observing (s, a) → s′. The dots in Eq. 12 denote the dummy variable over which we integrate to evaluate the Kullback-Leibler divergence. Note that if s′ ∉ 𝒮(t), then there are some technical problems in the naïve computation of DKL – since p(t) and p(t+1) have different supports. We deal with these problems using a more fundamental definition of DKL using the Radon–Nikodym derivative; see Supplementary Materials for derivations and see ref.42 for an alternative heuristic solution. Note that the information-gain in Eq. 12 has been also interpreted as a measure of surprise (called ‘Postdictive surprise’92), but it has a behavior radically different from that of the Shannon surprise introduced above (Eq. 11) – see ref.44 for an elaborate treatment of the topic.
Importantly, the expected (over s′) information-gain corresponding to a state-action pair (s, a) converges to 0 as (see Supplementary Materials for the proof). Similar intrinsic rewards have been used in machine learning23,29,33,42.
Informing RL agents of different goal states and modeling optimism
Human participants had been informed that there were different goal states in the environment with different monetary reward values. This information was aimed to motivate participants to further explore the environment after they received the first reward at the end of episode one. This information is incorporated into our hybrid algorithms through a few mechanisms, where some include explicit information about the goal states but some others only an implicit notion of optimism.
Our main focus throughout the paper has been on modeling reward optimism by balancing intrinsic rewards against extrinsic rewards (Fig. 2, Fig. 5, and Fig. 6). In particular, assigning different values to βMB,ext, βMF,ext, βMB,int, and βMF,int (c.f. Eq. 7) during the two phases of the experiment enables us to implicitly make the relative importance of intrinsic rewards depend on the difference between the reward value of the discovered goal rG∗ and the known reward values and
of the other goal states (Eq. 2). Our results for the fitted relative importance of intrinsic reward across different groups of human participants (Fig. 6A) support this very assumption, which implies that the influence of reward optimism on the action-choices is via regulation of the balance between two separate policies, one for seeking intrinsic and one for seeking extrinsic rewards.
However, there are two other alternative mechanisms, purely based on seeking extrinsic rewards, that can contribute to reward-optimism in our hybrid algorithms: The model-based and model-free optimistic initialization. In this section, we discuss these mechanisms and how they balance exploration versus exploitation. We note that our results in Fig. 4 and Fig. 6 (particularly Fig. 6B) show that these two mechanisms alone are not enough and that a novelty-seeking module is necessary to explain the behavior of human participants; otherwise, all three intrinsically motivated algorithms would have the same probability of generating human data – because the purely reward-seeking algorithm with optimistic initialization is a special case of all three intrinsically motivated algorithms that we compared. In other words, if optimistic initialization alone were sufficient to explain human behavior, then all three algorithms would perform equally well in Fig. 4 and the best fit would indicate a relative importance of 0 for novelty in Fig. 6.
Model-based optimistic initialization
MB optimistic initialization is an explicit approach to model reward-optimism through designing the world-model. The MB branch of the hybrid algorithm finds the extrinsic Q-values by solving the Bellman equations
where p(t)(s′|s, a) is estimated transition probability in Eq. 6, and
is the average immediate extrinsic reward expected to be collected by taking action a in state s (see Eq. 2). Hence, the knowledge of the existence of three different goal states with three different rewards has an explicit influence on the MB branch of our algorithms. For example, because no transitions to any of the goal states have been experienced during episode 1, we have
This equation has two important implications. First,
is an increasing function of ϵobs. This implies that the expected reward of a transition during episode 1 increases by increasing the prior probability of transition to states in 𝒮t). This is a direct consequence of our Bayesian approach to estimating the world-model. Second,
is a decreasing function of
This implies that the expected reward of a state-action pair decreases by experience. Importantly,
converges to 0 as
which makes a link between exploration driven by the MB optimistic initialization and exploration driven by information-gain.
During episodes 2-5, the exact theoretical analysis of the MB optimistic initialization is rather complex. However, using a few approximation steps for episode 2, we can find a condition for whether the MB extrinsic Q-values show a preference for exploring or leaving the stochastic part (Supplementary Materials). The condition involves a comparison between the discounted reward value of the discovered goal state and an optimistic estimate of a reward-to-be-found
. in the stochastic part that depends on
, and the average pseudo-count
of state-action pairs in the stochastic part (Supplementary Materials). We can show that if
then increasing
would eventually result in a preference for staying in the stochastic part: If the reward value of a goal state is much greater than the value of the discovered goal state, then the agent prefers to keep exploring the stochastic part. However, for any value of
and
, increasing
would eventually result in a preference for leaving the stochastic part and going towards the already discovered goal: After a sufficiently long and unsuccessful exploration phase, agents will eventually give up exploration. This is another qualitative link between exploration based on the MB optimistic initialization and exploration driven by information-gain. This qualitative link leads to the conclusion that an agent with only the MB optimistic initialization cannot explain human behavior for the same reason that an agent with intrinsic reward based on information gain cannot explain human behavior.
Model-free optimistic initialization
As opposed to the MB branch of the hybrid algorithm, the MF branch does not have any explicit knowledge about the existence of different goal states and their values. However, the initial value of the MF extrinsic Q-values quantifies an expectation of the reward values in the environment prior to any interaction with the environment. During episode 1, no extrinsic reward is received by the agent, hence, for a small enough learning rate ρ and an optimistic initialization
, the extrinsic reward prediction errors are always negative (Eq. 3). As a result,
decreases as an agent keeps taking action a in state s, which motivates the agent to try new actions. This is a well-known mechanism for directed exploration in the machine learning community37. Similar to the MB optimistic initialization, the effect of the MF optimistic initialization fades out over time – which makes them both similar to exploration driven by information-gain.
During episode 2-5, the exact theoretical analysis of the MF optimistic initialization is complex and dependent on an agent’s exact trajectory (because of the eligibility traces). However, whether the MF extrinsic Q-values show a preference for exploring or leaving the stochastic part essentially depends on the reward value of the discovered goal state and the initialization value
. For example, if an agent, starting at s1, takes the perfect trajectory of s1 → s2 → s3 → s4 → s5 → s6 → G∗ in episode 1, then, given a unit decay rate of the eligibility traces (i.e., µext = 1), it is easy to see that, in the 1st visit of state 4 in episode 2, the agent prefers the stochastic/bad action over the progressing action if
. This implies that, even though the MF branch is not explicitly aware of different goal states and their reward values, it is still able to model a type of reward optimism through initialization of Q-values. Nevertheless, since model fitting reveals an importance factor significantly greater than 0.5 (Fig. 6), the effective reward optimism generated by optimistic initialization is not strong enough to explain human behavior.
Efficient model-based planning for simulated participants
For simulating efficient agents in Fig. 2, we set ε = 0 (see Fig. 1A) and used a pure MB version of our algorithm with 13 parameters:
We considered perfect model-building by assuming κ = 1 and almost perfect planning by assuming TPS,ext = TPS,int = 100. We chose discount factors λext and λint as well as prior parameters ϵnew and ϵobs in the range of fitted parameters reported by Xu et al. (2021)18: λext = 0.95, λint = 0.70, ϵnew = 10−5 and ϵobs = 10−4. To relatively separate the effect of optimistic initialization37 from seeking intrinsic reward in episode 2-5, we assumed the same value of reward for all goals, i.e.,
. Finally, we considered
to have pure intrinsic reward seeking in episode 1.
After fixing parameter values for 10 out of 13 parameters in Eq. 16, we fine-tuned to minimize the average length of episode 1 (to find the goal as fast as possible; see Supplementary Materials). For episodes 2-5, we first set
and
to have a non-deterministic policy for purely seeking extrinsic reward after the 1st encounter of the goal (the lightest shade of colors in Fig. 2). Different shades of color in Fig. 2 corresponds to different choices of ω ∈ [0, 1] for
and
More precisely, we used ω = 0 for the darkest color (pure extrinsic reward seeking), ω = 1 for the lightest color (pure intrinsic reward seeking), and ω = 0.7 for the one in between. Higher values of ω indicates higher relative importance of the intrinsic reward.
Model-fitting and model-comparison
To compare seeking different intrinsic rewards based on their explanatory power, we considered our full hybrid (with both MF and MB components) algorithm – except that we put TPS,ext = TPS,int = 100 to decrease number of parameters, based on the results of Xu et al. (2021)18 showing the negligible importance of this parameter. As a result, we had 20 free parameters for each of the three intrinsic rewards (i.e., novelty, information-gain, and surprise). For each intrinsic reward R ∈ {novelty, inf-gain, surprise} and for each participant n ∈ {1, …, 57}, we estimated the algorithm’s parameters by maximizing likelihood of data given parameters:
where Dn is the data of participant n, Rn is the intrinsic reward assigned to participant n, P (Dn|Φ, Rn = R) is the probability of Dn being generated by our intrinsically motivated algorithm seeking Rn = R with its parameter equal to Φ (see Eq. 9), and
is the set of estimated parameters that maximizes that probability. For optimization, we used Subplex algorithm104 as implemented in Julia NLopt package105.
Because all algorithms have the same number of parameters, we considered the maximum loglikelihood as the model log-evidence, i.e., for intrinsic reward R and participant n, we consider – which is equal to a shifted Schwarz approximation of the model log-evidence (also called BIC)49,53. Fig. 4A1 shows the total log-evidence Σn log P (Dn|Rn = R). With the fixed effects assumption at the level of models (i.e., assuming that R1 = R2 = … = R57 = R∗), the total log-evidence is equal to the log posterior probability log P (R∗ = R|D1:57) of R being the intrinsic reward used by all participants (plus a constant). See ref.53,55 for tutorials.
We also considered the Bayesian model selection method of ref.54 with the random effects assumption, i.e., assuming that participant n uses the intrinsic reward Rn = R, which is not necessarily the same as the one used by other participants, with probability PR. We performed Markov Chain Monte Carlo sampling (using Metropolis Hasting algorithm50 with uniform prior and 40 chains of length 10t000) for inference and estimated the joint posterior distribution
Fig. 4A2 shows the expected posterior probability 𝔼[PR|D1:57] as well as the protected exceedance probabilities P(PR > PR′ for all R′ ≠ R|D1:57) computed by using the participant-wise log-evidences.
The boxplots of the fitted parameters of novelty-seeking are shown in Supplementary Materials. The same set of parameters were used for model-recovery in Fig. 4B, posterior predictive checks in Fig. 5, computing the relative importance of novelty in Fig. 6A-B (see ‘Relative importance of novelty in action-selection‘), and parameter recovery in Fig. 6C.
Posterior predictive checks, model-recovery, and parameter-recovery
For each intrinsic reward R ∈ {novelty, inf-gain, surprise} and participant group 𝒢 ∈ {2 CHF, 3 CHF, 4 CHF}, we repeated the following two steps 500 times: 1. We sampled participant n from group 𝒢 with probability . We ran a 5-episode simulations in our environment using the intrinsic reward R and the parameter
i.e., we sampled a trajectory D from
(with the G∗ of the environment corresponding to the group 𝒢. As a result, we ended up with 1500 simulated participants (with randomly sampled parameters) for each algorithm.
We considered the simulated participants who took more than 3000 actions in any of the 5 episodes to be similar to the human participants who quit the experiment and excluded them from further analyses: 238 (∼ 16%) of simulated participants seeking novelty, 166 (∼ 11%) of those seeking information-gain, and 374 (∼ 25%) of those seeking surprise. We note that, even with the marginal influence of surprise on action-selection (Fig. 5C), one fourth of participants seeking surprise cannot escape the stochastic part in less than 3000 actions. Moreover, we excluded, separately for each algorithm, the simulated participants who took more than 3 times group-average number of actions in episodes 2-5 to finish the experiment (i.e., the same criterion that we used to detect non-attentive human participants): 45 (∼ 3%) of simulated participants seeking novelty, 77 (∼ 5%) of those seeking information-gain, and 27 (∼ 2%) of those seeking surprise. We then analyzed the remaining participants (1217 simulated participants seeking novelty, 1257 seeking informationgain, and 1099 seeking surprise) as if they were real human participants. Fig. 5 and its supplements in Supplementary Materials show the data statistics of simulated participants in comparison to human participants.
Given the participants simulated by each of the three intrinsically motivated algorithms, we fitted all three algorithms to the action-choices of 150 simulated participants (50 from each participant group, i.e., 2 CHF, 3 CHF, and 4 CHF). Then, we applied the Bayesian model selection method of ref.54 to 5 randomly chosen sub-populations of these 150 simulated participants (each with 60 participants, i.e., 20 from each participant group). Fig. 4B shows the results of the modelcomparison averaged over these 5 repetitions. Fig. 6C shows the relative importance of novelty in action-selection (see Eq. 18) for each of the 150 simulated participants estimated using the original parameters (which were used for simulations) and the recovered parameters (which were found by re-fitting the algorithms to the simulated data).
Relative importance of novelty in action-selection
The relative importance of novelty in action-selection depends not only on the inverse-temperatures βMB,ext, βMF,ext, βMB,int, and βMF,int but also on the variability of Q-values; for example, if the extrinsic Q-values and
are the same for all state-action pairs, then, independently of the values of the inverse-temperatures, the action is taken by a pure novelty-seeking policy – because the policy in Eq. 7 can be re-written as
Thus, to measure the contribution of different components of action-selection to the final policy, we need to consider the variations in their Q-values as well.
In this section, we propose a variable ωi2e ∈ [0, 1] for quantifying the relative importance of seeking intrinsic reward in comparison to seeking extrinsic reward. We first define total intrinsic and extrinsic Q-values as and
respectively. We further define the state-dependent variations in Q-values as
and
as well as their temporal average
and
, where ⟨·⟩ shows the temporal average.
and
show the average difference between the most and least preferred action with respect to seeking extrinsic and intrinsic reward, respectively.
Therefore, a feasible way to measure the influence of seeking intrinsic reward on action-selection is to define ωi2e as
Fig. 6A shows the value ωi2e in episode 2-5 computed for each human participant (dots) and averaged over different groups (bars), and Fig. 6B shows the same for episode 1. Fig. 5C shows the value ωi2e in episode 2-5 for the 2 CHF group of simulated participants. See Supplementary Materials for a similar approach for quantifying the relative importance of the MB and MF policies in action-selection.
Author Contributions
AM, HAX, MHH, and WG developed the study concept and designed the experiment. HAX and WL conducted the experiment and collected the data. AM designed the algorithms, did the formal analyses, and analyzed the data. AM, MHH, and WG wrote the paper.
Competing Interests statement
The authors declare no competing interests.
Code and data availability
All code and data needed to reproduce the results reported in this manuscript will be made publicly available after publication acceptance.
Acknowledgement
AM thanks Vasiliki Liakoni, Johanni Brea, Sophia Becker, Martin Barry, Valentin Schmutz, and Guillaume Bellec for many useful discussions on relevant topics. This research was supported by Swiss National Science Foundation No. CRSII2 147636 (Sinergia, MHH and WG), No. 200020 184615 (WG), and No. 200020 207426 (WG) and by the European Union Horizon 2020 Framework Program under grant agreement No. 785907 (Human Brain Project, SGA2, MHH and WG).
Footnotes
New statistical analyses and new results were added.
References
- 1.↵
- 2.
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.↵
- 62.↵
- 63.
- 64.
- 65.↵
- 66.↵
- 67.
- 68.↵
- 69.↵
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.
- 76.↵
- 77.
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵