The curse of optimism: a persistent distraction by novelty

Human curiosity has been interpreted as a drive for exploration and modeled by intrinsically motivated reinforcement learning algorithms. An unresolved challenge in machine learning is that several of these algorithms get distracted by reward-independent stochastic stimuli. Here, we ask whether humans get distracted by the same stimuli as the algorithms. We design an experimental paradigm where human participants search for rewarding states in an environment with a highly ‘stochastic’ but reward-free sub-region. We show that (i) participants get repeatedly and persistently distracted by novelty in the stochastic part of the environment; (ii) optimism about the availability of other rewards increases this distraction; and (iii) the observed distraction pattern is consistent with the predictions of algorithms driven by novelty but not with ‘optimal’ algorithms driven by information-gain. Our results suggest that humans use suboptimal but computationally cheap curiosity-driven policies for exploration in complex environments.

. B. Structure of the stochastic part of the environment (states S-1 to S-50), i.e., the dashed oval in A. B1. In state 4, one action takes agents randomly (with uniform distribution) to one of the stochastic states. B2. In each stochastic state (e.g., state S-1 in the figure), one action takes agents back to state 4 and two actions to another randomly chosen stochastic state. C. Time-line of one episode in human experiments. The states are represented by images on a computer screen and actions by disks below each image. An episode ends when a goal image (i.e., '3 CHF' image in this example) is found. D. Block diagram of the intrinsically motivated RL algorithm. Given the state s t at time t, the intrinsic reward r int,t (i.e., novelty, information-gain, or surprise) and the extrinsic reward r ext,t (i.e., the monetary reward value of s t ) are evaluated by a reward function and passed to two identical (except for the reward signals) parallel RL algorithms. The two algorithms compute two policies, one for seeking intrinsic reward π int,t and one for seeking extrinsic reward π ext,t . The two policies are then weighted according to the relative importance of the intrinsic reward and are combined to make a single hybrid policy π t . The next action a t is selected by sampling from π t . See Methods for details. of seeking surprise and the efficiency of seeking information-gain in dealing with noise 26 .
In order to further dissociate action-choices of different algorithms, we analyze the action prefer-  Purely seeking novelty shows a smaller difference between the preference for SA and PA in state 4 compared to purely seeking information-gain or surprise. Error bars show the SEMean.
groups of efficient agents, increasing the relative importance of intrinsic rewards increases their 163 preference for the stochastic action. However, for the highest importance of intrinsic rewards, 164 the probability of choosing the progressing action is substantially lower than the probability of 165 choosing the stochastic action for seeking surprise or information-gain (15% vs. 85%; Fig. 2D  Following a power analysis based on the data of simulated efficient agents (Methods), we recruited 206 63 human participants and collected their action-choices during the 5 episodes of our experiment: 207 23 participants in an environment with 2 CHF reward value for G * and two times 20 human 208 participants in environments with 3 CHF and 4 CHF reward value for G * , respectively. In the rest 209 of the manuscript, we refer to each group by their reward value of G * , e.g., the 3 CHF group is the 210 group of human participants who were assigned to 3 CHF reward value for G * (as in Fig. 1C). We 211 excluded the data of 6 human participants from further analyses since they either did not finish 212 the experiment or had an abnormal performance (Methods).

213
Human participants exhibit a persistent distraction by stochasticity 214 We perform the same series of analyses on the behavior of human participants as those performed 215 on the behavior of simulated efficient agents (Fig. 3). In episodes 2-5, the search duration of human 216 participants (Fig. 3A1) and the fraction of time they spend in the stochastic part ( Fig. 3A2) are 217 both negatively correlated with the goal value of their environment, e.g., the 2 CHF group has a 218 longer search duration and spends more time in the stochastic part than the other two groups.

219
Moreover, increasing the goal value increases the preference of human participants for the progress-220 ing action in state 4 during episode 2 (Fig. 3B). These observations support our hypothesis that 221 increasing the degree of reward optimism influences the behavior of human participants in the same 222 way as increasing the relative importance of intrinsic rewards influences the behavior of simulated 223 efficient agents (e.g., compare Fig. 3A1, A2, and B with Fig. 2A1, A2, and B, respectively).

224
The behavior of the 2 CHF group is particularly interesting since they are the most optimistic      ticipants' action-choices are best explained by novelty-seeking (see Methods for details). A1. Model log-evidence summed over all participants (i.e., assuming that different participants have the same exploration strategy but can have different parameters 53 ) is significantly higher for seeking novelty than seeking information-gain or surprise. High values indicate good performance, and differences greater than 10 are traditionally 50 considered as strongly significant. A2. The expected posterior model probability with random effects assumption 54 (i.e., assuming that different participants can have different exploration strategies and different parameters) given the data of all participants. PXP stands for Protected Exceedance Probability 54 , i.e., the probability of one model being more probable than the others. Error bars show the standard deviation of the posterior distribution. B. Confusion matrix from the model recovery procedure: Each row shows the results of applying our model-fitting and -comparison procedure (as in A2) to the action-choices of simulated participants by one of the three algorithms (with their parameters fitted to human data; see Methods). Color-code shows the expected posterior probability and numbers in parentheses the PXP (both averaged over 5 sets of 60 simulated participants). We could always recover the model that had generated the data (PXP ≥ 0.98), using almost the same number of simulated participants (60) as human participants (57).
Novelty-seeking is the most probable model of human exploration 235 In the previous section, we observed that human participants exhibit patterns of behavior qual-236 itatively similar to those of novelty-seeking simulated efficient agents. However, the qualitative 237 predictions in Fig. 2 were made based on the assumptions of (i) using efficient RL algorithms with 238 perfect memory and high computational power, (ii) using parameters that were optimized for the 239 best performance in episode 1, and (iii) assigning the same extrinsic reward value to different goal 240 states. In this section, we use a more realistic model of behavior than that of efficient agents in For each of the three intrinsic rewards, we run 1500 simulations of algorithms with parameters fitted to individual human participants; random seeds are different in each simulation. We divide the simulated participants into three groups (corresponding to the 2 CHF, 3 CHF, and 4 CHF goal values) and use the same criteria as we used for human participants to detect and remove outliers among simulated participants (Methods). A. Average fraction of time during episodes 2-5 spent by the 2 CHF group of human participants (blue circles, same data as in Fig. 3A2) and the simulated participants (bars). Error bars: SEMean. P-value and BF: Comparison between the simulated and human participants (unequal variances t-test). Human participants spend a significantly greater fraction of their time in the stochastic part than simulated participants seeking information-gain (t = 4.4; 95%CI = (0.08, 0.23); DF = 21.2) or surprise (t = 6.3; 95%CI = (0.15, 0.30); DF = 21.6). No significant difference was observed for noveltyseeking (t = 1.0; 95%CI = (−0.04, 0.11); DF = 22.3). B. Pearson correlation between the fraction of time during episodes 2-5 spent in the stochastic part and the goal value. Human participants' data shows the same correlation value as reported in Fig. 3A2. Error bars: Standard deviation evaluated by bootstrapping. P-values are from permutation tests (1000 sampled permutations; Bayesian testing was not applicable). C. The relative contribution of intrinsic rewards (i.e., dominance of π int,t over π ext,t ; Eq. 18 in Methods) in episodes 2-5 for the 2 CHF group of simulated participants. P-value and BF: Comparison with 0.5 (one-sample t-test). We observe a dominance of π int,t for seeking novelty (t = 58.2; 95%CI = (0.88, 0.91); DF = 416) and information-gain (t = 12.7; 95%CI = (0.63, 0.67); DF = 379) but a dominance of π ext,t for seeking surprise (t = −14.6; 95%CI = (0.27, 0.32); DF = 327). Red p-values: Significant effects with False Discovery Rate controlled at 0.05 50 (see Methods). Red BFs: Significant evidence in favor of the alternative hypothesis (BF≥ 3).
(Methods). Our model-comparison results show that seeking novelty is the most probable model 252 for the majority of human participants, followed by seeking information-gain as the 2nd most specifically, each simulated participant belongs to one of the three groups of human participants 263 (e.g., the 3 CHF group), and its action-choices are simulated using a set of parameters fitted to the action-choices of one human participant randomly selected from the participants in that group 265 (Methods).

266
Given the PPC results, we first perform model-recovery 55 on the data from the simulated par-267 ticipants: Indeed, model recovery confirms that we can infer which algorithm has generated the 268 action-choices of simulated participants (by repeating our model-fitting and -comparison; Fig. 4B).

269
This implies that even the versions of different algorithms that are closest to human data can be are optimized to fit human behavior. A higher value of relative importance would make, during 296 episode 2, the algorithm too attracted to the stochastic action in state 4 compared to humans 297 (compare Fig. 2D with Fig. 3B). With such reduced importance of information-gain, the hybrid 298 algorithm cannot, however, explain the specific behavioral features in Fig. 5A and B. Therefore, 299 the attraction of human participants to the stochastic part has specific characteristics that are 300 explained by seeking novelty but not by seeking surprise or information-grain.

301
Taken together our results with simulated participants provide strong quantitative evidence for 302 novelty-seeking as a model of human exploration in our experiment.

303
Reward optimism correlates with relative importance of novelty 304 Using novelty-seeking as the most probable model of human behavior, we can now explicitly 305 test our hypothesis that reward optimism increases the relative importance of intrinsic rewards.   By analyzing the parameters of our novelty-seeking algorithm fitted to the behavioral data, we 307 observe, in agreement with our hypothesis, a significant negative correlation between the relative 308 importance of novelty during action-selection (in episodes 2-5) and the goal value participants 309 found in episode 1 ( Fig. 6A; parameter-recovery 55 in Fig. 6C). Moreover, the participants in the 2 310 CHF group continue with an almost fully exploratory policy in episodes 2-5 indicating that they 311 have only a small bias towards exploiting the small but known reward (Fig. 6A).

312
Since our simulated participants are informed that there are three different goal states in the 313 environment, the reward-seeking component π ext,t of the action-policy can also contribute to ex-314 ploratory behavior, e.g., through optimistic initialization of Q-values 37 or prior assumptions about 315 bution, we focus on episode 1 where this effect is most easily detectable: We observe a dominant 317 influence of novelty-seeking on action-selection (Fig. 6B). This implies that, to explain human 318 behavior, the knowledge of the existence of different goal states must drive exploration through a 319 novelty-seeking policy instead of the optimistic initialization of a reward-seeking policy.

321
We designed a novel experimental paradigm to study human curiosity-driven exploration in the 322 presence of stochasticity. We made two main observations: (i) Human participants who are opti-323 mistic about finding higher rewards than those already discovered are persistently distracted by 324 stochasticity; and (ii) this persistent pattern of distraction is explained better by seeking novelty 325 than seeking information-gain or surprise, even though seeking information-gain is theoretically 326 more robust in dealing with stochasticity.  However, there are multiple distinct theoretical models to describe directed exploration 5,10,26,62-65 , 331 and it has been debated which one is best suited to explain human behavior. In a general setting, 332 human exploration is driven by multiple motivational signals 15,65 , but it has been also shown that  However, we note that, for computing novelty, an agent only needs to track the state frequencies be a globally optimal strategy for exploration but can be an optimal strategy given a set of prior 350 assumptions and computational constraints, i.e., a 'resource rational' policy 69-71 .

351
The core assumption of using intrinsically motivated algorithms as models of human curiosity finding a goal at the end of episode 1 (Fig. 3B1-2) and also consistent with experimental evidence 362 for partially separate neural pathways of novelty-and reward-induced behaviors 18,74-78 . 363 We found that the relative importance of novelty-and reward-induced behaviors in human partici-  Before starting the experiment, the participants were informed that they need to find either one of 414 the 3 goal states 5 times. They were shown the 3 goal images and informed that different images 415 had different reward values of 2 CHF, 3 CHF, and 4 CHF. Specifically, they were given the example 416 that 'if you find the 2 CHF goal twice, 3 CHF goal once, and 4 CHF goal twice, then you will 417 be paid 2 × 2 + 1 × 3 + 2 × 4 = 15 CHF'; see 'Informing RL agents of different goal states and 418 modeling optimism' for how simulated efficient agents and simulated participants were given this 419 information. At each trial, participants were presented an image (state) and three grey disks below 420 the image (Fig. 1C). Clicking on a disk (action) led participants to a subsequent image which was 421 chosen based on the underlying graph of the environment in Fig. 1A-B (which was unknown to 422 the participants). Participants clicked through the environment until they found one of the goal 423 states which finished an episode (Fig. 1C).

424
The assignment of images to states and disks to actions was random but kept fixed throughout 425 the experiment and among participants. Exceptionally, we did not make the assignment for the 426 actions in state 4 before the start of the experiment. Rather, for each participant, we assigned 427 the disk that was chosen in the 1st encounter of state 4 to the stochastic action and the other two 428 disks randomly to the bad and progressing actions, respectively (Fig. 1A). With this assignment, 429 we made sure that all human participants would visit the stochastic part at least once during 430 episode 1. The same protocol was used for simulated efficient agents and simulated participants.

431
Before the start of the experiment, we randomly assigned the different goal images (corresponding 432 to the three reward values) to different goal states G * , G 1 , and G 2 in Fig. 1A, separately for 433 each participant. The image and hence the reward value were then kept fixed throughout the 434 experiment. In other words, we randomly assigned different participants to different environments 435 with the same structure but different assignments of reward values. We, therefore, ended up with 436 3 groups of participants: 23 in the 2 CHF group, 20 in the 3 CHF group, and 20 in the 4 CHF 437 group. The probability of encountering a goal state other than G * is controlled by the parameters 438 ε. We considered ε to be around machine precision 10 −8 , so we have (1 − ε) 5×63 ≈ 1 − 10 −5 ≈ 1, 439 meaning that all 63 participants would be taken almost surely to the goal state G * in all 5 episodes. 440 We note, however, that a participant could in principle observe any of the 3 goals if they could 441 choose the progressing action at state 6 sufficiently many times because lim t→∞ (1 − ε) t = 0. 442 2 participants (in the 2 CHF group) did not finish the experiment, and 4 participants (1 in the 3 443 in episodes 2-5 to finish the experiment. We considered this as a sign of being non-attentive and 445 removed these 6 participants from further analyses. 446 The sample size was determined by a power analysis performed on the data of the efficient sim-447 ulations done for Fig. 2 (see 'Efficient model-based planning for simulated participants' for the 448 simulation details). Our goal was to have a statistical power of more than 80% (with a significance 449 level of 0.05) for correlations in panels Fig. 2A, C, and E as well as for the differences for the 450 highest importance of intrinsic rewards in Fig. 2D and F.

451
The correction for multiple hypotheses testing was done by controlling the False Discovery Rate 452 at 0.05 50 over all 22 null hypotheses that are tested in Fig. 3, Fig. 5  Full hybrid model 459 We first present the most general case of our algorithm as visualized in Fig. 1D and then explain 460 the special cases used for simulating efficient agents (Fig. 2) and for modeling human behavior 461 (Fig. 4-Fig. 6). We used ideas from non-parametric Bayesian inference 99 to design an intrinsically and r int,t+1 . We define the extrinsic reward function R ext as optimism'). We discuss choices of R int,t in the next section.

483
As a general choice for the RL algorithm in Fig. 1D (3) 493 where λ ext and λ int ∈ [0, 1) are the discount factors for extrinsic and intrinsic reward seeking, 494 respectively, and V 527 where β MB,ext ∈ R + , β MF,ext ∈ R + , β MB,int ∈ R + , and β MF,int ∈ R + are free parameters (i.e.,  See 'Relative importance of novelty in action-selection' for how these inverse temperatures relate 545 to the influence of intrinsic and extrinsic rewards on action-choices ( Fig. 5C and Fig. 6).
(16) 679 We considered perfect model-building by assuming κ = 1 and almost perfect planning by assuming MB,ext = 0 to have pure intrinsic reward seeking in episode 1.

685
After fixing parameter values for 10 out of 13 parameters in Eq. 16, we fine-tuned β

716
We also considered the Bayesian model selection method of ref. 54 with the random effects assumption, i.e., assuming that participant n uses the intrinsic reward R n = R, which is not necessarily the same as the one used by other participants, with probability P R . We performed Markov Chain Monte Carlo sampling (using Metropolis Hasting algorithm 50 with uniform prior and 40 chains of length 10 000) for inference and estimated the joint posterior distribution P (R 1:57 , P novelty , P inf-gain , P surprise |D 1:57 ). probabilities P (P R > P R for all R = R|D 1:57 ) computed by using the participant-wise log-evidences.

718
The boxplots of the fitted parameters of novelty-seeking are shown in Supplementary Materials.

719
The same set of parameters were used for model-recovery in Fig. 4B, posterior predictive checks 720 in Fig. 5, computing the relative importance of novelty in Fig. 6A-B (see 'Relative importance of 721 novelty in action-selection'), and parameter recovery in Fig. 6C.

722
Posterior predictive checks, model-recovery, and parameter-recovery where . shows the temporal average. ∆Q ext and ∆Q int show the average difference between the 768 most and least preferred action with respect to seeking extrinsic and intrinsic reward, respectively.

769
Therefore, a feasible way to measure the influence of seeking intrinsic reward on action-selection 770 is to define ω i2e as