A Behavioral Association Between Prediction Errors and Risk-Seeking: Theory and Evidence

Reward prediction errors (RPEs) and risk preferences have two things in common: both can shape decision making behavior, and both are commonly associated with dopamine. RPEs drive value learning and are thought to be represented in the phasic release of striatal dopamine. Risk preferences bias choices towards or away from uncertainty; they can be manipulated with drugs that target the dopaminergic system. The common neural substrate suggests that RPEs and risk preferences might be linked on the level of behavior as well, but this has never been tested. Here, we aim to close this gap. First, we apply a recent theory of learning in the basal ganglia to predict how exactly RPEs might influence risk preferences. We then test our behavioral predictions using a novel bandit task in which value and risk vary independently across options. Critically, conditions are included where options vary in risk but are matched for value. We find that subjects become more risk seeking if choices are preceded by positive RPEs, and more risk averse if choices are preceded by negative RPEs. These findings cannot be explained by other known effects, such as nonlinear utility curves or dynamic learning rates. Finally, we show that RPE-induced risk-seeking is indexed by pupil dilation: participants with stronger pupillary correlates of RPE also show more pronounced behavioral effects. Author’s summary Many of our decisions are based on expectations. Sometimes, however, surprises happen: outcomes are not as expected. Such discrepancies between expectations and actual outcomes are called prediction errors. Our brain recognises and uses such prediction errors to modify our expectations and make them more realistic--a process known as reinforcement learning. In particular, neurons that release the neurotransmitter dopamine show activity patterns that strongly resemble prediction errors. Interestingly, the same neurotransmitter is also known to regulate risk preferences: dopamine levels control our willingness to take risks. We theorised that, since learning signals cause dopamine release, they might change risk preferences as well. In this study, we test this hypothesis. We find that participants are more likely to make a risky choice just after they experienced an outcome that was better than expected, which is precisely what out theory predicts. This suggests that dopamine signalling can be ambiguous--a learning signal can be mistaken for an impulse to take a risk.


Introduction 38
Reward-guided learning in humans and animals can often be modelled simply as reducing the difference 39 between the obtained and the expected reward-a reward prediction error. This well-established 40 behavioral phenomenon (Rescorla and Wagner 1972) has been linked to the neurotransmitter dopamine 41 (Schultz, Dayan et al. 1997). Dopamine neurons project to brain areas relevant for reward learning, such 42 as that striatum, the cortex and the amygdala (Wise 2004, Björklund andDunnett 2007). Dopamine 43 activity is known to change synaptic efficacy in the striatum (Reynolds, Hyland et al. 2001) and has been 44 causally linked to learning (Steinberg, Keiflin et al. 2013). This and other biological evidence have led to 45 a family of mechanistic theories of learning within the basal ganglia network (Collins and Frank 2014, people's moment-to-moment risk-preferences (Chew, Hauser et al. 2019). 56 In summary, ample evidence suggests that dopamine bursts are related to distinct behavioral 57 phenomena-learning and risk-taking-by way of 1) acting as reward prediction errors, affecting 58 We further hypothesized that participants who experience stronger prediction errors also show stronger 84 risk preferences. Thus, we tracked participants' pupil dilation, which is known to reflect surprising events 85 such as prediction errors (Preuschoff, t Hart et al. 2011, Cavanagh, Wiecki et al. 2014 Behrens et al. 2015, Lawson, Mathys et al. 2017). We expected that increased perception of prediction 87 errors should lead to increased pupil responses as well as stronger risk preferences, and hence predicted a 88 correlation between pupil response to prediction errors and risk seeking associated with prediction errors. 89 Overall, we found effects that were consistent with our predictions: Risk seeking was higher when 90 choices followed positive prediction errors than when they followed negative prediction errors. These 91 preferences emerged gradually over the course of learning and could not be explained by any of several 92 other known mechanisms. In addition, they were associated with pupil dilation in a way predicted by our 93 theory. 94

95
Task & Theory 96 In this first section, we introduce our task and provide a detailed theoretical analysis of the behavior we 97 expect from our participants. This analysis is based on models of the basal ganglia network and allows us 98 to derive concrete predictions and models for behavior which we use for data analysis below. 99 Task 100 Our task consisted of sequences of two-alternative forced choice trials. On each trial, after an inter-trial 101 interval (ITI) of 1 s, two stimuli (fractal images, Fig 1A) were drawn from a set of four stimuli and shown 102 to the participant, who had to choose one. Following the choice, after a short delay of 0.8 s a numerical 103 reward between 1 and 99 was displayed under the chosen stimulus for 1.5 s. Then, the next trial began. 104 Participants were instructed to try to maximize the total number of reward points throughout the 105 experiment.

B) Reward distributions. Each stimulus (top) is associated with a different reward distribution (bottom). 117
The distributions differ in mean ( The reward on each trial depended on the participant's choice: each stimulus was associated with a 137 specific reward distribution from which rewards were sampled. The four reward distributions associated 138 with the four stimuli were approximately Gaussian and followed a two-by-two design: the mean of the 139 Gaussian could be either high or low (60 or 40), and the standard deviation could be either large or small 140 (20 or 5), resulting in four reward distributions in total (risky-high, risky-low, safe-high and safe-low, Fig  141  1B). The names derive from the idea that it is "risky" to pick a stimulus associated with a broad reward 142 distribution, since outcomes might deviate a lot from the expected outcome. Correspondingly, it is "safe" 143 to pick a stimulus with a narrow distribution, since the outcomes will mostly be as expected. 144 We organize trials into three conditions: 1) "different": trials in which the shown stimuli have different 145 average rewards (for example risky-high and safe-low, which have average rewards of 60 and 40 146 respectively), 2) "both-high": trials in which both stimuli have a high average reward (risky-high and 147 safe-high, both have an average reward of 60 points) and 3) "both-low": trials in which both stimuli have 148 a low average reward (risky-low and safe-low, both have an average reward of 40 points). 149

Theoretical analysis of learning and decision making 150
In this section, we sketch a mechanistic theory of learning and decision-making in our task. This theory is 151 used to derive the computational model we use in our modelling analysis (see Results/Modelling). We 152 also use it to derive behavioral predictions (see Results/Task & Theory/Behavioral Predictions). 153 Our theory is based on (Mikhael and Bogacz 2016, Möller and Bogacz 2019). Its premise is that choices 154 are governed by competing action channels in the basal ganglia (BG) network, an assumption common to 155 many models of the basal ganglia (Gurney, Prescott et al. 2001, Frank 2006. We assume that for each 156 option in our task, there is one such action channel, and that the probability of choosing option 157 depends on the total activation of that action channel. There are two contributions to this activation: 158 excitation of magnitude through the direct pathway (also called Go pathway), and inhibition of 159 magnitude through the indirect pathway (also called No-Go pathway). These contributions are 160 differentially modulated by dopamine (DA), mediated through the D1 and D2 receptors expressed in the 161 direct and indirect pathway respectively. Let represent the deviation of DA levels from baseline (i.e., 162 = 0 means baseline DA levels, > 0 means DA is above baseline etc.). Then, the activation of an 163 action channel corresponding to option can be approximated as 164

Eq. 1 166
What determines and for each option? According to (Mikhael and Bogacz 2016, Möller and Bogacz 167 2019), the direct and indirect pathway are subject to DA-dependent plasticity. Hence, and change 168 through reward-driven learning-roughly, tracks the upper end of the reward distribution, while 169 tracks the lower end. More accurately, we may assume that − tracks , the mean of the reward where > . Using this notation, Eq. 1 reads = + × -the impact of reward spread on 177 action activation is gated by the level of DA. 178 In our task, the and of the four action channels should converge to the means and spreads of the four 179 reward distributions. With DA at baseline (i.e., = 0) activation is proportional to , and hence to the 180 mean reward. Choices should thus be biased towards options with high mean rewards. If DA levels are 181 increased (i.e., > 0), the learned spread contributes positively to action activation, biasing choices 182 towards risky options. If, on the other hand, DA levels are below baseline (i.e., < 0), the learned spread 183 reduces action activation, biasing choices towards safe options. 184

Theoretical analysis of prediction errors 185
In this section, we focus on how the participant's reward prediction should theoretically change over the 186 course of a trial. This is based on the theory of temporal difference learning (Sutton and Barto 2018), 187 which has been applied to describe dopaminergic responses to rewards (Schultz, Dayan et al. 1997). The 188 basic assumption of these theories is that participants maintain a prediction of upcoming rewards at all 189 times. This prediction is based on learned estimates of average rewards associated with the four stimuli. 190 At the beginning of a trial (before the stimuli appear), participants do not have any specific information to 191 base their prediction on. Depending on whether they anticipate the appearance of the stimuli or not, they 192 might either predict the average reward over all trials (which we take to be the average learned value 193 across all options), or no reward at all. Neither of these two possibilities can be ruled out on theoretical 194 grounds only. We will therefore consider them both, and develop two versions of our hypothesis, 195 ultimately resulting into two slightly different models (see section Results/Models).  If, on the other hand, participants don't predict any reward at the beginning of the trial (this corresponds 211 to the second variant of our model), the magnitude would be given by 212 = ℎ ( ). 213

Eq. 5 214
We would then expect a magnitude of 60 in the both-high condition and a magnitude of 40 in the both-215 both-low condition. 216 Next, participants will make a choice. Now, their reward expectation is the value of the option they chose. 217 Finally, a reward is displayed, forcing participants to update their reward estimate again. This second 218 prediction error-the difference between the learned value of the option and the actual reward received-219 we call the outcome prediction error . Its magnitude is given by 220

Eq. 6 222
It is the outcome prediction error that drives learning about the stimuli (see Eq. 2 and Eq. 3 above). 223 In summary, our analysis reveals that two prediction errors (and two corresponding DA responses) should 224 occur in each trial: first the stimulus prediction error shortly after the presentation of the options, and 225 second the outcome prediction error after the presentation of the reward. 226

Behavioral Predictions 227
Here, we combine the results of the analyses above to derive task-specific behavioral predictions from our 228 theory (see Fig 1C for a schematic representation). Let us first assume that the stimulus prediction error is 229 given by Eq. 4. We have seen that in the matched-mean conditions (both-high and low), the presentation 230 of the options should cause a prediction error (positive and negative, respectively), and hence a transient 231 change in DA levels (increase and decrease, respectively) in the striatum during the choice period (Fig  232  1C, mechanistic level). We have also seen that DA levels affect choices through modulation of the BG 233 pathways: increased DA makes people risk seeking, decreased DA makes them risk averse. If the average 234 reward is similar for two options (as it is in the both-high and the both-low condition), these risk 235 preferences should be the decisive factor in decisions ( Fig 1C,

task level and past rewards level). 236
Taken together, these premises suggest that we should see a preference for the risky stimulus in the both-237 high condition (depicted in Fig 1C), and a preference for the safe stimulus in the both-low condition. 238 These effects should appear gradually, since they require that both mean and spread of the reward 239 distributions are learned. Specifically, risk preferences in the both-high and both-low conditions should 240 appear slower than value preferences in the difference condition. This follows from the underlying 241 plasticity rules: one can show that the learning rate for spread must always be lower than the learning rate 242 for value (Möller and Bogacz 2019). In addition to this, a reasonably accurate value estimate is required 243 for the spread estimate to converge; this also contributes to a higher in learning speed for value compared 244 to spread. 245 Relaxing now our assumption that the stimulus prediction error is given by Eq. 4 and allowing stimulus 246 prediction errors of the type of Eq. 5, we can still be sure that the stimulus prediction error would be 247 stronger in the both-high condition than in the both-low condition. We should thus see a difference in risk 248 preference between conditions independent of the exact nature of the stimulus prediction error. 249 In summary, we have derived two predictions: 1) We should see a difference in risk preference between 250 conditions, and 2) we should see gradually emerging risk seeking in the both-high and risk aversion in the 251 both-low condition (prediction 2 depends on a stronger assumption on the nature of the stimulus 252 prediction error). To arrive at these predictions, we took into account neural mechanisms. However, the 253 predictions are purely on the level of behavior. In the empirical part of this study, we hence focus on 254 behavior, conscious that this will only provide indirect evidence for the neural underpinnings of the 255 proposed mechanism. However, it might reveal a previously unknown effect with a clear, plausible 256 biological explanation. We provide further predictions based on neural signals in the Discussion. 257

Behavior 258
In the previous section, we introduced a novel reinforcement learning task and performed a theoretical 259 analysis to derive behavioral predictions. In this section, we present the results of testing these 260

predictions. 261
We recruited a cohort of participants (N=27, 3 excluded, see Fig S1) and recorded their behavior in the 262 task described above. Each participant performed four blocks of 120 trials. During each block, all six 263 possible stimuli pairings occurred equally often. Each block used a new set of four stimuli, mapped to the 264 same four distributions. 265 First, we investigated whether participants' performance improved during the task. Based on the premise 266 of associative learning, we expected choice accuracy (i.e., likelihood of choosing the option with the 267 higher average reward) to increase gradually over trials. To confirm this, we focused on choices in the 268 difference condition, where participants had to choose between stimuli with different average rewards. 269 We found that indeed, the probability of choosing the stimulus with the higher average reward increased 270 gradually over trials across the population. Average performance differed from chance level with high 271 significance (Fig 2A, t-test, t(27) = 31.9, p < 0.001) and approached its asymptote in the second half of 272 the block. These findings suggest that our participants successfully used associative learning in our task, 273 confirming a basic assumption of our theory. 274

C) Correlation between risk preferences in the both-high mean and the both-low condition. Each point 283 represents one participant. Preference for the risky stimulus if mean rewards are high is plotted against 284
preference for the risky stimulus if mean rewards are low. If a point falls below the diagonal, the 285 participant was more risk seeking for high-mean stimuli than for low-mean stimuli. The stars indicate 286 that the population mean is significantly below the diagonal. 287 288 Next, we tested our first prediction-that participants would be risk-seeking in the both-high condition 289 and risk-averse in the both-low condition, and that these effects would emerge gradually and more slowly 290 than increases in task performance. For this, we analyzed choices in the both-high and both-low 291 condition. For each condition, we investigated the likelihood of choosing the stimulus with the broad 292 distribution (risky) over the stimulus with the narrow distributing (save). Preferring risky over safe was 293 considered risk-seeking, preferring safe over risky was considered risk-averse. As predicted, we found 294 significant risk-seeking in the both-high condition (Fig 2Bi; two-tailed t-test: p = 0.0343, t(26) = 2.23) 295 and significant risk aversion in the both-low condition (Fig 2Bii; two-tailed t-test: p = 0.0317, t(26) = - aspects. 300 We then tested our second prediction-that there would be a difference in risk-preferences (i.e., in the 301 likelihood to choose the risky option over the save option) between the both-high and the both-low 302 condition. Specifically, we predicted higher risk-seeking in both-high than in low. To test this, we 303 computed the difference in risk-preference between conditions for each participant. We found that most 304 of the participants were more risk seeking in the both-high condition than in the both-low condition ( Fig  305   2C; two-tailed paired t-test: t(27) = 3.58, p = 0.0016), which confirmed our second prediction. 306 In summary, in data recorded from the task described above, we found the two behavioral effects we 307 predicted. The second effect was clearer than the first effect (t = 3.58 versus t = 2.23 and t = -2.27). This 308 is consistent with our theory: the first effect rests on more assumptions than the second. We will thus 309 focus on the second effect in the analyses below. 310 Our findings provide initial evidence for a behavioral link between prediction errors and risk preferences. 311 In the next section, we use computational modelling to compare our explanation of the measured effects 312 to alternative explanations. 313

Modelling 314
In the previous section, we have shown that effects like those predicted by our theory can be found in 315 experimental data. Our analysis so far rested on the assumption that participants know the ground-truth 316 means and standard-deviations of the reward distributions. However, these statistics have to be learned 317 when participants perform the task. To capture this learning process trial-by-trial learning models can be 318 fit to the data. These models allow us to answer some other important questions that have not yet been 319 addressed: 320 1. Are there alternative explanations for the observed effects? 321 2. Does our theory fit the data better than existing theories? 322 In this section, we use computational modelling to answer these questions. To test our theory against 323 alternative explanations, we use simulations as well as model comparison techniques (Palminteri, Wyart 324 et al. 2017). 325

Models 326
Associative learning in tasks like ours is commonly described with the Rescorla-Wagner (RW) model 327 (Rescorla and Wagner 1972). All models we use here are variants of this base model. RW is also the first 328 type of explanation for the effects we observed-it has been shown that even basic associative learning 329 can yield risk preferences through sampling biases (Niv, Edlund et al. 2012). 330 The second type of explanation involves the utility of reward points: risk aversion as well as risk seeking 331 have been explained as consequences of nonlinear utility functions (Kahneman and Tversky 2013). For 332 our analysis, we considered RW-type learning in combination with the two most common families of 333 utility functions: the concave utility of expected utility theory and s-shaped utility of prospect theory. The 334 corresponding models are called concave UTIL and s-shaped UTIL. 335 The third type of explanation is based on RW with variable learning rates. It has been observed that 336 humans use different learning rates for positive and negative outcomes (Gershman 2015), and that this 337 can lead to risk preferences (Niv, Edlund et al. 2012). We implement this in the model pos-neg RATES. It 338 has also been observed that when tracking a reward signal, humans adapt to the signal statistics (Daw, 339 O'doherty et al. 2006), akin to the Kalman filter used in engineering. In Kalman filters, the learning rate 340 depends on the variance of the signal that is tracked. We thus included a model with different learning 341 rates for high variance and low variance rewards, referred to as variance RATES. 342 Finally, the explanation that we propose in this study is that prediction errors induce transient risk 343 preferences, as detailed above. We represent our hypothesis with two very similar models, called PEIRS 344

(Prediction Errors Induce Risk Seeking) and PIRS (Predictions Induce Risk Seeking). These models 345
differ only in how the stimulus prediction error is computed-PEIRS assumes that the initial reward 346 expectation corresponds to the average reward over all options (Eq. 4), while PIRS assumes that the initial 347 reward expectation is zero (Eq. 5). 348

Alternative explanations 349
First, we wanted to know which of the models could reproduce (and hence explain) the observed effects. 350 To check this, we first fitted all models to our dataset and extracted maximum-likelihood parameters for 351 each participant. We then used these parameters to simulate behavior in our task with all candidate 352 models to generate synthetic datasets. Finally, we analyzed these synthetic datasets in the same way as the 353 experimental dataset. In our analysis, we focused on the difference of risk preferences between conditions 354 (Fig 2C), as this was the most pronounced of the predicted effects. 355 We found that the models that represented our hypothesis reproduced the effect best: the PEIRS model 356 captures the overall distribution of risk preferences across the population best (Fig 3A,  difference between experimental data and simulated data, two-tailed paired t-test, t(26)=-1.72, p=0.0976). 360 The PIRS model, on the other hand, is the only model that captures the part of our population that is risk 361 seeking in both the both-high and the both-low condition (Fig 3A,  We concluded that among the tested models, our theory (specifically the PEIRS variant) was the best 388 explanation for the observed risk preferences (both distribution and difference between conditions). None 389 of the other models provided a convincing alternative explanation. 390

Model fit 391
After showing that the PEIRS model can reproduce the risk preferences that we measured, we wanted to 392 use model comparison techniques to test more formally whether our dataset provides evidence for or 393 against the PEIRS / PIRS models. One might be tempted to do this by simply comparing all candidate 394 models directly and selecting the best fitting one as the winner. However, this is not the right strategy 395 here. Consider for example a population with strongly concave utility functions and, in addition, moderate 396

PEIRS. A direct comparison between a utility model and a PEIRS model would favor the utility model 397
for this population, even though PEIRS was present. More generally, if it is possible that more than one of 398 the considered effects (nonlinear utility, variable learning rates, PEIRS / PIRS) is present in the same 399 participant, a direct model comparison between the effects will only determine the largest effect. It cannot 400 tell us whether effects are there or not. 401 The correct way to test whether there is PEIRS in our dataset is to compare models with PEIRS (extended 402 models) against models without PEIRS (base models). This should be done for different base models, as 403 they constitute different alternative explanations. We therefore performed a fit for all conventional models 404 (RW, concave UTIL, s-shaped UTIL, pos-neg RATES, variance RATES), and for each fit computed the 405 Bayesian Information Criterion (BIC) across the entire population. We then added the PEIRS mechanism 406 to each of these models, performed a fit with the extended models, and again computed the BIC across the 407 population. Finally, we compared the BICs with and without the PEIRS extension. We repeated the same 408 analysis with PIRS as well. 409 We found that adding the PEIRS extension improved parsimony for all but one model (variance RATES, 410 Fig 3C), and that adding PIRS improved all models (Fig 3D). This suggests that PIRS and PEIRS tend to 411 improve model fits enough to compensate for increased complexity, and hence capture some non-trivial 412 feature of the data that the other models cannot capture. 413 We concluded that in all but one case, our theory helps to explain empirical data better than existing 414 theories. The variance RATES model was not improved by PEIRS, perhaps because it already reproduced 415 the distribution of risk-preferences fairly well (Fig 3A, variance RATES). However, variance RATES 416 predicted an effect size that differed significantly from the empirical one ( Fig 3B) and was improved by 417 the PIRS extension (Fig 3D). This suggests that variable learning rates cannot capture all features within 418 the realm of our theory. 419 Finally, we considered the possibility that prediction errors might influence choices even across trials-420 that the outcome prediction error of one trial might induce risk preferences in the next. To test this, we 421 defined another model, called Outcome Errors Induce Risk Seeking (OEIRS) and performed an analysis 422 similar to Fig 3C and 3D. We found no evidence for across trial effects, likely due to comparatively large 423 delays (Fig S3, see supplementary discussion). This negative result is interesting in itself, but also 424 provides additional confidence for our positive results for PEIRS and PIRS-these models did not simply 425 win because they had the more parameters (PEIRS, PIRS and OEIRS have the same number of 426 parameters), but because they explained the data better. 427

Pupillometry 428
Above, we used choice data to look for the effects we predicted, and computational modelling to rule out 429 alternative explanations. In this section, we use an additional physiological marker-pupil dilation-to 430 provide additional evidence for our hypothesis. It is known that pupils dilate in response to surprise 431 We exploit this in three steps: first, we confirm that we can read out the magnitude of prediction errors by 435 computing the pupil response to the well documented outcome prediction errors. Second, we prove the 436 occurrence of stimulus prediction errors by extracting the corresponding pupil response. Finally, we show 437 that the strength of the stimulus prediction error, as measured through pupil dilation, is connected to the 438 strength of risk seeking extracted from choices. 439

Pupil dilation reflects outcome prediction errors 440
Can we measure prediction error magnitude through pupil dilation? To test this, we took the absolute 441 value of the outcome prediction error | | as a measure of surprise. Trial-by-trial estimates of this 442 prediction error were extracted from the PEIRS model fits. Regression analyses were used to determine 443 whether pupil dilation after reward presentation encoded | |. In this analysis, we focused on the 444 first half of the block (trials 1 to 60) where the learning from feedback occurs (Fig 2A). We found a 445 phasic response which peaked 0.9 s after reward presentation (Fig 4A t- statistical significance was established through leave-one-out unbiased peak detection and confirmed 447 through a cluster-based permutation test, see Fig S4A). This confirmed pupil dilation as a proxy for 448 prediction error magnitude. Next, we sought to confirm the occurrence of stimulus prediction errors, which are essential for our 467 theory of prediction-error induced risk seeking. For this, we extracted estimates of ∨ for every 468 trial and used an analysis similar to the previous step to extract the corresponding pupil response. Here, 469 we aligned the pupil signal at stimulus presentation, and censored all data points collected after reward 470 presentation to avoid confounding factors such as reward or outcome prediction errors. We also focused 471 on the second half of the block (trials 61 to 120) where we observe pronounced risk preferences (Fig 2B). 472 We found a phasic response which peaked 1.6 s after stimulus onset (Stimulus prediction error: Fig 4B; t-473 test: t(26) = 2.89, p = 0.0079. Statistical significance was established through leave-one-out unbiased 474 peak detection and confirmed through a cluster-based permutation test, see Fig S4B). This confirms the 475 existence of a stimulus prediction error that occurs after stimulus onset, as hypothesized. 476 The response to the stimulus prediction error was roughly similar to the response to the outcome 477 prediction error, except for a longer delay between stimulus prediction error onset and the peak of the 478 pupil response. There might be many reasons for this difference in delay. Among those, differences in 479 information processing might play a role: generating a stimulus prediction error involves two stimuli, 480 hence attention mechanisms, in addition to retrieval of value estimates from memory. Generating the 481 outcome prediction error, on the other hand, just requires the processing of a number. 482

Stimulus prediction errors are linked to risk seeking 483
Our theory suggests that risk seeking is caused by transiently increased dopamine levels that reflect 484 reward prediction errors, specifically the stimulus prediction error. If this is the case, we should expect to 485 see stronger risk preferences in individuals that respond stronger to stimulus prediction errors. shifts in behavioral risk preference as a function of reward prediction errors. Second, we tested it using a 500 task where reward prediction errors are immediately followed by decisions that involve risk. We found 501 that reward prediction errors and the probability of subsequent risk-taking are positively correlated: 502 positive reward prediction errors induce risk seeking, negative ones inhibit it. Finally, we showed that the 503 magnitude of the reward prediction error (as indexed by pupil dilation) determines its effect on risk-504

preferences. 505
Our results are consistent with our initial hypothesis: the two roles of dopamine (teaching signal and risk 506 modulator) interfere with each other. This study hence provide evidence against the conjecture that the

Predictions or Prediction errors? 512
In our theoretic analysis, we considered two possibilities for how the stimulus PE might be calculated: the 513 first possibility was to take the difference between value of the options and the overall average reward. 514 This was codified in the PEIRS model. The other possibility was to take the difference between the value 515 of the options and zero, which corresponded to no reward expectation during the ITI. This was codified in 516 the PIRS model, which is short for Predictions Induce Risk Seeking. We called it as such because in this 517 case, there is no way to distinguish between prediction and prediction error. Our task design in general is 518 not suitable to empirically differentiate between these two possibilities, because predictions and 519 prediction errors are strongly correlated at stimulus presentation. We did not aim to empirically 520 distinguish the two mechanisms, but rather, to show that either of these mechanisms explains our data 521 better than alternative explanations (with one exception, see Fig 3C and D). Differentiating between the 522 two mechanisms is an interesting direction for future work. 523 Another question that might arise in this context is about the role of the outcome prediction error of the 524 previous trial. According to our theory, that prediction error should be broadcasted by the dopamine 525 system just like the stimulus prediction error and might therefore also affect risk preferences. Of course, 526 there is a difference in timing: the choice follows the outcome prediction error of the previous trial with 527 3.47 s delay on average (standard deviation 0.51). In contrast, there is only an average delay of 0.97 s 528 (standard deviation 0.51) between stimulus onset and choice. We might thus expect that the impact of the 529 outcome prediction error, if at all observable, might be much weaker than that of the stimulus prediction 530 error. A supplementary analysis similar to those displayed in Fig 3C and 3D confirmed this: there is no 531 evidence for an association of risk preferences and outcome prediction error of the previous trial in our 532 dataset (Fig S3). 533

Relation to behavioral economics 534
Decision making under uncertainty has been extensively studied in behavioral economics. One main 535 finding in this field, codified in prospect theory, is that humans tend to be risk-averse if decisions concern 536 gains, and risk-seeking if decisions concern losses (Kahneman and Tversky 2013). However, those classic 537 findings rely on explicit knowledge about the probabilities involved in the decisions. Several more recent 538 studies indicate that those tendencies reverse when risks and probabilities are learned from experience 539 This phenomenon has been termed the description-experience gap and is considered a "major challenge" 542 for neuroeconomics (Garcia, Cerrotti et al. 2021). In cognitive neuroscience and psychology, some Our task differs somewhat from the tasks studied in the description-experience gap literature, since we 546 only use gains (reward points are always positive). However, it is also known that humans often evaluate 547 outcomes with respect to a reference point (Kahneman and Tversky 2013). In the context of our task, it 548 seems plausible that rewards are evaluated relative to the average of previous rewards, or relative to the 549 middle of the experienced reward range, which rapidly converges to 50 points within the first few trials. differ to those of (Niv, Edlund et al. 2012). This might be due to the degree of implicitness of the 554 knowledge that is gained during the task: Niv et al. used classical bimodal reward distributions (e.g., 40 555 points with probability 50 %, 0 points otherwise) which participants might be able to recognize as such 556 after a few trials. Here, we used high-entropy reward distributions (normal distributions, see Fig. 1B), 557 which could not be mapped onto bimodal gambles, and thus made anything but implicit learning 558 intractable. 559

Relation to other theories 560
For our behavioral results, interpretations other than our dopaminergic explanation may be evoked: the 561 behavior in a similar task (Madan, Ludvig et al. 2014) was interpreted as the result of memory replay: 562 experiences ("Obtained reward X after choosing option Y") might not only be used for immediate value 563 updates but might also be stored in a memory buffer. This buffer can then be used for offline learning 564 from past experiences in times of inactivity, such as during the inter-trial interval. It was proposed by 565 (Madan, Ludvig et al. 2014) that experiences are more likely to enter the buffer if they are extreme. If 566 entering the buffer is biased in this way, then so are the values learned from replaying those experiences. 567 In our task, extreme might mean that the reward was extremely high or low. The corresponding bias 568 would drive choice towards the stimuli that produce the highest rewards, and away from those that 569 produce the lowest, and thereby lead to a pattern similar to the one we observed. 570 Which theory is closer to the truth? One relevant piece of data is the gradual emergence of the risk 571 preferences (Fig 2B). We detailed above that our theory predicts slowly emerging risk preferences. In 572 particular, risk preferences should emerge after value preferences. The memory theory, on the other hand, 573 holds that risk preferences stem from biased values. If that was the case, risk preferences should emerge 574 at the same rate as value preferences, which is not what we observed. 575 Beyond this, it is difficult to compare the memory theory directly to prediction-error induced risk-576 seeking; it is unclear how to obtain trial-by-trial choice predictions from the memory model, which rules 577 out a formal model comparison. Indeed, the memory model has so far only been fitted to and assessed 578 based on summary statistics of a large collection of trials. Further, the memory model has so far not been 579 equipped with a mechanistic underpinning and was therefore not validated on physiological variables 580 such as pupil dilation. In contrast, prediction-error induced risk-seeking can be fitted trial-by-trial, 581 allowing it to make predictions not only about summary statistics but about the evolution of preferences 582 during the task as well as about the immediate impact of extreme events such as large prediction errors. 583 The corresponding latent variables can be correlated with physiological variables, proving that they can 584 explain aspects of pupil dilation in addition to behavior ( Fig. 4C and 4D). 585

Further experimental predictions 586
In this paper, we present a theory based on neural effects, and test some of its behavioral predictions. The 587 results could have falsified our theory, but they did not-all our predictions where confirmed. This 588 suggests that our theory is valid on the level of behavior, but it does not prove that our theory is correct on 589 the level of neural mechanisms. More direct measurements are needed to establish this. 590 Correlational studies are possible. They should have sufficient time resolution to differentiate the 591 prediction errors at different times during the trial, such as EEG, electrophysiology, voltammetry or 592 similar. With those, one could gain more direct measurements of the PEs, and hence resolve more clearly 593 how they are associated with subsequent risk seeking. Another possibility are causal studies. For example, 594 optogenetic tools could be used to elicit or suppress prediction errors just before choices involving risk, 595 for instance by stimulation VTA dopamine neurons at the time of stimulus onset. 596

Conclusion 597
In summary, we demonstrate that a biologically inspired theory of basal ganglia learning predicts an 598 interaction between prediction errors and risk seeking. This is based on dopamine's dual role in learning 599 and action selection. We present behavioral data that matches these predictions. We further show that 600 between-participant variability in behavior can be linked to differences in pupil responses-the stronger 601 the pupil response to stimulus prediction errors, the stronger the prediction error induced risk seeking. 602 Methods 603 Participants 604 We tested 30 participants (15 female, median age: 26, range: 18-42). Our participants did not suffer from 605 visual, motor or cognitive impairments. Participants were recruited from the Oxford Participant 606 Recruitment system and gave informed written consent. All experimental procedures were approved by 607 the Oxford Central University Research Ethics Committee, approval number R45265/RE004. 608 Participants were given a written set of instructions, as well as an oral instruction. They were first 609 provided with a description of the task sequence. Then, they were told that the rewards were random, but 610 nevertheless higher on average for some shapes than for others. Finally, they were advised to get as much 611 total reward as possible and that their compensation would be between 8 and 12 GBP, depending on their 612 performance. After the task, all participants received a compensation of 10.50 GBP. 613 Our results are based on 27 of the 30 participants. Three participants were excluded from the analysis due 614 to their failure to understand the task: we evaluated the participants' understanding of the task by scoring 615 their preferences in different-mean choices during the second half of the blocks. Participants were 616 included in the analysis if they chose the high-valued option in more than 70 % of those trials (Fig S1). 617

Emerging preferences 618
To test whether value preferences emerged faster than risk preferences, we used a linear mixed effects 619 model to model choices. As predictors, we included fixed effects of decision type, trial number, and their 620 interaction. Here, decision type was defined as value decision (difference condition) versus risk decision 621 (both-high and both-low conditions). We also included random effects for all predictors and a random 622 intercept, by participant. A likelihood ratio test for the fixed effect of the interaction between trial number 623 and decision type revealed a significant positive effect (p < 0.001; based on comparing the empirical 624 likelihood to a distribution of likelihoods obtained from 1000 Monte Carlo simulations of data from the 625 model without the fixed effect, as implemented in MATLAB's 'compare' function). This confirms value 626 preferences emerged faster than value preferences (Fig 2A and 2B). 627

Models 628
The models we used in this study are all variants of the RW model (Rescorla and Wagner 1972). They 629 consist of learning rules for latent variables, such as value and spread, and choice rules that convert latent 630 variables into choice probabilities. Below, we provide these rules for all models, along with some 631 auxiliary rules as necessary. 632 Eq. 8 647 The RW model has free parameters ∈ [0,1], > 0 and fixed parameters 0 = 50. 648

Concave utility 649
To allow for concave subjective utility of reward points in our experiment, we used an exponential family 650 of functions (Guyaguler and Horne 2001), which we adapted to our reward range through appropriate 651 scaling ( ) and shifting ( ): 652 The nonlinear utility of reward enters through the computation of the prediction error: 655 Eq. 10 657 Updates are computed with Eq. 7, choices are modeled using the softmax rule in Eq. 8. The concave 658 UTIL model has free parameters ∈ [0,1], > 0, > 0. and fixed parameters 0 = 50, = 50, 659 = 50. 660

Different learning rates for positive and negative prediction errors 670
It has been shown that learning from positive outcomes can differ from learning from negative outcomes 671 (Gershman 2015). We model this by letting the learning rate depend on the sign of the prediction error 672 (computed according to Eq. 6). The update rules then become: 673 Choices are modeled using the softmax rule Eq. 8. The pos-neg RATES model has free parameters + ∈ 676 Eq. 14 706 The PEIRS model has free parameters ∈ [0,1], ∈ [0,1], > 0, 0 > 0 and fixed parameters 707 0 = 50. The initial spread estimate 0 was allowed to vary since participants were not given no prior 708 information about the spread magnitude. Initial estimates might thus have differed strongly across 709 individuals. 710

OEIRS 717
If risk-preferences are associated with stimulus prediction errors, they might also be associated with 718 outcome prediction errors from the previous trial. This is the assumption of the OEIRS model, which 719 differs from the PEIRS model in that the outcome prediction error (computed according to Eq. 720 6) is substituted for in Eq. 13. Otherwise, OEIRS is identical to PEIRS and PIRS. 721

Parameter transformations and Priors 722
We used exponential and sigmoid transformations to constrain the parameters to their appropriate ranges. 723 Priors were specified as multivariate normal distributions over the untransformed parameters. All but 724 diagonal elements of the covariance matrices of those normal distributions were set to zero. Hence, the 725 prior distributions could be factorized into univariate normal distributions (one for each parameter). 726 Below, we provide the statistics of those prior distributions. For parameters that occur in more than one 727 baseline. All traces were divided and shifted by that baseline, resulting in traces reflecting the relative 772 change of pupil diameter after the alignment point. Finally, traces were downsampled to 10 Hz. 773 Pupil responses to prediction errors were obtained using linear models: for each participant, we regressed 774 pupil dilation against a trial-by-trail estimate of the prediction error magnitude of interest, which was 775 obtained from a model fit. This was done for each time bin of the pupil signal and resulted in a pupil To uncover the pupil response to the stimulus prediction error, we aligned the pupil time courses at 779 stimulus onset. After stimulus onset, participants would eventually make a choice (with variable delay, 780 the median reaction time was 0.86 s) and receive a reward (with a 1 s delay) after their choice. Since the 781 reward or the resulting outcome prediction error might confound our regression analysis, we censored out 782 all data after reward presentation. This means that the number of observations on which regressions can 783 be based rapidly declines after the median reward presentation time, which is at 1.86 s after stimulus 784 onset. Estimates obtained later are increasingly unreliable, since they are based on insufficient data. We 785 hence conducted our analyses for the interval 0 s to 1.9 s after stimulus onset. This allows us to obtain 786 reliable estimates of the statistics, while still avoiding confounding effects related to reward presentation. 787 To test whether the pupil responses to the prediction errors are statistically significant, we performed a 788 test at a single time-point corresponding to the peak of the effect. To avoid circularity, the time of the 789 peak was identified using a leave-one-out method: For each participant, we used the individual pupil 790 response time courses of all other participants to determine the time of the peak. This was achieved by 791 executing t-tests on the response strength (t-statistic) across participants in each time bin and selecting the 792 bin with the smallest p-value. We then took the left-out participant's response strength from that time bin, 793 considering it to be their response strength at the peak of the group response. In a final step, we pooled all 794 those individual response strengths and used a t-test to check whether they deviated significantly from 795 zero. 796