Prefrontal signals precede striatal signals for biased credit assignment to (in)actions

Actions are biased by the outcomes they can produce: Humans are more likely to show action under reward prospect, but hold back under punishment prospect. Such motivational biases derive not only from biased response selection, but also from biased learning: humans tend to attribute rewards to their own actions, but are reluctant to attribute punishments to having held back. The neural origin of these biases is unclear; in particular, it remains open whether motivational biases arise primarily from the architecture of subcortical regions or also reflect cortical influences, the latter being typically associated with increased behavioral flexibility and emancipation from stereotyped behaviors. Simultaneous EEG-fMRI allowed us to track which regions encoded biased prediction errors in which order. Biased prediction errors occurred in cortical regions (dACC, PCC) before subcortical regions (striatum). These results highlight that biased learning is not a mere feature of the basal ganglia, but arises through prefrontal cortical contributions, revealing motivational biases to be a potentially flexible, sophisticated mechanism.


Introduction 51
Human action selection is biased by potential action outcomes: reward prospect drives us to 52 invigorate action, while threat of punishment holds us back 1-3 . These motivational biases have been 53 evoked to explain why humans are tempted by reward-related cues signaling the chance to gain food, 54 drugs, or money, as they elicit automatic approach behavior. Conversely, punishment-related cues 55 suppress action and lead to paralysis, which may even lie at the core of mental health problems such as 56 phobias and mood disorders 4,5 . While such examples highlight the potential maladaptiveness of biases 57 in some situations, they confer benefits in other situations: Biases could provide sensible "default" 58 actions before context-specific knowledge is acquired 1,6 . They may also provide ready-made alternatives 59 to more demanding action selection mechanisms, especially when speed has to be prioritized 7 . 60 Previous research has assumed that motivational biases arise because the valence of prospective 61 outcomes influences action selection 8 . However, we have recently shown that not only action selection, 62 but also the updating of action values based on obtained outcomes is subject to valence-dependent 63 biases 3,9,10 : humans are more inclined to ascribe rewards to active responses, but have problems with 64 attributing punishments to having held back. On the one hand, such biased learning might be adaptive 65 in combining the flexibility of instrumental learning with somewhat rigid "priors" about typical action-66 outcome relationships. Exploiting lifetime (or evolutionary) experience might lead to learning that is 67 faster and more robust to environmental "noise". On the other hand, biases might be responsible for 68 phenomena of "animal superstition" like negative auto-maintenance. Studies of this phenomenon used 69 strict omission schedules in which reward were never delivered on trials on which animals showed an 70 action (key peck, button press), but only when animals inhibited responding over a given time period. 71 Still, animals showed continued key picking in such paradigms, which might either reflect a strong 72 "prior belief" that any situation in which rewards were available requires active work to obtain those, or 73 vice versa an inability to attribute rewards to having held back one's actions 1,11,12 . While reward 74 attainment can lead to an illusory sense of control over outcomes, control is underestimated under threat 75 of punishment: Humans find it hard to comprehend how inactions can cause negative outcomes, which 76 makes them more lenient in judging harms caused by others' inactions 13,14 . Taken together, also credit 77 126 Figure 1. Motivational Go/ NoGo learning task design. A. On each trial, a Win or Avoid cue appeared; valence of the cue was not signaled but should be learned. Cue offset was also the response deadline. Response-dependent feedback followed after a jittered interval. Each cue had only one correct action (GoLEFT, GoRight, or NoGo), which was followed by the positive outcome 80% of the time. For Win cues, actions χ 2 (1) = 19.732, p < .001). When selectively analyzing trials with salient outcomes only, rewards 152 (compared to punishments) led to a higher proportion of choice repetitions following Go relative to 153 NoGo responses (valence x response: b = 0.308, SE = 0.064, χ 2 (1) = 17.798, p < .001; valence effect for 154 Go only: b = 1.276, SE = 0.115, χ 2 (1) = 53.932, p < .001; valence effect for NoGo only: b = 0.637, SE 155 = 0.127, χ 2 (1) = 18.228, p < .001; see full results in S02). 156 Taken together, these results suggested that behavioral adaptation following rewards and 157 punishments was biased by the type of action that led to this outcome (Go or NoGo). However, this 158 analysis only considered behavioral adaptation on the next trial, and could not pinpoint the precise 159 algorithmic nature of this learning bias. More importantly, it did not provide trial-by-trial estimates of 160 action values as required for model-based fMRI and EEG analyses to test for regions or time points that 161 reflected biased learning. We thus analyzed the impact of past outcomes on participants' choices using 162 computational RL models. 163 Computational modeling of behavior 164 In line with previous work 3,9 , we fitted a series of increasingly complex RL models. We started with 165 a simple Rescorla Wagner model featuring learning rate and feedback sensitivity parameters (M1). We 166 next added a Go bias, capturing participants' overall propensity to make Go responses (M2), and a showed that this model captured key behavioral features (Fig. 2E, F), including motivational biases and a greater behavioral adaptation after Go responses followed by salient outcomes than after NoGo 179 responses followed by salient outcomes (Fig. 2G). This pattern could not be captured by an alternative 180 learning bias model based on the idea that active responses generally enhance credit assignment 35 (see 181 S05). 182 One feature of the behavioral data that was not well captured by the asymmetric pathways model 183 was a high tendency of participants to repeat responses ("stay") to the same cue irrespective of outcomes 184 (see Fig. 2C and G). This tendency was stronger for Win than Avoid cues. We explored three additional 185 models featuring supplementary mechanisms to account for this behavioral pattern (see S06). All these 186 models fitted the data well and captured the propensity of staying better than M5; however, these models 187 overestimated the proportion of incorrect Go responses. Model-based fMRI analyses based on these 188 models led to results largely identical to those obtained with M5 (see S06). We thus focused on M5, 189 which relied on only a single mechanism (i.e., biased learning from rewarded Go and punishment NoGo 190 actions). See S06 for further details. 191 192 Figure 2. Behavioral performance. A. Trial-by-trial proportion of Go responses (±SEM across participants) for Go cues (solid lines) and NoGo cues (dashed lines). The motivational bias was already present from very early trials onwards, as participants made more Go responses to Win than Avoid cues (i.e., green lines are above red lines). Additionally, participants clearly learn whether to make a Go response or not (proportion of Go responses increases for Go cues and decreases for NoGo cues). B. Mean (±SEM across participants) proportion Go responses per cue condition (points are individual participants' means). C. Probability to repeat a response ("stay") on the next encounter of the same cue as a function of action and outcome. Learning was reflected in higher probability of staying after positive outcomes than after negative outcomes (main effect of outcome valence). Biased learning was evident in learning from salient outcomes, where this valence effect was stronger after Go responses than NoGo responses. Dashed line indicates chance level choice (pStay = 0.33). D. Log-model evidence favors the asymmetric pathways model (M5) over simpler models (M1-M4). E-G. Trial-by-trial proportion of Go responses, mean proportion Go responses, and probability of staying based on one-step-ahead predictions using parameters (hierarchical Bayesian inference) of the winning model (asymmetric pathways model, M5). H. Model frequency and protected exceedance probability indicate best fit for model M5 (asymmetric pathways model), in line with log model evidence.

193
fMRI: Basic quality control analyses 194 First, we performed a GLM as a quality-check to test which regions encoded positive (rewards, 195 no punishments) vs. negative (no reward/ punishment) outcomes in a "model-free" way, independent of 196 any model-based measure derived from a RL model (for full description of the GLM regressors and 197 contrasts, see S08). Positive outcomes elicited a higher BOLD response in regions including vmPFC, We also assessed which regions encoded Go vs. NoGo as well as GoLEFT vs. GoRIGHT responses. 202 There was higher BOLD for Go than NoGo responses at the time of response in dorsal ACC (dACC), 203 striatum, thalamus, motor cortices, and cerebellum, while BOLD was higher for NoGo than Go To test which brain regions were involved in biased learning, we performed a model-based GLM 210 featuring the trial-by-trial PE update as a parametric regressor (see GLM notation in S08). We used the 211 group-level parameters of the best fitting computational model (M5) to compute trial-by-trial belief 212 updates (i.e., prediction error * learning rate) for every trial for every participant. In assessing neural signal in these regions was better described (i.e., more variance explained) by biased learning than by 233 standard prediction error learning. 234 Figure 3. BOLD signal reflecting outcome processing. BOLD effects displayed using a dual-coding visualization: color indicates the parameter estimates and opacity the associated z-statistics. Significant clusters are surrounded by black edges. A. Significantly higher BOLD signal for positive outcomes (rewards, no punishments) compared with negative outcomes (no rewards, punishments) was present in a range of regions including bilateral ventral striatum and vmPFC. Bar plots show mean parameter estimates per condition (±SEM across participants; dots indicating individual participants) B. BOLD signals correlated positively to "standard" RL prediction errors in several regions, including the ventral striatum, dACC, vmPFC, and PCC. C. Left panel: Regions encoding both the standard PE term and the difference term to biased PEs (conjunction) at different cluster-forming thresholds (1 < z < 5, color coding; opacity constant). Clusters significant at a threshold of z > 3.1 are surrounded by black edges. In bilateral striatum, dACC, pgACC, PCC, left motor cortex, left inferior temporal gyrus, and primary visual cortex, BOLD was significantly better explained by biased learning than by standard learning. Right panel: 3D representation with all seven regions encoding biased learning (and used in fMRI-informed EEG analyses).
EEG: Biased learning in midfrontal delta, theta, and beta power 236 Similar to the fMRI analyses, we next tested whether midfrontal power encoded biased PEs 237 rather than standard PEs. While fMRI provides spatial specificity of where PEs are encoded, EEG power 238 provides temporal specificity of when signals encoding prediction errors occur 29,34 . In line with our fMRI 239 analysis, we used the standard PE term and the difference to the biased PE term as trial-240 by-trial regressors for EEG power at each channel-time-frequency bin for each participant and then 241 performed cluster-based permutation tests across the b-maps of all participants. Note that differently 242 from BOLD signal, EEG signatures of learning typically do not encode the full prediction error. Instead, 243 PE valence (better vs. worse than expected) and PE magnitude (saliency, surprise) have been found 244 encoded in the theta and delta band, respectively, but with opposite signs [31][32][33] . When testing for 245 parametric correlates of PE magnitude, we therefore controlled for PE valence, thereby effectively 246 testing for correlations with the absolute PE magnitude (i.e., degree of surprise). Note that PE valence 247 was identical for standard and biased PEs. Thus, only PE magnitude could distinguish both learning 248

models. 249
Both midfrontal theta and beta power reflected outcome (PE) valence: Theta power was higher 250 for negative (non-reward and punishment) than for positive (reward and non-punishment) outcomes 251 (225-475 ms, p = .006; Fig. 4A-B), while beta power was higher for positive than for negative outcomes 252 (300-1,250 ms, p = .002; Fig. 4A, C). Differences in theta power were clearly strongest over frontal 253 channels, while differences in the beta range were more diffuse, spreading over frontal and parietal 254 channels ( Fig. 4B-C). All results held when the condition-wise ERP was removed from the data (see 255 S10), suggesting that differences between conditions were due to induced (rather than evoked) activity 256 (for results in the time domain, see S11). 257 When testing for correlates of PE magnitude, we controlled for PE valence given that previous 258 studies have reported TF correlates of both PE valence and PE magnitude in a similar time and frequency 259 range, but with opposite signs 31-33 . Midfrontal delta power was indeed positively correlated with the 260 term (225-475 ms; p = .017; Fig. 4D). Decomposition of the term into its constituent 261 terms showed that this correlation was not significant for the term (p = 0.074, Fig. 4E) nor for 262 the term (p = 0.185; Fig. 4F). This result does not imply that the term explained delta 263 power significantly better than the term; it only implies significant encoding of the term 264 as suggested by the model that best fitted the behavioral data, with no significant evidence for a similar 265 encoding of the conventional term. For a similar observation in the time-domain EEG signal, see 266 S12. Beyond delta power, beta power correlated positively, though not significantly with (p = 267 0.110, Fig. 4E) and significantly negatively with (p = .001, 425-850 ms). Given these oppositely-268 signed correlations of its constituents, the term did not significantly correlate with beta power 269 (p = 0.550, Fig 4D). 270 In sum, both midfrontal theta power (negatively) and beta power (positively) encoded PE 271 valence. In addition, delta power encoded PE magnitude (positively). This encoding was only significant 272 for biased PEs, but not standard PEs. Taken together, as was the case for BOLD signal, midfrontal EEG 273 power also reflected biased learning. As a next step, we tested whether the identified EEG phenomena 274 were correlated with trial-by-trial BOLD signal in identified regions. Crucially, this allowed us to test 275 whether EEG correlates of cortical learning precede EEG correlates of subcortical learning. 276 th Figure 4. EEG time-frequency power over midfrontal electrodes (Fz/ FCz/ Cz). reflecting outcome processing. A. Time-frequency plot (logarithmic y-axis) displaying higher theta (4-8 Hz) power for negative (non-reward for Win cues and punishment for Avoid cues) outcomes and higher beta power (16-32 Hz) for positive (reward and non-punishment) outcomes. This contrast reflects EEG correlates of PE valence (better vs. worse than expected). Black square dot boxes indicate clusters above threshold that drive significance in a-priori defined frequency ranges. B. Theta power transiently increases for any outcome, but more so for negative outcomes (especially punishments) around 225-475 ms after feedback onset. Black horizontal lines indicate the time range for which the cluster driving significance was above threshold. (C) Beta power was higher for positive than negative outcomes over a long time period around 300-1,250 ms after feedback onset. D-F. Correlations between midfrontal EEG power and model-based trial-by-trial PE magnitudes controlling for PE valence (thus effectively testing for correlates of "absolute" PEs). Panel D displays the correlates of biased prediction errors , which are decomposed into (E) based on the non-biased learning model M1, and (F) their difference . Solid black lines indicate clusters above threshold. Biased PEs were significantly positively correlated with midfrontal delta power (D). The correlations of delta with the standard PEs (E) and the difference term to biased PEs (F) were positive as well, though not significant. Beta power only significantly encoded the difference term to biased PEs (F). ** p < 0.01.
Combined EEG-fMRI: Prefrontal cortex signals precede striatum during biased 277 outcome processing

278
The observation that also cortical areas (dACC, pgACC, PCC) show biased PEs is consistent with 279 the "external model" of cortical signals biasing learning processes in the striatum. However, this model 280 makes the crucial prediction that these biased learning signals should be present first in cortical areas 281 and only later in the striatum. Next, we used trial-by-trial BOLD signal from those regions encoding 282 biased PE to predict midfrontal EEG power. By determining the time points at which different regions 283 correlated with EEG power, we were able to infer the relative order of biased PE processing across 284 cortical and subcortical regions, revealing whether cortical processing preceded striatal processing. We 285 used trial-by-trial BOLD signal from the seven regions encoding biased PEs, i.e., striatum, dACC, 286 pgACC, PCC, left motor cortex, left ITG, and primary visual cortex (see masks in S07) as regressors on 287 average EEG power over midfrontal electrodes (Fz/ FCz/ Cz; see S13 for a graphical illustration of this 288 approach). We performed analyses with and without PEs included in the model, which yielded identical 289 results and suggested that EEG-fMRI correlations did not merely result from PE processing as a 290 "common cause" driving signals in both modalities. Instead, EEG-fMRI correlations reflected 291 incremental variance explained in EEG power by the BOLD signal in selected regions (even beyond 292 variance explained by the model-based PE estimates), providing the strongest test for the hypothesis 293 that BOLD and EEG signal reflect the same neural phenomenon. As the timeseries of all seven regions 294 were included in one single regression, their regression weights reflected each region's unique 295 contribution, controlling for any shared variance. In line with the "external model", BOLD signal from 296 prefrontal cortical regions correlated with midfrontal EEG power earlier after outcome onset than did 297 striatal BOLD signal: 298 First, dACC BOLD was significantly negatively correlated with alpha/ theta power early after 299 outcome onset (100-575 ms, 2-17 Hz, p = .016; Fig. 5A). This cluster started in the alpha/ theta range 300 and then spread into the theta/delta range (henceforth called "lower alpha band power"). It was not 301 observed in the EEG-only analyses reported above. 302 Second, while pgACC BOLD did not correlate significantly with midfrontal EEG power (p = .184), 303 BOLD in PCC was negatively correlated with theta/ delta power ( Fig. 5B; 175-500 ms, 1-6 Hz, p = 304 .014). This finding bore resemblance in terms of time-frequency space to the cluster of (negative) PE 305 valence encoding in the theta band and (positive) PE magnitude encoding in the delta band identified in 306 the EEG-only analyses (Fig. 4A). Complementary to the fMRI-informed EEG analyses, we also 307 performed independent EEG-informed fMRI analyses, which showed the robustness of this EEG-fMRI 308 correlation. We used the trial-by-trial EEG signal in the cluster identified in the EEG-only analyses (see 309 Fig. 4 A, B) to predict BOLD signal across the brain (see S13 for a graphical illustration of this 310 approach). The EEG time-frequency-mask used to create the EEG regressor was defined based on the 311 EEG-only analyses (Fig. 4A, B) and thus blind to the result of the fMRI-informed EEG analysis. We 312 observed significant clusters of negative EEG-BOLD correlation in vmPFC and PCC ( Fig. 5F; S16). 313 We thus discuss vmPFC and PCC together in the following. 314 Third, there was a significant positive correlation between striatal BOLD and midfrontal beta/ alpha 315 power (driven by a cluster at 100-800 ms, 7-23 Hz, p = .010; Fig. 5C). This finding bore resemblance 316 in time-frequency space to the cluster of positive PE valence encoding in beta power identified in the 317 EEG-only analyses (Fig. 4A, C). Again, to support the robustness of this finding, we used trial-by-trial 318 midfrontal beta power in the cluster identified in the EEG-only analyses (see Fig. 4A, C) to predict 319 BOLD signal across the brain. Clusters of positive EEG-BOLD correlations in right dorsal caudate (and 320 left parahippocampal gyrus) as well as clusters of negative correlations in bilateral dorsolateral PFC 321 (dlPFC) and supramarginal gyrus (SMG; Fig. 5G; see S16) confirmed the positive striatal BOLD-beta 322 power association. Given that the striatum is far away from the scalp and thus unlikely to be the source 323 of midfrontal beta power over the scalp, and given the assumption that trial-by-trial variation in an 324 oscillatory signal should correlate with BOLD signal in its source 38,39 , we speculate that dlPFC and SMG 325 (identified in the EEG-informed fMRI analyses) are the sources of beta power over the scalp and act as 326 an "antenna" for striatal signals. In line with this idea, previous studies have found beta power to be 327 correlated with dlPFC and SMG BOLD 40,41 and single units recordings in these regions have found 328 modulations of beta power by cognitive processes 42 . 329 Finally, regarding the other three regions that showed a significant BOLD signature of biased PEs, 330 BOLD in left motor cortex was significantly negatively correlated with midfrontal beta power (p = .002; 331 around 0-625 ms; see S14). There were no significant correlations between midfrontal EEG power and 332 left inferior temporal gyrus or primary visual cortex BOLD (see S14). All results were robust to different 333 analysis approaches including shorter trial windows, different GLM specifications, inclusion of task-334 condition and fMRI motion realignment regressors, and individual modelling of each region. TF results 335 were not reducible to phenomena in the time domain (see S15). 336 In sum, there were negative correlations between dACC BOLD and midfrontal lower alpha band 337 power early after outcome onset, negative correlations between PCC BOLD and midfrontal theta/ delta 338 power at intermediate time points, and positive correlations between striatal BOLD and midfrontal beta 339 power at late time points. This temporal dissociation was especially clear in the time courses of the test 340 statistics for each region (thresholded at |t| > 2 and summed across frequencies), for which the peaks of 341 the cortical regions preceded the peak of the striatum (Fig. 5D, H). Note that time-frequency power is 342 estimated over temporally extended windows (400 ms in our case), which renders any interpretation of 343 the "onset" or "offset" of such correlations more difficult. In sum, these results are consistent with an 344 "external model" of motivational biases arising from early cortical processes biasing later learning 345 processes in the striatum. 346 Figure 5. fMRI-informed EEG analyses. Unique temporal contributions of BOLD signal in (A) dACC, (B) PCC, and (C) striatum to average EEG power over midfrontal electrodes (Fz/ FCz/ Cz). Group-level t-maps display the modulation of the EEG power by trial-by-trial BOLD signal in the selected ROIs. dACC BOLD correlated negatively with early alpha/ theta power, PCC BOLD negatively with theta/ delta power, and striatal BOLD positively with beta/ alpha power. Areas surrounded by a black edge indicate clusters of |t| > 2 with p < .05 (cluster-corrected). Topoplots indicate the topography of the respective cluster. D. Time course of dACC, PCC, and striatal BOLD correlations, normalized to the peak of the time course of each region. dACC-lower alpha band correlations emerged first, followed by (negative) PCC-theta correlations and finally positive striatum-beta correlations. The reverse approach using lower alpha (E), theta (F) and beta (G) power as trial-by-trial regressors in fMRI GLMs corroborated the fMRI-informed EEG analyses: Lower alpha band power correlated negatively with the dACC BOLD, theta power negatively with vmPFC and PCC BOLD, and beta power positively with striatal BOLD. H. Schematic overview of the main EEG-fMRI results: dACC encoded the previously performed response and correlated with early midfrontal lower alpha band power. vmPFC/ PCC (correlated with theta power) and striatum (correlated with beta power) both encoded outcome valence, but had opposite effects on subsequent behavior. Note that activity in these regions temporally overlaps; boxes are ordered in temporal precedence of peak activity. dACC BOLD and midfrontal lower alpha band power encode the previously performed 348 action during outcome presentation 349 While the clusters of EEG-fMRI correlation in the theta/ delta and beta range matched the 350 clusters identified in EEG-only analyses, the cluster of negative correlations between dACC BOLD and 351 early midfrontal lower alpha band power was novel and did not match our expectations. Given that these 352 correlations arose very soon after outcome onset, we hypothesized that dACC BOLD and midfrontal 353 lower alpha band power might reflect a process occurring even before outcome onset, such as the 354 maintenance ("memory trace") of the previously performed response to which credit may later be 355 assigned. We therefore assessed whether information of the previous response was present in dACC 356 BOLD and in the lower alpha band around the time of outcome onset. 357 First, we tested for BOLD correlates of the previous response at the time of outcomes (eight 358 outcome-locked regressors for every Go/ NoGo x reward/ no reward/ no punishment/ punishment 359 combination) while controlling for motor-related signals at the time of the response (response-locked 360 regressors for left-hand and right-hand button presses). At the time of outcomes, there was higher BOLD 361 signal for NoGo than Go responses across several cortical and subcortical regions, peaking in both the 362 dACC and striatum (Fig. 6D). This inversion of effects-higher BOLD for Go than NoGo responses at 363 the time of response (see quality checks), but the reverse at the time of outcome-was also observed in 364 the upsampled raw BOLD and was independent of the response of the next trial (S17). In sum, large 365 parts of cortex, including the dACC, encoded the previously performed response at the moment 366 outcomes were presented, in line with the idea that the dACC maintains a "memory trace" of the 367 previously performed response. 368 Second, we tested for differences between Go and NoGo responses at the time of outcomes in 369 midfrontal broadband EEG power. Power was significantly higher on trials with Go than on trials with 370 NoGo responses, driven by clusters in the lower alpha band (spreading into the theta band; around 371 0.000-0.425 sec., 1-11 Hz, p = .012) and in the beta band (around 0.200-0.450 sec., 18-27 Hz, p = 372 .022; Fig. 6A, B). The first cluster matched the time-frequency pattern of dACC BOLD-alpha power 373 correlations (Fig. 5A). 374 If this activity cluster contained a signature of the previously performed response, it might have 375 been present throughout the delay between cue offset and outcome onset. When repeating the above 376 permutation test including the last second before outcome onset, there were significant differences again, 377 driven by a sustained cluster in the beta band (-1-0 sec., 13-33 Hz, p = .002) and two clusters in the 378 alpha/ theta band (Cluster 1: -1.000--0.275 sec., 1-10 Hz, p = 0.014; Cluster 2: -0.225-0.425 sec., 1-379 11 Hz, p = .022; Fig. 6B). These findings suggest that lower alpha band power might reflect a sustained 380 memory of the previously performed response. Supplemental analyses (S17) yielded that this Go-NoGo 381 trace during outcome processing did not change over the time course of the experiment, suggesting that 382 it did not reflect typical fatigue/ time-on task effects often observed in the alpha band. 383 Again, we performed the reverse EEG-fMRI analysis using trial-by-trial power in the identified 384 lower alpha band cluster (Fig. 6B) as an additional regressor in the quality-check fMRI GLM. Clusters 385 of negative EEG-BOLD occurred correlation in a range of cortical regions, including dACC and 386 precuneous ( Fig. 5E; see S16). In sum, both dACC BOLD signal and midfrontal lower alpha band power 387 contained information about the previously performed response, consistent with the idea that both 388 signals reflect a "memory trace" of the response to which credit is assigned once an outcome is obtained. 389 Figure 6. Exploratory follow-up analyses on dACC BOLD signal and midfrontal lower alpha band power. A. Midfrontal time-frequency response-locked (left panel) and outcome-locked (right panel). Before and shortly after outcome onset, power in the lower alpha band was higher on trials with Go actions than on trials with NoGo actions. The shape of this difference resembles the shape of dACC BOLD-EEG TF correlations (small plot; note that this plot depicts BOLD-EEG correlations, which were negative). Note that differences between Go and NoGo trials occurred already before outcome onset in the alpha and beta range, reminiscent of delay activity, but were not fully sustained throughout the delay between response and outcome. B. Midfrontal power in the lower alpha band per action x outcome condition. Lower alpha band power was consistently higher on trials with Go actions than on trials with NoGo actions, starting already before outcome onset.

C. BOLD signal differences between Go and NoGo actions (left panel) and left vs. right hand responses (right panel) at the time or responses.
Response-locked dACC BOLD signal was significantly higher for Go than NoGo actions. D. BOLD signal differences between Go and NoGo actions at the time of outcomes. Outcome-locked dACC BOLD signal (and BOLD signal in other parts of cortex) was significantly lower on trials with Go than on trials with NoGo actions.

390
Striatal and vmPFC/ PCC BOLD differentially relate to action policy updating 391 EEG correlates of PCC BOLD and striatal BOLD occurred later than for the dACC BOLD, and 392 overlapped with classical feedback-related midfrontal theta and beta power responses. We hypothesized 393 that those neural signals might be more closely related to updating of action policies (i.e., which action 394 to perform for each cue) and might thus predict the next response to the same cue 30,43 . We thus used the 395 trial-by-trial BOLD responses in dACC, vmPFC, PCC, and striatum to predict whether participants 396 would repeat the same response on the next trial with the same cue ("stay") or switch to another response 397 ("shift"). Mixed-effects logistic regression yielded that dACC BOLD did not significantly predict 15.559, p < .001). We also inspected the raw upsampled HRF shapes per region per condition, 406 confirming that differential relationships were not driven by differences in HRF shapes across regions. 407 We also tested whether trial-by-trial midfrontal lower alpha band, theta, or beta power (within 408 the clusters identified in the EEG-only analyses) predicted action policy updating. Participants were 409 significantly more likely to repeat the same response when beta power was high (b = 0.145, SE = 0.041, 410 χ 2 (1) = 11.886, p < .001), but more likely to switch when theta power was high (b = -0.099, SE = 0.047, 411 χ 2 (1) = 4.179, p = .041). Notably, unlike its BOLD correlate in ACC, lower alpha band power did predict 412 response repetition, with more repetition when alpha power was high (b = .0.179, SE = 0.052, χ 2 (1) = 413 10.711, p = .001; for plots, see S18). 414 In sum, high striatal BOLD and midfrontal beta power predicted that the same response would 415 be repeated on the next encounter of a cue, while high vmPFC and PCC BOLD and high theta power 416 predicted that participants would switch to another response. Thus, although both striatal and vmPFC/ 417 PCC BOLD positively encoded biased prediction errors, these two sets of regions had opposite roles in 418 learning: while the striatum reinforced previous responses, vmPFC/ PCC triggered the shift to another 419 response strategy (Fig. 5H). 420

421
We investigated neural correlates of biased learning for Go and NoGo responses. In line with 422 previous research 3,9 , participants' behavior was best described by a computational model featuring faster These results suggest that the architecture of the asymmetric striatal pathways might not be the only 430 neural structure that gives rise to motivational learning biases; instead, the PFC might critically 431 contribute to these biases. 432 The observation that both PFC and striatal BOLD signal reflected biased PEs might be explained 433 by three different models. One model assumes that both PFC and striatal processes arrive at biased 434 learning independently of each other, which is highly unlikely given strong recurrent connections 435 between both regions 18,19,44 . Another model incorporates such interconnections, but assumes that 436 striatum leads the PFC. While such a model is in line with past animal studies 45 and modeling work 46 , it 437 would predict EEG correlates of the PFC to trail after EEG correlates of the striatum-or at least to 438 occur with considerable delay after outcome onset. This model is not supported by our findings, which 439 showed EEG correlates of PFC regions soon after outcome onset, preceding striatal EEG correlates. The dominant idea about the origin of motivational biases has been that these biases are an 447 emergent feature of the asymmetric direct/ indirect pathway architecture in the basal ganglia 2,19 . We 448 find that these biases are present first in prefrontal cortical areas, notably dACC and PCC, which argues 449 against biases being purely driven by subcortical circuits. Rather, motivational learning biases might be 450 an instance of sophisticated, even "model-based" learning processes in the striatum instructed by the 451 prefrontal cortex 49,50 . An influence of PFC on striatal RL has prominently been observed in the case of 452 model-based vs. model-free learning 23,24 and has been stipulated as a mechanism of how instructions 453 can impact RL 20,21 . Although there are reports of striatal processes preceding prefrontal processes within 454 learning tasks 45,51 , the opposite pattern of PFC preceding striatum has been observed as well 52 and a 455 causal impact of PFC on striatal learning is well established 53,54 . In particular, we have previously 456 observed that motivational biases in action selection might arise from early prefrontal inputs to the 457 striatum, as well 17 . Prefrontal influences on striatal processes might thus be a common signature of both 458 motivational response and learning biases. 459 The particular subregion of PFC showing the earliest EEG correlates was the dACC. This 460 observation is in line with an earlier EEG-fMRI study reporting dACC to be part of an early valuation 461 system preceding a later system comprising vmPFC and striatum 55 . The dACC has been suggested to 462 encode models of agents' environment 56,57 that are relevant for interpreting outcomes, with BOLD in 463 this region scaling with the size of PEs 25,26 and indexing how much should be learned from new 464 outcomes. We hypothesize that, at the moment of outcome, dACC maintains a "memory trace" of the 465 previously performed response 58 which might modulate the processing of outcomes as soon as they 466 become available 59,60 . Notably, dACC exhibited stronger BOLD signal for Go than NoGo responses at 467 the time of participants' response, but this pattern reversed at the time of outcomes. This reversal rules 468 out the possibility that response-locked BOLD signal simply spilled over into the time of outcomes. 469 Future research will be necessary to corroborate such a motor "memory trace" in dACC. In sum, the 470 dACC might be in a designated position to inform subsequent outcome processing in downstream 471 regions by modulating the learning rate as a function of the previously performed response and the 472 obtained outcome. Rather than striatal circuits being sufficient for the emergence of motivational biases, 473 the more "flexible" PFC seems to play an important role in instructing downstream striatal learning 474 processes. The frequency-specific nature of these EEG-fMRI correlations further suggests that they are signatures 485 of task-induced events that are specific to the trial phase and unlikely to reflect general anatomical 486 connectivity. In sum, while these EEG-fMRI findings on outcome processing resemble our previous 487 EEG-fMRI findings on action selection in that prefrontal signals precede striatal signals, they are 488 dissociated in terms of the frequency specificity, highlighting the distinct roles of the striatum in these 489

processes. 490
Positive encoding of prediction errors in striatal BOLD signal is a well-established phenomenon 37,65 . 491 Striatal BOLD was better described by biased PEs than by standard PEs, corroborating the presence of 492 motivational learning biases also in striatal learning processes. Notably, EEG correlates of striatal 493 BOLD peaked rather late, suggesting that these processes are informed by early sources in PFC which 494 are connected to the striatum via recurrent feedback loops 18,44 . Positive prediction errors increase the 495 value of a performed action and thus strengthen action policies. Hence, it is not surprising that high 496 striatal BOLD signal and midfrontal beta power predicted action repetition 66,67 .
In contrast to striatal learning signals, the PCC and vmPFC BOLD as well as midfrontal theta 498 and delta power signals were more complicated: Theta encoded PE valence, delta encoded PE 499 magnitude. Both correlates showed opposite polarities. This observation is in line with previous 500 literature suggesting that midfrontal theta and delta power might reflect the "saliency" or "surprise" 501 aspect of PEs 31,32,68 . Surprises have the potential to disrupt an ongoing action policy 69 and motivate a 502 shift to another policy, which might explain why these signals predicted switching to another 503 response 70,71 . Notably, this EEG surprise signal was only significantly correlated with the biased (but 504 not the standard) PE term, corroborating that the surprise attributed to outcomes depends on the 505 previously performed response in line with motivational learning biases. In sum, both vmPFC and 506 striatum encode biased PEs, though with different consequences for future action policies. 507 Taken together, distinct brain regions processed outcomes in a biased fashion at distinct time 508 points with distinct EEG power correlates. Simultaneous EEG-fMRI recordings allowed us to infer when 509 those regions reached their peak activity 72 . However, the correlational nature of BOLD-EEG links 510 precludes strong statements about these regions actually generating the respective power phenomena. whether scalp EEG could in principle reflect striatal activity, at all 73,74 . Intracranial recordings have 515 observed beta oscillations during outcome processing in the striatum before 67,75,76 . Also, our analysis 516 controlled for BOLD signal in motor cortex, an alternative candidate source for beta power, suggesting 517 that late midfrontal beta power did not merely reflect motor cortex beta. Even if the striatum is not the 518 generator of the beta oscillations over the scalp, their true (cortical) generator might be tightly coupled 519 to the striatum and thus act as a "transmitter" of striatal beta oscillations. In fact, the analyses using trial-520 by-trial beta power to predict BOLD yielded significant clusters in dlPFC and SMG, two candidate 521 regions for such a "transmitter". 522 We observed EEG correlates of striatal BOLD at a rather late time point after outcome onset. 523 While we conclude that biased outcome processing occurs much earlier in cortical regions than the 524 striatum, it is possible that the modulating influence of the striatum on cortical sources of beta 525 synchronization (possibly dlPFC and SMG) observed over the scalp takes time to surface. However, 526 speaking against any delay, some single studies have reported maximal correlations between striatal 527 LFPs and scalp EEG at a time lag of 0 77 . Regardless, even in the presence of a non-zero lag, our main 528 conclusion would hold: Biased learning is present in cortical regions early after outcome onset, which 529 cannot be a consequence of striatal input, but must constitute an independent origin of motivational 530 learning biases. 531 Finally, the correlational nature of the study prevents strong statements over any causal 532 interactions between the observed regions. We assume here that a region showing an earlier midfrontal 533 EEG correlate influences other regions showing later midfrontal EEG correlates, and such an influence 534 is plausible given findings of feedback loops between prefrontal regions and the striatum 44 . Future 535 studies targeting those regions via selective causal manipulations will be necessary to test for the causal 536 role of PFC in informing striatal learning. 537 In conclusion, biased learning-increased credit assignment to rewarded action, decreased 538 credit assignment to punished inaction-was visible both in behavior and in BOLD signal in a range of 539 regions. EEG correlates of prefrontal cortical regions, notably dACC and pgACC, preceded correlates 540 of the striatum, consistent with a model of the PFC biasing RL in the striatum. The dACC appeared to 541 hold a "motor memory trace" of the past response, biasing early outcome processing. Subsequently, 542 biased learning was also present in vmPFC/ PCC and striatum, with opposite roles in adjusting vs. 543 maintaining action policies. These results refine previous views on the neural origin of these learning 544 biases, suggesting they might not only rely on subcortical parts of the brain typically associated with 545 rigid, habit-like responding, but rather incorporate frontal inputs that are associated with counterfactual 546 reasoning and increased behavioral flexibility 78,79 . The PFC is typically believed to facilitate goal-547 directed over instinctive processes. Hence, PFC involvement into biased learning suggests that these 548 biases are not necessarily agents' inescapable "fate", but rather likely act as global "priors" that facilitate 549 learning of more local relationships. They allow for combining "the best of both worlds"-long-term or corrected-to-normal vision) took part in a single 3-h data collection session, for which they received 556 €30 flat fee plus a performance-dependent bonus (range €0-5, Mbonus = €1.28, SDbonus = 1.54). The 557 study was approved by the local ethics committee (CMO2014/288; Commissie Mensengeboden 558 Onderzoek Arnhem-Nijmegen) and all participants provided written informed consent. Exclusion 559 criteria comprised claustrophobia, allergy to gels used for EEG electrode application, hearing aids, 560 impaired vision, colorblindness, history of neurological or psychiatric diseases (including heavy 561 concussions and brain surgery), epilepsy and metal parts in the body, or heart problems. Sample size 562 was based on previous EEG studies with a comparable paradigm 9,80 . 563 Behavioral and modeling results include all 36 participants. The following participants were 564 excluded from analyses of neural data: For two participants, fMRI functional-to-standard image 565 registration failed; hence, all fMRI-only results are based on 34 participants (Mage = 23.47, 25 women). 566 Four participants exhibited excessive residual noise in their EEG data (> 33% rejected trials) and were 567 thus excluded from all EEG analyses; hence, all EEG-only analyses are based on 32 participants (Mage 568 = 23.09, 23 women). For combined EEG-fMRI analyses, we excluded the above-mentioned six 569 participants plus one more participant whose regression weights for every regressor were about ten times 570 larger than for other participants, leaving 29 participants (Mage = 23.00, 22 women). Exclusions were in 571 line with a previous analysis of this data set 17 . fMRI-and EEG-only results held when analyzing only 572 those 29 participants (see S01). 573 Participants performed a motivational Go/ NoGo learning task 3,9 administered via MATLAB 575 2014b (MathWorks, Natick, MA, United States) and Psychtoolbox-3.0.13. On each trial, participants 576 saw a gem-shaped cue for 1300 ms which signaled whether they could potentially win a reward (Win 577 cues) or avoid a punishment (Avoid cues) and whether they had to perform a Go (Go cue) or NoGo 578 response (NoGo cue). They could press a left (GoLEFT), right (GoRIGHT), or no (NoGo) button while the 579 cue was presented. Only one response option was correct per cue. Participants had to learn both cue 580 valence and required action from trial-and-error. After a variable inter-stimulus-interval of 1,400-1,600 581 ms, the outcome was presented for 750 ms. Potential outcomes were a reward (symbolized by coins 582 falling into a can) or neutral outcome (can without money) for Win cues, and a neutral outcome or 583 punishment (symbolized by money falling out of a can) for Avoid cues. Feedback validity was 80%, 584 i.e., correct responses were followed by positive outcomes (rewards/ no punishments) on only 80% of 585 trials, while incorrect responses were still followed by positive outcomes on 20% of trials. Trials ended 586 with a jittered inter-trial interval of 1250-2000 ms, yielding total trial lengths of 4700-6650 ms. 587 Participants gave left and right Go responses via two button boxes positioned lateral to their 588 body. Each box featured four buttons, but only one button per box was required in this task. When 589 participants accidentally pressed a non-instructed button, they received the message "Please press one 590 of the correct keys" instead of an outcome. In the analyses, these responses were recoded into the 591 instructed button on the respective button box. In the fMRI GLMs, such trials were modeled with a 592 separate regressor. 593 Before the task, participants were instructed that each cue could be followed by either reward 594 or punishment, that each cue had one optimal response, that feedback was probabilistic, and that the 595 rewards and punishments were converted into a monetary bonus upon completion of the study. They 596 performed an elaborate practice session in which they got familiarized first with each condition 597 separately (using practice stimuli) and finally practiced all conditions together. They then performed 598 640 trials of the main task, separated into two sessions of 320 trials with separate cue sets. Introducing 599 a new set of cues allowed us to prevent ceiling effects in performance and investigate continuous 600 learning throughout the task. Each session featured eight cues that were presented 40 times. After every 601 100-110 trials (~ 6 min.), participants could take a self-paced break. The assignment of the gems to cue 602 conditions was counterbalanced across participants, and trial order was pseudo-random (preventing that 603 the same cue occurred on more than two consecutive trials). 604 Behavior analyses 605 We used mixed-effects logistic regression (as implemented in the R package lme4) to analyze 606 behavioral responses (Go vs. NoGo) as a function of required action (Go/ NoGo), cue valence (Win/ 607 Avoid), and their interaction. We included a random intercept and all possible random slopes and 608 correlations per participant to achieve a maximal random-effects structure 81 . Sum-to-zero coding was 609 employed for the factors. Type 3 p-values were based on likelihood ratio tests (implemented in the R 610 package afex). We used a significance criterion of α = .05 for all the analyses. 611 Furthermore, we used mixed-effects logistic regression to analyze "stay behavior", i.e., whether 612 participants repeated an action on the next encounter of the same cue, as a function of outcome valence 613 (positive: reward or no punishment/ negative: no reward or punishment), outcome salience (salient: 614 reward or punishment/ neutral: no reward or no punishment), and performed action (Go/ NoGo). We 615 again included all possible random intercepts, slopes, and correlations. 616 Computational modeling 617 We fit a series of increasingly complex RL models to participants' choices to decide between different 618 algorithmic explanations for the emergence of motivational biases in behavior. We employed the same 619 set of nested models as in previous studies using this task 3,9 . For tests of alternative biases specifications, 620 see S05 and S06. 621

Model space 622
To determine whether a Pavlovian response bias, a learning bias, or both biases jointly predicted 623 behavior best, we fitted a series of increasing complex computational models. In each trial (t), choice 624 probabilities for all three response options (a) given the displayed cue (s) were computed from their 625 action weights (modified Q-values) using a softmax function: 626 After each response, action values were updated with the prediction error based on the obtained 628 outcome ∈ {−1; 0; 1}. As the starting model (M1), we fitted an standard delta-learning model 82 in 629 which action values were updated with prediction errors, i.e., the deviation between the experienced outcome and expected outcome. This model contained two free parameters: the learning rate (ε) scaling 631 the updating term and the feedback sensitivity (ρ) scaling the received outcome (i.e., higher feedback 632 sensitivity led to choices more strongly guided by value difference, akin to the role of the inverse 633 temperature parameter frequency used in reinforcement learning models): 634 (2) 635 In this model, choice probabilities were fully determined by action values, without any bias. We 636 initialized action values Q0 such that they reflected a "neutral" expected value for each action. Win cues 637 could lead to reward (+1) or neutral (0) outcomes and Avoid cues to neutral (0) or punishment (-1) 638 outcomes. A neutral expected value would assign equal probability to either possible outcome, leading 639 to expectations of +1/2 and -1/2, respectively. In addition, because participants' feedback sensitivity 640 parameter ρ reflected how participants weighed the outcomes they received, also the initial values had 641 to be multiplied with the feedback sensitivity to stay neutral between 0 and participants' re-weighted 642 positive/ negative outcome of +/-1*ρ. Thus, initial action values Q0 were set to 1/2*ρ (Win cues) and -643 1/2*ρ (Avoid cues). 644 Unlike previous versions of the task 3 , cue valences were not instructed, but had to be learned 645 from outcomes, as well 9 . Thus, until experiencing the first non-neutral outcome (reward or punishment) 646 for a cue, participants could not know its valence and thus not learn from neutral feedback. Hence, for 647 these early trials, action values were multiplied with zero when computing choice probabilities 9 . After 648 the first encounter of a valenced outcome, action values were "unmuted" and started to influence choices 649 probabilities, retrospectively considering all previous outcomes 9 . 650 In M2, we added the Go bias parameter b, which accounted for individual differences in 651 participants' overall propensity to make Go responses, to the action values Q, resulting in action weights 652 w: 653 In M3, we added a Pavlovian response bias π, scaling how positive/ negative cue valence 655 (Pavlovian values) increased/ decreased the weights of Go responses: 656

) 657
Participants were instructed that a cue was either a Win cue (affording rewards or neutral 658 outcomes) or an Avoid cue (affording neutral outcomes or punishments). Hence, cue valence (Win/ 659 Avoid) did not have to be learned instrumentally; instead, it could be inferred as soon participants 660 experienced a non-neutral outcome. Until that moment, cue valence ( ) was set to zero. Afterwards, 661 ( ) was set to +0.5 for Win cues and -0.5 for Avoid cues. Note that choosing different values than 0.5 662 would merely rescale the bias parameter π (e.g., halving π with cue valences of +1 and -1) without any 663 changes in the model's predictions. The Pavlovian response bias affected left-hand and right-hand Go 664 responses similarly and thus reflected generalized activation/ inactivation by the cue valence. 665 In M4, we added a learning bias κ, increasing the learning rate for rewards after Go responses 666 and decreasing it for punishments after NoGo responses: 667 The learning bias was specific to the response shown, thus reflecting a specific enhancement in 669 action learning/ impairment in unlearning for that particular response. 670 In the model M5, we included both the Pavlovian response bias and the learning bias. 1] 3,9 . We made sure that the effect of κ on ε was symmetrical by computing it as: 677 with random-effects model comparison 84 . The fitting procedure involves two steps, starting with the 683 Laplace approximation of the model evidence to compute the group evidence, which quantifies how 684 well each model fits the data while penalizing for model complexity. Both group-level and individual-685 level parameters are estimated using an iterative algorithm. We used wide Gaussian priors (see 686 hyperpriors above) and exponential and sigmoid transforms to constrain parameter spaces. Subsequent 687 random-effects model selection allows for the possibility that different models generated the data for 688 different participants. Participants contribute to the group-level parameter estimation in proportion to 689 how well a given model fits their data, quantified via a responsibility measure (i.e., the probability that 690 the model at hand is responsible for generating data of the respective participant). This model-691 comparison approach has been shown to be less susceptible to the influence of outliers 83 . We selected 692 the "winning" model based on the protected exceedance probability. 693 694 We assured that the winning model was able to reproduce the data, using the sampled 695 combinations of participant-level parameter estimates to create 3600 agents that "played" the task. We choices. This method ignores participant-specific choice histories and can thus yield choice/ outcome 700 sequences that diverge considerably from participants' actual experiences. In contrast, one-step-ahead 701 predictions use participants' actual choices and experienced outcomes in each trial to update action 702 values. We simulated choices for each participant using both methods, which confirmed that the winning 703 model M5 ("asymmetric pathways model") was able to qualitatively reproduce the data, while an 704 alternative implementation of biased learning ("action priming model") failed to do so (see S05). 705 fMRI data acquisition 706 fMRI data were collected on a 3T Siemens Magnetom Prisma fit MRI scanner with a 64-channel 707 head coil. During scanning, participants' heads were restricted using foam pillows and strips of adhesive 708 tape were applied to participants' forehead to provide active motion feedback and minimize head 709 movement 85 . After two localizer scans to position slices, we collected functional scans with a whole-710 brain T2*-weighted sequence (68 axial-oblique slices, TR = 1400 ms, TE = 32 ms, voxel size 2.0 mm 711 isotropic, interslice gap 0 mm, interleaved multiband slice acquisition with acceleration factor 4, FOV 712 210 mm, flip angle 75°, A/ P phase encoding direction). The first seven volumes of each run were 713 automatically discarded. This sequence was chosen because of its balance between a short TR and 714 relatively high spatial resolution, which was required to disentangle cue and outcome-related neural 715 activity. Pilots using different sequences yielded that this sequence performed best in reducing signal 716 loss in striatum. 717 All fMRI pre-processing was performed in FSL 6.0.0. After cleaning images from non-brain 726 tissue (brain-extraction with BET), we performed motion correction (MC-FLIRT), spatial smoothing 727 (FWHM 3 mm), and used fieldmaps for B0 unwarping and distortion correction in orbitofrontal areas. 728

Model validation
We used ICA-AROMA 86 to automatically detect and reject independent components associated with 729 head motion. Finally, images were high-pass filtered at 100 s and pre-whitened. After the first-level 730 GLM analyses, we computed and applied co-registration of EPI images to high-resolution images (linearly with FLIRT using boundary-based registration) and to MNI152 2mm isotropic standard space 732 (non-linearly with FNIRT using 12 DOF and 10 mm warp resolution). 733

ROI selection 734
For fMRI-informed EEG analyses, we first created a functional mask as the conjunction of the 735 PESTD and PEDIF contrasts by thresholding both z-maps at z > 3.1, binarizing, and multiplying them (see 736 In the model-based GLM, we used two model-based regressors that reflected the trial-by-trial 756 prediction error (PE) update term. The update term was computed by multiplying the prediction-error 757 with the condition-specific learning rate. As described above, in the winning model M5, the learning 758 bias term κ leads to altered learning from "congruent" action-outcome pairs, with faster learning of Go 759 actions followed by rewards, but slower unlearning of NoGo actions followed by punishments. To 760 compute trial-by-trial updates, we extracted the group-level parameters of the best fitting computational 761 model M5 (asymmetric pathways model) and used those parameters to compute the prediction error on 762 every trial for every participant. Using the same parameter for each participant is warranted when testing 763 for the same qualitative learning pattern across participants 87 . Given that both standard (base model M1) 764 and biased (winning model M5) PEs were highly correlated (mean correlation of 0.921 across 765 participants, range 0.884-0.952), it appeared difficult to distinguish standard learning from biased 766 learning. As a remedy, we decomposed the biased PE into the standard PE plus a difference term as 767 = + 22,36 . Any region displaying truly biased learning should significantly encode 768 both the standard PE term and the difference term. The standard PE and difference term were much less 769 correlated (mean correlation of -0.020, range -0.326-0.237). To control for cue-related activation, we 770 furthermore added four regressors spanned by crossing cue valence and performed action (Go response 771 to Win cue, Go response to Avoid cue, NoGo response to Win cue, NoGo response to Avoid cue). 772 The model-free GLM included a separate regressor for each of the eight conditions obtained 773 when crossing performed action (Go/ NoGo) and obtained outcome (reward/ no reward/ no punishment/ 774 punishment). We fitted four contrasts: 1) one contrast comparing conditions with positive (reward/ no 775 punishment) and negative (no reward/ punishment) outcomes, used as a quality check to identify regions 776 that encoded outcome valence; 2) one contrast comparing Go vs. NoGo responses at the time of the 777 outcome; 3) one contrast summing of left-and right-hand responses, reflecting Go vs. NoGo responses 778 at the time of the response; and 4) one contrast subtracting right-from left-handed responses, reflecting 779 lateralized motor activation. As this GLM resulted in empty regressors for several participants when 780 fitted on a block level, making it impossible to use the data of the respective blocks on a higher level, 781 we instead concatenated blocks and performed a single GLM per participant. We therefore registered 782 the data from all blocks to the middle image of the first block (default reference volume in FSL) using 783 MCFLIRT. The first and last 20 seconds of each block did not feature any task-related events, such that 784 carry-over effects of task events in the design matrix from one block to another were not possible. 785 In both GLMs, we added four regressors of no interest: one for the motor response (left = +1, 786 right = -1, NoGo = 0), one for error trials, one for outcome onset, and one for trials with invalid motor 787 response (and no outcome respectively). We also added nine or more nuisance regressors: the six 788 realignment parameters from motion correction, mean cerebrospinal fluid (CSF) signal, mean out-of-789 brain (OBO) signal, and a separate spike regressor for each volume with a relative displacement of more 790 than 2 mm (occurred in 10 participants; in those participants: M = 7.40, range 1-29). For the model-free 791 GLM, nuisance regressors were added separately for each block as well as an overall intercept per block. 792 We convolved task regressors with double-gamma haemodynamic response function (HRF) and high-793 pass filtered the design matrix at 100 s. 794 First-level contrasts were fit in native space. Afterwards, co-registration and reslicing was 795 applied to participants' contrast maps, which were then combined on a (participant and) group level 796 using FSL's mixed effects models tool FLAME with a cluster-forming threshold of z > 3.1 and cluster-797 level error control at α < .05 (i.e., two one-sided tests with α < .025). 798 EEG data acquisition 799 We recorded EEG data with 64 channels (BrainCap-MR-3-0 64Ch-Standard; Easycap GmbH; 800 Herrsching, Germany; international 10-20 layout, reference electrode at FCz) plus channels for 801 electrocardiogram, heart rate, and respiration (used for MR artifact correction) at a sampling rate of 1000 802 Hz. We placed MRI-compatible EEG amplifiers (BrainAmp MR plus; Brain Products GmbH, Gilching, 803 Germany) behind the MR scanner and attached cables to the participants once they were located in final 804 position in the scanner. Furthermore, we fixated cables using sand-filled pillows to reduce artifacts 805 induced through cable movement in the magnetic field. During functional scans, the MR helium pump 806 was switched off to reduce EEG artifacts. After the scanning, we recorded the exact EEG electrode 807 locations on participants' heads relative to three fiducial points using a Polhemus FASTRAK device. 808 For four participants, no such data were available due to time constraints/ technical errors, in which case 809 we used the average electrode locations of the remaining 32 participants. 810 EEG pre-processing 811 First, raw EEG data were cleaned from MR scanner and cardioballistic artifacts using 812 BrainVisionAnalyzer 88 . The rest of the pre-processing was performed in Fieldtrip 89 . After rejecting 813 channels with high residual MR noise (mean 4.8 channels per participant, range 1-13), we epoched trials 814 into time windows of -1,400-2,000 ms relative to the onset of outcomes. Timing of this epochs was 815 determined by the minimal inter-stimulus interval beforehand until the minimal inter-trial interval 816 afterwards. Data was re-referenced to the grand average, which allowed us to recover the reference as 817 channel FCz, and then band-pass filtered using a two-pass 4th order Butterworth IIR filter (Fieldtrip action). We also tested for differences between Go and NoGo responses in the lower alpha band (6-10 859 Hz). For all contrasts, we employed two-sided cluster-based permutation tests in a window from 0-860 1,000 ms relative to outcome onset. For beta power, results were driven by a cluster that was at the edge 861 of 1,000 ms; to more accurately report the time span during which this cluster exceeded the threshold, 862 we extended the time window to 1,300 ms in this particular analysis. Such tests are able to reject the 863 null hypothesis of exchangeability of two experimental conditions, but they are not suited to precisely 864 locate clusters in time-frequency space. Hence, interpretations were mostly based on the visual 865 inspection of plots of the signal time courses. 866 For model-based analyses, similar to fMRI analyses, we used the group-level parameters from 867 the best fitting computational model M5 to compute the trial-by-trial biased PE term and decomposed 868 it into the standard PE term and the difference to the biased PE term. We used both terms as predictors 869 in a multiple linear regression for each channel-time-frequency bin for each participant, and then 870 performed one-sample cluster-based permutation-tests across the resultant b-maps of all participants 90 . 871 For further details on this procedure, see fMRI-inspired EEG analyses. 872

fMRI-informed EEG analyses 873
The BOLD signal is sluggish. It is thus hard to determine when different brain regions become 874 active. In contrast, EEG provides much higher temporal resolution. A fruitful approach can be to identify 875 distinct EEG correlates of the BOLD signal in different regions, allowing to test hypotheses about the 876 temporal order in which regions might become active and modulated EEG power 17,72 . Furthermore, by 877 using the BOLD signal from different regions in a multiple linear regression, one can control for 878 variance shared among regions (e.g., changes in global signal; variance due to task regressors) and test 879 which region is the best unique predictor of a certain EEG signal. In such an analysis, any correlation 880 between EEG and BOLD signal from a certain region reflects an association above and beyond those 881 induced by task conditions. 882 We used the trial-by-trial BOLD signal in selected regions in a multiple linear regression to predict 883 EEG signal over the scalp 17,72 (building on existing code from https://github.com/tuhauser/TAfT; see 884 S13 for a graphical illustration). As a first step, we extracted the volume-by-volume signal (first 885 eigenvariate) from each of the seven regions identified to encode biased PEs (conjunction of PESTD and 886 PEDIF: striatum, dACC, pgACC, left motor cortex, PCC, left ITG, and primary visual cortex). We applied 887 a highpass-filter at 128 s and regressed out nuisance regressors (6 realignment parameters, CSF, OOB, 888 single volumes with strong motion, same as in the fMRI GLM). We then upsampled the signal by a 889 factor 10, epoched it into trials of 8 s duration, and fitted a separate HRF (based on the SPM template) 890 to each trial (58 upsampled data points), resulting in trial-by-trial regression weights reflecting the 891 respective BOLD response. We then combined the regression weights of all trials and regions of a certain 892 participant into a design matrix with trials as rows and the seven ROIs as columns, which we then used 893 to predict power at each time-frequency-channel bin. As further control variables, we added the 894 behavioral PESTD and PEDIF regressors to the design matrix. All results were identical with and without 895 the inclusion of PEs as covariates in the regression, suggesting that EEG-fMRI correlations did not 896 merely arise from both modalities encoded PEs as a "common cause" that induced correlations. Instead, 897 these correlations reflected the incremental variance explained in EEG power that was afforded by the 898 BOLD signal even beyond the PEs. All predictors and outcomes were demeaned such that the intercept 899 became zero. Such a multiple linear regression was performed for each participant, resulting in a time- participants (results in t-map), thresholded these maps at |t| > 2, and finally computed the maximal cluster 909 mask statistic (sum of all t-values) for any cluster (adjacent voxels above threshold). Afterwards, we 910 computed the same t-map for the real data, identified the cluster with the biggest cluster-mass statistic, 911 and computed the corresponding p-value as number of permutations in the null distribution that were 912 larger than the maximal cluster mass statistic in the real data. 913

EEG-informed fMRI analyses 914
For the EEG-informed fMRI analyses, we fit three additional GLMs for which we entered the 915 trial-by-trial theta/ delta power (1-8 Hz), beta power (13-30 Hz), and lower alpha band power (6-10 916 Hz) as parametric regressors on top of the task regressors of the model-free GLM. These measures were 917 created by using the 3-D (time-frequency-channel) t-map obtained when contrasting positive vs. 918 negative outcomes (theta/ delta and beta; Fig. 4 A, B) and Go vs. NoGo conditions (lower alpha band) 919 as a linear filter ( Fig. 4; see S13 for a graphical illustration of this approach). Note that these signals 920 were selected based on the EEG-only results and not informed by the fMRI-informed EEG analyses. 921 We enforced strict frequency cut-offs. For lower alpha band and beta, we used midfrontal channels (Fz/ 922 FCz/ Cz). For theta/ delta power, given the topography that reached far beyond midfrontal channels and 923 over the entire frontal scalp, we used a much wider ROI (AF3/ AF4/ AF7/ AF8/ F1/ F2/ F3/ F4/ F5/ F6/ 924 F7/ F8/ FC1/ FC2/ FC3/ FC4/ FC5/ FC6/ FCz/ Fp1/ Fp2/ Fpz/ Fz). We extracted those maps and retained 925 all voxels with t > 2. These masks were applied to the trial-by-trial time-frequency data to create 926 weighted summary measures of the average power in the identified clusters in each trial. For trials for 927 which EEG data was rejected, we imputed the participant mean value of the respective action (Go/ 928 NoGo) x outcome (reward/ no reward/ no punishment/ punishment) condition. Note that this approach 929 accentuates differences between conditions, which were already captured by the task regressors in the 930

GLM, but decreases trial-by-trial variability within each condition, which is of interest in this analysis. 931
This imputation approach is thus conservative. While trial-by-trial beta and theta power were largely 932 uncorrelated, mean r = 0.104, range -0.118-0.283 across participants, and so were beta and alpha, mean Analyses of behavior as a function of BOLD signal and EEG power 937 We used mixed-effects logistic regression to analyze "stay behavior", i.e., whether participants 938 repeated an action on the next encounter of the same cue, as a function of BOLD signal and EEG power 939 in selected regions. For analyses featuring BOLD signal, we used the trial-by-trial HRF amplitude also 940 used for fMRI-informed EEG analyses. For analyses featuring EEG, we used the trial-by-trial EEG 941 power also used in the EEG-informed fMRI analyses.  Committee and the Radboud University security officer, potentially identifying data (such as imaging 1183 data) can only be shared to identifiable researchers. Hence, researchers requesting access to the data have to register and accept a data user agreement; access will then automatically be granted via a "click-1185 through" procedure (without involvement of authors or data stewards). included in EEG-fMRI analyses 35 We repeated the behavioral, fMRI, and EEG analyses reported in the main text while excluding 36 the seven participants that were also not included in the fMRI-inspired EEG analyses in the main text: 37 (a) two participants due to fMRI co-registration failure, which were also not included in the fMRI-only 38 analyses; (b) four further participants who exhibited excessive residual noise in their EEG data (> 33% 39 rejected trials) and were thus also not included in the EEG-only analyses, and finally (c) one more 40 participant who (together with four other participants already excluded) exhibited regression weights 41 for every regressor about ten times larger than for other participants.

78
Regarding fMRI findings, we first repeated the model-free GLM just contrasting positive and 79 negative outcomes. BOLD signal was higher for positive than negative outcomes in five clusters, namely 80 in vmPFC, striatum, amygdala, and hippocampus (zmax = 5.  When computing the conjunction between both (positive) contrasts, BOLD signal encoded both 131 the standard and the difference in four clusters, namely in vmPFC, bilateral striatum, bilateral ITG, and 132 V1. Clusters in ACC, left motor cortex, and PCC were not significant any more (because they were z > 133 3.1, but not significant after cluster correction in the standard PE contrast). However, new (though rather 134 small) clusters of biased PE encoding emerged in right insula, left amygdala, and left OFC. In sum, 135 results when analyzing only this subgroup of only 29 participants were largely similar to results based 136 on the full sample; however, clusters of biased PE encoding in left motor cortex, ACC, and PCC were 137 small and thus did not survive cluster correction in this subgroup. 138 Figure S01B. BOLD signal reflecting outcome processing in the subgroup of 29 participants included in the fMRI-inspired EEG analyses. A. BOLD signal was higher for positive outcomes (rewards, no punishments) compared with negative outcomes (no rewards, punishments) in a range of regions including bilateral ventral striatum and vmPFC. BOLD effects displayed using a dual-coding data visualization approach with color indicating the parameter estimates and opacity the associated z-statistics. Significant clusters are surrounded by black edges. Bar plots show parameter estimates per action x outcome condition (±SEM across participants) B. When using the trial-by-trial PEs participants experienced as model-based regressors in our GLM, positive PE correlations occurred in several regions including importantly the ventral striatum, vmPFC, dACC, and PCC. C. Left panel: Regions encoding both the standard PE term and the difference term to biased PEs (conjunction) at different cluster-forming thresholds (color). Clusters significant at a threshold of z > 3.1 are surrounded by black edges. In bilateral striatum, pgACC, bilateral ITG, and primary visual cortex, BOLD was significantly better explained by biased learning than by standard learning. Clusters in dACC, left motor cortex, and PCC were not significant any more.

139
Regarding EEG findings in this subgroup, both midfrontal theta and beta power reflected 140 outcome valence: Theta power was higher for negative than positive outcomes (driven by a cluster 141 around 225-500 ms, p = .002), while beta power was higher for positive than negative outcomes (driven 142 by a cluster around 325-1000 ms, p = .002). When using PE terms as regressor for midfrontal EEG 143 power while controlling for PE valence, delta power did not encode positively, though not 144 significant (p = .056), and also the positive encoding of was non-significant (p = .053). The 145 positive correlation of beta power with was not significant anymore (p = .059), while the negative 146 correlation with remained (p = .001, 450-950 ms). When adding and together to 147 achieve , theta/delta power indeed significantly encoded , first positively (p = .032, 224-148 475 ms) and then negatively (p = .019, 600 -1,000 ms; around 8 Hz and thus rather in the alpha band). 149 Also, beta power was significantly negatively correlated with (p = .008, 450 -975 ms). 150 In sum, all findings reported in the main text also held when analyzing only this subgroup of 151 only 29 participants. In addition, also late beta power and theta/alpha power appeared to negatively 152 encode the term. 153 Figure S01C. EEG time-frequency power midfrontal electrodes (Fz/ FCz/ Cz) reflecting outcomes processing in the subgroup of 29 participants included in the fMRI-inspired EEG analyses. A. Time-frequency plot (logarithmic y-axis) displaying high theta (4-8 Hz) power for negative outcomes and higher beta power (16-32 Hz) for positive outcomes. B. Theta power transiently increases for any outcome, but more so for negative outcomes (especially punishments) around 225-475 ms after feedback onset. C. Beta was higher for positive than negative outcomes (especially punishments) over a long time period around 300-1,250 ms after feedback onset. D-F. Correlations between midfrontal EEG power and trialby-trial PEs. Solid black lines indicate clusters above threshold. Biased PEs were significantly positively correlated with midfrontal theta power, but also negatively correlated with later alpha and beta power (D). The correlations of theta with the standard PEs (E) and the difference term to biased PEs (F) were also positive, though not significant. Beta power only encoded the difference term to biased PEs (F). ** p < 0.01.** p < 0.01.

154
Regarding fMRI correlates of the past action, similar to the original analysis comprising 34 155 participants, there were no clusters with higher BOLD after Go than NoGo actions at the time of 156 outcomes, but vice versa, large parts of cortex and subcortex showed higher BOLD after NoGo than Go 157 actions, highly similar to the original analysis (zmax = 7.65, p = 0, 124629 voxels, xyz = [-58 18 22]). 158 Furthermore, there were four clusters with higher BOLD for Go than NoGo actions at the time 159 of the response, namely one large cluster across lateral prefrontal cortex, anterior cingulate cortex, 160 striatum, thalamus, angular gyrus, cerebellum, left operculum and motor cortex, intracalcarine cortex, Regarding EEG time-frequency correlates of the past action, when testing for differences in 183 broadband after outcome onset, there was no significant difference after Go and NoGo responses, p = 184 .283. When restricting analyses to the low alpha range, the permutation test was marginally significant, 185 p = .056, driven by a cluster around 0-100 ms around 7-10 Hz). When repeating the permutation test 186 for the broadband signal including the last second before outcome onset, there was a significant 187 difference after Go and NoGo responses, driven by clusters in the beta band. p = 0.002, -1000 --275 188 ms, 13-32 Hz, and in the theta/ low alpha band, p = 0.020, -1000 --525 ms, 4-10 Hz. 189 Figure S01D. Exploratory follow-up analyses on dACC BOLD signal and midfrontal low-alpha power in the subgroup of 29 participants included in the fMRI-inspired EEG analyses. A. Midfrontal time-frequency response-locked (left panel) and outcome-locked (right panel). Before and shortly after outcome onset, power in the lower alpha band was higher on trials with Go actions than on trials with NoGo actions. The shape of this difference resembles the shape of dACC BOLD-EEG TF correlations (small plot; note that this plot depicts BOLD-EEG correlations, which were negative). Note that differences between Go and NoGo trials occurred already before outcome onset in the alpha and beta range, reminiscent of delay activity; but were not fully sustained since the actual response. B. Midfrontal power in the lower alpha band per action x outcome condition. Lower alpha band power was consistently higher on trials with Go actions than on trials with NoGo actions, starting already before outcome onset. C. BOLD signal differences between Go and NoGo actions (left panel) and left vs. right hand responses (right panel) at the time or responses. Response-locked dACC BOLD was significantly higher for Go than NoGo actions. D. BOLD signal differences between Go and NoGo actions at the time of outcomes. Outcomelocked dACC BOLD signal (and BOLD signal in other parts of cortex) was significantly lower on trials with Go than on trials with NoGo actions.

197
When linking trial-by-trial midfrontal EEG TF power to response repetition on the next trial 198 with the same cue, participants in this subgroup were more likely to repeat the same response when beta 199 power was high, b = 0.124, SE = 0.036, χ 2 (1) = 3.502, p < .001, or when low alpha power was high, b = 200 0.135, SE = 0.044, χ 2 (1) = 8.789, p = .003, but more likely to switch to another response when theta 201 power was high, b = -0.090, SE = 0.040, χ 2 (1) = 4.812, p = .028 .  202  203  204  205  206  207  208  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  224  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240 Table S02. Full report of model of stay behavior. Mixed-effects logistic regression of stay vs. switch behavior (i.e., repeating vs. changing an action on the next occurrence of the same cue) as a function of performed action (Go vs. NoGo), outcome salience (salient: reward or punishment vs. neutral: no reward or no punishment), and outcome valence (positive: reward or no punishment vs. negative: no reward or punishment). Follow-up analyses were performed on trials with salient vs. neutral outcomes separately, and then separately based on Go vs. NoGo actions and salient vs. neutral outcomes. P-values were computed using likelihood ratio tests using the mixed-function (option "LRT") from package afex.  266  267  268  269  270  271  272  273  274  275  276  277  278  279  280  281  282  283  284  285  286  287  288  289  290  291 S04: Parameter recovery analyses for model M5 292 We performed parameter recovery analyses to assess the identifiability of the model parameters 293 in the winning "asymmetric pathways" model M5. We simulated 100 new data sets based on the best 294 fitting parameters of each participant, fitted a separate model to each simulated data set (using first 295 Laplace approximation and then hierarchical Bayesian inference), and finally averaged parameters 296 across the 100 fitted models. 297 Parameter recovery was excellent for the feedback sensitivity ρ (r = .91), the baseline learning 298 rate ε0 (r = .98), the Go bias b (r > .99), and the Pavlovian response bias π (r > .99), with between-299 participant differences in ground-truth parameters correlating at high levels (all r > .90) with between-300 participant differences in the recovered parameters. Note that, due to shrinkage to the mean as a 301 consequence of hierarchical Bayesian inference, extreme parameter values tended to be shrunk to the 302 overall group-level mean in the recovered parameters. Correlations for the learning bias parameter κ 303 were considerably lower, though still strongly positive (r = 0.50; r = 0.51 when removing one outlier 304 participant). Note however that the effect of κ on learning depended on participants' baseline learning 305 rate ε0. When computing increased learning rates for rewarded Go actions and decreased learning rates 306 for punished NoGo actions-the parameters that determine the effective degree of trial-by-trial 307 learning-these learning rates were again highly correlated with the ground truth parameters 308 ( : r = 0.96; ℎ : r = 0.85 resp. r = 0.86 when removing one outlier participant). 309 Further parameter recovery analyses on the models explored in S06 yielded that the recovery of 310 κ was improved (r = 0.78) when adding perseveration parameters (which themselves had recovery 311 performances of r's > 0.99). This observation suggested that models featuring such perseveration 312 parameters might be better suited for quantifying individual differences in the learning bias. 313 In sum, parameter recovery was excellent for all parameters but the learning bias κ. More 314 relevant than recovery of κ, however, was that we could recover the effective learning rate well 315 (combining baseline learning rate ε0 and the learning bias κ). However, when combining the baseline 316 learning rate ε0 and the learning bias κ, recovery was high, as well. Note that the ability to accurately 317 capture individual differences in biased learning is not of interest in this study, nor relevant to the 318 imaging analyses. In fact, we used a single set of parameters (the group-level parameters) to compute 319 trial-by-trial regressors for the EEG and fMRI analyses. This is a standard approach in model-based 320 fMRI for two main reasons. First, it has been shown that the exact parameter values for relatively simple 321 RL models like the ones used here have little impact on the results of fMRI analyses 1 . For the current 322 study, of most relevance is the qualitatively differential pattern of learning updates after Go and NoGo 323 responses 2-4 , as embodied by the algorithmic specification of the model. This pattern drives the EEG 324 and fMRI results and indeed, using a different set of parameter values, we obtain essentially identical 325 fMRI results (see S06).  by rewards, lead to long-term potentiation in the striatal direct "Go" pathway (and long term depression 362 in the indirect pathway), allowing for a particularly effective acquisition of Go actions to obtain rewards. 363 Conversely, negative PEs, elicited by punishments, lead to long term potentiation in the NoGo pathway, 364 impairing the unlearning of NoGo actions in face of punishments. 365 An alternative account has recently suggested that self-generated (Go) actions lead to 366 preferential learning (relative to non-self-generated actions, including inaction), more generally 367 (henceforth called "action priming model") 7 . A self-generated action could "prime" basal ganglia 368 circuits and lead to subsequently larger PEs and thus faster learning. The main differential prediction 369 between these two models is how they account for the failure to learn "Go" actions to avoid punishment: 370 In the first model, this is due to a failure to unlearn punished "NoGo" actions, while in the second model, 371 this is due to increased unlearning of punished "Go" actions. 372 Here, we directly tested both models against each other. We specified an alternative model M6 373 7 with two separate learning rates, one learning rate for trials where self-generated (Go) action selection 374 should prime the processing of any following salient outcome (i.e., Go actions followed by rewards/ 375 punishments), and one learning rate for any other action-outcome combination. In this model, equation 376 (6) was substituted by equation (7): When comparing all models M1-M6 using Bayesian model selection, M5 (the asymmetric pathways 381 model) received highest support (model frequency: 68.15%; protected exceedance probability: 99.70%), 382 also compared to M6 (the action priming model; model frequency: 24.19%; protected exceedance 383 probability: 0.30%). In fact, as visible in Fig. S04E-H, the action priming did not reproduce the 384 motivational biases in learning curves and bar plots, which constitutes a case of qualitative model 385 falsification 2,3 . If anything, it seemed that the action priming model traded off both biases, leading to 386 negative response biases for a majority of participants. In contrast, the asymmetric pathways model (M5) 387 was well able to capture the qualitative patterns observed in the data (Fig. S04A-D). We conclude that 388 only the asymmetric pathways model is able to qualitatively reproduce core characteristics of our data. While the winning model M5 captured learning curves and the proportion of (correct/ incorrect) 433 Go and NoGo responses well, it did not fully capture the propensity to stay (i.e., repeat the same response 434 to the subsequent presentation of the same cue) following different action-outcome combinations (see 435 Fig. 2G). Specifically, M5 underestimated the overall propensity to stay and predicted a higher 436 probability of repeating a Go response after a positive (neutral) outcome for Avoid cues, relative to the 437 negative (neutral) outcome for Win cues. In contrast, in the data, there was no such significant 438 difference. We thus explored three extensions of M5 that had the potential to capture this behavioral 439 pattern. Specifically, we considered mechanisms that would make the model more likely to repeat a 440 given response. Furthermore, any such mechanism should boost repetition of Go responses to non-441 rewarded Win cues particular. We hypothesized that two potential mechanisms could account for these 442 data features, and present three new models to test these mechanisms.

444
As a first mechanism, we considered overall "response stickiness" or "perseveration" 8 , a 445 process that leads participants to repeat a previous response independent of the obtained outcome. This 446 mechanism could explain participants' overall higher propensity to stay, which we tested in model M7.

447
Model M7, called "single perseveration model", featured the same parameters as M5 plus a 448 perseveration parameter that was added as a "bonus" to the action weight ( , ) of the specific 449 action shown on the last occurrence of the respective cue 8 : 450 In M7 equation 7 in the main manuscript was replaced by equation 8 above, such that parameter 453 captured the propensity to repeat the action from the last time this cue was presented.

455
However, to account for the fact that staying was not different and numerically even higher for 456 a non-rewarded Go response (to a Win cue), relative to a non-punished Go response to an Avoid cue, 457 we tested whether separate perseveration parameters for Win and Avoid cues could capture this 458 behavioral difference (M8), as such a pattern of results could result from an overall higher propensity to 459 stay for Win cues. This "cue valence-dependent perseveration model" (M8), contained two separate 460 perseveration parameters, one for Win cues , and one for Avoid cues . The respective 461 perseveration parameter was added to the action weight ( , ) of the specific action shown on the 462 last occurrence of respective the cue: 463 In M8, equation 7 in the main manuscript was replaced by equation 9 above, such that parameter 466 and captured the propensity to repeat the action from the last time this cue was presented, 467 separately for Win and Avoid cues.

469
As an alternative mechanism that could potentially capture the p(stay) pattern in the data, we 470 considered the possibility that participants might "re-interpret" neutral outcomes in line with the cue 471 valence: although a non-reward after a Win cue constitutes negative feedback, the positive cue valence 472 might "overshadow" this feedback and give participants the impression that they received a reward. 473 Similarly, a non-punishment after an Avoid cue constitutes positive feedback, but the negative cue 474 valence might overshadow this feedback and give participants the impression that they received a 475 punishment. 476 477 Following this idea, lastly, we considered M9, called the "neutral outcome reinterpretation 478 model", which featured a single perseveration parameter φ as in equation (8), but in addition replaced 479 neutral outcomes (coded as zero) with what we term the "effective reward" , which allows the 480 neutral outcome to take on a value in the direction of the cue valence ( ). The degree to which this 481 happens is scaled by the parameter η: 482 In sum, the three additional models provided a better quantitative fit to the data compared to the 497 winning model M5 reported in the main text. Also, these additional models predicted the propensity 498 more accurately than the base models did. However, their qualitative fit (i.e. the ability to capture 499 relevant aspects of the data) was worse: These additional models systematically underestimated the 500 proportion of incorrect Go responses. Furthermore, although the predicted patterns of the propensity to 501 stay matched the data more closely than M5, these predicted patterns still mis-matched some aspects of 502 the data, particularly now over-estimating the tendency to stay following a punishment. Taken together, 503 these models could capture certain qualitative patterns in the data, but not others, which is a core feature 504 of computational modelling, which by definition constitutes a data reduction procedure that necessarily 505 loses some details of the data. In terms of qualitative model validation/ falsification 2,3 , M5 and M8/M9 506 capture different qualitative features of the data, but no model captured all features well. 507 508 Figure S06A. Model comparison and validation of the single perseveration (M7), dual perseveration (M8) and cue valence-based outcome reinterpretation models. First row. Trial-by-trial proportion of Go responses (±SEM across participants) for Go cues (solid lines) and NoGo cues (dashed lines). Second row. Mean (±SEM across participants) proportion Go responses per cue condition (points are individual participants' means). Third row. Probability to repeat a response ("stay") on the next encounter of the same cue as a function of action and outcome. Fourth row. Log-model evidence, model frequency, and protected exceedance probability all favored the dual perseveration model (M8) over the other models. In sum, the additional models M7-9 provided a better quantitative fit to the data compared to the asymmetric pathways model M5 reported in the main text. They also predicted the propensity of staying overall more accurately than M5. However, these additional models all overestimated the proportion of incorrect Go responses. Furthermore, although the predicted patterns of the propensity of staying mimicked the data more closely than M5, these predicted patterns still mismatched some aspects of the empirical data. Taken together, these models could capture certain qualitative patterns in the data, but not others, which was expectable given the data reduction that comes with fitting a learning model with few parameters only.

510
To confirm that neural correlates of biased prediction-error updating were not altered under 511 these alternative model specifications, we repeated the model-based fMRI analyses for both the cue 512 valence-dependent perseveration model M8 and the neutral outcomes interpretation model M9. In 513 summary, the results are effectively unchanged, as we present in more detail below. 514 515 Notably, M8 does not make different predictions about trial-by-trial learning updates; the only 516 difference to M5 consisted in slightly different best fitting parameter estimates for ε and κ (leading a 517 slightly different BOLD regressors. Neural correlates of learning typically reflect the qualitative learning 518 pattern, which is the same for M5 and M8, but are hardly sensitive to the exact parameter values 1 . 519 Indeed, when repeating the fMRI analyses with those different parameter values, we found almost 520 identical results, with significant encoding of both PESTD and PEDIF in striatum, dACC, pgACC, PCC, 521 left motor cortex, left ITG, and V1 (see Fig. S06B panel A). The only exception was the cluster in dACC, 522 which under M8 was not significant at a whole-brain level, but significant when using small-volume 523 correction with an anatomical ACC mask (from the Harvard-Oxford Atlas), warranted by our a-priori 524 hypotheses based on previous literature 9 . 525 When we repeated our fMRI analyses with learning updates predicted by M9, we again found 526 significant encoding of both PESTD and PEDIF in striatum, dACC, pgACC, PCC, left motor cortex, left 527 ITG, and V1 (see Fig. S06B panel B). However, the pgACC cluster was much larger and extended into 528 the vmPFC. Similarly, the PCC cluster was much larger. In addition, BOLD signal in left inferior frontal 529 gyrus and in multiple clusters in superior and inferior lateral occipital cortex encoded both PESTD and 530 PEDIF significantly. Using trial-by-trial BOLD signal from the extended vmPFC and PCC clusters 531 identified with M9 regressors to predict midfrontal EEG power, we obtained results that were highly 532 similar to the results for the pgACC and PCC clusters identified with M5 regressors.

533
In sum, model-based fMRI analyses based on PEs derived from M8 and M9 replicated the 534 findings based on M5 reported in the main text. In addition, M9 led to larger clusters in vmPFC and 535 PCC, tentatively suggesting that these regions might potentially contribute to "reinterpreting" neutral 536 outcomes in light of the previously presented cue valence (see also Fig. 2 in the main text). 537 538 Figure S06B. BOLD correlates of biased prediction errors as predicted by the asymmetric pathways model (M5), the cue valencedependent perseveration model (M8) and the neutral outcomes reinterpretation model (M9). (A) Regions encoding both the standard PE term and the difference term to biased PEs (conjunction) as predicted from the asymmetric pathways model (M5) at different clusterforming thresholds (1 < z < 5, color coding; opacity constant; replotted from Fig. 3C main text). Clusters significant at a threshold of z > 3.1 are surrounded by black edges. This is a version of Fig. 3C reprinted with a color scheme consistent with the other two panels. (B) Regions encoding both the standard PE term and the difference term to biased PEs (conjunction) as predicted from the cue valencedependent perseveration model (M8) at different cluster-forming thresholds (1 < z < 5, color coding; opacity constant). Clusters significant at a threshold of z > 3.1 are surrounded by black edges. In line with correlates of biased PEs as predicted by M5, BOLD signal in bilateral striatum, dACC (small-volume corrected), pgACC, PCC, left motor cortex, left inferior temporal gyrus, and primary visual cortex was significantly better explained by biased learning than by standard learning. This finding was not surprising given that adding perseveration to the model did not change the learning mechanism, but only led to slightly different best fitting parameter values. (B) Regions encoding both the standard PE term and the difference term to biased PEs (conjunction) as predicted from the neutral outcomes reinterpretation model (M9). In addition to the regions in which BOLD signal was significantly better explained by biased than standard PEs as derived from M5 and M8, biased PEs derived from M9 also explained BOLD signal in vmPFC (larger cluster than M5), PCC (larger cluster than M5), left inferior frontal gyrus and multiple clusters in superior and inferior lateral occipital cortex significantly better than standard PEs. These results tentatively suggested that vmPFC, PCC, and these other occipital regions might implement an additional mechanism besides biased learning which encodes the cue valence also at the time of the outcome, biasing the processing of neutral outcomes. Anatomical masks were based on the Harvard-Oxford Atlas. Functional contrasts involve outcome valence and conjunction of PESTD and PEDIF. A. vmPFC outcome valence contrast (dark blue, conjunction of frontal pole, frontal medial cortex, and paracingulate gyrus). B. striatum outcome valence contrast (yellow, conjunction of bilateral nucleus accumbens, caudate, and putamen). C. vmPFC PESTD ∩ PEDIF contrast (dark blue, results in a cluster in pgACC). D. striatum PESTD ∩ PEDIF contrast (yellow). All anatomical masks were extracted from the probabilistic Harvard-Oxford Atlas, thresholded at 10%. Note that images are in radiological orientation (i.e., left brain hemisphere presented on the right and vice versa).

Figure S07B. Conjunctions of anatomical masks with functional contrasts from fMRI GLM analyses used for fMRI-informed EEG analyses:
A. AAC PESTD ∩ PEDIF contrast (red, cingulate gyrus, anterior division, resulting in a cluster in dACC; B. PCC PESTD ∩ PEDIF contrast (light blue, cingulate gyrus, posterior division); C. Left motor cortex PESTD ∩ PEDIF contrast (orange, conjunction of precentral and postcentral gyrus). D. Left inferior temporal gyrus PESTD ∩ PEDIF contrast (turquoise, conjunction of inferior temporal gyrus, posterior division, and inferior temporal gyrus, temporooccipital part). E. Primary visual cortex PESTD ∩ PEDIF contrast (green, conjunction of lingual gyrus, occipital fusiform gyrus, occipital pole). All anatomical masks were extracted from the probabilistic Harvard-Oxford Atlas, thresholded at 10%. Note that images are in radiological orientation (i.e., left brain hemisphere presented on the right and vice versa). Model-based GLM with PESTD and PEDIF regressor: 558 • WinGoOnset: for every trial with Win cue and Go action, at cue onset, duration 1, value +1 559 • AvoidGoOnset: for every trial with Avoid cue and Go action, at cue onset, duration 1, value +1 560 • WinNoGoOnset: for every trial with Win cue and NoGo action, at cue onset, duration 1, value 561 +1 562 • AvoidNoGoOnset: for every trial with Avoid cue and NoGo action, at cue onset, duration 1, 563 value +1 564 • Handedness: for every trial, at cue onset, duration 1, value +1 for left hand response, 0 for NoGo 565 10 response, -1 for right hand response 11 566 • Error: for every trial, at cue onset, duration 1, value +1 for incorrect response, 0 for correct 567 response 568 • OutcomeOnset: for every trial, at outcome onset, duration 1, value +1 for every trial 569 • Finally, we tested whether after ERP subtraction, low alpha (and beta power) still encoded the 751 previously performed action. When testing for differences in broadband power after Go and NoGo 752 responses, power was indeed significantly different between conditions, driven by clusters in beta band 753 (p = 0.002, 0.125 -625 ms; p = 0.052, 700 -1000 ms, 23 -29 Hz) and theta/ low alpha band (p = 0.024, 754 575 -1000 ms, 5-9 Hz; p = 0.056, 0 -225 ms, 6-11 Hz). For power before outcome onset, there were 755 again broadband differences between Go and NoGo (p = 0.002, -1000 -+225 ms, 1 -33 Hz), but note 756 that there was no ERP subtracted before outcome onset. We thus conclude that the differences between 757 Go and NoGo responses were attributable to differences in induced rather than evoked activity. 758 759 Figure S10. EEG time-frequency power over midfrontal electrodes (Fz/ FCz/ Cz) after the (action x outcome) condition-wise ERPs has been removed. A. Time-frequency plot (logarithmic y-axis) displaying high theta (4-8 Hz) power for negative outcomes and higher beta power (16-32 Hz) for positive outcomes. B. Theta power transiently increases for any outcome, but more so for negative outcomes (especially punishments) around 225-475 ms after feedback onset. C. Beta was higher for positive than negative outcomes (especially punishments) over a long time period around 300-1,250 ms after feedback onset. D-F. Correlations between midfrontal EEG power and trial-by-trial PEs. Solid black lines indicate clusters above threshold. There still was a visible positive correlation between biased PEs and midfrontal delta power, but this correlation was not significant (D). The correlation of delta with the standard PEs (E) was also positive, though not significant; in fact, at a later time point around stimulus offset, delta power correlated significantly negatively with standard PEs. The difference term to biased PEs (F) also correlated positively, though not significantly with delta power. Beta power encoded the difference term and biased PEs themselves (F). ** p < 0.01. NoGo responses. If anything, differences after Go and NoGo responses were maximal over right 822 occipital electrodes, with a larger P3 after Go than after NoGo responses. Signal at these channels also 823 differed between positive and negative outcomes, most notably with a bigger P1 after positive than 824 negative outcomes. In sum, we replicate classical reward learning ERP effects, which shows that the 825 motivational Go/NoGo learning task taps into reward learning processes reported before, but these 826 processes appeared to be unaffected by the previously performed action. 827 828 Figure S11A. ERPs reflecting outcome valence and performed action. A. Voltage (±SEM) over midfrontal electrodes (Fz/FCz/Cz) was lower for negative than positive outcomes around 246-294 ms (stronger N2, FRN) and higher for positive than negative outcomes around 344 -414 ms (stronger P3/ RewP). B. Over right occipital electrodes, the P3 was slightly bigger for positive than negative outcomes. ** p < 0.01. * p < .05 C. Topoplots of difference in voltage between trials with positive and negative outcomes over selected time windows. D. There was no difference in voltage over midfrontal electrodes between trials with Go and NoGo responses. E. Over right occipital electrodes, the P3 was slightly stronger after Go than NoGo actions (no p-value because ROI selected based on visual inspection). F. Topoplots of difference in voltage between trials with Go and NoGo actions over selected time windows.  838  839  840  841  842  843  844  845  846  847  848  849  850  851  852  853  854  855  856  857  858  859  860  861  862  863 Figure S13A. Graphical illustration of the fMRI-informed EEG analysis approach. A. Regions are identified to encode biased PEs via a model-based GLM on BOLD data (see Fig. 2). B. The volume-by-volume time-series of the signal in each ROI is extracted and upsampled. C. Time series are epoched into trials and the HRF amplitude is estimated for every trial. D. HRF amplitudes in every ROI for every trial are combined into a design matrix. E. The design matrix is applied in a multiple linear regression for each participant at each channel, frequency, and time point across trials. F. Regressions yield a sensor-frequency-time map of b regression weights for each ROI for each participant. Maps are combined across participants using a one-sample t-test. Figure S13B. Graphical illustration of the EEG-informed fMRI analysis approach. A. 3D clusters of channel-frequency-time points where power significantly distinguishes trials with positive from trials with negative outcomes are identified via a cluster-based permutation test (see Fig. 3A). The t-values above a threshold |2| are retained, weights at all other grid points are set to zero. B. The 3D t-value cluster is multiplied with the trial-by-trial channel-frequency-time data, yielding a single average value of power in the cluster at each trial. C. Trialby-trial average power in the cluster is added as a parametric regressor in the GLM on BOLD-data and fitted with FSL. Besides the results for striatum, ACC, and PCC reported in the main text, there were also 957 significant EEG correlates over midfrontal electrodes for trial-by-trial BOLD signal from left motor 958 cortex (p = .002, around 0-625 ms, 16-27 Hz; Fig. S11A). There were however no significant EEG 959 correlates over midfrontal electrodes for BOLD signal from pgACC (p = .174; Fig. S11B), left inferior 960 temporal gyrus (p = .097; Fig. S11C), and primary visual cortex (p = .170; Fig. S11D). 961 As quality checks, we checked whether visual cortex BOLD correlated negatively with alpha 962 over occipital electrodes 24,25 and whether motor cortex BOLD correlated negatively with beta power 963 over central electrodes 26,27 . Both was the case (see Fig. S11E and F), showing that our data was of 964 sufficient quality to detect these well-established associations. 965 Figure S14. Supplementary fMRI-informed EEG results in the time-frequency domain. Unique temporal contributions of BOLD signal in (A) left motor cortex, (B) pgACC, (C) left ITG and (D) primary visual cortex to midfrontal EEG power. Group-level t-maps display the modulation of the EEG power over midfrontal electrodes (Fz/ FCz/ Cz) by trial-by-trial BOLD signal in the selected ROIs. There significant correlations between midfrontal EEG TF power in the beta range and left motor cortex BOLD signal (p = .002), but no significant midfrontal EEG correlates for BOLD signal from other ROIs. E. Topoplots displaying t-values of left motor cortex BOLD over the entire scalp between 13 and 30 Hz (beta band) in steps of 100 ms from 0 to 800 ms. There were significant negatively correlates over central electrodes, especially round 300-500 ms. F. Topoplot displaying t-values of primary visual cortex BOLD over the entire scalp between 8 and 13 Hz (alpha band) in steps of 100 ms from 0 to 800 ms. There were significantly negatively correlations over occipital electrodes throughout outcome presentation.  Figure S15. fMRI-informed EEG analyses in the time-domain. Group-level t-value time courses display the modulation of the EEG voltage over midfrontal electrodes (Fz/ FCz/ Cz) by trial-by-trial BOLD signal in the selected ROIs. A. Correlations between midfrontal voltage and trial-by-trial BOLD signal from core value regions, i.e., striatum, dACC, pgACC, and PCC. Striatal BOLD modulates the amplitude of the N1 and P3, while the P3 amplitude was also modulated by pgACC BOLD. B. Correlations between midfrontal voltage and trial-by-trial BOLD signal from other regions, i.e., left motor cortex, left inferior temporal gyrus, and primary visual cortex. Visual cortex BOLD modulates the amplitude of the P3, as well. C-E. Midfrontal voltage split up for high vs. low BOLD signal (median split) from regions significantly modulating voltage. Striatal BOLD modulated N1 and P2 amplitude, while pgACC BOLD and visual cortex BOLD modulated N2 (FRN) amplitude. F-L. Topoplots displaying t-values of correlations between midfrontal voltage and trial-by-trial BOLD for all regions in steps of 100 ms from 0 to 800 ms.