Task-Level Value Affects Trial-Level Reward Processing

Despite disagreement about how anterior cingulate cortex (ACC) supports decision making, a recent hypothesis suggests that activity in this region is best understood in the context of a task or series of tasks. One important task-level variable is average reward because it is both a known driver of effortful behaviour and an important determiner of the tasks in which we choose to engage. Here we asked how average task value affects reward-related ACC activity. To answer this question, we measured a reward-related signal said to be generated in ACC called the reward positivity (RewP) while participants gambled in three tasks of differing average value. The RewP was reduced in the high-value task, an effect that was not explainable by either reward magnitude or outcome expectancy. This result suggests that ACC does not evaluate outcomes and cues in isolation, but in the context of the value of the current task.

The role of anterior cingulate cortex (ACC) in decision making has been hotly debated 1 (Holroyd & Verguts, 2021). According to one view, ACC supports reward-guided 2 behaviours like foraging by encoding the value of switching from one task to another 3 (Kolling et al., 2016). Others argue that ACC selects from among competing tasks by 4 computing the expected value of control, balancing effort and task value (EVC: Shenhav 5 et al., 2013). Computationally, ACC has been proposed to follow the principles of 6 hierarchical reinforcement learning, exerting control over lower-level systems when 7 ongoing rewards are worse than the average task value (Holroyd & McClure, 2015). 8 In each of these approaches, ACC activity depends on task value. This claim has 9 empirical support; neurons in monkey ACC, for example, encode both trial-by-trial 10 reward and average task value (Amiez et al., 2006). But how does trial-level reward 11 processing interact with task-level value processing in ACC? One possibility is via a 12 reward prediction error (RPE), a concept borrowed from reinforcement learning (RL) 13 that indicates whether ongoing events are better or worse than expected (Sutton & 14 Barto, 2018). RPEs are proposed to be carried by the midbrain dopamine system to 15 striatal and cortical targets (Schultz et al., 1997), including ACC (Amiez et al., 2005; 16 Holroyd & Coles, 2002). According to the standard "model-free" RL approach, called 17 temporal difference learning, the RPE elicited by reward delivery is computed by 18 subtracting the expected value of an action from the just-received reward. The standard 19 reinforcement learning view is that RPEs are "one-step" computations, comparing the 20 value of the current state to the value of the previous state only. Under this view, 21 dopaminergic RPEs depend only on action/reward values, not the average task value. 22 Alternatively, midbrain dopamine activity may index a more general RPE rather than just 23 an immediate, "one-step" RPE. Indeed, there is evidence that midbrain RPEs are 24 sensitive to task context. For example, monkey dopamine neurons appear to track 25 patterns in reward history beyond what would be predicted by a standard RL algorithm 26 (Nakahara et al., 2004). Reinforcement learning algorithms have therefore been 27 augmented with internal models of the environment resulting in "model-based" RL 28 algorithms. Both model-free and model-based reinforcement learning computations 29 appear to drive activity of midbrain dopamine neurons (Collins & Cockburn, 2020;Daw 30 et al., 2011). There is therefore good reason to believe that midbrain RPE signals may 1 be sensitive to task-level factors such as average task value. 2 In humans, midbrain RPE signals are thought to modulate ACC activity in a way that is 3 measurable at the scalp. It has been proposed that a component of the event-related 4 potential (ERP) called the reward positivity (RewP) varies as a function of RPE 5 magnitude (Holroyd & Coles, 2002;Sambrook & Goslin, 2015). In other words, the 6 RewP provides a convenient readout of the degree to which outcomes differ from goal here was to investigate whether and how reward-related ACC activity, as indexed 10 by the RewP, varies with task value. Importantly, action value and task value are 11 partially dissociable in reinforcement learning frameworks: High-value actions can occur 12 in low-value tasks, and vice-versa. 13 To answer this question, we manipulated task value and action value by varying the 14 proportion of "low-value" and "high-value" actions in three probabilistic learning tasks. 15 EEG was recorded from participants as they attempted to learn correct actions for six 16 predictive cues. Some of the cues were "high-value", indicating that a correct response 17 would likely yield a reward. Other cues were "low-value", indicating that reward and non-18 reward outcomes were equally likely regardless of the response. We did not vary the 19 value of the reward itself, which can affect the amplitude of the RewP (Kreussel et al.,20 2012; Sambrook & Goslin, 2015). Rather, we varied the proportion of high-and low-21 value cues in each task: either all low-value, all high-value, or an even split. Our goal 22 here was to vary reward at the task level while keeping trial-level rewards constant. 23 Importantly, the same cue type (same reward expectancy and reward magnitude) 24 appeared in multiple tasks, allowing us to isolate the effect of task value on feedback-25 locked signals. 26 In addition to the RewP, we also examined the cue-locked ERP. There is evidence that 27 reward-predicting cues can elicit a RewP-like signal (Holroyd et al., 2011;Krigolson et 28 al., 2014). The computational explanation for this is that RPEs propagate backward in 29 time to the earliest indicator that things are better or worse than expected. In a task with 30 mixed high-and low-value cues, we might therefore expect high-value cues to elicit a 1 positive prediction error (a positive RewP deflection) relative to low-value cues. 2 Conversely, we would not expect a cue-locked prediction error to be elicited in a task 3 with uniform cue values, because the cues all make the same prediction about 4 upcoming rewards within the task. To summarize, trial cues ought to elicit 5 positive/negative prediction errors in the "mixed value" task and no prediction error in 6 the "uniform value" tasks We tested 36 participants with no known neurological impairments and with normal or    All participants gave written informed consent. 20 This experiment required that participants learn to make optimal responses when this 21 was possible, i.e., in response to high-value cues. As such, we applied the following a 22 priori criterion: only the data of participants who made a correct response on at least 23 60% of the learnable trials in both the mid-value task and the high-value task were 24 included in the main EEG analysis. 12 of the 36 participants did not meet this criterion 25 and their data were therefore removed from the main analysis. However, the data of 26 these 12 participants were included in a correlational analysis relating performance to 27 the RewP. Finally, the data of one participant were excluded from all analyses due to 28 excessive EEG artifacts (across all conditions, the average trial rejection rate for this 29 participant was 87%). This left a total of 24 participants for the main analysis and 35 1 participants for the correlational analysis. Participants were seated 60 cm in front of an LCD display (60 Hz, 1024 by 1280 pixels). 4 Visual stimuli were presented using the Psychophysics Toolbox Extension (Brainard,5 1997; Kleiner et al., 2007;Pelli, 1997) for MATLAB (Version 8.2, MathWorks, Natick, 6 USA). Participants were given both verbal and written instructions in which they were 7 asked to minimize head and eye movements. 8 Participants were told that they would be gambling in three different casinos. Each 9 casino contained six different slot machines, and each slot machine was represented by 10 a unique, coloured shape. Prior to "entering" a casino, the message "New Casino, New 11 Coloured Shapes" was displayed. Shapes were reused across casinos, but with 12 different colours (randomly chosen for each participant). The slot machines were 13 described to participants as having two arms -a left arm and a right arm. Participants 14 were instructed to, upon the appearance of a slot machine (a coloured shape), select 15 one of the arms by pressing the corresponding key on a keyboard (the s-key to select 16 the left arm, or the k-key to select the right arm). Participants were also told that their 17 gamble would result in win or a loss -each outcome represented by a randomly-18 assigned fruit -and that for each slot machine "pulling one arm may be more likely to 19 result in a win compared to pulling the other arm". Wins resulted in a gain of $0.03, 20 while losses resulted in no gain ($0.00). Participants were informed that their goal was 21 to win as much money as possible and that they would be paid their total at the end of 22 the experiment. 23 Trials were grouped by casino, and each casino was entered only once. Casino order 24 was counterbalanced with six possible casino orderings and four participants assigned 25 to each ordering. Within a casino, participants encountered the 6 slot machines 24 26 times in random order (144 gambles). Unknown to participants, there were two types of 27 slot machine: high-value and low-value. Each high-value slot machine had one arm 28 (randomly chosen) that, when selected, would result in a win with 80% probability 29 (selecting the other arm resulted in a win with a 20% probability). In contrast, low-value 1 slot machines had no correct answer: there was a 50% probability of winning, 2 regardless of which arm was pulled. The casinos contained either only low-value slot 3 machines (the low-value task), high-value slot machines (the high-value task), or an 4 even split between low-and high-value slot machines (the mid-value task). As each slot 5 machine was encountered the same number of times, there was no learning advantage 6 related to the number of exposures. Upon leaving a casino the total amount won within 7 that casino was displayed.

11
Each trial began with the appearance of a white fixation cross presented against a black 12 background ( Figure 1). The fixation cross, and all other visual stimuli, subtended 2° of 13 visual angle. After 400-600 ms, the fixation cross was replaced by the coloured shape 14 representing the current trial's slot machine (the "cue"). After 1000 ms, a 50 ms 400 Hz 15 sine tone signalled participants to choose an arm (left or right) by pressing the 16 corresponding key on a keyboard. The purpose of the delay between the cue and the 17 tone was to isolate cue-related neural activity from response-related neural activity. The 18 coloured shape/slot machine remained on the display until the participant responded (or 19 until 2000 ms if no valid response was made). Finally, another fixation cross appeared 20 for 400-600 ms, followed by the feedback stimulus for 1000 ms. Two images of fruit 21 were used as feedback stimuli, chosen at random for each participant from six possible 22 images. If the participant responded prior to the onset of the tone, the message "too 23 fast" was displayed instead. If no response was made within the 2000 ms response 24 window, or if a non-response key was pressed, the message "invalid" was displayed. In 25 both cases (fast/invalid responses) the trial was excluded from both the behavioural 1 analysis and the EEG analysis. information. We also recorded casino value (low, mid, or high), slot machine value (low 6 or high), trial outcome (win or lose) and -for high-value slot machines -whether or not 7 the correct arm was chosen (that is, the arm more likely to result in a win). 8 EEG was recorded from 35 Ag/AgCl electrode locations using Brain Vision Recorder 9 (Brain Products GmbH, Gilching, Germany). The electrodes were mounted in a fitted cap 10 (EASYCAP GmbH, Wörthsee , Germany) with a standard 10-20 layout and were 11 recorded with respect to an average reference. Electrode impedances were lowered 12 below 20 kW prior to recording and the EEG data were sampled at 250 Hz and amplified 13 (UVic: Quick Amp, Oxford: actiCHamp Plus, Brain Products GmbH, Gilching, Germany). 15 For all three tasks (low-value, mid-value, high-value) we computed the mean proportion 16 of trials resulting in a win. For the mid-value and high-value tasks we computed, for 17 each participant and task, the likelihood of a correct response to the high-value cues. 18 (Recall that low-value cues had no correct response.) As mentioned previously, 19 participants for whom this likelihood fell below 60%, in either the mid-value task or the 20 high-value task, were excluded from the main analyses (seven participants in total). For  The EEG was analyzed in MATLAB 2022a (MathWorks, Natick, USA) using the 25 EEGLAB library (Delorme & Makeig, 2004). EEG data were first bandpass filtered (0.1- and removed from the continuous data. We removed components which iclabel 3 determined were more likely to be eye-related than brain-related. 4 5 We constructed 800 ms epochs around the appearance of each cue and feedback of feedback-locked epochs were excluded. We also excluded the first ten trials of each 11 task, as participants became familiarized with the new cues. We then averaged over the 12 remaining epochs to create mean cue-locked, win-locked, and loss-locked waveforms 13 for each task (low-value, mid-value, high-value), cue (low-value, high-value), and 14 participant. To quantify the RewP, we used the difference wave method by subtracting 15 the mean loss ERP from the corresponding mean win ERP for each task and cue. The  Supplemental Table 1). A similar approach was used for the cue-locked ERPs (see 25 Supplemental Material). 27 To confirm our task value manipulation, we compared the mean proportion of wins in 1 each task using a one-way repeated-measures ANOVA. Partial and generalized eta-2 squared were computed as:

Inferential Statistics
where SSV is the sum of squares of the value effect (low, mid, high), SSsV is the error 6 sum of squares of the value effect, and SSS is the sum of squares between subjects. To 7 test whether performance differed between the mid-value and high-value block -that is, 8 whether participants learned about high-value cues differently -we used a repeated-9 measures t-test. 10 To determine the effect of average task value on the RewP, we compared RewP scores 11 using repeated measures t-tests. To avoid possible outcome frequency confounds, we 12 only compared RewP scores from conditions with similar outcome frequencies 13 (Krigolson, 2017). Specifically, we compared the low-task, low-cue RewP (infrequent 14 rewards in an infrequent reward context) to the mid-task, low-cue RewP (infrequent 15 rewards in a mid-frequency reward context). We then compared the mid-task, high-cue 16 RewP (frequent rewards in a mid-frequency reward context) to the high-task, high-cue 17 RewP (frequent rewards in a frequent reward context). 18 For all t-tests, we first checked the assumption of normality of each variable using the 19 Shapiro-Wilk test. The assumption was met for all variables except for the RewP score 20 in the high-task, high-cue condition. As the t-test is robust to non-normality, no 21 correction was made. For each comparison, we also computed Cohen's d for paired-22 samples t-tests as: where Mdiff is the mean difference between the scores being compared and sdiff is the  Modelling 27 We used computational modelling to explore the possibility that participants may have 1 used a different strategy in each task, either due to differences in average task value, or  above. We also tested whether the optimized model parameters (a, t) differed by cue 8 value. 9

11
The proportion of winning trials differed by task (   Recall that mean performance for both the mid-value blocks and the high-value blocks 19 was defined as the proportion of high-value cues that were responded to correctly.  4 After collapsing across the average feedback-locked waveforms (Figure 3a) we 5 observed a scalp topography difference consistent with a RewP (Figure 3b). After 6 constructing conditional difference waves (Figure 3c), we observed a significant effect of   Cue-Locked EEG (RewP) 10 No cue-locked RewP difference was observed (see Supplemental Material).

12
In the mid-value task, the fit learning rate parameter a was greater for low-value cues Here we show evidence that trial-to-trial ACC responses, as indexed by 15 the RewP, are influenced by task value. We measured these signals in three tasks of 16 varying average value. Despite matching outcomes for expectancy and trial-level value, 17 the RewP was reduced when average task value was high. 18 This result is in line with a growing body of literature highlighting the importance of task 19 context in understanding ACC activity. By "context" we mean task value (as in the Traditional RL models learn action values without representing task context (such as 7 average task value) explicitly. However, here we controlled for trial-level expectancy 8 and yet observed a task-level influence on the RewP. This result -a reduced RewP 9 when task value is high -is inconsistent with a model that learns at the level of actions 10 only, but may be consistent with an average-reward learning model in which prediction 11 errors are defined as the difference between trial rewards and the average task value 12 (e.g., Holroyd & McClure, 2015). Under this view, the RewP ought to be reduced when 13 trial outcomes are closer in value to the average task value, as we observed for 14 outcomes following high-value cues in the high-value task. Note that we did not observe 15 a similar effect for low-value outcomes (outcomes following low-value cues). This result 16 may be due to a difference in strategy; participants used a larger learning rate when 17 attempting to learn about low-value cues, suggesting they were weighing recent 18 outcomes more heavily compared to high-value cues. Additionally, low-value cues were 19 associated with a larger temperature parameter, indicative of enhanced exploration for 20 the more difficult problems.

21
There is empirical evidence that the RewP is sensitive to average reward. For example, 22 we have previously observed that a simple RL model fails to account for RewP learning (HRL) model of ACC (Holroyd & McClure, 2015). Under this framework, ACC 2 signals a need for control when outcomes are worse than the average task value. 3 Besides the RewP, there is other evidence that the brain tracks average reward over 4 time. In the short-term, unexpected rewards (RPEs) elicit phasic midbrain dopaminergic 5 activity that is linked to trial-to-trial improvements in behaviour (Pessiglione et al., 2006). 6 Subjectively, RPEs are associated with momentary changes in happiness (Rutledge et 7 al., 2014(Rutledge et 7 al., , 2015. However, the brain also tracks rewards in the long-term. There is fMRI 8 evidence that BOLD activity in midbrain correlates with average reward, even when 9 controlling for RPEs (Rigoli et al., 2016). The proposed mechanism behind this effect is 10 tonic midbrain dopamine, which is thought to track average reward rate in order to 11 support motivational vigour (Niv et al., 2007). 12 Relevant here is the possibility that phasic and tonic dopamine interact. In particular, a 13 high level of tonic dopamine is thought to suppress phasic activity (Bilder et al., 2004; 14 Grace, 1991Grace, , 2000. One piece of evidence for this comes from EEG; the RewP, which dopamine may provide a dopamine-based mechanism for our ERP result: increased 19 tonic activity in the high-value task relative to the mid-value task resulted in reduced 20 phasic activity (and a concomitant decrease in RewP amplitude).

21
The ACC has long been known to respond to trial-level events such as cues and Holroyd & Yeung, 2012). Here we provide evidence that ACC activity depends not only 25 on trial-level features (is this a good outcome/cue?) but also task-level variables (how 26 good is the task?) We suggest that a global view of ACC function will prove more fruitful 27 than studying its neural responses at a molecular level. 28 29 Data Availability Statement site did not consent for their raw or preprocessed EEG data files to be publicly shared. 4 Task and analysis scripts are available at https://github.com/chassall/averagetaskvalue.