Computational modeling of reinforcement learning and functional neuroimaging of probabilistic reversal dissociates compulsive behaviors in Gambling and Cocaine Use Disorders

Cognitive flexibility refers to the ability to adjust to changes in the environment and is essential for adaptive behavior. It can be investigated using laboratory tests such as probabilistic reversal learning (PRL). In individuals with both Cocaine Use Disorder (CUD) and Gambling Disorder (GD), overall impairments in PRL flexibility are observed. However, it is poorly understood whether this impairment depends on the same brain mechanisms in cocaine and gambling addictions. Reinforcement learning (RL) is the process by which rewarding or punishing feedback from the environment is used to adjust behavior, to maximise reward and minimise punishment. Using RL models, a deeper mechanistic explanation of the latent processes underlying cognitive flexibility can be gained. Here, we report results from a re-analysis of PRL data from control participants (n=18) and individuals with either GD (n=18) or CUD (n=20) using a hierarchical Bayesian RL approach. We observed significantly reduced ‘stimulus stickiness’ (i.e., stimulus-bound perseveration) in GD, which may reflect increased exploratory behavior that is insensitive to outcomes. RL parameters were unaffected in CUD. We relate the behavioral findings to their underlying neural substrates through an analysis of task-based fMRI data. We report differences in tracking reward and punishment expected values (EV) in individuals with GD compared to controls, with greater activity during reward EV tracking in the cingulate gyrus and amygdala. In CUD, we observed reduced responses to positive punishment prediction errors (PPE) and increased activity following negative PPEs in the superior frontal gyrus compared to controls. Thus, an RL framework serves to differentiate behavior in a probabilistic learning paradigm in two compulsive disorders, GD and CUD.


Introduction
The diagnostic criteria for both Substance Use Disorders (SUD) and Gambling Disorder (GD) in the Diagnostic and Statistical Manual of Mental Disorders (fifth edition) (DSM-5) include unsuccessful attempts to stop substance abuse or gambling, jeopardizing relationships and educational/career opportunities, and financial troubles arising as a consequence of the disorder (APA 2013). Compulsivity, a key feature of both GD and SUDs, is defined as persistent actions inappropriate to a given situation, which have no clear relationship to the overall goal and frequently result in undesirable consequences (Dalley et al. 2011). GD and SUDs are disorders of compulsivity and their behavioral phenotypes may thus overlap, but also diverge in certain aspects (Leeman and Potenza 2012;Robbins et al. 2012). Gaining a clearer definition of these phenotypes could inform the development of new treatments for disorders of compulsivity.
A further common feature of GD and SUD is behavioral inflexibility, defined as a deficit in adjusting behavior based on changes in environmental feedback (Jara-Rizzo et al. 2020;Smith et al. 2020;Perandrés-Gómez et al. 2021). Individuals with SUDs to a range of specific substances exhibit an increase in perseverative responding following a contingency change during probabilistic reversal learning (PRL), a paradigm used to investigate cognitive flexibility (Ersche et al. 2011). Increased perseveration during reversal is observed in individuals with Cocaine Use Disorder (CUD) (Ersche et al. 2008;Robinson et al. 2021).
Indeed, reversal learning is impaired in rats and monkeys following prolonged exposure to cocaine (Jentsch et al. 2002;Schoenbaum et al. 2004).
Patients with GD, in comparison, show difficulties in learning novel stimulus-outcome associations following contingency changes during reversal learning (Jara-Rizzo et al. 2020).
Following repeated negative feedback, patients with GD tend to stay rather than switch their response, or switch prematurely after little or no negative feedback during PRL (Perandrés-Gómez et al. 2021). Individuals with GD perform significantly worse than healthy controls (HCs) on the Intra-/Extra-Dimensional Set Shifting test (IED), which assays higher order cognitive flexibility, with impairments observed at the extra-dimensional shift stage (requiring the most flexibility) (Ornstein et al. 2000;Leppink et al. 2016). In a meta-analysis of nine studies that investigated performance of participants diagnosed with GD on the related Wisconsin Card Sorting Test (WCST), patients made more perseverative errors than healthy individuals (van Timmeren et al. 2018). Overall, it is evident that individuals with GD are impaired on cognitive flexibility tasks and have greater perseverative tendencies, similar to individuals with SUD.
CUD has been associated with fronto-striatal neuroadaptations that are linked to altered reward processing. For example, a study employing functional magnetic resonance imaging (fMRI) has found that individuals diagnosed with CUD exhibited lower blood-oxygen level dependent (BOLD) signals in the orbitofrontal cortex (OFC) than control participants following monetary gains on a forced-choice task containing three monetary value conditions (Goldstein et al. 2007). Neural activity is also known to be altered in patients with SUD during PRL, such as in the middle frontal gyrus (MFG) and caudate nucleus, areas known to contribute to performance on this task (Cools et al. 2002;Ersche et al. 2011). A meta-analysis of 52 studies reported that the OFC is hypoactive following detoxification in participants with CUD across different decision-making tasks (Dom et al. 2005). Thus, it is evident that activity of striatal and prefrontal cortical (PFC) regions is altered in CUD.
Functional MRI studies in individuals with GD have also found differential recruitment of PFC areas during reward-based tasks (Leeman and Potenza 2012). The ventromedial PFC (vmPFC), an area activated during monetary reward tasks in healthy individuals that is important for reward processing, shows decreased task-related activation in GD (Campbell-Meiklejohn et al. 2008;Habib and Dixon 2010;Li et al. 2010). On the Iowa Gambling Task, greater activity in individuals with GD during high-risk choices has been reported in the right caudate, OFC, vmPFC, superior frontal gyrus (SFG), amygdala, and hippocampus (Power et al. 2012).
Furthermore, lower activity in the right ventrolateral PFC (vlPFC) has been linked to increased perseveration on a PRL task (De Ruiter et al. 2008). These findings point to altered reward processing in GD and suggest the involvement of cortical areas such as the vmPFC and OFC as well as subcortical structures; several areas overlap with those also affected in CUD.
Reinforcement learning (RL) is the process by which positive and negative feedback from the environment is used to adjust behavior to maximize rewards and minimize punishment (Sutton and Barto 1998). In recent years, RL models have increasingly been used to gain deeper insight into the latent mechanisms (represented by model parameters) underlying PRL. With RL models, for example, it is possible to interrogate how exploration versus exploitation (of learned values) contribute to choice behavior, or the degree of simple choice repetition unrelated to outcomes and value (stickiness). Reward and punishment learning rates can also be determined via RL models, which index the speed at which the expected value of a choice is updated after a better than or worse than expected outcome (reward or punishment prediction error). Indeed, RL impairments following drug use and withdrawal have been demonstrated in rodents and humans. In rats, increased exploitation and stickiness have been reported after cocaine self-administration, consistent with perseveration (Zhukovsky et al. 2019). Humans with SUD, meanwhile, have been found to be higher in stickiness and punishment learning rate, and have a lower reward learning rate (Kanen et al. 2019). RL modeling has also revealed that LSD, a non-selective serotonin 2A receptor agonist, increases reward and punishment learning rates and decreases stimulus stickiness in healthy volunteers (Kanen et al. 2022). Critically, the RL fingerprint during PRL in GD has not been elucidated. Furthermore, the neural substrates underlying these changes in RL parameters are not clearly defined. In rats, stickiness positively correlated with resting-state fMRI activity between the medial OFC (mOFC), PFC and subcortical structures (Zühlsdorff et al. 2023). In humans, the link between RL behavior and neural activity has not yet been established.
Here, we present a re-analysis of a previously published dataset (Verdejo-Garcia et al. 2015) using novel computational methods. Individuals with CUD, GD and controls completed a PRL task in an fMRI scanner. In the previous publication arising from this dataset, conventional PRL measures were calculated and compared between the groups. There, it was reported that a behavioral variable reflecting the perseveration error rate was increased in GD, with no changes observed in the CUD group. Additionally, both patient groups had lower vlPFC activation when shifting responding following a reversal. When perseverating, participants with CUD had greater activity in the dorsomedial PFC (dmPFC) than the GD group. In the new analysis presented here, RL models are employed to reveal latent processes underlying behavior on the PRL task, via a potentially more sensitive trial-by-trial approach. Through the fMRI data, the RL parameters can be linked to their associated neural substrates. To our knowledge, no previous studies have investigated PRL data from GD patients using RL models.
Based on our recent work that showed the concept of stickiness (the tendency to repeat actions regardless of value) was critical for dissociating other disorders of compulsivity (Kanen et al. 2019), we hypothesized that stickiness would be central to the RL modeling results here and would be increased in GD and CUD. Neurally, we predicted that activity in the amygdala and OFC would be linked to the reward learning rate, and that medial PFC and dorsal striatal activity would reflect the stickiness parameter.

Participants
Fifty-six participants took part in this study. These comprised: 18 healthy control subjects who did not meet any of the criteria for an Axis I or II disorder; 18 subjects who met the DSM-IV-TR criteria for Gambling Disorder and 20 individuals that met the criteria for Cocaine Use Disorder. Here, we use the term Substance Use Disorder as used in the DSM-V, rather than stimulant dependence, which is used in the DSM-IV-TR (APA 2013). Basic behavioral data in association with fMRI findings from this study have previously been published (Verdejo-Garcia et al. 2015).
Individuals with CUD were recruited in the outpatient clinic Centro Provincial de Drogodependencias, Granada, Spain, and participants diagnosed with GD were recruited from Asociación Granadina de Jugadores en Rehabilitación, Granada, Spain. Participants from these two groups had met the following inclusion criteria: 1) between 18-45 years old; 2) estimated IQ level above 80; 3) meeting the DSM-IV-TR criteria for cocaine dependence or pathological gambling; 4) having commenced psychological treatment; 5) having been abstinent for more than 15 days. It was confirmed that individuals with CUD were abstinent using two urine tests per week and an additional test on the imaging day. Gambling abstinence was confirmed by relatives and checked through self-assessment. The following exclusion criteria were applied: 1) diagnosis of another Axis I or II disorder, except alcohol or nicotine addiction; 2) history of head injury, neurological disease or any other diseases affecting the central nervous system; 3) having undertaken other treatments in the two years prior to the study; 4) court-mandated treatment. The Structured Clinical Interview for DSM-IV Axis I Disorders (SCID-I-CV) was used to assess Axis I disorders, whereas Axis II disorders were assessed through the International Personality Disorders Examination (IPDE) (Loranger 1994;First 1997).
Diagnoses were made through registered clinical psychologists. Control participants were recruited from local agencies. The study was approved by the Ethics Committee for Research in Humans, University of Granada, Spain. Participants signed an informed consent form to confirm their voluntary participation and were all equally reimbursed for their participation.

Probabilistic Reversal Learning Task
This task was similar to the PRL task used by (Cools et al. 2002). Two abstract, colored stimuli were presented on the right and left side of the visual display. Stimulus location was randomized. At the beginning of the tasks, everyone was informed that one stimulus was the 'correct' stimulus (CS+), and the other stimulus was the 'incorrect' stimulus (CS−). Subjects had to learn the correct and incorrect stimulus through a trial-and-error approach. The CS+ resulted in a reward on only 85% of the trials, whereas the CS− was rewarded 15% of the time.
Following 10 to 15 correct trials, the contingencies were reversed. All participants were trained on the PRL task outside the scanner before the initial scan, for which different stimuli were used. During scanning, there were three consecutive blocks that consisted of 10 discriminations (9 reversals), with a duration of 11 min per block.
Magnetic-resonance-compatible liquid-crystal display goggles were used to present the stimuli (Resonance Technology Inc., Northridge, CA, USA). All responses were recorded using the Evoke Response Pad System (Resonance Technology Inc.). This button box was located on the subject's chest. The duration of stimulus presentation was 2000 ms. If participants failed to respond during this time, a 'too late' message was presented. Following a 'correct' response, a green smiley face was presented, and following an 'incorrect' response, a red sad face was shown. Feedback was presented for 500 ms, during which time the stimulus remained on the screen. Following feedback presentation, there was a variable inter-trial interval, which was adjusted by the program, for a final interstimulus interval duration between stimuli of 3253 ms.
This interstimulus interval duration was selected to enable a precise desynchronization from the repetition time (2000 ms).

Reinforcement learning modeling
The PRL data was modelled with RL models using a hierarchical Bayesian approach. Seven different models containing different combinations of RL parameters (described in more detail below) were tested, implemented through Stan (Stan Development Team 2020).
The highest hierarchical level contained a group-specific mean and a common standard deviation for every RL parameter of the respective model. Priors for these values are shown in Table 1. The RL parameters were drawn for each subject from a normal distribution having the relevant mean/standard deviation. Predicted choices were fit to behavior according to an RL algorithm (described below), and the highest posterior density interval (HDI) was calculated for group mean differences of interest (Kruschke 2014). Q values were updated on a trial-by-trial basis according to the following equation: Qt+1(c t ) is the expected value for the next trial based on the stimulus that is chosen on the current trial, Qt(ct) is the expected value of the choice taken on the current trial, α is the learning rate and rt is the reinforcement on trial t (1 for reward and 0 for punishment). The learning rate influences how much the subject updates the Q value based on the prediction error rt − Q t (c t ), with higher α driving faster learning.
The probability of making one of two choices given the Q values for each was calculated using the softmax decision rule: Q t (L) and Qt(R) are the Q values of the left and right stimuli, and β is the reinforcement sensitivity parameter, which determines to what extent the subject is driven by its reinforcement history (versus random choice).
The seven models tested were as follows: 1. Two parameters: α and β, the learning rate and reinforcement sensitivity parameter.
2. Three parameters: α, β, stimulus stickiness parameter κ stim . κ stim is the tendency to respond to the same stimulus as on the previous trial, irrespective of its location and outcome (i.e., whether it was rewarded or not), and was used to update the Q value as follows: , +1 = κ , . , represents the stimulus chosen by the subject on the last trial. This value is 1 if the same stimulus was chosen, and 0 if another stimulus was chosen. The final Q value is the sum of , +1 and the Q value as calculated in equation 1.

Four parameters: αrew, αnon-rew, β and κ stim .
5. Four parameters: αrew, αnon-rew, β and κ side . The stimulus stickiness parameter was replaced with the side stickiness parameter, representing the tendency to choose the same side as on the previous trial, irrespective of the outcome produced, and was used to update the Q value as follows: , +1 = κ , . , represents the side chosen by the subject on the last trial.
This value is 1 if the same side was chosen, and 0 if the other side was chosen. The final Q value is the sum of , +1 and the Q value as calculated in equation 1.
In this model, the value of incoming information is compared against the individual's beliefs.
The parameters included are experience weight  for each stimulus, which modulates learning from reinforcement. This value changes over time according to the decay factor . This model also includes the parameter β.
The models were fitted through Hamiltonian Markov Chain Monte Carlo sampling via Stan 2.17.2 (Carpenter et al. 2017). Convergence was ensured using the potential scale reduction factor (Brooks and Gelman 1998;Gelman 2013). A potential scale reduction factor value close to 1 indicated perfect convergence. A cut-off of 1.1 was selected as a stringent criterion for convergence. Models were compared using a bridge sampling estimate of the marginal likelihood using the "bridgesampling" R package (Gronau et al. 2017(Gronau et al. , 2020.
Between-group differences were sampled to give a posterior probability distribution for each quantity of interest. These posterior distributions were interpreted using the 95% and 75% HDI, which are 'credible intervals' in Bayesian statistics. At 95% HDI, more evidence is provided for there being group differences than at 75% HDI. However, findings at 75% HDI are also considered to provide sufficient evidence for there being group differences.

Data simulation
Data were simulated using the posterior group mean parameters from the winning model, with the aim of determining whether the winning model could reproduce the behavioral observations. The simulated data were then analyzed using a conventional PRL analysis as described in (Verdejo-Garcia et al. 2015). One hundred virtual "subjects" were simulated for each group, with each "subject" performing the PRL task in silico.

Image pre-processing
The FMRIB Software Library (FSL) and FMRIPREP were used to pre-process the data Esteban et al. 2018 Subsequently, brain extracted T1w images were segmented into cerebrospinal fluid, white matter and grey matter using fast (FSL) (Zhang et al. 2001). Functional MRI scans were slicetiming-corrected using slicetimer (FSL) and then motion-corrected with mcflirt (FSL) (Jenkinson et al. 2002). For scans with associated field maps, distortion correction was performed using fugue (FSL) (Jenkinson 2003). Next, the fMRI images were co-registered to their corresponding T1w scan using boundary-based registrations with six degrees of freedom with flirt (FSL) (Greve and Fischl 2009). The field distortion correcting warp, BOLD-to-T1w transformation and T1w-to-template (MNI) warp were concatenated and applied in a single step using antsApplyTransforms using Lanczos interpolation. Nipype was used to calculate the frame-wise displacement (Power et al. 2014). The first five volumes were discarded to avoid T1 saturation effects. fMRI images were high-pass filtered (128 s) and spatially smoothed with a 6 mm full-width, half-maximum 3D Gaussian kernel. A canonical hemodynamic response function was modelled to the onsets of the explanatory event types. Multiple criteria were used to ensure successful registration, including checking successful registration, ensuring that none of the participants showed excessive motion using DVARS (root mean square of the temporal change of the voxel-wise signal at each time point (Yang et al. 2019)) and framewisedisplacement measures (excessive motion threshold being 10% of the total number of volumes) and by inspecting their respective carpet plots.

First-level models
First-level linear models were fit through FEAT (FSL) (Woolrich et al. 2001). A first-level model was fit for each run and included the following event types: (1) reward Expected Value PEs were between 0 and -1. The model was based on an analysis presented previously (Murray et al. 2019). Six movement parameters (x, y, z, pitch, roll, yaw) were incorporated into the model, which resulted from the image realignment to control for movement artefacts.

Higher-level models
The first-level models were averaged across the three runs for each subject, resulting in the second-level models. Third-level mixed-effects whole-brain analyses involving one-sample ttests with cluster thresholding with a Z threshold of ±3.1 and p<0.05 were used to investigate the contrasts for each event type ). The contrasts included control vs GD, control vs CUD and GD vs CUD. Subsequently, an analysis of covariance (ANCOVA) was run as an additional exploratory analysis. In the ANCOVA, model parameters from the bestfitting RL model were extracted for each subject and included as predictors. The aim of this analysis was to investigate group differences in the correlation between activity in a given region and a RL parameter (i.e., a group×RL parameter interaction). RL parameters were also correlated with BOLD signal from all participants, regardless of group. FSLeyes was used to generate figures ). In all figures, the right and left sides are inverted from the observer's perspective (according to standard radiological convention).

Demographic information
There were no significant differences in age, gender, IQ, handedness, or years of education between the groups ( Table 2) (Verdejo-Garcia et al. 2015). is the tendency to select the same stimulus regardless of outcome, and side stickiness κ , the tendency to select the same side regardless of outcome.

Selecting the winning model
Reinforcement learning results Figure 1 shows results of the hierarchical Bayesian RL analysis. Neither the reward learning rate nor the punishment learning rate were affected in GD or CUD when compared with healthy controls. However, there was evidence that the reward learning rate was lower in the CUD group than the GD group (difference in parameter per-group mean, posterior 75% HDI excluding zero). Reinforcement sensitivity was lower in the CUD group compared to the GD group, reflecting more exploratory behavior in CUD (group difference, 0 ∉ 75% HDI). Side stickiness, meanwhile, was not different in either patient group compared to the control group (no group differences, 0 ∈ 75% HDI). There was evidence for a decrease in stimulus stickiness at 75% HDI in the GD group compared to HCs (group difference, 0 ∉ 75% HDI). There were no changes in the CUD group when compared to the control group (no group differences, 0 ∈ 75% HDI). To summarize, we found evidence for the stimulus stickiness parameter κ being decreased in the GD group. Moreover, when comparing the GD group to the CUD group, there was support for the reward learning rate and reinforcement sensitivity parameter being greater in the GD group compared to the CUD group. No differences at 95% HDI were observed.

Simulations
The parameters from the winning RL model were used to simulate the behavioral data and determine whether this model could replicate the behavior observed initially via raw data measures. When these data were analyzed using a conventional approach to extract raw data measures such as win-stay and lose-shift, no statistically significant differences between the groups were found. These findings thus align with the results for the conventional behavioral measures presented in (Verdejo-Garcia et al. 2015), suggesting that the model was able to reproduce the behavioral dynamics on this task.

Disorder
The model fitted to the task-based fMRI data included seven explanatory variables, as above: (1) reward EV; (2) positive RPE; (3) negative RPE; (4) punishment EV; (5) positive PPE; (6) negative PPE and (7) response/feedback presentation. We found differences in the neural responses to reward and punishment expected value in the GD group compared to controls.
Specifically, we observed that when tracking reward EV, that individuals with GD had greater activations in the amygdala, hippocampus, parahippocampal gyrus, lateral occipital cortex, superior, inferior, and middle temporal gyri, as well as the precuneus than HCs (Figure 2, Table 4). These effects were only observed in the left hemisphere.
For punishment EV, we observed the opposite trend: individuals with GD showed lower activity in the superior parietal lobule, pre-and postcentral gyri, precuneus, parietal operculum, supramarginal gyrus and angular gyrus compared to control subjects (Figure 3, Table 5).
Activations were seen in both hemispheres but were more pronounced in the right hemisphere.

Use Disorder
We observed aberrant neural responses in CUD as well, specifically in response to positive and negative PPEs. Compared to control participants, individuals with CUD exhibited lower activity in the paracingulate gyrus and left SFG in response to positive PPEs. Conversely, individuals with CUD showed greater activity in the left SFG and MFG in response to negative PPEs (Figures 4, 5; Tables 6, 7, respectively).

Neural responses to feedback presentation
During feedback presentation, the GD group overall showed increased activity (versus controls) in the lateral occipital cortex, cingulate gyrus, parahippocampal gyrus, precuneus, middle temporal gyrus and supramarginal gyrus (supplementary materials, Figure S1). There were also significantly greater activations during simultaneous cue and feedback presentation in the CUD group (versus controls), which were instead in the frontal pole, SFG, inferior frontal gyrus (IFG), precentral gyrus, superior parietal lobule, supramarginal gyrus, precuneus, angular gyrus, and lateral occipital cortex ( Figure S2). Moreover, we observed differences between the CUD and GD groups: individuals with CUD had greater activity than those with GD in the insular cortex, IFG, and frontal operculum.
No significant differences were found in response to positive and negative RPEs. Thus, there appear to be widespread differences in both CUD and GD groups when the feedback was presented. However, this response was altered in different areas of the brain in the two disorder groups.

Whole-brain correlation analyses
The five parameters from the winning RL model were used in a whole-brain correlation analysis to identify whether they correlated with the BOLD signal during each event type in any of the brain regions. This was done to identify the brain regions underlying RL parameters.
The first analysis related the parameters to activity from all subjects.
This analysis highlighted that the parameter correlated negatively with activity in the cingulate and paracingulate gyri, IFG, middle and superior temporal gyri, insular cortex, and mOFC during reward EV tracking as well as responses to positive PPEs. This parameter also correlated negatively with activity in the putamen, mOFC, and insula during positive RPEs (see Supplementary Materials).
Next, an ANCOVA was run to compare connectivity patterns among the different groups. In the GD group, correlated more strongly with activity in the SFG, MFG, postcentral gyrus during reward EV tracking compared to the other two groups (Figure S3). In the CUD group, the correlation between and activity during the positive PPE was greater in the frontal pole, SFG, cingulate and paracingulate gyri compared to the HC and GD groups ( Figure S4).
In both patient groups, stimulus stickiness ( κ ) had a stronger positive correlation with activity in the right MFG and IFG during cue/feedback presentation compared to control participants, suggesting that there are increased activations in these areas in patients when repeating a response regardless of previous outcomes (Figure 6). No other correlations with RL parameters were found.

Discussion
In this study, we examined RL processes during a classic test of behavioral flexibility (PRL) in individuals with GD and CUD. Our computational modeling approach enabled the assessment of how both value-based (learning rates, reinforcement sensitivity) and value-free in this case GD and CUD. However, we note that group differences were only observed at 75% HDI, but not at 95%.
We provide a novel and unexpected insight into how RL parameters are affected in GDthat stimulus stickiness was reduced in this group. A similar reduction in stimulus stickiness has also been observed in another compulsive disorder, OCD (Kanen et al. 2019). However, in GD, the reduction in stimulus stickiness was accompanied by a slight increase in side stickiness (below 75% HDI), whereas in OCD there was additionally a mild reduction in side stickiness (Kanen et al. 2019). In other words, the computational profile of GD and OCD was distinct. Perseveration is not a unitary construct (see also (Sandson and Albert 1984)): side stickiness may be representative of motor perseveration, whereas stimulus stickiness reflects stimulus perseveration. Side stickiness may therefore represent excessive motor perseveration.
In contrast, the reduction in stimulus stickiness may reflect another form of behavioral inflexibility that is overly exploratory yet outcome insensitive. Low stimulus stickiness in GD detected during trial and error learning in a laboratory setting may therefore reflect a real-life increase in exploration of choices in an attempt to identify an optimal strategy, e.g., tracking new stimuli in a casino game (Clark 2010). Once the 'optimal' stimulus has been chosen following exploration, greater side stickiness may result in motor perseveration resulting in excessive losses. Whilst one interpretation of low stimulus stickiness in OCD is that it is a manifestation of increased checking behavior, this too can be thought of as maladaptive exploration albeit pertinent to a different real-life setting (Hauser et al. 2017;Kanen et al. 2019). Exploration (particularly of stimuli) is presumably meant to collect information; however, when such behavior becomes disconnected from outcomes it may contribute to compulsions in GD and OCD. It remains to be determined how the neural mechanisms supporting low stimulus stickiness in GD and OCD differ or overlap. Overall, value-free contributors to choice behavior have allowed for novel dissociations of GD, OCD, and SUD, and point to possible computational fingerprinting, which could eventually be useful for informing psychiatric classification.
At the neural level, group differences were also observed during ongoing RL processes.
Differences in brain activity when tracking reward and punishment EVs were seen in participants with GD. In these individuals, there was greater activity in response to reward EVs in the amygdala, hippocampus and cingulate gyrus compared to HCs. When tracking punishment EV, on the other hand, there was lower activity in the postcentral gyrus, superior parietal lobule and occipital areas, suggesting that individuals with GD differentially track EVs of stimuli in their surroundings in favor of reward-related expectancies. In the CUD group, there was also an altered balance in RL, instead with lower responses to positive PPEs and greater responses to negative PPEs in the SFG and neighboring regions compared to control participants, which suggests preferential processing of punishment. This aligns with our recent finding that individuals with SUD show increased punishment learning rates (Kanen et al. 2019). In summary, there appear to be uniquely aberrant neural signals in each patient group when tracking value-related information important for RL processes.
By linking the computational modeling parameters to the fMRI data, we also identified regions involved in the modulation of RL measures, which has not been investigated in previous human studies. We found that the learning rate parameter for reward ( ) was correlated with areas that responded to RPEs and PPEs, including the SFG, MFG, cingulate and paracingulate gyri.
Therefore, these regions appear to be of key importance for RL and are likely to be involved in the modulation of the reward learning rate ( ). The SFG and ACC are key areas underlying error and action monitoring, providing support for their involvement in reward learning (Carter et al. 1998;Botvinick et al. 1999). Moreover, a meta-analysis including 35 studies reported that these areas are consistently activated when there is a prediction error (Garrison et al. 2013).
At least two previous studies have reported reduced learning rates, reinforcement sensitivity and increased stimulus stickiness in individuals with SUD compared with HCs (Kanen et al. 2019;Lim et al. 2021). In the present study, meanwhile, we observed diminished reward learning rates and increased stimulus stickiness in CUD only when contrasted with GD.
Duration of substance abuse may be a key factor underlying the less pronounced RL results in CUD when compared to these two previous studies. Whereas the CUD sample in the present study had an average duration of substance use of 3.7 years (Verdejo-Garcia et al. 2015), the participants with SUD in previous studies reporting more pronounced RL deficits had been using for an average of 11.7 (Ersche et al. 2011) and 13.7 years (Lim et al. 2021). Additionally, a criterion in our study was abstinence, which was not the case in the other two investigations.
These differences in sample suggest longer exposure to substances may have more pronounced effects on RL processes, possibly due to neurotoxicity, and may therefore help reconcile the RL findings between these studies. As GD itself does not involve substance use, we would not expect the same magnitude or mechanism of change in RL effects related to disease duration.
At the same time, such contrasts between GD and SUD may inform which aspects of RL in SUD are more or less likely to be tied to neurotoxic effects.
Another consideration when reconciling this series of studies is differences in study design.
Lim and colleagues, for example, used a probabilistic task which had separate conditions for reward and punishment, tested in individuals with CUD (Lim et al. 2021), while Kanen et al.
included individuals with any type of SUD (Kanen et al. 2019)this may also explain the decreased punishment learning rate observed in the former compared to an increased punishment rate in the latter. It has been shown that individuals with Cocaine and Amphetamine Use Disorders perform differently during reversal learning, with increased perseveration seen preferentially in those with CUD (Ersche et al. 2008). This difference may relate to elevated stimulus stickiness in CUD observed here only when contrasted with GD.
Based on the neural results presented here, individuals with GD appear to be less sensitive to punishment EV but more sensitive to reward EV than controls. A study of performance on a two-choice lottery task found that choice behavior in GD patients was less sensitive to EVs for both reward and punishment, with this group using information about magnitude and probability information less than HCs (Limbrick-Oldfield et al. 2021 In individuals with SUD, reduced responses to PEs in the VS and mOFC on the IGT have been reported previously (Tanabe et al. 2013). In a separate study using electroencephalography, impaired RPE signaling in CUD was also found (Parvaz et al. 2015). In contrast, we found increases in responding to PPEs, rather than reduction in RPEs. Following cocaine abstinence in individuals with CUD, enhanced signals to positive PEs, regardless of whether reward or punishment was predicted, have been observed (Wang et al. 2019). Although we report reduced activity following positive PPEs, this may be because we separated reward and punishment PEs and suggests that the two PEs are differentially altered in CUD. Altered responses to PE related to both reward and punishment could be a contributor to compulsive drug use, as it persists despite negative outcomes. In patients with OCD, RPE responses were altered in the nucleus accumbens and anterior cingulate cortex, further highlighting that RL can be used to distinguish disorders of compulsivity, both through behavior and its associated neural substrates (Murray et al. 2019).
We report that stimulus stickiness ( κ ) was positively correlated with activity in the dorsolateral PFC (dlPFC) and ventrolateral PFC (vlPFC), areas important for cognitive control, including conflict monitoring and motor inhibition, respectively (Badre and Wagner 2004; Levy and Wagner 2011). In the results presented here, patients with GD and CUD showed a stronger positive correlation with stimulus stickiness ( κ ) in these regions. This result was contrary to our expectations and previous studies, as it was predicted that stickiness would be related to reduced activity in these regions. A possible interpretation of this finding is that stimulus stickiness reflects bias towards one of the presented stimuli, ideally the majority reinforced one, and that the MFG and IFG are active in order to overcome this response following a reversal. However, this hypothesis would need to be explored further in future studies.
It has been demonstrated previously that both the dlPFC and vlPFC are affected in GD and CUD (Goldstein and Volkow 2011;Raimo et al. 2021); here we provide a novel computational mechanism pertinent to compulsions that is linked to these regions in GD and CUD. Previous studies have demonstrated that response shifting on the PRL task is associated with vlPFC activation in control participants (Cools et al. 2002). Consistent with the present results, a prior analysis of this dataset showed the vlPFC was engaged during response shifting, yet both clinical groups showed lower vlPFC activity than HCs (Verdejo-Garcia et al. 2015). Reduced vlPFC activity during shifting has been also reported in OCD patients (Remijnse et al. 2006).
These findings from previous studies, however, focus on response shifting on certain trials, whereas our analysis investigated stickiness across all trials, reflecting an overall tendency.
Additionally, stickiness represents repeated responses, rather than response shifts. In rats, it has been shown that side stickiness (stimulus stickiness was not studied) is correlated with activity in medial PFC and dorsal striatal regions (Zühlsdorff et al. 2023). It is therefore possible that side and stimulus stickiness recruit different neural circuits, but this requires further analysis in the same species.
In summary, we provide novel behavioral and neural insights into GD through computational modeling of RL processes. Critically, we demonstrate that individuals with GD and CUD display perseverative behavior during PRL that differs both qualitatively and quantitatively, advancing the notion that compulsivity is not a unitary construct. We also provide evidence that individuals with GD and CUD display aberrant and opposing neural responses to rewards and punishments, in relation to expected value and PPEs. Furthermore, we link RL parameters to regions that may be involved in their modulation, which has not previously been investigated in the human literature, such as the finding that stimulus stickiness is positively correlated with activity in the dlPFC and vlPFC, areas involved in modulating the balance between goaldirected and habitual behaviors. We demonstrate that RL modeling combined with fMRI may provide new insights into the mechanisms underlying compulsive disorders and therefore refine our understanding of compulsivity transdiagnostically.

Funding
The collection of data for this study was originally funded by the Spanish Ministry of Health /   Whole-brain analysis involving one-sample t tests with cluster thresholding with a Z threshold of 3.1 and p<0.05. The areas indicated show greater activity in participants with CUD than control participants. BA, Brodmann area; MNI, Montreal Neurological Institute template. Whole-brain analysis involving one-sample t tests with cluster thresholding with a Z threshold of 3.1 and p<0.05. The areas indicated show lower activity in participants with CUD than control participants. BA, Brodmann area; MNI, Montreal Neurological Institute template. Whole-brain analysis involving one-sample t tests with cluster thresholding with a Z threshold of 3.1 and p<0.05. The areas indicated show lower activity in participants with CUD than control participants. BA, Brodmann area; MNI, Montreal Neurological Institute template. Whole-brain analysis involving one-sample t tests with cluster thresholding with a Z threshold of 3.1 and p<0.05. The areas indicated show greater activity in participants with CUD than control participants. BA, Brodmann area; MNI, Montreal Neurological Institute template.

Figure 2.
Reward EV tracking: differences between healthy controls and participants with GD (MNI coordinates: X=-16, Y=58, Z=34). Activity was higher in the GD group in the areas indicated. Color bar on the right-hand side represents the t statistic. Figure 3. Punishment EV tracking: differences between healthy controls and participants with GD (MNI coordinates: Y=-24 to -17). Activity was lower in the GD group in the areas indicated. Color bar on the right-hand side represents t.

Figure 4.
Response to positive PPE: differences between healthy controls and participants with CUD (MNI coordinates: X=-5, Y=17, Z=48). Activity was lower in the CUD group in the areas indicated. Color bar on the right-hand side represents t.

Figure 5.
Response to positive PPE: differences between healthy controls and participants with CUD (MNI coordinates: X=-31, Y=30, Z=56). Activity was higher in the CUD group in the areas indicated. Color bar on the right-hand side represents t. Figure 6. Top: Areas that have a stronger positive correlation with κ stim in the GD group than in healthy controls (MNI coordinates: X=48, Y=29, Z=22). Bottom: Areas that have a stronger positive correlation with κ stim in the CUD group than in healthy controls (MNI coordinates: X=48, Y=29, Z=20). Color bar on the right-hand side represents t.