Abstract
Nobody wants to experience anxiety. However, anxiety may be induced by our own implicit choices that are mis-reinforced by some imbalance in reinforcement learning. Here we focused on obsessive-compulsive disorder (OCD) as a candidate for implicitly learned anxiety. Simulations in the reinforcement learning framework showed that agents implicitly learn to become anxious when the memory trace signal for past actions decays differently for positive and negative prediction errors. In empirical data, we confirmed that OCD patients showed extremely imbalanced traces, which were normalized by serotonin enhancers. We also used fMRI to identify the neural signature of OCD and healthy participants with imbalanced traces. Beyond the spectrum of clinical phenotypes, these behavioral and neural characteristics can be generalized to variations in the healthy population.
Introduction
Humans sometimes experience anxiety. Given that no one likes to feel anxious, we tend to believe that anxiety is passively driven by external factors. However, is it possible that our own choices implicitly induce anxiety? We make mental choices when thinking or mind-wandering, as well as in response to the external world. We are always learning appropriate choices through our experiences, and some choices are learned in subliminal situations1, 2. Indeed, some evidence suggests that many of our decisions and actions in everyday life are shaped by implicit learning processes3. Hence, our anxiety might be induced by our own choices that are reinforced through implicit learning.
When investigating the possibility of implicit choices leading to anxiety, we must consider its theoretical and biological implementation. Implicit learning requires repetitive experiences1, 2. Reinforcement learning theory provides a framework for learning appropriate choices through repetitive experiences4. Reinforcement learning reinforces or punishes recent choices based on the difference in the actual outcome from the prediction, called the prediction error. A positive prediction error reinforces recent choices, whereas a negative error punishes them. When examining the biological implementation of implicit learning, it is essential to consider previous work demonstrating that the activity of dopamine-projecting neurons in the midbrain resembles prediction error5. Dopaminergic neurons widely project to cortical and subcortical areas. One of the main targets of dopamine is the striatum, which is located in the basal ganglia6. In the striatum, distinct types of dopamine receptors—D1 and D2 receptors—are expressed in exclusive neuron groups, which are considered to be involved in distinct circuit patterns, namely, direct and indirect pathways6. These distinct neural systems are thought to play different roles in reinforcement learning. Artificial activations of D1- and D2-expressing neurons after a certain behavioral action reinforce and punish the action, respectively7. Direct and indirect pathway neurons respond to positive and negative outcomes, respectively8. Dopamine-dependent synaptic plasticity in corticostriatal synapses, which underlies reinforcement learning, shows opposing dopamine dependence for D1 and D2 neurons9, 10. These results imply that reinforcement and punishment of recent actions may be reflected through different neural systems: the direct and indirect pathways.
In the implementation of implicit learning, we should also consider delay as well as reinforcement/punishment. In general, the outcome of a certain choice is available after a certain delay. Therefore, reinforcement and punishment should be assigned to recent choices within a certain time scale. This is called credit assignment, which is implemented as eligibility traces in reinforcement learning4. An eligibility trace can be implemented as a memory trace in each synapse10-12. In this case, the trace time scales for reinforcement and punishment should be separately controlled in direct and indirect pathways. Reinforcement learning theory requires that both trace time scales be equal. However, the requirement cannot be completely realized in distinct neural systems.
Here, we propose a computational model that enables the implementation of imbalanced learning, and we aimed to determine whether the imbalance induces some defects in specific situations. We focused on obsessive-compulsive disorder (OCD) as a candidate for such an imbalanced condition for two reasons. First, abundant evidence suggests that imbalance between direct and indirect pathways is central to the pathophysiology of OCD13-15. Second, obsession in OCD can be an implicit choice leading to anxiety. In patients with OCD, an intrusive thought becomes lodged in their mind and drives obsessive anxiety16. To neutralize and relieve the anxiety, they tend to perform a compulsive action, despite its cost (e.g., excessively washing their hands or repeatedly checking their keys), even though the anxiety will spontaneously diminish before long if no action is taken17.
This article proposes a computational model to reproduce the abnormal repetition of obsession and compulsion in OCD symptoms, incorporating two hypotheses: (1) obsession is induced by patients’ implicit choices, and (2) eligibility trace time scales are imbalanced for reinforcement and punishment in the reinforcement learning framework. We demonstrate that the imbalance in trace time scales mis-reinforces implicit choices leading to obsession, resulting in a spiral of repetitive obsession and compulsion. We tested our hypothesis in a behavioral task for healthy participants and participants with OCD and evaluated the neural substrates of imbalanced eligibility trace time scales using resting-state functional magnetic resonance imaging (rs-fMRI).
Results
OCD-like behavior in a separate eligibility trace model
Eligibility traces determine the credit assignment of the outcome prediction error of recent actions4. Each trace represents the recent frequency of an action in a specific state within a time scale determined by trace decay factor v (see Online Methods). When a prediction error occurs, the choice probability of each action in each state is updated by the product of the eligibility trace and the prediction error (Figure 1a). Thus, a positive prediction error reinforces recent actions within a time scale, whereas a negative error punishes them. We assumed that the eligibility traces for reinforcement and punishment might be implemented in distinct neural systems (red and blue in Figure 1a; see Online Methods). Theoretically, the trace factors in the two systems, v±, should be balanced: ν+ = ν−. However, the separate neural implementation makes it difficult to perfectly maintain the balance. Here, we assumed that the balance was not maintained: ν+ ≠ ν−.
We modeled mental states involved in anxiety into stochastic transitions between two states: relief and anxiety (Figure 1b). We assumed an action to relieve anxiety as an option in the anxiety state, which could stochastically permit a transition from the anxiety state to the relief state at a certain cost. The action might become compulsive in OCD, and we labeled it “compulsion”. We unified any other options in the anxiety state which could relieve the anxiety at lower probability without any cost into the option “other”. Our fundamental assumption was that the transition from the relief state to the anxiety state would be caused by the individual’s own choice, which might become obsessive in OCD. We labeled this option “obsession”. We also unified any other options in the relief state which maintained the relief state into “other”. We assumed that every stay in the anxiety state would produce a negative outcome.
No positive outcome was introduced in the anxiety-relief transition model. Hence, maintaining “other” in the relief state is clearly optimal. Normal reinforcement learning should reduce the obsession rate. However, if the learning system has some defects, such as an imbalance in trace factors, then the learning system may fail to reduce the obsession and fall into a spiral of obsession and compulsion. To show this possibility, we simulated actor-critic learning (see Online Methods), a typical method of reinforcement learning models4 in the anxiety-relief state-transition (Figure 1b), incorporating a separate eligibility trace (Figure 1a). When the imbalance in the trace factors was moderate, actor-critic learning was able to reduce the obsession rate, even when the simulation started at a high obsession rate (Figure 1c). In contrast, extreme imbalance induced an increase in the obsession rate that culminated in a spiral of repetitive obsession and compulsion (Figure 1d). Such OCD-like behavior was reduced by preventing the compulsion during exposure to the anxiety state (green bar in Figure 1d). This is the behavioral therapy of exposure and response prevention (ERP) and is one of the first-line treatments of OCD18.
We identified the condition for the trace factors ν± in which OCD-like behavior might emerge by numerical simulations in a representative set of other model parameters (Figure 1e; see Online Methods). In addition to actor-critic learning (leftmost in Figure 1e), we simulated SARSA (State-Action-Reward-State-Action) and Q-learning (right in Figure 1e), other typical reinforcement learning models4. The color map represents the fraction of 100 simulation runs in which obsession was reinforced at each pair of (ν+, ν−). We identified the conditions of OCD-like behavior in the region ν+ > ν− (Figure 1e). Namely, the trace scale for reinforcement was longer than that for punishment. We also identified the condition in which the behavioral therapy of ERP would not work and localized it to a region with more robust imbalance (Figure 1f).
We obtained the theoretical conditions as functions of the other parameters (see the Supplementary Note). Each OCD patient presents a certain compulsive action for specific anxieties among many types of anxiety. Therefore, the corresponding parameters of the anxiety-relief transition model (Figure 1b) should differ by the type of anxiety. Next, we derived the condition for the learning parameters in which OCD-like behavior emerges in some type of anxiety in actor-critic, SARSA, and Q-learning and obtained the common condition that ν+ > ν− (see Section 3.3 in the Supplementary Note). This result suggests that a person with ν+ > ν− has a risk of OCD.
Participants and behavioral task
To test the suggestion that v+ > v−, derived from our computational model of OCD, we applied the delayed feedback task to patients with OCD (n = 33) and healthy controls (HCs) (n = 168). Fifteen HCs were excluded for medical or experimental reasons, and the subsequent analysis was conducted in 33 OCD patients and 153 HCs (Supplementary Table 1). Considering the therapeutic effect of the serotonin reuptake inhibitor (SRI) in OCD18, SRI might normalize the imbalanced setting of ν+ > ν−. Because a meta-analysis of SRI treatment suggested that higher doses of SRI are more effective in the treatment of OCD versus other psychiatric conditions, such as major depressive disorder19, we divided the OCD patients into two groups by the dose equivalence of SRIs20 (Online Methods and Supplementary Table 2): OCD patients with higher SRI doses were grouped into OCDHighSRI (n = 10), and those with lower SRI doses or no psychotropic medications were OCDLow-NoSRI (OCDLowSRI, n = 10; OCDNoSRI, n = 13). There were no significant differences in obsessive-compulsive and depressive symptoms evaluated by the Yale-Brown Obsessive-Compulsive Scale21 and the 17-item Hamilton Depression Rating Scale (HDRS)22 between the OCDHighSRI and OCDLow-NoSRI groups (Supplementary Table 1). In addition, no patients had current major depressive disorder as a comorbidity (Online Methods).
We used the delayed feedback task to evaluate the trace factors ν± as in our previous research23 except for the presented stimuli (abstract cues). Briefly, participants chose one of two options displayed on the screen by pressing a left or right button in each trial (Figure 2a). Each stimulus represented different outcomes (+10, +40, −10, or −40 yen) and a delay to the feedback (immediately after a button press or three trials later) (Figure 2b and c). For example, +40(0) represented a gain of 40 yen within the current trial (immediate reward) and −10(3) represented a loss of 10 yen after three trials (delayed punishment). The participants were not told about the stimulus-outcome associations shown in Figure 2b. They received money after the experiment in proportion to the total outcome obtained in the delayed feedback task. To maximize the total outcome, the participants needed to learn the appropriate stimulus choices by correctly assigning the credit of the current feedback to the recent choices that caused the current feedback. The difference in learning effects between immediate and delayed feedbacks reflected the trace factors ν±.
Behavioral results
Based on our simulation of OCD-like behavior with trace factors ν+ > ν− (Figure 1e), patients with OCD would show impaired learning in the stimuli with delayed feedback. Therefore, the optimal choice rates between the representative four pairs of stimuli with the same delay and different magnitudes [pairs with no delays: +40(0) vs. +10(0) and −10(0) vs. −40(0); pairs with delays: +40(3) vs. +10(3) and −10(3) vs. −40(3)] were compared among groups. Consistent with our trace factor hypothesis, the OCDLow-NoSRI group showed impaired learning of stimuli with delays and performed at chance levels (Figure 3). Notably, OCDHighSRI patients exhibited intact learning similar to HCs in all pairs of interests.
A mixed-design two-way repeated-measures ANOVA with a within-participant factor of sessions (1–6 sessions) and a between-participant factor of groups (OCDLow-NoSRI patients, OCDHighSRI patients, and HCs) was conducted to clarify the between-group differences. There were significant interactions in the learning of stimulus pairs with delays. In the 10(3) vs. 40(3) pair, the interaction between session and group (F(7.05, 645.34) = 2.32, p = 0.024, ηp2 = 0.025) and the simple main effects of group were significant in sessions 4 (F(2, 183) = 3.97, p = 0.021, ηp2 = 0.042) and 5 (F(2, 183) = 6.27, p = 0.0023, ηp2 = 0.064). Bonferroni-Holm–corrected post-hoc comparisons confirmed OCDLow-NoSRI patients < HCs (t(183) = 2.81, padjusted = 0.016) in session 4 and OCDLow-NoSRI patients < HCs (t(183) = 3.54, padjusted = 0.0015) in session 5 (Figure 3).
Regarding the −10(3) vs. −40(3) pair, the interaction between session and group (F(7.42, 678.8) = 2.09, p = 0.039, ηp2 = 0.022) and the simple main effects of group were significant in sessions 5 (F(2, 183) = 7.59, p = 0.0007, ηp2 = 0.077) and 6 (F(2, 183) = 5.83, p = 0.0035, ηp2 = 0.060). Post-hoc comparisons confirmed significantly impaired learning in the OCDLow-NoSRI group OCDLow-NoSRI patients < HCs (t(183) = 3.87, padjusted = 0.0005) and OCDLow-NoSRI patients < OCDHighSRI patients(t(183) = 2.38, padjusted = 0.036) in session 5 and OCDLow-NoSRI patients < HCs (t(183) = 3.41, padjusted = 0.0024) in session 6 (Figure 3).
Computational model-based behavioral analysis and results
We fitted the behavioral data with the actor-critic learning model using the learning rate (α), exploration-exploitation degree (β), and separation of the eligibility trace for reinforcement and punishment (ν±) (see Online Methods). Because we were specifically interested in the individual variance represented by reinforcement learning parameters, we fitted parameters in each participants’ data independently using maximum a posteriori estimation rather than pooled data as a group24.
To clarify the ν+/ν− distribution of each group, we projected each participants’ estimated parameter to ν+/ν− space and visualized it using kernel density estimation (Figure 4a). Consistent with our computational model simulation of OCD, OCDLow-NoSRI participants were distributed to the ν+ > ν− imbalanced area, whereas the distribution was balanced in HCs and OCDHighSRI participants (Figure 4a). To compare the ν+/ν− distribution among groups, we confirmed the multivariate homogeneity of group dispersions (F(2, 183) = 1.06, p = 0.35) and applied permutational multivariate analysis of variance (PERMANOVA). The ν+/ν− distribution was significantly different (PERMANOVA: F(1, 184) = 6.41, p = 0.0039), with Bonferroni-Holm– corrected post-hoc pairwise comparisons revealing a significant difference between OCDLow-NoSRI participants and HCs (padjusted = 0.014) (Figure 4a). Parameters α and β were not significantly different among groups (Kruskal-Wallis test, p > 0.05).
We conducted clustering analysis using hierarchical density-based spatial clustering of applications with noise (HDBSCAN)25 to evaluate the diversity in HCs. HDBSCAN revealed two clusters in HCs: a balanced ν cluster (Figure 4a, blue circle; n = 83) and an imbalanced ν cluster (Figure 4a, white circle; n = 59); the remaining 11 HCs were not clustered. Obsessive-compulsive trends and a propensity to adhere to fine-grained details were compared between clusters using five subscales of the Padua Inventory (PI; “Checking”, “Dirt”, “Doubt”, “Impulse”, “Precision”)26 and the Attention to Detail subscale of the Autism Quotient (AQ)27, respectively. The imbalanced HC cluster showed significantly higher scores in the PI Checking score (Brunner-Munzel test, statistic = −2.11, p = 0.019) and the AQ Attention to Detail score (Brunner-Munzel test, statistic = −2.79, p = 0.0030) than the balanced HC cluster (Figure 4b). Significant differences were still found after false discovery rate correction for multiple comparisons among scales (PI Checking, padjusted = 0.046; AQ Attention to Detail, padjusted = 0.015). In OCD participants taking SRIs (n = 20), the equivalent dose of SRIs was significantly correlated with the imbalanced settings of ν (ν+–ν−) [Spearman’s rank correlation (20); r = −0.61, p = 0.0045; Figure 4c]. That is, higher doses of SRIs normalized the imbalanced settings of ν in participants with OCD.
Neural signatures of OCD and the imbalanced ν+ > ν− HC cluster
The above behavioral results revealed the presence of imbalanced clusters in both the OCD patients and the HCs. We next explored such imbalanced conditions of ν+ > ν− in neural circuits. Recently, various behavioral and demographic characteristics and psychiatric conditions have been thought to be represented in a brain network, even in the resting state14, 15, 28-30. Here, we explored the neural signature of nonmedicated patients with OCD hypothetically related to ν+ > ν− and extended it into an imbalanced (ν+ > ν−) HC cluster using rs-fMRI. First, we constructed a whole-brain functional connectivity (FC) matrix using Cole-Anticevic Brain-wide Network Partition (CAB-NP)31 and compared 49 OCDNoSRI patients and 53 HCs (dataset A; Online Methods and Supplementary Table 4) using network-based statistics (NBS)32 (initial threshold, t = 3.87; 10,000 random permutations; see Online Methods). We detected a significant network component related to OCD (OCD network) with significantly increased connectivity in OCDNoSRI patients compared with HCs (padjusted = 0.022, Figure 5a). The OCD network comprises nodes in the dorsolateral frontal cortex (DLPFC), parietal cortex, retrosplenial cortex, and the hippocampal formation and parahippocampal region. These brain regions mainly belong to the default mode network (DMN) or frontoparietal network (FPN) (Supplementary Figure 1). To confirm the robustness of the OCD network, we compared the mean FC of the OCD network between 10 OCDNoSRI patients and 18 HCs in the entirely independent dataset (dataset B; Online Methods and Supplementary Table 5). Similarly, the mean FC within the OCD network was significantly higher in OCDNoSRI patients than in HCs (Figure 5b, Brunner-Munzel test, statistic = −3.11, p = 0.0027). Finally, we explored whether the imbalanced (ν+ > ν−) HC cluster showed OCD-like characteristics also regarding a functional network. We compared all FCs of the OCD network between 10 HCs in the imbalanced and balanced clusters (dataset C: independent from dataset A and B; Online Methods and Supplementary Table 6) detected in our delayed feedback task. We found that the FC between the DLPFC and presubiculum was significantly increased in the imbalanced (ν+ > ν−) HC cluster, similar to OCD (Figure 5c, Brunner-Munzel test, statistic = −2.94, p = 0.0051). The FC still showed a significant trend after false discovery rate correction for multiple comparisons among the FCs of the OCD network (padjusted = 0.066).
Discussion
In this study, we showed that implicit choices in our mind can induce anxiety, which is related to imbalance in eligibility trace time scales for reinforcement and punishment. Specifically, we constructed a computational model of OCD using a separate eligibility trace model (Figure 1a–d) and found extremely imbalanced trace factors ν+ > ν− in OCD (Figure 4a) and its neural substrate (Figure 5a) in our empirical data. In addition, behavioral therapy (ERP) and psychotropic medication (SRIs), which are the first-line treatments for OCD, were reflected in our computational model (Figure 1d) and behavioral results (Figure 4c), respectively.
While the theoretical framework of the eligibility trace has long been conceptualized in the field of reinforcement learning4, its empirical evidence has been reported relatively recently10, 33-35. The latest theories regarding synaptic plasticity have proposed that the co-activation of pre- and postsynaptic neurons sets a flag at the synapse (eligibility trace) that leads to a weight change only if additional factors (i.e., reinforcement or punishment) are present while the flag is set12. These additional factors could be implemented by the phasic activity of some neuromodulators, such as dopamine, which is supposed to represent the prediction error5. Although the detailed mechanisms concerning how the trace time scales for reinforcement and punishment are modulated remain unclear, our theoretical consideration of the anxiety-relief transition model showed that the imbalanced trace factors ν+ > ν− could lead to the condition of repetitive choices of anxiety and its relief in the model, similar to obsession and compulsion in OCD (Figure 1a–d). Our experimental data from OCD patients and HCs strongly support the predictions from our computational model, namely, the apparent impairment in the learning with delayed feedback (Figure 3) and the extremely imbalanced trace factors ν+ > ν− in OCD (Figure 4a). It is noteworthy that the imbalanced trace factors ν+ > ν− are quite convincing because the conventional pathophysiological model of OCD suggests excess tone in the direct pathway over the indirect pathway13-15, which are supposed to be related to ν+ and ν−, respectively.
There have been several computational models of OCD, all of which primarily focused on compulsive behaviors36-38. Compulsive actions are thought to be reinforced by the rewarding effect of relief from anxiety17. Excessive anxiety will induce excessive reinforcement of compulsion leading to habit formation38. This excessively habitual compulsion is considered a cause of OCD38, 39. However, obsessive thoughts that drive anxiety also increase with the severity of OCD symptom17. Although the essence of OCD symptoms is the abnormal repetition of obsession and compulsion17, there is so far no unified model to explain the growth of both obsession and compulsion. Our unified model can represent the vicious circle of obsession and compulsion, which are the phenomenological characteristics of OCD17 (Figure 1a–d). Moreover, our model extends our understanding of the therapeutic effects of first-line treatments of OCD. While ERP seems to promote the appropriate choices to prevent obsession, even under imbalanced trace factors (Figure 1d), SRIs resolve the imbalance itself (Figure 4a and c). In practice guidelines for the treatment of OCD18, the combination therapy of ERP and SRI is recommended if patients do not respond to ERP monotherapy. We identified such a condition where ERP would fail in our computational model (Figure 1f). SRI add-on therapy in this ERP-refractory condition can be viewed as assisting in the normalization of trace factors. Regarding the relationships between the neuromodulator serotonin and the time scales of the eligibility trace, serotonin seems to modulate the synaptic plasticity in many brain regions directly or indirectly through its regulatory effects on other neuromodulatory systems12, 33, 40, 41. In our previous study, we also found similar modulatory effects of the trace factor time scales, that is, depletion of the serotonin precursor tryptophan increased the ν+ > ν− imbalance compared with the control condition23. There is still abundant scope for further research aimed at elucidating how serotonin modulates the time scale of trace factors.
Using rs-fMRI, we found that patients with OCD exhibited significantly increased connectivity of the OCD network component, which consists of the DLPFC, parietal cortex, retrosplenial cortex, hippocampal formation, and parahippocampal region in two independent datasets (Figure 5a and b). Specifically, the OCD network is mainly composed of FCs within the DMN (7 of 13 FCs) and FCs between the DMN and FPN (4 of 13 FCs) (Supplementary Figure 1). These results are quite consistent with a recent meta-analysis of rs-fMRI findings in OCD, in which the authors reported FC alterations within and between the DMN and FPN42. Moreover, the DLPFC, parietal cortex, and hippocampus have been implicated in encoding the eligibility traces related to reward-based decision-making for different time scales12, 43-45. These brain regions might help to deal with the multiple eligibility trace time scales required for reinforcement learning as a functional network, together with the retrosplenial cortex, which has dense anatomical connections with all of the other detected brain regions46.
Beyond the range of clinical phenotypes seen in patients with OCD, a broader continuum of the obsessive-compulsive trait is also observed in behavioral and neural characteristics. Specifically, the imbalanced (ν+ > ν−) HC cluster showed a significantly greater propensity for Checking and Attention to Detail (Figure 4b) and increased FC between the DLPFC and presubiculum, which belongs to the OCD network (Figure 5c). These results not only increase the reliability of our clinical findings, but also support the generalizability of our findings to a broader population on the obsessive-compulsive continuum. Further study with greater focus on OCD patients and their unaffected healthy first-degree relatives should be performed to evaluate the potential of our findings as an endophenotype of OCD47.
In this study, we provided evidence that one’s own implicit choices can induce anxiety in OCD. However, anxiety can itself manifest in many different forms. In some situations, anxiety may motivate people to take action or strive to meet a goal. In other situations, people may experience anxiety as a symptom of an anxiety disorder such as social anxiety disorder and generalized anxiety disorder. We previously demonstrated the commonality of different types of anxiety, including obsessive-compulsive anxiety, using fMRI29. Because we cannot cover all forms of anxiety in this research, further study with greater focus on the relationships between implicit choices and various representations of anxiety from computational aspects should be performed.
Conclusion
Anxiety can feel like a calamity that befalls us. However, our research shows that our own implicit choices can trigger anxiety. Psychiatric symptoms are often thought of as alterations in the mind that are not directly quantifiable, but they can be directly assessed through the creation of appropriate computational models. Although it is currently difficult to identify treatment-resistant patients from their clinical symptoms, our computational model suggests that patients with extremely imbalanced trace scale parameters may not respond to behavioral therapy alone. These results suggest that our findings could one day be applied to the appropriate selection of OCD treatment. In addition, psychiatric symptoms have been regarded in recent years as a symptom dimension common to various mental diseases, rather than being specific to a disease. In this study, we focused on patients with OCD and healthy participants, but our approach could be applied to the assessment of the anxiety dimension in various populations48. Future research is needed to address these hypotheses in prospective longitudinal cohorts or in larger cohorts with various psychiatric symptoms.
Online Methods
Separate eligibility trace model
Eligibility traces determine the weight assignment of outcome prediction errors for recent actions (Figure 1a). If the eligibility traces for positive and negative prediction errors are implemented in distinct neural systems, the respective eligibility traces and at time step t for action a (a = 0, 1, …) in state s (s = 0, 1, …) obey the following equation4, where Xas(t) = 1 when action a is chosen in state s at time step t and Xas(t) = 0 otherwise. The factors ν± (0 < ν± < 1) determine the decaying time scales of the eligibility traces. The policy parameters qas(t) (a = 0, 1, …) determine choice probability in state s as a soft-max function, where β represents the degree of the exploration-exploitation balance. Each policy parameter is updated as below, where ε(t) denotes the outcome prediction error and max{0, ε(t)} and min{0, ε(t)} represent positive and negative components of the prediction error, respectively (Figure 1a). Theoretically, the trace factors ν± should be balanced as ν+ = ν−. However, the separate neural implementation makes it difficult to perfectly maintain the balance. Here, we assumed that the balance could be broken: ν+ ≠ ν−.
The outcome prediction error ε(t) is determined as a function of the current outcome r(t) and the policy parameters {qas}. The form depends on learning algorithms. In actor-critic learning, the outcome prediction is based on the state value vs = ∑a qas, and the prediction error ε(t) = r(t) + γvs(t+1) − vs(t), where r(t) denotes the current outcome and s(t) and s(t + 1) denote the current and next states. The parameter γ denotes the discount factor that determines the weight of future prediction. In SARSA and Q-learning, the outcome prediction is based on the action value estimated as each policy parameter qas, and the prediction error ε(t) = r(t) + γVs(t+1) − qa(t)s(t). The difference is in the term of the next state value: Vs(t+1) = qa(t+1)s(t+1) in SARSA and in Q-learning.
Anxiety-relief transition model
We modeled the mental states involved in anxiety into stochastic transitions between two states: relief (s=0) and anxiety (s=1) (Figure 1b). We assumed two options in each state: “compulsion” (a=1) and “other” (a=0) in the anxiety state, and “obsession” (a=1) and “other” (a=0) in the relief state. We defined a matrix b to determine the state-transition probabilities, We assumed that every stay in the anxiety state produced a negative outcome normalized to −1. The relative cost of compulsion was denoted as c. Although b00>0 allows passive anxiety, we focused on b00=0 in the main article. More general cases, including passive anxiety, are considered in the Supplementary Note.
Demonstration of OCD-like behavior
For the example in Figure 1, we fixed the parameters of the anxiety-relief transition as b00=0,
b10=1, b01=0.9, b11=0.5, and c=0.01 and the learning parameters as α=0.1, β=1, γ = 0.5, and v+=0.8, except for v−. For Figure 1c, the trace factor for punishment was set as v−=0.6 and the initial values of variables as . The simulation consisted of 100,000 time steps. For Figure 1d, we set the trace factors as v−=0.1 and the initial values as and q10(0) = −3, to show that reinforcement of obsession started from even a low obsession rate. After 50,000 time steps, compulsion (a=1 in s=1) was always prevented, and other (a=0 in s=1) was forced regardless of the choice probability, demonstrating the behavioral therapy of ERP.
For Figure 1e and f, we evaluated the fraction of 100 simulation runs in which the obsession rate was reinforced from a low obsession rate on 40 × 40 grids in (ν+, ν−) space. In each simulation, 200 instances of forced obsessions were intermittently caused in the relief state because spontaneous obsession scarcely occurred at a low obsession rate. The simulation was judged to be obsession-reinforced if the choice probability of obsession became larger than the initial value. The initial values of variables were set as , and q10(0) = −5. Each forced obsession was caused when sufficiently approached zero in the relief state .
Participants
In total, 33 patients with OCD and 168 HCs participated in the behavioral task. Fifteen HCs were excluded for medical or experimental reasons, and the subsequent analysis was conducted in 33 OCD patients and 153 HCs. There were no significant differences in age, sex, or handedness (Supplementary Table 1). All OCD patients and 13 of the 153 HCs were recruited at Kyoto Prefectural University of Medicine (KPUM), whereas 140 of the 153 HCs were recruited at Advanced Telecommunications Research Institute International (ATR). The Medical Committee on Human Studies at KPUM and the Ethics Committee at ATR approved all procedures in this study. All participants gave written informed consent after receiving a complete description of the study. All methods were carried out following the approved guidelines and regulations. Trained, experienced clinical psychiatrists and psychologists assessed all participants.
All patients were primarily diagnosed using the Structured Clinical Interview for DSM-IV Axis I Disorders-Patient Edition (SCID)49. Exclusion criteria were 1) cardiac pacemaker or other metallic implants or artifacts; 2) significant disease, including neurological diseases, disorders of the pulmonary, cardiac, renal, hepatic, or endocrine systems, or metabolic disorders; 3) prior psychosurgery; 4) psychotropic medication except for SRIs; 5) DSM-IV diagnosis of mental retardation and pervasive developmental disorders based on a clinical interview and psychosocial history; and 6) pregnancy. We excluded patients with current DSM-IV Axis I diagnosis of any significant psychiatric illness except OCD as much as possible, and only two patients with trichotillomania, one patient with a tic disorder, one patient with panic disorder, and one patient with bulimia nervosa were included as patients with comorbidities. The experienced clinical psychiatrists or psychologists applied the Yale-Brown Obsessive-Compulsive Scale (Y-BOCS)21 for clinical evaluation of obsessive-compulsive symptoms in patients with OCD. Handedness was classified based on a modified 25-item version of the Edinburgh Handedness Inventory.
We divided OCD into two groups by medication status. Thirteen patients with OCD were drug-free for all types of psychotropic medication, and the remaining 20 patients were taking SRIs only. The SRI and imipramine equivalent doses20 are summarized in Supplementary Table 2. OCD patients with higher SRI doses (higher than or equal to imipramine equivalent dose 150 mg) were grouped into the OCDHighSRI group (n = 10) and those with lower SRI doses (less than imipramine equivalent dose 150 mg) or no psychotropic medications were grouped into OCDLow-NoSRI (OCDLowSRI, n = 10; OCDNoSRI, n = 13). The threshold was determined by considering a meta-analysis of the SRI treatment19 and common clinical doses used in Japan. To evaluate the antidepressant effects of SRI, we compared the depressive symptoms evaluated by the HDRS22 between OCDHighSRI and OCDLow-NoSRI patients. There was one missing value of the HDRS in the OCDHighSRI group.
Behavioral task and statistical analysis
The delayed feedback task was similar to the one in our previous research23, except for the presented stimuli. Participants chose one of the two options (abstract cues) displayed on the screen by pressing a left or right button in each trial within 1 s after an auditory cue (Figure 2a). Depending on the selected stimulus, monetary feedback with different outcomes (+10, +40, −10, or −40 yen) was displayed either immediately after the button press or three trials later (Figure 2b and c). We did not offer monetary feedback for the first five trials because participants can learn the relationships between stimuli and outcomes or delays quickly. If participants pressed a button before the auditory cue or more than 1 s passed without any button press, −50 yen was displayed as a punishment. Such trials were considered error trials, and delayed outcomes were not considered.
At each trial, two abstract cues were displayed side by side on the screen (Figure 2a). We prepared 16 pairs (Figure 2b), counterbalancing the number of appearances of each stimulus. The 16 pairs of stimuli were presented in pseudorandom order. Each pair was presented as the scheduled number of trials in each session: each of six pairs [+10(0) vs. +40(0); +10(3) vs. +40(3); −10(0) vs. −40(0); −10(3) vs. −40(3); +10(0) vs. +40(3); −10(0) vs. −40(3)] was presented in 10 trials during a single session, and each of 10 pairs [+40(0) vs. +40(3); +10(0) vs. +10(3); −10(0) vs. −10(3); −40(0) vs. −40(3); +10(3) vs. +40(0); −10(3) vs. −40(0); +10(0) vs. −10(0); +40(0) vs. −40(0); +10(3) vs. −10(3); +40(3) vs. −40(3)] was presented in five trials during a single session. Each participant performed 110 trials during a single session and six sessions in each experiment. About 28 min was required for participants to complete six sessions. At the beginning of each session, the session number was displayed on the screen for 2.5 s. Before the task, each participant practiced the test session under the same task settings except for stimuli, and we confirmed that all participants understood the task set.
The total outcome (except for punishments related to button press errors), reaction time, and the number of error trials were compared between the OCD and HC groups. Patients with OCD showed a significantly lower total monetary outcome compared with HCs [median (interquartile range): OCD patients, 2890 (1590–3600) yen; HCs, 4550 (3020–5950) yen; Brunner-Munzel test, statistic = 4.08, p = 0.0002], whereas there were no group differences in reaction time [median (interquartile range): OCD patients, 524.6 (487.1–565.4) ms; HCs, 522.6 (475.3–575.9) ms; Brunner-Munzel test, p > 0.05)] and number of button press errors [median (interquartile range): OCD patients, 4 (2–9); HC, 4 (2–8); Brunner-Munzel test, p > 0.05). These results suggested impaired learning in OCD individuals compared with HCs. Based on our hypothesis of trace factors ν+/ν−, patients with OCD would show impaired learning in the stimuli with delayed feedback. Therefore, the optimal choice rate of the representative four pairs of stimuli with the same delay and different magnitude [pairs with no delays: +40(0) vs. +10(0) and −10(0) vs. −40(0); pairs with delays: +40(3) vs. +10(3) and −10(3) vs. −40(3)] were compared among groups. A mixed-design two-way repeated-measures ANOVA with a within-participants factor of sessions (1–6 sessions) and a between-participants factor of groups (OCDLow-NoSRI patients, OCDHighSRI patients, and HCs) was conducted to clarify the between-group differences. Degrees of freedom were corrected using Chi-Muller’s epsilon because Mendoza’s multisample sphericity test indicated that the assumption of sphericity had been violated. Bonferroni-Holm– corrected post-hoc comparisons were conducted to clarify the between-group differences.
Model comparison and parameter estimation for the delayed feedback task
We fitted the behavioral data with actor-critic learning with separate eligibility traces for positive and negative prediction errors. We defined 16 states for possible pairs of stimuli presented at each trial. Available actions at each state involved the choosing of alternative stimuli. To facilitate model-fitting in the face of limited experimental data from each participant, we used regularizing priors that favored realistic values and maximum a posteriori estimation rather than maximum likelihood estimation24. The learning rate α and trace factor ν were constrained to the range of 0 ≤ α ≤ 0.95 and 0 ≤ ν ≤ 0.95 with a uniform prior. The exploration-exploitation degree β was constrained to the range of 0 ≤ β ≤ 100 with a gamma (2,3) prior distribution that favored relatively lower values. We fixed the discount factor γ = 0, because the term of the next state value was just noise in our delayed feedback task in which the state of each trial was randomly selected. We optimized parameters by minimizing the negative log posterior of the data with different parameter settings using the hyperopt package50. Likelihood ratio tests to assess the contribution of the additional parameter in our model (four parameters: α, β, ν+, ν−) compared with the standard actor-critic learning model (three parameters: α, β, ν) showed that the additional parameter was justified in 128 of 186 participants (X2 test with one degree of freedom, p < 0.05). Because our target model was validated in behavioral data, we applied the separate eligibility trace model in the substantive analysis.
We evaluated the performance of parameter estimation in actor-critic learning with separate eligibility traces. We created 200 simulation data points using a model with the following parameters: α, 0.1 ± 0.05 (mean ± SD); β, 1 ± 0.2; ν+ and ν−, randomly selected within 0.01–0.95. The parameter estimation was quite accurate51 (Supplementary Figure 2). Specifically, Pearson’s r and the mean absolute error between the true and estimated ν+ or ν− were 0.99 and 0.03, respectively (Supplementary Figure 2).
We compared the ν+/ν− distribution in our model among three groups (OCDLow-NoSRI patients, OCDHighSRI patients, and HCs) using PERMANOVA with the ADNOIS function and 10,000 permutations using the Euclidean distance implemented in the statistical package R52. The multivariate homogeneity of group dispersions was confirmed with the betadisper function with 10,000 permutations in R52. The learning rate α and inverse temperature β were compared using the Kruskal-Wallis test. To further investigate relationships between clinical characteristics and estimated parameters in the HC group, we conducted clustering analysis using HDBSCAN25. We detected two clusters (balanced cluster, n = 83; imbalanced cluster, n = 59; the remaining 11 HCs were not clustered) and evaluated their obsessive-compulsive trait using the five PI subscales: “Checking”, “Dirt”, “Doubt”, “Impulse”, “Precision”26. In addition, the propensity to adhere to fine-grained details was evaluated using the Attention to Detail subscale of AQ27. There were 15 missing values in PI (n = 127) and 1 missing value in AQ (n = 141). The therapeutic effects of SRIs were evaluated by the Spearman’s rank correlation between the SRI dose and the imbalanced settings of ν (ν+–ν−).
Imaging data acquisition, preprocessing, and statistical analysis
rs-fMRI data were collected using three different MRI scanners: 49 OCDNoSRI participants and 53 HCs at Kajiicho Medical Imaging Center (dataset A), 10 OCDNoSRI participants and 18 HCs at Kyoto Prefectural University of Medicine (dataset B) for replication of the findings of dataset A, and 20 HCs with the delayed feedback task (10 HCs for the imbalanced cluster and the remaining 10 HCs for the balanced cluster) in ATR (dataset C). There was no overlap of participants among datasets. All demographic distributions were matched between groups (dataset A, Supplementary Table 4; dataset B, Supplementary Table 5; dataset C, Supplementary Table 6). Some of the participants in dataset A and B were included in our previous studies conducted for a different purpose using a different method14, 15, 28, 29. All fMRI imaging protocols using gradient EPI sequences are summarized in Supplementary Table 3. High-resolution T1-weighted structural images were also acquired.
Preprocessing of rs-fMRI data was conducted using fmriprep_ciftify 1.3.0.post2-2.3.053,54. Briefly, typical preprocessing steps such as slice timing correction, motion correction, and spatial normalization into Montreal Neurological Institute space were conducted using fmriprep. We then converted the data from the volumetric NIfTI format to the surface-based CIFTI format (https://www.nitrc.org/projects/cifti/) using ciftify54. To remove artifacts and increase the signal/noise ratio, the time course of the rs-fMRI data was detrended, bandpass-filtered (0.01– 0.08 Hz), and linearly regressed out of nuisance variables (the temporal fluctuations of the entire brain and six head motion parameters, their derivatives, and six principal components of anatomical CompCor55). With respect to motion artifacts, framewise displacement (FD) was not significantly different between groups in all datasets [median (interquartile range); dataset A: OCD patients, 0.082 (0.070–0.097) mm; HCs, 0.087 (0.067–0.10) mm; Brunner-Munzel test, p > 0.05; dataset B: OCD patients, 0.095 (0.067–0.12) mm; HCs, 0.079 (0.062–0.094) mm; Brunner-Munzel test, p > 0.05; dataset C: balanced cluster, 0.13 (0.11–0.18) mm; imbalanced cluster, 0.11 (0.094–0.14) mm; Brunner-Munzel test, p > 0.05]. The first six functional scans were discarded to allow magnetization to reach equilibrium. For each participant, mean time series were extracted from 360 cortical and 358 subcortical parcels using CAB-NP31, which is the comprehensive whole-brain solution for large-scale functional networks based on the cortical parcellation developed by Glasser et al.56 (Human Connectome Project Multi-Modal Parcellation). Pearson correlation coefficients were calculated between each pair of parcels and transformed to Fisher’s Z scores to obtain the FC matrix.
To evaluate the neural substrate of nonmedicated patients with OCD hypothetically related to ν+ > ν−, we compared the FC matrices between 49 OCDNoSRI patients and 53 HCs (dataset A) using NBS32. We chose NBS because it facilitates the detection of a subnetwork related to the condition of interest while controlling the family-wise error rate. Briefly, NBS was performed in the following two steps. First, the between-group comparison for every possible FC was conducted and thresholded at t = 3.87 (corresponding to p = 0.0001) by considering the FD as a nuisance variable. We detected the thresholded subnetworks (networks of nodes interconnected by significant FCs), and their network size was calculated. Second, the significance of subnetworks was tested using 10,000 random permutations of groups, which determined the null distribution of the largest subnetwork size. Only subnetworks whose network size exceeded the estimated family-wise error-corrected p-value 0.05 were identified as the network that was significantly different between OCDNoSRI patients and HCs. The detected subnetwork was visualized using BrainNet Viewer57. To confirm the robustness of the OCD network, we compared the mean FC of the detected OCD network between 10 OCDNoSRI patients and 18 HCs in the entirely independent dataset (dataset B) using the Brunner-Munzel test. To further explore whether the imbalanced (ν+ > ν−) HC cluster showed OCD-like characteristics also regarding a functional network, we compared every FC of the OCD network between 10 HCs each in the imbalanced and balanced clusters (dataset C) detected in our delayed feedback task using the Brunner-Munzel test.
Data Availability
The patients’ data supporting the conclusion of this paper are not publicly available due to them containing information that could compromise research participant privacy or consent. The theoretical derivation of our computational model is in the Supplementary Note along with the MATLAB source code.
Author Contributions
Yuki S., Yutaka S., J.N., and S.C.T designed the study. Yutaka S. developed the theory and performed computational modeling. Yuki S., Y.A., J.N., and S.C.T. collected the data. Yuki S. conducted the computational modeling and statistical analysis of the data. Yuki S. and Yutaka S. wrote the manuscript, which was edited by all authors. S.C.T. acquired funding to support theory development and data analysis.
Competing Interests Statement
The authors declare no competing financial interests.
Acknowledgments
All authors had full access to the data used in this study and the corresponding author S.C. Tanaka takes responsibility for the integrity of the data and the accuracy of the data analysis. These research results have been achieved by “Research and development of technology for enhancing functional recovery of elderly and disabled people based on non-invasive brain imaging and robotic assistive devices”, the Commissioned Research of National Institute of Information and Communications Technology (NICT), JAPAN (S.C. Tanaka). The followings partly supported this study: JSPS KAKENHI Grant Number JP16K01958 and JP16H06396 (S.C. Tanaka), JP16H01516 and JP18H05524 (Yutaka Sakai); 16H01512 and the Sakamoto Research Foundation of Psychiatric Diseases (Yuki Sakai); the Joint Usage/Research Center (“Behavioral economics”) of Institute of Social and Economic Research, Osaka University (J. Narumoto).