Consequence assessment and behavioral patterns of inhibition in decision-making: modelling its underlying mechanisms

Learning to make adaptive decisions depends on exploring options, experiencing their consequence, and reassessing one’s strategy for the future. Although several studies have analyzed various aspects of value-based decision-making, most of them have focused on decisions in which gratification is cued and immediate. By contrast, how the brain gauges delayed consequence for decision-making remains poorly understood. To investigate this, we designed a decision-making task in which each decision altered future options. The task was organized in groups of consecutively dependent trials, and the participants were instructed to maximize the cumulative reward value within each group. In the absence of any explicit performance feedback, the participants had to test and internally assess specific criteria to make decisions. This task was designed to specifically study how the assessment of consequence forms and influences decisions as learning progresses. We analyzed behavior results to characterize individual differences in reaction times, decision strategies, and learning rates. We formalized this operation mathematically by means of a multi-layered decision-making model. By using a mean-field approximation, the first layer of the model described the dynamics of two populations of neurons which characterized the binary decision-making process. The other two layers modulated the decision-making policy by dynamically adapting an oversight learning mechanism. The model was validated by fitting each individual participants’ behavior and it faithfully predicted non-trivial patterns of decision-making, regardless of performance level. These findings provided an explanation to how delayed consequence may be computed and incorporated into the neural dynamics of decision-making, and to how learning occurs in the absence of explicit feedback.

2d, the RT gradually increased with growing difficulty to discriminate the stimuli, thus 292 exhibiting a gradual and significant sensitivity to VD (p=10 -25 , F-stat=160).

294
The dependency of PF and RT on VD together with the other variables must be established 295 statistically. To assess the learning process, we quantified the relationship of PF and RT with 296 horizon nH, trial within episode TE, and episode E. To obtain consistent results, we adjusted 297 these variables as follows. The trial within episode is reversed, from last to first, because the 298 optimal choice for the last TE (large) is the same regardless of the horizon number. The variable 299 representing the trial within episode counted backwards is denoted as " ! . Furthermore, 300 regarding the model for PF, to consider trials within episode independently, we adapted the 301 notion of PF (defined as a summary measure per episode) to an equivalent of PF per trial, i.e., 302 the percentage of optimal choices "# . To be able to calculate such percentage, we grouped the 303 episodes in blocks of 10 and used their average. This new variable is called " . Regarding the 304 model for RT, since we consider each episode separately, and not an aggregate of 10 of them, 305 we also check the dependency with DbS (d). Finally, to assess the difference between learning 306 groups, we introduce the categorical variable L that identifies the group of participants that 307 learned the optimal strategy and the ones who did not, according to Figure 2a. We then used a 308 linear mixed effects model (59,60) to predict PF and RT. The independent variables for the 309 fixed effects are horizon nH, trial within episode " ! (counted backwards), and the passage of 310 time expressed as groups of 10 episodes " each for PF, or for RT the episode E and DbS d. We  (Table 2). In panel (e), "# decreases with " ! , 316 suggesting that the first trial(s) within the episode are less likely to be guessed right, i.e., 317 favoring the smaller of both stimuli. This makes sense, since only the early trials within the 318 episode required inhibition. Moreover, looking at the amplitude of the regression coefficients, 319 we can state that this has a larger impact in the no-learning case. The same argument can be 320 made for the dependency with nH. The mayor difference between learning and no-learning can 321 be appreciated when considering the time dependence: for the learners' group "# increases as 322 time goes by, i.e., " increases, while it is not significant for the group that did not learn the 323 optimal strategy. The two learning groups are globally statistically different (p=10 -12 ). In panel 324 (f), RT shows converse effect directions between learning and no-learning groups for both 325 dependencies on " ! and nH. The participants who learned the optimal strategy exhibited longer 326 RT for the earlier trials within the episode, consistently with the need of inhibiting the selection 327 of the larger stimulus. Also, the larger the horizon, the longer the RT, opposite to the no-  Out of all 28 participants we analyzed, in Figure 3 we show the data from 3 sample participants. 332 Figure 3 shows their associated PFs, CHs, and RTs metrics, and the order of execution of the 333 different blocks and horizons. Each column corresponds to a participant and each row to a 334 different horizon level. Note that all three participants performed the nH=0 task correctly 335 (Figure 3a,b). The first 2 participants also performed nH=1 correctly, while participant 3 did 336 not learn the correct strategy until he executed nH=2. Note that participants 1 and 2 performed 337 nH=1 before nH=2, they learned during nH=1, and then applied the same strategy for nH=2. 338 Because of this, a very fast learning process can be noted during the first nH=2 block. In Figure   339

351
In this section, we describe our mathematical formalization of consequential decision-making, 352 incorporating a variable foresight mechanism, adaptive to the specifics of how reward is 353 distributed across trials of each episode. We formalized these processes using a three-layer 354 neural model, described next. In brief, we used a mean-field model for binary decision-making, 355 driven by a system able to learn the optimal strategy, and consequently dictate the choices to 356 the decision-making process. The reason why we chose to build such a model instead of 357 employing, for example, a classic reinforcement learning model is that our model not only 358 describes behavioral patterns of learning, but it is also biophysically plausible. The neural 359 dynamics in the mean-field approximation have been derived analytically from a network of 360 spiking neurons used for making binary decisions (61). To describe the neural dynamics at each trial, we used a mean-field approximation of a 364 biophysically based binary decision-making model (23,58,61,62). This approximation has been 365 often used to analytically study neuronal dynamics, through analysis of population averages.

366
This included a simplified version that reproduced most features of the original spiking neuron 367 model while using only two internal variables (21).

369
The core of the model consists of two populations of excitatory neurons: one sensitive to the 370 stimulus on the left-hand side of the screen (L), and the other to the stimulus on the right (R).

371
The intensity of the evidence is the size of each stimulus, which is directly proportional to the 372 amount of reward displayed. In the model this is captured by the parameters λL, λR, respectively.

373
Although in the interest of our task we distinguish between the bigger and smaller stimulus  when the difference between the L and R pool activity crosses a threshold D (69), see Figure   407 4b. This event defines the trial's decision time. Note that the decision time and the likelihood 408 of picking the larger stimulus are conditioned by the evidence associated with the two stimuli 409 (lL, lR), i.e., how easy it is to distinguish between them. Namely, the larger the difference 410 between the stimuli is, the more likely and quickly it is that the larger stimulus is selected.

412
This type of decision-making model is made such that the larger stimulus is always favored.

413
Although the target with the stronger evidence in Eq. 1 is the most likely to be selected, this 414 behavior becomes a particular case when this first layer interacts with the middle layer of our 415 model, as described in the next section.  Specifically, the intended decision mechanism at each trial is represented as a two-attractor The initial condition ( . ) and the noise intensity (σψ) are interdependent. The closer an initial 458 condition is to one of the attractors, the larger the noise is required to escape that basin of 459 attraction. Behaviorally, the role of the initial condition is to capture the a-priori bias of  Although the small/big stimulus may be favored at each trial, the final decision still depends 472 on the stimuli intensity ratio. More specifically, if the evidence associated with the small/large stimulus is higher/lower than that of its counterpart, the dynamics of the system will evolve as 474 described in the previous section, see Eq which exhibit the competence of switching preference between the large and small stimulus. If (size-wise) with which a decision is made in the consequential task, as described by Eq. 2. Although the previously described intended decision layer endowed our model with the ability 501 of targeting a specific type of stimulus at each trial, a second mechanism to internally oversee 502 performance and to promote only beneficial strategies is a requirement. The overall goal for 503 each participant of the consequential task is to maximize the cumulative reward value 504 throughout an episode. As shown by previous analyses, most participants attained the optimal 505 strategy after an exploratory phase, gradually improving their performance until the optimum 506 is reached. Inspired by the same principle of exploration and reinforcement, we incorporated 507 the strategy learning layer to our model.

509
The internal dynamics of an episode are such that selecting the small/large stimulus in a trial 510 implies an increase/decrease of the mean value of the presented stimuli in the next trial ( Figure   511 1). Consequently, the strategy to maximize the reward value must vary as a function of the to learn the optimal one over time. Mathematically, we describe the dynamics of learning as that R( , ! ) must be positive/negative if the subjective reward is considered beneficial or not.

535
In the absence of explicit performance feedback, as is the case in the current task, participants 536 must look for clues that convey some indirect information about their performance that could 537 feed their internal criterion of assessment. In our case, the correct clue to look for was the   shows that learning time decreases as learning rate k increases, and as decisional uncertainty 585 sy decreases. Note that for these simulations we used nH=1 with 50 episodes, therefore any tL 586 bigger than 50 means that the optimal strategy was not learned. As a consequence of this 587 analysis, to be able to obtain a large variety of behavioral results, in the following section we To demonstrate the behavior of the model, Figure 8 shows the results of a typical simulation of 596 a horizon nH =1 experiment. Figure 8a shows the example dynamics of the neural dynamics   The fitting of the model parameters to each of the participant's behavioral metrics was 635 performed in stages. First, we started by considering the neural dynamics layer, and fitting each 636 parameter of Eq. 3. The first metric to fit is each participant's RT. Note that due to response anticipation of the GO signal, the experimental RTs could be negative in a few cases (see Figure   638 3c). A free parameter was incorporated into the model to control for this temporal shift.

640
The second metric to fit is the VD, i.e., the ability to distinguish between stimuli. We assumed 641 VD to be specific to each participant, and constant across blocks of each session. As a means 642 of assessment, we checked how often the larger stimulus had been selected over the last 50    learning rate (k=2.6). Interestingly, even if participant 3 did not learn the correct strategy, the 740 parameters obtained from the fitting process still reported a slow learning process (k=0.2). In 741 addition to this, we noticed that a slightly higher learning rate was reported for participant 2, To summarize, we have first found the best fit for the RT and the VD by varying all the free 759 parameters of Eq. 3, i.e., t, s, D, and a. Then, we calculated the subjective initial bias f0.

760
Finally, employing these parameters, we found the best fit for the decisional uncertainty σψ, 761 and the learning rate k.   To summarize, we performed an individual fit to each of the participant's behavioral metrics. 786 We first used the RT distribution and VD of each participant to fit the parameters in Eq. 3.

787
Once these parameters were fixed, we moved on to calculate the initial bias, and ran simulations  Remarkably, most participants attained the optimal strategy. This demonstrates that they The number of trials per episode equals the horizon nH plus 1. In brief, within an episode, a 1101 decision in the initial trial influences the stimuli to be shown in the next trial(s) in a specific 1102 fashion, unbeknown to our participants. Although a reward value is gained by selecting one of 1103 the stimuli presented in each trial, the goal is not to gain the largest amount as possible per trial, 1104 but rather per episode.
Each participant performed 100 episodes for each horizon nH = 0, 1, and 2. In the interest of 1107 comparing results, we have generated a list of stimuli for each nH and used it for all participants.

1108
To avoid fatigue and keep the participants focused, we divided the experiment into 6 blocks, 1109 to be performed on the same day, each consisting of approximately 100 trials. More At the beginning of the session, participants were given instructions on how to perform the 1133 task. Specifically, using some sample trials, we demonstrated them how to select a stimulus by 1134 moving the mouse.
Step by step we showed that a target appears in the center of the screen 1135 indicating the start of an episode. We told them that they had 4 seconds to move the cursor to

1151
The participants were instructed to maximize the cumulative reward value throughout each 1152 episode, namely the sum of water contained by the selected targets across the trials of the 1153 episode. If trials within an episode were independent, the optimal choice would be to always 1154 choose the largest stimulus. Since one of the major goals of our study was to investigate delayed

Exploratory strategy
We analyzed the exploratory strategy participants used. In particular, we tested whether  In Materials and Methods, Sec. Consequential Decision-Making task, we described the 1600 structure of the task, and in particular we mentioned that we randomized the order in which 1601 participants performed the horizons. This means that, for example, some participants 1602 performed nH 2 before nH 0. We wondered if the order of the horizons had an influence on the 1603 learning time. For example, were the participants who started with nH 1 faster in learning the 1604 optimal strategy than the ones who started with nH 2? To address this, we performed a thorough 1605 investigation analyzing the learning time for each block of recording of the session. according to if nH 2 was performed before or after nH 1. Note that these plots show all 1633 participants, including the ones that never learned nH 1, 2, but did learn nH 0. From the results 1634 in (b-c), we can conclude that, in the horizons with consequence (i.e., nH 1, 2), participants 1635 needed less time to learn the optimal strategy of one horizon, if they had already performed the 1636 other one. We speculate that once the optimal strategy in any consequential block was 1637 understood, participants generalized the rule and by abstraction applied it to the other horizon.