A Theoretical Formalization of Consequence-Based Decision-Making

19 Learning to make adaptive decisions depends on exploring options, experiencing their 20 consequence, and reassessing one’s strategy for the future. Although several studies have 21 analyzed various aspects of value-based decision-making, most of them have focused on 22 decisions in which gratification is cued and immediate. By contrast, how the brain gauges 23 delayed consequence for decision-making remains poorly understood. 24 25 To investigate this, we designed a novel decision-making task in which each decision altered 26 future options to decide upon. The task was organized in groups of inter-dependent trials, and 27 the participants were instructed to maximize cumulative reward value within each group. In 28 the absence of any explicit performance feedback, the participants had to test and internally 29 assess specific criteria to make decisions. The absence of explicit feedback was key to 30 specifically study how the assessment of consequence forms and influences decisions as 31 learning progresses. 32 33 We formalized this operation mathematically by means of a multi-layered decision-making 34 model. It uses a mean-field approximation to describe the dynamics of two populations of 35 neurons which characterize the binary decision-making process. The resulting decision-making 36 policy is dynamically modulated by an internal oversight mechanism based on the prediction 37 of consequence. This policy is reinforced by rewarding outcomes. The model was validated by 38 fitting each individual participants’ behavior. It faithfully predicted non-trivial patterns of 39 decision-making, regardless of performance level. 40 41 These findings provide an explanation to how delayed consequence may be computed and 42 incorporated into the neural dynamics of decision-making, and to how adaptation occurs in the 43 absence of explicit feedback. 44 45

,b). The first three participants also performed nH=1 correctly, while 305 participant 4 did not learn the correct strategy until he executed nH=2. Note that participant 2 306 performed nH=2 before nH=1, they learned during nH=2, and then applied the same strategy 307 for nH=1. Because of this, no learning process can be detected during the nH=1 block. In Figure   308 3c, note that some RTs are negative. In these cases, the participant did not wait for the 309 presentation of the GO signal to start the movement.  The core of the model consists of two populations of excitatory neurons: one sensitive to the 332 stimulus on the left-hand side of the screen (L), and the other to the stimulus on the right (R).

333
The intensity of the evidence is the size of each stimulus, which is directly proportional to the 334 amount of reward displayed. In the model this is captured by the parameters λL, λR, respectively.

335
Although in the interest of our task we distinguish between the bigger and smaller stimulus   The time constant and the noise are set to τ = 80 and σ = 0.003, respectively.

366
The neural dynamics described in this section refer to the time-course of a single trial, and is 367 related to the discrimination of the two stimuli. The model commits to a perceptual decision 368 when the difference between the L and R pool activity crosses a threshold D (51), see Figure   369 4b. This event defines the trial's decision time. Note that the decision time and the likelihood 370 of picking the larger stimulus are conditioned by the evidence associated with the two stimuli 371 (lL, lR), i.e., how easy it is to distinguish between them. Namely, the larger the difference 372 between the stimuli is, the more likely and quickly it is that the larger stimulus is selected.

374
This type of decision-making model is made such that the larger stimulus is always favored.

375
Although the target with the stronger evidence in Eq. 1 is the most likely to be selected, this 376 behavior becomes a particular case when this first layer interacts with the middle layer of our 377 model, as described in the next section.  Specifically, the intended decision mechanism at each trial is represented as a two-attractor The initial condition ( $ ) and the noise intensity (σψ) are interdependent. The closer an initial 420 condition is to one of the attractors, the larger the noise is required to escape that basin of 421 attraction. Behaviorally, the role of the initial condition is to capture the a-priori bias of 422 choosing the smaller/bigger target. Though this is true, please note that a strong initial bias  Although the small/big stimulus may be favored at each trial, the final decision still depends 434 on the stimuli intensity ratio. More specifically, if the evidence associated with the small/large 435 stimulus is higher/lower than that of its counterpart, the dynamics of the system will evolve as 436 described in the previous section, see Eq. 1. For this reason, we incorporated the intention term 437 yields a novel set of equations which exhibit the competence of switching preference between the large and small stimulus. If (size-wise) with which a decision is made in the consequential task, as described by Eq. 2.  Figure 7e shows that learning time decreases as learning rate k increases, and as 546 decisional uncertainty sy decreases. Note that for these simulations we used nH=1 with 50 547 episodes, therefore any tL bigger than 50 means that the optimal strategy was not learned.  To demonstrate the behavior of the model, Figure 8 shows the results of a typical simulation of 557 a horizon nH =1 experiment. Figure 8a shows the example dynamics of the neural dynamics  The fitting of the model parameters to each of the participant's behavioral metrics was 597 performed in stages. First, we started by considering the neural dynamics layer, and fitting each 598 parameter of Eq. 3. The first metric to fit is each participant's RT. Note that due to response 599 anticipation of the GO signal, the experimental RTs could be negative in a few cases (see Figure   600 3c). A free parameter was incorporated into the model to control for this temporal shift.

602
The second metric to fit is the VD, i.e., the ability to distinguish between stimuli. We assumed 603 VD to be specific to each participant, and constant across blocks of each session. As a means  Panels (a-c) in Figure 9 show the optimal parameters for the RT and VD of the four sample 648 participants introduced in the Behavioral Results Section. Figure 9a depicts the CDF of the RT 649 for the participants and for the best-fit model simulation. Figure 9b presents the KSD between We set a vector of initial conditions ( = 1, ! ) = $ ( ! ) for each trial within episode (TE).

666
To quantify f0, we selected the first 3 episodes for each participant, and calculated the 667 frequency f with which the larger stimulus was selected. The parameter f0 works as an initial 668 condition for the intended decision process (see Eq. 2). In agreement with the attractor 669 dynamics, if the initial condition coincides with one of the basins of attraction, the system will 670 be locked in that state. To prevent this (since f0 should only be an initial bias), we rescaled the 671 frequency of the selected choices f to make the value closer to 0.5, i.e., $ = (1 + ) 3 ⁄ (other 672 rescaling factors could be used and would not change the results). Figure 9d shows the values 673 obtained for f0 for each trial within episode TE. Note that we have selected one block from 674 nH=2 for participant 2 and nH=1 for the others.
process began under similar conditions for the model and for the participant. The second factor 690 is the time required to learn the strategy. As already introduced in the Behavioral Results

691
Section, we defined the time at which the strategy was learned as the moment after which the 692 optimal strategy was employed in at least 9 out of the following 10 episodes. To ensure that a 693 low success rate was not due to errors caused by visual discrimination, we excluded the To summarize, the problem of having explicit feedback is that the learning of the optimal 830 strategy could be reduced to testing rule-based sequences until the one that gives the optimal 831 feedback is found. Although the optimal strategy consists of the same rule-based sequence, the 832 crucial element of the task is that, to reach that stage, the participant must first forego a phase The crucial element of the aforementioned process is that, in the absence of explicit 841 performance feedback, learning depends on first building up a subjective criterion of reward.

842
This criterion necessarily depends on cognitive processes implementing an oversight 843 mechanism of whether the correct decision criterion is being used, and whether the proper 844 association between the choice and subsequent stimuli is being correctly perceived (63-66).

845
Moreover, despite the participants being able to find the optimal strategy and diminishing the 846 uncertainty of their behavior to reach the optimal strategy, the fact they never get an explicit Each participant performed 100 episodes for each horizon nH = 0, 1, and 2. In the interest of comparing results, we have generated a list of stimuli for each nH and used it for all participants.

1018
To avoid fatigue and keep the participants focused, we divided the experiment into 6 blocks, 1019 to be performed on the same day, each consisting of approximately 100 trials. More