## Abstract

Models based on normative principles have played a major role in our understanding of how the brain forms decisions. However, these models have typically been derived for simple, stable environments, and their relevance to decisions under more naturalistic, dynamic conditions is unclear. We previously derived a normative decision model in which evidence accumulation is adapted to environmental dynamics (* Glaze et al., 2015*), but the evolution of commitment rules (e.g., thresholds on the accumulated evidence) under such dynamic conditions is not fully understood. Here we derive a normative model for decisions based on changing evidence or reward. In these cases, performance (reward rate) is maximized using adaptive decision thresholds that best account for diverse environmental changes, in contrast to predictions of many previous decision models. These adaptive thresholds exhibit several distinct temporal motifs that depend on the specific, predicted and experienced changes in task conditions. These adaptive decision strategies perform robustly even when implemented imperfectly (noisily) and can account for observed response times on a task with time-varying evidence better than commonly used constant-threshold or urgency-gating models. These results further link normative and neural decision-making while expanding our view of both as dynamic, adaptive processes that update and use expectations to govern both deliberation and commitment.

## Introduction

Even simple decisions can require us to adapt to a changing world. Should you go through the park or through town on yourwalk? The answer can depend on each route’s length, the weather, and/or the time of day. Some of these factors can change quickly and affect our deliberations in real time; e.g., an unexpected shower will send us hurrying down the faster route (** Figure 1A**), whereas spotting a new ice cream store can make the longer route more attractive. Despite the ubiquity of such dynamics in the real world, they are often neglected in models used to understand how the brain makes decisions. For example, many commonly used models assume that decision commitment occurs when the accumulated evidence for an option reaches a fixed, predefined value or threshold (

*). The value of this threshold can account for inherent trade-offs between decision speed and accuracy found in many tasks: lower thresholds generate faster, but less accurate decisions, whereas higher thresholds generate slower, but more accurate decisions (*

**Wald, 1945; Ratcliff, 1978; Bogacz et al., 2006; Gold and Shadlen, 2007; Kilpatrick et al., 2019***). However, these models do not adequately describe decisions made in environments with unknown or stochastically changing contexts (*

**Gold and Shadlen, 2007; Chittka et al., 2009; Bogacz et al., 2010***).*

**Thura et al., 2014; Thura and Cisek, 2016; Palestro et al., 2018; Cisek et al., 2009; Drugowitsch et al., 2012; Thura et al., 2012; Tajima et al., 2019; Glickman et al., 2022**Efforts to model decision-making thresholds under dynamic conditions have focused largely on heuristic strategies. For instance, “urgency-gating models” (UGMs) use thresholds that collapse monotonically over time (equivalent to dilating the belief in time) to explain decisions based on time-varying evidence quality (* Cisek et al., 2009; Carland et al., 2015; Evans et al., 2019*). Attempts to extend normative theory to such dynamic environments typically assume that individuals set decision thresholds to maximize trial-averaged reward rate (

*), resulting in adaptive, time-varying thresholds similar to those assumed by heuristic UGMs. However, as in fixed-threshold models, these time-varying thresholds are typically defined before the evidence is accumulated, preceding the formative stages of the decision, and thus cannot account for environmental changes that may occur during deliberation.*

**Simen et al., 2009; Balci et al., 2011; Drugowitsch et al., 2012; Tajima et al., 2016; Malhotra et al., 2018; Boehm et al., 2020**To identify how environmental changes impact decision rules, we developed normative models of decision-making that adapt to dynamic changes in expectations or evidence. Specifically, we used Bellman’s equation (* Bellman, 1957; Mahadevan, 1996; Sutton et al., 1998; Bertsekas, 2012; Drugowitsch, 2015*) to identify decision strategies that maximize trial-averaged reward rate under dynamic conditions. We show that for simple tasks that include within-trial changes in the reward or the quality of observed evidence, these normative decision strategies involve non-trivial, time-dependent changes in decision thresholds. These rules take several different forms that outperform their heuristic counterparts, are identifiable from behavior, and have performance that is robust to noisy implementations. We also show that, compared to fixed-threshold models or UGMs, these normative, adaptive thresholds provide a better account of human behavior on a “tokens task,” in which both the value of commitment and evidence quality change at predictable times within each trial (

*). These results provide new insights into the behavioral relevance of a diverse set of adaptive decision thresholds in dynamic environments and tightly link the details of such environmental changes to threshold adaptations.*

**Cisek et al., 2009; Thura et al., 2014**## Results

### Normative Theory for Dynamic Context 2AFC Tasks

Normative decision rules that maximize trial-averaged reward rate can be obtained by solving an optimization problem using dynamic programming (* Bellman, 1957; Sutton et al., 1998; Drugowitsch et al., 2012; Tajima et al., 2016*). To do so, we assign specific values (i.e., economic utilities) to correct and incorrect choices (reward and/or punishment) and the time required to arrive at each choice (i.e., evidence cost). Given a defined task structure, we discretize the time during which the decision is formed and define the observer’s actions during each timestep. An observer gathers evidence (measurements) during each timestep prior to a decision and uses each increment of evidence to update their belief about the correct choice. Then, the observer has the option to either commit to a choice or make another measurement at the next timestep. By assigning a utility to each of these actions, we find the specific belief values where the optimal action changes from gathering evidence to commitment, defining thresholds on the ideal observer’s belief that trigger decisions.

**shows a schematic of this process.**

*Figure 1*BTo understand how normative decision thresholds adapt to fluctuating conditions, we derived them for several different forms of two-alternative forced-choice (2AFC) tasks in which we controlled changes in evidence or reward. For each task, the evidence was provided by observations drawn from a Gaussian distribution with one of two different means and signal-to-noise ratio (SNR) *m* (** Figure 2–Figure Supplement 1**). The SNR measures evidence quality: a smaller (larger)

*m*implies that evidence is of lower (higher) quality, resulting in harder (easier) decisions. An observer must determine which of the two means were used to generate a flnite number of observations. We introduced changes in the reward for a correct decision (“reward-change task”) or the SNR (“SNR-change task”) within a single decision, where the time and magnitude of the changes are known in advance to the observer (

**,**

*Figure 1*A**). For example, changes in SNR arise naturally throughout a day as animals choose when to forage and hunt given variations in light levels and therefore target-acquisition difficulty (**

*Figure 2–Figure Supplement 2**).*

**Combes et al., 2012; Einfalt et al., 2012**Under these dynamic conditions, dynamic programming produces normative thresholds with rich non-monotonic dynamics (** Figure 2A,B**,

**). For the reward-change task, these normative threshold dynamics exhibited several motifs that in some cases resembled fixed or collapsing thresholds characteristic of previous decision models, but in other cases exhibited novel dynamics. We characterized five different dynamic motifs in response to single changes in expected reward for different combinations of pre- and post-change reward values (**

*Figure 2–Figure Supplement 2***and**

*Figure 2*C**i-v**). For tasks in which reward is initially very low, thresholds are infinite until the reward increases, ensuring that the observer waits for the larger payout regardless of how strong their belief is (

**). In contrast, when reward is initially very high, thresholds collapse to zero just before the reward decreases, ensuring all responses occur while payout is high (**

*Figure 2*i**). Between these two extremes, optimal thresholds exhibit rich, non-monotonic dynamics (**

*Figure 2*v**), promoting early decisions in the high-reward regime, or preventing early, inaccurate decisions in the low-reward regime.**

*Figure 2*ii, iv**shows the regions in pre- and post-change reward space where each motif is optimal, including broad regions with non-monotonic thresholds. Thus, even simple context dynamics can evoke complex decision strategies in ideal observers that differ from those predicted by constant decision-thresholds and heuristic UGMs.**

*Figure 2*CWe also formulated an “inferred reward-change task”, in which reward fluctuations are governed by a two-state Markov process and the observer infers these changes on-line. For this task, decision thresholds always changed monotonically with monotonic shifts in expected reward (see ** Figure 2–Figure Supplement 3**). These results contrast with our findings with the reward-change task in which changes can be anticipated and monotonic changes in reward can produce non-monotonic changes in decision thresholds.

For the SNR-change task, optimal strategies are characterized by threshold dynamics adapted to changes in evidence quality in a way similar to changes in reward (** Figure 3A,B**,

**). However, in this case monotonic changes in evidence quality always produce monotonic changes in response behavior. This observation holds across all of parameter space for evidence-quality schedules with single change points (**

*Figure 3–Figure Supplement 1***), with only three optimal behavioral motifs (**

*Figure 3*C**). This contrasts with our findings in the reward-change task, where monotonic changes in reward can produce non-monotonic changes in decision thresholds. Strategies arising from known dynamical changes in context tend to produce sharper response distributions around reward changes than around quality changes, which may be measurable in psychophysical studies. These findings suggest that changes in reward can have a larger impact on the normative strategy thresholds than changes in evidence quality.**

*Figure 3*i-iii#### Performance and Robustness of Non-monotonic Normative Thresholds

The normative solutions that we derived for dynamic-context tasks by definition maximize reward rate. This maximization assumes that the normative solutions are implemented perfectly. However, a perfect implementation may not be possible, given the complexity of the underlying computations, biological constraints on computation time and *energy* (* Louie et al., 2015*), and the synaptic and neural variability of cortical circuits (

*). Given these constraints, subjects may employ heuristic strategies like the UGM over the normative model if noisy or mistuned versions of both models result in similar reward rates. We used synthetic data to better understand the relative benefits of different imperfectly implemented strategies. Specifically, we corrupted the internal belief state and simulated response times with additive Gaussian noise (See*

**Ma and Jazayeri, 2014; Faisal et al., 2008***) for three models: 1) the normative model, resulting in a noisy Bayesian (NB) model; 2) a constant-threshold (Const) model, which uses the same belief as the normative model but a constant, non-adaptive decision threshold (*

**Figure 4–Figure Supplement 1C****); and 3) the UGM, which low-pass filters the normative observer’s belief and commits to a decision when this output crosses a hyperbolically collapsing threshold (**

*Figure 4–Figure Supplement 1A***). We compared their performance in terms of reward rate achieved on the same set of reward-change tasks shown in**

*Figure 4–Figure Supplement 1B**.*

**Figure 2**When all three models were implemented without additional noise, the relative benefits of the normative model depended on the exact task condition. The performance differential between models was highest when reward changed from low to high values (** Figure 4A**, dotted line;

**). Under these conditions, normative thresholds are initially infinite and become finite after the reward increases, ensuring that most responses occur immediately once the high reward becomes available (**

*Figure 4*B**). In contrast, response times generated by the constant-threshold and UGM models tend to not follow this pattern. For the constant-threshold model, many responses occur early, when the reward is low (**

*Figure 4*D**). For the UGM, a substantial fraction of responses are late, leading to higher time costs (**

*Figure 4*E**). In contrast, when the reward changes from high to low values, all models exhibit similar response distributions and reward rates (**

*Figure 4*F**, dashed line;**

*Figure 4*A**). This result is not surprising, given that the constant-threshold model produces early peaks in the reaction time distribution, and the UGM was designed to mimic collapsing bounds that hasten decisions in response to imminent decreases in reward (**

*Figure 4–Figure Supplement 2**). We therefore focused on the robustness of each strategy when corrupted by noise and responding to low-to-high reward switches – the regime differentiating strategy performance in ways that could be identified in subject behavior.*

**Cisek et al., 2009**Adding noise to the internal belief state (which tends to trigger earlier responses) and simulated response distributions (which tends to smooth out the distributions) does not alter the advantage of the normative model: across a range of added noise strengths, the normative model outperforms the other two when encountering low-to-high reward switches (** Figure 4C**). This robustness arises because, prior to the reward change, the normative model uses infinite decision thresholds that prevent early noise-triggered responses when reward is low (

**). In contrast, the heuristic models have finite collapsing or constant thresholds and thus produce more suboptimal early responses as belief noise is increased (**

*Figure 4*D**). Thus, adaptive decision strategies can result in considerably higher reward rates than heuristic alternatives even when implemented imperfectly, suggesting subjects may be motivated to learn such strategies.**

*Figure 4*E,F### Adaptive Normative Strategies in the Tokens Task

To determine the relevance of the normative model to human decision-making, we analyzed pre-viously collected data from a “tokens task” (* Cisek et al., 2009*). For this task, human subjects were shown 15 tokens inside a center target flanked by two empty targets (see

**for a schematic). Every 200 ms, a token moved from the center target to one of the neighboring targets with equal probability. Subjects were tasked with predicting which flanking target would contain more tokens by the time all 15 moved from the center. Subjects could respond at any time before all 15 tokens had moved. Once the subject made the prediction, the remaining tokens would finish their movements to indicate the correct alternative. Because the total number of tokens was finite and known to the subject, token movements varied in their informativeness within a trial, yielding a dynamic and history-dependent evidence quality that, in principle, could benefit from adaptive decision processes (e.g., a token’s movement into a target is informative only if the difference in token counts between targets is lower than the number of tokens still in the center). In addition, the task included two different post-decision token movement speeds, “slow” and “fast”, that dynamically modulated the utility of decision commitment by altering the duration of the inter-trial interval, and hence the average rate at which rewards could be obtained. Given that costs and rewards can be subjective, we quantified how normative decision thresholds change with different combinations of rewards and costs, for both the slow (**

*Figure 5*A**) and fast (**

*Figure 5*B**) versions of the task.**

*Figure 5*CWe identified four distinct motifs of normative decision threshold dynamics for the tokens task (** Figure 5i-iv**). Some combinations of rewards and costs produced collapsing thresholds (

**) similar to the UGM developed by**

*Figure 5*ii**. (**

*Cisek et al***) for this task. In contrast, large regions of task parameter space produced rich non-monotonic threshold dynamics (**

*2009***) that differed from any found in the UGM. In particular, as in the case of reward-change tasks, normative thresholds were often infinite for the first several token movements, preventing early and weakly informed responses. These motifs are similar to those produced by low-to-high reward switches in the reward-change task, but here resulting from the low relative cost of early observations. These non-monotonic dynamics also appear if we measure belief in terms of the difference in tokens between the top and bottom target, which we call “token lead space” (see**

*Figure 5*iii, iv**).**

*Figure 5–Figure Supplement 1*#### Adaptive Normative Strategies Best Fit Subject Response Data

To determine the relevance of these adaptive decision strategies to human behavior, we fit discretetime versions of the noisy Bayesian (four free parameters), constant-threshold (three free parameters), and urgency-gating (five free parameters) models to response-time data from the tokens task collected by * Cisek et al*. (

*). All models included belief and motor noise, as in our analysis of the dynamic-context tasks (*

**2009****). The normative model tended to fit the data better than the heuristic models (see**

*Figure 4–Figure Supplement 1C***), based on three primary analyses. First, both corrected AIC (AICc), which accounts for goodness-of-fit and model degrees-of-freedom, and average root-mean-squared error (RMSE) between the predicted and actual trial-by-trial response times, favored the noisy Bayesian model for most subjects for both the slow (**

*Figure 6–Figure Supplement 1***) and fast (**

*Figure 6*A**) versions of the task. Second, when considering only the best-fitting model for each subject and task condition, the noisy Bayesian model tended to better predict subject’s response times (**

*Figure 6*D**). Third, most subjects whose data were best described by the noisy Bayesian model had best-fit parameters that corresponded to non-monotonic decision thresholds, which cannot be produced by either of the other two models (**

*Figure 6*B,E**). Together, our results strongly suggest that these human subjects tended to use an adaptive, normative strategy instead of the kinds of heuristic strategies often used to model response data from dynamic context tasks.**

*Figure 6*C,F## Discussion

The goal of this study was to build on previous work showing that in dynamic environments, the most effective decision processes do not necessarily use relatively simple, pre-defined computations as in many decision models (* Bogacz et al., 2006; Cisek et al., 2009; Drugowitsch et al., 2012*), but instead adapt to learned or predicted features of the environmental dynamics (

*). Specifically, we used new “dynamic context” task structures to demonstrate that normative decision commitment rules (i.e., decision thresholds, or bounds, in “accumulate-to-bound” models) adaptto reward and evidence-quality switches in complex, but predictable, ways. Comparing the performance of these normative decision strategies to the performance of classic heuristic models, we found that the advantage of normative models is maintained when computations are noisy. We extended these modeling results to include the “tokens task”, in which evidence quality changes in a way that depends on stimulus history and the utility of commitment increases over time. We found that the normative decision thresholds for the tokens task are also non-monotonic and robust to noise. By reanalyzing human subject data from this task, we found most subjects’ response times were best-explained by a noisy normative model with non-monotonic decision thresholds. Taken collectively, these results show that ideal observers and human subjects use adaptive and robust normative decision strategies in relatively simple decision environments.*

**Drugowitsch et al., 2014a**Our results can aid experimentalists investigating the nuances of complex decision-making in several ways. First, we demonstrated that normative behavior varies substantially across task parameters for relatively simple tasks. For example, the reward-change task structure produces five distinct behavioral motifs, such as waiting until reward increases (** Figure 2i**) and responding before reward decreases unless the accumulated evidence is ambiguous (

**). Using these kinds of modeling results to inform experimental design can help us understand the possible behaviors to expect in subject data. Furthermore, extending our work and considering the sensitivity of performance to both model choice and task parameters (**

*Figure 2*iv*) will help to identify regions of task parameter space where models are most identifiable from observables like response time and choice. In general, our work suggests that experimentalists can design more informative tasks by using normative theory to determine what subject strategies are plausible, the volume and diversity of tasks needed to identify them, and the relationship between task dynamics and decision rules.*

**Barendregt et al., 2019; Radillo et al., 2019**Real subjects likely do not rely on a single strategy when performing a sequence of trials (* Ashwood et al., 2022*) and instead rely on a mix of near-normative, sub-normative, and heuristic strategies. In fitting subject data, experimentalists are thus presented with the difficult task of constructing a library of possible models to use in their analysis. More general approaches have been developed for fitting response data to a broad class of models (

*), but these model libraries are typically built on pre-existing assumptions of how subjects accumulate evidence and make decisions. Because the potential library of decision strategies is theoretically limitless, a normative analyses can both expand and provide insights into the range of possible subject behaviors in a systematic and principled way. Understanding this scope will assist in developing a well-groomed candidate list of near-normative and heuristic models. For example, if a normative analysis of performance on a dynamic reward task produces threshold dynamics similar to those in*

**Shinn et al., 2020****, then the fitting library should include a piecewise-constant threshold (or urgency signal) model. Combining these model-based investigations with model-free approaches, such as rate-distortion theory (**

*Figure 2*B*), can also aid in identifying commonalities in performance and resource usage within and across model classes without the need for pilot experiments.*

**Berger, 2003; Eissa et al., 2021**Our work complements the existing literature on optimal decision thresholds by demonstrating the prevalence of behaviors reflective of non-monotonic decision thresholds. Most studies describing decision strategies with time-varying decision thresholds focus on environments with fixed structure, in which dynamic decision thresholds are adapted as the observer acquires knowledge of the environment. Using dynamic programming (* Drugowitsch et al., 2012, 2014b Tajima et al., 2016*) or policy iteration (

*), normative strategies in these environments typically have monotonically collapsing decision thresholds that can be approximated by a standard UGM (*

**Malhotra et al., 2017, 2018***). While recent work has started to generalize notions of urgency-gating behavior (*

**Tajima et al., 2019***), we have shown that novel response behaviors need to be considered even with simple tasks.*

**Trueblood et al., 2021**The neural mechanisms responsible for implementing and controlling decision thresholds are not well understood. Recent work has identified several cortical regions that may contribute to threshold formation, such as prefrontal cortex (* Hanks et al., 2015*), dorsal premotor area (

*), and superior colliculus (*

**Thura and Cisek, 2020***). Urgency signals are a complementary way of dynamically changing decision thresholds via a commensurate scale in belief, which*

**Crapse et al., 2018; Jun et al., 2021***(*

**Thura and Cisek***) suggest are detectable in recordings from basal ganglia. The normative decision thresholds we derived do not employ urgency signals, but analogous UGMs may involve non-monotonic signals. For example, the switch from an infinite-to-constant decision threshold typical of low-to-high reward switches would correspond to a signal that suppresses responses until a reward change. Measurable signals predicted by our normative models would therefore correspond to zero mean activity during low reward, followed by constant mean activity during high reward. While more experimental work is needed to test this hypothesis, our work has expanded the view of normative and neural decision making as dynamic processes for both deliberation and commitment.*

**2017**## Methods and Materials

### Normative Decision Thresholds from Dynamic Programming

We outline the general mathematical structure of a two-alternative forced-choice (2AFC) task we use throughout this work and introduce the dynamic programming tools required to find normative decision thresholds. Consider an observer inferring an initially unknown environmental state, *s* ∈ {*s*_{+}, *s*_{-}}, that uniquely determines one of two “correct” choices. To determine the environmental state, this observer makes measurements, *ξ*, that follow a distribution *f*_{±}(*ξ*) = *f*(*ξ*|*s*_{±}) that depends on the state. Determining the correct choice is thus equivalent to determining the generating distribution, *f*_{±}. An ideal Bayesian observer uses the log-likelihood ratio (LLR), *y*, to track their “belief” over the correct choice (* Wald, 1945; Bogacz et al., 2006; Veliz-Cuba et al., 2016*). After

*n*discrete observations

*ξ*

_{1:n}, the discrete-time LLR

*y*is given by

_{n}For the free-response tasks we consider, an observer sets their potentially time-dependent decision thresholds, *θ*_{±}(*t*), that determine when they will stop accumulating evidence and commit to a choice: When *y* ≥ *θ*_{+}(*t*)(*y* ≤ *θ*_{-}(*t*)), the observer chooses the state *s*_{+} (*s*_{-}). In general, an observer is free to set *θ*_{±}(*t*) any way they wish. However, a normative observer sets these thresholds to optimize an objective function, which we assume throughout this study to be the trial-averaged reward rate, *ρ*, which is given by (* Gold and Shadlen, 2002; Drugowitsch et al., 2012*)
where 〈

*〉 is the average reward for a decision,*

**R***is the decision time, is the average total accumulated cost given an incremental cost function*

**T**_{d}*(*

**c***), 〈*

**t***〉 is the average trial length, and (*

**T**_{t}*t*) is the average inter-trial interval (

*). All averages in*

**Drugowitsch, 2015***are taken over trials. The addition of the incremental cost function*

**Equation 1***c*(

*t*) accounts for both explicit costs (e.g., paying for observed evidence, metabolic costs of storing belief in working memory) and implicit costs (e.g., opportunity cost). We assume symmetry in the problem (in terms of prior, rewards, etc.) that guarantees the thresholds are symmetric about

*y*= 0 and

*θ*

_{±}(

*t*) = ±

*θ*(

*t*). We derive the optimal threshold policy for a general incremental cost function

*c*(

*t*), but in our results we consider only constant costs functions

*c*. Although the space of possible cost functions is large, restricting to a constant value ensures that threshold dynamics are governed purely by task and reward structure and not by an arbitrary evidence cost function.

To find the thresholds ±*θ* that optimize the reward rate given by * Equation 1*, we start with a discrete-time task where observations every

*δt*time units, and we simplify the problem so the length of each trial is fixed and independent of the decision time

*. This simplification makes the denominator of*

**T**_{d}*ρ*constantwith respect to trial-to-trial variability, meaningwe can optimize reward rate by maximizing the numerator 〈

*〉 – 〈*

**R***(*

**C***)〉. Under this simplified task structure, we suppose the observer has just drawn a sample*

**T**_{d}*ξ*and updated their state likelihood to . At this moment, the observer takes one of three possible actions:

_{n}*Stop accumulating evidence and commit to choice s*_{+}. This action has value equal to the average reward for choosing*s*_{+}, which is given by whereis the value for a correct choice and**R**_{c}is the value for an incorrect choice.**R**_{i}*Stop accumulating evidence and commit to choice s*_{-}. By assuming the reward for correctly (or incorrectly) choosing*s*_{+}is the same as choosing*s*_{-}, the value of this action is obtained by symmetry from:**Equation 2***Wait to commit to a choice and draw an additional piece of evidence*. Choosing this action means the observer expects their future overall valueto be greater than their current value, less the cost incurred by waiting for additional evidence. Therefore, the value of this choice is given by where**V***c*is the incremental evidence cost function; because we assume that the incremental cost is constant, this simplifies*c*(*δt*) =*cδt*.

Given the action values from * Equation 2-Equation 4*, the observer takes the action with maximal value, resulting in their overall value function

Because the value-maximizing action depends on the state likelihood, *p _{n}*, the regions of likelihood space where each action is optimal divide the space into three disjoint regions. The boundaries of these regions are exactly the optimal decision thresholds, which can be mapped to LLR-space to obtain ±

*θ*. To find these thresholds numerically, we used backward induction starting at the total trial length

*=*

**t***. At this moment in time, it impossible to wait for more evidence, so the value function in*

**T**_{t}*does not depend on the future. Once the value is calculated at this time point, it can be used as the future value at time point*

**Equation 5***=*

**t***–*

**T**_{t}*.*

**δt**To find the decision thresholds for the desired tasks where * T_{t}* is not fixed, we must optimize both the numerator and denominator of

*. To account for the variable trial length, we adopt techniques from average reward reinforcement learning (*

**Equation 1***) and penalize the waiting time associated with each action by the waiting time itself scaled by the reward rate*

**Mahadevan, 1996***ρ*(i.e., 〈

*t*〉

_{i}*ρ*for committing to

*s*

_{+}or

*s*

_{-}and

*ρδt*for waiting). This modification makes all trials effectively the same length and allows us to use the same approach used to derive

*(*

**Equation 5***). The new overall value function is*

**Drugowitsch et al., 2012**To use this new value function to numerically find the decision thresholds, we must note two new complications that arise from moving away from fixed-length trials. First, we no longer have a natural end time from which to start backward induction. We remedy this issue by following the approach of * Drugowitsch et al*. (

*) and artificially setting a final trial time*

**2012***that is far enough in the future so that decision times of this length are highly unlikely and do not impact the response distributions. If we desire accurate thresholds up to a time*

**T**_{f}*t*, we set

*= 5*

**T**_{f}*t*, which produces an accurate solution while avoiding a large numerical overhead incurred from a longer simulation time. In our simulations, we set

*t*based on when we expect most decisions to be made. Second, the value function now depends on the unknown quantity

*ρ*, resulting in a co-optimization problem. To address this complication, note that when

*ρ*is maximized, our derivation requires

*(0;*

**V***ρ*) = 0 for a consistent Bellman’s equation (

*). We exploit this consistency requirement by fixing an initial reward rate*

**Drugowitsch et al., 2012***ρ*

_{0}, solving the value function through backward induction, calculating

*V*(0;

*ρ*

_{0}), and updating the value of

*ρ*via a root finding scheme. For more details on numerical implementation, see https://github.com/nwbarendregt/AdaptNormThresh.

### Dynamic Context 2AFC Tasks

For all dynamic context tasks, we assume that observations follow a Gaussian distribution so that . Using the Functional Central Limit Theorem, one can show (* Bogacz et al., 2006*) that in the continuous-time limit, the belief

*y*evolves according to a stochastic differential equation:

In * Equation 7*, is the scaled signal-to-noise ratio (SNR),

*is a standard increment of a Wiener process, and the sign of the drift ±*

**dW**_{t}*is given by the sign of the correct choice*

**m dt***s*

_{±}. To construct Bellman’s equation for this task, we must also determine the average value gained by waiting:

The main difficulty in computing this expectation is computing the likelihood transfer function *f _{p}*(

*p*

_{n+1}|

*p*). To compute this transfer function, we can start by using the definition of the LLR and leveraging the relationship between

_{n}*p*and

_{n}*y*to find

_{n}*p*and a function of the observation

_{n}*ξ*:

_{n}Note that we used the fact that in continuous-time, the observations . The relationship between *ξ*_{n+1} and *p*_{n+1} in * Equation 8* can be inverted to obtain

With this relationship established, we can find the likelihood transfer function *f _{p}*(

*p*(

*ξ*

_{1:n+1})|

*p*(

*ξ*

_{1:n})) by finding the observation transfer function

*f*

_{ξ}(

*ξ*(

*p*

_{n+1})|

*ξ*(

*p*)) and performing a change of variables, which by independence of the sample is simply

_{n}*f*(

_{ξ}*ξ*

_{n+1}). With probability

*p*

_{n}, ξ_{n+1}will be drawn from the normal distribution , and with probability 1 –

*p*

_{n}, ξ_{n+1}will be drawn from the normal distribution . This immediately provides the observation transfer function by marginalizing:

Performing the change of variables using the derivative yields the transfer function

Combining * Equation 7* and

*, we can construct Bellman’s equation for any dynamic context task.*

**Equation 9**#### Reward-Change Task Thresholds

For the reward-change task, we fixed punishment * R_{i}* = 0 and allowed the reward

*, to be a Heavi-side function:*

**R**_{c}In * Equation 10*, there is a single switch in rewards between pre-change reward

**R**_{1}and post-change reward

**R**_{2}. This change occurs at

*t*= 0.5. Substituting this reward function into

*allows us to find the normative thresholds for this task as a function of*

**Equation 6**

**R**_{1}and

**R**_{2}.

For the inferred reward change task, we allowed the reward * R*(

*) ∈ {*

**t***} to be controlled by a continuous-time two-state Markov process with transition (hazard) rate*

**R**_{H}, R_{L}*between rewards*

**h***. In addition, the state of this Markov process must be inferred from an independent evidence source to the environment’s state (i.e., the correct choice); for simplicity, we assume that the reward-evidence source is also Gaussian-distributed with quality .*

**R**_{H}≥ R_{L}*. (*

**Glaze et al***);*

**2015***. (*

**Veliz-Cuba et al***);*

**2016***. (*

**Barendregt et al***) have shown that the belief*

**2019***y*for such a dynamic state inference process is given by the modified DDM where

_{R}*x*(

*t*) ∈ ±1 is a telegraph process that mirrors the state of the reward process (i.e.,

*x*(

*t*) = 1 when

*(*

**R***) =*

**t***and*

**R**_{H}*x*(

*t*) = −1 when

*(*

**R***) =*

**t***). With this belief over reward state, we must also modify the values*

**R**_{L}

**V**_{+}(

*p*) and

_{n}

**V**_{-}(

*p*) to account for the uncertainty in

_{n}*. Defining as the reward likelihood gives where we have fixed*

**R**_{c}*= 0 for simplicity.*

**R**_{i}#### SNR-Change Task Thresholds

For the SNR-change task, we allowed the task difficulty *m* to vary over a single trial by making *μ*(*t*) a time-dependent step function similar to * Equation 10*:

In * Equation 11*, there is a single switch in evidence quality between pre-change quality

*μ*

_{1}and post-change quality

*μ*

_{2}. This change occurs at

*t*= 0.5. Substituting this quality time series into the likelihood transfer function in

*allows us to find the normative thresholds for this task as a function of*

**Equation 9***μ*

_{1}and

*μ*

_{2}. This modification necessitates that the transfer function

*f*also be a function of time; however, because the quality change points are known in advance to the observer, we can simply change between different transfer functions at the specified quality changes.

_{p}### Reward-Change Task Model Performance

Here we detail the three models used to compare observer performance in the reward-change task, as well as the noise filtering process used to generate synthetic data. For the noisy Bayesian model, the observer uses the thresholds ±*θ*(*t*) obtained via dynamic programming, thus making the observer a noisy ideal observer. For the constant-threshold model, the observer uses a constant threshold ±*θ*(*t*) = ±*θ*_{0}, which is predicted to be optimal only in simple, static decision environments. Both the noisy Bayesian and constant-threshold models also use a noisy perturbation of the LLR as their belief, where *σ _{y}* is the strength of the noise and

*is a sample from a standard normal distribution. In continuous-time, this perturbation involves adding an independent Wiener process to*

**Z***: where is an independent Wiener process with strength*

**Equation 7***σ*.

_{y}The UGM, being a phenomenological model, behaves differently from the other models. The UGM belief *E* is the output of a noisy low-pass filter,
where *τ* is a relaxation time constant and the noise-free LLR *y* is the filter’s input. The UGM accumulates evidence until the belief crosses the hyperbolically decreasing thresholds , where *θ*_{0} and *a* control the initial position and the rate of collapse of the thresholds, respectively. To add noise to the UGM’s belief variable *E*, we simply allowed *σ _{y}* > 0 in the low-pass filter in

*.*

**Equation 12**In addition to the inference noise, we also filtered each process through a Gaussian responsetime filter with strength *σ _{mn}*, so that if the model predicted a response time

*, the measured response time was drawn from a normal distribution centered at*

**T***with standard deviation*

**T***σ*.

_{mn}To compare model performance on the reward-change task, we first fixed the value of prechange reward **R**_{1} (and set **R**_{1} + **R**_{2} = 11) to find the post-change reward) and tuned each model to achieve optimal reward rate with no additional noise in both the inference and response processes. Bellman’s equation outputs both the optimal normative thresholds and reward rate, allowing us to find the exact tuning of the normative model. For the constant threshold model and the UGM, we approximate optimal tuning by using a grid search over each models parameters. After tuning all models for a given reward structure, we filtered them through the two noise sources. When generating noisy synthetic data from these models, we generated 100 synthetic subjects, each with sampled noise strengths *σ _{y}* and

*σ*. We defined “noise strength” of noise samples (

_{mn}*σ*) to be the ratio where and are the maximum values of belief noise and motor noise considered, respectively. Noise strength is thus defined between 0 and 1, such that a noise strength of 0.5 is approximately equivalent to the fitted noise strength obtained from tokens task subject data. We plot the response distributions using noise strengths of 0, 0.5, and 1 in our results. We then generated 1000 trials for each subject and had each simulated subject repeat the same block of trials three times, one for each model. This process ensured that the only difference between model performance would come from their distinct threshold behaviors, because each model was taken to be equally noisy and was run using the same stimuli.

_{y}, σ_{mn}### Tokens Task

#### Normative Model for the Tokens Task

For the tokens task, observations in the form of token movements are Bernoulli distributed with parameter *p* = 0.5 that occur every 200 ms. Because of the stimulus structure, one can show using a combinatorial argument (* Cisek et al., 2009*) that the likelihood function

*p*is given by where

_{n}*, and*

**U**_{n}, L_{n}*are the number of tokens in the upper, lower, and center targets after token movement*

**C**_{n}*n*, respectively. Constructing the likelihood transfer function

*f*required for Bellman’s equation is also simplified from the Gaussian 2AFC tasks, as there are only two possible likelihoods that one can transition two after observing a token movement:

_{p}Combining * Equation 13* and

*, we can fully construct Bellman’s equation for the tokens task. While the timings of the token movements, post-decision token acceleration, and inter-trial interval are fixed, we let the reward*

**Equation 14***and cost function*

**R**_{c}*c*be free parameters to control the different threshold dynamics of the model.

#### Model Fitting and Comparison

We used three models to fit the subject response data provided by *Cisek et al. (2009):* the noisy Bayesian model (*k* = 4 parameters), the constant threshold model (*k* = 3 parameters), and the UGM (*k* = 5) parameters. To fit each model, we used Markov Chain Monte Carlo (MCMC) with a standard Gaussian proposal distribution to generate an approximate posterior made up of 10, 000 samples. For more details as to our specific implementation of MCMC for this data, see the MATLAB code available at https://github.com/nwbarendregt/AdaptNormThresh. We held out 2 of the 22 subjects to use as training data when tuning the covariance matrix of the proposal distribution for each model, and performed the model fitting and comparison analysis on the remaining 20 subjects. Using the approximate posterior obtained via MCMC for each subject and model, we used calculated AICc using the formula

In * Equation 15*,

*k*is the number of parameters of the model,

*L*is the likelihood of the model evaluated at the maximum-likelihood parameters, and

*n*is the number of responses in the subject data (

*). Because each subject performed different numbers of trials, using AICc allowed us to normalize results to account for the different data sizes; note that for many responses (i.e., for large*

**Cavanaugh, 1997; Brunham and Anderson, 2002***n*), AICc converges to the standard definition of AIC. For the second model selection metric, we measured how well each fitted model predicted the trial-by-trial responses of the data by calculating the average RMSE between the response times from the data and the response times predicted by each model. To measure the difference between a subject’s response time distribution and the fitted model’s distribution (

*), we used Kullback-Leibler (KL) divergence:*

**Figure 6–Figure Supplement 1**In * Equation 16*,

*i*is a time index representing the number of observed token movements,

*(*

**RT**_{D}*i*) is the probability of responding after

*i*token movements from the subject data, and

*(*

**RT**_{W}*i*) is the probability of responding after

*i*token movements from the model’s response distribution. Smaller values of KL divergence indicate that the model’s response distribution is more similar to the subject data.

## Code Availability

See https://github.com/nwbarendregt/AdaptNormThresh for the MATLAB code used to generate all results and figures.

## Acknowledgments

We thank Paul Cisek for providing response data from the tokens task used in our analysis.

## Footnotes

**Competing interests**: JIG: Senior editor, eLife. The other authors declare that no competing interests exist.**Funding**: This work was funded by CRCNS/NIH R01-MH-115557. NWB and ZPK were also supported by R01-EB029847-01 and NSF-DMS-1853630. KJ was also supported by NSF DBI-1707400.