A control theoretic model of adaptive behavior in dynamic environments

In order to behave adaptively in environments that are noisy and non-stationary, humans and other animals must monitor feedback from their environment and adjust their predictions and actions accordingly. An under-studied approach for modeling these adaptive processes comes from the engineering field of control theory, which provides general principles for regulating dynamical systems, often without requiring a generative model. The proportional-integral-derivative (PID) controller is one of the most popular models of industrial process control. The proportional term is analogous to the “delta rule” in psychology, adjusting estimates in proportion to each successive error in prediction. The integral and derivative terms augment this update to simultaneously improve accuracy and stability. Here, we tested whether the PID algorithm can describe how people sequentially adjust their predictions in response to new information. Across three experiments, we found that the PID controller was an effective model of participants’ decisions in noisy, changing environments. In Experiment 1, we re-analyzed a change-point detection experiment, and showed that participants’ behavior incorporated elements of PID updating. Experiments 2-3 employed a gradual change-point task that was optimized to detect PID-like adjustments. In both experiments, the PID model offered better descriptions of behavioral adjustments than the classical delta-rule model. We further examined how participants weighted different PID terms in response to salient environmental events. Across experiments, we found that the weights on these control terms were modulated by reward, surprise, and outcome entropy. These experiments provide preliminary evidence that adaptive behavior in dynamic environments resembles PID control.

To behave adaptively, we must adjust our behavior in response to the dynamics of our environment (Ashby, 1956;Pezzulo & Cisek, 2016). Achieving this goal requires us to collect feedback about the outcomes of our recent actions, and incorporating this feedback into decisions about how to adjust future actions. Within research on learning and decision-making, a popular approach for achieving this feedback-based control is referred to as the delta-rule model. This model adjusts expectations in proportion to the discrepancy between the outcome that was predicted and the outcome that occurred (i.e., prediction error), weighted by the learning rate (∆ = × ; Widrow & Hoff, 1960;Rescorla & Wagner, 1972). While substantial evidence suggests that people implement some form of delta-rule control (Mirenowicz & Schultz, 1994), this algorithm has major limitations: first, it is sensitive to any noise that will cause persistent errors, either leading to oscillatory behavior (at a high learning rate) or a sluggish response (at a low learning rate; Aström & Murray, 2010). More seriously, this algorithm performs poorly in any environment that changes over time (Pearce & Hall, 1980;Aström & Murray, 2010).
More elaborate feedback control mechanisms have been developed for use in various industrial applications within a branch of engineering that studies the regulation of dynamical systems, referred to as Control Theory. Control Theory augment delta rule-like algorithms with additional control terms that greatly improve the efficacy of this algorithm. The most popular variant of these control theoretic models is referred to as the popular proportional-integral-derivative (PID) controller (Fig. 1). This model is simple, accurate, and robust, with response properties that have been wellcharacterized over the last century (Franklin, Powell, & Emami-Naeini, 1994;Aström & Murray, 2010). The PID controller takes the error from a reference signal as input, and it outputs a control signal consisting of a linear combination of control signals proportional to the error (P-Term), the integral of the error (I-Term), and the derivative of the error (D-Term; Fig. 1). These three terms minimize deviations from the reference based on errors in the present, past, and expected future, respectively.
Proportional control (cf. delta-rule control) directly minimizes deviation from the reference, and is often the primary driver of the control process. Integral control provides low-frequency compensation for residual steady-state errors, allowing the controller to reduce noise and track gradual changes in the environment. Derivative control provides high-frequency compensation that increases stability, dampening control adjustments when the controller is approaching the reference and increasing these adjustments when the reference or environment changes suddenly. Here we test whether this popular model of industrial control can account for adjustments in human behavior within a dynamic environment.

Figure 1. Left:
The Proportional-Integral-Derivative (PID) Controller uses the current error (P-Term), the integral of the error (I-term), and the derivative of the error (D-Term), in order to provide compensation based on the present, past, and future, respectively. Right: The PID controller specifies a control signal based on the weighted sum of the PID terms, with each term weighted by their respective gain. Similar to a thermostat-controlled furnace, the plant implements the control signal, moving the measured processes closer to the reference (e.g., the desired temperature). Left panel adapted from (Aström & Murray, 2010), permission pending. PID control has algorithmic properties that make it useful for most control systems. For instance, relative to algorithms that require an explicit representation of task dynamics, PID can provide an effective, and computationally cheaper, model-free alternative to adjusting cognitive or behavioral processes over time, particularly for natural environments that require particularly complex world models. Moreover, convergent evidence suggests that the PID algorithm may help account for the variety of feedback-related findings observed in humans and monkeys. Behavioral and neural correlates of feedback-controlled choice provide preliminary evidence that participants transform decision-relevant variables in a manner predicted by the PID algorithm. Consistent with proportional control, there is substantial evidence that participants adjust their behaviors based on recent errors or conflict (Rescorla & Wagner, 1972;Lau & Glimcher, 2005;Rabbitt, 1966;Gratton, Coles, & Donchin, 1992;Ullsperger, Danielmeier, & Jocham, 2014), with corresponding signals observed most prominently in the striatum and anterior cingulate cortex (ACC; Mirenowicz & Schultz, 1994;Garrison, Erdeniz, & Done, 2013;Niki & Watanabe, 1979;Ito, Stuphorn, Brown, & Schall, 2003;Matsumoto, Matsumoto, Abe, & Tanaka, 2007;Kennerley, Walton, Behrens, Buckley, & Rushworth, 2006;Seo & Lee, 2007;Smith et al., 2015). Consistent with integral control, experiments and simulations have suggested that people are sensitive to the extended history of errors or conflict (Laming, 1968;Logan & Zbrodoff, 1979;Botvinick, Braver, Barch, Carter, & Cohen, 2001;Aben, Verguts, & Van den Bussche, 2017;Alexander & Brown, 2011Bugg & Crump, 2012;Wittmann et al., 2016), and experiments have found neural signals in the prefrontal cortex and ACC that reflect this history (Carter et al., 2000;Blais & Bunge, 2010;Bernacchia, Seo, Lee, & Wang, 2011;Wittmann et al., 2016). Recent models of the ACC have emphasized the role that integrative, recurrent activity in this region play in executive control (Wang, 2008;Hunt & Hayden, 2017;Shahnazian & Holroyd, 2017). Finally, consistent with derivative control, prior work has found that participants track the environmental rate of change when making decisions, with associated neural correlates in the anterior prefrontal cortices and ACC (Behrens, Woolrich, Walton, & Rushworth, 2007;Jiang, Beck, Heller, & Egner, 2015;Bernacchia et al., 2011;Kovach et al., 2012;McGuire, Nassar, Gold, & Kable, 2014;Wittmann et al., 2016).
Despite the success of the PID model as a simple and effective algorithm for implementing control in other fields, and evidence of relevant neural signatures in circuits involved in adaptive control, PID has yet to be formally tested as a model of human adaptive behavior. In the current set of experiments, we directly tested whether a PID model can describe human performance in adaptive learning tasks. In Experiment 1, we re-analyzed a recent study that examined predictive inference in an environment with discrete change points. Behavior on this task confirmed key predictions of the PID model, but was limited in its ability to adjudicate between candidate models. Informed by our findings in Experiment 1, for Experiments 2-3 we developed a novel task that was optimized for PID control, using gradual rather than sudden change-points. We found that the PID model was a strong predictor of participants' choices in both experiments. Experiment 3 replicated the predictive power of our model, and further examined whether participants dynamically adjust their control terms based on rewards, surprise, and outcome entropy. Across these tasks, participant performance was consistent with key predictions of the PID model, demonstrating that this simple model provides a promising account of adaptive behavior.

Experiment 1
The PID model is designed to adapt the behavior of a system in response to changes in the environment. We therefore began by testing whether this model could explain behavioral adjustments in an existing change-point detection task, one that was designed to assess how humans can adapt their learning rate to uncertain and volatile outcomes (McGuire et al., 2014). In this experiment, participants predicted where a target stimulus would appear (horizontal location on the screen), and then received feedback about where the true location of the outcome had been (See Fig. 2A). Outcome locations were normally distributed around a mean, and the mean of this distribution changed suddenly throughout the experiment (change points), according to a predetermined hazard rate. This task allows us to measure participants' choices and feedback in a continuous space with high precision, making it desirable for studying PID control. We can therefore use the PID model to predict trial-to-trial adjustments in participant behavior (predicted locations) based on their history of error feedback. In other respects, this task is not ideally suited for testing our model: the dramatic changes in target distributions may 'reset' adaptive control processes (Bouret & Sara, 2005;Tervo et al., 2014), and so this experiment serves as a preliminary test of our hypothesized control dynamics. We will address these concerns in Experiments 2-3.

Participants and Procedure
Experiment 1 consisted of a re-analysis of a change-point task used by McGuire and colleagues (2014). Briefly, thirty-two participants (17 female, mean(SD) age = 22.4(3.0)) selected a location on a line using a joystick, and then were shown the correct location for that trial. On a random subset of trials, participants were rewarded if their accuracy was within a criterion, which de-correlated trial-wise errors and rewards. Target locations were drawn from a Gaussian distribution with a variable mean and variance. The mean was stable for three trials, and then on a weighted coin flip (hazard rate: 0.125) was uniformly redrawn from the line coordinates; the variance alternated between high and low levels across blocks. Participants performed 160 training trials followed by 4 blocks of 120 trials during fMRI. Two participants were excluded for having an inconsistent number of trials per block, leaving 30 participants for the final analysis. See McGuire et al. (2014) for additional details.

Lagged Regression Analysis
A critical prediction of the PID model is that a participant's updates (i.e., the change in their location guesses) should depend on their history of feedback. While a delta-rule model predicts that only the error on the current trial would directly influence updates, a PID controller integrates errors over a longer history, enabling the controller to correct for a consistent bias in errors. In order to measure the independent contribution of each trial's feedback in the recent past, we regressed the update in location estimates between two trials (current vs. previous) on the errors from the A representative block of trials from an example participant. The mean correct location was stable for a variable number of trials, and then was uniformly resampled. Inset histogram: The distributions of errors for this participant across their session. B) Left: Participants in Experiment 2 selected a location on the circle with their mouse, and then were shown the correct location. Right: A representative block of trials, demonstrating that the mean correct location changed gradually over time. As seen in the histogram for an example participant (inset), the gradual changes in location for this task resulted in error distributions that were less leptokurtic than in Experiment 1 (compare Panel A inset). C) Left: Experiment 3 was identical to Experiment 2 but participants were rewarded based on their accuracy, according to one of four reward-error functions. They were informed of the current reward mode during Fixation, and during Feedback they received the reward corresponding to their accuracy on that trial (conditional on their reward mode). Right: Error-reward slopes for the four potential reward modes. current and 10 leading trials ( 2 ~ 1 + 2 + 267 + ⋯ + 2679 ; Wilkinson & Rogers (1973) notation) We assessed the influence of pervious trials' feedback by testing whether the sum of previous trials' betas was significantly different from zero, using a nonparametric sign-randomization test at the group level (comparing the observed results to a null distribution that we generated by randomly assigning positive or negative signs to each participant's summed betas). Throughout the paper, all randomization tests used 10 5 simulations, and all statistical tests were two-tailed with α = .05.

PID Model
The PID algorithm controls a process in order to maintain a desired reference signal ( Fig. 1). It takes as input the signed error relative to this reference (e t = referenceoutput), and produces a control signal (u t ) that specifies the adjustment for the next time point (here, the next trial). The control signal is defined by a linear combination of three terms: the P-term (reflecting the error), the I-term (reflecting the leaky integral of the error), and the D-term (reflecting the derivative of the error). Each of these terms is weighted by its own gain parameter (K P, K I, K D ). For trial t, the control signal (u t ) is therefore determined by transforming the error (e t ) as follows: where λ represents a memory persistence parameter, with larger values leading to longer retention. On the first trial of a block, the I-Term is e t , and the D-Term is 0, producing a control action similar to proportional control. In the following tasks, u t was defined as the difference in the choice location between trial t+1 and trial t (hereafter, the 'update'), and e t was the difference between the correct location and the chosen location.

PID Model Fit
We used each participant's timecourse of errors within each block to generate hypothesized P, I, and D values based on the raw errors, the integral of the errors, and the first derivative of the error, respectively. Our regression model consisted of an intercept and the three PID terms ( ~ 1 + + + ), and we fit this model with iteratively reweighted robust regression (using MATLAB's fitlm function; bisquare weighting factor), in order to minimize overfitting to the outliers that can occur when participants make scalar responses. Fit statistics were generated based on the non-reweighted residuals from the robust model, in order to avoid undue bias in favor of complex models.
Because the λ parameter (memory persistence) interacted with the identifiability of our PID terms when estimated jointly (e.g., when λ = 0, P and I are identical), we chose to fit this single term at the group rather than individual level. We fit λ with a grid search (range: [0.5-1], increments of 0.001), using median R 2 across participants as our measure of fit (in order to normalize individual differences in update variability). Regression models were estimated at the individual level, and regression weights were tested for deviance from zero at the group level with a sign-randomization test (see above).
We compared the P (i.e., delta-rule), PI, PD, and PID models, as these are the most common instantiations of the PID algorithm. To compare model performance, we used each participant's Akaike's Information Criterion (AIC; Akaike, 1983), an index of goodness-of-fit that is penalized for model complexity 1 . We compared the AIC at the group level using Bayesian model selection (Rigoux, Stephan, Friston, & Daunizeau, 2014), quantifying the Bayesian omnibus risk (likelihood that all of models are equally good) and each model's protected exceedance probability (PXP; the likelihood that the proportion of participants generated using a given model exceeds the proportion of participants generated using other models, controlling for the chance rate).

PID Controller Simulations
To determine whether participants were behaving similarly to the best performing PID controller for this environment, we simulated PID controller performance for the outcome locations that each participant encountered during the experiment. We used a restricted maximum likelihood estimation procedure (MATLAB's fmincon) to determine the values of the PID gains, λ, and choice bias (i.e., intercept) that perform best given the outcomes of each participant's task. We then compared these bestperforming PID gains to the gains estimated from participants' behavior.
We also tested whether our fitted model would produce the same pattern of behavior that we measured with our lagged regression. We simulated an ideal observer that used each participant's estimated PID parameters and outcome history in order to generate a sequence of updates, and then fit our lagged regression to this simulated behavior, separately for the P, PI, and PID candidate controllers. This analysis allows us to qualitatively determine the extent to which the PID model can act as a generative model of participants' decision-making behavior (Gelman, Meng, & Stern, 1996;Nassar & Frank, 2016). If participants are using PID control, then simulated updates from a PID controller should similarly weight the feedback received over previous trials.

Model-Agnostic Analysis
To identify the degree to which behavioral adjustment were influenced by recent feedback, we regressed participants current and previous errors on their update. We found that the current trial was the strongest predictor of the current update, but that updates were also influenced by feedback from leading trials (Fig. 3A). The sum of leading trials' betas was reliably less than zero (Mean(SD) summed beta: -0.040(0.064), p = .00017). This suggests that while immediate feedback was the most influential factor for participants' updates, they also incorporated an extended history of error feedback into their adjustments. While the current trial had a positive influence on updates, these previous trials instead had a negative influence on the current update, potentially compensating for the extreme errors that participants made at a changepoints (see Fig. 2A).
To verify that our model can generate performance that captures the behavior observed in this task, we simulated behavior on this task using parameters estimated for our PID model and reduced versions thereof (P, PI). We then performed the same lagged regression on these simulated data that we used on real data (Fig. 3A). As expected, we found that the simulated PI and PID models captured the influence of leading errors, unlike the P-only model (which predicts that there should be no influence of leading errors).

PID Model Fit
We first simulated the best-performing PID agent for each participant's outcome history, finding that the simulated PID gains were all reliably different from zero (Mean(SD) PID Gain: K P = 1.0(0.18), K I = -0.16(0.12), K D = -0.070(0.077), λ= 0.85(0.071); randomization test; all ps ≤ 10 -5 ; see Fig. 3B). Consistent with the lagged-regression analysis, we found that the simulated integral and derivative gains were negative.

Figure 3. Experiment 1 Results. A) For our model-agnostic analyses, we regressed the errors that participants made on the current and ten leading trials on their current update (black: participants' regression weights). Next, we used the PID parameters estimated in our regression analysis to generate behavior from P, PI, and PID controllers, and fit our lagged regression to this simulated behavior (colored lines). Inset: the regression weights from only the leading errors, controlling for the effect of the current error. B) For our model-based analyses, we regressed the trial-wise P-, I-, and D-Terms on participants' updates, and found that all three terms were significantly different from zero. Colored circles indicate individual participants' regression weights. Gray diamonds indicate the mean gains from ideal PID controllers fit to each participant's outcome history. C)
We used Bayesian model selection to adjudicate between our candidate models, and found that the PI model best explained the data, albeit with moderate support (see text). Error bars throughout indicate mean and between-participant bootstrapped 95% confidence intervals. Fitting our PID model to participants' updates, we found that the best-fit models accounted for a substantial amount of this variance (median R 2 = 0.92), with parameters for all terms being significantly different from zero (Mean(SD) standardized betas: ; = 0.62(0.27), p ≤ 10 -5 ; < = -0.019(0.022), p ≤ 10 -5 ; @ = -0.0068(0.016), p = 0.022; see Fig. 3B). The group-level λ (memory persistence) was also relatively high (0.9430), suggesting that participants retained a great deal of information regarding past feedback. Participants' estimated gains qualitatively resembled the gains produced by the simulated PID controller, sharing the same sign and approximate magnitude (compare gray diamonds and circles in Fig. 3B).
Of these three models, we found that the PI model had the highest protected exceedance probability (PXP P = 0.14, PXP PI = 0.46, PXP PD = 0.14, PXP PID = 0.26; Fig.  3C), but that there is altogether insufficient evidence to support one model over another (Bayesian omnibus risk= 0.55, providing moderate evidence for the 'null' explanation that all models are equally likely). These data therefore do not allow us to rule out the possibility that a simple delta-rule (P-only) model parsimoniously accounts for participant behavior.

Discussion
We found preliminary evidence that participants performing a change-point detection task are influenced by their history of error feedback, consistent with the predictions of a PID controller. Participants' updates could also be predicted from the integral and derivative of their errors. Despite these promising indications of PID control, we were unable to confidently differentiate between candidate models.
While this experiment offers mixed evidence in favor of the PID algorithm, this may be because this task was designed for change-point models, and has sudden, dramatic shifts in the outcome distribution. These change-points introduces extreme errors, which participants might treat categorically differently from normal prediction errors, evoking a 'reset' in their decision process or the representation of a different context (Bouret & Sara, 2005;Nassar, Wilson, Heasly, & Gold, 2010;McGuire et al., 2014;O'Reilly et al., 2013). With this concern in mind, we developed a novel adaptive learning task in which outcomes changed smoothly over time, encouraging participants to treat outcomes as arising from a single, changing context.

Experiment 2
While Experiment 1 provided promising evidence that participants use their history of feedback in a way that resembled PID control, it did not provide definitive evidence as to whether this is the best explanation for the data. However, this may have been caused by participants strategically resetting their predictions at extreme changepoints, making it more difficult to measure history-dependent predictions. To address this, we develop a gradual change-point task in which participants tracked outcomes distributions whose mean gradually changed from one location to another, and whose variance would change randomly throughout the block. In order to make these location transitions seem more continuous, they were represented along a circle rather than a straight line, thus also avoiding edge effects that can occur at either end of a screen. This design allowed us to precisely measure participants' predictions, errors, and adjustments within an environment whose dynamics are more fluid and predictable than Experiment 1. This task was explicitly designed to emulate an environment for which a PID controller is well-suited, and specifically to maximize our power to detect differences between PID control and proportional (delta-rule) control.

Participants & Procedure
Twenty-nine Brown University undergraduate students (25 female; mean(SD) age = 18.6(0.825)) performed a supervised learning task in which they predicted an outcome location on a circular display (see Fig. 2B).
Participants completed 5 blocks of 100 trials in which they used a mouse cursor to guess a location on the circumference of the circle. They were then shown the correct location, with an arc indicating the magnitude and direction of their error. Participants completed 50 training trials before the main experiment. Participants had up to 3 seconds to make their guess, or else their final cursor angle would be chosen as the guess for that trial, and feedback was presented for 2 seconds. Our final analysis excluded any trials where participants did not move their cursor to the edge of the circle, as well as a subset of trials following aberrant feedback due to a technical issue (1.8% of total trials).
The target location for each trial was drawn from a Gaussian distribution over arc degrees, with a mean and standard deviation that systematically changed over time. On a weighted coin flip (hazard rate: 0.80), the distribution's mean shifted based on a random draw from U(-180, 180) degrees. After the new mean was drawn, the mean would transition from the old mean to the new mean over U(8, 20) trials, with the means during transition trials linearly interpolated between the old and new means. The standard deviation varied independently of the mean, and was redrawn from U(1, 8) degrees on a weighed coin flip (hazard rate: 0.40). These task parameters were selected through simulation in order to maximally differentiate the performance of PID and delta-rule models. Unless otherwise indicated, methods of analysis and model selection for this study are identical to Experiment 1.

Model-Agnostic Analysis
Regressing the current and ten leading errors onto the current update (see Experiment 1 Methods), we again found that the sum of leading errors was significantly different from zero (Mean(SD) summed betas: 0.29(0.34), p ≤ 10 -5 ; Fig. 4A). This replicates the observation in Experiment 1 that participants incorporate the extended history of errors into their prediction process. Fitting our lagged model to behavior generated from our models produced a similar pattern of predictions as in Experiment 1: the P-only model categorically failed to capture the influence of leading errors. The PI and PID models were similar in their ability to recreate participants' use of previous errors, however the PID model seemed to better capture participants' weighting of recent leading errors (i.e., over the previous three trials). We examined whether this trend was reliable across participants by fitting the linear and quadratic trend over trials to each participants' leading betas. We found a significant quadratic trend (Mean(SD) quadratic trend beta: -0.26(0.28), p < 10 -4 ) but not a linear one (Mean(SD) linear trend beta: 0.020(0.56), p = .15). While this finding is broadly compatible with derivative control's non-linear weighting of previous errors, the observed trend extended further backwards in time than the previous-trial effects predicted by a simple derivative. Figure 3 for detailed graph legends. Error bars indicate mean and between-participant bootstrapped 95% confidence intervals.
Our model comparison found that there was a marginal difference between our models (Bayesian omnibus risk = 0.061), and Bayesian model selection favored the PI model (PXP P = .018, PXP PI = .91, PXP PD = .015, PXP PID = .058; Fig. 4C). While there was only a marginal omnibus difference between models, this was mostly due to the similarity in likelihood between the PI and PID model (excluding PID: Bayesian omnibus risk < .001, PXP PI > .99). Therefore, our model selection provides strong evidence that an elaborated PID variant explains behavior better than strict proportional control.

Discussion
Using a novel variant of a change-point task, we provide strong evidence that the PID control model can usefully describe participants' predictions. Our model-free analysis showed that participants incorporated previous errors in their adjustments, in a way that is not predicted by a proportional control model, but is well approximated by our PID model. We also found that all of the PID terms were significant predictors of participants' updates, and that they were qualitatively similar to the gains of an ideal PID controller. Building on Experiment 1, this experiment provided strong evidence that PI was the best-fitting model. These data further support a role for control processes that extend beyond immediate errors.
These first two experiment have provided promising evidence that the PID controller predicts adaptive behaviors better than the classical delta-rule model. A striking feature of these experiments is that participants had very different estimated control gains across the two experiments, consistent with the differential gains of the best-performing PID agents. These differences suggest that participants may set their control gains in a context-specific manner, although at an unknown timescale. Popular delta-rule models have suggested that participants may in fact rapidly adjust their control gain in response to changes to the local context (Pearce & Hall, 1980). This prompted us to develop a third experiment, in order to replicate our results from Experiment 2, and test whether participants can adaptively PID gains to their local contexts.

Experiment 3
In Experiment 3, we sought to replicate the findings from Experiment 2, while at the same time manipulating the incentives for performing accurately on the task. We additionally sought to examine three factors that might influence the weights that individuals might place on each of the PID terms over the course of an experiment.
First, we examined the influence of surprise (absolute error) on these control weights, given classic findings that such surprise signals modulate learning, indicating the degree to which the environment has been learned (Pearce & Hall, 1980; see also Hayden, Heilbronner, Pearson, & Platt, 2011;Nassar et al., 2010;McGuire et al., 2014;O'Reilly et al., 2013). Second, given evidence that learning can be influenced by uncertainty over recent feedback (Yu & Dayan, 2005;Nassar et al., 2010), or related estimates of volatility (Behrens et al., 2007), we examined how PID gains were influenced by an index of the outcome entropy over the past several trials. This measure of uncertainty indexes both expected uncertainty (the variance in the generative distribution) and unexpected uncertainty (changes in the mean of the generative distribution, i.e., drift), the latter of which is more dominant in our tasks. Finally, we examined the influence of reward on PID gains, given previous evidence that these can impact learning in a dissociable fashion from surprise or uncertainty alone (McGuire et al., 2014), and more generally that rewards may compensate for the costs of effortful control policies (Hayden, Pearson, & Platt, 2009;Padmala & Pessoa, 2011;Manohar et al., 2015;Kool, Gershman, & Cushman, 2017), including learning in particular (Hayden et al., 2009;Shenhav, Botvinick, & Cohen, 2013). Importantly, Experiments 1 and 3 were unique in the literature in that they were designed to deconfound reward from errors, providing us the ability to measure their influences on PID gains separately from one another and from our measure of uncertainty.

Participants and Procedure
Forty-seven Brown University subject pool participants (32 female; mean(SD) age = 21.30(4.07)) performed a rewarded supervised learning task (without monetary compensation). Apart from the reward manipulation, the structure of this task was similar to Experiment 2. On each trial, the reward magnitude depended on the accuracy of the participant's guess (i.e., the absolute error between guess and outcome location; Fig. 2C). These rewards decreased exponentially with increasing error magnitude. In order to de-correlate rewards and errors, and to vary overall motivation to perform the task, we adjusted the steepness (mean) of this exponential (gamma) function over trials, resampling one of four possible means (1, 1.5, 2.5, 4.5) at random time points, chosen with a flat hazard rate of .20 across all trials (Fig. 2C, right). These different levels of steepness defined four 'reward modes,' which participants were instructed about at the beginning of the experiment. The reward mode for a given trial was indicated by the color of the fixation cross (one of four colors from equally spaced locations on a heat colormap). The input (errors) to these reward functions were divided by 3.5 in order to approximately match the reward that these functions returned to the mean performance level of Experiment 2.
Participants completed 50 training trials, followed by 6 blocks of 75 trials. On each trial, participants had up to 5 seconds to make their guess, feedback was presented for 2.5 seconds, and then the reward mode for the next trial was displayed during an ITI that was drawn from U(1, 1.5). At the end of each block, participants were shown the mean reward earned during that block. Our final analysis excluded any trials where participants did not move their cursor to the edge of the circle (.07% of total trials). The lagged and trial-wise regression analyses were performed as described in Experiments 1 and 2.

Gain Modulation Analysis
In order to examine the influence of reward (Experiments 1 and 3; n = 77), errors (Experiments 1-3; n = 106), and outcome entropy (Experiments 1-3; n = 106) on the gains of the PID terms, we re-ran our PID regression analysis, including interaction terms for gain modulation. In Experiment 1, the reward modulator consisted of binary reward feedback that was given on a random subset of trials, conditional on participants' error being within a pre-specified threshold. This feedback was orthogonal to absolute error on the task. In Experiment 3, the reward modulator was the number of points that participants received on each trial. In all three experiments, the error modulator was the absolute prediction error. In all three experiments, outcome entropy was defined as the natural logarithm of the outcome sample standard deviation over the current and 10 previous trials within each block (with a truncated window for the first 10 trials in each block). In Experiment 3, participants received both error and reward feedback on every trial, and the absolute error was moderately correlated with the reward amount (median r = -.68). We ran Belsley collinearity diagnostics (Belsley, Kuh, & Welsch, 1980), and found that the collinearity between absolute error and reward was below standard tolerances.
A robust regression (bisquare weighted) was run for every participant in every experiment, with the reward modulator excluded for Experiment 2. The regression model included all main effects, as well as the interactions between the PID terms and gain modulators ( ~ 1 + + + * + + + * + + + * ). We mean-centered betas within their respective experiment and then re-centered the betas on their grand mean, in order to remove between-experiment variance (Cousineau, 2005).

Model-Agnostic Analysis
Regressing the current and ten leading errors onto the current update (see Experiment 1 Methods), we replicated the observation that participants were influenced by past errors (Mean(SD) summed betas: 0.25(0.21), p ≤ 10 -5 ; see Fig. 5A). Our modelgenerated behavior again showed that the delta-rule model categorically fails to capture the influence of leading errors. Unlike Experiment 2, here we found that the weighting of previous errors was best fit as a linear decay from the current trials, resembling PI control (Mean(SD) trend beta: linear = 0.24(0.35), p < 10 -4 ; quadratic = -0.03(0.39), p = .40; sign-randomization test). This discrepancy from Experiment 2 may be due to the fact that rewards in Experiment 3 were highly dependent on accuracy. This may have biased participants more toward integral control (which favors accuracy) and away from derivative control (which favors stability; Aström & Murray, 2010).

Figure 5. Experiment 3 Results. A) Participants adjusted their choices based on previous errors, unlike the predictions of a proportional controller (i.e., delta-rule model). B) P, I, and D control significantly predicted participants' updates. C)
The PI controller best explained participants' behavior. See Figure 3 for detailed graph legends. Error bars indicate mean and between-participant bootstrapped 95% confidence intervals.

PID Model Fit
Replicating Experiments 1 and 2, we found that our standard PID model accounted for most of the variance in participants' updates (median r 2 = 0.81). The parameters for the P-, I-, and D-terms were all significantly different from zero (standardized beta = Mean(SD); β P = 0.75(0.12), p ≤ 10 -5 ; β I = 0.12(0.085), p < 10 -5 ; β D = 0.015(0.049), p = .038; Fig. 5B). The group-level λ was .8020. Participants' estimated gains were similar to the ideal PID controller, but they over-weighted proportional control and underweighted integral control. We found that there were likely differences between the model likelihoods (Bayesian Omnibus Risk < 10 -4 ), and that Bayesian Model Selection strongly favored the PI model (PXP PI > 0.99) over the alternate models (all other PXPs < 0.01; Fig. 5C).

Discussion
In Experiment 3, we found confirmatory evidence that the PID model accurately describes participants' predictions, and that participants adjust their weighting of different PID terms based on trial-wise task dynamics. We found that each PID term

Figure 6. Gain Modulation. Interaction terms between the trial-wise reward (A), absolute error (B), and outcome entropy (C) with each of the PID terms, for each of the experiments. All models included all main effects. Colored shapes indicate individual participant's standardized betas in each experiment (see legend). Error bars indicate mean and between-participant bootstrapped 95% confidence intervals, uncorrected for between-experiment variance.
was uniquely sensitive to changes in reward, absolute error, and outcome entropy, extending previous observations of the role of these modulators on proportional control, and providing further evidence that the PID terms represent distinct control processes.
When participants received larger rewards, they modulated their gains in a way that is consistent with a preference for accuracy (P-and I-Terms) over stability (D-term;Ang, Chong, & Li, 2005), and potentially indicating an exploitive strategy for the high reward environments (Kovach et al., 2012). While participants P gain was already larger than the best-performing gain, this may reflect the unique role of reward modulation, controlling for the environmental changes (e.g., entropy) that make a high P gain less desirable. Another alternative is that the P-and/or I-Terms are effortful to implement, with rewards 'paying the cost' of these control policies (Kool, Gershman, et al., 2017;Manohar et al., 2015). Further work will be necessary to dissociate the role of salience and motivation on reward-modulated gain adjustments.
In response to absolute errors (i.e., surprise), participants increased their immediate adjustment (P-Term), and relied less on previous feedback (I-and D-Terms). This is consistent with the idea that large errors may indicate changes in the environment (Nassar et al., 2010), and with filtering mechanisms in industrial PID control that improve robustness by limiting the long-term influence of noisy samples (Ang et al., 2005).
While outcome entropy, and uncertainty, has traditionally been thought to increase the gain on proportional control (Behrens et al., 2007;Nassar et al., 2010), in our experiment the P-Term was decreased, and the I-and D-Terms were instead increased. Interestingly, when we instead implement this gain modulation in a P-only model, we do find that outcome entropy increases the gain of the P-term (data not shown). Unlike previous experiments studying uncertainty, environmental change in Experiments 2 and 3 required tracking gradually changing outcomes, which accounted for most of the outcome entropy and for which integral and derivative control are particularly useful (Kovach et al., 2012;Wittmann et al., 2016). In Experiment 1, where these gradual changes were not present, we found that uncertainty after a change-point increased control gains (McGuire et al., 2014), which may be reflected here by integrating over the trials since the change-point.

General Discussion
Across three experiments, we found that the PID model successfully captured participants' prediction updating in a stochastic environment. By incorporating a richer model of control monitoring and adjustment, the PID controller was able to account for ways in which performance in such environments deviates from predictions of standard error-driven (delta-rule) learning models. We also replicated and extended previous findings showing that learning parameters themselves are modulated by environmental signals (e.g., reward), and extended these findings to show that signals related to the magnitude of reward, error, and outcome entropy can differentially affect the gains on the PID model parameters.
Our findings suggest that PID control offers a good account of behavior across two fairly different task environments. Indeed, while we found that normative PID gains differed substantially between Experiment 1 (standard change-point task) and Experiments 2-3 (gradual change-point task), participant behavior continued to qualitatively match the behavior predicted by this normative controller across studies, in each case matching the sign of the best-performing control gain. This suggests that these gains adapted to the specific environment that participants were acting in. Specifically, when outcomes were prone to shift sharply and dramatically (Experiment 1), participants tended to rely less on history-dependent control processes like integral and derivative control, especially on trials in which large errors may have indicated a state shift.
While we have focused our discussion of the PID controller on all three of its components, in industrial settings the D-Term is often given the lowest gain or not included (Aström & Murray, 2010), as it is highly sensitive to noise. Accordingly, our own data supported little to no role for the derivative term in the current experiments, both normatively and in our model fits to participants' behavior. While the derivative control term was significant in all of the experiments, and interacted with the absolute error, it did not account for sufficient variance to outweigh complexity penalties in model comparison. This may have been compounded the fact that the derivative term was negatively modulated by absolute error, which may have caused it to explain less of the variance on trials where there were large updates. While the outcomes in Experiments 2 and 3 were designed to differentiate PID control from the delta-rule model, they were not designed to specifically detect derivative control. Future research should investigate cases where derivative control is especially beneficial for good performance. Since derivative control provides high-frequency compensation to improve responsivity, it may be the case that derivative control is generally poorly suited for tasks that depend on inter-trial adjustments and favor accuracy over speed. Relative to Experiment 2, Experiment 3 emphasized accuracy through its reward structure, and deemphasized responsivity due to its longer trial length. While there were several differences between these experiments, these factors may have contributed to the differences in derivative control between these experiments. Some of the most promising evidence for derivative control was that in Experiment 2 participants down-weighted recent errors (from t-3 and t-1) relative to what would be expected by error integration alone. While basic derivative control would only compare the current and previous errors, participant's behavior resembles a common practice in control engineering to low-pass filter the derivative term in order to improve robustness (Ang et al., 2005). The discrepancy between the observed nonlinear influence of previous errors (predicted by the full PID model) and the model selection preference for the PI model, may therefore be accounted for by alternative forms of derivative control.
We found that the PID terms were independently modulated by reward (Experiments 1 and 3), absolute error (Experiments 1-3), and outcome entropy (Experiments 1-3) on a trial-to-trial basis. While we have proposed speculative explanations for the role of each modulating factor, at a minimum, the unique pattern of interactions for each of the PID terms suggests that P, I, and D represent dissociable forms of control. Future experiment should examine the extent to which gain modulation depends on the structure of the task and environment, for instance whether the task rewards consistency in addition to accuracy.
The PID model provides robust control without relying on an explicit model of the environment, offering a parsimonious explanation of the behavior we observed in these experiments. However, there have been notable successes for algorithms that instead learn generative models of the environment (e.g., using Bayesian updating), and can represent the uncertainty about upcoming choices (e.g., Daw, Niv, & Dayan, 2005;Franklin & Frank, 2015;Griffiths, Lieder, & Goodman, 2015;McGuire et al., 2014;Nassar et al., 2010, though see Geana & Niv, 2014Duverne & Koechlin, 2017). Given the ability of model-based predictions to explain participants' behavior in certain environments, and participants' ability to report confidence in their choices, it is likely that predictions consist of some mixture between model-free control (e.g., PID) and model-based control (Daw, Gershman, Seymour, Dayan, & Dolan, 2011;Gläscher, Daw, Dayan, & O'Doherty, 2010;Kool, Cushman, & Gershman, in press;Momennejad et al., 2017). Model-based control may serve to modulate the PID controller itself (e.g., to tune gain parameters or reset control processes; Bouret & Sara, 2005;Behrens et al., 2007;Nassar et al., 2010;McGuire et al., 2014); may be selectively engaged in environments that are highly constrained or familiar; and/or may operate in parallel over longer timescales (Denève, Alemi, & Bourdoukan, 2017). Future experiments should compare the PID controller to model-based approaches, and examine how these models can be combined.
These experiments together provide strong evidence for the viability of control theoretic models as mechanisms of human prediction updating in dynamic environments. This class of models has been highly influential in research on motor control, including the PID controller in particular (e.g., Kawato & Wolpert, 1998). Motor control models typically describe the rapid regulation of limb movements to produce trajectories that are fast, accurate, and robust. In contrast, participants in our experiments were not motivated to make fast or accurate trajectories, and instead may have used analogous control process to adapt their predictions from trial to trial. Control theoretic algorithms (like PID control) may be a domain-general class of neural functions, involved in a diverse array of cognitive processes (Powers, 1973;Pezzulo & Cisek, 2016), including the cognitive control functions that have been suggested to operate using both closed-loop (Botvinick et al., 2001) and open-loop (Shenhav et al., 2013) control principles. The architecture of these executive control algorithms, and the nature of the references that they regulate, are important areas of further research.