Further Perceptions of Probability: In Defence of Associative Models – A Commentary on Gallistel et al. 2014

Extensive research in the behavioural sciences has addressed people’s ability to learn stationary probabilities, which stay constant over time, but only recently have there been attempts to model the cognitive processes whereby people learn – and track – non-stationary probabilities. In this context, the old debate on whether learning occurs by gradual formation of associations or by occasional shifts between hypotheses representing beliefs about distal states of the world has resurfaced. Gallistel et al. (2014) pitched the two theories against each other in a non-stationary probability learning task. They concluded that various qualitative patterns in their data were incompatible with trial-by-trial associative learning and could only be explained by a hypothesis-testing model. Here, we contest that claim and demonstrate that it was premature. First, we argue that their experimental paradigm consisted of two distinct tasks: probability tracking (an estimation task) and change detection (a decision-making task). Next, we present a model that uses the (associative) delta learning rule for the probability tracking task and bounded evidence accumulation for the change-detection task. We find that this combination of two highly established theories accounts well for all qualitative phenomena and outperforms the alternative model proposed by Gallistel et al. in a quantitative model comparison. In the spirit of cumulative science, we conclude that current experimental data on human learning of non-stationary probabilities can be explained as a combination of associative learning and bounded evidence accumulation and does not require a new model.


INTRODUCTION
lever (Robinson, 1964) or a slider on a computer screen (Gallistel et al., 2014;Khaw et al., 125 2017; Ricci & Gallistel, 2017). In the latter case, this was framed as the proportion of green 126 rings in a hypothetical box visualised on a computer screen ( Figure 1A). On each trial, the 127 participant could adjust a slider between 0 and 100 percent to produce an estimate, before 128 locking in their guess and initiating the next draw from the box (i.e., the next trial). The 129 participants performed 10,000 trials and, importantly, were free to choose to revise their 130 estimate or to leave it unchanged on any trial. The data of interest are the realised outcomes 131 from the Bernoulli process, the underlying true probabilities of the outcomes, and the 132 participant's estimates of these underlying probabilities ( Figure 1B). Most participants 133 exhibited stepwise updating behaviour: for long periods they did not adjust their estimates, at 134 other times more often, but never on every trial. One of the studies (Gallistel et al., 2014) 135 included a button labelled "I think the box has changed" that allowed participants to indicate 136 that they believed that there had been a change in the parameter of the Bernoulli process. Half 137 of those participants were also provided with the option to retract their decisions by pressing a 138 button labelled "I take that back" (so called "second thoughts", see Figure 1A). 139 140 Two Approaches to Learning: Forming Associations vs. Testing Hypotheses 141 As in many areas of the psychology of learning, there are two different ways of explaining how 142 people infer probabilities from experience. Models with their origin in the associationist 143 traditions of behaviourism, reinforcement learning, and connectionism emphasise the 144 continuous updating of associations "trial-by-trial", while models with their origin in cognitive 145 psychology often emphasise the testing of and discrete shifting between hypotheses. 146 The most famous associationist theory is perhaps the Rescorla-Wagner model of classical 147 conditioning (Rescorla & Wagner, 1972), which has been adopted in many other domains too 148 (Behrens et al., 2007;Busemeyer & Myung, 1988;Neal & Dayan, 1997;Verguts & Van Opstal, 149 2014). This model is based on the delta learning rule introduced by Widrow and Hoff (1960), 150 an algorithm for updating the weights of nodes in a neural network (see Widrow & Lehr, 1993, 151 for a review). A defining feature of the delta rule is that it is an "eager" algorithm: the 152 associations are updated each time a new data point is observed ("trial-by-trial updating"). For 153 the task of learning a binomial probability from observed outcomes, the delta-rule can be 154 implemented as 155 the observed outcome (e.g., "0 = red", "1 = green"). The delta-rule accordingly 158 abstracts an online running estimate of the underlying probability in the form of an associative 159 expectation of the event. It has the advantage of being recursive: it operates without requiring 160 access to memories going back further than the latest observation. 161 162 163 164 By contrast, models in the rationalist tradition typically assume that people construct 165 richer representations of reality than mere associations between contiguous features, instead 166 referring to beliefs about distal states of the world that can be either true or false. One way to 167 embody this is in hypothesis-testing models. These assume that people learn about the world 168 by testing between explicit hypotheses about the state of the world based on confirming or 169 disconfirming feedback (Brehmer, 1974;Bruner et al., 1956). A common feature of such 170 models is that beliefs are updated in a discrete rather than gradual fashion, because observers 171 hold on to a belief until sufficiently strong evidence has accumulated against it. Hypothesis 172 testing models have been applied to, for example, research on reasoning (e.g. Klayman & Ha,173 1987; Oaksford & Chater, 1994;Wason & Johnson-Laird, 1970), categorisation (Ashby & 174 Valentin, 2017; Bruner et al., 1956), and function learning (Brehmer, 1974(Brehmer, , 1980. Because a 175 single data point typically provides little evidence about a hypothesis, these models predict that 176 beliefs sometimes stay unchanged over many outcome observations. Gallistel  phenomena that any viable model should be able to reproduce. Table 1 provides an overview  188 of these phenomena, which we divide into two categories: those related directly to slider 189 updates and those related to participants' stated beliefs about the generative function. 190 We identified an additional phenomenon that has not been reported before but may be 191 informative about the underlying mechanisms: participants regularly make changes to the slider 192 in the opposite direction of the colour of the last observation (e.g., decrease their estimate of 193 the probability of a red outcome after observing a red outcome Beliefs are updated on each trial, but they are accompanied by a slider update only when the discrepancy between the current belief and the slider value exceeds the response threshold.

Unimodal step height distributions
Small adjustments happen when the current hypothesis is refined in the troubleshooting stage or when a new hypothesis is close to the present one. Large updates happen when a new hypothesis is very different from the present one.
The response threshold varies across trials. Small slider updates happen when the response threshold is low; large ones may happen when the response threshold is high.

Rapid adjustment to changes(*)
A sufficiently low threshold on hypothesis updating.
A sufficiently low response threshold on initiating a slider update.

Median response close to true p
Maximum-likelihood updating of the internal estimate of the tracked probability.
Gradient descent updating of the internal estimate of the tracked probability.

Inconsistent updating (present paper)
Unexplained by the original model but can be accounted for by adding a variable response threshold.
Accounted for by the variable response threshold.
Related to conception of generative function IIAB model Delta-rule model combined with an evidence accumulation mechanism

Rapid detection of change points
A sufficiently low decision threshold during hypothesis testing (see Phenomenon 3).
A sufficiently low bound in the drift-to-bound change detection mechanism.

High hit rates on change-point reports
A consequence of "rapid adjustment to changes".
A consequence of "rapid adjustment to changes".

High false discovery rates on change-point reports (**)
A consequence of "rapid adjustment to changes".
A consequence of "rapid adjustment to changes". 11. Declarative perception of periodicity (Ricci and Gallistel, 2017) Unexplained by the original IIAB model, but can be accounted for by adding a separate function learning process.
Unexplained by the original deltarule model, but can be accounted for by adding a separate function learning process. using an event-level definition for both metrics, only the hit rate is high and the false-alarm rate is (trivially) close 218 to zero. In this task, we believe it is more informative to look at the false-discovery rate: the number of false alarms 219 divided by the total number of change reports.

221
The Main Arguments Against Associative Models 222 Gallistel et al. (2014) argue that associative models, which update an internal estimate trial-by-223 trial, are unable to account for several of the phenomena listed above. The first one is the 224 stepwise manner in which participants tend to adjust their estimates of tracked probabilities: 225 they often leave the slider unchanged for long periods of time ( Figure 1B), which seems in 226 direct contradiction to any model that updates on a trial-by-trial basis. An additional and closely 227 related argument against those models is based on the distribution of adjustment sizes. Besides 228 making many small adjustments, participants also regularly make large adjustments. Large 229 adjustments are hard to reconcile with the idea of a gradual build-up of an associative value 230 because a single observation should rarely cause a large change in that estimate. Gallistel et al. 231 (2014) argue that large adjustments and periods of constancy instead reflect discrete belief 232 changes as the participant moves from conclusion to conclusion. 233 One potential way to make an associationist model account both for large slider 234 adjustments and periods with no adjustment is to assume that participants have a "response 235 threshold" that prevents them from making slider updates when the difference between the 236 current slider value and their internal belief is not sufficiently large to justify the effort. Such a 237 threshold could reflect simple "laziness" (recall that participants typically performed thousands 238 of trials) or have a more sophisticated basis. While a response threshold produces stepwise 239 response behaviour, it has another problem: it is unable to explain adjustments smaller than the 240 to find parameter settings that produced step height distributions resembling the empirically 249 observed distributions. Importantly, however, they seem to have done this by manually trying 250 out different parameter settings, rather than by exploring the space exhaustively. In the present 251 paper, we perform a more systematic search and find that a delta-rule model with a variable 252 response threshold does, in fact, accurately reproduce the empirical distributions. 253 Another major argument that Gallistel et al. (2014) make against associative models is 254 based on their observation that participants are able to detect changes in the Bernoulli parameter 255 (Phenomena 6-9 in Table 1). They demonstrated this using a version of the task where 256 participants were asked to press an "I think the box has changed" button whenever they thought 257 that there had been a change in the generative process (see Figure 1A). Some of these 258 participants were also given the option to report "seconds thoughts" about those reports by 259 pushing a button labelled "I take that back." ( Figure 1A). Gallistel et al. (2014) interpret the 260 ability to detect and reconsider changes as evidence that participants store a record of the 261 previous change points in memory, over and above a summary representation of the outcomes 262 thus far observed. They argue that such a record is incompatible with associative models, which 263 have a more condensed knowledge state. In the present work, we argue that estimating a 264 probability and detecting changes in the underlying process are two separate tasks that may be 265 driven by separate mechanisms. We show that a delta-rule model extended with a bounded 266 evidence accumulation mechanism tracks changes in the underlying Bernoulli parameter and 267 can account for human reports of, and second thoughts about, such changes. 268 Based on the above arguments, Gallistel et al. (2014) ruled out the entire class of 269 associative, trial-by-trial estimation models as a possible explanation for human behaviour on 270 probability estimation tasks. They argued that one instead needs a model with the conceptual 271 richness of a hypothesis testing, "troubleshooting" process that identifies the most likely state 272 of the world to have produced the data. They proposed such a model under the name "If It Ain't 273 Broke don't fix it" (IIAB, described in more detail later) and used simulations to show that 274 there are parameter settings that produce qualitatively similar data patterns as the phenomena 275 observed in the human data. Importantly, however, they did not fit the model to any data and 276 they did not perform any quantitative model comparison against alternative models.  Table 1). 286 Unlike Gallistel et al. (2014), we find that it accurately reproduces those patterns. 287 We further note that to account for all of the phenomena in Table 1, a model may need 288 two separate mechanisms: one for tracking the proximal stimulus distribution (the recently 289 observed frequencies) and another one that uses this information to make decisions about the 290 likely distal state of the world (the underlying generative probability). We demonstrate that 291 combining the delta-rule's associative account of the descriptive tracking of the probability 292 with a sequential evidence accumulator on the delta-rule's prediction error is sufficient to 293 account for the explicitly stated inferences about the generative process. 294 Having established that there are no qualitative reasons to rule out the delta-rule model, 295 we next examine how well both models account for actual data by fitting them to data from the 296 three previous studies. We find that both models account well for most of the data, even though A key characteristic of this model is that it has a relatively stable internal belief about 307 the tracked probability 2 : it only updates this belief when there is sufficient evidence against the 308 current value. It proceeds in two stages. In the first stage, it tests whether the currently held 309 belief about the tracked probability is "broke". This test is performed by computing the 310 discrepancy between the belief and the outcomes observed since the last registered change 311 point. If the discrepancymeasured as Kullback-Leibler divergenceexceeds a decision 312 threshold T1, it is concluded that something is "broke". Each time this happens, the model enters 313 a "troubleshooting" stage, in which it considers three hypotheses on why the current estimate 314 may be "broke": (i) there was a change in the generative process ("I think the box has 315 changed"), in which case the model will register a new change point and update its estimate of 316 ptrue accordingly; (ii) the previously registered change point was a mistake ("I take that back"), 317 in which case the model will expunge the last recorded change point and update its estimate 318 accordingly; (iii) the previous estimate of ptrue was wrong but the change point record is correct, 319 in which case the model will update its estimate of ptrue but not register or expunge any change 320 point. Hypothesis (iii) corresponds to concluding that the estimate was "broke" due to sampling 321 error, but it is not assumed that such beliefs are recorded in memory. Gallistel  The original version of the IIAB model has just two parameters: threshold T1 mentioned 330 above and an additional threshold T2 that is used in the troubleshooting stage. While both 331 thresholds are fixed, the evidence in the first stage is scaled by the number of trials since the 332 last change (the sample size). The IIAB model will therefore become increasingly sensitive to 333 small discrepancies between the current belief and the most recently observed evidence when 334 no change point has been detected for a while. 335 Predictions related to slider updates can be derived directly from the model's internal 336 belief state about the tracked probability. Predictions related to a participant's reports of 337 suspected changes in the generative process and changes of mind about those reports can be 338 derived directly from the model's "troubleshooting" stage. Hence, this is a rich model that 339 makes predictions about all of the empirical phenomena listed in Table 1   The model proceeds by constantly adjusting its estimate of the tracked probability in the 345 direction of the latest observed outcome: seeing a green ring slightly increases the observer's 346 estimate of the proportion of green rings in the box and seeing a red one decreases it. 347 The higher the value of the learning rate, λ, the larger the trial-by-trial adjustments. In 348 environments with frequent, abrupt changes in the generative process, it is beneficial to have a 349 high learning rate because that will allow the model to catch up quickly to those changes. 350 However, in stable or very slowly changing environments it is better to have a slow learning 351 rate, to avoid the estimates being overly sensitive to occasional unexpected outcomes.  Table 1). This is not surprising 367 since the delta rule is a learning mechanism for proximal stimuli. To explain change point 368 reports, it needs to be combined with a decision-making mechanism which discriminates 369 between hypotheses about distal states of the world. One of the most established mechanisms 370 for perceptual decision-making to date is that of bounded evidence accumulation (Bogacz et 371 al., 2006;Ditterich, 2006;Ratcliff, 1978), which finds broad support in behavioural, 372 neurophysiological, and computational studies (Ratcliff, 1978 In the present task, the trial-by-trial prediction error, 1tt pX − − , is indicative of changes in 376 the tracked probability ( Figure 2) and could thus be a suitable input variable for an evidence 377 accumulator. When the generative process is stable and the observer's estimate has homed in 378 on a value close to the true value of the tracked variable, prediction errors tend to cancel each 379 other out over trials and the cumulative evidence will hover around 0 (Figure 2, first 100 trials). 380 After an abrupt change in the generative process ( Figure 2, trial 100), however, there will 381 typically be a burst of prediction errors that all have the same sign (determined by the direction 382 of the change) and thus do not cancel each other out. Hence, the cumulative prediction error is 383 indicative of changes in the generative process: a value close to zero suggests a stable process; 384 a large positive value suggests that there was a recent increase in the Bernoulli parameter; a 385 large negative value suggests that there was a recent decrease in the Bernoulli parameter. We 386 propose that observers may use the cumulative prediction error to detect changes in the 387 generative process when tasked to do so. We model this by adding a bounded evidence 388 accumulator to the delta-rule model and let it trigger an "I think the box has changed" response 389 whenever the cumulative prediction error exceeds a decision bound (Figure 2). A nice feature 390 of this combination is that the evidence accumulator has the same computational efficiency as 391 the delta-rule: the cumulative error can be updated iteratively with negligible memory 392 requirements. 393 Importantly, bounded evidence accumulators can also explain "second thoughts", which 394 are known as "changes of mind" in the decision-making literature. This is done by introducing 395 a temporary second bound at the moment that an initial decision has been made (e.g., Resulaj,396 Kiani, Wolpert For the first 100 trials, the cumulative prediction error hovers around 0, because positive and negative errors cancel each other out (however, note that around trial 50 there is almost a false alarm). At trial 100, the value of p true is changed to 0.90. This causes the cumulative prediction error to quickly increase, because now more positive errors are experienced than negative ones. The cumulative prediction error crosses decision bound B 1 =3.0 at trial 109 which triggers an "I think the box has changed" response, resets the cumulative prediction error to 0, and instates a temporary change-of-mind bound (which is not being crossed in this example). The reason why the shape of the cumulative prediction error looks different after the change in p true is that the trial-by-trial prediction errors changed from being approximately −0.50 and 0.50 (in 50% of the trials each) before the change to around 0.90 (on 90% of the trials) and −0.10 (on 10% of the trials) after it. participants' levels of motivation and attention might fluctuate over time, which could make 413 the discrepancy required for an update vary across trials. We will first model this resistance to 414 updating in the same was as in Gallistel et al. (2014), namely by using a response threshold 415 whose value is drawn on each trial from a constrained Gaussian distribution. We will then test 416 models where the threshold is drawn from a beta distribution. We parameterise both thresholds 417 by their mean and variance. 418 419

Response Noise and Lapse Rate 420
To account for inaccuracies in predicted slider settingsdue to factors such as motor 421 noise and model mismatchwe included response noise in all models. This noise was 422 implemented as a beta distribution centred on the model's predicted response, m, and was 423 applied to trials on which a slider update was predicted. Since the variance of the beta 424 distribution has an upper bound (equal to mm 2 ), we parameterised it as a relative value 425 between 0 (no variance) and 1 (maximum variance). The (relative) variance was fitted as a free 426 parameter. Moreover, we included a small lapse rate (1/1000) to account for lapses in attention 427 and to avoid numerical instabilities in model variants without any other sources of stochasticity 428 (such as the original IIAB model). We performed several robustness checks (reported in 429 Results) to verify that the results do not critically depend on these auxiliary assumptions. 430 431

Model Evaluation Methods 432
We will evaluate the models by using maximum-likelihood fitting with five-fold cross-433  Table 1). From this, they concluded that delta-rule models are 456 fundamentally unfit to account for human estimation of non-stationary probabilities. Here, we 457 reconsider this finding by using an approach that differs from theirs in an important way: instead 458 of manually trying out parameter settings, we systematically explore parameter space using an 459 optimisation method. Specifically, we let the algorithm search for the setting that minimises the Step width Step width Step width Step height Step height Step height Step width Step height Step width Step height were obtained by minimizing the root mean squared deviation (RMSD) with the data (black). (A) Total number of "I think the box has changed" reports (solid) and "I take that back reports (dashed). (B) Cumulative distribution of the number of trials between a change in p true and the next "I think the box has changed" report. (C) Hit rates, false discovery rates, and false alarm rates on change point detections. Following Gallistel et al. (2014), the hit rate was defined as the proportion of box changes followed by at least one change report before the next change took place and the falsealarm rate was defined as the number of 'extra' change reports (i.e., the number of reports after the first correct report after a change and before the next change) divided by the number of trials on which a change call would have been scored as a false alarm. The false-discovery rate was computed as the proportion of change reports that were 'extra' rather than 'correct'. Results are shown for the (only) five participants in Gallistel et al. (2014) who were asked to report both box changes and second thoughts. problemas also noted by Gallistel et al. (2014) is that there is no obvious way to weight 500 misestimates in one summary statistic against misestimates in another, which makes it 501 impossible to formulate a single measure to base judgements on. 502 To compare the models quantitatively, we instead use maximum-likelihood fitting with 503 five-fold cross-validation (see Supplemental Materials for details). We fit the models to the raw 504 data from four experiments (Table 4) reported in three previous studies 4 . The number of trials 505 per participant varied from 9,000 to 10,000 divided over 9 or 10 sessions, with a grand total of 506 286,890 trials performed by 29 participants over 287 sessions. In each cross-validation fold, a 507 block of 20% of consecutive trials will be left out, which for most datasets amounted to exactly 508 two entire experimental sessions. For now, we limit these analyses to the slider update data 509 because "I think the box has changed" and "I take that back" responses were collected for only 510 some of the participants (we come back to those data below). 511 for the data better than the IIAB model ( Figure 5A): for each of the 29 participants, the delta-519 rule model is favoured over the IIAB model by a difference of at least 18020 log likelihood 520 points (M±SE: 28654 ± 904) 6 . 521 522 523 6 When fitting the models separately to each session, the average difference is 285±40 in favour of the delta model. Considering that log likelihoods scale linearly with the number of trials, this difference is comparable to that obtained by fitting the full datasets. Hence, not only is the delta-rule model viable from a qualitative perspective, its 524 quantitative account of the raw data is better than that of the alternative model proposed by 525 Gallistel et al. (2014). 526 There are two big differences between the models that could potentially explain why the 527 IIAB model performs so much worse than the delta-rule model in this comparison: it uses a 528 different mechanism for belief updating and it does not have a threshold on slider reports. To 529 examine how important the assumption of a response threshold is, we next fit a variant of the 530 IIAB model with the same threshold mechanism as in the delta-rule model. This variant has a 531 much better goodness of fit than the original IIAB model (an increase of 370±114 log likelihood 532 points), but it is still outperformed by the delta-rule model for 25 out of 29 participants, with 533 an average log likelihood difference of 271 ± 44 across all participants ( Figure 5B). This 534 substantial change in the log likelihood difference suggests that a response threshold is of 535 primary importance to quantitatively account for the data. 536 A response threshold can be implemented in many ways and which version is chosen can 537 strongly affect the model fit (see Khaw et al., 2017). So far, we have followed Gallistel et al. 538 (2014) by assuming a variable threshold in the shape of a Gaussian distribution with a constraint 539 on the magnitude of the noise. We will now test an alternative version by making two changes. 540 First, we remove the constraint on the amount of variance (cvτ ≤ 0.33) because its justification 541 is unclear to us and it may limit both models' ability to account for participants' response 542 behaviours. Second, we switch to a beta distribution because, unlike the Gaussian distribution, 543 it produces responses that are bounded between 0 and 1, just like the response scale used by the 544 participants. This change increases the goodness of fit substantially for both the IIAB and delta-545 rule model, by 650 ± 130 and 495 ± 121 log likelihood points, respectively. The delta-rule 546 model still outperforms the IIAB model for 27 out of 29 participants, with an average difference 547 of 125 ± 20 ( Figure 5C). Because the beta distribution provides a better fit, we will employ it 548 in the remaining analyses, unless mentioned otherwise. 549 So far, we have evaluated the models only based on how well they predict participants' 550 slider responses. Besides setting the slider, the ten participants in Gallistel et al. (2014) also 551 reported when they thought the box had changed and five of them also reported their second 552 thoughts (using the "I think the box has changed" and "I take that back!" buttons, see Figure  553 1A). We now also evaluate how well the two models perform when fitting them jointly to slider 554 data and those additional responses (see Supplemental Materials for details and results). We 555 find that both models account reasonably well for the jointly fitted data, with an advantage for 556 the delta-rule-and-evidence-accumulation model in terms of cross-validated log likelihoods: it 557 is the selected model for nine of the ten participants, with a mean difference of 260 ± 74 points. 558 While this shows that bounded evidence accumulation is a viable candidate for explaining 559 change reports in probability estimation tasks, it should be kept in mind that the currently 560 available data is extremely limited (1316 change reports by ten participants and 115 second-561 thought reports by five participants). Therefore, we view these results mainly as a proof-of-562 concept and refrain from drawing strong conclusions from them at this point. 563 In summary, from a quantitative model comparison perspective the delta-rule accounts 564 better for the data than the IIAB model. We checked that this conclusion is robust to changes 565 in the assumed lapse rate, the presence of response noise, the shape of the response noise Step width Step width Step width Step height Step height Step height Step width Step height Step width Step height has suggested a discrepancy-independent mechanism called "inertia" where the decision to 597 update is entirely determined by the flip of a weighted coin. We tested this mechanism by 598 replacing the response threshold with a constant probability of updating on each trial, 599 implemented as a free parameter. This mechanism makes the fits substantially worse for 27 of 600 the 29 participants, with an average of 69 ± 14 log likelihood points over all participants. This 601 suggests that the update decision at least in part depends on the discrepancy between the internal 602 belief and the current slider value. Response noise in the delta-rule model. The median estimate of the (relative) variance of 608 the beta response noise distribution is 0.058 (IQR: 0.041). To get an intuition of the magnitude 609 of this noise, we performed two simulations. In the first simulation, we generated predicted 610 responses from the delta-rule model for each participant by using the maximum-likelihood 611 parameter estimates. We found that the participant-averaged RMSD between the model 612 predictions and the true value of the tracked probability was 0.192 ± 0.012, which is close to 613 other parameters at the same values. As expected, the RMSD in this simulation was lower 616 (0.165 ± 0.009) but not by much. Hence, the response noise seems to play only a modest role 617 in accounting for participants' estimation errors. 618 Decision threshold in the IIAB model. The decision threshold parameter in the IIAB 619 model -which controls when the model considers the current belief to be "broke" and in need 620 of an updateis estimated to be close to 0 for every participant (M=0.032, SE=0.018). 621 Importantly, this means that the IIAB model captures the data best when setting its parameters 622 in such a way that it considers its current probability estimate "broke" on each trial and thus 623 transforms into a trial-by-trial updating model in which the stepwise behaviour is entirely 624 explained by the response threshold. phenomena. Moreover, quantitative model comparison showed that this combination of 667 established theories actually accounts better for the data than the IIAB hypothesis-testing model 668 that was specifically developed for this task. These conclusions hold across all tested data sets 669 and are robust to changes in the modelling assumptions about the shape of the response 670 threshold distribution, the assumed lapse rate, the presence of response noise, and the choice of 671 model comparison method. We thus conclude that associative models of probability estimation 672 are still viable candidates. However, we do not take this to imply that there is no place for 673 hypothesis testing models. On the contrary, our point is that the experimental task requires both 674 descriptive tracking of the observed frequencies and inferences to unobservable states of the 675 world (hypotheses about whether the box has changed). The mind can do both and previous 676 research has suggested models of those processes. Our central claim here is that these previous 677 models can explain participants' ability to "do both" in the present task too. 678 We note that we are not the first to combine associative learning and evidence 679 accumulation decision making. 7 Fontanesi et al. (2019) have shown how drift-rates can be 680 updated associatively in a multi-armed bandit task. Like them, we believe that this kind of 681 approach might prove fruitful: take processes that we know the mind is capable of and 682 investigate how they together might generate more complex behaviour. 683 684 Theoretical importance and implications of a variable response threshold 685 Adding a variable response threshold greatly improved the fits of both models. One 686 reason is that participants make inconsistent updates which are incompatible with the original 687 models. The variable response threshold allows the models to account for this. Hence, this 688 threshold does not merely "soak up noise" but is required by both the IIAB and delta-rule model 689 to explain inconsistent updating and other empirical phenomena (Table 1). We therefore 690 emphasise that a variable response threshold does not represent a "nuisance term", akin to 691 adding an error term to a regression, but constitutes a theoretical proposition which is tentatively 692 supported by our results. 693 Evaluation of the fitted response thresholds revealed that many distributions were so 694 broad that the choice of whether or not to update on any given trial becomes partly stimulus-695 independent. Completely stimulus-independent thresholds have elsewhere 696 Gonzalez & Dutt, 2011) been termed "inertia". For two of the 29 participants, a coin-flip 697 mechanism did indeed provide a better quantitative fit than the response threshold mechanism. 698 However, for the vast majority of participants it did not, which suggests that updating is at least 699 in part driven by stimulus-dependent factors (as also concluded by Khaw et al., 2017). For other 700 participants, we obtained threshold distributions such that the probability of updating the 701 response increased with the discrepancy between the current response and the internal estimate. 702 Updates were less likely under very small discrepancies and more likely under very large 703 discrepancies. We interpret this as a resistance to updating, as opposed to a suppressive 704 thresholdthe term we have hitherto used. Participants are reluctant to updateperhaps due 705 to the motor or time costbut balance this against their wish to respond correctly. They care 706 about not being very wrong, but not so much about being exactly right. In economics, the idea 707 that learning can be influenced by a trade-off between the costs of updating and the gains from 708 a more accurate belief has been formalised in the "rational inattention" literature (Sims, 2003). 709 The stepwise response pattern in the present Bernoulli distribution task has been taken to 710 support this idea (Khaw et al., 2017). The reluctance interpretation of our threshold distributions 711 stated above is different to rational inattention in that it supposes that the overt response, and 712 not the covert estimate, is affected by the trade-off. Our modelling here was not aimed at 713 answering which version is correct, which could be an interesting avenue for future work. models that may also be worth to consider when addressing the more general question of how 741 to best explain the data. 742 One such candidate, which was also tested by Gallistel  from re-estimation will be biased towards 0.5 and updates from acceptance of a new hypothesis 772 will be biased towards the extremes. These effects should cancel out, making the estimates 773 accurate on average (phenomenon 4, Table 1). If descriptive estimates are actually made 774 associatively, and hypothesis testing is a separate evidence accumulation process, Costello and 775 Watts's (2018) framework predicts a constant bias towards 0.5, which seems inconsistent with 776 the available data. 777 778 Interference from distal beliefs as an alternative to proximal learning rate adaptation 779 Previous research has suggested various mechanisms for adapting the learning rate as a part of 780 the associative process. Our combined model highlights a complementary solution. One can 781 imagine a process where the proximal learning not only feeds evidence "upwards" to the 782 accumulator that affects beliefs, but where beliefs reached by the accumulator are also signalled 783 back "down" to the online learning which adapts to new beliefs about the world. For example, 784 when a decision has been made that the box has changed it could prompt a momentary increase 785 in the learning rate to adapt to the new reality. Lee et al. (2020) found that people adapted their 786 learning rate depending on verbal information about the generative function. Using our 787 vocabulary, proximal learning in their task was affected by distal beliefs informed by written 788 instructions. Our conjecture here is that a similar effect might be observed when the distal 789 beliefs (what Lee called the "cognitive map") are arrived at through experience. 790 In the approximately Bayesian delta-rule model (Nassar et al., 2010) the perceived 791 probability that there has been a change point determines the learning rate. In other words, it 792 explains learning rate adaptation as part of the associative process. By contrast, we speculate 793 that delta-rule learning might be interfered with by semantic cognition (Rogers & McClelland, 794 2008). These ideas are not mutually exclusive, but the latter seems necessary to satisfactorily 795 account for any effects of external information. Outside the laboratory, we often have a rich 796 semantic understanding of the generative processes we "draw" from. Future work could 797 consider how such information can affect online estimation. 798 799 Unexplained phenomena 800 Phenomena 10 and 11 (Table 1)  In Ricci and Gallistel (2017), some participants were able to correctly report having 811 drawn from a sinusoidal during the debriefing (phenomenon 11). A central theoretical 812 proposition of the IIAB (see pp. 106, Gallistel et al., 2014) is that people do not perceive 813 probabilities per se but "deduce" them from a (sparse) memory of change points. To generate 814 a declarative belief of a continuous functional form from a discrete set of memories, the IIAB 815 would require some function learning mechanism (e.g., Brehmer, 1974) which interpolates 816 between the "datapoints". For the delta-rule model, we need the mechanism to be recursive. 817 There exist several recursive function learning models, some of which are specifically adapted 818 to non-stationary environments (Speekenbrink & Shanks, 2010) and some of which use a 819 version of delta-rule learning (DeLosh, Busemeyer, & McDaniel, 1997). The perhaps most 820 famous of the latter is the EXAM model (Mcdaniel & Busemeyer, 2005). 821 In sum, we do not view phenomena 10 and 11 as evidence against either model but 822 rather as avenues of future research. Investigating phenomenon 10 involves opening a black 823 box by trying to establish a structured explanation of aspects which we here model as free

Concluding remarks 844
Previous research has highlighted that a complete theory of probability perception must 845 account for both online estimates of the proximal frequencies and detection of changes in the 846 distal generative process. We have demonstrated that it was premature for the previous 847 literature to rule out a role for associative models in this. We have shown here that the 848 perception of probability is consistent with a combination of mechanisms from both the 849 empiricist and rationalist traditions that are already well-established in the literature. 850