## ABSTRACT

Extensive research in the behavioural sciences has addressed people’s ability to learn probabilities of stochastic events, typically assuming them to be stationary (i.e., constant over time). Only recently have there been attempts to model the cognitive processes whereby people learn – and track – non-stationary probabilities, reviving the old debate on whether learning occurs trial-by-trial or by occasional shifts between discrete hypotheses. Trial-by-trial updating models – such as the delta-rule model – have been popular in describing human learning in various contexts, but it has been argued that they are inadequate for explaining how humans update beliefs about *non-stationary* probabilities. Specifically, it has been claimed that these models cannot account for the discrete, stepwise updating that has been observed in data. Here, we demonstrate that the rejection of trial-by-trial models was premature for two reasons. First, our experimental data suggest that the stepwise behaviour depends on details of the experimental paradigm. Hence, discreteness in response data does not necessarily imply discreteness in internal belief updating. Second, previous studies have dismissed trial-by-trial models mainly based on qualitative arguments rather than quantitative model comparison. To evaluate the models more rigorously, we performed a likelihood-based model comparison between stepwise and trial-by-trial updating models. Across eight datasets collected in three different labs, human behaviour is consistently best described by trial-by-trial updating models. Our results suggest that trial-by-trial updating plays a prominent role in the cognitive processes underlying learning of non-stationary probabilities.

## INTRODUCTION

When making decisions, people often rely on their estimates of the probabilities that certain events will occur. Not surprisingly, at least since the Enlightenment the issue of how people assess and should assess – probabilities have been pivotal to the behavioural sciences. How people learn, estimate, and reason with probability has thus been studied extensively, especially in psychology and behavioural economics. Typically, this has occurred in the context of assuming *stationary probabilities* in the environment (i.e., probabilities that stay constant over time).

This research suggests that people are good at learning stationary probabilities from experience with relative frequencies (e.g. Edwards, 1961; Estes, 1976; Fiedler, 2000; Peterson & Beach, 1967), and it has been suggested that frequencies are among the few properties of the environment that are encoded automatically (Zacks & Hasher, 2002). At the same time, the research on heuristics-and-biases shows that probability assessments are sometimes also swayed by subjective (“intentional”) aspects, like prototype-similarity (representativeness) or ease of retrieval, leading to biased judgements (Kahneman & Frederick, 2005). People also appear to over-weight extreme probabilities in their decisions when encountering them in numeric form (Tversky & Kahneman, 1992), but under-weight them when they are learned inductively from trial-by-trial experience (Hertwig & Erev, 2009). People frequently have problems with reasoning according to probability theory, leading to phenomena like base-rate neglect and conjunction fallacies (Kahneman & Frederick, 2005; Tversky & Kahneman, 1983), at least if they cannot benefit from natural frequency formats (Gigerenzer & Hoffrage, 1995) that highlight the set-relations between the events (Barbey & Sloman, 2007).

However, not all probabilities are stable (stationary), as when, for example, the risks of default in a mortgage market fluctuate over time or the risk of hurricanes changes with a changing global climate. A small and mostly recent literature has started to model the cognitive processes by which people learn – and track – *non-stationary probabilities* that may change over time (Gallistel, Krishan, Liu, Miller, & Latham, 2014; Khaw, Stevens, & Woodford, 2017; Ricci & Gallistel, 2017; Robinson, 1964). Because this research addresses changes in people’s beliefs about probability it has (once again) highlighted the classical issue of learning by trial-by-trial updating or occasional shifts between discrete hypotheses (Bruner, Goodnow, & Austin, 1956), with the initial studies reporting support for processes of explicit hypothesis testing. In this article, we complement this previous literature in two ways. First, we report an experiment that investigates the robustness of the stepwise learning patterns that have been taken as evidence for hypothesis testing models over trial-by-trial updating models in the previous studies. Second, for the first time, we report a formal comparison between the competing models, applied to our own data as well as data from two other laboratories.

### Tracking Probabilities in Non-Stationary Environments

Several previous studies have started to address how people learn and reason with non-stationary probabilities. They used tasks in which participants were presented with outcomes from a Bernoulli distribution that changed over time. Participants were asked to estimate the hidden Bernoulli parameter, by having them adjust a physical lever (Robinson, 1964) or a slider on a computer screen (Gallistel et al., 2014; Khaw et al., 2017; Ricci & Gallistel, 2017), with the option to change their estimate after each new observation.

Most versions of this paradigm have asked participants to estimate the proportion of items of a certain colour in a hypothetical box visualised on a computer screen (Gallistel et al., 2014; Khaw et al., 2017; Ricci & Gallistel, 2017) (Figure 1A). The participants drag a slider to indicate a value between 0 and 100 percent to indicate their current estimate, before locking in their guess, which initiates another draw of an item from the box. The participant may then choose to revise their estimate or leave it unchanged. This procedure is repeated for many trials. The data of interest are the realised outcomes, the underlying true probabilities of the outcomes, and the participant’s estimates of these probabilities (Figure 1B). Most participants in previous studies exhibited stepwise updating behaviour: for long periods they did not adjust their estimates, at other times more often, but never on every trial.

As in many areas of the psychology of learning, there are two different ways of explaining how people infer probabilities from experience: models with their origin in the associationist traditions of behaviourism, reinforcement learning, and connectionist models emphasise the continuous updating of beliefs “trial-by-trial”, while models with their origin in cognitive psychology emphasise the testing of discrete shifting between hypotheses.

A defining feature of trial-by-trial models is that the internal beliefs are updated each time a new data point is observed. They can be further separated into at least two kinds: delta-rule and memory-based models. The delta learning rule was introduced by Widrow and Hoff (1960) as an algorithm for updating the weights of nodes in a connectionist network (see Widrow & Lehr, 1993, for a review). In psychology, the most famous model based on this rule is the Rescorla-Wagner model of classical conditioning (Rescorla & Wagner, 1972), but it has also been adopted in many other domains (Behrens, Woolrich, Walton, & Rushworth, 2007; Busemeyer & Myung, 1988; Neal & Dayan, 1997; Verguts & Van Opstal, 2014).

In the context of probability estimation, delta-rule learning can be implemented as
where is the probability estimate at time *t*, the previous estimate, *δ _{t}*

_{−1}the prediction error at time

*t*−1, and

*γ*the learning rate. This rule has the advantage of being recursive: it can operate without access to memories going back any further than the latest observation.

Memory-based models, on the other hand, rely on the memory of previously observed outcomes. These models encode and then retrieve memories of events, often in the form of recency-constrained samples, to calculate beliefs on-line and have been applied to a variety of domains, including perceptual classification (Nosofsky & Palmeri, 1997), decision making (Lebiere, Stewart, & West, 2009), probability judgments (Costello & Watts, 2014; Juslin & Persson, 2002; Juslin, Winman, & Hansson, 2007), speech recognition (Gemmeke, Virtanen, & Hurmalainen, 2011), and consumption decisions (Mullainathan, 2002). Memory-based models have the advantage that, although they potentially draw on an extensive long-term memory, they are flexible in the sense that nothing needs to be pre-computed, but the computations are primarily performed at the time of judgement.

By contrast, hypothesis-testing models assume that people learn about the world by testing between explicit hypotheses about the state of the world based on the confirming or disconfirming feedback (Brehmer, 1974; Bruner et al., 1956). Hypothesis testing models have been applied to, for example, research on reasoning (e.g. Klayman & Ha, 1987; Oaksford & Chater, 1994; Wason & Johnson-Laird, 1970), categorisation (Ashby & Valentin, 2017; Bruner et al., 1956), and function learning (Brehmer, 1974, 1980). Because a single data point typically provides little evidence about a hypothesis, these models predict that the beliefs may sometimes stay unchanged over many trials.

While trial-by-trial models have been successful in explaining various types of human learning and inference, Gallistel et al. (2014) have argued that these models are unable to account for the stepwise patterns found in experiments where participants track non-stationary probabilities (Figure 1B). Instead, they proposed that the stepwise response pattern is caused by discreteness in how the participants update their beliefs, which they formalised in their “If it ain’t broke, don’t fix it” (IIAB) model. According to this model, participants assess whether their current belief is “broke” after each new observation and only update their belief if the answer is in the affirmative. The suggestion is that humans do not estimate probabilities directly: they estimate changes in the hidden Bernoulli parameter and infer probabilities from this.

### Purpose of this study

In the present work, we address four potential weaknesses that we identified in previous studies. The first one is related to the available data. Four previous studies (Gallistel et al., 2014; Khaw et al., 2017; Ricci & Gallistel, 2017; Robinson, 1964) have reported stepwise response updating in probability learning experiments with non-stationary probabilities. In three of those experiments (Gallistel et al., 2014; Khaw et al., 2017; Robinson, 1964), the underlying probability changed discretely. As noted by Ricci and Gallistel (2017), this is problematic, because it could mean that the discreteness in response patterns simply reflects the discreteness in the true underlying function, rather than discreteness in belief updating. Therefore, competing models of probability learning should primarily be tested using data from experiments in which the Bernoulli parameter changes in a *continuous* fashion. To the best of our knowledge, the study by Ricci and Gallistel (2017) is the only one so far that has performed such an experiment. However, for three^{*} of their nine subjects, the Bernoulli processes consisted of long periods of no change followed by a quite abrupt change, thus closely resembling a discretely changing parameter. Altogether, this means that current theories about human learning of non-stationary probabilities rely heavily on data from only six participants. The first purpose of the present study is to study the robustness of previous findings by using a larger subject sample.

A second potential weakness of previous studies is that the experimental design may unintentionally have invited stepwise behaviour. In all previous studies, participants were informed that the distribution they were inferring would change over the course of the experiment. If participants had reason to be believe that the changes in the probability that they were tracking were discrete (e.g., because they were told that the box will be replaced “*from time to time*”), then this may have invited stepwise response behaviour. In addition to this, the bodily effort required to change one’s estimate was in all previous studies greater than that needed to maintain it. Robinson (1964) had the participants adjust a lever while Gallistel et al. (2014), Ricci and Gallistel (2017) and Khaw et al. (2017) required them to move the computer mouse, adjust a slider and move the mouse back again before clicking “Next”. By contrast, maintaining one’s previous guess merely required pressing the left mouse button once (Gallistel et al., 2014; Khaw et al., 2017; Ricci & Gallistel, 2017) or no action at all (Robinson, 1964). The asymmetry between the effort required to maintain or change the estimate may have affected the rate of re-estimations, especially when considering that subjects performed 10,000 trials.^{†} In Gallistel et al. (2014) and Ricci and Gallistel (2017) a further asymmetry existed in that a participant could move the slider by clicking right or left of its current position, which would make it jump a set distance. This made it easier to move it in large steps than in small ones. The third purpose of our study is to examine whether effort affects the degree of discreteness in response patterns.

A fourth and perhaps the most important weakness of previous work is that competing models have never been tested against each other using formal quantitative model comparison methods. Gallistel et al. (2014) compared models mainly based on visual comparisons of summary statistics in the subject data with those produced by the models. Khaw et al. (2017) performed model comparison with the Bayesian Information Criterion (Schwarz, 1978) but only between trial-by-trial models from the economic literature. The second part of our paper presents a comprehensive, formal comparison of competing models.

To summarise, the two main contributions of the present article are as follows. First, we perform an experiment to test whether response effort and instructions affect the degree of discreteness in people’s response patterns. Second, we perform a rigorous, likelihood-based comparison of hypothesis-testing and trial-by-trial updating models on all available data, which has not been attempted before.

## EXPERIMENT

Previous studies on human learning and tracking of a non-stationary probabilities interpreted stepwise response behaviour as evidence that participants update their internal beliefs in a discrete manner (Gallistel et al., 2014; Ricci & Gallistel, 2017). This interpretation rests on the assumption that the discrete learning pattern constitutes a fairly stable and robust phenomenon that derives from the participant’s mental shift between discrete hypotheses. In the present experiment we investigate the extent to which these results are sensitive to superficial specifics of the task, by experimentally varying two factors that we believe may affect the rate of re-estimations in the observed response behaviour. The first factor is the amount of information provided in the instructions to the participants about the non-stationarity of the probability they are asked to estimate. The second factor is the amount of effort required to make an update to the response slider.

### Method

#### Participants

Sixty-two participants were recruited using posters advertising the study at several university campuses in Uppsala. Data from two participants were excluded from the analysis since they chose to terminate early. The mean age of those who completed the experiment was 24.7 (SD = 6.3). Forty-seven of these participants identified as female, eleven as male, and two as other. Participants were rewarded with gift vouchers for a major Swedish book shop chain (Akademibokhandeln). The total reward value depended on a participant’s task accuracy, with a minimum fixed to the approximate equivalent of USD 11 and the maximum being approximately equivalent to USD 28.^{‡} Two participants in Condition 1, six in Condition 2, six in Condition 3 and five in Condition 4 received a signature on a participation form instead of gift cards.

#### Stimulus and task

We replicated the visual design of the experiment described by Gallistel et al. (2014) to the best of our ability. The stimulus consisted of a screen showing a box labelled “Box of RINGS”, a bar with a slider, and a rectangle filled with red and blue dots (Figure 1A). At the beginning of each trial, a ring would move out of the box and then stay beside it until the end of the trial. The task of the participant was to estimate the proportion of blue rings in the box by changing the value indicated by a slider on a bar that was labelled with “0% - No blue” and “100% - Only blue” on the left and right ends, respectively. Adjusting the slider caused the proportion of red and blue dots in the square labelled “My current estimate: % blue rings” to change to reflect the new proportion indicated by the slider position, which was intended as a visual aid to help participants “see” their currently chosen estimate.

#### Conditions

The experiment followed a two-by-two factorial design, with “response mode” and “instruction mode” as the independent variables (see Table 1). The first variable had two levels: “Low effort” and “High effort”. In the High effort response mode, subjects revised their estimate by first clicking on the slider and then dragging it to adjust its value. When they were finished, they would click a “next” button to the right of the slider to initiate the next trial. In the Low effort response mode of our experiment, no cursor or “next” button was visible, and the slider value would change whenever the mouse was moved. Participants initiated the next trial by a mouse click. The second independent variable also had two levels. In the “Informed” instruction mode, participants were explicitly informed about the non-stationarity of the generative process: they were told that the contents of the box might change after each draw and that these changes would occur throughout the task. They were also told that the changes could be fast or slow and that their task was to track the proportion as it changed. Participants in the “Uninformed” instruction mode were not provided with this information. In all four conditions, the hidden Bernoulli parameter was a sinusoidal with a minimum of 0, a maximum of 1, and a period of 500. Its value at the very first trial was 0.50. Condition 2 is almost identical to the design described in Gallistel et al. (2014). To the best of our knowledge, the only difference is that in the original study, the slider would jump a set distance when the participant clicked to the left or right of it^{§}.

#### Procedure

At the start of the experiment, participants read a paper detailing that they were allowed to discontinue their participation at any stage; that the experiment would be divided into two sessions with a break in between; that the average difference between each of their guesses and the correct answer would determine their reward; and what the highest possible reward was. Meanwhile, a Swedish translation of the instructions found in Appendix A in Gallistel et al. (2014) was displayed on the screen, but without the passages relating to reporting that the box had changed. In the Low effort conditions, the relevant parts of the instructions were altered to explain how to answer using the low-effort response mechanism. In the Informed conditions, paragraphs were added to explain that the box could be swapped every time a ring was put back into it, that these changes could be large or small, and that their task was to estimate the proportion of blue rings in the box and track it as it changed throughout the task. Participants were not told anything about how often they were supposed to make a change to the slider.

When the participant indicated that they had read everything, the experimenter would approach them to ask if they had understood all that they had read and if they had any further questions. If asked a question regarding anything not revealed in the instructions, the experimenter would respond that he was unable to provide that information at this stage of the experiment. Any question pertaining to any practicalities of how to carry out the task would be clarified upon request. The participants then completed 1,000 trials before a pause screen was displayed, inviting them to take a break. At their leisure, participants were allowed to commence the second session of 1,000 trials. The length of the break varied strongly across subjects, ranging from 12 seconds to 17 minutes, with a mean of 3 minutes and 6 seconds.

After finishing the experiment, the participants filled out post-test questionnaires with questions concerning their beliefs about the generative function, self-assessed statistics proficiency, age, gender and education. Finally, they were asked to draw the probability of drawing a blue ring as a function of trial count into a graph. The questionnaires were administered on paper and filled in with pen. However, we found little use for the questionnaire data and did not analyse them.

#### Analysis

All statistical analyses are performed using the JASP software package with default settings (JASP Team, 2019) and R (R Core Team, 2014).

### Results

#### Accuracy

A visual inspection of the mean estimations (Figure 2A) shows that, on average, the participants tracked the wave-like pattern of the underlying probability reasonably well in all four conditions of the experiment. However, average accuracy is clearly highest in the condition where the participants were informed about the non-stationary generative function and making changes to the slider involved more effort (Figure 2B). We next perform statistical tests to determine if there is evidence for effects of “Information” and “Effort” on the root mean squared error (RMSE) between the generating probability and the participant’s estimate.

Since the data violate the normality assumption of standard ANOVA analyses (Kolmogorov-Smirnov test, *p*<10^{−13}), we apply a Kruskal-Wallis and a Friedman test, with the two between-subject conditions as fixed factors and repeated measurement across blocks of 500 trials each. An initial main effects analysis suggests a main effect of Information (*H*(1) = 8.919, *p* = 0.003) but not of Effort (*H*(1) = 0.685, *p* = 0.408) or Block (χ^{2}(3) = 1.043, *p* = .791). However, Dunn’s post hoc test between the four between-subject cells indicates that this main effect is secondary to the interaction between Information and Effort presented in Figure 2C, with significantly lower median RMSE (approximately 0.13) in the informed high effort condition than in the other three conditions (median RMSE > 0.30; *p*_{holm} < .020; see Appendix A for details on the Dunn’s post hoc test).

To get an indication of how well participants performed in an absolute sense, we compare their accuracy to that of fictive observers who always responds 0.50 (Figure 2C, dashed lines) or randomly (Figure 2C, dotted lines). It is clear that despite that the average estimates track the functions in all conditions in Figure 2A, in three of the four conditions the trial-by-trial accuracy in terms of RMSE is no better than what is expected from a participant who always responds with the probability 0.50. In sum: although the average estimates tracked the underlying function, the trial-by-trial accuracy was poor in all conditions except when the participants were informed about the nonstationary process and used the more effortful response method, and participants did not improve with more training.

Following earlier work (Gallistel et al., 2014; Ricci & Gallistel, 2017), we consider the Kullback-Leibler (KL) divergence as an alternative measure of accuracy. We perform the same analyses with the KL divergence as the dependent variable and find an initial main effect of Information (*H*(1) = 8.656, *p* = 0.003) but not of Effort (*H*(1) = 0.367, *p* = 0.544) or Block (χ^{2}(3) = 2.187, *p* = 0.534). Dunn’s post hoc test shows that it is secondary to the interaction between Information and Effort (Figure 2C). The median KL divergence in the informed high effort condition (approximately 0.064) is significantly lower than in the other three conditions (median KL divergence > 0.267; *p*_{holm} ≤ .030; see Appendix A for details on the Dunn’s post hoc test). Hence, the results are consistent between the RMSE and KL divergence.

#### Step width

We next examine whether the experimental manipulations affect the average number of trials between slider updates, in previous studies referred to as “step width” (Gallistel et al., 2014; Ricci & Gallistel, 2017). The initial main effects analyses, with the same non-parametric tests as we applied to the RMSE, suggest significant main effects of Information (*H*(1) = 9.46, *p* = 0.002), Effort (*H*(1) = 15.12, *p* < 0.001) and Block of trials (χ^{2}(3) = 69.33, *p* < 0.001). The main effect of Block of trials is an increasing step width, and thus decreasing rate of re-estimation, with additional training. The main effects of Information and Effort are qualified by the interaction illustrated in Figure 2C. Dunn’s post hoc test shows that the median step width is significantly higher in the condition with no information about the non-stationarity of the process and a high effort response mode (approximately 39) as compared to the other three conditions (medians between approximately 2 and 9: *p*_{holm} < 0.020, see Appendix A for details on Dunn’s post hoc test). In sum: with more training the step width increased somewhat, and it was much larger in the condition without information about nonstationary and a high-effort response mode. In other words, when the participants were uninformed that the probability would change over time and the response required more effort, they were more reluctant to change their estimate.

#### Step height

Finally, we test if Information and Effort affected the average magnitude of the slider adjustments on trials when the estimate was updated, referred to as the “step height” in Gallistel et al. (2014) and Ricci and Gallistel (2017). Applying the same statistical tests as above, the results suggest main effects of Information (*H*(1) = 14.633, *p* < 0.001) and Effort (*H*(1) = 11.363, *p* < 0.001), but not of Block (χ^{2}(3) = 6.766, *p* = 0.080). Dunn’s post hoc test supports both a main effect of Information and an interaction between Information and Effort, as illustrated in Figure 2C. The median step height was significantly greater with information about the non-stationarity than without, both with the low effort response mode (medians 0.0312 *vs.* 0.0177; *p*_{holm} = 0.043) and the high effort response mode (medians .107 *vs.* 0.0445; *p*_{holm} = 0.003), suggesting a main effect of information regardless of the amount of effort required to update the response. In addition, the informed high effort condition had a higher median than all of the other three conditions, suggesting a (catalytic) interaction for this specific condition (see Appendix A for the full results of Dunn’s post hoc test). In sum, Block had no effect on the step height, but information about non-stationarity of the process increased it, especially when the high-effort response mode was used. Thus, when the participants were told that the underlying probability could change over time, the changes they made were larger, and this was especially the case if the response mode required more effort.

### Discussion

Although the average estimates track the sinusoid function in all conditions (Figure 2A), in absolute terms the trial-by-trial accuracy was poor in three of the four conditions, in the sense that the deviation from the true probability on a given trial was no smaller than expected from a participant who responds with 0.50 on each trial (median RMSE approximately 0.35, see Figure 2C). In part, of course, this reflects the relative complexity of the task the participants are faced with. It takes at least a few observations to get a reliable estimate of the underlying probability. When this probability changes on each trial – as in our experiment – the observer’s estimate will always lag behind the generating value. Optimal performance would require participants to infer the abstract function that relates the trial number to the true probability and to use this function to *predict the true probability on the next trial*. To induce this function from the “foggy” output of a constantly changing Bernoulli distribution is difficult, especially so if the observer is provided with only minimum information about the generative process. For this reason, some previous studies have assessed subject performance by comparing their responses to those of an optimal observer rather than to the true generating value (Gallistel et al., 2014; Khaw et al., 2017; Ricci & Gallistel, 2017). These analyses are helpful when investigating the degree of optimality of participants. However, here we are primarily interested in the relative performance between groups, for which any measure of accuracy seems suitable.

The high accuracy and distinctly stepwise re-estimation behaviour observed in Ricci and Gallistel (2017) and the other previous studies were only replicated when the participants were informed about the non-stationarity of the process beforehand and used the more effortful response mode, which are the conditions under which it has previously been observed. Better performance with more accurate prior information about the task is obviously no surprise. But this effect interacted with the effort required by the response mode in an interesting way. With a low effort response mode, there are frequent but small adjustments (median step width of approximately 5, suggesting about 100 re-estimations per block of 500 trials, of a median size of .03), and this holds regardless of whether participants are informed about non-stationarity or not. With the high effort response mode, the pattern with relatively rare, large re-estimations only occurred with prior information that the process is non-stationary. The behavioural differences are indeed large. Participants without information about the non-stationarity and with the more effortful response mode rarely re-estimate and make rather small adjustments when they do (median step width of 39 trials, suggesting approximately 13 re-estimations per block of 500 trials, with a median size of .04). The participants with information about the non-stationarity and with the more effortful response mode often change their estimates (median step width of 9 trials suggesting approximately 56 re-estimations per block of 500 trials) and usually by quite a lot (median step height of .11) The characteristic step-hold patterns of the predictions of the IIAB-model (Gallistel et al., 2014) were thus observed in only one cell and appear to arise under specific conditions, suggesting that rare but large re-estimations are not necessarily intrinsic to the cognitive process.

An alternative explanation of the effects of the independent variables on step width and step height is that they merely reflect the fact that the low effort response mode results in an increase in the number of small, accidental adjustments. When the slider is “stuck” to the mouse cursor, participants might occasionally produce unintended adjustments. When the slider has to be dragged, this is less likely to occur. This kind of “shaky hand” error would decrease both the average step width and step height. There are relatively small negative main effects of having a low effort response mode on both of those dependent variables. Since we cannot rule out that the shaky hand effect exists, these should be interpreted with some caution. However, the substantial interaction between high effort and information mode is not possible to attribute to such error. If unintentional adjustments as a result of the low effort response mechanism is a pervasive phenomenon, it should affect the results equally regardless of what information is provided. We would therefore argue that the main result of our experiment – that the previously observed stepwise updating arises as a result of particular combinations of circumstances – holds regardless of whether the low effort response mode increases the number of accidental adjustments.

A tentative interpretation of the results is that people spontaneously tend to be “myopic”, only considering small samples of the most recent observations, which they project onto the next trial as an estimate of the probability. This estimate can, in principle, change from trial to trial, as is consistent with the small and frequent adjustments produced by the participants in several conditions, and their overt expression of the estimate is affected by the effort required to produce the response, as is consistent with the significant effects of response mode. Intriguingly, the effortful response mode seems to have invited participants to consider larger sample sizes, allowing them to better track changes in the underlying probability.

To conclude, a key implication of these results is that the discreteness of the response data seems highly sensitive to external factors, which calls into question whether it should be thought of as inherent to human probability inference as has been done in previous literature. Instead, the pattern may reflect adaptations to the particulars of the task at hand. In other words, it is possible that the internal belief updating is continuous and only the slider adjustments occur discretely.

## MODELLING

Earlier studies (Gallistel et al., 2014; Ricci & Gallistel, 2017) have concluded that human behaviour in probability estimation tasks is consistent with hypothesis-testing models and cannot be explained by any trial-by-trial updating model. Above, we presented experimental evidence that questions the first part of this claim; the remainder of this paper is dedicated to evaluating the plausibility of the second part, by using formal model comparison techniques. Our approach makes four important methodological improvements on previous studies. First, instead of setting parameters manually, we use maximum-likelihood fitting to determine parameter values. Second, instead of fitting models to summary statistics, we fit them to the raw data. This way, we use all available information and avoid having to decide which statistics to look at and how to weight them against each other. Third, instead of evaluating goodness of fit through visual inspection of plots, we use formal model comparison techniques. Fourth, instead of evaluating the models only against our own data, we also include data from other studies in our analyses.

### Factorial model design

When models differ from each other in multiple ways, it is hard to identify which factor explains the success of one model over another. To circumvent such identifiability problems, we apply a method known as *factorial model comparison* (van den Berg, Awh, & Ma, 2014). Just as in factorial experimental designs and factorial ANOVAs, this means that we pair every choice in one factor with every possible choice in the other factors. The goal is not only to identify the model that best captures the underlying process, but also to quantify evidence for each factor level, much as an ANOVA quantifies the evidence for each of the main effects. We deconstruct the models that we consider here into two factors: the updating mechanism and the threshold mechanism. For convenience, Table 2 provides an overview of the most important mathematical terms and symbols appearing in the model specifications.

#### Factor 1: Updating mechanism

This factor determines how and when the observer updates its belief about the hidden Bernoulli probability, *p*_{true}. We consider three options: the IIAB mechanism, a delta-rule mechanism, and a memory-based averaging mechanism. The essence of the IIAB mechanism (Gallistel et al., 2014) is that it maintains a list of “change points” that is updated through hypothesis testing. The change points summarise at which earlier time points there was, according to the model, a change in *p*_{true} and how large each supposed change was. After making a new observation, the mechanism tests the hypothesis that “something is broke”. It does so by computing how much the currently held belief about *p*_{true} – as encoded in the most recently registered change point – deviates from the estimate based on all observations since the last change point. When this discrepancy exceeds a threshold *T*_{1}, it is concluded that “something is broke” and that it “needs fixing.” The updating mechanism then proceeds to a second stage. In this, three further hypotheses are tested about what might be wrong: the last registered change point was incorrect and must be expunged, it was at the wrong point and should be moved or there has been a new change point after the last one encoded, which now needs to be registered. Once a decision has been made on this, the mechanism updates the list of change points accordingly and adjusts the slider value, *p*_{slider}, to make it consistent with what is now the latest estimated change point. For a detailed description of the mechanism, see Gallistel et al. (2014). It is important to note that since it can take many observations before it is detected that “something is broke”, slider updates in this type of model tend to happen in a discrete fashion.

The second updating mechanism that we consider is the delta rule, which we abbreviate as “Delta”. Unlike the IIAB mechanism, the delta rule has no notion of hypothesis testing and, therefore, has no threshold on its belief updating. Instead, it updates its estimate of the hidden Bernoulli parameter after each new observation. It does so by computing a weighted average of the previous estimate, *p*_{observed,t−1}, and latest observation, *O _{t}*, through
where parameter

*λ*is the learning rate. Another difference to the IIAB mechanism is that since an update is made on each trial, the magnitude of the updates will often be very small. Considering that it is effortful in both time and energy to adjust the slider value, it seems reasonable to assume that observers only do so when the discrepancy between slider and belief has grown sufficiently large. Therefore, we impose a response threshold

*T*

_{1}on this discrepancy, such that a slider update is only made when it is considered to be worth the effort.

The third and final updating mechanism that we consider is a memory-based weighted average, which we abbreviate as “M-Avg”. This mechanism works in largely the same way as the Delta mechanism, except that the probability estimate is computed as
where the weights decrease exponentially in history, . Parameter *α* is constrained to the range [0,1] and can be thought of as a history weight: the larger its value, the more weight is given to observations further back in time. If *α* = 0, then *p*_{observed} is equal to the last observation; if *α* = 1, then *p*_{observed} equals a plain average of all observations; if 0 < *α* <1, then *p*_{observed} is a weighted average of all observations, with higher weight given to more recent observations. Just as in the Delta mechanism, we include a response threshold such that slider updates are made only when the discrepancy between belief and current slider value is sufficiently large.

#### Factor 2: Threshold mechanism

All three updating mechanisms described above involve a threshold, denoted as *T*_{1}: the IIAB mechanism has an “is it broke” threshold that prevents hypothesis updating when there is too little evidence that something is wrong and the other two updating mechanisms have a response threshold that prevents slider updating when it is not worth the effort. In the original formulation of the IIAB model, the “is it broke” discrepancy is measured as KL divergence, *ε*=KL(*p*_{observed} || *p*_{slider})×*n*, where *p*_{observed} is an estimate of *p*_{blue} based on the outcomes observed since the last change point, *p*_{slider} is the currently held belief and *n* is the number of trials since the last update. For the response threshold in the other two mechanisms, however, a more obvious measure of discrepancy is the absolute difference, *ε*=|*p*_{observed} − *p*_{slider}|. This is indeed what Gallistel et al. (2014) used in their implementations of delta-rule models. These two proposals differ from each other in two ways: the discrepancy is either measured as KL divergence (*ε*=KL(Δ)) or as an absolute difference (*ε*=|Δ|) and it is either multiplied by *n* (*ε*=KL(Δ)×*n*; *ε*=|Δ|)×*n*) or not. To dissociate the effects of threshold choice from effects of updating mechanism on goodness of fit, we cross these options factorially, which gives rise to four different threshold mechanisms. Combining each updating mechanism with each threshold mechanism results in a total of 12 models (see Table 3).

#### Threshold variability

Since cognitive processes are generally noisy, it seems plausible that threshold *T*_{1} varies from trial to trial. Therefore, following the proposal by Gallistel et al. (2014), we draw the value of *T*_{1} on each trial from a normal distribution with a mean *μ*_{T1} and standard deviation *σ*_{T1}, both of which are fitted as free parameters.

### Model fitting methods

Due to the existence of latent variables in the IIAB models and the presence of trial dependencies, the proper likelihood function is intractable for some of the models. Therefore, we use a simplified, “custom” likelihood function for model fitting (Appendix B). We use the Bayesian Adaptive Direct Search (BADS) method (Acerbi & Ma, 2017) to find the parameters that maximise this function. In order to reduce the risk of terminating in local maxima, we run BADS thirty times with different initial parameter values. Prior to each run, we evaluate the likelihood function for five hundred randomly drawn parameter vectors and choose the vector that gives the highest outcome as the initial parameter vector for BADS. Results from a model recovery analysis confirm that these methods allow for reliable model comparison (see Appendix C).

### Benchmark dataset

To get the most out of the model comparison, we fit the models to both our own data and data from three previous studies, which were kindly made available to us by the respective authors (Gallistel et al., 2014; Khaw et al., 2017; Ricci & Gallistel, 2017; see Table 4).^{**} The number of trials per subject varied from 2,000 to 10,000 across experiments, with a grand total of 408,000 trials. To the best of our knowledge, all experiments were conducted in sessions of 1,000 trials each, with breaks between consecutive sessions. Because of these breaks, we suspect that since parameter values might not be stable across sessions. Therefore, we divide the data into sets of 1,000 trials each, giving a total of 408 datasets.

### Model comparison

We fit the twelve models (Table 3) separately to each of the 408 datasets (Table 4) for a total of 4,896 fits. In doing so, we include only the first 750 trials from each dataset, so that we can use the remaining 250 trials for cross validation.

Model comparison based on AIC values shows a large heterogeneity between subjects (Figure 3A, left): there is not a single model that provides a good fit to all datasets and every model seems to perform well on at least one dataset. Despite this heterogeneity, it is clear that some models perform better overall than others. In particular, the IIAB models generally seem to fit worse than the Delta and M-Avg models. When averaging the relative AIC values across datasets (Figure 3A, right), the most successful model is the one with a memory-based updating mechanism and a threshold mechanism based on the absolute difference (M-Avg with *ε*=|Δ|). All other models have an average AIC value of at least 50 points larger, which would even under a very conservative criterion be reason to reject all of them as serious contenders. However, given the heterogeneity at the individual level, it seems unwarranted to rule out individual models at this stage.

Instead of looking at individual models, it may be more informative to look at the success of each factor level. To this end, we compute the *log factor likelihood* as proposed by Shen and Ma (2019) to quantify the evidence for each factor level (Figure 4). Consistently across experiments, the results reveal strong evidence against the IIAB updating mechanism, while the two trial-by-trial mechanisms perform approximately equally well in most experiments. In terms of threshold mechanisms, we observe that there is evidence against models that incorporate the number of trials since the last slider update, while there is approximately equal evidence for mechanisms based on the absolute difference and mechanisms based on KL divergence.

While AIC is widely used as a measure of *fit*, it is not necessarily a good measure of *prediction* because of possible overfit. Therefore, we next compare models based on the log likelihood of the last 250 trials of each session, which were not included during model fitting. The results of this cross-validation analysis (Appendix D) show a pattern that is largely similar to the AIC-based results: there is large heterogeneity at the level of individual datasets, models with an IIAB updating mechanism perform poorly, and there is no strong evidence in favour or against specific threshold mechanisms. The evidence is more even between the Delta and M-Avg mechanisms and both of the two absolute difference variants (*ε*=|Δ|; *ε*=|Δ|×*n*) are now better supported than the unscaled Kullback-Leibler divergence (*ε*=KL).

### Model fits

The model comparison results provide insight into how well the models perform in relation to each other. However, those results would be of little value if all models were extremely poor descriptions of the data. Visual inspection of the fits indicates that the best model overall (M-Avg with *ε*=|Δ|) generally does a good job in describing the subject responses (see Figure 5 for a few examples). Across all 408 datasets, the average RMSE between the maximum-likelihood fit of this model and the subject data is 0.139 ± 0.004. Consistent with the results of the formal model comparison, we find that the RMSE is higher for the best-fitting Delta model (0.142 ± 0.004) and the best-fitting IIAB model (0.153 ± 0.004).

### Parameter estimates

An overview of the maximum-likelihood parameter estimates for each model is found in Appendix E. The estimate of *σ*_{unexplained} is on average smaller in the M-Avg and Delta models than in the IIAB models, suggesting that the latter kind of model leaves more variance unexplained than the former two, which is consistent with the model comparison results. In the best-fitting model (M-Avg with *ε*=|Δ|), the median value of this parameter is 5.64×10^{−2}. This is rather small in relation to the response scale (0 to 1), which corroborates our earlier conclusion that the model provides a reasonably good account of the data. For parameters *μ*_{T1} and *σ*_{T1} we find median values equal to 0.470 and 0.207, respectively. These values suggest quite a high degree of trial-by-trial variability in the response threshold, with a relatively high mean value. We speculate that the variance captured by these parameters also includes other sources of variability in response behaviour (e.g., noise in the calculation of *ε* and variability in the applied learning rate or memory weight) which are not specified in the models.

Finally, we estimate how much outcome history the winning M-Avg takes into account in its trial-by-trial estimates of *p*_{true}. The memory weight in this model drops exponentially with history length, with a rate that is determined by parameter *α*. We quantify the history length as the number of trials that cover 95% of the total weight mass. Based on the maximum-likelihood estimates of *α*, we find a median length of 33 trials (25% quantile: 19 trials; 75% quantile: 97 trials).

### Model comparison with fixed thresholds

All models that we have tested so far had a variable threshold. We next address two questions regarding this variability. First, how much do the fits suffer if the variable threshold is replaced by a fixed one? Second, do the conclusions that we draw from the model comparison depend on the existence of threshold variability? To answer these questions, we re-fit the twelve models with *σ*_{T1} fixed to 0. While the AIC value worsens for each of the twelve models – by a minimum of 728±38 points – the model order is near-identical to the order we found with the models with variable thresholds (Supplementary Figure S1). Hence, while the assumption of variability in thresholds contributes strongly to the success of all tested models, our main conclusions do not critically depend on it.

### IIAB with a response threshold

The IIAB models have a threshold at the belief updating stage, while the trial-by-trial updating models have a threshold at the response stage. This creates a potential interpretation problem regarding the model comparison results: is the relatively poor performance of the IIAB models due to its belief updating mechanism or due to it lacking a threshold at the response stage? Or, put differently: can the IIAB model be salvaged by adding a response threshold? To answer this question, we add a response threshold to the IIAB models and fit them again to all 408 datasets. We find that this modification improves the average AIC values of the IIAB models by 200±6 points. However, despite this substantial improvement, the models still perform poorly compared to the trial-by-trial models (Figure 6A).

### Two-kernel delta-rule model

Under conditions where there are large and infrequent changes, as in much of the experimental data considered in this study, a regular delta-rule model faces a problem. If the most recent observation is assigned a lot of weight, the model will quickly catch on to changes but exhibit exaggerated volatility during the long periods where the true probability is unchanged. If, on the other hand, it is only given a little weight, overfitting of noise will be avoided but the model will be slow to catch on to sudden changes. To solve this problem, Gallistel et al. (2014) considered a two-kernel variant that keeps track of two running averages. One kernel has a fast learning rate and the other a slow one. When there is a sudden change, the discrepancy between the two estimates is large, which is used as a signal that there has been a change and that the fast kernel should be trusted. After some observations, the slow kernel will catch up and the discrepancy will decrease, signalling that the fast kernel is no longer relevant. The model will then revert to reporting the slow kernel’s estimate.

We next test this model as a contender to the other models we have considered so far. The model keeps two estimates of the Bernoulli probability, *p*_{slow,t} = (1−*λ*_{slow})*p*_{slow,t−1} + *λ*_{slow}*O _{t}* and

*p*

_{fast,t}= (1−

*λ*

_{fast})

*p*

_{fast,t−1}+

*λ*

_{fast}

*O*. On trials where the absolute difference between the two estimates is larger than a threshold Δ

_{t}_{c}, the model takes

*p*

_{fast}as its estimate of the Bernoulli probability; otherwise it uses

*p*

_{slow}as its estimate. The model thus has two additional parameters compared to the standard delta-rule model tested above. As in the main analysis, we combine this updating mechanism with all four thresholding mechanisms (Table 3). We find that across all 1,632 fits, the additional kernel improves the AIC value of the delta-rule models on average by 133±5 points. In terms of model comparison, the two-kernel delta-rule model with

*ε*=KL(Δ) outperforms all other tested models (Figure 6B).

### Fits to full datasets

In the analyses presented above, we have been fitting models to sessions of 1,000 trials each to allow for the possibility that parameters can vary between sessions. To verify that our conclusions do not critically depend on this choice, we next fit the models to the full datasets, i.e., with only one set of parameters per subject. Although there are small differences in the model order (Figure 6C), the overall findings are the same as before: the M-Avg model with *ε*=|Δ| comes out as the overall best model and the four IIAB models perform poorly. Hence, the general conclusions of our model comparison do not seem to critically depend on whether we fit the models to single sessions or to full datasets.

### Fits to summary statistics

So far, we have been comparing models based on log likelihoods computed from fitting raw data. One might argue, however, that it is also important that a model captures key summary statistics derived from the raw data. In the context of probability estimation, Gallistel et al., (2014) argued that two important summary statistics are the step widths and step heights. While we agree with this, we are not convinced by their conclusion that it is impossible for *any* trial-by-trial updating model to account for the empirical joint distributions of these statistics. The problem is that their conclusion was based on visual inspection of model behaviour for a supposedly small number of manually picked parameter settings, rather than on a systematic exploration of the parameter space.

To investigate more formally how well the models are able to account for the empirical joint distributions of step widths and step heights, we use an optimisation algorithm to find the parameters that minimise the Jensen-Shannon divergence^{‡‡} (JSD) between the empirical and the predicted distributions. Since repeated computation of joint distributions makes this optimisation very time-consuming, we fit the models with only one threshold variant in the second model factor. To make it unlikely that our choice does not bias the results in favour of the trial-by-trial models, we choose *ε*=|Δ|×*n* for all three models, which was the most successful variant for the IIAB model in the main analysis (Figure 3). We fit these models to full datasets, because joint distributions for session-based data often contain too few data points for reliable fitting.

The left panel of Figure 7A presents the empirical data that led Gallistel et al. (2014) to conclude that there are serious discrepancies between the kind of patterns generated by subjects and those generated by trial-by-trial models. In contrast to their conclusion, however, we find that the three models perform approximately equally well, both visually (Figure 7A) and in terms of JSD (IIAB: 0.22±0.03; Delta: 0.22±0.04; M-Avg: 0.19±0.04). Also at the individual level, visual inspection of the fits does not indicate an advantage of the IIAB model over the M-Avg and Delta models in any of the experiments (Figure 7B). In fact, when averaging the JSD across all 89 subjects (Figure 8A), the IIAB model accounts for the distributions substantially worse than the M-Avg and Delta models (IIAB: 0.28±0.017; Delta: 0.17±0.013; M-Avg: 0.17±0.011).

At the level of individual experiments, the IIAB model has the worst JSD in seven of the eight cases (Figure 8B); the only exception is E3, which has just three subjects and where all models have approximately equal JSD. Overall, these results are consistent with our main analysis in the sense that the Delta and M-Avg mechanisms perform roughly equally well and better than the IIAB mechanism. However, it has to be noted that the JSD differences are very small in comparison to the AIC differences (Figure 3). This is because a summary statistic can never contain more information than the raw data from which it is derived, which follows from a theorem known as the data processing inequality (Cover & Thomas, 2005). We quantified this difference in a previous study (albeit in a different context), where we found that the summary statistics contained only 0.15% of the evidence present in the raw data (van den Berg & Ma, 2014). In light of this, we prefer to give more weight to likelihood-based comparisons than comparisons based on summary statistics.

In conclusion, even if one considers the joint distribution of step widths and step heights as the sole criterion to evaluate models on, there seems to be no ground for ruling out trial-by-trial models. If anything, the trial-by-trial models explain the data better than the hypothesis-testing model.

### Slider updating consistency

The three updating mechanisms considered in this study (IIAB, Delta, M-Avg) have in common that belief updates are always consistent with the most recent observation: observing a blue increases the estimate of *p*_{blue} and observing a ring of the other colour decreases it. However, we find that across all 89 subjects in our dataset, on average only 75.8±1.8% of the updates were consistent with the most recent observation (range: 68.9% to 80.3%). Hence, about 1 in every 4 updates was made in the direction opposite to the most recent observed outcome. Threshold variability may be one source of these inconsistencies. To see why this is the case, suppose that a participant observes three blue rings followed by a red one. If the updating threshold happened to be high in the first three trials and low in the last trial, it can happen that a slider update is made only in the fourth trial.

In agreement with our intuitions, we find that updating behaviour in the fits to full datasets is 100% for all M-Avg and Delta models without threshold variability. However, somewhat to our surprise, for the IIAB model we find that a small proportion of the updates (1.4±0.3% across all 89 subjects) is inconsistent with the last observation. We suspect that this may have to do with the ability of the model to have “second thoughts”, i.e., to take back an earlier made update. In any case, models without threshold variation predict much higher updating consistency than what is observed in the data.

For models with threshold variation, we find substantially lower consistency values in the fits: 91.6±0.8% (IIAB with *ε*=|Δ|), 83.7±1.6% (Delta with *ε*=|Δ|), and 83.6±1.0% (M-Avg with *ε*=|Δ|). These results show that threshold variance may be one explanation for participants’ updating consistency rates. However, since they are still somewhat overestimated by these models, it is likely that there are other sources too. Participants could, for example, be inferring local sequential dependencies in the data. This would lead to beliefs of the form “the next ring will surely be red since I have just drawn three blue ones” as opposed to “there is a high chance of drawing a blue ring given that I have just drawn several of them”, and thus inconsistent updating.

### Generalisation to a different task

So far, we have been evaluating models based on data from a single task. However, a good model should also be able to explain data from other tasks that supposedly tap into the same cognitive process. Therefore, we next assess how well the three updating mechanisms account for data from a binary prediction task. In a study by Luthra and Todd, 131 subjects performed three sessions of a probability learning experiment of 100 trials each (Luthra & Todd, 2019)^{§§}. On each trial, they were asked to guess which of two light bulbs – A or B – would light up next. The probability that light bulb A would light up – denoted *p*_{A} – differed between the three sessions (values: 0.60, 0.70, and 0.80; randomly ordered). Since the true value of *p*_{A} was always larger than 0.50, the accuracy-maximising strategy would be to always report “A”. Luthra and Todd found that the proportion of “A” responses indeed increased as a function of trial number (Figure 9A), indicating learning. Moreover, they observed that subjects – on average – chose the two options in proportion to their probability, especially in the first session (Figure 9B). This extensively studied phenomenon – known as *probability matching* (e.g., Vulkan, 2000) – is not restricted to human choice behaviour, but has also been observed in response data from a range of other animals, from bees (Keasar, Rashkovich, Cohen, & Shmida, 2002) to birds (Krebs, Kacelnik, & Taylor, 1978). Nevertheless, it is important to note that probability matching behaviour observed at the level of aggregated data does not always have to imply probability matching at the level of individual subjects. Moreover, even if there was probability matching at the level of individual subjects, it is unclear what caused it: it could originate from the way that subjects updated their probability estimates or from the way they mapped these estimates to a response.

While the stimulus and response type were quite different between this experiment and the probability estimation experiments considered above, we may expect that subjects used highly similar cognitive strategies to estimate the hidden Bernoulli processes in both tasks. Therefore, the data by Luthra and Todd allow us to further evaluate the plausibility of the three updating mechanisms (IIAB, Delta, M-Avg): plausible models should be able to do well on both tasks. Conversely, modelling the data may also shed some light on the origin of the probability matching behaviour.

To make our models applicable to this task, we make several small modifications. Since subjects were required to provide a response on each trial, there is no need for a response threshold. Hence, we can ignore the second model factor, which leaves us with only three models: the original IIAB model with parameters T* _{1}* and

*T*

_{2}, a delta-rule model with parameter

*λ*, and a memory-based averaging model with parameter

*α*. To apply these models to the binary prediction task, we need to add a response mechanism that maps the observer’s estimate of the hidden Bernoulli parameter, , to a binary response (“A” or “B”). We try three such mechanisms, which we implement as levels in a second model factor:

*Maximisation mechanism.*This strategy maximises expected accuracy, by responding “A” if and “B” otherwise.*Probability matching mechanism.*In this mechanism, the probability of responding “A” is equal to the observer’s current estimate of the hidden Bernoulli parameter, .*Generalised response mechanism.*When using this response mechanism, the observer responds “A” with a probability equal to , where*β*is a free parameter.

This mechanism contains the maximisation and probability matching strategies as special cases (*β*=1 and *β*=∞, respectively).

We thus have nine models in total. As before, we assume a lapse rate of 0.1% to circumvent numerical issues in fitting deterministic models. We fit each model separately to each subject dataset. Just as in our main analysis, we allow the parameters to differ between sessions.

All models that use the maximisation or probability matching response mechanisms provide a poor fit to the empirically observed proportions of “lightbulb A” responses. (Figures 9A and 9B). Models with the generalised response mechanism fit these data much better, but only when combined with either the Delta or IIAB updating mechanism. Interestingly, none of the M-Avg models accounts well for the data. These visual results are corroborated by AIC-based model comparison results (Figure 9C), where the best-fitting model is the Delta model with the generalised response mechanism, followed by the IIAB model with the generalised response mechanism (ΔAIC= 14.5±2.1). All other models are outperformed by at least 30 AIC points.

### Discussion

The most important point to take away from the modelling analyses is that – contrary to previous claims – we find no compelling evidence against trial-by-trial updating in human estimation of non-stationary probabilities. In fact, we find this class of models to be more successful at explaining behaviour than the hypothesis-testing models, with very high consistency: it holds across all eight available datasets; it holds for models with and without threshold variability; it is independent of whether model comparison is based on AIC values or on cross-validation; it is independent of whether model comparison is based on raw data or summary statistics; it is independent of whether we fit the models to full data sets or per session; and it still holds if we add a second variable threshold to the IIAB model. Differences in model evidence were smaller when we applied the models to data from a binary prediction task. However, importantly, we did not find any evidence against trial-by-trial updating there either.

It is difficult to say which of the two types of trial-by-trial models is the more successful one. When applied to data from probability estimation tasks, M-Avg models have a slight advantage over Delta models in AIC-based model comparison. However, the results are reversed in model comparison based on cross validation and in the results from the binary prediction task. Altogether, these results suggest to us that the two classes of models make very similar predictions, but that M-Avg models may be more susceptible to overfitting.

Allowing the threshold to vary is important for any model to describe the participants’ behaviour well. This kind of variance could have multiple origins. For example, it could be that the neural representation of the threshold varies due to neural noise. Another possibility is that the revisions of the threshold depend on the subject’s level of attention, which may fluctuate over time, especially in long experiments of the type considered here. Similarly, the threshold as such can be interpreted in several ways. Gallistel et al., (2014) assumed any threshold to be an integral part of the estimation procedure, while Khaw et al., (2017) suggest that it arises from rational adaptation to the cognitive costs of updating. Yet others may envisage it as the result of motor “laziness”, which could be an equally rational outcome of a trade-off between motor cost and expected reward. All in all, the psychological interpretation of the updating threshold requires further study.

Our finding that the two-kernel delta-rule model outperformed all other models on the probability estimation task suggests that subjects may have been keeping track of both slow and fast changes in the probability that they were estimating. Another possible explanation is that they were in fact behaving as described by a single-kernel model that updates its learning rate as a function of the prediction errors, as suggested by Behrens et al. (2007). Intuitively, this mechanism should be able to solve the problem which a regular trial-by-trial model will face when tracking a function with large but infrequent changes: that the estimate sometimes needs to be highly sensitive to new observations and at other times less sensitive in order to track it well. This is an interesting question for future work.

Lastly, we made an interesting observation that to the best of our knowledge has not been reported before: a rather large proportion of the slider updates was inconsistent with the most recent draw from the Bernoulli distribution. While threshold variability may be part of the explanation, we suspect that there are other sources too. Since the origin of these inconsistencies could be informative about the underlying belief updating mechanism, further investigation of this issue could lead to important improvements of the theories.

## GENERAL DISCUSSION

While there is an extensive literature on human estimation of stationary probabilities (Edwards, 1961; Estes, 1976; Fiedler, 2000; Peterson & Beach, 1967), research on estimation of non-stationary probabilities has only just begun. An important observation made by the studies that have been pioneering this area is that humans tend to report their probability updates in a stepwise manner (Gallistel et al., 2014; Khaw et al., 2017; Ricci & Gallistel, 2017; Robinson, 1964). Ricci and Gallistel (2017) posited that explaining this kind of behaviour is the number one challenge for any model based on trial-by-trial updating. In this article, we took up this challenge and scrutinised the claim in two ways. First, we reported empirical data which investigated the malleability of these observed stepwise behaviours, and which expanded the empirical data base for distinguishing between the different models considerably. Second, we evaluated the different models using more rigorous likelihood-based model comparisons, applying them both to our new data and to the data sets from three previously published studies.

In the experiment, using two novel manipulations, we found evidence that particulars of the experimental design affect the discreteness in the response patterns, in turn suggesting that the stepwise behaviours need not exclusively or mainly be a signature of hypothesis testing. In particular, the finding that the extent of stepwise behaviours is strongly affected by the effort required to produce the response indicates that there are covert changes in beliefs that are not disclosed when there are asymmetric costs of maintaining vs. changing the response. The rate of stepwise behaviour was also affected by instructions about the non-stationarity of the process, indicating that there are a priori adaptations of the process that are responsive to instructions (e.g., changes in the priors across a hypothesis space or changes in the sampling window effectively used for estimation). The characteristic patterns of rare and large changes observed in the previous studies were not general, but mainly observed in one of the four experimental cells.

Furthermore, using rigorous model comparison methods, we found that not only our own data, but also all the of previous data sets are better accounted for by models based on trial-by-trial updating than by models based on hypothesis testing. This conclusion held across eight data sets and across a variety of different criteria for evaluating the fit of the models. However, we should immediately point out that the ambition of this article is not to proclaim the “death” of hypothesis testing models, but rather to suggest that the reports of the “death” of trial-by-trial learning models have been greatly exaggerated. Ultimately, we would expect that – as is true in most areas of cognitive science – the mind is able to draw on several different cognitive processes for learning about a property as fundamental to adaptation as probability.

### More challenges

While the modelling results presented above may appear conclusive, Ricci and Gallistel (2017) raised several additional challenges for trial-by-trial models in excess of the question of how to explain stepwise updating. Here, we briefly address these. The first one is to explain that “subjects perceive the changes themselves” when there are abrupt and large changes. The authors considered the possibility of a trial-by-trial model with both a slow and fast kernel, the latter of which should be able to detect abrupt changes. However, they rejected that model because they were unable to find parameter settings that produced summary statistics matching the patterns in subject data. Here, we performed a rigorous model comparison and found that the two-kernel delta-rule model actually beats all other models that we tested. Based on this finding, we believe that it would be interesting for future work to examine to what extent perceptions of abrupt changes in a two-kernel Delta-rule model coincide with those perceived by subjects.

Another challenge is to explain that subjects sometimes have “second thoughts about previously perceived changes in the hidden parameter”. An elegant property of the IIAB model is that the prediction of second thoughts is integral to its updating mechanism. However, we believe that it would be wrong to reject trial-by-trial model based on the fact that they need additional assumptions to account for second thoughts, because they might very well be governed by a separate process. A circumstance (in this case a button) which explicitly invites people to re-evaluate their previous beliefs might induce them to do so, but that is not to say that such behaviour must be integral to the iterated online estimation which the present paradigm investigates.

A final challenge is to explain that participants are able to extract abstract information about the function that guides the true value of the probability that they are tracking. In line with their findings, we observed in the post-experiment questionnaires that many participants produced something that resembled a sinusoidal function when asked to draw the function they believed they had been tracking. Just as was the case with the issue of second thoughts, we believe that inference of the underlying function may be governed by a mechanism that is separate from the updating mechanism. Yet again, a nice feature of the IIAB is that it requires less of an extension^{***} to account for this since keeping a history of previous change points is an integral part of the updating mechanism. However, the drawings that our participants made were almost always smooth and rarely resembled the kind of stepwise function encoded in lists of change points. An integral feature of M-Avg models is that they keep extended histories of previous observations. In principle, this history could be used to infer both second thoughts and the underlying function, even though that would require the assumption of additional mechanisms. Finally, while the Delta models are recursive and, therefore, do not require any sequence memory, it is important to note that the original delta rule was devised as a linear combination of weights on input signals (Widrow & Hoff, 1960; Widrow & Lehr, 1993). It was designed as a shortcut for a human supervisor to manually update the weights assigned to a set of information nodes. More generally, any recursive function can be reformulated as an iterative one (Church, 1936b, 1936a; Turing, 1937). Hence, whether people retain the sequences they have observed is not a question that can be answered by model comparison – all three tested classes of models are compatible with that proposition. Rather, it will require experimental manipulations which investigate what cognitive mechanisms mediate the behaviour.

### Heterogeneity in updating strategies

Our model comparison results were quite unambiguous when considered at the group level: the M-Avg mechanism accounted best for the data, followed by first the Delta mechanism and then the IIAB mechanism (Figures 3 and 4). However, at the level of individual subjects, we observed substantial heterogeneity in the results (Figure 3A). There are two possible explanations for this. First, there may be true heterogeneity in the underlying cognition, in which case it would be misleading to consider only group-level results. Second, the heterogeneity could be an artefact caused by limitations of the analysis, such as the finite size of the dataset, the use of a custom likelihood function, and the lack of guarantee that the optimisation algorithm always converged to the maximum of this function. Indeed, the model recovery analysis (Appendix C) showed some misclassifications even when the true model was in the set of fitted models, although never between updating mechanisms. We can, at present, neither rule out nor confirm that different individuals used different updating strategies.

### Generalisation to a binary prediction task

To test the generalisability of the models of probability estimation, we examined whether they were also able to account for data on a binary prediction task. An important aspect of those data was that they contained indications of probability matching, i.e., subjects choosing each option in proportion to the true frequency of its occurrence. This kind of behaviour is suboptimal, in the sense that it does not maximise expected accuracy. As we suspected, all models failed to describe this behaviour when assuming that participants used a response strategy that maximises expected accuracy. More surprisingly, the models were also unable to account for the data when assuming that participants used a probability matching response strategy. Instead, to account for the data, the models needed a more flexible, generalised response mechanism that allows for strategies that essentially lie in between maximisation and matching. It would thus seem that participants are not simply attempting to maximise or probability match, but use some other strategy, under the assumption that they are relying on one of the updating mechanisms considered here.

Alternatively, it may be that there are probability updating mechanisms that produce apparent probability matching behaviour when coupled with a maximisation response mechanism. One such model is the instance-based learning (IBL) model (Gonzalez & Dutt, 2011; Lejarraga, Dutt, & Gonzalez, 2012), which is based on the ACT-R architecture (Anderson & Lebiere, 1998; Lebiere, 1999). It manages to account for probability matching by including high memory activation noise in the belief-updating mechanism, which makes the probability estimate vary a lot from trial to trial but converge on the true probability in expectation. Since the probability estimation data do not generally exhibit such high variability (Figure 5), is seems unlikely that the IBL can explain them without making hugely different assumptions about the memory noise.

Although our analysis here is only rudimentary, it points to the issues that our results pose for binary prediction models and that human behaviour on binary prediction tasks poses for our models. To explain probability matching *and* probability estimation, one must propose that participants are using a different response updating mechanism (which does not assume too high noise) or a different strategy (which is neither pure maximisation nor pure probability matching). There are such suggestions (Gaissmaier & Schooler, 2008; Koehler & James, 2010; Wozny, Beierholm, & Shams, 2010), and future studies should investigate these more rigorously using data from both paradigms.

### Limitations

A first limitation of the present study is that we did not test hybrid models. Since the main goal was to scrutinise previous conclusions drawn about the viability of trial-by-trial models, we considered the testing of hybrid models outside the scope of the present work. However, since hypothesis-testing and trail-by-trial updating are not necessarily mutually exclusive, the most promising models might be ones that combine the two processes.

We also mentioned above that there remain unexplained differences between the observed consistency rates and those predicted by the models. Intuitively, one possible cause is that participants infer sequential dependencies within random processes (Ayton & Fischer, 2004). A participant who is under the impression that, say, three blues in a row indicate that the next ring is most likely going to be red should update inconsistently after observing that sequence. This has not been addressed in our experimentation or modelling, but experimental data exists from a paradigm similar to our own. Toda (1958) rigged the Bernoulli sequence in his probability estimation task in such a way that there were sequential patterns in the outcomes, allowing him to study if these were inferred through observing the participants’ subjective probabilities. He inferred from the data that subjects estimate probabilities in a way that is approximately the Bayesian solution of a higher order Markov process – a non-trivial trial-by-trial model. We are, however, reluctant to accept this conclusion. The problem is that the probability estimates in Toda’s task were derived indirectly from decisions in an ultimatum bargaining game and thus likely to have been affected by first-mover advantage and people’s fairness concerns (Güth, 1995; Güth & Van Damme, 1998; Slembeck, 1999; Thaler & Camerer, 1995). This may have biased his estimates. Future studies could adapt the present task with Toda’s (1958) rigged sequences to see if this increases the inconsistency rates beyond those in a non-rigged control condition.

Moreover, we performed model comparisons based on a custom likelihood function, because the proper likelihood function was intractable. Even though model recovery analysis confirmed that the chosen function allowed for reliable model comparison, better choices might have been possible and could have led to more conclusive results in terms of distinguishing the four threshold mechanisms in the second model factor. We constructed the custom likelihood function mainly based on “educated guesses” of what aspects are important to take into account. An alternative and probably better way would be to *derive* a likelihood function by starting with the proper one and then make simplifications until it becomes tractable.

Lastly, during our debriefings, some participants reported that they counted or chunked the observations. This could possibly imply a trivial dual strategy hypothesis: some people attempt to solve the task by counting, a strategy which is highly inefficient in the chaotic world outside of the laboratory. When they update intuitively, they use a different system which does not require working memory retention of observations. Manipulating working memory capacity may confirm or reject this hypothesis and inform future studies which want to use similar tasks – since most scientists presumably will be more interested in the second, intuitive system we must know if we need to control for counting.

### Relation to behavioural economics

In their seminal work “Theory of Games and Economic Behavior”, originally published in 1944, von Neumann and Morgenstern (2007) begin by recognising the fact that a “universal system” of economic theory is not achievable in the foreseeable future, largely due to the lack of a sufficient body of empirical observations. In anticipation of that, they make-do with “some commonplace experience of human behavior” to demonstrate the mathematical framework we today recognise as game theory. These behavioural assumptions have been criticised by behavioural economics and cognitive psychology (e.g. Mullainathan & Thaler, 2015; Schoemaker, 1982; Tversky, 1975). Some studies have introduced modifications (e.g. Caplin & Leahy, 2001; O’Donoghue & Rabin, 1999), but there have been few comprehensive replacements. A well-validated, robust theory of probability perception would be an important step towards such an end. We believe that the present work is a contribution to the construction of such a theory.

### Concluding remarks

To the best of our knowledge, the first direct study of human estimation of non-stationary probabilities was performed in 1964 (Robinson, 1964). After that, it took another 50 years before a serious modelling attempt was initiated to obtain an understanding of the mechanism behind this important cognitive function (Gallistel et al., 2014). That attempt culminated in a rejection of the entire class of trial-by-trial models and the proposal that humans instead use hypothesis testing to track non-stationary probabilities. Here, we scrutinised that proposal and found that there is actually much stronger evidence for trial-by-trial models than for hypothesis testing. Hence, the rejection of trial-by-trial models seems to have been premature. Considering the young state of this subfield of research, we believe that it would be equally wrong to rule out any other class of model at this point. In the end, it may turn out that humans use a mix of strategies. Therefore, future studies might benefit from starting to look into hybrid models instead of continuing to restrict themselves to one particular class. In doing so, they should strive to bring all the findings – from function learning through binary choice to probability inference – under one umbrella. That way, applied researchers such as economists may find important uses for the work.

## CREDIT AUTHOR STATEMENT

**Mattias Forsgren:** Conceptualization, Methodology, Validation, Formal Analysis, Investigation, Data Curation, Writing – Original Draft, Writing – Reviewing & Editing. **Peter Juslin:** Conceptualization, Methodology, Formal Analysis, Writing – Original Draft, Writing – Reviewing & Editing, Supervision, Project Administration, Funding Acquisition. **Ronald van den Berg:** Conceptualization, Methodology, Software, Validation, Formal Analysis, Data Curation, Writing – Original Draft, Writing – Reviewing & Editing, Visualization, Supervision, Project Administration, Funding Acquisition.

## ACKNOWLEDGEMENTS

We thank Randy Gallistel, Matthew Ricci, Mahi Luthra and Peter Todd for sharing their data with us and for several helpful discussions about their experiments. The research was funded by the Swedish Research Council (Grant 2018-01947) and the Marcus and Amalia Wallenberg Foundation.

## APPENDIX A Dunn’s post hoc comparisons

## APPENDIX B Custom likelihood function

In its most general form, the log likelihood function forthe models considered in this study takes the form
where **R**={*R*_{1}, *R*_{2}, …, *R*_{n}} is a vector with subject responses for all *n* trials, **θ** is a vector with parameter values, **ψ** is a matrix with latent variables, and **O**={*O*_{1}, *O*_{2}, …, *O*_{m}} is a vector with all Bernoulli outcomes observed by the subject. The IIAB model has multiple time-varying latent variables, including a list of change points and parameters of a beta distribution representing the observer’s prior belief that any given trial is a change point (see Table 1 in Gallistel et al., 2014). The existence of these latent variables in combination with the fact that the model predictions are not independent across trials makes evaluation of the likelihood function computationally prohibitive.

To circumvent this problem, we construct a “custom” likelihood function that captures the main aspects of the likelihood function proper in a computationally tractable way, yet still allows for reliable model comparison, which will be verified by a model recovery analysis (Appendix C).

We believe that there are two important aspects that the likelihood function should cover in order to allow it for reliable model fitting and comparison. First, obviously, it should punish models for discrepancies between the predicted slider value and the slider value chosen by the subject. Second, since one of the main differences between the models is when they predict slider updates, it is probably also important that the likelihood function punishes models that predict slider updates on trials where the subject made no update and vice versa. With this in mind, we choose to compute the likelihood of parameters **θ** for model *M* as follows. Let **R**_{subject} denote the vector with subject responses and **O** the vector with observed Bernoulli outcomes. First, we compute the model’s predicted response vector **R**_{M}. Assuming for the moment that there is no threshold noise, **R**_{M} is a deterministic function of **θ** and **O** for all models that we consider here. We can obtain **R**_{M} efficiently using a forward simulation of the model, feeding it with **O** while fixing the parameters to **θ**. After obtaining **R**_{M}, we compute the probability of the subject response on each trial *t* as follows,
where *N*(*x*; *μ*, *σ*) is a normal distribution with mean *μ* and standard deviation *σ*, evaluated at point *x*. This function strongly punishes models that predict an update when the subject did not make an update (first line of last expression in Eq. (5)) or vice versa (second line). If, on the other hand, the updating behaviour is consistent between model and subject (third line), the probability of the subject response is measured as a draw from a normal distribution centred on the response predicted by the model. This normal distribution can be thought of as a way to capture variance in the data that is left unexplained by the model: the better the model, the smaller the estimate of *σ*_{unexplained}. Part of this variance could be due to variability in motor responses, but there may be other sources too. To avoid log likelihoods equal to negative infinity, we assume in each model that the observer sometimes produces a random response drawn from a uniform distribution on [0,1]. We fix the rate of such random responses to 1 in 1,000 trials.

So far, we have assumed fixed thresholds in our construction of the likelihood function. However, all models that we consider here have a variable threshold, which makes the predictions non-deterministic: for a fixed set of parameters **θ** and input vector **O**, prediction **R**_{M} varies from run to run. To approximate the probability of the subject’s response under a variable response threshold, we average the model prediction over 100 runs. We thus obtain the following custom log likelihood function:
where *p*(*R*_{subject,t} | R_{M}_{,t}) is as specified in Eq. (5).

## APPENDIX C Model recovery

We created a group of five synthetic data sets from each of the twelve models with threshold noise, giving a total of sixty synthetic datasets. Next, we used maximum-likelihood estimation to fit the twelve main models twenty times to all datasets. For each fit, we computed the Akaike Information Criterion (AIC; Akaike, 1974). At the level of individual data sets, AIC-based model comparison picks out the correct model in forty-six of the sixty cases (Figure C, Panel A). In the remaining fourteen cases, a mistake was made with respect to the second modelling factor, i.e., the threshold mechanism. This indicates that at the individual level, our methods are adequate for selecting the right updating mechanism (IIAB, Delta or M-Avg), but it has some difficulties in selecting the right threshold mechanism. At the group level, on the other hand, the correct model was selected in all cases (Figure C, panel B). These results also indicated that the quality of fit improved very little after about ten runs of the optimizer (Figure C, panel C).

## APPENDIX D Cross validation results

In our main analysis, we fitted the models to only the first 750 trials in each dataset. Model comparison based on the log likelihood of the remaining trials (Figure D) are largely consistent with the AIC-based results (Figure 3).

## APPENDIX E Maximum-likelihood parameter estimates

## Footnotes

↵* Subjects S1, S3, and S4 in the “aperiodic” condition.

↵† We do not know the exact number of trials in Robinson (1964) but each of his subjects performed the task for about 15 hours, which is a substantial amount of time.

↵‡ Calculated using 2017 OECD purchasing power parity estimates.

↵§ This subtlety was not mentioned in the methods of the original study and we only became aware of it when scrutinising the methods of Khaw et al. (2017) who mention it in relation to their own experiment.

↵** There is one other study using the same paradigm (Robinson, 1964), but it has no preserved record of the data known to us.

↵‡‡ The Jensen-Shannon divergence is a symmetric variant of the Kullback-Leibler divergence and has the advantage that it is always finite, even when one of the inputs is zero.

↵§§ The authors excluded data from eight subjects “due to failure to perform at least one of the tasks”. We did not receive the data from the excluded subjects, which means that our analysis is limited to the remaining 123 subjects.

↵*** We say “less” since a continuous function obviously cannot be inferred from discrete points without some kind of interpolation.