## Abstract

Human decisions are known to be systematically biased. A prominent example of such a bias occurs during the temporal integration of sensory evidence. Previous empirical studies differ in the nature of the bias they observe, ranging from favoring early evidence (primacy), to favoring late evidence (recency). Here, we present a unifying framework that explains these biases and makes novel neurophysiological predictions. By explicitly modeling both the approximate and the hierarchical nature of inference in the brain, we show that temporal biases depend on the balance between “sensory information” and “category information” in the stimulus. Finally, we present new data from a human psychophysics task that confirm that temporal biases can be robustly changed within subjects as predicted by our models.

## Introduction

Imagine a doctor trying to infer the cause of a patient’s symptoms from an x-ray image. Unsure about the evidence in the image, she asks a radiologist for a second opinion. If she tells the radiologist her suspicion, she may bias his report. If she does not, he may not detect a faint diagnostic pattern. If the evidence in the image is hard to detect or ambiguous, the radiologist’s second opinion, and hence the final diagnosis, may be swayed by the doctor’s initial hypothesis. We argue that the brain faces a similar problem during perceptual decision-making: any decision-making area combines sequential signals from sensory brain areas, not directly from sensory input. If those signals themselves reflect inferences that combine both prior expectations and sensory evidence, we suggest that this can then lead to an observable confirmation bias [30].

Formalizing this idea in the context of approximate Bayesian inference requires extending classic evidence-integration models to include an explicit intermediate sensory representation (Figure 1b). We explicitly model the inferences of the intermediate sensory representation and find that task difficulty is modulated by two distinct types of information: the information between the stimulus and sensory representation (sensory information), and the information between sensory representation and category (category information) (Figure 1b). The balance between these distinct types of information can indeed explain puzzling discrepancies in the literature with regards to how subjects weigh evidence over time across a wide range of studies. Even in tasks where all evidence is equally informative about the correct category, existing studies typically report one of three distinct motifs: some find that early evidence is weighted more strongly (a primacy effect) [23, 31] some that information is weighted equally over time (as would be optimal) [47, 9, 37], and some find late evidence being weighted most heavily (a recency effect) [12] (Figure 1a,c). There are myriad differences between these studies such as subject species, sensory modality, stimulus parameters, and computational frameworks [23, 9, 16, 12]. However, none of these aspects can explain their different findings. We show that the differences arise naturally in a hierarchical approximate inference framework.

## Results

### “Sensory Information” vs “Category Information”

Normative models of decision-making in the brain are typically based on the idea of an *ideal observer*, who uses Bayes’ rule to infer the most likely category on each trial given the stimulus. On each trial in a typical task, the stimulus consists of multiple “frames” presented in rapid succession. (By “frames” we refer to discrete independent draws of stimulus values that are not necessarily visual). If the evidence in each frame, *e _{f}*, is independent and the categorical identity of the stimulus is a binary variable

*C*∈ {−1, +1}, then evidence in favor of

*C*= + 1 after

*F*independent frames is . The ideal observer reports the most likely category, for instance by reporting the sign of log p(

*C*= +1|

*e*

_{1},…,

*e*) − log p(

_{F}*C*= −1|

*e*

_{1},…,

*e*).

_{F}The ideal observer’s performance is limited only by (i) the information about *C* available on each frame, *p*(*e _{f}*|

*C*), and (ii) the number of frames per trial. In the brain, however, a decision-making area computing a belief about the correct choice only has access to the sensory representation of the stimulus, which we call

*x*, not to the outside stimulus

*e*directly. For example, in a visual task each

*e*would be the image on the screen while inferences about

_{f}*x*are represented by the concurrent activity of relevant neurons in visual cortex. This implies that the information between the stimulus and category can be partitioned into the information between the stimulus and the sensory representation, and the information between sensory representation and category, which we call “sensory information” and “category information,” respectively (Figure 1b). These two kinds of information span a two-dimensional space with a task being defined by a single point (Figure 1c).

_{f}To illustrate this difference, consider the classic dot motion task [29] and the Poisson clicks task [9], which occupy opposite locations in the space spanned by sensory and category information. In the classic low-coherence dot motion task, subjects view a cloud of moving dots, some percentage of which move “coherently” in one direction. Here, sensory information is low since evidence about the net motion direction at any time is weak. Category information, on the other hand, is high, since knowing the “true” motion on a single frame would be highly predictive of the correct choice (and of motion on subsequent frames). In the Poisson clicks task, subjects hear a random sequence of clicks in each ear and must report the side with the higher rate. Here, sensory information is high since each click is well above sensory thresholds, but category information is low since knowing the side on which a single click was presented provides only little information about the correct choice (and the side of the other clicks). Another way to think about category information is as “temporal coherence” of the stimulus: the more each frame of evidence is predictive of the correct choice, the more the frames must be predictive of each other, whether a frame consists of visual dots or of auditory clicks. Note that our distinction between sensory and category information is different from the well-studied distinction between internal and external noise; in general, both internal and external noise will reduce the amount of sensory and category information.

If we assume that the sensory representation – which itself is an inference about the actual stimulus – incorporates prior expectations [25, 18, 40], then, as we show below, approximate inference models predict that this will lead to a primacy effect when sensory information is low and category information is high, but not when sensory information is high and category information is low. Indeed, a qualitative placement of prior studies in the space spanned by these two kinds of information demonstrates that studies that find early weighting are located in the upper left quadrant (low-sensory/high-category or LSHC) and studies with equal or late weighting in the lower right quadrant (high-sensory/low-category or HSLC) (Figure 1c). This suggests that the different trade-off between sensory information and category information may indeed underlie differences in temporal weighting seen in previous studies. Further, with this framework it is straightforward to predict how simple changes in stimulus statistics of previous studies should change the temporal weighting they find (Supplemental Table S1).

### Visual Discrimination Task

To test this critical model prediction, we designed a visual discrimination task with two stimulus conditions that correspond to the two opposite sides of this task space, while keeping all other aspects of the design the same (Figure 2a). If our theory is correct, then we should be able to change individual subjects’ temporal weighting strategy simply by changing the sensory-category information trade-off.

The stimulus in our task consisted of a sequence of ten visual frames (83ms each). Each frame consisted of band-pass-filtered white noise with excess orientation power either in the −45° or the +45° orientation [1] (Figure 2b,d). On each trial, there was a single true orientation category, but individual frames might differ. At the end of each trial, subjects reported whether the stimulus was oriented predominantly in the −45° or the +45° orientation. The stimulus was presented as an annulus around the fixation marker in order to minimize the effect of small fixational eye movements (Materials and Methods).

If the brain represents the orientation in each frame, then sensory information in our task is determined by how well each frame determines the orientation of that frame (i.e. the amount of “noise” in each frame), and category information is determined by the probability that any given frame’s orientation matches the trial’s category. For a ratio of 5 : 5, a frame’s orientation does not predict the correct choice and category information is zero. For a ratio of 10 : 0, knowledge of the orientation of a single frame is sufficient to determine the correct choice and category information is high. For a more detailed discussion, see Supplementary Text.

Using this stimulus, we tested 12 human subjects (9 naive and 3 authors) comparing two conditions intended to probe the difference between the LSHC and HSLC regimes. Starting with both high sensory and high category information, we either ran a staircase lowering the sensory information while keeping category information high, or we ran a staircase lowering category information while keeping sensory information high (Figure 2a). These are the LSHC and HSLC conditions, respectively (Figure 2b,d). For each condition, we used logistic regression to infer, for each subject, the influence of each frame onto their choice. Subjects’ overall performance was matched in the two conditions by defining threshold performance as 70% correct (Materials and Methods).

In agreement with our hypothesis, we find predominantly flat or decreasing temporal weights when sensory information is low and category information is high (Figure 3a). When the information is partitioned differently – into high sensory and low categorical information – we find flat or increasing weights (Figure 3b). Despite variability between subjects in each condition, a within-subject comparison revealed that the change in slope between the two conditions was as predicted for nearly all subjects (Figure 2c,d) (*p* < 0.05 for 9 of 12 subjects, bootstrap). This demonstrates that the trade-off between sensory and category information in a task robustly changes subjects’ temporal weighting strategy as we predicted, and further suggests that the sensory-category information trade-off may resolve the discrepant results in the literature.

### Approximate inference models

We will now show that these significant changes in evidence weighting for different stimulus statistics arise naturally in common models of how the brain might implement approximate inference. In particular, we show that both a neural sampling-based approximation [20, 13, 18, 34] and a parametric (mean-field) approximation [2, 36] to exact inference can explain the observed pattern of changing temporal weights as a function of stimulus statistics.

The crucial assumption in both models is that the brain computes a posterior belief over both *C* and *x* given the external evidence, i.e. p(*x, C*|*e*), not just over the categorical variable *C*. This assumption differs from some models of approximate inference in the brain that assume populations of sensory neurons strictly encode the *likelihood* of the stimulus [26], but is consistent with other models from both sampling and parametric families [4, 18, 36, 40].

In our models, the brain’s belief about *x* depends both on the external evidence, *e*, via the likelihood, but also on the brain’s current belief about *C*, via the prior. For a decision-making area in the brain to update its belief about *C* based on current sensory responses, it needs to account for, or “subtract out” its influence on those sensory responses. Failure to do so will result in “double-counting” evidence presented early in the trial, inducing a positive feedback loop between the sensory area and the decision making area (Figure 4a). The stronger the decision-making area’s belief in a particular choice, the more likely the sensory representation of *x* will concur with that belief through the influence of the prior. We call this feedback loop a “perceptual confirmation bias.”

Importantly, the strength of this confirmation bias depends on the relative amount of sensory and category information in the stimulus (Figure 4a). It is weakest when the posterior over *x* is dominated by the likelihood, a case that occurs when the category information is much weaker than the sensory information. Conversely, the feedback loop is strongest when the category information is high compared to the sensory information, as assumed in [18] who found a primacy effect in their model.

To demonstrate and quantify the intuitions laid out above, we implemented approximate online inference (where “online” means observing a single frame at a time) for a discrimination task using two previously proposed frameworks for how inference might be implemented in neural circuits: neural sampling [20, 13, 18, 34] and mean field variational inference [2, 36] (Figure 4).

As in the Sequential Probability Ratio Test [17], both models recursively compute a running estimate of the log posterior odds (LPO) of the latent category *C*:
where is the log prior odds and LLO_{f} is the log likelihood odds for frame *f*, i.e. the change in belief about *C* implied by the new evidence *e _{f}*.

Equation (1) describes exact, unbiased online evidence-integration. It is “online” because it describes an *update* to a running estimate of LPO each frame. The bias in our models comes from two additional assumptions: first, that sensory areas of the brain represent the *posterior p*(*x _{f}*|

*e*

_{1},…,

*e*) over

_{f}*x*given all evidence so far rather than its

_{f}*likelihood p*(

*e*|

_{f}*x*), and second that they do so only approximately.

_{f}### Sampling model

The neural sampling hypothesis states that variable neural activity over brief time periods can be interpreted as a sequence of samples from the brain’s posterior over *x*. In our model, the prior belief about *C* biases the distribution from which samples are generated (Materials and Methods). The canonical way to compute an expectation with respect to one distribution (the likelihood) using samples from another (the posterior) is ‘importance sampling’ which weights each sample so as to “subtract out” the prior as described above. While this approach is unbiased in the limit of infinitely many samples, it incurs a bias for a finite number – the relevant regime for the brain. The bias is such that it under-corrects for the prior that has been fed back, resulting in a positive feedback loop. Figure 4b and Supplemental Figure S5a-c show performance for the ideal observer and for the resulting sampling model, respectively, across all combinations of sensory and category information. White lines show threshold performance (70% correct) as in Figure 1c. This model reproduces the primacy effect, and how the temporal weighting changes as the stimulus information changes seen in previous studies. Importantly, it predicted the same within-subject change seen in our data [18]. However, double-counting the prior alone cannot explain recency effects (Supplemental Figure S5a-c,j-l).

There are two simple and biologically-plausible explanations for the observed recency effect. First, the brain may try to actively compensate for the prior influence on the sensory representation by subtracting out an estimate of that influence. That is, the brain could do approximate bias correction to mitigate the effect of the confirmation bias. In particular, we modeled linear bias correction by explicitly subtracting out a fraction of the running posterior odds at each step:
where 0 ≤ *γ* ≤ 1 and is the model’s (biased) estimate of the log likelihood odds. Second, the brain may assume a non-stationary environment, i.e. *C* is not constant over a trial. Interestingly, Glaze et al. (2015) showed that optimal inference in this case implies equation (2), which can be interpreted as a noiseless, discrete time version of the classic drift-diffusion model [17] with *γ* as a leak parameter.

Incorporating equation (2) into our model reduces the primacy effect in the upper left of the task space and leads to a recency effect in the lower right (Figure 4c-e, Supplemental Figure S5), as seen in the data.

### Variational model

The second major class of models for how probabilistic inference may be implemented in the brain – based on mean-field parametric representations [26, 2] – behaves similarly. These models commonly assume that distributions are encoded *parametrically* in the brain, but that the brain explicitly accounts for dependencies only between subsets of variables, e.g. within the same cortical area. [36]. We therefore make the assumption that the joint posterior p(*x*, *C*|*e*) is approximated in the brain by a product of parametric distributions, q(*x*)q(*C*) [2, 36]. Inference proceeds by iteratively minimizing the Kullback-Leibler divergence between q(*x*)q(*C*) and p(*x,C*|*e*) (Materials and Methods). As in the sampling model, the running estimate the category *C* acts as a prior over *x*. Because this model is unable to explicitly represent posterior dependencies between sensory and decision variables, it is forced to commit early either to both *x* and *C* being positive or to both *x* and *C* being negative. This yields the same behavior as the sampling model: a transition from primacy to flat weights as category information decreases, with recency effects emerging only when approximate bias correction is added (Figure 4f-h, Supplemental Figure S5j-r). Whereas the limited number of samples was the key deviation from optimality in the sampling model, here it is the assumption that the brain represents its beliefs separately about *x* and *C* in a factorized form and that its instantaneous belief about *x* is unimodal.

### Predictions for Neurophysiology

Both the sampling and variational models induce a confirmation bias by creating an “attractor” dynamic between different levels of the cortical hierarchy – the decision-making area and the relevant sensory areas. Our model therefore makes a number of novel and testable neurophysiological predictions.

First, our model predicts that both “choice probabilities” [7, 10] and “differential correlations” [27] in populations of task-relevant sensory neurons will be stronger in contexts where category information is high and sensory information is low, i.e. when subjects exhibit primacy effects [46, 18]. This is because the feedback from the decision-making to sensory areas in our model explicitly biases the sensory representation *in the direction that encodes the stimulus strength*, which is the *f*′–direction [40, 24]. Our model is thus consistent with recent evidence that noise correlations are dominated by a task-dependent component in the *f*′ direction [5].

Second, our model predicts that apparent attractor-dynamics measured in both sensory and decision-making areas are in fact driven by inter-rather than within-area dynamics, and will depend on the decision-making context. In particular, categorization tasks should induce a stronger confirmation bias, and hence stronger attractor-like dynamics, than equivalent estimation tasks, as was recently reported [41]. This observation, as well as our above prediction, contrasts with classic attractor models which posit a recurrent feedback loop *within* a decision making area [45, 46].

## Discussion

We have shown that online inference in a hierarchical model can result in characteristic task-dependent temporal biases, and further that such biases are unavoidable in two specific families of biologically-plausible approximate inference algorithms. Explicitly modeling the mediating sensory representation allowed us to partition the information in the stimulus about the category into two parts – “sensory information” and “category information” – defining a novel two-dimensional space of possible tasks. We explicitly probed this space using a standard visual discrimination task and show that individual subjects’ temporal biases change as predicted by our model. We argue that the discrepancy in biases reported by previous studies is resolved by considering how their tasks trade off sensory and category information.

The “confirmation bias” emerges in our model as the result of two key assumptions: (1) intermediary sensory areas represent posterior beliefs and as a result incorporate expectations fed back from decision areas, and (2) those expectations cannot be completely accounted for in subsequent updates to the decision-variable.

Our work is in line with converging evidence that populations of sensory neurons encode posterior distributions of corresponding sensory variables [25, 49, 4, 2, 40, 41, 34, 18] incorporating dynamic prior beliefs via feedback connections [25, 49, 2, 33, 40, 41, 18, 24], which contrasts with other probabilistic theories in which only the likelihood is represented [26, 3, 35, 44].

Optimal inference often requires computing the likelihood of each data point independently. However, in the streaming setting – when data are encountered sequentially – an agent must simultaneously make accurate judgments of current data (based on the current posterior) and make adjustments to a model of long-term trends (based on all likelihoods). Basing model updates on posteriors rather than likelihoods will further entrench existing biases [15, 50].

One recently proposed solution to the general problem of online updating is to make two separate inferences about each data point: one using the current prior for decision-making, and one using a “counter-factual” flat prior for model updates (i.e. using the likelihood) [50]. Since the likelihood can in principle be recovered by dividing the posterior by the prior (or “subtracting out” the log prior), our models can be seen as approximations to recovering the likelihood without explicitly making a second “counterfactual” inference.

In line with our model, it has been proposed that post-decision feedback biases subsequent perceptual estimations [39, 42]. While in spirit similar to our confirmation bias model, there are two conceptual differences between these models and our own: First, the feedback from decision area to sensory area in our model is both continuous and online, rather than conditioned on a single choice after a decision is made. Second, our models are derived from an ideal observer and only incur bias due to approximations, while previously proposed “self-consistency” biases are not normative and require separate justification.

Alternative models have been previously proposed to explain primacy and recency effects in evidence accumulation. Kiani et al. (2008) suggested that an integration-to-bound process is more likely to ignore later evidence even when task-relevant stimuli are of a fixed duration [23]. Deneve (2012) showed that simultaneous inference about stimulus strength and choice and in tasks with trials of variable difficulty can lead to either a primacy or a recency effect [11]. However, both models of evidence integration are based entirely on total information per frame (i.e. *p*(*C*|*e _{f}*)) and hence cannot explain the difference between the data for the LSHC and the HSLC conditions since both conditions are matched in terms of total information. In general,

*any*model based only on

*p*(

*C*|

*e*) cannot explain the pattern in our data. While such a model can coexist with the confirmation bias dynamic proposed by our model, it is not sufficient to explain the pattern in our data for which the trade-off between sensory- and category-information is crucial.

_{f}It has also been proposed that primacy effects could be the result of near-perfect integration of an adapting sensory population [46, 48]. For an analogous mechanism to explain recency effects, however, the sensory population would need to become *less* adapted over time in our HSLC condition. While these circuit dynamics of sensory neurons could in principle explain our behavioral results, this would make different predictions both for the task dependence of the dynamics of sensory populations [41] and for the origin and prevalence of differential correlations [5], both of which are consistent with our model, as described above.

Models of “leaky” evidence accumulation are known to result in recency effects [43, 23, 9, 16]. Interestingly, leaky evidence accumulation has also been shown to be optimal in non-stationary environments [16] and could thus in principle indicate that subjects assume such non-stationarity in our HSLC condition. However, this explanation alone cannot explain the presence of primacy effects in the LSHC condition. While our data are well-explained by a consistent leak parameter *γ* across conditions (Figure 4), it remains to be seen whether subjects’ apparent evidence-weighting strategies would approach the optimal strategy given further experience in each condition (i.e. *γ* may be learned slowly).

In the brain, decisions are not based directly on external evidence but on intermediate representations. If those intermediate representations themselves in part reflect prior beliefs, and if inference in the brain is approximate, then this is likely to result in a bias. The nature of this bias is directly related to the integration of internal “top-down” beliefs and external “bottom-up” evidence previously implicated in clinical dysfunctions of perception [21]. Importantly, we have shown how the strength of this effect depends on the nature of the information in the task in a way that may generalize to cognitive contexts where the confirmation bias is typically studied. For instance, our model makes predictions about when beliefs will be updated in line with the presented evidence, and when it will paradoxically be updated in contradiction to the presented evidence [22]. Finally, the differential effect of sensory and category information may be useful in diagnosing clinical conditions that have been hypothesized to be related to abnormal integration of sensory information with internal expectations [14].

## Materials and Methods

### Visual Discrimination Task

We recruited students at the University of Rochester as subjects in our study. All were compensated for their time, and methods were approved by the Research Subjects Review Board. We found no difference between naive subjects and authors, so all main-text analyses are combined, with data points belonging to authors and naive subjects indicated in Figure 3d.

Our stimulus consisted of ten frames of band-pass filtered noise [1, 32] masked by a soft-edged annulus, leaving a “hole” in the center for a small cross on which subjects fixated. The stimulus subtended 2.6 degrees of visual angle around fixation. Stimuli were presented using Matlab and Psychtoolbox on a 1920×1080px 120 Hz monitor with gamma-corrected luminance [6]. Subjects kept a constant viewing distance of 36 inches using a chin-rest. Each trial began with a 200ms “start” cue consisting of a black ring around the location of the upcoming stimulus. Each frame lasted 83.3ms (12 frames per second). The last frame was followed by a single double-contrast noise mask with no orientation energy. Subjects then had a maximum of 1s to respond, or the trial was discarded (Supplemental Figure S1). The stimulus was designed to minimize the effects of small fixational eye movements: (i) small eye movements do not provide more information about either orientation, and (ii) each 83ms frame was too fast for subjects to make multiple fixations on a single frame.

The stimulus was constructed from white noise that was then masked by a kernel in the Fourier domain to include energy at a range of orientations and spatial frequencies but random phases [1, 32, 5] (a complete description and parameters can be found in the Supplemental Text). We manipulated sensory information by broadening or narrowing the distribution of orientations present in each frame, centered on either +45° or −45° depending on the chosen orientation of each frame. We manipulated category information by changing the proportion of frames that matched the orientation chosen for that trial. The range of spatial frequencies was kept constant for all subjects and in all conditions.

Trials were presented in blocks of 100, with typically 8 blocks per session (about 1 hour). Each session consisted of blocks of only HSLC or only LSHC trials (Figure 2). Subjects completed between 1500 and 4400 trials in the LSHC condition, and between 1500 and 3200 trials in the HSLC condition. After each block, subjects were given an optional break and the staircase was reset to *κ* = 0.8 and *p*_{match} =0.9. *p*_{match} is defined as the probability that a single frame matched the category for a given trial. Psychometric curves were fit to the concatenation of all trials from all sessions using the Psignifit Matlab package [38], and temporal weights were fit to all trials below each subject’s threshold, separately in each condition.

### Low Sensory-, High Category-Information (LSHC) Condition

In the LSHC condition, a continuous 2-to-1 staircase on *κ* was used to keep subjects near threshold (*κ* was incremented after each incorrect response, and decremented after two correct responses in a row). *p*_{match} was fixed to 0.9. On average, subjects had a threshold (defined as 70% correct) of *κ* = 0.17 ± 0.07 (1 standard deviation).

### High Sensory-, Low Category-Information (HSLC) Condition

In the HSLC condition, the staircase acted on *p*_{match} while keeping *κ* fixed at 0.8. Although *p*_{match} is a continuous parameter, subjects always saw 10 discrete frames, hence the true ratio of frames ranged from 5:5 to 10:0 on any given trial. Subjects were on average 69.5% ± 4.7% (1 standard deviation) correct when the ratio of frame types was 6:4, after adjusting for individual biases in the 5:5 case. Regression of temporal weights was done on all 6:4 and 5:5 ratio trials for all subjects.

### Logistic Regression of Temporal Weights

We constructed a matrix of per-frame signal strengths **S** on sub-threshold trials by measuring the empirical signal level in each frame. This was done by taking the dot product of the Fourier-domain energy of each frame as it was displayed on the screen (that is, including the annulus mask applied in pixel space) with a difference of Fourier-domain kernels at +45°and −45°. This gives a scalar value per frame that is positive when the stimulus contained more +45°energy and negative when it contained more −45°energy. Signals were z-scored before performing logistic regression, and weights were normalized to have a mean of 1 after fitting.

Temporal weights were first fit using (regularized) logistic regression with different types of regularization. The first regularization method consisted of an AR0 (ridge) prior, and an AR2 (curvature penalty) prior. We did not use an AR1 prior to avoid any bias in the slopes, which is central to our analysis.

To visualize regularized weights in Figure 3, the ridge and AR2 hyperparameters were chosen using 10-fold cross-validation for each subject, then averaging the optimal hyperparameters across subjects for each task condition. This cross validation procedure was used only for display purposes for individual subjects in Figure 3a-c of the main text, while the linear and exponential fits (described below) were used for statistical comparisons. Supplemental Figure S4 shows individual subjects’ weights with no regularization.

We used two methods to quantify the shape (or slope) of **w**: by constraining **w** to be either an exponential or linear function of time, but otherwise optimizing the same maximum-likelihood objective as logistic regression. Cross-validation suggests that both of these methods perform similarly to either unregularized or the regularized logistic regression defined above, with insignificant differences (Supplemental Figure S3). The exponential is defined as
where *f* refers to the frame number. *β* gives an estimate of the shape of the weights **w** over time, while *α* controls the overall magnitude. *β* > 0 corresponds to recency and *β* < 0 to primacy. The *β* parameter is reported for human subjects in Figure 3d, and for the models in Figure 4e,h.

The second method to quantify slope was to constrain the weights to be a linear function in time:
where *slope* > 0 corresponds to recency and *slope* < 0 to primacy.

Figure 3d shows the median exponential shape parameter (*β*) after bootstrapped resampling of trials 500 times for each subject. Both the exponential and linear weights give comparable results (Supplemental Figure S2).

To compute the combined temporal weights across all subjects (in Figure 3a-c), we first estimated the mean and variance of the weights for each subject by bootstrap-resampling of the data 500 times without regularization. The combined weights were computed as a weighted average across subjects at each frame, weighted by the inverse variance estimated by boot-strapping.

Because we are not explicitly interested in the magnitude of **w** but rather its *shape* over stimulus frames, we always plot a “normalized” weight **w**/mean(**w**), both for our experimental results (Figure 3a-c) and for the model (Figure 4d,g).

### Approximate inference models

We model evidence integration as Bayesian inference in a three-variable generative model (Figure 4a) that distills the key features of online evidence integration in a hierarchical model [18]. The variables in the model are mapped onto the sensory periphery (*e*), sensory cortex (*x*), and a decision-making area (*C*) in the brain.

In the generative direction, on each trial, the binary value of the correct choice *C* ∈ {−1, +1}is drawn from a 50/50 prior. *x _{f}* is then drawn from a mixture of two Gaussians:

Finally, each *e _{f}* is drawn from a Gaussian around

*x*:

_{f}When we model inference in this model, we assume that the subject has learned the correct model parameters, even as parameters change between the two different conditions. This is why we ran our subjects in blocks of only LSHC or HSLC trials on a given day.

Category information in this model can be quantified by the probability that is drawn from the mode that matches *C*. We quantify sensory information as the probability with which an ideal observer can recover the sign of *x _{f}*. That is, in our model sensory information is equivalent to the area under the ROC curve for two univariate Gaussian distributions separated by a distance of 2, which is given by
where Φ is the inverse cumulative normal distribution.

Because the effective time per update in the brain is likely faster than our 83ms stimulus frames, we included an additional parameter *n*_{U} for the number of online belief updates per stimulus frame. In the sampling model described below, we amortize the per-frame updates over *n*_{U} steps, updating *n*_{U} times per frame using . In the variational model, we interpret *n*_{U} as the number of coordinate ascent steps.

Simulations of both models were done with 10000 trials per task type and 10 frames per trial. To quantify the evidence-weighting of each model, we used the same logistic regression procedure that was used to analyze human subjects’ behavior. In particular, temporal weights in the model are best described by the exponential weights (equation (3)), so we use *β* to characterize the model’s biases.

### Sampling model

The sampling model estimates p(*e _{f}*|

*C*) using importance sampling of

*x*, where each sample is drawn from a pseudo-posterior using the current running estimate of

*p*

_{f−1}(

*C*) ≡

*p*(

*C*|

*e*

_{1},..,

*e*

_{f−1}) as a marginal prior:

Using this distribution, we obtain the following unnormalized importance weights.

This yields the following estimate for the loglikelihood ratio needed for the belief update rule in equation (2):

In the case of infinitely many samples, these importance weights exactly counteract the bias introduced by sampling from the posterior rather than likelihood, thereby avoiding any double-counting of the prior, and hence, any confirmation bias. However, in the case of finite samples, *S*, biased evidence integration is unavoidable.

The full sampling model is given in Supplemental Algorithm S1. Simulations in the main text were done with *S* = 5, *n*_{U} = 5, normalized importance weights, and *γ* = 0 or *γ* = 0.1.

### Variational model

The core assumption of the variational model is that while a decision area approximates the posterior over *C* and a sensory area approximates the posterior over *x*, no brain area explicitly represents posterior dependencies between them. That is, we assume the brain employs a *mean field approximation* to the joint posterior by factorizing p(*C, x*_{1},…,*x _{F}*|

*e*

_{1},…,

*e*) into a product of approximate marginal distributions and minimizes the Kullback-Leibler divergence between q and p using a process that can be modeled by the Mean-Field Variational Bayes algorithm [28].

_{F}By restricting the updates to be online (one frame at a time, in order), this model can be seen as an instance of “Streaming Variational Bayes” [8]. That is, the model computes a sequence of approximate posteriors over *C* using the same update rule for each frame. We thus only need to derive the update rules for a single frame and a given prior over *C*; this is extended to multiple frames by re-using the posterior from frame *f* − 1 as the prior on frame *f*.

As in the sampling model, this model is fundamentally unable to completely discount the added prior over *x*. Intuitively, since the mean-field assumption removes explicit correlations between *x* and *C*, the model is forced to commit to a marginal posterior in favor of *C* = + 1 or *C* = −1 and *x* > 0 or *x* < 0 after each update, which then biases subsequent judgments of each.

To keep conditional distributions in the exponential family (which is only a matter of mathematical convenience and has no effect on the ideal observer), we introduce an auxiliary variable *z _{f}* ∈ {−1, +1}that selects which of the two modes

*x*is in: such that

_{f}We then optimize .

Mean-Field Variational Bayes is a coordinate ascent algorithm on the parameters of each approximate marginal distribution. To derive the update equations for each step, we begin with the following [28]:

After simplifying, the new q(*x _{f}*) term is a Gaussian with mean given by equation (14) and constant variance
where

*μ*and

_{C}*μ*are the means of the current estimates of q(

_{z}*C*) and q(

*z*).

For the update to q(*z _{f}*) in terms of log odds of

*z*we obtain:

_{f}Similarly, the update to q(*C*) is given by:

Note that the first term in equation (16) – the log prior – will be replaced with the log posterior estimate from the previous frame (see Supplemental Algorithm S2). Comparing equations (16) and (1), we see that in the variational model, the log likelihood odds estimate is given by

Analogously to the sampling model we assume a number of updates *n*_{U} reflecting the speed of relevant computations in the brain relative to how quickly stimulus frames are presented. Unlike for the sampling model, naively amortizing the updates implied by equation (17) *n*_{U} times results in a stronger primacy effect than observed in the data. Allowing for an additional parameter *η* scaling this update (corresponding to the step size in Stochastic Variational Inference [19]) seems biologically plausible because it simply corresponds to a coupling strength in the feedforward direction. Decreasing *η* both reduces the primacy effect and improves the model’s performance. Here we used *η* = 0.05 in all simulations based on a qualitative match with the data. The full variational model is given in Algorithm S2.

## Author Contributions

RL and RH wrote the manuscript. RL coded the experiment and data analysis. JY and AC helped design the experiment and temporal weighting analysis. AC recruited and ran subjects. RL, AC, and RH created the sampling model. RL and JB created the variational model.

## Acknowledgements

This work was supported by NEI/NIH awards R01 EY028811-01 (RMH) and T32 EY007125 (RDL, JLY), as well as an NSF/NRT graduate training grant NSF-1449828 (RDL).

## Footnotes

↵2 rhaefne2{at}ur.rochester.edu