Abstract
Decision confidence reflects our ability to evaluate the quality of decisions and guides subsequent behaviors. Experiments on confidence reports have almost exclusively focused on two-alternative decision-making. In this realm, the leading theory is that confidence reflects the probability that a decision is correct (the posterior probability of the chosen option). There is, however, another possibility, namely that people are less confident if the best two options are closer to each other in posterior probability, regardless of how probable they are in absolute terms. This possibility has not previously been considered because in two-alternative decisions, it reduces to the leading theory. Here, we test this alternative theory in a three alternative visual categorization task. We found that confidence reports are best explained by the difference between the posterior probabilities of the best and the next-best options, rather than by the posterior probability of the chosen (best) option alone, or by the overall uncertainty (entropy) of the posterior distribution. Our results upend the leading notion of decision confidence and instead suggest that confidence reflects the observer’s subjective probability that they made the best possible decision.
Introduction
Confidence refers to the “sense of knowing” that comes with a decision. Confidence affects the planning of subsequent actions after a decision1,2, learning3, and cooperation in group decision making4. Failures in utilizing confidence information have been linked to psychiatric disorders5.
While human observers can report their self-assessment of the quality of their decisions6, 7, 8, 9, 10, 11, 12, the computations underlying confidence reports are still insufficiently understood. The leading theory of confidence suggested that confidence reflects the probability that a decision is correct7, 8, 13, 14, 15, 16, 17. We refer to this idea as the “Bayesian confidence hypothesis” meaning that the decision-maker uses the posterior probability of the chosen category (i.e. the probability that decision is correct) for their confidence reports. In neurophysiological studies, a brain region or a neural process is considered to represent confidence if its responses correlate with the probability that a decision is correct18, 19, 20. Behavioral studies testing whether human confidence reports follow Bayesian confidence hypothesis have shown mixed results: While some studies found resemblances between Bayesian confidence and empirical data e.g. 18, 19, 21, 22, others have suggested that confidence reports deviate from the Bayesian confidence hypothesis e.g. 23, 24, 25.
Even though the Bayesian confidence hypothesis is the leading theory of confidence, there is currently no evidence to rule out the possibility that confidence is affected by unchosen options. Specifically, people could be less confident if the next-best option is very close to the best option. In other words, confidence could depend on the difference between the posterior probabilities of the best and the next-best options, rather than on the absolute value of the posterior of the best option. This idea has not been tested because previous studies of decision confidence have predominantly used two-alternative decision tasks; in such tasks, the alternative hypothesis is equivalent to the Bayesian confidence hypothesis, because the difference between the two posterior probabilities in a two-alternative task is a monotonic function of the highest posterior probability. Thus, to dissociate these two models of confidence, we need more than two alternatives. Therefore, we use a three-alternative decision task. To preview our main result, we find that the difference-based model accounts well for the data, whereas the model corresponding to the Bayesian confidence hypothesis and a third, entropy-based model do not.
Results
To investigate the computations underlying confidence reports in the presence of multiple alternatives, we designed a three-alternative categorization task. On each trial, participants viewed a large number of exemplar dots from each of the three categories (color-coded), along with one target dot in a different color (Figure 1A). Each category corresponded to an uncorrelated, circularly symmetric Gaussian distribution in the plane. We asked participants to regard the stimulus as a bird’s eye view of three groups of people. People within a group wear shirts of the same color, and the target dot represents a person from one of the three groups. Participants made two responses: the category of the target, and their confidence in their decision on a four-point Likert scale.
To manipulate participants’ beliefs (posterior probability distribution), we used different configurations of the category distributions and varied the position of the target dot within each configuration (Figure 1B and 1C). This design allowed us to test quantitative models of how the posterior distribution gives rise to confidence reports (see an illustration of this idea in Supplementary Figure 1).
Model
Generative model
Each category is equally probable. We assume that the observer makes a noisy measurement x of the position s of the target dot. We model the noise as obeying a circularly symmetric Gaussian distribution centered at the target dot.
Decision model
We now consider a Bayesian observer. We assume that the observer knows that each category is equally probable, and knows the distribution associated with each category (group) based on the exemplar dots. Given a measurement x, the posterior probability of category C is then
We further assume that due to decision noise or inference noise, the observer might not maintain the exact posterior distribution, p(C|x), but instead a noisy version of it. This type of decision noise is consistent with the notion that a portion of variability in behavior is due to “late noise” at the level of decision variable26, 27, 28. We modeled decision noise by drawing a noisy posterior distribution from a Dirichlet distribution around the true posterior (Figure 2A-B; See details in Methods). In our case, the true posterior, which we denote by p, consists of the three posterior probabilities from Eq.(1): p=(p(C=1|x), p(C=2|x), p(C=3|x)). The magnitude of the decision noise, the amount of variation around p, is (inversely) controlled by a concentration parameter α>0. When α⟶∞, the variation vanishes and the posterior is noiseless. In general, the “noisy posterior”, which we denote as a vector pnoisy, satisfies
We assume that when reporting the category of the target, the observer chooses the category C with the highest pnoisy(C|x). Unless otherwise specified, from now on we will refer to the noisy posterior distribution as simply the posterior distribution.
We introduce three models of confidence reports: the Max model, the Entropy model and the Difference model. Each of these models contains two steps: a) mapping the posterior distribution (pnoisy) to a real-valued confidence variable; b) applying three criteria to this confidence variable to divide its space into four regions, which then map in increasing order to the four confidence ratings. The second step accounts for every possible monotonic mapping from the confidence variable to the four-point confidence rating. The three models differ in the first step.
The Max model corresponds to the Bayesian confidence hypothesis. In this model, the confidence variable is the probability that the chosen category is correct, or in other words, it is the highest of the three posterior probabilities (Figure 2C). In this model, the observer is least confident when the posterior distribution is uniform. Importantly, confidence is never influenced by the posterior probabilities of the categories that were not chosen.
In the Difference model, the confidence variable is the difference between the highest and second-highest posterior probabilities. In this model, confidence is low if the evidence for the next-best option is strong, and the observer is least confident whenever the two most probable categories are equally probable. One interpretation of this model is that confidence reflects the observer’s subjective probability that they made the best possible choice, regardless of the actual posterior probability of that choice. An alternative interpretation is that decision-making consists of an iterative process in which the observer reduces a multiple-choice task to simpler (binary) choices (see Discussion).
In the Entropy model, the confidence variable is the negative of the uncertainty conveyed by the entire posterior distribution, quantified by its negative entropy. High confidence is associated with low entropy, and vice versa. Like in the Max model, the observer is least confident when the posterior distribution is uniform. Unlike in the Max model, however, the posterior probabilities of the non-chosen categories affect confidence. See the details of the models in Methods.
Note that all three models are Bayesian in a way that they compute the posterior probability distribution, and categorize the target dot by choosing the category with the highest posterior. The three models differ in how the confidence variable is read out from the posterior distribution. Only the Max model corresponds to the Bayesian confidence hypothesis. Only the Max model assumes that the posterior of the unchosen categories does not affect confidence. Importantly, in our three-alternative task, these models generate qualitatively different mappings from the posterior distribution to the confidence variable (Figure 2C). In a standard two-alternative task, however, the models would have been indistinguishable, because the probability of the non-chosen category would be determined by the probability of the chosen category.
We fitted the free parameters to the data of each individual subject using maximum-likelihood estimation, where the data on a given trial consist of a decision-confidence pair. Thus, we accounted for the joint distribution of decisions and confidence ratings24, 25, 29(see Methods). We compared models using the Akaike Information Criterion (AIC; Akaike, 1998). A model recovery analysis suggests that if the true model is among our tested models, our model comparison procedure is able to identify the correct model (see Methods and Supplementary Figure 3).
Experiment 1
In Experiment 1, the centers of the three category distributions were aligned vertically (Figure 1B). There were four conditions: In the first two conditions, the centers were evenly spaced horizontally. In the last two conditions, the center of the central distribution was closer to the center of either the left or the right distribution. The vertical position of the target dot was sampled from a normal distribution, and the horizontal position of the target dot was sampled uniformly between the center of the leftmost and right-most classes plus an extension to the left and the right (see Methods).
We plotted the psychometric curves (mean confidence rating as a function of the horizontal position of the target dot) by averaging confidence reports across trials using a sliding window (Figure 3). Mean confidence rating varied as a function of the horizontal position of the target. In the first two conditions (Figure 3), where the three distributions were evenly spaced, the psychometric curves showed two dips, with the lowest confidence attained at two positions symmetric around 0°.
We simulated the predicted psychometric curves using the best-fitting parameters of each model (Figure 3B). The fits of the Max and the Difference models resembled the data, but the best fit of the Entropy model showed a dip at the center in the first condition.
In the third and fourth conditions, in which the three distributions were unevenly spaced, mean confidence was lowest around the centers of the two distributions that were closest to each other. Only the Difference model exhibited this pattern, while the Max and the Entropy models deviated more clearly from the data.
The models not only make predictions for confidence ratings, but also for the category decisions (Supplementary Figure 2). Participants categorized the target dot based on its location, and when the target dot was close to the boundary between two categories (the location where two categories have equal likelihood), they assigned the target to those two categories with nearly equal probabilities. In general, this pattern is consistent with an observer who chooses the category associated with the highest posterior probability. The Entropy model fits worst, even though all three models used the same rule for the category decision; this is because the confidence data also need to be accounted for.
Using the Akaike Information Criterion for model comparison (Figure 4A and Supplementary Table 1), we found that the Difference model outperformed the Max model by a group-averaged AIC score of 27.3 ± 7.0 (mean ± s.e.m.) and the Entropy model by 149 ± 25 (mean ± s.e.m.).
We further tested reduced versions of each of the three confidence models by removing either the sensory noise or the decision noise from the model. The Difference model outperformed the Max model and the Entropy model regardless of these manipulations (Supplementary Figure 4 and Supplementary Table 1). The sensory noise played a minor role in this task compared to the decision noise. For example, removing the sensory noise from the Difference model increased the AIC by 9.9 ± 3.2, while removing the inference noise increased the AIC by 57.3 ± 6.5. Using the Bayesian information criterion30 for model comparison led to the same conclusions (Supplementary Figure 5).
Experiment 2
In Experiment 2, we aimed to test whether the findings in Experiment 1 could be generalized to other stimulus configurations, where the centers of the categories varied in a two-dimensional space. We tested four conditions in which the centers of the three groups varied along both horizontal and vertical axis (Figure 1C). We sampled the target dot positions uniformly within a circular area centered on the screen. In addition, the distribution of the categories used in Experiment 2 allowed us to probe confidence reports in a wider range of posterior distributions (Supplementary Figure 1B). For example, we can probe the confidence report when the target dot had the same distance to all three categories in Experiment 2, but not in Experiment 1.
The “psychometric curve” now is a heat map in two dimensions (Figure 5). The fits to these psychometric curves showed different patterns among the three models: When the three groups formed an equilateral triangle (Figure 5, the first and second columns), the confidence (as a function of target location) estimated by the Entropy model exhibited contours that were more convex than that in the data. In the last two conditions (Figure 5, the third and fourth columns), compared to the other two models, the Difference model showed stronger resemblance to the data, as the model exhibited an extended low confidence region at the side where two categories were positioned closely. The results of model comparisons were consistent with Experiment 1. The Difference model outperformed the Max model by a group-averaged AIC score of 45.9 ± 8.5 (mean ± s.e.m.) and the Entropy model by 152 ± 25 (mean ± s.e.m.) (Figure 4B and Supplementary Table 1). The model with both sensory and inference noise explained the data the best, and the inference noise had a stronger influence on the model fit than the sensory noise (Supplementary Figure 4B, Supplementary Figure 5B and Supplementary Table 1).
Experiment 3
So far, we found that the Difference model fits the data better than the Max and the Entropy. However, whether participants report the probability that a decision is correct (the Max model) might depend on the experimental design. In Experiment 1 and 2, participants received no feedback on their category decision. Thus, the probability of being correct in the task could be difficult to learn. To investigate this issue, in Experiment 3, using the same four stimulus configurations as those in Experiment 1 (Figure 1B), we randomly chose one of the three groups as the true target category in each trial, and sampled the target position from the distribution of the true category. Feedback was presented at the end of each trial, informing participants of the true category.
The results of model comparison were consistent with Experiment 1. The Difference model outperformed the Max model by a group-averaged AIC score of 10.3 ± 2.9 (mean ± s.e.m.) and the Entropy model by 93 ± 18 (mean ± s.e.m.) (Supplementary Figure 6 and Supplementary Table 1). The model with both sensory and inference noise explained the data the best, and the inference noise had a stronger influence on the model fit than the sensory noise (Supplementary Figure 4C and 5C).
Discussion
To distinguish the leading model of perceptual confidence (the Bayesian confidence hypothesis) from a new alternative model in which confidence is affected by the posterior probabilities of unchosen options, we studied human confidence reports in a three-alternative perceptual decision task. We found that confidence is best described by the Difference model, in which confidence reflects the difference between the strength of observers’ belief (posterior probability) of the top two options in a decision. The Max model (which corresponds to the Bayesian confidence hypothesis) and the Entropy model (in which confidence is derived from the entropy of the posterior distribution) fell short in accounting for the data. Our results were robust under changes of stimulus configurations (Experiment 1 and 2), and when trial-by-trial feedback was provided (Experiment 3). Our results demonstrate that the posterior probabilities of the unchosen categories impact confidence in decision-making.
Decision tasks with multiple alternatives not only allow us to dissociate different computational models of confidence, they are also ecologically important. In the real world, human and other animals often face decisions with multiple alternatives, such as identifying the color of a traffic light, recognizing a person, categorizing a species of an animal, online shopping, or making a medical diagnosis.
Our models can be generalized to categorical choice with more than three alternatives. Specifically, the Difference model predicts that besides the posterior probabilities of the top two options, the posterior of the other options does not matter as long as they add up to the same total. A special type of categorical choice is when the world state variable is continuous (e.g. in an orientation estimation task) but gets discretized for the purpose of the experiment. Consider the specific case that the posterior distribution is Gaussian. An observer following the Difference model would compute the difference between the posteriors of the two discrete options closest to the peak. This serves as a very coarse approximation to the curvature of the posterior distribution at its peak, which, for Gaussians, is monotonically related to its inverse variance, consistent with an earlier model in which confidence is based on the precision parameter of the posterior29. Outside the realm of Gaussian and similar distributions, the Difference model and van den Berg et al.’s model (2017) might be distinguishable. For example, when the posterior distribution is bimodal, with the modes slightly different in height, the variance of the posterior is dominated by the separation between the modes, whereas the Difference model will use the difference in height for confidence reports.
Although many behavioral studies have emphasized similarities between human confidence reports and predictions of Bayesian models e.g. 18, 19, 21, 22, the Bayesian confidence hypothesis has been questioned before8, 13, 14, 15, 16. In addition to the probability of being correct, confidence is influenced by various factors such as reaction time31, post-decision processing32, 33, 34, 35, and the magnitude of positive evidence36, 37, 38, 39. Two model comparison studies have shown deviations from Bayesian confidence hypothesis in two-alternative decision tasks24, 25. However, in one study24, the experimental design did not allow the authors to strongly distinguish the model that was based on Bayesian confidence hypothesis from those that were not. Moreover, in both studies24, 25, the alternative models were based on heuristic decision rules without a broader theoretical interpretation. Here, we have identified a type of deviation from the Bayesian predictions that is not only of a qualitatively different nature, but that also raises new theoretical questions.
Specifically, the Difference model is currently a descriptive model. We have two suggestions to interpret it as an outcome of approximate inference. First, the Difference model might be an approximation to a model in which confidence depends on the probability that an observer made the best possible decision. Specifically, the observer is “aware” that their decision is based on the noisy posterior pnoisy rather than the true posterior p. Thus, it is possible that the chosen category is not the category with the highest probability in the true posterior. Confidence would be derived from the probability that the chosen category has the highest probability in the true posterior distribution. The observer achieves this computation using the evidence for the next-best option: The stronger the evidence for the next-best option, the more likely that the chosen category is not the top choice in the true posterior, thus leading to lower confidence. Recent work has shown that subjective confidence guides information seeking during decision-making40. Under the Difference model, during information seeking, the observer’s goal is to make sure that the best option is better than the alternative options. Low confidence would encourage the observer to collect more information in order to strengthen the belief that the best option is better than the next-best option.
Second, the finding that confidence is best described by the relative strength of the evidence of the top two options might be related to other findings in multiple-alternative decision-making. For example, in one experiment, observers watched columns of bricks build up on the screen, and reported which column had the highest accumulation rate41. A heuristic model in which the observer makes a decision when the height of the tallest column exceeds the height of the next-tallest column by a fixed threshold captured the overall pattern of people’s behavior. In a study on self-directed learning in a three-alternative categorization task, observers had to learn the category distributions by sampling from the feature space and receiving feedback. Instead of choosing the most informative samples, human observers chose ones for which the likelihood of two categories were similar, namely those located at boundaries between pairs of two categories42. This literature allows us to speculate that observers might decompose a multiple-alternative decision into several simpler (perhaps binary) choices. This notion is reminiscent of the concept in prospect theory that before a phase of evaluation, extremely unlikely outcomes might be first discarded in an “editing” phase43. Hence, an alternative interpretation of our results is that confidence reports deviate from the Bayesian confidence hypothesis (the Max model) because the observer estimates the probability of correct in a way that ignores the options that are discarded before final evaluation. In the Difference model, the least favorite option is not completely discarded because it decreases the posterior probabilities of the other two options (and thus their difference) by contributing to the normalization pool44, 45. Therefore, we consider an extreme version of editing, the Ratio model, in which the least-favorite option does not even participate in normalization, and thus confidence solely depends on the likelihood ratio between the top two options. The Difference model and the Ratio model are not distinguishable in Experiment 1 and 2 (Supplementary Figure 7). In Experiment 3, the Difference model was very similar to the Ratio model in group-averaged AIC (3.8 ± 1.4 in favor of the Difference model). Testing variable numbers of categories within an experiment might help to differentiate between these two models.
We found that compared to the sensory noise, the noise associated with the computation of posterior probability plays a more important role in our task. This is consistent with the findings of a recent study26. The relative unimportance of sensory noise could be partly due to our experimental designs, which used stimuli with strong signal strength (saturated color and unlimited duration). Different from our study, Drugowitsch et al. (2016) devised an evidence accumulation task and further distinguished two types of decision noises: First, the inference noise that was added (and thus increased) with each new stimulus sample. Second, the selection noise that was injected only once at the final response. Because our experiment only had one stimulus in each trial, these two sources of variability were indistinguishable.
Do our results generalize beyond perceptual decision-making? In a two-alternative value-based decision task, observers reported confidence in a way that was similar to that in perceptual decision tasks10: When observers were asked to choose the good with the higher value, confidence increased with the posterior probability that a decision is correct, which in turn increased with the difference in value between the two goods. In addition, choice accuracy was higher in high-confidence trials then in low-confidence trials, reflecting observers’ ability to evaluate their own performance. It is unknown how observers compute confidence when there are more than two goods. In three-alternative value-based tasks, the Difference model would predict that, confidence is determined by the difference between the probability that the chosen item is the most valuable and the probability that the next-best item is the most valuable.
How does the present study advance our understanding of the neural basis of confidence? Most neurophysiological studies of confidence have considered the neural activity that correlates with the probability of being correct as the neural representation of confidence (but see 48). Neural responses in parietal cortex19, orbitofrontal cortex18 and pulvinar20 have been associated with that representation of confidence.. These studies all used two-alternative decision tasks. Multiple-alternative decision tasks have been used in neurophysiological studies on non-human primates but not with the objective of studying confidence45, 49, 50, 51. By utilizing multiple-alternative tasks, neural studies could dissociate the neural correlates of probability correct from that of the “difference” confidence variable in the Difference model, which according to our results might be the basis of human subjective confidence. A potentially important difference between human and non-human animal studies is that in the latter, confidence is not explicitly reported but operationalized through some aspect of behavior, such as the probability of choosing a “safe” (opt-out) option19, 20, 46, 47, 48, or the time spent on waiting for reward18. Thus, one should be careful when directly comparing these implicit reports with explicit confidence reports in human studies.
Methods
Setup
Participants sat in a dimly lit room with the chin rest positioned 45 cm from the monitor. The stimuli and the experiment were controlled by customized programs written in Javascript. The monitor had a resolution of 3840 by 2160 pixels and a refresh rate of 30 Hz. The spectrum and the luminance of the monitor were measured with a spectroradiometer.
Participants
Thirteen participants took part in Experiment 1. Eleven participants took part in Experiment 2. Eleven participants took part in Experiment 3. All participants had normal or corrected-to-normal vision. The experiments were conducted with the written consent of each participant. The University Committee on Activities involving Human Subjects at New York University approved the experimental protocols.
Stimulus
On each trial, three categories of exemplar dots (375 dots per category) were presented along with one target dot, a black dot (Figure 1A). The dots within a category were distributed as an uncorrelated, circularly symmetric Gaussian distribution with a standard deviation of 2° (degree visual angle) along both horizontal and vertical directions. Exemplar dots from the different categories were coded with different colors. The three colors were randomly chosen on each trial, and were equally spaced in Commission Internationale de l’Eclairage (CIE) L*a*b* color space. The three colors were at a fixed lightness of L*=70 and were equidistant from the gray point (a*=0, and b*=0).
In Experiment 1 and 3, the centers of the three categories were aligned vertically to the center of the screen, and were located at different horizontal positions (Figure 1B). In four configurations, the horizontal positions of the centers of the three categories were (−3°, 0°, 3°), (−4°, 0°, 4°), (−3°, −2°, 3°), and (−3°, 2°, 3°), from the center of the screen respectively. In Experiment 2, the centers of the three categories varied on a 2-dimensional space (Figure 1C). In four configurations, the horizontal positions of the centers of the three categories were (−2°, 0°, 2°), (−1.59°, 0°, 1.59°), (−2°, −2°, 2°), and (−2°, 2°, 2°), from the center of the screen, respectively. The vertical positions of the centers were (1.16°, −2.31°, 1.16°), (0.94°, −1.84°, 0.94°), (1.16°, 0°, 1.16°), (1.16°, 0°, 1.16°) from the center of the screen respectively.
Procedures
We told participants that the three groups of exemplar dots represented a bird’s eye view of three groups of people. The three groups contained equal numbers of people. The black dot (the target) is a person from one of the three groups, but we do not know the color of her/his T-shirt. We asked participants to categorize the target to one of the three groups based on the (position) information conveyed by the dots, and report their confidence on a four-point Likert scale.
Each trial started with the onset of the stimulus and three rectangular buttons positioned at the bottom of the screen (Figure 1A). On each trial, participants first categorized the target to one of the three groups (based on the position information conveyed by the dots) by using the mouse to click on one of the three buttons. After participants reported their decision, the three buttons were replaced by four buttons (labeled as “very unconfident”, “somewhat unconfident”, “somewhat confident”, and “very confident”) for participants to report their confidence on the decision they made. The stimuli were presented throughout each trial. Reaction time (for both decision and confidence reports) was unlimited. After participants reported their confidence, all the exemplar dots and the rectangular buttons disappeared from the screen, and the next trial started after a 600 ms inter-trial-interval.
In Experiment 1, the vertical position of the target dot was sampled from a normal distribution (2° std), and the horizontal position of the target dot was sampled uniformly between the center of the leftmost and rightmost categories plus a 0.2° extension to the left and the right. In Experiment 2, the target dot was uniformly sampled from a circular area (2.6° radius) positioned at the center of the screen. No feedback was provided in Experiment 1 and Experiment 2.
In Experiment 3, in each trial, we randomly chose one of the three categories with equal probability as the true category. We then positioned the target dot by sampling from the distribution of the true category. A feedback regarding the true category was provided at the end of each trial: After participants reported their confidence, all exemplar dots disappeared except that the exemplar dots from the true category remained on the screen for an extra 500 ms. In each experiment, participants completed one 1-hr session (84 trials per configuration in Experiment 1 and 120 trials per configuration in Experiment 2 and 3). All the trials in one session were separated into 8 blocks with equal number of trials. Different configurations were randomized and interleaved within each block.
Models
Generative model
The target belongs to category C ∈ {1, 2, 3}. The two-dimensional position s of a target in category C is drawn from a two-dimensional Gaussian , where mC is the center of category C, is the variance of the stimulus distribution, and I is the 2-dimensional identity matrix. We assume that the observer make a noisy sensory measurement x of the target position. We model the sensory noisy using a Gaussian distribution centered at s with covariance matrix σ2I. Thus, the distribution of x given category C is .
Inference on a given trial
We assume that the observer knows the mean and standard deviation of each category based on the exemplar dots, and that the observer assumes that the three categories have equal probabilities. The posterior probability of category C given the measurement x is then . Instead of the true posterior p(C|x), the observer makes the decisions based on pnoisy(C|x), a noisy version of the posterior probability. We obtain a noisy posterior pnoisy(C|x) by drawing from a Dirichlet distribution. The Dirichlet distribution is a generalization of the beta distribution. Just like the beta distribution is a continuous distribution over the probability parameter of a Bernoulli random variable, the Dirichlet distribution is a distribution over a vector that represents the probabilities of any number of categories. The Dirichlet distribution is parameterized as p is a vector consists of the three posterior probabilities, p=(p(C=1|x), p(C=2|x), p(C=3|x)). pnoisy is a vector consists of the three posterior probabilities perturbed by the decision noise, pnoisy =(pnoisy(C=1|x), pnoisy(C=2|x), pnoisy(C=3|x)). The mean of pnoisy(C|x) is p(C|x). The concentration parameter α inversely determines the magnitude of the decision noise. To make a category decision, the observer chooses the category that maximizes the posterior probability: .
We considered three models of confidence reports. We first specify in each model an internal continuous confidence variable c*. In the Max (maximum a posteriori) model, c* is the posterior probability of the chosen category: . In the Difference model, c* is a difference: ,where is the category with the second-highest posterior probability. In the Entropy model, c* is the negative entropy of the posterior distribution: .
In each model, the continuous confidence variable c* is converted to a four-point confidence report c by imposing three confidence criteria b1, b2 and b3. For example, c=3 when b2<c*<b3. We also included a lapse rate λ in each model; on a lapse trial, the observer presses a random button for both the decision and the confidence report. In addition to the models that included both sensory and decision noise, we took a factorial approach and tested various combinations of confidence model and sources of variability 52, 53, 54. For each confidence model, we tested two reduced models by removing either the sensory noise (by setting σ=0) or the decision noise (by setting pnoisy(C|x) = p(C|x)) from the model.
Response probabilities
So far, we have described the mapping from a measurement x to a decision and a confidence report c. The measurement, however, is internal to the observer and unknown to the experimenter. Therefore, to obtain model predictions for a given parameter combination (σ, α, b1, b2, b3, λ), we perform a Monte Carlo simulation. For every true target position s that occurs in the experiment, we simulated a large number (10,000) of measurements x. For each of these measurements, we compute the posterior p(C|x), add decision noise to obtain pnoisy(C|x), and finally obtain a category decision and a confidence report c. Across all simulated measurements, we obtain a joint distribution that represents the response probabilities of the observer.
Model fitting and model comparison
We denote the parameters (σ, α, b1, b2, b3, λ) collectively by θ. We fit each model to individual-subject data by maximizing the log likelihood of θ, log L(θ)=log p(data|θ). We assume that the trials are conditionally independent. We denote the target position, category response, and four-point confidence report on the ith trial by si, , and ci, respectively. Then, the log likelihood becomes where is obtained from the Monte Carlo simulation described above. We optimized the parameters using a new method called Bayesian Adaptive Direct Search 55. We used AIC and BIC for model comparison. To report the AIC (or BIC) index, we computed the AIC (or BIC) for each individual and then averaged the AIC across participants.
Parameterization
The full version of the three confidence models (Max, Difference and Entropy models reported in Figure 4) have the same set of free parameters including the magnitude of sensory noise (σ) the magnitude (concentration parameter) of decision noise (α), three boundaries for converting continuous confidence variable to button press (b1, b2, b3) and a lapse rate λ.
For each of the three confidence models, we tested two versions of the reduced models(reported in Supplementary Figure 4 and Supplementary Figure 5). In one version, we kept the sensory noise (σ) in the model while removing the decision noise (α). In the other version we kept the decision noise (α) in the model while removing the sensory noise (σ).
Model Recovery
To evaluate our ability to distinguish the three models, we performed a model recovery analysis. Based on the design of Experiment 1, we synthesized 10 datasets for each of the confidence models. To ensure that the synthesized data resemble our experimental data, we synthesized the data using the group-averaged best-fitting parameter values obtained in Experiment 1. We then fit each of the 30 datasets (3 generating models with 10 datasets each) with the 3 models. Supplementary Figure 3 illustrates the results averaged over 10 datasets for each of the generating model.
Data visualization
For Experiment 1 and 3, we used a sliding window to visualize the psychometric curves, defined as the confidence ratings as a function of horizontal location of the target dot. The sliding window had a width of 0.6°. We moved the window horizontally (in a step of 0.1°) from the left to the right of the screen center. At each step, we computed mean confidence rating by averaging the confidence reports c of all the trials fell within the window (based on the horizontal target location of each trial). We first applied this procedure to individual data, and then averaged the individual psychometric curves across subjects (the black curves in Figure 3B and Supplementary Figure 6B). For Experiment 1, we visualized the data ranging from −3.5° to +3.5° from the screen center. For Experiment 3, we visualized the data ranging from −5° to +5° from the center. These ranges were chosen so that each steps along the black curves in Figure 3B and Supplementary Figure 6B contained at least 5 trials per subject on average. To visualize the model fit, we sampled a series of target dot locations along the horizontal axis (in a step of 0.1°), and we used the best-fitting parameters to compute the confidence rating predicted by the models for each target location. We then used the same procedure (a sliding window) to compute the mean confidence rating predicted by the models (the blue curves in Figure 3B and Supplementary Figure 6B).
For Experiment 2, the “psychometric curve” became a heat map in a two-dimensional space (Figure 5). We tiled the two-dimensional space with non-overlapped hexagonal spatial windows (with a radius of 0.25°) positioned from −3° to +3° (Figure 5A) along both horizontal and vertical axis. To compute the mean confidence rating for each hexagonal window, we averaged the confidence ratings across all the trials fell within that window for each participant. If the number of trials was zero among all the participants for a window, that window was left as white in Figure 5A. To visualize the model fit, we used the best-fitting parameters and computed the confidence rating predicted by the models for an array of target locations (a grid tiling the two-dimensional space with a step of 0.1° along both horizontal and vertical axis). The predicted confidence rating was then averaged within each hexagonal window.
Acknowledgement
We thank members of the Ma Lab, Hui-Kuan Chung, Rachel Denison, and Michael Landy for helpful comments on the manuscript.