Confidence as a noisy decision reliability estimate

Decisions vary in difficulty. Humans know this and typically report more confidence in easy than in difficult decisions. However, confidence reports do not perfectly track decision accuracy, but also reflect response biases and difficulty misjudgments. To isolate the quality of confidence reports, we developed a model of the decision-making process underlying choice-confidence data. In this model, confidence reflects a subject’s estimate of the reliability of their decision. The quality of this estimate is limited by the subject’s uncertainty about the uncertainty of the variable that informs their decision (“meta-uncertainty”). This model provides an accurate account of choice-confidence data across a broad range of perceptual and cognitive tasks, revealing that meta-uncertainty varies across subjects, is stable over time, generalizes across some domains, and can be manipulated experimentally. The model offers a parsimonious explanation for the computational processes that underlie and constrain the sense of confidence.

Lowering the confidence criterion yields more "confident" reports at all stimulus values. (d) Increasing meta-uncertainty increases the fraction of "confident" reports for weak stimuli, but has the opposite effect for strong stimuli. (e) The confidence-consistency relation for two levels of meta-uncertainty. All other model parameters held equal. (f) The psychometric function, split out by confidence report ("confident" in green vs "not confident" in red), for three levels of metauncertainty. (g) The confidence-consistency relation under a liberal vs a conservative confidence criterion. All other model parameters held equal, σm = 0.25. level of meta-uncertainty therefore requires directly fitting the model to data. 139 Evaluating the model architecture 140 We have motivated our framework on the basis of a qualitative observation (the lawful confidence-consistency relationship) 141 and first principles (the inherent noisiness of perceptual and cognitive processes). To further test the central tenets of the 142 CASANDRE model, we quantitatively examined the choice-confidence data collected by Adler and Ma (2018). We conducted 143 several model comparisons designed to interrogate the framework's second-stage operations. For this reason, we began by 144 fitting the first-stage parameters to each subject's choice data and then kept these parameters constant across all model variants 145 (see example in Fig. 3a). We first asked whether a simpler computation can account for confidence reports. We compared  Fig. 3c, top). We then asked whether meta-uncertainty is a necessary 150 model component, and found this to be the case (Fig. 3b, middle). Including meta-uncertainty improved model quality for all 151 19 subjects (median difference in AIC = 285.2; Fig. 3c, middle). These model comparisons thus provide strong and consistent 152 support for the hypothesis that confidence reflects a subject's noisy estimate of the reliability of their decision. 153 Further attempts to improve the model architecture yielded comparatively weak and inconsistent results. In particular, we 154 wondered whether model performance would benefit from allowing criterion-asymmetry (meaning that the confidence criteria 155 depend on the primary decision) and adopting a different second-stage noise distribution (the Gamma distribution). Allowing  Estimating meta-uncertainty from sparse data 166 We seek to quantify a subject's ability to introspect about the reliability of a decision. Our method consists of interpreting human 167 choice-confidence data through the lens of a principled two-stage process model. What kind of measurements are required to 168 obtain robust and reliable estimates of meta-uncertainty, the model's parameter that governs metacognitive ability? We verified 169 that Adler and Ma's experimental design affords solid parameter recovery (See Supplementary Fig. 3). However, their design 170 is exceptional for its large number of stimulus conditions 25 . Many studies use as little as two conditions 23 . To test whether 171 our approach generalizes to such experiments, we performed a recovery analysis. We used the CASANDRE model to generate 172 synthetic data sets for five model subjects performing a 2-AFC discrimination task with binary confidence report options (see 173 Methods). The model subjects only differed in their level of meta-uncertainty, which ranged from negligible to considerable 174 (Fig. 4a, colored lines). We simulated data for each model subject using experimental designs that varied in the number of trials 175 (100 vs 1,000) and in the number of conditions (2 vs 20; Fig. 4a, top). Figure 4b summarizes an example synthetic experiment.

176
The model parameters (σ d , C d , σ m , C c ) specify the relation between stimulus value and the probability of each response option 177 (Fig. 4b, left). We used these probabilities to simulate a synthetic dataset of 1,000 trials distributed across 20 conditions ( Fig.   178 4b, middle). We then identified the set of parameter values that best describes these data (Fig. 4b, right). We repeated this 179 procedure 100 times for each simulated experiment. Our method yields robust estimates of meta-uncertainty: for all model 180 subjects and all experimental designs, the median estimate closely approximates the ground truth value (Fig. 4c, symbols). The 181 reliability of these estimates is higher for more trials and somewhat higher for denser stimulus sampling (Fig. 4c, error bars).

182
Estimation error in σ m covaried with estimation error in C c (Supplementary Fig. 7 ). We conclude that the CASANDRE model 183 typically can be identified in sparse experimental designs.   Fig. 5c). This suggests that meta-uncertainty is largely, but not 217 fully, independent of the absolute level of stimulus uncertainty.

218
Whether metacognitive ability is domain-specific or domain-general is a debated question 8,42-44 . We analyzed data from 20 219 subjects who performed a perceptual and cognitive confidence task. Both tasks had the same experimental design. Stimulus categories were either defined by the average orientation of a series of rapidly presented gratings, or by the average value 221 of a series of rapidly presented numbers 24 . Subjects' performance level was correlated across both tasks (condition-specific 222 proportion correct choices: r = 0.69, P < 0.001), and so were their reported confidence levels, albeit to a lesser degree (r = 223 0.53, P < 0.001). We used the CASANDRE model to analyze both data-sets (see Methods). Meta-uncertainty estimates were 224 strongly correlated (r = 0.64, P = 0.003; Fig. 5d). Thus, meta-uncertainty appears to capture an aspect of confidence-reporting 225 behavior that generalizes across at least some domains.

226
Comparison with other metrics for metacognitive ability

227
Our method to analyze choice-confidence data is built on the hypothesis that metacognitive ability is determined by meta-228 uncertainty. It is natural to ask how this metric of metacognitive ability relates to alternatives. We approach this question in

232
One historically popular approach to quantify metacognitive ability consists of measuring the trial-by-trial correlation between 233 choice accuracy and the confidence report (this metric is sometimes termed "phi") 12 . Consider an analysis of the choice-  can be estimated by fitting this model to choice-confidence data 28 . In a practical sense, correlated criteria noise resembles meta-245 uncertainty in that it solely impacts confidence reports. However, assuming noisy uncertainty estimates versus noisy confidence 246 criteria results in metrics that behave somewhat differently (Fig. 6a, bottom).

247
Now consider the relationship between these metrics and meta-uncertainty for the three experiments performed by Navajas et al. Finally, meta-uncertainty and criteria noise are well correlated (r = 0.64, P < 0.001, Fig. 6b, bottom). But the confidence 253 criterion is also correlated with criteria noise (r = 0.37, P < 0.001). Variability in each of these metrics of metacognitive ability 254 thus in part reflects variability in meta-uncertainty, and in part variability in other components of the CASANDRE model.

255
To identify the relative importance of the different model components, we decomposed the variance of these metrics using 256 the averaging-over-orderings technique (see Methods) 46,47 . We first asked whether variability in meta-uncertainty could be 257 explained by other model components, but found this not be be the case (fraction of explained variance: 13%, Fig. 6c). In 258 contrast, variability in phi is predominantly explained by stimulus uncertainty (27%), followed by meta-uncertainty (22%). For 259 meta-d /d and criteria noise, most variance is explained by meta-uncertainty (24% and 26%) while the contribution of the other 260 model components is rather small (Fig. 6c). In summary, for all three alternative metrics, about three quarters of the variance 261 arises from factors other than meta-uncertainty.

262
Our analysis suggest that phi, meta-d /d , and criteria noise do not isolate the factors that limit metacognitive ability but instead 263 measure a complex mixture of factors underlying choice-confidence data. We wondered how the performance of these mixtures 264 in bench-marking experiments compares to that of meta-uncertainty. We computed phi, meta-d /d , and criteria noise for the 265 data sets shown in Fig. 5a conditions. Correlations ranged from weak to strong levels, with three tests failing to reach statistical significance (uncertainty

Manipulating meta-uncertainty 274
Can metacognitive ability be manipulated experimentally? Key to our framework is that confidence judgements require a subject 275 to estimate uncertainty on a trial-by-trial basis. This becomes more difficult when experiments involve more confusable levels 276 of stimulus uncertainty. We therefore expect that meta-uncertainty will grow with the number of stimulus uncertainty levels.

277
To appreciate our logic, consider the ideal Bayesian uncertainty estimation strategy which consists of combining information 278 obtained from ambiguous sensory measurements with prior task-specific knowledge. Specifically, the sensory measurement 279 informs the uncertainty likelihood function, while knowledge of task statistics (i.e., the distribution of stimulus uncertainty 280 levels) is summarized in a prior uncertainty belief function (Fig. 7a). The combination of both yields a posterior uncertainty 281 belief function, the maximum of which is the "best possible" uncertainty estimate (Fig. 7a). Due to noise, repeated presentations 282 of the same condition will yield different likelihood functions (Fig. 7a, see Methods). If the task involves only one level of 283 stimulus uncertainty, the prior is a fixed delta function, and so is the posterior. Consequently, the maximum posterior estimate 284 will not vary across trials and the ideal estimation strategy results in zero meta-uncertainty. However, when a task involves 285 multiple levels of stimulus uncertainty, the prior will be more dispersed, causing the resulting maximum posterior estimate to be 286 more variable across trials. Under an ideal Bayesian estimation strategy, meta-uncertainty thus initially grows with the number 287 of uncertainty levels (Fig. 7b). We wondered whether this normative prediction affords insight into human metacognition.

288
To test this hypothesis, we used the CASANDRE model to analyze six confidence experiments that varied in the number of It has long been known that humans and other animals can meaningfully introspect about the quality of their decisions 294 and actions 5-7,31,48 . Quantifying this ability has remained a significant challenge, even for simple binary decision-making 295 tasks 12,13,15,28,40,41 . The core problem is that observable choice-confidence data reflect metacognitive ability as well as task 296 difficulty and response bias. To overcome this problem, we introduced a metric that is anchored in an explicit hypothesis about 297 the decision-making process that underlies behavioral reports. Our method is based on likening choice-confidence data to the 298 outcome of an abstract mathematical process in which confidence reflects a subject's noisy estimate of their choice reliability, 299 expressed in signal-to-noise units 14,20,49 . This framework allowed us to specify the effects of factors that limit metacognitive 300 ability and to summarize this loss in a single, interpretable parameter: meta-uncertainty. We showed that this process model 301 (which we term the CASANDRE model) can explain the effects of stimulus strength and stimulus reliability on confidence 302 reports and that meta-uncertainty can be estimated from conventional experimental designs. We found that a subject's level of 303 meta-uncertainty is stable over time and across at least some domains. Meta-uncertainty can be manipulated experimentally: it 304 is higher in tasks that involve more levels of stimulus reliability. Meta-uncertainty appears to be mostly independent of task dif-305 ficulty and confidence reporting strategy. Widely used metrics for metacognitive ability are poor proxies for meta-uncertainty.

306
As such, the CASANDRE model represents a notable advance toward realizing crucial medium and long-term goals in the field 307 of metacognition 50 .

308
The mental operations underlying confidence in a decision have long intrigued psychologists. Two key unresolved issues are involves, the more variable uncertainty estimates will be. We validated this prediction by analyzing data from six different con-333 fidence experiments in which 160 subjects completed a total of 243,000 trials (Fig. 7c). This finding is arguably the strongest 334 piece of empirical evidence that meta-uncertainty is the critical factor that limits human metacognitive ability. It was enabled 335 by the use of modern computational tools to quickly compute the approximate ratio of two distributions (i.e., the confidence 336 variable distribution) and by the availability of the confidence database 23  coupled to an explicit goal such as maximizing choice accuracy, process models can be used to derive the optimal task strategy.

357
The resulting predictions offer a critical point of reference for human behavior 88 . This approach has revealed that humans  Fig. 2 and 4). Under the CASANDRE model, the predicted probability of each response option is fully specified by five  Figure 4b shows an example model fit to a 403 synthetic data set whereby we used 5 free parameters (λ, σ d , C d , σ m , and C c ) to capture data across 20 experimental conditions. 404 Some studies used a task design that combined a 2-AFC categorization decision with a multi-level confidence rating scale (i.e., 405 ref. 24,25,27,28 ). To model these data, we used the same approach as described above but we used multiple confidence criteria 406 (one less than the number of confidence levels). We modeled the data from ref. 27 using seven free parameters: λ, σ d , C d , σ m , 407 and C c (4-point confidence rating scale, thus three in total) (see Fig. 7c and Supplementary Fig. 5a). We modeled some data 408 from ref. 25 (task 1) using seventeen free parameters: λ, σ d (one per contrast level, six in total), C d (one per contrast level, 409 six in total), σ m , and C c (4-point confidence rating scale, thus three in total). Example fits are shown in Fig. 1b,c and in 410 Supplementary Fig. 1 (also see Fig. 7c, task 1 and Supplementary Fig. 5e). We modeled the data from ref. 24 using twelve free 411 parameters: λ, σ d (one per stimulus variance level, four in total), C d , σ m , and C c (6-point confidence rating scale, thus five in 412 total). Example fits are shown in Supplementary Fig. 4 (also see Fig. 7c, Fig. 6b-d, and Supplementary Fig. 5d). We modeled nating categories of orientation distributions with different means but the same standard deviation; their "Task A") and task 2 455 (discriminating categories of orientation distributions with the same mean but different standard deviations; their "Task B").

456
Since stimulus orientations were drawn from a continuous distribution, to plot the data we grouped nearby orientations into 457 9 bins with similar numbers of trials. Data and model fits from two example subjects performing task 1 in experiment 1 are 458 shown in Fig. 1b-c and Supplementary Fig. 1. Fitted parameters from all 19 subjects who performed experiments 1 and 2 are 459 included in Fig. 7c (task 1) and Supplementary Fig. 5f. Subjects in experiment 3 performed only task 2. Data and model fits 460 from an example subject performing task 2 in experiment 3 are shown in Supplementary Fig. 6. Fitted parameters from all 34 461 subjects who performed task 2 in experiments 1, 2, and 3 are included in Fig. 7c (task 2) and Supplementary Fig. 5e. 462 We analyzed data from all 3 experiments in ref. 24 . 30 subjects performed experiment 1. 14 of those 30 subjects returned about 463 a month after their first session to perform the same task again as experiment 2. Finally, 20 subjects performed experiment To test the independence between stimulus uncertainty and measures of metacognitive ability, we split experimental data from 491 each session in half 24 . We estimated meta-uncertainty independently for the two easiest and the two hardest stimulus conditions.

492
To limit the model comparison to the question of whether meta-uncertainty is independent of stimulus reliability, all other model primary choice, assuming a normally distributed confidence variable. The ratio of the ground truth sensory noise level and this 502 estimate is plotted in the middle panels of Fig. 6a. Second, to obtain the expected value of phi, we simulated 200,000 trials in 503 an experiment that included 20 levels of stimulus strength. We then calculated the Pearson correlation between the resulting 504 choice accuracy and confidence vectors (Fig 6a, top panels). We used these same simulated trials to fit the criteria-noise model 505 of Shekhar and Rahnev 28 . We downloaded their parameter optimization code and modified it as appropriate to fit our simulated 506 data (available at https://osf.io/s8fnb/). In their procedure, a nested two-step, coarse-to-fine search algorithm is used 507 to optimize the estimated confidence criteria and the confidence criteria noise level. The resulting criteria noise estimates are 508 plotted in the bottom panels of Fig 6a. The non-smooth appearance of the curves is a consequence of instabilities in the fitting 509 procedure. 510 We also computed these alternative metrics for each session from ref. 24 (see Fig. 6b-d). As is conventional, we estimated d four stimulus conditions (Fig. 6b-d). parameter are plotted in Fig. 6c.

523
Bayesian uncertainty estimation 524 We examined a simple model of Bayesian uncertainty estimation (Fig. 7a-b.). We modeled the uncertainty likelihood function 525 as a Gaussian function with a mean value, µ u , that varied from trial-to-trial. Each trial, µ u was randomly drawn from a Gaussian 526 distribution whose average matched the true level of stimulus uncertainty, S u , and with standard deviation σ u . As is typical for       illustrates an additional analysis of the trade-off between meta-uncertainty and the other parameters of the CASANDRE model 800 using the same generated data and model fits as in Figure 4c. Although the variance in meta-uncertainty explained by trade-801 offs with confidence criterion can reach high levels, this is somewhat mitigated by denser stimulus sampling (Supplementary 802 Fig. 7b, bottom) and is reasonable for datasets with a larger number of trials ( Supplementary Fig. 7b, right) and for values of 803 meta-uncertainty that are empirically observed more often (less than 1). a parameter capturing metacognitive ability from stimulus sensitivity and confidence reporting strategy 28 . They refer to the 807 parameter capturing metacognitive ability as "meta-noise" and find that log-normally distributed meta-noise provides a better 808 quantitative and qualitative match to empirical data than normally distributed noise. In the CASANDRE model, the standard 809 deviation of a log-normal distribution also serves as a metric for the metacognitive ability of an observer, however these two 810 uses of log-normal noise, like the models themselves, are not equivalent. In the CASANDRE model, the confidence variable 811 is distributed according to the ratio of a normally and log-normally distributed variable, whereas in the model of Shekhar and 812 Rahnev the confidence variable has a normal distribution identical to the decision variable but the positions of the confidence 813 criteria are subject to log-normally distributed noise. We thus refer to the model of Shekhar and Rahnev as the "criteria-noise" 814 model. 815 We quantitatively compared the criteria-noise model with the CASANDRE model. Because the criteria-noise model is currently 816 limited to experiments with two stimulus strengths, we did not apply it to data from ref. 25 (as we did for comparing other 817 model variants in Fig. 3 Wilcoxon signed-rank test; Supplementary Fig. 8a, top). Second, we compared the CASANDRE model to a simpler version  Fig. 8a) demonstrate that the CASANDRE model performs quantitatively at least as well as the criteria-noise 839 model in explaining the data reported in ref. 28 . 840 We now turn to several more qualitative considerations that favor the CASANDRE model compared with the criteria-noise 841 model. First, Shekhar and Rahnev show that empirical, averaged zROC functions have significant curvature compared with the 842 straight zROC functions predicted by signal detection theory (their Fig. 4 and 5b). The criteria-noise model shows curved zROC 843 functions but, as the authors note, they resemble piecewise linear functions rather than the smoothly curving zROC functions 844 of the empirical data (see their Fig. 11, bottom left) confidence computation and posits that it is exactly noise in this transformation that can lead to limited metacognitive ability.

853
Analogous to stimulus discrimination ability being limited by variation in the estimation of the stimulus, the CASANDRE 854 model posits that metacognition is limited by variation in the estimation of the uncertainty required to compute confidence.

855
In contrast, the criteria-noise model posits that lower metacognitive ability arises from the inability of subjects to maintain 856 constant confidence criteria. Stochastic confidence criteria cause problems for model tractability, allowing for them to cross 857 both the decision criterion and each other. By casting criteria-noise as log-normally distributed, Shekhar and Rahnev avoid the 858 problem of crossovers with the decision criterion, but to solve the problem of crossovers between confidence criteria they make 859 the questionable assumption that noise is perfectly correlated across criteria. The CASANDRE model naturally avoids both of 860 these issues. Fifth, the process captured by the CASANDRE model leads to new predictions about how metacognitive ability 861 can be experimentally manipulated. Figure 7 illustrates how increasing the number of uncertainty levels in a task increases 862 meta-uncertainty. If an inability to maintain stable criteria were a source of lower metacognitive ability instead, increasing the 863 number of confidence response levels on the rating scale used by subjects should lower metacognitive ability (and increase meta-864 uncertainty estimated from the CASANDRE model). We see no evidence for this prediction when rearranging the estimated 865 meta-uncertainty across tasks according to the number confidence response levels( Supplementary Fig. 8d).