Sensory Noise Increases Metacognitive Efficiency

Metacognitive efficiency quantifies people’s ability to introspect into their own decision making relative to their ability to perform the primary task. Despite years of research, it is still unclear how visual metacognitive efficiency can be manipulated. Here, we show that a hierarchical model of confidence generation makes a counterintuitive prediction: Higher sensory noise should increase metacognitive efficiency. The reason for this is that hierarchical models assume that although the primary decision is corrupted only by sensory noise, the confidence judgment is corrupted by both sensory and metacognitive noise. Therefore, increasing sensory noise has a smaller negative influence on the confidence judgment than on the perceptual decision, resulting in increased metacognitive efficiency. To test this prediction, we used a perceptual learning paradigm to decrease sensory noise. In Experiment 1, 7 days of training led to a significant decrease in sensory noise and a corresponding decrease in metacognitive efficiency. Experiment 2 showed the same effect in a brief 97-trial learning for each of 2 different tasks. Finally, in Experiment 3, we combined increasingly dissimilar stimulus contrasts to create conditions with higher sensory noise and observed a corresponding increase in metacognitive efficiency. Our findings demonstrate the existence of a robust positive relationship between across-trial sensory noise and metacognitive efficiency. These results could not be captured by a standard model in which decision and confidence judgments are made based on the same underlying information. Thus, our study provides direct evidence for the existence of metacognitive noise that corrupts confidence but not the perceptual decision.


Introduction
When faced with difficult decisions, people not only make an informed choice but can also provide a metacognitive estimate of the likelihood that their response was correct (Metcalfe & Shimamura, 1994). This judgment is usually provided in the form of a confidence rating. The ability of confidence judgments to distinguish between correct and wrong answers determines the degree of visual metacognition.
High metacognitive scores suggest that confidence judgments are informative and should be trusted, while low scores suggest the opposite. Despite the importance of understanding when confidence judgments are particularly useful and when they are less so, the factors determining the quality of metacognition are still not understood.
Research into the determinants of visual metacognition has been hampered by existing measures of metacognition. Traditional metrics include the trial---to---trial Pearson correlation between confidence and accuracy (Nelson, 1984), the area under the Type 2 curve (Fleming, Weil, Nagy, Dolan, & Rees, 2010), and type---2 d' (Higham, Perfect, & Bruno, 2009). The quantities measured by all of these metrics increase trivially as stimulus sensitivity increases .
Consequently, such metrics are said to measure metacognitive sensitivity (Fleming & Lau, 2014): the quality of confidence ratings without regard for stimulus sensitivity.
Recently, Maniscalco and Lau (2012) developed a way to measure metacognitive efficiency (Fleming & Lau, 2014): the quality of confidence ratings normalized by stimulus sensitivity. Their method computes an index (of metacognitive sensitivity) meta---d' that can then be divided by the level of stimulus sensitivity d'. The resulting metric is called Mratio . (Note that meta---d' can alternatively be normalized by subtracting d'; the resulting metric is called Mdiff.) By constructing a measure of metacognitive efficiency, the development of Mratio allows researchers to investigate metacognition independent of stimulus sensitivity.
Armed with a measure of metacognitive efficiency, we explored what factors influence metacognitive efficiency and whether it is possible to manipulate it experimentally. To do so, we turned to existing models of confidence generation.
Most current models assume that confidence is based on the exact same information used to make the perceptual decision (Fetsch, Kiani, Newsome, & Shadlen, 2014;Pouget, Drugowitsch, & Kepecs, 2016;Rahnev, Bahdo, de Lange, & Lau, 2012;. These models predict that while higher stimulus sensitivity leads to higher metacognitive sensitivity, it results in constant metacognitive efficiency. However, several newer models have included an extra level of metacognitive noise that corrupts the confidence but not the decision judgments (Berg & Ma, 2016;De Martino, Fleming, Garrett, & Dolan, 2013;Jang, Wallsten, & Huber, 2012;Mueller & Weidemann, 2008;Rahnev, Nee, Riddle, Larson, & D'Esposito, 2016). We refer to these models as "hierarchical" models of confidence (Maniscalco & Lau, 2016) since they include two separate stages of noise corruption: the perceptual decision is corrupted by a first--level sensory noise, while the confidence rating is additionally corrupted by a second---level metacognitive noise ( Figure 1A). Because the perceptual decision and confidence are based on different information, hierarchical models of confidence allow in principle for dissociations between metacognition and stimulus sensitivity resulting in non---constant metacognitive efficiency. Still, there has been no theoretical or empirical work on how such dissociations can be achieved.
Here we report on a counter---intuitive prediction of hierarchical models of confidence, namely that higher sensory noise should lower stimulus sensitivity but increase metacognitive efficiency. This prediction stems from the differential effect of sensory noise on stimulus and metacognitive sensitivity. Stimulus sensitivity is only corrupted by sensory noise, while metacognitive sensitivity is corrupted by both sensory and metacognitive noise. Therefore, increasing sensory noise is more detrimental to stimulus sensitivity than metacognitive sensitivity, resulting in higher metacognitive efficiency. Mathematically, stimulus sensitivity d' is the ratio of the signal and sensory noise, while meta---d' is the ratio of the signal and a combination of sensory and metacognitive noise. Therefore, increasing sensory noise levels has a large negative effect on d' but a smaller negative effect on meta---d', thus leading to an increase in their ratio (that is, Mratio; Figure 1B; for a complete proof, see Methods). Importantly, a standard model based on signal detection theory (SDT), which lacks a separate metacognitive noise stage, predicts that metacognitive efficiency remains constant for different sensory noise levels ( Figure 1C).

Figure 1: Hierarchical model of confidence.
A. Generative model of confidence generation. At the stimulus level, two stimulus categories S1 and S2 (e.g., Gabor patches of counterclockwise and clockwise orientation) are presented. The stimuli are perfectly distinguishable. However, the internal representation at the decision level, rsens, is corrupted by Gaussian noise σsens and thus the two stimulus categories are not perfectly distinguishable at the time of the decision. The confidence judgment is then made at the meta level based on an internal response rmeta that is derived from rsens but is corrupted by additional noise σmeta. B. Depiction of the model predictions. Seven simulations with a gradually decreasing level of sensory noise, σsens, show a gradual increase in sensory sensitivity d' and confidence ratings (given on a 2---point scale such that high confidence was provided when probability of being correct exceeded 70%), but a decrease in metacognitive efficiency Mratio. C. Depiction of predictions made by a standard model based on signal detection theory (SDT). The SDT---based model is equivalent to the hierarchical model but lacks a metacognitive noise stage. The same decrease in sensory noise leads to similar increases in sensory sensitivity and confidence, but no change in metacognitive efficiency.
We empirically tested and confirmed the hierarchical model's prediction that higher sensory noise leads to higher metacognitive efficiency. In two experiments, we used learning to decrease the level of sensory noise and observed a corresponding decrease in metacognitive efficiency. In a third experiment, we experimentally  increased the level of sensory noise and found a corresponding increase in metacognitive efficiency. These results demonstrate that metacognitive efficiency depends on low---level stimulus characteristics and provide strong support for the existence of metacognitive noise assumed by hierarchical models of confidence.

Experiment 1: Perceptual learning decreases metacognitive efficiency
To test the counterintuitive prediction that decreasing sensory noise leads to lower metacognitive efficiency, we employed a perceptual learning paradigm. Twelve subjects participated in a 7---day training on a visual task. Subjects performed a 2--interval forced choice (2IFC) orientation detection task in which they indicated the interval (first or second) that contained a Gabor patch (Figure 2A). Stimulus intensity was adjusted using a 2---down---1---up staircase procedure that allowed us to determine subjects' intensity threshold.
Consistent with a decrease in sensory noise, training gradually decreased subjects' intensity threshold (t11 = ---5.28, p = .0003; one---sample t---test on the slope of change; Figure 2B). Next, we selected the same range of intensity values across all seven days of training (we used intensity values in the 35---65 percentile range; using larger percentile ranges produced similar results; see Supplementary Results). When considering only this range of intensity values, we observed that training increased stimulus sensitivity d' (t11 = 5.2, p = .0003; Figure 2B) as well as average confidence (t11 = 2.43, p = .034; Figure 2B).
Critically, as predicted by our hierarchical model of confidence, the decreased sensory noise also resulted in decreased metacognitive efficiency Mratio (t11 = ---3.06, p = .011; Figure 2B). The same effect was also present for the alternative measure of metacognitive efficiency Mdiff (= meta---d' -d'; t11 = ---2.99, p = .012). Note that while this effect was predicted by our hierarchical model (Figure 1B), it cannot be accounted for by a standard model with no metacognitive noise ( Figure 1C).
Further, we examined whether the Mratio decrease was indeed due to the decrease in sensory noise or to some nonspecific effect of training. We found that subjects who showed a larger decrease in Mratio also exhibited a larger decrease in intensity threshold (r = .62, p = .03; Figure 2C) and a larger increase in d' values (r = ---.74, p = .005; Figure 2D), thus indicating that the Mratio decrease is directly related to the change in performance on the perceptual task.
Further, one may worry that Mratio has an intrinsic negative relationship with stimulus sensitivity d' independent of sensory noise. To check for this possibility, we computed d' and Mratio across all seven sessions for the lower vs. upper half of intensities used. We found that higher intensities led to a significantly higher d' (average d' = 2.85 and 0.82 for the upper and lower intensity halves, respectively; t11 = 46.23, p = 5.9*10 ---14 ) but did not affect Mratio (average Mratio = .98 vs. 1.02 for the upper and lower intensity halves, respectively; t11 = ---.38, p = .71; Figure 2E). Thus, the training---induced decrease in Mratio cannot be explained as trivially arising from the corresponding d' increase.
Experiment 2: Brief learning leads to lower across---subject metacognitive efficiency Experiment 1 provided strong support for a causal link between decreased sensory noise and decreased metacognitive efficiency. It employed a standard perceptual learning design with extensive training over a number of days. In Experiment 2 we tested whether much shorter learning period can also lead to decreased metacognitive efficiency. To this end, we recruited a large number of subjects (N = 178) to complete 97 trials of two different perceptual tasks. Critically, we inverted our analyses: rather than combining many trials for each subject (the standard way of analyzing psychophysics data), we combined the data across subjects for a given trial ( Figure 3A). This approach allowed us to track the evolution of across---subject performance in terms of both stimulus sensitivity and metacognitive efficiency.
Subjects engaged in coarse discrimination of low---contrast Gabor patches ( Figure   3B) and fine discrimination on high---contrast Gabor patches ( Figure 3C).

Figure 3: Visual training decreases across---subject metacognitive efficiency. A. Depiction of standard subject---based analysis techniques (which depend on considering all data for a given subject) and trial---based analysis (which depends on considering all data for a given trial number). We investigated the evolution of the trial---based d' and Mratio. B---C. Depictions of the two tasks. Subjects indicated the tilt (clockwise or counterclockwise from vertical) of a Gabor patch and provided a confidence rating on a 4---point scale. In the coarse discrimination task (B), the stimulus was a Gabor patch of low contrast but large tilt (+/---45°). In the fine discrimination task (C), the stimulus was a Gabor patch of high contrast but small tilt. D---E. Practice resulted in a gradual increase in stimulus sensitivity d' but a decrease in Mratio.
Both of these effects were larger for the coarse (D) compared to the fine (E) discrimination task. The timecourses are smoothed with a 11---point moving window for display purposes.
As can be seen in Figures 3D---E, the learning rate was different for the two tasks.
Indeed, the d' increase was steeper for the coarse discrimination than for the fine discrimination task (t190 = 2.53, p = .01). Importantly, we observed a corresponding effect in Mratio, which showed a steeper decrease for the coarse than the fine discrimination task (t190 = ---2.85, p = .005), suggesting a direct relationship between the amount of learning and the decrease in metacognitive efficiency. All effects pertaining to Mratio remained significant with the alternative measure of metacognitive efficiency Mdiff.
Experiment 3: Experimentally increasing sensory noise leads to higher metacognitive efficiency The results of Experiments 1 and 2 lend strong support for the notion that training--induced decrease in sensory noise leads to a corresponding decrease in metacognitive efficiency. Nevertheless, it remains possible that the results of both experiments depended on the use of training and that other manipulations of sensory noise would not produce equivalent results.
To investigate the influence of sensory noise independent of visual training, in Experiment 3 we manipulated the level of sensory noise directly. To do so, we used three levels of contrast and combined them in different ways to construct four conditions that vary on the amount of trial---to---trial variability in the perceptual signal. Twelve subjects performed a Gabor patch orientation discrimination task ( Figure 4A) and completed 4,200 trials over the course of three testing days. The Gabor patches were presented with three different levels of contrast. By combining more and more dissimilar contrasts in the same analysis, we constructed four different levels of increasing across---trial stimulus variability ( Figure 4B). We found that higher levels of stimulus variability led to a decreased d' (t11 = 4.53, p = .0009; Figure 4C). This result may appear surprising since the different conditions consisted largely of the same actual trials that were simply combined in different ways. The robust but relatively modest decrease in d' can be explained by the non--linear relationship between accuracy and d' (a detailed explanation can be found in Supplementary Figure 1). Indeed, both our hierarchical and a SDT---based model (see Figure 1C) could capture this decrease ( Figure 4C).

Figure 4: Experimentally increasing sensory noise increases metacognitive efficiency. A. Subjects indicated the tilt (clockwise or counterclockwise from vertical) of a noisy Gabor patch and provided a confidence rating (on a 2---point scale) using a
Critically, higher levels of across---trial stimulus variability led to an increased Mratio (t11 = 6.21, p = .00007; Figure 4D; same effect was observed for Mdiff too, t11 = 5.85, p = .0001). This effect was quantitatively accounted for by our hierarchical model but not by the standard SDT model ( Figure 4D).   , and can be enhanced pharmacologically via noradrenaline blockade (Hauser et al., 2017). All of these previous findings rely on taxing subjects' "resources" for metacognition. Our findings demonstrate that hierarchical models of confidence can also be used to predict how metacognitive efficiency depends on low---level stimulus characteristics independent of high---level resources.
We modeled the effects of visual perceptual learning as a simple decrease in sensory noise. There is indeed ample evidence that perceptual learning leads to noise attenuation (B. A. Dosher & Lu, 1998, 1999 Our finding of a positive relationship between sensory noise and metacognitive efficiency raises the question as to how metacognitive scores should be interpreted. Influential theories pose that metacognition stems from second---order monitoring processes (Shimamura, 2000). The contents of these second---order metacognitive processes are often assumed to reflect the contents of consciousness (Kunimoto, Miller, & Pashler, 2001;Persaud et al., 2011). However, our results demonstrate that while metacognitive judgments may indeed be related to consciousness, they cannot generally be used as a direct measure of consciousness (Jachs, Blanco, Grantham---Hill, & Soto, 2015). Indeed, perceptual learning has been argued to increase consciousness (Schwiedrzik et al., 2011) but, as seen here, decreases metacognitive efficiency. We see metacognitive scores as invaluable in constructing and testing models of decision making but remain agnostic about their relationship to constructs such as consciousness and working memory.
An important question for future research is whether metacognitive efficiency can be trained. Given that subjects completed the same metacognitive task for seven days, one may expect that their metacognitive noise would decrease. Our design did not allow us to separate the effects of training on sensory and metacognitive noise but given the decrease of metacognitive efficiency, putative decreases in metacognitive noise must have been small. Importantly, we did not include trail---to--trial feedback; such feedback may be more important for decreasing metacognitive compared to sensory noise.
In conclusion, we showed the existence of a robust positive relationship between the level of sensory noise and metacognitive efficiency. These results point to the existence of independent metacognitive noise and have strong implications about the meaning and interpretation of metacognitive efficiency. Subjects performed a 2---interval forced choice (2IFC) orientation detection task. Two stimuli were shown in quick succession and subjects indicated the interval (first or second) that contained the target (Figure 2A). The target was a Gabor patch of a particular orientation (circular diameter = 5°, standard deviation of Gaussian filter = 2.5°, spatial frequency = 1 cycle/degree, random spatial phase). The Gabor patch was superimposed by noise generated from a sinusoidal luminance distribution. We varied stimulus intensity by controlling the ratio of noise pixels. The non---target consisted of the superimposed noise only. The target interval was determined randomly on each trial. The center of the Gabor patch was positioned 4° away from the center of the screen in a direction of 45° toward either lower left or lower right.

Subjects
Each trial started with a 500---ms fixation period. The two stimulus intervals lasted 50 ms each, separated by a 300---ms blank period (Figure 2A). Subjects were asked to make two responses: first, to indicate the target interval, and second, to indicate their confidence level. Once the first response was made, the central fixation dot changed color from white to green to signal that the response has been recorded and to cue the need to make a second response. Subjects indicated their confidence using a 4---point scale.
We trained subjects on a specific visual quadrant (either lower left or lower right) and a specific orientation (either 10° or 70°). The trained quadrant and orientation were determined randomly for each subject. Sessions 1 and 7 included testing on the untrained quadrant and orientation (data not reported here). Subjects completed 12 blocks of trials. Each block involved a 2---down 1---up staircase procedure that continuously adjusted the stimulus intensity and terminated after 10 reversals. The intensity threshold for each block was calculated as the geometric mean of the last six reversals per block. In sessions 2---6, all 12 blocks came from the trained condition, while in sessions 1 and 7, four blocks were presented from each of the trained and two untrained conditions (in a randomized order). To keep the sessions as equivalent as possible, data analyses included all four blocks from the trained condition in sessions 1 and 7, as well as the first four blocks in sessions 2---6.

Experiment 2
Subjects performed two separate tasks -coarse and fine discrimination -that involved discrimination between clockwise and counterclockwise oriented Gabor patches (circular diameter = 1.91˚). In the coarse discrimination task (Figure 3B), the stimulus was a Gabor patch of large tilt (+/---45°) overlaid on a noisy background composed of uniformly distributed intensity values. In the fine discrimination task ( Figure 3C), the stimulus was a Gabor patch of small tilt (less than 1°) presented without any additional noise.
Each trial started with a fixation cross appearing at the center of the screen. The first trial of each block had was preceded by a longer fixation period of two seconds to allow the subjects time to focus. All other trials had a variable fixation period that was sampled from a uniform distribution with a range of 300---700 ms. The stimulus was then presented for 500 ms. Once the Gabor patch disappeared, subjects were asked to make two responses using their keyboard: first to indicate the tilt of the stimulus and second to rate their confidence on a 4---point scale.
We collected data from three batches of 50 subjects and one batch of 51 subjects. In order to ensure similar average performance on both tasks, we varied the difficulty of each task across the batches. For the coarse discrimination task, difficulty was manipulated by adjusting the contrast level (mean contrast = 5.25%, SD = 0.7%). For the fine discrimination task, difficulty was manipulated by changing the offset from the vertical (mean = 0.69°, SD = 0.09°). Average accuracy was 76.44% for the coarse discrimination task and 74.12% for the fine discrimination task.
Subjects had to complete a total of 100 trials of each task. Each task was divided into five blocks of 20 trials each. Subjects were allowed to take breaks between each block and the order of the tasks was randomized across subjects.
To ensure high data quality, we included six attention check trials -three in each task. These trials were designed to be much easier than the regular trials (contrast for coarse discrimination task = 15%, offset for the fine discrimination task = 5˚) and subjects paying attention to the task were expected have a high degree of accuracy for such trials. Therefore, we excluded subjects who responded incorrectly to more than two out of six catch trials (total 15 excluded). Additionally, we excluded subjects whose performance was close to chance level (< 55% correct) on the non--catch trials of either task (additional 8 subjects excluded). These criteria led to the exclusion of a total of 23 of the initial 201 subjects (11% exclusion rate). Note that the final analyses were based only on the 97 non---catch trials per task.
The Gabor stimuli were generated online via in---house code written in JavaScript and the experiment was designed using the JSPsych 5.0.3 library. To account for variability in the resolution and size of screens across subjects, subjects were asked to adjust the size of images of real life objects displayed on the computer screen to match their dimensions to the actual objects. This calibration ensured that the size of the stimulus displayed was uniform across different screens.

Experiment 3
This study was originally reported as Experiment 2 in Rahnev et al. (2013). All study details can be found in the original publication. Briefly, subjects' task was to indicate the tilt (clockwise or counterclockwise) of a grating presented at fixation. Each trial began with 50 ms presentation of the grating followed by a fixation period of 200 ms ( Figure 4A). On each trial, the orientation of the grating was randomly selected to be tilted 10° clockwise or 10° counterclockwise away from vertical. The grating pattern was presented on an annulus (inner circle radius: 1.5°, outer circle radius: 4.5°) region. The stimulus consisted of a noisy background composed of uniformly distributed intensity values on top of which we added a grating (0.5 cycles/degree).
Subjects were required to fixate on a small white square for the duration of the experiment. They were seated in a dim room 50 cm away from a computer monitor.
After each stimulus presentation, subjects used one of four keys to give their response indicating the perceived orientation of the grating and a wager on whether they were correct. Subjects used the keys 1---4 indicating "certainly left", "guess left", "guess right", and "certainly right," respectively. A correct "certain" (i.e., high confidence) choice was awarded with two points while a correct "guess" (i.e., low confidence) choice was awarded with one point. An incorrect "guess" (i.e., low confidence) choice resulted in no points being won or lost but an incorrect "certain" (i.e., high confidence) choice resulted in a loss of two points. We chose this point structure to ensure that subjects gave a sufficient number of both "guess" and "certain" responses. The optimal strategy for this payoff structure was to choose the "certain" choice only when the probability of being correct exceeded 66.7%. We informed subjects of this contingency in order to guarantee that all subjects were aware of the optimal strategy. To further encourage optimal usage of the wagers, we gave the two subjects with highest final scores an additional cash prize. Since the wagers that subjects used were a proxy for their confidence on each trial, for simplicity we refer to the wagers as confidence ratings in the rest of the manuscript.
Each trial lasted for two seconds. Subjects had 1.8 seconds to give their response after the onset of the stimulus. Once a response was given, the text indicating the four possible answers disappeared and the next trial started. If a response was not given in the 1.8---second period, subjects were penalized by a subtraction of four points and the text was removed at the end of the 1.8---second period in order to avoid any potential interference with the processing of the stimulus in the next trial.
The study consisted of four days: one training and three days of testing. In the initial training session on day 1, subjects practiced with the task over the course of five blocks of 120 trials each. Days 2---4 involved theta burst stimulation (TBS) to three different brain areas (visual cortex, Pz, and sham). TBS had a modest effect on subjects' performance (reported in the original publication). Here we combined all sessions regardless of TBS condition in order to increase the power of our analyses, which were orthogonal to the TBS effects. Based on the results of the training session on day 1, we chose a grating contrast for each subject that would produce 80% correct responses. However, we included two more levels of contrast: 75% and 125% of the above contrast. These three contrast levels were used on days 2---4 without further adjustments even if performance deviated from the 80% correct target for the middle contrast. Contrast level was chosen randomly on each trial and subjects were not explicitly informed about the presence of multiple contrast levels.
In each session, subjects completed five blocks of 140 trials each for a total of 4,200 trials. Note that the original publication excluded three of the subjects because they did not see phosphenes. These subjects were included here.

Analyses
To determine observers' performance on the task, we computed the signal detection theory (SDT) measure d' (a measure of stimulus sensitivity) by calculating the hit rate (HR) and false alarm rate (FAR): such that the distance between the two distributions was . Note that the SDT parameter d' can then be expressed as: Perceptual decisions were modeled by specifying a decision criterion W and confidence criteria E$ , E$X) , … , E) , ) , … , $E) , $ where n = number of confidence ratings. Importantly, the criteria E$ , E$X) , … , $ were constrained to be monotonically increasing with E$ = −∞ and $ = ∞. Counterclockwise (clockwise) decisions were made based on whether the internal response (L$( was smaller (larger) than W . Confidence responses were given such that an internal response (L$( falling in the internal [ 5 , 5X) ) resulted in a confidence of + 1 when ≥ 0, and of -when ≤ −1.
The hierarchical model was constructed similarly but with the important addition of an extra layer of noise. The perceptual decision (about stimulus orientation) was made just as in the standard model described above. However, the confidence judgment was made on the internal signal at a metacognitive stage that was additionally corrupted by Gaussian noise with standard deviation of aL%' such that signal at the metacognitive stage was given by the formula aL%' = ( (L$( , aL%' 0 ).
The confidence response was made equivalently to the standard SDT model.
However, in cases in which (L$( and aL%' fell on different sides of the decision criterion W , confidence was constrained to always equal 1.
The seven simulations shown in Figure 1B Prediction of hierarchical models of confidence Here we give the simple mathematical proof for why hierarchical models of confidence predict that higher sensory noise would lead to higher metacognitive efficiency. As seen in Equation 2, stimulus sensitivity d' equals the ratio of the signal and noise present at the decision stage. Equivalently, metacognitive sensitivity meta---d' equals the ratio of the signal and noise present at the metacognitive stage.
According to our hierarchical model of confidence, the signal at the metacognitive stage is still but the noise is a combination of two Gaussian distributions with standard deviations of (L$( and aL%' . Therefore, we can derive that: Combining Equations 2 and 3, we obtain: which is an increasing function of (L$( . Therefore, as sensory noise (L$( increases,

Model fitting
To model the effect of stimulus contrast in Experiment 3, we set (L$( 6789:;<9 e = g , where C was set to .75, 1, and 1.25 for the three levels of contrast (since contrast levels were 75%, 100%, and 125% of the subject---specific contrast threshold). We do not claim that (L$( has a power relationship with contrast. However, since this relationship is not generally known, this way of specifying the relationship allowed us to capture any combination of the two ratios between the sensory noise corresponding to successive contrast levels. Importantly, the parameter was strongly correlated between the fits for the SDT and the hierarchical models (r = .77, p = .004), demonstrating that the superior fits of the hierarchical model were not due to an interaction between and the extra parameter aL%' . Finally, since three of the 12 subjects exhibited Mratio values larger than 1 in at least one condition, we included additional decision---level noise for them and applied it to both the hierarchical and the SDT models.
The SDT and hierarchical models were instantiated with four and five free parameters, respectively. Importantly, the signal corresponding to each contrast level was not treated as a free parameter but was directly computed using Equation 2 using the contrast---specific d' and sensory noise values. The standard SDT model thus had four free parameters: and the criteria E) , W , and ) (since confidence was provided on a 2---point scale). The hierarchical model was instantiated with five free parameters (the four from the SDT model and aL%' ). The criteria 5 were constrained to be non---decreasing and aL%' was constrained to be ≥ 0.
We fit the models to the data as previously (Rahnev et al., 2011(Rahnev et al., , 2013Rahnev, Maniscalco, Luber, Lau, & Lisanby, 2012) using a maximum likelihood estimation approach. The models were fit to the full distribution of probabilities of each response type contingent on each stimulus type. Model fitting was done by finding the maximum---likelihood parameter values using a simulated annealing (Kirkpatrick, Gelatt and Vecchi, 1983). Fitting was conducted separately for each subject's data by first running the fitting five times with general starting parameter set, and then running the fitting five more times using a starting parameter set derived from the best fit from the previous stage. The best fitting model from the second stage was used for further analyses. Akaike Information Criterion (AIC) was used for model comparison though the results remained the same if Bayesian Information Criterion (BIC) was used instead.

Data and code availability
All data and codes for the analyses are freely available online at https://github.com/DobyRahnev/sensory_noise_metacognitive_efficiency.

Supplementary Results
In Experiment 1, we selected a limited range of stimulus intensity values for analyses in order to investigate how training affected stimulus sensitivity, confidence, and metacognitive efficiency. Specifically, we considered all stimulus intensity values used across the seven days and selected all intensities within the 35---65 percentile. We used a relatively small window in order to avoid excessive stimulus variability. However, this choice may appear arbitrary. Therefore, we tested the robustness of our results to using larger ranges of intensity values. In three more analyses, we selected all intensities within 30---70, 20---80, and 10---90 percentile of all intensities. We found that when considering these intensity ranges, = .0223). Therefore, our results do not depend on the exact percentile selected.

Supplementary Figures
Supplementary Figure 1. A graphical explanation as to why the d' value of two combined conditions is lower than the average d' values of those conditions. The figure shows the receiver operating characteristic (ROC) curves for d' of 0, 1, 2, 3, 4, and 5. The ROC curve plots the false alarm rate (FAR) on the x axis vs. the hit rate (HR) on the y axis. Assume that an unbiased observer (who chooses each stimulus category equally often) shows stimulus sensitivity values of d' = 1 and d' = 3 in two different conditions. Such performance would result in the red dots marked on the graph, lying on the diagonal perpendicular to the line of d' = 0. Assuming that the two conditions had equal number of trials from each category, then their combination would result in a point on the ROC curve (marked in blue) lying exactly midway between the two red circles. As can be seen in the graph, this point corresponds to d' of 1.773, which is lower than 2 -the average of 1 and 3. Mathematically, the distance from the d' = 0 line (which we can denote with ) is where is the cumulative normal distribution, and signifies the error function. Therefore, ′ = E) and since

E)
is convex in [0, 1], then E) n = Xn @ 0 < p q= n = Xp q= n @ 0 for 0 ≤ ) < 0 ≤ 1, which means that the d' of two combined conditions is lower than the average d' of each of those conditions. The same arguments hold even if the observer is not unbiased and thus points on the ROC curve do not lie on the diagonal perpendicular to the line of d' = 0. Metacognitive efficiency