Improvement of association between confidence and accuracy after integration of discrete evidence over time

When making decisions in real-life, we may receive discrete pieces of evidence during a time period. Although subjects are able to integrate information from separate cues to improve their accuracy, confidence formation is controversial. Due to a strong positive relation between accuracy and confidence, we predicted that confidence followed the same characteristics as accuracy and would improve following the integration of information collected from separate cues. We applied a Random-dot-motion discrimination task in which participants had to indicate the predominant direction of dot motions by saccadic eye movement after receiving one or two brief stimuli (i.e., pulse(s)). The interval of two pulses (up to 1s) was selected randomly. Color-coded targets facilitated indicating confidence simultaneously. Using behavioral data, computational models, pupillometry and EEG methodology we show that in double-pulse trials: (i) participants improve their confidence resolution rather than reporting higher confidence comparing with single-pulse trials, (ii) the observed confidence follow neural and pupillometry markers of confidence, unlike in weak and brief single-pulse trials. Overall, our study showed improvement of associations between confidence and accuracy in decisions resulted from the integration of evidence separated by different temporal gaps.


1
Introduction 27 Humans and animals can both make choices based on multiple discrete pieces of information. 28 Imagine that a large bus is passing between you and a faraway car as you cross the street. In this 29 situation, simply by collecting discrete pieces of information about the car's position through the 30 windows of the bus, you can decide whether the car is moving toward or away from you. In this 31 scenario, as the number of pieces of information increased, the interpretation of the car's direction 32 would be improved. Indeed, research has shown that the accuracy of decisions can be significantly 33 improved by integrating information from separate cues (Kiani, Churchland, & Shadlen, 2013 2009) properties of confidence and its association with perceptual decision-making. However, how 47 confidence is established within a discrete environment is still unclear. 48 According to the leading computational approach in perceptual decision-making (Gold & Shadlen,49 2007; Shadlen & Kiani, 2013), when the accumulated evidence for one option, called a decision 50 variable (DV), crosses a threshold or a boundary, a decision would be made. In addition, confidence 51 is briefed by the probability that a decision relying on the DV is correct ( To test the hypothetical relation between the accuracy and confidence, in binary decisions, signal 74 detection theory (SDT) can provide a method to characterize how well the observers reporting the 75 confidence ratings by introducing metacognitive sensitivity and efficiency ( Figure 1B; (Fleming,76 2017; Maniscalco & Lau, 2012). In fact, for years, SDT has provided a simple yet powerful 77 methodology to distinguish between an observer's ability to categorize the stimulus and the 78 behavioral response (Green & Swets, 1966), and to determine confidence resolution. 79 Moreover, levels of confidence can be tracked by behavioral, neural and pupillometry signatures. 80 Higher confidence is accompanied by faster and more accurate decisions ( Figure 1A). Each target was 152 shaped as a gradient rectangle (9° length and 0.5° width). After a variable duration of 200 -500 ms 153 (truncated exponential distribution), the RDM was presented. Participants had to indicate their choice 154 after receiving one or two pulses of 120ms of motion pulses. The gap interval of double-pulse trials 155 was selected randomly from 0, 120, 360, and 1080ms. On single-pulse trials, motion coherence was 156 randomly selected from these six values: 0%, 3.2%, 6.4%, 12.8%, 25.6%, and 51.2%, whereas, on 157 double-pulse trials, motion coherence of each pulse was randomly chosen from three values: 3.2%, 158 6.4%, and 12.8%. Both pulses had the same net direction of motion and participants were aware of it. 159 In total, there were 6 single-pulse and 9 × 4 double-pulse trial types. After the offset of one or two 160 motion pulses, a 400 to 1000 ms delay period (truncated exponential) was imposed before the Go 161 signal appeared on the screen. In each trial, participants were required to indicate their response by 162 directing the gaze to one of the targets, the upper extreme of targets representing full decision 163 confidence and the lower extreme representing guessing ( Figure 1A). To provide the approximate 164 balance within the trials, we constructed a list of all possible conditions of motion coherences and 165 gaps. Then, we shuffled the listed conditions and assigned them randomly to the trials in each block. 166 Participants were instructed to achieve high performance. Distinctive auditory feedback (Beep 167 Tones) was provided for correct and incorrect responses. The type of feedback of 0% coherence trials 168 was selected randomly by a uniform distribution. In Experiment 1, each participant performed the 169 task across multiple blocks on different days (12-20 blocks). Experiment 2 contained the same 170 paradigm as Experiment 1. All variables of stimulus remained constant except, in Experiment 2, the 171 EEG data were also recorded. In Experiment 2, each participant completed a session of 4-5 blocks. 172 173 Figure 1. Task paradigm and Signal Detection Theory. (A) Participants had to indicate the predominant direction of motion of moving dots (left or right) by saccadic eye movement to one of the targets after receiving one or two pulse(s) of 120ms stimulus. The intervals between two pulses were selected randomly from 0 to 1080 ms and the direction of both pulses were the same. Colorcoded targets enabled participants indicating their confidence simultaneously. (B) On each trial, a stimulus generates an internal response ‫ݔ‬ within an observer, who must use ‫ݔ‬ to decide whether the stimulus is

‫ݔ‬
is drawn from a normal distribution. The distance between these distributions is ݀ ᇱ , which measures the observer's ability to discriminate

EEG Recording and pre-processing 175
We used a 32-channel amplifier for the EEG signal recording (eWave, produced by ScienceBeam, Reported confidence was categorized as high and low. Since the participants were told to choose the 209 upper part of the bar as high confidence and lower part as low confidence, we considered reported 210 confidence higher than midline as high confidence and lower than midline as low confidence 211 respectively. This categorization allowed us to take each confidence report as a binary variable 212 comparable to the choice. Using categorical variables also provided the possibility of comparing the 213 current data with our previous work (Vafaei Shooshtari et al., 2019). However, in addition to the 214 midline, we tested various binary level set methods for categorizing participants' high and low 215 confidence ratings. First, the highest 55% and 45% of each participant's confidence reports were 216 considered high confidence (similar to (Zylberberg, Wolpert, & Shadlen, 2018)). Then, the mean of 217 each participant's confidence was calculated separately, and the confidence ratings above the mean 218 were considered as high ratings. Using these methods did not significantly alter reported confidence 219 categorization (see Supplementary Figure 6). 220

Behavioral analyses 221
Except where otherwise specified, we reported behavioral data of the first experiment but all the 222 analyses were repeated for the EEG experiment and if the results were inconsistent, it has been 223 admitted (Experiment 2 results were reported in Supplementary Figures 1, 2, 3, 4 and, 5). 224 We performed several logistic regression models to measure the impact of stimulus characteristics on 225 binary outcomes after confirming the assumptions of the linear regression were met. For logistic 226 regression models, we used maximum likelihood under a binomial error model (i.e., a GLM) to 227 evaluate the null hypothesis that one or more of the regression coefficients were equal to zero. For single-pulse trials, to determine whether confidence improved consistently with coherence, we 232 used the following function: ). To examine whether confidence judgments were associated with 240 more accurate choices, we fitted a logistic regression model to accuracy where the probability of high 241 confidence is given by where ‫ܣ‬ was the accuracy of the response (0 or 1 for incorrect and correct) and our null hypothesis 244 was that the accuracy would not affect reported confidence ‫ܪ(‬ : ߚ ଵ ൌ 0 ). We also used logistic 245 regression to evaluate the effect of interpulse interval on confidence in double-pulse trials: ). The similar equation was used to assess relation of accuracy and time interval:  ). 262 To examine the interaction between the two pulses (e.g., a stronger pulse 1 reduced the effect of 263 pulse 2), we fitted the following regression model to all double-pulse trials: 264 to the second pulse on the decision. In other words, higher impact of second pulse on confidence was 268 due to higher sensitivity rather than an interaction of motion pulses. 269 In addition, to investigate the variation of confidence in double-pulse trials compared to single-pulse 270 trials, we subtracted confidence of double-pulse trials from corresponding confidence in single-pulse 271 trials for each participant. For example, the confidence of a sequence of 3.2%, 6.4% motion strength 272 trial, subtract separately once from 3.2% and once from 6.4% corresponding confidence in single-273 pulse trials. The process repeated for the data of each gap too. Moreover, the same method was used 274 to compare accuracy of double-pulse and single-pulse trials. To assess the effect of choice accuracy 275 on variation of confidence in double-pulse and single-pulse trials, we fitted the following: 276

Response-time analysis 281
In the current study, response-time was referred to the time between the response cue onset and the 282 participant's response. To evaluate the significance of the effect of response-time on confidence, we 283 fitted the following linear regression model separately in double-pulse and single-pulse trials: 284 before the cue onset and response-time, we fit a linear regression model as follows: 288 study the profile of high and low confidence from behavioral data, an equal number of trials from 294 each participant's trials was selected randomly from single/double-pulse trials. Same procedure 295 repeated 100 times, then individual response-time were rank-ordered and binned into four quintiles. 296 Then, the accuracy of high and low confidence trials in each bin was calculated. We expected to see a 297 significant difference between accuracy of each bin grouped by levels of confidence. We only 298 included motion strength of 3.2, 6.4, 12.8 of single-pulse trials (similar to coherence used in double-299 pulse trials) to control the impact of coherence on response-time. 300

Motion energy analysis 301
Random dot stimulus is stochastic, so the sensory evidence fluctuated within and across trials but 302 around the nominal motion coherence level. To examine the fluctuations in motion during each trial, 303 we filtered the sequence of random by using two pairs of quadrature spatiotemporal filters, as 304 specified in previous studies ( We used logistic regression to test whether the confidence was more influenced by the second pulse's 308 motion energy than that of the first pulse in double-pulse trials. We tested double-pulse trials with 309 equal motion strength using the following logistic regression model: 310  . To evaluate the relation of ܲ and 317 motion energy in single-pulse trials, we fitted a linear regression model as follows: 318 where ‫ܯ‬ was the motion energy of the presented motion stimulus, C was the pulse coherence and the 320 null hypothesis was that confidence did not depend on the motion energy ‫ܪ(‬ : ߚ ଶ ൌ 0 ). 321

General computational modeling approach 322
We implemented a set of computational models based on signal detection theory to provide a 323 mechanistic explanation of the experimental data. According to SDT, observers set a decision 324 criterion ‫)ݎܿ(‬ to discriminate between two stimuli (e.g., labeled as  research has suggested that the number of trials could bias measures of metacognitive ability 339 (Fleming, 2017). Therefore, in a control analysis, we created 100 sets of trials randomly from the 340 single/double-pulse trials and from trials simulated by the perfect integrator model. Each set 341 contained the same number of trials for each participant. We then averaged the metacognitive scores 342 obtained from these 100 sets and repeated the comparison procedure (see Supplementary Figure 6). 343

Perfect integrator Model 344
To estimate the expected confidence (ܲ ሺ ሻ ) in double-pulses trials, we assumed that each trial's 345 confidence was achieved based on evidence integrating from both pulses by using a perfect integrator 346 model. In the perfect integrator model, the expected accuracy (ܲ  We evaluated the models qualitatively (i.e., parameter recovery exercises) and quantitatively (i.e., 366 maximum likelihood estimation). 367 In the qualitative method, based on the calculated parameters of the model, the probability of 368 choosing high confidence for all combinations of motion strength for each participant were calculated 369 (see Supplementary Appendix 2). We compared the expected high confidence predicted by models 370 to the observed confidence in double-pulse trials using regression, as follows: 371 According to the temporal low-pass characteristics of the slow peripheral pupil apparatus (Hoeks & 418 Ellenbroek To evaluate the relation of confidence and pupil response, we fit a linear regression model as follows: 428 ). To control the impact of coherence on pupil response, we only 431 included motion strength of 3.2%, 6.4%, 12.8% of single-pulse trials. 432

General statistical analysis 433
To test our hypotheses, a series of regression analyses were run after confirming the assumptions of 434 the linear regression are met. Effect sizes were reported and as suggested, here, we considered small 435 (f 2 = .02), medium (f 2 = .15), and large (f 2 = .35) effect sizes (see (Cohen, 1970)) at the alpha level of 436 5%. 437 Moreover, we used repeated-measures two-tailed t-tests. As suggested, we considered small (d = .2), 438 medium (d = .5), and large (d = .8) effect sizes for this assessment (see (Cohen, 1970)) and the 439 statistical significance for t-tests was set to a probability from data ≥ .

440
For tests of pupil response signals and ERPs between two levels of confidence, statistical inferences 441 were performed using t-tests at each time-point (at a statistical threshold of p < .05). 442

3
Results 443 We tested our predictions in two studies that applied the same paradigm ( Figure 1A). The first study  444  used behavioral measures and pupillometry analyses, whereas for the second experiment, we  445 recorded EEG signals as well. Participants decided about the direction of the RDM motion based on 446 brief motion pulses. The task design contained different conditions which allowed us to compare 447 participants' behavior in (i) double-pulse vs single-pulse, (ii) different coherence of motion stimulus, 448 and (iii) four distinct gaps intervals. 449

Behavioral results 450
We used the single-pulse trials to benchmark the effect of coherence on choice accuracy and 451 confidence. As shown in Figure 2A, for single-pulse trials, participants were more confident for high 452 coherence stimuli (Figure 2A   Confidence of double-pulse trials was calculated by pooling data across all time intervals. In (D) and (E) each data point reports pooled data from indicated sequence pulse and its reverse order (e.g., 12.8-3.2% and 3.2 -12.8%).

459
Moreover, in double-pulse trials, the accuracy improved with motion strength (Figure 2C, black  460 dots) and participants were more accurate while reporting higher confidence (Figure 2C, green dots). 461 Along with accuracy ( Figure 2D

Motion energy results 480
To yield a precise estimate of the decision-relevant sensory evidence accommodated in the stochastic 481 stimuli, we employed motion energy filtering to the random dot motion stimuli. Figure 3D displays 482 the average motion energy in double-pulse trials when the strength of pulses was the same. 483 Accordingly, the motion energy profiles for high and low confidence responses were slightly larger 484 for the second pulse than the first pulse. A logistic regression confirmed the influence of trial-to-trial 485 fluctuations of motion energy on confidence (Eq.11; β 2 = . 14], f 2 = .08). Consequently, motion energy analysis could not provide independent 490 confirmation of asymmetric effect of pulses for confidence. 491

Figure 3. Choice confidence was not depended of the sequence of motion pulses (A)
The weakstrong pulse sequence contributed higher accuracy than the strong-weak sequence. (B) The weakstrong pulse sequence did not contribute higher confidence than the strong-weak sequence. In all panels, data are represented as group mean ± SEM. (*p<0.05) (C) In single-pulse trials, low and high confidence cannot be determined by motion energy profiles in weaker pulses (D) The second pulse had slightly more impact on confidence. Data were pooled for all nonzero interpulse intervals. Only correct trials with equal pulse strength are included. In (C) and (D), the shaded region around the mean indicates SEM. The black horizontal bars show the duration of the stimulus display. The units of motion energy are arbitrary and the same for all motion strengths.

The Interplay between confidence in single vs double-pulse trials 500
To address accuracy and confidence variation in double-pulse from single-pulse trials, we consider 501 ܲ ௧ or ܲ of each coherence (3.2%, 6.4% and 12.8%) in single-pulse as baseline and measure 502 the ܲ ௧ or ܲ variation of any corresponding sequence in double-pulse trials. As we expected 503 in all combinations of three coherence as the baseline, ܲ ௧ improved ( Figure 4A). Additionally, 504 when considering all the trials, in all combinations of three coherence as the baseline, ܲ increased 505 when the other pulse was a strong pulse (12.8%) (Figure 4B and Figure 4C for correct trials). On the 506 contrary, ܲ decreased or not changed whenever the other pulse was a weak motion strength 507 (3.2%, 6.4%). Interestingly, in incorrect trials, the confidence decreased comparing to single-pulse 508 trials for all the coherence and conditions ( Figure 4D). These data did not correlate with the interval 509 duration (Figure 4A, B, C, D). 510 511 Figure 4. Variation of accuracy or confidence in double-pulse trials baselined by corresponding coherence (3.2%, 6.4% and 12.8% for each column). (A) Considering all the trials, the accuracy improved in almost all pulses combination. (B) Considering all the trials, the confidence improved in combination with stronger pulses while the confidence in sequence with a weaker pulse either decreased or remained constant. (C) In correct-choice trials, the increasing effect of stronger pulses is more significant and the confidence even slightly improved in combination with weaker pulses comparing to corresponding baseline. (D) Interestingly, in incorrect trials, the confidence decreased in every condition. The colored line representing matching data for each of four possible gaps. The data are represented as group mean ± SEM.

512
In other words, the participants reported lower confidence in double-pulse trials compared to single-513 pulse trials for incorrect choices but reported higher confidence for correct choices (Figure 4; Eq.8, 514  Figure 4). Altogether, the perfect integrator was incapable of employment observed metacognitive 537 ability in double-pulse trials. 538 As a control investigation, we examined whether the differences in estimated metacognitive ability 539 between models could result from the different number of trials. We averaged the metacognitive 540 scores obtained from randomly selected equal numbers of samples, and found very similar results. 541 Thus, the difference in the estimated metacognitive efficiency cannot be explained by the difference 542 in the number of trials between the single-pulse, double-pulse, and perfect integrator models 543 (Supplementary Figure 7). 544  the observed data. In this way, the predicted confidence improved considerably (Figure 5E right, 552 Eq.16, Experiment 2). Accordingly, this result suggests that the factor that made the perfect integrator model 558 unable to estimate the confidence from single-pulse trials was inability in predicting the proper 559 change in confidence criteria of double-pulse trials. 560 In addition, to consider the suboptimality in confidence reporting, we simulated data using the perfect 561 integrator model's parameters while setting higher confidence noise (Eq.15). The predicted ܲ 562 from this simulation improved (Eq.16, β 1 = .97, p < .001, 95% CI = [.83, 1.07], f 2 = 9.00). 563 Consequently, the perfect integrator model simply highlighted accumulating decision evidence and 564 ignored the effect of confidence noise. 565

Models' evaluation 566
We conducted parameter recovery simulations to evaluate models fitted to single/double-pulse trials. 567 We regressed predicted vs. observed confidence for each coherence of each participant. In single-568 pulse trials, linear regression indicated that there was a significant effect between the predicted and 569 observed confidence, (Eq.16,  (Figure 5D). 578 In summary, comparing between the perfect integrator model and the optimized model, both 579 quantitatively ( Figure 5E) and qualitatively ( Figure 5D) showed that the optimized model has a 580 better prediction in estimating confidence. Accordingly, these investigations indicated: (i) 581 participants integrated the decision evidence perfectly but to report their confidence, their confidence 582 resolution improved rather than reporting higher confidence, (ii) the inability to predict the proper 583 change in confidence criteria in the perfect integrator model was the factor that made the model 584 unable to estimate the confidence from single-pulse trials, (iii) the confidence noise was changed 585 after receiving the second pulse in double-pulse trials. 586

EEG Analysis 604
We derived the ERPs of averaged signals for two levels of confidence to verify whether there was a 605 significant difference in the centro-parietal ERPs across confidence levels.

614
Interestingly the effect of different confidence profiles in centro-parietal was considerable in double-615 pulse trials (Figure 8) but not in single-pulse trials (Figure 7). 616

Pupil responses 617
We took the mean baseline-corrected pupil signal during 200 ms before feedback delivery as our 618 measure of pupil response. In line with previous work ( Discussion 625 The current study was designed to clarify the confidence of decisions in more real-world contexts 626 where the evidence arrives separately. Using an experimental design, we examined how human 627 subjects combined the pieces of information to form their decision and confidence and how the two 628 are related to each other. We performed two experiments with either single or double pulses of RDM 629 stimuli. To this end, we investigated behavioral modeling, EEG responses and pupillometry. In 630 summary, the results across experiments showed that participants used both pulses to decide about 631 their confidence. Also, while their confidence was largely invariant to the gap interval, confidence 632 scoring was not noticeably enhanced in double-pulse trials compared to single-pulse trials. Instead, 633 participants reported their confidence with higher resolution and their metacognitive sensitivity 634 improved. Furthermore, using RT, EEG and pupillometry analysis, we could considerably track the 635 confidence profiles in double-pulse trials, unlike in single-pulse trials. 636

Behavioral and motion energy findings 637
Remarkably, unlike accuracy, confidence ratings in double-pulse trials have not increased 638 significantly comparing to single-pulse trials. We hypothesize that participants mainly trust on the 639 evidence of one of the pulses and ignore the other one. The trusted pulse can either be the first or 640 second pulse; it also can simply be the stronger pulse. However, the effect of sequence and 641 interaction of pulses on confidence was examined and no effect was observed. Moreover Moreover, once comparing confidence in double-pulse trials grouped by accuracy, we show that the 654 participants had lower confidence in double-pulse trials than single-pulse trials for incorrect choices 655 but higher confidence for correct choices. In other words, compared with single-pulse trials, in 656 double-pulse trials, participants adjusted their confidence by enhancing their confidence resolution or 657 metacognitive sensitivity. 658 Typically, confidence facilitates evidence accumulation and drives a confirmation bias in perceptual 659 decision-making (Rollwage et al., 2020). Likewise, we suggest that an extra brief and weak evidence 660 can validate confidence and improve metacognitive sensitivity.

Computational modeling findings 662
To understand the nature of the differences in participants' metacognitive sensitivity in double-pulse  663 vs single-pulse trials, we compared corresponding estimated metacognitive parameters. Likewise, we 664 included the expected parameters that would be achieved in double-pulse trials under the assumption 665 of perfect integration. double-pulse trials surpasses the value predicted by the perfect integrator model (Figure 5C and 678 Supplementary Figure 4). This effect can be followed in all of our participant (except one of 679 participants from Experiment 2) and can be explained by low confidence resolution in single-pulse 680 trials. 681 Metacognitive noise is the noise that affects confidence estimates but not perceptual decisions (De  682  Martino (Shekhar & Rahnev, 2020). Accordingly, metacognitive noise is a 686 superordinate term for all noise sources that impact the confidence formation process (Shekhar &  687 Rahnev, 2020, 2021) ranging from systematic to nonsystematic input and computation. Nevertheless, 688 the exact source of metacognitive noise remains unclear (Shekhar & Rahnev, 2020 Furthermore, perceptual decisions are often modeled using ideal observers (e.g., SDT). However, a 715 source of suboptimal behavior in decision-making is 'lapse' (Gold & Ding, 2013;Pisupati, 716 Chartarifsky show such an association in the presence of a brief piece of evidence. Moreover, our participants 727 decide much faster in double-pulse trials comparing to single-pulse trials. We hypothesized that the 728 decrease of response-time in double-pulse trials would be reflected with higher internal confidence. 729 However, another hypothesis of this variation pointed to the extra time duration in double-pulse 730 trials, which can be used to increase readiness to decide. We regress the delay-time before response 731 cue onset and response-time in both single-pulse and double-pulse trials to explore the hypothesis. If 732 the variation of response-time was primarily dependent on extra delay time, the delay time should 733 have had a considerable effect on response-time, especially in our 120ms single-pulse trials when the 734 stimulus duration was concise and the delay time varied. Nevertheless, the effects in both double-735 pulse and single-pulse trials are weak. Accordingly, the hypothesis that faster decision reflect higher 736 confidence in double-pulse trials is supported. In addition, the confidence profile as a function of 737 response-time was significant in double-pulse trials unlike in single-pulse trials. 738 Our findings furthermore suggest that reported confidence might not follow confidence marker in 739 EEG response. We focused on the CPP -a neural correlate of perceptual processing believed to 740 reflect evidence accumulation and correlated to confidence (Boldt et

Limitations and future directions 757
To the best of our knowledge, how evidence accumulation processes improve the accuracy 758 confidence association was not addressed using the combination of behavioral, neural, and 759 pupillometry signatures before. Obviously, our results were grounded in assumptions of integration 760 strategy in decision-making. However, this insight has recently been reconsidered (