Brain signatures of surprise in EEG and MEG data

The brain is constantly anticipating the future of sensory inputs based on past experiences. When new sensory data is different from predictions shaped by recent trends, neural signals are generated to report this surprise. Existing models for quantifying surprise are based on an ideal observer assumption operating under one of the three definitions of surprise set forth as the Shannon, Bayesian, and Confidence-corrected surprise. In this paper, we analyze both visual and auditory EEG and auditory MEG signals recorded during oddball tasks to examine which temporal components in these signals are sufficient to decode the brain’s surprise based on each of these three definitions. We found that for both recording systems the Shannon surprise is always significantly better decoded than the Bayesian surprise regardless of the sensory modality and the selected temporal features used for decoding. Author summary A regression model is proposed for decoding the level of the brain’s surprise in response to sensory sequences using selected temporal components of recorded EEG and MEG data. Three surprise quantification definitions (Shannon, Bayesian, and Confidence-corrected surprise) are compared in offering decoding power. Four different regimes for selecting temporal samples of EEG and MEG data are used to evaluate which part of the recorded data may contain signatures that represent the brain’s surprise in terms of offering a high decoding power. We found that both the middle and late components of the EEG response offer strong decoding power for surprise while the early components are significantly weaker in decoding surprise. In the MEG response, we found that the middle components have the highest decoding power while the late components offer moderate decoding powers. When using a single temporal sample for decoding surprise, samples of the middle segment possess the highest decoding power. Shannon surprise is always better decoded than the other definitions of surprise for all the four temporal feature selection regimes. Similar superiority for Shannon surprise is observed for the EEG and MEG data across the entire range of temporal sample regimes used in our analysis.


Introduction
The predictive coding framework [1] states that the brain is constantly predicting its 45 next sensory input. Past inputs are used by the brain to form prior knowledge in the 46 Bayesian brain model [2,3]. In fact, the results of brain functions such as perception, eliminating ambiguity, attention, and decision making are dependent on the way the 48 current sensory input and the knowledge gained from previous experiences are 49 combined in the hierarchical inference model of the brain [1] (for review see [4]).
according to one of the 4 temporal sample selection regimes) as its input, and the        not statistically significant (Fig 4).

354
In Fig 6. needs to be relatively small in order to obtain high decoding powers.

370
According to the Bayesian definition of surprise, after the brain has observed enough 371 stimuli to correctly learn the generative distribution, there will not be a high amount 372 of surprise in response to any type of stimuli because at that time the difference 373 between the two distributions involved in the calculation of the KL divergence 374 diminishes. In addition, for larger integration coefficients, estimates of the 375 underlying parameter will be closer to its true value. This improvement enables the 376 Shannon and Confidence-corrected surprises to produce more accurate predictions 377 for the brain surprise In addition, in the Samples regime of the EEG analysis, the best integration 379 coefficient is not much dependent on time. In other words, the best is not much 380 different for the middle and late components (Fig 7).

382
In this study, we aim to examine how the surprise of an ideal observer is reflected in 383 the temporal data recorded by EEG and MEG systems, and which surprise model can (e.g. as in [13] These observations are true for all three definitions of surprise used in this study.

420
Evidence for Bayesian brain and ideal observer 421 In this work, we have constructed our decoding model based on the Bayesian model 422 of the brain by presuming a generative model for the world [39][40][41][42][43]. The brain is 423 assumed to update its perception of the world after receiving inputs according to the 424 Bayesian updating model. Our surprise decoding approach assumes an ideal observer 425 model for the brain, which postulates that the brain attempts to find the distribution 426 from which the input sequence is generated (parameterized by a transition 427 probability matrix), and updates its estimate of this distribution using the Bayes rule 428 [12-15, 27, 29, 38]. Therefore, the assumptions of the Bayesian brain and the ideal are based on the parameters learned following the Bayesian brain and ideal observer 432 assumptions.

433
Our results demonstrate the feasibility of decoding these three quantifications of the 434 theoretical surprise on a trial-by-trial basis with noticeable decoding power, and 435 hence provide new evidence for the Bayesian brain and ideal observer assumptions.

Finding the best model for surprise
Studying the phenomena associated with the brain's surprise response is an 438 important aspect of modeling the process of learning in the brain. In these models, 439 surprise has been regarded as a parameter which the brain tries to minimize during 440 learning [14,16,17]. One of this paper's objectives has been to propose an approach 441 to best quantify surprise as a parameter varying with every incoming stimulus that processes which occur in the anterior cingulate cortex [22]. As the former activity 463 occurs closer to the brain's surface beneath the scalp, its presence may be more 464 strongly reflected in the EEG data and hence, be better decoded.

465
Furthermore, we are using a "Bayesian sampler" approach to describe the behavior 466 of the brain, employing a limited sample size to estimate the parameters of the input 467 distribution [44]. Hence, a possible explanation for the significant difference It is important to distinguish between the surprise in the recorded response of the 477 brain and the underlying surprise of the sequence of the input stimuli. The former is studied extensively as a parameter to be extracted from the recorded data; for 479 example, in MEG [38,45], and in EEG, as the amplitude of the P300 component [13, 480 20, 24, 25, 46], or as the MMN component [27,28]. The latter terminology for 481 surprise is extracted from the sequence of input stimuli based on an assumption 482 about how the brain estimates the parameters of the input stimuli distribution (for 483 example, the ideal observer assumption used in this work). This surprise is hence not 484 dependent on the actual recorded data from the brain. There are two major 485 approaches for quantifying the surprise of the stimuli which have been reported to be 486 related to the surprise of the brain: the Shannon surprise and the Bayesian surprise.

487
Aside from these two well-known models for the stimuli surprise, we have also   To address this, we applied our decoding methodology to the EEG dataset with no This dataset includes four different auditory oddball tasks and is openly accessible at 604 www.bnci-horizon-2020.eu/database/data-sets with the name "auditory oddball 605 paradigm during hypnosis". There are data from two healthy subjects, one female 606 and one male, in this dataset. Two of the tasks are performed in normal awake 607 condition, and during the other two tasks the subjects were hypnotized. We used the 608 data in which the subjects passively listen to the auditory stimuli.

609
The subjects were presented a random sequence of 420 short complex high tones  In the datasets analyzed in our work, the probability of the occurrence of the deviants 654 is fixed and small (1/5 in the EEG datasets and 1/3 in the MEG dataset). This 655 probability, called the item frequency in some literature [13,38], along with a second 656 statistical parameter, the alternation probability, fully determine the characteristics of the input sequence generation. To assess the performance of the two-parameter 658 decoding models discussed earlier, experiments in which these two probabilities take transition probabilities as in [13,38]. In order to analyze models based on both the 666 simple parameters like the item frequency as well as models utilizing higher-level 667 parameters like transition probability matrix, a more diversified dataset with higher 668 probabilities for the deviants, and including various alternation and repetitions 669 probabilities would be required. One might analyze the other three blocks of MEG 670 data that was used in this paper to be able to compare the two ideal observers 671 estimating transition probability matrix and item frequency [38]. Furthermore, 672 analyzing surprise on a task including more than two types of stimuli (like the tasks   consisted of a small green circle as the standard and a large red circle as the deviant.

709
In the auditory mode, the standards were single tones of 390 Hz and the deviants 710 were generated as a broadband "laser gun" sound. The first two stimuli of each run 711 were always fixed to be standards.

712
The EEG signal was sampled at a frequency of 1000 Hz by a conventional MRI- that the participants paid attention to the task, they were asked every 12 to 18sounds 734 to predict the next stimuli (being a standard or a deviant) using one of two buttons.

735
The brain activity was recorded by 306 channels (102 magnetometers and 204 736 gradiometers) with whole-head Elekta Neuromag MEG system using a sampling rate 737 of 1 kHz and a hardware-based bandpass filter of 0.1 to 330 Hz. Raw MEG data was 738 corrected for head movement and bad channels, and was also cleaned from 739 powerline and muscle movement artifacts. Then, a low pass filter below 30 Hz and a 740 250 Hz down-sampler was applied to the data. Eye blinks and cardiac artifacts were 741 removed using ICA (Independent Component Analysis). Finally, the data was 742 baseline corrected using a window of 250ms before the stimulus onset [38].

743
Preprocessing 744 EEG preparation. The raw EEG signal wasprocessed in our analysis through the 745 following steps (Fig 10.b): In the first step, the InfoMax ICA algorithm [59] was applied to the data of each 747 subject using the EEGLAB toolbox [60]. Our aim was neither the rejection of 748 artifacts nor selecting among the resulting ICA components. Instead, we assumed a subsequent steps of the analysis. Therefore, the input of our regression model would 767 be a feature matrix denoted by (Fig 11). ×  Ideal observer model 796 A fundamental question in the Bayesian brain literature is how the brain learns the 797 distribution of the sensory stimuli. The brain is assumed a near-optimal estimator for 798 the probability of the input sequence based on a generative model with Bayesian 799 inference [12,13,15,19,62,63]. To be more precise, the brain uses prior belief 800 about the environment, and updates it after each stimulus arrives. In addition, in 801 order to initialize the inference process, it is presumed that the brain begins with the 802 assumption of equally probable input types despite exposure to any possible previous 803 blocks of stimuli [11,13,64,65]. of the next stimulus being a standard or a deviant [19]. can be stated as a matrix, which can be estimated by counting the number of 2 × 2 823 successive transitions [13]. It has been demonstrated that utilizing the transition 824 probability matrix for describing the stimuli sequence will statistically outperform the single-parameter approach to describe the brain's response [13].   The parameter can be computed in different ways [13]. In this paper we have

862
Estimation of the stimulus-generating distribution leads to a prediction about the next 863 stimuli, which if violated may produce a "surprise" response by the brain [12,13,15,864 27] reflecting the prediction error [19,27]. where c is 0 or 1 in an oddball test and is the predicted probability of stimulus to 871 be and is obtained using this formula: For the single-parameter definition of (i.e. when is chosen as the item 874 frequency), we have: where is the estimated value of at the preceding sample. In the two-parameter surprise. But it is obvious that the first person is surprised less than the second one 897 due to the indifference in the predicted probability of occurrence for all types. In 898 other words, the first person has no commitment to the world.

899
The most recent definition for surprise, calculating it in the puzzlement sense, 900 considers how observers commit to their belief about the world, and attempts to 901 address the above issue [14]. It measures the distance between the distributions 902 estimated for model parameters after observing all the samples, and the -1 903 posterior belief of a naїve observer for parameters after observing the . best regress the surprise level. Hence, we define four different regimes of sample selection from the temporal data record, and apply our decoder operation in each 930 regime to find out the most significant set of points in time which can be used to 931 predict the surprise level of the input stimuli (Fig 13). The four regimes of temporal 932 data selection are described below. Note that the entire set of data covering one  and from 250ms to 600ms (Late Components).