Surprise: a unified theory and experimental predictions

Surprising events trigger measurable brain activity and influence human behavior by affecting learning, memory, and decision-making. Currently there is, however, no consensus on the definition of surprise. Here we identify 16 mathematical definitions of surprise in a unifying framework, show how these definitions relate to each other, and prove under what conditions they are indistinguishable. We classify these surprise measures into four main categories: (i) change-point detection surprise, (ii) information gain surprise, (iii) prediction surprise, and (iv) confidence-correction surprise. We design experimental paradigms where different categories make different predictions: we show that surprise-modulation of the speed of learning leads to sensible adaptive behavior only for change-point detection surprise whereas surprise-seeking leads to sensible exploration strategies only for information gain surprise. However, since neither change-point detection surprise nor information gain surprise perfectly reflect the definition of ‘surprise’ in natural language, a combination of prediction surprise and confidence-correction surprise is needed to capture intuitive aspects of surprise perception. We formalize this combination in a new definition of surprise with testable experimental predictions. We conclude that there cannot be a single surprise measure with all functions and properties previously attributed to surprise. Consequently, we postulate that multiple neural mechanisms exist to detect and signal different aspects of surprise. Author note AM is grateful to Vasiliki Liakoni, Martin Barry, and Valentin Schmutz for many useful discussions in the course of the last few years, and to Andrew Barto for insightful discussions through and after EPFL Neuro Symposium 2021 on “Surprise, Curiosity and Reward: from Neuroscience to AI”. We thank K. Robbins and collaborators for their publicly available experimental data (Robbins et al., 2018). All code needed to reproduce the results reported here will be made publicly available after publication acceptance. This research was supported by Swiss National Science Foundation (no. 200020_184615). Correspondence concerning this article should be addressed to Alireza Modirshanechi, School of Computer and Communication Sciences and School of Life Sciences, EPFL, Lausanne, Switzerland. E-mail: alireza.modirshanechi@epfl.ch.

Following this notation, we define an agent's belief about the parameter Θ t at time t as the posterior 163 distribution 164 π (t) (θ) = P (t) (Θ t = θ).
(3) 165 We can equivalently write π (t) = P Another important quantity is the marginal probability of observing y given the cue x and a belief π (t) : 170 P (y|x; π (t) ) = E π (t) P Y |X (y|x; Θ) = P Y |X (y|x; θ)π (t) (θ)dθ, see Definition 1 P (t) P conditioned on observations and cues until time t, i.e., x 1:t and y 1:t P (t) W An alternative notation for the distribution of random variable W given x 1:t and y 1:t π (0) Prior distribution over the environment parameter Equivalently, the distribution of Θ t given C t = 1 The belief about parameter Θ t at time t, i.e., P (t) (Θ t = θ) P (y|x; π (t) ) The marginal probability of observation y given cue x and belief π (t) ; see Eq. 4

191
Based on how they depend on an agent's belief π (t) , we divide existing surprise measures into three cate-192 gories: (i) probabilistic mismatch, (ii) observation-mismatch, and (iii) belief-mismatch surprise measures 193 (Fig. 3). Probabilistic mismatch surprise measures depend on the belief π (t) through the marginal prob-194 1 For example, assume that S = f (S ) for an invertible function f . If an estimator of the variable Z is found using the measure S asẐ = g(S), then we can rewrite the same estimator in terms of S asẐ =g(S ) = g(f −1 (S)). Because g(S) and g(S ) essentially have the same explanatory power given any function g and any measure of performance, the two surprise measures S and S are equally informative about the variable Z in this regard. However, this is not necessarily true if one restricts the estimators to a particular class of functions -e.g., if the estimators are constrained to be linear with respect to surprise measures while f is nonlinear. Such limitations can be avoided by using non-parametric statistical methods like Spearman or Kendall correlations (Corder & Foreman, 2014). ability P (y t+1 |x t+1 ; π (t) ); an example is the Shannon surprise (Shannon, 1948). Observation-mismatch surprise measures depend on the belief π (t) through some estimate of the next observationŷ t+1 according 196 to the belief π (t) ; an example is the absolute difference between y t+1 andŷ t+1 (Rouhani & Niv, 2021). To 197 compute the belief-mismatch surprise measures, however, we need to have the whole distribution π (t) ; an 198 example is the Bayesian surprise (Baldi, 2002;Schmidhuber, 2010). 199 Probabilistic mismatch surprise 1: Bayes Factor surprise 200 An abrupt change in the parameters of the environment influences the sequence of observations. Therefore,  Here, we apply their definition to our generative model. Similar to Xu et al., 2021, we define the Bayes 208 Factor surprise of observing y t+1 given the cue x t+1 as the ratio of the marginal probability of observing 209 y t+1 given x t+1 and C t+1 = 1 (i.e., assuming a change) to the marginal probability of observing y t+1 given 210 x t+1 and C t+1 = 0 (i.e. assuming no change); formally, we write 211 S BF (y t+1 |x t+1 ; π (t) ) = P (t) y t+1 |x t+1 , C t+1 = 1 P (t) y t+1 |x t+1 , C t+1 = 0 = P (y t+1 |x t+1 ; π (0) ) P (y t+1 |x t+1 ; π (t) ) .
(5) 212 The name arises because S BF (y t+1 |x t+1 ; π (t) ) is the Bayes Factor (Bayarri & Berger, 1997; Efron & Hastie, 213 2016; Kass & Raftery, 1995) used in statistics to test whether a change has occurred at time t. For a given 214 P (y t+1 |x t+1 ; π (0) ), the Bayes Factor surprise is a decreasing function of P (y t+1 |x t+1 ; π (t) ): Hence, more 215 probable events are perceived as less surprising. However, the key feature of S BF (y t+1 |x t+1 ; π (t) ) is that it animal experiments is whether a surprise measure S explains the variations of a behavioral or physiological variable Z better than an alternative surprise measure S . A1. A common experimental paradigm: A sequence of cues x 1:t and observations y 1:t is presented to participants, the sequence z 1:t is measured, and the sequence of surprise values S 1:t or S 1:t is predicted by computational modeling. Then statistical tools are used to study whether the sequence S 1:t or S 1:t is more informative about the sequence of measurements z 1:t . A2. If there exists a strictly increasing function f such that S = f (S), then the two surprise measures are equally informative about the measurable variable Z. In this case, S and S are 'indistinguishable'. B. Schematic of the theoretical relation between different measures of surprise. A line connecting two measures indicates that the two measures are indistinguishable, i.e., one is a strictly increasing function of the other under the condition corresponding to the color and the type of the line. The conditions are shown on the bottom right of the panel: a solid black line means the two measures are always indistinguishable; a dashed black line corresponds to the condition p c = 0; a solid red line corresponds to the prior marginal probability P (.|x t+1 ; π (0) ) being flat; a dashed red line corresponds to the prior belief π (0) being flat; a solid green line corresponds to the limit of p c → 1; and a dashed green line means that the relation holds only for some special cases (e.g., Gaussian task). Two lines indicate that one of the conditions is sufficient for the two measures to be indistinguishable. The text beside each line shows where in the text the existence of the mapping is proven, e.g., P3 and C2 stand for Proposition 3 and Corollary 2, respectively. The purple box includes surprise measures that are computed in the parameter (Θ t ) space, whereas the surprise measures outside of the purple box are computed in the space of observations (Y t  but also how expected it would be if the agent had reset its belief to the prior belief. More precisely, for 218 a given P (y t+1 |x t+1 ; π (t) ), the Bayes Factor surprise is an increasing function of P (y t+1 |x t+1 ; π (0) ).

249
The Shannon surprise S Sh1 measures how unexpected or unlikely y t+1 is considering the possibility that 250 there might have been an abrupt change in the environment. As a result, for a fixed P (y t+1 |x t+1 ; π (t) ), 251 the Shannon surprise is a decreasing function of P (y t+1 |x t+1 ; π (0) ) (c.f. Eq. 9), whereas the Bayes Factor 252 surprise is an increasing function of P (y t+1 |x t+1 ; π (0) ) (c.f. Eq. 5 In other words, S Sh2 (y t+1 |x t+1 ; π (t) ) neglects the potential presence of change-points, and, therefore, it 263 is independent of both p c and P (y t+1 |x t+1 ; π (0) ). For a non-volatile environment that does not allow for 264 abrupt changes (p c = 0), the two definitions of Shannon surprise are identical: S Sh1 = S Sh2 (Fig. 2B).

265
Proposition 2 shows that the Bayes Factor surprise S BF is related to S Sh1 and S Sh2 :

266
Proposition 2. (Relation between the Shannon surprise and the Bayes Factor surprise) In the generative 267 model of Definition 1, the Bayes Factor surprise S BF (y t+1 |x t+1 ; π (t) ) can be written as
(12) 271 Proposition 2 states that the Bayes Factor S BF (y t+1 |x t+1 ; π (t) ) has a behavior similar to the difference in 272 Shannon surprise (i.e., ∆S Sh1 or ∆S Sh2 ) as opposed to Shannon surprise itself (i.e., S Sh1 or S Sh2 Corollary 1 states that the modulation of learning as presented in Proposition 1 can also be written in 277 the form of the difference in Shannon surprise (i.e., ∆S Sh1 or ∆S Sh2 ).
309 Therefore, the SPE and the Shannon surprise are in principle indistinguishable (Fig. 2).

316
Before turning to an 'observation-mismatch', we first need to define an agent's prediction for the next 317 observation. Analogously to our two definitions for the Shannon surprise (c.f. Eq. 9 and Eq. 10), we 318 define two different predictions for the next observation y t+1 given the cue x t+1 : We note that the observation y t+1 is in general case multi-dimensional (Niv et al., 2015). As two natural Proofs). tribution of Y t+1 ∈ R N given the cue x t+1 and under the belief π (t) is a Gaussian distribution with a 345 covariance matrix equal to σI N ×N , where I N ×N is the N × N identity matrix, then S Sq2 (y t+1 |x t+1 ; π (t) ) 346 is an invertible function of S Sh2 (y t+1 |x t+1 ; π (t) ). 347 We note that, according to Proposition 3, the SPE is an invertible function of the Shannon surprise.

429
where A(y t+1 , x t+1 ) = S Sh (y t+1 |x t+1 ; π flat ) + C[π flat ] is independent of the current belief π (t) ; note that 430 because p c = 0, we have S Sh1 = S Sh2 and S Ba1 = S Ba2 . Therefore, in a non-volatile environment (i.e., 431 p c = 0), S CC1 is correlated with the sum of the Shannon and the Bayesian surprise regularized by 432 the confidence of the agent's belief. However, such an interpretation is no longer possible in volatile 433 environments (p c > 0), and Eq. 30 must be replaced by Proposition 6 below.

434
In order to account for the information of the true prior π (0) and to avoid the cases where π flat (.|y t+1 , x t+1 ) 435 is not a proper distribution, we also give a 2nd definition for the Confidence Corrected surprise as

455
Hence, the Confidence Corrected surprise should be distinguishable from both the Shannon and the 456 Bayesian surprise (for p c < 1). An interesting consequence of Proposition 6, however, is that S CC2 is 457 identical to S Ba2 when the environment becomes so volatile that its parameter changes at each time step 458 (i.e., in the limit of p c → 1):

487
The minimized free energy F * = min φ F (t+1) (φ) has been interpreted as a measure of surprise (Friston, approximation of S Sh1 (y t+1 |x t+1 ;π (t) ). The parametric family of q(.; φ) and its relation to the exact belief 490 π (t+1) determine how well F * approximates S Sh1 (y t+1 |x t+1 ;π (t) ) ( Fig. 2B). More precisely, the minimized 491 free energy measures both how unlikely the new observation is (i.e., how large S Sh1 (y t+1 |x t+1 ;π (t) ) is) 492 and how imprecise the best parametric approximation of the beliefπ (t+1) is (i.e., how large D KL [π (t+1) || 493 P Θ t+1 .|y t+1 , x t+1 ;π (t) ] is). Therefore, the minimized free energy is in the category of belief-mismatch 494 surprise measures (Fig. 3).  dence, including the Confidence Corrected surprise. The idea is that higher confidence (or higher 600 commitment to a belief) leads to more puzzlement, where the puzzle is either to detect environmen-601 tal changes or to find the most accurate prediction. We will introduce a new measure in this family 602 in section Regularized Shannon surprise: A new direction that explicitly captures confidence and 603 defines the agent's puzzle as finding the most accurate predictions (Fig. 4).

604
Following this classification, we focus in the following section on one representative example from each of 605 these sub-categories, and, whenever there are two definitions of one surprise measure, we take the second  there are no cues, the conditional distribution P Y |X is specified as

632
An example of a categorical task is the oddball paradigm (Squires et al., 1976). In a typical oddball task, 633 participants are exposed to a sequence of binary stimuli (i.e., N = 2), e.g., they listen to an auditory 634 sequence composed of two different musical notes, where one stimulus is less frequent than the other, e.g., p 1 = 0.9 and p 2 = 0. ). We will also consider generalized 639 oddball task with N = 3 stimuli.

640
As a natural choice for a categorical task, we assume that the prior and the current beliefs are both N ] are the parameters of the Dirichlet distribution for the belief at time t, and the 644 initial belief π (0) is the uniform distribution, i.e., a Dirichlet distribution with parameters α (0) = 1 N ×1 .

645
At any time-point t ≥ 0, we can write 646 is an 648 estimate of θ given the belief π (t) . We discuss the parameter α (t) sum below when we define confidence. Using 649 this notation, we can write the marginal probability as P (y = i; π (t) ) =p 2 is the estimated occurrence probability of the second category (e.g., the deviant note in 651 an auditory oddball task). For the initial belief π (0) , we have α for all i between 1 and N .

653
Because P (y; π (0) ) is a flat distribution, S BF and S Sh are invertible functions of each other and hence 654 indistinguishable (Fig. 2). Therefore, in this section, we do not explicitly include S BF in our comparisons 655 as all qualitative results for S Sh hold true also for S BF ; in particular, if S Sh increases or decreases, S BF does 656 as well. Moreover, in the setting described above, we have S Sh1 = S Sh2 , S Ba1 = S Ba2 , and S CC1 = S CC2 657 due to the assumptions p c = 0 and π (0) = π flat (Fig. 2B).

658
General setting: Confidence definition. To study the effect of confidence on the perception of 659 surprise, we first need to agree how to measure confidence given a belief π (t) . In Eq. 29, we have defined 660 the confidence given a belief π (t) by the negative entropy C[π (t) ] (Faraji et al., 2018). However, when 661 π (t) is a Dirichlet distribution, there is no analytic expression for C[π (t) ]. For reasons of practicality, we 662 therefore define here confidence in a categorical task as the inverse variance 663

664
where CatConf stands for Categorical Confidence, is the current estimate of the parameter,
The parameter α (t) sum has been interpreted as the number of samples the belief π (t) is worth (Efron & 669 Hastie, 2016). We note that ||θ (t) || 2 the estimateθ (t) and the uniform distribution over N categories -i.e., ||θ (t) || 2 2 takes its maximum value 671 (corresponding to maximum confidence) when the estimateθ (t) has a probability of 1 for one category 672 and zero for the rest, and it takes its minimum value (corresponding to minimum confidence) whenθ (t) 673 is distributed uniformly over all categories. sum is) and depending on our estimate of its bias (i.e., how large |p . 684 We first fixp

685
In other words, we assume that we have a fixed estimate of the coin's bias (expressed byp sum ) varies (grey, Fig. 5A1). 687 We observe that the Shannon surprise S Sh of Y t+1 = 1 is independent of α surprise measures (Fig. 5B1, left). This means that all surprise measures predict that a wrong prediction 703 made with higher confidence leads to higher surprise than a wrong prediction made with little confidence.  . The dashed grey curve shows the confidence. We considerp 2 ] corresponding to the beliefs in panel B2. We note that the qualitative behavior of different measures with respect to α The Shannon surprise S Sh and the Bayesian surprise S Ba are independent ofp Confidence Corrected surprise S CC , however, has an interesting U-shape relation withp   To summarize, S CC can capture aspects of surprise perception that are consistent with our intuition 747 and are not captured by S Sh and S Ba . However, S CC has also some non-intuitive behavior that appears       2 in panel A. Note that the space is shown as a triangle because θ = [p 1 , p 2 , p 3 ] lives inside the area specified by p 1 + p 2 + p 3 = 1, p 1 ≥ 0, p 2 ≥ 0, and p 3 ≥ 0. C. The estimate of the 3 ] corresponding to the beliefs in panels A. We note that the qualitative behavior of different measures with respect top

777
This opens the door to more theory-driven and principled approaches for future experimental studies.

778
Proposed experiment: (ii) Generalized oddball task (N = 3). The U-shape relation between 779 the Confidence Corrected surprise andp (t) 2 in the CEO selection experiment (Fig. 6) can be tested in a 780 generalized oddball task with N = 3 stimuli (similar to Mars et al., 2008). The approach is similar to 781 that of the binary oddball task in Fig. 7: We have a physiological measurement Z that is thought to be 782 sensitive to surprise; we want to test whether its behavior is consistent with the predictions of S CC or 783 with those of S Sh and S Ba .

784
To do so, we design a sequence of stimuli consisting of two phases separated by an abrupt change (Fig. 8A).

785
The idea is to keep the occurrence frequency of the 1st stimulus p 1 fixed throughout the whole sequence 786 and change the balance between p 2 and p 3 from Phase 1 to Phase 2 (Fig. 8B). In particular, we consider 787 the case that p 2 = p 3 in Phase 1 (before the change) while p 2 p 3 in Phase 2 (after the change). In 788 this case, S Sh and S Ba predict that the surprise of observing Y t = 1 is the same in Phase 1 as in Phase 2,

789
whereas S CC predicts that observing Y t = 1 is more surprising in Phase 2 than in Phase 1 (Fig. 8C). Such  In volatile environments similar to our generative model (Fig. 1A), an unexpected event can occur ei- The belief π (t) (θ = 1) = b (t) ∈ [0, 1] shows how much an agent has trust in the oracle, whereas π (t) (θ = 827 0) = 1 − b (t) shows how much it believes in the unpredictability of the outcome.

828
Thought experiment: Belief in Forecasts. We started the introduction of the paper by discussing each year an important election takes place between a blue party (y t+1 = 1) and a red party (y t+1 = 2).

836
Each year the media announce, a week before the election, the probability x t+1 for the blue party to win  is to find which surprise measure is most useful to indicate the need for such a change.

855
To do so, let us suppose that citizen A has a high trust (e.g., b (t) = 0.9) in the forecast and citizen B 856 a low trust (e.g., b (t) = 0.05). We assume that the media predicted a 90-percent probability of the blue Note that, in the first two points, we compare the surprise value of two different outcomes for the same 869 belief, whereas, in the last point, we compare the surprise value as a function of the belief b (t) .

870
For the case of citizen A (strong trust in the media) all definitions of surprise match our Expectation 1 871 (compare the blue and the red curves in Fig. 9 for high values of b (t) ). If we follow the red curve from 872 high belief to very low belief (i.e., if we decrease b (t) ), then different measures of surprise show different 873 behaviors. As b (t) decreases, only the Bayes Factor surprise S BF (Fig. 9A1) and the Shannon surprise 874 S Sh (Fig. 9A2) decrease and match our Expectation 3. Importantly, the red and blue curves cross for 875 S BF but not for S Sh . Hence, after a certain point (b (t) < b (0) ), S BF predicts that the media being right is 876 more surprising than the media being wrong -matching our Expectation 2. However, S Sh has a different

898
In order to formalize experimental predictions, we consider a task with 150 trials (Fig. 10). At each 899 time t, the oracle announces either X t = 0.9 or X t = 0.1, randomly chosen. For the first 50 trials, we 900 assume that the observations are independent of the oracle's prediction, i.e., Θ t = 0 for 1 ≤ t ≤ 50 and 901 Y t ∼ Bernoulli(0.5). Then, unknown to the participants, there is an abrupt change at trial 51, and, for 902 the next 50 trials, the observations follow the same distribution as the one predicted by the oracle, i.e.,  where, after a jump, S CC decreases quickly before it goes up again. This behavior is due to its U-shape 923 relation with b (t) in Fig. 9. S CC is the second most informative measure to detect changes.

924
The behavior of S Sh and S Ba (Fig. 10B second and  in Fig. 9. Therefore, S Sh and S Ba are less informative about environmental changes than S BF and S CC .

932
In an actual experiment, a behavioral or biological variable Z can be measured throughout the experiment 933 (Fig. 2A1) -similar to our proposed experiments for the example of the oddball tasks ( Fig. 7 and Fig. 8).

934
In order to examine with which measure of surprise Z correlates or whether Z is involved in the biological 935 mechanism behind adaptive learning, one can compute its average over different sequences of stimuli, 936 time-locked to change-points from unpredictable to predictable environments. Because different surprise 937 measures have qualitatively different predictions for this average (c.f. Fig. 10B 971 We consider the prior belief to be a Dirichlet distribution 4 972

973
Using exact Bayesian inference, the belief π (t) (θ) at time t is also a Dirichlet distribution 974 where α Methods for case-studies for details.

980
Thought experiment: (i) Optimal model-building. Before turning to surprise-seeking exploration 981 policies, we study the optimal exploration policy in our reward-free bandit task. Imagine that an agent 982 is instructed to always choose its next action x t+1 in a way to find the best estimateθ (t+1) of the 983 environment parameter Θ at the next time-step -i.e., to build the most accurate model of the environment 984 (Schmidhuber, 2010). We define the accuracy of the estimateθ (t+1) as the squared error betweenθ (t+1) 985 and the true parameter Θ. Then, to obtain (on average) the best estimate of the parameter at the next 986 time-step t + 1, the optimal exploration policy is by definition to choose the action x t+1 = i that has in 987 expectation the lowest mean squared error it is conditioned on the previous actions x 1:t , the previous observations y 1:t , and the hypothetical choice i 993 of the next action X t+1 . The optimal policy can be re-written as choosing the action that maximizes an 994 optimal gain function g * (p x t+1 = f * (x 1:t , y 1:t ) = arg min i∈{1,...,N }

998
The optimal gain function g * indicates which action should be chosen (Fig. 11A). To grasp the idea of 999 the optimal policy, let us first imagine that we have chosen arm 1 and 2 ten times each. If we observed 1000 Y t = 1 after every single time we chose arm 1 but after only 50% of times when we chose arm 2, then we 1001 would naturally be more confident about our estimate of p 1 than our estimate of p 2 . Thus, in order to 1002 increase the precision of our estimates, we should keep choosing arm 2. Consistent with this intuition, the 1003 optimal gain function g * has an inverted-U-relation with the estimated probabilityp  (Fig. 11A1). This means that the optimal policy is to pick the arm with the highest 1005 stochasticity in its outcome distribution (among the arms with the same α (t) sum,i ).

1006
Second, let us imagine that we have chosen arm 1 only once but have chosen arm 2 ten times. In this 1007 case, whatever the actual observations, we would still be less confident about our estimate of p 1 than 1008 our estimate of p 2 . Thus, in order to increase the precision of our estimates, we should choose arm 1009 1 more often. Consistent with this intuition, the optimal gain function g * is a decreasing function of sum,i implying that the more often an agent chooses arm i, the less informative it becomes (Fig. 11A2).

1011
Moreover, for any two arms i and j with different estimated probabilitiesp occasionally and all arms get selected infinitely many times in the limit t → ∞. We note that g * is 1017 independent of the prior parameters α (0) sum,i andp (0) i ( Fig. 11A3 and A4).

1018
To summarize, the optimal exploration policy prefers arms that (i) are more stochastic and (ii) have been 1019 chosen less often. See Appendix C: Methods for case-studies for details and proofs. x t+1 = f S BF (x 1:t , y 1:t ) = arg max i∈{1,...,N } 1035 which is the same as seeking the difference in Shannon surprise (c.f. Proposition 2). Note that, be-1036 cause E P (.|x=i;π (t) ) [S BF (Y t+1 |x t+1 = i; π (t) )] is always and by definition equal to 1, classic surprise-seeking 1037 (Eq. 49) with the Bayes Factor surprise is equivalent to uniformly random exploration.

1038
Similar to the case of the optimal exploration policy (Eq. 47), we can define a gain function for each 1039 measure of surprise and write the corresponding surprise-seeking policy as the one that maximizes that 1040 gain function, i.e., in general, for a measure of surprise S, the surprise-seeking policy can be written as 1042 with g S the corresponding gain function; see Appendix C: Methods for case-studies for details. Different 1043 surprise measures give rise to different gain functions and show different preferences over actions (Fig. 11).

1044
According to g Sh (Fig. 11B1), arms with more stochastic outcomes (i.e., withp optimal gain function g * (Fig. 11A1). However, according to g Ba the opposite is true (Fig. 11C1). The 1047 reason is that improbable events lead to huge changes in the agent's belief such that more deterministic 1048 arms have higher expected S Ba . On the other hand, g Ba is a decreasing function of α (t) sum,i (Fig. 11C2) 1049 and in agreement with g * (Fig. 11A2). The preference of g Ba for deterministic arms also decreases with 1050 increasing α (t) sum,i -e.g., the maximum of the blue curve in Fig. 11C1 has a smaller value than the minimum of the red curve. This means that independently of the difference in the stochasticity of their outcomes, g Ba will eventually choose arms that have been chosen less often, in agreement with g * . However, since 1053 S Sh is a probabilistic mismatch surprise measure (Fig. 3), g Sh is independent of α  has been chosen does not change the preference of g Sh for more stochastic arms (Fig. 11B1 and Fig. 11B2).

1055
This observation is consistent with different behaviors of S Sh and S Ba with respect to confidence in our 1056 first case-study (Fig. 5). Therefore, seeking S Ba has an asymptotic behavior similar to that of the optimal 1057 policy, whereas seeking S Sh remains systematically different from the optimal policy (Fig. 11A-C). We note that both S Sh and S Ba are independent of the prior parameters α (0) sum,i andp (0) i (Fig. 11B3-4 and Fig. 11C3-4, respectively). See Appendix C: Methods for case-studies for details and proofs.

1060
S BF and S CC rely by definition (Eq. 5 and Eq. 31, respectively) on a comparison between the current belief 1061 π (t) and the prior belief π (0) . As a result, the prior parameters α (0) sum,i andp (0) i have a potential influence 1062 on g BF and g CC (Fig. 11D3-4 and E3-4, respectively) -in contrast to the optimal policy (Fig. 11A3-4). 1063 In particular, g BF prefers arms for which the latest estimate of the outcome probabilityp  Fig. 11D1 and D3, respectively). If we havep (0) i = 0.5 for all arms, then g BF has 1067 the same preference as that of g Sh -since S BF and S Sh are indistinguishable in this case (Fig. 2). g BF is 1068 independent of α (t) sum,i and α (0) sum,i (Fig. 11D2 and Fig. 11D4). See Appendix C: Methods for case-studies 1069 for details and proofs.

1070
The gain function for seeking the Confidence Corrected surprise g CC prefers arms for which the latest 1071 belief about outcome probability π (t) (p i ) differs most from the prior belief about outcome probability 1072 π (0) (p i ). Therefore, the behavior of g CC with respect top Fig. 11E1 and E3, respectively) is 1073 qualitatively opposite to that of g BF (Fig. 11D1 and D3, respectively). Overall, the preference of g CC is 1074 different from that of the optimal gain function g * with respect to all parameters:p sum,i (Fig. 11E1-4). See Appendix C: Methods for case-studies for details and proofs.

1076
To summarize, seeking surprise gives rise to different sub-optimal exploration policies depending on the Kass & Raftery, 1995) and, given a sequence of action-choices and observations (for one random seed), computed the posterior probability of different models. Each cell shows the posterior probability of a candidate model (corresponding column) given that the action-choices were made by one of the true models (corresponding row) -averaged over 500 random seeds. C. and D. The same as panels A and B, respectively, except that the prior parameter α    i also plays a role in action-selection.

1102
In the 1st scenario (Fig. 12A), maximizing g Ba is almost as good as maximizing the optimal gain function 1103 g * , and maximizing g Sh is the same as maximizing g BF -as expected from the shape of the gain functions 1104 (Fig. 11). Maximizing g CC is the worst policy. Our results for model-recovery (Fig. 12B) show that 1105 different exploration policies can be distinguished from each other given observed action-choices -except 1106 for g Sh and g BF which are essentially the same in the 1st scenario. Importantly, despite the similar 1107 performance of g * and g Ba in model-building (Fig. 12A), they are distinguishable at the level of actionchoices (Fig. 12B).
In the second scenario, maximizing g Sh differs from maximizing g BF (Fig. 12C). By constantly comparing i , seeking S BF has an indirect preference for arms that have been chosen less often; therefore, seeking S BF has a better performance than maximizing g Sh . While the best surprise-seeking policy is still 1112 to maximize g Ba , its difference with the optimal policy becomes more obvious in the second scenario -1113 whenp (t) i matters. In this scenario, g CC is again the worst policy and all policies can be distinguished 1114 from each other given their action-choices (Fig. 12D). 1115 To summarize, our proposed experiment can distinguish different exploration policies both at the level 1116 of (i) performance and (ii) action-choices with as few as 100 trials. We found that seeking the Bayesian we showed that a surprise measure that is useful to modulate the learning speed should necessarily 1140 compare the current belief with the prior belief ( Fig. 9 and Fig. 10), while we also showed that such a 1141 comparison leads to suboptimal exploration in surprise-seeking policies (Fig. 11 and Fig. 12). Similarly, we 1142 showed that a surprise measure that is useful for exploration should increase with increasing uncertainty 1143 ( Fig. 11 and Fig. 12), whereas we also showed that such a behavior is slow and suboptimal in detecting 1144 change-points ( Fig. 9 and Fig. 10). In other words, our results show that the very features that make the 1145 Bayes Factor surprise an appropriate measure for learning make it an unsuitable measure for exploration 1146 and vice versa for the Bayesian surprise.

1147
Unlike the Bayes Factor and the Bayesian surprise, the Shannon surprise (c.f. Eq. 9 and Eq. 10) considers 1148 less likely events always as more surprising. Surprise in natural language is defined as 'the feeling or 1149 emotion excited by something unexpected' (Oxford English Dictionary, n.d.). If we focus on the term 1150 'unexpected', identify it with 'unlikely under the current belief', and neglect the terms 'feeling' and 1151 'emotion', then our results suggest that the Shannon surprise measures a quality closely related to the 1152 definition of surprise in natural language (i.e., the dictionary definition of surprise). However, we observed 1153 that the Confidence Corrected surprise has a more intuitive behavior in some cases where confidence (or 1154 commitment to a belief) plays an important role (Fig. 6), despite its counter-intuitive behavior in some 1155 other situations (Fig. 5).
We argued that the Confidence Corrected surprise has a more intuitive behavior than the other measures 1158 in the CEO selection experiment because it explicitly accounts for confidence (Fig. 6). However, because 1159 it treats the confidence for a correct prediction in the same way as the confidence for a wrong prediction, 1160 its behavior is against common sense in some other experiments (Fig. 5 and Fig. 7). In this section, we  Fig. 5-Fig. 12. Sh (dark blue) corresponds to the Shannon surprise and ShR (light blue) corresponds to the regularized Shannon surprise. The regularized Shannon surprise has the same behavior as the Shannon surprise in panels A-E as well as in panels I-K, but it has the same behavior as the Confidence Corrected surprise in panels F-H. A. Coin flipping experiment: Surprise of Y t+1 = 1 as a function of α sum (c.f. Fig. 5A1). B. Coin flipping experiment: Surprise of Y t+1 = 1 as a function ofp (t) 1 (c.f. Fig. 5B1). C. Classic oddball task: Surprise over time for standard (shown at -0.05) and deviant (shown at 1.05) stimuli (c.f. Fig. 7A). D. Classic oddball task: Average surprise of standard stimuli in the early and the late phase of the task (c.f. Fig. 7B). E. Classic oddball task: Average surprise of deviant stimuli in the early and the late phase of the task (c.f. Fig. 7C). F. CEO selection experiment: Surprise of Y t+1 = 1 as a function ofp . Fig. 6A1). G. CEO selection experiment: Surprise over time (c.f. Fig. 8A). H. Generalized oddball task: Average surprise values of Y t+1 = 1 in Phase 1 and Phase 2 (c.f. Fig. 8C). I. Belief in forecasts experiment: The regularized Shannon surprise as a function of trust in media b (t) for media being right (blue) or wrong (red) ; b (0) = 0.5 (c.f. Fig. 9B). J. Association learning experiment: Surprise over time for one random seed; stars indicate when oracle's prediction has been wrong (c.f. Fig. 10A). K. Association learning experiment: Surprise over time (as in panel J) averaged over 500 random seeds (c.f. Fig. 10B). L. Bandit experiment: The gain function g ShR corresponding to the policy of seeking ShR as a function ofp i (c.f. Fig. 11). g ShR is independent of α   We call the modified measure the Regularized Shannon surprise and define it as 1164 S ShR (y t+1 |x t+1 ; π (t) ) = S Sh (y t+1 |x t+1 ; π (t) ) + R S Sh (y t+1 |x t+1 ; π (t) ) − min y S Sh (y|x t+1 ; π (t) ) ,

1165
where R : R + → R + can be any continuous function that satisfies: (i) R(0) = 0, and (ii) R(z) is 1166 an increasing function of z for all z ∈ R + . As a result, we have the following two properties for the 1167 regularized Shannon surprise: 1168 1. Whenever y t+1 is the most expected observation, i.e., y t+1 = arg min y S Sh (y|x t+1 ; π (t) ), we have 1169 S ShR (y t+1 |x t+1 ; π (t) ) = S Sh (y t+1 |x t+1 ; π (t) ). This means that the confidence for a correct prediction 1170 does not penalize surprise.

1174
This means that for a fixed Shannon surprise of y t+1 , the more we expect another observation y * t+1 , 1175 the more surprised we are by observing y t+1 = y * t+1 , consistent with our expectation for the CEO 1176 selection experiment (Fig. 6A1).

1177
As a simple choice, let us consider R(z) = z, i.e.,
Then, the predictions of the regularized Shannon surprise S ShR are the same as the predictions of the oddball task (compare Fig. 13A-E with Fig. 5 and Fig. 7A-C). Importantly, the EEG amplitude at around 450ms in the visual oddball task we analyzed has a behavior consistent with the predictions of 1183 S ShR but not with the predictions of S CC (compare Fig. 13D-E with Fig. 7). In the belief in forecast and 1184 association learning experiments, S ShR , similar to S Sh , can be interpreted as a measure of unpredictability 1185 and unexpectedness in the environment (Fig. 13I-K). In this regard, S ShR is also a good model of the 1186 dictionary definition of surprise. On the other hand, in the CEO selection experiment and the generalized 1187 oddball task, the confidence regularization in S ShR makes it more similar to S CC than S Sh (compare Finally, our results show that, similar to seeking S Sh or S CC , seeking S ShR is a sub-optimal exploration 1194 policy and has a poor performance for model-building (Fig. 13L-N  in the brain? And how do they relate to the word 'surprise' in natural language? To address these 1200 question, we reviewed 16 surprise measures in a common mathematical framework and studied their 1201 links to perception, learning, and decision-making. We identified the conditions under which they are 1202 indistinguishable (Fig. 2) and provided a technical (Fig. 3) and a conceptual (Fig. 4) categorization of 1203 these measures.

1204
Our results suggest that the class of prediction surprise measures (Fig. 4)

1210
We found that the very features that make a surprise measure suitable for adaptive learning are in conflict 1211 with the ones that make it suitable for exploration. In particular, adaptive behavior as observed in humans 1212 is achieved only if a surprise from the class of change-point detection measures modulates learning (Fig. 4), 1213 whereas a close-to-optimal exploration strategy is achieved only if a surprise from the class of information-1214 gain surprise measures drives action-selection (Fig. 4).

1653
Proof of Proposition 5

1688
Proof: The remark is the direct consequence of Lemma 1.

1726
Proof of Corollary 1

1727
The corollary is the direct conclusion of Eq. 60 and Eq. 62.

1728
Proof of Corollary 2
Both relations are invertible. Therefore, the proof is complete.

1733
Proof of Corollary 3

1752
Using the general formula, the Bayesian surprise (c.f. Eq. 24) can be computed as

1759
where G n are 'Gregory coefficients' and (x) n = x(x + 1)...(x + n − 1) is raising factorial. As a result, the 1760 Bayesian surprise is (101) 1762 We use this representation in some of our proofs below.
We could not make any general statement about the relation of the Confidence Corrected surprise with sum (c.f. Fig. 5 and Eq. 102). However, using asymptotic (for α 1778 which concludes that lim α (t) sum →∞ S CC (y = i; π (t) ) = ∞.
Influence ofθ (t) . When α (t) sum is fixed, then the Shannon surprise (Eq. 96) is decreasing with respect to 1780p (t) i . According to Eq. 101, the Bayesian surprise is also always a decreasing function ofp

1782
We could not make any general statement about the relation of the Confidence Corrected surprise with 1783θ (t) (c.f. Fig. 5, Fig. 6, and Eq. 102).
Simulation details. For Fig. 7A where κ ∈ [0, 1] is the leak parameter that determines how fast old observations are forgotten, and δ is 1790 the Kronecker delta function. With this update rule, we have the guarantee that α The formulas for surprise calculations and the temporal update rule for the belief for data shown in Fig. 9 1809 and Fig. 10 To compute the 1819 Bayesian surprise, we use the result of Lemma 1; to do so, we compute 1820 E π (t) S Sh2 (y t+1 = 1|x t+1 ; δ Θ ) = −b (t) log 2x t+1 + log 2, (109) 1821 We, therefore, have

1823
With similar tricks, the Confidence Corrected surprise can be computed as The update rule. Using Proposition 1 for the update of the belief, we have

1828
where γ t+1 is the adaptation rate as define in Proposition 1, and Simulation details. We fixed the parameters θ 1:150 as shown in Fig. 10: θ 1:50 = θ 101:150 = 0 and θ 51:100 = 1. Then, for each random seed, we randomly sampled the cue variables x 1:150 independently: 1834 X t ∼ Uniform({0.1, 0.9}), i.e., at each point, the oracle chooses one of the possible outcomes and assigns 1835 a probability of 0.9 to it. We then, given the same random seed, sampled the observations y 1:150 as 1836 described before: Y t ∼ Cat({0.5, 0.5}) whenever Θ t = 0, and Y t ∼ Cat({x t , 1 − x t }) whenever Θ t = 1, 1837 where Cat stands for categorical distribution. Fig. 10A shows data generated for one random seed, and 1838 Fig. 10B shows the average belief and surprise over 500 random seeds.

1839
Methods for case-study 3

1840
The formulas for the gain functions and the details of the update rules for Fig. 11 and Fig. 12 are given 1841 in this section.

1842
To simplify the notation, we define, for all p ∈ (0, 1), q ∈ (0, 1), and Y ∈ {1, 2} 1846 The belief and the update rule. Given the prior belief as in Eq. 44, the belief at time t is 1847 The optimal gain function. Using Eq. 117 and Eq. 118, we can re-write Eq. 46 Therefore, the optimal strategy is 1857 x t+1 = f * (x 1:t , y 1:t ) = arg min sum,i , and its behavior with respect to the other two variables is as follows: • With respect top  sum,i + δα). Therefore, independent of 1866 the difference in the stochasticity level (i.e. δp), as an arm gets to be chosen more often (as δα 1867 increases) it eventually becomes less informative.

1868
• With respect to α (t) sum,i : g * is always decreasing.

1911
Simulation details. We first put α sum,i equal to 2 and 10 for scenarios 1 and 2, respectively. Then, 1912 for a given random seed, we randomly sampled prior parameterp i ∈ {1, ..., 10} -note that p i s are not known to the agents. We then ran 6 different algorithms separately, corresponding to (1) g * , (2) g Sh , (3) g BF , (4) g Ba , (5) g CC , and (6) g ShR . At time t, each algorithm 1917 computed its corresponding gain function g for all actions. Then, it chose the action with highest gain 1918 as x t -when there was a tie, the action with the smaller index was chosen, e.g., action 2 was preferred 1919 to action 6. Given their actions x t , different algorithms observed different observations y t . Then, each 1920 algorithm updated its belief, and this procedure was repeated until t = 100. At time t, the mean-squared i ) 2 was computed as a measure of performance. Fig. 12A, Fig. 12C, and p(x t+1 = i|x 1:t , y 1:t ; g, β) ∝ exp β · g(p removed the baseline activity by subtracting the mean calculated over the first 100ms. We excluded error 1954 trials (i.e., the trials where participants either pressed the wrong button or did not press any button, 1955 error rate: 3.8% ± 2.4%, range: 0.4% to 11.3%) from further analyses.
Analysis the deviant stimuli, averaged over all trials. Fig. 7D shows the mean and the standard error of the mean 1959 (over participants) of the standard and the deviant ERPs. We used one-sample t-test (FDR controlled by Corrected surprise) which can be tested by further and more advanced analyses.

1996
The regularized Shannon surprise is constant with respect to α (t) sum .