A Normative Bayesian Model of Classification for Agents with Bounded Memory

18 Classification, one of the key ingredients for human cognition, entails establishing a criterion that 19 splits a given feature space into mutually exclusive subspaces. In classification tasks performed in 20 daily life, however, a criterion is often not provided explicitly but instead needs to be guessed 21 from past samples of a feature space. For example, we judge today’s temperature to be “cold” 22 or “warm” by implicitly comparing it against a “typical” seasonal temperature. In such situations, 23 establishing an optimal criterion is challenging for cognitive agents with bounded memory 24 because it requires retrieving an entire set of past episodes with precision. As a computational 25 account for how humans carry out this challenging operation, we developed a normative 26


51
Classification, an act of organizing entities into mutually exclusive classes according to a criterion, 52 is an essential ingredient of human cognition. Classification is necessary to the emergence of 53 scientific constructs such as taxonomies in biology (Hempel, 1965;Jacob, 2004) and entailed in 54 generating basic linguistic propositions such as predication (Rips and Turnbull, 1980), being 55 frequently exercised when describing daily events (e.g., 'the train arrived earlier/later than the 56 scheduled time').

57
A criterion is necessary for classification but often not explicitly provided. Previous 58 modelling efforts offered several algorithms for how a criterion is guessed in such situations 59 (Lapid et al., 2008;Raviv et al., 2012;Treisman and Williams, 1984) and described human 60 classification behavior (Lages and Treisman, 1998). However, they did not specify any process of 61 minimizing classification errors, leaving it unclear how their algorithms can be adopted from the 62 optimality perspective. Furthermore, they did not specify how uncertainty-which prevails in the

152
Behavioral data acquisition for DS#2~4 153 DS#2-4 were borrowed from the works conducted with different purposes in our lab (Table 1).

154
The specifics of these data sets were as follows: DS#2 was previously published in our previous

185
For univariate analysis, the images of individual observers were normalized to the MNI 186 template using the following steps: motion correction, coregistration to whole-brain anatomical 187 images via the in-plane images (Nestares and Heeger, 2000), spike elimination, slice timing 188 correction, normalization using the SPM DARTEL Toolbox (Ashburner, 2007) to 3×3×3mm voxel 189 size, and smoothing with 8×8×8mm full-width half-maximum Gaussian kernel. All the procedures 190 were implemented with SPM8 and SPM12 (http://www.fil.ion.ucl.ac.uk.spm) (Friston et al., 1996; 191 Jenkinson et al., 2002), except for spike elimination for which we used the AFNI toolbox (Cox, 192 1996). The first six frames of each functional scan (the first trial of each run) were discarded to 193 allow hemodynamic responses to reach a steady state. Then, the normalized BOLD time series at 194 each voxel, each run, and each subject were preprocessed using linear detrending and high-pass

198
To define anatomical masks, probability tissue maps for individual participants were 199 generated from T1-weighted images, normalized to the MNI space, and smoothed as were done 200 for the functional images by using SPM12, and then averaged across participants. Finally, the 9 locations of CSF, white matter, and gray matter were defined as respective groups of voxels in 202 which the probability was more than 0.5.

225
We defined the ring size, Z, in a normalized space with an arbitrary unit. Specifically, 226 physical ring sizes in visual angle unit were normalized to Z values by dividing their deviances 227 from the M-ring size with the minimum deviance from the M-ring size. For example, five ring sizes 228 in visual degree 1,5,7,9,13 were normalized to Z values of −3, −1,0,1,3 by dividing the 229 deviances from 7 ( −6, −2,0,2,6 ) with the minimum deviance from 7 (2).

235
The posterior p Z m ! is a conjugate normal distribution of the prior and likelihood of Z given 236 the evidence m (!) , whose mean µ ! (!) and standard deviation σ ! (!) were calculated as follows:

238
Whenever Z takes one of the ring sizes on each trial as Z ! , its generative noise in 239 generating m ! was assumed to be equivalent to σ ! . Therefore, σ ! propagates through the 240 Bayesian estimates of stimulus s (!) , which results in the sampling distribution of estimates whose 241 mean µ !(! (!) ) and standard deviation σ !(! ! ) were calculated as follows:

256
Much like stimulus estimates, the sampling distribution of criterion estimates have a 257 mean µ !(! ! ) and a standard deviation σ !(! ! ) due to generative noise propagation, and they 258 were calculated as follows:

260
We could analytically derive this equation because we assumed that, when the criterion is 261 inferred, the retrived stimulus measurement r (!!!) is randomly resampled from a normal 262 distribution whose mean is equal to the stimulus that instigated it, Z !!! , and whose standard 263 deviation is σ ! (!!!) . In other words, retrieved sensory measurements ( r !!! , r !!! , … − r !!! ) 264 were assumed to be independent from their initial sensory measurements ( m !!! , m !!! , … − 265 m !!! ) and from one another as well.

285
For each subject, estimation was carried out in the following steps: First, we found local 286 minima for parameters using a MATLAB function, 'fminseachbnd.m', with the iterative evaluation 287 number set to 50. We repeated this step by choosing 1,000 different initial parameter sets, which 288 were randomly sampled within uniform prior bounds, and acquired 1,000 candidate sets of 289 parameter estimates. Second, from these candidate sets of parameters, we selected the top 20 290 in terms of goodness-of-fit (sum of log likelihoods) and searched the minima using each of those 291 20 sets as initial parameters by increasing the iterative evaluation number to 100,000 and setting 292 tolerances of function and parameters to 10 -7 for reliable estimation. Finally, using the parameters 293 fitted via the second step, we repeated the second step one more time. Then, we selected the 294 parameter set that showed the largest sum of likelihoods as the final parameter estimates.

295
13 We discarded the first trial of each run, the trials in which RTs were too short (less than 296 0.3 s), and the trials in which no responses were given for parameter estimation for any further 297 analyses. We discarded the first trial of each run for two reasons: first, the criterion could not be 298 inferred on the first trial since there is no measurement to be retrieved yet; second, the first trial 299 is susceptible to non-specific fMRI signals that are irrelevant to the task. Note that some aspects 300 of the fitting procedure are arbitrary (e.g., the parameter boundaries and the number of initial

307
The constant-criterion model

308
The constant-criterion model has two parameters, bias of classification criterion µ ! and 309 measurement noise σ ! . Stimulus estimates, s (!) , were assumed to be sampled from a normal  328 , the independent 329 variables were each standardized into z-scores for each observer, and where T is the number of trials in a session. The

331
Bayesian observers' choices were also regressed with the logistic regression model by The regression was repeatedly carried out 333 for each simulation, and the coefficients p that were averaged across simulations were taken as 334 final outcomes. Due to the additional time consumption associated with regression, the number 335 of simulations for regression was compromised to 10 5 repetitions, which is smaller than that for 336 pL (!) prediction (10 6 ). However, we confirmed that the simulation number was sufficiently large 337 to produce stable simulation outcomes.  349 by GLMM with a random effect of individual 355 observers, and then linearly re-scaled to those for the human observers using the following GLM: , where γ is a GLM parameter for scaling.

380
To statistically judge whether a given trial pair is significantly distinguishable or 381 indistinguishable, we applied the following procedure on DS#1. First, we grouped individual trials   ΔpL (!,!) , and was estimated by computing the across-subject standard error of ΔpL (!,!) for each 397 bin pairs (thus, slightly differed across the bin pairs). Lastly, for each bin pair, the observed 398 ΔpL (!,!) was judged to be significantly equal to zero and different from zero when the Bayes 399 factor was smaller than 1/3 and greater than 3, respectively, following the convention (Jeffreys, 400 1961).

401
The almost identical analyses were applied to the RT and dACC data, except that the 402 metameric currency for u ! was defined in an absolute term, |s ! − c (!) |. Therefore, two trials

436
To identify brain signals of c (!) , s (!) , and v (!) , we defined three a priori lists of regressions that 437 must be satisfied by the brain signals. We stress that each of these lists consists of the necessary 438 conditions to be satisfied, and the conditions that must not be satisfied as well, because both 439 significant (βof the ne) and non-significant (β a) regressions are deduced from the causal 440 structure of the variables in the NBMC. The lists were as follows.

441
The 13 regressions for the brain signal of c: (c1-3), y ! , c decoded from brain signals, must 442 be regressed onto c-the variable it represents-even when c is orthogonalized to v or d,

452
The 13 regressions for the brain signal of s: (s1-3), y ! , s decoded from brain signals, must causal graph structure that is prescribed a priori by the NBMC, we searched for the causal graph 482 (G) whose likelihood is maximal given the time series of three brain signals, one for each of the graphs can be created out of three variables nodes (3 edges x 3 edges x 3 edges between the 485 nodes); a total of 6 triplets of brain signals can be used for X ! , X ! , X ! since we have three (IPL !" , 486 STG !" , STG !" ), two (DLPFC !" , Cereb !" ), and single (STG !" ) candidate brain signals for c, s, and v,

629
Having estimated s ! and c ! , the Bayesian observer performs the task by deducing a 630 decision variable v (!) from s (!) and c (!) and translating it into a binary decision d (!) with a degree 631 of uncertainty u (!) (Fig. 3E); v (!) is the probability that s (!) will be greater than c (!) (p(s (!) > c (!) )); 635 Figure 3F and 3G graphically summarize how the NBMC works. Figure 3F

655
According to the c-inference algorithm of the NBMC (Equation 2; Fig. 3D), c is affected only by 656 past stimuli but not by past decisions, and is attracted toward past stimuli more strongly toward 657 recent ones than toward old ones (Fig. 3D,F). Then, according to the deduction algorithm of the 658 NBMC (Fig. 3E), such attractive shifts of c toward past stimuli will be translated into the decisions 659 that are repelled away from past stimuli, more strongly from recent ones than from old ones.

674
The c-inference and deduction algorithms of the NBMC also imply specific history effects 675 of past stimuli on decision uncertainty (u ! ): the decision uncertainty on a current trial varies as a function of the congruence of a current decision with only past stimuli, but not with past 677 decisions. This is because, as c ! is attracted towards a past stimulus, s ! that would lead to the 678 decision congruent with that past stimulus (e.g., making a 'large' decision on a current trial after 679 viewing an L ring on a previous trial) tends to be closer to c ! and thus be more uncertain 680 compared to s ! that would lead to the decision incongruent with that past stimulus (e.g.,

681
making a 'small' decision on a current trial after viewing an L ring in a previous trial). In the case   hemodynamic delay of fMRI signal, were significantly correlated with u ! (Table 2; Fig. 6A,C).

707
Importantly, when we carried out the regression analysis on those fMRI responses as we did on 708 the RT data, the dACC responses were significantly regressed onto the congruence of d ! with 709 past stimuli but not onto the congruence of d ! with past decisions (Fig. 6D), which corresponds

727
Rephrasing in terms of the apple-classification scenario, the NBMC specifies how much of the variability of classification is ascribed to the farm from which an apple came (Fig. 2B) and to the 729 farm from which a farmer came (Fig. 2C).

730
To  Figure 9G, which puts together the panels in Figure 9F, 797 demonstrates that the NBMC readily captures the relativity of classification in a single currency, 798 the difference between s (!) and c (!) .

799
The RT and dACC data, though being noisier than the decision fraction data, also 800 supported the Bayesian-human correspondences in u (!) regarding the relativity of classification.

824
Initially, we identified the candidate brain regions within which activity patterns satisfy 825 the first 'correlation' requirement using the multivoxel pattern analysis (MVPA) in conjunction 826 with a searchlight technique, which is known to be highly sensitive to detect brain signals in local

832
To address this specificity problem, we imposed the second 'causality' requirement on 833 the candidate brain regions. Specifically, we deduced an exhaustive set of regressions (Fig. 11D-F) 834 that are implied from the causal relationships of a target variable with other variables (Fig. 11A-C), 835 and tested whether a candidate brain signal is consistent with the entire set of regressions. For 836 example, to claim that a candidate brain signal of c satisfies the 'causality' relationship between c 837 and v, we must show that that the c decoded from the brain signal is (i) negatively regressed

845
Six brain regions satisfied both of the two requirements (Table 3;

851
We also confirmed that the concurrent trial-to-trial variabilities of the decoded variables 852 from the brain signals were consistent with the causal structure defined by the NBMC, as follows.

872
for BIC calculation). The outcomes of BIC evaluation were consistent with the NBMC in the 873 following two aspects. First, out of the 162 possible causal graph structures, the smallest (best) 874 BIC value was found for 'c → v ← s' (Fig. 12D). This causal graph of 'common effect' is exactly the 875 one prescribed by the NBMC (Fig. 11A). Second, note that one critical outcome that falsifies the 876 NBMC is the 'significant' existence of any graph that includes causal arrows between c and s 877 because the NBMC is built upon the assumption that c and s are the random variables that are 878 independent from one another. To evaluate the statistical significance of those graphs, we 879 followed the statistical convention that any pairs of hypotheses with the BIC differences greater 880 than 2 are considered to be sufficient to conclude significant differences (Kass and Raftery, 1995).

881
With this critical value of statistical significance (BIC>2), we found that none of the graphs with 36 the causal arrows between c and s could pass the criterion. The closest group of the graphs 883 (shown at the bottom of Fig. 12D) was well apart from the winner graph, by more than the BIC 884 value of 3.

885
In sum, the outcomes of the fMRI data analyses indicate that there exist the brain 886 activities that respectively represent the key latent variables of the NBMC (c, s, and v) and

895
categorization requires judging which of two distributions (i.e., 'which category') a given entity 896 belongs to (Fig. 13B). In the context of the apple-sorting scenario, classification is to decide 897 whether a given apple is smaller or larger than the typical size of apples while categorization is to 898 decide whether a given apple is from the B farm or from the F farm. attractively shift the distribution of stimuli and thus the criterion (Fig. 13E). As results, only past 906 stimuli, but not past decisions, bring about negative history effects on current decisions as 907 observed in the current study (Fig. 4). probe the history effects on decisions for the NBMC (Fig. 4), the regression coefficients of past 931 stimuli and past decisions were opposite in sign (negative for past stimuli and positive for past 932 decision) and similar in magnitude (green symbols in Fig. 13G). These history effects differed from 933 those observed for the humans in our experiments (black and gray symbols in Fig. 13G). This 38 difference could not be resolved by changing the model parameters: the parameter set that best 935 captures the stimulus effect fails to capture the decision effect (top panel of Fig. 13G), and vice 936 versa (bottom panel of Fig. 13G).

939
By specifying the statistical structure of classification, we developed the NBMC, in which a 940 normative observer infers the stimulus and criterion from current and past sensory evidences, 941 respectively, and commits to a decision with a varying degree of uncertainty.

1269
The black horizontal bars represent a criterion, and the size of circles represents apple size. The 1270 hue and saturation of circles represent decision identity (blue for 'small', red for 'large') and 1271 uncertainty (increasing desaturation with increasing uncertainty), respectively.

1334
were significantly correlated with u (!) at the fourth fMRI time point (5.5s after stimulus onset).

1335
The two-tailed tests were applied for evaluating the significance of GLMM. 1457 Table 3. Specifications of the regions in which the brain signals of c (!) , s (!) , and v (!) were 1458 identified.