A new formalism for perceptual classification with normative inference of internal criteria

Perceptual classification, one canonical form of decision-making, entails assigning stimuli to discrete classes according to internal criteria. Accordingly, the standard formalisms of perceptual decision-making have incorporated both stimulus and criterion as necessary components, but granted them unequal representational status, stimulus a random variable and criterion a scalar variable. This representational inconsistency obscures identifying the origins of behavioral or neural variability in perceptual classification. Here, we redress this problem by presenting an alternative formalism in which criterion, as a latent random variable, plays causal roles in forming decision variable on equal footings with stimulus. By implementing this formalism into a Bayes-optimal algorithm, we could predict, simulate, and explain the key human classification behaviors with high fidelity and coherency. Further, by acquiring concurrent fMRI measurements from humans engaged in classification, we demonstrated an ensemble of brain activities that embodies the causal interactions between stimulus, criterion, and decision variable as the algorithm prescribes.


Introduction
. The formation, use, and consequences of criterion in an apple-sorting task (A) Farmers working at three apple farms that differ in fertility. As the farms differ in fertility, the three farmers experience apples of different sizes and form different criteria based 'large', respectively, by comparing the apple sizes against her "medium-size" criterion. (C) Example cases where the apples from the

264 265
Intuition II: dissecting stimulus and criterion contributions to classification 266 Our model decomposes the trial-to-trial variability in decision and uncertainty into that originating from 267 and that from . Speaking in terms of the apple-sorting scenario, our model specifies why and how much a 268 current decision is ascribed to the farm from which an apple came ( Figure 1B) and to the farm from which a 269 farmer came (Figure 1C).

270
The model's dissection of the and effects is illustrated in the bivariate decision space ( Figure   271 5A). The probability of making the 'large' decision (pL (!) ), which corresponds to the truncated bivariate 272 distribution that satisfies s ! > c (!) , is greater for 'L' rings than for 'S' rings, but decreases as c (!) increases, 273 regardless of ring size (Figure 5A,B). The decision uncertainty (u ! ) of 'large' decisions, which corresponds 274 to how close the mean of the truncated bivariate distribution that satisfies s ! > c (!) is to the equality of 275 s ! and c (!) , is higher for 'S' rings than for 'L' rings, but increases as c (!) increases, regardless of ring size 276 ( Figure 5A,C). For 'small' decisions, the and effects on decision uncertainty can be described in similar 277 ways, based on the truncated bivariate distributions that satisfy s ! < c (!) (Figure 5A,C).

318
Observed RTs (E) and dACC activity (F) plotted against the binned absolute difference between the stimulus and criterion estimates 319 for each stimulus-decision congruence condition. (D-F) Observed data juxtaposed with model simulation results (green symbols).

320
Shaded areas, 95% bootstrap confidence intervals of mean of data. None of the observed data significantly deviated from the model

333
Intuition III: relativity of classification

334
Our model explains the relativity of classification at both computational and algorithmic levels (Marr and between and . This inequality computation, at the algorithmic level, is accomplished by forming a 337 probabilistic representation of the difference between s (!) and c (!) , which is the decision variable, v (!) . As inuited in the apple-sorting scenario, trials are identical in v (!) as long as they are matched in relative 339 difference between s (!) and c (!) , and increasingly differ as the relative difference increases. In the bivariate 340 decision space (Figure 6A), this means that decisions are (i) indiscernible as long as the expected values of responses all constituted single, seamless psychometric, chronometric, and neurometric curves, 348 respectively, which indicates that what governs classification is ' relative to ' (Figure 6D-F). As 349 consequences, pLs, RTs, and dACC responses became 'metameric' (significantly indiscernible) between the 350 trials in which a physical difference between stimuli was compensated by a counteracting difference in c (!) , 351 or 'anti-metameric' (significantly discernible) between the trials in which a physical sameness between 352 stimuli was accompanied by a substantial difference in c (!) (Figure 6-figure supplement 1), which supports 353 the model's prediction of metameric classifications. The model's simulations fell within the 95% confidence 354 intervals defined at all data bins inspected ( Figure 6D-F

378
Causality between , , and in brain activity 379 So far, we have validated the model by verifying its predictions for and in the behavioral data and dACC 380 activity. Next, we set out to verify the model's predictions for its core latent variables, , , and , from 381 which and were deduced (Figure 3E), in the patterns of brain activity. These predictions are two fold: 382 first, the model predicts the presence of brain signals that are correlated with the trial-to-trial states of , ,

383
and ; second, if such brain signals exist, their trial-to-trial variabilities must satisfy their causal relations 384 with all the remaining variables in the model, including the manipulated ( ), latent ( , , , ) and observed 385 ( ) variables. Verifying these predictions will lend strong support to our formalism by endowing its latent 386 variables and their causal relations with neural presence.

387
For comprehensive search of candidate brain regions within which activity was correlated with , ,

388
and , we opted to use multivoxel pattern analysis (MVPA) in conjunction with a searchlight technique,

389
which is known to be highly sensitive to detect brain signals in local patterns of population fMRI responses the causal relationships of the target variable with the other variables. For example, the decoded y ! , by

427
The interplays between brain signals also satisfied the causal structure defined by the model. First, 428 the interplays between , , and singals were well captured by a multiple regression model,

Bayes-optimal criterion inference 456
Although the true is the population median of quantities, without access to quantities that will be 457 encountered in the future, people have to rely on the quantities experienced so far. Further, due to

462
In this sense, our model is Bayes-optimal, and our results imply that people performs our task with

482
Third, , as a legitimate parent to along with , precisely defines the relativity and metamerism in 483 classification with a rigor to predict when and to what extent decisions become indiscernible or discernible.

484
Finally, as a new species brings a native ecosystem a new configuration (Richardson and Rejmánek, leads to better quantitative predictions for the variability in decision and RT, and detection of the brain signal of , which was undetectable by the constant-model, in the dACC.

Brain signals of inferred criterion 502
The brain signals of appeared initially at the IPL and migrated towards the STG, where the signal resides.

503
We conjecture that working-memory representations of past stimuli are likely to be formed in the IPL and 504 then are transferred to the STG, wherein they turned into to partake in classification of current stimuli.

505
This conjecture appears to be consistent with the literature in several aspects. Optical inactivation of the DLPFC and cerebellum did not represent external stimuli per se but the inferred estimates of external 542 stimuli that partake in classification (Figure 8-figure supplement 1C). In contrast, the signals in the 543 retinotopic visual areas were much weaker than those in the DLPFC and cerebellum (Figure 7

563
Classification versus categorization 564 The difference between classification and categorization tasks has rarely been appreciated by previous 565 studies, presumably because the two tasks appear similar on a surface level in that they both require 566 translating continuous variables into discrete variables. As footnoted in Introduction, however, 567 classification (e.g., "Is your cat big?") should not be confused with categorization ("Is that a cat or a dog?") 568 since they fundamentally differ in what to be computed (Jacob, 2004). In terms of our apple-sorting 569 scenario, categorization would be to answer the question of 'which farm is this apple from?'. Specifically, 570 while classification requires judging the inequality between an instance quantity ('size of your cat') and a 571 criterion ('typical size of cats'), categorization requires judging which category ('cat' or 'dog') an instance 572 ('that particular animal') belongs to (Green and Swets, 1966). In this regard, inference of is necessary for 573 classification but not for categorization whereas a generative model of categories (i.e., how instances are 574 generated from respective categories) is necessary for categorization but not for classification.

580
However, this possibility was negated because the categorization model could not provide a coherent 581 account for the history effects in our classification task (Figure 8-figure supplement 2), which supports that 582 inference is crucial for performing our task.

584
History effects in classification via dynamic criterion update inference model offers a normative account for the history effect in a classification task: , which is a and our account for it, is distinct from those in previous studies as follows.

695
For each observer, we acquired three types of MRI images, as follows: (1) , except for spike elimination for which we used the AFNI toolbox (Cox, 1996). The first six frames of

715
Finally, the locations of CSF, white matter, and gray matter were defined as respective groups of voxels in 716 which the probability was more than 0.5. The preprocessing steps for multivoxel analysis were the same, 717 but only spatial smoothing was omitted to prevent blurring of the pattern of activity. Unfortunately, in a 718 few of the subjects, functional images did not cover entire brain areas. Voxels in which data were derived 719 from fewer than 17 subjects were excluded for further analysis, which included those in the temporal pole,

Bayesian estimates of stimuli 740
A Bayesian estimate of the value of Z on a current trial, s (!) , was defined as the most probable value of a 741 posterior function of a given sensory measurement m (!) (Equation 1). The posterior p Z m ! is a 742 conjugate normal distribution of the prior and likelihood of Z given the evidence m (!) , whose mean µ ! (!) 743 and standard deviation σ ! (!) were calculated as follows (Figure 3c): given the evidence ! , 759 p Z ! ∝ p ! Z p Z = p r !!! Z p r !!! Z … p r !!! Z p Z , whose mean and standard deviation were calculated (Bromiley, 2003) based on the knowledge of how 760 the retrieved stimulus becomes noisier as trials elapse ( Figure 3B):

764
Much like stimulus estimates, the sampling distribution of criterion estimates have a mean µ !(! ! ) 765 and a standard deviation σ !(! ! ) due to generative noise propagation, and they were calculated as follows 766 ( Figure 5A):

770
The constant-criterion model

771
The constant-criterion model has two parameters, bias µ ! and measurement noise σ ! . Stimulus estimates, 772 s (!) , were assumed to be sampled from a normal distribution, (Z ! , σ ! ). Each stimulus sample has

819
Due to the time consumption of regression, the number of simulations for regression (10 5 repetitions) was 820 compromised, and was smaller than that for pL (!) prediction (10 6 ). However, we confirmed that the 821 simulation number was sufficiently large to produce stable simulation outcomes.

822
To check the correspondence between Bayesian decision uncertainty u (!) and human RT, we took past stimuli with a current choice using the following model to obtain regression coefficients by GLMM 833 with a random effect of individual observers ( Figure 4D): For the regression analysis, the independent and dependent variables were both standardized into zscores for each observer. For comparison, the regression coefficients for Bayesian decision uncertainty were calculated as u (!)~q! + q !!! g ! !!!

Comparing Bayesian decision uncertainty and human BOLD activity
were both standardized into z-scores for each observer. We first localized cortical sites where β !(!,!) were 853 statistically significant after correcting for the false discovery rate (FDR) (Benjamini and Hochberg, 1995) 854 over the entire brain voxels tested at each time point, i. Then, we identified voxel clusters covering a 855 region larger than 350 mm 3 (> 12 contiguous voxels) in which the voxels' FDR-corrected p-values were less 856 than 0.05 and raw p-values were less than 10 !! as regions of interest (ROIs). For ROI analysis, BOLD signals 857 were averaged over individual voxels, and their correspondences with Bayesian decision uncertainty were 858 calculated using the same procedure as that used for RT data analysis.

860
Searching for multivoxel patterns of activity signaling latent model

875
(2014) was developed to decode external stimuli by assigning decoding weights to invidual voxels based on 876 the retinotopy (eccentricity) map acquired a priori, the latent variable s (!) does not represent external 877 physical stimuli per se but inferred stimuli that partake with c (!) in causing v (!) . For such a latent variable, 878 whose values stochastically vary on a trial-to-trial basis, the time-resolved SVR method is more appropriate 879 because voxel weights can flexibly learned via training. Second, we, of course, did not want to preclude the 880 possibility that s (!) is represented in the early visual cortex or to limit our search of brain signals of s (!) 881 within the early visual cortex either. In this regard, the SVR method in conjunction with searchlight is more 882 appropriate than the retinotopy-based population decoding method, because the former can be applied 883 impartially to any local regions throughout the entire brain whereas the latter can be applied only to those 884 with fine retinotopy maps. Lastly, the spatial resolution of fMRI signals in the current study (      an existing measurement to be retrieved; second, the first trial is susceptible to non-specific fMRI signals 1162 that are irrelevant to the task. RT was shorter than 0.3s (0.0059s) in one trial, and this was too short to be 1163 considered as a task-relevant response. Note that some aspects of the procedure are arbitrary (e.g., the fitting boundaries and the number of initial parameter randomization). So, the outcomes of the fitted  Table S2 1177 for specifications of these ROIs), the time courses of the within-ROI averages of BOLD signals (red) are with P < 0.05 and P < 10 -5 , respectively. Error bars, standard errors of the mean across observers.

1425
(A-F) Gray patches represent 95% bootstrap confidence interval. Just as in the main data (Figure 6), the predictions of the criterion-inference model fell well within the confidence intervals of the observed data 1427 at the majority of data bins for each and every auxiliary data set. As for AD#2, we did not plot the pL data 1428 for the largest two and the smallest two stimuli and did not plot the RT data for the largest two and the 1429 smallest three congruence conditions, because only small numbers of trials were avaiable for several

1438
decision uncertainty, and decision identity) that must be satisfied, respectively, by the brain signals. We 1439 stress that each of these lists (1) is a "collectively exhaustive" set, because the constituent regressions 1440 encompass all the variables at work in the model as regressors, and (2) consist of "the necessary 1441 conditions to be satisfied, and the conditions that must not be satisfied as well", because both significant 1442 (β > 0 or β < 0) and non-significant (β = 0) regressions are deduced from the causal structure of variables 1443 that are defined by the criterion-inference model (see Figure 7B). Thus, these lists subject the candidate 1444 brain signals of the latent variables to strong tests, and, if a given brain signal satisfies the entire list of 1445 regressions, it must be considered, as one that may be not a mere "neural correlate or signature" of a 1446 latent variable, but rather a "neural representation or embodiment" of a latent variable. The meanings of 1447 symbols, numbers, and colors used in expressing the regressions are as follows: '+', '−', and '±', signs of 1448 regression indicating that the tail of significance test is right, left, or two-tailed, respectively; numbers with 1449 decimal points, threshold p-values for GLMM regression of a brain signal onto the variable of interest (P FDR ,

1450
FDR-corrected p-values) ; '<' and '>', significant and non-significant regression; 'A !! ', residual from linear 1451 regression of A onto B (which will be referred to as 'A orthogonalized to B' from now on) ; 'D#' and 'S#', 1452 decision and stimulus on a #-back trial; red, blue, and gray, positive, negative, and non-significant 1453 regressions, respectively. The brain signals of a targeted variable must and must not satisfy the following 1454 regressions of the targeted variable:

1455
The brain signal of ( ; top left column of A). (c1)~(c3), y ! must be regressed onto c-the 1456 variable it represents-even when c is orthogonalized to v or d, because it should reflect the variance 1457 irreducible to the offspring variables of c; (c4), y ! must not be regressed onto s because c and s are 1458 independent; (c5),(c6), y ! must be regressed onto v but not when v is orthogonalized to c because the 1459 influence of c on v is removed; (c7),(c8) y ! must be regressed onto d but not onto u because u 's 1460 relationship with its parents v and c is nonlinear (see Figure 7E); (c9)-(c11), y ! must be regressed onto, not the current stimulus, but the past stimuli-strongly onto the 1-back stimulus and more weakly onto the 2back stimulus (thus, non-significant regression with one-tailed regression in the opposite sign is modeled coefficients were consistent with the regressions deduced from the causal structure in the model, as  Correspondence between the causal graph of the latent variables that is prescribed a priori by the model ( → ← ) and the maximum likelihood causal graph that is inferred from the brain signals representing those variables. To see whether the patterns of , and signals that were seriously as a strong support for the existence of brain embodiment of the latent variables and their  1597 Figure 3D), s (!) is an inferred value of the stimulus on trial t (Equation 1; Figure 3C). As defined in Figure 3, 1598 this implies that trial-to-trial measurement noise in the sensory system are substantive and propagate to substantively between the 'small'-decision (blue dots) and the 'large'-decision (red dots) trials, as shown in 1602 (C). Correspondingly, the brain signals of found in the DLPFC and the cerebellum (DLPFC !" , Cereb !" ) also 1603 varied not only as a function of external stimuli, but also as a function of decisions made by the observers, 1604 as shown in (D). Specifically, the multiple linear regressions 'DLPFC !" (or Cereb !" ) ~β ! + β !" S0 + β !" D0' 1605 revealed that β !" and β !" are both significant (DLPFC s3 : β !" = 0.051, P !" = 0.017; β !" = 0.056, P !" = 1606 0.0028; Cereb s5 : β !" = 0.062, P !" = 0.0044; β !" = 0.061, P !" = 0.0084). P-values, *< 0.05; **< 0.01. Error