Model-based fMRI reveals co-existing specific and generalized concept representations

There has been a long-standing debate about whether categories are represented by individual category members (exemplars) or by the central tendency abstracted from individual members (prototypes). Across neuroimaging studies, there has been neural evidence for either exemplar representations or prototype representations, but not both. In the present study, we asked whether it is possible for individuals to form multiple types of category representations within a single task. We designed a categorization task to promote both exemplar and prototype representations, and we tracked their formation across learning. We found evidence for co-existing prototype and exemplar representations in brain in regions that aligned with previous studies: prototypes in ventromedial prefrontal cortex and anterior hippocampus and exemplars in inferior frontal gyrus and lateral parietal cortex. These findings show that, under the right circumstances, individuals may form representations at multiple levels of specificity, potentially facilitating a broad range of future memory-based decisions.

fMRI data have done so only during a categorization test that followed extensive training, potentially missing dynamics occurring earlier in concept formation. Notably, memory consolidation research suggests that memories become abstract over time, often at the expense of memory for specific details (McClelland, McNaughton, & O'Reilly, 1995;Moscovitch, Cabeza, Winocur, & Nadel, 2016;Payne et al., 2009;Posner & Keele, 1970), suggesting that early concept representations may be exemplar-based. In contrast, research on schema-based memory shows that abstract knowledge facilitates learning of individual items by providing an organizational structure into which new information can be incorporated (Bransford & Johnson, 1972;Tse et al., 2007;van Kesteren, Ruiter, Fernández, & Henson, 2012). Thus, early learning may instead emphasize formation of prototype representations, with exemplars emerging later.
Finally, abstract and specific representations need not trade-off in either direction. Instead, the brain may form these representations in parallel (Collin, Milivojevic, & Doeller, 2015;Schlichting et al., 2015), generating the prediction that both prototype and exemplar representations may grow in strength over the course of learning.  (prototypes). New items are classified into the category with the most similar prototype. C. Example stimuli. The leftmost stimulus is the prototype of category A and the rightmost stimulus is the prototype of category B, which shares no features with prototype A. Members of category A share more features with prototype A than prototype B, and vice versa. D. During the learning phase, participants completed four study-test cycles while undergoing fMRI. In each cycle, there were two runs of observational study followed by one run of an interim generalization test. During observational study runs, participants saw training examples with their species labels without making any responses. During interim test runs, participants classified training items as well as new items at varying distances. E. After all study-test cycles were complete, participants completed a final generalization test that was divided across four runs. Participants classified training items as well as new items at varying distances.
In the present study, participants underwent fMRI scanning while learning two novel categories or 'species,' which were represented by cartoon animals varying on eight binary dimensions ( Figure 1C). The learning phase consisted of two types of runs: observational study runs and interim generalization test runs ( Figure 1D). During study runs, participants passively viewed individual category members with their accompanying species label ('Febble' or 'Badoon'). All of the items presented during study runs differed by two features from their respective prototypes (for example, exemplars depicted in Figure 1A). After completing two runs of observational study, participants underwent an interim generalization test run in which participants classified cartoon animals into the two species. Test items included the training items as well as new items at varying distances from category prototypes. Across the entire learning phase, there were 4 study-test cycles, with different new test items at every cycle. The learning phase was followed by a final generalization test, whose structure was similar to the interim test runs but more extensive ( Figure 1E). To test for co-existing prototype and exemplar correlates in the brain during interim and final generalization tests, we used latent metrics generated from each model as trial-by-trial predictors of BOLD activation in six regions of interest ( Figure 2): ventromedial prefrontal cortex, anterior hippocampus, posterior hippocampus, lateral occipital cortex, inferior frontal gyrus, and lateral parietal cortex. To identify potential changes with learning, we tested these effects separately in the first half of the learning phase (interim tests 1 and 2) and second half of the learning phase (interim tests 3 and 4) as well as in the final test.
As a complementary approach, we used pattern similarity analyses on the observational study runs to identify regions representing item-level information (i.e., individual category members) and/or category-level information. Item-level information was defined as greater neural pattern similarity for repetitions of a specific stimulus compared to its neural similarity to other stimuli. Category-level information was defined as greater pattern similarity for items in the same category compared to physically equally similar items from different categories. We were interested whether such a distinct analysis method, applied on a distinct portion of neuroimaging data (observational learning rather than categorization task), would provide converging or complementary information to model-based MRI results. As with neural prototype and exemplar model fits, we compared the first and second half of learning to identify potential representational shifts.

Accuracy
Interim tests. Categorization performance across the four interim tests is presented in Figure 3A. We first tested whether generalization accuracy improved across the learning phase and whether generalization of category labels to new items differed across items of varying distance to category prototypes. There was a significant main effect of interim test number [F(3,84)  Thus, the prototype model provided an overall better fit to behavioral responses throughout the learning phase and final test, and the effect size of the prototype advantage was largest in the final test. The final test. Fits are given in terms of negative log likelihood (i.e., model error) such that lower values reflect better model fit. Each dot represents a single subject and the trendline represents equal prototype and exemplar fit. Dots above the line have better exemplar relative to prototype model fit. Dots below the line have better prototype relative to exemplar model fit.

Model-based MRI
Learning phase. We first tested the degree to which prototype and exemplar information was represented across different brain regions and across different points of the learning phase. Using the data from the interim generalization tests, we compared neural model fits across our six ROIs across the first and second half of the learning phase. Full ANOVA results are presented in Table 1. Figure 5 presents neural model fits for each ROI. Figure 5A represents 1 st half of the learning phase, Figure 5B represents the 2 nd half of the learning phase, and Figure 5C represents fits collapsed across the entire learning phase (to illustrate the main effects of ROI, model and ROI x model interaction). We were also interested in whether there was a shift in representations that could be detected across learning. The only effect that included learning phase that approached significance was the three-way ROI x model x learning phase interaction, likely reflecting the more pronounced region x model differences later in learning ( Figure 5A vs. Figure 5B). Final test. Figure 5D presents neural model fits from each ROI during the final test. We tested whether the differences across ROIs identified during the learning phase were also present in the final test. As during the learning phase, we found a significant main effect of ROI correlates did not reach significance in any of the predicted exemplar regions (all t < 1.18, p > .12, d < 0.31).

Pattern similarity
As a complement to model-based fMRI, we used pattern similarity analyses during observational study runs to measure item-level and category-level information. Observational study runs were well suited to pattern similarity analyses because, unlike test runs, there was no required motor response, and the physical similarity of training items was well controlled, allowing us to detect category effects that were not driven by differences in the number of shared features for pairs of items within the same category vs. between categories. First, we computed neural pattern similarity between all pairs of trials from different runs (e.g., trial 1 from run 1 to trial 1 from run 2, trial 1 from run 1 to trial 2 from run 2, etc.) to avoid any possible issues with within-run autocorrelation (Mumford, Davis, & Poldrack, 2014). We then sorted the neural pattern similarity values into three bins, depending on whether they reflected the similarity between activation patterns for repetitions of the same training item, two separate training items that belonged to the same category, or two training items that belonged to different categories. We then computed item-level representations as the difference between pattern similarity for same item and pattern similarity for distinct items from the same category.
Category-level representations were defined as the difference between same category similarity and different category similarity. As all items in the same category differed from one another by four features, we limited the different category pattern similarity analysis to only those pairs of items that also differed from one another by four features, which was by design the vast majority of all possible different category item pairs. This allowed us to measure category representations that were not merely driven by differences in physical similarity as all similarity comparisons were based on items that differed by four features.
Pattern similarity results are presented separately for each ROI and each learning half for category-level information ( Figure 6A) and item-level information ( Figure 6B Figure 6A, these results show little evidence that items belonging to the same category were represented more similarly than equally similar items from different categories during observational study runs. In contrast, comparisons of item-level indices across ROIs and learning halves ( Figure   6B) showed a significant main effect of ROI

Discussion
In the present study, we tested whether exemplar-and prototype-based category representations could co-exist in the brain within a single task under conditions that favor both exemplar memory and prototype extraction. We found signatures of both types of representations across distinct brain regions when participants categorized items during the learning phase. Consistent with predictions based on prior studies, the ventromedial prefrontal cortex and anterior hippocampus tracked abstract prototype information, and the inferior frontal gyrus and lateral parietal cortex tracked specific exemplar information. That inferior frontal gyrus and lateral parietal cortex represent individual items was also apparent when measured via pattern similarity analyses computed during observational study runs. In addition, we tested whether individuals relied on different types of representations over the course of learning. We did not find evidence of representational shifts either from specific to abstract or vice versa.
Instead, results suggested that both types of representations emerged together during learning.
Together, we show that specific and abstract representations need not form at the expense of one another and may instead exist in parallel for the same categories.
A great deal of prior work in the domain of category learning has focused on whether classification of novel category members relies on retrieval of individual category exemplars (Kruschke, 1992;Nosofsky, 1986) or instead on abstract category prototypes (Homa et al., 1973;Posner & Keele, 1968;Reed, 1972). These two representations are often pitted against one another with one declared the winner over the other, which is based largely on typical model-fitting procedures for behavioral data. Indeed, fitting exemplar and prototype models to behavioral data in the present study generally showed better fit of the prototype model over the exemplar model. However, using neuroimaging allowed us to detect both types of representations apparent across different parts of the brain. These results thus contribute to the ongoing debate about the nature of category representations in behavioral studies of categorization by showing that individuals may maintain multiple representations simultaneously even when one model shows better overall fit to behavior.
In addition to contributing novel findings to a longstanding debate in the behavioral literature, the present study also helps to resolve between prior neuroimaging studies fitting prototype and exemplar models to brain data. Specifically, two prior studies found conflicting results: one study found only exemplar representations in the brain (Mack et al., 2013) whereas another found only prototype representations (Bowman & Zeithamova, 2018). Notably the brain regions tracking exemplar predictions were different than those identified as tracking prototype predictions, showing that these studies engaged different brain systems in addition to implicating different categorization strategies. However, because the category structures, stimuli and analysis details also differed between these studies, the between-studies differences in the identified neural systems could not be uniquely attributed to the distinct category representations that participants presumably relied on. The present data newly show that neural prototype and exemplar correlates can exist not only across different task contexts but also within the same task, providing strong evidence that these neural differences reflect distinct category representations rather than different task details. Moreover, our results aligned with those found separately across two studies, replicating the role of the VMPFC and anterior hippocampus in tracking prototype information (Bowman & Zeithamova, 2018) and replicating the role of inferior prefrontal and lateral parietal cortices in tracking exemplar information (Mack et al., 2013).
What causes differences in the types of representations engaged across tasks and why were we able to detect both in the present study? A number of manipulations are known to influence which memory systems are engaged to support categorization, including whether training items are relatively coherent or more variable (Bowman and Zeithamova, in press), whether training is intentional or incidental (Aizenstein et al., 2000;Reber, Gitelman, Parrish, & Mesulam, 2003), whether individuals learn single or contrasting categories (Zeithamova et al., 2008), and whether a category member follows a category rule or is a rule-breaking exception (Davis, Love, & Preston, 2012). In designing the present study, we included a relatively coherent category structure that was likely to promote prototype formation (Bowman and Zeithamova, in press, 2018), but balanced it with an observational rather than feedback-based training task in hopes of emphasizing individual items and promoting exemplar representations.
Detecting exemplar-tracking regions using this design provides new evidence that such observational training can enhance formation of exemplar representations relative to feedbackbased training, consistent with evidence that these two forms of training engage different neural systems (Cincotta & Seger, 2007;Shohamy et al., 2004).
While co-existing prototype and exemplar representations were clear during the learning phase of our task, they were less clear during the final test phase. The VMPFC and anterior hippocampus continued to track prototypes during the final test, but exemplar-tracking regions no longer emerged. These weaker exemplar correlates in the brain were also matched by a weaker exemplar effect in behavior. While we observed a reliable advantage for old relative to new items matched for distance during the interim tests, the old advantage was no longer significant in the final test. The effect size for the prototype advantage in model fits was also larger in the final test than in the learning phase. This finding was unexpected, but we offer several possibilities that can be investigated further in future research. One possibility is that has shown that category learning can cause items within a category to be perceived as more similar to one another and/or items from different categories to be perceived as less similar to one another following learning (Beale & Keil, 1995;Goldstone, 1994;Livingston, Andrews, & Harnad, 1998). These categorical perception effects are thought to be driven by biases in attention toward relevant features and/or away from irrelevant features (Goldstone & Steyvers, 2001;Nosofsky, 1986) In addition to identifying multiple, co-existing types of category representations during learning, we sought to test whether there were representational shifts as category knowledge developed. We found no evidence for a shift between exemplar and prototype representations over the course of learning. Both prototype and exemplar correlates showed numerical increases across learning in brain and behavior, suggesting strengthening of both types of representations in parallel. Prior work has suggested that there may be representational shifts during category-learning, but rather than shifting between exemplar and prototype representations, early learning may be focused on detecting simple rules and testing multiple hypotheses (Johansen & Palmeri, 2002;Nosofsky, Palmeri, & McKinley, 1994;Paniukov & Davis, 2018), whereas similarity-based representations such as prototype and exemplar representations may develop later in learning (Johansen & Palmeri, 2002). Our findings are consistent with this framework, with strong prototype and exemplar representations emerging across distinct regions primarily in the second half of learning. Our results are also consistent with recent neuroimaging studies showing multiple memory representations forming in parallel without need for competition (Collin et al., 2015;Schlichting et al., 2015), potentially allowing individuals to flexibly use prior experience based on current decision-making demands.

Conclusion
In the present study, we found evidence for multiple types of category representations co-existing across distinct brain regions within the same categorization task. Further, the regions identified as prototype-tracking (anterior hippocampus and VMPFC) and exemplartracking (IFG and lateral parietal cortex) in the present study align with prior studies that have found only one or the other. These findings shed light on the multiple memory systems that contribute to concept representation and provide novel evidence of how the brain may flexibly represent information at different levels of specificity without the need for competition between representations.

Method Participants
Forty volunteers were recruited from the University of Oregon and surrounding community and were financially compensated for their research participation. This sample size was determined based on effect sizes for neural prototype-tracking and exemplar-tracking regions estimated from prior studies (Bowman & Zeithamova, 2018;Mack et al., 2013), allowing for detection of these effects with at least 80% power. All participants provided written informed consent, and Research Compliance Services at the University of Oregon approved all experimental procedures. All participants were right-handed, native English speakers and were screened for neurological conditions, medications known to affect brain function, and contraindications for MRI.
A total of 11 subjects were excluded: 3 subjects for chance performance (< .6 by the end of the training phase and/or < .6 for trained items in the final test), 5 subjects for high correlation between regressors (r > .9 for prototype and exemplar regressors and/or rank deficiency across the entire design), 1 subject for excessive motion (> 1.5 mm within multiple runs), and 2 subjects for failure to complete all phases. This left 29 subjects (age: M = 21.9 years, SD = 3.3 years, range 18-30 years; 19 females) reported in all analyses. Additionally, we excluded single runs from three subjects who had excessive motion limited to that single run.

Materials
Stimuli consisted of cartoon animals that differed on eight binary features: neck (short vs. long), tail (straight vs. curled), foot shape (claws vs. round), snout (rounded vs. pig), head (ears vs. antennae), color (purple vs. red), body shape (angled vs. round), and design on the body (polka dots vs. stripes) (Bozoki et al., 2006;Zeithamova et al., 2008; available for download osf.io/8bph2). The two possible versions of all features can be seen across the two prototypes shown in Figure 1C. For each participant, the stimulus that served as the prototype of category A was randomly selected from four possible stimuli and all other stimuli were recoded in reference to that prototype. The stimulus that shared no features with the category A prototype served as the category B prototype. Physical distance between any pair of stimuli was defined by their number of differing features. Category A stimuli were those that shared more features with the category A prototype than the category B prototype. Category B stimuli were those that shared more features with the category B prototype than the category A prototype.
Stimuli equidistant from the two prototypes were not used in the study.
Training set. The training set included four stimuli per category, each differing from their category prototype by two features (see Table 2 for training set structure). The general structure of the training set with regard to the category prototypes was the same across subjects, but the exact stimuli differed based on the prototypes selected for a given participant. The training set structure was selected to generate many pairs of training items that were four features apart both within the same category and across the two categories. This design ensured that categories 1) could not be learned via unsupervised clustering based on similarity of exemplars alone and 2) to control the number of shared features between items when computing pattern similarity within and between categories, equating for physical similarity when determining the effect of category membership.   (Bowman & Zeithamova, 2018;Kéri, Kelemen, Benedek, & Janka, 2001;J David Smith, Redford, & Haas, 2008).

Experimental Design
The study consisted of two sessions: one session of neuropsychological testing and one experimental session. Only results from the experimental session are reported in the present manuscript. In the experimental session, subjects underwent four cycles of observational study and interim generalization tests ( Figure 1D), followed by a final generalization test ( Figure 1E), all while undergoing fMRI.
In each run of observational study, participants were shown individual animals on the screen with a species label (Febbles and Badoons) and were told to try to figure out what makes some animals Febbles and others Badoons without making any overt responses. Each stimulus was presented on the screen for 5 seconds followed by a 7 second ITI. Within each study run, participants viewed the training examples three times in a random order. After two study runs, participants completed an interim generalization test. Participants were shown cartoon animals without their labels and classified them into the two species without feedback.
Each test stimulus was presented for 5 seconds during which time they could make their response, followed by a 7 second ITI. After four study-test cycles, participants completed a final categorization test, split across four runs. As in the interim tests, participants were asked to categorize animals into one of two imaginary species (Febbles and Badoons) using the same button press while the stimulus was on the screen. Following the MRI session, subjects were asked about the strategies they used to learn the categories, if any, and then indicated which version of each feature they thought was most typical for each category. Lastly, subjects were verbally debriefed about the study.

fMRI Data Acquisition
Raw MRI data are available for download via OpenNeuro

Behavioral accuracies.
Interim tests. To assess changes in generalization accuracy across train-test cycles, we computed a 4 (interim test run: 1-4) x 4 (distance: 0-3) repeated-measures ANOVA on accuracy for new items only. We were particularly interested in linear effects of interim test run and distance. We also tested whether there was a difference across training in accuracy for the training items themselves versus new items at the same distance from their prototypes, which can index how much participants learn about specific items above-and-beyond what would be expected based on their typicality. We thus computed a 4 (interim test run: 1-4) x 2 (item type: training, new) repeated-measures ANOVA on accuracies for distance 2 items.
Final test. First, to assess the effect of item typicality, classification performance in the final test (collapsed across runs) was assessed by computing a one-way, repeated-measures ANOVA across new items at distances (0-3) from either prototype. Second, we assessed whether there was an old-item advantage by comparing accuracy for training items and new items of equal distance from prototypes (distance 2) using a paired-samples t-test. For all analyses (including fMRI analyses described below), a Greenhouse-Geisser correction was applied whenever the assumption of sphericity was violated as denoted by 'GG' in the results.

Prototype and exemplar model fitting
As no responses were made during the study runs, prototype and exemplar models were only fit to test runs -interim and final tests. As the number of trials in each interim test was kept low to minimize exposure to non-training items during the learning phase, we concatenated across interim tests 1 and 2 and across interim tests 3 and 4 to obtain more robust model fit estimates for the first half vs. second half of the learning phase. Model fits for the final test were computed across all four runs combined. Each model was fit to trial-by-trial data in individual participants.
Prototype similarity. As in prior studies (Bowman & Zeithamova, 2018;Maddox et al., 2011;Minda & Smith, 2001), the similarity of each test stimulus to each prototype was computed, assuming that perceptual similarity is an exponential decay function of physical similarity (Shepard, 1957), and taking into account potential differences in attention to individual features. Formally, similarity between the test stimulus and the prototypes was computed as follows: (1) is the similarity of item x to category A, x i represents the value of the item x on the ith dimension of its m binary dimensions (m = 8 in this study), proto A is the prototype of category A, r is the distance metric (fixed at 1 for the city-block metric for the binary dimension stimuli).
Parameters that were estimated from each participant's pattern of behavioral responses were w (a vector with 8 weights, one for each of the 8 stimulus features and constrained to sum to 1) and c (sensitivity: the rate at which similarity declines with physical distance, constrained to be 0-100).
Exemplar similarity. Exemplar models assume that categories are represented by their individual exemplars, and that test items are classified into the category with the highest summed similarity across category exemplars ( Figure 1A). As in prior studies (Nosofsky, 1987;Zaki, Nosofsky, Stanton, & Cohen, 2003), similarity of each test stimulus to the exemplars of each category was computed as follows: where y represents the individual training stimuli from category A, and the remaining notation and parameters are as in Equation 1. Parameter estimation. For both models, the probability of assigning a stimulus x to category A is equal to the similarity to category A divided by the summed similarity to categories A and B, formally, as follows: Using these equations, the best fitting w 1-8 (attention to each feature) and c (sensitivity) parameters were estimated from the behavioral data of each participant, separately for the first half of the learning phase, second half of the learning phase, and the final test, and separately for the prototype and exemplar models. To estimate these parameters for a given model, the trial-by-trial predictions generated by Equation 3  there were shifts across learning in which model fit best. We used a paired-samples t-test comparing model fits during the final test to determine whether the group as a whole was better fit by the prototype or exemplar model by the end of the experiment. Finally, we used trial-bytrial similarity metrics generated by each model as regressors for the neuroimaging data (described in the fMRI analysis section) to identify brain regions tracking predictions of each model during each phase.

fMRI Preprocessing
The raw data were converted from dicom files to nifti files using the dcm2niix function from MRIcron (https://www.nitrc.org/projects/mricron). Functional images were skull-stripped using BET (Brain Extraction Tool), which is part of FSL (www.fmrib.ox.ac.uk/fsl). Within-run motion correction was computed using MCFLIRT in FSL to realign each volume to the middle volume of the run. Across-run motion correction was then computed using ANTS (Advanced Normalization Tools) by registering the first volume of each run to the first volume of the first functional run (i.e., the first training run). Each computed transformation was then applied to all volumes in the corresponding run. Brain-extracted and motion-corrected images from each run were entered into the FEAT (fMRI Expert Analysis Tool) in FSL for high-pass temporal filtering (100 s) and spatial smoothing (4 mm FWHM kernel for univariate analyses, 2 mm for pattern analyses).

Regions of interest
Regions of interest (ROIs, Figure 2) were defined anatomically in individual participants' native space using the cortical parcellation and subcortical segmentation from Freesurfer version 6 (https://surfer.nmr.mgh.harvard.edu/) and collapsed across hemispheres to create bilateral masks. Past research has indicated that there may be a functional gradient along the hippocampal long axis, with detailed, find-grained representations in the posterior hippocampus and increasingly coarse, generalized representations proceeding toward the anterior hippocampus (Brunec et al., 2018;Frank, Bowman, & Zeithamova, 2019;Poppenk, Evensmoen, Moscovitch, & Nadel, 2013). As such, we divided the hippocampal ROI into anterior/posterior portions at the middle slice. When a participant had an odd number of hippocampal slices, the middle slice was assigned to the posterior hippocampus. Based on our prior report (Bowman & Zeithamova, 2018), we expected the anterior portion of the hippocampus to track prototype predictors, together with VMPFC (medial orbitofrontal label in Freesurfer). Based on the prior study by Mack and colleagues (2013), we expected lateral occipital cortex, inferior frontal gyrus (combination of pars opercularis, pars orbitalis, and pars triangularis freesurfer labels), and lateral parietal cortex (combination of inferior parietal and superior parietal freesurfer labels) to track exemplar predictors. The posterior hippocampus was also included as an ROI, to test for an anterior/posterior dissociation within the hippocampus.
While one might expect the posterior hippocampus to track exemplar predictors based on the aforementioned functional gradient, our prior report (Bowman and Zeithamova, 2018) found only a numeric trend in this direction and Mack et al. (2013) did not report any hippocampal findings despite significant exemplar correlates found in the cortex. Thus, we did not have strong predictions regarding the posterior hippocampus, other than being distinct from the anterior hippocampus.

Model-based fMRI analyses
fMRI data were modeled using the GLM. Three task-based regressors were included in the GLM: one for all trial onsets, one that included modulation for each trial by prototype model predictions, and one that included modulation for each trial by exemplar model predictions. The regressor for all trial onsets was included to account for activation that is associated with performing a categorization task generally, but does not track either model specifically. The modulation values for each model were computed as the summed similarity across category A and category B (denominator of Equation 3) generated by the assumptions of each model (from Equations 1 and 2). This summed similarity metric indexes how similar the current item is to the existing category representations as a whole (regardless of which category it is closer to) and has been used by prior studies to identify regions that contain such category representations (Bowman & Zeithamova, 2018;Mack et al., 2013). By including both model predictors in the same GLM, we account for any shared variance between the regressors. Events were modeled with a duration of 5 seconds, which was the fixed length of the stimulus presentation. Onsets were then convolved with the canonical hemodynamic response function as implemented in in FSL (a gamma function with a phase of 0 seconds, and SD of 3 seconds, and a mean lag time of 6 seconds). Finally, the six standard timepoint-bytimepoint motion parameters were included as regressors of no interest.
For region of interest analyses, we obtained an estimate of how much the BOLD signal in each region tracked each model predictor by dividing the mean ROI parameter estimate by the standard deviation of parameter estimates (i.e., computing an effect size measure).
Normalizing the beta values by their error of the estimate de-weighs values associated with large uncertainty, similar to how lower level estimates are used in group analyses as implemented in FSL (S. M. Smith et al., 2004). These normalized beta values were then averaged across the appropriate runs (interim tests 1-2, interim tests 3-4, all four runs of the final test) and submitted to group analyses.
We tested whether prototype and exemplar correlates emerge across different regions and/or at different points during the learning phase. To do so, we computed a 2 (model: prototype, exemplar) x 2 (learning phase: 1 st half, 2 nd half) x 6 (ROI: VMPFC, anterior hippocampus, posterior hippocampus, lateral occipital, lateral prefrontal, and lateral parietal cortices) repeated-measures ANOVA on parameter estimates from the interim test runs. We were interested in a potential model x ROI interaction effect, indicating differences across brain regions in the type of category information represented. Following any significant interaction effect, we computed one-sample t-tests to determine whether each region significantly tracked a given model. Given a priori expectations about the nature of these effects, we computed onetailed tests only on the effects of interest: for example, in hypothesized prototype-tracking ROIs (anterior hippocampus and VMPFC), we computed one-sample t-tests to compare prototype effects to zero. We followed a similar procedure in hypothesized exemplar-tracking ROIs (inferior frontal gyrus, lateral parietal cortex, lateral occipital cortex). We were also interested in potential interactions with the learning phase, which would indicate shift across learning in category representations. Following any such interaction, follow-up ANOVAs or t-tests were performed to better understand drivers of the effect.
We next tested ROI differences in the final generalization phase. To do so, we computed a 2 (model: prototype, exemplar) x 6 (ROI: see above) repeated-measures ANOVA on parameter estimates from the final generalization test. We were particularly interested in the model x ROI interaction effect, which would indicate that regions differ in which model they tracked.

Pattern similarity analyses during observational study runs
As a complementary analysis, we also examined item-level and category-level representations using pattern similarity analyses during observational study runs. Study runs contained many pairs of items, both from the same category and from opposing categories, that were equated on physical similarity (four features apart). To generate beta-series representing a pattern of activation for each trial, a GLM was run in which each trial was modeled as a separate regressor of interest. In addition to the 24 regressors for the trials (8 category examples repeated 3 times), the model included 6 standard motion regressors, totaling 30 regressors for each run. As with the model-based analyses described above, the duration was set to 5 seconds and regressors were convolved with the canonical hemodynamic response function. After estimating the model for each run, we concatenated the single-trial betas across all training runs.
Following concatenation, we submitted each beta-series to pattern similarity analyses using the PyMVPA toolbox (www.pymvpa.org) for Python (www.python.org). For each ROI, we extracted the vector of beta parameters across voxels for two trials of interest and correlated them using a Pearson's correlation. Within each half of training (study runs 1-4 vs. study runs 5-8), we computed the pattern similarity score for all trials not in the same run (all run 1 trials compared to all run 2, 3, and 4 trials; all run 2 trials with all run 3 and 4 trials, etc.). Pattern similarity was never computed within a run to avoid issues with local autocorrelations (Mumford et al., 2014). After computing all similarity scores, the resulting r-values were Fisher ztransformed. Three types of similarity scores of interest were then extracted: same item comparisons, same category comparisons, and different category comparisons. For same item comparisons, the similarity score represents the correlation in activation patterns when the exact same item is repeated. For same category comparisons, the similarity score represents the correlation in activation patterns for two items that are in the same category, but are not the exact same item. For different category comparisons, the similarity score represents the correlation in activation patterns for two items that are from opposing categories. To ensure that differences in similarity scores between items in the same vs. different categories was not due only to differences in physical similarity, we only extracted similarity scores in which items were 4 features different from one another for both the same category and different category condition.
To measure item-level representations, we then subtracted the same category similarity from same item similarity (same item -same category). To measure category-level representations, we subtracted the different category similarity from the same category similarity (same category -different category). To test for differences in the strength of item and category information across ROIs and across learning, we submitted these differences scores to separate 2 (learning phase) x 6 (ROI) repeated-measures ANOVAs. Following a significant effect of ROI or ROI x learning phase interaction, we computed one-sample t-tests within each ROI to determine whether the pattern similarity effect of interest was present (i.e., item or category representation significantly different than zero). Following a significant interaction, we used paired-samples t-tests to identify differences between the first and second half of learning within each ROI.