Abstract
Humans learn object categories without millions of labels, but to date the models with the highest correspondence to primate visual systems are all category-supervised. This paper introduces a new self-supervised learning framework: instance-prototype contrastive learning (IPCL), and compares the internal representations learned by this model and other instance-level contrastive learning systems to the structure of human brain responses. We present the first evidence to date showing that self-supervised systems can show more brain-like representation than category-supervised models. Further, we find that recent substantial gains in top-1 accuracy from instance-wise contrastive learning models do not result in more brain-like representation—instead we find the architecture and normalization scheme are critical. Finally, this dataset reveals substantial representational structure in intermediate and late stages of the human visual system that is not accounted for by any model, whether self-supervised or category-supervised. Considering both neuroscience and machine vision perspectives, these results provide promise for instance-level representation as a key objective of visual system encoding, and highlight the room to grow towards more robust, efficient, human-like object representation.
1 Introduction
A fundamental goal for machine vision is to learn useful, flexible, generalizable, robust visual representations, e.g. with comparable capacities to human vision. Drawing insights from biology has been valuable—the past decade’s breakthroughs using deep convolutional neural networks have leveraged structural and algorithmic parallels to biological neural systems (LeCun et al., 2015). And, in a stunning act of reciprocity, these networks learn hierarchical visual representations that show an emergent match to the structure of visual brain responses to depicted objects (e.g. Cadieu et al., 2014; Khaligh-Razavi and Kriegeskorte, 2014; Schrimpf et al., 2018; Yamins et al., 2014), though certainly with room to improve (Geirhos et al. 2018; Xu and Vaziri-Pashkam 2020, see Serre 2019; Sinz et al. 2019 for recent reviews). In this way, comparing learned visual representations directly to brain representations can be an additional source of biological feedback for model development (Schrimpf et al., 2018).
Beyond the CNN architecture, further biological inspiration can be taken from the nature of the learning: humans do not learn from millions of labeled examples, and it has been argued that in the next decade of advances, neither should machines (e.g. LeCun et al., 2015). And, recently, there has been substantial advances in self-supervised learning, particularly using instance-level contrastive learning (He et al., 2019; Hjelm et al., 2018; Misra and van der Maaten, 2019; Oord et al., 2018; Tian et al., 2019; Wu et al., 2018; Ye et al., 2019). For example, recent representation learning frameworks like SimCLR (Chen et al., 2020a) and Moco2 (Chen et al., 2020b) are now able to achieve dramatically higher overall categorization accuracy, approaching supervised model performance.
A key insight across these frameworks is that they operate over instances—individual images—where category-level representation develops as an emergent capacity of the learned representational space (see Wu et al., 2018). With a biological lens, this is a critical step towards a more plausible learning framework: categories do not need to be presupposed ahead of time but can be indexed from within a more generically useful representation that is learned solely from the structure of natural images and the architecture-induced prior of a hierarchical CNN. Do such models learn representations that are similar to those of supervised models, with similar matches to the brain, e.g. arrive at the same representation through different mechanisms? Or could these models be learning something different that is even more brain-like?
To examine these possibilities, this paper first presents a new self-supervised, contrastive-learning framework: instance-prototype contrastive learning (IPCL). The IPCL framework trains the model to build an online prototype representation of each instance from multiple samples (augmentations), and to represent each sample as dissimilar from previously viewed samples stored in an offline memory queue (non-indexed). Next, we compare this model and other instance-level contrastive learning systems to the representations measured in the human visual system, considering early, intermediate, and later hierarchical stages.
2 Instance-Prototype Contrastive Learning (IPCL)
In contrastive-learning frameworks, the goal is to learn an embedding function that maps images into a low-dimensional latent space, where visually similar images are close to each other, and visually dissimilar images are far apart. Learning proceeds by organizing the training data into similar pairs (positive samples) and dissimilar pairs (negative samples), where different frameworks make different choices of how positive and negative samples are computed and retained throughout the learning process (Doersch and Zisserman, 2017; Dosovitskiy et al., 2014; He et al., 2019; Ji et al., 2019; Misra and van der Maaten, 2019; Tian et al., 2019; Wu et al., 2018; Ye et al., 2019; Zhuang et al., 2019).
Our instance-prototype contrastive learning framework is depicted in Figure 1. We randomly augment the same image (x) multiple times (here, n = 5), then pass each augmented image (xi… xj) through an embedding function fθ(x) to obtain a low-dimensional representation of each image (zi… zj). We then compute an instance prototype by averaging the embeddings for all 5 samples: where n is the number of samples, fθ(x) is the embedding function, and xi is the ith augmented sample of an image.
For each augmented instance, its prototype serves as its positive pair, and all stored representations serve as negative pairs (implemented with a lightweight, non-indexed memory queue storing the K=4096 most recent samples). The normalized temperature-scaled cross entropy loss for a positive pair (zi, ) would be defined as: where the similarity function sim is the dot product between embeddings, τ is a temperature parameter that controls the dynamic range of the similarity function, and K is the total number of instances in the memory queue. In practice, we used Noise Contrastive Estimation (NCE, Gutmann and Hyvärinen, 2010) to approximate sampling from a larger memory store (see Wu et al. 2018; Appendix B.1), though recent work suggests the loss function in equation 2 may suffice (e.g. see Chen et al., 2020a). The final loss is computed across all positive pairs in a minibatch (128 images, 5 samples per image, yielding 640 positive pairs). The queue is updated after every minibatch with the current samples added to the queue, displacing the oldest samples.
Architecturally, we used an Alexnet as the base image encoder (Krizhevsky et al., 2012), replacing the 1000-dimensional output layer with a 128-dimensional fully-connected layer with an L2 norm, following Wu et al. (2018). Further, we modified the Alexnet to have group norm (gn) rather than batch norm (bn) layers, which enabled successful learning (see Appendix A). And, to preview our results, this slight modification had a consequential effect on emergent brain-like representation.
3 Related instance-level contrastive learning frameworks
Our IPCL model was inspired by Wu et al. (2018), where models were trained to perform instancelevel discrimination, with a base CNN encoder, a normalized low-dimensional embedding space, and a ≈1.28M indexed memory bank. In their learning framework, the current representation of an image is compared with its prior representation which can be directly accessed from the indexed memory to form the positive pair. Negative pairings are obtained using other the other indexed memory representations, estimated using an empirically determined draw from the memory bank of 4096 negative samples. The category structure learned in the latent space of the L2 layer supported then state-of-the art top-1 ImageNet classification for a unsupervised system (e.g., Resnet50=42.5%), outperforming other leading self-supervised representation learning systems by a substantial margin (e.g. Jigsaw, Noroozi and Favaro 2016; SplitBrain autoencoder, Zhang et al. 2017).
From both biological and machine vision perspectives, the indexed memory bank is somewhat problematic, as the exact number of images to be represented is fixed, and during training this image index provides perfect memory access, a bit like an (external) supervised label. Our IPCL learning framework makes biologically-inspired modifications, by replacing the indexed memory bank with a non-indexed memory queue, and using multi-augmentation to produce positive pairings, rather than using a perfectly-indexed stored memory trace of the previous encounter with that item.
For comparison with IPCL, we trained a variety of networks using as Wu et al.’s indexed memory framework, varying both base encoding architecture and the dimensionality of the latent space, to examine whether these instance-level contrastive learning frameworks learn visual representations that are as brain-like as category-supervised models.
Concurrently, new instance-level contrastive learning frameworks like MoCo2 (Chen et al., 2020b) and SimCLR (Chen et al., 2020a) have emerged, which both avoid an indexed memory bank, and have dramatically increased emergent top-1 ImageNet classification accuracy compared to Wu’s nets (MoCo2-Resnet50 = 71.1%; SimCLR-Resnet50-4x = 76.5%; Wusnet-Resnet50 = 42.5%). Like our IPCL framework, Moco2 uses a memory queue, though with a 16x larger queue (4096 vs. 65,536 items), and a completely different framework for generating positive samples, using a dual-network architecture. SimCLR, like our IPCL framework, uses augmentation for the positive pair, but with no instance-prototype, and a very large batch size for negative samples (4x more items than in IPCL, 4096 vs 16382). In the present work, we also included these pre-trained models in our model-brain comparisons.
4 Methods
Our goal is to compare the representations learned by these self-supervised vision systems with their category-supervised counterparts, assessing how they fit the representational structure of human brain responses in early, intermediate, and later hierarchical stages of the visual system. Below we outline the models, brain dataset, and evaluation metrics, with expanded detail in Appendix B.
4.1 Models
Five self-supervised models were trained with our IPCL framework, with an Alexnet-gn base architecture, and a 128-D latent space, with variations in training regimes (Appendix B.1). Twelve more self-supervised models were trained using the indexed-memory framework of Wu et al. (2018) (henceforth “Wusnets”; Appendix B.2), where we varied (1) the base architecture (Alexnet-bn, Alexnet-gn, Resnet18), including a biologically-based model architecture (Cornet-z; Kubilius et al., 2018), and (2) the dimensionality of the latent space (d = 128, 256, and 1000). The 1000-d latent space was selected to be more clearly matched to the category-supervised supervised models. We also trained a Resnet50 base architecture with a 128-d latent space. Categorization accuracy was assessed in these models using the weighted k-nearest neighbors (kNN) procedure used by Wu et al. (2018), see Appendix B.3.
For the critical comparisons, five category-supervised models were trained with matched Alexnet-bn, Alexnet-gn, Resnet18, Resnet50, and Cornet-z base architectures, with the standard output layer and cross entropy loss function using the same augmentation regime as their self-supervised counterparts.
All models were trained using the ImageNet database (Russakovsky et al., 2015), using the same augmentation policy as in Wu et al. (2018): images were randomly cropped (between 0.2 − 1.0x their original area, and their original aspect ratio), rescaled to 224 × 224 pixels, randomly horizontally flipped (p = .5), with random adjustments to hue, saturation, contrast, and brightness, and random grayscale conversion (p = .2).
4.2 Evaluating Similarity with Human Brain Responses
We used a new brain dataset from Magri and Konkle (2019), which used functional magnetic resonance imaging to obtain human brain responses to 72 individual images, depicting isolated inanimate objects (Appendix C.1; images shown in Supplementary Figure 1).
To compare the representations evident across the ventral visual pathway to those learned by the model layers, we used a standard representational similarity analysis (RSA; Kriegeskorte et al., 2008a), Appendix C.2). To overview, the visual system was divided into three large-scale sectors reflecting early cortical stages (areas V1-V3), intermediate processing stages (posterior occipitotemporal cortex, pOTC), and later processing stages (anterior occipitotemporal cortex, aOTC). In each brain region, a 72×72 representational similarity matrix was obtained, which reflects the pairwise similarity of the activation patterns across the voxels, computed as the pearson correlation between two images’ activation profiles across voxels. Correspondingly, for each layer in each trained model, we measured activations in every unit to the same 72 images, and created layer-wise representational similarity matrices. Finally, the correlation between the model-layer representational similarity matrix and the brain sector’s representational similarity was computed, as the key outcome measure. The reliability of each brain region was also computed, where this noise ceiling serves as an estimate of the best possible model fit.
Note that more complex RSA procedures can be used to improve fits between models and brains (e.g. see Storrs et al., 2020). Here we chose to use simple correlation between brain and model-layer representational geometries because it requires a fully emergent relationship, and thus is a more conservative bar (no brain-based fine-tuning or layer-feature-reweighting involved; e.g. Federer et al. 2019; Khaligh-Razavi and Kriegeskorte 2014). These other analytic techniques could be explored in future work.
This fMRI dataset has two distinctive features. First, the data have substantially more reliable imagelevel responses than other current datasets (e.g. Chang et al., 2019; Cichy et al., 2016; King et al., 2019), due to different experimental design choices. Second, the dataset reflects responses only from inanimate objects, which provides a different lens into neural representation than most other datasets that sample both animate and inanimate categories. The distinction between animates and inanimates is one of the strongest representational divisions for biological visual systems (e.g. Kriegeskorte et al., 2008b), evident in large-scale topographic organization (e.g. Grill-Spector and Weiner, 2014; Konkle and Caramazza, 2013), and also captured in the representations learned by category-supervised deep neural networks (e.g. Long et al. 2018, though see Bracci et al. 2019)–thus, capturing similarity relationship among only inanimate items is a finer-grained representational challenge.
5 Results
The results are shown in Figure 2. Early, intermediate, and later stage brain regions are plotted along the rows. Layer-by-layer correlations with each brain sector are shown for three selected models: IPCL-Alexnet-gn, Wusnet-Resnet18, and SimCLR-Resnet50 (see Appendix D, Supplementary Figure 2, for the other models). Finally, a summary of all models’ most correlated layer is shown, grouped to highlight the self-supervised vs category-supervised comparison. There are several patterns in the data to highlight.
Self-supervised vs Category-supervised Comparison
The first key result is that the IPCL self-supervised model shows the highest general correspondence with the visual system hierarchy, along with the Wusnet model using the same Alexnet group norm architecture. Further, in nearly all cases, the self-supervised models showed comparable or higher correlation than category-supervised models with the same architecture (Figure 2c, colored vs dark bars). Interestingly, the main cases where a category-supervised model showed a stronger correlation than the self-supervised counterparts were the final model layers in some architectures (Resnet 18, Cornet-z) in the later stage brain sector (aOTC), hinting at a more category-like shift in the representational structure of this region. However, it is notable that, for all three brain sectors including the most anterior stage, the highest layer-brain correlation was a self-supervised (IPCL or Wusnet) Alexnet. These results provide the first empirical demonstration to our knowledge that self-supervised models can perform as well or better than category-supervised models in the degree to which the model layer representations directly correlate with brain representation.
Base Architecture Effects
The second result is that the convolutional neural network architecture is the single factor that matters most for brain fits in this dataset, surprisingly more than either the learning framework, the dimensionality of the projection head, or the emergent categorization accuracy. For example, Wusnet training over an Alexnet-gn base encoder produced more brain-like representations than the same training on an Alexnet-bn, Resnet18, Resnet50, or Cornet-z architecture (Figure 2c). In this dataset, the best matching model in each sector was always from the Alexnet model family. Further, the variations in the dimensionality of the latent space from 128, 256, to 1000 had a negligible effect on the layer-brain correlations (Figure 2b; dotted, dashed, solid lines; Figure 2c; open circles; see Appendix D.2). Finally, as evident in Figure 3, emergent top-1 classification accuracy was not predictive of brain similarity—indeed, our model variations clearly cluster by CNN backbone architecture. This effect was known for category-supervised systems (e.g. Kubilius et al., 2018, 2019; Schrimpf et al., 2018), and here we extend that result to self-supervised systems as well.
Hiearchical Representation
The third observation is related to the hierarchical brain stages of the human visual system and model architectures. Based on initial studies comparing internal CNN representations to the human brain, where we expect early model layers to fit early brain regions best, and later model layers to fit later brain regions best (e.g. Güçlü and van Gerven, 2015). Somewhat surprisingly, this correspondence between model layer hierarchy and brain hierarchy is not particularly clear in these data. The Alexnet family architectures show slightly clearer hierarchical correspondence than Resnet18 or Resnet50 architectures, with convolutional layers vs fully-connected layers correlating better with the earlier vs later brain regions, respectively. Further, the category-supervised models show low, or even negative, correlations at late convolutional stages. This pattern of data is somewhat contradictory to previous results (e.g. Long et al., 2018) that used the original split-channel Alexnet architecture (Krizhevsky et al., 2012). These negative correlations were also not evident when we tested the original Alexnet on our dataset (see Appendix D). These results generally support the conclusion that internal layer normalization and channeling (group-norm vs batch norm vs channel norm, splitting channels into groups) have substantial impact on the learned representation and how brain-like they are.
6 Discussion
The present work introduced Instance-Prototype Contrastive Learning, which is a fully self-supervised framework that takes the framework of Wu et al. (2018) in more biologically plausible directions, and shows emergent representations that match human brain responses equal to, or better than, category-supervised counterparts. Here we discuss the algorithmic choices of the IPCL through a biological lens, and the implications of brain-model comparisons for both neuroscience and machine vision communities.
IPCL
The most distinctive concept for self-supervised learning introduced by IPCL is the instance prototype based on multiple-augmentations, which can be mapped to the idea of constructing an online prototype of the current scene over multiple fixations. This learning scheme pushes each sample towards the central tendency of the instance representation across variation, effectively learning to encode each view with respect to a prototype. This encoding dovetails with classic prototype theory of object representation proposed by Rosch and Lloyd (1978), wherein category representation is not about necessary and sufficient features per se, but is probabilistic, with each exemplar standing in a more central or distant relationship with other exemplars of the category. In IPCL, we apply this logic at the instance level. This instance-prototype concept invites clear and interpretable variations, e.g. increasing the number of samples, and the kind of augmentation (e.g. simulating eye-movements, and optionally including biologically inspired “efference copy” signals that indicate the magnitude and direction of eye-movements between samples; e.g. Colby et al. 1992; Crapse and Sommer 2008).
The use of a non-indexed memory queue in IPCL also has biological undertones: the human and non-human primate ventral streams are effectively a highway to the hippocampus (Van Essen and Maunsell, 1983), a brain structure supporting long-term memory where more compositional-like operations can be carried out over usefully factorized visual representations. This suggests that to some extent the representations in the ventral visual stream maybe be optimized to interface with long-term memory systems. Through this lens, the recent memory queue of IPCL is a stand-in for the traces that would be accessible in a hippocampal memory system. Further extensions into the biological realm might draw on hippocampal models of memory (e.g. Schapiro et al., 2017). For example, our memory queue has a temporal order but no temporal decay, unlike biological long-term memory signatures (e.g. see Anderson and Schooler, 1991), inviting further modifications that vary the weight of the contrast with fading negative samples.
Brain-Model Comparisons
In this interdisciplinary intersection between deep learning and neuroscience, comparing the representations of different model layers to different brain sectors can be done towards (at least) two distinct ends. One aim is to find the single best model system with the most emergent brain-like representation. Brain-Score formalizes this endeavor, aggregating brain datasets and automating pipelines for scoring models along these brain-based and behavior-based benchmarks (Schrimpf et al., 2018). This fMRI dataset has clear value towards this endeavor, as there is reliable neural representation in mid- and late stages of the visual system that is not accounted for well by any model, whether category-supervised or not, and across models that vary substantially in object categorization accuracy. These results contribute the emerging picture that object categorization is not the right task to close this representational gap between models and brains (Schrimpf et al., 2018). Given that Magri and Konkle (2019) found that aOTC representation is well correlated with human judgments of 3-dimensional shape similarity, models that must learn more 3D-aware representations present a provocative alternative to categorization (e.g. Tung et al., 2019; Zamir et al., 2016).
However, finding the best model is only one reason to compare models to brain data. For the cognitive neuroscientist, these models also serve as computational existence proofs for learnability arguments: that is, what kind of representational structure can be learned from natural image inputs and architectural constraints, given specific representational goals (operationalized as loss functions; e.g. Richards et al. 2019). For example, a prominent theory of object representation in the visual system asserts that specialized category-level (or “domain-level”) forces are critical for shaping visual category representation (Mahon and Caramazza e.g. 2011, see also de Beeck et al. 2019). The finding that instance-level contrastive learning can result in emergent categorical representation supports an alternative theoretical point of view, in which category-specialized learning mechanisms are not necessary. On this generalist account, visual mechanisms operate similarly over all kinds of input, and the goal is to learn hierarchical visual features that simply try to discriminate each view from every other view of the world, regardless of the visual content or domain. Here, we add that these instance-level contrastive learning systems can have representations that are as brain-like as category-supervised systems, increasing the plausibility of the generalist account.
Broader Impact
This work is aimed at advancing self-supervised vision systems as well as our understanding the nature of human visual representation. Both scientific communities stand to benefit. Biological systems can be used to help inspire and inform machine vision model development. Models can inform cognitive neuroscience theories, serving as computational existence proofs for learnability arguments, and as testbeds for exploring the links between targeted mechanisms and their representational consequences. Beyond basic science understanding, the work presented in this paper does not raise many ethical issues, or have consequences have impact at a societal level.
Our paper is part of the larger endeavor of building more human-like models, focusing specifically on the nature of visual representation. Ultimately, a successful model will emulate human perception, providing more flexible and robust machine vision systems and providing insight into the nature of human visual processing. To strive towards these goals, our work leverages a large-scale image database—any sampling bias in this dataset thus will permeate the representational structure learned by the model systems. Further, the goal of making brain-like representation is also worth examining, as human perceptual systems are not without bias (e.g. consider the other-race effect). To address these limitations, it may be possible to develop targeted stimulus sets which can be used in to assess both human perception and machine vision systems, which can be used to quantify bias effects (e.g. a visual implicit attitudes test, see Greenwald et al. 2003). This endeavor is particularly important for machine vision of people, actions, and interactions; it is less relevant for the current work focusing on isolated inanimate objects.
Acknowledgments and Disclosure of Funding
This research was supported by an AWS Machine Learning Research Award.
Appendix
A. Model architecture Details
We used the AlexNet architecture as specified in Krizhevsky et al. (2012), with a few changes. Layers were not divided into groups (in the original AlexNet early layers were divided into groups that were split across GPUs and then merged at later layers, and some implementations maintain this grouping, e.g., MATLAB). AlexNet (bn) used BatchNorm layers (Ioffe and Szegedy, 2015) after each convolutional and fully-connected layer (except the final fully-connected layer) with eps = 1e − 05 added to the numerator for numerical stability and momentum=0.1 for the running mean and variance. In our AlexNet (gn) we used GroupNorm layers (Wu and He, 2018) after each convolutional layer (using 32 groups; eps = 1e − 5, and with learnable per-channel affine parameters initialized to ones for weights and zeros for biases), and BatchNorm1d layers after each fully-connected layer (except the final layer). The final 1000-way output layer was replaced with an N-dimensional latent space with an L2 norm operation (following Wu et al. 2018). The dimensionality of this space was varied for different models from 128, 256, to 1000. See Appendix Section E for the exact model architecture specification for Alexnet(bn) and Alexnet(gn).
B. Training Details
B.1 IPCL Model Training
All IPCL models used temperature τ = 0.07, a non-indexed memory queue of size 4096, and multiple augmentations per image (N = 5), and reduced the batch size to 128 (the reduced batch size was needed to fit 5x images on our GPUs). We found that this architecture would not learn with the standard learning rate schedule, so we made the following changes to the training protocol: (1) we accumulated gradients over 20 batches (i.e., performed the optimizer step every 20 batches), and (2) we reduced the number of epochs to 100, and (3) we varied the learning rate schedule using a one-cycle policy (Smith, 2017), using cosine annealing to vary the learning rate from .03/1000 to .03 over the first 40 epochs, and from .03 down to .03/(1000 * 1e4). We replicated the model training three times (run1,2,3) using SGD, and once using the Ranger optimizer (RectifiedAdam + LookAhead). The resulting models achieved between 36.5% – 39.1% accuracy on subsequent ImageNet categorization, detailed in the table in Appendix Section D.
B.2 Wusnet Model Training
Each of the 12 model variants (6 architectures × 3 latent space dimensionalities) were trained with the same procedure as in Wu et al. (2018; https://github.com/zhirongw/lemniscate.pytorch), using temperature τ = 0.07 and NCE sample size 4096. Models were trained for 200 epochs using SGD (momentum = .9; weight decay = .0001), and a batch-size of 256. The learning rate was initialized to 0.03 and scaled down by 0.1 at epoch 120 and 160.
We used Wu et al. (2018)’s implementation of Noise Contrastive Estimation to approximate sampling, with slight modifications to accommodate our prototype and queue: where zi is the embedding for the ith sample, is its corresponding prototype, sim is the similarity function (dot-product between embeddings), τ is the temperature parameter, Z is a normalization constant (estimated based on the first mini-batch of 128*5 augmented samples), qk is the embedding for the kth item stored in the queue, and ϵ = 1e – 7 is a constant added for numerical stability.
B.3 Evaluating Categorization Accuracy
We tested classification accuracy on the ImageNet validation set for the Wusnet and IPCL models, using the same weighted k-nearest neighbors (kNN) procedure used by Wu et al. (2018). To classify a test image x, it’s embedding was compared to the embedding of each training image using cosine similarity. The top k = 200 nearest neighbors were used to make the prediction via cosine-similarity-weighted voting, where the class c would receive the total weight given by:
Where Nk denotes the k-nearest neighbors, and si is the cosine similarity between the target and the neighbor, k = 200, and τ = 0.07.
C. Brain Data and Analysis
C.1 Summary of fMRI Experimental Procedures
The stimulus set consisted of 72 images (Supplementary Figure 1), and was selected to span a range of categories and contexts (e.g. accessories, bags, bathroom items, bedroom items, clothing, food-processed, fruits and vegetables, furniture, household items, kitchen, musical instruments, office supplies, outdoor items, sporting goods, tools, vehicles).
These images were presented in a mini-block design, while participant (N = 10) underwent functional magnetic resonance imaging scanning. In each 8-min run, each image was flashed 4x in a row (600ms on 400ms off) in a 4s block, with all 72 images presented in a block in each run (randomly ordered), with 4×15-s rest periods interleaved throughout. Participants completed 6 runs. Their task was to pay attention to each image and complete a vigilance task (press a button when a red-frame appeared around an object, which happened 12 times in run).
Imaging data were acquired on a BioSpin MedSpec 4T scanner (Bruker) using an eight-channel head coil. Functional data were analyzed using Brain Voyager QX software and MATLAB. Standard preprocessing was performed and a general linear modeling framework was used to estimate the response of each voxel to each of the 72 images, using square-wave regressors for each image condition, convolved with a gamma function to approximate the hemodynamic response. Thus, this design allowed us to estimate voxel-level estimation to these 72 individual images.
All sectors were delineated by hand on the cortical surface of each hemisphere of each participant, using typical procedures(e.g. Cohen et al., 2017; Long et al., 2018). The early visual areas V1-V3 were delineated based on activations from a separate retinotopy protocol, using standard procedures. An occipitotemporal cortex mask was drawn on each hemisphere, within which the 1000-most active voxels were included, based on the contrast [All Objects > Rest] at the group-level. To divide this cortex into intermediate and later hierarchical stages, we used the analyses of these data done by Magri and Konkle (2019): in this dataset, the brain responses have a systematic dip in local-regional reliability at a particular anatomical location along the posterior to the anterior axis.
C.2. Representational Similarity Analysis
For each brain region and each participant, we computed a 72×72 representational similarity matrix (RSM), using the Pearson correlation between activation profiles between all pairs of images. A group-level RSM was created for each sector by averaging across individual participants. The lower diagonal of this symmetrical matrix was vectorized (representational similarity vector, RDV; 2556 pairs) and used as the target brain data to match. For each layer in each trained model, we presented the same images to the model, recorded activations from every unit in every layer, and computed a 72×72 RSM with the same method. The lower diagonal of these matrices were vectorized and correlated with the brain target RDV.
To compute the reliability of this RDV to contextualize the model fits, we split participants into two sets, computed the RDV in both sets, and repeated this procedure 1000 times, to estimate an average split-half reliability of the brain data.
D. Supplemental Results
D.1. Other trained model plots
D.2. Summary Table
E. Architecture Specification
Footnotes
talia_konkle{at}harvard.edu, alvarez{at}wjh.harvard.edu