How well do models of visual cortex generalize to out of distribution samples?

Unit activity in particular deep neural networks (DNNs) are remarkably similar to the neuronal population responses to static images along the primate ventral visual cortex. Lin-ear combinations of DNN unit activities are widely used to build predictive models of neu-ronal activity in the visual cortex. Nevertheless, prediction performance in these models is often investigated on stimulus sets consisting of everyday objects under naturalistic set-tings. Recent work has revealed a generalization gap in how predicting neuronal responses to synthetically generated out-of-distribution (OOD) stimuli. Here, we investigated how the recent progress in improving DNNs’ object recognition generalization, as well as various DNN design choices such as architecture, learning algorithm, and datasets have impacted the generalization gap in neural predictivity. We came to a surprising conclusion that the performance on none of the common computer vision OOD object recognition benchmarks is predictive of OOD neural predictivity performance. Furthermore, we found that adver-sarially robust models often yield substantially higher generalization in neural predictivity, although the degree of robustness itself was not predictive of neural predictivity score. These results suggest that improving object recognition behavior on current benchmarks alone may not lead to more general models of neurons in the primate ventral visual cortex.


T ra in T e s t N a t. T e s t S y n . T e s t N a t. T ra in T e s t S y n . T e s t N a t. T ra in T e s t S y n .
: Neural predictivity as a measure of model-brain similarity. a) We used electrophysiological recordings from area V4 in three macaque monkeys. Each animal's data was collected across one or several recording sessions (1-4 sessions). Data corresponding to each session included measured neural responses (firing rate) to a set of naturalistic (i.e. Nat.) as well as a set of synthetically generated (i.e. Syn.) images. b) Illustration of the cross-validation procedure (2-fold) using principal component regression (PCR) for computing the neural predictivity score for naturalistic and synthetic stimuli. The natural dataset (top) is split randomly into two partitions (i.e. folds) along the stimulus dimension. For each fold, one partition is used to fit the PCR model towards predicting the neural data from neural network model unit activations in a given layer. The resulting PCR model is used to predict the responses to both the natural stimuli in the test-partition and the synthetic stimuli. The similarity scores for each stimulus domain (S nat and S syn ) is computed by combining the predictions across the two folds.
electrophysiological recordings from macaque area V4 to stimuli from two readily distinguishable visual domains from a recent study (24). The first visual domain which we refer to as limitation in the diversity of our samples within the naturalistic domain. To investigate this hypothesis, we also fitted a regression model on data from the synthetic domain and tested that 145 model's predictions on the naturalistic and synthetic domains (Fig. 1e). This model achieved  Formally, the procedure for assessing representational similarity between a model and a par-158 ticular brain area commonly involves selecting all units within a layer of the neural network 159 as candidate predictors and use cross-validation to fit a simple mapping function (a variation 160 of a regression function in most cases) to the observed neuronal responses. Importantly, the 161 implicit assumption underlying this procedure is a one-to-one mapping between a layer in the 162 neural network and a particular brain area. Despite its wide usage in the literature, such strong 163 assumption may be over-restrictive in practice, specially when considering the typical number 164 of layers in neural network models that are substantially higher than the known number of areas 165 along the VVS. 166 A natural alternative to this assumption is to consider possible correspondence between one 167 brain area and multiple model layers. In this view, neurons in one brain area could correspond 168 to an aggregation of units in multiple neural network model layers. In its most general form 169 and without any prior, it is numerically cumbersome to test all possible layer combinations as 170 candidate predictor sets for representing each brain area, or to consider all layers simultaneously 171 in establishing the mapping between the model and the brain. However, one feasible alternative 172 to that is to assume that each individual neuron can be approximated by units within a single 173 layer of the neural network model but also that not all measured neuronal sites from the same 174 brain area may necessarily correspond to the same model layer. 175 We examined the neural prediction generalization properties of neuronal models constructed synthetic stimuli. We found that 1) many of the model layers had similar levels of in-domain 180 neural predictivity (Fig. 3a). Indeed, for many neuronal sites, the neural predictivity score 181 for the first model layer was almost as high as that obtained for the best model layer; 2) most The neuronal model's generalization capability is highly variable across neurons. Left and right plot show two examples neuron with high and low generalization respectively. b) Scatter plot of Nat. and Syn. predictivity scores for a neuronal model based on ResNet50 unit activations for all neuronal sites with high internal consistency (larger than 0.7). The corner histogram shows the distribution of the difference between in-and out-of-distribution predictivity scores across neuronal sites; c) Neural predictivity score of the ResNet50 neuronal model and internal consistency of the neural data in naturalistic and synthetic domains for neural data collected from different animals (M, N, and S) and different recording sessions (S1-4).  Figure 3: Assumptions on brain-model correspondence affect generalization in neural predictivity. a) Neural predictivity scores from unit activity in each layer of ResNet50 architecture for individual neuronal sites recorded during example sessions from different animal subjects. Colors correspond to the neural predictivity score on natural (green) and synthetic (blue) domains. Different shades correspond to different neuronal site in the same animal. Bold lines correspond to the average predictivity score in each domain across all neuronal sites within that animal's session; b) Number of neurons with highest neural predictivity in a given layer corresponding to the same subplot in a. Colors are the same as in a; c) Distribution of the difference between the layer number in ResNet50 neural network where each neuronal site is best predicted in-distribution and out-of-distribution. The distribution spans a wide range but has a slightly negative mean (-1.29); d) Comparison of neural predictivity scores on natural and synthetic domains when a brain-model correspondence follows a Layer-Area (LA) or Layer-Neuron (LN) mapping assumption.
neuronal sites were best predicted from unit activations in two or three of the model's layers ( Fig. 3b); 3) the best predictive layer for each neural site varied across stimulus set (naturalistic 184 vs. synthetic). For a large number of neurons, models constructed from units in earlier layers of 185 the network had substantially higher generalization capacity to synthetic stimuli while having 186 more or less similar generalization on the naturalistic domain (Fig. 3b). In ResNet50, we found 187 that the layer with the highest generalization to out-of-distribution samples was on average 1.29 188 layers earlier than that corresponding to in-distribution samples (Fig. 3c).

189
Next, we investigated the effect of these assumptions on model-brain correspondence on 190 neural prediction generalization. For this, we constructed predictive models of the neuronal 191 population in two ways: a) Layer-Area mapping: following the common approach in the lit- benchmarks with the neural predictivity scores within the naturalistic and synthetic domains.

219
Each of the adopted benchmarks uniquely probed the neural network models on their object 220 recognition generalization behavior that included natural photos of naturally occurring objects, 221 9 object renditions, natural objects in difficult settings, drawings, and natural corruptions (Fig.  4a). The recognition performance of typical neural network models varied greatly on each of 223 these benchmarks (Fig. 4b). 224 Object recognition accuracy on the ImageNet dataset. As a first measure, we considered 225 the classification accuracy on the widely used ImageNet dataset to assess the models' object

237
Robustness to out-of-distribution Samples. While many existing neural network models 238 can perform object recognition at a level similar or better than human subjects on the ImageNet 239 dataset (33), it was soon discovered that most models' object recognition accuracy substantially

246
We further evaluated the models' object recognition performance on four out-of-distribution 247 recognition datasets including a) ImageNet-Adversarial that contains a set of natural photos 248 which are selected through an adversarial filtration process that only includes images that can-  in the synthetic domain, we noted that most models that were trained specifically to improve 291 their robustness to adversarial perturbations, had higher neural predictivity scores compared to 292 the non-robust variation of the same architecture (ResNet50; Fig. 5b). We also found that the 293 degree of neural prediction accuracy varied systematically as a function of the magnitude of 294 adversarial perturbations to which the models were exposed during training, and the pattern 295 formed an inverse U-shape (Fig. 5c) where the highest neural predictivity was obtained at R N 5 0 (

378
Results indicated that models constructed from different layers of the same architecture 379 highly agree with each other on the relative predictivity scores assigned to different neuronal 380 sites. In other words, these models consistently found the same neurons to be easy or hard to predict. Moreover, the consistency was higher for the naturalistic dataset (0.99 ± 0.003) 382 compared to the synthetic one (0.88 ± 0.04). We also found a similar overall pattern in score layers as well). This trend is also evident by visually inspecting the graphs in Fig. 3a and S3. 393 We next asked how consistent are the image-level predictions for neuronal models that pre-394 dict a neural site with the same level of predictivity score. To investigate this, for each neuronal 395 site, we selected 5 neuronal models with similar predictivity scores (≤ 2% difference in neural 396 predictivity score) and measured the consistency in image-level predictions across these models 397 by computing the pairwise correlation between vectors of per-image predictions generated by 398 each of the models. Similar to the score consistency analysis, here also we carried out this anal-399 ysis both within-and across-models. We found that, despite the close similarity in predictivity 400 score for all the considered neuronal models, the distribution of these correlations spans a wide within-model ensembles (Fig. 7c). On the naturalistic domain, within-and cross-model ensem-419 bles improved the neural predictivity score by 1.2 and 3.3 percent respectively above the best 420 single-model. Likewise, on the synthetic domain, this improvement was 0.7 and 1.6 percent.

453
In addition, we found that robustness to adversarial attacks was the only measure which was Model-brain alignment. There is recent evidence suggesting that the function performed 463 by a single cortical column may be more complex than that performed by a single typical ANN 464 layer (53). Popular neural modeling benchmarks such as BrainScore (28)

476
We also found many neuronal sites in the V4 cortex to be highly predictable by simple 477 representations forming in the early layers of neural network models. The neuronal models that 478 stemmed from these early layers, in many cases, also generalized better to out-of-distribution 479 stimuli. This suggested that either such neuronal sites correspond to direct projections from 480 earlier stages of processing in the brain (when neural predictivity score is already close to the score is substantially lower than the data's internal consistency).

485
Altogether, our results suggest that naturalistic stimuli alone may not be sufficient to assess 486 the similarity of the models of ventral visual cortex to the brain. Synthetically generated stimuli 487 such as those considered in this work or others that are specifically generated for such compar-488 isons (e.g. "controversial stimuli" from (55)) may provide a broader and more effective test bed 489 for model-brain comparisons in the future.

490
Effect of architecture and learning rule on generalization. 491 We also observed a trend where deeper models consistently improve the model predictions

502
We also compared the model representations when their parameters were trained on different datasets. We showed that while there are large variance in neural predictivity across models 504 trained on different datasets, the moderately sized (with today's standards) ImageNet dataset 505 remains the best dataset for training neural network models with highest neural predictivity.

506
Indeed, we observed that even datasets with orders of magnitude more samples do not yield 507 models with better neural prediction generalization. Our results were also congruent with other 508 recent findings showing that models trained on visual object recognition yield higher brain- of synthetically generated pixel patterns that were produced using an optimization procedure 539 described in (24). Briefly, a predictive neuronal model of each neural site was constructed from the internal activity of the AlexNet model (56). New stimuli were then synthesized that were 541 predicted to 1) produce high activity in individual neuronal sites (stretch) or; 2) produce high ac-542 tivity in an individual neuronal site while suppressing activity in other simultaneously recorded 543 sites (one-hot population). The resulting gray-scale images contained complex curvature and 544 texture like patterns but importantly did not contain any nameable objects. cross-validation procedure is repeated 5 times for each candidate representation.

565
The neural predictivity score for a neural network model is then computed using one of the 566 following two approaches:  best layer. We used this approach for the majority of our analyses in the paper.

576
To report the neural predictivity score on the synthetic data, we compute the neural predictions from the layer with best natural predictive score to the synthetic stimuli using each of the 578 cross-validation models (10) and average those predictions per stimulus across folds. We then 579 compute the Pearson correlation between the averaged predictions and the neuronal responses 580 for each repetition of the cross-validation procedure, take the median across neuronal sites for 581 each repetition and finally, the averaged correlation across repetitions of the cross-validation 582 procedure is reported as the neural predictivity score on synthetic stimuli for that neural net-583 work model. Each neuronal model can generate predictions of the neuronal responses to any stimulus that can be formatted as a static image. We assess the generalization ability of each neuronal model by quantifying the gap in the prediction accuracy made on two input domains. In all analyses except those in Fig.2e, we fit the regression parameters to predict the brain responses to naturalistic images (i.e. in-distribution or ID) and use the fitted model to additionally predict the responses to the synthetic (out-of-distribution or OOD) stimuli. We then quantify the generalization gap in neural predictivity. For this, we first compute the difference between the predictivity scores on ID and OOD domains for each site on each repetition of the cross-validation procedure. Then, for each repetition, we compute the median difference value across neuronal sites. Finally, we compute the mean of the median values across repetitions. Concretely, we compute the following value: where C nat , C synth , and C gap denote median prediction accuracy (i.e. correlation) on naturalistic 586 imageset, synthetic imageset and the generalization gap in neural predictivity respectively.

589
The ventral visual stream is hypothesized to serve a critical role in visual object recognition. 590 We use object recognition accuracy as a measure of usefulness of the representations in each  In-distribution object recognition accuracy. We use the ImageNet dataset to assess each  CORnet-Z and CORnet-S architectures trained supervised on the ImageNet dataset.

720
• VOneCORnet-S This network is a variation of the CORnet-S architecture where its first 721 layer is replaced with a biologically plausible model of the V1 cortex in primates (52).

722
• adversarial robust models. We considered several adversarially trained models in our  la y e r 1 . 0 la y e r 1 . 1 la y e r 1 . 2 la y e r 2 . 0 la y e r 2 . 1 la y e r 2 . 2 la y e r 2 . 3 la y e r 3 . 0 la y e r 3 . 1 la y e r 3 . 2 la y e r 3 . 3 la y e r 3 . 4 la y e r 3 . 5 la y e r 4 . 0 la y e r 4 . 1 la y e r 4 .

Number Of Neurons
ID NP OOD NP m a x p o o l la y e r 1 . 0 la y e r 1 . 1 la y e r 1 . 2 la y e r 2 . 0 la y e r 2 . 1 la y e r 2 . 2 la y e r 2 . 3 la y e r 3 . 0 la y e r 3 . 1 la y e r 3 . 2 la y e r 3 . 3 la y e r 3 . 4 la y e r 3 . 5 la y e r 4 . 0 la y e r 4 . 1 la y e r 4 .