## Abstract

An error was made in including noise ceilings for human data in Khaligh-Razavi and Kriegeskorte (2014). For comparability with the macaque data, human data were averaged across participants before analysis. Therefore the noise ceilings indicating variability across human participants do not accurately depict the upper bounds of possible model performance and should not have been shown. Creating noise ceilings appropriate for the fitted models is not trivial. Below we present a method for doing this, and the results obtained with this new method. The corrected results differ from the original results in that the best-performing model (weighted combination of AlexNet layers and category readouts) does not reach the lower bound of the noise ceiling. However, the best-performing model is not significantly below the lower bound of the noise ceiling. The claim that the model “fully explains” the human IT data appears overstated. All other claims of the paper are unaffected.

Noise ceilings should not have been included in the plots of Khaligh-Razavi and Kriegeskorte (2014) showing correlations between models and human inferior temporal (hIT) data (Figures 2A and 7, and Supplementary Figures 2, 3 and 12), as they were not statistically appropriate. The noise ceilings were calculated as implemented in the RSA Toolbox (Nili et al., 2014), by computing the average correlation of each participant’s representational dissimilarity matrix (RDM) with the group average RDM, either including (upper bound) or excluding (lower bound) that participant’s data. This provides the upper bound on the possible model performance when model performance is calculated by correlating the model RDM separately with each participant’s RDM and averaging these correlations. However, in the paper, all model evaluations were performed against the single group-average RDM, for better comparability with the macaque RDM (which averages over N=2 macaques). Inter-subject variance therefore does not limit possible model performance, and the correct upper bound is 1. The figures for the macaque data do not contain noise ceilings for this reason, and neither should those for the human data.

The removal of noise ceilings from Figures 2 and 7 leaves unsupported the claim that reweighted features from a deep supervised model fully explain hIT data. Here we present a new analysis to assess whether this claim holds. One approach would be to estimate the lower bound of the noise ceiling correctly for the bar graphs shown in the paper. However, this is not trivial and requires distributional assumptions, which we would like to avoid. We therefore instead take the approach of Nili et al. (2014) for the noise ceiling and plot the group-average RDM correlation for each model (rather than the correlation with the group-average RDM). The Nili et al. (2014) approach to estimating model performance (group-average of single-subject RDM correlations) requires a separate RDM prediction for each subject. A complication is that some of the models have parameters to be fitted, requiring crossvalidation across images and subjects for unbiased estimates of model performance.

Below we describe a novel and flexible analysis method for fitted RDM models. On each fold of crossvalidation, the training and test set is based on a nonoverlapping set of stimuli and a nonoverlapping set of subjects. Each model is fitted to the training set and evaluated on the test set (different subjects and stimuli). The lower and upper bounds of the noise ceiling are computed in the same crossvalidation loop. The performance estimates and noise-ceiling estimates (RDM correlations) are averaged across folds. The entire crossvalidation procedure can be repeated for many bootstrap samples of stimuli and subject. Here the bootstrapping was limited to stimuli, because there were only four subjects in the dataset.

## Method for estimating model performances and noise-ceiling bounds

The performance of each model, and the lower and upper bounds of the noise ceiling, were computed concurrently, using the following procedure:

For 1000 stimulus-bootstrap samples:

For 20 stimulus-crossvalidation folds:

Randomly assign 8 unique stimuli present in this bootstrap sample to be test stimuli. As a result, the test set always consists of data from exactly 8 unique images, and typically contains repetitions of some of them. Data from the same image never appears in both training and test sets, whether or not it is repeatedly present in the bootstrap sample.

For 4 subject-crossvalidation folds (leaving out each subject in turn):

Average the data RDMs across training subjects for the training images. Using the resulting training RDM, fit the weights of the parameterized models with non-negative least squares regression (e.g., in Figure 1 below, the components are the 7 Alexnet layer RDMs and 3 SVM-remixed RDMs).

For each model, create a predicted RDM for the test images (using fitted weights for parameterized models). Calculate each model’s performance as the correlation between the model-predicted RDM and the test RDM (the RDM for the test images in the test subject).

At the same time, calculate the upper and lower bounds of the noise ceiling by taking the correlation between the test RDM and the average test-image RDM for either all 4 subjects (upper bound) or the 3 training subjects (lower bound).

For each test subject, we now have an estimate of each model’s performance and of the upper and the lower bounds of the noise ceiling. Average these to create a single estimate of each model’s performance and the noise ceiling bounds, for this stimulus-crossvalidation sample.

At the end of the 20 stimulus-crossvalidation folds, we have 20 estimates for each model’s performance and 20 estimates of the upper and the lower bounds of the noise ceiling. Average these to create a single estimate of each of these measures, for this bootstrap sample.

At the end of this procedure, we have a distribution of 1000 estimates of each model’s performance and of the upper and lower bounds of the noise ceiling, bootstrapped over the population of image stimuli used in the experiment.

Figure 1 shows the performance of the deep supervised neural network model from Khaligh-Razavi and Kriegeskorte (2014) in explaining their hIT data, calculated using the above procedure (cf. Figure 7 in the original paper), and Figure 2 does the same for the 27 shallow and not-strongly-supervised models (cf. Figure 2A in the original paper).

## Conclusions

The noise ceilings shown in figures presenting human fMRI results in Khaligh-Razavi and Kriegeskorte (2014) are inappropriate, and should not have been included. A novel analysis method enables us to estimate noise ceilings for fitted and unfitted models, and to inferentially compare the models to each other and to the lower bound of the noise ceiling. The best-performing model no longer reaches the noise ceiling, but is statistically indistinguishable from its lower bound. The main claims of the paper are therefore unaffected.