Task-specific vision models explain task-specific areas of visual cortex

Computational models such as deep neural networks (DNN) trained for classification are often used to explain responses of the visual cortex. However, not all the areas of the visual cortex are involved in object/scene classification. For instance, scene selective occipital place area (OPA) plays a role in mapping navigational affordances. Therefore, for explaining responses of such task-specific brain area, we investigate if a model that performs a related task can serve as a better computational model than a model that performs an unrelated task. We found that DNN trained on a task (scene-parsing) related to the function (navigational affordances) of a brain region (OPA) explains its responses better than a DNN trained on a task (scene-classification) which is not explicitly related. In a subsequent analysis, we found that the DNNs that showed high correlation with a particular brain region were trained on a task that was consistent with functions of that brain region reported in previous neuroimaging studies. Our results demonstrate that the task is paramount for selecting a computational model of a brain area. Further, explaining the responses of a brain area by a diverse set of tasks has the potential to shed some light on its functions. Author summary Areas in the human visual cortex are specialized for specific behaviors either due to supervision and interaction with the world or due to evolution. A standard way to gain insight into the function of these brain region is to design experiments related to a particular behavior, and localize the regions showing significant relative activity corresponding to that behavior. In this work, we investigate if we can figure out the function of a brain area in visual cortex using computational vision models. From our results, we find that explaining responses of a brain region using DNNs trained on a diverse set of possible vision tasks can help us gain insights into its function. The consistency of our results using DNNs with the previous neuroimaging studies suggest that the brain region may be specialized for behavior similar to the tasks for which DNNs showed a high correlation with its responses.

Deep neural networks (DNN) are currently the state of the art models for explaining 2 cortical responses in the visual cortex [1][2][3][4][5][6][7][8][9][10][11]. DNNs trained on a large dataset of images 3 for the object classification task have been shown to explain the human and monkey 4 cortical responses in the inferior temporal cortex (IT) area known for playing a role in 5 explains its responses better than a DNN trained on a less related task. We attempt to 22 bridge this gap by taking tasks into account for explaining the brain responses. 23 In this work, we hypothesize that a DNN trained on a task related to the function of 24 brain region will explain its responses better than a DNN trained on a task which is not 25 explicitly related. Here, we consider that two tasks are different if they generate a 26 different output structure or the predictions are from different domains, for instance, 27 object domain or scene domain. We validate this hypothesis through two different 28 analyses. In the first analysis, we consider a particular case of OPA and explain its 29 responses by a DNN trained on a task (scene-parsing [15]), which we argue is related to 30 navigational affordances. We then compare the results with a DNN trained on generic 31 scene-classification task. In the second analysis, we select DNNs trained on a diverse set 32 of computer vision tasks selected from Taskonomy [16] dataset. We then investigate if 33 the tasks for which DNNs show a high correlation are consistent with functions of 34 scene-selective areas OPA, PPA and Early Visual Cortex (EVC) reported in the 35 previous works [14,[17][18][19][20][21][22][23]. 36 The navigational affordances ( Fig 1A) as described in Bonner and Epstein [12] is 37 computed by localizing the free space available for navigation in the scene. Thus, a 38 DNN trained on a computer vision task to localize the free space for navigation can 39 serve as a computational model for explaining the navigational affordance related 40 responses in the visual cortex. The scene-parsing ( Fig 1B center) task, where the aim is 41 to predict the label of each pixel in the image is, therefore, suitable for our purpose. Navigational affordances and scene-parsing task. (A) An example stimulus image (left) presented to the subject for behavior and fMRI experiments in [12]. A path indicated by a rater as instructed in [12] to walk through the scene starting from the bottom center of the image (center). Heat map of possible navigational trajectories produced by combining the data across different raters (right). (reproduced with the permission from Bonner and Epstein [12]). B: Output generated by the scene parsing model (center). Activation of floor unit of the scene-parsing model (right) Fig 2. Computer vision tasks from Taskonomy [16] dataset. An example stimulus image (top-left) presented to the subject for behavior and fMRI experiments in [12]. Rest of the images are the output generated by pretrained DNNs optimized for the corresponding tasks selected from the Taskonomy dataset given the stimulus image as the input 1. The brain responses of a particular region show high correlation with the DNNs 58 trained on a task related to its function than with the DNNs trained for the 59 classification task. 60 2. The tasks on which the pretrained DNN activations show high correlation with a 61 particular brain region's responses are consistent with the functions of the brain 62 region reported in the previous studies. 63 3. The correlation comparison of a diverse set of task DNN activations with a brain 64 area's responses provides insight into its previously known/unknown functions.

66
In this work, we use Representation similarity analysis (RSA) [24] to compare the 67 correlation of computational and behavioral models with human brain responses. We 68 present the results through two sets of analysis. In the first set, we select a task 69 (scene-parsing) which we argue is similar to the navigational affordances and, hence, the 70 related responses in the OPA. We compute the correlation of brain and behavior

71
Representation Dissimilarity matrices (RDMs) with scene-parsing DNN (VGG scene-parse ) 72 and compare the results with a scene-classification DNN (VGG scene-class ). The scene 73 classification task is not as relevant to mapping navigational affordances as the scene 74 parsing task. Therefore, a comparison between these two can provide insights into 75 whether training the DNN in a task related to the function of the brain region in the 76 study is required to explain its responses. In the second set, we select a diverse set of 77 computer vision tasks from the Taskonomy dataset and use DNNs trained for these 78 individual tasks to explain the cortical responses of scene-selective brain regions and 79 early visual cortex. We then compare the correlation with the brain RDMs with the 80 DNNs trained on the above tasks to gain insights into the functions of the brain areas. 81 Scene-parsing DNN explains OPA responses related to 82 navigational affordances better than scene-classification DNN

83
The DNNs are widely used as a potential candidate for computational modeling of areas 84 in visual cortex. Here, we consider two DNNs, one which is optimized on a task 85 (scene-parsing) related to mapping navigational affordances and other on a task (scene 86 classification) not explicitly related to mapping navigational affordances. The DNN 87 models we consider here are VGG scene-parse ( Fig 3A) and VGG scene-class (Fig 3B)  DNNs with the brain and behavior RDMs is higher and significant (p<0.05 for 107 layer15, layer 16 with OPA, and layer 15 with NAM) in some cases.

108
The correlation values for all the comparisons are higher and significant for the 109 VGG scene-parse in the deeper layers validating our hypothesis that task-relevant DNNs 110 explain the task-specific regions of the brain better than a generic classification DNN. 111 Scene-parsing DNN explains a major portion of the shared 112 variance of the behavior and scene-classification DNN 113 We combined the RSA with variance partitioning [25] analysis to investigate how 114 uniquely does each model (behavior, VGG scene-parse , and VGG scene-class ) explain the 115 responses of OPA. In variance partitioning approach, using a multiple regression model, 116 we can divide the unique and shared variance contributed by all of its predictors. In this 117 case, OPA RDM was the predictand, and the DNN models and behavior were the correlation (layer 15 for VGG scene-parse , and layer 13 for VGG scene-class ) with the OPA 120 RDM.

121
From the results of this analysis (Fig 4C), we note the following points: 122 1. VGG scene-class shares a major portion (96.62%) of its variance with VGG scene-parse . 123 2. behavior shares more than half of its variance with VGG scene-parse (57.42%) and 124 VGG scene-class (52.35%) 125 3. VGG scene-parse 's unique variance is more than one-forth of the total variance 126 (25.40% of the total variance) explained by the three models 127 The above results suggest that VGG scene-parse can equally or better account for the 128 navigational affordance related responses in the OPA than VGG scene-class , while at the 129 same time uniquely explaining the OPA responses which are neither related to 130 navigational affordances nor scene classification.

131
Floor and free space activations explain the behavior but not 132 the brain responses better than the scene-parsing output 133 The navigational affordance, visually, is related to the free space available for navigation. 134 Therefore, in this analysis, we investigate the case if units corresponding to the free 135 space show a higher correlation with the behavior and brain RDMs than the readout 136 layer of the VGG scene-parse . The readout layer of the VGG scene-parse consists of 151 137 channels with 150 channel each containing an output corresponding to a particular class 138 in the ADE20k [26] dataset and 1 channel corresponding to the background. Therefore, 139 it is straightforward to separate specific category activation from the readout layer. We 140 consider 15 such labels (floor, road, earth, rug, grass, sidewalk, field, sand, stairs, 141 runway, stairway, dirt, land, stage, and step) that represent free space and consider a  PSP scene-parse [27], which is a scene-parsing model that has been shown to achieve 149 higher prediction accuracy than VGG scene-parse , to ensure that the results are consistent. 150 The results from Fig 5B show that the trend is consistent among the different models 151 and clears the ambiguity due to the poor performance of the VGG scene-parse model.

152
Further, it is interesting to note that due to more accurate prediction from In this analysis, we probe further by computing the correlation of each DNN category 161 unit's activation with the brain and behavior RDMs. We investigate the top-10 highly 162 correlated categories with the brain and behavior RDMs and observe whether this 163 analysis support the previous works which investigated the functions of OPA and PPA. 164 For this purpose, we use the PSP scene-parse as the predictions generated by 165 PSP scene-parse are more accurate than the VGG scene-parse model. sidewalk, path, dirt) out of 10 highly correlated categories corresponding to free space. 173 Rest of the labels include object categories: plate, vase, sink, kitchen, and barrel. One 174 possible explanation for these categories being highly correlated is the experimental 175 design [14] in which the OPA responses were recorded. The subjects were asked to 176 classify whether the room displayed is a bathroom or not. The objects such as sink, 177 plate, and vase are highly indicative of the room type, and therefore, OPA responses 178 may be related to the scene classification task. Hence, the high correlation of OPA with 179 these objects is explained by assuming that OPA is involved in the scene classification 180 task. Further, knowing the scene category is also crucial for planning navigation. A 181 related possible explanation is that the objects also suggest the spatial layout of the 182 scene by indicating the presence of obstacles and therefore can be relevant for 183 navigational affordances.

184
PPA, on the other hand, is hypothesized to represent the spatial layout of the scenes 185 and is insensitive to the navigational affordance as shown in [14]. The results from this 186 analysis (Fig 6 right) are consistent with [14], as the majority of the categories with 187 high correlation are objects that are indicative of scene layout and category and only 2 188 of the highly correlated categories correspond to free space.

189
The above analysis demonstrated that categorical units from the scene parsing 190 output are consistent with the functions of the OPA and PPA reported in the previous 191 studies. This result suggests that a categorical analysis has the potential to be used in 192 investigating the functions of brain regions. and that are optimized on the same set of images from the Taskonomy dataset [16], to 201 perform different tasks.
The provided Taskonomy DNNs architectures consist of an encoder which is same for 203 all the tasks, and a decoder which can vary depending on the task. The encoder 204 architecture for all the tasks is a fully convolutional ResNet-50 [28] with 4 residual 205 blocks and without any pooling layers. The decoder architecture, however, is 206 task-dependent, for example, the decoder of the classification tasks consists of 207 fully-connected layers while the decoder of the tasks in which the output is spatial 208 consist of all convolutional layers. In this analysis, we consider the tasks in which the 209 output is spatial and therefore the decoder architecture is same across all the tasks. In 210 this way, the DNN architecture is the same across all the selected tasks, and only the 211 task is the variable. 212 We argue that deeper layers of the DNNs decoder may be more task-specific than 213 early layers. To support our argument, we report the mean and variance correlation 214 between the RDMs of several layers from the DNNs and brain RDM in Fig 7. We 215 observe that in earlier layers of the encoder and decoder, the correlation values do not 216 vary significantly across the tasks while the variance increases as we go deeper (Fig 7A 217  left). The mean correlation remains almost constant and decreases on going deeper 218 (Fig 7A center). Yet, the maximum correlation with the brain RDMs consistently 219 increases as we go deeper (Fig 7A right) in the DNN architecture. These results when 220 considered together suggest that for deeper DNN layers, DNNs trained on related tasks 221 show higher correlation while the DNNs trained on unrelated tasks start showing lower 222 correlation values with the brain RDMs. We also observed that the order of tasks 223 correlation values is more consistent in the deeper layers. The above analysis provides 224 evidence that for comparison we should consider the deeper layers of the DNN decoder 225 for all tasks. Thus, in the following experiments, we use the prefinal layer of the decoder 226 for all the tasks to perform the correlation analysis between each of the DNNs RDMs 227 and the brain RDM. In this analysis, we focus on the correlation with the RDMs of the behavioral model 229 for navigational affordances, OPA responses that are related to navigational affordances, 230 and PPA responses that are related to spatial layout and scene classification. From  task DNN with behavior is because the categories in this task do not contain floor or 245 free space as compared to scene-parsing task.

246
The results from the above analysis suggest that for explaining the brain responses 247 of task-specific brain regions using the DNNs, the DNN should be optimized on a task 248 related to the function of that brain region. The above analysis also reveals that DNNs 249 of same architecture that are trained with the same dataset show a difference in 250 correlation due to the task.

251
Task optimized DNNs provide insights into the functionality of 252 brain regions 253 Above, we demonstrated through a series of analysis that a DNN optimized to perform 254 a task related to a particular brain region function, better explains the responses of such 255 brain region. In this analysis, we take a different route and ask the question if we can 256 gain an insight about the functions of the brain region by comparing the correlation of 257 different task DNNs with the brain RDMs.

258
In the current analysis, we focus on the correlation with the RDMs of OPA, PPA, 259 and EVC. We consider all the single image tasks from the Taskonomy dataset and use 260 the prefinal layer RDM as the representative DNN RDM for that particular task. Then, 261 we perform the comparison of correlation between the brain RDMs and the prefinal 262 layer RDMs of the DNNs trained on selected tasks.

263
The results (Fig 8A) of OPA and PPA are similar to the previous analysis. Now, we 264 also observe a high correlation for both areas and the scene and object classification 265 tasks, which were not reported in the previous analysis. These results suggest that while 266 PPA and OPA representations are spatial to perform tasks requiring spatial layout 267 information, the representations in these areas are also semantic to perform classification 268 tasks. We observe that EVC is highly correlated to edge2d (ρ = 0.6972, p = 0.0002),  The results support previous work [7] showing that EVC representation is more similar 280 to early layers of DNN. Further, the tasks which show very high correlation with the 281 EVC in deeper layers are mostly related to low-level visual cues (edge2d, keypoint 3d, 282 segment2d, etc. ) or the classification (object and scene classification). The high 283 correlation with the classification DNNs may be due to the emergence of object 284 detectors in the early visual cortex similar to as shown to emerge in DNNs [29]. 285 Thus, the above analysis shows that performing RSA of a brain region with a diverse 286 set of tasks has the potential to shed some light on the functionality of that particular 287 brain region in the visual cortex.
Similar task DNNs share more variance than dissimilar task 289 DNNs 290 We probe further whether the tasks that are similar according to the Taskonomy 291 transfer matrix share more variance than the tasks that are less similar. This analysis is 292 performed to investigate whether two dissimilar tasks can be used to uniquely explain 293 the responses of brain areas corresponding to different behaviors. 294 We use the variance partitioning [25] approach to calculate the unique and shared 295 variance of different models. The brain RDMs (OPA, PPA, and EVC) are the 296 predictand, and three task DNNs (pairs of similar and dissimilar tasks) are the 297 predictors. We used the RDMs of the prefinal layer for all the DNNs tested.  The results from partitioning analysis (Fig 9A)  PPA, 6.53% vs. 2.10% for EVC) is higher than the similar task. This analysis suggests 306 that two DNNs optimized for dissimilar tasks may be used to explain the brain 307 responses uniquely related to each task. In this work, we demonstrated the importance of task selection for using pretrained 318 DNNs as a computational model for task-specific regions of visual cortex. We list the 319 key findings from our analyses below.

320
• A DNN trained on a task (scene-parsing) related to the function (navigational 321 affordance) of the brain region (OPA) shows a higher correlation with its responses 322 than a DNN trained on a task (scene-classification) not explicitly related.

323
• Category-specific activations generated from the scene-parsing DNN provide 324 insights into functions of the scene-selective cortex.

325
• Training DNNs with same architecture on the same dataset but for different tasks 326 results in different correlation with the brain responses.

327
• DNNs that show a high correlation with the brain responses are trained on tasks 328 related to the functions of the brain areas reported in the previous studies.

329
In the following paragraphs, we discuss the strength and limitations of the key analysis 330 and findings of this work.

331
Finding a DNN trained on a task related to the function of the 332 brain region

333
In our first analysis, we selected a scene-parsing task and showed how it could be 334 related to navigational affordances in the scene. However, it is not always possible to 335 find a computer vision task that could be explicitly related to the function of the brain 336 area. Further, even if we find a related task, there might be a possibility that the 337 annotations for such a task are not readily available. Hence, finding a DNN pretrained 338 on this particular task might not be possible. datasets become available, the comparison of these DNNs with the brain responses will 350 shed new light into the functions of different brain regions. Through a specific example 351 of OPA, navigational affordances, and scene-parsing task, we believe our work has paved 352 a way towards future studies using task-optimized DNNs as potential computational 353 models for task-specific brain regions.

354
RSA with categorical units: a potential method to investigate 355 functions of a semantic brain area 356 We showed that the responses of categorical units which are generated as the output of 357 the scene-parsing task could be used to gain insights into the functions of OPA and  The result was consistent with earlier neuroimaging works investigating the function of 361 OPA [14,35,36]. While in Bonner and Epstein [14], it is shown that OPA is involved in 362 navigational affordance of scenes, in Dilks et al. [36] it has been shown that OPA might 363 also be playing a role in scene classification. Similarly, for PPA, the top correlated 364 classes mostly contained objects indicative of scene category and layout and only 2 out 365 of the top 10 correlated classes corresponded to free space. These results further provide 366 the evidence that PPA responses are insensitive to navigational affordance which is also 367 consistent with the findings related to PPA in Bonner and Epstein [14]. Thus, the above 368 results suggest that RSA with categorical activations can be a potential method to 369 investigate functions of a brain area. Further, it is important to note that we were 370 unable to distinguish the functions of OPA and PPA through the analysis involving a 371 diverse set of tasks since both the OPA and PPA RDMs showed high correlation with the same set of tasks. In such cases, where the functional difference is due to semantical 373 categories and not the spatial tasks, the categorical activations can be used to 374 distinguish the functions of these brain regions.

375
However, there are few potential shortcomings with this approach. First, the number 376 and type of categories are limited by the dataset used for training the DNN. Therefore, 377 in a new set of stimuli which contains categories that were not present in the dataset 378 used for training of DNN, the top correlated classes might not provide any useful 379 insights. Also, it is not always the case that the brain areas are categorical and 380 therefore this approach may not provide any meaningful insight into the functionality of 381 those brain areas.

382
Difference in correlation: Is it because of task?

383
In the first analysis, we found that scene-parsing DNN showed a higher correlation with 384 OPA and navigational affordances than the scene-classification DNN. Yet, it is 385 important to note that there were three differences in the DNNs used for comparison.

386
First was the architecture difference, while the last 3 layers of scene-parsing DNN were 387 convolution, the last 3 layers of scene-classification DNN were fully connected. Second, 388 the dataset used for training both the DNNs were different, ADE20k [15] for 389 scene-parsing DNN and Places-365 [37] for scene-classification DNN. Thirdly, the task 390 on which the models were trained were different. Therefore, the difference in the 391 correlation of 2 DNNs with OPA and navigational affordances could be attributed to 392 any of these factors. To clear this ambiguity, we selected DNNs with same architecture 393 trained on the same set of images but for different tasks. We then showed that the 394 DNNs trained on tasks related with the function of brain area were highly correlated 395 while the DNNs trained on unrelated tasks showed low or insignificant correlation with 396 the brain area. From this analysis, we found that training on different tasks leads to a 397 difference in correlation of the DNN activations with the responses of the brain regions. 398 The correlation of the DNN with the brain region depends on how similar the task is 399 with the function of the brain region.

400
DNNs trained on a diverse set of tasks: a potential method to 401 assess unknown functions of a brain area 402 We compared the correlation of DNNs trained on a diverse set of tasks with different 403 brain areas. The above comparison was performed to investigate whether the highly 404 correlated tasks are related to and are consistent with the previously reported functions 405 of the brain areas. The top-3 task DNNs (3D Keypoints, curvature, and 25d 406 segmentation) showing the highest correlation with the scene-selective visual areas 407 (OPA and PPA) were related to the 3-D structure of the scenes. In an 408 electrophysiological study [21], they demonstrated the importance of structure defining 409 contours through the electrophysiological investigations of scene-selective visual cortex 410 in the macaque brain. In a related neuroimaging work [22], they showed that scene 411 category could be decoded from the PPA even if the stimuli images are just the line 412 drawings of the corresponding scene. In Choo and Walther [23], they show that intact 413 contour junctions are crucial for scene category representation in PPA. Thus, the high 414 correlation of OPA and PPA responses with the DNNs trained to predict 3-D keypoints 415 and curvature demonstrate that our results are consistent with the previous studies 416 investigating the representation of scene-selective visual cortex.

417
Further, the semantic tasks such as scene/object classification and semantic 418 segmentation also showed high correlation with the scene-selective visual cortex. This is 419 consistent with the results of Bonner and Epstein [14] and Epstein et al. [17] where they 420 report that representation of OPA and PPA is also semantic. Thus, the results from the 421 task comparison analysis are consistent with the previous studies of the scene-selective 422 visual cortex and provide the evidence that the representation of scene selective visual 423 cortex is both semantic and visual. Further analysis with the early visual cortex showed 424 that the tasks which require low-level vision cues such as 2d edges, 2d segmentation are 425 highly correlated with the EVC responses. These results taken together suggest a strong 426 potential for using a diverse set of tasks for gaining insights into the function of different 427 brain regions.

428
One counter-argument to our approach might be that humans are never supervised 429 the same way as these DNNs. The DNNs were supervised using the task-specific 430 annotations. However, no such annotations are available to humans, and they learn to 431 perform these tasks intuitively. However, one should also note that humans learn 432 through interaction with the environment, by moving around, and learning from others. 433 Therefore, these intermediate vision tasks may have been learned through supervision of 434 much complex goal. Learning complex tasks is still a challenging area in the artificial 435 intelligence and a single model is not able to perform all the tasks a human can perform. 436 Therefore, in this work, we focused only on scene-selective regions in visual cortex and 437 tried to explain its responses by DNNs trained on different tasks. Further, in this work, 438 we are only interested in the correlation of the end state of the representations and not 439 how either the DNNs or the human learned these representations.

441
In this study, we presented the evidence supporting our hypothesis of using task-specific 442 DNN models to explain responses of task-specific brain regions. We first validated this 443 hypothesis by considering the particular case of OPA which has been reported to be 444 associated with navigational affordances. We showed that a scene-parsing DNN that is 445 related to the navigational affordances shows a higher correlation with OPA responses 446 than a DNN trained on a less related task (scene-classification). We further validated 447 this hypothesis by comparing the correlation of the responses of scene-selective visual 448 areas with a large and diverse set of task DNNs. Although in this work, we only 449 considered scene-selective visual areas we believe that the similar results can be 450 obtained for other higher cognitive brain areas such as hippocampus and prefrontal 451 cortex. One other limitation of this work is that we only considered the tasks that apply 452 to single static images. In future studies, we aim at considering performing a similar 453 analysis with more complex functions and with models trained on complex tasks in 454 virtual 3-D environments. We believe this study has paved a way towards using 455 task-optimized DNNs as potential computational models for task-specific brain regions. 456

457
In the first section, we describe Representation similarity analysis (RSA) [24] which is a 458 standard method to compare the correlation of computational and behavioral models 459 with human brain responses. In the second section, we describe the variance 460 partitioning analysis which was used to find the unique and shared variance of the 461 computational models used to predict brain responses. In the third section, we briefly 462 describe the dataset we used in this work and then in the last section we provide the 463 details of the DNN models used for analysis.   [12], where a scene classification 476 DNN was compared with the navigational affordance the dissimilarity metric used was 477 the Euclidean distance, we observed that with 1 − ρ as the dissimilarity metric, the 478 correlation was higher. Hence, in this work for all the analysis 1 − ρ is used as the 479 dissimilarity metric to compute RDMs of layer activations. We did not perform PCA on 480 layer activations as done in [12] since the spatial information in the case of 481 convolutional layer outputs is lost. The spatial information is lost because for 482 performing PCA as done in [12]; first, the convolutional layer output is flattened and 483 then principal components are selected due to which information from some spatial 484 location is never considered for the analysis. For the first set of analysis with 485 scene-parsing and scene-classification DNN, we consider OPA and PPA RDMs for 486 comparison as these areas have been hypothesized to represent scene affordances [14] 487 and scene layout [17] respectively. We also compare the DNN RDMs with a behavior 488 Navigational Affordance Map (NAM) [12] that represents navigational affordances in a 489 scene. For the second set of analysis with Taskonomy DNNs, we consider OPA, PPA, 490 and EVC RDMs to compare with DNN RDMs.

491
Statistical analysis We use RSA toolbox [38] to compute RDM correlations and 492 corresponding p-values and standard deviation, using bootstrap similar to [12]. For Variance partitioning method is used to determine the unique and shared contribution 499 of individual models when considered in conjunction with the other models. We 500 describe the analysis by considering the case of OPA predicted by behavior model 501 related to navigational affordance, scene-parsing DNN, and scene-classification DNN.

502
First, the off-diagonal elements of the OPA RDM is assigned as the dependent variable 503 (predictand). Then, the off-diagonal elements of behavior RDM and layer RDMs 504 representing scene-parsing and scene-classification tasks are selected as the independent 505 variable. Then, we perform seven multiple regression analysis: one with all three 506 independent variables as predictors, three with three different possible combinations of 507 two independent variables as predictors, and three with individual independent 508 variables as the predictors. Then, by comparing the explained variance (r 2 ) of a model 509 used alone with the explained variance when it was used with other models, the amount 510 of unique and shared variance between different predictors can be inferred. For the 511 other variance partitioning analysis in this work, the predictors and predictands were 512 modified accordingly, and the steps of analysis were the same. The area proportional 513 venn diagrams for the variance partitioning analysis were generated using EulerAPE 514 software [39].  The stimuli images used for analysis consisted of 50 images of indoor environments. The 517 subject's fMRI responses were obtained while they performed a category-recognition 518 task (bathroom or not). In this work, we directly use the precomputed subject averaged 519 RDMs of the navigational affordance map (NAM), PPA and OPA provided by Bonner 520 and Epstein [12].

521
To obtain NAM, first, an independent group of subjects was asked to indicate the 522 paths in each image starting from the bottom using a computer mouse. The 523 probabilistic maps of paths for each image were created followed by histogram 524 construction of navigational probability in one-degree angular bins radiating from the 525 bottom center of the image. This histogram represents a probabilistic map of potential 526 navigation routes from the viewer's perspective. For further details of the navigational 527 affordance map or dataset, please refer to [12,14].

528
Deep Neural Network Models to explain brain responses 529 In this section, we describe the architecture of the DNN models used in the analysis. Bonner and Epstein [12] was that we were unable to find a pretrained scene-parsing 535 model with similar architecture as Alexnet [41]. VGG16 model (Fig 3A) contains 13 536 convolutional layers with 5 pooling layer after a convolutional block of either 2 or 3 537 convolutional layers and 3 fully connected (FC) layers after the last pooling layer.

538
Scene-parsing models We use fully convolutional modification of VGG16 [42] 539 trained on ADE20k [26], [15] (a scene-parsing dataset) as the scene-parsing model 540 (VGG scene-parse ). In VGG scene-parse (pretrained model downloaded from 541 https://github.com/hellochick/semantic-segmentation-tensorflow) (Fig 3B), the FC 542 layers are replaced by convolutional layers to predict pixel-wise spatial mask. The 543 model has additional deconvolutional layers to upsample the spatial mask obtained from 544 the intermediate layers. We use pyramid scene parsing network (PSP scene-parse ) for 545 performing analysis of category specific outputs as PSP scene-parse (pretrained model 546 downloaded from https://github.com/hellochick/semantic-segmentation-tensorflow) 547 outperforms VGG scene-parse on scene-parsing task and hence the categorical outputs are 548 more accurate and suitable for that particular analysis. The PSP scene-parse model 549 introduces a pyramid pooling module that fuses features of four different scales to 550 obtain superior performance over VGG scene-parse .

551
Taskonomy models Taskonomy dataset is a large-scale image dataset containing 4 552 million images with annotations and pretrained DNN models available for 26 vision 553 related tasks. The tasks included in this dataset cover most common computer vision 554 tasks related to 2D, 3D, and semantics. The tasks involved range from low-level visual 555 tasks like edge detection to more abstract semantic tasks like scene/object classification. 556 The architecture of DNNs (pretrained models downloaded from 557 https://github.com/StanfordVL/taskonomy/tree/master/taskbank) trained on different 558 tasks from Taskonomy dataset share a common encoder architecture. The encoder is a 559 fully convolutional ResNet-50 [28] without any pooling layers. The encoder architecture 560 consists of 4 residual blocks each containing multiple convolutional layers. The decoder 561 architecture, however, varies according to the output structure of each task. For the