Abstract
Natural scenes are inherently structured, with meaningful objects appearing in predictable locations. Human vision is tuned to this structure: When scene structure is purposefully disrupted (e.g., by jumbling scene images), perception is strongly impaired. Here, we tested how such perceptual effects are reflected in neural sensitivity to natural scene structure. During separate fMRI and EEG experiments, participants passively viewed scenes whose spatial structure (i.e., the position of the scene’s parts) and categorical structure (i.e., the content of the scene’s parts) could be intact or jumbled. Using multivariate decoding analyses, we show that spatial (but not categorical) scene structure has a profound impact on cortical processing: Scene-selective responses in the occipital and parahippocampal cortices (fMRI) and after 255ms (EEG) accurately differentiated between spatially intact and spatially jumbled scenes. Importantly, this differentiation was more pronounced for upright than for inverted scenes, indicating genuine sensitivity to spatial scene structure rather than sensitivity to low-level visual attributes. The cortical sensitivity to spatial structure may reflect perceptual adaptations to real-world statistics, which support efficient scene understanding in everyday situations.
Cortical Sensitivity to Natural Scene Structure
Humans can understand natural scenes from just a single glance (Potter, 1975; Thorpe, Fize, & Marlot, 1996). One reason for this perceptual efficiency lies in the structure of natural scenes: for instance, a scene’s spatial structure tells us where specific objects can be found and its categorical structure tells us which objects are typically encountered within the scene (Bar, 2004; Oliva & Torralba, 2007; Potter, 2012; Võ, Boettcher, & Draschkow, 2019).
The beneficial impact of scene structure on perception becomes apparent in jumbling paradigms, where the scene’s structure is purposefully disrupted by shuffling blocks of information across the scene. Jumbling makes it harder to categorize scenes (Biederman, Rabinowitz, Glass, & Stacy, 1974), recognize objects within them (Biederman, 1972; Biederman, Glass, & Stacy, 1973) or to detect subtle visual changes (Varakin & Levin, 2008; Zimmermann, Schnier, & Lappe, 2010).
These perceptual effects prompt the hypothesis that scene structure impacts perceptual stages of cortical scene processing. However, while there is evidence that real-world structure impacts visual responses to everyday objects (Kim & Biederman, 2011; Kaiser & Cichy, 2018; Kaiser & Peelen, 2018; Roberts & Humphreys, 2010) and people (Bernstein, Oron, Sadah, & Yovel, 2010; Brandman & Yovel, 2016; Chan, Kravitz, Truong, Arizpe, & Baker, 2010; Sorisa Bauser & Suchan, 2015), it is unclear whether real-world structure has a similar impact on scene-selective responses.
Here, we used multivariate pattern analysis (MVPA) on fMRI and EEG responses to jumbled scenes to demonstrate that cortical scene processing is indeed sensitive to scene structure. We reveal three key characteristics of this sensitivity: (1) Cortical scene processing is primarily sensitive to the scene’s spatial structure, rather than its categorical structure. (2) Spatial structure impacts the perceptual analysis of scenes, in occipital and parahippocampal cortices (Epstein, 2012) and shortly after 200ms (Harel, Groen, Kravitz, Deouell, & Baker, 2016). (3) Spatial structure impacts cortical responses more strongly for upright than inverted scenes, indicating robust sensitivity to scene structure that goes beyond sensitivity to low-level features.
Method
Participants
In the fMRI experiment, 20 healthy adults participated in session 1 (mean age 25.5, SD=4.0; 13 female) and 20 in session 2 (mean age 25.4, SD=4.0; 12 female). Seventeen participants completed both sessions, three participants only session 1 or session 2, respectively. In the EEG experiment, 20 healthy adults (mean age 26.6, SD=5.8; 9 female) participated in a single session. All participants had normal or corrected-to-normal vision. Participants provided informed consent and received monetary reimbursement or course credits. All procedures were approved by the local ethical committee and were in accordance with the Declaration of Helsinki.
Stimuli and design
Stimuli were 24 scenes from four different categories (church, house, road, supermarket; Figure 1a), taken from an online resource (Konkle, Brady, Alvarez, & Oliva, 2010). We split each image into quadrants and systematically recombined the resulting parts in a 2×2 design, where both the scenes’ spatial structure and their categorical structure could be either intact or jumbled (Figure 1b/c). This yielded four conditions: (1) In the “spatially intact & categorically intact” condition, parts from four scenes of the same category were combined in their correct locations. (2) In the “spatially intact & categorically jumbled” condition, parts from four scenes from different categories were combined in their correct locations. (3) In the “spatially jumbled & categorically intact” condition, parts from four scenes of the same category were combined, and their locations were exchanged in a crisscrossed way. (4) In the “spatially jumbled & categorically jumbled” condition, parts from four scenes from different categories were combined, and their locations were exchanged in a crisscrossed way. For each participant separately, 24 unique stimuli were generated for each condition by randomly drawing suitable fragments from different scenes1. During the experiment, all scenes were presented both upright and inverted.
fMRI paradigm
The fMRI experiment (Figure 1d) comprised two sessions. In the first session, upright scenes were shown, in the second session inverted scenes were shown; the sessions were otherwise identical. Each session consisted of five runs of 10min. Each run consisted of 25 blocks of 24 seconds. In 20 blocks, scene stimuli were shown with a frequency of 1Hz (0.5s stimulus, 0.5s blank). Each block contained all 24 stimuli of a single condition. In 5 additional fixation-only blocks, no scenes were shown. Block order was randomized within every five consecutive blocks, which contained each condition (four scene conditions and fixation-only) exactly once.
Scene stimuli appeared in a black grid (4.5° visual angle), which served to mask visual discontinuities between quadrants. Participants were monitoring a central red crosshair, which twice per block (at random times) darkened for 50ms; participants had to press a button when they detected a change. Participants on average detected 80.0% (SE=2.5)2 of the changes. Stimulus presentation was controlled using the Psychtoolbox (Brainard, 1997).
In addition to the experimental runs, each participant completed a functional localizer run of 13min, during which they viewed images of scenes, objects, and scrambled scenes. The scenes were new exemplars of the four scene categories used in the experimental runs; objects were also selected from four categories (car, jacket, lamp, sandwich). Participants completed 32 blocks (24 scene/object/scrambled blocks and 8 fixation-only blocks), with parameters identical to the experimental runs (24s block duration, 1Hz stimulation frequency, color change task).
EEG paradigm
In the EEG experiment (Figure 1e), all conditions were randomly intermixed within a single session of 75min (split into 16 runs). During each trial, a scene appeared for 250ms, followed by an inter-trial interval randomly varying between 700ms and 900ms. In total, there were 3072 trials (384 per condition), and an additional 1152 target trials (see below).
As in the fMRI, stimuli appeared in a black grid (4.5° visual angle) with a central red crosshair. In target trials, the crosshair darkened during the scene presentation; participants had to press a button and blink when detecting this change. Participants on average detected 78.1% (SE=3.6) of the changes. Target trials were not included in subsequent analyses.
fMRI recording and preprocessing
MRI data was acquired using a 3T Siemens Tim Trio Scanner equipped with a 12-channel head coil. T2*-weighted gradient-echo echo-planar images were collected as functional volumes (TR=2s, TE=30ms, 70° flip angle, 3mm3 voxel size, 37 slices, 20% gap, 192mm FOV, 64×64 matrix size, interleaved acquisition). Additionally, a T1-weighted anatomical image (MPRAGE; 1mm3 voxel size) was obtained. Preprocessing was performed using SPM12 (www.fil.ion.ucl.ac.uk/spm/). Functional volumes were realigned, coregistered to the anatomical image, and normalized into MNI-305 space. Images from the localizer run were additionally smoothed using a 6mm full-width-half-maximum Gaussian kernel.
EEG recording and preprocessing
EEG signals were recorded using an EASYCAP 64-electrode3 system and a Brainvision actiCHamp amplifier. Electrodes were arranged in accordance with the 10-10 system. EEG data was recorded at 1000Hz sampling rate and filtered online between 0.03Hz and 100Hz. All electrodes were referenced online to the Fz electrode. Offline preprocessing was performed using FieldTrip (Oostenveld, Fries, Maris, & Schoffelen, 2011). EEG data were epoched from −200ms to 800ms relative to stimulus onset, and baseline-corrected by subtracting the mean pre-stimulus signal. Channels and trials containing excessive noise were removed based on visual inspection. Blinks and eye movement artifacts were removed using independent component analysis and visual inspection of the resulting components. The epoched data were down-sampled to 200Hz.
fMRI region of interest definition
We restricted fMRI analyses to three regions of interest (ROIs): early visual cortex (V1), scene-selective occipital place area (OPA), and scene-selective parahippocampal place area (PPA). V1 was defined based on a functional group atlas (Wang, Mruczek, Arcaro, & Kastner, 2015). Scene-selective ROIs were defined using the localizer data, which were modelled in a general linear model (GLM) with 9 predictors (3 regressors for the scene/object/scrambled blocks and 6 movement regressors). Scene-selective ROI definition was constrained by group-level activation masks for OPA and PPA (Julian, Fedorenko, Webster, & Kanwisher, 2012). Within these masks, we first identified the voxel exhibiting the greatest t-value in a scene>object contrast, separately for each hemisphere, and then defined the ROI as a 125-voxel sphere around this voxel. Left-and right-hemispheric ROIs were concatenated for further analysis.
fMRI decoding
fMRI response patterns for each ROI were extracted directly from the volumes recorded during each block. After shifting the activation time course by three TRs (i.e., 6s) to account for the hemodynamic delay, we extracted voxel-wise activation values from the 12 TRs corresponding to each block of 24s. Activation values for these 12 TRs were then averaged, yielding a single response pattern across voxels for each block. To account for activation differences between runs, the mean activation across all blocks was subtracted from each voxel’s values, separately for each run. Decoding analyses were performed using CoSMoMVPA (Oosterhof, Connolly, & Haxby, 2016), and were carried out separately for each ROI and participant. We used data from four runs to train linear discriminant analysis (LDA) classifiers to discriminate multi-voxel response patterns (i.e., patterns of voxel activations across all voxels of an ROI) and response patterns from the left-out, fifth run to test these classifiers. This was done repeatedly until every run was left out once and decoding accuracy was averaged across these repetitions.
EEG decoding
EEG decoding was performed separately for each time point (i.e., every 5ms) from −200ms to 800ms relative to stimulus onset, using CoSMoMVPA (Oosterhof et al., 2016). We used data from all-but-one trials for two conditions to train LDA classifiers to discriminate topographical response patterns (i.e., patterns across electrodes) and data from the left-out trials to test these classifiers. This was done repeatedly until each trial was left out once and decoding accuracy was averaged across these repetitions. Classification time series for individual participants were smoothed using a running average of five time points (i.e., 25ms).
Decoding sensitivity to scene structure
For both the fMRI and EEG data, we performed two complimentary decoding analyses. In the first analysis, we tested sensitivity for spatial structure by decoding spatially intact from spatially jumbled scenes (Figure 2a). In the second analysis, we tested sensitivity for categorical structure by decoding categorically intact from categorically jumbled scenes (Figure 2d). To investigate whether successful decoding indeed reflected sensitivity to scene structure, we performed both analyses separately for the upright and inverted scenes. Critically, inversion effects (i.e., better decoding in the upright than in the inverted condition) indicate genuine sensitivity to natural scene structure that goes beyond purely visual differences.
Statistical testing
For the fMRI data, we used t-tests to compare decoding against chance and between conditions. To Bonferroni-correct for comparisons across ROIs, all p-values were multiplied by 3. For the EEG data, given the larger number of comparisons, we used a threshold-free cluster enhancement procedure (Smith & Nichols, 2009). Multiple-comparison correction was based on a sign-permutation test (with null distributions created from 10,000 bootstrapping iterations) as implemented in CoSMoMVPA (Oosterhof et al., 2016). The resulting statistical maps were thresholded at z>1.96 (i.e., pcorr<.05).
Results
Sensitivity to spatial scene structure
To uncover where and when cortical processing is sensitive to spatial structure, we decoded between scenes whose spatial structure was intact or jumbled (Figure 2a).
For the fMRI data (Figure 2b), we found highly significant decoding between spatially intact and spatially jumbled scenes. For upright scenes, significant decoding emerged in V1, t(19)=13.03, pcorr<.001, OPA, t(19)=7.61, pcorr<.001, and PPA, t(19)=5.92, pcorr=.002, and for inverted scenes in V1, t(19)=9.92, pcorr<.001, but not in OPA, t(19)=2.08, pcorr=.16, and PPA, t(19)=0.85, pcorr>1. Critically, we observed inversion effects (i.e., better decoding for the upright scenes) in the OPA, t(16)=4.41, pcorr=.0014, and PPA, t(16)=3.67, pcorr=.006, but not in V1, t(16)=1.32, pcorr=.62. Therefore, decoding in V1 solely reflects visual differences, whereas OPA and PPA exhibit genuine sensitivity to the spatial scene structure. This result was confirmed by further ROI analyses and a spatially unconstrained searchlight analysis (see Supplementary Information).
For the EEG data (Figure 2c), we also found strong decoding between spatially intact and jumbled scenes. For upright scenes, this decoding emerged between 55ms and 465ms, between 505ms and 565ms, and between 740ms and 785ms, peak z>3.29, pcorr<.001, and for inverted scenes between 65ms and 245ms, peak z>3.29, pcorr<.001. As in scene-selective cortex, we observed inversion effects, indexing stronger sensitivity to spatial structure in upright scenes, between 255ms and 300ms and between 340ms and 395ms, peak z=2.78, pcorr=.005.
Together, these results show that in scene-selective OPA and PPA, and after 255ms, cortical activations are sensitive to the spatial structure of natural scenes.
Sensitivity to categorical scene structure
To uncover where and when cortical processing is sensitive to categorical structure, we decoded between scenes whose categorical structure was intact or jumbled (Figure 2a).
For the fMRI (Figure 2e), the upright scenes’ categorical structure could be decoded only from V1, t(19)=3.11, pcorr=.017, but not the scene-selective ROIs, both t(19)<2.15, pcorr>.13. Similarly, for the inverted scenes, significant decoding was only observed in V1, t(19)=4.58, pcorr<0.001, but not in the scene-selective ROIs, both t(19)<2.29, pcorr>.10. No inversion effects were observed, all t(16)<0.60, pcorr>1.
For the EEG (Figure 2f), we found only weak decoding between the categorically intact and jumbled scenes. In the upright condition, decoding was significant between 165ms and 175ms and between 215ms and 265ms, peak z=2.32, pcorr=.02, and in the inverted condition at 120ms, peak z=1.97, pcorr=.049. No inversion effects were observed, peak z=1.64, pcorr=.10.
Together, these results reveal no sensitivity to the categorical structure of a scene, at least when none of the scenes are fully coherent and when they are not relevant for behavior. This is in marked contrast with sensitivity for spatial scene structure, which is observed in the absence of behavioral relevance and is disrupted by stimulus inversion. Similar results were obtained in univariate analyses (see Supplementary Information).
Discussion
Our findings provide the first spatiotemporal characterization of cortical sensitivity to natural scene structure. As the key result, we observed sensitivity to spatial (but not categorical) scene structure, which emerged in scene-selective cortex and from 255ms of vision. By showing that this effect is stronger for upright than for inverted scenes, we provide strong evidence for genuine sensitivity to spatial structure, rather than low-level properties.
Sensitivity to spatial structure may index mechanisms enabling efficient scene understanding. Previous work on object processing shows that in order to efficiently parse the many objects contained in natural scenes, the visual system exploits regularities in the environment, such as regularities in individual objects’ positions (Kaiser & Cichy, 2018; Kaiser, Moeskops, & Cichy, 2018), relationships between objects (Kim & Biederman, 2011; Kaiser & Peelen, 2018; Kaiser, Stein, & Peelen, 2014; Roberts & Humphreys, 2010), and relationships between objects and scenes (Brandman & Peelen, 2017; Faivre, Dubois, Schwartz, & Mudrik, 2019). The current results suggest that also cortical scene analysis uses spatial regularities to efficiently handle complex visual information, in line with the view that real-world structure facilitates processing in the visual system across diverse naturalistic contents.
Our results also shine new light on the temporal processing cascade during scene perception. Sensitivity to spatial structure emerged after 255ms of processing, which is only after scene-selective peaks in ERPs (Harel et al., 2016; Sato et al., 1999)5 and after basic scene attributes are computed (Cichy, Khosla, Pantazis, & Oliva, 2017). Interestingly, after 250ms brain responses not only become sensitive to scene structure, but also to object-scene consistencies (Draschkow, Heikel, Fiebach, Võ, & Sassenhagen, 2018; Ganis & Kutas, 2003; Mudrik, Lamy, & Deouell, 2010; Võ & Wolfe, 2013). Together, these results suggest a dedicated processing stage for the structural analysis of objects, scenes, and their relationships, which is different from basic perceptual processing. However, whether these different findings indeed reflect a common underlying mechanism requires further investigation6.
Perhaps surprisingly, our findings suggest more pronounced sensitivity to spatial structure than to categorical structure. This is in line with studies showing that scene-selective responses are mainly driven by spatial layout, rather than scene content (Dillon, Persichetti, Spelke, & Dilks, 2018; Harel, Kravitz, & Baker, 2013; Lowe, Rajsic, Gallivan, Ferber, & Cant, 2017; Kravitz, Peng, & Baker, 2011). However, the brain may be less sensitive to categorical structure when, as in our study, all scenes are jumbled to some extent and not behaviorally relevant.
On the contrary, it is worth stressing that sensitivity to spatial scene structure emerged in the absence of behavioral relevance, suggesting that spatial structure is analyzed automatically during perceptual processing. As in real-world situations we cannot explicitly engage with all aspects of a scene concurrently, this automatic analysis of spatial structure may be crucial for rapid scene understanding.
Author Note
We thank Sina Schwarze for help in EEG data collection and manuscript preparation.
D.K. and R.M.C. are supported by Deutsche Forschungsgemeinschaft (DFG) grants (KA4683/2-1, CI241/1-1, CI241/3-1). R.M.C. is supported by a European Research Council Starting Grant (ERC-2018-StG 803370).
Footnotes
↵1 Note that all scenes were jumbled to some extent, as also in the categorically intact scenes four different exemplars were intermixed
↵2 For two participants, due to technical problems, no responses were recorded.
↵3 For two participants, due to technical problems, only data from 32 electrodes was recorded.
4 Statistics for fMRI inversion effects are based on the 17 participants who completed both sessions.
↵5 In our study, ERP responses in posterior-lateral electrodes peaked at 235ms.
↵6 One open question concerns whether these effects primarily reflect enhanced processing of consistent structure or responses to inconsistencies.