Viewpoint-Dependence and Scene Context Effects Generalize to Depth Rotated 3D Objects

Viewpoint effects on object recognition interact with object-scene consistency effects. While recognition of objects seen from “accidental” viewpoints (e.g., a cup from below) is typically impeded compared to processing of objects seen from canonical viewpoints (e.g., the string-side of a guitar), this effect is reduced by meaningful scene context information. In the present study we investigated if these findings established by using photographic images, generalise to 3D models of objects. Using 3D models further allowed us to probe a broad range of viewpoints and empirically establish accidental and canonical viewpoints. In Experiment 1, we presented 3D models of objects from six different viewpoints (0°, 60°, 120°, 180° 240°, 300°) in colour (1a) and grayscaled (1b) in a sequential matching task. Viewpoint had a significant effect on accuracy and response times. Based on the performance in Experiments 1a and 1b, we determined canonical (0°-rotation) and non-canonical (120°-rotation) viewpoints for the stimuli. In Experiment 2, participants again performed a sequential matching task, however now the objects were paired with scene backgrounds which could be either consistent (e.g., a cup in the kitchen) or inconsistent (e.g., a guitar in the bathroom) to the object. Viewpoint interacted significantly with scene consistency in that object recognition was less affected by viewpoint when consistent scene information was provided, compared to inconsistent information. Our results show that viewpoint-dependence and scene context effects generalize to depth rotated 3D objects. This supports the important role object-scene processing plays for object constancy.

4 encounter objects within certain contexts, which provides us with a pool of complex visual and 26 multimodal information that is integrated during object recognition. Past research has shown 27 that context facilitates object recognition (Biederman et al., 1982;Oliva & Torralba, 2007; for 28 a recent review see Lauer et al., 2021). Evidence from behavioral as well as neurophysiological 29 studies (e.g., Brandman & Peelen, 2017) suggest an interactive processing of objects and 30 scenes. For instance, objects placed in semantically consistent contexts are recognized faster 31 and more accurately, often referred to as the scene-consistency effect (Davenport & Potter, 32 2004;Palmer, 1975). Accordingly, models of object recognition have been updated to 33 incorporate the integration of contextual information (Bar, 2004). Further, frameworks 34 incorporating object-scene and object-object relations (e.g., the so-called scene-grammar) 35 describe a set of internalized rules based on regularities found in real-world scenes that 36 facilitate scene and object perception and guide our attention during different visual cognitive 37 tasks (Draschkow & Võ, 2017;Josephs et al., 2016;Võ et al., 2019;Võ & 38 Henderson, 2009;Võ & Wolfe, 2013a, 2013b. 39 Sastyin and clleagues (2015) conducted a series of experiments investigating the 40 interaction between viewpoint and scene-consistency on object and scene recognition. They 41 used photographic images of objects shown from canonical and accidental viewpoints and 42 paired them with consistent or inconsistent scenes. They found a significant interaction 43 between viewpoint and consistency where the viewpoint effect was weaker when consistent 44 scene information was provided. From this they concluded that object recognition relied more 45 on context information if the object was presented from an accidental viewpoint. 46 Here, in order to increase the external validity of these findings (Draschkow, 2022), 47 To investigate the speed and accuracy of object recognition, while keeping the 110 procedure comparable with previous studies, a word-picture verification task was employed 111 for all experiments (Figure 1). Participants were instructed on screen as well as through 112 standardized verbal instructions to decide as quickly and accurately as possible whether the 113 object on screen matched the basic level category label presented to them at the beginning of 114 the trial using a corresponding "match" or "mismatch" key. Participants were not made aware 115 of the different viewpoint conditions beforehand. Each experiment consisted of three practice 116 trials during which the instructor stayed in the room with the participant. More detailed 117 procedure and trial sequences will be described in the individual Procedure sections of each 118 experiment. Experiments 1a and 1b lasted approximately 30 minutes, Experiment 2 lasted 119 approx. 12 minutes. 120 8 Figure 1. Trial procedures for the matching task in Experiment 1a and 1b (A) and Experiment 2 (B). The object was presented 121 in colour in Experiment 1a and greyscaled in Experiment 1b. Note that the depicted labels are in English for visualization 122 purpose. Feedback was only provided in case of incorrect responses.

124
Design 125 Experiments 1a and 1b consisted of six blocks with 100 trials each. In each block, the 126 object was presented from a different angle (0°, 60°, 120°, 180°, 240°, 300°) chosen randomly 127 and counterbalanced between participants. The order of objects within each block was 128 randomized. Each object appeared three times in the match condition (object image matched 129 basic level category label) and three times in the mismatch condition (object image did not 130 match basic level category label), randomized between blocks. 131 In the mismatch condition, the basic level category label stemmed from a different 132 superordinate category than the object image (e.g., the label "chair" as part of the superordinate 133 category "furniture" was paired with an image of a "car" as part of the superordinate category 134 "vehicle"). 135 Because there was no effect of viewpoint in the mismatch condition in Experiment 1a 136 and 1b, most trials in Experiment 2 were match trials (N = 120) with 23% mismatch trials (N 137 = 36) that were later excluded from analysis. In Experiment 2, each object was presented to 138 9 each participant once, and we counterbalanced consistency (consistent vs. inconsistent) and 139 viewpoint (canonical vs. non-canonical) between participants. 140

Data Analysis 141
In Experiments 1a and 1b, we were interested in the effects of viewpoint (how far the 142 object was rotated away from its canonical 0° angle) and match (whether the object matched 143 the basic level category label as part of the experimental design) on reaction times (time 144 between the onset of the object image and keypress response) and accuracy. In Experiment 2, 145 we were interested in the interaction between viewpoint (canonical versus non-canonical 146 viewpoint), and scene consistency (consistent scene versus inconsistent scene) on reaction 147 times and accuracy. 148 Raw data was pre-processed and analysed using R (R Core Team, 2021). Objects that 149 produced accuracy ratings that deviated more than 2.5 SD from the mean (computed for each 150 condition separately) were excluded from analysis. Based on this, we excluded four objects in 151 Experiment 1a, one in Experiment 1b, and two in Experiment 2. We based our reaction time 152 analysis on correctly matched trials only (percent trials removed: Experiment 1a = 4.45%, 153 Experiment 1b = 10.16%, Experiment 2 = 8.55%). 154 In our data analysis, we employed (generalized) linear mixed-effects models 155 ((G)LMMs) using the lme4 package (Bates et al., 2015). We chose this approach because of 156 its potential advantages over analysis of variance (ANOVA) as it allows us to simultaneously 157 estimate by-participant and by-stimulus variance (Baayen et al., 2008;Bates et al., 2014;Kliegl 158 et al., 2011). The random effects structure of each model was determined using a drop-one 159 procedure starting with the full model including by-participant and by-stimulus varying 160 intercepts and slopes for the main effects in our design. We then subsequently removed random 161 slopes that did not contribute significantly to the goodness of fit as determined by likelihood 162 ratio tests. This allowed us to avoid overparameterization and produce converging models that 163 10 are supported by the data. Details about the individual analysis and models are described in the 164 Data Analyses sections of each experiment. For each GLMM we report β regression 165 coefficients together with the z statistic and apply a two-tailed 5% error criterion for 166 significance testing. P-values for the binary accuracy variable are based on asymptotic Wald 167 tests. Additionally, reaction times were transformed following the Box-Cox procedure (Box & 168 Cox, 1964) to correct for deviation from normality as to better meet LMM assumptions (see 169 individual Data Analysis sections for further details). For the LMMs regression coefficients 170 are reported with the t-statistic and p-values were calculated with the lmerTest package 171 (Kuznetsova et al., 2017). We defined sum contrasts for match (match vs. mismatch), and 172 consistency (consistent vs. inconsistent) where slope coefficients represent differences between 173 factor levels and the intercept is equal to the grand mean. 174 We used the ggplot2 package (Wickham, 2016) for graphics and emmeans (Lenth,175 2022) for post-hoc comparisons. Data and code are openly available at 176 https://github.com/aylinsgl/2022-Viewpoint_and_Context. 177

Apparatus 178
All experimental sessions were carried out in the same six experimental cabins of the 179 department of psychology at Goethe-University Frankfurt am Main, containing the same 180 experimental set up (computers running OS Windows 10). Stimulus presentation, response-181 times (RT) and accuracy were systematically controlled and recorded by OpenSesame (Mathôt 182 et al., 2012), presented on a 19-in monitor (resolution = 1680 × 1050, refresh rate = 60 Hz, 183 viewing distance = approx. 65 cm, subtending approx. 11.13 °× 9.28° of visual angle for the 184 object images and approx. 19° × 15.84° of visual angle for the background images). 185

Experiment 1a & 1b 186
In Experiments 1a and 1b, we investigated the effect of viewpoint on object recognition 187 RT and accuracy using 3D models of objects rotated around the pitch axis (0°, 60°, 120°, 180°, 188 11 240°, 300°). The only difference between the experiments was that 3D models were presented 189 either in color (Experiment 1a) or a grayscale version of the model was used (Experiment 1b). 190 Participants had to indicate whether the object matched the previously presented basic level 191 category label. 192

Procedure 193
Participants were presented with a fixation point in the middle of the screen followed 194 by a basic level object category label (in German, font: Droid Sans Mono; font size: 26; color: 195 black). This was followed by the target object presented in the middle of the screen, which 196 could either match or mismatch the label, until the participant gave a response ( Figure 1A). 197 Participants were given feedback on screen if their answer was incorrect. The next trial 198 automatically started with a new fixation point. 199

Data Analysis 200
After data preprocessing, we employed a binomial GLMM to examine the effects of 201 viewpoint and match on accuracy. As fixed effects we included viewpoint (0°, 60°, 120°, 180°, 202 240°, 300°) as a first and second-degree polynomial ,the match vs mismatch comparison, and 203 the interactions between these terms. The second-degree polynomial viewpoint term was added 204 as we expected viewpoint to affect recognition in a non-linear manner (symmetry around 180°). 205 Our final model included random intercepts for participants and stimuli, as well as a by-stimuli 206 random slope for the match vs. mismatch effect for Experiment 1a, and random intercepts for 207 participants and stimuli, as well as a by-stimuli and by-participant random slope for the match 208 effect for Experiment 1b. 209 Based on the power coefficient output of the Box-Cox procedure (λ = 0.22), RTs were 210 log-transformed. We employed the same fixed effects structure for the RT-LMMs as for the 211 accuracy-GLMMs. As random effects, we entered random intercepts for participants and 212 stimuli, as well as by-participant and by-stimuli random slopes for the effect of match for 213 Experiment 1a and 1b. 214

Results 215
Accuracy In Experiment 1a, we found viewpoint-dependent object recognition for objects rotated around 242 the pitch axis. This effect can best be described by a quadratic curve that approximates 243 symmetry around 120° rotation. We also found that in our sequential matching task, only the 244 match condition produced viewpoint-dependent behavior, while mismatch trials seemed 245 unaffected by viewpoint. Finding a mismatch might rely more on the analysis of global, 246 viewpoint-invariant features, whereas matching might be more dependent on the analysis of 247 local, viewpoint-dependent features (e.g., Jolicoeur, 1990a) (e.g., deciding a shape is not a car 248 might require less viewpoint-dependent information than identifying the shape as a chair). In 249 Experiment 1b, we were able to replicate our results from Experiment 1a. Grayscaling the 250 images seemed to have made the overall task slightly more difficult while still producing 251 similarly viewpoint-dependent behavior. The canonical (0°) and non-canonical (120°) 252 14 viewpoints we used in Experiment 2 represented viewpoints that produced the best and worst 253 recognition performance derived from average accuracy ratings obtained from Experiment 1a 254 and 1b. 255

Experiment 2 256
In Experiment 2, we paired canonical (0°) and non-canonical (120°) viewpoints with 257 consistent and inconsistent scene contexts. We were specifically interested in the interaction 258 between viewpoint and consistency with the expectation that meaningful scene context 259 information would reduce the effect of viewpoint on object recognition. 260

Procedure 261
In Experiment 2, we used the same word-picture verification task as in Experiments 1a 262 and 1b ( Figure 1B). Scene context was provided by first previewing the consistent or 263 inconsistent scene for 300ms and then overlaying the target object on top of the scene 264 background until a response was given. 265

Data Analysis 266
For both the accuracy-GLMM and response time (RT) LMM we entered interaction 267 terms between viewpoint and consistency as fixed effects. The GLMM included random 268 intercepts for participants and stimuli, as well as a by-stimuli random slope for the effect of 269 viewpoint. Response time data was log transformed. 270 For the RT-LMM we had random intercepts for participants and stimuli, and a by-271 participant random slope for the effect of viewpoint and by-stimuli random slopes for the 272 effects of viewpoint and consistency. 273

Results 274
Accuracy. Accuracy was significantly higher for canonical viewpoints than for non-275  In general, object recognition accuracy was viewpoint dependent, however, there was 297 a significant interaction between viewpoint and consistency. In line with our hypothesis, the 298 16 viewpoint effect was significantly weaker for consistent scenes and the scene consistency effect 299 was only observed for non-canonical viewpoints ( Figure 3A). Non-canonical viewpoints were 300 recognized significantly slower than canonical viewpoints. However, this was unaffected by 301 scene consistency. 302

General Discussion 303
In the present study, we investigated how scene context information modulates 304 viewpoint-dependent object recognition using 3D models of everyday objects. While providing 305 meaningful context did not eradicate the viewpoint effect fully, it significantly reduced 306 recognition accuracy costs. In line with previous findings (Sastyin et al., 2015) this supports a 307 model of object recognition that incorporates context (e.g., Bar, 2004) while dynamically 308 adapting to the amount of available information based not only on visual features of the object 309 (Burgund & Marsolek, 2000;Hayward & Tarr, 1997;Jolicoeur, 1990), but also context. It 310 further motivates models of object constancy -the visual system's ability to produce 311 representations that are robust to changes in e.g., viewpoint or lighting (e.g., DiCarlo & Cox,312 2007)that efficiently integrate contextual information and can lead to both viewpoint-313 dependent and invariant behavior based on available information and the task at hand. 314 A key component of the present study was to generalize previous findings on object-315 scene processing effects and viewpoint-dependence to depth rotated 3D objects. We want to 316 highlight the importance of generalizing findings from traditional 2D settings to more 317 naturalistic settings and stimuli. Kristjánsson and Draschkow., (2021) have shown very 318 illustratively for a variety of phenomena that given more naturalistic constraints, a system is 319 able to circumvent e.g., capacity limits by drawing on the rich visual experience of natural 320 environments. While we did not use fully immersive environments, using 3D models offers a 321 more realistic encounter of everyday objects and therefore a more precise measure of 322 viewpoint-dependence in real-world object recognition. It should be noted, however, that there 323 is a trade-off between naturalistic looking stimuli (i.e., photographs) and stimuli that more 324 precisely capture naturalistic properties (i.e., 3D structure of objects from different viewpoints) 325 in a highly controlled manner while not looking as naturalistic. Here, we opted for providing 326 more naturalistic 3D properties of the displayed objects. 327 From the present study it is unclear what kind of information contained in the scenes 328 was responsible for reducing the viewpoint costs. Rapidly accessed global information such as 329 the gist of the scene (Oliva & Torralba, 2007)  Varying what information is presented during the task (i.e., providing meaningful 336 context vs. showing objects in isolation) is one way to probe the visual system's ability to 337 overcome processing limitations in viewpoint-dependent object recognition. Alternatively, one 338 could keep the visual input constant but vary the level at which participants have to perform 339 the matching task (Hamm & McMullen, 1998). If there are object representations that contain 340 more or less viewpoint-dependent or invariant information how does this interact with the 341 integration of contextual information in the form of scene context? 342 Finally, we would like to address that on average performance was high in the matching 343 task throughout all our experiments. These ceiling effects are probably due to the type of task 344 we chose -different from the tasks usually employed to study scene consistency effects 345 (Davenport & Potter, 2004;Sastyin et al., 2015). Despite these differences in difficulty, we 346 were able to demonstrate a significant reduction in viewpoint costs by providing meaningful 347 scene context. 348 Past research has made strong advances towards understanding the computations that 349 underly invariant object recognition (DiCarlo & Cox, 2007). Understanding these mechanisms 350 in isolation is key to understanding object recognition in general. We argue that understanding 351 how the visual system is able to make use of richly structured naturalistic environments to 352 circumvent computational bottlenecks will ultimately lead to better, more robust models of 353 object recognition and inspire approaches in fields such as computer vision (e.g., Bomatter et 354 al., 2021). 355 To conclude, in the present study we built upon previous findings on object-scene 356 processing and viewpoint dependence by generalizing these effects to depth rotated 3D objects. 357 We highlight the importance of testing capacity limits of object recognition in more naturalistic 358 frameworks in order to build more robust and flexible models and move towards a better 359 understanding of vision under naturalistic constraints. 360 361