Human shape representations are not an emergent property of learning to classify objects

Humans are particularly sensitive to changes in the relationships between parts of objects. It remains unclear why this is. One hypothesis is that relational features are highly diagnostic of object categories and emerge as a result of learning to classify objects. We tested this by analysing the internal representations of supervised convolutional neural networks (CNNs) trained to classify large sets of objects. We found that CNNs do not show the same sensitivity to relational changes as previously observed for human participants. Furthermore, when we precisely controlled the deformations to objects, human behaviour was best predicted by the amount of relational changes while CNNs were equally sensitive to all changes. Even changing the statistics of the learning environment by making relations uniquely diagnostic did not make networks more sensitive to relations in general. Our results show that learning to classify objects is not sufficient for the emergence of human shape representations.


22
, where the visual system encodes the multiple collinearities of edges present in the 23 proximal image and uses these to build contours of a triangle even though these contours do not 24 exist in the retinal image. The advantage of distal representations is that they are relevant for a 25 broad range of tasks -the same representation of an object can be used for recognition and visual 26 SHAPE THROUGH CLASSIFICATION 4 reasoning (Hummel & Biederman, 1992) amongst other visual skills.

27
A second view proposes that these biases can emerge as a result of internalisation of the 28 biases present in the environment relevant for classifying objects. According to this view, humans 29 prefer to view objects from a canonical perspective because these perspectives are more frequent in 30 the visual environment, and they prefer to classify objects based on shape because shape is more 31 diagnostic during object classification. In other words, biases are a consequence of performing 32 statistical learning on a large set of objects, with the goal of optimising behaviour on a particular 33 task. We will call this the optimisation-for-classification approach or, more briefly, the optimisation 34 approach. 35 The goal of this study was to test the second view -whether inferences about distal stimuli 36 can emerge as a result of learning to classify a large set of objects. We tested this by focusing on 37 supervised Convolutional Neural Networks (CNNs) -which are machine learning models that 38 recognise objects by learning statistical features of their proximal stimuli that can be used to 39 optimally classify each stimulus, given some training data. The learned representations that 40 support object recognition are specialized for image classification. There is no pressure to learn 41 distal representations of objects. As such, CNNs trained using supervised learning to classify 42 objects provide a concrete model to test the optimisation view. If human perceptual biases are 43 acquired purely through internalising the statistics of the environment in order to classify objects, 44 then training CNNs to perform classification on ecologically realistic datasets should lead to 45 perceptual shape biases similar to the ones observed for humans. 46 Initial studies testing shape-bias in CNNs showed that CNNs trained in a supervised setting 47 on large datasets of naturalistic images (e.g. ImageNet) frequently lacked a shape-bias, instead 48 preferring to classify images based on texture (Geirhos et al., 2018) or other local features (Baker,49 Lu, Erlikhman, & Kellman, 2018; Malhotra, Dujmovic, & Bowers, 2021). However, it has been 50 argued that CNNs can also be trained to infer an object's shape given the right type of training.

51
For example, Geirhos et al. (2018) trained standard CNNs on Style-Transfer image dataset that 52 mixes the shape of images from one class with the texture from other classes so that only shape was 53 diagnostic of category. CNNs trained on this dataset learned to classify objects by shape. In 54 another study, Feinman and Lake (2018) found CNNs were capable of learning a shape-bias based 55 SHAPE THROUGH CLASSIFICATION 5 on a small set of images, as long as the training data was carefully controlled. Similarly, Hermann,56 Chen, and Kornblith (2020) showed that more psychologically plausible forms of data augmentation, 57 namely the introduction of color distortion, noise, and blur to input images, make standard CNNs 58 rely more on shape when classifying images. Indeed, the authors found that data augmentation was 59 more effective in inducing a shape bias than modifying the learning algorithms or architectures of 60 networks, and concluded: "Our results indicate that apparent differences in the way humans and 61 ImageNet-trained CNNs process images may arise not primarily from differences in their internal 62 workings, but from differences in the data that they see" (Hermann et al., 2020, Abstract). 63 These results raise the possibility that human biases are indeed a consequence of 64 internalising the statistical properties of the environment relevant to classifying objects rather than 65 the product of heuristics involved in building distal representations of objects. But studies so far 66 have focused on judging whether or not CNNs are able to develop a shape-bias, rather than 67 examining the type of shape representations they acquire. If humans and CNNs indeed acquire a 68 shape-bias through a similar process of statistical optimisation, then CNNs should not only show a 69 shape-bias, but also develop shape representations that are similar to human shape representations. 70 A key finding about human shape representations is that humans do not give equal weight 71 to all shape-related features. For example, it has been shown that human participants are more 72 sensitive to distortions of shape that change relations between parts of objects than distortions that 73 preserve these relations (Biederman, 1987;Hummel & Stankiewicz, 1996). These observations have 74 typically been taken to support a heuristic view according to which relations present in the proximal 75 images are used to build distal representations of objects (Hummel, 1994). The question we ask is 76 whether CNNs trained to classify objects learn to encode these relational features of shape. If they 77 do, it would suggest that the relational sensitivity of human shape representations can emerge as a 78 consequence of learning to classify large sets of objects and that shape-biases in object recognition 79 are the product of optimising performance on object classification. But if not, it would suggest that 80 these biases are best characterized as heuristics designed to build distal representations of shape and 81 that learning to classify objects is not sufficient for the emergence of such distal representations. 82 In the rest of the paper, we discuss a series of experiments (simulation studies with CNNs 83 as well as behavioural experiments with human participants) which show that the shape 84 SHAPE THROUGH CLASSIFICATION 6 representations that emerge as a result of classifying images in CNNs are qualitatively different 85 from human shape representations. In the first two experiments, we examine objects that consist of 86 multiple parts, while the following experiments examine objects that consist of a single part. The 87 deformations required to infer the shape representations of these two types of objects are different, 88 but related. Therefore, we begin each section by describing these deformations and how these 89 deformations are predicted to affect shape representations under the two (optimisation and 90 heuristic) views. We then present results of experiments where humans and CNNs were trained on 91 the same set of shapes and then presented these deformations. In the final section, we discuss how 92 our findings pose a challenge for developing models of human vision. In our first experiment, we asked whether models that learn to optimise their performance 95 by classifying large sets of objects develop a key property of human shape representations -it's 96 sensitivity to a subset of object deformations. According to the structural description theory 97 (Biederman, 1987), humans represent objects as collections of convex parts in specific categorical 98 spatial relations. For example, consider two objects -a bucket and a mug -both of which consist of 99 the same parts: a curved cylinder (the handle) and a truncated cone (the body). The encoding of 100 objects through parts and relations between parts makes it possible to support a range of visual 101 skills. For example, it is possible to appreciate the similarity between a mug and a bucket because 102 they both contain the same parts (curved cylinder and truncated code) as well as their differences 103 (the different relations between the object parts). That is, the representational scheme supports 104 visual reasoning. In addition, the parts themselves are coded so that they can be identified from a 105 wide range of viewing conditions (e.g., invariance to scale, translation and viewing angle, as well as 106 robustness to occlusion), allowing objects to be classified from novel poses and under degraded 107 conditions.

108
Note that the reliance on categorical relations to build up distal representations of 109 multi-part objects is a built-in assumption of the model (one of the model's heuristics), and it leads 110 to the first hypothesis we test, namely that image deformations that change a categorical relation 111 between an object's parts should have a larger impact on the object's representation than 112 SHAPE THROUGH CLASSIFICATION 7 metrically-equivalent deformations that leave the categorical relations intact (as might be produced 113 by viewing a given object from different angles). By contrast, any model that relies only on the 114 properties of the proximal stimulus might be expected to treat all metrically-equivalent 115 deformations as equivalent. Such a model may learn that some distortions are more important -116 i.e., diagnostic -than others in the context of specific objects, but it is unclear why they would 117 show a general tendency to treat relational deformations as different than metric ones since there is 118 no heuristic that assumes that categorical relations between parts is central feature of object shape 119 representations. (Indeed, it may have no explicit encoding of parts at all.) Instead, all deformations 120 are simply changes in the locations of features in the image. 121 Hummel and Stankiewicz (1996) designed an experiment to test this prediction of structural 122 description theory and compare it to the prediction of view based models (Poggio & Edelman, 1990) 123 of human vision. They created a collection of shapes modeled on Tarr and Pinker's (1989) simple 124 "objects". Each object consisted of a collection of lines connected at right angles ( Figure 1). 125 Hummel and Stankiewicz then created two deformations of each of these Basis object. One 126 deformation, the relational deformation (Rel), was identical to the Basis object from which it was 127 created except that one line was moved so that its "above/below" relation to the line to which it 128 was connected changed (from above to below or vice-versa). This deformation differed from the 129 Basis object in the coordinates of one part and in the categorical relation of one part to another. 130 The other deformation, the coordinates deformation (Cood), moved two lines in the Basis object in 131 a way that preserved the categorical spatial relations between all the lines composing the object, 132 but changed the coordinates of two lines. Note that both types of deformations can, in principle, 133 indicate a change in distal stimulus. But, a system that uses relational changes as a heuristic for 134 changes to distal stimuli will be more sensitive to Rel changes than Cood changes.

135
Across five experiments participants first learned to classify a set of base objects and then 136 tested on their ability to distinguish them from their relational (Rel) and coordinate (Cood) 137 deformations. The experiments differed in the specific set of images used, the specific tasks, the 138 duration of the stimuli, but across all experiments, participants found it easy to discriminate the 139 Rel deformations from their corresponding basis object and difficult to distinguish the Cood 140 deformations. The effects were not subtle. In Experiment 1 (that used the stimuli from Figure 1 Stimuli used by Hummel and Stankiewicz (1996).
Note. The first column shows a set of six (Basis) shapes that participants were trained to recognise.
Participants were then tested on shapes in the second and third columns, which were generated by deforming the Basis shape in the corresponding row. In the second column (Rel deformation) a shape is generated by changing one categorical relation (highlighted in red circle). In the third column (Cood deformation) all categorical relations are preserved but coordinates of some elements are shifted (highlighted in blue ellipse). participants mistook the Rel and Cood images as the base approximately 10% and 90%, 142 respectively, with similar findings observed across experiments. Hummel and Stankiewicz took 143 these findings to support the claim that humans encode objects in terms of the categorical relations 144 between their parts, consistent with the predictions of the structural description theories that 145 propose a heuristic approach to human shape representation (Hummel, 1994). 146 However, an optimisation approach may also be able to explain the findings of Hummel and 147 Stankiewicz -a bias for perceiving objects in terms of parts and relations may simply emerge as a 148 SHAPE THROUGH CLASSIFICATION 9 result of learning to classify objects. In Experiment 1, we tested this hypothesis by replicating the 149 experimental setup of Hummel and Stankiewicz, replacing human participants with two well-known 150 CNNs -VGG-16 and AlexNet -that have been previously argued to capture human-like 151 representations (Kriegeskorte, 2015;Yamins & DiCarlo, 2016) and an ability to develop a 152 shape-bias (Geirhos et al., 2018;Hermann et al., 2020).

154
Training Stimuli. We constructed six basis shapes that were identical to the shapes 155 used by Hummel and Stankiewicz (1996) in their Experiments 1-3. Each image was sized 196x196 156 pixels and consisted of five black line segments on a white background organised into different 157 shapes. All images had one short (horizontal) segment at the bottom and one long (vertical) 158 segment in the middle. This left three segments, two long, which were always horizontal, and one 159 short, which was always vertical. The two horizontal segments could be either left-of or right-of the 160 central vertical segment. Additionally, the short vertical segment could be attached to the left-of or 161 the right-of the upper horizontal segment. This means that there were a total of 8 (2x2x2) possible 162 Basis shapes. We selected six out of these to match the six shapes used by Hummel and 163 Stankiewicz (1996). Each training set contained 5000 images in each category constructed using  Test Stimuli. Following Hummel and Stankiewicz (1996), we constructed Rel (relational) 167 deformations (called V1 variants by Hummel and Stankiewicz (1996)) of each Basis shape by 168 shifting the location of the top vertical segment, so that it's categorical relation to the upper 169 horizontal segment changed from "above" to "below". Similarly, we constructed Cood (coordinate) 170 deformations (called V2 variants by Hummel and Stankiewicz (1996)) by shifting the location of 171 both the top horizontal line and the short vertical segments together, so that the categorical 172 relations between all the segments remained the same but the pixel distance (e.g. cosine distance) 173 was at least as large as the pixel distance for the corresponding Rel deformation. The test set 174 consisted of 1000 triplets of Basis, Rel and Cood images for each category, which were again 175 generated using the same data augmentation method.

176
Model architecture and pre-training. We evaluated two deep convolutional neural 177 networks,   (Simonyan & Zisserman, 2014) and AlexNet (Krizhevsky,Sutskever,& Hinton,178 2012) on the image classification tasks described in the Results section. We obtained qualitatively 179 similar results for both architectures. Therefore, we focus on the results of VGG-16 in the main text 180 and describe the results of AlexNet in Appendix C. Since human participants had a lifetime 181 experience of classifying naturalistic objects prior to the experiment, we used network 182 implementations that had been pre-trained on a set of naturalistic images. Two types of 183 pre-training were used: networks were either pre-trained in the standard manner on ImageNet (a 184 large database of naturalistic images), or pre-trained on a set of images where shape was made 185 more predictive than texture by using style-transfer (Gatys, Ecker, & Bethge, 2016). We used 186 networks pre-trained by Geirhos et al. (2018), who have shown that networks trained in this 187 manner have a greater shape-bias than networks trained on ImageNet.

188
Further training. Networks were either tested in a Zero-shot condition, where no 189 further training was given on any of our datasets and we recorded the response of the pre-trained 190 networks to the test images, or in a Fine-tuned condition, where the pre-trained network was 191 fine-tuned to classify the 5000 Basis images of each category described in the Stimuli above. This 192 fine-tuning was performed in the standard manner (Yosinski, Clune, Bengio, & Lipson, 2014) by 193 replacing the last layer of the classifier to reflect the number of target classes in each dataset. The 194 models learnt to minimise the cross-entropy error by using the Adam optimiser (Kingma & Ba,195 2014) with a small learning rate of 10 −5 and a weight-decay of 10 −3 . In all simulations, learning Ba-Rel condition. Similarly the average cosine similarity between 100 pairs of Basis and Cood test 208 images gave us the Ba-Cood distance. These distances were compared against two baseline 209 conditions. The upper limit of similarity was given by the average similarity of 100 pairs of Basis 210 images from the same category. The lower limit was given by the average similarity of 100 pairs of 211 Basis images from different categories (in each pair, one of the images was from one category and 212 the other from one of the other six categories).

214
An analysis of the internal representations of VGG-16 is shown in Figure 2 and it's 215 classification performance is showin in Figure A1  convolutional and fully connected layers within the network. We compared these similarities to two 221 baselines: the average similarity between two Basis images that belong to the same category and 222 the average similarity between two Basis images that belong to different categories. These two 223 baselines provide the upper and lower bounds on similarities (hatched yellow region). 224 We observed that in the Zero-shot condition (left-hand column in Figure 2), the similarity 225 between a basis image and it's relational variant was the same on average (across seeds) as the 226 similarity between the basis image and it's coordinate variant throughout the networks. That is, the 227 networks failed to distinguish between the basis images and their relational and coordinate variants. 228 In fact, networks also failed to distinguish between basis images from different categories (note the 229 narrow hatched (yellow) region in the Zero shot condition in Figure 2 and the classification 230 performance in Figure A1). Thus pre-training on ImageNet or Stylized-ImageNet was not 231 sufficient for networks to distinguish between the stimuli or their deformations used by Hummel 232 and Stankiewicz -to these models, all line drawings are alike. In contrast, the networks successfully 233

Figure 2
Cosine similarities in internal representations for a VGG-16 network Note. In each panel, the solid (red) line plots the cosine similarity between the internal representations of a Basis shape and its Rel deformation, while the dashed (blue) line plots the cosine similarity between the internal representations of a Basis shape and its Cood deformation. Layers of the network are along the x-axis with Conv2d Linear indicating convolutional and fully connected layers, respectively. Networks were either pre-trained on ImageNet (first row) or on Stylized-ImageNet -a dataset developed to induce a shape-bias in CNNs (second row). Their internal representations were then probed either without any further training (Zero-shot, first column) or fine-tuning on the Hummel and Stankiewicz dataset (second column).
The hatched (yellow) area shows the upper and lower bounds on cosine similarity (obtained by computing the cosine similarity of images from the same and different categories, respectively). Shaded regions around each line show 95% confidence interval. Based on the results of Hummel and Stankiewicz (1996), we would expect the solid (red) line (Ba-Rel) to be closer to the lower, rather than upper bound. Instead we observe that it stays at the upper bound throughout the network and overlaps the dashed (blue) line (Ba-Cood). SHAPE THROUGH CLASSIFICATION 13 learned to distinguish between stimuli from different categories in the Fine-tuned condition (see 234 classification performance in Figure A1). Examining the internal representations showed that the 235 networks represented all types of images in a similar manner in the early convolution layers (there 236 is no difference between similarities within or between categories in the early layers) but 237 representations begin to separate in the deeper convolution and fully connected layers (the hatched 238 (yellow) region increases in size as we move left to right because images from different categories 239 have lower similarity than images from the same category). However, both types of pre-treained 240 networks, the basis images were equally distant to their relational and coordinate deformations (see 241 the overlapping Ba-Rel and Ba-Cood lines in Figure 2). Note that this is not because the networks 242 overfit to the training data. In fact, networks showed very good generalisation to both novel Basis 243 images (in unseen combinations of rotations, translation and scale) and the two types of 244 deformations, with the cosine distance between a basis shape and either deformation close to the 245 upper bound of similarity (also see the high classification performance for both deformations in 246 Figure A1). In summary, we did not find any evidence that suggests that the CNN represents a 247 relational change to an image in any privileged manner compared to a coordinate change.

Experiment 2 249
The results of Experiment 1 suggested that learning to classify the naturalistic images in 250 ImageNet or even Stylized-ImageNet is not sufficient for CNNs to perceive the objects in terms 251 of their categorical relations. But it could be argued that this is not because of a limitation of the 252 optimisation approach, but due to the limitation of datasets that the model was trained on. It is 253 possible that, if the classification model was trained on datasets where relational differences were 254 diagnostic of object categories, it may have internalised this statistic and started perceiving objects 255 in terms of their categorical relations, just like humans. We tested this hypothesis in the next set of 256 simulations, where we created a training environment with a "relational bias". We show next that 257 when we do this, the network can learn specific changes to relations but it does not generalise this 258 knowledge to novel (but highly similar) relational changes.

Figure 3
Three training sets that try to teach the network to recognise relational changes.
Note. In each set, the first column shows a set of six unique Basis shapes, while the second column shows Rel deformations of the first five shapes (see red arrow). At the bottom are the two test shapes. These test shapes are identical to the eleventh (unpaired) training shape, except for one relational (dashed red circle) or coordinate (dashed blue ellipse) deformation. In Set 1, the difference between the untrained shape and the tested Rel deformation is exactly the same as the relational change distinguishing one pair of shapes and similar to another pair in the training set (both highlighted in solid red rectangles). In Set 2, the exact relational change is not trained, however there is similar relational change at a close location (pair again highlighted in solid red rectangle). Set 3 is the most challenging, where none of the diagnostic relational changes in the training set occur at similar locations to the tested relational deformation.

260
Experiment 2 used the same model architectures, pre-training and analysis methods as 261 Experiment 1. However, instead of using the training dataset based on Hummel and Stankiewicz 262 (1996), we created three new datasets where relational changes were diagnostic of image categories. 263 Training and Test Stimuli. We generated three datasets -shown as Set 1, Set 2 and 264 Set 3 in Figure  The test set consisted of 1000 images per category where each image was constructed by randomly 273 translating, scaling and rotating the Rel and Cood deformations of the unpaired Basis shape.

274
The difference between Set 1, 2, and 3 lay in the degree of novelty of test images. In all 275 three datasets the same relation (dashed red circle in Figure 3) was changed between the unpaired 276 Basis shape and it's Rel deformation. However, in the first set, there were four other categories (two 277 pairs, highlighted in red rectangles) in the training set where a similar change in relation occurred -278 that is, for all highlighted categories, there existed another category where the short red segment at 279 the left end of the top bar flipped from "above" to "below" or vice-versa. In the second training set 280   Cosine similarity for  networks that have been trained on diagnostic relations Note. Each panel shows cosine similarity in internal representations between Basis images and Rel (solid, red) or Cood (dashed, blue) deformations of those images. The network was either pre-trained on ImageNet (first row) or Stylized-ImageNet (second row) and fine-tuned on Set 1 (left), Set 2 (middle) or Set 3 (right) shown in Figure 3. Like Figure 2, the hatched (yellow) region shows the upper and lower bound on similarity.
We can see that the network fine-tuned on Set 1 represents relational deformations as significantly different from Basis images as well as coordinate deformations (solid red line is much lower than upper bound and dashed blue line for deeper layers in the network). However, this is not the case for networks fine-tuned on Set 2 or Set 3.

Results and Discussion
289 Figure 4 shows the cosine similarity in internal representations for VGG-16 trained on these 290 three modified data sets (we obtained a similar pattern of results for AlexNet -see Figure C3). As 291 in previous simulations, we tested networks that were either pre-trained on ImageNet (first row) or 292 on Stylized-ImageNet (second row) and fine-tuned to each training set. We observed that when 293 networks were trained on Set 1 (left column in Figure 4), the cosine similarity Ba-Rel was lower 294 than Ba-Cood in deeper layers of the CNN. That is, the networks treated the relational deformation 295 as less similar to Basis figures than the coordinate deformations. This looks much more like the 296 behaviour of human participants in Hummel and Stankiewicz (1996). But note that Set 1 contained 297 two pairs of categories with the same relational change that distinguishes the tested Rel 298 deformation from the corresponding Basis figure. A stronger test is provided by Set 2 that excludes 299 the pair of categories distinguished by the critical relational change from the training set. Here, we 300 observed that the this effect was significantly reduced (middle column in Figure 4) -the cosine 301 similarity Ba-Rel was slightly lower than Ba-Cood but by a much smaller degree and the difference 302 existed only for the networks pre-trained on ImageNet and only in the fully connected layers (also 303 compare results in Figure C3 in Appendix for AlexNet, where this effect is slightly more 304 pronounced but qualitatively similar). The strongest test for whether the network learns relational 305 representations is provided by Set 3, where none of the categories in the training set changed the 306 exact relation that distinguishes the Rel deformation from the Basis image in the test set. Here, we 307 observed ( Figure 4, right-hand column) that the effect disappeared completely -the cosine 308 similarity Ba-Rel was indistinguishable from Ba-Cood and both similarities were at the upper bound. 309 All networks failed to learn that novel relational changes are more important for classification than 310 coordinate changes even when the learning environment contained a "relational bias" -i.e., 311 changing relations led to a change in an image's category mapping. change. This meant that we could not match the extent of relational change with an equivalent 322 coordinate change and compare the sensitivity to each of these changes.

323
What sorts of deformations of the proximal stimulus should allow us to contrast 324 optimisation and heuristic approaches for identifying the component parts of complex objects or 325 single-part objects? According to the structural description theory (Biederman, 1987), certain 326 shape properties of the proximal image are taken by the visual system as strong evidence that 327 individual parts have those properties. For example, if there is a straight or parallel line in the 328 image, the visual system infers that the part contains a straight edge or parallel edges. If the 329 proximal stimulus is symmetrical, it is assumed that the part is symmetrical (see, for example, 330 Pizlo, Sawada, Li, Kropatsch, & Steinman, 2010). These (and other) shape features used to build a 331 distal representation of the object part are called non-accidental because they would only rarely be 332 produced by accidental alignments of viewpoint. The visual system ignores the possibility that a 333 given non-accidental feature in the proximal stimulus (e.g., a straight line) is the product of an 334 accidental alignment of eye and distal stimulus (e.g., a curved edge). That is, the human visual 335 system uses non-accidental proximal features as a heuristic to infer distal representations of object 336 parts. Critical for our purpose, many of the non-accidental features described by Biederman (1987) 337 are relational features, and indeed, many of the features are associated with Gestalt rules of 338 perceptual organization, such as good continuation, symmetry, and Pragnanz (simplicity).  Note that rotating an object preserves all non-accidental features (Biederman, 1987), while shearing 351 it changes it's symmetry -a non-accidental property of the object. To a model that looks only at 352 proximal stimulus, both deformations lead to an equivalent pixel-by-pixel change, while to a model 353  Test Stimuli. The test set consisted of a grid of shapes that were obtained by deforming 381 the Basis shape of the corresponding category. We used two deformations: rotation, which 382 preserved the internal angles between edges, and shear, which changed internal angles. To shear a 383 shape, it's vertices were horizontally moved by a distance that depended on the vertical distance to 384 the apex. For a vertex with coordinates (x old , y old ), we obtained a new set of vertices, 385 (x new , y new ) = (x old + λ(∆y) 2 , y old ), where λ was the degree of shear and ∆y was the distance 386 between y old and y apex , the y-coordinate of the vertex at the apex. Images could also be 387 combination of rotations and shears. To do this, the Basis image was first sheared, then rotated. 388 We measured the (cosine or Euclidean) distance of a deformed image from the Basis image and 389 used this distance to organise the test images on a grid (see Figure 5), where images in each column 390 had the same degree of shear and images along each diagonal had the same (cosine or Euclidean) 391 distance to the Basis image. We then obtained 100 exemplars of each deformed image on the grid 392 by randomly translating and scaling the image.

393
Model architecture and pre-training. We used the same set of models as 394 Experiments 1 and 2 (VGG-16 and AlexNet) pre-trained in the same manner (either on ImageNet 395 or on Stylized-ImageNet).

396
Further training. Like Experiments 1 and 2, models were tested in either the Zero-shot 397 condition, where we did not train the model on our training set and simply examined the internal 398 representations in response to test images, or in the Fine-tuned condition, where the pre-trained 399 model underwent further training (with a reduced learning rate) on the training stimuli. We again 400 observed that the models failed to distinguish any shape in the Zero-shot condition, therefore we 401 restrict the presentation of our results to the Fine-tuned condition. Performance of VGG-16 on deformations of single-part objects.
Note. Test stimuli for each category is shown in the top row. Middle and bottom rows show accuracy on the landscape of relational and coordinate deformations for the network pre-trained on ImageNet (middle row) or Stylized-ImageNet (bottom row). In each case, the network was fine-tuned on the set of seven polygons shown in Figure 5(a). Each heatmap (in middle and bottom rows) corresponds to a category and shows the percent of shapes (with a relational and coordinate deformation given by the position on the landscape) accurately classified as the category from which the stimulus was derived. For most categories, accuracy is highest for small deformations (top-left corner) and decreases as a function of the coordinate distance from the basis shape (perpendicular to diagonal). The relational distance (left-to-right) has no added effect on this decrease in accuracy.
(Ba-Rot) is estimated by measuring the average cosine similarity between embeddings of 100 pairs 409 of images from Basis and rotated sets of the same category.

411
The classification performance of VGG-16 for images from the test set is shown in Figure 6 412 (we obtained a qualitatively similar pattern of results for AlexNet, see Appendix C). For all 413 networks, we observed that test accuracy was highest at the top-left corner (i.e., for the Basis 414 shape) and reduced as the degree of relational and coordinate change was increased. Thus, unlike 415 Experiment 1, where we were able to observe only ceiling performance for both deformations, the 416

Figure 7
Cosine similarity in internal representations of  in Experiment 3.
Note. The solid (red) and dashed (blue) lines show the average cosine similarity between Basis images and relational (shear) and coordinate (rotation) deformations, respectively. The hatched (yellow) region shows the bounds on this similarity, with the upper bound determined by the average similarity between Basis images from the same category and lower bound determined by the average similarity between Basis images of different categories. If relational (shear) deformation has a larger affect on internal representations than a coordinate (rotation) deformation, one would expect the solid (red) line to be below the dashed (blue) line. design of Experiment 3 allowed us to compare how performance degrades for the two types of 417 deformations. Crucially, we observed that for most categories, accuracy decreased as a function of 418 distance to the Basis shape (perpendicular to the diagonals), rather than relational change (left to 419 right). In fact, for some categories accuracy improved as one moved from left to right along the 420 diagonals. Occasionally, we observed high accuracy for large rotations on one category. This was 421 generally due to false positives, where large rotations for all categories were classified as the same 422 category by the network (see Figure B1 in Appendix B for details). Overall, these results suggest 423 that the network does not represent the shapes in this task in a relational manner. If it did, it's 424 performance on relational changes should have been a lot worse than it's performance on and two test images that were equidistant from it. An example of these images is highlighted 430 (dashed red squares) in Figure 5(b). These cosine similarities for two  networks are plotted 431 in Figure 7 (we again obtained qualitatively similar results for AlexNet -see Figure C6 in 432 Appendix). At all internal layers, we observed that the average similarity between a Basis image 433 and it's relational (shear) deformation was equal or higher than the average similarity between the 434 Basis image and it's coordinate (rotation) deformation (compare solid (red) and dashed (blue) lines 435 in Figure 7). In other words, relational deformation of an image was closer to the Basis image than 436 it's coordinate deformation and pre-training on the Stylized-ImageNet dataset to give the 437 network a shape-bias did not change this pattern. This is the opposite of what one would expect if 438 the network represented the stimuli in a relational manner.    about what the correct response should have been (1.5 s) if the response was incorrect. The 471 training phase was followed by a test phase consisting of five test blocks. Each block consisted of 20 472 trials for a total of 100 test trials (25 per condition). Like the training phase, the order of test trials 473 was randomised for each participant. The procedure for each test trial was the same as in the 474 training phase except that participants were not given any feedback during testing.

479
The average CNN and human accuracy of classification on each of these deformations is 480 shown in Figure 8. We can see that irrespective of training, VGG-16 was more sensitive to rotation 481  One response to the difference between CNNs and humans in Experiments 3 and 4 is that it 492 arises due to the difference in experience of the two systems. Humans experience objects in a 493 variety of rotations and consequently represent a novel object in a rotation invariant manner.

494
CNNs, on the other hand, have not been explicitly trained on objects in different orientations 495 (although ImageNet includes some objects in various poses). It could therefore be argued that 496 CNNs do not learn relational representations in Experiment 3 because the training set did not 497 provide an incentive for learning such a representation. Indeed, the optimisation view argues that a 498 bias must be present in the training environment for the visual system to internalise it.

499
To give the network a better chance of learning to classify based on internal relations, we 500 conducted two further simulations. In the first simulation, we trained the networks on some 501 rotations for all Basis shapes and tested them on unseen rotations. This simulation emulates 502 generalising the concept of rotation for each object after observing some of the rotations for that 503 object. In the second simulation, the networks were shown all rotations of some Basis shapes and 504 tested on unseen rotations of the left-out Basis shapes. This simulation emulates generalising the 505 concept of rotation from one object to another.

507
All methods in Experiment 5 remained the same as Experiment 3, except for the images in 508 the training sets. In the first simulation, the training set now consisted of Basis (polygon) shapes 509 presented at random translations and scales (just like Experiment 3) but additionally, also at The network performance for the first simulation is shown in Figure 9. We observed that, 521 despite being trained on this augmented dataset, results remained qualitatively similar. For most 522 categories performance degraded equally or more with a change in rotation than with an equivalent 523 change in shear. That is, the network was better at generalising to large relational deformations 524 (shears) than large relation-preserving deformations (rotation). The pattern was different for 525 Category 6, where the network showed good performance on large rotations. But examining the 526 confusion matrix again revealed that the high accuracy at large rotations for this category was 527 misleading as it was accompanied with large Type I errors: large rotations for shapes of any 528 category were mis-classified as belonging to the Category 6. Overall, we did not find any evidence 529 for the network learning shapes based on their internal relations.

530
The results of the second simulation are shown in Figure 10. Figure 10a shows the heat-map 531

Figure 10
Performance of VGG-16 trained on all rotations of some categories. of accuracy on the test grid for the left-out category. This heat map showed that the network 532 continued showing the pattern observed above -it's performance decreases across (perpendicular to) 533 the diagonals, but increases as one moves from left-to-right along these diagonals. Figure 10b shows 534 the performance on the same conditions as the human experiment (see Figure 8). Again, we see that 535 the performance drops less rapidly across the two shear deformations (dashed line) than the two In fact, the performance drops more quickly than when none of the categories were rotated in the 539 training set (compare with Figure 8). This is because the network starts classifying novel 540 orientations of the left-out shape as the shapes that it had seen being rotated in the training set. 541 It may be tempting to think that the differences between humans and CNNs can be 542

Figure 11
A proposal for achieving rotation invariance in CNNs (from Goodfellow et al. (2016, chap. 9)) Note. Each panel shows the response of the network to a different rotation of the digit '5'. The network detects different rotations by having a large set of filters, one matching each rotation. The output unit pools across all rotation filters, essentially performing a disjunction over all filter activations. reconciled by training CNNs that learn rotation-invariant shapes. However, consider how a CNN 543 achieves rotation-invariance. Figure 11, taken from Goodfellow, Bengio, and Courville (2016, chap. 544 9), illustrates how a network consisting of convolution and pooling layers may learn to recognise 545 digits in different orientations. As a result of training on digits (here, the digit 5) oriented in three 546 different directions, the convolution layer develops three different filters, one for each orientation. A 547 downstream pooling unit then amalgamates this knowledge and fires when any one of the 548 convolution filters is activated. Therefore, this pooling unit can be considered as representing the 549 rotation-invariant digit 5. During testing, when the network is presented the digit 5 in any 550 orientation, the corresponding convolution filter gets activated, resulting in a large response in the 551 pooling unit and the network successfully recognises the digit 5, irrespective of it's orientation.

552
In contrast, a relational account of shape representation does not rely on developing filters 553 for each orientation of a shape. Indeed, it is not even necessary to observe a shape in all 554 orientations to get, at least some degree of, rotation invariance. All that is needed is to be able to 555 recognise the internal parts of an object and check whether they are in the same relation as the 556 learned shape. Accordingly, many psychological studies have shown that invariance, such as 557 rotation invariance, precedes recognition (Biederman & Cooper, 1991, 1992Biederman & 558 Gerhardstein, 1995;Hummel, 2013).

560
In a series of experiments we have shown that humans represent shape in qualitatively 561 different ways to CNNs that learn to classify large datasets of objects using supervised learning. In 562 Experiment 1 we found that CNNs trained to classify objects were entirely insensitive to 563 deformations in categorical relations between object parts. Furthermore, we could not train CNNs 564 to be sensitive to relational changes in general even when we made relational changes diagnostic of 565 category classification (Experiment 2). In Experiment 3 and 4, where we precisely matched the 566 extent of relational and coordinate deformations, we found that humans were highly sensitive to 567 relational deformations of single-part objects, whereas CNNs were only sensitive to coordinate 568 distance, and once again, CNNs could not learn to be sensitive to relational manipulations 569 (Experiment 5).

570
These findings challenge the hypothesis that humans perceive objects based on similar 571 principles as CNNs trained to classify large sets of objects and that apparent differences arise due 572 to "differences in the data that they see" (Hermann et al., 2020). These results show that even 573 CNNs that have been trained to classify objects on the basis of shape (trained on the 574 Stylized-ImageNet) learn the wrong sort of shape representation. These findings add to other 575 studies that also highlight the different types of shape representation used by CNNs and the human 576 visual system. For example, Puebla and Bowers (2021) have found that CNNs fail to support a 577 simple relational judgement with shapes, namely, whether two shapes are the same or different.

578
Again, this highlights how CNNs trained to process shape ignore relational information. In 579 addition, Baker et al. (2018) have shown that CNNs that classify objects based on shape focus on 580 local features and ignore how local features relate to one another in order to encode the global 581 structure of objects.

582
These failures may reflect a range of processes present in humans but absent in CNNs 583 trained to recognise objects through supervised learning, such as figure-ground segregation, 584 completing objects behind occluders, encoding border ownership, and inferring 3D properties about 585 the object (Pizlo et al., 2010). Consistent with this hypothesis, Jacob, Pramod, Katti, and Arun 586 (2021) and Bowers et al. (2022) have recently highlighted a number of these failures in CNNs, 587 including a failure to represent 3D structure, occlusion, and parts of objects. More broadly, these 588 results challenge the the claim that CNNs trained to recognise objects through supervised learning 589 are good models of the ventral visual stream of human vision (see, for example, (Cadieu et al., 2014;590 Mehrer, Spoerer, Jones, Kriegeskorte, & Kietzmann, 2021;Yamins et al., 2014)).

591
One interesting study that provides some evidence to suggest that standard CNNs have 592 similar shape representations to humans was reported by Kubilius, Bracci, and Op de Beeck (2016). 593 In one of their experiments (Experiment 3), they compared the similarity of representations in 594 various CNNs in response to a change in metric and non-accidental features of single-part objects. 595 For instance, they compared a base object that looked like a slightly curved brick to two objects: 596 one object that was obtained by deforming the base object into a straight brick (a non-accidental 597 change) and a second object that was obtained by deforming the base object into a greatly curved 598 brick (a metric change). Kubilius et al. reported that, like humans, CNNs were more sensitive to 599 non-accidental changes. However, it is unclear whether CNNs were more sensitive to one of their 600 manipulations because of the non-accidental change or because of other confounds accompanying 601 these manipulations. For example, when Kubilius et al. modified some of the base shapes to 602 non-accidental deformations, it was accompanied by a change in local features (such as properties 603 of vertices). Recent research (Baker et al., 2018;Geirhos et al., 2018) has shown that, unlike 604 humans, CNNs are in fact highly sensitive to change in local and textural features and it is unclear 605 whether it is these types changes that are driving the effects observed by Kubilius et al. (2016).

606
More work is required to reconcile their findings with our own.

607
More generally, our findings raise the question as to whether optimizing CNNs on 608 classification tasks is even the right approach to developing models of human object recognition. It 609 is striking how well our findings are well predicted by a classic structural description theory of 610 object recognition that builds a distal representation of objects using heuristics (e.g., Biederman, 611 1987). As detailed above, on this theory, the visual system encodes specific features of the proximal 612 stimulus that are best suited for making inferences about the distal object. This includes explicitly 613 coding the relations between parts in order to support visual reasoning about objects (e.g., 614 appreciating the similarity and differences of buckets and mugs as discussed above), and encoding 615 SHAPE THROUGH CLASSIFICATION 33 parts in terms of non-accidental features that often include relations between features, such as 616 symmetry, in order to infer their 3D distal shape from variable proximal 2D images. Just as 617 predicted, humans are selectively sensitive to these deformations (changes in the relations between 618 parts in Figure 1 and changes in symmetry in Figure 5), whereas CNNs treated these deformations 619 no differently than others.

620
Of course, it is possible that training CNNs on a range of different tasks (especially tasks 621 where the objective is to approximate the distal representation) or on tasks with different objectives 622 rather than classification (e.g. unsupervised learning of image sequences (Parker & Serre, 2015), or 623 generative modelling (Kingma & Welling, 2013) or on a "self-supervised" task (Grill et al., 2020)) 624 may lead to shape representations that are more similar to those formed in human visual cortex. 625 However, here we wanted to focus on CNNs trained on recognising objects through supervised 626 learning because of two reasons. Firstly, it has been argued that CNNs trained under these settings 627 learn to classify objects based on human-like shape representations (Geirhos et al., 2018;Hermann 628 et al., 2020;Kubilius et al., 2016). Secondly, these models have had the largest success in predicting 629 neural representations in human and primate visual system (Cadieu et al., 2014;Schrimpf et al., 630 2020; Yamins & DiCarlo, 2016) and it has been argued that there is a "strong correlation between a 631 model's categorization performance and it's ability to predict individual level IT neural unit 632 response data" (Yamins et al., 2014). Our findings challenge the view that optimizing performance 633 in a classification task can explain shape representations used during human shape perception.

634
Instead, these findings are well predicted by the classic structural description theory of object 635 recognition that builds a distal representation of objects using heuristics (e.g., Biederman, 1987). 636 It is also possible that a different Deep Learning architecture may be more successful than 637 CNNs at encoding objects based on relations between their parts. Indeed, previous research 638 indicates that relational reasoning may require a more powerful architecture that can explicitly and 639 separately represent (i) parts and relations, and (ii) their bindings (e.g., to distinguish whether the 640 brick is above the cone or vice-versa; Doumas, Puebla, Martin, & Hummel, in press;Hummel, 2011;641 Hummel Hummel & Holyoak, 1997, 2003. Other Deep Learning architectures 642 such as Capsule Networks (Sabour, Frosst, & Hinton, 2017), Transformers (Vaswani et al., 2017), 643 LSTMs (Hochreiter & Schmidhuber, 1997) or Neural Turing machines (Graves, Wayne, & Danihelka, 2014) may also provide the representational power necessary to represent structural 645 descriptions. What is clear from our study is that learning to classify objects is not, in and of itself, 646 sufficient for the emergence of human shape representations.

648
All code for generating the datasets, simulating the model as well as participant data from 649 Experiment 4 can be downloaded from: https://github.com/gammagit/distal Appendix B

Figure B1
Confusion matrices for  in Experiment 3.
Note. For any heat map, the category label along each row shows the ground truth -i.e., all test shapes used to obtain the heat map were obtained by distorting the basis shape from that category. The category label along the column shows output category label assigned by the network. Therefore, in each row, the diagonal heat map shows the correct classifications, while the off-diagnoal heat maps show how each deformation was misclassified.
In Experiment 3, we observed that performance decreased as a function of coordinate 771 distance for most categories. However, in most simulations, we also observed that there was one 772 category where performance was really high for most deformations, including large rotations. For 773 example, in Figure 6, most categories show a large decrease in performance with increase in 774 rotation of test images, except for Category 7 (both middle and bottom rows). To understand why 775 this was the case, it is useful to look at the errors made by the network. Figure B1 shows confusion 776 matrices for two models (VGG-16 pre-trained on ImageNet and Stylized-ImageNet respectively). 777 Each heat map shows the number of times an output category was chosen for all deformations of a 778 given input category. This confusion matrix shows that the both networks were prone to 779 mis-classify large rotations from any category as belonging to Category 7 (note large number of 780

Figure C2
Performance of AlexNet in the test set for Experiment 1.
Note. Each panel shows accuracy on the Basis shapes as well as the two types of deformations: relational (Rel) which changes a categorical relation and coordinate (Cood), which preserves all categorical relations.
Compare with performance of VGG-16 in Figure A1.

Figure C3
Cosine similarity for AlexNet, trained on diagnostic relations in Experiment 2 Note. Like the results for VGG-16 (compare with Figure 4), we see that networks learns to distinguish the Rel deformation from the Basis image for Set 1 (left column), when it has seen the specific deformation in the training set. But this sensitivity to Rel deformation diminishes in Set 2 (middle column), when only one pair of trained shapes have a similar deformation and completely lost for Set 3 (right column) when the network has been trained on the Rel deformations, but the specific deformation tested is novel.

Figure C4
Classification performance for AlexNet trained on ImageNet in Experiment 3 Note. Each heatmap shows accuracy for on test items for a particular category for AlexNet pre-trained on ImageNet and fine-tuned on the dataset in Figure 5. Each cell in the heatmap corresponds to a deformation that is a combination of relational (shear) and coordinate (rotation) transformations of the trained Basis shapes (see Figure 5(a)). The grid at the bottom shows the "confusion matrix" -each heatmap in the grid shows the proportion of responses predicted as the category along the column for a deformation with basis shape taken from the category along the row. Like the results for VGG-16 (compare with Figure 6), we see that accuracy decreases as a function of coordinate distance from the basis shape, rather than the relational distance.

Figure C5
Classification performance for AlexNet trained on Stylized-ImageNet in Experiment 3 Note. Each heatmap in the top row shows accuracy for on test items for a particular category for AlexNet pre-trained on Stylized-ImageNet and fine-tuned on the dataset in Figure 5. The bottom panel shows the confusion matrix. See Figure C4 for explanation.

Figure C6
Cosine similarity for AlexNet in Experiment 3.
Note. Cosine similarity between internal representations for the Basis shapes and two deformations of the basis shape (dashed red squares in Figure 5(b)) from the polygons dataset at each convolution and fully connected layer of AlexNet. Solid (red) line shows the average similarity between representations for a basis shape and its relational (shear) deformation, while dashed (blue) line shows the average similarity between a basis shape and it's coordinate (rotation) transformation. The hatched area shows the bounds on similarity, with the upper bound determined by the average similarity between two basis shapes from the same category and lower bounds determined by the average similarity between two basis shapes of different categories. Like the results for VGG-16 (compare with Figure 7), we observed that the network treated the relational (shear) deformation as being more similar to the basis shape than the coordinate (rotation) deformation.   Figure 9), we observed that even when the network was trained on some rotations, it's performance on untrained rotations (a coordinate transformation) was still worse than shears (a relational transformation). (b) shows accuracy for deformation level D1 and D2 used for testing human participants.

Figure C9
Performance of AlexNet in Experiment 5, Simulation 2 (a) Accuracy landscape for left-out Category (b) Accuracy for conditions in experiment Note. Performance of AlexNet fine-tuned on an augmented dataset where the basis shapes are not only randomly translated and scaled but also rotated. For six out of seven categories, the network is trained on all rotations ([0, 360°)). We then tested the network on the left-out category (Cat 3, highlighted with red square in (a)) on untrained rotations and shears. However, we observed that despite being trained in this manner, the accuracy degraded as a function of the coordinate deformation, rather than the relation deformation. (b) shows the performance of this network for deformations D1 and D2 used to test human participants.