The role of capacity constraints in Convolutional Neural Networks for learning random versus natural data

Convolutional neural networks (CNNs) are often described as promising models of human vision, yet they show many differences from human abilities. We focus on a superhuman capacity of top-performing CNNs, namely, their ability to learn very large datasets of random patterns. We verify that human learning on such tasks is extremely limited, even with few stimuli. We argue that the performance difference is due to CNNs’ overcapacity and introduce biologically inspired mechanisms to constrain it, while retaining the good test set generalisation to structured images as characteristic of CNNs. We investigate the efficacy of adding noise to hidden units’ activations, restricting early convolutional layers with a bottleneck, and using a bounded activation function. Internal noise was the most potent intervention and the only one which, by itself, could reduce random data performance in the tested models to chance levels. We also investigated whether networks with biologically inspired capacity constraints show improved generalisation to out-of-distribution stimuli, however little benefit was observed. Our results suggest that constraining networks with biologically motivated mechanisms paves the way for closer correspondence between network and human performance, but the few manipulations we have tested are only a small step towards that goal.

ages (due to human capacity limitations). This procedure involved categorising 88 either random pixel images or images with shuffled labels. In both conditions the 89 categories lacked any category-defining features.  ipant -e.g., the labels given to Participant 1 were not the same as those given to participant 2.
were then upscaled (without smoothing) to 224 by 224 pixels. Some examples of 113 these images are shown in Figure 1, dataset I.

114
Participants were instructed to indicate the categories by pressing either 'x' or 115 'z' on their keyboard. On each trial, a fixation cross was presented on the screen 116 for 250ms. Afterwards, each stimulus was displayed for 300ms. All stimuli were 117 presented on a neutral grey background. Trials were separated by a presentation 118 of an empty neutral grey screen for 500ms. Responses within the first 150ms of 119 presentations were considered too quick and a warning message was displayed 120 to participants to pay more attention to the stimulus before responding. These trials were not considered for analysis. Similarly, we took responses longer than 122 3000ms as evidence that participants were not attending and these trials were also 123 excluded. one had more red overall"), rather than paying attention to individual pixels. Al-151 though these heuristics allowed some participants to achieve an above chance performance in a 2-way categorisation task involving a small dataset of 20 images,   the type of data being learned. That is, we did not find pre-training to be prefer-243 entially beneficial to the speed of learning the CIFAR-10 dataset compared to the 244 random pixels dataset.

245
These results do not corroborate the hypothesis that experience with natural 246 images restricts the network's ability to learn random data. Indeed, the networks 247 pre-trained with ImageNet learned the random dataset faster than those with ran-248 domly initialised weights. Next, we investigated whether introducing biologically-inspired constraints 251 would limit the capacity of CNNs to learn random data. We experimented with 252 three such constraints (see Figure 4).   all data types. Subsequently we used 100 epochs as a boundary in further runs. 329 We used a batch size of 128 (39,000 steps). 7  with new random pixel images and a shuffled labels permutation for every seed.

338
The same hyperparameters were used for all datasets and seeds. impact on classifying CIFAR-10 images. 363 We also observed that the bottleneck manipulation was not sufficient to prevent 364 learning of unstructured data on it's own, although its effect interacted with the 365 degree of internal noise and the activation function.   straints is a step in the right direction, we have not identified a condition that allows to disassociate the ability to learn these two forms of random data within 421 the existing CNN framework.  noise that was not seen during training. In the first set, we added uniform (zero-  CNNs when modelling human vision in order to provide better alignment between 507 their performance characteristics.

508
To summarise, we report five main findings. First, in two behavioural exper-509 iments we confirm that humans perform far worse when learning random data 510 compared to standard CNNs (Figure 2). In addition, we observed that humans 511 are able to learn (a small set of) images with randomly assigned labels, but are 512 mostly unable to learn images of random pixels. This behaviour contrasted with 513 CNNs, which learn random pixels either more easily Zhang et al. (2017), or as in 514 our simulations, learn the two types of random data at similar rates, with random 515 pixels slightly easier than shuffled labels in some conditions and slightly harder in 516 others.

517
Second, we found that exposing CNNs to structured data (pre-training on 518 ImageNet) did not prevent them from learning random data. Surprisingly, pre-  Fourth, we show that these biological constraints worked by reducing network 530 capacity rather than through some alternative mechanism. That is, we showed 531 that CNNs with these constraints could still learn noise patterns, but far fewer 532 of them (Figure 5, Bottom). This is the pattern of results to be expected if the 533 manipulation interferes with capacity.

534
Finally, we did not find that the pre-training and the biological constraints In this section we address the concern about whether insufficient training can 796 explain our results from Section 3.  MobileNet Appendix E 128 100 1 × 10 −2 0.05 0.9  where there is also a more pronounced drop-off from 2 to 1 feature map. We there-834 fore decided to use a bottleneck of 2 feature maps for all further small-inception 835 simulations. 836 We used a similar method to establish the size of bottleneck for the small-alexnet This effect is even larger for networks using Sigmoid activations. Therefore 2 feature maps was selected as the optimal bottleneck size.