Deep predictive coding accounts for emergence of complex neural response properties along the visual cortical hierarchy

Predictive coding provides a computational paradigm for modelling perceptual processing as the construction of representations accounting for causes of sensory inputs. Here, we develop a scalable, deep predictive coding network that is trained using a Hebbian learning rule. Without a priori constraints that would force model neurons to respond like biological neurons, the model exhibits properties similar to those reported in experimental studies. We analyze low- and high-level properties such as orientation selectivity, object selectivity and sparseness of neuronal populations in the model. As reported experimentally, image selectivity increases systematically across ascending areas in the model hierarchy. A further emergent network property is that representations for different object classes become more distinguishable from lower to higher areas. Thus, deep predictive coding networks can be effectively trained using biologically plausible principles and exhibit emergent properties that have been experimentally identified along the visual cortical hierarchy. Significance Statement Understanding brain mechanisms of perception requires a computational approach based on neurobiological principles. Many deep learning architectures are trained by supervised learning from large sets of labeled data, whereas biological brains must learn from unlabeled sensory inputs. We developed a Predictive Coding methodology for building scalable networks that mimic deep sensory cortical hierarchies, perform inference on the causes of sensory inputs and are trained by unsupervised, Hebbian learning. The network models are well-behaved in that they faithfully reproduce visual images based on high-level, latent representations. When ascending the sensory hierarchy, we find increasing image selectivity, sparseness and generalizability for object classification. These models show how a complex neuronal phenomenology emerges from biologically plausible, deep networks for unsupervised perceptual representation.

selectivity increases systematically across ascending areas in the model hierarchy. A further 23 emergent network property is that representations for different object classes become 24 more distinguishable from lower to higher areas. Thus, deep predictive coding networks can 25 be effectively trained using biologically plausible principles and exhibit emergent properties 26 that have been experimentally identified along the visual cortical hierarchy. 27 28 Significance Statement 29 Understanding brain mechanisms of perception requires a computational approach based 30 on neurobiological principles. Many deep learning architectures are trained by supervised 31 learning from large sets of labeled data, whereas biological brains must learn from 32 unlabeled sensory inputs. We developed a Predictive Coding methodology for building 33 scalable networks that mimic deep sensory cortical hierarchies, perform inference on the 34 causes of sensory inputs and are trained by unsupervised, Hebbian learning. The network 35 models are well-behaved in that they faithfully reproduce visual images based on high-level, 36 latent representations. When ascending the sensory hierarchy, we find increasing image 37 selectivity, sparseness and generalizability for object classification. These models show how 38 synapses. Second, learning in these models was required to be based on neurobiological 86 principles, which led us to use unsupervised, Hebbian learning instead of back-propagation 87 (Rumelhart et al., 1986) or other AI training methods (Lillicrap et al., 2016;Salimans et al., 88 2017) incompatible with physiological principles. 89 Third, we aimed to investigate which neuronal response properties evolve emergently in 90 both low and high-level areas, i.e. without being explicitly imposed a priori by network 91 design constraints. We paid attention to both low-level visual cortical properties such as 92 orientation selectivity (Hubel and Wiesel, 1961) as well as high-level properties such as 93 selectivity for whole images or objects found in e.g. inferotemporal cortex (Desimone et al., 94 1984; Gross et al., 1972;Perrett et al., 1985). 95 96

98
It is known that Receptive Field (RF) size increases from low to high-level areas in the ventral stream 99 (V1, V2, V4 and inferotemporal cortex (IT)) of the visual system (Kobatake and Tanaka, 1994). To 100 incorporate this characteristic, neurons in the lowermost area of our network (e.g. V1) respond to a 101 small region of visual space. Similarly, neurons in the next area (e.g. secondary visual cortex (V2)) are 102 recurrently connected to a small number of neurons in V1 so that their small RFs jointly represent 103 the larger RF of a V2 neuron. This architectural property is used in all areas of the network, resulting 104 in a model with increasing RF size from lower-level to higher-level areas. Furthermore, there can be 105 multiple neurons in each area having identical RFs (i.e., neurons that respond to the same region in 106 visual space). This property is commonly associated with neurons within cortical microcolumns 107 (Jones, 2000).

108
The model variants described in this paper receive natural images in RGB color model as sensory 109 input of which the size is described by two dimensions representing the height and width of an 110 image. Similarly, RFs of neurons in visual cortical areas extend horizontally as well as vertically. To 111 simplify the explanation below, we will assume that the input to the network is one-dimensional and 112 correspondingly neurons in the model also have receptive fields that can be expressed using a single 113 dimension. Later, we will extend the description to two-dimensional sensory input. 114 Figure 1 shows the architecture of the network. Consider a network with ( + 1) layers which are 115 numbered from 0 to . The layers 1 to in the network correspond to visual cortical areas; layer 1 116 represents the lowest area (e.g. primary visual cortex (V1)) and layer the highest cortical area (e.g. 117 area IT). Layer 0 presents sensory inputs to the network. Below, we will use the term "area" to refer 118 to a distinct layer in the model in line with the correspondence highlighted above. Each area is 119 recurrently connected to the area below it. Information propagating from a lower-level to a higher-120 level area constitutes feedforward flow of information (also termed bottom-up input) and feedback 121 (also known as top-down input) comprises information propagating in the other direction.

122
Conventionally, the term "receptive field" of a neuron describes a group of neurons that send 123 afferent projections to this neuron. In other words, a receptive field characterizes the direction of 124 connectivity between a group of neurons and a "reference" neuron. Here, the term receptive field is 125 used to characterize the hierarchical location of a group of neurons with respect to a reference 126 neuron. Specifically, the receptive field of a neuron represents a group of neurons in a lower-level 127 area that are recurrently connected to the higher-level neuron . Similarly, the group of cells that 128 receive projections from a given neuron represents the projective field of that neuron. In the current 129 paper the term "projective field" of a neuron describes a group of higher-level neurons that are 130 recurrently connected to the lower-level neuron .
which results in values that are positive or zero. To extend the architecture described above for 149 handling natural images, the populations in each area can be visualized as a two-dimensional grid 150 ( Figure 1B). Here, each population has receptive fields that extend both horizontally as well as 151 vertically.

152
Learning and Inference Rule

153
The learning rule presented in this section is inspired by the approach of predictive coding in (Rao 154 and Ballard, 1999). Each area of the model infers causes that are used to generate predictions about 155 causes inferred at the level below. These predictions are sent by a higher-level area to a lower-level 156 area via feedback connections. The lower-level area computes an error in the received predictions, inference step of predictive coding, and also to build the brain's internal model of the external 160 environment, which is termed the learning step.

174
Along with bottom-up errors, neurons in the ℎ area also receive a top-down prediction from 175 neurons in the ( + 1) ℎ area. Due to an overlap of ( +1 − 1) between two consecutive receptive 176 fields in area ( + 1), populations in the ℎ area will be present in the projective fields of +1 177 populations in the ( + 1) ℎ area ( Figure 1A). Populations in the ℎ area whose receptive fields are 178 closer to the boundary of the visual space are an exception to this property as these neurons will be 179 present in the projective fields of fewer than +1 populations. Here, we will focus on the general 180 case. The population will receive top-down predictions from neuronal populations ( − +1 +1) +1 181 through +1 . The error based on the top-down prediction of the neuronal activity of the population 182 generated by the population +1 is computed as The computation of this top-down error occurs in the ℎ area ( Figure 2 where was set to one for all models unless specified otherwise (see Discussion). In addition, we 190 employ 1-regularization to prevent high neuronal activities. The error due to regularization (which 191 is symbolized by ) is given as: The neuronal activity of a given population is estimated by performing gradient descent on the sum  Table 1. Hyperparameter settings used for training the network with and without receptive fields.
The size of receptive field in the network with receptive fields is equal in both image dimensions.
Note that the term receptive field (RF) has been used in this table in line with its conventional definition. For the network without RFs, 1 , 2 , 3 and 4 are equal to the total number of neurons in each area.

273
Kurtosis is a statistical measure of the "tailedness" of a distribution. It is more sensitive to infrequent

290
Below we will first present results from the model without receptive fields. The aim of this first 291 modelling effort was to examine if the network is well-behaved in the sense that latent 292 representations of causes generated in higher areas can be effectively used to regenerate the 293 sensory input patterns in lower areas, as originally evoked by input images. Following this section we 294 will continue with DHPC networks with receptive fields, because this type of model is better suited 295 to examine response properties of neurons across the respective areas along the visual processing 296 hierarchy.

298
For the DHPC networks without receptive fields, we used a model that was trained on an image set 299 to infer causes for an image set that was never presented to the network during training. Set 300 contains images of objects from two classes, i.e. airplanes and automobiles, and set consists of images of ten object classes namely airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, 302 ships and trucks. Note that images of airplanes and automobiles in set were different from images 303 of these object classes in set . For a given stimulus in , a separate reconstruction of this stimulus 304 is obtained using the causes inferred from each area of the model. For a given area, the inferred 305 causes transmit a prediction along the feedback pathways to the level below. This process is 306 repeated throughout the hierarchy until a predicted sensory input is obtained at the lowest level.

524
In experiments, sparseness has been compared across two brain regions at most, and our model 525 suggests that results obtained from such studies may not generalize to other brain regions.

526
Regularization was also a factor that affected whether high selectivity neurons or high dynamic 527 range neurons contributed strongly towards sparseness in a given area (Figure 8). In the absence of 528 regularization, sparseness in lower areas was determined by high selectivity neurons, but in higher 529 areas sparseness was determined by high dynamic range neurons ( Figure S3). This can be attributed

569
With respect to their network, the specific advance of the current study is that it provides a 570 methodology for building scalable, deep neural network models, e.g. to study neuronal properties of 571 higher cortical areas. (Spratling, 2008) showed that predictive coding models can reproduce various 572 effects associated with attention-like competition between spatial locations or stimulus features for 573 processing. This study employed a network with two cortical regions, each having two to four 574 neurons. A different study (Spratling, 2010) showed that predictive coding models can reproduce 575 response properties of V1 neurons like orientation selectivity. These models consisted of a single 576 cortical region corresponding to V1 and hence a top-down input was lacking. Both studies employed 577 models with predefined synaptic strengths. In contrast, DHPC networks employ a Hebbian rule for 578 adjusting synaptic strengths and estimating representations. They can be trained using images of 579 essentially arbitrary dimensions. Further, DHPC networks not only showed basic properties like 580 orientation selectivity at lower levels but simultaneously showed high stimulus selectivity and 581 sparseness in higher areas, thus unifying these different phenomena in a single model. 582 (Spratling, 2012b) presented a predictive coding model in which synaptic strengths were adapted 583 using rules that utilized locally available information. This study used models having one or two 584 areas with specific, pre-set architectural parameters like receptive field size and size of image 585 patches. Using predictive coding (Wacongne et al., 2012) showed that a network model trained to 586 perform an oddball paradigm can reproduce different physiological properties associated with 587 mismatch negativity. This study simulated a network architecture with two cortical columns, each of 588 which had a pre-established selectivity for specific auditory tones. Unlike these studies (Spratling,

596
Probably, the approach closest to our work is by (Lotter et al., 2017) who employed networks 597 consisting of stacked modules. This network was specifically designed to predict the next frame in 598 videos and was trained using end-to-end error-backpropagation which is unlikely to be realized in 599 the brain. However, an interesting aspect of this model is the use of recurrent representational units 600 which allows the network to capture temporal dynamics of the input. This aspect will be an 601 interesting direction of future research for the unsupervised Hebb-based models we proposed here. shown in blue. For plotting conventions, see figure S3. As a result of adding regularization to 967 the top area, the contribution of high dynamic range neurons to sparseness is weakened in 968 areas 2 and 3 (cf. Figure S3). This effect likely arises because regularization, by definition, 969 reduces neuronal activity; via a top-down spreading effect this leads to lower dynamic 970 ranges in areas 2 and 3. In turn, this reduces the contribution of high dynamic range 971 neurons to sparseness in these areas. 972