Abstract
According to analysis-by-synthesis theories of perception, the primary visual cortex (V1) reconstructs visual stimuli through top-down pathway, and higher-order cortex reconstructs V1 activity. Experiments also found that neural representations are generated in a top-down cascade during visual imagination. What code does V1 provide higher-order cortex to reconstruct or simulate to improve perception or imaginative creativity? What unsupervised learning principles shape V1 for reconstructing stimuli so that V1 activity eigenspectrum is power-law with close-to-1 exponent? Using computational models, we reveal that reconstructing the activities of V1 complex cells facilitate higher-order cortex to form representations smooth to shape morphing of stimuli, improving perception and creativity. Power-law eigenspectrum with close-to-1 exponent results from the constraints of sparseness and temporal slowness when V1 is reconstructing stimuli, at a sparseness strength that best whitens V1 code and makes the exponent most insensitive to slowness strength. Our results provide fresh insights into V1 computation.
Introduction
Analysis-by-synthesis is a long-standing perspective of perception, which proposes that instead of passively responding to external stimuli, the brain actively predicts and explains its sensations [1, 2]. Neural network models in this line use a generative network to reconstruct the external stimuli, and use this reconstruction error to guide the updating of synaptic weights [3, 4] or the neuronal dynamic states [5]. This analysis-by-synthesis perspective of perception has received experimental supports in both visual and auditory systems [2, 6]. Additionally, neural representations are generated in a top-down cascade during visual creative imagination [7], and these representations are highly similar to those activated during perception [8, 9]. These results above suggest that the brain performs like a top-down generator in both perception and creativity tasks.
According to the generative perspective above, V1 plays two fundamental cognitive roles: (1) V1 provides code for higher-order cortex to reconstruct for perception or to simulate for creativity, and (2) V1 reconstructs the external stimuli for perception. In this paper, we would like to ask two questions regarding to these two roles: (1) What code does V1 provide higher-order cortex to reconstruct or simulate in order to improve perception or creativity? (2) What unsupervised learning principles shape the activity of V1 when V1 is reconstructing stimuli, so that the statistics of V1 code resembles experimental observations?
Since Hubel and Wiesel proposed their V1 circuit model that complex cells receive from simple cells [10] (see Ref. [11] for experimental validation), most modeling works suppose complex cells to be the output of V1 (e.g., Ref. [12, 13]). Deep convolutional neural network, which consists of alternate stacking of convolution and pooling layers and can be regarded as an engineering realization of the stacking of the Hubel-Wiesel circuits, has seen a huge success in artificial intelligence [14]. It is believed that the computational function of complex cells is to establish code invariant to local spatial translation, which benefits object recognition performed in down-stream cortices [15, 12]. No studies, as far as we know, however, address the computational function of complex cells from top-down generative perspective.
Understanding the unsupervised learning principles that guide the development of V1 has been a fruitful research direction. It has been shown that sparse coding results in Gabor-like receptive fields of simple cells [4], temporal slowness may guide the formation of complex cells [16], temporal prediction can be used to understand the spatio-temporal receptive fields of simple cells [17]. Recent experiment showed that the variance of the nth principal component of the activity of V1 decays in power law n−α with α → 1+ [18], which provides a new challenge for computational explanation. This eigenspectrum may result from the compromise between efficient and smooth coding [18], but how this compromise can be realized through biologically plausible unsupervised learning principles in neural networks remains unknown.
In this paper, we addressed the above two problems by training neural network models. By modeling the visual system as a variational auto-encoder [19], we show that reconstructing the activities of complex cells facilitates higher-order cortex to form representations smooth to shape morphing of stimuli, thereby improving visual creativity and perception. Using a parsimonious generative model in which V1 is continuously reconstructing the input temporal stimuli, we show that coding sparseness and temporal slowness together explain the power-law eigenspectrum of V1 activity. The close-to-1 exponent is realized at a sparseness strength that best whitens V1 code and makes the exponent most insensitive to slowness strength. Our results provide fresh insights into V1 computation.
Results
Understanding the top-down function of complex cells: toy models
Model setup
Our V1 circuit model is a two-layer network receiving from stimuli on a one-dimensional line (Fig. 1b, right panel). The feedforward connection W2 (magenta arrow in Fig. 1b, right panel) between the simple cell preferring stimulus at position Xsim and the complex cell preferring Xcom Gaussianly decays with |Xsim – Xcom|. Experiments showed that lateral inhibition sharpens the tuning of simple cells [20, 21, 22]. We model this tuning sharpening by a winner-take-all (WTA) effect, so that only the simple cell with the highest activity remains active, while all the others are silent.
(a) Schematic of the input model. Stimulus input (red) is a wavelet along a one-dimensional line. The feedforward connections W1 (green) as a function of the difference Xstim – Xsim of the preferences of the stimulus receptors and simple cells share the same wavelet form as the stimuli. (b) Schematic of the three V1 models we study. Model 3 has both complex cells and simple cells with lateral inhibition (blue arrows), with the connection W2 from a simple cell to a complex cell Gaussianly decays with the difference between their stimulus preferences. Model 1 has only simple cells. Model 2 also has only simple cells, but without lateral inhibition. (c) Examples of V1 output in Model 1 (blue), Model 2 (red) and Model 3 (black). (d) Schematic of the variational auto-encoder (VAE). The value z of the bottleneck (which models the prefrontal cortex) is sampled from a Gaussian distribution with μ and σ determined by the two output channels of the bottom-up pathway. This network is optimized so that the output of the top-down pathway
is close to the input stimulus x, and at the same time
is close to
.
To study the computational function of complex cells, we asked two questions: (1) what will happen if the complex cells are removed, leaving only simple cells; (2) why V1 is functionally like a two-layer structure (with simple cells and complex cells), instead of a simple linear filter? To answer these questions, we compared three toy models: (Model 1) a single-layer model with lateral inhibition (which results in WTA) to model the simple-cell-only case (Fig. 1b, left panel); (Model 2) a single-layer model without lateral inhibition to model the case of a linear filter (Fig. 1b, middle panel); and (Model 3) a two-layer model with both simple and complex cells (Fig. 1b, right panel), which is the model we introduced above. In all the three models, the stimuli are wavelets parameterized by their positions on a one-dimensional line, and the feedforward connection W1 as a function of Xstim – Xsim (where Xstim and Xsim are respectively the preferred stimulus positions of a stimulus receptor and a simple cell) is also in the shape of the wavelet (Fig. 1a). In response to a stimulus, the V1 output is a delta peak in Model 1, an oscillating wavelet in Model 2, and a Gaussian bump in Model 3 (Fig. 1c).
We model the visual system as a variational auto-encoder (VAE) [19], whose output is optimized to reconstruct the input, under the regularization that the state of the bottleneck layer is encouraged to distribute close to standard normal distribution in response to any input (Fig. 1d). VAE mimics the visual system in the following four aspects: (1) the bottleneck layer , which is low-dimensional, can be regarded as the prefrontal cortex, because experiments suggest that working memory is maintained in a low-dimensional subspace of the state of the prefrontal cortex [23, 24]; (2) the input
and output
layers can be regarded as V1, with the
and
connections (i.e., the encoder and decoder parts) representing the bottom-up and top-down pathways respectively; (3) stochasticity is a designed ingredient of VAE, analogous to neuronal stochasticity; and (4) after training, VAE can generate new outputs through the top-down pathway by sampling states of the bottleneck layer from Gaussian distribution, consistent with the experimental finding that V1 participates in mental imagination of new images [9]. This fourth point enables us to study the creativity of the visual system using VAE.
Visual creativity
To study the function of complex cells in visual creativity, we trained VAEs to generate the output patterns of V1 in the three models above (Fig. 1b, c), and compared the quality of the generated patterns in the three models. After training, we investigated how the generated pattern changed with the bottleneck state z, which is one-dimensional in our model. In Model 1, sharp peaks are never generated by the VAE, suggesting the failure of pattern generation (Fig. 2a). In Model 2, abrupt changes of the peaking positions sometimes happen with the continuous change of z (Fig. 2b). In Model 3, Gaussian peaks are generated, with the peaking position smoothly changing with z (Fig. 2c).
(a) Upper panel: generated activity patterns as a function of the state z of the bottleneck variable. Lower panel: blue and red curves respectively represent the two generated patterns when z takes the two values (−1 and 1) indicated by the red and blue arrows in the upper panel. VAE is trained using the output of Model 1. (b, c) Similar to panel a, except that VAE is trained using the output of Model 2 (panel b) or 3 (panel c). (d) We let the input of VAE (i.e., x in Fig. 1d) be the output of Model-in, but train the output of VAE (i.e., in Fig. 1d) to approach the output of Model-out. Both Model-in and Model-out can be Model 1,2 or 3. r2 quantifies how well a generated pattern by VAE looks like an output pattern of Model-out. We do not accurately show the value of r2 when r2 < 0. Arrows indicate the cases when Model-in equals Model-out, which are the cases in panels a-c and e. (e) Euclidean distance d(s(x), s(y)) between two output patterns s as a function of the distance |x – y| between the positions x and y of two stimuli, normalized by the maximal value of d, for Models 1 (blue), 2 (green) and 3 (red). The bottleneck state of VAE has only 1 dimension.
To quantify the quality of the generated patterns, we defined an index r2 to quantify how well a generated pattern by VAE looks like an output pattern of the model used to train the VAE (see Methods). We found r2 is maximal for Model 3, intermediate for Model 2, but is small (even negative) for Model 1 (see the bars indicated by arrows in Fig. 2d). Therefore, VAE trained by Model 3 learns to generate patterns looking like the output of Model 3. This advantage of Model 3 is closely related to the smooth transition of z to the shape morphing (i.e., the translational movement) of the generated bumps (Fig. 2c), because the generation quality is bad around the abrupt change points (blue curve in Fig. 2b, lower panel).
In the study above, we input to VAE and trained VAE to construct
in the output, with
being the output pattern of Model a (a = 1, 2, 3). Now we ask whether the advantage of Model 1 results from a ‘good’ code that higher-order cortex receives from V1, or from a ‘good’ code that higher-order cortex is to reconstruct through the top-down pathway. To answer this question, we input
to VAE but trained VAE to construct
. We found that quality of the generated images strongly depended on
, but hardly on
(Fig. 2d). Therefore, the advantage of complex cells is a top-down effect, and cannot be understood from a bottom-up perspective.
To understand the reason for the advantage of Model 3, we then studied the Euclidean distance d(s(x), s(y)) between the output patterns s as a function of |x – y|, where x and y are the positions of two stimuli respectively. In Model 1, d(s(x), s(y)) sharply jumps from 0 to a constant value at |x – y| = 0; in Model 2, d(s(x), s(y)) is not monotonic; and in Model 3, d(s(x), s(y)) monotonically and gradually increases with |x – y| (Fig. 2e). This property of Model 3 may be important for its advantage in top-down generation. To see this, suppose two spatially nearby bottleneck states z1 and z2 (z1 ≈ z2) generate two output patterns s1 and s2, which corresponds to two stimuli at positions x1 and x2 respectively. For simplicity, we constrain that s1 is fixed during training, and s2 changes within the manifold {s(x)}x. By introducing stochastic sampling during training, VAE encourages s2 to be closer to s1, so that x2 gets closer to x1, which means that the bottleneck state represents the spatial topology (i.e., the translational movement) of the stimuli. In Model 3, this can be realized using the gradient . In Model 2, d(s1, s2) is not monotonic with s2, so
does not always lead s2 close to s1, sometimes instead far away from s1. In Model 1,
remains zero when s1 ≠ s2, so it cannot provide clues to the updating of s2.
Visual perception
According to predictive coding theory of perception [5], higher-order cortex adjusts its state using the error of its recontruction of the activity of lower-order cortex. In our model, perception is performed by adjusting the bottleneck state z to minimize the error between the generated pattern by the decoder of VAE after training and the target pattern (Fig. 3a).
(a) Schematic of the perception model. Only the decoder of VAE after training is used. The bottleneck state z is updated by gradient descent to minimize the error between the generated pattern (red) and the target pattern (blue). (b) for the VAEs trained by the three models (Fig. 1b). (c)
as a function of σperturb for the three models. Error bars represent s.e.m..
Two conditions are required for good perception performance: (1) there exists a state z0 at which the generated pattern well resembles the complex-cell pattern; and (2) the representation of the complex-cell patterns in the bottleneck state should be ‘smooth’ so that starting from a state z1 (z1 = z0), the optimal state z0 can be reached by error updating using, in our model, gradient descent algorithm (Fig. 3a).
Guided by these two understandings, we studied the perception performance of a VAE trained by Model a (a = 1, 2,3) in two steps. First, we set the target pattern (i.e., blue curve in Fig. 3a) to be a complex-cell pattern of Model a, and set the initial bottleneck state to be the value of the μ-channel (see Fig. 1d) of the VAE encoder, which is the optimal bottleneck state estimated by the encoder; then we updated the bottleneck state to minimize the error between
and the decoder output. We denote the bottleneck state at the end of the updating as
. Second, we set the target pattern
to be the decoder output when the bottleneck state took
, and set the initial bottleneck state to be
, with ϵ being a Gaussian noise with standard deviation σperturb, and updated the bottleneck state again.
We quantified the reconstruction quality in the above two steps using and
, respectively representing the ratio of variance of the target pattern explained by the decoder output at the end of each step.
quantifies how optimal a bottleneck state can be to reconstruct a complex-cell pattern.
quantifies how smooth the representation of the output in the bottleneck state can be so that an optimal bottleneck state can be easily found by error updating. We found that Model 2 and Model 3 correspond to high
values (Fig. 3b), and
for Model 3 is higher than that for Model 2 (Fig. 3c). These results suggest the advantage of complex cells for visual perception. The low
value for Model 2 may be closely related to the fragmentary representation in the bottleneck state (Fig. 2b, upper panel).
Understanding the top-down function of complex cells: skeleton MNIST dataset
Model setup
We then studied the advantage of complex cells for top-down image generation using skeleton MNIST dataset [25]. Images in this dataset represent digits using lines of 1 pixel width (Fig. 4a, second column). We used this dataset (denoted as dataset below) to displace the output of Model 1 (Fig. 1b, left) for 2-dimensional-image case, because a 1-pixel-width line is analogous to the activities of simple cells under local winner-take-all inhibition along the direction perpendicular to the line (comparing the second column of Fig. 4a with the blue peak in Fig. 1c). Biologically, simple cells are selective to the orientation of bars. Our model does not has the ingredient of orientation selectivity, but uses images of thin lines, mimicking the enhanced representation of contours in V1 due to collinear facilitation [26]. Our model also shares a property with the simple cells with sharpened orientation tuning due to lateral inhibition [20]: the representations of two parallel bars hardly overlap with each other. To displace Model 2 (Fig. 1b, middle) for 2-dimensional case, we set the pixel intensities oscillatorily decaying along the direction perpendicular to a line in the skeleton MNIST images (dataset
, Fig. 4a, third column). To displace Model 3 (Fig. 1b, right), we blurred the skeleton images so that pixel intensities Gaussianly decayed along the direction perpendicular to a line (dataset
, Fig. 4a, fourth column). We trained VAE using the three datasets, and compared the quality of the generated images (see Methods).
(a) Upper panels: example images in the MNIST, skeleton MNIST ,
and
datasets. Lower panels: pixel intensity along the blue bar in the upper panel as a function of the distance from point A (i.e., the middle point of the bar). (b) An example manifesting the post-processing of the generated image by VAE. (c) Upper panels: examples of the generated images when using dataset
and
to train VAE. Lower panels: the generated images after post-processing. Note that
results in better image quality than
and
. (d)
as a function of the binarization threshold θthres when the parameter λKL in VAE (see Methods) takes different values, when VAE is trained using
and
respectively, (e, f) Similar to panel d, but for
and
. (g) Examples of the generated images when the bottleneck state continuously changes its value, see more examples in Supplementary Fig. 1. In panels d, e, f, error bars representing s.e.m.. The bottleneck state of VAE has 20 dimensions.
Visual creativity
The images generated by VAE trained by different datasets have different styles (Fig. 4c, upper panels). To fairly compare the quality of the generated images by the three datasets, we post-processed the generated images by first binarizing the images using a threshold θthres, such that pixels with intensities larger (or smaller) than θthres were set to 1 (or 0), then skeletonizing the images (see Methods), resulting in binary images with lines of 1-pixel width (Fig. 4b and lower panels of c), similar to the images in the skeleton MNIST dataset.
People have proposed that creativity is a mental process of generating worthwhile and novel ideas [27]. Inspired by this definition of creativity, we propose that good generation of skeleton-MNIST images should satisfy three conditions. (1) Realisticity: a generated image should look like a digit in the skeleton MNIST dataset. (2) Cross-category variety: the numbers of the generated images looking like different digits should be almost the same. In other words, it is not good if the VAE can generate images looking like the same digit. (3) Within-category variety: the shapes of the images of the same digit should be various. To quantify the image-generation quality, we trained a neural network to classify the skeleton MNIST dataset, so that the network output a label distribution p(x|I) (x = 0,1,···, 9) after receiving a post-processed generated image I (see Methods). Realisticity requires that p(x|I) has low entropy for each I [28]. Cross-category variety requies that the marginal
has high entropy
, with A being the set of all the post-processed generated images [28]. To quantify within-category variety, we calculated the intrinsic dimensionality
[29], where λi is the eigenspectrum of the post-processed generated images A0 belonging to the same category. Dincat has maximal value if all principal components (PCs) of the set A0 have equal variance, and has small value if a single PC dominates.
We investigated and Dincat with the change of the binarization threshold θthres (Fig. 4b) and a parameter λKL in VAE which controlled the regularization strength onto the distribution of the bottleneck state variable (see Methods). We found that VAEs trained by dataset
generated images with better eye-looking quality than VAEs trained by dataset
and
(Fig. 4c), consistent with the quantitive indication of smaller
, larger
and larger Dincat in large range of parameters θthres and λKL (Fig. 4d-f). These results suggest that complex cells facilitate the visual system to generate diverse realistic-looking images, supporting the functional role of complex cells in visual creativity.
Similar to the 1-dimensional-image case (Fig. 2a-c), we also investigated the series of the generated images when the bottleneck state z was continuously changing (Fig. 4g). The images generated by the VAE trained by have two advantages comparing with those resulting from
and
: (1) when the change of z is not large so that the generated images look like the same digit, the images generated by
undergo more smooth and flexible shape morphing, whereas the images generated by
and
are more rigid; (2) when the change of z is large so that the generated images experience digit transition, the images generated by
look more realistic during the transition period, whereas the images generated by
and
are more unrecognizable during the transition period (see Supplementary Fig. 1 for more examples). This investigation gains insights into the facilitation of creativity by complex cells.
Model 1 in Fig. 1 can hardly generate realistic-looking samples (Fig. 1e); the generated digits trained by , although are worse than those trained by
in quality, are sometimes recognizable (Fig. 4c). This is possibly because in 1-dimensional case, there is no delta peak p3 that can interpolate two delta peaks p1 and p2, so that d(p1, p3) < d(p1, p2) and d(p2, p3) < d(p1, p2), where d(·, ·) is the Euclidean distance between the representations of two peaks; in 2-dimensional case, however, we can draw many 1-pixel-width thin lines that interpolate two parallel thin lines. In both the 1- and 2-dimensional cases, the models corresponding to the complex cells perform best.
Visual perception
We also studied the perception performance of VAEs trained by the three datasets using a similar scheme to that for 1-dimensional case (Fig. 3a). We found that both and
have highest values for
(Fig. 5), supporting that complex cells facilitate perception.
(a) for the VAEs trained by the three datasets. (b) Examples of the dataset images and the reconstructed images by VAE, note that
result in best reconstruction quality. (c)
as a function of σperturb for the three datasets when λKL takes different values. Error bars represent s.e.m..
Sparseness and slowness together explain the power-law eigenspectrum of V1
Experiments found power-law eigenspectrum of V1 activity with exponent close to 1 [18]. Here we would like to explain this phenomenon using a parsimonious generative model. Our working hypothesis is that V1 is trying to reconstruct the input temporal stimuli through a top-down pathway, keeping the activity of V1 as sparse and slow-changing as possible. Specifically, we minimized the following cost function:
where It is the input stimulus at time t, xi,t is the activity of the ith neuron in V1 at time t, wi is the top-down reconstruction weight from neuron i, and λsparse and λslow respectively control the strengths of sparseness and temporal slowness. We generated temporal stimuli by sliding a small spatial window on static natural images, imitating the experimental protocol of Ref. [18], where animals were seeing static natural images, and the temporal change of stimuli resulted from head or eye movement of the animals.
At suitable values of λsparse and λslow, the variance of the nth principal component (PC) of samples of {xi,t}i decay as a power law n−α, with exponent α ≈ 1 (Fig. 6a), with successive PC dimensions encoding finer spatial features (Supplementary Fig. 4a). Similar to the result in Ref. [18], this power-law scaling is not inherited from the eigenspectrum of input stimuli (Fig. 6b), because we partially whitened the natural images in the preprocessing step, modeling the function of the retina and lateral geniculate nucleus [30]. After training, wi exhibited Gabor-like shape under the sparseness constraint (Fig. 6c), which implies Gabor-like receptive fields of the neurons [4]. When the slowness constraint is strong, the curvature of the temporal trajectory of the neuronal population activity {xt}t (we denote xt = {xi,t}i) can be smaller than the curvature of the temporal trajectory of the stimuli {It}t (Fig. 6d), consistent with experimental observations [13].
(a) Eigenspectrum when λsparse = 0.14, λslow = 0.4. The dashed black line denotes the linear fit of n−α. (b) Eigenspectrum of the partially whitened image used to train the model. Inset: σ1(x) quantifies the whiteness (i.e., equal variance along all PCs) in the subspace spanned by the first x PCs. Smaller σ1 indicates more equal variances. (c) Examples of generation weights (wi in eq. 1) after training, when λsparse = 0.14, λslow = 0.4 (upper panels) and λsparse = 0.0, λslow = 0.4 (lower panels). (d) Curvature of temporal trajectory of V1 state as a function of λslow when λsparse = 0.14. Dashed red horizontal line indicates the curvature of the images used to train the model. The blue dot represents the λslow = 0.4 value that panel a takes.
We then searched the exponent α in a range of λsparse and λslow, and found that α is most close to 1 at an optimal sparseness strength at which α is most insensitive to λslow when λslow ≠ 0 (Fig. 7a, b).
(a) α as a function of λsparse and λslow. The black dot represents the λsparse = 0.14 and λslow = 0.4 value pair that panel a of Fig. 6 takes. The black square and triangle respectively represent the (λsparse, λslow) value pairs used in panels c and d of this figure. (b) α as a function of λslow at different λsparse values. The blue curve represents the cases when , which is also the λsparse value indicated by the black dot in panel a. The dashed red line represents α = 1. (c, d) Eigenspectra when λsparse and λslow takes the indicated values. The blue and red arrows are explained in the text. (e) Eigenspectra when λsparse takes different values with λslow = 0. (f) σ1 and σ2 as functions of λsparse. Smaller σ1 and σ2 values indicate more equal variance along different PCs, which means more whitened V1 activity. The blue dot represents
. The dashed red line indicates the λsparse value that minimizes σ1 and σ2. (g) r2 quantifies how well the reconstruction can explain the variance of the stimulialong a specific PC dimension of the stimuli. The blue line represents
. (h) The variance of the reconstruction along PC dimensions of the stimuli. (i) σ1 of the reconstructed images as a function of λsparse. Smaller σ1 indicates that the reconstructed images are more whitened. In panels e-i, λslow = 0.
How to understand the good property of ? Ref. [18] proprosed that the power-law scaling of PC variances is a compromise between the whitening and differentiability of neural code. The slowness constraint improves the differentiability, manifested by the straightening of temporal trajectories (Fig. 6d), so as long as λslow is large enough, optimal code should be obtained when the sparseness constraint most whitens the code. Closer investigation of the eigenspectrum unveiled that at the λsparse and λslow values resulting in α > 1, the eigenspectrum decays in a power-law manner toward the end (blue arrow in Fig. 7c). In this case, the exponent α > 1 manifests the undersize of the variances of the last several PCs. When α < 1, however, there is a sharp drop toward the end of the eigenspectrum (red arrow in Fig. 7d), also manifesting the undersize of the variances of the last several PCs. Therefore, both α > 1 and α < 1 result from the non-whiteness of the code. To test this hypothesis, we let λslow = 0 and quantified the sparseness-induced whiteness through two indexes σ1 and σ2 (see Methods). Consistent with our presumption, both indexes get their smallest values around
(Fig. 7f), indicating the optimal whiteness.
To get some understanding on the optimal sparseness-induced whitening at , we studied the whiteness of the reconstructed stimuli (i.e., Σi wixi,t in eq. 1) when λslow = 0. The link between the whiteness of the reconstructed stimuli and that of the V1 activity {xi,t}i is strictly valid when the reconstruction weights {wi}i are orthonormal. In our simulation, we constrained ||wi||2 = 1, the approximate orthogonality between wi and wj (i ≠ j) is numerically manifested in Supplementary Information Section 4 and Supplementary Fig. 4b.
Specifically, we investigated r2(m), which is the ratio of the variance of the mth PC of the stimuli It explained by the top-down reconstruction Σi wixi,t. We found that the reshaping of the function r2(m) with the increase of λsparse experiences two stages, separated largely at (see the blue curve in Fig. 7g). At stage 1
, r2(m) remains close to 1 for small m values, but drops to zero for large m values. This means that at this stage, with the increase of λsparse, the model gradually abandons the reconstruction of the PCs with small variances, whereas the reconstruction in the subspace
of the dominating PCs remains largely intact. The stimuli in the subspace
are well-whitened (Fig. 6b, inset). At stage 2
, r2(m) lowers down for small m values, suggesting the deterioration of the reconstruction of
. This deterioration is not uniform along all PCs in
, with the PC with smaller variance deteriorated worse (Fig. 7h). The reconstructed stimuli in the subspace
becomes non-whitened with the increase of λsparse (Fig. 7i). These results imply a scenario that the improvement of the whiteness of V1 code with λsparse when
is because V1 gradually focuses to reconstruct a well-whitened subspace
of stimuli spanned by the dominating PCs, and the deterioration of whiteness with λsparse when
is because V1 deteriorates the reconstruction of the well-whitened subspace
non-uniformly along different PC directions. See Supplementary Information Section 4 and Supplementary Fig. 4c for more supports on the notion that the reconstruction of PC with smaller variance is more impaired with the increase of λsparse.
Discussion
In this paper, we found that complex cells facilitate visual creativity and perception, and showed that the close-to-1 exponent of the power-law eigenspectrum of V1 is realized at a sparseness strength that best whitens V1 code and makes the exponent most insensitive to slowness strength. Our work provides fresh insights into the cognitive roles of complex cells from top-down generative perspective, establishes a link between the V1 eigenspectrum and the principles of sparseness and slowness, and suggests that there is an optimal sparseness strength that V1 is working at.
Brain as an organ of active explanation
Predictive coding is the dominating theory in the generative theories of perception, which proposes that the brain iteratively updates its explanation for the world using the error of the current explanation transmitted through the feedforward pathway [5] (Fig. 3a). However, one should note that error updating is not the only approach to construct this explanation. For example, the encoder of VAE uses a deep feedforward network to construct this explanation in the bottleneck state (Fig. 1d). Similar technique has been used to solve lasso regression [31], where a deep neural network is used to approximate an implicit function defined by iterative error updating that maps stimuli to hidden states. The advantage of this deep neural network approximation is computational speed. It has been found that information is transmitted feedforwardly at the early stage of perception, but processed recurrently at the later stage [7, 32, 33], which reflects the shift of computational demand from responding speed to representation precision during the process of perception. Our perception model (Fig. 3a) also captures these two stages, where the μ-channel of the feedforward encoder output of VAE is regarded as the initial bottleneck state before recurrent updating (see the text explanation related to Fig. 3b).
An important but seldomly discussed question is why the brain is evolved to use the generative explanation scheme instead of the passive sensation scheme. One possible reason is that the brain has to continuously adapt itself to kaleidoscopic task demands. For example, it is believed that the sharpened feature tuning of simple cells by lateral inhibition improves discrimination [20, 22] and the position tolerance of complex cells improves classification [15, 12]. These understandings preset the task demands of discrimination and classification for simple and complex cells respectively. The problem is that a feedforward network optimized for one task may perform badly for another: high discrimination may imply strong amplification of intra-class noises, blurring the clustered structure of the inputs [34], impairing classification task; a good classifier may map all elements in the same class onto a single output, impairing discrimination task. The generative explanation scheme, however, requires the activity of the high-level layer reconstructs that of the low-level layer, which ensures that most information in the low-level layer gets represented in higher levels. The high-level layer can have representations optimized for different tasks after imposed different priors, but remains sensitive to the low-level information un-used in the current task, ready to switch tasks according to environmental changes.
Complex cells, creativity, and perception
In this paper, we show that temporal slowness is necessary for the power-law eigenspectrum of V1 with exponent α → 1+ (Fig. 7b), and that complex cells facilitate higher-order cortex to form representations smooth to shape morphing of stimuli, improving creativity and perception (Figs. 1–5). Previous studies showed that temporal slowness is necessary for the developmental formation of complex cells [35, 36], and that the α → 1+ power-law exponent ensures the differentiability of neural code [18]. These results together suggest the cognitive role of the close-to-1+ exponent of the eigenspectrum of V1: facilitating higher-order cortex to form representations differentiable to shape morphing of stimuli thereby improving creativity and perception. To better understand this point, suppose that higher-order cortex represents a stimulus at position x using code z(x), and the V1 code is υ1(x). The generative theory of perception requires that υ1(x) ≈ f(z(x)), where f(·) represents the deep neural network along the top-down pathway, which is differentiable in the biologically plausible context. Therefore, if z(x) is to be differentiable to x, υ1(x) must also be differentiable to x to better fulfill the equation υ1(x) ≈ f(z(x)). Consistently, in Supplementary Information Section 5 and Supplementary Fig. 5, we trained VAE to generate V1 activity (i.e., {xi,t}i in eq. 1), and found that VAEs trained by V1 activity at λslow ≠ 0 (we kept λsparse = 0.14 so that α ≈ 1 when λslow ≠ 0, see Fig. 7b) performs better on creativity and perception than those trained by V1 activity at λslow = 0.
Engineering neural networks to generating images with high-resolution details is difficult. VAE tends to generate blurred images [37]. Another well-known generative model, generative adversarial network (GAN), also struggles when generating high-resolution details [38] due to a gradient problem [39]. Our results suggest that blurring these high-resolution details may result in better creativity and perception performance ( in Figs. 4a and 5c). The implementation of this idea in engineering VAE or GAN for image generation or deep predictive coding network for image recognition is an interesting research direction.
It is believed that mental creativity mostly involves default mode network and executive control network [27, 40], which include association cortical areas in the frontal, parietal and cingulate cortices. Our results suggest that low-order sensory cortices such as V1 also plays an important role in, or is even designed for, the high-order cognitive task of creativity. But this may not be so surprising: the cost function that VAE optimizes is the variational free energy [19], similar to that during perception and learning [1], two cognitive processes that V1 also participates in. Therefore, creativity, perception and learning are different aspects of the same nature: free-energy minimization.
The neural mechanism of V1 adaptation
In Ref. [18], animals were presented stimuli ensembles with different statistics. Similarly, we also studied our model using stimuli ensembles of a low dimensionality d = 4. Consistent with our result with high-dimensional stimuli, , at which the exponent α is most close to the experimental value 1.62 and insensitive to the change of λslow, is obtained near a λsparse value that best whitens the V1 code at λslow =0 (Supplementary Information Section 2 and Supplementary Fig. 2). Therefore, V1 may be able to adapt itself to the statistics of the presented images by adjusting λsparse, in the time scale of the experiment [18] (i.e., minutes).
In our model, this adaptation to stimuli statistics is realized by adjusting λsparse to before learning the generation weight wi(eq. 1). λsparse has the physiological meaning of neuronal firing threshold, and wi is related to the feedforward and recurrent weights to and within V1 [41, 42]. So it is likely that V1 performs this adaptation by first adjusting global inhibition, and then adjusting feedforward and recurrent weights accordingly. This global inhibition may be adjusted by thalamocortical connections or neuromodulators: it has been found that inactivation (or excitation) of the pulvinar neurons suppresses (or increases) the responses of superficial V1 neurons to visual input [43]; and cholinergic axons from the basal forebrain depolarize cortical interneurons [44]. We found that
for low-dimensional stimuli is smaller than that for high-dimensional stimuli (Fig. 7f, Supplementary Fig. 2g), which means that this global inhibition is weaker during the presentation of low-dimensional stimuli than during high-dimensional stimuli.
What is the neural mechanism the brain uses to guide λsparse to Fig. 7g shows that the reconstruction of PCs with small (or large) variance is impaired with the increase of λsparse when
(or
). Consistently, the reconstruction error ϵ increases slowly with λsparse when
, but starts to quickly soar up after
(Supplementary Information Section 3 and Supplementary Fig. 3). Therefore, it is possible that the brain monitors ϵ when adjusting λsparse, and stops the adjustment just at the point where ϵ starts to soar up with λsparse. Predictive coding theory suggests that ϵ is encoded by the pyramidal neurons in superficial layers [1], experiments found that ϵ is closely related to the gamma-band frequencies [45, 46]. More detailed mechanical insights require experimental studies.
Methods
Manifesting the advantage of complex cells using toy models
A variational auto-encoder (VAE) [19] contains two parts: an encoder and a decoder (Fig. 1d). The encoder is a feedforward network that receives input x and gives two channels of output μ and log(σ2). Then a random number z is generated according to the Gaussian distribution , and is input to the decoder which outputs
. VAE is trained to minimize the following cost function:
where Dx is the dimension of the input and the output, Dz is the dimension of the random variable z. Minimizing the first term of this equation reduces the reconstruction error, minimizing the second term (which is the KL-divergence of
from the standard normal distribution
) makes
close to
. λKL is a parameter controlling the relative strengths of these two terms.
In the VAE used in Figs. 1, 2, the encoder was a multilayer perceptron (MLP) that had three hidden layers with sizes 100, 50 and 20 respectively. The input layer was of size 201, and the output layer had two channels each with size 1. The decoder was another MLP that had three hidden layers with sizes 20, 50 and 100 respectively. Adjacent layers were all-to-all connected. We used leaky relu as the activation functions of the hidden layers. VAE was trained using Adam optimizer [47].
In Figs. 1, 2, the inputs received by VAE (i.e., the outputs of the three models) are positioned on a one-dimensional line of 201 neurons. In Model 1, this input is a delta peak f1(x; a) = δx,a in which only a single neuron at position a has activity 1, whereas all the other neurons have zero activity. In Model 2, this input is a Gabor function , where σ = 10, T = 80, C is a normalization factor such that maxx f2(x; a) = 1. In Model 3, this input is a Gaussian function
, where σ = 10, C is a normalization factor such as maxx f3(x; a) = 1. In f1(x; a), f2(x; a) and f3(x; a), a is arandom integer in the range [31, 171].
To quantify the quality of the generated patterns (Fig. 2d), for any generated pattern p, we defined , where s(x) is the output pattern of Model-out in Fig. 2d in response to the wavelet stimulus at position x, and
is the ratio of the variance of p that can be explained by s(x) (i.e., coefficient of determination).
In Fig. 3, the bottleneck state was optimized to minimize the error between the target pattern and the generated pattern using Adam optimizer [47].
Manifesting the advantage of complex cells using skeleton MNIST dataset
The dataset in Fig. 4a is the skeleton MNIST dataset [25]. The intensity of a pixel in an image in
is binary (1 or 0) depending on whether this pixel belongs to a line of 1-pixel width.
An image I2 in was generated using an image I1 in
in the following way. To determine the intensity T2(x2, y2) of a pixel I2(x2, y2) at the position (x2, y2) in I2, we defined a box
in I1. We looked for a pixel
such that its intensity T1(x1, y1) = 1 and the distance
was minimized. Then we set T2(x2, y2) = a exp(−d2/2), where a = −1 if max(|x1 – x2|, |y1 – y2|) = 1, and a = 1 otherwise. If all pixels in
had intensity 0, then T2(x2, y2) = 0.
was generated using
in a similar way to above, except that a = 1 all the time.
The VAE used in Fig. 4 had a similar structure with that used in Fig. 1, except that the size of the input and output layers was 28 × 28 = 784, and the sizes of the three hidden layers of the encoder (or decoder) were 512, 256 and 128 (or 128, 256 and 512) respectively. The size of each of the two output channels of the encoder was 20.
The images generated by VAE were post-processed in two steps. First, images were binarized such that pixels with intensities larger (or smaller) than a threshold θthres were set to 1 (or 0). Second, the images were skeletonized using the ‘skeletonize’ routine of the skimage python package.
To quantify the quality of the post-processed images, we trained a MLP to classify the skeleton MNIST dataset. This MLP contained a hidden layer of size 1000 with leaky-relu activation function. After receiving a post-processed image I generated by VAE, this MLP output a label distribution p(x|I) (x = 0,1,⋯, 9). In Fig. 4d, , where EI[·] means average over all the generated images [28]; in Fig. 4e,
[28]. To plot Fig. 4f, we first chose the generated post-processed images with high realisticity (i.e., maxx p(x|I) > 0.9), then for all the images belonging to a category x, we calculated the variance λi(x) of the ith principal component (PC), Dincat was defined as
[29]. Fig. 4d-f show how
and Dincat change with the binarization threshold θthres and the parameter λKL in eq. 2. Note that if θthres is high, the image after post-processing may be very sparse (i.e., only a few pixels are nonzero), especially when λKL also takes a large value. In this case, the MLP network has an artifact that p(x|I) strongly peaks at x = 1, and p(x ≠ 1|I) has very small value. Because of this artifact, in Fig. 4d-f, we excluded the data points at which the percentage of nonzero pixels in the post-processed images was smaller than 1%. Some data points when λKL = 0.9 for
and when λKL = 0.5, 0.7, 0.9 for
resulted in images with sparsity a little larger than 1%, but we also excluded these data points, because the quality of the generated images was really bad. These artifacts are weak for
in our range of parameters, so we plotted the whole parameter range for
.
Fig. 4g were plotted by gradually changing the bottleneck state of VAE from z = [1.5,1.5,⋯, 1.5] to [−1.5, −1.5,⋯, −1.5].
The generative model to explain the eigenspectrum of V1
To explain the eigenspectrum of V1, our working hypothesis is that V1 is trying to reconstruct a sequence of input images through a top-down pathway, keeping the activity of V1 as sparse and slow-changing as possible. Specifically, we minimized the following cost function:
where Ii is the input image at time t, xi,t is the activity of the ith neuron in V1 at time t, wi is the top-down reconstruction weight from neuron i, and λsparse and λslow respectively control the strengths of sparseness and temporal slowness. However, minimizing eq. 3 for a long sequence {It}t of images is computationally costly, so approximation has to be used. Specifically, we minimized the following cost fuction for short sequences {I1, I2, I3} of only three images using an EM algorithm:
In the E-step, Eshort_seq was minimized respective to {xi,t}i,t=1,2,3 using a fast iterative shrinkage-thresholding algorithm (FISTA) [48]; in the M-step, Eshort_seq was minimized respective to {wi}i using Adam optimizer [47]. After training, we fixed {wi}i and inferred {xi,t}i,t from a given image sequence {It}t in a Markovian manner: we inferred {xi,t}i,t temporally sequentially (i.e., starting from {xi,t}i,t=1 to {xi,t}i,t=2 then to {xi,t}i,t=3,…); when inferring {xi,t}i,t=T, we fixed the values of {xi,t}i,t<T. The off-line training and on-line inferring algorithms model the replay-driven plasticity [49] and the perception of V1 respectively. To calculate the eigenspectra (Figs. 6a, 7a-e), we prepared a number of triplet sequences of three images {I1, I2, I3} (see below), inferred the states in the Markovian manner above, and collected the state {xi,t}i,t=3 that corresponding to I3. To calculate the curvature of temporal trajectory of states (Fig. 6d), we prepared sequences of four images {I1, I2, I3, I4}, and calculated the curvature c of {xi,t}i,t={2,3,4} by [13]
where we have denoted xt = {xi,t}i. In our simulation, the size of image It was 16 × 16 = 256, and the number of hidden units xi was 257.
Image sequence preparation
The image sequences {It}t used to train the model (eq. 4) were prepared in the following way. We picked 100 images from van Hateren’s natural image dataset [50], avoiding images that contained large areas of the sky. We first took logarithm of the intensities of the image pixels, following the suggestion of Ref. [50], and then partially whitened the images using the method in Ref. [4], modeling the image whitening by retina or lateral geniculate nucleus in the upstream of V1 [30]. To get a short sequence {I1, I2, I3} in eq. 4, we picked a 16 × 16 patch from the images preprocessed above, and sliding the position of the patch window by the same vector ΔP = (ΔX, ΔY) for two successive steps, where ΔX and ΔY were randomly −1, 0 or 1. A caveat here is a boundary effect. To see this, suppose all pixels in I1 have zero intensity, but after the patch window moves by ΔP, the pixels in a boundary of I2 have strong intensities. In this case, xi,t=1 = 0 for all is, but |xi,t=2| may be large, enlarging the cost for temporal slowness (i.e., the third term at the right-hand side of eq. 4). This large cost term does not represent the fast change of the stimulus itself, but is due to the sudden entrance of high-intensity pixels into the small patch window. To alleviate this boundary effect, we multiplied element-wise each image patch It by a 16 × 16 filter F. A pixel of F took value 0.05, 0.24, 0.43, 0.62, 0.81 or 1, depending on whether its distance with the boundary was 1, 2, 3, 4, 5 or larger than 5 pixels. This filtering also improves the biological plausibility of the model, because it means that the response of a neuron to a stimulus gradually decays (instead of sharply reducing to zero) when the stimulus is moving away from the center of the receptive field.
Power-law exponent estimation
The power-law exponents of the eigenspectra of V1 states (Figs. 6 and 7) were estimated in the following way. We noted that the change of eigenvalue λi with the order i of principal component largely has three stages (Figs. 6a and 7c, d): when i takes small values, the decay of λi with i is relatively slow and may exhibit zigzag fluctuations during decaying; when i takes intermediate values, the decay of λi with i can be best approximated by power law; when i takes large values, λi quickly decays with i. These three stages also exhibited in experimental results [18]. The power-law exponent of an eigenspectrum was obtained by linearly fitting the intermediate stage in log-log scale. The intermediate stage was determined by eye-looking, which slightly varied when λsparse in eq. 4 took different values, and largely remained the same when λslow changed in the range of our study. When λsparse = 0.14 (which is the value in Fig. 6a), we let i ∈ [29, 109] be the intermediate range.
Quantifying the whiteness of eigenspectrum
Suppose the variance of the nth principal component (PC) is υi, to quantify the whiteness of the eigenspectrum, we used two indexes: (1) σ1 = stdn(log(υn)), which is the standard deviation over logarithm of the variances; (2) σ2 = Σn={flrst 20} log(υn) – Σn={last 20} log(υn) which is the difference between the summation of log(υn) over the first 20 PCs and that over the last 20 PCs. In Fig. 7f, we used both σ1 and σ2; in the inset of Fig. 6b, we used σ1.
Acknowledgements
Z.B. thanks Prof. Changsong Zhou and Prof. Shuzhi Sam Ge for comments on the manuscript and helpful discussions. Z.B. is supported by the NSF of China (Grant No. 32000694) and the start-up fund of the Institute for Future, Qingdao University.