## Abstract

According to analysis-by-synthesis theories of perception, the primary visual cortex (V1) reconstructs visual stimuli through top-down pathway, and higher-order cortex reconstructs V1 activity. Experiments also found that neural representations are generated in a top-down cascade during visual imagination. What code does V1 provide higher-order cortex to reconstruct or simulate to improve perception or imaginative creativity? What unsupervised learning principles shape V1 for reconstructing stimuli so that V1 activity eigenspectrum is power-law with close-to-1 exponent? Using computational models, we reveal that reconstructing the activities of V1 complex cells facilitate higher-order cortex to form representations smooth to shape morphing of stimuli, improving perception and creativity. Power-law eigenspectrum with close-to-1 exponent results from the constraints of sparseness and temporal slowness when V1 is reconstructing stimuli, at a sparseness strength that best whitens V1 code and makes the exponent most insensitive to slowness strength. Our results provide fresh insights into V1 computation.

## Introduction

Analysis-by-synthesis is a long-standing perspective of perception, which proposes that instead of passively responding to external stimuli, the brain actively predicts and explains its sensations [1, 2]. Neural network models in this line use a generative network to reconstruct the external stimuli, and use this reconstruction error to guide the updating of synaptic weights [3, 4] or the neuronal dynamic states [5]. This analysis-by-synthesis perspective of perception has received experimental supports in both visual and auditory systems [2, 6]. Additionally, neural representations are generated in a top-down cascade during visual creative imagination [7], and these representations are highly similar to those activated during perception [8, 9]. These results above suggest that the brain performs like a top-down generator in both perception and creativity tasks.

According to the generative perspective above, V1 plays two fundamental cognitive roles: (1) V1 provides code for higher-order cortex to reconstruct for perception or to simulate for creativity, and (2) V1 reconstructs the external stimuli for perception. In this paper, we would like to ask two questions regarding to these two roles: (1) What code does V1 provide higher-order cortex to reconstruct or simulate in order to improve perception or creativity? (2) What unsupervised learning principles shape the activity of V1 when V1 is reconstructing stimuli, so that the statistics of V1 code resembles experimental observations?

Since Hubel and Wiesel proposed their V1 circuit model that complex cells receive from simple cells [10] (see Ref. [11] for experimental validation), most modeling works suppose complex cells to be the output of V1 (e.g., Ref. [12, 13]). Deep convolutional neural network, which consists of alternate stacking of convolution and pooling layers and can be regarded as an engineering realization of the stacking of the Hubel-Wiesel circuits, has seen a huge success in artificial intelligence [14]. It is believed that the computational function of complex cells is to establish code invariant to local spatial translation, which benefits object recognition performed in down-stream cortices [15, 12]. No studies, as far as we know, however, address the computational function of complex cells from top-down generative perspective.

Understanding the unsupervised learning principles that guide the development of V1 has been a fruitful research direction. It has been shown that sparse coding results in Gabor-like receptive fields of simple cells [4], temporal slowness may guide the formation of complex cells [16], temporal prediction can be used to understand the spatio-temporal receptive fields of simple cells [17]. Recent experiment showed that the variance of the nth principal component of the activity of V1 decays in power law *n ^{−α}* with

*α*→ 1

^{+}[18], which provides a new challenge for computational explanation. This eigenspectrum may result from the compromise between efficient and smooth coding [18], but how this compromise can be realized through biologically plausible unsupervised learning principles in neural networks remains unknown.

In this paper, we addressed the above two problems by training neural network models. By modeling the visual system as a variational auto-encoder [19], we show that reconstructing the activities of complex cells facilitates higher-order cortex to form representations smooth to shape morphing of stimuli, thereby improving visual creativity and perception. Using a parsimonious generative model in which V1 is continuously reconstructing the input temporal stimuli, we show that coding sparseness and temporal slowness together explain the power-law eigenspectrum of V1 activity. The close-to-1 exponent is realized at a sparseness strength that best whitens V1 code and makes the exponent most insensitive to slowness strength. Our results provide fresh insights into V1 computation.

## Results

### Understanding the top-down function of complex cells: toy models

#### Model setup

Our V1 circuit model is a two-layer network receiving from stimuli on a one-dimensional line (**Fig. 1b, right panel**). The feedforward connection *W*_{2} (magenta arrow in **Fig. 1b, right panel**) between the simple cell preferring stimulus at position *X _{sim}* and the complex cell preferring

*X*Gaussianly decays with |

_{com}*X*–

_{sim}*X*|. Experiments showed that lateral inhibition sharpens the tuning of simple cells [20, 21, 22]. We model this tuning sharpening by a winner-take-all (WTA) effect, so that only the simple cell with the highest activity remains active, while all the others are silent.

_{com}To study the computational function of complex cells, we asked two questions: (1) what will happen if the complex cells are removed, leaving only simple cells; (2) why V1 is functionally like a two-layer structure (with simple cells and complex cells), instead of a simple linear filter? To answer these questions, we compared three toy models: (Model 1) a single-layer model with lateral inhibition (which results in WTA) to model the simple-cell-only case (**Fig. 1b, left panel**); (Model 2) a single-layer model without lateral inhibition to model the case of a linear filter (**Fig. 1b, middle panel**); and (Model 3) a two-layer model with both simple and complex cells (**Fig. 1b, right panel**), which is the model we introduced above. In all the three models, the stimuli are wavelets parameterized by their positions on a one-dimensional line, and the feedforward connection *W*_{1} as a function of *X _{stim}* –

*X*(where

_{sim}*X*and

_{stim}*X*are respectively the preferred stimulus positions of a stimulus receptor and a simple cell) is also in the shape of the wavelet (

_{sim}**Fig. 1a**). In response to a stimulus, the V1 output is a delta peak in Model 1, an oscillating wavelet in Model 2, and a Gaussian bump in Model 3 (

**Fig. 1c**).

We model the visual system as a variational auto-encoder (VAE) [19], whose output is optimized to reconstruct the input, under the regularization that the state of the bottleneck layer is encouraged to distribute close to standard normal distribution in response to any input (**Fig. 1d**). VAE mimics the visual system in the following four aspects: (1) the bottleneck layer , which is low-dimensional, can be regarded as the prefrontal cortex, because experiments suggest that working memory is maintained in a low-dimensional subspace of the state of the prefrontal cortex [23, 24]; (2) the input and output layers can be regarded as V1, with the and connections (i.e., the encoder and decoder parts) representing the bottom-up and top-down pathways respectively; (3) stochasticity is a designed ingredient of VAE, analogous to neuronal stochasticity; and (4) after training, VAE can generate new outputs through the top-down pathway by sampling states of the bottleneck layer from Gaussian distribution, consistent with the experimental finding that V1 participates in mental imagination of new images [9]. This fourth point enables us to study the creativity of the visual system using VAE.

#### Visual creativity

To study the function of complex cells in visual creativity, we trained VAEs to generate the output patterns of V1 in the three models above (**Fig. 1b, c**), and compared the quality of the generated patterns in the three models. After training, we investigated how the generated pattern changed with the bottleneck state *z*, which is one-dimensional in our model. In Model 1, sharp peaks are never generated by the VAE, suggesting the failure of pattern generation (**Fig. 2a**). In Model 2, abrupt changes of the peaking positions sometimes happen with the continuous change of *z* (**Fig. 2b**). In Model 3, Gaussian peaks are generated, with the peaking position smoothly changing with *z* (**Fig. 2c**).

To quantify the quality of the generated patterns, we defined an index *r*^{2} to quantify how well a generated pattern by VAE looks like an output pattern of the model used to train the VAE (see Methods). We found *r*^{2} is maximal for Model 3, intermediate for Model 2, but is small (even negative) for Model 1 (see the bars indicated by arrows in **Fig. 2d**). Therefore, VAE trained by Model 3 learns to generate patterns looking like the output of Model 3. This advantage of Model 3 is closely related to the smooth transition of *z* to the shape morphing (i.e., the translational movement) of the generated bumps (**Fig. 2c**), because the generation quality is bad around the abrupt change points (blue curve in **Fig. 2b, lower panel**).

In the study above, we input to VAE and trained VAE to construct in the output, with being the output pattern of Model *a* (*a* = 1, 2, 3). Now we ask whether the advantage of Model 1 results from a ‘good’ code that higher-order cortex receives from V1, or from a ‘good’ code that higher-order cortex is to reconstruct through the top-down pathway. To answer this question, we input to VAE but trained VAE to construct . We found that quality of the generated images strongly depended on , but hardly on (**Fig. 2d**). Therefore, the advantage of complex cells is a top-down effect, and cannot be understood from a bottom-up perspective.

To understand the reason for the advantage of Model 3, we then studied the Euclidean distance *d*(*s*(*x*), *s*(*y*)) between the output patterns *s* as a function of |*x* – *y*|, where *x* and *y* are the positions of two stimuli respectively. In Model 1, *d*(*s*(*x*), *s*(*y*)) sharply jumps from 0 to a constant value at |*x* – *y*| = 0; in Model 2, *d*(*s*(*x*), *s*(*y*)) is not monotonic; and in Model 3, *d*(*s*(*x*), *s*(*y*)) monotonically and gradually increases with |*x* – *y*| (**Fig. 2e**). This property of Model 3 may be important for its advantage in top-down generation. To see this, suppose two spatially nearby bottleneck states *z*_{1} and *z*_{2} (*z*_{1} ≈ *z*_{2}) generate two output patterns *s*_{1} and *s*_{2}, which corresponds to two stimuli at positions *x*_{1} and *x*_{2} respectively. For simplicity, we constrain that *s*_{1} is fixed during training, and *s*_{2} changes within the manifold {*s*(*x*)}_{x}. By introducing stochastic sampling during training, VAE encourages *s*_{2} to be closer to *s*_{1}, so that *x*_{2} gets closer to *x*_{1}, which means that the bottleneck state represents the spatial topology (i.e., the translational movement) of the stimuli. In Model 3, this can be realized using the gradient . In Model 2, *d*(*s*_{1}, *s*_{2}) is not monotonic with *s*_{2}, so does not always lead *s*_{2} close to *s*_{1}, sometimes instead far away from *s*_{1}. In Model 1, remains zero when *s*_{1} ≠ *s*_{2}, so it cannot provide clues to the updating of *s*_{2}.

#### Visual perception

According to predictive coding theory of perception [5], higher-order cortex adjusts its state using the error of its recontruction of the activity of lower-order cortex. In our model, perception is performed by adjusting the bottleneck state z to minimize the error between the generated pattern by the decoder of VAE after training and the target pattern (**Fig. 3a**).

Two conditions are required for good perception performance: (1) there exists a state *z*_{0} at which the generated pattern well resembles the complex-cell pattern; and (2) the representation of the complex-cell patterns in the bottleneck state should be ‘smooth’ so that starting from a state *z*_{1} (*z*_{1} = *z*_{0}), the optimal state *z*_{0} can be reached by error updating using, in our model, gradient descent algorithm (**Fig. 3a**).

Guided by these two understandings, we studied the perception performance of a VAE trained by Model *a* (*a* = 1, 2,3) in two steps. First, we set the target pattern (i.e., blue curve in **Fig. 3a**) to be a complex-cell pattern of Model *a*, and set the initial bottleneck state to be the value of the *μ*-channel (see **Fig. 1d**) of the VAE encoder, which is the optimal bottleneck state estimated by the encoder; then we updated the bottleneck state to minimize the error between and the decoder output. We denote the bottleneck state at the end of the updating as . Second, we set the target pattern to be the decoder output when the bottleneck state took , and set the initial bottleneck state to be , with *ϵ* being a Gaussian noise with standard deviation *σ _{perturb}*, and updated the bottleneck state again.

We quantified the reconstruction quality in the above two steps using and , respectively representing the ratio of variance of the target pattern explained by the decoder output at the end of each step. quantifies how optimal a bottleneck state can be to reconstruct a complex-cell pattern. quantifies how smooth the representation of the output in the bottleneck state can be so that an optimal bottleneck state can be easily found by error updating. We found that Model 2 and Model 3 correspond to high values (**Fig. 3b**), and for Model 3 is higher than that for Model 2 (**Fig. 3c**). These results suggest the advantage of complex cells for visual perception. The low value for Model 2 may be closely related to the fragmentary representation in the bottleneck state (**Fig. 2b, upper panel**).

### Understanding the top-down function of complex cells: skeleton MNIST dataset

#### Model setup

We then studied the advantage of complex cells for top-down image generation using skeleton MNIST dataset [25]. Images in this dataset represent digits using lines of 1 pixel width (**Fig. 4a, second column**). We used this dataset (denoted as dataset below) to displace the output of Model 1 (**Fig. 1b, left**) for 2-dimensional-image case, because a 1-pixel-width line is analogous to the activities of simple cells under local winner-take-all inhibition along the direction perpendicular to the line (comparing the second column of **Fig. 4a** with the blue peak in **Fig. 1c**). Biologically, simple cells are selective to the orientation of bars. Our model does not has the ingredient of orientation selectivity, but uses images of thin lines, mimicking the enhanced representation of contours in V1 due to collinear facilitation [26]. Our model also shares a property with the simple cells with sharpened orientation tuning due to lateral inhibition [20]: the representations of two parallel bars hardly overlap with each other. To displace Model 2 (**Fig. 1b, middle**) for 2-dimensional case, we set the pixel intensities oscillatorily decaying along the direction perpendicular to a line in the skeleton MNIST images (dataset , **Fig. 4a, third column**). To displace Model 3 (**Fig. 1b, right**), we blurred the skeleton images so that pixel intensities Gaussianly decayed along the direction perpendicular to a line (dataset , **Fig. 4a, fourth column**). We trained VAE using the three datasets, and compared the quality of the generated images (see Methods).

#### Visual creativity

The images generated by VAE trained by different datasets have different styles (**Fig. 4c, upper panels**). To fairly compare the quality of the generated images by the three datasets, we post-processed the generated images by first binarizing the images using a threshold *θ _{thres}*, such that pixels with intensities larger (or smaller) than

*θ*were set to 1 (or 0), then skeletonizing the images (see Methods), resulting in binary images with lines of 1-pixel width (

_{thres}**Fig. 4b and lower panels of c**), similar to the images in the skeleton MNIST dataset.

People have proposed that creativity is a mental process of generating worthwhile and novel ideas [27]. Inspired by this definition of creativity, we propose that good generation of skeleton-MNIST images should satisfy three conditions. (1) Realisticity: a generated image should look like a digit in the skeleton MNIST dataset. (2) Cross-category variety: the numbers of the generated images looking like different digits should be almost the same. In other words, it is not good if the VAE can generate images looking like the same digit. (3) Within-category variety: the shapes of the images of the same digit should be various. To quantify the image-generation quality, we trained a neural network to classify the skeleton MNIST dataset, so that the network output a label distribution *p*(*x*|*I*) (*x* = 0,1,···, 9) after receiving a post-processed generated image *I* (see Methods). Realisticity requires that *p*(*x*|*I*) has low entropy for each *I* [28]. Cross-category variety requies that the marginal has high entropy , with *A* being the set of all the post-processed generated images [28]. To quantify within-category variety, we calculated the intrinsic dimensionality [29], where *λ _{i}* is the eigenspectrum of the post-processed generated images

*A*

_{0}belonging to the same category.

*D*has maximal value if all principal components (PCs) of the set

_{incat}*A*

_{0}have equal variance, and has small value if a single PC dominates.

We investigated and *D _{incat}* with the change of the binarization threshold

*θ*(

_{thres}**Fig. 4b**) and a parameter

*λ*in VAE which controlled the regularization strength onto the distribution of the bottleneck state variable (see Methods). We found that VAEs trained by dataset generated images with better eye-looking quality than VAEs trained by dataset and (

_{KL}**Fig. 4c**), consistent with the quantitive indication of smaller , larger and larger

*D*in large range of parameters

_{incat}*θ*and

_{thres}*λ*(

_{KL}**Fig. 4d-f**). These results suggest that complex cells facilitate the visual system to generate diverse realistic-looking images, supporting the functional role of complex cells in visual creativity.

Similar to the 1-dimensional-image case (**Fig. 2a-c**), we also investigated the series of the generated images when the bottleneck state *z* was continuously changing (**Fig. 4g**). The images generated by the VAE trained by have two advantages comparing with those resulting from and : (1) when the change of *z* is not large so that the generated images look like the same digit, the images generated by undergo more smooth and flexible shape morphing, whereas the images generated by and are more rigid; (2) when the change of *z* is large so that the generated images experience digit transition, the images generated by look more realistic during the transition period, whereas the images generated by and are more unrecognizable during the transition period (see **Supplementary Fig. 1** for more examples). This investigation gains insights into the facilitation of creativity by complex cells.

Model 1 in **Fig. 1** can hardly generate realistic-looking samples (**Fig. 1e**); the generated digits trained by , although are worse than those trained by in quality, are sometimes recognizable (**Fig. 4c**). This is possibly because in 1-dimensional case, there is no delta peak *p*_{3} that can interpolate two delta peaks *p*_{1} and *p*_{2}, so that *d*(*p*_{1}, *p*_{3}) < *d*(*p*_{1}, *p*_{2}) and *d*(*p*_{2}, *p*_{3}) < *d*(*p*_{1}, *p*_{2}), where *d*(·, ·) is the Euclidean distance between the representations of two peaks; in 2-dimensional case, however, we can draw many 1-pixel-width thin lines that interpolate two parallel thin lines. In both the 1- and 2-dimensional cases, the models corresponding to the complex cells perform best.

### Sparseness and slowness together explain the power-law eigenspectrum of V1

Experiments found power-law eigenspectrum of V1 activity with exponent close to 1 [18]. Here we would like to explain this phenomenon using a parsimonious generative model. Our working hypothesis is that V1 is trying to reconstruct the input temporal stimuli through a top-down pathway, keeping the activity of V1 as sparse and slow-changing as possible. Specifically, we minimized the following cost function:
where *I*_{t} is the input stimulus at time *t, x _{i,t}* is the activity of the ith neuron in V1 at time

*t*,

*w*_{i}is the top-down reconstruction weight from neuron

*i*, and λ

_{sparse}and λ

_{slow}respectively control the strengths of sparseness and temporal slowness. We generated temporal stimuli by sliding a small spatial window on static natural images, imitating the experimental protocol of Ref. [18], where animals were seeing static natural images, and the temporal change of stimuli resulted from head or eye movement of the animals.

At suitable values of λ_{sparse} and λ_{slow}, the variance of the nth principal component (PC) of samples of {*x _{i,t}*}

_{i}decay as a power law

*n*

^{−α}, with exponent

*α*≈ 1 (

**Fig. 6a**), with successive PC dimensions encoding finer spatial features (

**Supplementary Fig. 4a**). Similar to the result in Ref. [18], this power-law scaling is not inherited from the eigenspectrum of input stimuli (

**Fig. 6b**), because we partially whitened the natural images in the preprocessing step, modeling the function of the retina and lateral geniculate nucleus [30]. After training,

*w*_{i}exhibited Gabor-like shape under the sparseness constraint (

**Fig. 6c**), which implies Gabor-like receptive fields of the neurons [4]. When the slowness constraint is strong, the curvature of the temporal trajectory of the neuronal population activity {

*x*_{t}}

_{t}(we denote

*x*_{t}= {

*x*}

_{i,t}_{i}) can be smaller than the curvature of the temporal trajectory of the stimuli {

*I*_{t}}

_{t}(

**Fig. 6d**), consistent with experimental observations [13].

We then searched the exponent *α* in a range of λ_{sparse} and λ_{slow}, and found that *α* is most close to 1 at an optimal sparseness strength at which *α* is most insensitive to λ_{slow} when λ_{slow} ≠ 0 (**Fig. 7a, b**).

How to understand the good property of ? Ref. [18] proprosed that the power-law scaling of PC variances is a compromise between the whitening and differentiability of neural code. The slowness constraint improves the differentiability, manifested by the straightening of temporal trajectories (**Fig. 6d**), so as long as λ_{slow} is large enough, optimal code should be obtained when the sparseness constraint most whitens the code. Closer investigation of the eigenspectrum unveiled that at the λ_{sparse} and λ_{slow} values resulting in *α* > 1, the eigenspectrum decays in a power-law manner toward the end (blue arrow in **Fig. 7c**). In this case, the exponent *α* > 1 manifests the undersize of the variances of the last several PCs. When *α* < 1, however, there is a sharp drop toward the end of the eigenspectrum (red arrow in **Fig. 7d**), also manifesting the undersize of the variances of the last several PCs. Therefore, both *α* > 1 and *α* < 1 result from the non-whiteness of the code. To test this hypothesis, we let λ_{slow} = 0 and quantified the sparseness-induced whiteness through two indexes *σ*_{1} and *σ*_{2} (see Methods). Consistent with our presumption, both indexes get their smallest values around (**Fig. 7f**), indicating the optimal whiteness.

To get some understanding on the optimal sparseness-induced whitening at , we studied the whiteness of the reconstructed stimuli (i.e., Σ_{i} *w*_{i}*x _{i,t}* in

**eq. 1**) when λ

_{slow}= 0. The link between the whiteness of the reconstructed stimuli and that of the V1 activity {

*x*}

_{i,t}_{i}is strictly valid when the reconstruction weights {

*w*_{i}}

_{i}are orthonormal. In our simulation, we constrained ||

*||*

**w**_{i}^{2}= 1, the approximate orthogonality between

*and*

**w**_{i}*(*

**w**_{j}*i*≠

*j*) is numerically manifested in

**Supplementary Information Section 4**and

**Supplementary Fig. 4b**.

Specifically, we investigated *r*^{2}(*m*), which is the ratio of the variance of the *m*th PC of the stimuli *I*_{t} explained by the top-down reconstruction Σ_{i} *w*_{i}*x _{i,t}*. We found that the reshaping of the function

*r*

^{2}(

*m*) with the increase of λ

_{sparse}experiences two stages, separated largely at (see the blue curve in

**Fig. 7g**). At stage 1 ,

*r*

^{2}(

*m*) remains close to 1 for small

*m*values, but drops to zero for large

*m*values. This means that at this stage, with the increase of λ

_{sparse}, the model gradually abandons the reconstruction of the PCs with small variances, whereas the reconstruction in the subspace of the dominating PCs remains largely intact. The stimuli in the subspace are well-whitened (

**Fig. 6b, inset**). At stage 2 ,

*r*

^{2}(

*m*) lowers down for small

*m*values, suggesting the deterioration of the reconstruction of . This deterioration is not uniform along all PCs in , with the PC with smaller variance deteriorated worse (

**Fig. 7h**). The reconstructed stimuli in the subspace becomes non-whitened with the increase of λ

_{sparse}(

**Fig. 7i**). These results imply a scenario that the improvement of the whiteness of V1 code with λ

_{sparse}when is because V1 gradually focuses to reconstruct a well-whitened subspace of stimuli spanned by the dominating PCs, and the deterioration of whiteness with λ

_{sparse}when is because V1 deteriorates the reconstruction of the well-whitened subspace non-uniformly along different PC directions. See

**Supplementary Information Section 4**and

**Supplementary Fig. 4c**for more supports on the notion that the reconstruction of PC with smaller variance is more impaired with the increase of λ

_{sparse}.

## Discussion

In this paper, we found that complex cells facilitate visual creativity and perception, and showed that the close-to-1 exponent of the power-law eigenspectrum of V1 is realized at a sparseness strength that best whitens V1 code and makes the exponent most insensitive to slowness strength. Our work provides fresh insights into the cognitive roles of complex cells from top-down generative perspective, establishes a link between the V1 eigenspectrum and the principles of sparseness and slowness, and suggests that there is an optimal sparseness strength that V1 is working at.

### Brain as an organ of active explanation

Predictive coding is the dominating theory in the generative theories of perception, which proposes that the brain iteratively updates its explanation for the world using the error of the current explanation transmitted through the feedforward pathway [5] (**Fig. 3a**). However, one should note that error updating is not the only approach to construct this explanation. For example, the encoder of VAE uses a deep feedforward network to construct this explanation in the bottleneck state (**Fig. 1d**). Similar technique has been used to solve lasso regression [31], where a deep neural network is used to approximate an implicit function defined by iterative error updating that maps stimuli to hidden states. The advantage of this deep neural network approximation is computational speed. It has been found that information is transmitted feedforwardly at the early stage of perception, but processed recurrently at the later stage [7, 32, 33], which reflects the shift of computational demand from responding speed to representation precision during the process of perception. Our perception model (**Fig. 3a**) also captures these two stages, where the *μ*-channel of the feedforward encoder output of VAE is regarded as the initial bottleneck state before recurrent updating (see the text explanation related to **Fig. 3b**).

An important but seldomly discussed question is why the brain is evolved to use the generative explanation scheme instead of the passive sensation scheme. One possible reason is that the brain has to continuously adapt itself to kaleidoscopic task demands. For example, it is believed that the sharpened feature tuning of simple cells by lateral inhibition improves discrimination [20, 22] and the position tolerance of complex cells improves classification [15, 12]. These understandings preset the task demands of discrimination and classification for simple and complex cells respectively. The problem is that a feedforward network optimized for one task may perform badly for another: high discrimination may imply strong amplification of intra-class noises, blurring the clustered structure of the inputs [34], impairing classification task; a good classifier may map all elements in the same class onto a single output, impairing discrimination task. The generative explanation scheme, however, requires the activity of the high-level layer reconstructs that of the low-level layer, which ensures that most information in the low-level layer gets represented in higher levels. The high-level layer can have representations optimized for different tasks after imposed different priors, but remains sensitive to the low-level information un-used in the current task, ready to switch tasks according to environmental changes.

### Complex cells, creativity, and perception

In this paper, we show that temporal slowness is necessary for the power-law eigenspectrum of V1 with exponent *α* → 1^{+} (**Fig. 7b**), and that complex cells facilitate higher-order cortex to form representations smooth to shape morphing of stimuli, improving creativity and perception (**Figs. 1–5**). Previous studies showed that temporal slowness is necessary for the developmental formation of complex cells [35, 36], and that the *α* → 1^{+} power-law exponent ensures the differentiability of neural code [18]. These results together suggest the cognitive role of the close-to-1^{+} exponent of the eigenspectrum of V1: facilitating higher-order cortex to form representations differentiable to shape morphing of stimuli thereby improving creativity and perception. To better understand this point, suppose that higher-order cortex represents a stimulus at position *x* using code *z*(*x*), and the V1 code is *υ*_{1}(*x*). The generative theory of perception requires that *υ*_{1}(*x*) ≈ *f*(*z*(*x*)), where *f*(·) represents the deep neural network along the top-down pathway, which is differentiable in the biologically plausible context. Therefore, if *z*(*x*) is to be differentiable to *x, υ*_{1}(*x*) must also be differentiable to *x* to better fulfill the equation *υ*_{1}(*x*) ≈ *f*(*z*(*x*)). Consistently, in **Supplementary Information Section 5** and **Supplementary Fig. 5**, we trained VAE to generate V1 activity (i.e., {*x _{i,t}*}

_{i}in

**eq. 1**), and found that VAEs trained by V1 activity at λ

_{slow}≠ 0 (we kept λ

_{sparse}= 0.14 so that

*α*≈ 1 when λ

_{slow}≠ 0, see

**Fig. 7b**) performs better on creativity and perception than those trained by V1 activity at λ

_{slow}= 0.

Engineering neural networks to generating images with high-resolution details is difficult. VAE tends to generate blurred images [37]. Another well-known generative model, generative adversarial network (GAN), also struggles when generating high-resolution details [38] due to a gradient problem [39]. Our results suggest that blurring these high-resolution details may result in better creativity and perception performance ( in **Figs. 4a** and **5c**). The implementation of this idea in engineering VAE or GAN for image generation or deep predictive coding network for image recognition is an interesting research direction.

It is believed that mental creativity mostly involves default mode network and executive control network [27, 40], which include association cortical areas in the frontal, parietal and cingulate cortices. Our results suggest that low-order sensory cortices such as V1 also plays an important role in, or is even designed for, the high-order cognitive task of creativity. But this may not be so surprising: the cost function that VAE optimizes is the variational free energy [19], similar to that during perception and learning [1], two cognitive processes that V1 also participates in. Therefore, creativity, perception and learning are different aspects of the same nature: free-energy minimization.

### The neural mechanism of V1 adaptation

In Ref. [18], animals were presented stimuli ensembles with different statistics. Similarly, we also studied our model using stimuli ensembles of a low dimensionality *d* = 4. Consistent with our result with high-dimensional stimuli, , at which the exponent *α* is most close to the experimental value 1.62 and insensitive to the change of λ_{slow}, is obtained near a λ_{sparse} value that best whitens the V1 code at λ_{slow} =0 (**Supplementary Information Section 2** and **Supplementary Fig. 2**). Therefore, V1 may be able to adapt itself to the statistics of the presented images by adjusting λ_{sparse}, in the time scale of the experiment [18] (i.e., minutes).

In our model, this adaptation to stimuli statistics is realized by adjusting λ_{sparse} to before learning the generation weight *w*_{i}(**eq. 1**). λ_{sparse} has the physiological meaning of neuronal firing threshold, and *w*_{i} is related to the feedforward and recurrent weights to and within V1 [41, 42]. So it is likely that V1 performs this adaptation by first adjusting global inhibition, and then adjusting feedforward and recurrent weights accordingly. This global inhibition may be adjusted by thalamocortical connections or neuromodulators: it has been found that inactivation (or excitation) of the pulvinar neurons suppresses (or increases) the responses of superficial V1 neurons to visual input [43]; and cholinergic axons from the basal forebrain depolarize cortical interneurons [44]. We found that for low-dimensional stimuli is smaller than that for high-dimensional stimuli (**Fig. 7f**, **Supplementary Fig. 2g**), which means that this global inhibition is weaker during the presentation of low-dimensional stimuli than during high-dimensional stimuli.

What is the neural mechanism the brain uses to guide λ_{sparse} to **Fig. 7g** shows that the reconstruction of PCs with small (or large) variance is impaired with the increase of λ_{sparse} when (or ). Consistently, the reconstruction error *ϵ* increases slowly with λ_{sparse} when , but starts to quickly soar up after (**Supplementary Information Section 3** and **Supplementary Fig. 3**). Therefore, it is possible that the brain monitors *ϵ* when adjusting λ_{sparse}, and stops the adjustment just at the point where *ϵ* starts to soar up with λ_{sparse}. Predictive coding theory suggests that *ϵ* is encoded by the pyramidal neurons in superficial layers [1], experiments found that *ϵ* is closely related to the gamma-band frequencies [45, 46]. More detailed mechanical insights require experimental studies.

## Methods

### Manifesting the advantage of complex cells using toy models

A variational auto-encoder (VAE) [19] contains two parts: an encoder and a decoder (**Fig. 1d**). The encoder is a feedforward network that receives input ** x** and gives two channels of output

**and log(**

*μ*

*σ*^{2}). Then a random number

**is generated according to the Gaussian distribution , and is input to the decoder which outputs . VAE is trained to minimize the following cost function: where**

*z**D*is the dimension of the input and the output,

_{x}*D*is the dimension of the random variable

_{z}*z*. Minimizing the first term of this equation reduces the reconstruction error, minimizing the second term (which is the KL-divergence of from the standard normal distribution ) makes close to . λ

_{KL}is a parameter controlling the relative strengths of these two terms.

In the VAE used in **Figs. 1, 2**, the encoder was a multilayer perceptron (MLP) that had three hidden layers with sizes 100, 50 and 20 respectively. The input layer was of size 201, and the output layer had two channels each with size 1. The decoder was another MLP that had three hidden layers with sizes 20, 50 and 100 respectively. Adjacent layers were all-to-all connected. We used leaky relu as the activation functions of the hidden layers. VAE was trained using Adam optimizer [47].

In **Figs. 1, 2**, the inputs received by VAE (i.e., the outputs of the three models) are positioned on a one-dimensional line of 201 neurons. In Model 1, this input is a delta peak *f*_{1}(*x; a*) = *δ _{x,a}* in which only a single neuron at position

*a*has activity 1, whereas all the other neurons have zero activity. In Model 2, this input is a Gabor function , where

*σ*= 10,

*T*= 80,

*C*is a normalization factor such that max

_{x}

*f*

_{2}(

*x; a*) = 1. In Model 3, this input is a Gaussian function , where

*σ*= 10,

*C*is a normalization factor such as max

_{x}

*f*

_{3}(

*x; a*) = 1. In

*f*

_{1}(

*x; a*),

*f*

_{2}(

*x; a*) and

*f*

_{3}(

*x; a*), a is arandom integer in the range [31, 171].

To quantify the quality of the generated patterns (**Fig. 2d**), for any generated pattern *p*, we defined , where *s*(*x*) is the output pattern of Model-out in **Fig. 2d** in response to the wavelet stimulus at position *x*, and is the ratio of the variance of *p* that can be explained by *s*(*x*) (i.e., coefficient of determination).

In **Fig. 3**, the bottleneck state was optimized to minimize the error between the target pattern and the generated pattern using Adam optimizer [47].

### Manifesting the advantage of complex cells using skeleton MNIST dataset

The dataset in **Fig. 4a** is the skeleton MNIST dataset [25]. The intensity of a pixel in an image in is binary (1 or 0) depending on whether this pixel belongs to a line of 1-pixel width.

An image *I*_{2} in was generated using an image *I*_{1} in in the following way. To determine the intensity *T*_{2}(*x*_{2}, *y*_{2}) of a pixel *I*_{2}(*x*_{2}, *y*_{2}) at the position (*x*_{2}, *y*_{2}) in *I*_{2}, we defined a box in *I*_{1}. We looked for a pixel such that its intensity *T*^{1}(*x*_{1}, *y*_{1}) = 1 and the distance was minimized. Then we set *T*_{2}(*x*_{2}, *y*_{2}) = *a* exp(−*d*^{2}/2), where *a* = −1 if max(|*x*_{1} – *x*_{2}|, |*y*_{1} – *y*_{2}|) = 1, and *a* = 1 otherwise. If all pixels in had intensity 0, then *T*_{2}(*x*_{2}, *y*_{2}) = 0.

was generated using in a similar way to above, except that *a* = 1 all the time.

The VAE used in **Fig. 4** had a similar structure with that used in **Fig. 1**, except that the size of the input and output layers was 28 × 28 = 784, and the sizes of the three hidden layers of the encoder (or decoder) were 512, 256 and 128 (or 128, 256 and 512) respectively. The size of each of the two output channels of the encoder was 20.

The images generated by VAE were post-processed in two steps. First, images were binarized such that pixels with intensities larger (or smaller) than a threshold *θ _{thres}* were set to 1 (or 0). Second, the images were skeletonized using the ‘skeletonize’ routine of the skimage python package.

To quantify the quality of the post-processed images, we trained a MLP to classify the skeleton MNIST dataset. This MLP contained a hidden layer of size 1000 with leaky-relu activation function. After receiving a post-processed image *I* generated by VAE, this MLP output a label distribution *p*(*x|I*) (*x* = 0,1,⋯, 9). In **Fig. 4d**, , where *E _{I}*[·] means average over all the generated images [28]; in

**Fig. 4e**, [28]. To plot

**Fig. 4f**, we first chose the generated post-processed images with high realisticity (i.e., max

_{x}

*p*(

*x|I*) > 0.9), then for all the images belonging to a category

*x*, we calculated the variance λ

_{i}(

*x*) of the

*i*th principal component (PC),

*D*was defined as [29].

_{incat}**Fig. 4d-f**show how and

*D*change with the binarization threshold

_{incat}*θ*and the parameter λ

_{thres}_{KL}in

**eq. 2**. Note that if

*θ*is high, the image after post-processing may be very sparse (i.e., only a few pixels are nonzero), especially when λ

_{thres}_{KL}also takes a large value. In this case, the MLP network has an artifact that

*p*(

*x|I*) strongly peaks at

*x*= 1, and

*p*(

*x*≠ 1|

*I*) has very small value. Because of this artifact, in

**Fig. 4d-f**, we excluded the data points at which the percentage of nonzero pixels in the post-processed images was smaller than 1%. Some data points when λ

_{KL}= 0.9 for and when λ

_{KL}= 0.5, 0.7, 0.9 for resulted in images with sparsity a little larger than 1%, but we also excluded these data points, because the quality of the generated images was really bad. These artifacts are weak for in our range of parameters, so we plotted the whole parameter range for .

**Fig. 4g** were plotted by gradually changing the bottleneck state of VAE from ** z** = [1.5,1.5,⋯, 1.5] to [−1.5, −1.5,⋯, −1.5].

### The generative model to explain the eigenspectrum of V1

To explain the eigenspectrum of V1, our working hypothesis is that V1 is trying to reconstruct a sequence of input images through a top-down pathway, keeping the activity of V1 as sparse and slow-changing as possible. Specifically, we minimized the following cost function:
where *I*_{i} is the input image at time *t, x _{i,t}* is the activity of the

*i*th neuron in V1 at time

*t,*is the top-down reconstruction weight from neuron

**w**_{i}*i*, and λ

_{sparse}and λ

_{slow}respectively control the strengths of sparseness and temporal slowness. However, minimizing

**eq. 3**for a long sequence {

*I*_{t}}

_{t}of images is computationally costly, so approximation has to be used. Specifically, we minimized the following cost fuction for short sequences {

*I*_{1},

*I*_{2},

*I*_{3}} of only three images using an EM algorithm:

In the E-step, *E*_{short_seq} was minimized respective to {*x _{i,t}*}

_{i,t=1,2,3}using a fast iterative shrinkage-thresholding algorithm (FISTA) [48]; in the M-step,

*E*

_{short_seq}was minimized respective to {

*w*_{i}}

_{i}using Adam optimizer [47]. After training, we fixed {

*w*_{i}}

_{i}and inferred {

_{xi,t}}

_{i,t}from a given image sequence {

*I*_{t}}

_{t}in a Markovian manner: we inferred {

*x*}

_{i,t}_{i,t}temporally sequentially (i.e., starting from {

*x*}

_{i,t}_{i,t=1}to {

*x*}

_{i,t}_{i,t=2}then to {

*x*}

_{i,t}_{i,t=3,…}); when inferring {

*x*}

_{i,t}_{i,t=T}, we fixed the values of {

*x*}

_{i,t}_{i,t<T}. The off-line training and on-line inferring algorithms model the replay-driven plasticity [49] and the perception of V1 respectively. To calculate the eigenspectra (

**Figs. 6a**,

**7a-e**), we prepared a number of triplet sequences of three images {

*I*_{1},

*I*_{2},

*I*_{3}} (see below), inferred the states in the Markovian manner above, and collected the state {

*x*}

_{i,t}_{i,t=3}that corresponding to

*I*_{3}. To calculate the curvature of temporal trajectory of states (

**Fig. 6d**), we prepared sequences of four images {

*I*_{1},

*I*_{2},

*I*_{3},

*I*_{4}}, and calculated the curvature c of {

*x*}

_{i,t}_{i,t={2,3,4}}by [13] where we have denoted

*x*_{t}= {

*x*}

_{i,t}_{i}. In our simulation, the size of image

*I*_{t}was 16 × 16 = 256, and the number of hidden units

*x*was 257.

_{i}#### Image sequence preparation

The image sequences {*I*_{t}}_{t} used to train the model (**eq. 4**) were prepared in the following way. We picked 100 images from van Hateren’s natural image dataset [50], avoiding images that contained large areas of the sky. We first took logarithm of the intensities of the image pixels, following the suggestion of Ref. [50], and then partially whitened the images using the method in Ref. [4], modeling the image whitening by retina or lateral geniculate nucleus in the upstream of V1 [30]. To get a short sequence {*I*_{1}, *I*_{2}, *I*_{3}} in **eq. 4**, we picked a 16 × 16 patch from the images preprocessed above, and sliding the position of the patch window by the same vector Δ*P* = (Δ*X*, Δ*Y*) for two successive steps, where Δ*X* and Δ*Y* were randomly −1, 0 or 1. A caveat here is a boundary effect. To see this, suppose all pixels in *I*_{1} have zero intensity, but after the patch window moves by Δ** P**, the pixels in a boundary of

*I*_{2}have strong intensities. In this case,

*x*= 0 for all

_{i,t=1}*i*s, but |

*x*| may be large, enlarging the cost for temporal slowness (i.e., the third term at the right-hand side of

_{i,t=2}**eq. 4**). This large cost term does not represent the fast change of the stimulus itself, but is due to the sudden entrance of high-intensity pixels into the small patch window. To alleviate this boundary effect, we multiplied element-wise each image patch

*I*_{t}by a 16 × 16 filter

**. A pixel of**

*F***took value 0.05, 0.24, 0.43, 0.62, 0.81 or 1, depending on whether its distance with the boundary was 1, 2, 3, 4, 5 or larger than 5 pixels. This filtering also improves the biological plausibility of the model, because it means that the response of a neuron to a stimulus gradually decays (instead of sharply reducing to zero) when the stimulus is moving away from the center of the receptive field.**

*F*#### Power-law exponent estimation

The power-law exponents of the eigenspectra of V1 states (**Figs. 6** and **7**) were estimated in the following way. We noted that the change of eigenvalue λ_{i} with the order *i* of principal component largely has three stages (**Figs. 6a** and **7c, d**): when *i* takes small values, the decay of λ_{i} with *i* is relatively slow and may exhibit zigzag fluctuations during decaying; when *i* takes intermediate values, the decay of λ_{i} with *i* can be best approximated by power law; when *i* takes large values, λ_{i} quickly decays with *i*. These three stages also exhibited in experimental results [18]. The power-law exponent of an eigenspectrum was obtained by linearly fitting the intermediate stage in log-log scale. The intermediate stage was determined by eye-looking, which slightly varied when λ_{sparse} in **eq. 4** took different values, and largely remained the same when λ_{slow} changed in the range of our study. When λ_{sparse} = 0.14 (which is the value in **Fig. 6a**), we let *i* ∈ [29, 109] be the intermediate range.

#### Quantifying the whiteness of eigenspectrum

Suppose the variance of the nth principal component (PC) is *υ _{i}*, to quantify the whiteness of the eigenspectrum, we used two indexes: (1)

*σ*

_{1}= std

_{n}(log(

*υ*)), which is the standard deviation over logarithm of the variances; (2)

_{n}*σ*

_{2}= Σ

_{n={flrst 20}}log(

*υ*) – Σ

_{n}_{n={last 20}}log(

*υ*) which is the difference between the summation of log(

_{n}*υ*) over the first 20 PCs and that over the last 20 PCs. In

_{n}**Fig. 7f**, we used both

*σ*

_{1}and

*σ*

_{2}; in the inset of

**Fig. 6b**, we used

*σ*

_{1}.

## Acknowledgements

Z.B. thanks Prof. Changsong Zhou and Prof. Shuzhi Sam Ge for comments on the manuscript and helpful discussions. Z.B. is supported by the NSF of China (Grant No. 32000694) and the start-up fund of the Institute for Future, Qingdao University.