Abstract
Hierarchical structures constitute a wide array of brain areas, including the visual system. One of the important questions regarding visual hierarchical structures is to identify computational principles for assigning functions that represent the external world to hierarchical structures of the visual system. Given that visual hierarchical structures contain both bottom-up and top-down pathways, the derived principles should encompass these bidirectional pathways. However, existing principles such as predictive coding do not provide an effective principle for bidirectional pathways. Therefore, we propose a novel computational principle for visual hierarchical structures as spatio-temporally efficient coding underscored by the efficient use of given resources in both neural activity space and processing time. This coding principle optimises bidirectional predictions over hierarchical structures by simultaneously minimising temporally differences in neural responses and maximising entropy in neural representations. Simulations demonstrated that the proposed spatio-temporally efficient coding was able to assign the function of appropriate neural representations of natural visual scenes to visual hierarchical structures. Furthermore, spatio-temporally efficient coding was able to predict well-known phenomena, including deviations in neural responses to unfamiliar inputs and bias in preferred orientations. Our proposed spatio-temporally efficient coding may facilitate deeper mechanistic understanding of the computational processes of hierarchical brain structures.
Author Summary The visual system in the brain is composed of a hierarchical structure where neural signals pass in both bottom-up and top-down pathways. To account for how such a hierarchical structure attains a function to represent the external world, previous research has proposed diverse neural computational principles. Predictive coding, one of those principles, has attracted much attention recently that can assign representation functions to hierarchical structures but its computation relies on hypothetical error units. The present study proposes a novel coding principle for hierarchical structures without the notion of such hypothetical entities, via spatio-temporally efficient coding which underscores the efficient use of given resources. Spatio-temporally efficient coding optimises bidirectional predictions in hierarchical structures of the brain and can assign representational functions to visual hierarchical structures, without complex inference systems or hypothetical neuronal entities. We demonstrate that spatio-temporally efficient coding predicts well-known features of neural responses in the visual system such as that deviation in neural responses to unfamiliar inputs and a bias in preferred orientations.
1. Introduction
It is well-established that a wide array of brain areas have a hierarchical structure, including the visual system (Felleman and Van Essen, 1991; Mesulam, 1998; Harris et al., 2019; Hilgetag and Goulas, 2020). Studies have identified a link between hierarchical structures and gene expression (Burt et al., 2018; Hansen et al., 2021), suggesting that hierarchical structures are genetically determined a priori. Given that one of the major functions of the brain is to represent the external world (deCharms and Zador, 2000; Kriegeskorte and Diedrichsen, 2019), an ensuing question arises: How do a priori hierarchical brain structures attain functions to represent the external world? This question can be addressed by identifying a fundamental neural coding principle that assigns representational functions to hierarchical structures.
The traditional view of the neural coding of visual hierarchical structures is bottom-up visual information processing, whereby simple features are processed in a lower visual hierarchy and more complex features created by integrating simple features are processed in a higher visual hierarchy (Hubel and Wiesel, 1962; Hubel and Wiesel, 1968; Riesenhuber and Poggio, 1999; Riesenhuber and Poggio, 2000; Serre et al., 2007; DiCarlo et al., 2012; Yamins et al., 2014). However, this view does not consider the role of top-down pathways that are abundant even in the early visual system, such as the lateral geniculate nucleus (Murphy and Sillito, 1987; Wang et al., 2006) and primary visual cortex (Zhang et al., 2014; Muckli et al., 2015; Huh et al., 2018).
The role of top-down visual processing is especially prominent in predictive coding (Rao and Ballard, 1999; Spratling, 2017). According to predictive coding, a higher hierarchy performs top-down predictions of neural responses in a lower hierarchy. Both inference and learning of predictive coding are based on the minimisation of bottom-up prediction errors. Predictive coding has been used to explain the neural responses corresponding to prediction errors (Friston, 2005) and extends from the explanations of perception to action (Friston, 2010; Clark, 2013). A recent study combined predictive coding with sparse coding (i.e., sparse deep predictive coding) and demonstrate that it could enhance perceptual explanatory power (Boutin et al., 2021).
Nevertheless, predictive coding has several theoretical shortcomings. Since inference in predictive coding aims to minimise prediction errors, the hierarchical structure would require an additional information processing subsystem to perform this inference. In addition, because bottom-up transmitted information contains only prediction errors, predictive coding requires the presence of error units (biological neurons) in the hierarchical structure to represent this prediction error, yet such error units remain as hypothetical entities and evidence for prediction error responses is limited in some conditions (Solomon et al., 2021).
A possible approach to overcome the shortcomings of both bottom-up processing and predictive coding in visual hierarchical structures is to make bottom-up predictions similarly to top-down predictions across hierarchies, instead of transmitting bottom-up prediction errors. These bidirectional predictions would realise both context-independent bottom-up predictions and context-dependent top-down predictions (Teufel and Fletcher, 2020). Such bidirectional predictions eliminate the necessity for hypothetical error units, while presumably elucidating the neural responses of hierarchical structures underlying bottom-up feature integration and top-down predictive coding. A neural coding principle underlying bidirectional predictions of hierarchical structures can be found in the theory of efficient coding that draws upon the efficient use of given resources (Laughlin, 2001; Bullmore and Sporns, 2012), which crucially include limited time resources related to processing speed (Griffiths et al., 2015; Lieder and Griffiths, 2020). A possible solution to promote the most efficient use of limited time resources by the bidirectional prediction system is to render present neural responses equal to future ones before the occurrence of future neural responses. This can be achieved by minimising the temporal differences between present and future neural responses. Accordingly, we consider this temporal difference minimisation as our learning principle, referred to as temporally efficient coding. A minimal temporal difference indicates that bidirectional prediction is accurate. Here, inference simply refers to a bidirectional prediction mediated by top-down and bottom-up pathways. Unlike inference in predictive coding, which requires further error minimisation, inference in temporally efficient coding involves simple monosynaptic transmission (i.e., a connection that passes via a single step rather than multiple steps).
Since temporally efficient coding includes a trivial solution, in that neural responses are invariant to changes in external events. we circumvent this issue by adding a complementary neural coding (learning) principle that maximises the informational entropy of neural responses. It maximises the neural response space available to represent the external world under the constraints of both the number of neurons and maximum firing rates. Maximal entropy coding indicates that the system uses spatial resources of neural responses efficiently (Attneave, 1954; Barlow, 1961; Laughlin, 1981), referred to as spatially efficient coding. By combining spatially efficient coding and temporally efficient coding, we propose a neural coding principle termed spatio-temporally efficient coding (Fig 1).
It is noteworthy that minimising temporal differences in upper hierarchies may, at first glance, be reminiscent of the slow feature analysis (Wiskott and Sejnowski, 2002; Berkes and Wiskott, 2005; Creutzig and Sprekeler, 2008). However, the upper hierarchies of the proposed spatio-temporally efficient coding need to change dynamically to predict rapidly changing visual inputs and neural responses of the lower hierarchies, whereas the upper hierarchies of the slow feature analysis need to change slowly. Minimising temporal differences in spatio-temporally efficient coding does not aim at extracting slow features, but rather aim at the representation of rapidly changing visual inputs with stabilizing neural responses as quickly as possible.
To determine how a visual hierarchical structure with spatio-temporally efficient coding represents external visual events, we repeatedly exposed a visual hierarchical structure to natural scene images, enabling it to learn the parameters of bidirectional prediction networks between top-down and bottom-up hierarchies using spatio-temporally efficient coding (Fig 1). Using simulations, we observed that spatio-temporally efficient coding enabled the visual hierarchical structure to implement desirable neural representations. In addition, the visual hierarchical structure learned to represent natural scene images by spatio-temporally efficient coding and could predict well-known perceptual responses. First, the visual hierarchical structure exhibited deviant neural responses to unfamiliar visual inputs (i.e., neural responses to unfamiliar inputs were smaller or larger than those to familiar ones), indicating selectivity (Margoliash, 1983; Waydo et al., 2006) and/or familiarity suppression (Huang et al., 2018; Issa et al., 2018) of visual cortical neurons. Second, the visual hierarchical structure demonstrated simple cell-like receptive fields with preference for horizontal and vertical orientations over oblique orientations, resembling the orientation preference of visual cortical neurons (Furmanski and Engel, 2000; Li et al., 2003; Girshick et al., 2011). These simulation results demonstrated that the proposed spatio-temporally efficient coding could assign functions representing the external world to hierarchical structures in the visual system without theoretical complexity and hypothetical entities.
2. Results
2.1. Spatio-temporally efficient coding in visual hierarchical structures
In the present study, visual information processing in hierarchical structures was established as biologically inspired temporal processing. Specifically, visual information processing is described as a function ft for both image Ximage and neural responses Xh in each hierarchy h such that it maps from Ximage and Xh at time t – 1 to those at time t: where Xh,t–1 and Xh,t are the neural responses Xh at time t – 1 and t, respectively, at hierarchy h. The function values of adjacent hierarchies Xh, and Xh+1, as well as those for Ximage and Xh=1, are mutually influenced, representing the hierarchical structure. These influences are expressed as mathematical transformations based on synaptic weights, biases, and sigmoid nonlinearities added to the transformations (see Section 4.2 for details of the transformations). This function ft makes inferences using spatio-temporally efficient coding.
Learning in spatio-temporally efficient coding minimises the ensuing objectives of both temporally and spatially efficient coding. The objective of temporally efficient coding is given by: where ∘ is the function composition, Ximage=n indicates that the image is fixed to the nth sample throughout the temporal processing, squaring is operated component-wise, and the outer summation is operated on the range of f where h = 0 denotes Xh=0: = Ximage. The minimisation of the given objective minimises the temporal differences between present and future neural responses. The objective of spatially efficient coding is given by where ft(·)|Xh indicates a restriction of the range of the function value ft(·) to Xh, and P(·) is a probability. Because the term within the summations is an estimation of negative informational entropy, minimising the objective maximises the informational entropy of ft. Finally, the objective of spatio-temporally efficient coding is a linear combination of those two objectives: where λ is a regularisation parameter. A smaller λ indicates a greater emphasis on the temporally efficient coding objective, whereas a larger λ indicates the opposite.
In the present study, the depth H of hierarchies was set to 2, the minimum depth to realise both bottom-up and top-down pathways in the same hierarchy (Fig 2A). For a given image, the duration of temporal processing of ft was given as 5 (i.e., t ∈ [1,5]) in both learning and inference, because four time steps are required for the image information to reach the top hierarchy and return over H = 2 hierarchies, in addition to one time step to obtain future neural responses. The minibatch size N was set to 40 (see Section 4.2. for more details).
In our simulations, we repeatedly exposed the hierarchical structure to natural scene images, which enabled it to learn the bidirectional predictions between top-down and bottom-up hierarchies using spatio-temporally efficient coding with a range of the balancing parameter λ. Successful learning was confirmed by minimising or stabilising L during learning. Further, we verified that the learned hierarchical structure could successfully reconstruct an input image, as shown in Fig 2C.
2.2. Balancing between temporally and spatially efficient coding
The temporally efficient coding objective that minimises the temporal difference between present and future neural responses not only renders a neural representation suitable for the objective but also renders the visual information process temporally stable. In contrast, the spatially efficient coding objective that maximises the informational entropy of neural responses renders the process noisy. Based on our simulations, we confirmed this phenomenon by observing that neural responses were temporally stable with smaller λ (i.e., weighing toward temporally efficient coding), whereas they became noisy with larger λ (i.e., weighing toward spatially efficient coding) (Fig 2C). In addition, when the λ values were too small or too large (i.e., strongly biased to either temporally or spatially efficient coding), the hierarchical structure failed to represent visual images appropriately (Fig 2D), highlighting the need for an appropriate balance between spatially efficient coding and temporally efficient coding in the visual hierarchical structure.
A change in balance between the two objectives also altered the relative strengths between bottom-up and top-down synaptic connections. Simulation results demonstrated that top-down synaptic strengths from hierarchies 2 to 1 were increased compared to bottom-up synaptic strengths from images to hierarchy 1 as λ increased, with an emphasis on spatially efficient coding (Fig 2B, see Section 4.2 for details on the computation of synaptic strength). This finding suggests that the balance between spatially and temporally efficient coding could be related to the balance between bottom-up and top-down synaptic strengths in the brain.
2.3. Appropriate neural representations of the external world
As a major function of the brain is to represent the external world (deCharms and Zador, 2000; Kriegeskorte and Diedrichsen, 2019), we investigated whether spatio-temporally efficient coding creates appropriate neural representations of the external world. We quantified the appropriateness of neural representations by examining the correlations between a natural scene image space and a neural response space. Specifically, we first measured the Euclidean distances from a natural scene image (i.e., reference image) to other natural scene images in the image space and the distances from the neural responses for the reference image to those for other compared images in the neural response space (Fig 3A). When calculating the distances between neural responses, we used the neural responses at five time steps after receiving a given image based on the assumption that the neural responses became temporally steady after five steps of temporal processing (see Fig 1). We then measured the Pearson’s linear correlation between the distances in the image space and those in the neural response space. A high correlation indicates that neural responses tend to be similar when the visual system perceives natural scenes close to each other (in the sense of Euclidean distance). We set each image sample as a reference and repeated the calculations for the correlation across all image samples. We defined global similarity as the average correlation obtained by using the distances from one image to all other images, and local similarity as the average correlation obtained by using the distances from one image to its neighbouring images only (1% of all images). For illustration, we visualised the distance from the reference image in Fig 3A to others by projecting the images onto a 2D space using t-distributed stochastic neighbour embedding (t-SNE) (van der Maaten and Hinton, 2008). We performed the same visualisation of the corresponding neural responses in each hierarchy. The distances in the 2D image space were largely preserved in the 2D neural response space (Fig 3B).
From the simulations, we identified high global similarities (> 0.6) between the natural scene image space and neural response space (Fig 3C). Global similarities were higher for the lower hierarchy than for the higher hierarchy, and increased with smaller λ for the lower hierarchy (Fig 3C). Local similarities were close to 1 for all λs and hierarchies, demonstrating that the spatio-temporally efficient coding produced locally well-suited neural representations of natural scene images (Fig 3D).
Neural responses exhibited certain noise with spatio-temporally efficient coding, especially when λ was high (Fig 2C). We investigated whether this noise interfered with the neural representations of the visual images. To assess the interference by noise, we assumed that the neural responses converged to represent a given image after certain time steps of the bidirectional predictions. Here, we set the time step to 10 and measured the variability in neural responses from time step 1 (i.e., when receiving an input image) to time step 9 until convergence. Temporal variations were computed using the Euclidean distance from the neural responses at each time step (1, …, 9) to neural responses at time step 10. We also computed the Euclidean distance from the neural responses at time step 10 and those for an image that was the nearest to the given image in the image space. We compared the temporal variations above with this distance to the nearest neighbouring image by calculating the ratio of the temporal variations over the distance to the nearest neighbouring image. A ratio exceeding 1 indicates that the noise could significantly interfere with neural representations to be confused with nearby images. Within the range of our simulation parameters (λ ∈ [10, 100]), the ratio was kept below 1 on average for the majority of time steps at each hierarchy, even with higher values of λ (Fig 3E), indicating that the noise caused by maximising entropy with spatially efficient coding did not significantly interfere with neural presentations over time.
2.4. Deviant neural responses to unfamiliar inputs
The visual system often responds selectively to sensory inputs (Margoliash, 1983; Waydo et al., 2006). Even for the type of sensory inputs to which the visual system is responsive, unfamiliar inputs induce larger neural responses compared to familiar inputs (Huang et al., 2018; Issa et al., 2018). These large neural responses to unfamiliar inputs are thought to be due to prediction errors (Issa et al., 2018). Accordingly, we investigated whether spatio-temporally efficient coding could predict the phenomenon of large neural responses to unfamiliar inputs without the introduction of prediction error responses mediated by error units.
In the simulations, the visual hierarchical structure learned to be familiar with natural scene images, and novel handwritten digit images were used as unfamiliar visual inputs (Fig 4A). The simulation results revealed that neural responses were distributed over middle values for familiar images and over smaller or larger values for unfamiliar inputs (Fig 4B and 4C), suggesting that spatio-temporally efficient coding could predict the phenomenon of deviant neural responses to unfamiliar inputs.
Further, we investigated the neural responses to locally unfamiliar and globally familiar images. We composed new images such that a portion of only one quadrant of the image was unfamiliar, and the rest was familiar (Fig 5A). According to global precedence (Navon, 1977; Rezvani et al., 2020), such partially unfamiliar images are preferentially perceived as familiar ones. The simulation results demonstrated that deviant neural responses were predominantly localised to the hierarchy 1 units connected to unfamiliar portions of the images (Fig 5B). In addition, deviant neural responses rarely propagated to hierarchy 2 units (Fig 5C) and thus did not perturb hierarchy 1 units connected to familiar inputs in return (Fig 5D and 5E), consistent with the predictions of global precedence.
2.5. Preferred orientation biases of receptive fields
Neurons in the visual system prefer horizontal and vertical orientations over oblique orientations (Furmanski and Engel, 2000; Li et al., 2003). Indeed, orientation discrimination is more sensitive to horizontal and vertical orientations than to oblique orientations (Girshick et al., 2011). These biases are due to the environmental statistics of natural scenes (Girshick et al., 2011). We investigated whether hierarchy 1 units in the visual hierarchical structure that learned by spatio-temporally efficient coding of natural scene images exhibited such biases. We presented a moving bar oriented in one of eight angles that moved in the direction perpendicular to the orientation angle (Fig 6A) and defined the response of each unit to that orientation by the largest response during presentation. We then defined the preferred orientation of each unit as the orientation that elicited the largest response. Simulation results revealed that hierarchy 1 units preferred horizontal and vertical orientations over oblique orientations in all λ conditions (Fig 6B), consistent with the orientation bias of visual cortical neurons and in accordance with context-independent bottom-up prediction (Teufel and Fletcher, 2020).
3. Discussion
The present study aimed to find computational principles that enables visual hierarchical structures to attain the function to represent external visual information. To address the lack of neural coding principles to encompass both bottom-up and top-down pathways, we propose spatio-temporally efficient coding as a novel computational model. As a principled way of efficiently using given resources in both neural activity space and processing time, this coding principle optimises bidirectional predictions over hierarchical structures by simultaneously minimising temporally differences in neural responses and maximising entropy in neural representations. Simulation results showed that the proposed spatio-temporally efficient coding assigned the function of appropriate neural representations of natural scenes to a visual hierarchical structure and that it could predict deviations in neural responses to unfamiliar inputs and a bias in preferred orientations, which are well known characteristics of the visual system.
Cortical Hebbian-like spike-timing-dependent plasticity (STDP) (Feldman, 2000; Froemke and Dan, 2002) is a potential mechanism underlying the temporally efficient coding proposed in this study. If bidirectional predictions of the hierarchical structure increase future postsynaptic responses, STDP will strengthen these synaptic connections, which may increase future postsynaptic responses. Indeed, this phenomenon is underscored by a minimisation of the temporal difference in neural responses, rendering near future neural responses closer to distant future neural responses. As such, the concept of temporally efficient coding proposed herein is consistent with cortical Hebbian-like STDP.
Since its initial proposal (Attneave, 1954; Barlow, 1961), spatially efficient coding has been validated experimentally (Laughlin, 1981). However, observed correlations between neurons, which maximise entropy to a lesser extent compared to mere spatially efficient coding assuming no inter-neuronal correlations, have yet to be incorporated into the principle of spatially efficient coding. Empirically observed neuronal correlations may drive computational processes of the brain away from strict spatially efficient coding. Recent studies suggest that biological visual systems are intermediate between strict spatially efficient coding and correlated neural responses (Stringer et al., 2019). Therefore, to create biologically plausible computational models, it is necessary to mitigate the spatially efficient coding objective by combining firing-rate-dependent correlations (de la Rocha et al., 2007). This enables more accurate predictions of visual perception mediated by visual hierarchical structures. As we focused on integrating spatially efficient coding with temporally efficient coding for computation in hierarchical structures, this study did not incorporate the correlations between neurons in spatially efficient coding, which will be pursued in follow-up studies.
Based on our simulations, we observed that the learning of bidirectional prediction networks with spatio-temporally efficient coding was hindered when the balancing parameter λ was too small or too large (Fig 2D). Therefore, it was necessary to confine λ within a certain range, in which the magnitude of λ affected neural responses such that a larger λ rendered responses more variable (Fig 2C). Such increased variability is likely to originate from recurrent responses via higher hierarchies rather than the noise of neural responses. This was confirmed by the observation that top-down synaptic weights become larger than bottom-up synaptic weights when λ increased (Fig 2B). Although a large λ attenuated the appropriateness of neural representations (Fig 3), it rendered stronger top-down synaptic connections in hierarchy 1 (Fig 2B), which is consistent with the previous finding that top-down synaptic connections are stronger than bottom-up connections in the lateral geniculate nucleus (Sillito et al., 2006). As to why top-down synaptic weights increase with a larger λ value (Fig 2B), we speculate that learning via spatio-temporally efficient coding may increase the range of neural responses to maximise entropy through top-down pathways. While the bottom-up pathways originating from external inputs are invariant during learning, the top-down pathways originating from higher hierarchy neural responses are more flexible to adjustment to maximise entropy during learning.
The deviant neural responses to unfamiliar inputs observed in this study (Fig 4 and 5) arise from compact, well-organised neural representations for familiar inputs (distributed over the middle value). As such, neural representations for familiar inputs extrude neural responses to unfamiliar inputs into a range of deviant neural responses. We conjecture that the visual system may generate deviant neural responses via a similar mechanism. Simulations in the present study demonstrated that neural responses were distributed around smaller or larger extremes for unfamiliar inputs and around intermediate values for familiar inputs (Fig 4B and 4C). In a separate analysis, we allowed the neural responses to familiar inputs be distributed only around lower values, similar to sparse coding (Olshausen and Field, 1996; Olshausen and Field, 1997), and observed that neural responses to unfamiliar inputs only exhibited higher values.
Spatio-temporally efficient coding predicted a bias in preferred orientations (Fig 6). In this regard, spatially efficient coding alone has been reported to predict bias in preferred orientations (Ganguli and Simoncelli, 2014). Notably, spatio-temporally efficient coding was able to predict this bias well, even when λ was low, that is, when the spatially efficient coding objective was less weighted (Fig 6B). Therefore, this bias prediction should be viewed as a result of spatio-temporally efficient coding, not as a result of spatially efficient coding alone.
The present study has several limitations. First, for simplicity, our simulation model contained only two hierarchies. However, it is necessary to explore how spatio-temporally efficient coding operates in models with more hierarchies. We also modelled 64 neuronal units at each hierarchy, as we assumed that this would be sufficient to represent the natural scene images used in this study. Nevertheless, the interactions between the number of neuronal units, levels of hierarchy, and spatio-temporally efficient coding require further investigation. Second, we demonstrated that the visual hierarchical structure could learn to represent static natural scene images with spatio-temporally efficient coding, but future follow-up studies will investigate whether the visual hierarchical structure learns to represent moving scenes using the same coding principle. Finally, the scope of the present study was limited to the visual system given that its hierarchical structure is well documented, but spatio-temporally efficient coding may be applied to other systems (e.g., somatosensory system) or to movements and planning.
3.1. Conclusions
In the present study, we proposed spatio-temporally efficient coding, inspired by the efficient use of given resources in neural systems, as a neural coding mechanism to assign representational functions to the hierarchical structures of the visual system. Simulations demonstrated that the visual hierarchical structure could represent the external world (i.e., natural scenes) appropriately using bidirectional predictions (Fig 3). Furthermore, spatio-temporally efficient coding predicted the well-known properties of visual cortical neurons, including deviations in neural responses to unfamiliar images (Fig 4 and 5) and bias in preferred orientations (Fig 6). Our proposed spatio-temporally efficient coding may facilitate deeper mechanistic understanding of the computational processes of hierarchical brain structures.
4. Methods
4.1. Datasets
For the simulations, van Hateren’s natural scene image dataset (van Hateren and van der Schaaf, 1998) was used. The dataset was downloaded from http://bethgelab.org/datasets/vanhateren/. The images were downsized to 64 × 96 pixels. For the comparison tests, the MNIST handwritten digit dataset (Lecun et al., 1998) was used. The dataset was downloaded from http://yann.lecun.com/exdb/mnist/. The images were resized to 64 × 96 pixels to fit the images used in the simulations. All image data were rescaled between 0 and 1.
4.2. Detailed descriptions of spatio-temporally efficient coding
For convenience, Xh=0 denotes Ximage. The details of the visual information processing ft are as follows: If h > 0, then where Xh,t is an Xh value vector at time t, ft(·)|Xh,t indicates restricting the range of the function value ft(·) to Xh,t, Wh+1,h is a synaptic weight matrix from hierarchy h + 1 to h, T is the transpose of a matrix, bh is a bias vector at hierarchy h, and σ(·) is a sigmoid function. In the case of h + 1 > H, the term should be omitted. If h = 0, then where F(·) is a neural network with two hidden layers to enable detailed reconstruction of images. The activation function of the last layer of F(·) is a sigmoid function, as in the case of h > 0, in the hidden layer, it is a rectified linear unit.
To minimise the objective of spatially efficient coding, it is necessary to calculate the probability P(ft|Xh). Instead of calculating the exact probabilities, we obtained pre-normalised densities in the sense of probabilities without a partition function. As the value of the partition function is fixed, it does not affect the minimisation process. Kernel density estimation was used to obtain P(ft|Xh). Using a Gaussian kernel with width 0.1(dim Xh)1/2, the neural response density Q(ft|Xh) and compensation density Q′(ft|Xh) at ft|Xh ∈ Xh were obtained. Then, the pre-normalised density of interest is
The compensation density Q′(ft|Xh) is obtained using pseudo-uniformly generated samples on Xh = [0, 1]dim Xh instead of the neural responses. The compensation density is necessary to compensate for the non-uniform intrinsic expectation of Q(·) resulting from the fact that Xh is bounded.
In the present study, the depth of the hierarchies was 2, and images with 64 × 96 size were divided into four overlapping 38 × 58 patches. Each patch was connected to 16 of 64 hierarchy 1 units. Further, 64 hierarchy 1 units were fully connected to 64 hierarchy 2 units (Fig 1A). Four neural networks F(·) were connected from hierarchy 1 units to four image patches. Each neural network F(·) had two hidden layers with 128 and 1024 units, respectively.
Because L = LTemporal + λLSpatial is differentiable, the minimisation of the objective in spatio-temporally efficient coding was performed with a gradient descent. The Adam optimiser (Kingma and Ba, 2015) was used to perform the stochastic gradient descent with momentum. The parameters of the Adam optimiser used in this study were α = 0.001, β1 = 0.9, β2 =0.999, and ϵ = 10−8. The optimisation lasted 104 iterations for each repetition and restarted with five repetitions. For each iteration, the duration of temporal processing ft was five (i.e., t ∈ [1,5]), and the minibatch size was 40.
Simulation codes for spatio-temporally efficient coding are available from [https://github.com/DuhoSihn/Spatio-temporally-efficient-coding] (Sihn, 2021).
Data availability statement
All simulation and analysis codes are available at: https://github.com/DuhoSihn/Spatio-temporally-efficient-coding and on Zenodo (doi: 10.5281/zenodo.5298182)
CRediT authorship contribution statement
Duho Sihn: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration. Sung-Phil Kim: Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.
Acknowledgments
This research was supported by the Brain Convergence Research Programs of the National Research Foundation (NRF) funded by the Korean government (MSIT) (NRF-2019M3E5D2A01058328 and No. 2021M3E5D2A01019542).
Footnotes
Content has been added to Introduction.