Sparse coding models predict a spectral bias in the development of primary visual cortex (V1) receptive fields

It is well known that sparse coding models trained on natural images learn basis functions whose shapes resemble the receptive fields (RFs) of simple cells in the primary visual cortex (V1). However, few studies have considered how these basis functions develop during training. In particular, it is unclear whether certain types of basis functions emerge more quickly than others, or whether they develop simultaneously. In this work, we train an overcomplete sparse coding model (Sparsenet) on natural images and find that there is indeed order in the development of its basis functions, with basis functions tuned to lower spatial frequencies emerging earlier and higher spatial frequency basis functions emerging later. We observe the same trend in a biologically plausible sparse coding model (SAILnet) that uses leaky integrate-and-fire neurons and synaptically local learning rules, suggesting that this result is a general feature of sparse coding. Our results are consistent with recent experimental evidence that the distribution of optimal stimuli for driving neurons to fire shifts towards higher frequencies during normal development in mouse V1. Our analysis of sparse coding models during training yields an experimentally testable prediction for V1 development that this shift may be due in part to higher spatial frequency RFs emerging later, as opposed to a global shift towards higher frequencies across all RFs, which may also play a role. We also find that at least two explanations could account for the order of RF development: 1) high frequency RFs require more information to be specified accurately, and thus may require more visual experience in order to learn, and 2) early development of low frequency RFs improves the sparseness and fidelity of the visual representation more than early development of high frequency RFs. Author summary We are interested in how visual neurons learn representations of the natural world. In particular, we want to know whether certain visual features are learned by the visual cortex earlier in development than others. To address this question, we turn to a class of algorithms that can learn to represent natural scenes in a sparse fashion, with only a few neurons active at any given time (population sparseness). While sparse coding has been used extensively to model the response properties of neurons in the visual cortex, we use it here to arrive at a quantitative description of the way neurons might learn to encode visual information during development. We find that receptive fields (RFs) tuned to lower spatial frequencies develop earlier in our sparse coding models compared to high frequency RFs. If our prediction is accurate, such a description would provide a general framework for understanding the development of the functional properties of V1 neurons and serve as a guide for future experimental studies. It could also lead to new computational models that learn from input statistics, as well as advances in the design of devices that can augment or replace human vision.


Introduction 1
A central goal of systems neuroscience is to establish a precise quantitative description 2 of how neurons learn to encode sensory stimuli. Simple cells in the primary visual 3 cortex (V1) have well-studied response properties [1][2][3][4] and therefore offer a useful 4 model system for understanding how these representations of the visual world are 5 learned during development. In this work, we use computational models of neural 6 encoding to understand how V1 simple cells learn to represent the visual world from a 7 stream of visual input. While many response properties of V1 simple cells can emerge 8 before eye-opening without the need for visual experience (e.g., orientation selectivity 9 and ocular dominance), observations of changes in receptive field (RF) properties that 10 depend on the nature of the visual environment suggest that plasticity in V1 is 11 experience-dependent [5]. Experimental evidence also shows that early postnatal visual 12 experience is necessary for natural scene representation and discriminability in V1 [6]. 13 The process of learning to encode visual information in V1 has been modeled as an 14 unsupservised learning problem in which neurons adapt their tuning properties in order 15 to optimize some objective function based on the statistical structure of stimuli in the 16 natural environment. One coding principle that has proven to be useful for 17 understanding sensory representations is sparseness, which posits that the neural 18 population should not only maximize fidelity to input stimuli, but also minimize the 19 number of active units (L 0 population sparseness), or the amount of neural activity 20 across the population (L 1 population sparseness) [7]. Sparseness is an appealing concept 21 for biological systems, both in terms of conserving metabolic costs and efficiently 22 representing natural scenes, which have sparse structure [8]. Indeed, sparse coding 23 models trained on natural image data to jointly optimize both fidelity to the input and 24 sparseness have been shown to learn basis functions whose response properties replicate 25 simple cell receptive fields (RFs) of V1 neurons [9][10][11]. 26 Moreover, Hunt et al. have demonstrated that training sparse coding models with 27 unnatural training images results in basis functions resembling the RFs that arise when 28 animals are reared with abnormal visual input, suggesting that sparse coding is a 29 feature of experience-dependent development [12]. Zylberberg  SAILnet, a biologically plausible sparse coding model, by tracking the changes in 33 sparseness of the learned representations throughout model training [13].

34
In this work, we analyze sparse coding models during training to answer the 35 following question: do some types of basis functions develop sooner than others, and if 36 so, which ones? Experimental work demonstrates that over the course of development, 37 the distribution of frequency tuning of V1 neurons shifts towards higher spatial 38 frequencies, and this shift requires visual experience [14,15]. However, the question 39 remains whether this shift is due to high spatial frequency RFs emerging later after the 40 early development of low spatial frequency RFs, or whether there is a global shift during 41 March 16, 2022 2/11 development across all receptive fields towards higher spatial frequencies. We find that 42 the Sparsenet model [9] predicts the former to be true: low spatial frequency basis 43 functions tend to emerge earlier in training, and high spatial frequency basis functions 44 tend to emerge later. In fact, we observe the same behavior for the SAILnet model [11] 45 of sparse coding, which implements leaky integrate-and-fire neurons and synaptically 46 local learning rules, suggesting both that this result is a general feature of sparse coding 47 and that it is biologically plausible.

49
Data and pre-processing 50 The sparse coding models used in this work are trained on 16×16 patches drawn from 51 35 images in the van Hateren database [16]. The original images are 1024 rows × 1536 52 columns, with pixel values linearly proportional to intensity. They are then 53 pre-processed using the same procedure described in [17]. The full images are first 54 transformed to log-intensity to account for background luminance [18]. The central 55 1024×1024 region of each image is extracted and the mean is subtracted to yield a pixel 56 distribution that is roughly symmetric around zero. Each image is then whitened and 57 lowpass filtered in the frequency domain by multiplying with the following filter: where ⃗ f denotes the two-dimensional spatial frequency. The cutoff frequency f 0 is set to 59 200 cycles/image, and the steepness parameter n is set to 4 to produce a sharp cutoff 60 without introducing ringing in the space domain [19]. This filter is chosen to attenuate 61 low frequencies while boosting high frequencies. For natural images, which typically 62 have power spectra that roughly obey a 1/f 2−η power law, with 0 < η < 0.3 [20], the 63 filter flattens out the power spectrum to approximately whiten the images. Consistent 64 with this, we find that the filter tends to overcompensate, giving slightly more power to 65 higher frequencies than to lower frequencies (S1 Fig). The central 512×512 region of the 66 image frequency domain is then extracted and inverse Fourier transformed to yield a 67 512×512 image that is down sampled by a factor of two from the original. The set of 68 images obtained this way is then multiplied by a single scale factor so that the variance 69 of the entire ensemble is 1.0. The filenames of the exact images used are listed in [17].  connections. If the net input to a neuron exceeds its threshold value in response to a 87 given image, it will emit some number of spikes; these spike counts are analogous to the 88 coefficients in Sparsenet. The model is trained on Hebbian and anti-Hebbian rules 89 similar to those used in [23], with the additional constraint that learning is localized to 90 each synapse without information from any other synapses in the network. As with noise. The extent to which a basis function is learned at a given time step in training t 120 is measured by the degree of similarity to its final learned shape at the final training 121 time step T . We quantify this using cosine similarity, the cosine of the angle between 122 two vectors, between a given basis function at t and that same basis function at T . This 123 is expressed as where BF t denotes the basis function at a given time step t, BF T denotes the final 125 learned basis function, and ∥·∥ denotes the L 2 norm. By definition, the maximum 126 similarity between any two basis functions is 1, which indicates that they are equal up 127 to a re-scaling of the pixel intensities. Two orthogonal basis functions have a similarity 128 of 0. 129 We divide the basis functions into three equally sized categories -low, mid, or high 130 frequency -based on the peak frequencies in their power spectra (see Methods for 131 details). We plot the similarity over training time steps t and find that on average low 132 frequency basis functions converge to their final learned shapes (reach similarity of 1) 133 first, followed by the mid frequency basis functions, and the high frequency basis 134 functions converging last (Fig 1A). The effect persists when binning the basis functions 135 by the actual values of the peak frequency (number of cycles) in their power spectra, as 136 opposed to binning into equally sized frequency batches, though differences in the  We train SAILnet on the same whitened natural image data as with Sparsenet for 162 the same number of training iterations. We report the full learned SAILnet dictionary 163 in S2 Fig. We find that on average, SAILnet also learns lower frequency basis functions 164 earlier in training and higher frequency basis functions later in training (Fig 1C).  There are several reasons one might expect low frequency basis functions to develop 170 earlier than high frequency basis functions. We analyze two of these here and 171 demonstrate that they are consistent with our findings.

172
Higher frequency basis functions require more spatial precision to specify 173 fully 174 The first potential reason for the observed spectral bias in training is that it requires 175 more spatial precision to specify higher frequency basis functions, and therefore they 176 may take more time to converge. We examine this by applying a small shift ϵ to the 177 phase of each basis function along the direction of its maximum spatial gradient. We 178 then compute similarity(BF, BF + ϵ), where BF + ϵ denotes the phase-shifted basis 179 function, and find that high spatial frequency basis functions are changed more from 180 their original shapes under ϵ-phase shifts (Fig 2A). The sensitivity of high frequency 181 receptive fields to small perturbations suggests they may require more training data to 182 converge to their final shapes. To illustrate the sensitivity of a basis function's spatial pattern to the parameters one might use to specify its detailed shape, such as its exact location, we computed the similarity between a given basis function and the same pattern shifted by a small amount. Phase shifts result in larger perturbations for higher frequency representations. The distribution of similarity between basis function and basis function under a 1-pixel phase shift for each frequency category. Vertical lines denote the means for each category. Note that for both models, the mean of the low frequency histogram is closer to 1 (exactly the same under phase shift) than that of the mid frequency or high frequency histograms, as one would expect. (C) Early development of lower frequency basis functions yields faster optimization of sparse coding objective function. For each of the three categories, the distribution of decreases in the Sparsenet objective function obtained by swapping one learned basis function from a fully converged dictionary into a randomly initialized model. Each objective function was evaluated on the same batch of 100 whitened natural image patches. A negative decrease corresponds to an increase in the objective function. Vertical lines denote the means for each category. Note that the low frequency histogram has a greater mean value than the mid frequency or high frequency histograms.

Early learning of low frequency basis functions leads to faster optimization 184
Another reason we might expect low frequency basis functions to develop sooner is that 185 they lead to faster optimization. We examine this possibility by comparing the model further suggests that this observed spectral bias may be a feature of RF development in 208 V1.

209
Our findings are consistent with experimental evidence that V1 becomes more 210 attuned to higher spatial frequencies over the course of experience-dependent 211 development. Chino et al. show a rapid increase in the mean optimal spatial frequency 212 tuning of V1 neurons in macaque monkeys over the first four postnatal weeks [14]. 213 Nishio et al. find that the overall distribution of optimal tuning of V1 neurons in mice 214 shifts towards higher spatial frequencies from postnatal weeks 3-6. They also find that 215 this shift in the distribution of optimal tuning towards higher frequencies does not occur 216 for binocularly deprived animals, suggesting that visual experience is required for 217 neurons to learn higher frequency representations [15].

218
Our results may provide additional insight into this phenomenon because we are able 219 to observe the development of each individual basis function in the model, as opposed 220 to just sampling from the distribution of tuning across the V1 neuronal population at 221 different timepoints during training. In particular, we propose that this shift in the 222 distribution is due to higher frequency receptive fields emerging later than the low 223 frequency receptive fields. An alternative possibility is that it is due to a global shift 224 across all receptive fields towards higher spatial frequencies during development. Future 225 experimental work can help distinguish whether one or both of these explanations can 226 account for the observations in [14,15]. It may be experimentally challenging to track 227 individual neuronal receptive fields over the full course of development, which would be 228 the ideal way to distinguish whether high frequency RFs emerge late in development, as 229 opposed to all RFs shifting to higher frequencies over time. Whether or not this can be 230 done, it should be possible to sample from the population of receptive fields at various 231 points in development and estimate the relative proportions of low, mid, and high Basis function complexity and input data statistics could each 252 account for spectral bias in RF development 253 We propose two candidate explanations for why lower frequency basis functions emerge 254 earlier in sparse coding models. The first is a statement about complexity of the learned 255 basis functions: higher frequency functions require more spatial precision to specify, and 256 therefore may require more training data to accurately learn (Fig 1). The second is a 257 statement about the dataset: early development of lower frequency basis functions leads 258 to larger decreases in the sparse coding objective function, suggesting that the data 259 contains more statistically relevant features at lower frequencies.
As it is known that 260 natural images exhibit a roughly 1/f 2 power spectrum [20], with more power 261 concentrated at low frequencies, one might expect from the second mechanisms above 262 that learning first lower frequencies on natural image data is better for optimization.

263
However, we trained both models in this study on whitened natural images, which have 264 approximately flat power spectra across all frequencies between our upper and lower 265 frequency cutoffs. Moreover, our whitening procedure tends to overcompensate, 266 boosting higher frequencies so power increases with frequency. This makes our results 267 even more surprising: if anything, because of the dominance of higher frequencies in the 268 dataset, we would expect the model's spectral bias to lean towards learning higher 269 frequencies first. This would seem to implicate the first mechanism over the second 270 since the first mechanism is not so obviously dependent on stimulus properties.

271
It could also be the case that some aspects of higher-than-second-order statistics in 272 the data not captured by the power spectrum may induce a bias towards learning lower 273 frequencies early in training. Similar phenomena have also been observed in other 274 statistical learning paradigms such as deep neural networks [24,25], suggesting that 275 spectral bias towards lower frequencies may be a general characteristic of representation 276 learning.

277
Finally, there are many factors potentially affecting spectral bias in biological 278 development that we do not consider in our modeling. For example, the optics of the 279 eyes of young animals often change during development after eye opening so that spatial 280 acuity increases over time [26]. Moreover, changes in the RFs of neurons upstream of 281 V1, such as in the retinas or the lateral geniculate nucleus of the thalamus, are 282 undoubtedly changing during development. cell receptive field properties and topography [27]. We also do not consider excitatory 290 connections between cells, which are a feature of the sparse coding model described 291 in [28]. Future work could analyze development of the basis functions in these 292 extensions of sparse coding. Other models of V1, such as the probabilistic Bayesian 293 update model [29], would also be interesting to explore in the context of development. 294 Our spectral bias prediction was derived in the context of experience-dependent 295 development, during which neuronal tuning adapts to natural scene statistics. It is 296 possible that this particular order of development may not hold for 297 experience-independent development, such as occurs in V1 prior to eye opening.This 298 question could be addressed by considering different input data to the model. One 299 possible input could be internally generated spontaneous neural activity, such as retinal 300 waves, which play a role in the wiring of circuitry in early visual areas. For example, 301 Dähne and colleagues implement slow feature analysis to encode retinal wave signals 302 and find that the learned features correspond to the shapes of V1 complex cells [30].