Abstract
Brains must represent the outside world so that animals survive and thrive. In early sensory systems, neural populations have diverse receptive fields structured to detect important features in inputs, yet significant variability has been ignored in classical models of sensory neurons. We model neuronal receptive fields as random, variable samples from parametrized distributions in two sensory modalities, using data from insect mechanosensors and neurons of mammalian primary visual cortex. We show that these random feature neurons perform a randomized wavelet transform on inputs which removes high frequency noise and boosts the signal. Our result makes a significant theoretical connection between the foundational concepts of receptive fields in neuroscience and random features in artificial neural networks. Further, these random feature neurons enable learning from fewer training samples and with smaller networks in artificial tasks. This structured random model of receptive fields provides a unifying, mathematically tractable framework to understand sensory encodings across both spatial and temporal domains.
1 Introduction
It has long been argued that the brain uses a large population of neurons to represent the world [93, 29, 78, 85]. In this view, sensory stimuli are encoded by the responses of the population, which are then used by downstream areas for diverse tasks, including learning, decision-making, and movement control. These sensory areas have different neurons responding to differing stimuli while also providing a measure of redundancy. However, we still lack a clear understanding of what response properties are well-suited for different sensory modalities.
One way to approach sensory encoding is by understanding how a neuron would respond to arbitrary stimuli. Experimentally, we typically present many stimuli to the animal, measure the responses of sensory neurons, then attempt to estimate some kind of model for how the neurons respond to an arbitrary stimulus. A common assumption is that the neuron computes a linear filter of the stimulus, which then drives spiking through a nonlinear spike-generating mechanism. Mathematically, this assumption can be summarized as the number of measured spikes for a stimulus x being equal to σ(wTx) for a weight vector w and nonlinearity σ. Here, the weights w define the filtering properties of the neuron, also known as its receptive field [81]. This model is known as a linear-nonlinear (LN) model [16], and it is also the most common form of artificial neuron in artificial neural networks (ANNs). LN models have been used extensively to describe the firing of diverse neurons in various sensory modalities of vertebrates and invertebrates. In the mammalian visual system, LN models have been used to characterize retinal ganglion cells [77], lateral geniculate neurons [18], and simple cells in primary visual cortex (V1) [43]. They have also been used to characterize auditory sensory neurons in the avian midbrain [45] and somatosensory neurons in the cortex [76]. In insects, they have been used to understand the response properties of visual interneurons [74], mechanosensory neurons involved in proprioception [27, 67], and auditory neurons during courtship behavior [19].
Given the stimulus presented and neural response data, one can then estimate the receptive fields of a population of neurons. Simple visual receptive fields have classically been understood as similar to wavelets with particular spatial frequency and angular selectivity [43]. In mechanosensory areas, receptive fields are selective to temporal frequency over a short time window [27]. Commonly, parametric modeling (Gabor wavelets [85]) or smoothing (regularization, etc. [64]) are used to produce “clean” receptive fields. Yet, the data alone show noisy receptive fields that are perhaps best modeled using a random distribution [8]. A key goal of this work is to understand why receptive fields have the structures that they do and how this structure relates to the kinds of stimuli that are relevant to the animal.
Modeling the filtering properties of a population of LN neurons as samples from a random distribution leads to the study of networks with random weights [73, 14, 51]. In machine learning (ML), such networks are known as random feature networks (RFNs) [11, 39, 69, 52]. The study of RFNs has rapidly gained popularity in recent years, in large part because it offers a theoretically tractable way to study the learning properties of ANNs where the weights are tuned using data [3, 2, 15]. When the RFN contains many neurons, it can approximate functions that live in a well-understood function space. This function space is called a reproducing kernel Hilbert space (RKHS), and it depends on the network details, in particular the weight i.e. receptive field distribution [58, 88, 70]. Learning can then be framed as approximating functions in this space from limited data.
Several recent works highlight the RFN theory’s usefulness for understanding learning in neural systems. Bordelon, Canatar, and Pehlevan, in a series of papers, have shown that neural codes allow learning from few examples when spectral properties of their second-order statistics aligns with the spectral properties of the task [9, 10, 13]. When applied to V1, they found that the neural code is aligned with tasks that depend on low spatial frequency components. Harris constructed an RFN model of sparse networks found in associative centers like the cerebellum and insect mushroom body and showed that these areas may behave like additive kernels [34], an architecture also considered by Hashemi et al. [35]. These classes of kernels are beneficial for learning in high dimensions because they can learn from fewer examples and remain resilient to input noise or adversarial perturbation. Xie et al. investigated the relationship between the fraction of active neurons in a model of the cerebellum—controlled by neuron thresholds—and generalization performance for learning movement trajectories [89]. In the vast majority of network studies with random weights, these weights w are drawn from a Gaussian distribution with independent entries. This sampling is equivalent to a fully unstructured receptive field, which looks like white noise.
Closely related to our work, a previous study of ANNs showed that directly learning structured receptive fields could improve image classification in deep networks [41]. Their receptive fields were parametrized as a sum of Gaussian derivatives up to fourth order. This led to better performance against rival architectures in low data regimes.
In this paper, we study the effect of having structured yet random receptive fields and how they lead to informative sensory encodings. Specifically, we consider receptive fields generated by a Gaussian process (GP), which can be thought of as drawing the weights w from a Gaussian distribution with a particular covariance matrix. We show that networks with such random weights project the input to a new basis and filter out particular components. This theory introduces realistic structure of receptive fields into random feature models which are crucial to our current understanding of artificial networks. Next, we show that receptive field datasets from two disparate sensory systems, mechanosensory neurons on insect wings and V1 cortical neurons from mice and monkeys, are well-modeled by GPs with covariance functions that have wavelet eigenbases. Given the success of modeling these data with the GP, we apply these weight distributions in RFNs that are used in synthetic learning tasks. We find that these structured weights improve learning by reducing the number of training examples and the size of the network needed to learn the task. Thus, structured random weights offer a realistic generative model of the receptive fields in multiple sensory areas, which we understand as performing a random change of basis. This change of basis enables the network to represent the most important properties of the stimulus, which we demonstrate to be useful for learning.
2 Results
We construct a generative model for the receptive fields of sensory neurons and use it for the weights of an ANN. We refer to such a network as a structured random feature network. In Section 2.1, we review the basics of random feature networks, the details and rationale behind our generative model, and the process by which we generate hidden weights. Our main theory result is that networks with such weights transform the inputs into a new basis and filter out particular components, thus bridging sensory neuroscience and the theory of neural networks. In Section 2.2, we show that neurons in two receptive field datasets—insect mechanosensory neurons and mammalian V1 cortical neurons—are well-described by our generative model. There is a close resemblance between the the second-order statistics, sampled receptive fields, and their principal components for both data and model. Finally, in Section 2.3 we show the performance of structured random feature networks on several synthetic learning tasks. The hidden weights from our generative model allows the network to learn from fewer training examples and smaller network sizes.
2.1 Theoretical analysis
We consider receptive fields generated by GPs in order to connect this foundational concept from sensory neuroscience to the theory of random features in artificial neural networks. GPs can be thought of as samples from a Gaussian distribution with a particular covariance matrix, and we initialize the hidden weights of RFNs using these GPs. We show that using a GP causes the network to project the input into a new basis and filter out particular components. The basis itself is determined by the covariance matrix of the Gaussian, and is useful for removing irrelevant and noisy components from the input. We use these results to study the space of functions that RFNs containing many neurons can learn by connecting our construction to the theory of kernel methods.
2.1.1 Random feature networks
We start by introducing the main learning algorithm and the neuronal model of our work, the RFN. Consider a two-layer, feedforward ANN. Traditionally, all the weights are initialized randomly and learned through backpropagation by minimizing some loss objective. In sharp contrast, RFNs have their hidden layer weights sampled randomly from some distribution and fixed. Each hidden unit computes a random feature of the input, and only the output layer weights are trained (Fig. 1).
Mathematically, we have the hidden layer activations and output given by where is the stimulus, are the hidden neuron responses, and is the predicted output. We use a rectified linear (ReLU) nonlinearity, σ(x) = max(0, x) applied entrywise in (1). The hidden layer weights are drawn randomly and fixed. Only the readout weights β0 and β are trained in RFNs.
In our RFN experiments, we train the readout weights and offset using a support vector machine (SVM) classifier with squared hinge loss and ℓ2 penalty with regularization strength of 1. Our RFNs do not include a threshold for the hidden neurons.
In the vast majority of studies with RFNs, each neuron’s weights are initialized i.i.d. from a spherical Gaussian distribution . We will call networks built this way classical unstructured RFNs (Fig. 1). We propose a variation where hidden weights are initialized , where is a positive semidefinite covariance matrix. We call such networks structured RFNs (Fig. 1), to mean that the weights are random with a specified covariance. To compare unstructured and structured weights on equal footing, we normalize the covariance matrices so that Tr(C) = Tr(Id) = d, which ensures that the mean square amplitude of the weights .
2.1.2 Receptive fields modeled by linear weights
Sensory neurons respond preferentially to specific features of their inputs. This stimulus selectivity is often summarized as a neuron’s receptive field, which describes how features of how the sensory space elicits responses when stimulated [81]. Mathematically, receptive fields are modeled as a linear filter in the stimulus space. Linear filters are also an integral component of the widely used LN model of sensory processing [16]. According to this model, the firing rate of a neuron is a nonlinear function applied to the projection of the stimulus onto the low-dimensional subspace of the linear filter.
A linear filter model of receptive fields can explain responses of individual neurons to diverse stimuli. It has been used to describe disparate sensory systems like visual, auditory, and somatosensory systems of diverse species including birds, mammals, and insects [77, 74, 45, 19, 76]. If the stimuli are uncorrelated, the filters can be estimated by computing the spike triggered average (STA), the average stimulus that elicited a spike for the neuron. When the stimuli are correlated, the STA filter is whitened by the inverse of the stimulus covariance matrix [63]. Often these STAs are denoised by fitting a parametric function to the STA [16], such as Gabor wavelets for simple cells in V1 [43].
We model the receptive field of a neuron i as its weight vector wi and its nonlinear function as σ. Instead of fitting a parametric function, we construct covariance functions so that each realization of the resulting Gaussian process resembles a biological receptive field (Fig. 1).
2.1.3 Structured weights project and filter input into the covariance eigenbasis
We generate network weights from Gaussian processes (GP) whose covariance functions are inspired by the receptive fields of sensory neurons in the brain. By definition, a GP is a stochastic process where finite observations follow a Gaussian distribution [71]. We design the weight covariance of the GP so that our weights are compatible with our inputs, which is to say, the GP covariance function reflects the statistical regularities within the sensory inputs to the network. We find that networks with such weights project inputs into a new basis and filter out irrelevant components. In Section 2.3, we will see that this adds an inductive bias to classical RFNs and improves learning.
We view our weight vector w as the finite-dimensional discretization of a continuous function w(t) which is a sample from a GP. The continuous function has domain T, a compact subset of , and we assume that T is discretized using a grid of d equally spaced points {t1,…, td} ⊂ T, so that wi = w(ti). Let the input be a real-valued function x(t) over the same domain T, which could represent a finite timeseries (D = 1), an image of luminance on the retina (D = 2), or more complicated spatiotemporal sets like a movie (D = 3). In the continuous setting, the d-dimensional ℓ2 inner product gets replaced by the L2(T) inner product 〈w, x〉 = ∫t∈T w(t)x(t)dt.
Every GP is fully specified by its mean and covariance function C(t, t′). We will always assume that the mean is zero and study different covariance functions. By the Kosambi-Karhunen-Loève theorem [47], each realization of a zero-mean GP has a random series representation in terms of standard Gaussian random variables , functions ϕi(t), and weights λi ≥ 0. The pairs are eigenvalue, eigenfunction pairs of the covariance operator , which is the continuous analog of the covariance matrix C. If C(t, t′) is positive definite, as opposed to just semidefinite, all and these eigenfunctions ϕi form a complete basis for L2(T). Using (2), the inner product between a stimulus and a neuron’s weights is
Equation (3) shows that the structured weights compute a projection of the input x onto each eigenfunction 〈ϕi, x〉 and reweight or filter by the eigenvalue λi before taking the ℓ2 inner product with the random Gaussian weights zi.
It is illuminating to see what these continuous equations look like in the d-dimensional discrete setting. Samples from the finite-dimensional GP are used as the hidden weights in RFNs, . First, the GP series representation (2) becomes w = ΦΛz, where Λ and Φ are matrices of eigenvalues and eigenvectors, and is a Gaussian random vector. By the definition of the covariance matrix, , which is equal to ΦΛ2ΦT after a few steps of linear algebra. Finally, (3) is analogous to wTx = zTΛΦTx. Since Φ is an orthogonal matrix, ΦTx is equivalent to a change of basis, and the diagonal matrix Λ shrinks or expands certain directions to perform filtering. This can be summarized in the following theorem:
(Basis change formula) Assume with C = ΦΛ2ΦT its eigenvalue decomposition. For , define
Then for .
Theorem 1 says that projecting an input onto a structured weight vector is the same as first filtering that input in the GP eigenbasis and doing a random projection onto a spherical random Gaussian. The form of the GP eigenbasis is determined by the choice of the covariance function. If the covariance function is compatible with the input structure, the hidden weights filter out any irrelevant features or noise in the stimuli while amplifying the descriptive features. This inductive bias facilitates inference on the stimuli by any downstream predictor. Because the spherical Gaussian distribution is the canonical choice for unstructured RFNs, there is a simple way to evaluate the effective kernel of structured RFNs as (see Appendix A.1). This provides a concrete connection to the kernel theory of learning using nonlinear neural networks.
2.2 Examples of random yet structured receptive fields
Our goal is to model the weights of artificial neurons in a way that is inspired by biological neurons’ receptive fields. Structured RFNs sample hidden weights from GPs with structured covariance, so we construct covariance functions that make the generated weights resemble neuronal receptive fields. We start with a toy example of a stationary GP with well-understood Fourier eigenbasis and show how the receptive fields generated from this GP are selective to frequencies in timeseries signals. Then, we construct locally stationary covariance models of the of insect mechanosensory and V1 neuron receptive fields. These models are shown to be a good match for experimental data.
2.2.1 Warm-up: frequency selectivity from stationary covariance
To illustrate some results from our theoretical analysis, we start with a toy example of temporal receptive fields that are selective to particular frequencies. This example may be familiar to readers comfortable with Fourier series and basic signal processing. Let the input be a finite continuous timeseries x(t) over the interval T = [0, L]. We use the covariance function where ωk = 2πk/L is the kth natural frequency and are the weight coefficients. The covariance function (5) is stationary, which means that it only depends on the difference between the timepoints t – t′. Applying the compound angle formula, we get
Since the sinusoidal functions cos(ωkt) and sin(ωkt) form an orthonormal basis for L2(T), (6) is the eigendecomposition of the covariance, where the eigenfunctions are sines and cosines with eigenvalues . From (2), we know that structured weights with this covariance form a random series: where each . Thus, the receptive fields are made up of sinusoids weighted by λk and the Gaussian variable zk.
Suppose we want receptive fields that only retain specific frequency information of the signal and filter out the rest. Take λk = 0 for any k where ωk < flo or ωk > fhi. We call this a bandlimited spectrum with passband [flo, fhi] and bandwidth flo – fhi. As the bandwidth increases, the receptive fields become less smooth since they are made up of a wider range of frequencies. The smoothness is also controlled by the overall magnitude of the nonzero eigenvalues.
When these receptive fields act on input signals x(t), they implicitly transform the inputs into the Fourier basis and filter frequencies based on the magnitude of λk. In a bandlimited setting, any frequencies outside the passband are filtered out, which makes the receptive fields selective to a particular range of frequencies and ignore others. On the other hand, classical random features weight all frequencies equally, even though in natural settings high frequency signals are the most corrupted by noise.
2.2.2 Insect mechanosensors
We next consider a particular biological sensor that is sensitive to the time-history of forces. Campaniform sensilla (CS) are dome-shaped mechanoreceptors that detect local stress and strain on the insect exoskeleton [21]. They are embedded in the cuticle and deformation of the cuticle through bending or torsion induces depolarizing currents in the CS by opening mechanosensitive ion channels. The CS encode proprioceptive information useful for body state estimation and movement control during diverse tasks like walking, kicking, jumping, and flying [21].
We will model the receptive fields of CS that are believed to be critical for flight control, namely the ones found at the base of the halteres [92] and on the wings [67] (Fig. 2A). Halteres and wings flap rhythmically during flight, and rotations of the insect’s body induce torsional forces that can be felt on these active sensory structures. The CS detect these small strain forces, thereby encoding angular velocity of the insect body [92]. Experimental results show haltere and wing CS are selective to a broad range of oscillatory frequencies [26, 67], with STAs that are smooth, oscillatory, selective to frequency, and decay over time [27] (Fig. 2B).
We model these temporal receptive fields with a locally stationary GP [31] with bandlimited spectrum. Examples of receptive fields generated from this GP are shown in Fig. 2C. The inputs to the CS are modeled as a finite continuous timeseries x(t) over the finite interval T = [0, L]. The neuron weights are generated from a covariance function where ωk = 2πk/L is the kth natural frequency. As in Section 2.2.1, the frequency selectivity of their weights is accounted for by the parameters flo and fhi. As the bandwidth fhi – flo increases, the receptive fields are built out of a wider selection of frequencies. This makes the receptive fields less smooth (Fig. 2D). Each field is localized to near t = 0, and its decay with t is determined by the parameter γ. As γ increases, the receptive field is selective to larger time windows.
The eigenbasis of the covariance function (8) is similar to a Fourier eigenbasis modulated by a decaying exponential. The eigenbasis is an orthonormal basis for the span of λke−t/γ cos(ωkt) and λke−t/γ sin(ωkt), which are a non-orthogonal set of functions in L2(T). The hidden weights transform timeseries inputs into this eigenbasis and discard components outside the passband frequencies [flo, fhi].
We fit the covariance model to receptive field data from 95 CS neurons from wings of the hawkmoth Manduca sexta (data from [67]). Briefly, CS receptive fields were estimated as the spike-triggered average (STA) of experimental mechanical stimuli of the wings, where the stimuli were generated as bandpassed white noise (2–300 Hz).
To characterize the receptive fields of this population of CS neurons, we compute the data covariance matrix Cdata by taking the inner product between the centered receptive fields. We normalize the trace to be the dimension of each receptive field (number of samples), which in this case is 40 kHz × 40 ms = 1600 samples. This normalization sets the overall scale of the covariance matrix. The data covariance matrix shows a tridiagonal structure (Fig. 3A). The main diagonal is positive while the off diagonals are negative. All diagonals decay away from the top left of the matrix.
To fit the covariance model to the data, we optimize the parameters (see Appendix A.2) flo, fhi, and γ, finding flo = 75 Hz, fhi = 200 Hz, and γ = 12.17 ms best fit the sensilla data. The resulting model covariance matrix (Fig. 3B) matches the data covariance matrix (Fig. 3A) remarkably well. Examples of biological receptive fields and random samples from this fitted covariance model are shown in the Appendix (Fig. 17). To simulate the effect of a finite number of neurons, we generate 95 weight vectors (equal to the number of neurons recorded) and recompute the model covariance matrix (Fig. 3C). We call this the finite neuron model covariance matrix Cfinite, and it shows the bump and blob-like structures evident in Cdata but not in Cmodel. This result suggests that these bumpy structures can be attributed to having a small number of recorded neurons. We hypothesize that these effects would disappear with a larger dataset and Cdata would more closely resemble Cmodel.
Comparing the eigenvectors and eigenvalues of the data and model covariance matrices, we find that the spectral properties of both Cmodel and Cfinite are similar to that of Cdata. The eigenvalue curves of the models match that of the data quite well (Fig. 3E); these curves are directly comparable because each covariance is normalized by its trace, which makes the sum of the eigenvalues unity. Further, all of the data and the model covariance matrices are low-dimensional. The first 10 data eigenvectors explain 97% of the variance, and the top 5 explain 90%. The top 5 eigenvectors of the model and its finite sample match that of the data quite well (Fig. 3D).
2.2.3 Primary visual cortex
We now turn to visually driven neurons from the mammalian primary cortex. Primary visual cortex (V1) is the earliest cortical area for processing visual information (Fig. 4A). The neurons in V1 can detect small changes in visual features like orientations, spatial frequencies, contrast, and size.
Here, we model the receptive fields of simple cells in V1, which have clear excitatory and inhibitory regions such that light shone on the excitatory regions increase the cell’s response and vice-versa (Fig. 4B). The shape of the regions determines the orientation selectivity, while their widths determine the frequency selectivity. The receptive fields are centered to a location in the visual field and decay away from it. They integrate visual stimuli within a small region of this center [38]. Gabor functions are widely used as a mathematical model of the receptive fields of simple cells [43].
We model these receptive fields using another locally stationary GP [31] and show examples of generated receptive fields in Fig. 4C. Consider the inputs to the cortical cells to be a continuous two-dimensional image x(t), where the domain T = [0, L] × [0, L′] and x: . Since the image is real-valued, x(t) is the grayscale contrast or single color channel pixel values. The neuron weights are then generated from a covariance function of the following form:
The receptive field center is defined by c, and the size of the receptive field is determined by the parameter s. As s increases, the receptive field extends farther from the center c (Fig. 4D). Spatial frequency selectivity is accounted for by the bandwidth parameter f. As f decreases, the spatial frequency of the receptive field goes up, making the weights less smooth (Fig. 4E).
The eigendecomposition of the covariance function (9) leads to an orthonormal basis of single scale Hermite wavelets [54, 55]. When c = 0, the wavelet eigenfunctions are Hermite polynomials modulated by a decaying Gaussian: where Hk is the kth Hermite polynomial; eigenfunctions for nonzero centers c are just shifted versions of (10). The detailed derivation and values of the constants c1, c2, c3 and normalization are in Appendix A.3.
We use (9) to model receptive field data from 8,358 V1 neurons recorded with calcium imaging from transgenic mice expressing GCaMP6s; the mice were headfixed and running on an air-floating ball. We presented 24,357 unique white noise images of 14×36 pixels using the Psychtoolbox [44], where the pixels were white or black with equal probability. Images were upsampled to the resolution of the screens via bilinear interpolation. The stimulus was corrected for eye-movements online using custom code. The responses of 45,026 cells were collected using a two-photon mesoscope [83] and preprocessed using Suite2p [62]. Receptive fields were calculated from the white noise images and the deconvolved calcium responses of the cells using the STA. For the covariance analysis, we picked cells above the signal-to-noise (SNR) threshold of 0.4; this gave us 8,358 cells. The SNR was calculated from a smaller set of 2,435 images that were presented twice using the method from [85]. As a preprocessing step, we moved the center of mass of every receptive field to the center of the visual field.
We compute the data covariance matrix Cdata by taking the inner product between the receptive fields. We normalize the trace to be the dimension of each receptive field, which in this case is 14 pixels × 36 pixels = 504 pixels2. The data covariance matrix resembles a tridiagonal matrix. However, the diagonals are non-zero only at equally spaced segments. Additionally, their values decay away from the center of the matrix. We show Cdata zoomed in at the non-zero region around the center of the matrix (Fig. 5A). The full covariance matrix is shown in the Appendix A.8 (Fig. 15).
In the covariance model, the number of off-diagonals, the center, the rate of their decay away from the center are determined by the parameters f, s and c respectively. When the frequency parameter f increases, the number of off-diagonals increases. Pixels in the generated weights become more correlated and the weights become spatially smoother. When the size parameter s increases, the diagonals decay slower from the center c, increasing correlations with the center pixel and leading the significant weights to occupy more of the visual field.
We again optimize the parameters to fit the data (Appendix A.2.2), finding s = 1.87 and f = 0.70 pixels. We do not need to optimize over the center parameter c, since we preprocess the data so that all receptive fields are centered at c = (7, 18), the center of the 14×36 grid. The resulting model covariance matrix (Fig. 5B) and the data covariance matrix (Fig. 5A) match remarkably well. Examples of biological receptive fields and random samples from this fitted covariance model are shown in Fig. 16 in the Appendix. To simulate the effect of a finite number of neurons, we generate 8,358 weights, equal to the number of neurons in our data, to compute Cfinite shown in Fig. 5C. This finite matrix Cfinite looks even more like Cdata, and it shows that some of the negative covariances far from center result from finite sample size but not all.
Similar spectral properties are evident in the eigenvectors and eigenvalues of Cmodel, Cfinite, Cdata, and the analytical forms derived in (10) (Fig. 5D,E). As in Section 2.2.2, the covariances are normalized to have unit trace. Note that the analytical eigenfunctions are shown on a finer grid than the model and data because the analysis was performed in continuous space; differences between analytical and model results are due to discretization. Examining the eigenvectors (Fig. 5D), we also see a good match, although there are some rotations and differences in ordering. These 10 eigenvectors explain 68% of the variance in the receptive field data. For reference, the top 80 eigenvectors explain 86% of the variance in the data and all of the variance in the model. The eigenvalue curves of both the models and the analytical forms match that of the data (Fig. 5E) reasonably well, although not as well as for the mechanosensors. In Appendix A.8, we repeat this analysis for receptive fields measured with different stimulus sets in the mouse and different experimental dataset from non-human primate V1. Our findings are consistent with the results shown above.
2.3 Advantages of structured random weights for artificial learning tasks
Our hypothesis is that neuronal inductive bias from structured receptive fields allows networks to learn with fewer neurons, training examples, and steps of gradient descent. To examine this hypothesis, we compare the performance of structured receptive fields against classical ones on several classification tasks. We find that, for most artificial learning tasks, structured random networks learn more accurately from smaller network sizes, fewer training examples, and gradient steps.
2.3.1 Frequency detection
CS naturally encode the time-history of strain forces acting on the insect body and sensors inspired by their temporal filtering properties have been shown to accurately classify spatiotemporal data [57]. Inspired by this result, we test sensilla-inspired mechanosensory receptive fields from Section 2.2.2 on a timeseries classification task (Fig. 6A, top). Each example presented to the network is a 100 ms timeseries sampled at 2 kHz so that d = 200, and the goal is to detect whether or not each example contains a sinusoidal signal. The positive examples are sinusoidal signals with f1 = 50 Hz and corrupted by noise so that their SNR = 1.76 (2.46 dB). The negative examples are Gaussian white noise with matched amplitude to the positive examples. Note that this frequency detection task is not linearly separable because of the random phases in positive and negative examples. See Section A.5 for additional details including the definition of SNR and how cross-validation was used to find the optimal parameters flo = 10 Hz, fhi = 60 Hz, and γ = 50 ms.
For the same number of hidden neurons, the structured RFN significantly outperforms a classical RFN. We show test performance using these tuned parameters in Fig. 6A. Even in this noisy task, it achieves 1% test error using only 25 hidden neurons. Meanwhile, the classical network takes 300 neurons to achieve similar error.
Predictably, the performance suffers when the weights are incompatible with the task. We show results when flo = 10 Hz and fhi = 40 Hz and the same γ (Fig. 6A). The incompatible RFN performs better than chance (50% error) but much worse than the classical RFN. It takes 300 neurons just to achieve 19.9% test error. The test error does not decrease below this level even with additional hidden neurons.
2.3.2 Frequency XOR task
To challenge the mechanosensor-inspired networks on a more difficult task, we build a frequency Exclusive-OR (XOR) problem (Fig. 6B, top). XOR is a binary function which returns true if and only if the both inputs are different, otherwise it returns false. XOR is a classical example of a function that is not linearly separable and thus harder to learn. Our inputs are again 100 ms timeseries sampled at 2 kHz. The inputs either contain a pure frequency of f1 = 50 Hz or f2 = 80 Hz, mixed frequency signals with both f1 and f2, or white noise. In both the pure and mixed frequency cases, we add noise so that the SNR = 1.76. See A.5 for details. The goal of the task is to output true if the input contains either pure tone and false if the input contains mixed frequencies or is white noise.
We tune the GP covariance parameters flo, fhi, and γ from (8) using cross-validation. The cross validation procedure and algorithmic details are identical to that of the frequency detection task in Section 2.3.1. Using cross validation, we find the optimal parameters to be flo = 50 Hz, fhi = 90 Hz, and γ = 40 ms. For incompatible weights, we take flo = 10 Hz, fhi = 60 Hz, and the same γ.
The structured RFN significantly outperform classical RFN for the same number of hidden neurons. We show network performance using these parameters in Fig. 6B. Classification error of 1% can be achieved with 25 hidden neurons. In sharp contrast, the classical RFN requires 300 hidden neurons just to achieve 8.1% error. With incompatible weights, the network needs 300 neurons to achieve just 25.5% test error and does not improve with larger network sizes. Out of the four input subclasses, it consistently fails to classify pure 80 Hz sinusoidal signals which are outside its passband.
2.3.3 Image classification
We next test the V1-inspired receptive fields from Section 2.2.3 on two standard digit classification tasks, MNIST [49] and KMNIST [17]. The MNIST and KMNIST datasets each contain 70,000 images of handwritten digits. In MNIST, these are the Arabic numerals 0–9, whereas KMNIST has 10 Japanese hiragana phonetic characters. Both datasets come split into 60,000 training and 10,000 test examples. With 10 classes, there are 6,000 training examples per class. Every example is a 28×28 grayscale image with centered characters.
Each hidden weight has its center c chosen uniformly at random from all pixels. This ensures that the network’s weights uniformly cover the image space and in fact means that the network can represent any sum of locally-smooth functions (see Section A.4). We use a network with 1,000 hidden neurons and tune the GP covariance parameters s and f from (9) using 3-fold cross validation on the MNIST training set. Each parameter ranges from 1 to 20 pixels, and the optimal parameters are found with a grid search. We find the optimal parameters to be s = 5 pixels and f = 2 pixels. We then refit the optimal model using the entire training set. The parameters from MNIST were used on the KMNIST task without additional tuning.
The V1-inspired achieves much lower average classification error as compared to the classical RFN for the same number of hidden neurons. We show learning performance using these parameters on the MNIST task in Fig. 7A. To achieve 6% error on the MNIST task requires 100 neurons versus 1,000 neurons for the classical RFN, and the structured RFN achieves 2.5% error with 1,000 neurons. Qualitatively similar results hold for the KMNIST task (Fig. 7B), although the overall errors are larger, reflecting the harder task. To achieve 28% error on KMNIST requires 100 neurons versus 1,000 neurons for the classical RFN, and the structured RFN achieves 13% error with 1,000 neurons.
Again, network performance suffers when GP covariance parameters do not match the task. This happens if the size parameter s is smaller than the stroke width or spatial scale f doesn’t match the stroke variations in the character. Taking the incompatible parameters s = 0.5 and f = 0.5 (Fig. 7A, B), the structured weights performs worse than the classical RFN in both tasks. With 1,000 hidden neurons, it achieves the relatively poor test errors of 8% on MNIST (Fig. 7A) and 33% on KMNIST (Fig. 7B).
2.3.4 Structured weights improve generalization with limited data
Alongside learning with fewer hidden neurons, V1 structured RFNs also learn more accurately from fewer examples. We test few-shot learning using the image classification datasets from Section 2.3.3. The training examples are reduced from 60,000 to 50, or only 5 training examples per class. The test set and GP parameters remain the same.
Structured encodings allow learning with fewer samples than unstructured encodings. We show these few-shot learning results in Fig. 7C and D. The networks’ performance saturate past a few hundred hidden neurons. For MNIST, the lowest error achieved by V1 structured RFN is 27% versus 33% for the classical RFN and 37% using incompatible weights (Fig. 7C). The structured network acheives 61% error using structured features on the KMNIST task, as opposed to 66% for the classical RFN and 67% using incompatible weights (Fig. 7D).
2.3.5 Networks train faster when initialized with structured weights
Now we study the effect of structured weights as an initialization strategy for fully-trained neural networks where all weights in the network vary. We hypothesized that structured initialization allows networks to learn faster, i.e. that the training loss and test error would decrease faster than with unstructured weights. We have shown that the performance of RFNs improves with biologically inspired weight sampling. However, in RFNs (1) only the readout weights β are modified with training, and the hidden weights W are frozen at their initial value.
We compare the biologically-motivated initialization with a classical initialization where the variance is inversely proportional to the number of hidden neurons, . This initialization is widely known as the “Kaiming He normal” scheme and is thought to stabilize training dynamics by controlling the magnitude of the gradients [37]. The classical approach ensures that , so for fair comparison we scale our structured weight covariance matrix to have Tr(C) = 2. In our studies with RFNs the trace is equal to d, but this weight scale can be absorbed into the readout weights β due to the homogeneity of the ReLU.
We again compare structured and unstructured weights on MNIST and KMNIST tasks, common benchmarks for fully-trained networks. The architecture is a single hidden layer feedforward neural network (Fig. 1) with 1,000 hidden neurons. The cross-entropy loss over the training sets are minimized using simple gradient descent (GD) for 3,000 epochs with a learning rate of 0.1. All other parameters are the same as in Section 2.3.3.
In both the MNIST and KMNIST tasks, the V1-initialized network minimizes the loss function faster than the classically initialized network. For the MNIST task, the V1 network achieves a loss value of 0.07 after 3,000 epochs compared to 0.1 for the other network (Fig. 8A). We see qualitatively similar results for the KMNIST task. At the end of training, the V1-inspired network’s loss is 0.09, while the classically initialized network only reaches 0.14 (Fig. 8B). We find that the the V1-initialized network performs no better than classical initialization when the covariance parameters do not match the task. With incompatible parameters, the V1-initialized network achieves a loss value of 0.12 on MNIST and 0.17 on KMNIST.
Not only does it minimize the training loss faster, the V1-initialized network also generalizes well and achieves a lower test error at the end of training. For MNIST, it achieves 2% test error compared to 3.5% error for the classically initialized network, and 3.7% using incompatible weights (Fig. 8C). For KMNIST, we see 10% error compared to 14% error with classical initalization and 15% using incompatible weights (Fig. 8D).
We see similar results across diverse hidden layer widths and learning rates (Fig. 18–21), with the benefits most evident for wider networks and smaller learning rates. Furthermore, the structured weights show similar results when trained for 10,000 epochs (rate 0.1; 1,000 neurons; not shown) and with other optimizers like minibatch Stochastic Gradient Descent (SGD) and ADAM (batch size 256, rate 0.1; 1,000 neurons; not shown). Structured initialization facilitates learning across a wide range of networks.
However, the improvement is not universal: no significant benefit was found by initializing the early convolutional layers of the deep network AlexNet [48] and applying it to the ImageNet dataset [75], as shown in Appendix A.10. The large amounts of training data and the fact that only a small fraction of the network was initialized with structured weights could explain this null result.
2.3.6 Improving representation with structured random weights
We have shown how structured receptive field weights can improve the performance of RFNs and fully-trained networks on a number of supervised learning tasks. As long as the receptive fields are compatible with the task itself, then performance gains over unstructured features are possible. If they are incompatible, then the networks performs no better or even worse than using classical unstructured weights.
These results can be understood with the theoretical framework of Section 2.1. Structured weights effectively cause the input x to undergo a linear transformation into a new representation following Theorem 1. In all of our examples, this new representation is bandlimited due to how we design the covariance function.1 By moving to a bandlimited representation, we both filter out noise—high-frequency components—and reduce dimensionality—coordinates in outside the passband are zero. In general, noise and dimensionality both make learning harder.
It is easiest to understand these effects in the frequency detection task. For simplicity, assume we are using the stationary features of our warm-up Section 2.2.1 to do frequency detection. In this task, all of the signal power is contained in the f1 = 50 Hz frequency, and everything else is due to noise. If the weights are compatible with the task, this means that w is a sum of sines and cosines of frequencies ωk in some passband which includes f1. The narrower we make this bandwidth while still retaining the signal, the higher the SNR of becomes since more noise is filtered out (Appendix A.6).
3 Discussion
In this paper, we describe a random generative model for the receptive fields of sensory neurons. Specifically, we model each receptive field as a random filter sampled from a Gaussian process (GP) with covariance structure matched to the statistics of experimental neural data. We show that two kinds of sensory neurons—insect mechanosensory and simple cells in mammalian V1—have receptive fields that are well-described by GPs. In particular, the generated receptive fields, their second-order statistics, and their principal components match with receptive field data. Theoretically, we show that individual neurons perform a randomized transformation and filtering on the inputs. This connection provides a framework for sensory neurons to compute input transformations like Fourier and wavelet transforms in a biologically plausible way.
Our numerical results using these structured random receptive fields show that they offer better learning performance than unstructured receptive fields on several benchmarks. The structured networks achieve higher test performance with fewer neurons and fewer training examples, unless the frequency content of their receptive fields is incompatible with the task. In networks that are fully trained, initializing with structured weights leads to better network performance (as measured by training loss and generalization) in fewer iterations of gradient descent. Structured random features may be understood theoretically as transforming inputs into an informative basis that retains the important information in the stimulus while filtering away irrelevant signals.
3.1 Modeling other sensory neurons and modalities
The random feature formulation is a natural extension of the traditional linear-nonlinear (LN) neuron model. This approach may be applied to other brain regions where LN models are successful, for instance sensory areas with primarily feedforward connectivity like somatosensory and auditory regions. The neurons in auditory and somatosensory systems are selective to both spatial and temporal structures in their stimuli [45, 76, 68], and spatial structure emerges in networks trained on artificial tactile tasks [95]. Their receptive fields could be modeled by GPs with spatiotemporal covariance functions [90]; these could be useful for artificial tasks with spatiotemporal stimuli such as movies and multivariate timeseries. Neurons with localized but random temporal responses were found to be compatible with manifold coding in a decision-making task [46].
3.2 Receptive fields in development
Our generative model offers new directions to explore the biological basis and computational principles behind receptive fields. Development lays a basic architecture that is conserved from animal to animal [86, 84], yet the details of every neural connection cannot be specified [94], leading to some amount of inevitable randomness at least initially [14]. If receptive fields are random with constrained covariance, it is natural to ask how biology implements this. Unsupervised Hebbian dynamics with local inhibition can allow networks to learn principal components of their input [59, 65]. An interesting future direction is how similar learning rules may give rise to overcomplete, nonorthogonal structure similar to what has been studied here.
The above assumes that receptive field properties actually lie within synaptic weights. For spatial receptive fields, this assumption is plausible [72], but the temporal properties of receptive fields are more likely a result of neurons’ intrinsic dynamics, for which the LN framework is just a model [61, 87, 23]. Heterogeneous physiological (e.g. resonator dynamics) and mechanical (position and shape of mechanosensor relative to body structure) properties combine to give the diverse temporal receptive field structures [6]. Development thus leverages different mechanisms to build structure into receptive field properties of sensory neurons.
3.3 Connections to compressive sensing
Random projections have seen extensive use in the field of compressive sensing, where a high-dimensional signal can be found from only a few measurements so long as it has a sparse representation [22, 25, 30]. Random compression matrices are known to have optimal properties, however in many cases structured randomness is more realistic. Recent work has shown that structured random projections with local wiring constraints (in one dimension) were compatible with dictionary learning [24], supporting previous empirical results [5]. Our work shows that structured random receptive fields are equivalent to employing a wavelet dictionary and dense Gaussian projection.
3.4 Machine learning and inductive bias
An important open question for both neuroscience and machine learning is why certain networks, characterized by features such as their architecture, weights, and nonlinearities, are better than others for certain problems. One perspective is that a network is good for a problem if it is biased towards approximating functions that are close to the target, known as an inductive bias, which depends on an alignment between the features encoded by neurons and the task at hand [10]. Our approach shows that structured receptive fields are equivalent to a linear transformation of the input that can build in such biases. Furthermore, we can describe the nonlinear properties of the network using the kernel, which varies depending on the receptive field structure. If the target function has a small norm in this Reproducing Kernel Hilbert Space (RKHS), then there is an inductive bias and it is easier to learn [80, 79].
Networks endowed with principles of neural computation like batch normalization, pooling of inputs, and residual connections have been found to contain inductive biases for certain learning problems [91, 36]. Learning data-dependent kernels is another way to add in inductive bias [82]. We also saw that initializing fully-trained networks from our generative models improved their speed of convergence and generalization compared to unstructured initialization. This result is consistent with known results that initialization has an effect on generalization [4]. The initialization literature has mostly been focused on avoiding exploding/vanishing gradients [37, 32]. Here, we conjecture that the inductive bias in our structured connectivity places the network closer to a good solution in the loss landscape [94].
The random V1-inspired receptive fields that we model can be seen as similar to what happens in a convolutional neural network (CNN) [60], which have similarities and differences compared to brains [50]. A recent study showed that CNNs with a fixed V1-like convolutional layer are more robust to adversarial perturbations to their inputs [20]. In a similar vein to our work, using randomly sampled Gabor receptive fields in the first layer of a deep network was also shown to improve its performance [40]. The wavelet scattering transform is a multi-layer network where wavelet coefficients are passed through nonlinearities, a model which is similar to deep CNNs [53, 12, 1]. Our framework differs as a randomized model and yields wavelets of a single scale. Adding layers to our model or sampling weights with a variety of spatial frequencies and field sizes would yield random networks that behave similar to the scattering transform, offering an another connection between the brain and CNNs. Directly learning filters in a Hermite wavelet basis led to good perfomance in ANNs with little data [41], and this idea was extended to multiple scales by [66]. Our structured random features can be seen as an RFN version of those ideas with supporting evidence that these principles are used in biology.
3.5 Limitations and future directions
There are several limitations to the random feature approach. We model neuron responses with a scalar firing rates instead of discrete spikes, and we ignore complex neuronal dynamics, neuromodulatory context, and many other details. Like most LN models, the random feature model assumes zero plasticity in the hidden layer neurons. However, associative learning can drive changes in receptive fields of individual neurons in sensory areas like V1 and auditory cortex [33, 28]. Further, our RFN is purely feedforward and cannot account for feedback connections. Recent work suggests that feedforward architecture lacks sufficient computational power to serve as a detailed inputoutput model for a network of cortical neurons; it might need additional layers with convolutional filters [7]. It can be difficult to interpret the parameters found from fitting receptive field data and connect them to experimental conditions. Also, the GP model of weights only captures covariance (second moments) and neglects higher-order statistics. It remains to be shown how the theory can yield concrete predictions that can be tested in vivo experimental conditions.
We see several future directions of structured random features in connecting computational neuroscience and machine learning. As already stated, the auditory, somatosensory, and tactile regions are good candidates for further study as well as developmental principles that could give rise to random yet structured receptive field properties. To account for plasticity in the hidden layer, one could also analyze the neural tangent kernel (NTK) associated with structured features [42]. These kernels are often used to analyze ANNs trained with gradient descent when the number of hidden neurons is large and the step size is small [2]. To incorporate lateral and feedback connections, the weights could be sampled from GPs with recurrent covariance functions [56]. Our theory may also help explain why CNNs with fixed V1-like convolutional layer are more robust to adversarial input perturbations [20] as filtering out high frequency corruptions. It seems likely that structured random features will also be more robust. It would be interesting to study intermediate layer weights of fully-trained networks as approximate samples from a GP by studying their covariance structure. Finally, one could try and develop other covariance functions and further optimize these RFNs for most sophisticated learning tasks to see if near high performance—lower error, faster training, etc.—on more difficult tasks is possible.
Author contributions
Conceptualization: K.D.H.; Mathematical Analysis: B.P., K.D.H.; Data Acquisition: M.P.; Writing–original draft: B.P., K.D.H.; Writing–review & editing: B.P, M.P., B.W.B., K.D.H.; Figures: B.P., B.W.B., K.D.H.; Supervision: B.W.B., K.D.H.
Funding
B.P. was supported by a UW Applied Math Frederic Wan Endowed Fellowship, Terry Keegan Memorial ARCS Endowment Fellowship, and Natural Science Foundation Graduate Research Fel-lowship Program under Grant No. DGE-1762114. M.P. was supported by the Janelia Research Campus, Howard Hughes Medical Institute. B.W.B. was supported by grants FA9550-19-1-0386 & FA9550-18-1-0114 from the Air Force Office of Scientific Research. K.D.H. was supported by the Washington Research Foundation postdoctoral fellowship and Western Washington University.
Acknowledgements
We thank Dario Ringach for providing the macaque V1 data and Brandon Pratt for the hawkmoth mechanosensor data. We are grateful to Ali Weber, Steven Peterson, and Owen Levin for useful discussions. We thank Sarah Lindo, Michalis Michaelos, and Carsen Stringer for help with mouse surgeries, calcium imaging, and data processing, respectively.
A Appendix
A.1 Function spaces for wide networks with structured receptive fields
RFNs are intimately connected to a popular class of supervised learning algorithms called kernel methods. As the network width grows, the inner product between the feature representations of two inputs x, x′ converges to a reproducing kernel
The kernel defines a reproducing kernel Hilbert space (RKHS) of functions. The explicit form of the kernels corresponding to classical RFNs are known for several non-linear activation functions. For example, with the ReLU nonlinearity, no threshold, and unstructured Gaussian weights where [3].
We derive the kernel induced by our RFNs with hidden weights initialized from GPs. In this section we work in the discrete setting, but the continuous version is analogous. By definition (11), network equation (1), and basis change Theorem 1, the kernel for structured features
Thus, the induced kernels from structured weights can be found in terms of unstructured weight kernels acting on the transformed inputs and . Taking ReLU as the nonlinearity for example, we get that .
Every RKHS comes with an inner product and norm . The norm and inner product can be expressed in terms of eigenvalues and eigenfunctions of the kernel itself, analogous to the eigendecomposition of the covariance function of the GP weights. Although it is beyond the scope of our paper to explain the theory in detail, there are well-established results showing that functions with small -norm are easier to learn than those with larger norm [80]. In ridge regression, this effect is again equivalent to projection and filtering in the kernel eigenbasis, i.e. linear filtering in function space. Finally, end-to-end trained networks where the weights W are optimized may be studied with the related neural tangent kernel (NTK) when the step size is small [42]. The basis change formula, Theorem 1, and (12) give us a way to understand the RKHS of the stuctured network in terms of an unstructured network’s RKHS acting on the transformed inputs .
A.1.1 Kernel eigenfunctions differ with structured weights
The structured RKHS has eigenfunctions which are different from the eigenfunctions of the un-structured RKHS. To see this, it’s necessary to introduce a probability measure μ(x) for the data . Kernel learning is often understood [13] in the orthonormal basis for L2(μ) given by the eigenfunctions ψi of the integral operator defined by
A natural question to ask is, how does the eigensystems of and compare? The mapping from x to induces a pushforward measure on which we will call , and since the mapping is linear . (Note that dν(x) = |Λ|-1 dμ(ΦTΛ-1x), and if μ is multivariate Gaussian, then ν is Gaussian with a different covariance.) Thus the integral operator under the data measure is equivalent to integrating kernel kunstruct under the pushforward measure. Because are different, there is no general relationship that holds between the eigenfunctions. Different measures leading to different kernel eigenfunctions and eigenvalues can explain why structured weights have strong effects on learning, as shown by recent work demonstrating that areas of low input density are learned more slowly [1].
A.2 Covariance parameter optimization
Here we describe the details of how the GP covariances were fit to our various datasets.
A.2.1 Mechanosensor covariance
We aim to minimize the difference between the matrix generated by the covariance model Cmodel and the data Cdata, while keeping flo smaller than fhi. For simplicity, we measure the covariance mismatch with the Frobenius norm, solving
We use the trust region algorithm provided by the scipy.optimize.minimize to solve (13).
A.2.2 V1 covariance
To fit the covariance model to the data, we formulate an optimization problem over the model parameters s and f, where we minimize the Frobenius norm of the difference between the covariance matrix Cmodel and Cdata:
We solve (14) using the Broyden–Fletcher–Goldfarb-Shannon (BFGS) algorithm provided by the scipy.optimize.minimize package.
A.3 Derivation of eigenfunctions of V1 covariance function
The covariance between two pixel locations t = (t1, t2), is given by
Since covariance function factors into a product of functions of variables t1, and t2, , the multidimensional eigenfunctions ϕk(t) and eigenvalues also factor into a product of 1-dimensional eigenfunction and eigenvalues, i.e. and . This holds for d > 2 dimensions as well. So we work in 1-d and search for eigenfunctions and eigenvalues such that, with .
We make the ansatz that ϕk(t) = e−c1t2Hk(c2t), where Hk is the kth Hermite polynomial (physicists’ convention) [4] and c1, c2 are constants. With this guess for the eigenfunctions,
Solving for the unknown constants leads to the equations
The last step is to find the normalization constant for the eigenfunctions:
Therefore, our orthonormal eigenfunctions and eigenvalues for the 1-dimensional covariance are where . Note that with , so that the spectrum decays exponentially.
A.4 Distributed receptive field centers imply a sum kernel space
To generate our V1-inspired weights, we first sample a center c uniformly at random from the pixels in the image; call this set of pixels S. We will now derive the kernel for this weight sampling.
Suppose that all of the weights are sampled with a single center c. Then (12) tells us that the structured kernel associated with the RFN where we have defined the local basis change
This local basis change projects into a basis of Hermite wavelets Φc centered at c and filters according to the eigenvalues Λc. The reproducing kernel (16) defines an RKHS of functions which take images as their input and produce a real-valued output. The RKHS is a Hilbert space and thus has a norm . Functions with small -norm are, informally, smooth functions of the local wavelet coefficients .
In our experiments, we actually sample weights from all centers c ∈ S with equal probability. Taking the expectation over the centers, this means that the kernel will be an average over all of the local kernels (16),
Let be the RKHS associated with kstruct(·, ·), another space of functions that take in images and output a real number. The sum (18) implies that , i.e. the RKHS is a direct sum of local RKHS’s [80]. This means that any function can be written as f ∑c∈S fc, with every . The norm of this function comes from taking a minimum over all such decompositions
We can think of functions with small -norm, which will be easiest to learn, as sums of smooth functions of local wavelet coefficients.
A.5 Timeseries data generation
We detail how the two frequency classification tasks from Sections 2.3.1 and 2.3.2 are generated. In both tasks, each example is an L ms timeseries sampled at f Hz, making each x a vector of length d = L × f. Thus in the discrete setting, we only have d total frequencies. While the math below show continuous signals, in our code we generate analagous discrete signals using the discrete Fourier transform basis.
A.5.1 Frequency detection
The frequency detection task from Section 2.3.1 is a binary classification task. The positive examples contain a pure sinusoidal signal with frequency f1 and additive Gaussian noise. The negative examples are just white noise. They are generated in the following way: where ωj = 2πj/L is the j-th natural frequency, a is a parameter that sets the SNR, and the coefficients ηj, ξj, ηf, ξf are random variables uniformly sampled from the unit circle (which gives each frequency component a random phase).
We define with a ∈ [0, 1]. Larger a means a larger contribution of the pure tone and smaller amplitude noise. Note that . The generation process ensures that the L2 energy of both the negative and positive examples are matched and that the SNR is equal to the ratio of energy captured in frequency flo to the total energy in all other components.
We generate a balanced dataset with 7,000 timeseries signals which we split into a training set with 5,600 examples and a test set with 1,400 examples. We tuned the GP covariance parameters flo, fhi, and γ from (8) using 3-fold cross validation on the training set. We found the optimal parameters for a network with 20 hidden neurons and used them for all hidden layer widths. We tested flo and fhi parameters from 10 Hz to 200 Hz at increments of 10 Hz. For γ, we set the parameter range to be from 10 ms to 100 ms and used all parameters at increments of 10 ms. Using grid search, we tested all combinations of these parameters. The optimal model was refit using all training 5,600 samples, and the errors we report were measured on the test set.
A.5.2 Frequency XOR
We use a similar set up to generate the timeseries for the frequency exclusive-or (XOR) task in Section 2.3.2. The positive examples are either frequency f1 or f2 Hz pure sinusoids with additive Gaussian noise. The negative examples are either mixed frequency timeseries (with both f1 and f2 Hz signals) or pure Gaussian noise. They are generated in the following way: for k ∈ {1, 2} in the x+,k(t) function. The constants, random variables, and details of SNR are identical to the frequency detection section. The datasets we generate have balanced proportions of x+,1, x+,2, x−,noise, and x−,mixed signals.
A.6 SNR amplification via filtering
Let’s consider the simple frequency detection task with stationary bandpass features. Since these features are stationary, their eigenvectors are Fourier modes, i.e. Φ is the discrete Fourier transform (DFT) matrix. Assume the features encode a bandpass filter, which means that Λ = diag(λi), i ∈ {0,…, d – 1}, with λi = 1 for ilo ≤ i ≤ ihi and 0 otherwise.
In frequency detection, a single frequency component (discrete Fourier mode) contains the signal with energy a2. The other d – 1 components each have energy , for a total energy of 1 – a2 contained in this white noise. After the basis change (4) is applied to any input x, the transformed vector will have zeros in all entries outside the passband. (We use the conjugate transpose Φ* here rather than the transpose since the DFT matrix is complex; the interpretation is the same.) This makes the new representation effectively d′-dimensional, where d′ = ihi – ilo.
Now, first assume that the signal is within the passband. The total noise energy in the transformed representation becomes , since one of the d′ components is still signal, and no energy is lost in the retained components because Φ is unitary. The overall noise is shrunk by a factor of , so the SNR gets boosted by . In the limiting case where d′ = 1, the noise energy is 0 and SNR is infinite. On the other hand, if the signal lies outside the passband the SNR is reduced to 0.
A.7 Implementation details and code availability
In all experiments with RFNs, the training algorithm is an SVM classifier with squared hinge loss provided by the sklearn.svm.LinearSVC package and all other parameters set to their defaults. We used scipy [12] and numpy [5] to construct both classical unstructured and neural-inspired structured weights. For the experiments with fully-trained networks, we used pytorch [8]. The cross entropy loss function was optimized using full batch stochastic gradient gescent (SGD) optimizer, i.e. gradient descent (GD). Our code is available at https://github.com/BruntonUWBio/structured-random-features.
A.8 Covariance of V1 neurons with other stimuli
We repeat the covariance analysis from Section 2.2.3 on three additional datasets of V1 neurons. Different stimuli were shown in each dataset to calculate the receptive field.
The first dataset was provided by Ringach et al. from their work on characterizing the spatial structure of simple receptive fields in macaque (Macaca fascicularis) V1 [9]. The spikes of 250 neurons were recorded in response to drifting sinusoidal gratings. The receptive fields were calculated from the stimuli and responses using subspace reverse correlation. Because of the bandlimited properties of sinusoidal stimuli, this experiment biases the reconstruction towards smooth receptive fields. The receptive fields were of various sizes: 32 pixels × 32 pixels, 64 pixels × 64 pixels, and 128 pixels × 128 pixels. We resized them to a common dimension of 32 pixels × 32 pixels using local mean averaging. We find the optimal covariance parameters that fit the data to be s = 2.41 and f = 0.95 pixels. The covariance matrices and eigenfunctions are shown in Fig. 9. Examples of biological receptive fields and random samples from the fitted model are shown in Fig. 10 in the Appendix.
The second dataset contains the responses of 69,957 neurons recorded from the primary visual cortex of mice bred to express GCaMP6s. We presented 5,000 static natural images of 24 × 27 pixels in random order for 3 trials each. We calculated the receptive fields from the natural images and calcium responses of cells using ridge regression with an ℓ2 penalty set to 0.1 after each image pixel was z-scored across images. We used the average receptive field over all three trials. For the covariance analysis, we picked cells with SNR > 0.4. This gave us 10,782 cells. The optimal covariance parameters that fit the data are s = 5.40 and f = 1.17 pixels. Examples of biological receptive fields and random samples from the model are shown in Fig. 12. The covariance matrices and eigenfunctions are shown in Fig. 11. Examples of biological receptive fields and random samples from the fitted model are shown in Fig. 12. Repeating this analysis using receptive fields from individual trials yields identical results.
The third dataset contains the responses of 4,337 neurons also recorded from the primary visual cortex of mice bred to express GCaMP6s. The mice were shown static discrete Hartley transform (DHT, similar to a real-valued discrete Fourier transform) basis functions of size 30 × 80 pixels, and the calcium responses of neurons were recorded. The receptive fields were calculated using ridge regression without any ℓ2 penalty. Here, we picked cells with SNR > 1 for analysis. We were left with 2,698 cells. The optimal covariance parameters that fit the data are s = 10.46 and f = 1.20 pixels. The covariance matrices and eigenfunctions are shown in Fig. 13. Examples of biological receptive fields and random samples from the fitted model are shown in Fig. 14.
A.9 Initialization of networks with structured weights
We show results of initializing fully trained neural networks across a range of network widths (50, 100, 400, and 1,000) and learning rates (10-3, 10-2, and 10-1) in Figures 18, 19, 20, and 21.
A.10 Deep network experiments
We experimented with using the V1-inspired weight initialization in the first two convolutional layers of AlexNet [48] and training on the ImageNet Large Scale Visual Recognition Challenge from 2012 [75]. Our implementation was based on the example provided by pytorch and torchvision [8] and used the same optimization routine, parameters, and schedule as in https://github.com/pytorch/examples/tree/master/imagenet.
All convolutional layers were initialized with weights drawn from a Gaussian distribution with variance (cindxdy)−1, where cin was the number of input channels, and dx and dy are the dimensions of the filter. This is equal to the reciprocal of the fan-in. In the case of classical initialization, this Gaussian distribution has covariance proportional to the identity, whereas in the structured case we use the V1-inspired covariance centered in the center of the filter with independent draws for each input channel. All biases and weights in the other layers are set with their pytorch defaults. The structured weights were only used in the first two convolutional layers of dimensions dx × dy = 11×11 and 5×5. The size parameter was set to s = max(dx, dy) · 3 and frequency bandwidth was f = max(dx, dy)/5.
We show training and testing loss over the first 10 epochs for both the classical and structured initializations in Fig. 22. The structured initialization at first shows an advantage over classical, with consistently lower losses for the first 4 epochs, but eventually the classical network catches up. From this point onwards (until the 90 training epochs are complete), the classical network has the same or lower loss. Both networks end up performing well, reaching accuracies close to those reported in [48] and the torchvision documentation (https://pytorch.org/vision/stable/models.html), as shown in Table 3. The classical initialization performs slightly better overall.
These null results are perhaps not surprising: The initial layers of AlexNet contain only 64 and 192 output channels (i.e. filters) respectively, making up only a small fraction of the total weights in the network. The deeper convolutional layers contain many more channels and are built with small 3× 3 filters where our initialization is unlikely to help. It is also possible that the effects of initialization are less important for overparametrized models or with large amounts of training data.
B
Footnotes
Small changes to the appendix: discussed change in kernel eigenfunctions under structured weights.
↵1 The V1 weights have all eigenvalues nonzero, but the spectrum decays exponentially, so it acts as a lowpass filter.
4 References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵